Conf42 Observability 2023 - Online

Beyond Monitoring: The Rise of Observability Platform

Video size:

Abstract

In a complex, multi-layered, distributed computing environment with so many interdependencies that is impossible to keep track of, full-stack observability enables organizations to find needle in the haystack, by identifying and responding to systems issues before they affect customers.

Summary

  • The session on beyond monitoring the rights of observability, which is around the observability domain. I'll walk you through what the open telemetry framework is. And then there are the KRAs and the KPIs that are impacted by this particular platform. So this is basically what I want to cover in this session.
  • System unavailability and under performance in a landscape, they negatively impact the user experience and also the customer satisfaction. And then we have observability provides the stakeholders with multiple actionable insights into the distributed infrastructure. These are solutions that are deployed.
  • In a modern complex infrastructure, on a modern IT landscapes, there are tons and tons of lecture solutions that are run. What's required is a system that ensures that it gathers the insights from these endpoints. It tries to deliver actionable insights in a way that will make sure that the system uptime and availability is maintained.
  • In the earlier times, monitoring was more reactive than data driven. Now, observability consolidates data from various infra applications. It then analyzes that data to find insights, to find correlations. This is the next generation of monitoring, says Gartner.
  • It is basically an open telemetry architecture which is arrived by CNCF. The three elements are telemetry tribes, metrics, logs and traces. All the isvs and open source tools and the hyperscalers are the ones which more or less have to abide by it.
  • Aiops combines observability with aiops, machine learning and automation. It improves reliability, it improves availability based on the overall goals and objectives. Then comes the predict issues based on system behavior. This is basically nothing but making sure that you have the artificial intelligence built into it.
  • The observability platform architecture basically has six different elements, or six different, let's say, data elements that are captured. This is what is leveraged to find insights and patterns and do the root cause analysis. With these kind of complexities, I think there is no way but to leverage automation.
  • The more concrete and focused solutions you have, there's more capability and features in it. But definitely I think that's something that needs to be analyzed on a case to case basis. This is the open source stacks that are typically leveraged for various tiers.
  • Aiops is the next stage of your solutioning or your platform. It includes your itsm and automation around that. And it also includes the things that you will basically leverage to have a kind have an autonomous architecture. I hope you have found this presentation insightful and informative.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, this is Samir here. Welcome to this event by Conf 42 around the observability domain. I'll be presenting the session on beyond monitoring the rights of observability, which is around the observability domain. Just to give you my introduction, I'm Samir, I'm part of the abidian organization. I'm based out of Mumbai. I provide architecture and technology leadership to large transformation programs and large deals. So I think let's begin with the session. The first slide here, it actually provides the agenda that we would cover as part of this particular session. So we'll try and understand and get a view on what observability is under the hood. Then I'll walk you through what the open telemetry framework is, what constitutes the open telemetry framework. Then we have the observability building blocks in terms of the functional building blocks, in terms of the framework, in terms of the architecture. And then there are the KRAs and the KPIs that are impacted by this particular platform. Once this platform is deployed as part of various opportunities, they basically impact those, those various KRAs and KPIs. So I'll touch upon that as well. Then there is the exact tooling stack in terms of what constitutes the solution building blocks, the ISV solutions, the hyperscalers and the open source solutions that we have. And then a touch upon the self healing infrastructures as well. And then a view on the AI Ops. So this is basically what I want to cover in this session. So let's begin. So system unavailability and under performance in a landscape, they negatively impact the user experience and also the customer satisfaction. Obviously this definitely constitutes to a revenue loss. If you see a solution. If you see a portal where you basically have issues in terms of the performance or reliability or scalability, the users would just run away from the portal if the response time is not basically not performing as per the slAs. And typically it's about 2 seconds or 3 seconds. I think if it takes time to load more than 2 seconds or 3 seconds, I think the customers or the users would just run away from your website. So I think that's just an example. And what observability does is it enables the organizations to find needle in a haystack by identifying the system issues that are happening even before the system, even before the customers would be able to track it. Right. So I think that's about the exact sort of objective. And then we have observability provides the stakeholders with multiple actionable insights into the distributed infrastructure and that's something that is part of large organizations and large complex architectures, including the likes of Uber and AWS and all the sort of complexities that we have. These are solutions that are deployed. Obviously, it's something that will be based on what exactly are the requirements, what constitutes, or what are the basically requirements that are coming in from an application standpoint, from a business standpoint. And then you basically build that platform incrementally. And it's just not kind of a single dimensional aspect. I think there are various teams and stakeholders that are basically the users of these systems to get insights, to get the insights and inputs around the various elements that are happening in your IT landscape. So making sure that in case there are any bottlenecks, in case there are any predictions around systems or applications or components going down, this particular solution tracks it basically in advance, which is basically making sure that it ensures the uptime and availability of those solutions. So this is basically about the overall background of the observability solution or the observability platform. So next is under the hood. In a modern complex infrastructure, on a modern IT landscapes, there are tons and tons of lecture solutions that are run, be it on premise, be it on cloud, and the number is just like humongous, right? I think you have your application servers, you have your database servers, you have your web servers, load balancers, the network elements, the identity and access management systems integration, your machine learning elements, data warehouses, and just the list is something that doesn't stop for this kind of a complex landscape. What's required is you need to have a system or a platform in place which ensures that it gathers the insights from these endpoints and provides a kind of a glass pane view, single glass pane view, to understand what's happening in those nodes, what's happening in those various components, various services, and take proactive actions to make sure that any course correction that needs to be done, it is done through various rules and various policies that you implement as part of your self pinning infrastructure. Right? So that's basically what happens with the solution. From a solution standpoint. It does gather all the insights from various endpoints, it explores the patterns, it explores the various properties, and then tries to find and tries to deliver actionable insights in a way that will make sure that the system uptime and availability is maintained, not impacting that in a negative way. So that's about the elements of observability under the hood. So this is a quick comparison between observability and monitoring. So we have in the, in the earlier times. Right. Monitoring was something that was, let's say, medium to understand and analyze what was happening in your applications, what was happening in your various applications that are deployed as part of your organization, your it landscape. But monitoring was, if you see the table here, it was more reactive, right. And situational, wherein you really didn't have the intelligence. You didn't have the intelligence to monitor it and take a corrective action on that. So I think the first two elements are something that compares that. So one is reactive versus proactive. One is situational, others is predictive. Those are the two key, let's say, points. We also have something that was more speculative than data driven. So when I say speculative year, monitoring was the case where you will try to, let's say, basis the data that you had, basis the limited access of data that you had about applications and your infrastructure. You would try to, let's say, analyze and try and understand the root cause of why certain things happen or certain things have happened in your application. Maybe a system going down, maybe a database not working, maybe there's a performance issue, maybe there's a scalability issue, but it was more speculative. So we really didn't have a concrete answer to why that happened. Versus observability. Right. Which is data driven. So observability actually consolidates it ingests all the data from various infra applications, be it your cloud environment, be your on premise environment. And it then analyzes that data to find insights, to find correlations, to find the patterns that are required to, let's say, track your bottleneck. So that's basically the difference what when versus what when, why, how. So that's basically the next generation of monitoring. If you have to basically define it that way, which is observability. Right. Then you have expected problems and unexpected problems. When we are doing this comparison, expected is something that there is something happened in the system. All right? So these known unknowns, we know that this problem has happened, and now we have to do a course correction versus the unexpected problem, which is where you basically try and predict that, okay, something is going to happen in, let's say, this node or something is going to happen in this component or this service, and this is what you need to do to basically correct it. So those are the two differences. Something which is more predictive versus more situational data silos. Yes. Monitoring was more built on data silos, limited access to the data about your applications, about your infra, about your various solution components versus data in one place where you're basically aggregating data from various solutions, various systems, and then analyzing it to find patterns, to find insights, and basically leverage that to have actionable, let's say, decisions taken in terms of how we can improvise the uptown and availability. And then there was data sampling versus instrument, everything, which means that we have those elements monitoring 24 7365 days a week, making sure that there's no downtime on those systems. Sorry. As for Gartner, and that's probably something which says that more and more systems, more and more, let's say organizations, are adapting it, knowing what value it brings in knowing how critical it is, knowing what impact can it make on system availability and overall, let's say, in terms of the customer experiences, et cetera. So I think this is something that is a critical platform, critical layer of your it landscapes in the current era. So we come to next slide, which actually talks about telemetry. So it is basically an open telemetry architecture which is arrived by CNCF, or rather driven by CNCF. I'm sure we have other isvs and product companies and SIS and consulting organizations supporting it and driving that community part. But yes, those are the core pillars of telemetry, which means that those three elements, right, the telemetry tribes, metrics, logs and traces, are the ones that are leveraged or are the parameters that are leveraged to build your observability stack. Matrixes are nothing but numeric values measured over an interval of time. And this is something that can be related to your cpu, memory, network bandwidth, et cetera. And those are the ones that are gathered in your observability stack. Then we have logs, right? These are the time stamped text records of the events that occurred at particular interval or a particular time. Which means stack trace. It means the call stack, right? What API is getting called, what time it is getting called. That API calls which other support APIs, which other things are embedded into that API. So that entire logging is something that basically a second type of telemetry parameter, and the third is a traces, right? And this represents the end to end journey of a user request to do the entire distributed architecture, which means it's not just related to your single process or application, it actually cuts across your various nodes, various applications and systems to come up with the details around what is exactly happening in terms of your end to end user journey. So these three parameters, or these three telemetry types are the ones that are recorded, are gathered, and there are derivatives of this that basically constitutes the entire, let's say, gamut of things around the observability platform, the user experience, the uptime, the availability, everything is then calculated based on these parameters, even the KPIs and the keras, I'll come to that in a bit. But these are the key ones. And this is basically an opal telemetry framework. All the isvs and open source tools and the hyperscalers are the ones which more or less, in a way have to, let's say follow it, have to abide by it, have to align with this framework. Because I think ultimately these are the best practices, right? So I think that's about telemetry, pillars of telemetry. And here is a good, let's say details around what are the KPIs in the KRS that are impacted by your observability, right? So this is customer experience being one thing, which is making sure that the impact on the end to end customer journeys is something that is something that is also observed, that is also monitoring by observability. Then there is MTTR and MTBF, right? So there is mean time to repair and mean time between failures. So if you have a system in place which can predict the issues even before they happen, I'm sure all these parameters would be under control, right? MPTR and MPBF. It can actually do the analysis, it can find the insights, it can find the patterns, it can find the correlations between those patterns and come up with a decision with an insight that can be leveraged to do a course correction in terms of your entire portfolio of applications and solutions. Then there's reliability and availability. Definitely. All these parameters are something that are directly getting impacted by observability. It improves reliability, it improves availability based on the overall, let's say goals and objectives that you have. So if there's three nines, if there's four nines, five nines, right. So I think, I'm sure we can actually configure these policies and processes and rules to make sure that we have this in control, right? We have this architected, we have this defined and monitored, managed as per what the SLas are. Performance. Yes, it does have an impact on the performance and scalability both. It can actually provide you with so much data around, right? You're gathering about your services, about your network calls, about your distributed tracing, logging, all these elements that come in, right. If you consolidate these details, you will ultimately find that we will be having the insights gathered from the various endpoints, from the various nodes to make sure that we improve these parameters. We have these parameters under control so that it doesn't really impact the user experience, uptime and the availability. So those are about the KPIs and the KRAs. This is in terms of the platform objectives, right? If one decides to build an observability platform, what constitutes that observability platform? What would really go into the platform? What are the solution building blocks, what are the infrastructure elements, what kind of, let's say what solutions or tools you would want to leverage, whether it's isvs, whether it's open source tooling, whether it's specific to any hyperscalers like Azure or AWS. So this is just kind of a platform objectives in terms of what you want to achieve via this observability platform. So it definitely enhances the visibility of system performance and health as we saw all the KRAs and the KPIs. Then it discovers and addresses unknown issues and with accurate insights. So it has that elements which are built into it, about root cause analysis, about pattern identification, about insights, et cetera. So this basically is that element. Then there are fewer problems and backouts as a result of predictive capabilities. So it can actually do that. It can analyze various issues and bottlenecks that you have and then make sure that you have the right set of rules deployed, right set of policies deployed in your landscape. Then comes the predict issues based on system behavior by combining observability with aiops, machine learning and automation. So this is the second stage of your maturity, right? The first being you have observability in place now you have to build aiops. Aops actually includes other elements as well. Not just monitoring, not just observability. We'll have the ITSm bit as well. We'll have the automation bit as well into it, which will make sure that all that you do, all that you can monitor in those domains can be leveraged to have your AI Ops, which is basically nothing but making sure that you have the artificial intelligence built into it, so that there are less manual interventions to drive your processes and orchestrate various, let's say, scenarios that you would have. So those are what AIops is. And I have kind of a slide also explaining that. Then catch resolve issues in early phases of software development process. This is something that relates to while you are actually building your software, right? If you have that entire element which goes through your CI CD automation, you can actually leverage a lot of tools around that in your pre production, in your staging before it finally goes to production, to make sure that you have the right quality of platforms deployed, right quality of applications deployed, and deep dive into logs and inspect stack trace error. So this is again, something that is already covered in the earlier point, and then the framework part, which basically is nothing but making sure that once you have those components or the platform built, created and deployed, you will make sure that there is the right. So I think each of these basically mentions about the system of duty. So if you go to each of these points, what does it finally do once you have the platform deployed, what it originally does. So it makes sure that the overall health of the system, which is basically the uptime and the availability, is something that is as per the superior SLA, then we have tooling to help debug in production systems. So I think we'll have the exact set of tools with us to analyze the issues. If something happens in production, why it happened and what is it that is needed to correct it and fix it, is what is also one of the, let's say, things that the observability platform provides, then you can diagnose infrastructure problems in production, do the root cause analysis as well, which is a capability that the platform provides. And as I said, it's unknown, unknowns, right? We really have no clue in terms of what may come up, but the platform has the intelligence to make sure that it can analyze your data from various endpoints, from various nodes, to make sure that it does the anomaly detection and make sure that time and availability is maintained. So this is basically the platform framework. This is the reference architecture for observability. If you see it's a kind of, let's say pipeline or a funnel, which starts with ingesting data from various endpoints into your observability stack. The first stage of that platform is data ingestion and collection, which collects the log data, which collects the application performance monitoring data, which collects the metrics data, the uptime data, and user experience data as well, right? The second stage of it is aggregation and processing. So the second stage is where you actually make sure that the data is in the right quality right format, and then the data is aggregated to derive insights and also run various machine learning algorithms on it to have the solution created so that we can leverage it for anomaly detection, root cause analysis, pattern discovery, et cetera. So the next stage is nothing but that analysis part, that machine learning part. So while you have the data in the right format, it's aggregated. So I think it's typically three stages, right? So first stage is the raw data, then you have the cleansed data, and the third you have the aggregated data. But this machine learning part, right, which actually is like a system which will run through the data and make sure that it finds the patterns, it finds the insights based on the various, let's say, things, various rules that would be there. So this is basically something that is like a system which goes in a kind of, let's say circular manner, right? And finally, when you have that third stage, the fourth stage is the alerts and your dashboards, which will be provided to your ops team, to your SRE team to basically, finally, let's say, understand the various elements of your it landscape and make sure that all the systems are up and running. There is also an element, or rather the self healing infrastructure, where in case there's something that needs to be fixed, right. The systems can be built in such a way that even without your service desk intervention, even without your level one support or level two supports intervention, the system would be able to roll out a fixed basis of some of the issues that defined in the applications, right, in your it landscape. So these are systems that invariably are, let's say, converging towards self or autonomous systems, where we'll have less and less lean teams to manage that entire gamut of your infrastructure and application. So that's about the observability platform architecture. I would want to just touch upon the logical architecture, which basically has six different elements, or six different, let's say, data elements that are captured. So there's log data, there is metrics data, there is synthetic data, there is APM data, user experience and uptime, right? So if you see each of them, if you see the metrics data, for example, it's the host and container matrices, then we have the database matrices, then there is a network metrics as well. So if there is a kind of, let's say a system that you built, you basically make sure that these set of data elements are the ones that are ingested into the platform. And this is what is leveraged to find insights and patterns and do the root cause analysis. I think it may sound very simplistic when I'm explaining it, but I'm sure there is these things that are ingested, they run into terabytes and petabytes, because I think we are monitoring the system every second to see the, to actually observe what exactly is happening in the applications in your infrastructure. I think the kind of complexities that we have in the application, I am sure with these kind of complexities, I think there is no way but to leverage automation. There is no way but to leverage AI. There's no way but to only leverage these cutting edge platforms to make sure that your uptime and your availability is maintained. I might just want to cover a few more slides here. One pertaining to. Yeah, so I think a quick one. This is the tooling stack of your system. One is, one on the left side is the one that actually is an open source hybrid stack, which constitutes logtash, promntus, nagios, jagger, java, melody, elasticsearch and kibana. And the one on the right is open source track, leveraging elk, which is elastic search, lockdash and kibana. So this is a mix of these two things I think ultimately depends on what are your final requirements. I think the more concrete and focused solutions you have, there's more capability and features in it. But definitely I think that's something that needs to be analyzed on a case to case basis. But this is the open source, let's say stacks that are typically leveraged for various tiers, various, let's say stages that you have as part of your monitoring platform logging, monitoring, infra monitoring, app monitoring, distributed tracing, user experience monitoring, et cetera. So I think this basically gives a view on the tooling part or the tooling architecture. I think I have almost reached the end of the presentation. I think rest is. I probably just want to basically touch upon that. These are the KrAs and the KPIs that you typically have built into some of these tools. So you really don't have to do the heavy lifting around it. You'll make sure that some of the templates that are there that are rolled out by various tools, you can just leverage them, do a bit of customization, do some element of configuration, and they are still up and running. So these are the various KrIs and the KPIs, the ones that you just saw, the log data, matrixes data. This is basically what constitutes that. And this is around the matrix's part. A lot of templates can be leveraged to build these solutions. And I think this basically is covering and explaining the elements around applications, the databases and things like that. This is the dashboard, an azure monitoring dashboard in terms of what exactly it looks like when it comes to complex landscapes, monitoring the security, the application side of things, the infrared network. But this is just kind of a simplistic view. I'm sure you would have seen all the command centers with a number of, let's say, dashboards that you can find in a room where the SRA teams and the Ops teams are actually sort of monitoring it all the time. So this is basically just a view on what it constitutes. It's probably just extrapolated to the complexity you have in the landscape. And that's probably the number of, let's say, elements that you will want to basically analyze, right? So I think that is about it. I just want one last thing to include is about aiops. Aiops. It is the next stage of your solutioning or your platform. So it's just not limited to monitoring. It includes your itsm and automation around that. And it also includes the things that you will basically leverage to have a kind have an autonomous architecture, right? So it's a mix of those things, right. You observe, you engage and you act. So you do the automation. So I think these are the elements that basically are the next stage of, let's say, capabilities that you'll have in your landscape. I think the diagram on the left, it shows that the data collection is eventually from various systems and your application is not just limited to your observability stack. And then you have the machine learning systems to do the real time processing and the historic processing. And then the next level of let's check up capabilities to do the anomaly detection and pattern discovery. So this is about it. I think I have covered most of the things, some of the elements that I couldn't cover are there in my slide deck and I think that's about it. I've, on the last slide, I just have my coordinates. You can just feel free to reach me if you have any doubts or you have any specific inputs around the presentation. And I hope you have found this presentation insightful and informative. Thank you. Thanks for joining. Thanks for watching. Thanks. Have a nice day. Bye.
...

Sameer Paradkar

IT Architect, Author, Speaker @ Eviden, an AtoS Business

Sameer Paradkar's LinkedIn account Sameer Paradkar's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways