The future of Observability - the next 10 years

Video size:

Abstract

In the next decade, observability will transform through AI, real-time processing, and predictive analytics. This talk explores proactive monitoring, rapid troubleshooting, and deep system insights. Learn about AI-driven anomaly detection, open standards, and tackling distributed system complexities

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello, and welcome to this talk on the future of observability and what you can expect over the next 10 years. I'm Shamshree Wilson, and I work at NetData. NetData is a real time observability platform with a unique distributed architecture and an open source core. This talk isn't really about NetData, the product. though I might, talk about a few of the decisions we've made along the way and some of the principles that we follow. If you'd like to find out more about NetData, you can check out our website at www. netdata. cloud. Now let's get started with the talk. Let's start by discussing a little bit about the evolving definition of observability. What is observability and what does it mean? and has this meaning transformed over time. Now the traditional definition of observability is that it is the ability to observe and understand the internal state of a system by examining its external outputs. And when we talk about external outputs, we usually mean metrics, logs, and traces today. In other words, observability is to understand what is happening with the system. Now, this traditional definition came out of control systems, and that explains some of the terminology that you see here, when we talk about internal state, external outputs. But today, observability has, has evolved over all these years. And it's a term that's primarily used for people who are managing IT infrastructure these days. And IT infrastructure, as we all know, has been growing ever complex. and it's, deployed across multiple kinds of environments, and it scales up, it scales down, and there's hundreds, thousands of services talking to each other. So it's, it is a very complex environment. So the current reality of observability is that it's really the digital nervous system of modern enterprises. Thank you. So it's no longer just about logs, metrics, and traces, but what we're trying to get out of observability is to achieve real time comprehension, real time understanding of the complex digital ecosystems through the lens of causality. And causality is an important term here because what we expect out of observability today is not just to understand what is happening, but also why it is happening. what to do about it, and all of this in real time. So this is now the new evolved definition of observability, if you may. So now that we've talked about what observability is, let's just pause and think about what is the current state of observability? What are the problems with observability today? So, You can see that there's, there's some problems that are called out here, and it's, it's been classified into three different categories. Too little observability, too complex observability, and too expensive observability. And as you can see by that interconnected graph here, it, it's actually a mix of all of these things, and it kind of bleeds into each different category. So let's, let's start by talking about too little observability. What does this mean? So what this means is that the, the, the system either doesn't have any observability at all. So there's nobody really looking at what's going on. or there's very limited visibility, which means that there's some monitoring of a few metrics. There might be a dashboard here or there, but it's not really capturing the full state of the system. And this is a very dangerous situation to be in because, you might have a false sense of security that I have monitoring for my system. But you just have a very partial view, and that partial view might be telling you a very different story from, from reality. So limited visibility is a problem. And, the other problem is late detection. So when we talk about too little, observability, it also means that Maybe you have the metrics that you need, but you don't have the sampling intervals that you need, which means that you're collecting the data, but it's too late. So you're missing out on these short micro spikes or bursts, or, you know, you're finding out about the issue too late. So late detection becomes a big problem as well. performance bottlenecks either directly due to the kind of observability that you've implemented. Or the fact that you're missing out on identifying the performance bottlenecks becomes a problem too. and all of these things kind of contribute to delayed troubleshooting. So you don't have the right tool set to be able to understand and troubleshoot an issue in real time. And when this happens, somewhere down the line, you're going to have outages. And outages are very, very expensive. And they're getting more expensive, with each passing year. And all of these things, of course, contribute to reduced engineering efficiency and, overheads when it comes to maintenance and operations. So, clearly, too little observability is not a situation that anybody really wants to be in. What's the second category here? So, now we're talking about when you have observability, but it's too complex. and this is again, you know, complex observability leads to many of the same sort of issues. Such as delayed troubleshooting, the fact that you're not able to identify outages on time, reduced engineering efficiency, and so on. it also has, different aspects that makes it hard to scale, when you're, as your, as your system and as your team scales. And especially hard when you're dealing with hybrid infrastructure and dynamic environments. let me give a little bit of an example about what I mean when I say too complex observability, just so that it's easier for you to understand. think about a case where, you're starting off with, with a system, and it's, it's initially the MVP version of the product, and you set up, observability. So maybe you start off with a Prometheus and Grafana stack, and it's not complex, right? It was pretty easy for you to set it up. You're experienced in Prometheus and Grafana, and you put everything up, it's fine. The problem happens when, as time passes, you add more features, more services to your stack, there's no different kinds of applications running in tandem, there's messaging between them, Maybe your environment scales, you're not on prem anymore. You're hosted as part of your system on AWS. there's some services on prem. Maybe there's other services running on other clouds. And now you have a complex system in place. And the question then arises, were you able to keep up with your observability that you set up on Prometheus and Grafana over time? and that's, that's becomes a very interesting challenge. Because what started off as simple may no longer be simple. because as you scale, there's more things that you want to monitor. So every time you need to take the decision, do I have the right, Prometheus exporters? am I getting all the right metrics? how should I visualize this? because spending a lot of time creating custom dashboards is, not going to be a fun thing to do as your services scale. So you might have to rely on community dashboards, but then you're not 100 percent sure if the community dashboard is capturing everything that you need. So it, it adds up over time into one cognitive overload for you, and then there's, of course, going to be other business related tasks that, that your team, need to spend time on. So they don't really have that time to dedicate building custom dashboards and writing custom queries on PromQL, or wherever. And over time, it becomes really, really complex. So it starts off simple, but then it becomes complex over time. And now finally, let's look at the third category, which is too expensive. And this really kind of touches upon all of these different issues, right? So we have these modern observability platforms, whether we're talking about tools like Datadog, Dynatrace, which are really powerful, right? They have all the features that you need. They're gathering all the right metrics and logs and traces and have really neat visualizations to kind of present them to you. But they're also really expensive. and while that might not be a problem for large enterprises who have money to spare. it becomes problematic for smaller enterprises and startups and companies that are starting out because they don't have, that amount of, capital to invest into observability. And what do they do at that point, right? So what are the choices that they're left with? Either they say that, you know, we're not going to invest that much in observability, and, you know, we'll see about it when it comes to it. And that, which means that they fall back to too little observability. Which is a problematic scenario to be in. or they try to say, we'll do it ourselves. And that takes them to the second category of too complex observability. So this, this becomes a problem when we think about how observability can be democratized. And it also becomes a problem when we think about the kind of architectures in place. So, later in this talk I'll, I'll be speaking about centralized versus distributed or decentralized architectures when it comes to observability. And why this is the, the crux when it comes to why observability becomes too expensive. So this is a little bit about the current state of observability. And now let's look at, let's look at the future, right? Because that's what this talk is about. Let's look at what's headed our way over the next decade. So I like to term the next 10 years, the great infrastructure explosion, because that's, that's absolutely what we're going to be able, what we're going to be seeing. there's, there is already, and there will be an even bigger data center boom. so the global data center market, which is around 500 billion, worth today in 2025 will grow to nearly 2 trillion by the end of the decade. and that, so that's, that's a huge amount of growth. And considering what's happening with AI and the AI arms race that we are in, with all the major, tech companies really locked into, you know, the race towards, super intelligence, we, you know, we might be underestimating it when we say 2 trillion. the data center market might actually be much larger. And a different way to look at this is that the energy consumption that data centers globally would be consuming, even at a conservative estimate by 2035 would be 8 percent of the world's total electricity, which means that a large part of human existence and our world in general is going to be built around supporting, these data centers where we are cultivating intelligence. Compute capacity has been growing at two and a half times faster than Moore's Law recently. And all of this means that there's going to be A lot more data centers running a lot more different kinds of workloads. there's going to be a lot more IT infrastructure which needs to underpin all of this and Eventually, it means that we would need better and more observability So that the teams who are maintaining and managing all of this and can keep it running That brings us to the human factor So DevOps and SRE jobs are actually growing, multiple times faster than traditional IT jobs. Now I know that, you know, there's been a lot of talk about what's going to happen with AI and people's jobs in general. And of course, AI will affect the jobs market. There are certain jobs or certain roles that people are doing today that will get automated away. Or at least parts of that job will get automated away. but at least what, what I believe is going to happen is that the, the amount of work will increase. So the average organization is going to be managing five times more services. by 2030, that's, that's in five years. which means that there's going to be a multiple fold increase in infrastructure teams needed overall, which means that the amount of total jobs available is going to increase, especially for DevOps and SRE. And this is directly related to the data center boom. You're going to have huge number of data centers, with, you know, increasing complexity of workloads. And you're going to have, you're going to need teams to manage this. Now, of course. You know, AI is going to be a factor here. so the, the kind of work, that your average SRE does is going to be different. But you're still going to need a lot more of them. And a combination of these things means that observability is mission critical now. The average cost of downtime is more than 9, 000 a minute. And you know this number is from a few years ago I'm sure that it's grown to even higher now by now and it keeps growing up every year so Downtime is something that you want to minimize to the most infinitesimal possible value as much as you can Unobserved or, you know, partially understood or misunderstood failures can cascade today across thousands of different services, distributed systems, nodes, multiple clouds, which makes it a lot harder when you don't have the visibility to understand what really happened, right? Because you need to understand what happened so that you can prevent it from happening again. And all of this means that good observability is survival critical and it's not just a nice to have anymore. So now that we've talked a little bit about what's headed our way in the next 10 years, and here are some references as well. I know that the text is kind of small, but, you can download the slides afterwards and you can look at where these numbers come from that I was quoting in the stats here. So let's, let's talk about the stats. what the observability equation really is, right? And, and this is the, the traditional observability equation. You might have seen it, heard it in other, in other talks and other sessions and books and so on. That observability is made up of metrics, logs, and traces. Now, I think where we are today and where we're headed in the next ten years, we need a new observability equation. And it's not just about metrics, logs, and traces. So the way that I see this is the new observability equation is made up of three things, but those three things are data, context, and intelligence. So data is all of the data, including metrics, logs, traces, any, any sort of raw system output. And they form the foundational layer of observability. But observability is just not limited to data. Observability isn't just metrics, logs, and traces, right? It just forms a foundational layer on which the new observability equation is built. The second part of this equation is context. So context is really important. Without context, your data, even though you have it, might be telling you a wrong story. It might not be telling you a story at all. So it's not really providing the value that it should without context. So, what do I mean by context? Context really is a rich metadata. So it's, it's information about the data that explains the data. For example, it might be an anomaly pattern which says, you know, here's your, the value for your, CPU percentage. but here's additional metadata which tells if this is normal or not for this system, for this particular component within this system. So that piece of extra metadata now tells you that, you know, it tells you a better story, right? So you have a value, CPU is at 90%, but maybe the, the anomaly pattern tells you that CPU for this component on this system is always at 90%. So that's a very different thing from just telling you that the CPU is 90%. you can, you know, anomaly patterns is just one example. You could have, correlation information. So when you have information about a single metric, you also have correlation, which says that here's what happened with other metrics or other logs, while this metric was exhibiting this value. So this is You know, at NADATA, we really believe that context is important, and, and we try to make our, our tool, our product, more opinionated, so that it's bringing you these opinions about what you're seeing. It's not just showing you data, but it's also trying to tell you a story about the data. I'm not saying that, you know, we're, we've solved, the, the equation when it comes to context, there's a lot more that we can do. And there's a lot more that the industry in general should be doing, about making, metadata a first class citizen, because there's, there's a lot of, important information when it comes to, you know, what, what the business KPIs are that your system is trying to achieve. what's the real world events that are happening that are related to what your observability data is trying to tell you. And bringing those things together, will provide, you know, exponential value. Now, the, the third component to the new observability equation, is intelligence. And yes, I'm talking about AI. because up until now, intelligence was solely provided by humans. And we were limited by, the amount of time, the amount of people that, you know, could be working on this at one point in time. but, With the, the capability increases in AI that we've been seeing recently, it means that we have this layer of intelligence that we can start applying to data and context. And context becomes ever more important because, humans, And the human teams had experience to rely on. They had a better inherent understanding of context when it came to the data. For example, you know, a team of SREs or, or DevOps engineers would already know if, if they were working on an e commerce platform, that there would be certain trends when it came to. a particular sale that the company was running, right? It might be a Black Friday sale, it might be something else which is region specific, but they would already know that. whereas if you're, if you're relying on an AI, platform or an AI agent to do this, maybe the AI agent knows this as well, but maybe it doesn't. maybe it's, been biased based on the training data, so it's not really expecting that this. this context, exists for, for your use case. So to be able to provide that context as metadata means that, the, the, the AI intelligence that you're relying on becomes a lot more capable as well. And we are going to be, seeing a lot more, from autonomous agents, and predictive modeling when it comes to AI. you know, we'll, we'll speak more about this as well. But this new equation, what this does is that, you know, you, you, you have data as your foundational layer. You add a rich metadata in the form of context on top, and then you have this intelligence layer that can try to kind of combine this, stir the stew, if you will, and really bring out of it insights. And that's, that's the ideal, if you will. end product or output of the observability equation is the fact that you get true insights. So you're getting true insights into your infrastructure, into your system about what's, you know, what's going on, why is it going on, what, what you need to do about it if you need to do about it. And you, all of this is happening in real time. So this is the ideal new observability equation that I think, both companies and teams use. should be talking about. Now let's dive a little bit deeper into each of these three different parts of the equation. Starting with data, right? So starting with the foundation. And I really want to stress on the importance of high fidelity data. So what does high fidelity mean? High fidelity means that you, you have all of the data. So you're getting all of the different metrics. You're not missing out on any of the metrics, but you're also getting as many samples per metric that you can. And more is better in this case, because higher resolution, which means that you're, for example, if you're capturing data for a particular metric every single second, that's high resolution compared to. capturing data every 10 seconds or every 60 seconds, or in some cases, you know, there are tools out there which only capture data every five minutes, every 10 minutes. and that, you know, it does give you, you know, you could still plot a chart out of it. You can put it on a dashboard, but that number, if you're gathering it every, every minute or every five minutes is very different from, you know, the, the story that you can tell if you're getting. per second information for that same metric. You get way better insights into what's really happening with that metric, with your system, when you get a high resolution. So, you know, every time that you increase your sampling intervals, you're, you're missing out on unknown unknowns, right? And this, this is a, this is a really hidden issue because you don't see what's going on. So you don't know if it's a problem or not. So it, this is why we call it an unknown unknown. so if you really want to future proof your troubleshooting and, future proof your observability stack overall, then you should be waiting towards, you know, leaning towards higher resolution data. The more data that you can have, you know, the, the lower sampling intervals that you have, the better. And of course, a side effect of this is that if you're relying on machine learning, if you're relying on artificial intelligence, then, you know, a higher data density is always better because it allows the AI systems to learn the right patterns. If you provide a machine learning algorithm or an AI model with sparse data, then, you it finds it harder to learn the right patterns, or sometimes it might arrive at the wrong patterns. Give it the, you know, the more data that you give it and the more context that you give it, it can arrive at the right conclusions. So what are the new rules of data? Rule number one is collect everything. You can filter out the stuff that you don't need later. Rule number two, you know, the better the sampling frequency, the better the outcome. Okay. Rule number three is don't have artificial limits. So don't have a limit which says that you can only collect x number of samples per y interval and you can only store it for, you know, x days. If that limit is artificial, if that limit is something that, you know, a tool is introducing but it's not natural to your system, to your workflow, don't settle for that limit. And finally, In an ideal world, you want to store everything indefinitely, right? Maybe you need to archive it, maybe you need to digest the data. But still, it's available there if you want to go back in time. Because very often what happens when you're troubleshooting an issue is that you want to know what happened when similar issues like this happened in the past. And you need to be able to go back and look at data. which means that you want to have, you know, some form of that data available to you. So if, if you're designing your observability stack today, you know, I would play very, I would pay very close attention to these new rules of data because the real impact that it can have, on your team and on the work that you do is that you can have significantly faster mean timed resolution. when you have access to high fidelity data. it allows you to catch micro anomalies before they cascade and start creating problems in different parts of your system. And, the, the high fidelity data also makes, learning historical patterns to reveal future issues, a much more exact science, than it otherwise would be. So, to sum up, you know, the difference between low fidelity and high fidelity is that You know, you're dealing with, real time, one second samples versus dealing with ten seconds to sixty seconds or even worse, sampling intervals. you're, you know, you have the option to gather any number of metrics that you want to. So, you know, in, in theory, it's unlimited. You can get anything and everything that you need. So you shouldn't be metered by how many metrics you're collecting, because anytime, any of these things, have a metering in terms of the price, then it becomes a decision for the team to take that, you know, we can't get everything that we need, we're going to get charged by the number of metrics or the amount or the usage of metrics or logs, for example, then you have to decide, okay, you know, I need to take a decision. I'm opting in for only this subset of metrics. so you, you, you become, you get stuck in a partial metric situation. You don't want to be there. you, you want to have full freedom on getting all the data that you need. And a similar situation with retention as well. So in, in an ideal world, you want to be able to retain. the data up to what is naturally, meaningful for your workload, for your system. You don't want an artificial limitation that's being decided by, by a vendor or a tool, that is not natural to what you're doing. So we talked about data, high fidelity data, which, which is the foundation of the new observability equation. let's talk about, metadata. Rich metadata. can be thought of as context at scale, right? So you're getting context from different parts of your system for different kinds of metrics, and you're getting it at scale. And, and this really, you know, is where the magic happens. So this, this could be all kinds of different information, right? So you could have business context in here, which, which talks about, you know, what, what is the revenue impact to you? what are the, the touch points in the customer journey or the user journey while the user is using the system? you can have context around the business transaction flow and also SLA, SLO implications because that matters a lot for, for certain products and certain teams. Then there's the operational context. so service dependencies, changes in the infrastructure that are being rolled out. deployment events. So when, when a new build is being deployed, you need all of that information to, again, tell a better story, right? and it's not even if, you know, every time it doesn't have to be about a major change in the infrastructure or, or a new build being deployed. It could be a minor configuration update as well. But that information is still critical because without it, some of the other data might not tell you the full story. And then you have environmental context, which is, real world events. So what's happening in the world today that might be influencing what your system is experiencing. There could be regional impacts due to outages that are not related to your product, but maybe AWS is down and that impacts you. Or maybe there's another third party service, that does that. You know, that, that's announcing its status on their status page, which is accessible publicly. But that story isn't being told as part of your observability story. And it just adds extra work for, for somebody to go and manually connect the dots. In an ideal world, those things would happen automatically. So there's all of these different various market conditions which can, which can influence what your system is experiencing. And all of that, could become context that your observability, equation could use. And then there's the opinionated insights that I was talking about earlier. so this is an area where we've, started, trying to, get some of this done as part of the NetData product, including, anomaly detection. So when, when we do anomaly detection at NetData, we don't do it as an optional thing. We do it for. every single metric that we collect. It means that every metric comes with this extra piece of metadata that tells, you know, is what you're seeing anomalous or not. Similarly, there's the ability to, we ha NetData has a scoring engine, which allows you to do, correlation of what's happening at any time interval of your choice. So you understand which metrics are behaving in a correlated fashion at that point in time or not. But of course, there's a lot more that we could do in terms of, you know, root cost ranking and things like that. So what all of this means is that when you have rich metadata, when you have the right context, you can tell a much more meaningful, interesting, and useful story, right? So instead of just saying that triggering a raw alert, saying that you have, you know, CPU has hit 90 percent because it's hit some threshold. When you have the right context, you could actually tell a story which says, you know, this particular revenue critical service is experiencing never before seen slowdown, right? So it's, it's an anomalous slowdown during peak shopping hour because we have that context as well. And we can have the context which says that this is likely happening due to a recent deployment because you have the time when the deployment happened. And you can see when the slowdown occurred. started after that, right? So just, just look at the difference between these two things, and you will understand, you know, where observability needs to be for you. And now, you know, the third, spoke in the wheel, of the new observability equation, the giant pink elephant in the room, AI. So, I like to think of AI as the, the force multiplier. And What's going to happen with AI in the next 10 years is, it's really anybody's guess almost. because the, the, the capabilities of the models that are available today, especially with the success of the transformer architecture and, the takeoff of large language models. that that has happened after Chad GPT was launched and the world really got their taste of, you know, what these models are capable of, has been really something. The models are almost doubling in capability every six months. So it's very hard to predict what exactly will happen, but there are trends that we can see and there's, there's, there's things that we can expect that will happen over the next 10 years. Specialized AI agents is something that it's already starting to happen. And I, I see this as one area where we're going to see a lot more traction over the next few years. And there will be, agents that are optimized for certain things. we are already seeing models that are optimized for reasoning, so that they take some time to think through the answer before giving you a response. So reasoning agents will become very, key for, for causal analysis across, services and across a large system. we will also have memory augmented agents for historical pattern matching because the, you know, when we're dealing with observability data. And all of you know this, there's a huge amount of data. so it's not like you can take all of the data and fill it into the context of a large language model. In most cases, it's either not possible or it's prohibitively expensive today. Of course, the cost is going to come down in the future. but there will be different, different models of how, memory will be handled by, by models in the future. This is, this is one area, memory is one area where today's models are lacking. There will also be planning agents, that, that will, You know, sit or embed in your observability stack and, and are looking at, you know, achieving specific goals or targets, whether it's capacity optimization, capacity planning, or cost management, the planning agents should be, you know, will be able to run independently. And they're always there. They're always looking through the data, always looking for ways in which you can optimize your costs or, or plan your capacity better. And, and they're, they're going through the data, they're going through the context and they're bringing back insights to you that, you know, here's something that, that could be done. And maybe there's things that can be automated as well when it comes to, automatic scaling and things like that. and, and the final point here, is about autonomous, RAG enabled agents. RAG stands for retrieval augmented generation, which means that you can provide, for example, documentation information, you can provide access to your JIRA, or to your incident management system, and the agent, the AI agent, would then be able to look at all this information based on a search for relevancy. And You know, based on that, it can provide you either here's the steps that you need to do, or in some cases, you can automate it so that certain runbooks or playbooks get executed automatically. I'm already seeing, a lot of startups out there who are experimenting in this space, and I expect that we're going to see a lot more of this, in the next few years. So the technical capabilities, of these models, are, are an area that it's, it's worth spending time on. there, there's a lot that can be done here and we're, we're barely scratching the surface with the things that I'm talking about on the slide. but for exam, I'm just giving you a few examples. There's a lot more that could be done, of course. So when it comes to parsing large amounts of logs at scale, trying to understand complex error patterns, you can really use LLMs to transform the way that this is being done currently. You can do multi hop reasoning. So you can try to connect the dots between, you know, what you see on service A with what you see on service B. You can do vector similarity searches across incidents and try to understand. Is what I'm seeing right now unique? Is it related to incidents that have happened in the past? You know, what were the playbooks or runbooks that were executed when this kind of issue happened in the past? does it make sense to re execute those things now? Sorry. And the other thing that, you know, will happen and is already starting to happen is a real time prompt engineering based on the system state. So prompt engineering is just a fancy term for, tweaking how you're interacting with a large language model. So what are you asking it to do? Because the, the, the way in which you ask it to do something is determining, you know, the quality of the output that you get in response. So, you know, if, if you don't, if you just have a static prompt and that's what the large language model you're interfacing with all the time, then you get a very similar, same ish response all the time. But when it comes to something as dynamic as, you know, your, a complex IT infrastructure and the observability system that you're trying to build on top of it, the, the prompting needs to be, adaptable and dynamic in real time as well. And really where we're going to see a lot of the lift off or a lot of the, the impetus over the next few years is from agents, AI agents that can operate autonomously, right? So you give it a policy, you give it a set of directions, and then it can take that and then operate autonomously. It can modify alert thresholds. It can, scale your system up and down. It can, enable traces when it thinks it's important to do. It can, it can create service dependency mapping on its own. So some of these things might sound a little bit scary to you that, you know, if you're, if you're handing off this to, a non human, what's going to happen? And, and it's a fair fear and I understand that. but as the capability of these models grow and grow, And of course the models can do things that humans cannot. They can, you know, be working on the problem 24 hours a day, looking through vaster amounts of data than a human could. You will start to see that you're better off with the tools than without. And we're, we're already approaching that territory. Some of you might have already started using some of these, automated and autonomous tools and self healing and so on. but it's just going to become much more prevalent, in the future. And what all of this again means is that we're, we're living in a new reality. potential workflows that did not exist before exist today. So, you could have a scenario like this example here, where you detect a memory leak, the agent then does a git blame analysis and understands the, the particular commit which created the memory leak. The agent automatically deploys a rollback decision, and then it, it has to do an auto scaling adjustment to, to get the system into the right state. And all of this could happen in an automated fashion. Transcribed And you would just get a report that this happened, and this is a very possible workflow. And you can also see new KPIs coming out of this. So X percent of routine incidents resolved without human intervention could become a new KPI for, for SRE teams and DevOps teams. So it's, it is a new reality. it's a very interesting world that, that we live in. So now that we talked about this new observability equation. About data and context and intelligence. I just want to touch a little bit about the, the underlying architecture of the observability stack and why. you know, at Net Data, we believe that decentralization is the way forward, and it's, you know, it's, it's really not very complicated. Modern systems are distributed by nature. Why should we force their observability to be centralized? That feels like an artificial limitation that we're forcing on to a natural system. There's a lot of different reasons in which and why centralization fails. So the minute you bring in centralization, and by centralization what I mean is, when you think of your usual observability platforms, what happens is that you're collecting logs and metrics and traces and you're forwarding all of this into the central data lake or repository. It might be in somebody's cloud. You know, you might be self hosting it yourself, but there's this giant central repository where everything sits. Right? And this creates all kinds of problems. Now, of course, it becomes a single point of failure for the entire observability stack, because if your central store goes down, you have zero observability. There's also exponential cost, that comes from all of this data movement and storage. Especially if you're using, you know, a cloud based provider where everything is stored on their cloud. It means that you're not just paying for their service, but you're also paying for the egress. So all of the data leaving your, system and, you know, getting transmitted to their cloud is stored on their cloud. And of course, you're paying the downstream costs for the amount of storage that it takes. Because as systems get complex and larger, and we're talking about thousands of services, hundreds of thousands of nodes, and millions of metrics, to actually store all of this centrally is, is a gargantuan task. And it costs, so it costs the vendor who is doing it. It costs you, to pay the vendor or to self host it yourself. And it, it quickly becomes a bottleneck. You also have, cross team bottlenecks, access control issues when it comes to centralization. You know, you know, who do you provide access to, to this giant central repo? How, how, how carefully do you control that access? you can, you can run into data sovereignty issues, compliance headaches, especially if you have multi region, multi continent operations. And, Another thing that you will, of course, see is that your, your query performance starts to degrade at scale. And again, this would mean that you need to make these decisions, right? Do I need all of this data? Maybe I need to get rid of some of the data. Maybe I need to filter out some of the data so that I can query it at a reasonable timeframe and use it, make it usable. So decentralization, so if your architecture is decentralized or distributed, which means that instead of storing everything in one single place, you're storing it all over the place, and your, your querying is then intelligent enough to fetch all of it in real time, solves a lot of these problems. So this is the way that NetData has been architected from day one. And this is one of the reasons that it allows us to do some of the things that your traditional observability platform cannot do. things like unlimited metrics, per second granularity, so real time for everything, and only detection for everything, while keeping the costs really down. So, the real world impact of going distributed, would mean that you can, you can reduce data transfer costs by up to 90%, you get 65 percent faster mean time to detection, and, you know, this is, you know, indirectly happening because of the distributed architecture, right? Because, because of the distributed architecture, you can have high fidelity data. And because of the high fidelity data, you're able to capture the, you know, the, the right patterns, tell the right stories with the data. And that leads to faster mean time to detection and mean time to resolution. And of course, this means that, you know, teams can operate a lot more independently. You don't have a single point of failure. So even if your centralization points fail, you still have direct access to the systems and the observability data originating from those individual nodes. And you can vastly reduce the amount of money that you're spending on network and storage. And, you know, downstream the licensing costs as well. And, you know, you're able to process the metrics where they originate. And this becomes important as well, because As an example, if you're trying to do anomaly detection, you know, where are you doing the anomaly detection becomes important because you're processing the data, right? If you're processing it centrally, you're losing out a little bit on the context that you had, about where this metric came from. Because this, you know, a CPU of 90 percent on one node and CPU on 90 percent on another node could mean two entirely different things based on the workloads that are running on them. So now that we've talked about the new observability equation and the underlying architecture, you know, are you as an organization, as an engineer, are you ready for the next 10 years? Right? So this, this is what's, what's really important to us. So for organizations, I, you know, here are some of the touch points that I think you should be thinking about, when it comes to observability and, the next few years. You should really focus on reducing the sampling intervals. High fidelity data is super, super important. So if you can get per second granularity at scale, you're going to be able to do so much more with your observability data that you are doing today. And this is related, of course. You should really focus on real time. So it's, you know, observability isn't just about historical analysis. You're not just looking at, you know, the last X hours or weeks of data, in a historical fashion, the data needs to be real time. You need to be able to see the issues as they're happening so that you can take action before it becomes a problem, not just afterwards. And ideally you want to democratize observability across the teams. You don't want to silo the observability into just one team, right? It shouldn't just be, you know, this is the SRE team and they're in charge of all observability. Nobody else needs to look at this data. Ideally, You know, you should be able to share that with the developers and whoever else, you know, management, whoever needs to see this data, should have some form of access to the data. And the, the platform or the tool that you're using should be usable by all of these different stakeholders. You should, As an organization, invest in AI driven anomaly detection and automated root cause analysis. So, really look at if the platform that you're using today, if you are using one, offers you these capabilities. If it doesn't, then it's time to evaluate, a few tools and platforms that do. And finally, you know, I would really recommend implementing a decentralized observability architecture. And really think about, how you can minimize the centralization, in your, in your system overall going forward due to, you know, for all of the reasons that we talked about. So this was for organizations. Now let's talk a little bit about the individuals, right? For, for the SRE DevOps professionals of today. What are the things that you should be doing? So of course you should be mastering cloud native observability and AI ops. I don't think I need to tell anybody this. People are already doing this on their own. I think you should start thinking beyond, basic infrastructure as code to, to policy as code, to security as code practices as well. it would be really good if you can develop expertise in causal analysis and real time analytics, because this allows you to tell those better stories, right? So you can actually. this is a skill that you can develop. You can, you can try to tell better cause and effect stories. So here's what you see, here's, you know, what could have cost it. And then being able to present that either through a dashboard, through a report, through, through a tool, you know, it, it gives you a lot of power when, when you can tell the right stories. you should develop multi cloud and edge computing skills, because like we discussed, as, as the workloads become more complex, as the data center explosion happens, you're, you're not going to have your entire workload running on a single system, on, on a single cloud. It, it's going to be spread out, and you need to be able to, understand different systems and different scales. And finally, I think it would be, very useful for you, for your career in general, if you can gain expertise working with and orchestrating AI agents, right? So, get, get familiar with with AI in general, with AI agents, get comfortable with them, understand their intricacies, understand what they're good at, what they're bad at, understand the scenarios where they fail. And doing that will give you a big head start, in, in the next decade. So I hope that, this is, useful advice for both organizations and for individuals, and that, you'll be able to do great things, with this information. So that's, you know, that's about the, the, the guidance that, that I would provide for the next 10 years and, and to leave you with, with a quote, that I made up just now, I think that future SREs will manage a hundred times more services. But they'll be doing it, with 10 times less manual intervention. So there will be a lot more work for SREs, but that work, will be, less manual burdens involved, right? So finally to wrap up, it's time to go back to the future. there's, there's this quote that I really like, I really enjoy, from William Gibson. It says, the future is already here. It's just not evenly distributed yet. And I think this is really true, about a lot of the things that we talked about. That some of these capabilities, some of these tools are already here, but everybody is not aware about them. Everybody is not using them. So there's, there's pockets of space, in the industry where, where, you know, people are aware they are using them and they're miles ahead because of this fact. So, as, as my final, statement as part of the talk, I would, I would like to tell that you should not wait until your monitoring tools are obsolete. You should not wait until your dashboards are, you know, either peddling lies or half truths. You should not wait until your teams are drowning in alerts. And you definitely don't want to wait until your systems are too complex for humans to manage them alone. The tools of tomorrow are here today. Use them. So embrace high fidelity, real time data. Don't settle for less. Use AI agents. Get comfortable with them. Let the machines do what humans cannot or don't want or should not be doing. Go distributed. True fidelity in your data demands decentralized architecture. And then finally, demand insights from your observability stack, from your observability platform, not just more charge, more dashboards, and more noise. Don't settle. Thank you. thanks for listening to the talk. like I said at the start, if you'd like to find out more about NetData to the company, you can visit our website at netdata. cloud. You can also check out our GitHub You can just search for NetData on GitHub. And if you have any questions about the talk, I'd be more than happy to discuss it with you. Here's my email. You can find me on LinkedIn. And you can also find me on Substack, where again, I talk a little bit about the future, what's in store for us. Thank you. Thanks again. Bye.

Slides

Download slides (PDF)

See all 58 talks at this event!

Conf42 DevOps 2025 - Online

January 23 2025 - premiere 5PM GMT

The future of Observability - the next 10 years

Video size:

Abstract

Summary

Transcript

Slides

Shyam Sreevalsan

Vice President - Strategy, Partnerships, Product @ Netdata

Join the community!

Featured event

2025

2024

Info

Conf42 DevOps 2025 - Online

January 23 2025 - premiere 5PM GMT

The future of Observability - the next 10 years

Video size:

Abstract

Summary

Transcript

Slides

Shyam Sreevalsan

Vice President - Strategy, Partnerships, Product @ Netdata

Join the community!