Why Decentralized Monitoring Matters

Video size:

Abstract

Ever felt frustrated by the slow detection and resolution of issues? Ever wondered why monitoring systems are always centralized when infrastructures are highly distributed? You’re not alone. DevOps and SREs face challenges with centralized monitoring systems that often fall short in modern environments.

Summary

NetData is an open source observability platform. Its goal is to shake up the observability landscape and make monitoring easy for everyone. Shyam Srivalsan explains how observability works. It's mostly in the form of dashboards. Another important component is alerting.
Centralized monitoring or centralized observability is the default setting today. The tools that we're talking about are centralizing metrics, logs, traces, checks, all of this information into some sort of a central monitoring server. There is a big bucket and that's what we're going to have a deep dive into.
F fidelity is sort of a mixture of granularity and cardinality. If you have low granularity, which means that let's say you're collecting data every 60 seconds, this is in effect blurry data. Centralization makes cost and fidelity proportional to each other.
Centralized data storage is expensive. High data egress. Scaling costs grow disproportionately. And number four is accuracy. Cost is a huge problem when it comes to centralized monitoring solutions.
When it comes to machine learning, as you must have heard, it's all about the data. How accurate your machine learning is is based on how much data you have. Having to do it in a centralized fashion means that you have to have so much context built in. This leads to outages and downtime.
The centralized monitoring solution has a single point of failure, which is your single server. Now, if you had more localized ways into looking at your individual pieces of your infrastructure, you're centralized view could be down. And then we have efficiency. Infrastructure takes up around 30% of the total energy consumption in the world.
Data privacy is something that is often talked about, that large tech companies have an often unhealthy liking for customer data. There is a concentration of risk when it comes to centralized systems. The other aspect to this is compliance challenges. And finally, there's the question of deployment options.
The one solution that I'm proposing here on this talk is to decentralize. Every single node has individual identity and entity. This gives us a lot more option to have higher fidelity data. There's a lot of advantages to be had from decentralized networks.
Keep data at the edge. Make the data highly available across the network. Unify and integrate all of this at query time. These are some of the challenges also of making decentralized observability work.
Net data aims to achieve the decentralized philosophy that we've been talking about. The main component is the net data agent, which collects data in real time. The agent can also stream data to other agents. And the really important thing which allows Netdata to deploy this decentralized philosophy is that the netdata agent is really lightweight.
Net data parents are other net data agents which are aggregating data across multiple children. Having access to these parents or mini centralization points gives you enhanced scalability and flexibility. It ensures that all of the data remains always on Prem. And by design, it's resilient and fault tolerant.
Netada's decentralized architecture enables horizontal scalability. The cloud is just querying the data in real time and it's just showing it to you. I would welcome you to explore how decentralized monitoring looks and feels.
Creating a decentralized observability platform is not easy. Changing from a centralized architecture to a decentralized architecture is even harder. Big players in the industry will find it really hard to move away from their existing architecture. I believe that the future is decentralized and that hard problems can be solved.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello. Hello. Welcome everyone to this talk on decentralized monitoring and why it matters. I am Shyam Srivalsan. I work at Netdata. NetData is an open source observability platform, and our goal is to shake up the observability landscape and make monitoring easy for everyone. So let's start by talking about observability. What is observability in a nutshell? So, to start with, there's all the stuff that you care about, and this could be your data center, your applications, your databases, your servers, your IoT networks, your Kubernetes clusters, all of those things which are keeping whatever that you care about, your business, your company, all of it running. So here's all the stuff that you care about, and you want to observe this stuff. And how do you do that? So you do it at a high level using these three things. So there's metrics that you get out of your infrastructure, and metrics are usually in the form of time series data, and there's numeric data associated with different counters that you look at. And then you have logs which are string texture data, which are again, talking about what's happening in your infrastructure. And then you have traces which go a bit deeper into the flow between a particular event happening across different parts of your infrastructure, your front end, your black end, for example. So now you have your metrics, logs and traces. They exist on these devices. What do you do next? So the next step is to collect it all. So this is where your observability tools and platforms generally enter the picture. And they have you, they usually have some sort of agents in place, or collectors or exporters which are doing this job of collecting all of this data into some sort of repository or storage. And once you collect it, what next? What do you do with it? So there's two main things that you do, and one is to visualize it. So to have it up in a dashboard, have it in some form that you can look at it. And this is how, whether you're a DevOps or a developer or an SRE, this is how you observe the stuff that you care about. It's mostly in the form of dashboards. And also another really important component is alerting, because you can't be expected to always be looking at a dashboard about this stuff. You have other important tasks to do. You're human, so you have to go to sleep. You have other things outside of work that you need to be doing, and you cannot always be observing the things to make sure that everything is going okay, so here, that's where alerting comes in, which means that if something important happens that you need to know about, the alerts will make sure that that happens. So this, in a nutshell is observability. So I just wanted to set the stage with this because we're going to be talking about a lot of the different aspects of the things that we just discussed. And what about the observability landscape? It's pretty crowded in here, to be honest. You can see all those logos of all the different observability companies. To try and break this down a little bit, I'm going to be dividing these solutions into different generations. So you can talk about the first generation observability platforms which are focused on checks. So you have your nagios, your zabbix, your peer DG check MK, they're focused on checks. Is something working or not? Is something running or not? Is something up or down? That's their main role. Of course they're branching out and doing other things as well, but that's where they started with, or that's their core philosophy. And then you have the second generation which are more focused on the metrics themselves. So this is where Prometheus, for example, is really famous for. And most of my audience has heard about Prometheus. Most of you have used it. Prometheus community is pretty active as well. So you know how it works. There's metrics that are exported from different services, applications, servers, devices, and it's time series metrics. There's a metadata associated to it. And there is a Prometheus server which collects all of this data and stores it there as a time series database. And it's not just Prometheus. There are other agents and other monitoring platforms that do this as well. Then you have the observability tools that focus on logs. So you have things like splunk or elastic where though they do other things as well, logs is kind of their core competence. And then finally you have the fourth generation of integrated monitoring tools where you have a mix of all of these things. You have metrics, you have logs, you have traces, checks and all these things mixed in. And this is where the big commercial tools such as Datadog, Dynatrace and stanar and Eurelic and so on really make their mark. So this is the observatory landscape as most of you are used to it. So what is in common with most of these tools? Right? So when you think about it, this is where the philosophy of centralized monitoring really comes in. Centralized monitoring or centralized observability is the default setting today, which means that the tools that we're talking about are centralizing metrics, logs, traces, checks, all of this information into some sort of a central monitoring server. There are benefits, this approach, of course, this is why people do it. It gives you comprehensive visibility because you have all of those things in one place. So you have the metrics, you have the logs, you have the traces, all of that information. So when you're looking at a particular timeline, you get all the information in one place. You can use this to correlate trends across various different data types, whether it's a metric, whether it's a log or a trace. The trend correlation becomes really important when you're troubleshooting something, and it also gives you a deeper understanding of what's really going on in the system, in the infrastructure. So this is centralized observability, this is the underlying architecture or philosophy of most of, or almost all of the main observability platforms out there today. So it sounds pretty good, doesn't it? So what's the issue? Right, so that's what I'm going to get to. There is a big bucket and that's what we're going to have a deep dive into. So let's talk about the seven deadly sins of centralized monitoring, if you will. What are the limitations of centralized monitoring, which in many cases, unless you think about it, it's not something that's obvious and on top of your mind or in front of your mind, right, sometimes it can go under the carpet and that's where the troubles start creeping in. So let's go over these one by one, and let's start with fidelity. So what, what is fidelity? What does fidelity mean? So one way to think about this is fidelity is sort of a mixture of granularity and cardinality. And what does this mean? So granularity means, let's, let's say we're talking about metric data here. How often do you have the metric data? Are you collecting it every second, or are you collecting it every 10 seconds? Or are you collecting it every 60 seconds? Right. There is a very big difference between these three things, between getting per second data, per ten second data and per 62nd data. So this is what data granularity means, how granular your data is. And if you have low granularity, which means that let's say you're collecting data every 60 seconds, this is in effect blurry data, because you don't get the full picture. There's a lot of stuff that's happening. Let's say for example, they'll be looking at a specific metric, which is a counter over some important data coming from your database and you're only looking at it at snapshots that happen every 60 seconds. There could be a lot that happened within that time period which you're either aggregating away or you're not getting to see. So this is why I call it blurry data. And then the second part of this is cardinality. And you know what cardinality means? It means how much data that you're getting, right? So if you have ten metrics or 100 metrics or 1000 metrics coming from a particular server or from a particular application, there's again a big difference. In many cases with the traditional monitoring aspect, you're cherry picking what you want to monitor. So you're saying that I think, or somebody else on the team thinks, or this particular observability tool thinks that these are the ten most important metrics that I need to collect from my postgres database or from this Linux server. There might be a lot of other metrics, there might be 100, there might be a thousand, there might be 2000 other metrics that are collectible from that machine. But you're choosing not to. And we talk about why you're choosing not to do this. But the important thing to understand here is if you have low cardinality, that means that you have blind spots. There's metrics out there which are talking about things which you're not collecting and it's completely blind to you. So you might think that I have observability, I have all this data that I'm looking at, but there's a ton of other data that you're not looking at and that you're not even thinking about. So think about it this way. When an actual issue happens and you're troubleshooting it and you're now doing the post mortem of why didn't we catch this issue before it became a problem? Sometimes you end up taking this action, or we should have been monitoring X and Y and we didn't have those metrics, let's go and add them to the dashboard so that this won't happen again. So this is a problem with low cardinality data where you're not collecting all of those things to start with. And when you have a combination of low granularity and low cardinality, then what you get is like the first half of this horse over here, you get low fidelity data, which means it's abstract, it doesn't have the detail and it doesn't have the coverage, which means that you think you have observability, but in actual fact you don't. Now fidelity and centralization are both deeply linked to each other. In a way centralization makes cost and fidelity proportional to each other. And this is the root cause of all of the problems that we talked about. Because if you increase the fidelity, which means that you're increasing the granularity and the cardinality, then you're by default increasing the cost. Because think about it, you have a centralized monitoring server where all of this information needs to reside. Higher granularity means that you need to send more data over time. So instead of sending one sample every 60 seconds, you're now sending 60 samples. So it's 60 times the amount of data that you're sending. And the same with cardinality. Instead of collecting ten metrics, if you're collecting 1000 metrics, it's 100 times the amount of data that you're collecting. And that central server needs to be able to handle all of this data. Your network connection needs to be able to handle all of this new data going out of the egress. So there's a direct link here. Reducing costs leads to a decrease in fidelity, increasing fidelity leads to higher costs. So in effect what's happening is we're building in low fidelity by design. The second point is scalability. So again, when you think about it, think about a central server and what happens, there is a clear bottleneck here, which is your central monitoring server. What happens when something happens to that central monitoring server? That's when you start facing all of your issues. If you want to scale again, you run into these problems. So as an example you can think about setting up your own central Prometheus monitoring server. And what happens when you keep scaling. And especially for companies where you're adding a lot of compute or a lot of storage very quickly and you want to scale very fast. This means that you end up spending more time trying to figure out how to scale your monitoring environment and your observability platform. Then you should, you should actually be spending that time on what you need to do to make sure that all of your applications and your business logic is working properly. So scalability is a major issue. With centralized monitoring solutions you can run into bottlenecks. Obviously there's capacity limits, there's also latency and delays because all of this information from all of these different servers and applications and databases, all of this needs to be relayed into a central repository somewhere for it to be observed alerted on and anything else that you want to do on that data, which means that all of this data needs to travel there. So there is a latency associated to this. And of course if you want to get fancy and if you want to build this out, make this more scalable, then you might have to end up doing a lot of complex load balance. So centralization makes scalability harder the higher scale that you're looking at, because if you're looking at this at a very small scale, and let's say that I have ten servers and I want to monitor this, you don't really have any scalability issues with a centralized monitoring solution. But as your infrastructure becomes more complex, we're talking about multi cloud environments, hybrid cloud or kubernetes clusters, along with other, you know, an IoT network. This is when scalability starts hitting you really badly with centralized monitoring solutions. And this is again one of those things which is a silent killer to start with. You might not see it as a problem at all, but over time it becomes more and more of a problem. Let's go to the next one. So this is a fun one cost, right? So as you can see from all those news headline clippings that I've based to the observability is really expensive and why is it so. Right. So centralized data storage is expensive. If you want to store huge amounts of data, especially at higher fidelity, you're going to have to spend a huge amount of money on it, whether it's you self hosting it yourself or your observability provider who has to host all of this on their cloud and they're going to transfer that cost to you. Of course there's also centralized compute. It's not just storage, but let's say you have huge amount of data, petabytes of data, and then you want to do compute ordered, you want to compute whether there's certain correlations happening between different data points or if you want to understand if something should be alerted on all of this is again compute that you're doing in a centralized location over a very large dataset. Of course there's architectural ways in which you can make this easier, but in most cases still there is a large compute cost and again that cost gets passed on to the user. The third thing is high data egress. So in many deployments there's a big cost to the amount of data that you're sending up to the central repository, especially if it being hosted by the observability provider on their cloud, for example. So if you want to get higher accuracy, it is possible, right? So the, the observability tool, let's take, you know, datadog, for example. You can collect data at a higher granularity or a better granularity, but it means that you're going to be sending that much data up to Datadog's cloud, and that means more egress costs for you and also more costs in general. And then we just talked about scaling. Scaling costs grow disproportionately. So you can see some of these articles here mention things like this that they didn't start up with this. They started off with paying a few hundred, a few thousand dollars for monitoring and observability. And as the company scaled, the costs grew disproportionately. So very soon they have a $1 million observability bill. And now the company is trying to make sense of why are we paying a million dollars for observability? This doesn't make any sense. So this has happened multiple times in multiple companies where a lot of smart people are working. So you can see how this is something that, it's not something that's obvious or evident on day one, but over time it's going to catch up to you. And what's the result of all of these things? So very often what happens is people start questioning the value of observability because it doesn't make sense that you have to pay millions for this stuff. So teams decide that either we don't need observability, just the most drastic option, of course, or what happens more often is that they decide we're going to cherry pick what to observe. We can't just grab all the data because it's just costing us too much and we're not seeing the value from all the data all the time. So let's cherry pick based on our subject matter expertise, or expertise that we're getting from somewhere that these are the things that I want to monitor, and we'll only monitor this. And this, as we discussed earlier, can be a bad move because nobody, no matter how much of an expert you are, can anticipate what metric would be useful while you're troubleshooting an outage at 03:00 a.m. in the morning. That's when you wish that you had all the data already there. So cost is a huge problem when it comes to centralized monitoring solutions, even the ones that are open source, like Prometheus, there is still cost in terms of what you are self hosting and what you're maintaining and what you have to scale out over time. And number four is accuracy. So this is linked to fidelity in a way, because when you have reduced granularity or reduced coverage, then by proxy, by default, your accuracy becomes lower because you're not getting data every second, because you're not getting all the metrics, you cannot be, by definition, as accurate as you could be. But it's not just about fidelity. There's also other issues that could happen. For example, let's just think about alerts for a second. If all of the data is spread across all of these different servers or nodes or applications, rather than triggering the alerts when something anomalous is discovered in one particular location, if you have to centralize all of these information in a single place, then the thresholds that you're applying to it might again be generic, it might not be precise enough, it might not be customized enough to the metric in question that you trigger the alert at the right time, which means that you could be triggering the alert later, you could be missing it completely, and you might miss the actual event there. When it comes to things like machine learning, for example, you must be hearing about a lot of machine learning related to observability, maybe even some discussions during this conference. So when it comes to machine learning, as you, I'm sure you must have heard, it's all about the data, right? So how accurate your machine learning is, is based on how much data you have, how granular it is, and how clean it is, and how good it is. So when let's, as an example, let's think about something like anomaly detection. So anomaly detection is basically a way in which you can detect if something, a metric value, for example, is anomalous or not. Is this something that's expected or is it unexpected? Now, when you do all of this in a centralized location, you need to have so much context built in, because the metric could be coming from a raspberry PI or it could be coming from an Nvidia H 100 GPU rig. And it's a very different environment in both those cases. So the value of the metric, whether that's expected, whether that's unexpected. Again, having to do it in a centralized fashion means that you have to have so much context built in, and that increases the processing load of what you're trying to do and putting together all of these things, what does this lead to? So this leads to obviously outages, it leads to downtime, and in general, it leads to pain for DevOps and for sres and developers who have to deal with this. And that brings us to the next point, which is resilience. And this is again, one of those terms which is bandied about a lot and misused a lot as well. But again, this is something that's built into the definition of what a centralized monitoring solution is. The centralized monitoring solution has a single point of failure, which is your single server. Something happens to that server, everything goes down and you have cascading failures across your infrastructure, which means that in the worst case, if a disaster happens, you're left with no way to monitor what you care about and the recovery time. There are no guarantees on this either. Recent example is what happened last year when there was an outage on datadog service which took down their monitoring for most of their customers for many hours. And again, in these cases, these users were left without a direct way of easily understanding what was going on. Because you have the centralized view, all of your data is in one place. Your window into what's happening and how to observe these systems is that single point of entry, right? So a single pane of grass is often touted as a good thing because you get all your information in one place, and it is, but it also has its drawback if that's the only window that you have into observability on those infrastructures. Now, if you had more localized ways into looking at your individual pieces of your infrastructure, you're centralized view could be down. It could be down for a day, for example. But as long as you have the ability to look into those things through other means, through localized means, you wouldn't be as impacted as an observability user. And then we have efficiency. So efficiency is again a thing that on day one of your observability journey, maybe you're not paying a lot of attention to efficiency. But over time, as it grows, as your observability setup grows along with your infrastructure, you've added lots of new collectors or exporters, you have different kinds of data types in there. This is when the efficiency gains start becoming more and more important for you. If there are delays in data processing, if the data handling is inefficient across your data pipeline, all of this starts to add up over time. And resource overload becomes a real challenge, a real issue, because how much resources do you allocate your monitoring server? And how much can you scale up when you need to? And another often underlooked or overlooked part of this is energy consumption. So you have it. Infrastructure in general, I think, takes up around 30% of the total energy consumption in the world. Today, and that's a lot. And your observability platform itself is intended to monitor that you're running optimally all the time. Now, if your observability platform itself adds to your energy consumption in a significant way, then this becomes a problematic scenario to be in. So I think we've covered six of the deadly sins and we've landed at the final one, which is data privacy. So this one's obvious, right? So this is something that is often talked about, that large tech companies have an often unhealthy liking for customer data. And there are different ways to look at this problem. So for one, thinking about centralized systems in general, there is a concentration of risk. You have a single repository where if there was an attack that happened there, then the attacker suddenly gains access to all of your data or all of the data that exists there. And the concentration of risk is something that you should be thinking about more and more when it comes to your data and your data security. The other aspect to this is compliance challenges. So you've heard about your gdprs and your ctpas and all of these different compliance standards that your company, your business has to meet, and you want to understand whether the tool that you're using for observability is supporting all of these standards. Now, if it's a centralized observability tool, which means that they have access to your data, your data is being stored somewhere, and by your data, it could be your end user's data, which means it's your customer's data that you're now storing, in a way, in a third party company side. So it becomes a question of trust as well. If the company big enough, maybe you trust them, maybe you're trusting that these large public listed companies are going to treat this data well, and that's a choice, right? And then finally, there's also the question of deployment options. Maybe you do not want all of your networks to be exposed, all of your devices or all of your servers to be exposed to the outside world or exposed to the centralized monitoring. So you want to have a way in which you can cordon off certain parts of your network or parts of your infrastructure into a demillage zone, for example. Are you able to do this? Are you able to achieve this with your monitoring solution? This becomes another question that you should be thinking about. So I think we've now talked about all of these different problems and what's the solution. So the one solution that I'm proposing here on this talk is to decentralize. And what does it mean to decentralized. So let's try and understand this a little bit better. So on the left here, you have a centralized network. As you can see, there is a central authority or a central node, and all the other nodes are connected in one way or another to the single authority. On the right, we have a decentralized network. There is no single authority server who controls all the nodes. Every single node has individual identity and entity. This is really important. So every single node that you see here on the decentralized network can operate on its own. It's independent and it's fully capable. So this is the main difference. Each node is fully capable. As you can see, there are still many centralization points that can exist where multiple nodes are connected to a single node, which means that this node now has access to the data of these other nodes, and then these nodes could be connected to each other. And you can have how many number of these connections as you want. It's up to you. But the really important thing is that each of those individual nodes are a capable entity in and of itself. So what does this mean for the problems that we talked about, the problems of fidelity, scalability, cost, accuracy, resilience, efficiency, and data privacy? So think about it. Let's think about fidelity. So if we are storing the data on the individual node itself, this gives us a lot more option to have higher fidelity data because you can collect more data, you can collect data more in a more granular fashion because you're just storing it on the device itself. Of course, you need to think about if you're storing it in an efficient way or not, but you're not sending it to be stored in a centralized server somewhere else. And decentralized networks are by definition built to be highly scalable. You can scale them up, you can scale them down as you wish to and cost. So again, there is no central server, which is contributing to the cost by being directly proportional to the fidelity. It's completely decentralized, so you have the option to keep the costs down. Higher fidelity is one reason why it could be more accurate, but also if you want to do things like alerting or anomaly detection, since each node is individually capable of doing this, this means that you're doing it on device, you're doing it on the edge. So if there were alerts being triggered on one of these nodes, that decision is being taken on the edge, which means it's, that it's, it's, it's more accurate. And again, these are architectural definitions of a decentralized network. Is that it's more resilient by nature, which means that you can take off one of these things, but the nodes would still operate. You can break the connections, but the nodes would again be able to operate or connect to other nodes as and when needed. The efficiency is another important factor. So you can cut down on things like centralized bottlenecks, you can cut down on things like latency issues by having a decentralized network and data privacy because you're storing all your data on device, your data privacy requirements look very different all of a sudden. You don't even need to worry about a lot of the regulations because you are not exporting your data to be stored in a third party cloud somewhere. So there's a lot of advantages to be had from decentralized networks, and you don't need to be scared about decentralized being something that's very complex or hard to understand or hard to deploy. You know, let's dive into this a little bit more so that we can understand this better. So let's talk about decentralized design for high fidelity. Specifically, the main important aspect of this is keeping data at the edge. So you have compute and storage already available on these things that you're monitoring, whether that's a container, whether that's a virtual machine, or whether that's a high end server. These things have compute available, they have storage available, and this is enough to keep the data at the edge. You can keep the data stored there, and you can also have the processing happening there, and you can optimize it in such a way that the monitoring doesn't affect the actual business logic that needs to operate on those devices. So keep data at the edge. That's number one. Number two, make the data highly available across the network, right? Because you might have ephemeral nodes that are not going to exist forever. They might come up, they might go down. Once they vanish, you still need access to their data, which means that their data needs to be stored somewhere. That's where those other nodes in your decentralized network becomes important. They also help with higher availability. So if a node goes down, you know that there's another node which has access to data of this node. And you can also use this for more flexible deployment scenarios where you can offload sensitive production systems from observability work. You can say that I have these ten servers here which are doing top secret work. I don't want to do any monitoring logic on this. Just export the data somewhere else, export the alerting, export the anomaly detection, and do it all elsewhere. And then number three, you need a way to unify and integrate all of this at query time. So you have all of this data, it's stored, it's being processed in a decentralized fashion across different nodes in different places. How do you get that single pane of glass view when you need it? So there has to be a way that you can unify and integrate everything at query time. These are some of the challenges also of making decentralized observability work. Now we'll talk about net data. Nadata is the company that I work for. It started off as an open source project and it became very popular on GitHub. It has more than 68,000 likes, stars on GitHub, and people are using it for all kinds of things. They're using it to monitor tiny things such as their home labs or their raspberry PI's. But also there are teams and companies using the data to monitor entire data centers, kubernetes, clusters, multi cloud environment, high performance clusters. So really it's completely up to you how you use net data. So how does net data aim to achieve the decentralized philosophy that we've been talking about? The main component of the net data monitoring solution is the net data agent. And I have agent and double quotes here because the netifier agent is so much more than what normal monitoring agents are. So it's open source, it collects data in real time, which means that the granularity is 1 second. By default. All the metrics are collected per second. So it auto discovers what's there to monitor in the environment where it's been installed. And it collects all of this data every second, and it stores this data in its own time series database on all of this is open source, so you can look at it if you want to. And it collects metrics, analogs, and it also does alerting and the notification of those alerts are being sent. All of this happens on the agent, and anomaly detection and machine learning also happens within the agent at the edge. This is again something that's not very common, and the agent can also stream data to other agents. So this is where the decentralized concept comes in. The agent is a fully functioning entity of itself, but it can also send its data to be stored on another agent via configuration. And you can have a cloud which unifies all of these different agents and gives you the ability to query from any agent across all agents in real time. And we'll talk a little bit more about the cloud component. So this is what distributed metrics pipeline looks like inside Netdata, it's, you can think of it in a way like Lego building blocks. So you have local netdata, it's discovering metrics, it's collecting these metrics and then it's detecting anomalies on them, it's storing them in the time series database. And you know, it's checking for alert transitions, it's querying for anomaly score or correlations and things like this. And it's also able to visualize this in a dashboard. It's all inside this agent. But then at the same time it can also collect metrics from a remote net data, which means another agent. So this is the decentralized aspect of it where you can plug these agents together into sort of a Lego creation. So you can collect data from a remote net data, you can stream all of this data from both the collected one and the current one to another remote net data. So it's really up to you on how you construct this network, this monitoring network of yours. And the really important thing which allows Netdata to deploy this decentralized philosophy is that the netdata agent is really lightweight, even though it's highly capable. So we ran a full, very detailed analysis and I'll share the link along with this presentation when you can take a look at it on how this was done. But you have some of the data points here. You can see that the cpu usage, the memory usage, the disk usage, and the egress bandwidth that's generated is all really, really low, even though it's doing all of those things that we talked about. It's doing the metric collection, the storage, the alerting, the anomaly detection and machine learning and the dashboarding. All of that is happening on each agent. But it's still very light in terms of the number of resources, and you can configure it to make it lighter still. So if you say that this is an IoT node, I want to make sure that it runs super light. Then you can configure it so that it doesn't run alerting, it doesn't run ML, it doesn't do any storage, it's just streaming the data to another more powerful node which does all of those things for it. And just by installing the data on an empty vm, you get 2000 plus metrics. You get more than 50 pre configured alerts. There's anomaly detection running for every metric. And by default, if you just have three gb of disk space, you get up to two weeks of this per second real time data. You get three months of per minute data, and you get two years of per our data. So that's, you know, in terms of your data retention, that's a pretty good deal. Now, we've talked about the net data agent. The other component to this, which allows this decentralized architecture, is in the data parent. So net data parents are nothing but other net data agents which are aggregating data across multiple children. So you can start to see the decentralized network build out here. You have three parents. Each parent has multiple children. So this parent, for example, has children that are part of a data center. The other parent has children that are part of a cloud provider, and the third parent has children that are part of another data center. And all of these parents could be connected to each other so that they have access to the data across these three different environments. Now, having access to these parents or mini centralization points gives you, obviously, it gives you enhanced scalability and flexibility, because now you can really build the Lego blocks into something magnificent. It ensures that all of the data remains always on Prem. You're always storing all of your data on your own premises. And by design, it's resilient and fault tolerant. You can take out any of these instances, but the other remaining instances would continue to function on its own. And this really helps you to build a monitoring network which is optimized in terms of performance, in terms of cost, and also if you want to isolate certain parts of your network from the rest, from your broader network and from the Internet, it allows you to do this as well. And the third and final component of Netada's decentralized architecture is netida cloud. So netdata cloud, again, cloud in double quotes or air codes, because it's not a centralized repository. Netadata cloud does not centralize any observability data. It doesn't store any data in the cloud. All it does is it maintains a map of the infrastructure. So the cloud is the one entity that knows where everybody, all the other nodes, all the parents, and all the agents are. And it has the ability to query any of these agents or all of those agents or any grouping of those agents at any time, in real time, right? Which means that I could be just logged into the cloud and say, I want to see all the data from all the nodes in data center one and data center two. I don't want to see cloud provider one, or I could say I want to see all of it together. So the cloud is able to send this query to these nodes. And since you have these parent agents. The cloud doesn't need to query 15 different servers here, it just needs to query three. So this architecture, this decentralized architecture keeps things much more efficient in query and quickly get the data back within a second, because nobody wants to wait multiple seconds or multiple minutes to get a dashboard updating. So the cloud in effect enables horizontal scalability because you could have any number of these parent agent clusters, and as long as all of them are connected to the cloud, it should be relatively easy to just query them within a second and see the data, which means that you have high fidelity data across your entire infrastructure. It's super easy to scale and you have access all of it, access to all of it from a single central cloud without having to store your data in the cloud, right? So the cloud is just querying the data in real time and it's just showing it to you. So some of the common concerns about decentralized design are, one, the agent will be really heavy, you have to run this thing on your servers, on the machines that are hosting your application. But clearly we saw that, no, the data agent processes thousands of metrics per second. It's super light. The second concern is that querying will increase load on production time on production systems. So each agent serves only its data, so the queries do not increase load on the production systems themselves. Querying for small data sets is lightweight, and you can use the parent agents as a centralization point within your decentralized network, so that certain nodes are isolated from query. The queries do not even reach them. The third concern is that the queries will be slower. This isn't the case either. Actually the queries are faster because we're distributing tiny queries in parallel, massively to multiple systems, which means that your refresh times and your load times are much, much better. And the final thing is that it'll require more bandwidth. But this is again not true, because the querying is selected, you're only querying for data that you're seeing on the screen. It doesn't need to query for all the 2000 metrics that it's collecting, it just needs to query for what the user is looking at right now. And if the user goes to a different chart or a different dashboard, then queries for that instead. So this is a quick look at the Netdata dashboard. I'm not going to go into a detailed demo. We have a public demo space that's available that you can check out yourself from our website without even logging in. But if you want to log in, if you want to create a login, then you get access to our a space in netadda where you can copy a single command. And when you base that command, it automatically installs the Netadata agent on your server, on your device. And you get this dashboard, which is out of the box, right? So this is not a custom dashboard or a created dashboard, it's what you get immediately after you install netat on your system. And here you can see that the data that's coming in is from 16 nodes, it's across two labels from 16 different systems, and all of that data is stored on those nodes, on those systems in completely decentralized fashion. And it's being queried by the cloud in real time without any of that data being stored on the cloud. So I would welcome you to explore how decentralized monitoring looks and feels like by trying out net data and in general, think about how to make your own monitoring setups more decentralized, even if you're not using nitate. So where does this take us to? The last section is about the future, about the road ahead. So what's the catch? Where are all the other decentralized observability platforms? So part of the reason for this is that creating a decentralized observability platform is not easy. Changing from a centralized architecture to a decentralized architecture is even harder because you put all your eggs into the centralized basket and you don't really want to change. Even if you do, it's not easy to do because you have to ensure that resource consumption at the edge is minimal. You have to handle complex queries and aggregation. And all the while the deployment has to be really simple. Right? And this is something that's hard for a lot of commercial companies to do. You have to learn to relinquish control. You have to say that I'm okay with not having control over the data or over the processing. It all happens on the customer's own promises, on the user's, on premises. And this is not an easy thing to do. So this is part, or maybe a big part of why we're not seeing more decentralized observability platform. And also, like I said, the big players in the industry will find it really hard to move away from their existing architecture to do something like a decentralized monitoring solution. I believe that the future is decentralized and that hard problems can be solved, and they should be solved. I would ask all of the listeners to not compromise infidelity, because it will compromising infidelity will only create more problems for you in the long term. I would ask you to demand more and demand better from your observability provider, who, whoever that is. And if you're operating your own monitoring stack, then try to apply some of the decentralized principles that we talked about in this talk today and you would see a long term benefit. And think about when your environment, when your infrastructure is distributed. It's multi cloud, it's hybrid, it's auto scaling environments. Why is your observability centralized? Why is it not decentralized? That should be the question that you're asking yourself. So thank you so much for listening to this talk. If you have any questions about. Net data, or if you'd like to find out more, this is our website that you can visit. And here's the link to our GitHub page where you can download the open source. Net data agent and you can run it on whatever system that you have and get immediate access to thousands of metrics in a decentralized way. So I really hope that you try this out. And if you have any questions, if you have any suggestions, or if you have any disagreements about anything that I've spoke about on this talk, here's my email id and my LinkedIn profile as well. So I'd love to hear some feedback. Thanks for listening. Thanks for being a good audience. Thank you and have a good day.

Slides

Download slides (PDF)

See all 22 talks at this event!

Conf42 Observability 2024 - Online

June 13 2024

Why Decentralized Monitoring Matters

Video size:

Abstract

Summary

Transcript

Slides

Shyam Sreevalsan

Senior Technical Product Manager @ Netdata

Join the community!

Featured event

2026

2025

Info

Conf42 Observability 2024 - Online

June 13 2024

Why Decentralized Monitoring Matters

Video size:

Abstract

Summary

Transcript

Slides

Shyam Sreevalsan

Senior Technical Product Manager @ Netdata

Join the community!