Netdata: Open Source, Distributed Observability Pipeline - Journey and Challenges

Video size:

Abstract

Netdata: open-source, distributed observability pipeline. Higher fidelity, easier scalability, and lower cost of ownership, but there are challenges too. An overview of the journey and the challenges we face while building Netdata.

Summary

Most monitoring solutions today provide very low fidelity insights. There is no AI. Even the open source ones are really expensive to run. Netdata introduces an all in one, real time, high fidelity, AI powered and extremely cost efficient.
We need real time per second data and extremely low latency. We should use AI and machine learning to help us in observability. I prefer generally a system that provides a powerful, fully automated dynamic dashboards.
Net data can export metrics to third parties, prometheus or influx or whatever it is. Most of the time most of the data are never queried. It allows you to offload very sensitive production systems. The overall performance is much much much bigger, better.
Netida developed a model to describe the metrics in a way that will allow the automatic and automated dashboarder to work to visualize all the metrics. The query language is the biggest problem of monitoring tools. The biggest problem is how do we make a user, allow a user to grasp what the dashboard is about.
Netata has a scoring engine that goes through all metrics and scores them according to an algorithm. The scoring engine gives you an anomaly rate per section of the dashboard. This allows you, for example, to identify immediately that you know what. It's another way to use AI and machine learning in the troubleshooting process.
For logs, data has a very similar distributed approach. We rely on system djournal. It is secured by design. It indexes all fields and all values. And it can also be used to build log centralization points.
The next challenge is about going beyond metrics, logs and traces. We want metadata to be a console for any kind of information. Open source, but we monetize it through a SaaS through Netata cloud.
I am very sorry for my microphone problem. I hope I will see you again and I hope you enjoyed it. Of course. Bye.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello. Welcome. Today we are going to discuss about Netata, the open source observability platform. I am the founder of Netdata, and to tell you the truth, I started Netdata out of frustration. So I was facing some issues. I couldn't identify them actually with any of the existing solutions, and I decided to do something about it. So what's the problem? Why we need another tool? So the key problems are this. The first is that most monitoring solutions today provide very low fidelity insights. Why is that? The key problem is that they force us to select the data sources, to select a few data sources, and then instead of collecting everything per second and going everywhere high resolution, they force us to lower the granularity, the resolution of the insights. The second is inefficiency. So most monitoring tools assume that we're going to build some kind of dashboards ourselves and we're going to do everything by hand, etcetera. For my taste, this was very problematic. So I don't want to build custom dashboards by hand. Of course I want to have the ability to do it, but it should not be the main means for troubleshooting or exploring an infrastructure. The next is that there is no AI. So even the observability solutions that claim that they do some kind of machine learning, they don't do true machine learning. It's more like an expert system where they have some rules, let's say that they call them collectively AI, but it's not really. And the last one is that all of them are expensive. Even the open source ones are really expensive to run. So this was my driving force. This is why I wanted to have a new monitoring tool that will simplify the lives of engineers instead of complicating them. Now, the current state, as I see it from my experience, the tools that exist out there, and I respect all of them. So each of these tools contributes significantly, or has contributed significantly to the evolution, let's say, of the observability solutions. So there is a world of the too little observability. These are the traditional check based systems, the systems that do some check, and then you have something like traffic lights monitoring. You have green, yellow, red to indicate that something needs attention or is problematic. The problem with this, this is a robust solution. There are excellent tools in this area. The problem with this philosophy is that they have a lot of blind spots. They don't monitor much. They monitor only the things that they have checks for. Then there is the two complex observability, like the Grafana work, the Grafana world is a very powerful and an excellent visualizer with excellent databases and other tools in this ecosystem. But the problem, the biggest problem is that it has a very steep learning curve and it very quickly becomes overcomplicated to maintain and scale. And then, of course, there are the commercial solutions, like the data doc, Dynatrace, et cetera, which of course, are very nice integrated systems. However, they are very expensive, and to my understanding, they cannot do otherwise. It's expensive what they do. So the evolution, as I see it, is like this. So initially, we had all the check based systems. That was the first generation of monitoring tools. Then we had the metrics based systems, then the logs based systems. Now, okay, this is more or less in parallel, they evolved. The fourth generation is the integrated ones, like the commercial providers and Grafana, of course. And the fifth is what I believe that netdata introduces, that is, the all in one, like the integrated ones, all in one, integrated, but at the same time, real time, high fidelity, AI powered and extremely cost efficient. Now, in order for to understand my thinking how I started the thing, this is what I believe are the biggest, let's say bad practices. They call it best, but in my mind, they are bad practices. So the first bad practice is this myth that says that you should monitor only what you need and understand. This is no, we should monitor everything. Everything that is available should be monitored, no matter what, no exceptions. Why? We need this because we need a holistic view of the system and the infrastructure of even our applications. So we need more data to make our lives easier. When we do root cause analysis, we need enough data in order to feed the detection mechanism that can proactively detect detect issues. And of course, since we collect everything, this means that we are adaptable to changing environments. So as the infrastructure changes or the application changes or additional things are introduced, we still monitor everything, so we have all the information that is there, no need to maintain it. The second is that we need real time per second data and extremely low latency. So the bad practice is that, okay, if you monitor every 1015 or 60 seconds, that's enough. To my understanding, this is not enough, because the first thing is that the environment we live today, with all this virtualization and all these different providers that are involved in our infrastructure, monitoring every 1015, or even monitoring every two, is not enough in many cases, to understand what is really happening at the infrastructure or the application level. The next is that when you monitor every second, it's easy to see the correlations. It's easy to see when this application did something, and when that application did something, so spikes and dives or different errors can be easily put in sequence. And of course, when you have a low latency system and you know that the moment that you hit enter on the keyboard to apply a command, the next second the, the monitoring system will be updated. Of course, this improves response to issues, even makes it extremely easy to identify issues during deployments or during changes in the infrastructure, et cetera. The next is about dashboards. So most of the tools, most of the monitoring tools have a design that allow people and forces, actually people, to create the custom dashboards, the custom dashboards that they need beforehand. So that's the idea. Create enough custom dashboards so that you are going to have them available when the time comes, especially under crisis. The problem with this is that our data have infinite correlations. So okay, you can put the database and the queries and this and that, they are in the same dashboard, but how this is correlated with the storage, the network, the web servers, etcetera. So it's very hard to build by hand, infinite correlations. So I prefer generally a system that provides a powerful, fully automated dynamic dashboards. So it's a tool, it's not just static, a few dashboards, it's a tool to explore, it's a tool to correlate, and it's a tool that it's consistent no matter what we want to do. So this fully automated dashboard, it's again, one of the things that they believe are ideal, one of the ideal attributes of monitoring of a modern observability solution. And the last one is of course, that we should engage, we should use AI and machine learning to help us in observability. There are many, many presentations in the Internet, and there is a hype that machine learning cannot help in observability. Unfortunately, this is not true. Fortunately, actually this is not true. So ML can learn the behavior of matrix, so given a time series, it can learn how it behaves over time. ML can be trained at the edge, so there is no need to train models and then publish models and use. And actually this does not work for observability, because even if you have the same servers, exactly the same data, exactly the same things, the workload defines what the metrics will do. So each time series should be trained individually, then ML machine learning can detect outliers, reliably, can reveal hidden correlations. So you can see, for example, in anomalies that where the moment the anomaly of this thing and that thing happened at the same time. And this is consistent across time. So this means that these metrics somehow, even if you don't realize how they are correlated. And the most interesting part is that when you have a system that monitors everything, has everything high fidelity, and trains machine learning models for everything, is that the density of outliers and the number of anomalous, of concurrently anomalous metrics provide key insights. So you can see clusters of anomalies within a server or across servers, and you can see how the anomalies spread across different components of the systems and the services. Now in order to overcome the first problem, to monitor everything problem and the real time, of course the per second and low latency visualization, the data comes up with a distributed design. But in order for this, in order to understand it, let's see what affects fidelity. Granularity is the resolution of the data, how frequently you collect data. If you have a low granularity, the data are blurry, they are averaged over time. So you have a point here, a point there, just averaged in between. Low cardinality means that you have fewer sources. So you don't collect everything that is there. You collect some of the data that are there. If you have both of them, both low granularity and low cardinality, then you are lacking both detail and coverage. So it's an abstract view. You have monitoring, you can see the overall thing you can see, but it's a very high level view of what is really happening in your systems and applications. Why not having high fidelity? What's the problem? The key problem for all monitoring systems is the centralized design. So the centralized design means that you push everything from your entire infrastructure to a database server. It can be a time series database or whatever it is, doesn't matter, but it goes, the whole infrastructure goes to one database server. This means that in order to scale the monitoring solution, to scale this database server and also to control its costs. So for commercial providers this means costs, this means in bound samples that need to be processed, stored, indexed, et cetera. So in order to scale the centralized design, they have to lower granularity and cardinality and of course the output. The outcome is always expensive because no matter how much you lower it, you have to have some data there, you have to have enough data to actually provide some insights. Now the data for example, collects everything per second. And as we will see later on an empty VM, it has about 3000 metrics per second. So if you install the data on an empty VM, you are going to get 3000 metrics per second, 3000 unique time series per second. This is compared to most other. Most other monitoring solutions collect about 100 metrics every 15 seconds or something like this. Compared to this to it, the data manages 450 times more data. That's the big deal. This is why all the others are forced to lower select the cherry pick sources, lower the frequency. Now, we used a decentralized design, a distributed design, so we keep the data at the edge. The data keeps the data at the edge. When we keep the data at the edge, then you understand that we have a few benefits. The first is that each of these servers has its own data. It's a small data set. It may be a few thousand metrics, but it's small data set, easily manageable. The second is that the resources required to do so are already available in spare. So metadata is very efficient in both cpu and memory and storage. So you can expect, for example, a couple percent cpu of a single core, two 2% of a single core, and 200 megabytes of ram and 3gb of disk space. That's it. I o is almost idle disk IO. So this allows nadeta to be a very polite citizen for production applications. So despite the fact that it's a full blown monitoring solution in a box, in one application, and it does everything in there, the agent that you install, it still is one of the most efficient applications. Actually we did. Also, if you search our site, we have a blog where we evaluated the performance of all the agents and you can verify there. That data is one of the lightest monitoring agents available, despite the fact that it's a full monitoring solution in a box. The second is that if you have ephemeral nodes, or if you have other operational needs to not having the data inside the edge, the production systems, then the data, the same software, the agent can be used as centralization point. So you can build with data centralization points across your infrastructure. But decentralization points do not need to be, it does not need to be one. You can have as many as you want. And in order to provide unified views across the infrastructure, we merge the data at query time. So for us, the biggest trick, the hardest part, was actually to come up with a query engine that can do all these complex queries that are required, but it can execute them in a distributed way. So parts of the queries are running all over the place. And then the final thing is aggregated. One of the common concerns about decentralized design is the agent will be heavy. We discussed this already. It will not. We have verified this. Actually, it's one of the lightest queries will influence production systems? No, because the data set is very small. So there is really no effect, zero effect on production systems even when queries are run. But what is more important is that we give all the tooling to allow offloading very sensitive production systems. So if you have a database server and you really don't want queries to run there by net data observability queries, I mean then you can easily stream the data to another metadata server next to it. And this one will be used for all queries. Queries will be slower? No, it will not be slower. Actually they are faster. Now imagine this. If you have a thousand servers out there and you have installed metadata on all of them and you want to aggregate one chart with a total bandwidth of everything, all the total cpu utilization, let's say of all your servers, a thousand servers. Then the moment you hit that button for the query to be executed, tiny queries are executed on a thousand servers. The horsepower that you have is amazingly big. Each server is doing a very tiny job but the overall performance is much much much bigger, better. And another concern is that it will require more bandwidth. No, because in observability most of the bandwidth is in streaming the collected data. That's all the bandwidth goes there. It's magnitudes bigger than the bandwidth required to query something or view a page with a chart, etcetera. So actually another aspect is that there are times that you use the monitoring system a lot and you are all day on it because you need to troubleshoot something. But most of the time most of the data are never queried. So most of the time, come on. They are just collected, stored, indexed but they are not queried. So you have them there as an ability to troubleshoot. You don't need to go and see them all every day. Now this is what I told you before, what the data collects. How if you install metadata on a purely empty vm, so nothing, just buy an AWS VM or GCP or azure or whatever, install the data on it. This is what you're going to get. This is as you see, 150 more than 150 charts, more than 2000 unique time series. More than 50 alerts will be running monitoring components. You're going to have a system D logs explorer and network explorer. So all the circuits in out, whatever it is, even the local ones, you're going to have unsupervised anomaly detection for all metrics. Two years of retention using 3GB of disk space and it's going to use about 1% CPU of single core 120 megabytes RAM and almost zero disk I O. This includes machine learning and this includes the metrics, retention, storage, everything. Now if you see what any data agent is internally is this thing. So you have a discovery process where it auto discovers all the metrics. It starts collecting them, it detects anomaly. But this is a feedback loop, we will go there. So after collection, usually it stores them. So it stores in its own database, time series database. And this is a time series that we have specially crafted to achieve this. Half a byte per sample on disk on the high resolution tier. Now after the data are stored in the database, we have machine learning where multiple models are trained per metric. And this of course provides reliable anomaly detection during data collection. This is in real time. We have the health engine that we check for common error conditions or whatever alerts you have configured. We have here are the query engines. We're going to discuss about the scoring engine later. The query engine is the normal query engine that most monitoring solutions have. Net data can export metrics to third parties, prometheus or influx or whatever it is, where it can export its own metrics and actually it can also downsample them so that the other system will not be overloaded by the amount of information that the data sends. Actually it can filter them. So zap a few metrics or end or down sample instead of per second, do it every 10 seconds, every minute or whatever needed. And then there is a streaming functionality. This is the function that allows you to build parents. So the streaming part, centralization points, the streaming part of one data is connected to this point on the remote net data. So it's like building pipelines as legos. So you install metadata everywhere. There is no central component anywhere. You can have centralization points, but it's the same thing. So you install the data everywhere. If you want to create centralization points, you use the same software and you install it as a centralization point and you, you just appoint the others to push the matrix, their matrix to that. And that's it. That's everything actually about metadata. So if you see in this example, we have five servers, you install five agent on all five servers. You understand that every agent by default is isolated, is by itself alone, stand alone. So in this case you have in order to access metrics, logs, whatever dashboards you need to hit the IP or the host name of each server and alerts are dispatched from each server. So they are standalone. But what you can do there is that you want to use data cloud to actually all the agents are connected to the data cloud. But are not streaming data to the data cloud. So the data cloud only maintains, ok, this is the list of servers. The guy has five servers. They have these metrics, these are the alerts that have been configured, but that's it. Just metadata about what the agent is actually doing. And then when you go to view dashboards, the data cloud queries all the servers behind the scenes, aggregates the data, and presents unified dashboards similarly for alerting. So these systems sent trigger alerts. The agent evaluates the alerts. It sends to data cloud, a notification that says, hey, this alert has been triggered. And a data cloud dispatches email notifications or pages due to notifications, or whatever notifications you want in order for you to get notified. We have also a mobile app where you can get alert notifications to your mobile if you want to build centralization points. It works like this. So you appoint one in a data s six, in this case as a parent for all the others. And then all the others can be offloaded. They don't need to be queried, they don't need to store data, or they can store data only for maintenance work in the parent. So we have a replication feature where if the connection, for example, between s one and s six gets interrupted, then s one, the next time it connects to s six, they will negotiate and it will feel the missing points, the missing samples from as six, and then of course continue. And a hybrid setup looks like this. So you may have here we have two data centers, data center one, data center two, and one cloud provider. This means that you can have parents at each data center, optional, but if you want, you can have them, and then using the data cloud on top of all of them, to have infrastructure wide dashboards, even across different providers, different data centers. Now, we stress tested the data as a parent versus Prometheus to understand how better or worse it is. Actually, we beat Prometheus in every aspect. So we need one third less cpu, half the memory, 10% less bandwidth, almost no disk IO, and the data rights, the samples right at the position they should be. And it's throttled over time, so it's very gentle, doesn't introduce any big regional rights, and we have amazing disk footprint. The University of Amsterdam did a research in December 23 to find the impact of monitoring tools on energy efficient, the impact of monitoring tools and the energy efficiency of monitoring tools for docker based systems. They found that the data is the most efficient agent and it excelled in terms of performance impact, allowing containers to run without any measurable impact due to observability and data is in the CNCF landscape. Data does not incubate in CNCF, but we sponsors CNCF and we support CNCF and we are in the CNCF landscape and it is the top project in the CNCF landscape in the observability category in terms of users love, of course, GitHub stars. Now let's see how we do the first thing. That is how we automate everything, how we manage to install the agent to detect the data sources, but then it comes up by itself, dashboards and alerts everything without you doing anything. So the first thing that we understood is that we have a lot in common. So each of us has an infrastructure that's completely different things from the outside it seems a completely different thing. But if you check the components that we will use, so we use the same database service, the same web servers, the same Linux systems, the similar disks, similar network interface, et cetera, et cetera, et cetera. So the components of all our infrastructure are pretty similar. So what Netida did is that we went through and developed a model to actually describe the metrics in a way that will allow the automatic and automated dashboarder to work to visualize all the metrics. So we developed this needle framework. Needle stands for nodes, instances, dimensions and labels. And what we do is that actually we have a method, let's say, of describing the metrics, allowing us to have both fully automated dashboards without any manual intervention and at the same time fully automated alerts. All the data alerts are attached to components. So you say, I want this alert to be attached to all my disks, I want this alert to be attached to all my web servers, to all database tables, whatever the instance of the component is. Now the result of the middle framework is that it allows the data to come up. So you install it, you don't do anything else. It auto detects all the data sources and you have fully functional dashboards for every single metric, no exceptions. And at the same time it allows you to slice and dice the data, correlate and slice and dice the data from the UI without learning a query language. This is our mission. Unconquelist. Then the next thing was how to get rid of the query language. The query language is the biggest problem of monitoring tools. Why? Because first it has a learning curve. You need to learn the query language. And for most of the cases learning the query language is hard. Some people can do it, but most of the monitoring users. Come on, what are you talking about? No is the answer. The second thing is that when you have a query language, the whole process of extending, enhancing the monitoring, the observability platform goes through a development process. So you have some people that know the query language and you ask them to create the dashboard you want. So they have to bake the dashboards, they have to test the dashboards and then you can use them. That's a big problem also, because at crisis, most likely you want to explore and correlate stuff on the fly without. Let's do this, let's do that. It should be very simple, so all people should be fluent. What we as the data, we had another problem to solve, mainly because all our visualization is fully automated. The biggest problem is how do we make a user, allow a user that sees a dashboard for the first time to grasp what the dashboard is about, what every chart, every metric is about. This was a big challenge because for most monitoring tools, the chart is just a time series visualization. It has some settings like this, like that line chart, area chart, etcetera. But what is incorporated in there is never shown. It's never. So you need to do queries by hand to actually pivot the data to understand, oh, this is coming from five servers, or you have to do the visualization like this. So what we did in the data, in a data, this chart, it looks like a chart, like all the charts. So this is, in this case scenario, an area chart. This chart has of course, an info that we have added some information about. What is this chart about? So that people can read some text to understand, to get a context of the, of the chart. But then we added these controls. Now look what happens here. The first is this purple ribbon. We call it the anomaly Ray ribbon. So when there are anomalies, they are visualized in this ribbon, anomalies across all the matrix. In this case, for example, this comes from seven nodes, 115 applications, and there are 32 labels in there, 32 different labels. Now, the whole point of this middle ribbon is to allow people grasp what is the source. Now let's see it. In this case, we click the nodes. In the nodes. When you click the nodes, this dropdown appears, and this explains the nodes. The data are coming from the number of instances at metrics each of the nodes provides to this chart, and then the volume contribution of each node. So this node contributes about 16% of the total volume of the chart. Whatever we see there, 16% is from this node. This is the anomaly rate that each node contributes. And of course, you can see the raw database values. What's the mean average and maximum of the raw database values. Now, the interesting part is that this is also a filter, so you can filter some of the nodes to immediately change the chart. And of course, the same happens for, oh, I don't have it, but the same exactly happens for applications and for labels. So you can see per label, per application, what is the volume, what is contribution, its anomaly rate, what is the minimum average maximum value. Now, we went a step ahead and we also added grouping functionality. So this grouping functionality allows you to select one or more groups. So in this case, I selected label device type and dimension. Dimension is written rights. And you see that I got reads physical, reads virtual rights, physical rights virtually. So the idea is that if you can group by the chart on the fly without knowing any query language or whatever, you can group by and get the chart you want with just point and click. The next important thing with metadata is that there is this info ribbon at the bottom that it may present empty data. Empty data means that data are missing there. So unlike most monitoring solutions, if you have a chart and you have every 10 seconds and one sample is missing, for most monitoring solutions, this means nothing. So it will just smooth it out. It will just collect this point. With that point. The data, however, works in a bit. So it needs to collect data every second. The data runs with the smallest priority in a system, and this is on purpose. We want the data to run with the smallest priority, because if you miss a sample, it means that your system is severely overloaded. Since the data could not connect there, they could not collect the sample. So gaps is an important aspect of monitoring in the data world. And we visualize them and we explain where they come from, etcetera. Now that's another mission accomplished, to get rid of the query language and allow people to work and navigate the dashboard without, without any help and any preparation and any skills. Then the next is about machine learning. Most likely a lot of you have seen this. This is a presentation that made in 2019 by Google. And the guy said that, you know what, ML, it's the bolt here. ML cannot solve most of the problems. Most people wanted to. So it's not that the ML cannot be helpful. Is that what people expect from ML is not the right thing? And we are talking about not general people, Google DevOps. Google developers and Google DevOps. So they expected from machine learning to solve a certain number of problems that, of course cannot do. So what we do in a data with machine learning, the first thing is that we train model per, we train a model per metric every 3 hours for 6 hours of data. Too complex. We train 18 ML models per time series. So every time series has 18 models that are trained based using its past, its past data. Now the data detects anomalies in real time. It uses these 18 models, and if all 18 models agree that a collected sample is anomalous, it just marked as anomalous. The anomaly bit is stored in the database. So when we store the anomaly bit in the database, then we can query for it. So we can do past queries for anomalies only. No data, not the value of the samples of the metric, but the anomaly rate of the metric. And of course, we use, we calculate host level anomaly scores, et cetera, which we will see how it works. So this is the scoring engine that I told you earlier. Netata has a scoring engine. The scoring engine goes through all metrics and scores them according to an algorithm. So let's assume that you see a spike or a dive on the dashboard, instead of speculating what could be wrong, and I see this dive in, I don't know my sales, is it the web server? Is the database server, is the storage? Is the network. Do I have retransmits? What's wrong? Instead of going through these assumptions, the scoring engine takes this window that you see the spike or the dive, so you highlight it, you give it to it, and the scoring engine goes through all metrics, across all your servers for that window to score them for the rate of change or the anomaly rate or whatever you ask. And then the data gives you an ordered list of everything, of the top things that were scored higher than the others. So your aha moment or the display that the, I don't know, the network did that is in the results. That was a point to flip, actually, the troubleshooting process. But let's see, overall, what other uses of that thing. One is this is the data dashboard. It has a menu where all the metrics are organized. As I said, everything. All metrics and charts appear here by default. There's no option to hide something. So when you click, there is an AR button here. When you click this button, actually in the data, the scoring engine gives you an anomaly rate per section of the dashboard. This allows you, for example, to identify immediately that you know what. In the system we have 14% and in application I have 2%. And you can see immediately the anomaly rate per section for the visible time frame, always. So if you pan this to the past, if you go to the past and click the button, it will do it for that time frame, the host anomaly rate is the number of metrics in a host that are anomalous concurrently. So a 10% host anomaly rate means that 10% of the total number of metrics are anomalous concurrently. And what we do then is that this host anomaly rate, we visualize it in a chart like this. If you see this chart, every dimension of this chart, every line on this chart is a node. And you see that anomalies, even across nodes, happen in clusters. So you see here four nodes concurrently. You see here one very big spike for one node, but other nodes concurrently had anomaly spikes. Now, when you highlight a spike, then metadata gives you an ordered list of the things that are related to that spike. Which other metrics? Which metrics had most the anomaly rate within this window. So that's another mission accomplished on trying to use AI and machine learning help in the troubleshooting process and reveal insights that otherwise go unnoticed. Then it was about logs. So for logs, everything we saw so far, it was about metrics. For logs, data has a very similar distributed approach. So we rely on system djournal. So instead of centralizing logs to some database, other database server, Loki or elasticsearch or splunk or whatever, we keep the data in system djournal at the place they are, probably already are. So once the data are there, the data can query the data directly from that place. And we found out, actually I found about the system data journal a year ago. I realized how good this thing is. So the first thing with the system, the journal, is that it is available everywhere. It is secured by design. It is unique. It indexes all fields and all values. This is amazing flexibility. So it works like for logs, it works like this. Either you have a plaintext file, all the logs together, doesn't matter, not much you can do. Then you can put them in low key. In low key, nothing is indexed, it's just a few labels. So you create streams that you say, okay, if this is a and this b and this c and this is d, four labels. For example, this is the stream of logs of that thing. And the number of streams that you have influenced significantly, of course the performance, the memory footprint, etcetera, etcetera. So logie is like log files, almost the same with log files, of course has amazing disk footprint. On the other side is elastic, elastic indexes every word that is found inside all text messages. Of course it's good and powerful, but at the same time requires a lot of resources. This indexing is heavy, eventually requires more significantly more resources than the roll logs. System digital is a balance between the two. So it indexes all fields role values, but it doesn't split the words. So they hold value as it is. If it is a string or whatever, it is there. The good thing is that it has amazing injection performance, so almost no resources. And it can also be used to build log centralization points. Now the Nadir UI looks like this. It's the typical thing. Kibana is like this, Grafana is like this. So you have the messages, et cetera, you have the different fields that have been detected. The good thing about metadata is that you see even in this presentation it's about 15.6 million log entries. We start sampling at 1 million. So most other solutions, Kibana and Grafana Loki, they sample just the last 5000 entries or something like this to give you the percentages of how much a field value is there to what percentage. So in a data we sample at least a million or more. And it's fast. Actually people have complained in the past that system, the journal CTl is slow. We submitted patches. We found the problem. We submitted patches to system D to make, to make system DJ 14 times faster. I think they should be merged by now. And we have the system explorer. That's a plugin of net data that can query the logs at the place they are now. Systemdit journal lacks some tooling to push logs into it. So I wonder. So guys, unfortunately my audio died. The microphone died five minutes before finishing the presentation. So I was shooting the presentation. I had to leave immediately for a wedding in a greek island. So here I am in a beautiful greek island. You see the sea. It's very nice, very nice weather. So sorry for that. I will shoot the last five video for you to have audio. I was telling that system did journal lacks some integration so it's not easy to extract feeds in a structured way. Convert plain text files to structured journal fields structured journal log files and send them to system the journal. So we created log two journal. This command, this program tails clean log files and it can extract any fields from them using regular expressions. It can also automatically parse JSON files and log FMT files and it outputs systemd native format. This systemd native format is then sent to systemdcat native, another tool that we created which sends it in real time to a local or a remote system ld. Both of these tools work on any operating system so they dont require any special libraries or anything. They are available on FreeBSD, macOS of course Linux and even on Windows. This concludes our work for making logs a lot easier and affordable to run. So systemd journal is very performant today, especially after the patches that we supplied to system D. And it's extremely lightweight. Of course, system digital files are open, so you have all the tools to dump data from them, etcetera. They are also very secure. It has been designed to be secure system journal. So I think that having all the fun happening at the edge, all the process happening at the edge in a distributed way actually eliminates the needs for the need for heavy logs, centralization and database servers, and makes logs management a lot more affordable system. As I said before, support centralization points and our plugin is able, the data plugin is able to use the journals of centralization points to be multinode. So all the logs now are multiplex. Logs across multiplex are multiplexed. The next challenge is about going beyond metrics, logs and traces. So we want metadata to be a lot more than just metrics, logs and traces. We want metadata to be a console for any kind of information. For this we created what we call functions. So the functions are used, functions are exposed by the data plugins. So the postgres plugin, for example, may expose a function that says, hey, I can provide the currently running slow queries. Similarly, our network viewer exposes a function that visualizes all the system active connections, the connections from containers and all applications running. This allows the data to be used as a console tool to explore any kind of information. It doesn't matter what the information is, we just can have a custom visualization or whatever required for this to work. The tricky part here and the challenge was the routing. So in order for this to work, we had to solve the routing problem. Since all functions provide live information, we had to root requests through the data servers to the right server and the right plugin, run the function, get the result back, and send it to your web browser, no matter where your web browser is connected, even the data cloud. This way we can this is, for example, our network connections Explorer. You see that there is a visualization that actually shows all the applications, the number of connections that they have and the kind of connections that they have, listening client outbound, inbound to private IP address spaces or the Internet, etcetera. That's another mission accomplished on creating the mechanics to actually have any kind of plugins to expose any kind of information. And the last part is about our monetization strategy is open source, but we monetize it through a SaaS through Netata cloud. So Netdata cloud offers horizontal scalability. So you can have totally independent data agents, but all of them appear as one uniform infrastructure at visualization time. We added role based access control to it, allow the ability to access your observability from anywhere. And of course we have push notifications for alerts. We have a mobile app for that for iOS and Android, and a lot more customizability and configurability via data cloud. Thank you very much. That was the presentation. I hope you enjoyed it. I am very sorry for my microphone problem. I hope I will see you again and I hope you enjoyed it. Of course.

See all 22 talks at this event!

Conf42 Observability 2024 - Online

June 13 2024

Netdata: Open Source, Distributed Observability Pipeline - Journey and Challenges

Video size:

Abstract

Summary

Transcript

Costa Tsaousis

Founder & CEO @ Netdata

Join the community!

Featured event

2025

2024

Info

Conf42 Observability 2024 - Online

June 13 2024

Netdata: Open Source, Distributed Observability Pipeline - Journey and Challenges

Video size:

Abstract

Summary

Transcript

Costa Tsaousis

Founder & CEO @ Netdata

Join the community!