Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi. Today we are going to discuss but observability
standardization, and we are
going to see what I call the elephant that I believe
still in the room. So a typical
observability setup looks like this.
So we have some kind of applications or
systems that are exposing metrics. Then we
have time series databases or logs
databases where we centralize all our logs and all our
metric data. And of course then we have a
tool to provide dashboards. This tool
usually queries this time
series data or logs data and creates
beautiful charts. And then we have alerting, an alerting engine that
again performs the same. So it queries time
series data and sends notifications if
something exceeds the threshold.
Now this has some issues and
some complexity may sound simple, but there is
a hidden complexity here. And this is what the hidden
complexity is. In the time series databases,
you have to check, you have to maintain first the database server itself.
You have to take care of the high cardinality and the memory requirements
of that thing, the disk I o requirements.
And you have to think about clustering and high availability.
In the logs database you have the similar thing, but also on top
of it you have a query performance issue, so that in many
cases you have to maintain the indexes of the
log streams that are available. You have
to maintain retention, and you also have the same problem of clustering
availability, high availability, et cetera. Then on the
dashboarding tool, you have to first
learn the query languages of the other database,
of the database servers so that you can query them and create
useful dashboards with the data they have.
These query engines are usually
good in order to convert counters to
rates or aggregate metrics together, correlate metrics
together with some aggregation functions, pivot them, group them in
a way in order to present them, et cetera. So you have
to learn a query language and you also have to learn
what the visualization tool can do, how, what kind
of visualizations you have available, what kind of customizations are
there available. So in order to
have at the end a meaningful monitoring solution
and something similar exists also, some kind of complexity exists
also for alerting. So while building this,
what you need is a deep understanding, first of
the metrics that you have. So you need to know what the database
has and what labels are there, how you should
query them. In order
to maintain this processing pipeline,
you need to go through various steps
of configurations in
different phases. So in order, for example, to reduce the
cardinality of some metrics, you may have to relabel some
of the metrics. At some point you have
to learn the query language that, the languages that we said,
you need to understand the tools and have a very good understanding of
the tools. And of course you need experience. So if you
don't know, if you have never done this before, most likely you are
doomed. You know, you need to have an understanding of what
is a heat map, what's a histogram, how to visualize
this kind of data. You need to have a lot of best
practices in order for this to be actually
useful for people while troubleshooting.
If you are at scale, then you have additional problems.
So you have to go through scalability challenges,
how the database servers can be scaled, how I can
have high availability there. You have to
work on the data overload and the noise
that may exist in all these dashboards,
in all these alerts, et cetera. You have a
cost management aspect that is important and you have
to manage because you may need at some point to
cherry pick what you want, what you really want, what is really useful
for you, because otherwise the cost of the monitoring
infrastructure will skyrocket. And of course you have
to check also compliance and security. So you
have to check what kind of information is in your logs
and who has access to them and all this kind
of stuff. If you're in a team,
you have an additional layer, then you have
to come up with some kind of monitoring policies.
You have to agree on a unified monitoring framework.
You have to have some kind of change management and shared
responsibility on the dashboards and the alerts and
the data in general that are in these database servers.
You have to take care of documenting everything and
letting others know how this thing works,
what they should do, how this needs to be changed, et cetera.
And you need to have quite some discipline in order to
follow your design principles always.
Otherwise it usually becomes a mess.
So to understand this,
I asked Chachi PT to come
up with a few phrases that actually
they are the challenges, let's say that these engineers that
work on a monitoring setup have in their minds
every day what they usually think, what are the challenges
they face every day. CHPT came up
with this, so I made it a word cloud for
us to see. So CHPT, you rate them with a
frequency, how frequently its
phrase comes up, and this
is what it came with. So balance, scrape,
interval structure, multidimensional query
test and validate queries, monitoring exporter,
health time series, database I o bottlenecks,
dynamic alert, threshold, regular expressions for
label rematching, federation efficiency,
fine tune, chart access and scales, filter metrics and
exporters. You see, it's all the details
that people have to go through
in order to have a monitoring,
to build a monitoring with such a setup. Now, I also
asked satypt to tell me what people
that are monitoring infrastructure are usually thinking of
independently of the monitoring system. So what people should
be thinking in general when they are responsible
for the performance and the stability and the availability of some kind
of infrastructure. And touch APD came up with this
list. Now, this list is quite
different. Now it says, incidents, response,
application stability, data encryption, cost optimization,
service availability, client satisfaction, operational efficiency.
It's completely different. It's another level,
actually. Now this is what I think is
the gap of the current monitoring tools.
Most of our monitoring tools, most of the monitoring tools that exist
today force us to think like
this. This is the challenges we face every day,
how to do this little thing, how to create a heat map
query for quantiles, how to configure external labels.
These are details of the internals of
the monitoring system. While the other list,
this list is what we
need our engineers to think. This is their
primary role. This is what they need to achieve. This is what they need
to focus on. Now, some may think,
okay, wait a moment, we have a lot of standardization
bodies that are taking care of this. Of course,
we do have, and we have, for example,
open telemetry, we have CNCF,
we have w three c, et cetera. But you know what?
If you actually check what all these standardization
bodies do, is the following. Do you remember
this chart, this graph that we started with? They focus
here. So what they do,
all our standardization effort is
above data exchange is, okay,
let everyone have all the data. If you need the
data, okay, take them. But all the complexity
that we have is outside this. Of course, this is an enabler.
We need it. It's good to have data compatibility
and being able to exchange data,
because otherwise it's a lot more complex. But I think that the
most important aspects of efficient
monitoring are not there. So actually we
don't have any standardization. This lack of standardization
is actually the major
shift in our focus. So instead of improving
our infrastructure, we spent most of our time in
a sea of challenges around
the monitoring itself, about how to achieve something with
the monitoring system. Now,
I also researched what analysts say.
So for analysts, they have this DevOps
role. The DevOps role that is supposed
to be the clue, is supposed to
fix the gaps in this thing. But now what
analysts believe is that the DevOps,
the DevOps guy, a DevOps guy is a data scientist
who understands software engineering and
is an IT network architect at the same time.
So he understands technology as a sysadmin,
software as a developer, and data science
as a data scientist, and it's a combination of these
three. So what the analysts are saying is actually that there
is a guy or a girl, of course,
that knows what is coefficient of variation,
that knows what is IPC instructions per cycle,
and somehow can figure out if an application,
a running capital texture production application, is memory
bound or cpu bound, and at the same time can understand
what is the disk utilization of an NVMe
or an SSD disk, what's I ops and
under which conditions such a disk, for example,
can get congested. To my understanding,
the whole thing is a utopia. So these
people that can actually
know data science, the vast area of data
science, have huge experience in software engineering,
and have vast experience in
IT infrastructure and network architecture in
some companies, bigger companies solve this
problem by having a lot of roles. So they have an army of
people to take care of. A few things like the monitoring,
but for smaller or medium sized companies, and also
for a lot of force 500 companies that don't invest
in this, this is a breakpoint.
So this kind of engineers that have
all these magical knowledge, experience and skills,
simply does not exist.
Now the result for all of us, the result
for most companies is something like this.
So monitoring is very expensive and
a constant strangy. So it's never
complete, never finished, never. Okay, never enough.
While being extremely expensive,
it requires a lot of skills.
And actually for most of the companies,
the kind, the quality of the monitoring they have reflects the
skills of the engineers. So if they have good engineers, they have good
monitoring. If the engineers are not that experienced,
do not have that many skills, then the monitoring is a toy. It's a joke.
Frequently we see, even in very big companies,
that monitoring is severely under engineered.
So it's next to zero. They have some
kind of visibility on the workload, but that's it.
This is where it starts, then attends.
And it's frequently illusional,
and I say illusional. I was the first that to experience this.
So before I built any data,
I spent quite some time in monitoring,
and I had quite some
money and time and effort in monitoring, and I
had built a great team. But at the end of the day,
I had the impression that everything that I have built,
all the dashboards and all the tools and everything that has been installed,
is there just to make me happy, because I cannot troubleshoot anything with
it. It's not useful for what I needed.
So in many cases, I still see this today.
With many companies that we cooperate with, that the monitoring
gaps,
and it leads them, the monitoring inefficiency
leads them to wrong conclusions, to increase
their time, to resolution, a lot of frustration, a lot of issues, et cetera.
Lost money, of course. So now
I'm going to talk to you about Netdata. Netdata is an open source tool
that provides opinionated observability.
So the idea of net data was to solve
all these problems that we saw so far.
The idea of netdata, let me tell you a
few things. So the first thing is that it was born out of a need.
So I needed the dumb thing.
I had problems. I had issues with some infrastructure and
I couldn't solve them. I spent quite some time there.
The problems remained. So after several months of frustration and lost
money and lost effort, I decided that
I should do something about it. Since the monitoring tools that exist
today, they are not sufficient enough for
this kind of job, then we need a new monitoring tool.
Initially, it was out of curiosity, so I said, okay, let's build
something. Let's see if we can build something. Because I couldn't believe
that all the monitoring
tools have this design
out of, I don't know, accident.
So I was thinking that, okay, they tried different ways and this
is the only way that actually worked.
So it was curiosity why they didn't do monitoring
real time, why they don't ingest all the metrics,
why cardinality is such a big problem,
why monitoring systems
don't work out of the box, they don't come predefined with
the dashboards and the likes that we all have, et cetera.
So I started building it,
and after a couple of years, I released it on GitHub.
It was on GitHub from the first day, but I actually pressed the button to
release it. Anyway, nothing happened. It was a funny story. Nothing happened.
So one day on Reddit, I posted and boom, it skyrocketed.
So the data way says that we
all have a lot in common. So my
infrastructure, your infrastructure, their infrastructure,
we are using the same components,
the same parts of a Lego. So we
are using the same or similar physical and virtual hardware,
similar operating systems. These are finite
sets of stuff. It's not infinite. The combinations
are infinite. So we can combine these Lego things,
these building blocks, the way we see fit.
But all the building blocks are pretty much packaged
today, even for custom
applications, applications that we built ourselves. In most of
the cases, we use standard libraries, and these standard
libraries expose their telemetry in a standard and
predictable and expected way. So even for custom
applications today, this becomes increasingly
true, incrementally true.
As time passes, even these custom applications will
be completely packaged like packets. They will
provide a finite and predictable and expected
set of metrics that we can all consume to actually say if the
application is healthy or not. Of course, there will be
common metrics, sorry, custom metrics all the time.
So we have a lot
in common. The next thing is that I wanted
high fidelity monitoring. High fidelity monitoring means
everything collected every second. Like the console tools.
My original idea was to kill the console
for troubleshooting.
What most monitoring tools provide is a
helicopter view, so you can see that
the road is congested, you can see the dive or
the spike on that thing, but you cannot actually see what's happening.
Why is that? This is a helicopter view.
So this happens mainly because they want
the minimum all, including commercial providers,
including companies that provide monitoring
solutions as a service. They do this because it's
expensive to ingest all the
information, so they prefer to select which
information to ingest and maintain so
that you can have the helicopter view that you need.
But once it comes to the actual details of what is
happening there, why this is happening like this, they don't
have any information to help. And this is where the console tools usually
come in place. So with Netdata, I wanted the console
tools, I wanted to kill the console for monitoring.
So every metric that you can find in console, it should
be available in the monitoring system without exceptions,
even if something is extremely rare or is
not very common to be used. The next thing
is that I wanted all the metrics visualized, so I don't, come on,
it's an Nginx, it's a database postgres. I don't want to
configure visualization for
this packaged application myself. Why do you do that?
It exposes workload and errors and this
and that, and tables and index performance and whatever it is.
But I want this visualization to come out of
the box. The next is that I didn't want
to learn a new query language, so I wanted to have all
the controls required directly on the dashboard to actually
slice and dice the data the way I see fit.
So in the data, we have added a nice ribbon
above every chart that allows you to filter the data, slow slice
the data by label or by whatever
node, instance, whatever it is there, but also to
group them differently. So it's like a cube to see different aspects
of the cube. And of course
we added unsupervised anomaly detection.
And this is an innovation we have among
all the other monitoring solutions, mainly because our anomaly detection
happens for all metrics unconditionally and it's totally
unsupervised, so you don't need to provide feedback to it. This is what unsupervised means.
You don't have to train it yourself. Also, it is
trained at the edge, so we don't train somewhere what
a good postgres means and actually give you
the model to apply it to your database.
You will never, it will be full of false positives.
Mainly because in monitoring you have the workload that
determines the actually queries that you send to a database server
determine what the metrics will do. So it's
impossible to learn to share
models. The only viable solution is to train
models for each metric individually
and out of the box. Alerts for alerts what I wanted is to actually
have predefined alerts that once they see a
network interface, they attach to it. They see this, they attach disk
alerts attached to it, they see, I don't know,
hardware sensors, they attach to them, they see a postgres database,
they see a web server, they attach to them automatically.
So in a data today sips with hundreds. Actually we
counted a few days before, it is 344
distinct alerts that are all dynamic.
So there are no fixed threshold there. There are all
of them rolling windows, et cetera, et cetera. So they compare
different parts of the metric to understand if
there is a big spike or a big dive or an attack
or something wrong. And of course there are plenty of alarms that
are just counting errors or things that are mathematically,
even if there is one, there is an error condition of it and
users need to be alerted. Now,
Netdata, I wanted also netdata to
be able to be installed mid crisis.
So you have a crisis, you never installed Netdata
before. You can install Netdata right there
while the thing is happening and Netdata will be
able to help you figure it out. You are not going to have the help
of the anomaly detection because this thing needs to
learn what is normal in order to help you. But Netdata
has a lot of tools, additional tools on
top of anomaly detection that will
help you identify what is wrong,
correlate what is happening, find similarities in metrics,
et cetera. It will also allow you to explore the journal logs
directly on the servers. So all this
is about removing the focus
of the monitoring system of what the monitoring system
internals from users and putting some extra
knowledge into the tool. So unlike, for example,
if you take Prometheus and Grafana, when you get them and you
install them, they are blank, they cannot do anything. They are
just a database server. And the visualization engine, great. They are great database
server and great visualization engine, but there is
no use of them if you don't go through the process of
configuring them, setting them up,
pushing metrics to them, et cetera, in the data. The story
is quite different. So data knows
the metrics, it knows already when we ship it. It knows how
to collect cpu metrics, memory metrics, container metrics,
database metrics.
It comes with pre configured alerts,
so it knows how to visualize the metrics correlate and come up with
meaningful visualizations. So the idea is that data is a
monitoring out of the box, it's ready
to be used. The internals of what is happening there and
why this is like this, and how to convert a founder,
CEO a rate. And all this kind of stuff is already baked
into the tool for each metric individually,
including the monitoring of each component individually.
So for the data, when you have a disk,
it's an object in the data. It has their
metrics attached to it, alerts attached to it. So we monitor
infrastructure bottom up, we don't go helicopter
view, we go down deep, we deep dive to
the highest level we can, we monitoring and set
alerts at that level and we start building,
see, because even Nadeda has a lot
of innovations, even in deployment,
for example. So Netdata is an open
source software, as I said, you can use it for free.
So the software you are going to get is a monitoring
in a box. So the moment you install the data agent,
it's not an agent, it's not the same as an exporter
in Prometheus. And a data agent is like something
like exporters time series
databases. So you have Prometheus,
you have the visualization engine, you have the alert
manager, and you have also machine
learning and the likes, including logs,
everything combined into one application.
Now this application is modular. So for
each installation you do, you have the ability, of course,
you can use it by itself. So you can install it on one
server and use it on one server. So it has an API, it has a
dashboard, you can see the dashboard there, you can explore the metrics,
et cetera. But when you can build,
sorry, I can show you this, you can also use
them. So you have a number of servers, you install them,
you install netdata on all of them, then you can use our
SaaS offering to actually have a combined view of all of them.
If you don't want that, you can use the same
software, the Netdata agent, as a parent.
So in this parent, now, this parent receives all
the metrics in real time from all the servers.
So all the other servers are streaming in real time to it.
This server now can have all the
functions of the others, so it can alert
for them, anomaly for them, visualization for them, everything required.
This allows you to offload the other servers. So these
extra features, let's say take some cpu, take some
disk space so you can offload if you want the other servers
and use only the parent. Of course this is infinitely
scalable. So you can have data centers,
different data centers, or many places
all over the world where you have infrastructure, you can have a parent there
and then use the data cloud to aggregate the parents now,
or you can have a grandparent. So you can have a
grandparent, or a grand grand grand grand grandparent.
So it's infinite, you can scale it as
you see fit, as your infrastructure grows.
The whole point with this now is that data is
a lot faster compared for example to Prometheus
and requires a lot less resources.
So for Netata, we distress tested Netdata in Prometheus.
We set it up 500 servers with
40,000 containers. We had about 2.7
million metrics collected every second. And we
configured actually Prometheus to collect all
these metrics in real time also per second. And we measured
then the resources that were required on this data and
Prometheus. And the result is this,
35% less cpu utilization. So one
third less cpu utilization, half the memory of Prometheus,
10%, 12% less bandwidth,
98% less disk I o. This is because we don't
need a wall Sony that doesn't write all the time.
We rely on streaming and replication for
high availability. So each of the parents can be
a cluster, it's very easy, you just set them up in a loop. You can
have three parents in a cluster, four parents in a cluster,
and all of them are in a loop. So the idea is
that instead of committing data to disk and
trying to have something that
can sustain failures on each
server, we rely on replication and streaming
to make sure that we will not lose data in case of failures.
So 98% less
disk I o and also on retention.
So you see that net data, we say there 75%
less storage footprint, but actually it is actually a
lot more. The problem here is that the key
characteristic of net data is that it can downsample data
as time passes. So it has tiers, it can have up to five
tiers. We ship it with three, but you can configure up to
five, where you downsample data from tier
to tier. Now,
NetData also is one of the
most power energy efficient platforms out
there. So last month, the University of
Abstention did a research we didn't know, we saw it when it
was published that they
said Netdata excels in energy
efficiency, is the most energy efficient tool,
and excels in cpu usage, ram usage and execution
time when it monitoring docker containers. The whole study was about Docker
containers. Let's move on to
AI, to artificial intelligence, and what
is there? What happens there? So in
2019, Todd Underwood from Google
made this speech. This pitch says that actually all
the male ideas that Google engineers had were bad.
They couldn't have the expected outcome. So the
engineers set some goals and they tried. They put them down
and they tried to do it, but the goals were not there.
It was impossible to achieve them. So all the
ML's ideas are bad, and we should also feel bad,
as Todd says here. Now in
the data, we have mls.
Now what we do,
the first goal of machine learning in a
data is to understand, to learn the patterns
of the metrics. So we didn't want to.
That's the first goal. Can we understand
the pattern of the metrics so that the next time
we collect a sample, we can know
reliably if the collected sample
is an outlier or not. And we wanted this
unsupervised, so we didn't want to
do, to provide any feedback to
the training. We train at the edge, or as
close to the edge, so you can train as the parents if you want.
But the whole point was to
understand if we can have a way to
detect if a collected sample, just collected sample,
is anomalous or not. And I think we have achieved that. So in
a data train, say 18 models, it learns the behavior of
each medic individual for the last 57 hours.
This is two points and a half. Let's say it detects
anomalies in real time. And all
18 machine learning models need to agree.
This is how we remove the noise from ML. Because it's noise,
it has false positives. So all 18 models have to
agree that a sample is an outlier in order to say
that it is an outlier. And we
store the anomaly rate in
the collected data together with the collected data in the time series.
So it's like anomaly rate is an additional
time series for every other time series.
It's like having all the time series twice, one for
anomalous, not anomalous. And we also
calculate a host level anomalous score that we will see how
it is used. Now in a data.
One of the innovations we did in a data is that we
added a scoring engine. A scoring engine tries
to score the metrics, tries to
understand given some parameters, it tries to understand
which metrics, out of the thousands or millions
of metrics available, are the more relevant to the query
we do. So it can do scoring based on
two windows to find the rate of change.
It can score based on the anomaly rate. So which
metrics were the most anomalies from that time to that time?
We have also metric correlations that tries
to correlate metrics together. So by
similarity or by volume. So it tries to
understand how the metrics correlate together.
Now let's see how this appear in
a data dashboard. And a data dashboard is like this.
It's a single dashboard, one chart is below the other,
infinite, scrolling hundreds of charts. Of course, there is
a menu here that groups everything into sections so
that you can quickly jump from section to section and see the charts.
Now, this dashboard is fully automated. You don't have to do anything about
it. Of course, if you want to cherry pick charts and build custom
dashboards and change the visualization, all this is there.
But the whole point is that we wanted all of it to
be, every metric to be visualized in a fully automated
way. Each chart
we will see has a number of controls. So the charts that you
see in the data are a little bit different compared to the others.
Let's go to that thing. The first thing is that when
you are in this dashboard that you have, in this case, I think it
says five, 90, 500, almost 600
charts in this dashboard. It says it
here when you are there and
you can press this button, this AR button that's here, this is in Zoom
and netdata. What will do is that it
will fill the sections with their anomaly
rates. So for the daytime picker from
the duration you want it, it will score
all the metrics and figure out
what's the anomaly rate for that duration for
each chart within them, and then for each section,
this allows you to quickly spot. So if you have a problem and you want
to find you don't know what it is, you have a spike or
a dive on a web server, something is wrong. You don't know even what is
wrong. You can just go there and hit that AR button and the data
will tell you where the anomalies are. So you will be able to immediately
see that, oh, my database server, my storage layer,
or an application is doing something,
has a crash or something. Now, this is
in a data chart, and a data chart looks like
all the charts out there,
but not quite. So.
The first thing that you will see is that there is this anomaly
reborn. The anomaly ribbon is this. This purple color
indicates on the top the anomaly rate
of all the metrics that are included here. In this case,
the metrics are coming from seven nodes,
115 applications, and there are 33
labels there. So the entire,
all of them, all of them together. The anomaly
rate, the combined anomaly rate is visualized here.
Of course, you can have individual anomaly rate like
this. So you click the nodes. This model comes
in. It has the seven nodes one by one. It says,
how many instances, how many components. If this is about disks,
this is disks. If it is applications, these applications,
these processes in this case. So how many
applications are there? How many metrics are available?
What's the volume, the relative volume. So in the
chart that you see, some of them contribute more
than the others. This is sorting by the volume,
a sorting by anomaly rate. If there are related alerts to
context switches, in this case about applications, you would see them here.
And you can see the minimum, average, and maximum value per node
for this kind of data. The same happens for
applications, for dimensions or for labels. So it's a similar
model that shows all applications in
a list where you can see the volume, the anomaly rate, the minimum
average and maximum, et cetera. And of course,
you can use the group by.
Sorry, I didn't tell you that. You can filter from here. So if you want
to include or exclude something, you can just include
it or exclude it here, and it will automatically change
the chart. So similarly,
you can use the group by feature to change how data
are grouped. So you can group by node, you can group by application,
you can group by dimension. In this case, it is read or
write or whatever it is. You can also group by any label,
whatever label is there, and actually by combinations, two labels,
three labels, nodes and labels. So you can do all the combinations and
group the data. See the different aspects
of the data from this menu. Now,
this menu is standard on every chart. Then we
have the anomaly advisor. Anomaly advisor is a tool
that we developed in order to find the needle in
the haystack. So you have a problem.
There is an anomaly. We saw this AR button that you press
it and you can actually see the
anomaly rate of each section. But how can I
find the individual metrics or
the most anomalous metrics that
exist for a given time frame, current or past. So we
use the host anomaly rate. The host anomaly rate looks like
this. So this is a chart, you see here that
it is a percentage, and it shows the number
of metrics of the host, of each host. These are hosts
here of the nodes,
the number of metrics in the node that are concurrently
anomalous. You can see that when you have anomalies,
they are widespread. So you see a lot of
metrics in that node become anomalous.
So if there is a stress on the database server
disk will have increased, I O CPU will be
a lot more, probably network
interface will have a lot more bandwidth.
So all this combined together with
all the individual metrics that we check, like the
containers and the page
faults, how many memory process shall locate,
et cetera, or the text switches or whatever
happens, even interrupts how interrupts are affected
in the system. So all this information comes
together and is aggregated here to see a huge spike when
something anomalous happen. Now when something anomalous happens
like this, what you can do is highlight this area.
So there is a toolbox here that you can highlight this area. And immediately
the data will give you a list of all the metrics sorted
by relevance for that highlighted window. So for
that highlighted window, it goes through all the metrics, no matter how many
they are. It scores them according to their
anomaly rate, sorts them, and gives
you the list in a sorted way. So you can see for example
that the whole point of this is that in the top 1020
items that you have there, you should have your aha moment.
So you should have ho someone ssh
to this server or ho we have
tcp resets, something is broken somewhere
else and this doesn't play.
So the whole point is to have your
aha moment within the top few items.
Now the way I see it, if I go
through, is that, the way I understand it
is that we have really a lot in common.
Our infrastructures under the hood are quite
similar. We all deserve to have real time
high fidelity monitoring solutions like net data
keep up to this promise. So we spread
net data like this in a distributed fashion,
mainly to avoid the
bottlenecks that all other monitoring solutions face.
So net data should
be scalable better than anything else. The data cloud,
for example, is our SaaS offering today
works at about, I don't know, less than 1%
of its capacity, and it has 100,000 connected
nodes. And it's just a Kubernetes cluster, not much,
a small one actually, a few nodes. So the idea
is that we want monitoring to be high resolution,
to be high fidelity, to be real
time. We open sourced everything. So Netdata
is a gift to the world, and we
open sourced, even advanced machine learning
techniques, everything we do, all the innovations
in observability are baked into the open source
agent. And even when you view one
agent or a parent with 2 million metrics,
the dashboard is the same. We don't change dashboards.
It's one thing, the same as the cloud. Netdata cloud has
exactly the same dashboard as the agent.
And monitoring, to our
understanding, should be simple. It should be easy to use,
easy to maintain. The data is maintenance free, doesn't require anything.
Of course there are a few things to learn. How the tool behaves like this,
and how I do streaming, how to build parent. You need to
learn a few things, but even
there is nothing to maintain in indexes. Most of the stuff
are zero configuration and work out of the box.
And at the same time we believe that the monitoring tool
should be a powerful tool at
the experts, at the hands of experts, but a
strong educational tool for newcomers.
So people should be using these
kind of tools like Netdata, to learn, to troubleshoot,
understand the infrastructure, feel the pulse
of the infrastructure, and at the
same time we are trying to optimize it all over
the place. So we want my data to be a thin
layer compared to the infrastructure
it monitors. It should never become huge.
This is why we wanted to spread over the infrastructure,
to utilize the resources that are already available. And spare
data on a single node requires just
5% of a single node utilization of a single core,
sorry, of a single core cpu utilization and about
100 megabytes of ram. So we want this to
be extremely thin compared to
the whole thing, so that it can be affordable for
everyone. Thank you very much for watching,
try new data and see you online.