Transcript
This transcript was autogenerated. To make changes, submit a PR.
Welcome. Today we are going to talk about machine
learning in observability and what we did in the data.
The data is a monitoring tool that was born out of a need,
out of frustration. I got frustrated by the monitoring solutions that
existed a few years ago, and I said,
okay, what is wrong?
Why monitoring have these limitations? Why it's
so much of a problem to have high
resolution monitoring, high fidelity monitoring across the board.
And I try to redesign,
rethink, let's say, how monitoring systems
work. So traditionally, a monitoring system looks
like this. So you have some applications or systems
exposing metrics and logs. You push these
metrics and logs to some database servers.
So a time series database like Prometheus, or a monitoring provider like
datadoc, dynatrace, new Relic, et cetera,
or even for logs, elastic or
lockey, or even their splunk commercial providers,
et cetera. So you push all your metrics and logs to these
databases, and then you use the tools that these
databases have available in order to create dashboards,
alerts to explore the logs, the metrics, et cetera.
This has some issues. The biggest issues is
the biggest issues come from the fact that as you push
metrics to them, as you push data, not just metrics, but also logs
to them, they become a lot more
expensive. So you have to be very careful, you have to
carefully select which metrics you centralize,
how you collect and visualize how frequently
you gave, to be very careful about the log streams that you push
to them, the fields that you index, et cetera.
And then, of course, in order to have
a meaningful monitoring solution, you have to learn query languages.
You have to build dashboards, metric by metric and alerts,
metric by metric, et cetera. So the whole process,
the first is that it requires skills. So you have to know what you
are doing. You gave to have experience in doing that thing,
knowing what you need to collect, how frequently
you have to collect it,
knowing what each metric means. Because at the end of
the day, you need to provide dashboards, you need to provide alerts. So you need
to have an understanding of the metrics and the logs,
et cetera, that you have available. It has
a lot of moving parts, so a lot of,
I don't know, integrations or stuff to
install and maintain database servers, visualizers and
the likes. And what happens for most of the companies
is that the skills of the engineers they have reflect
actually the quality of the monitoring they will get. So if
you have good engineers that have a deep understanding of what
they are doing, they have experience in what they do.
You are going to have a good monitoring. But if your engineers
are not that experienced or they don't have much time to spend on monitoring,
then most likely your monitoring will be childish,
will be primitive, it will not be
able to help you when you need it. And of course the whole
process on how to work with monitoring follows a development
lifecycle. So you have to design things,
put the requirement, design that thing,
test it, develop it, test it and then consider
its production quality. What I tried in the data is
zap, everything. So I said, oh come on,
monitoring cannot work like this. So can we find another way
to do it? So I created this application.
It's an open source application that we install everywhere.
This application, Netdata,
has all the moving parts inside it. So it has
800 integrations to collect data from and actually
auto discovers everything. So you don't need to go and configure
everything. And it tries to collect as
many metrics as possible. The second is that it collects everything
per second. Of course there are metrics that are not per second, but this
is because the data source does not expose
this kind of granularity. So the
data source updates the metrics that it exposes every five or
every ten. Otherwise the data will get everything per
second and it will get as many metrics as it can.
It has a database, a time series database
in it. So there is no separate application. It's just a library
inside netdata that stores metrics in files.
So it has a health engine that checks metrics against
common issues that we know that exist
for your applications and your systems. And it learns the
behavior of metrics. And this is what we're going to discuss about machine learning and
how it learns. And of course it provides all the APIs
to query these metrics and these logs
and visualize them and also has all the methodologies.
So it's a nice citizen in the observability ecosystem.
It has integrations to push metrics, even to Prometheus, even to
other monitoring solutions, and to also stream metrics between
the data servers. So it allows you to create centralization points within
your infrastructure. So if
we look how this works behind the
scenes, it may seem complicated. It's not that much. So you have
a local. Net data running on a Linux system, for example.
It will discover all metrics using all the plugins
that it has. It will collect data from these
sources. This is with zero configuration. This is just the default behavior.
This is what it does. It will automatically
detect anomalies and store everything in
its own time series database. Once the data are stored in the
time series database, it provides all these
features, so it learns from the metrics
in order to detect, to feed the trained
models into the anomaly detection. It checks the metrics for
common issues,
congestions, errors and the likes. It can
score the metrics, so it can use different
algorithms to find the needle in the haystack
when you need it. It can query all the metrics
and provide dashboards out of the box.
And the data actually visualizes everything by itself.
So every metric that is collected is also visualized,
correlated and visualized in a meaningful way.
It can export the metrics to third party time series databases
and the likes. And it can also stream
metrics to other netdata servers. So it can
come here from one end data to another.
So you can create metrics,
centralization points on demand. You don't need to centralize everything
on one node or across your infrastructure.
You can have as many centralization points as required across
your infrastructure. This provides both efficiency, cost efficiency,
mainly because there are no egress costs, for example.
But also it allows you to
use netdata in cases where you have ephemeral service, for example.
So you have a kubernetes cluster that nodes come up
and go down all the time. Where are my data? If the data
are in these servers and the data is offline, then where are
my data? So you can have a parent a
centralization point where your data are aggregated,
that are permanently available even if the
data collection server is not available. A similar
methodology happens for logs. For logs, we rely
on systemdjournal. Systemdjournal is an application that we
all use. So even if we don't know,
systemdjournal is there inside our systems. And system dig
journal is amazing for logs. Why?
Because it's
the opposite of what all the other log
solutions do. So for all log data,
log database servers, the cardinality.
So the number of fields that are there and the number of
values that the fields have is important.
And the more you have, the slower it gets,
the more opensource are required, more memory, et cetera, et cetera.
But for the system Dig journal,
the cardinality of the logs is totally relevant. So the system
dig journal is designed to actually index all
fields and all values, even if all log
lines, each log line has a different set of fields
and a different set of values. So it doesn't care about the cardinality
at all. It has been designed first to be secure.
It has ceiling and tampering and a lot of features
to improve security. And at the same time,
it is designed to scale
independently of the number of fields. The only, of course,
drawback if you push huge cardinality
to system digital is the disk footprint,
but not CPU, not memory.
So it's there,
it's inside our system. So what we do is that
we provide for the first is the data, can use journal
files, can query journal files without storing, without moving
the logs to another database server. So we don't have a logs
database server, we rely on systemd journal.
The first is this, and the second is that if you have text files and
you want to push them to systemdjournal, we provide log
to journal, a tool, a command line tool that you can configure to
actually extract structured information from text log
files and push them to systemdate journal.
Systemdate Journal itself has the ability to create multiple
centralization points across the infrastructure, much like Netdata.
So while Netdata can do this for streaming,
with streaming, to push metrics from one data
agent to another, systemd journal has the same
functionality. It provides system digital upload that pushes metrics
to another journal d. And it provides also
system digital remote that ingests this metric
and stores them locally. If you put net data
in a parent, let's say in a log centralization point,
Netdata will automatically pick all your logs. So the
idea with this setup is that you
install netdata everywhere on all your servers.
If you want to centralize, if you have ephemeral nodes, et cetera,
you have the methodology to create centralization points,
but not one. You can configure as many centralization points
as is optimal for your setup in terms of cost or
complexity or none if you don't require any.
And then the whole infrastructure becomes one.
How it becomes one. So it becomes one with the help of
our SaaS offering that we have. Of course,
it has a free tier too. So you install netdata everywhere
and then all these are independent servers.
But then Netdata cloud can provide dashboards
and alerts for all of them for metrics and logs.
And if you don't want to use the SaaS offering, you can
do the same with any data parent. So the same
software that is installed in your servers, you can build a
centralization point. You can centralize here metrics and logs,
and this thing will of course do all the
mail stuff and whatever else needed.
And from this point you can have fully automated
multi node dashboards and fully automated alerts
for the whole infrastructure. If your setup is more complex,
you can do it like this. So in this setup
there are different data centers or cloud providers.
So this is a hybrid setup in this case or a
multi cloud setup. Again, you have if you want
multiple parents all over the place, and then you can use medada
cloud to integrate the totally independent parents.
If you don't want to use no data cloud, again you can use
a data grandparent. But this time, this thing,
the grandparent needs to centralize everything.
Now what this setup
provides is the following. The first thing is that we manage
to decouple completely cardinality and granularity
from the economics of monitoring of observability.
So you can have as many metrics. The data is about having
all the metrics available. If a metric is available, if there is a data opensource
that expose a metric, this is the standard policies that we have.
Grab it, bring it in, store it, analyze it, learn about it,
attach alerts to it, et cetera. So all
metrics in full resolution, everything is per second for
all applications, for all components, for all systems,
and even the visualization is per second. So the
data collection to visualization latency.
While in most monitoring solutions it's a problem
in a data, you hit enter on a terminal to make a change and
boom, it's immediately on the dashboard. It's less than a second.
Data collection to visualization the time required
from data collection to visualization. The second is that
all metrics are visualized. So you don't need to do
anything, you don't need to visualize metrics yourself.
Everything is visualized, everything is correlated. So the moment
we create plugins for Netdata,
we attach to them all the metadata required in
order for the fully automated dashboard and visualization
to work out of the box for you. The next
is that we're going to see this in a while.
Our visualization is quite powerful. So you don't need to
learn a query language.
You can slice and dice the data, any data set
actually on a data charts with just point and click.
And actually in the data is the only tool that is totally transparent
on where data are coming from,
if there are missed samples somewhere.
So all these work out of the box for you,
including alerts. So in a data, when we build alerts,
we create alert templates. We say
for example, attach these alerts to all network interfaces,
attach these alerts to all disk
devices, to all mount points, attach these alerts to all NgINX
servers or all postgres servers. So we create templates
of alerts that are automatically attached to
your instances, your data.
And of course we don't use fixed thresholds.
Everything is about rolling Windows and statistical analysis
and the likes in order to figure out,
ah, we should trigger an alert or not.
Some people may think, okay, since this is an application that
we should install everywhere on all our servers,
and it has a database in it, it has a machine learning in it,
then it must be heavy. No, it's not.
Actually, it's lighter than everything.
Here we have a comparison with Prometheus as a
centralization point. And you can see that we
tested it with 2.7 million metrics,
samples per second, 2.7 time series,
everything collected per second, 40,000 containers,
500 servers. And you see that Netdata used
one third less CPU compared to Prometheus, half the memory,
10% less bandwidth, almost no
disk I o compared to Prometheus. This means that my data, when it
writes data to disk, it writes them at the right place. So it compresses everything.
It writes in batches, small increments, and puts everything
in the right place in one go. And at the same time,
it has an amazing storage footprint. So the
average sample for the high resolution tier,
for the per second metrics is 0.6 bytes
per sample. So every value that we collect,
it just needs 0.6 bytes less than
a byte, a little bit above half a byte
on disk. Of course, this depends
on compression, et cetera. And what I didn't tell you
is that we have on our blog, we have
a comparison with all the other agents that Datatrace has,
or Datadog has, or Eurelli has, et cetera,
a comparison on resources. What resources need data
required from you when it is installed,
and the data is among the lightest.
So it's written in C,
and the core is highly optimized with
ML, with machine learning enabled. It's the lightest
across all the agents. So this
makes net data,
let's say that you
can now build with net data a distributed
pipeline of all your metrics and logs without
all the problems that you gave from centralizing
metrics and logs. So you can have infinite scalability and
at the same virtually infinite scalability. Don't be arrogant
and at the same time have
high fidelity monitoring and out of the box. So you
don't need to know what you can install Netdata mid
crisis, so you have a problem, you don't have a monitoring
in place. Install net data, it will tell you what is wrong.
So let's move on to AI
in 2019.
Google Todd Underwood from Google gave this
speech about so what Google did is that
they gathered several engineers, SREs, DevOps and the likes,
and they asked them what they expect from machine learning to
do for them. And it turned out that
none of their ideas worked.
Why it didn't work because when people hear
machine learning, the expectations they have are a little bit different of what
actually machine learning can do. So let's see, let's understand first
how machine learning works. Machine learning. In machine learning,
you train model based on sample data. So you
give some samples to it, some old data to it,
and it trains a model. Now, the idea is that
if you give new data to it, it should detect
if the new data are aligned with the patterns you
saw in the past or if they are outliers.
If they are outliers, then you have to train it more in order
to learn the new patterns and repeat the process until
you have the right model. What most people believe is
that machine learning models can be served.
So assume that you have a database server. You can train
a machine learning on one database server and apply the
trained machine learning model to another to detect
outliers and anomalies. The truth is that it is not.
So let's assume that we have two database servers, A and B.
They run on identical hardware, they have the same operating
system, they run the same application, a database server,
postgres, same version. They have exactly the
same data, so they are identical,
both of them. Can we train a model on
A and apply this model
on B? Will it be reliable? Most likely not.
Why not? Because the trained model has incorporated
into it the workload. So if the
workload on B is slightly different, it runs some
statistical queries, some reports that a doesn't,
or if the load balancer, that is, if there
is a load balancer or the clustering software, it spreads
the stuff a little bit, not completely equally
among the two. Then the
behavior that the machine learning model that was trained on a
will not work on B. It will give false
positive. So what can we do? If this
is the case, what can we do?
Let's understand the following. The first is that machine learning
is the simplest solution for
learning the behavior of metrics. So it can,
given enough data, enough samples,
it can learn the behavior of the metrics.
So if you grab a new sample and you give it
to it, it can tell you if this
sample that you just collected is an outlier. Sorry,
it's an outlier or not. This is
what anomaly detection is. So you train the model,
you collect a new sample, you check against the model
you have true or false, if it is an anomaly or not.
Now, the whole point of
this is how reliable it is, if it is
accurate, so it's not accurate.
So if you have one machine learning model and you
actually train
one machine learning model and you actually give samples
to it, just collect samples to it, it has some noise.
So by itself, machine learning should not be something,
an anomaly should not be something to wake up
at 03:00 a.m. Because it will happen.
Of course, you can reduce the noise by learning multiple
machine learning models. So you train multiple models,
and then when you collect a sample, you check against
all of them. If all of them agree that
this sample is an anomaly, then you can say,
okay, this is an anomaly. Still, there are
false positives. Still, you should not wake up at 03:00
a.m. But look what happens.
What we realized is that if this
inaccurate anomaly rates, these noisy
anomaly rates, are triggering for
a short period of time across multiple metrics.
So it's not random anymore. It's multiple
metrics that for a period of time,
they all trigger together. They all say,
I am anomalous together, then we know for
sure that something anomalous is happening
at a larger scale. It's not just a metric now.
It's the system or the service level or the
application level that triggers a lot anomalies
across many metrics. So how
this can help us? So we use,
of course, this is what we did in the data. So we train
multiple machine learning models, and we try to detect anomalies
to make anomalies useful, not just because one
metric had one anomaly at some point,
this is nothing, but because a lot of metrics
at the same time are anomalous,
and we try to use this to help people
troubleshoot problems more efficiently.
So how
we use it in the data, the first is, you understand, since the data
is installed, the moment
you run it, it comes up with hundreds of charts.
You probably will see for the first time. So how do
you know what is important? This is the first question that we
try to answer. So you just see
in front of this amazing dashboard, a lot of charts,
hundreds of thousands of metrics, and hundreds of charts.
And what is important, machine learning
can help us with this, and we will see how. The second is,
you face a problem. I know that the spike or dive
or for this time frame, I know
that there is a problem. Can you tell me what is there
in most monitoring solutions to
troubleshoot this issue? You go through speculation. So if you
use, for example, Mithoso Grafana, you say okay. And you
see for example a spike or a dive in your
web server responses, or increased latency in
your web server responses, a spike. There you start speculating.
What if is the database server? Oh no, what if is the
storage? And you start speculating,
making assumptions and then trying to validate
or drop these assumptions. What we tried with net
data is to flip it completely.
So you highlight the area, the time you
are interested, and the data gives you an ordered
list, a sorted list of what was most anomalous during
that time, hoping that your
aha moment is within the first 2030 entries.
So the idea is that instead of speculating what could
be wrong, in order to figure it out and solve it,
what we do is we go to netdata and a data gives us a list
of what is most anomalous during that time and our
aha moment. The disk did that, the storage did that,
or the database did that is in front
of our eyes. And how
we can find correlations between
components. What happens when this thing
runs? What happens when a user logs in?
What happens when what is
affected? Because if you have a
steady workload and then suddenly you do something,
a lot of metrics will become anomalous. And this allows you to
see the dependencies between the metrics immediately. So you
will see, for example, that the moment a
cron job runs, a lot of things are
affected. A lot of totally,
seemingly independent, independent metrics
get affected. So let's see them in action.
Netdata trains 18 machine
learning models for each metric.
So if on a default Netdata you may have 3000
4000 metrics on a server, for each of these 3000
4000 metrics, it will train 18 machine
learning models over time. Now these
machine learning models generate anomalies, but the anomaly information
is stored together with the samples. So every sample
on disk has, okay, this is the value I collected.
It was anomalous or not anomalous? A bit. It was anomalous or not
anomalous. Then it calculates.
The query engine can calculate the anomaly rate as a
percentage. So when you view a time frame
for a metric, it can tell you the anomaly rate of that
metric during that time.
It's a percentage, and it is the number of samples that
were anomalous versus the total number of metrics. And it
can also calculate host level anomaly score.
So the host level anomaly score is when all the metrics
get aligned as anomalies together that we were discussing
before. Now the
data query engine calculates the anomaly rates in
one go. So this is another thing that we did.
So with the moment you query charts
for the samples, the chart that you want, et cetera,
all the anomaly information, whatever anomaly information
is there is visualized together. It's in the same output.
One query does everything, both samples and anomalies.
Now let's see, chart. This is a data chart.
It looks like any chart, I think from any monitoring
system, but there are a few differences. And let's see the
differences. The first thing is that there is an
anomaly rebound. Now this rebound shows
the anomalies. How many
samples were anomalous across time.
So I don't know. This is for some time here. And you
can see that at this moment there were anomalies. At this moment and
at this moment there were anomalies. Now the Srecon thing
is that we created this needle
framework. Needle stands for nodes, instances,
dimensions, and labels. Now look what this do.
The moment you click nodes. So it
clicked here. Nodes, you get this view. This view
tells you the nodes the data are coming from.
This is about transparency. So if you chart
in the data and you immediately know which
nodes contribute data to it, and as you will see, it's not just nodes,
it's a lot more information. So the nodes that are contributing
data to it, you can see here how many instances.
So this is about applications, this is about 20.
It comes from 18 nodes for a
total of, you see number of
metrics per node.
You can see the volume. So this chart has a volume,
has some volume in total.
What's the contribution of each metric? Of each
node? Sorry, to the total. So if you remove Bangalore
from the chart, you are going to lose about 16% of the volume.
And here is the anomaly rate. So widths
of the nodes, of the metrics of the nodes, the anomaly rate of the metrics
of the node. Of course we have mean average, maximum, et cetera
for all metrics involved. And if you move on,
the same happened for instances. So here that we have applications,
you can see here that each application has two metrics
and you can immediately see, okay, the SSH is anomalous
on this server. And of
course the same happens even for labels.
So not only for label keys, but also
for label values. So you can see again
the volume, you can see the anomaly rate, minimum,
average, maximum values for everything.
Now the same information is available as a tooltip.
So you can see this on the tooltip. You hover on a point
on the chart, and together with the values that
you normally see, you have the anomaly rate of
that point, anomaly rate. And for
each of the time series, of course, of the dimensions of the
chart. Now, if we go back to the original chart
that I show you, you have more control here.
So you can change the aggregation across time. So if
you zoom out the chart, it has to aggregate across
time because your screen has 500 points. But behind the
scenes in the database, if this per second, there are thousands and thousands
of metrics. So you can change the aggregation across time.
You can say here, minimum, maximum, so you can see,
reveal the spikes or the dives, and you
can change the group by and the aggregation.
So you can pivot the chart. You can change
the chart. It's like a cube. You see it
from different angles. So let's continue.
Netdata also has a scoring engine. A scoring engine
allows Netdata to traverse the entire
list of metrics on a server and score them based
on anomaly rate or similarity. We have many algorithms
there. Now we also have a metric
correlation algorithm that tries to find similarity
in changes. So you highlight
a spike and you say, correlate this with anything else and it
will find the dive. Because of the rate,
the change is similar. The rate of change is similar. Now, how we use
this, the first thing is that this energeta
dashboard, it has one chart below the other and a menu
where all the charts are segmented, as you can see.
I don't see the number, but I think it's 500 charts or something
like that there. So out of these charts,
you press a button. These are all in sections. You press
a button and the data will score them according to their anomaly
rate to tell you in which sections.
Which sections are anomalous and how much this
allows you if you have a problem, for example, you just go to the Netdata
dashboard that reflects the current, says, the last five minutes or the last 15 minutes.
You press that button and you will immediately see which metrics
across the entire dashboard are anomalous so that
you can check what's happening, what's wrong.
The next is the host anomaly rate. Now, for the host
anomaly rate, what we do is that we
calculate the percentage of metrics
on a server that are anomalous concurrently.
Concurrently. What we realize then is the following,
that anomalies happen in clusters.
So look at this, for example. These are servers.
Every line is a different server, but you see
that the anomalies happen close together.
This is up to 10%. So 10% of
the metrics of all the metrics collected on a server were
anomalous at the same time. And as you see,
for each server, it happened with a little delta here,
it happened concurrently, so one server spiked to
10%, but a lot other servers spiked to 5%.
Now look what happens when you
view this dashboard. What you can do is highlight an
area. So here we have highlighted from here to there
and the data, what it will do. It score the metrics.
So it will traverse the entire all
the metrics one by one. Score them for that little time frame,
calculate the anomaly rate, and then provide a sorted list
of what changed over.
What is more important, what is more anomalous for that
time frame. The whole point of this is to provide
the AHA moment within the list, their top 2030
items. So instead of speculating what could
be wrong to have this issue there,
Netdata tries to figure out this for you
and gives you a list of the most anomalous things for that time frame so
that your aha moment is there within that list.
Now the highlights is
that the data in ML is totally unsupervised,
so you don't need to train it. It is trained
for every metric, multiple models for every metric,
and it learns the behavior of metrics for the last few days.
So you just let the data run,
you don't need to tell it what is good or what is
bad, and the data will start automatically
detect anomalies based
on the behavior of metrics of the last two or three days.
It is important to note that this
is totally unsupervised. You don't need
to do anything. Of course if an anomaly
happens it will trigger it, but then after a while it
will learn about it, so it will not trigger it again.
But if it happens for the first time in the
last few days, two days it will detect it
and reveal it for you. The second is
that the data can immediately,
within minutes. So you install it and after ten or
15 minutes it will start triggering anomalies.
So it doesn't need to train all 18 models to detect anomalies.
But as time passes it becomes better and better and better,
so it eliminates noise. So even one model is
enough to trigger anomalies. The second is it happens for
all metrics. So every single metric, from database
servers, web servers, disks, network interfaces,
system metrics, every single metric
gets this anomaly detection.
The anomaly information is stored in the
database. So you can query the anomaly of yesterday.
Not based on today's models, on yesterday's models.
So as the models were at the time the anomaly was triggered.
There is a scoring engine that allows you to score metrics
across the board. So you are looking for
what is anomalous. Now, what is most anomalous now, what is most anomalous for
that time frame? Or I want to find
something that is similar to this.
So all these queries are available with Netdata,
and it has the host level anomaly score that allows you
to see the strength and the
spread of an anomaly across your systems,
across each system, inside each system, but also
across systems. So what
we are next to do this solidity there, this works,
you can try in the data, it's open source software. And actually
it's amazing because you don't have to do anything, just install it,
it will work for you. We are
adding machine learning profiles, so we
see users that are using machine learning in
the data for different purposes. So some people want
it for security, some people want it for troubleshooting,
some people want it for learning
special applications, training special applications,
et cetera. So we are trying to make these,
to create profiles that users can create,
can create different settings for machine learning according to
their needs. Of course, there are many settings now available,
but they are applied to all metrics. Everything is
the same. The second is that we
want to segment this across time. So instead of learning
the last two days and
then detecting anomalies based on the total
of the last two days, to learn Mondays, to learn
Tuesdays, so detect anomalies based on
Monday's models or Monday
morning models. So this profiling
will allow better to have a better control on many
industries that the days are not exactly similar.
So they have some big spikes on Wednesdays and the
systems are totally idle on Tuesdays.
That's it. So thank you very much.
I hope you enjoyed it.