Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello. Welcome. Today we are going to discuss about Netata,
the open source observability platform.
I am the founder of Netdata, and to tell you
the truth, I started Netdata out of frustration.
So I was facing some issues.
I couldn't identify them actually with any of the existing
solutions, and I decided
to do something about it. So what's the problem? Why we
need another tool? So the
key problems are this. The first is that most
monitoring solutions today provide very low fidelity
insights. Why is that? The key problem is that
they force us to select the data sources,
to select a few data sources, and then instead
of collecting everything per second and going everywhere high resolution,
they force us to lower the granularity,
the resolution of the insights. The second is
inefficiency. So most monitoring tools
assume that we're going to build some kind of dashboards ourselves and
we're going to do everything by hand, etcetera.
For my taste, this was very problematic. So I don't want to build custom
dashboards by hand. Of course I want to have the ability to do
it, but it should not be the main means for
troubleshooting or exploring an infrastructure.
The next is that there is no AI.
So even the observability solutions
that claim that they do some kind of machine learning,
they don't do true machine learning. It's more like an expert system where
they have some rules, let's say that
they call them collectively AI,
but it's not really. And the last
one is that all of them are expensive.
Even the open source ones are
really expensive to run. So this was my
driving force. This is why I wanted to have
a new monitoring tool that will simplify the
lives of engineers instead of complicating them.
Now, the current state, as I see it
from my experience, the tools that exist
out there, and I respect all of them.
So each of these tools contributes
significantly, or has contributed significantly to
the evolution, let's say, of the observability solutions.
So there is a world of the too little observability.
These are the traditional check based systems,
the systems that do some check, and then you have something
like traffic lights monitoring. You have green,
yellow, red to indicate that something
needs attention or is problematic. The problem with this,
this is a robust solution. There are excellent tools in this area.
The problem with this philosophy is that they
have a lot of blind spots. They don't monitor much.
They monitor only the things that they have checks for.
Then there is the two complex observability,
like the Grafana work, the Grafana world is a very
powerful and an excellent visualizer
with excellent databases and other
tools in this ecosystem. But the problem, the biggest
problem is that it has a very steep learning curve
and it very quickly becomes
overcomplicated to maintain and scale.
And then, of course, there are the commercial solutions, like the
data doc, Dynatrace, et cetera, which of course,
are very nice integrated systems. However, they are
very expensive, and to my understanding,
they cannot do otherwise. It's expensive what they do.
So the evolution,
as I see it, is like this. So initially,
we had all the check based systems. That was the first generation
of monitoring tools. Then we had the
metrics based systems, then the logs based systems.
Now, okay, this is more or less in parallel,
they evolved. The fourth generation
is the integrated ones,
like the commercial providers and Grafana, of course.
And the fifth is what I believe that netdata
introduces, that is, the all in one,
like the integrated ones, all in
one, integrated, but at the same time, real time, high fidelity,
AI powered and extremely cost efficient.
Now, in order for to understand my thinking how
I started the thing, this is what I believe are the biggest,
let's say bad practices. They call it best,
but in my mind, they are bad practices.
So the first bad practice
is this myth that says that
you should monitor only what you need and understand.
This is no, we should
monitor everything. Everything that is available should be monitored,
no matter what, no exceptions. Why? We need this because
we need a holistic view of the system and the infrastructure
of even our applications. So we need more
data to make our lives easier.
When we do root cause analysis,
we need enough data in order to
feed the detection mechanism that
can proactively detect detect issues. And of
course, since we collect everything,
this means that we are adaptable to changing
environments. So as the infrastructure changes or the application changes or
additional things are introduced, we still
monitor everything, so we have all the information that is
there, no need to maintain it. The second is
that we need real time per second data
and extremely low latency. So the
bad practice is that, okay, if you monitor
every 1015 or 60 seconds, that's enough.
To my understanding, this is not enough, because the
first thing is that the environment we live today, with all
this virtualization and all these
different providers that are involved in our infrastructure,
monitoring every 1015, or even monitoring every two,
is not enough in many cases, to understand
what is really happening at the infrastructure or the application
level. The next is that when
you monitor every second,
it's easy to see the correlations.
It's easy to see when this application did
something, and when that application did something,
so spikes and dives or different errors
can be easily put in sequence.
And of course, when you have a low latency system and
you know that the moment that you hit enter on the keyboard to apply
a command, the next second the, the monitoring system
will be updated. Of course, this improves response
to issues, even makes
it extremely easy to identify issues during
deployments or during changes in the infrastructure, et cetera.
The next is about dashboards.
So most of the tools, most of the monitoring tools have
a design that allow people and
forces, actually people, to create the custom dashboards,
the custom dashboards that they need beforehand.
So that's the idea. Create enough custom dashboards
so that you are going to have them available when the
time comes, especially under crisis. The problem
with this is that our data have
infinite correlations. So okay, you can
put the database and the queries and this and that, they are
in the same dashboard, but how this is correlated
with the storage, the network, the web servers,
etcetera. So it's very hard to build
by hand, infinite correlations. So I
prefer generally a system that
provides a powerful, fully automated
dynamic dashboards. So it's a tool,
it's not just static, a few dashboards,
it's a tool to explore, it's a tool to correlate,
and it's a tool that it's consistent no matter
what we want to do. So this fully automated
dashboard, it's again, one of the things
that they believe are ideal,
one of the ideal attributes of monitoring of
a modern observability solution. And the last one is
of course, that we should engage,
we should use AI and machine learning
to help us in observability. There are many,
many presentations in the Internet, and there is a hype
that machine learning cannot help in observability.
Unfortunately, this is not true. Fortunately,
actually this is not true. So ML can learn the
behavior of matrix, so given a time series, it can
learn how it behaves over time. ML can be
trained at the edge, so there is no need to train models
and then publish models and use. And actually this
does not work for observability, because even if you have the same servers,
exactly the same data, exactly the same things, the workload defines
what the metrics will do. So each
time series should be trained individually,
then ML machine learning can detect
outliers, reliably, can reveal hidden
correlations. So you can see, for example, in anomalies
that where the moment the anomaly of this thing and that thing happened at
the same time. And this is consistent
across time. So this means that these metrics
somehow, even if you don't realize how they are correlated.
And the most interesting part is that when you have a system that
monitors everything, has everything high fidelity,
and trains machine learning models for everything,
is that the density of outliers and
the number of anomalous, of concurrently anomalous
metrics provide key insights. So you
can see clusters of anomalies within
a server or across servers,
and you can see how the anomalies spread
across different components of the systems and the services.
Now in
order to overcome the first problem, to monitor everything
problem and the real time, of course the per second and
low latency visualization, the data comes
up with a distributed design.
But in order for this, in order to understand it, let's see what
affects fidelity. Granularity is the resolution
of the data, how frequently you collect data.
If you have a low granularity, the data are blurry,
they are averaged over time. So you have a point here, a point
there, just averaged in between.
Low cardinality means that you have fewer
sources. So you don't collect everything that is there.
You collect some of the data that are there.
If you have both of them, both low granularity and low
cardinality, then you are lacking both
detail and coverage. So it's an abstract view.
You have monitoring, you can see the overall thing you can see,
but it's a very high level view of what
is really happening in your systems and applications.
Why not having high fidelity? What's the problem? The key problem
for all monitoring systems is the centralized design.
So the centralized design means that you push everything from your entire
infrastructure to a database server. It can be a time series database
or whatever it is, doesn't matter, but it goes, the whole infrastructure
goes to one database server. This means
that in order to scale the monitoring solution,
to scale this database server and also to control its
costs. So for commercial providers
this means costs, this means in bound samples that
need to be processed, stored, indexed, et cetera.
So in order to scale
the centralized design, they have to lower granularity
and cardinality and of course the output.
The outcome is always expensive because no matter how much
you lower it, you have to have some data there, you have
to have enough data to actually
provide some insights. Now the data
for example, collects everything per second. And as we
will see later on an empty VM, it has about 3000
metrics per second. So if you install the data on an empty VM, you are
going to get 3000 metrics per second,
3000 unique time series per second.
This is compared to most other. Most other monitoring solutions collect
about 100 metrics every 15 seconds or something like this.
Compared to this to it, the data manages
450 times more data.
That's the big deal. This is why all the others
are forced to lower select the
cherry pick sources, lower the frequency.
Now, we used
a decentralized design, a distributed design,
so we keep the data at the edge. The data keeps the data at the
edge. When we keep the data at the edge, then you
understand that we have a few benefits. The first is that each
of these servers has its own data. It's a small data
set. It may be a few thousand metrics, but it's small data
set, easily manageable. The second is that the
resources required to do so are already available
in spare. So metadata is very efficient in both cpu
and memory and storage. So you can expect,
for example, a couple percent cpu of a single
core, two 2% of a single core,
and 200 megabytes of ram and
3gb of disk space.
That's it. I o is almost idle disk
IO. So this allows
nadeta to be a very polite citizen for production
applications. So despite the fact that it's a full blown monitoring solution
in a box, in one application, and it does
everything in there, the agent that you install,
it still is one of the most
efficient applications. Actually we did.
Also, if you search our site, we have a blog where
we evaluated the performance of all the agents and
you can verify there. That data is one of the lightest monitoring
agents available, despite the fact that it's a full monitoring
solution in a box. The second is that
if you have ephemeral nodes, or if you have other operational
needs to not having the
data inside the edge, the production systems,
then the data, the same software, the agent can
be used as centralization point. So you can build with
data centralization points across your infrastructure.
But decentralization points do not need to be,
it does not need to be one. You can have as many as you want.
And in order to provide unified views
across the infrastructure, we merge the data at
query time. So for us,
the biggest trick, the hardest part,
was actually to come up with a query engine that can
do all these complex queries that are required,
but it can execute them in a distributed
way. So parts of the queries are running all over the place.
And then the final thing is aggregated.
One of the common concerns about decentralized design is
the agent will be heavy. We discussed this already.
It will not. We have verified this. Actually, it's one of the lightest
queries will influence production systems?
No, because the data set is very small.
So there is really no
effect, zero effect on production systems even when
queries are run. But what is more important is that
we give all the tooling to
allow offloading very sensitive production systems.
So if you have a database server and you really don't want queries to run
there by net data observability
queries, I mean then you can easily stream
the data to another metadata server next to it. And this
one will be used for all queries.
Queries will be slower? No, it will not be slower.
Actually they are faster. Now imagine this. If you have a thousand
servers out there and you have installed metadata on all of them
and you want to aggregate one chart with a total
bandwidth of everything, all the total cpu utilization, let's say
of all your servers, a thousand servers.
Then the moment you hit that button
for the query to be executed, tiny queries
are executed on a thousand servers.
The horsepower that you have is
amazingly big. Each server is doing
a very tiny job but the overall
performance is much much much bigger,
better. And another
concern is that it will require more bandwidth. No,
because in observability most of the bandwidth is in
streaming the collected data. That's all the
bandwidth goes there. It's magnitudes bigger than
the bandwidth required to query something or view a page with
a chart, etcetera. So actually another
aspect is that there are times that you
use the monitoring system a lot and you are all day
on it because you need to troubleshoot something. But most
of the time most of the data are never
queried. So most of the time,
come on. They are just collected, stored, indexed but they
are not queried. So you have them there as an ability to
troubleshoot. You don't need to go and see them all
every day.
Now this is what I told you before, what the data
collects. How if you install metadata on
a purely empty vm, so nothing, just buy
an AWS VM or GCP or azure or whatever,
install the data on it. This is what you're going
to get. This is as
you see, 150 more than 150 charts,
more than 2000 unique time series.
More than 50 alerts will be running monitoring components.
You're going to have a system D logs explorer and network
explorer. So all the circuits in out, whatever it is, even the local
ones, you're going to have unsupervised anomaly
detection for all metrics. Two years of retention
using 3GB of disk space and it's going to
use about 1% CPU of single core 120
megabytes RAM and almost zero disk I O.
This includes machine learning and this includes the metrics,
retention, storage, everything. Now if you
see what any data agent is internally is this thing.
So you have a discovery process
where it auto discovers all the metrics. It starts collecting
them, it detects anomaly. But this is a
feedback loop, we will go there. So after
collection, usually it stores them. So it stores in
its own database, time series database.
And this is a time series that we have specially crafted to
achieve this. Half a byte
per sample on disk on the high resolution tier.
Now after the data are stored
in the database, we have machine learning where
multiple models are trained per metric. And this
of course provides reliable anomaly detection
during data collection. This is in real time.
We have the health engine that we check for common
error conditions or whatever alerts you have configured.
We have here are the query engines. We're going to discuss about the
scoring engine later. The query engine is
the normal query engine that most monitoring solutions have.
Net data can export metrics to third parties,
prometheus or influx or whatever it is,
where it can export its own metrics and actually it can
also downsample them so that the other system will
not be overloaded by the amount of information that
the data sends. Actually it can filter them.
So zap a few metrics or end
or down sample instead of per second, do it
every 10 seconds, every minute or whatever needed.
And then there is a streaming functionality. This is the function that allows
you to build parents. So the streaming part,
centralization points, the streaming part of one data
is connected to this point on
the remote net data. So it's like building
pipelines as legos. So you install metadata everywhere.
There is no central component anywhere. You can have centralization points,
but it's the same thing. So you install
the data everywhere. If you want to create centralization points,
you use the same software and you install it as a centralization point
and you, you just appoint the others to push
the matrix, their matrix to that. And that's
it. That's everything actually about metadata.
So if you see in this
example, we have five servers, you install five
agent on all five servers. You understand
that every agent by default is isolated,
is by itself alone, stand alone. So in this case
you have in order to access metrics,
logs, whatever dashboards you need
to hit the IP or the host name of each server
and alerts are dispatched from each server.
So they are standalone. But what you can do there is
that you want to use data cloud to actually all
the agents are connected to the data cloud. But are not
streaming data to the data cloud. So the data cloud only maintains,
ok, this is the list of servers. The guy has five
servers. They have these metrics,
these are the alerts that have been configured, but that's
it. Just metadata about what the agent is actually doing.
And then when you go to view dashboards, the data cloud
queries all the servers behind the scenes, aggregates the
data, and presents unified dashboards
similarly for alerting. So these systems sent
trigger alerts. The agent evaluates
the alerts. It sends to data cloud,
a notification that says, hey, this alert has been triggered.
And a data cloud dispatches email
notifications or pages due to notifications, or whatever notifications
you want in order for you to get notified. We have also
a mobile app where you can get alert notifications
to your mobile if you want to build centralization
points. It works like this. So you
appoint one in a data s six, in this case
as a parent for all the others. And then
all the others can be offloaded.
They don't need to be queried, they don't need to store data, or they
can store data only for maintenance work in
the parent. So we have a replication
feature where if the connection,
for example, between s one and s six gets interrupted,
then s one, the next time it connects to s six,
they will negotiate and it will feel the
missing points, the missing samples from as six,
and then of course continue. And a
hybrid setup looks like this. So you may have
here we have two data centers, data center one, data center two,
and one cloud provider. This means that you can have parents
at each data center, optional, but if you want, you can have them,
and then using the data cloud on top of all of them, to have infrastructure
wide dashboards, even across
different providers, different data centers. Now,
we stress tested the data as a
parent versus Prometheus to understand how
better or worse it is. Actually, we beat Prometheus
in every aspect. So we need one third less cpu, half the
memory, 10% less bandwidth, almost no disk IO,
and the data rights, the samples right at the position they
should be. And it's throttled over time, so it's very gentle,
doesn't introduce any big regional rights,
and we have amazing disk footprint.
The University of Amsterdam did a research
in December 23 to find the
impact of monitoring tools on energy efficient,
the impact of monitoring tools and the energy efficiency of monitoring
tools for docker based systems. They found that the
data is the most efficient agent and
it excelled in terms of performance
impact, allowing containers to run without
any measurable impact due to observability
and data is in the CNCF landscape.
Data does not incubate in CNCF, but we sponsors CNCF
and we support CNCF and we are in
the CNCF landscape and it is the top
project in the CNCF landscape in the observability category
in terms of users love, of course, GitHub stars.
Now let's see how
we do the first thing. That is how we automate everything,
how we manage to install the agent
to detect the data sources, but then it comes
up by itself, dashboards and alerts everything without
you doing anything. So the first thing
that we understood is that we have a
lot in common. So each of us has
an infrastructure that's completely different things
from the outside it seems a completely different thing. But if
you check the components that we will use, so we use the same
database service, the same web servers, the same Linux systems, the similar
disks, similar network interface, et cetera,
et cetera, et cetera. So the components of all
our infrastructure are pretty similar.
So what Netida did is that we
went through and developed a model to
actually describe
the metrics in a way that will allow the automatic
and automated dashboarder to work to visualize all the
metrics. So we developed this needle framework.
Needle stands for nodes, instances, dimensions and labels.
And what we do is that actually we
have a method, let's say, of describing the metrics,
allowing us to have both fully
automated dashboards without any manual intervention
and at the same time fully automated alerts.
All the data alerts are attached to components.
So you say, I want this alert to be attached to all my
disks, I want this alert to be attached to all my web servers,
to all database tables, whatever the instance
of the component is. Now the
result of the middle framework is that it
allows the data to come up. So you install it, you don't do
anything else. It auto detects all the data sources and you
have fully functional dashboards
for every single metric, no exceptions.
And at the same time it allows you to slice and
dice the data,
correlate and slice and dice the data from the UI
without learning a query language.
This is our mission. Unconquelist.
Then the next thing was how to get
rid of the query language. The query language is the
biggest problem of monitoring tools.
Why? Because first it has a
learning curve. You need to learn the query language. And for
most of the cases learning the query language
is hard. Some people can do it, but most of the
monitoring users. Come on, what are you talking about? No is the answer.
The second thing is that when you have a query language,
the whole process of extending,
enhancing the monitoring, the observability
platform goes through a development process.
So you have some people that know the query language and you ask
them to create the dashboard you want. So they have to bake the
dashboards, they have to test the dashboards and then you can use
them. That's a big problem also, because at crisis,
most likely you want to explore
and correlate stuff on the fly without.
Let's do this, let's do that. It should be very
simple, so all people should be fluent.
What we as the data, we had another problem
to solve, mainly because all our visualization is fully
automated. The biggest problem is how
do we make a user, allow a user
that sees a dashboard for the first time to
grasp what the dashboard is about, what every
chart, every metric is about. This was a big challenge
because for most monitoring tools, the chart
is just a time series visualization. It has
some settings like this, like that line chart, area chart, etcetera.
But what is
incorporated in there is never shown.
It's never. So you need to do queries by hand
to actually pivot the data to understand, oh,
this is coming from five servers, or you have to do
the visualization like this. So what we did
in the data, in a data, this chart, it looks like a chart,
like all the charts. So this is, in this case scenario,
an area chart. This chart has of course, an info that we have
added some information about. What is this chart about?
So that people can read some
text to understand, to get a context of the,
of the chart. But then we added these controls.
Now look what happens here. The first is this purple
ribbon. We call it the anomaly Ray ribbon.
So when there are anomalies, they are visualized
in this ribbon, anomalies across
all the matrix. In this case, for example, this comes from
seven nodes, 115 applications,
and there are 32 labels in
there, 32 different labels. Now,
the whole point of this middle ribbon is
to allow people grasp what
is the source. Now let's see it. In this case,
we click the nodes. In the nodes.
When you click the nodes, this dropdown
appears, and this explains the nodes.
The data are coming from the number of instances
at metrics each of the nodes provides to
this chart, and then the
volume contribution of each node. So this node
contributes about 16% of the total volume
of the chart. Whatever we see there, 16% is
from this node. This is the anomaly rate
that each node contributes. And of course, you can
see the raw database values.
What's the mean average and maximum of the raw database
values. Now,
the interesting part is that this is also a filter,
so you can filter some
of the nodes to immediately change the chart.
And of course, the same happens for,
oh, I don't have it, but the same exactly happens for
applications and for labels.
So you can see per label, per application,
what is the volume, what is contribution,
its anomaly rate, what is the minimum average maximum value.
Now, we went a step ahead and we
also added grouping functionality. So this
grouping functionality allows you to select one or
more groups. So in this case, I selected label
device type and dimension. Dimension is written rights.
And you see that I got reads
physical, reads virtual rights, physical rights
virtually. So the idea is that if you can
group by the chart on the fly without knowing
any query language or whatever, you can group by
and get the chart you want with just point and click.
The next important thing with metadata is
that there is this info ribbon
at the bottom that it may
present empty data. Empty data means
that data are missing there.
So unlike most monitoring solutions,
if you have a chart and you have every 10 seconds and
one sample is missing, for most monitoring solutions,
this means nothing. So it will just smooth
it out. It will just collect this point. With that
point. The data, however, works in a bit.
So it needs to collect data every second.
The data runs with the smallest priority
in a system, and this is on purpose.
We want the data to run with the smallest priority,
because if you miss a sample,
it means that your system is severely
overloaded. Since the data could not
connect there, they could not collect the sample.
So gaps is an important aspect of
monitoring in the data world. And we visualize
them and we explain where they come from, etcetera.
Now that's another mission accomplished, to get rid of
the query language and allow people to work
and navigate the dashboard without, without any help and any
preparation and any skills.
Then the next is about machine learning.
Most likely a lot of you have seen this. This is a presentation
that made in 2019 by Google.
And the guy said that, you know what,
ML, it's the bolt here. ML cannot solve
most of the problems. Most people wanted to. So it's not
that the ML cannot be helpful. Is that what people expect
from ML is not the right thing? And we are talking
about not general people, Google DevOps. Google developers
and Google DevOps. So they expected from machine learning to
solve a certain number of problems that, of course cannot
do. So what we
do in a data with machine learning, the first thing
is that we train model per, we train
a model per metric every 3 hours for
6 hours of data. Too complex. We train
18 ML models per time series.
So every time series has 18 models
that are trained based using its past,
its past data. Now the data detects
anomalies in real time. It uses these 18 models, and if all 18
models agree that a collected sample is anomalous,
it just marked as anomalous.
The anomaly bit is stored in the database.
So when we store the anomaly bit in the database,
then we can query for it. So we can do
past queries for anomalies only.
No data, not the value of the samples
of the metric, but the anomaly rate of the metric.
And of course, we use, we calculate host level anomaly
scores, et cetera, which we will see how it works. So this
is the scoring engine that I told you earlier. Netata has a scoring
engine. The scoring engine goes through all
metrics and scores them according to an algorithm.
So let's assume that you see a spike or a dive on the dashboard,
instead of speculating what could be wrong,
and I see this dive in, I don't know my sales,
is it the web server? Is the database server, is the storage? Is the network.
Do I have retransmits? What's wrong? Instead of going through these assumptions,
the scoring engine takes this window that you see the
spike or the dive, so you highlight it, you give it to it,
and the scoring engine goes through all metrics,
across all your servers for that window
to score them for the rate of change or the anomaly rate
or whatever you ask. And then the
data gives you an ordered list of everything,
of the top things that were
scored higher than the others. So your aha moment
or the display that the, I don't know, the network did
that is in the results.
That was a point to flip, actually, the troubleshooting
process. But let's see, overall, what other
uses of that thing. One is this is the data dashboard.
It has a menu where all the metrics are organized. As I said,
everything. All metrics and charts
appear here by default. There's no option to hide something.
So when you click, there is an AR button here.
When you click this button, actually in the data, the scoring engine
gives you an anomaly rate per
section of the dashboard. This allows you, for example, to identify
immediately that you know what. In the system we have 14%
and in application I have 2%.
And you can see immediately the anomaly rate per section
for the visible time frame, always.
So if you pan this to the past, if you go to the past
and click the button, it will do it for that time frame,
the host anomaly rate is
the number of metrics in a host
that are anomalous concurrently.
So a 10% host anomaly rate means
that 10% of the total number of metrics are anomalous concurrently.
And what we do then is that this host
anomaly rate, we visualize
it in a chart like this. If you see this chart,
every dimension of this chart, every line on this chart
is a node. And you see that anomalies,
even across nodes, happen in clusters.
So you see here four nodes concurrently.
You see here one very big spike for one node,
but other nodes concurrently had
anomaly spikes. Now, when you
highlight a spike, then metadata gives
you an ordered list of the things that are related
to that spike. Which other metrics?
Which metrics had most the anomaly rate
within this window. So that's another mission accomplished
on trying to use
AI and machine learning help in the troubleshooting
process and reveal insights that
otherwise go unnoticed. Then it was about
logs. So for logs, everything we saw so far, it was
about metrics. For logs, data has a very similar distributed
approach. So we rely on system djournal.
So instead of centralizing logs to some
database, other database server, Loki or elasticsearch
or splunk or whatever, we keep the data in
system djournal at the place they are, probably already are.
So once the data are there, the data can
query the data directly from that place.
And we found out,
actually I found about the system data journal a
year ago. I realized how good this
thing is. So the first thing with the system, the journal, is that
it is available everywhere. It is secured by design.
It is unique. It indexes all fields and all values.
This is amazing flexibility.
So it works like for logs, it works like
this. Either you have a plaintext file, all the logs
together, doesn't matter, not much you can
do. Then you can put them in low key. In low
key, nothing is indexed, it's just a few labels.
So you create streams that you say, okay, if this is
a and this b and this c and this is d,
four labels. For example, this is the stream
of logs of that thing. And the
number of streams that you have influenced significantly,
of course the performance, the memory footprint, etcetera, etcetera.
So logie is like log files,
almost the same with log files, of course has amazing
disk footprint. On the other side is elastic,
elastic indexes every word that is found
inside all text messages.
Of course it's good and powerful,
but at the same time requires a lot of resources.
This indexing is heavy,
eventually requires more significantly
more resources than the roll logs. System digital is
a balance between the two. So it indexes all fields role values,
but it doesn't split the words. So they hold value as it is.
If it is a string or whatever, it is there. The good
thing is that it has amazing injection performance,
so almost no resources.
And it can also be used to build log centralization
points. Now the Nadir UI looks like this.
It's the typical thing. Kibana is like this, Grafana is like this.
So you have the messages, et cetera, you have the different fields
that have been detected. The good thing about metadata is that
you see even in this presentation it's about 15.6
million log entries. We start sampling
at 1 million. So most other solutions,
Kibana and Grafana Loki,
they sample just the last 5000 entries or something like
this to give you the percentages of how
much a field value is there to what percentage.
So in a data we sample at
least a million or more. And it's fast.
Actually people have complained
in the past that system, the journal CTl
is slow. We submitted patches. We found the problem.
We submitted patches to system D to make,
to make system DJ 14 times faster.
I think they should be merged by now.
And we have the system explorer.
That's a plugin of net data that can query the
logs at the place they are now.
Systemdit journal lacks some tooling to push logs
into it. So I wonder. So guys,
unfortunately my audio died. The microphone died
five minutes before finishing the presentation. So I was
shooting the presentation. I had to leave immediately for a wedding in
a greek island. So here I am in a beautiful greek island.
You see the sea. It's very nice, very nice weather.
So sorry for that. I will shoot the
last five video for you to have audio. I was telling that
system did journal lacks some integration so it's not easy
to extract feeds in a structured way.
Convert plain text files to structured
journal fields structured journal log files
and send them to system the journal. So we created log two journal.
This command, this program tails
clean log files and it can extract
any fields from them using regular expressions.
It can also automatically parse JSON files and log
FMT files and it outputs
systemd native format. This systemd native format
is then sent to systemdcat native,
another tool that we created which
sends it in real time to a local or
a remote system ld. Both of
these tools work on any operating system so they dont require
any special libraries or anything. They are
available on FreeBSD, macOS of course
Linux and even on Windows.
This concludes our work for
making logs a lot easier
and affordable to run. So systemd journal
is very performant today, especially after the patches
that we supplied to system D. And it's
extremely lightweight. Of course, system digital files are
open, so you have all the tools to dump data
from them, etcetera. They are also very secure.
It has been designed to be secure system journal.
So I think that having all
the fun happening at the edge, all the process happening at
the edge in a distributed way actually eliminates
the needs for the need for heavy
logs, centralization and database servers,
and makes logs management a lot more affordable
system. As I said before, support centralization points
and our plugin is able, the data plugin is able
to use the journals
of centralization points to be multinode. So all the logs now are
multiplex. Logs across multiplex are multiplexed.
The next challenge is about going beyond
metrics, logs and traces. So we want metadata to
be a lot more than just metrics, logs and traces. We want
metadata to be a console for any kind
of information. For this we created what
we call functions. So the
functions are used,
functions are exposed by the data plugins.
So the postgres plugin, for example, may expose a
function that says, hey, I can provide the
currently running slow queries.
Similarly, our network viewer
exposes a function that visualizes
all the system active connections,
the connections from containers and all applications
running. This allows the data to be
used as a console tool to
explore any kind of information. It doesn't matter
what the information is, we just can have a custom
visualization or whatever required for this to
work. The tricky part here and the challenge
was the routing. So in order for this to work,
we had to solve the routing problem. Since all functions
provide live information, we had to root
requests through the data servers to the right
server and the right plugin, run the function,
get the result back, and send it to your
web browser, no matter where your web browser is connected,
even the data cloud. This way we
can this is, for example, our network connections
Explorer. You see that there is a visualization
that actually shows all the applications,
the number of connections that they have and the kind
of connections that they have,
listening client outbound,
inbound to private IP address spaces or
the Internet, etcetera. That's another mission accomplished on
creating the mechanics to actually have
any kind of plugins to expose any kind
of information. And the last
part is about our monetization strategy
is open source, but we monetize it through a SaaS through
Netata cloud. So Netdata cloud offers horizontal
scalability. So you can have totally independent data agents,
but all of them appear as one uniform
infrastructure at visualization time.
We added role based access control to it,
allow the ability to access your observability
from anywhere. And of course we have push notifications
for alerts. We have a mobile app for that for iOS and
Android, and a lot more customizability
and configurability via data cloud.
Thank you very much. That was the presentation. I hope you
enjoyed it. I am very sorry for my microphone problem.
I hope I will see you again and I hope you
enjoyed it. Of course.