Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello. Hello. Welcome everyone to this talk on decentralized
monitoring and why it matters. I am Shyam Srivalsan.
I work at Netdata. NetData is an open source
observability platform, and our goal is
to shake up the observability landscape and make monitoring easy for everyone.
So let's start by talking about observability.
What is observability in a nutshell? So,
to start with, there's all the stuff that you care about, and this
could be your data center,
your applications, your databases, your servers,
your IoT networks, your Kubernetes clusters, all of those things which
are keeping whatever that you care about, your business, your company,
all of it running. So here's all the stuff that you care about,
and you want to observe this stuff. And how do you do that? So you
do it at a high level using these three things.
So there's metrics that you get out of your infrastructure,
and metrics are usually in the form of time series data, and there's numeric
data associated with different counters that you look at.
And then you have logs which are string texture data,
which are again, talking about what's happening in your infrastructure.
And then you have traces which go a bit deeper into the flow
between a particular event happening across different
parts of your infrastructure, your front end, your black end, for example.
So now you have your metrics, logs and traces. They exist on these
devices. What do you do next? So the next step is
to collect it all. So this is where your
observability tools and platforms generally enter
the picture. And they have you, they usually have some sort
of agents in place, or collectors or exporters which
are doing this job of collecting all of this data into
some sort of repository or storage. And once you collect
it, what next? What do you do with it?
So there's two main things that you do, and one is to
visualize it. So to have it up in a dashboard, have it
in some form that you can look at it. And this is how,
whether you're a DevOps or a developer or an SRE, this is how
you observe the stuff that you care about. It's mostly in the form
of dashboards. And also another really important component
is alerting, because you can't be expected to always be looking at
a dashboard about this stuff. You have other
important tasks to do. You're human, so you have to go to
sleep. You have other things outside of work that you need
to be doing, and you cannot always be observing
the things to make sure that everything is going okay, so here,
that's where alerting comes in, which means that if something important happens
that you need to know about, the alerts will make sure that that happens.
So this, in a nutshell is observability.
So I just wanted to set the stage with this because
we're going to be talking about a lot of the different aspects of the things
that we just discussed.
And what about the observability landscape? It's pretty
crowded in here, to be honest. You can see all those logos of all
the different observability companies.
To try and break this down a little bit, I'm going
to be dividing these solutions into different generations.
So you can talk about the first generation observability platforms which
are focused on checks. So you have your nagios, your zabbix,
your peer DG check MK, they're focused on
checks. Is something working or not? Is something running or not?
Is something up or down? That's their main role.
Of course they're branching out and doing other things as well, but that's where they
started with, or that's their core philosophy.
And then you have the second generation which are more focused on the metrics themselves.
So this is where Prometheus, for example, is really
famous for. And most of my audience has heard about Prometheus.
Most of you have used it. Prometheus community is pretty active as well.
So you know how it works. There's metrics that are exported
from different services, applications,
servers, devices, and it's time series metrics.
There's a metadata associated to it. And there is a Prometheus server
which collects all of this data and stores it there as a time
series database. And it's not just Prometheus. There are other agents
and other monitoring platforms that do this as well.
Then you have the observability
tools that focus on logs. So you have things like splunk or elastic
where though they do other things as well, logs is kind of their
core competence. And then finally you
have the fourth generation of integrated
monitoring tools where you have a
mix of all of these things. You have metrics, you have logs, you have traces,
checks and all these things mixed in. And this is where the
big commercial tools such as Datadog,
Dynatrace and stanar and Eurelic and so on really
make their mark. So this is the observatory
landscape as most of you are used to it. So what
is in common with most of these tools? Right? So when you think about
it, this is where the philosophy of centralized monitoring really
comes in. Centralized monitoring or centralized observability
is the default setting today, which means
that the tools that we're talking about are centralizing
metrics, logs, traces, checks, all of this information into
some sort of a central monitoring server.
There are benefits, this approach, of course, this is why people do it.
It gives you comprehensive visibility because you have all of those things in
one place. So you have the metrics, you have the logs, you have
the traces, all of that information. So when you're looking at a particular timeline,
you get all the information in one place.
You can use this to correlate trends across various different data
types, whether it's a metric, whether it's a log or a trace.
The trend correlation becomes really important when you're troubleshooting something,
and it also gives you a deeper understanding of what's really going on in the
system, in the infrastructure. So this is centralized observability,
this is the underlying architecture or philosophy of
most of, or almost all of the
main observability platforms out there today. So it
sounds pretty good, doesn't it? So what's the issue? Right,
so that's what I'm going to get to. There is a big bucket and
that's what we're going to have a deep dive into. So let's talk about
the seven deadly sins of centralized monitoring, if you will.
What are the limitations of centralized monitoring, which in many
cases, unless you think about it, it's not something that's
obvious and on top of your mind or in front of your mind,
right, sometimes it can go under the carpet and that's where the troubles start
creeping in. So let's go over these one by one, and let's
start with fidelity. So what, what is fidelity?
What does fidelity mean? So one
way to think about this is fidelity is sort of a mixture of granularity
and cardinality. And what does this mean? So granularity
means, let's, let's say we're talking about metric data
here. How often do you have the metric data? Are you collecting it
every second, or are you collecting it every 10 seconds?
Or are you collecting it every 60 seconds? Right.
There is a very big difference between these three things, between getting per
second data, per ten second data and per 62nd data.
So this is what data granularity means, how granular
your data is. And if you have low granularity, which means that let's say
you're collecting data every 60 seconds,
this is in effect blurry data, because you
don't get the full picture. There's a lot of stuff that's happening.
Let's say for example, they'll be looking at a specific metric, which is a counter
over some important data coming from your database and you're only looking at
it at snapshots that happen every 60 seconds.
There could be a lot that happened within that time period which you're either
aggregating away or you're not getting to see.
So this is why I call it blurry data. And then
the second part of this is cardinality.
And you know what cardinality means? It means how
much data that you're getting, right?
So if you have ten
metrics or 100 metrics or 1000 metrics coming
from a particular server or from a particular application, there's again a big difference.
In many cases with the traditional
monitoring aspect, you're cherry picking what you want to monitor.
So you're saying that I think, or somebody else on the
team thinks, or this particular observability tool
thinks that these are the ten most important metrics that
I need to collect from my postgres database or
from this Linux server. There might
be a lot of other metrics, there might be 100, there might be a thousand,
there might be 2000 other metrics that are collectible
from that machine. But you're choosing not to.
And we talk about why you're choosing not to do this.
But the important thing to understand here is if you have low cardinality,
that means that you have blind spots. There's metrics out there which are talking
about things which you're not collecting and it's completely blind to you.
So you might think that I have observability, I have all
this data that I'm looking at, but there's a ton of other data that
you're not looking at and that you're not even thinking about.
So think about it this way. When an actual issue happens and
you're troubleshooting it and you're now doing the post mortem
of why didn't we catch this issue before it became a problem?
Sometimes you end up taking this action, or we should have been monitoring X
and Y and we didn't have those metrics, let's go and add them to
the dashboard so that this won't happen again. So this is a problem with
low cardinality data where you're not collecting all of those
things to start with. And when you have a combination of
low granularity and low cardinality,
then what you get is like the first half of this
horse over here, you get low fidelity data, which means it's abstract,
it doesn't have the detail and it doesn't have the coverage,
which means that you think you have observability, but in
actual fact you don't.
Now fidelity and centralization
are both deeply linked to each other. In a way centralization
makes cost and fidelity proportional to each other. And this is the
root cause of all of the problems that we talked about. Because if
you increase the fidelity, which means that you're increasing the granularity
and the cardinality, then you're by default
increasing the cost. Because think about it, you have a
centralized monitoring server where all of this information needs to reside.
Higher granularity means that you need to send more data over
time. So instead of sending one sample every 60 seconds,
you're now sending 60 samples.
So it's 60 times the amount of data that you're sending.
And the same with cardinality. Instead of collecting ten metrics, if you're collecting 1000
metrics, it's 100 times the amount of data that you're collecting.
And that central server needs to be able to handle all of this data.
Your network connection needs to be able to handle all of this new data going
out of the egress.
So there's a direct link here. Reducing costs
leads to a decrease in fidelity, increasing fidelity
leads to higher costs. So in effect what's
happening is we're building in low fidelity by design.
The second point is scalability.
So again, when you think about it,
think about a central server and what happens,
there is a clear bottleneck here, which is your central monitoring server.
What happens when something happens to that central monitoring server?
That's when you start facing all of your issues. If you want to
scale again, you run into these problems.
So as an example you can think about setting up your own
central Prometheus monitoring server. And what
happens when you keep scaling. And especially for companies
where you're adding a lot of compute or a lot of storage very quickly
and you want to scale very fast. This means that you end up spending more
time trying to figure out how to scale your monitoring environment and
your observability platform. Then you should, you should actually be
spending that time on what you need to do to make sure that all
of your applications and your business logic is working properly.
So scalability is a major issue. With centralized monitoring solutions
you can run into bottlenecks. Obviously there's capacity
limits, there's also latency and delays because all of
this information from all of these different servers and
applications and databases, all of this needs to be relayed into
a central repository somewhere for it to be observed
alerted on and anything else that you want to do on that data,
which means that all of this data needs to travel there. So there is a
latency associated to this. And of
course if you want to get fancy and if you want to build this out,
make this more scalable, then you might have to end up doing a lot of
complex load balance. So centralization makes
scalability harder the higher scale that you're looking at,
because if you're looking at this at a very small scale, and let's say that
I have ten servers and I want to monitor this, you don't really have any
scalability issues with a centralized monitoring solution. But as
your infrastructure becomes more complex, we're talking about
multi cloud environments, hybrid cloud or kubernetes clusters,
along with other, you know, an IoT network. This is
when scalability starts hitting you really badly with centralized monitoring solutions.
And this is again one of those things which is a silent killer
to start with. You might not see it as a problem at all, but over
time it becomes more and more of a problem.
Let's go to the next one. So this is a fun one cost,
right? So as you can see from all those
news headline clippings that I've based to the observability
is really expensive and why is it so.
Right. So centralized data storage is expensive.
If you want to store huge amounts of data, especially at higher fidelity,
you're going to have to spend a huge amount of money on it, whether it's
you self hosting it yourself or your observability provider
who has to host all of this on their cloud and they're going to
transfer that cost to you. Of course there's also
centralized compute. It's not just storage, but let's say you have
huge amount of data, petabytes of data, and then you want to do compute
ordered, you want to compute whether there's certain
correlations happening between different data points
or if you want to understand if something should be alerted on all of
this is again compute that you're doing in a centralized location over
a very large dataset. Of course there's
architectural ways in which you can make this easier, but in
most cases still there is a large compute cost and again that cost gets
passed on to the user.
The third thing is high data egress. So in
many deployments there's a big cost to the amount of data
that you're sending up to the central repository, especially if
it being hosted by the observability provider on their cloud,
for example. So if you want to get higher accuracy,
it is possible, right? So the, the observability tool,
let's take, you know, datadog, for example. You can collect data
at a higher granularity or a better granularity,
but it means that you're going to be sending that much data up to Datadog's
cloud, and that means more egress costs for you and also more
costs in general. And then we just talked about
scaling. Scaling costs grow disproportionately.
So you can see some of these articles here mention things like this that
they didn't start up with this. They started off with paying a few
hundred, a few thousand dollars for monitoring and observability.
And as the company scaled, the costs grew
disproportionately. So very soon they have a $1 million observability
bill. And now the company is trying to make sense of why are we paying
a million dollars for observability? This doesn't make any sense.
So this has happened multiple times in multiple companies
where a lot of smart people are working. So you can see how
this is something that, it's not something that's obvious or evident
on day one, but over time it's going to catch
up to you. And what's the result of
all of these things? So very often what happens is people
start questioning the value of observability because
it doesn't make sense that you have to pay millions for this stuff.
So teams decide that either we
don't need observability, just the most drastic option, of course,
or what happens more often is that they decide we're going
to cherry pick what to observe. We can't just grab all the data
because it's just costing us too much and we're not seeing the value
from all the data all the time. So let's cherry pick based
on our subject matter expertise, or expertise
that we're getting from somewhere that these are the things that I want to monitor,
and we'll only monitor this. And this,
as we discussed earlier, can be a bad move because nobody, no matter how
much of an expert you are, can anticipate what metric
would be useful while you're troubleshooting an outage at 03:00 a.m. in the morning.
That's when you wish that you had all the data already there.
So cost is a huge problem when it comes to
centralized monitoring solutions, even the ones that are open
source, like Prometheus, there is still cost in terms of what
you are self hosting and what you're maintaining and what you have
to scale out over time.
And number four is accuracy. So this is linked to
fidelity in a way, because when you have reduced
granularity or reduced coverage,
then by proxy, by default, your accuracy
becomes lower because you're not getting data every second,
because you're not getting all the metrics, you cannot be,
by definition, as accurate as you could be. But it's
not just about fidelity. There's also other issues that
could happen. For example, let's just think about
alerts for a second. If all of the data is spread
across all of these different servers or nodes
or applications, rather than triggering
the alerts when something anomalous is discovered in
one particular location, if you have to centralize all of these information
in a single place, then the thresholds that you're applying to it
might again be generic, it might not be precise enough,
it might not be customized enough to the metric in question that
you trigger the alert at the right time, which means that you could be triggering
the alert later, you could be missing it completely,
and you might miss the actual event there.
When it comes to things like machine learning, for example, you must be
hearing about a lot of machine learning related to observability,
maybe even some discussions during this conference.
So when it comes to machine learning, as you,
I'm sure you must have heard, it's all about the data,
right? So how accurate your machine
learning is, is based on how much data you have, how granular it is,
and how clean it is, and how good it is. So when
let's, as an example, let's think about something like anomaly detection.
So anomaly detection is basically a way in
which you can detect if something, a metric value, for example,
is anomalous or not. Is this something that's expected or is it
unexpected? Now, when you do all of this in a centralized
location, you need to have so much context built
in, because the metric could be coming from a raspberry
PI or it could be coming from an
Nvidia H 100 GPU
rig. And it's a very different environment in
both those cases. So the value of the metric, whether that's
expected, whether that's unexpected. Again, having to do it in
a centralized fashion means that you have to have so much context built
in, and that increases the processing load of what you're trying to
do and
putting together all of these things, what does this lead to? So this leads to
obviously outages, it leads to downtime, and in general, it leads
to pain for DevOps and for sres and developers
who have to deal with this.
And that brings us to the next point, which is resilience.
And this is again, one of those terms which is bandied about a
lot and misused
a lot as well. But again,
this is something that's built into the definition of what
a centralized monitoring solution is. The centralized monitoring
solution has a single point of failure, which is your single server.
Something happens to that server, everything goes down and you have
cascading failures across your infrastructure, which means that
in the worst case, if a disaster happens,
you're left with no way to monitor what you
care about and the recovery time.
There are no guarantees on this either.
Recent example is what happened last year when there
was an outage on datadog service which
took down their monitoring for most of
their customers for many hours. And again,
in these cases, these users were left without a direct way of easily
understanding what was going on. Because you have the centralized view,
all of your data is in one place. Your window into what's
happening and how to observe these systems is that single
point of entry, right? So a single pane of grass is often
touted as a good thing because you get all your information in one place,
and it is, but it also has its drawback if that's the only window
that you have into observability on those infrastructures. Now, if you
had more localized ways into looking at your
individual pieces of your infrastructure, you're centralized
view could be down. It could be down for a day, for example. But as
long as you have the ability to look into those things through other means,
through localized means, you wouldn't be
as impacted as an observability user.
And then we have efficiency. So efficiency
is again a thing that on day one
of your observability journey, maybe you're not
paying a lot of attention to efficiency. But over time, as it
grows, as your observability setup grows along with your infrastructure,
you've added lots of new collectors or exporters, you have
different kinds of data types in there.
This is when the efficiency
gains start becoming more and more important for you. If there
are delays in data processing, if the data handling is
inefficient across your data pipeline, all of this starts to add up
over time. And resource overload
becomes a real challenge, a real issue, because how
much resources do you allocate your monitoring server? And how much
can you scale up when you need to?
And another often underlooked or overlooked
part of this is energy consumption. So you
have it. Infrastructure in general,
I think, takes up around 30% of the total energy consumption
in the world. Today, and that's a lot.
And your observability platform
itself is intended to monitor that you're running
optimally all the time. Now, if your observability platform itself
adds to your energy consumption in a significant way,
then this becomes a problematic scenario to be in.
So I think we've covered six
of the deadly sins and we've landed at the final one, which is data
privacy. So this one's obvious,
right? So this is something that is
often talked about, that large tech companies have
an often unhealthy liking for customer data.
And there are
different ways to look at this problem. So for one,
thinking about centralized systems in general, there is a concentration of risk.
You have a single repository where if
there was an attack that happened there, then the attacker
suddenly gains access to all of your data or all
of the data that exists there. And the
concentration of risk is something that you
should be thinking about more and more when it comes to
your data and your data security. The other
aspect to this is compliance challenges.
So you've heard about your gdprs
and your ctpas and all of these different compliance standards
that your company, your business has to meet,
and you want to understand whether the
tool that you're using for observability is supporting all of these
standards. Now, if it's a centralized
observability tool, which means that they have access to your data, your data
is being stored somewhere, and by your data, it could be your end
user's data, which means it's your customer's data that you're now
storing, in a way, in a third party company side.
So it becomes a question of trust as well. If the company
big enough, maybe you trust them, maybe you're trusting that these large
public listed companies are going to treat this data well, and that's
a choice, right?
And then finally, there's also the question of deployment options.
Maybe you do not want all
of your networks to be exposed, all of your devices or
all of your servers to be exposed to the outside world or exposed
to the centralized monitoring. So you want to have a way in which
you can cordon off certain parts of your network or parts of your infrastructure
into a demillage zone, for example. Are you
able to do this? Are you able to achieve this with your monitoring
solution? This becomes another question that you should be thinking
about. So I think we've
now talked about all of these different problems
and what's the solution. So the one solution that I'm proposing here on this talk
is to decentralize. And what does it mean to decentralized.
So let's try and understand this a little bit better.
So on the left here, you have a centralized network.
As you can see, there is a central authority or
a central node, and all the other nodes are connected
in one way or another to the single authority.
On the right, we have a decentralized network. There is no
single authority server who controls all the nodes. Every single
node has individual identity and entity.
This is really important. So every single node that you see
here on the decentralized network can operate on its
own. It's independent and it's fully capable.
So this is the main difference. Each node is fully capable.
As you can see, there are still many centralization points that can exist
where multiple nodes are connected to a single node,
which means that this node now has access to the data of these other
nodes, and then these nodes could be connected
to each other. And you can have how many number of these connections as you
want. It's up to you. But the really important thing is that
each of those individual nodes are a
capable entity in and of itself.
So what does this mean for the problems
that we talked about, the problems of fidelity, scalability, cost,
accuracy, resilience, efficiency, and data privacy?
So think about it. Let's think about
fidelity. So if we are storing
the data on the individual node itself,
this gives us a lot more option to have higher
fidelity data because you can collect more data, you can collect
data more in a more granular fashion because you're just storing it on
the device itself. Of course, you need to think about if you're storing
it in an efficient way or not, but you're not sending it
to be stored in a centralized server somewhere else.
And decentralized networks are by definition built to be highly
scalable. You can scale them up, you can scale them down as you wish to
and cost. So again, there is no central server,
which is contributing to the cost by being
directly proportional to the fidelity. It's completely decentralized, so you
have the option to keep the costs down.
Higher fidelity is one reason why it could
be more accurate, but also if you want to do things like alerting
or anomaly detection, since each node is individually
capable of doing this, this means that you're doing it on device,
you're doing it on the edge. So if there were alerts being triggered on one
of these nodes, that decision is being taken on the edge,
which means it's, that it's, it's, it's more accurate.
And again, these are architectural definitions of
a decentralized network. Is that it's more resilient by nature,
which means that you can take off one of these things, but the nodes would
still operate. You can break the connections, but the
nodes would again be able to operate or connect to other nodes
as and when needed. The efficiency
is another important factor. So you can cut down on things like
centralized bottlenecks, you can cut down on things like latency
issues by having a decentralized network
and data privacy because you're storing all your data on device,
your data privacy requirements look very different all of a sudden.
You don't even need to worry about a lot of the regulations because
you are not exporting your data to be stored in a third party cloud somewhere.
So there's a lot of advantages to be had from decentralized
networks, and you don't need to be scared about
decentralized being something that's very complex or hard
to understand or hard to deploy.
You know, let's dive into this a
little bit more so that we can understand this better.
So let's talk about decentralized design for high fidelity.
Specifically, the main important aspect
of this is keeping data at the edge. So you have compute
and storage already available on these things that you're monitoring,
whether that's a container, whether that's a virtual machine, or whether that's
a high end server. These things have compute available,
they have storage available, and this is enough to keep the data at the
edge. You can keep the data stored there, and you can also have
the processing happening there, and you can optimize it
in such a way that the monitoring doesn't affect
the actual business logic that needs to operate on those devices.
So keep data at the edge. That's number one. Number two,
make the data highly available across the network, right?
Because you might have ephemeral nodes that are not going to exist
forever. They might come up, they might go down. Once they vanish, you still
need access to their data, which means that their data needs to be stored somewhere.
That's where those other nodes in your decentralized network becomes important.
They also help with higher availability. So if a node goes down, you know that
there's another node which has access to data of this node.
And you can also use this for more flexible deployment scenarios
where you can offload sensitive production systems
from observability work. You can say that I have these ten servers here which
are doing top secret work. I don't want to do any monitoring
logic on this. Just export the data somewhere else,
export the alerting, export the anomaly detection, and do it all elsewhere.
And then number three, you need a way to unify and integrate
all of this at query time. So you have all of this data,
it's stored, it's being processed in a decentralized
fashion across different nodes in different places.
How do you get that single pane of glass
view when you need it? So there has to be
a way that you can unify and integrate everything at query time.
These are some of the challenges also of making decentralized observability
work. Now we'll talk about net
data. Nadata is
the company that I work for. It started off as an open source project
and it became very popular on GitHub.
It has more than 68,000 likes,
stars on GitHub, and people
are using it for all kinds of things. They're using it to monitor
tiny things such as their home labs or their raspberry PI's. But also there
are teams and companies using the data to monitor entire data centers,
kubernetes, clusters, multi cloud environment, high performance
clusters. So really it's completely
up to you how you use net data.
So how does net data aim
to achieve the decentralized philosophy that
we've been talking about?
The main component of the net data monitoring solution
is the net data agent. And I have agent and double quotes here because
the netifier agent is so much more than what normal
monitoring agents are. So it's open source, it collects data
in real time, which means that the granularity is 1 second. By default.
All the metrics are collected per second. So it
auto discovers what's there to monitor in
the environment where it's been installed. And it collects all of this data
every second, and it stores this data in its own time series database
on all of this is open source, so you can look at it if you
want to. And it collects metrics,
analogs, and it also does alerting and the
notification of those alerts are being sent. All of this happens on the agent,
and anomaly detection and machine learning also happens within the
agent at the edge. This is again something that's not very common,
and the agent can also stream data to other agents. So this is where
the decentralized concept comes in. The agent is a fully functioning
entity of itself, but it can also send its data to be stored on
another agent via configuration.
And you can have a cloud which unifies
all of these different agents and gives you the ability to
query from any agent across all agents in real time.
And we'll talk a little bit more about the cloud component.
So this is what distributed metrics pipeline looks
like inside Netdata, it's, you can think of it in
a way like Lego building blocks. So you have local netdata,
it's discovering metrics, it's collecting these metrics and then
it's detecting anomalies on them, it's storing them in the time series database.
And you know, it's checking for alert transitions, it's querying
for anomaly score or correlations and things like this.
And it's also able to visualize this in a dashboard. It's all inside this agent.
But then at the same time it can also collect metrics from
a remote net data, which means another agent.
So this is the decentralized aspect of it where you can plug these
agents together into sort of a Lego creation.
So you can collect data from a remote net data, you can stream all
of this data from both the collected one and the current one to
another remote net data. So it's really up to you
on how you construct this network, this monitoring network of
yours.
And the really important thing which allows Netdata
to deploy this decentralized philosophy is
that the netdata agent is really lightweight, even though
it's highly capable. So we ran a
full, very detailed analysis and I'll share the link
along with this presentation when you can take a look
at it on how this was done. But you have some of the data points
here. You can see that the cpu usage, the memory
usage, the disk usage, and the egress bandwidth that's generated
is all really, really low, even though it's
doing all of those things that we talked about. It's doing the metric collection,
the storage, the alerting, the anomaly detection and machine
learning and the dashboarding. All of that is happening on each agent.
But it's still very light in terms of the number of resources,
and you can configure it to make it lighter still. So if
you say that this is an IoT node,
I want to make sure that it runs super light. Then you can configure
it so that it doesn't run alerting, it doesn't run ML,
it doesn't do any storage, it's just streaming the data to another more powerful
node which does all of those things for it.
And just by installing the data on an empty vm, you get 2000
plus metrics. You get more than 50 pre configured
alerts. There's anomaly detection running for every metric.
And by default, if you just have three gb of
disk space, you get up to two weeks of
this per second real time data. You get three months of per
minute data, and you get two years of per our data. So that's,
you know, in terms of your data retention, that's a pretty
good deal.
Now, we've talked about the net data agent. The other component
to this, which allows this decentralized architecture, is in the
data parent. So net data parents are nothing
but other net data agents which are aggregating data across multiple
children. So you can start to see the decentralized network build
out here. You have three parents. Each parent
has multiple children. So this
parent, for example, has children that are part of a data center.
The other parent has children that are part of a cloud provider, and the
third parent has children that are part of another data center.
And all of these parents could be connected to each other so that they
have access to the data across these three different environments.
Now, having access to these parents or
mini centralization points gives you,
obviously, it gives you enhanced scalability and flexibility, because now you can
really build the Lego blocks into something magnificent.
It ensures that all of the data remains always on Prem.
You're always storing all of your data on your own premises.
And by design, it's resilient and fault
tolerant. You can take out any of these instances, but the other remaining
instances would continue to function on its own.
And this really helps you to build
a monitoring network which is optimized in terms of
performance, in terms of cost, and also if you want to isolate
certain parts of your network from the rest, from your broader network
and from the Internet, it allows you to do this as well.
And the third and final component of Netada's decentralized
architecture is netida cloud. So netdata cloud,
again, cloud in double quotes or air codes,
because it's not a centralized repository.
Netadata cloud does not centralize any observability data. It doesn't store
any data in the cloud. All it
does is it maintains a map of the infrastructure. So the cloud is
the one entity that knows where everybody, all the other nodes, all the parents,
and all the agents are. And it has the ability to
query any of these agents
or all of those agents or any grouping of those agents at
any time, in real time, right? Which means that
I could be just logged into the cloud and say, I want to see
all the data from all the nodes in data center
one and data center two. I don't want to see cloud provider one, or I
could say I want to see all of it together. So the
cloud is able to send this query to these
nodes. And since you have these parent agents.
The cloud doesn't need to query 15 different servers
here, it just needs to query three. So this architecture, this decentralized
architecture keeps things much more efficient in query and
quickly get the data back within a second, because nobody wants to
wait multiple seconds or multiple minutes
to get a dashboard updating.
So the cloud in effect enables horizontal scalability
because you could have any number of these parent
agent clusters, and as long as all of them
are connected to the cloud, it should be relatively easy to just query them within
a second and see the data, which means that you have high
fidelity data across your entire infrastructure. It's super
easy to scale and you have access all of it, access to all
of it from a single central cloud without having
to store your data in the cloud, right? So the cloud is just querying
the data in real time and it's just showing it to you.
So some of the common concerns about decentralized design
are, one, the agent will be really heavy, you have to run this thing
on your servers, on the machines that are hosting your
application. But clearly we saw that, no,
the data agent processes thousands of metrics per second.
It's super light. The second concern
is that querying will increase load on production time on production systems.
So each agent serves only its data,
so the queries do not increase load on the production systems
themselves. Querying for small data sets is lightweight,
and you can use the parent agents
as a centralization point within your decentralized network,
so that certain nodes are isolated from query. The queries do
not even reach them. The third concern is that the queries will
be slower. This isn't the case either.
Actually the queries are faster because we're distributing tiny
queries in parallel, massively to multiple systems,
which means that your refresh times and your load times are
much, much better. And the final thing is that it'll require
more bandwidth. But this is again not true, because the querying is selected,
you're only querying for data that you're seeing on the screen.
It doesn't need to query for all the 2000 metrics that it's collecting,
it just needs to query for what the user is looking at right now.
And if the user goes to a different chart or a different dashboard, then queries
for that instead.
So this is a quick look at the Netdata dashboard.
I'm not going to go into a detailed demo. We have a public demo
space that's available that you can check out yourself from our website without
even logging in. But if you want to log in, if you want to create
a login, then you get access to our
a space in netadda where you can copy a single
command. And when you base that command, it automatically installs the Netadata
agent on your server, on your device. And you get
this dashboard, which is out of the box, right? So this is not a custom
dashboard or a created dashboard, it's what you get immediately
after you install netat on your system. And here
you can see that the data that's coming in is
from 16 nodes, it's across two labels
from 16 different systems, and all of that data is
stored on those nodes, on those systems in completely decentralized
fashion. And it's being queried by the cloud in real time without
any of that data being stored on the cloud.
So I would welcome you to
explore how decentralized monitoring looks and feels like
by trying out net data and in general,
think about how to make your own monitoring
setups more decentralized, even if you're not using nitate.
So where does this take us to?
The last section is about the future, about the road ahead.
So what's the catch? Where are all the other decentralized observability
platforms? So part of the reason
for this is that creating a decentralized observability
platform is not easy.
Changing from a centralized architecture to a decentralized
architecture is even harder because you put
all your eggs into the centralized basket and you
don't really want to change. Even if you do,
it's not easy to do because you have to ensure that resource consumption at
the edge is minimal. You have to
handle complex queries and aggregation. And all the
while the deployment has to be really simple. Right?
And this is something that's hard for a lot of commercial companies to do.
You have to learn to relinquish control. You have to say that I'm
okay with not having control over the data or
over the processing. It all happens on the customer's own promises, on the
user's, on premises. And this is not an easy thing to do.
So this is part, or maybe a big part
of why we're not seeing more decentralized observability platform.
And also, like I said, the big players in the industry
will find it really hard to move away from their existing architecture to
do something like a decentralized monitoring solution.
I believe that the future is decentralized and
that hard problems can be solved,
and they should be solved. I would
ask all of the listeners to not
compromise infidelity, because it will compromising
infidelity will only create more problems for you in the long term.
I would ask you to demand more and demand better from your observability provider,
who, whoever that is. And if you're operating your
own monitoring stack, then try to
apply some of the decentralized principles that we talked about in this talk today
and you would see a long term benefit.
And think about when your
environment, when your infrastructure is distributed. It's multi
cloud, it's hybrid, it's auto scaling environments.
Why is your observability centralized?
Why is it not decentralized? That should be the question that you're
asking yourself. So thank
you so much for listening to this talk.
If you have any questions about. Net data, or if you'd like to find out
more, this is our website that you can visit.
And here's the link to our GitHub page where you can download the
open source. Net data agent and you can run it on whatever
system that you have and get immediate access to
thousands of metrics in a decentralized way.
So I really hope that you try this out.
And if you have any questions, if you have any suggestions,
or if you have any disagreements about anything that I've spoke about
on this talk, here's my email id and my
LinkedIn profile as well. So I'd love
to hear some feedback. Thanks for listening.
Thanks for being a good audience. Thank you and have a good day.