Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello, good morning, good afternoon,
good evening, wherever you are.
Welcome to CON 42, observability 2024,
the online conference about observability.
My name is Swapnil and I am going to talk a bit about observability
today. I would like to thank the
organizers of this conference for including this session and specifically
including it into the monitorings section,
because that is the primary motivation of this session,
to understand what is the difference between monitoring
the current state of observability and what it can be.
So what is monitoring?
We have been doing monitoring over the years for our applications
and infrastructure that
can be getting the heartbeat or the logs
of that system, but it was limited
to that.
Then why do we need observability when we have monitoring
when we get the data that we need?
Right. So before we dive further into
this session, let's have a look at what observability is by definition,
and then we will see different aspects
of it. So observability
by definition is defined as the ability to
determine a system's internal state by examining its
outputs. So what
are these outputs? In the technical term, these outputs
are termed as signals.
They are classified primarily into these four types,
as we see matrix events, log entries.
This data can be captured by using different agents.
They provide that data and a mechanism where you can collect
that. So this collection is
stored into a time series or similar database where you can see
the value of these signals
at different time intervals and then kind of make decision based
on that. So these
signals are then aggregated into the database
right on the time series level. And first
thing that people want to do based on this data is visualization.
You might have seen Grafana hall, right. So where
we have different ways where you can visualize the data by
creating graphs, panels or different
visualizations. In addition to this,
the important part about observability as using this
data and alert the user for any abnormal
behavior of the system based on thresholds
or based on some events.
So this is the primary observability that we see
at most places.
But this has a number of
challenges because of the amount of data that we get.
It is not only the amount of data, but the
quality of the data and how we
can use the data bandwidth.
So let's have a look at some the challenges
related to observability.
So first challenge is the increasing complexity of software systems.
So what happened initially was people use
the existing monitoring setup and made a
few changes and they thought they
are ready for observability, but it is not.
Why? Because the observability stack
needs to understand the increasing complexity of your systems.
So in today's distributed systems,
there are a number of applications, infrastructure components
are added every day. So the
observatory stack needs to understand the
new components that are being added, what is the data that they are
getting about these components, and how to
provide similar visualization
alerting mechanism to the end user with
the same inquest software system. So that complexity
needs to be understood by the system. And we'll see why this complexity makes a
statement about this.
Because with cloud native technologies, there are a
number of components that are added every day.
They can be installed with a click of a button or with
some motivation using DevOps. And they
start in just creating a number of metrics and logs and sending
them to the monitoring system or the observability system that you are
deployed. So it
kind of provides a lot of data that cannot,
if not understood by the system, is of no use,
right? In addition to just receiving
the data, the expectation of the end user is to
have the end to end visibility into the complete stack,
not only in terms of errors, but every activity
that you do on the system needs
to be tracked. So you should be able to track each event from
the user to the backend system, or from backend
system to the user who the data flows and
where it goes, at which point of the system it touches. You need
to know that.
And one of the primary drivers for observability
beyond monitoring is the cost.
The cost of downtime can be very high
for both end user as well as the product
of the product teams if the
downtime is not understood. So that
was the primary motivation for having this observatory system beyond mounting that can
give additional information and insights and actions
for the data that we consume from the mounting systems.
In addition to that, in the recent years, what we have seen is the
cost of observability. Platform itself can be very high
if the user does not know what data they are ingesting,
how much data they are ingesting, how they are processing
the data, for creating the dashboards and alerts,
then the running of objective platform can be very costly
to system and it can bring a very big hole
into a pocket.
So that is another challenge that people are facing. And why do
we need objective platform? We are good with what
we have with monitoring and we will have people who can basically
look at this information and take actions.
And the additional driver that we see recently is the security part of
it. Security has always been
key to all the enterprise and product
deployments, but with
the adoption of cloud native technologies. All the
data that is sitting on cloud security brings
another aspect to it. So you need to know
where my data is, how my data is flowing,
if the data is flowing correctly or not, is it security compliant
so that all data, you can get
the information from the laws or matrix
related to that. But you need to have a mechanism to get
alerted for the scenarios that affect security.
For example, my transit data is
secured with TL's. Where is my Cert going to
expire? Who is going to renew the Cert if that
is expired? Right? So just one
use case, right? And there are many similar to this.
And in addition to security, there are compliance activities.
So many enterprise system
as well as financial systems require you to be compliant to certain things.
And this involves a number of steps.
Each step is equally important and can bring down the compliance efforts.
So you need to have the ability to get
the compliance requirement as part of observability,
see what data you are ingesting and make
decisions out of that, or make alerts based on that.
So the current systems are not completely
capable of doing that. So that is a challenge. So you
need specific personnel to
look into this, do the changes and then they beyond that.
So these are the current challenges.
And to overcome these challenges, what we
are proposing is observability 2.0.
What is it we have seen, we are
getting different signals from different
agents. Each agent
that is sending the data is sent into
a different format for the data
metadata and everything that they are sending to the system.
If the format is different, it becomes very difficult
to manage the data and then showcase in
the same user interface.
For example, Grafana has done a very
good job of aggregating data from different data sources.
So you can configure different data sources, consume the data
and then show it on the visualization
panel. But if you want to
use the
data from different sources into single panel or single UI
frame, it becomes a bit difficult and sometimes not manageable
to the user, primarily because
the information that you get from the sources is different.
So what we need is a unified database wherein whatever
data that you are sending from any agent, so we
are basically calling it as poly agent. Any agent that you are sending the data,
consume the data and store into a very similar format inside
the database. So that unified database
adds a different value to the objective platform
in terms of setting the context of the information
that is received, correlating it
and enriching it. So what is the context? So for example,
for every logline that I am receiving,
I need to know where it is coming from, what is the application?
If you are deploying it into kubernetes, what is the Kubernetes
cluster name? What is the pod, what is the namespace?
If you're sending it from host, what is the host
ip address? Or what is the host name? What is the
cloud? If you're using AWS, what is the region from
where the host is situated? All this is the context about
that particular log line and that needs to be stored so
that you can correlate it with additional
services. So based on this data,
you can basically see what
all services are failing into a particular AWS zone.
If I want to see that, I need to have this data stored.
And sometimes what happens is the raw data does not have this amount
of information. The application is just sending the
data to the monitoring platform or the observability platform,
but it doesn't bother sending all the details related to that
application. So then it becomes the job of
the agents to enrich the data, or job
of the objective platform to scrape that additional information
and enrich the data so that you have all the data with the context
for you to be correlating
or when you are querying it.
What this helps is setting up dynamic and intelligent
alerting. What is dynamic and intelligent
alerting? So based
on the correlated information,
you can set the alert that looks at the
data and then makes a decision rather than some defined
threshold behind the scenes. So the threshold
can be dynamic or it can
be based on some dynamic activity. For example, your power detection has
gone from a certain percentage
plus 20% in matter of five minutes. Then it will create an alert
because it is basically mounting the current of
that port. Same goes for host,
and that adds the intelligence into the system
to basically just change the threshold
at any point of time.
And all this data can be fed into a
machine learning algorithm to find out patterns and
creating actions out of this.
So we'll see some examples of this actions in the
next slides. But this is the primary motivation
behind obsolete zero and how we are trying to build it into a platform.
The most important part in all this is that we
should not go into the same space where we were while
trying to solve the problem. So we are
making sure that all the data
that we are consuming, that we are utilizing and putting
into the product is using the open standards.
So any new system or existing system that
is using the open source standards for this should be able to use the
data if you want to migrate from the current system.
So that should be the case for observability how
we can achieve this.
So it's a combined effort. So it's not only that the operations
teams deploy the stack and the
development team starts using it for
using observatory zero, the development practices needs to
change. So what we call it as observability
driven development, that needs to come from
development as well. All the new applications,
all the new infrastructure that we are adding needs to be designed
for observability. It needs to send the
required signals with the correct data to
the platform, so that it can help you find
the issues easily. So we
will leverage the existing monitoring data. But at the same time we
are expecting a collaboration between development and operations team
to basically set the
infrastructure correctly and the development teams to have the
right side of instrumentation that is sending the required
data to the observatory platform.
And as always, this is a continuous
improvement into the system. So you might not get everything in
the first go, but you will evolve as you go and you basically
add additional information both from the instrumentation side and from
the infrastructure components.
So this has helped into faster
issue resolutions for us, because we are able
to correlate the data and come to a conclusion. Okay,
this is the application that is being
getting the errors, and this is only into this AWS region,
on this node, into this particular port because of the correlation
that you could find out. This helps the
system performance as well as the resilience of the system,
and it will help you with the collaboration activities as well.
So if we do this, this basically helps you into
the automation part of it, where we saw you
can have machine learning and AI on
the objective platform. So what is that?
So with the data that we have,
we enhance the root cause analysis for
any errors or failures that we see into the observed system.
So we get a lot of additional details
and correlation to the other services, so that we can track
the entire failure from
one service to another.
This also helps into setting up alerts
with anomaly. So you find the services
that are showing these patterns and then
you alert based on that. So you can include
the algorithms as part of your observability system that are analyzing
this data and then giving the output that can be used
by the alerting system.
You can help this to optimize the system performance, so that you
can tune the system regularly based on what kind
of data that you see into the now dashboards,
and as we have seen, the increasing alerting
part of it. So you can create the intelligent alerts for changes
or forecast on average. So there is
an interesting use case that we have achieved with this and that is very important.
So you can do a predictive maintenance of your system or
infrastructure based on the data that you consume and
c. Right. So most
of us are deploying the applications on kubernetes.
If you are using stateful sets, you mandatory need a volume
where you store the state data.
And it is very possible that the data volume
might not be sufficient for the amount of data that you ingest
over the period of time. So what
you need to do if the data is exceeding to the volume
that you need to manually or maybe sometimes with automation,
increase the size of the volume.
That is not something that is being taken care of by the
cloud till now.
So what we have done is we have applied the
forecast algorithm to the data
that we receive for volume sizes and
we forecast that this is the timeframe
when the volume will be filled. So what will be the volumes
utilization in next five days? Next three days, right.
And based on that you can do the predictive maintenance. You can go ahead and
increase the volume size beforehand so that you don't turn into the errors.
So this is, this kind of reactive maintenance is possible into observability system
when you have the data that is not only correct but that is corrected
and that can be put to the ML.
So that helps you with capacity planning. So similar to what the use case that
I mentioned. So you can increase not only the volumes but the number of nodes,
number of replicas of the pods, number of hosts that are
needed beyond the auto scaling part that is
given by the cloud provider or orchestrator platforms.
You can define dynamic thresholds that you have seen already. So the threshold
basically is automatically updated using the change
algorithm. So based on the current utilization of
the system, it will see for next 20% in the next five
days. And this means it will basically throw an alert.
Okay, this is not behaving correctly. Please have a look at that.
In addition to this, just by we
have the alerting capabilities which can create the incident
response to the off streams for
critical alerts. So it will automatically send the
alert to things like pagerduty or create Jira tickets and they
can have a look at that.
And in addition to the capacity planning,
you can do proactive workforce management as well.
So this is basically a
set of things that are intended to
be in an observable platform.
The current observable platforms that we see,
they have few of these things in pieces, but not all.
And the most important thing that we are looking at in object
2.0 is open standards. So you
should be able to work with things like promises hotel and
blockchain for all your objective operations.
Right. So this is
mostly what I had in this session.
Just a quick word about the observation that I
represent. So we are a small observer
startup cloudflues. We have
a product and you can basically have look at that
from our download page, or you can even play about it on the
playground that we can see in the links.
And yeah, that's pretty much that I
had from my side.