Transcript
This transcript was autogenerated. To make changes, submit a PR.
You. Hello everyone. I am very excited
to talk about the hidden cost of the instrumentation at Conf 42
DevOps 2023. My name is Prathamesh Sonpatki.
I work at Lastnet IO as the developer, evangelist and
the software engineer at Lastnet. We build SRE tools
to provide visibility into Rube Goldberg of microservices
and let's get started. So. So my first question
is, why do we even have to care about the instrumentation? Can we
not just ship our software to cloud and enjoy, just relax and
chill? But that's not always the case, right? How do we even know
that our application is running as expected? We may also have
some customer level, service level agreements that we are committed to,
and we may have to even give them pros that our system
is working as expected. Even for giving that proof,
we ourselves need to have some information about how the system is
running. Additionally, as SREs and DevOps people,
we also need good night's sleep. We cannot be always staring
at our screens and debugging the information to see
if something is even working as expected. So all of these factors
contribute to the fact that modern software systems definitely
need some sort of instrumentation to even know that things are working
fine. Hope cannot be the strategy.
As the Google SRE Bible says,
we cannot just hope that everything is working has expected.
We need to make sure that we take conscious efforts into measuring
and then making sure that things are working as expected.
So the reliability mandate basically starts with instrumenting as
its first step, because we can only improve what
we measure, right? We cannot even understand
how the system is behaving if we don't measure what
we want to measure in our software systems.
Let's go over the landscape of the instrumentation, because modern software
systems are very complicated. They are not
just like our standalone application that is running on a server
or a vm. And that's the only thing
that is running, right. What we have is actually a burger
because our application is like a patty,
which is running inside the bun and that, but we
can consider it as a cloud or a virtual machine. So there are
variants of buns like AWS,
TCP, Azure, and then the patty is where the
real magic resides, which is our application.
We throw in some mayo sauce, some external services, data stores
as rdss and databases, along with some ketchups,
fries and everything. And then we get a burger.
So sometimes also, this is not just a single budget that we
have. We may also have multiple burgers at the same time because
our system can have different microservices talking to
each other has, well as some other services. So this is how
generally the landscape of the modern software systems looks
like. So we basically deal
with burgers every day. We may not eat them, but we have to
at least run with them all the time. So this
is where we lead into the rabbit hole of the full stack observability,
because just the monitoring of the application will
give the insights very specific to the application. But we
may not know that some requests are getting dropped at our load balancer layer
or our database. Read I ops and write I ops are constantly below
the required threshold. So for knowing that,
we need to take a cut across the burger
and monitor all the components together so
that we get better insights. So modern software applications,
I like to call them, has living organisms that grow and
shrink in all possible directions. Grow and shrink
specifically because of the auto scaling and scaling constraints.
We do have ephemeral infrastructure that comes into
existence and then goes away when it is not needed.
Another interesting point is that the applications also communicate
with similar applications at the same time. So it's not just
one application that we have to deal with, we have several of them
talking and chatting with each other all the time.
So basically how do we monitor them? Right. The only option that
we have is basically we have this temple of observability that
is there and we have to just bow in that temple of observability
to make sure that everything is getting instrumenting. Generally the
standard pillars of observability are logs, metrics and
traces. Logs help us debug a root cause very quickly.
They can be structured versus unstructured depending on whether
they are like debug logs versus request scope logs,
but they are very easy to adopt with, right. So we can
throw in standard libraries for logging as well as
components such as NgInx. Load balancers have standard formats for
logging, so the adoption is extremely easy. Consistency is slightly
tricky because every microservice and every system can have
their own format of logging. So not necessarily you will always be able
to have the same consistency across all services,
whereas metrics, they are specifically giving you
the aggregate information about how system is behaving.
You can get a better overview of overall how
the system is behaving using metrics and
in case of metrics. Also there are standard tools and libraries that one can
use for the adoption part. Adoption of
metrics is also easy because of the proliferation of
different tools that one can use, and also they provide certain consistency
because of standards like open telemetry, open metrics
that people can use. So adoption and consistency. Both are
kind of consistent in case of metrics, whereas traces
are helpful in case of when we want to monitoring different
workflows. So for example, I may want to trace
my payment transaction starting from my microservice
where the user authentication happens to the background queue where actually
the job gets processed for sending the notification that
the payment was successful or unsuccessful. So in case of traces,
I'm mostly concerned about monitoring the workflows.
And to do that, what I do is I insert one span
or a trace id in all the pieces where I
want to basically monitor it. Traces are
extremely sharp and useful
in such scenarios, but also they can have a lot
of information getting emitted. Basically,
if not handled correctly, it can turn but as like your
debug, logs are running in production. So these are the original three
pillars of observability. But additionally we also
have profiling, external events and exceptions.
And Yuri Shukuro has written an excellent post
on these six pillars of observability. It's a great post.
So profiling is basically the continuous profiling of our application
to capture the runtime information about how the
application is behaving, and that can also help us in
debugging certain solutions when needed.
Generally, the profiling, even if it happens continuously, we may not use
it all the time. We may use it only when it
is needed. So while enabling it, we also have to consider
the overhead that it will put on our production systems because we may not
be using it all the time like we'll be using probably two, three times a
year or something like that. The external events are extremely
important because they can affect the state of the application.
So while logs metric statuses are internal information
about how the application is behaving, external events
such as deployments, configuration changes, third party changes
such as your AWS, instances getting restarted or reprovisioned
or something like that can also affect your
running application. So tracing that and then provide
making sure that they are also visible in
terms of the overall visibility is extremely important.
Another important part about the external events is they are
extremely critical in certain cases and also
not happening all the time, right? So in case of logs metrics,
they are constantly happening, but external events are not happening in that
same amount of number that will happen in case of logs
versus metrics. So they need precision
in capturing as well as storage when
we deal with such events. Additionally, we also have
exceptions which can probably go to tools like sentry
and rollbar. This can be considered has
an advanced version of structured logging only where we have
tools such as sentry rollbar, giving us specific log lines and
traces where we can go and debug the issues before
going forward. I have a curious question. How many
of us have used more than three at the same time?
Because I have talked to a lot of people and what I realized
is that depending on the use cases, we tend to pick up
at least three to four of these at any point of time, but not necessarily
all of them at the same time. So that is a very interesting
conversation to have, whether we have folks
who have used multiple types of days at the same time. But we
do capture these kind of information in our
instrumentation processes. The most
important point to consider is that none of this
is free, right? And when I say about the cost,
it is not just about the monetary cost, but it also adds
overhead to our runtimes. It also add overhead to our processes.
So there is no such thing as free lunch even in case of instrumentation,
and we have to pay the cost for different factors that
we will see a bit later. The most important point
that generally happens in case of instrumentation is the explosion of cardinality
or the churn of the metrics and logs information. They keep changing all
the time and that basically prevents
us from just shipping it and sitting there.
We always have to capture and monitor that the data is
not getting out of control because
of the cardinality explosion. To just
give a simple example, three node Kubernetes cluster with Prometheus will
basically ship 40k active series by default. And that
is just the default metrics. If you want to emit some
custom metrics then obviously it will even explode. With the ephemeral
infrastructure this can go out of control very quickly.
We also have to do the operations for running and operating this instrumentation
of the entire stack. So this is one more thing to operate
besides the application. We also have to run our application,
but we also have to run our entire observability
and instrumentation stack. And we also have
to make sure that not just the app scales, but with the app, the instrumentation
processes also scale. Because we cannot be blind on a new year's
day for 4 hours or we cannot be blind before the
streaming of the final match between.
I'm from India, so I'll give an example of cricket,
but we cannot be blind before the final of the cricket
World cup between India and Pakistan. Just because our instrumentation
is not able to scale. That can be a very bad
thing, not just for the engineering but also for the business.
All of this results into constant tuning of monitoring data,
instrumentation data and results into lot of engineering toil
that the engineering teams have to go through. So I give
it an acronym as a cost. The cost that we have to pay is
basically for cardinality churn operations,
scale tuning and toil. And all of this just
becomes the cost of the instrumentation that we have to pay.
But what is the hidden cost? Right? These costs that
we talked about are slightly apparent on their face. We are
aware of these things as well. But what is the most hidden
cost that is there in case of such instrumentation?
It is actually the distraction. We always get distracted
from doing the things that we actually wanted to do, which is
our product engineering or scaling our business, or making
sure that our customer experience is not impacted.
How many times you have heard terms like, okay, can you just reduce the
data dog monitoring cost before the next
month? It is actually going out of hand. Please can you just stop
your feature development and focus on getting this in control or
our logs are piling up from last two days. Can you just look at it
as a p zero item and please fix them? Otherwise our vendor will charge us
double and if we don't do that, then we'll be spending too much
amount of money unnecessarily. Today is New Year's Day tour.
Prometheus is not getting required metrics. Can you just ignore the important
feature and bug fixes that you are pushing? Just fix on this, because otherwise
we are completely blind before the party starts. So we
always hear these kind of things, and that causes us
the distraction from our actual tasks that
we want to do in our day to day life. We always get
distracted by the instrumentation and the information that we
are emitting and probably not even be using. Right? So we may
be emitting so much amount of data, but only using 10% to 20% of
it. So we do pay for the data, not just that
we use, but we also pay for the data that we don't use,
which is not really a good option to have. So the
modern software systems engineer has to not just maintain their
software, but also has to maintain instrumentation of that software
as well, with the same rigor,
with the same requirements of scaling and so
on. It is also fatigue, right? With so much amount of
data, so much dashboard, so much panels everywhere,
so much logs in front of our eyes,
we always get desensitized to the information.
There can be duplicate alarms. How many times I have seen
that while debugging a critical issue in production, we get confused
because the logs shows two, three pictures at the same time,
and some of the information that we see is not even getting used in the
code. So there can be such situations where just
the too much of information can cause us delays in debugging.
So because we focus on getting the data out,
because it is easier, we don't even consider
why do we even need them in the first place. So these are some
of the points that can cause fatigue with too much of the
information. While we talk about all of this, and we
sort of are used to these things, what's the way out?
Let's discuss that. So, if we focus on the
data that gives us only the early warnings with least
amount of data, and this least amount of data is important,
then probably we can just focus on the
warnings and then based on that, dig deeper to isolate the
root causes as and when needed. So I would
like to give an analogy to the Apple Watch, which is
on my wrist. But basically what Apple Watch does is it only
gives me the vitals, such has heart rate, or how I'm doing with
my sleep, or it gives me if I'm walking correctly every day
and so on. So it just gives me the vitals that are needed, right.
And based on that, I can decide to go to the doctor for detailed x
ray scans and ECG reports and then decide
whether to go further with my debugging or deep
exploration. So while I get the vitals,
if the vitals are off, I can go for the detailed
information about why they are off. I don't debug
and start with the x ray scans immediately,
or I don't start with the ECG reports as the first step
without even checking whether my vitals are off or
not. So a threat or a warning of something breaking is
always better because it can give me like an
ample amount of time to at least either fix things
or ignore that if actually it is not off the track.
So a threat is always better in such cases.
So what is the plan of action to fix this?
We can measure what we
actually want in our instrumentation. We can plan what
we really need. We can only emit the data that we need and skip
the things that we don't need. We can observe and track.
We can prune aggressively. Lot of metrics and instrumented
data is not even used at all. Like there are a
lot of default metrics that we keep pushing and they can
basically slow down things at later point of time. So we
can prune them aggressively. We can of course store lets for
less amount of time, because the more we store for
more amount of time, it can cause us problems and distractions and
we can focus on what can give us the best value for the money,
and that can help us in terms of reducing the scope
of our instrumentation. But there can be a better plan
of action than this as well. So, for example, what if we can define access
policies for our data, that you can access certain amount
of data only for this much amount of time. If you
want to access beyond that, then you have to be okay
with some reduced data or aggregated
data and so on. We can also have data storage policies across
organization, that your logs can be stored only for one day,
and then beyond that we won't have those because otherwise
they will basically explode in terms of storage costs.
All of these policies can help us in defining standards for our instrumentation
across the organization. So there is consistency and we
get the same results across our software systems so that
things are in a better, consistent way.
Less is always better, even in this case of instrumentation,
because instrumentation is not just instrumenting, it is actually a liability
that we have to worry about as builders of software.
Thanks. That's all that I have today. My name is Prathamesh
Sonpatki. I work has software engineer at lastio end
IO. I have this blog and I have posted my
Twitter and Microsoft details. We also have a discord where
we hang out with other SRE and DevOps folks to discuss
about reliability, observability and a lot of other things related
to SRE and DevOps people. So I would highly encourage to
check it out and join if you're interested. Thanks again. Thank you.