Transcript
This transcript was autogenerated. To make changes, submit a PR.
Jamaica make up real
time feedback into the behavior of your distributed systems
and observing changes exceptions errors in
real time allows you to not only experiment with confidence,
but respond instantly to get things working again.
Cse hi,
I'm Narmatha Balasundram. I'm a software engineering manager
with a commercial software engineering team at Microsoft.
In today's talk, we are going to see how to increase confidence
as you're doing your chaos testing, and how having
good observability can help take the chaos out of
the chaos testing. So what is chaos engineering? Chaos engineering
is the art of intentionally breaking the system with the sole purpose
of making the system more stable. So in a typical
chaos testing environment, we start with a steady state,
understand how the system behaves during steady state,
and come up with a hypothesis. So if we say the system during
high traffic is supposed to scale, then we do best
to see if the system behaves as we intend it to behave.
And when we find that the system does not behave as we
intended to, then we make changes and then do the testing again.
And this is a very iterative process. So what is observability?
Observability is the ability to answer new questions with existing
data. So having observability while doing chaos testing
helps us understand what the base state is. How does normal look for
the system, and what the deviated state is when there is an unusual
activity or unusual events or response in the system, how does
that change things? The chaos engineering without doing observability
will just lead to chaos. In the talk today,
we'll see how we can take that chaos out of testing observability.
So what is the true observability in the system and
what are the different attributes. Secondly, we'll see what are the
golden signals of monitoring and how we can help have actionable
failures. Number three, we'll see what is service
level agreements and service level objectives,
and how can we use this in our chaos testing to
understand how the system is behaving. And lastly,
monitoring and alerting, identifying based on the service
level objectives, we will see
how monitoring can help during chaos testing
and even before we start the chaos testing process.
So the true observability of the system are the things are
made up of the things, what the different attributes of
the application look like. So we talked about a
little earlier, we talked about what chaos testing is, and it is about
deviating what the normal looks like. And when the testing is run,
what does the deviated state look like. So for us to understand normal,
we need to understand what does the health of the components of the
system look like when there are requests going against a service
and it gives a 200 okay, and that means the service
health of the API is doing well.
And when there are additional stress that is going on in the system,
the resource health of the system could also be constrained.
Things like CPU or disk, iOS or memory for
that instance. That could be a constraint. And additionally,
with these two looking at what the business transaction
KPIs are. So let's say if we are looking
for number of logins per second, or if we are looking for the number of
searches per second, then these are the key business
indicators that we should look for as we are
looking at these data across the different components. Now, all these
data collected by itself, service health, then resource
health and the business transaction KPIs separately,
does not help in giving the holistic view of the system.
Creating a dashboard and having the dashboard represent these values
that the different stakeholders of the business are looking for is
what makes it cohesive. And let's say at the time of
chaos testing. For an SRE engineer coming in,
what are the key metrics that they should be aware of and they need
to look at? Creating a dashboard for that particular
stakeholder or for the SRE teams here makes much
more sense than just throwing everything into a combined dashboard.
Alerts. Alerts are the end result of a dashboard.
So let's say we see a change in the
current state from into an abnormal state, then creating
alerts and identifying the parties that needed to be alerted
on each of these scenarios are very important. We talked about what
these different attributes are, but how do we ensure that these attributes
are what is getting measured? So Google's SRE team
came up with the four golden signals, namely traffic latency
errors and saturation. So let's take the example of a
scenario where there is a high stress on the system.
It could either be because of increased traffic, or it could be because
of the vms being down. So we
start off with a normal tech, normal. The traffic, for instance,
in a normal scenario looks like 200 seconds per request per second,
increasing the traffic during the chaos,
testing to, let's say, 500 requests per second, or even
600 requests per second. How does that affect the latency?
Latency during a normal state looks like 500 milliseconds
per request. With an increased traffic,
how does that deviate from what the normal state looks like?
And how does traffic and latency play
a role in errors? Are you seeing more timeout errors? And because
of the high traffic that's coming in into the system,
are the resources like the cpu, the memory and the disk IO,
are they constrained? So these are the key things to watch out for
as you're looking at the signals that's coming in, in the system.
So looking at the, we talked about the attributes and we talked
about what the golden signals are. How do we identify what these
actionable failures are and what makes a good actionable failure?
The actionable failure is something where the key to
the recovery time is very minimum.
So from the time you identify a problem to the time it gets
recovered needs to be minimum. Meaning any
logs that we collect in the system that contain
information should have enough contextual information in it,
so we get to the problem area faster. Sometimes when
logs are built, and this reminds me of one of the scenarios that I had
in my previous experience, where just building
can observable system meant creating logs, right? We've all done
that, where we just go in and log, and we were feeling pretty confident
that we did all the good things. We had good logs,
we had alert system in place, things seemed
fine. And then we realized, as we started looking at
the production scenarios and production troubleshooting,
that the logs are very atomic and with no correlation
between the logs from different components or having
no contextual data, it took longer for us to identify what
may have caused the issue. And a point for you to remember is logs.
There's a lot of logs, and it's just huge volume of
logs. And as you're thinking about what your observability looks like,
just make sure that these logs are ingested well. And there is
good analytical engines at the back end that can actually help crunch through these
logs and give the data that you're looking for. Next, we look at service level
agreements and objectives. So slas
are service level agreements. They are typically agreements between
two parties about what the services are going to
be and what is the uptime and the response time between the
services look like. For instance, let's say an agreements
between a mapping provider and
let's say a right sharing application. So the
kind of agreement that they would get into would be that the mapping
provider can come up with the agreements that they
say that the maps would be available about
99.99% of the time. And when they make an
API changes, it could be something like they give
a two week notice to the right sharing company.
And this is what is called as an slas. And so when
the mapping provider company takes it back,
they need to understand what do they need
to do to be able to meet the slas. Now, this is
the slO, which is a service level objective. So what
are the objectives that a team must hit to meet the agreement that they
just made with their client?
So this boils down to what are the things that they need to monitor to
be able to meet the agreement that they just had. So we
cannot talk about slas and slos and not talk about error
budgets. So error budget is that maximum amount of time
that a technical system can fail without the
consequences of having any contractual obligations. So let's
say if the agreement is about 99.99%,
the error budget for the applying provider companies is
about 52 minutes per year. So that is
the maximum amount of time that the technical system can fail.
So we'll look at how knowing what these
slas and slos are can help us with our chaos
testing. So number one, it identifies before the
chaos testing even starts. It helps to understand
what does critical issues for the user experience look
like. Let's say if it's a streaming providing company,
provider company, and then they look at there's
a little bit of buffering as the users view the content.
Is this something that needs to be identified as a critical
issue, or is this something that can be, that resolves
by itself? So, knowing what the criticality of the issue
is helps for us to be able to make a decision on
identifying to fix it or not. Let's say if it's an issue where it
is very intermittent and then the reload of the page fixes
it, or it's very short lived. Right? So these are the
scenarios that helps with chaos testing. So during the chaos testing,
we want to be careful when to do the chaos testing.
We do not want to be introducing more uncertainty in
the network when the user experience is deteriorating
or when there's a system performance and things are
being slow, we want to be very informed
about when we want to start doing the chaos testing. And once the
chaos testing starts, we measure how the system is doing with
the chaos and without the chaos and what the difference is. And this helps us
to increase the load in the system, because we are seeing
a real time feedback on how the system is doing
by looking at the signals that we are monitoring using our golden
signals. And then we figure out, is it good to
tune up the traffic or are we hurting the system? And should we
be turning back the traffic? Then comes our monitoring and
alerts. So monitoring and alerts are a
great place to get an overall view of how
we are doing. How are the attributes doing with respect to the golden
signals? And also, when things do go bad,
what do we do with those? So while we are doing the chaos testing,
when the system is bound to
break, bound to deteriorate, we evaluate what are the missing
alarms? Are the alarms even in the right place?
Are we looking at the right things or are we just looking at symptoms
and they are not truly the causes? Are we measuring the right things?
Are we looking at the right latency,
or are we looking at the right error numbers,
for instance? And then once we do that,
then we take a step back and we look at the thresholds
of the alerts. And this is a very key component because
sometimes the alert fatigue,
if the threshold is too low, then that may result in an alert
fatigue. Too many alerts, folks that are responsible
for fixing the alerts may become immune to the
fact that there is an alert. So the threshold of the alerts
are key. And lastly, the alerts need to be
sent to the right team. So we need to identify who owns or who is
responsible for fixing the alerts for each of the segment.
So doing this practice while doing chaos testing
helps us to make sure that all these different things are aligned when we
start seeing things going bad in production.
We've already done this, we've already tested this. We know when
things go bad. How do we identify things
and how do we fix those things? And are these
going to the right folks? As we wrap this up,
here are a few closing thoughts with observability.
I would suggest, depending on where you are
in the observability track, I would say always start small.
Start with auto instrumentation that's available out
of the tool that you're going to use, and start small
and keep adding information on top of it. And in our distributed
environment, like how our tech stack is built,
always have distributed logs that have
information correlated with each other, and there are
sufficient traces so you could track how things are
moving along. And there is enough context for the
logs as you're logging them. And secondly, iterate instrumentation.
It's a rinse and repeat process. And there are things you discover
that needs to be added. And that is all right.
As long as we do these as an iterative
process, we learn stuff. And as we learn stuff, go and make
it better. And lastly, celebrate learnings.
So once you figure out something doesn't work, it's quite all
right. Go back in and fix it again with that.
I just want to say thank you for listening through this presentation
and I'm very happy to be a part of this. Thank you.
Bye.