Transcript
This transcript was autogenerated. To make changes, submit a PR.
Youll.
Hello, everyone. Welcome to Conf 42 site Reliability
Engineering. My name is Ricardo Castro, and today we're going
to talk about alerting on slos and what are error budget policies
and how we can leverage them. So what we have on the
menu, for starters, we're going to set up some
context. So we're going to talk about slos. Slos are all about
reliability. So it's easier if we have some ground knowledge to
actually build on top of these concepts of reliability and then talk about
alerts and effort budget policies. We'll then
talk about some reliability concepts. So we're going to talk about slos, but we
need to talk about all the things that encompass slos. So we need
to talk about the foundation that gives us slos.
We need to talk about metrics and slis, and we will then talk about
company concepts like error budget and slos.
We'll then get to the good part where we are going to talk about how
we can alert on slos and what are error budget policies
and how can we leverage them. And at the end, we're just going to conclude
on why all of this is important.
So let's start by setting some context to our discussion.
An example from the real world that we can think
of and draw parallel to our reality is
a supermarket. So if we think a little bit about it, a supermarket
is kind of a microservices architecture. So the idea is
that me as a user go into a supermarket, I do my shopping
and I go out. What happens in reality
is that there are many, many things that happen underneath the covers
that make it possible for me to do this transaction. So for me
as a user, I just want to go in, select the things that I want,
I pay, and I get out. But that means thats there is a cashier
to register everything and receive payments. That means there
are people that get stuff from the warehouse and put stuff available for me to
do my shopping. There's also people that need to put orders to ensure
thats stuff gets at the supermarket. There's people that do
the unloading from trucks to inside the supermarket. There's butchers,
there's people that work on the Fisher stand. So there's a lot of stuff
that's going on underneath the covers that actually make it possible that
I can do this transaction. So, like a microservice, I interact with
an application, and that means there are maybe a lot of services that
are working together to provide the functionalities that are required.
So what does that mean exactly? To be reliable?
Let's still look at this example of a supermarket.
So I did my shopping, I want to pay, how can thats
action might not be reliable. So if it's taking too long to
pay, if I have to stay half an hour on a queue just to be
able to pay, I might consider that thats service is not being reliable.
One other aspect for example is when I go into pick
up some product, if the product expiration date has passed,
I might say that this service is not being reliable. And we
can draw a quick parallel between this reality and the reality
of the tech world by saying that an analogous
concept for taking too long to pay is that of
latency. Whether a request for example, takes too long to be served and
a product expiration date being passed might mean can error.
So this wasn't supposed to happen.
So going a bit further on what reliability means,
and looking at our reality more into the tech side. So I
work at a company called AnovA, and at Anova we
work in industrial IoT service. So we provide services
to our customers. These numbers are a little bit outdated, but we have more
than 2000 customers worldwide, we operate in more than 80 countries and
we monitor more than 1 million assets. So what
we do is that we get data from industrial sensors,
we process them, we store them, and then we apply things like machine learning and
AI, and build applications that allow our customers
to actually provide good service to their customers.
And this means that we need a way and
a framework to actually ensure thats our systems
are being reliable. And ideally we want to be alerted when something is
not being reliable in us. So what does reliability mean
exactly? If we look at the dictionary,
the Cambridge dictionary, in this example, we can see a definition that says
that the quality of being able to be trusted or believed because
of working or behaving well. So in essence this means
that something is reliable if it's behaving well. And this
is a bit rough. So I like the definition from
Alex Giftalgov's book, implementing service level objectives,
actually a better way to define reliability. So in essence,
what Alex says is that reliability
can be defined by our users. So the answer to
the question is my service being reliable? Is my service doing thats
its users needed to do? So? If we look at reliability from the point of
view of our users, if they are satisfied, my service is being reliable.
So how can you actually measure this?
And now let's start our discussion about the founding concepts
that will actually lead us to slos and the other concepts.
And the most fundamental concept is the concept of the metric.
And the metric is nothing more than a measurement about something
in my system. So if an event happens,
and I take a measurement on that. So let's imagine a
few examples that we have here. So the amount of memory that the server is
using, the time it takes for an HTTP request
to be fulfilled. So for example, in milliseconds,
the number of HTTP responses that
are can error, or how old the message is when it arrives
at a Kafka cluster. So these are just measurements. An event happens and we take
some kind of measurement about that event.
Building on top of metrics, we have the concept of an SLI,
and an SLI is a quantifiable measure of service reliability.
So an SLI is what's going to allow us to say that if
a measurement is actually good or bad, or an event is good or
bad. But how can we define that?
So we need to achieve a binary state, even if the metric
itself doesn't gives us that out of the box.
So here are a few examples of how we can define
an SLI. So we can say that requests to a service
need to be responded within 200 milliseconds. So thats
means if I serve a request, I can measure how long
it took. If it was more than 200 milliseconds, I can say
that it wasn't a good event. If it was 200 milliseconds
or less, I can say that it was a good event.
In analogous we can say that if a response got
a 500 code, it is not good.
If it got another code, it's actually a good event. And the
same thing for Kafka messages. If the Kafka message that arrived is
older than five minutes, we might say that it's not a good event. If it's
younger than five minutes, it is actually a good event.
Building on top of slis, we have the concept of slos,
and slos actually define how many times an SLI needs
to actually be good, so that my
users are happy and thats needs to be time bound.
So here are a few examples. We can say that 99%
of requests to a service need to be responded within 200 milliseconds
within a 30 day period. The same way
we can say that 95% of requests to a services need to be responded
with a code thats is different, thats 500 within a seven day period.
And same thing, 99.99% of
messages arriving at a Kafka cluster must
not be older than five minutes within a 24 hours period. So essentially
what an SLO gives us is a way that within a time bound
I can say how many times my SLI needs to be
achieved so that my users are not unhappy.
So exactly what is exactly an echo budget?
An echo budget is nothing more than what is left from
my slO. So if I consider 100%
and I remove what
is the slO, I get my budget.
So if I have an sLO of 99%, I can say that I have 1%
of her budget. So it's effectively the
percentage of reliability left, and it help us make us
educated decisions on whether, for example, to release a new feature
or not. We're going to see that in a bit. And of course, they make
the operability process for incident response to have an appropriate budget
for us to know what we need to do.
And last but not least, the last concept is the concept of an SLI that
most of us are already familiar with.
So when we sign, for example,
sign up for a service, let's imagine that we sign up for a cloud provider,
like Amazon or AWS or whatever cloud provider you are
using. They all usually provide us with an SLA.
So what is an SLA? It's usually
a commitment between a service provider and a client. But in
practice, slas are nothing more than an
SLO that has consequences when that SLO is
not met. So here are examples of what an SLA
can say can be. So, if I can say that 99% of
requests to a service need to be responded within 200
milliseconds within a 30 day period if that doesn't happen.
So if the SLA is not met, the client will get a 30%
discount. Same way we can say that 90% of
requests to a service must be responded with a come different from 500 within
a seven day period. If not, the client can get a 50%
of its money back. And last but not least,
we can say that 99% of measures arriving at the Kafka cluster must
not be older than five minutes within a 24 hours period.
If not, we can be fine for €100,000.
Slas are usually looser than slos so that
we know when an SLO is broken. We still have some
buffer to actually fix things before an SLA is
actually broken. But thats means that we need the way to
actually know if an SLOS is at risk or not.
And of course, we can put that in the visualization. We can
create some graphs thats we can look at, and we can see what
is the slO, what is the objective that we are trying to
achieve. In this case, it's 99% if we
are or not burning budget,
how much error budget I have left
for the period. So visualization is a nice way for me to understand
if an SLO is at risk or not. But we
would ideally be alerted, right? So we don't want to stay our
whole day of work looking at these thats.
And even worse, if something happens during the night that is going to put
my services at risk. I need to be alerted. So this is where
alerting on slos comes through.
So traditionally we did metric thresholds.
So what we would do is that we would send an alert if some
threshold about a metric is
met. So we can say that if a cpu goes above 80%, I want to
receive an alert. If a request takes more than 200 milliseconds,
I want to receive an alert. If an effort,
if a request is served with a 500 error, I received alert.
And the same thing for a Kafka message. Of course all of thats can be
combined. And I can receive an alert and receive an
alert if a combination of these things happen.
So we can take the same approach with slos. So we can say that.
We can say that the latency threshold goes below 99%,
which is my target, I can receive an alert. And we can
say the same thing for an NFL rate that achieves the 99 95%
and I receive an alert. So this is similar to what some
of us had already been doing, because we could say like, okay, if I want
x amount of requests to be more than 200 milliseconds,
we receive an alert. So this is good.
This is better than actually the metric threshold.
But I would only receive an alert when I'm already in trouble.
So if I define the 99% as the threshold
where my users are happy, if I go below that,
basically what I'm saying is that my users are unhappy. So ideally
what I would want is to be alerted before this happens. And if
we are relying on this to fix things before an SLA is breached,
this is exactly what we want.
So how can we improve on this? So we actually can improve on
how much error budget I have available, or how much
error budget has been burned. So the idea is that I
will trigger an alert when the available error budget
reaches a critical level, or when an amount of error
budget has already been burned.
We can actually set different levels and trigger different messages
to different channels. So I can say, for example, that if I have burned
25% of my her budget, I can send an email to
my team. If half of it has been burned, I can put a
message on teams and if for example 75% of
the Akar budget thats already been burned, I want to send a message to pager
duty and I want to tell the team that needs to do something immediately.
This is better than the solution that we have before, but at this
point we have no clue about how fast the echo budget is being consumed.
So a question actually can arise. If by the end of
the evaluation period we would still have some effort budget
left, would I would like to receive this alert, for example on pager duty?
Probably not. Because if we consider that we are still within
the bounds of what my users consider to
be reliable, I wouldn't want to be woken up at 03:00 a.m. In the morning
to fix something that is actually not being need to be fixed.
But yet we still don't have a clue about how fast
my budget is being consumed. And this means that maybe
I received an alert thats 75% of the
effort budget has been burned. Burning so fast that
is actually we're going to get in trouble.
So if we think about it, we can actually alert
on burn rate. So basically alerting on burn rate
tells us how fast the effort budget is being consumed
when we have a burn rate of one. This essentially means
that if we are burning rate at this at a constant
pace, and thats burn rate is one, at the end
of my period, the periods that we've seen previously, like 30 days,
one week, 24 hours, I will have burned
all my FR budget. Here's an example for
the window of evaluation of for example four weeks an alert.
If the burden rate reaches two, why would I want this?
Because this would mean that with a burden rate of two, which is the double,
what is the maximum burden rate that I
want would mean that I will consume all my effort budget in half the time.
So for a period of four weeks, if I'm ensuring,
if my burden rate is two, would mean that after two weeks I would have
no effort budget left. So I would want to receive an alert.
This is also better. But this has one slight
issue, which is that if the burden rate is too
high, it might not be picked up. For example,
if we evaluate the burn rate
every hour, but the error budget is all consumed within 30 minutes,
we won't receive, but we would receive no alerts.
So the last evolution of our alerting on slos is the
multi window multiburn rate alerts. So this is the
idea where we will combine the previous alerted that
we've seen using multiple windows. And what
we want is to alert on fast burn when
the burn rate is too high and that will alert us on sudden
changes, something that actually catastrophic event that happened.
And it's ensuring stuff really, really fast. But at
the same time, we also want slow burn rates.
We want something that's consuming our
effort, but thats consistently over a longer period of time. We also
want to alert on those. So here are a few
examples. So for fast burn, we youll say that for
a period of two windows, in periods of five minutes, if the burn rate
reaches ten, we want to receive an alert. So this would actually alert
us if we had like a spike on our budget consumption.
And we will have a similar one, but that would be slower. So we
would evaluate a 24 hours period window for every
five minutes. If the burn rate reaches two, we would receive
an alert. So this is an evolution where we can go from magic thresholds
to thresholds. On alerts, we actually have something that we can
say with some confidence that when we receive an
alert, we actually need to do something.
But what is that something, and thats something
can be defined on the error budget policy. So the error budget
policy determines the alerting thresholds and the actions
to take to ensuring that the error budget depletion is addressed.
So what does this mean? The error budget policy is actually a
policy that is defined beforehand, where we say
that if x action happens, we will take the
action a, B and C. So here are a few examples, and I'm going
to see a document more detailed in a second. So we can say that
if the service has exceeded its error budget for the preceding
four week window, we will hold all changes on
our service and releases, and we will only do
p zero issues or security fixes until the service is within
the SLO. Depending upon the cause of the SLO
miss, the team may devote additional resources to working on reliability
instead of feature work. So basically here what we
are defining is a concrete measure for if the
error budget has been depleted at some point
within our four week period, we will not be releasing measures apart from,
of course, p zero issues or security fixes. And we're
also saying that depending on what caused the SLO
to be missed, we might need to add additional
resources to work on reliability instead of releasing more features.
Another example could be that if a single incident consumes
more than 20% of our effort budget over the same four week
period, then the team must conduct a postmortem,
and that post mortem must have at least one p zero action. So a
p zero action would be something that is really ugly. So again,
this actually defines the actions that are going to be made
when the effort budget policy is at risk, who has been consumed.
So this should actually go into documents agreed with multiple
parts so that everyone is on the same page regarding
what is being done when the SLO is actually consumed
and taken from the Google book,
from the SRE book, from Google, we have an example
of such kind of document. This is just an example.
You can of course define your error budget policy
any way that you want, but it gives you a good starting point
into how to define alerting alerting slos error budget
policies. In this example we see that we have the authors of the Echo budget
policy, when it was defined, who reviewed
it, who approved it, when it was approved, and when it should
be revised. We then have a service of review. So it
will be the service or group of service that this
air budget policy applies.
We would then have the goals and non goals. So these are the goals that
the air budget policy tries to achieve and what are not
the goals that it's trying to achieve. Then we have a definition
of what it means to actually miss the
slO. So here's a detailed description that what does
it mean for the SLO to be
missed and basically means when this budget
policy will be enforced. We can also have other
sections like an outage policy, an escalation policy,
and come background that it is necessary.
So to quickly sum up all the concepts that we have seen, we started
with metrics. With metrics we can build
slos and slos is what will help us define
if a metric is good or bad within our context.
An SLO is how many times an SLI needs
to be achieved so that we can be sure that our users are happy
with our services. Can effort budget is the amount of
reliability that is left from the SLO.
And with SLO and her budget we can build visualizations that
are good, but ideally thats we want is to actually be
alerted when our SLO is
at risk. And of course SLO alerts can trigger
an effort budget policy. For example, if I'm consuming too much of
my slo, I can enforce an effort budget policy that has been
pre discussed and agreed with all parties. And of course we have the concept of
an SLA which is an SLO with penalties.
But why is all of this important? This is important
because this way we can define reliability in the eyes of our
users. We stopped measuring,
we stopped alerting and defining reliability,
something that doesn't really is defined by our users.
So I don't want to be alerted when I have a threshold of a cpu,
for example, going up at three in the morning
and my users are not being affected. This of course,
ties in into reducing alert fatigue. So now I will receive,
ideally I will receive alerts only when my users are
being affected or they are at risk of being affected.
Of course, this also creates a shared language to talk about reliability.
So now with all of this, we have a framework in place that
can actually tell us if our systems are being reliable or not, and it's
understood by everyone. And of course it facilitates prioritization.
So we have a way to see if we're being reliable or
not. And we have can effort budget policy that
actually can help us define when more work needs to be put
on top of reliability. And before
I go, I want to leave a shameless plug. I'm writing
a book on overcoming SRE antipatterns and
we'll have a couple of chapters in the book speaking
precisely about this, speaking about how to measure
reliability edis of our users, and how we can leverage alerts
on slos and budget policies
to actually improve our day to day operations. It and
this is all from my part. I hope you enjoyed and
this talk was informative for all of you. You can find me at
these links. Thank you very much and have a great day.