Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. Welcome to my talk. My name is Ricardo
Castro and today we're going to talk about reliability.
So what do we have on the menu for today? We're going to start
by giving some context about this talk. So we're going to use
an example from the real world and then we're going to translate this into
our techie reality. We will then talk about
reliability. We are going to start step to start,
step by step, and we're going to develop a framework that many
of you already have heard about that surrounds
around slos. We will then see where the real
value of having such a framework in place comes from.
And at the end we're going to conclude this on why all of this is
important. So let's start
with an example from the real world.
So can example from the real world.
I'm going to be using a supermarket.
If you think a little bit about it, a supermarket, it's kind
of a microservices architecture. Why do
I say this? So the idea for me as a consumer,
I go to a supermarket, I do my shopping, I pay and I get
out. But underneath the covers there's a lot that
needs to happen so that my consumer experience is
actually reliable. So there are many
different pieces that need to fit into place, so that
when I go to the supermarket, everything is there for me to buy.
So someone has to make purchase orders to actually get
products into the supermarket. Someone has to transport
those products, someone has to unload those products
into the supermarket, someone has to stock the shelves,
someone needs to be at a counter. If I need some kind of assistance,
someone needs to be at a counter so that I need to pay. So as
we can see, there are a lot of moving pieces inside the supermarket
that have to be successful so that my simple user
experience is actually reliable.
So how can I assess
if a supermarket is being reliable or not?
Let's see a couple of examples of how we can assess
the reliability of a supermarket.
So it says, imagine that we did all of our shopping
and we want to pay. How can this experience be
not reliable? So if it takes too long to pay,
I may assess this experience as not being reliable.
By the same token, I go into a shell, I go to a shelf,
I pick up the product, and the product expiration date
has passed, right? So I might assess this
supermarket as not being reliable. If I go try and buy a product
and I can't because the product expiration has passed.
And we can draw a parallel here into our techie world.
When something takes too long to pay, we can equate that to latency. So if
I send a request to a service, it takes too long to receive a response
back. And the same thing, if a product expiration date has passed,
I can use this as an example of an error.
So as an example, I'm going to use my own company as
an example. So I work at a company called Anova, and we operate in
the industrial IoT space. So essentially we develop
solutions for our customers to have industrial sensors. We ingest
data from those sensors and we create
meaningful services that help them manage their own infrastructure.
So essentially, this is our main focus.
So we collect high reliability data
from high reliability sensors. We collect that,
process those services, and then provide solutions to our
customers. So for
all of this to be successful, we actually need to ensure that
all of our services are actually reliable.
What does reliability mean?
So if we go into the Cambridge dictionary, it has a
definition of reliability as being the quality of being able
to be trusted or believed because of working or behaving.
And this gets me a little bit confused, but essentially it
says that something is reliable if it's behaving well,
right? But I prefer the definition from
Alexey Dalgo in his great book, implementing service level objectives,
which essentially asks those question, is my service
reliable? And how can I assess that? And basically
I can say that is my service doing what its users
needed to do. So this is a little bit of shift. So it
says, we're not saying that something is reliable if it's behaving well,
we say that something is reliable. A service, for example, if it
is doing what its users expected it to do.
So how can we go about developing a framework to
actually ensure that services are doing what its users
needed to do? We're going to start to build this,
building this step by step, and we're going to start with those most basic component
in our system, which are metrics, very well known to
all of us. And essentially a metric is a measurement about
a system. It doesn't tell us anything other
than a measurement. So a few examples of what
a metric can be at the amount of memory that
the service server is using, for example, in terms of percentage of
available memory, the time it takes for an HTTP
request to be fulfilled, for example, in milliseconds or seconds,
the number of errors, HTTP response errors,
for example, a natural number or the age
of a message arriving at a Kafka cluster in terms of minutes.
So with metrics, we can start building on this and
start evolving this concept. And the next concept
is the concept of a service level indicator and
a service level indicator indicator is a
quantifiable measure of service reliability,
and it helps us separate good events
from bad events. So how can we define service level
indicators to actually say if an event is
good or bad? So here are a few examples.
We're going to take the examples of metrics, and we're going to use that metric
and actually define SLis. That tells us if
when we look at a metric or an event described
by a metric, you can say that if that metric is good or not.
So we need for a binary state to be achieved, right? So we need for
something to be good or bad, even if the underlying
magic doesn't provide that binary state. So how can we define that?
So, using the previous example, we can say a request
needs to be responded within 200 milliseconds.
So every time we take a measurement, if it takes more than
200 milliseconds, we can say that this is not good. If it takes 200
milliseconds or less, we can say that those is good.
And the same is analogous to the other example. So a request
to a service must not be responded with a 500 code,
right? If it's responded with a 500 code, it's not okay.
If it's responded with a code different than 500, it is okay.
And again, same thing for messages arriving to Kafka. If the
message is not older than five minutes, everything is okay. If those
message is older than five minutes, things are not okay. You need
to do, might need to do something. So we started with
metrics. We went to slis. What comes next?
So, next is an SLO. And an SLO
is nothing more than how many times one SLA
has to be achieved, so that I can be sure that my users
are happy with my service. And that, of course, needs to be measured within
a time interval. So, using the same examples again,
we're going to evolve an SLI into an SLO. So we
can say 99% of requests to a service need
to be responded within 200 milliseconds within a 30
day period. So we define an interval which is 30 days.
We look at all requests, and we say that 99%
of them need to be responded within 200 milliseconds.
If not, I can say that my users are not being satisfied
with my service. Same thing with those other examples.
99% of requests to a service need to be responded with a code different from
500 within a seven day period.
And exactly the same thing with the four messages arriving to Kafka.
But how can we create good slos so
here are a few tips on how to create good slos. So first and
foremost, and if you forget all the other tips, is the first one, always,
always and always focus on the users.
Users are the one who will be defining
the reliability of my service. So it makes sense for us
to actually know what the user expects, expect and
track our reliability according to that.
Going a little bit deeper. One good option to actually start
defining slos is to list out critical users journeys and
order them by business impact. Maybe is your search catalog, maybe is
checkouts, whatever makes sense within your company.
Then we need to determine which metrics we will be using as
service level indicators to most accurately track
that user experience. So we need to define a few metrics or measure a
few things and be sure that we have slis that actually track
what we know that the user is valuable.
Also, try not not to to get too
overboard with slos. For most cases, three or four slos
should be enough and we can use a composable
metrics to do that. Also important is
to have SLO documents. We'll see an example in a second,
but it's essentially a document that
describes what the SLO is when it was last reviewed, and so on.
Very important is to review SLOS periodically.
SLOS is not something that is set in stone,
and every once in a while it needs to be reviewed and to be sure
that it's still actually tracking user satisfaction
and also determine SLO targets and goals and those SLO measurement
period. Don't try to be too reliable. And we see
what I mean by this in a bit. So just a
quick glance on an example of an SLO document.
This example was taken from the Google icre book,
so you can use it as a reference to build your own and adapt to
your own reality. So essentially it has a description of the slO.
So this is an SLO for the example game service.
It has the authors, when it was defined, who reviewed it,
when it was approved, and when it should be revisited.
It also has an overview of the
service that this SLO applies to.
Then it comes down to slis. So these are the definition of
the Slis that we've seen previously, but here are the
ones who actually inform this slo.
Then there's a section for rationale. Maybe there's something that you need
to make clear to whoever is using this
slO, and this should be explained here. Effort budgets we're
going to see in a bit. But here we can describe what
the effort budget is for this particular slO.
And of course, you can have a section that has clarifications and
caveats, something that needs to be explained, something or maybe
some constraints that need to be taken into consideration when using this
slO. So, going back to
our previous point, what is an exoplasmable
target? 90%, 95%?
100%? Of course, it depends.
It will depend on your user needs. It can depend on
your business needs. For instance, you might be in a
highly regulated business that has many constraints. It can be informed
by cost and many other things. But the point here is
that it depends. But just for us to have
an idea, let's see, can slo
for uptime. Let's say that we define three
nines, right? So the number of nines is something that
it spread across the industry. But for example, for an slo
of three nines, this means that
I can have 8 hours of downtime
per year. Just by adding a nine,
those 8 hours go down to 25
to 52 minutes. And if I add another nine or
five nines of reliability, this means that I can only be down for
five minutes a year. So just by adding one
nine to my uptime slo,
I drastically reduce those downtime. So I
have to put things in place so that I can ensure that
I'm not down more than the allowed amount.
So just to conclude this point, let's just do some back of the envelope
calculations to see what does this mean. So I mentioned
that you need to put things in place to
actually assure this type of reliability. So let's just go through
a couple of scenarios and see how much this could cost us. So let's
see, scenario one. Let's imagine that I want to
increase my reliability from three nines to four nines,
right? So I'm increasing my reliability 0.9.
And those means. And let's imagine that my service does 1
million of revenue. Those means that if
I improve my reliability,
the value of improvement is actually $900.
So does this make sense? Maybe this will
be up to you. But these are important to do these calculations
to see if it makes sense. Scenario two,
let's use as rule of thumb, we can see
a lot in documentations, and I believe it's also in the SRE book that each
additional nine of reliability costs us ten times to achieve that.
So if I go from two, from three to four nines,
this means that whatever it costs me to run my services, it will increase
by a factor of ten. So let's use exactly the same example going
from three nines to four nines.
And let's imagine that it costs us to run 1
million. $1 million. That means that by
adding a new nine, it's going to cost me around $10 million.
So these, of course, are back of the envelope
calculations. But it's important to do those type of calculations for us to have an
idea. If I want to increase my reliability, how much
will it cost me and the amount of value that it can maybe
get me. So, moving forward,
we touched briefly before about error
budgets, but essentially can error budget is what it's
left from an SLO. So if I consider that 100%
is that my service is always reliable, the error
budget will be 100% minus what I define can SLO.
So if it was two nines, it will mean that I'll have 1% of
error budget left. So it's effectively the
percentage of reliability left, and it can help us make educated decisions
on whether, for example, to release a new feature or not.
Based on the amount of risk that we can take.
They help make sure the operability process, for example, incident response,
is appropriate to the budget available for the service being provided. What this means is
that with the amount of reactor budget that we have left, we can
inform our incident response.
And last but not least, we have service level agreements that are
well known to us. But essentially, an SLA is nothing more
than an SLO that has some kind of penalty attached.
For example, let's use the same examples that we've been
using up until this point. So if I have,
I can define an SLA that says that 99 95%
of requests to a service need to be responded within 200 milliseconds
within a 30 day period. If that doesn't happen,
the customer will get a 30% discount. And the
same thing for the other example. But the basic idea here is that
an SLA is an SLO that actually has some
type of penalty attached. They are usually looser
than slos, so that if we breach an
SLO, we actually still have some time before breaking the
SLA. So looking at the first example, if we had an SLO of
29, so 99%, this meant that if I breached
that SLO, I still had 4% of
unreliability to burn, so to speak, until my
SLA was breached.
So what can we build with this? All of this?
So, of course, we can build visualizations. So here is an example
of how we can track and put on a dashboard a
visualization of an SLO. Here we have the objective, for example,
99%, how much slo is being burned,
how much I still have left. If something is burning or not.
So this is very interesting and we can see and can do can historical analysis
of how my SLO is actually going.
But we don't want to be looking at dashboards all day.
This is very interesting, but what we actually want
is to be informed if something is not
going okay, right? So that we can take the appropriate measures.
So we come to alerts.
So if we think about the traditional
relative methods, we usually use metric
thresholds, right? So we define some kind of threshold and see if
something goes anova that threshold,
we need to trigger an alert so that someone can investigate
and see if everything is okay. So here are just a couple of examples.
If a cpu goes above 80%,
if a certain number of requests is taking more than 200
milliseconds, if we have x amount of 500 responses,
we just trigger an alert and someone needs to investigate what's going
on. We can take the same approach with the
slO. With an slO. So we can say that if latency,
if we have an SLO of 99% and latency goes below that,
those is SLO, we mean we need to do something.
This is better because if we define our slos in
a meaningful way, it's actually tracking users experience,
but it only alerts us when we are already in trouble,
right? So how can we do better than this?
So we can alert on the amount of
error budget that we still have available, or the amount of error
budget that we have already burned.
So we can set alerts when available error budget reaches
a critical level, or when a critical amount of error
budget has been burned. And we can set different trigger levels
to different alert channels. For example, if I have already burned
25% of my budget, I can send an
email 50%, I can alert someone on teams or on slack.
And if 75% of my budget has been burned, I can trigger
pager duty or ops genie for someone to actually look into it.
This is better than the alternative because we are being alerted
before we get into trouble and we need to do something.
But we have no idea how fast this error
budget is being consumed. So it begs the question, if my
error budget, if my slo is well defined,
and if I know, or if I could know if
by the end of the evaluation period we would still have some error budget left,
would you like to receive desolate?
Probably not, because I will still be within my bounds
of the amount of reliability that I have. So I can
make the decision to actually, I don't know,
release more features or do other
type of work because I still
have some reliability that I can account for.
So the next evolution will actually be would actually tackle
this problem. And it is the burn rate.
So burn rate actually tells us how fast an error
the error budget is being consumed. When I make this calculation,
if the error budget, those burn rate of one means that
all my error budget will be consumed within the interval
that I define, for example 30 days or a week, let's see those example.
If I have a window of evaluation of four weeks
and I'll calculate my error budget,
my burn rate, and it's two, this means that I will consume
all my effort budget in half the time. So in this example, two weeks,
this is a lot better. But it still has a slight
problem, which is if the effort budget burns
too fast and we evaluated,
and for example it is consumed within my evaluation period,
I might not even receive an alert. So we can do here
a small tweak to have excellent alerts based
on burn rate. And these
are alerts on effort budget available. I'm sorry,
these are alerts on multi window multiburn rate alerts.
So we will use multiple windows and multiple
burn rates to inform us of different problems.
We will inform fast burn alerts that will alert
us on sudden changes in the consumption of burn rate.
We can think of this as an example. We have a huge spike in
errors in our API, for example, and slow burn alerts
that alerts us on less urgent issues. But is something
that is consuming over time a lot of error budget? So here are
just a few examples I can define that. I have
a window of evaluation of 2 hours. I evaluate every five
minutes, and if my burn rate is ten, I know that something
is not okay. And the same thing for a slow burn. I define
evaluation period of 4 hours, so longer. And if my burn rate
is two, I actually have a problem and I need to investigate.
And our last concept is the concept of the Akar budget policy.
And those budget policy is nothing more than a document or a set of documents
where we define what happens when certain conditions
of my echo budget have been breached. So here
are just an example. So if a service
has exceeded its echo budget for the preceding four week
window, we will halt all changes and releases
other than p zero or security feed security fixes
until the service is back within its slO.
And depending upon the cause of the SLO miss, the team may devote
additional resources to working on reliability instead of feature work.
So this is a concrete definition of what happens when my Slo
is breached. This of course will be highly contextual and
will be dependent on your organization, of course. But it's
something that it's predefined and it's agreed by everyone involved that
if certain thing, certain things happen,
those are the actions that we are going to take.
So to recap, let's see this concept of the reliability
stack in his extended way.
We started by looking at metrics, which are measurements
about those system. We then evolved those to an SLI,
which tells us if a metric is good or bad or an
event is good or bad. We then evolved into slos,
and slos can tell us how many times the SLI
needs to be good so that my customers are
actually happy. The other budget
is nothing more than what it's left from the slO.
And then with slos, what can we build? We can build visualizations
so that we can use to assess if the SLO is
okay or not. We can build minifill alerts that
track user experience and tell us if something is not okay.
We can define effort budget policies that actually can inform
us what we need to do if certain conditions happen. And of
course we can users those slos to define slas that those
are nothing more than an SLO with a penalty.
And why is all of this important?
For starters, we start measuring reliability from the eyes of
our users. They are the most important actors
that will help us define the reliability. It doesn't matter
if I think if my system is performant, if my users are
not happy with the way that it's working.
Also, reliability work.
Reliability work ties directly to business goals.
So happy users are usually good for business. And if
I'm tracking reliability and my users are happy, and I'm
ensuring that my users are happy through a reliability framework,
this is good for business and we can tie this directly to business goals.
It also creates a shared language to talk about reliability.
There's no more some engineers defining
how reliability should be tracked or measured.
Other reliability engineers doing it another way, product people
saying it another way, business people doing it another way. So now we have a
framework that will help us assess, measure and
define reliability. And of course this facilitates prioritization.
So if we have a reliability framework in place and
we have can error budget policy that tells us what needs to happen
when setting conditions apply. This help us facilitate,
prioritize work and makes it easier to make decisions
on when we should we focus our efforts into reliability work or
on product work, for example. And this
is all from my part. I hope this was informative for you.
Don't hesitate to contact me through my social links.
I hope you enjoyed my talk and have a great conference.