Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey folks, welcome to my talk about techniques for slos
and air budgets at scale. I'm proud to be presenting here at
Conf 42 observability in 2023. I've been talking a lot
about slos and air budgets over the past six or so years, and over that
time I've given variations of this talk in different
online conferences and also a couple in person. So I'm
pleased to be presenting some of my learnings over the past couple of years.
This is basically the greatest its version of the previous talks I've
given on the subject. Let's get started. But first, I know
this is an online conference, but let's do a little survey.
Raise your hand. I know you're sitting there just at home by yourself,
but raise your hand if you know what this graph is. And this is one
of those graphs that if you know, you know. We're all here today
in part because of this thing. And if you don't know what this is,
that's okay. We'll go ahead and come back to that a little bit later.
So, hi, I'm Fred. I'm a observability
engineer at a large public company, and this talk is
my own opinions, not those my employers. Basic disclaimer
and so I've been working on
monitoring observability for about as long as the graph on the previous slide,
but focusing on it heavily over the past ten or so years.
I like to think about slos, Slis and air budgets,
hence the SLOS Jason term, I think that was coined in the original
Google SlO paper. I like to hack on
histograms, metrics, logs and traces. Been programming
a lot of stuff for the past 20 years, and I've got two young kids,
so I definitely am in need of more sleep and coffee. But let's go
ahead and kick this off. So how
do you implement slos for 1000 plus engineers?
And this was a challenge I encountered about four years
ago when I started a role at a company called Zendesk.
And I got tasked with a project to bring slos to an engineering
organization that had over a thousand engineers, which was
quite a few. And there was a big push to make the product
as reliable as possible. We called reliability our number one feature.
So I had to come up with a way to roll out slos and air
budgets across all those engineers. And to
do that effectively, I really had to understand what
slis and slos, and hence air budgets were programmatically.
So I really dove in and started to research the subject a lot to kind
of go back to the basics.
And speaking of basics, I started off by reading the original Google SRE
book, followed that up with the SRE workbook,
watched Liz Fong Jones and Seth Fargo's
Google Cloud presentation on slos titled slis,
slos, slos. Oh my. Which was an inspiration to me.
And I've given a number of SLO talks previously, most notably
one called Latency SLos done right, which was also given at
Srecon by Theo Schlossnagel and Heinrich Hartman,
who've written a lot on the subject. And even
looking back at that talk I gave, I can spot the errors in it,
which were kind of subtle. But what I found researching this topic
here is there wasn't really a prescription for slis and slos.
The Google books talked a lot about slis, but were vague on the
subject as far as specific examples were concerned. And even working through some
of the examples in the workbook, I either found subtle
omissions or places where the examples weren't completely
flushed out and tested. Liz and Seth's Google Cloud video
had some concise definitions, so I took those as a base and expanded
on them, and those are in use by some of the major SLO vendors out
there now. And over the next few years,
there was kind of what I call a cambrian SLO explosion.
Get it? Slo explosion. Little dad joke there.
But there was this explosion in SLO material with SLO
specific conferences and also Alex Sedalgo's book on implementing slos.
So I got to work creating formulas that can be shared across a large organization,
which would leave little room for creativity and variance because I wanted everyone
on the same page, I wanted to be able to give prescriptive formulas
that could be implemented at broad scale.
So this is what I came up with. The definition
of an SLI is what I use to put examples together.
Now, there's two major SLI opinionations, though the difference between them
is a bit subtle at first glance. The first is from the Google
SRE book, which describes an SLI,
pardon me, as a measurement of system performance. The second,
which I found first in Liz and Seth's video,
describes an SLI as something that delineates good requests from bad requests.
And that second opinion is really one that resonated well with me.
So in know, both opinionations did
service at Google, even though they're
somewhat conflicting. But the second, as I mentioned, is implemented
more broadly by practitioners and vendors that I found.
And I decided to base my example on that second opinionation,
not only because it had wider acceptance because intuitively it made more sense
to me. I spent quite a bit of time dissecting
those examples in the Google SRE book and the SRE workbook,
and they were good.
But I think the evolution of slis
and slos at Google probably bifurcated because they
have a lot of teams there. And that's not a criticism of the book or
the organization. But the definitions that I came across seemed to
be more abstract than what I was looking for. And so here are
three examples of slis for the second SLI opinionation
that I moved forward with. They each consist of three things.
A metric identifier, a metric operator, and a metric value.
This approach is one that is straightforward for a human being to understand,
but also fits easily into most of the open source
and commercial monitoring and observability software out there.
The part of the SLI definition which requires the most consideration is what
that metric value should be. And that's one that is often tuned
or calibrated by an engineering team, either for latency,
most often, sometimes with error
response codes, like a five, xx. That's pretty clear that that's a bad
request, but it's going to be up to engineering teams to determine,
like, is a 404 a bad request, or is
that just clients thinking that they're going to the right place?
Because really, all of this stuff is about feeling customer
pain and wanting to make sure that they have a great experience.
And so I kind of cemented this
example so that I could socialize widely
within the engineering of what an SLI was,
which teams to what's an slO?
And that definition came
down to the number of good requests divided by the number of bad requests
over a time range. And this is often called a
request based slos, where you count up the number of requests and see if
you got 99% of them right over a certain time range.
And I called the
three different components here a little bit different. In the red, we have the
success objective, which is your typical how many nines?
And then we drop the SLI in, which works really well for a lot of
the tooling out there. Then we have a period. And if you don't have a
time period here, you don't really have an slO,
because it's really important to specify this so that
you're evaluating it over something that's meaningful to the customer.
And one question that has come up is, how do I
know how many nines to choose for this success objective? When I was at Zenesk,
we had an engineering vp named Jason Smale, who was very
technical and engineers had him highly regarded.
And so he said, we need to hit three and a half
nines. And so that 99.95%
number became known as Smale's number. And if reliability dipped
below that number, it usually meant that a customer somewhere was feeling pain.
And this is really, if you want to get into enterprise software,
this is kind of, you must meet this criteria to
get on the ride. And so now
that you realize you're dealing with enterprise customers and you need three and a half
nines, how do you pick an appropriate metric value for your SLI,
since that's the only dependent variable? Now that you fix the objective at 99.95,
and this is essentially what I call calibrating your slO. Take a
time period of known good performance, set your objective at 99.95,
and iterate across your SLI to figure out what
latency value gives you that 99.95%.
In this example, it could be 100 milliseconds. And I was able to develop
some simple tooling to do that, or use
our commercial monitoring tooling to do that, and developed a dashboard
where engineers could set their objective at 99 and
a half and then iterate over their latency to see kind of
what latency value was it that the customers were getting these three
and a half nines performance. And just
to reiterate, the time period here is very important,
and this is a common oversight that I've seen in most of the literature.
They'll say, take an slo of 100 milliseconds at 99.9%,
but what time period is that over? Is it over a minute,
an hour, a week? And you
can, and you probably should have slos which use the same success objective in
SLI but different time operations. Depending on the stakeholder, an engineers
manager might want to know the reliability Moyer a week so they
can schedule reliability work. A director might want to know it
over a month, and a vp might want to know how reliable the service was
over a quarter for reporting to c staff or putting the direction of technical
efforts. And the purpose of
slos is often to prioritize reliability work.
That is, if you aren't meeting your slos, you want to deprioritize
feature work in favor of reliability engineering.
And we want to do this. We want to
use these operations to do this because we want to be accurate. If we reprioritize
engineering resources, that is expensive. So we want to make sure that we're
doing that based off data that's correct and precise.
Now let's take a quick look at error budgets.
So an error budget is essentially just an inverted slO.
You subtract your success objective from one and you get your allowed failure
rate. For user requests like a financial budget, you have a certain amount of
errors that you can spend over a time period, and ideally
this is an amount that does not make your customers think that your service is
unreliable. You can create monitors for this with most of
the tooling out there and perhaps alert when, say, 80% of your
error budget has been used up for a given time period, which will let
your engineering teams know that it's time to work on reliability.
You can also alert when the rate of an error budget burn
predicts that you will exhaust your error budget before the time period has
elapsed. And tooling. A lot of the tooling out there has functionality
for that. So there are really two conditions that your error budget should
spur action. First, if it's being used up too quickly and is in danger
of being exhausted for that period, that should prioritize
reliability focused work. The second is if your air budget is
not being used up at all, that could indicate an improperly
calibrated slo. Or it might mean that your service is normally so reliable
that you're not prioritizing enough feature work, or that you should embark
on controlled error budgets. Burns Google did that and mentioned
it in the SRE book. With their chubby service, which was the distributed lock
service, they introduced artificial air budget burn into
their consumption into the service so that
consumers of chubby would have to
make their services be able to tolerate those chubby failures and
hence become more reliable. And again, like slos,
air budgets should reflect the mindset of the customer as much as
possible. If the air budget is not exhausted but your customer is on
the phone with your vp, go take a look at what you are measuring and
if it really reflects what the customer is experiencing.
So, to sum up what I've showed you so far, there's a few points on
getting thousands of engineers on the same page for slos and air budgets.
First, you need real world examples. Most of the published
books out there are a bit abstract and hand wavy and
don't really give you complete examples, so you need to have those to show
folks. Second, present formulas for each of those entities
which can be read easily both by humans and machines, and I've
shown you what I used at scale there. Third, you have to be
detailed and consistent. I see so many slos out there
that leave off the time period, you might say well, the time range can
be whatever you want, but then it's not can actual
or actionable slos or air budget without a time range.
So we've looked at some example slos
that most engineers can parse and memorize, and which engineering managers and product managers
can use to correlate user happiness with. In most cases,
that happiness means your service is available and it's running fast.
We can take the formulas I just showed and extend them to cover both conditions
at once. So here we're talking about not only availability,
but also latency and both of those. You need to
have both of those. So here's an example SLI SLO
and error budget, which covers both latency and availability. So if
the page response is not a five xx or request was
served in under 100 milliseconds, that request can be considered to
be a good request. That's our slis to which we
can add a success objective of three and a half nines and a time
range of seven days to be evaluated on. To get the
error budget, we can subtract a success objective of 99.95%
from one, which gives us can error budget of zero 5%.
It's easy to understand and you can also easily create multiple
slos and error budgets from the base SLI just by extending the time range.
Now, on the point of the success objective here I
have 99.95% listed.
It's three and a half nine. Realistically, this is what enterprise
customers demand these days. That means out of a million requests,
you only get 500 requests that are slow or
return what we also known was a fail or the 500
internal server error as an example. And so if you're at scale,
this should be your success objective. And I go into this in depth
a little bit in the presentation shown below on the link for my Srecon
presentation. So at this point we've
got example formulas for slis, slos and air budgets that should
be easy for folks to understand and also straightforward to implement
with most monitoring and observability tooling out there,
both open source and commercial. Of the two components of latency
and availability, availability is generally pretty easy to measure.
The most simple example is a 500 response.
You see the sorry a problem occurred. Web page latency,
however, is more difficult to get right at scale. And when I say
get it right, there are two aspects of being right. First,
does your measurement have the right precision for your scale?
That is, if I have 1 million user requests, can you generate
a latency aggregate which means you aren't leaving more than a few dozen users
off precision. Here is the number of decimal places.
The other aspect is accuracy. Is your latency aggregate
for an SLO or a monitor actually correct? In many cases I've seen
that answer is no to both and precision
versus accuracy. Precision is the number of decimal places.
Accuracy is are the values in those decimal places correct?
So let's dive in. So coming
back to this chart. This chart is an
RRD graph, and it measures network usage and calculates the 95th
percentile over a time period. And at the
time of the.com boom, you saw a lot of these RRD
graphs, and these were mostly used for
metering bandwidth. Bandwidth was built on something like five megabits at
95th percentile, meaning that if you took all your five minute bandwidth
usage measurement slices and ordered them, and took the 95th percentile,
if that number was above five megabits, you incur
overage charges. And this first popularized the
approach of using percentiles. And that would really
notably be seen about ten years later in 2011 with the
advent of the statSD protocol developed by Etsy, which provided
the p 95 was a latency aggregation metric. And I wrote more
about this in a blog post I published last year, and I'll go into some
of the content in the next slides, but this is the historical
significance of this graph.
So let's talk about percentiles. This is a slide
from an SLO presentation I gave at Srecon 2019.
It illustrates two latency distribution profiles, which are meant to
represent service nodes that are behaving differently.
The blue distribution represents a bimodal latency
profile with lower latencies than the single mode red latency distribution.
Basically, this could be two web servers, one performing well and
one performing not as well. The red server is not performing as
well, and if we take the p 95
values for latency for each server, and we average those, we could
get an indicator of around 430 milliseconds,
and we might think that hes that's the performance of our service.
But if we combine the raw latency values from each of these distribution
sets and calculate the aggregate p 95 from those,
we'll get 230 milliseconds, and the error there
is almost 100%. And many,
if not all, of the monitoring and observability tools out there will happily
let you use an averaging function for percentiles generated from
different hosts, nodes or clusters. If your distribution
profiles are the same, no problem. That works great. But it's when
your services are behaving asymmetrically that you'll encounter large errors with
this approach, and this is a problem with percentiles.
And I talked about that in depth in that presentation.
So beware of using percentiles.
I've talked about this and ranted about this,
and this kind of illustrates the
prime condition where that's can issue. And again, if everything's running
smoothly or if you have a single node, percentiles work just great.
But it's the real world scenarios where we have different
node performance profiles and possibly hundreds
or thousands of nodes services requests that we want to be
able to handle and evaluate how our service
is performing accurately that teams into histograms
for measuring web service latency.
And I give an internal
talk. I called Dr. Histogram how I learned to stop worrying and
love latency bands at Zendesk a few years ago,
and I went into more depth on the intricacies of
these three different types of histograms in the SLO comp link below.
But in short,
there's a couple of different approaches you can use for measuring latency with histogram.
And this involves essentially collecting a latency sample and fitting
it into what we call a bucket or a bin.
And you'll see the gray and blue
bars here. Those are your buckets or bins. And so let's
take a look at how these are implemented differently. First, we could have
a log linear histogram, which you
can see the details of at openhistogram IO.
And if we have a latency value here of 125 milliseconds,
we could say like, oh, we'll just slot that sample into
the greater than 100 millisecond, but less than 200 millisecond
bucket. And so this is a data structure
that is fairly easy to represent, because all you have is an
array representing different histogram buckets,
and then you increase the value of that array, essentially a counter
for each of those. And this is a volume invariant way
of storing large amounts of latency data that you
can also use to generate highly accurate aggregates
for an entire cluster or any set of hosts.
And folks might also be familiar with the middle structure. That's the cumulative
histogram which prometheus uses.
So if I have a latency value of 125 milliseconds,
it will assign labels starting at less than
infinity all the way down to less than 200.
So this takes a few more data structures
or a few more counter values to implement, and it's not quite as
efficient as the logged linear histogram. And at Zendesk, I flipped that
on its head and came up what was called an inverse cumulative histogram,
where, for an example, if we have 125 milliseconds, I could
have a counter data structure, bump the counter and assign
these labels to it, which are often known as metric tags.
I could assign greater than ten, greater than 50, greater than 100 milliseconds,
but not greater than 200 milliseconds. And this approach
made my head hurt for a little bit. But it has some advantages
in terms of operator efficiency and ease of use of implementing with a
lot of the tooling out there and all
these buckets that can also be referred to as latency bands.
So you can kind of take a look at each of these different types of
histograms and decide, I might want to try to use histograms for
storing latency. So one of these should give you
some good results. And you might ask,
well, okay, well, now I know how to capture latency in a histogram at
scale. How do I generate an SLO from it? Well, let's go back to our
definition. It's the number of good requests divided by the number of bad
requests over a time range. And so in this case,
we can use a histogram data for the SLI.
We can sum up the number of requests below 100 milliseconds,
and we can divide that by the total number of requests, which would just be
the count sum of all the bands, and we can multiply that
by 100. In the case of the
number of requests under 100 milliseconds,
with the inverse cumulative histogram,
we add up the counts of the blue bars.
With the log linear histogram,
we just add up all those, the counts of the three bars to
the left of the three gray bars to the left of the blue bar.
So, mathematically, this is very simple to implement,
and it's fast, it works quickly with all monitoring solutions
out there. And it's also extremely accurate because you're adding
up counts of essentially raw data,
and it also gives you essentially arbitrary precision.
So this is a very robust and accurate approach,
and I highly recommend this because this will give you some great numbers
at scale. Now, you might
say, like, well, this is a lot of work to do, but again, it goes
back to prioritizing reliability work. So we want to make sure that our
data about, if we're hitting our slos is accurate,
because we're spending, likely spending hundreds of thousands or millions of dollars
on shifting this engineering work. Now,
I showed some raw histograms there,
where we keep count of a number of samples in each bin,
and that way we can sum them up. But there's some approximate
structures out there which you can use,
and some of the vendors provide to do the same things. And they're
often called sketches, like the GK sketch or the DD
sketch structure by one of the vendors. And there's also approximate
histograms such as t Digest, made by Ted Dunning, which stores
approximations of distributions. And these
two charts here were taken from the log linear circ
Slis paper for open histogram, and they represent error
percentages for two different takes of workloads across different
p nine x values on the x axis. And you
can see the red line here, which is the open histogram implementation
that's got very low errors. But then you look at like the T Digest
DD sketch and HDR histogram, which do relatively well
in terms of errors. However, there's a detail that is not in these
charts. These errors are for single node evaluations only,
say for one web server. Now, how do approximate histograms
and sketches behave across asymmetric node workloads of
hundreds of web servers or arbitrary time windows? And that's
a very difficult question to answer. But by and large, the errors are likely
to be unbounded and using histograms which
store the exact sample counts, as I termed raw histograms
on the previous slide, those avoid that problem entirely,
ensuring that any aggregates generated for them
for slos are highly accurate and precise.
So the sketches are good
to a certain extent, but they don't really hit
the same level of precision as these raw
histograms. Now, while we're on
the subject of histograms, I want to highlight some recent work in this area by
Adrian Cockroft. Adrian published a medium post titled percentiles
don't work. I think he coined them as wrong,
but useful analyzing the distribution of response times for web
services. A few months ago,
hes started doing some work here, where he
looked at operational telemetry, which is usually latency, and using some
r based tooling to decompose it into component normalish
distributions. So this image here was taken from his blog post, where he was
able to take a bimodal histogram here and decompose it into
two normal distributions using the mixed tools r package.
Now why is this important and what does this have to do
with slos? We just took a look at what magnitude of errors
can arise from using percentiles for latency measurements. So we
follow that up with looking at histograms to measure latency distributions.
So with something like this,
we can pull out these normal distributions.
And this could be relevant if we wanted to make an SLO for something like
disk writes, where you might have writing to a block device,
versus just writing to cache or reading
from the block device, as opposed to reading to cache. We can use these
to implement fine grained slos for each of the different
moyer of kind of the physical manifestations
of the system in the cloud. It could be like writing to s three or
different storage levels there. And so there's some really promising
work here. And I think that this
is definitely something to follow going ahead, because if you
really want to get fine grained with, say, a system that
has a few different modes at very large scale, this approach
would allow you to do that.
Now, one common question I've gotten about slos and
error budgets is how do you implement them across a distributed service architectures?
Now, one approach is to use an SLO and error budget for each service,
and this includes third party vendor services, as shown
in blue here. Now, the error rates I've shown
here and documented in red are error
rates across these different services. So you can have a
different error rate contribution from the third party service, the mid tier and the edge
tier. And you can take those and you
can add those up and essentially get a compound or
composite error rate for what the customer is seeing. So in this
case, you might see that, hey, our in house back
end service has a 0.1% error rate.
But then if you roll that up to the mid tier,
now you've got 1% error
rate also from the third party, which exceeds your mid tier
error budget of 1%. And so
you can kind of put these diagrams together, and it
will help you understand where you need to focus reliability
work. In this case, you need to focus reliability work on the
third party and either pull that in house or do
some sort of interface around
it to make it more reliable. And the goal here is not to assign blames
to teams or to different services. It's to
prioritize reliability work. And that's
really what this is all about. Because for
most of almost, I would say almost all of you out there, you're using some
sort of distributed system like this, and you're going to say like,
well, how do we use slos across that?
Remember to be customer centric, and you can roll
those error budgets up, starting from,
I'll call it upstream, which is further away from the client. You can roll
those error rates up and get a composite error
rate fairly simply and see
what the client is seeing. And that's it.
My tour through techniques for slos
and air budgets at scale I hope you enjoyed this presentation. Feel free
to reach out to me on LinkedIn or Twitter.
And that Twitter handle also works across Mastodon and
a couple of the other news sites popping up. I'd love to hear about your
experiences and talk about how you're using
slos and air budgets at scale. Thanks 42.
We'll see you next time.