Conf42 Observability 2023 - Online

Techniques for SLOs and Error Budgets at Scale

Video size:

Abstract

What tactics can you use to implement service level objectives and error budgets to operate enterprise-level services with a thousand (or more) engineers? Fred Moyer takes you through approaches he’s developed working with teams in a large production ecosystem where service reliability was a nonnegotiable business requirement. Engineers and operations folks who are putting SLIs, SLOs, and error budgets into practice in high-scale production environments with diverse service architectures should come away with an understanding of how to democratize SLOs, what objectives they should target for enterprise-level reliability, and how to communicate and implement error budgets across multiple teams.

Summary

  • I'm proud to be presenting here at Conf 42 observability in 2023. This talk is about techniques for slos and air budgets at scale. This is basically the greatest its version of the previous talks I've given on the subject.
  • Fred: I'm a observability engineer at a large public company. This talk is my own opinions, not those my employers. Raise your hand if you know what this graph is. We're all here today in part because of this thing.
  • How do you implement slos for 1000 plus engineers? There wasn't really a prescription for slis and slos. So I got to work creating formulas that can be shared across a large organization. Here is what I came up with.
  • You need real world examples. Present formulas for each of those entities which can be read easily both by humans and machines. Third, you have to be detailed and consistent. This is what enterprise customers demand these days.
  • Of the two components of latency and availability, availability is generally pretty easy to measure. Web page latency is more difficult to get right at scale. There are two aspects of being right. First, does your measurement have the right precision for your scale?
  • There's a couple of different approaches you can use for measuring latency with histogram. Many tools will happily let you use an averaging function for percentiles generated from different hosts, nodes or clusters. But when your services are behaving asymmetrically, you'll encounter errors with this approach.
  • A recent blog post highlights some recent work in this area. Using r based tooling to decompose telemetry into component normalish distributions. These can be used to implement fine grained slos for each of the different moyer of the system in the cloud.
  • One approach is to use an SLO and error budget for each service. This includes third party vendor services, as shown in blue here. You can roll those error rates up and get a composite error rate. The goal here is not to assign blames to teams or to different services. It's to prioritize reliability work.
  • My tour through techniques for slos and air budgets at scale. Feel free to reach out to me on LinkedIn or Twitter. I'd love to hear about your experiences.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hey folks, welcome to my talk about techniques for slos and air budgets at scale. I'm proud to be presenting here at Conf 42 observability in 2023. I've been talking a lot about slos and air budgets over the past six or so years, and over that time I've given variations of this talk in different online conferences and also a couple in person. So I'm pleased to be presenting some of my learnings over the past couple of years. This is basically the greatest its version of the previous talks I've given on the subject. Let's get started. But first, I know this is an online conference, but let's do a little survey. Raise your hand. I know you're sitting there just at home by yourself, but raise your hand if you know what this graph is. And this is one of those graphs that if you know, you know. We're all here today in part because of this thing. And if you don't know what this is, that's okay. We'll go ahead and come back to that a little bit later. So, hi, I'm Fred. I'm a observability engineer at a large public company, and this talk is my own opinions, not those my employers. Basic disclaimer and so I've been working on monitoring observability for about as long as the graph on the previous slide, but focusing on it heavily over the past ten or so years. I like to think about slos, Slis and air budgets, hence the SLOS Jason term, I think that was coined in the original Google SlO paper. I like to hack on histograms, metrics, logs and traces. Been programming a lot of stuff for the past 20 years, and I've got two young kids, so I definitely am in need of more sleep and coffee. But let's go ahead and kick this off. So how do you implement slos for 1000 plus engineers? And this was a challenge I encountered about four years ago when I started a role at a company called Zendesk. And I got tasked with a project to bring slos to an engineering organization that had over a thousand engineers, which was quite a few. And there was a big push to make the product as reliable as possible. We called reliability our number one feature. So I had to come up with a way to roll out slos and air budgets across all those engineers. And to do that effectively, I really had to understand what slis and slos, and hence air budgets were programmatically. So I really dove in and started to research the subject a lot to kind of go back to the basics. And speaking of basics, I started off by reading the original Google SRE book, followed that up with the SRE workbook, watched Liz Fong Jones and Seth Fargo's Google Cloud presentation on slos titled slis, slos, slos. Oh my. Which was an inspiration to me. And I've given a number of SLO talks previously, most notably one called Latency SLos done right, which was also given at Srecon by Theo Schlossnagel and Heinrich Hartman, who've written a lot on the subject. And even looking back at that talk I gave, I can spot the errors in it, which were kind of subtle. But what I found researching this topic here is there wasn't really a prescription for slis and slos. The Google books talked a lot about slis, but were vague on the subject as far as specific examples were concerned. And even working through some of the examples in the workbook, I either found subtle omissions or places where the examples weren't completely flushed out and tested. Liz and Seth's Google Cloud video had some concise definitions, so I took those as a base and expanded on them, and those are in use by some of the major SLO vendors out there now. And over the next few years, there was kind of what I call a cambrian SLO explosion. Get it? Slo explosion. Little dad joke there. But there was this explosion in SLO material with SLO specific conferences and also Alex Sedalgo's book on implementing slos. So I got to work creating formulas that can be shared across a large organization, which would leave little room for creativity and variance because I wanted everyone on the same page, I wanted to be able to give prescriptive formulas that could be implemented at broad scale. So this is what I came up with. The definition of an SLI is what I use to put examples together. Now, there's two major SLI opinionations, though the difference between them is a bit subtle at first glance. The first is from the Google SRE book, which describes an SLI, pardon me, as a measurement of system performance. The second, which I found first in Liz and Seth's video, describes an SLI as something that delineates good requests from bad requests. And that second opinion is really one that resonated well with me. So in know, both opinionations did service at Google, even though they're somewhat conflicting. But the second, as I mentioned, is implemented more broadly by practitioners and vendors that I found. And I decided to base my example on that second opinionation, not only because it had wider acceptance because intuitively it made more sense to me. I spent quite a bit of time dissecting those examples in the Google SRE book and the SRE workbook, and they were good. But I think the evolution of slis and slos at Google probably bifurcated because they have a lot of teams there. And that's not a criticism of the book or the organization. But the definitions that I came across seemed to be more abstract than what I was looking for. And so here are three examples of slis for the second SLI opinionation that I moved forward with. They each consist of three things. A metric identifier, a metric operator, and a metric value. This approach is one that is straightforward for a human being to understand, but also fits easily into most of the open source and commercial monitoring and observability software out there. The part of the SLI definition which requires the most consideration is what that metric value should be. And that's one that is often tuned or calibrated by an engineering team, either for latency, most often, sometimes with error response codes, like a five, xx. That's pretty clear that that's a bad request, but it's going to be up to engineering teams to determine, like, is a 404 a bad request, or is that just clients thinking that they're going to the right place? Because really, all of this stuff is about feeling customer pain and wanting to make sure that they have a great experience. And so I kind of cemented this example so that I could socialize widely within the engineering of what an SLI was, which teams to what's an slO? And that definition came down to the number of good requests divided by the number of bad requests over a time range. And this is often called a request based slos, where you count up the number of requests and see if you got 99% of them right over a certain time range. And I called the three different components here a little bit different. In the red, we have the success objective, which is your typical how many nines? And then we drop the SLI in, which works really well for a lot of the tooling out there. Then we have a period. And if you don't have a time period here, you don't really have an slO, because it's really important to specify this so that you're evaluating it over something that's meaningful to the customer. And one question that has come up is, how do I know how many nines to choose for this success objective? When I was at Zenesk, we had an engineering vp named Jason Smale, who was very technical and engineers had him highly regarded. And so he said, we need to hit three and a half nines. And so that 99.95% number became known as Smale's number. And if reliability dipped below that number, it usually meant that a customer somewhere was feeling pain. And this is really, if you want to get into enterprise software, this is kind of, you must meet this criteria to get on the ride. And so now that you realize you're dealing with enterprise customers and you need three and a half nines, how do you pick an appropriate metric value for your SLI, since that's the only dependent variable? Now that you fix the objective at 99.95, and this is essentially what I call calibrating your slO. Take a time period of known good performance, set your objective at 99.95, and iterate across your SLI to figure out what latency value gives you that 99.95%. In this example, it could be 100 milliseconds. And I was able to develop some simple tooling to do that, or use our commercial monitoring tooling to do that, and developed a dashboard where engineers could set their objective at 99 and a half and then iterate over their latency to see kind of what latency value was it that the customers were getting these three and a half nines performance. And just to reiterate, the time period here is very important, and this is a common oversight that I've seen in most of the literature. They'll say, take an slo of 100 milliseconds at 99.9%, but what time period is that over? Is it over a minute, an hour, a week? And you can, and you probably should have slos which use the same success objective in SLI but different time operations. Depending on the stakeholder, an engineers manager might want to know the reliability Moyer a week so they can schedule reliability work. A director might want to know it over a month, and a vp might want to know how reliable the service was over a quarter for reporting to c staff or putting the direction of technical efforts. And the purpose of slos is often to prioritize reliability work. That is, if you aren't meeting your slos, you want to deprioritize feature work in favor of reliability engineering. And we want to do this. We want to use these operations to do this because we want to be accurate. If we reprioritize engineering resources, that is expensive. So we want to make sure that we're doing that based off data that's correct and precise. Now let's take a quick look at error budgets. So an error budget is essentially just an inverted slO. You subtract your success objective from one and you get your allowed failure rate. For user requests like a financial budget, you have a certain amount of errors that you can spend over a time period, and ideally this is an amount that does not make your customers think that your service is unreliable. You can create monitors for this with most of the tooling out there and perhaps alert when, say, 80% of your error budget has been used up for a given time period, which will let your engineering teams know that it's time to work on reliability. You can also alert when the rate of an error budget burn predicts that you will exhaust your error budget before the time period has elapsed. And tooling. A lot of the tooling out there has functionality for that. So there are really two conditions that your error budget should spur action. First, if it's being used up too quickly and is in danger of being exhausted for that period, that should prioritize reliability focused work. The second is if your air budget is not being used up at all, that could indicate an improperly calibrated slo. Or it might mean that your service is normally so reliable that you're not prioritizing enough feature work, or that you should embark on controlled error budgets. Burns Google did that and mentioned it in the SRE book. With their chubby service, which was the distributed lock service, they introduced artificial air budget burn into their consumption into the service so that consumers of chubby would have to make their services be able to tolerate those chubby failures and hence become more reliable. And again, like slos, air budgets should reflect the mindset of the customer as much as possible. If the air budget is not exhausted but your customer is on the phone with your vp, go take a look at what you are measuring and if it really reflects what the customer is experiencing. So, to sum up what I've showed you so far, there's a few points on getting thousands of engineers on the same page for slos and air budgets. First, you need real world examples. Most of the published books out there are a bit abstract and hand wavy and don't really give you complete examples, so you need to have those to show folks. Second, present formulas for each of those entities which can be read easily both by humans and machines, and I've shown you what I used at scale there. Third, you have to be detailed and consistent. I see so many slos out there that leave off the time period, you might say well, the time range can be whatever you want, but then it's not can actual or actionable slos or air budget without a time range. So we've looked at some example slos that most engineers can parse and memorize, and which engineering managers and product managers can use to correlate user happiness with. In most cases, that happiness means your service is available and it's running fast. We can take the formulas I just showed and extend them to cover both conditions at once. So here we're talking about not only availability, but also latency and both of those. You need to have both of those. So here's an example SLI SLO and error budget, which covers both latency and availability. So if the page response is not a five xx or request was served in under 100 milliseconds, that request can be considered to be a good request. That's our slis to which we can add a success objective of three and a half nines and a time range of seven days to be evaluated on. To get the error budget, we can subtract a success objective of 99.95% from one, which gives us can error budget of zero 5%. It's easy to understand and you can also easily create multiple slos and error budgets from the base SLI just by extending the time range. Now, on the point of the success objective here I have 99.95% listed. It's three and a half nine. Realistically, this is what enterprise customers demand these days. That means out of a million requests, you only get 500 requests that are slow or return what we also known was a fail or the 500 internal server error as an example. And so if you're at scale, this should be your success objective. And I go into this in depth a little bit in the presentation shown below on the link for my Srecon presentation. So at this point we've got example formulas for slis, slos and air budgets that should be easy for folks to understand and also straightforward to implement with most monitoring and observability tooling out there, both open source and commercial. Of the two components of latency and availability, availability is generally pretty easy to measure. The most simple example is a 500 response. You see the sorry a problem occurred. Web page latency, however, is more difficult to get right at scale. And when I say get it right, there are two aspects of being right. First, does your measurement have the right precision for your scale? That is, if I have 1 million user requests, can you generate a latency aggregate which means you aren't leaving more than a few dozen users off precision. Here is the number of decimal places. The other aspect is accuracy. Is your latency aggregate for an SLO or a monitor actually correct? In many cases I've seen that answer is no to both and precision versus accuracy. Precision is the number of decimal places. Accuracy is are the values in those decimal places correct? So let's dive in. So coming back to this chart. This chart is an RRD graph, and it measures network usage and calculates the 95th percentile over a time period. And at the time of the.com boom, you saw a lot of these RRD graphs, and these were mostly used for metering bandwidth. Bandwidth was built on something like five megabits at 95th percentile, meaning that if you took all your five minute bandwidth usage measurement slices and ordered them, and took the 95th percentile, if that number was above five megabits, you incur overage charges. And this first popularized the approach of using percentiles. And that would really notably be seen about ten years later in 2011 with the advent of the statSD protocol developed by Etsy, which provided the p 95 was a latency aggregation metric. And I wrote more about this in a blog post I published last year, and I'll go into some of the content in the next slides, but this is the historical significance of this graph. So let's talk about percentiles. This is a slide from an SLO presentation I gave at Srecon 2019. It illustrates two latency distribution profiles, which are meant to represent service nodes that are behaving differently. The blue distribution represents a bimodal latency profile with lower latencies than the single mode red latency distribution. Basically, this could be two web servers, one performing well and one performing not as well. The red server is not performing as well, and if we take the p 95 values for latency for each server, and we average those, we could get an indicator of around 430 milliseconds, and we might think that hes that's the performance of our service. But if we combine the raw latency values from each of these distribution sets and calculate the aggregate p 95 from those, we'll get 230 milliseconds, and the error there is almost 100%. And many, if not all, of the monitoring and observability tools out there will happily let you use an averaging function for percentiles generated from different hosts, nodes or clusters. If your distribution profiles are the same, no problem. That works great. But it's when your services are behaving asymmetrically that you'll encounter large errors with this approach, and this is a problem with percentiles. And I talked about that in depth in that presentation. So beware of using percentiles. I've talked about this and ranted about this, and this kind of illustrates the prime condition where that's can issue. And again, if everything's running smoothly or if you have a single node, percentiles work just great. But it's the real world scenarios where we have different node performance profiles and possibly hundreds or thousands of nodes services requests that we want to be able to handle and evaluate how our service is performing accurately that teams into histograms for measuring web service latency. And I give an internal talk. I called Dr. Histogram how I learned to stop worrying and love latency bands at Zendesk a few years ago, and I went into more depth on the intricacies of these three different types of histograms in the SLO comp link below. But in short, there's a couple of different approaches you can use for measuring latency with histogram. And this involves essentially collecting a latency sample and fitting it into what we call a bucket or a bin. And you'll see the gray and blue bars here. Those are your buckets or bins. And so let's take a look at how these are implemented differently. First, we could have a log linear histogram, which you can see the details of at openhistogram IO. And if we have a latency value here of 125 milliseconds, we could say like, oh, we'll just slot that sample into the greater than 100 millisecond, but less than 200 millisecond bucket. And so this is a data structure that is fairly easy to represent, because all you have is an array representing different histogram buckets, and then you increase the value of that array, essentially a counter for each of those. And this is a volume invariant way of storing large amounts of latency data that you can also use to generate highly accurate aggregates for an entire cluster or any set of hosts. And folks might also be familiar with the middle structure. That's the cumulative histogram which prometheus uses. So if I have a latency value of 125 milliseconds, it will assign labels starting at less than infinity all the way down to less than 200. So this takes a few more data structures or a few more counter values to implement, and it's not quite as efficient as the logged linear histogram. And at Zendesk, I flipped that on its head and came up what was called an inverse cumulative histogram, where, for an example, if we have 125 milliseconds, I could have a counter data structure, bump the counter and assign these labels to it, which are often known as metric tags. I could assign greater than ten, greater than 50, greater than 100 milliseconds, but not greater than 200 milliseconds. And this approach made my head hurt for a little bit. But it has some advantages in terms of operator efficiency and ease of use of implementing with a lot of the tooling out there and all these buckets that can also be referred to as latency bands. So you can kind of take a look at each of these different types of histograms and decide, I might want to try to use histograms for storing latency. So one of these should give you some good results. And you might ask, well, okay, well, now I know how to capture latency in a histogram at scale. How do I generate an SLO from it? Well, let's go back to our definition. It's the number of good requests divided by the number of bad requests over a time range. And so in this case, we can use a histogram data for the SLI. We can sum up the number of requests below 100 milliseconds, and we can divide that by the total number of requests, which would just be the count sum of all the bands, and we can multiply that by 100. In the case of the number of requests under 100 milliseconds, with the inverse cumulative histogram, we add up the counts of the blue bars. With the log linear histogram, we just add up all those, the counts of the three bars to the left of the three gray bars to the left of the blue bar. So, mathematically, this is very simple to implement, and it's fast, it works quickly with all monitoring solutions out there. And it's also extremely accurate because you're adding up counts of essentially raw data, and it also gives you essentially arbitrary precision. So this is a very robust and accurate approach, and I highly recommend this because this will give you some great numbers at scale. Now, you might say, like, well, this is a lot of work to do, but again, it goes back to prioritizing reliability work. So we want to make sure that our data about, if we're hitting our slos is accurate, because we're spending, likely spending hundreds of thousands or millions of dollars on shifting this engineering work. Now, I showed some raw histograms there, where we keep count of a number of samples in each bin, and that way we can sum them up. But there's some approximate structures out there which you can use, and some of the vendors provide to do the same things. And they're often called sketches, like the GK sketch or the DD sketch structure by one of the vendors. And there's also approximate histograms such as t Digest, made by Ted Dunning, which stores approximations of distributions. And these two charts here were taken from the log linear circ Slis paper for open histogram, and they represent error percentages for two different takes of workloads across different p nine x values on the x axis. And you can see the red line here, which is the open histogram implementation that's got very low errors. But then you look at like the T Digest DD sketch and HDR histogram, which do relatively well in terms of errors. However, there's a detail that is not in these charts. These errors are for single node evaluations only, say for one web server. Now, how do approximate histograms and sketches behave across asymmetric node workloads of hundreds of web servers or arbitrary time windows? And that's a very difficult question to answer. But by and large, the errors are likely to be unbounded and using histograms which store the exact sample counts, as I termed raw histograms on the previous slide, those avoid that problem entirely, ensuring that any aggregates generated for them for slos are highly accurate and precise. So the sketches are good to a certain extent, but they don't really hit the same level of precision as these raw histograms. Now, while we're on the subject of histograms, I want to highlight some recent work in this area by Adrian Cockroft. Adrian published a medium post titled percentiles don't work. I think he coined them as wrong, but useful analyzing the distribution of response times for web services. A few months ago, hes started doing some work here, where he looked at operational telemetry, which is usually latency, and using some r based tooling to decompose it into component normalish distributions. So this image here was taken from his blog post, where he was able to take a bimodal histogram here and decompose it into two normal distributions using the mixed tools r package. Now why is this important and what does this have to do with slos? We just took a look at what magnitude of errors can arise from using percentiles for latency measurements. So we follow that up with looking at histograms to measure latency distributions. So with something like this, we can pull out these normal distributions. And this could be relevant if we wanted to make an SLO for something like disk writes, where you might have writing to a block device, versus just writing to cache or reading from the block device, as opposed to reading to cache. We can use these to implement fine grained slos for each of the different moyer of kind of the physical manifestations of the system in the cloud. It could be like writing to s three or different storage levels there. And so there's some really promising work here. And I think that this is definitely something to follow going ahead, because if you really want to get fine grained with, say, a system that has a few different modes at very large scale, this approach would allow you to do that. Now, one common question I've gotten about slos and error budgets is how do you implement them across a distributed service architectures? Now, one approach is to use an SLO and error budget for each service, and this includes third party vendor services, as shown in blue here. Now, the error rates I've shown here and documented in red are error rates across these different services. So you can have a different error rate contribution from the third party service, the mid tier and the edge tier. And you can take those and you can add those up and essentially get a compound or composite error rate for what the customer is seeing. So in this case, you might see that, hey, our in house back end service has a 0.1% error rate. But then if you roll that up to the mid tier, now you've got 1% error rate also from the third party, which exceeds your mid tier error budget of 1%. And so you can kind of put these diagrams together, and it will help you understand where you need to focus reliability work. In this case, you need to focus reliability work on the third party and either pull that in house or do some sort of interface around it to make it more reliable. And the goal here is not to assign blames to teams or to different services. It's to prioritize reliability work. And that's really what this is all about. Because for most of almost, I would say almost all of you out there, you're using some sort of distributed system like this, and you're going to say like, well, how do we use slos across that? Remember to be customer centric, and you can roll those error budgets up, starting from, I'll call it upstream, which is further away from the client. You can roll those error rates up and get a composite error rate fairly simply and see what the client is seeing. And that's it. My tour through techniques for slos and air budgets at scale I hope you enjoyed this presentation. Feel free to reach out to me on LinkedIn or Twitter. And that Twitter handle also works across Mastodon and a couple of the other news sites popping up. I'd love to hear about your experiences and talk about how you're using slos and air budgets at scale. Thanks 42. We'll see you next time.
...

Fred Moyer

SRE Observability Engineer

Fred Moyer's LinkedIn account Fred Moyer's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways