Conf42 Site Reliability Engineering 2022 - Online

Alerting on SLOs and Error Budget Policies

Video size:

Abstract

Assessing your system’s reliability through SLOs is a great way to really understand and measure how happy users are with your service(s). Error Budgets give you the amount of reliability you have left before users are unhappy. Ideally, you want to be alerted way before users are dissatisfied and take the appropriate measures to ensure they aren’t. How can you achieve that?

That’s where alerting on SLOs and Error Budget Policies come into the picture. By tracking how happy your users are, through SLOs, and alerting way before their level of insatisfaction reaches critical levels you’ll be able to define policies to deal with issues in a timely manner, ensuring operational excellence.

Summary

  • Today we're going to talk about alerting on slos and what are error budget policies and how we can leverage them. We'll then talk about some reliability concepts. At the end, we're just going to conclude on why all of this is important.
  • AnovA works in industrial IoT service. We need a way and a framework to ensure our systems are being reliable. We want to be alerted when something is not being reliable in us. How can you measure reliability?
  • In practice, slas are nothing more than an SLO that has consequences when that SLO is not met. The idea is that I will trigger an alert when the available error budget reaches a critical level. Slas are usually looser than slos so that we know when an SLA is broken.
  • An SLO is how many times an SLI needs to be achieved so that our users are happy with our services. SLO alerts can trigger an effort budget policy. This way we can define reliability in the eyes of our users. I'm writing a book on overcoming SRE antipatterns.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Youll. Hello, everyone. Welcome to Conf 42 site Reliability Engineering. My name is Ricardo Castro, and today we're going to talk about alerting on slos and what are error budget policies and how we can leverage them. So what we have on the menu, for starters, we're going to set up some context. So we're going to talk about slos. Slos are all about reliability. So it's easier if we have some ground knowledge to actually build on top of these concepts of reliability and then talk about alerts and effort budget policies. We'll then talk about some reliability concepts. So we're going to talk about slos, but we need to talk about all the things that encompass slos. So we need to talk about the foundation that gives us slos. We need to talk about metrics and slis, and we will then talk about company concepts like error budget and slos. We'll then get to the good part where we are going to talk about how we can alert on slos and what are error budget policies and how can we leverage them. And at the end, we're just going to conclude on why all of this is important. So let's start by setting some context to our discussion. An example from the real world that we can think of and draw parallel to our reality is a supermarket. So if we think a little bit about it, a supermarket is kind of a microservices architecture. So the idea is that me as a user go into a supermarket, I do my shopping and I go out. What happens in reality is that there are many, many things that happen underneath the covers that make it possible for me to do this transaction. So for me as a user, I just want to go in, select the things that I want, I pay, and I get out. But that means thats there is a cashier to register everything and receive payments. That means there are people that get stuff from the warehouse and put stuff available for me to do my shopping. There's also people that need to put orders to ensure thats stuff gets at the supermarket. There's people that do the unloading from trucks to inside the supermarket. There's butchers, there's people that work on the Fisher stand. So there's a lot of stuff that's going on underneath the covers that actually make it possible that I can do this transaction. So, like a microservice, I interact with an application, and that means there are maybe a lot of services that are working together to provide the functionalities that are required. So what does that mean exactly? To be reliable? Let's still look at this example of a supermarket. So I did my shopping, I want to pay, how can thats action might not be reliable. So if it's taking too long to pay, if I have to stay half an hour on a queue just to be able to pay, I might consider that thats service is not being reliable. One other aspect for example is when I go into pick up some product, if the product expiration date has passed, I might say that this service is not being reliable. And we can draw a quick parallel between this reality and the reality of the tech world by saying that an analogous concept for taking too long to pay is that of latency. Whether a request for example, takes too long to be served and a product expiration date being passed might mean can error. So this wasn't supposed to happen. So going a bit further on what reliability means, and looking at our reality more into the tech side. So I work at a company called AnovA, and at Anova we work in industrial IoT service. So we provide services to our customers. These numbers are a little bit outdated, but we have more than 2000 customers worldwide, we operate in more than 80 countries and we monitor more than 1 million assets. So what we do is that we get data from industrial sensors, we process them, we store them, and then we apply things like machine learning and AI, and build applications that allow our customers to actually provide good service to their customers. And this means that we need a way and a framework to actually ensure thats our systems are being reliable. And ideally we want to be alerted when something is not being reliable in us. So what does reliability mean exactly? If we look at the dictionary, the Cambridge dictionary, in this example, we can see a definition that says that the quality of being able to be trusted or believed because of working or behaving well. So in essence this means that something is reliable if it's behaving well. And this is a bit rough. So I like the definition from Alex Giftalgov's book, implementing service level objectives, actually a better way to define reliability. So in essence, what Alex says is that reliability can be defined by our users. So the answer to the question is my service being reliable? Is my service doing thats its users needed to do? So? If we look at reliability from the point of view of our users, if they are satisfied, my service is being reliable. So how can you actually measure this? And now let's start our discussion about the founding concepts that will actually lead us to slos and the other concepts. And the most fundamental concept is the concept of the metric. And the metric is nothing more than a measurement about something in my system. So if an event happens, and I take a measurement on that. So let's imagine a few examples that we have here. So the amount of memory that the server is using, the time it takes for an HTTP request to be fulfilled. So for example, in milliseconds, the number of HTTP responses that are can error, or how old the message is when it arrives at a Kafka cluster. So these are just measurements. An event happens and we take some kind of measurement about that event. Building on top of metrics, we have the concept of an SLI, and an SLI is a quantifiable measure of service reliability. So an SLI is what's going to allow us to say that if a measurement is actually good or bad, or an event is good or bad. But how can we define that? So we need to achieve a binary state, even if the metric itself doesn't gives us that out of the box. So here are a few examples of how we can define an SLI. So we can say that requests to a service need to be responded within 200 milliseconds. So thats means if I serve a request, I can measure how long it took. If it was more than 200 milliseconds, I can say that it wasn't a good event. If it was 200 milliseconds or less, I can say that it was a good event. In analogous we can say that if a response got a 500 code, it is not good. If it got another code, it's actually a good event. And the same thing for Kafka messages. If the Kafka message that arrived is older than five minutes, we might say that it's not a good event. If it's younger than five minutes, it is actually a good event. Building on top of slis, we have the concept of slos, and slos actually define how many times an SLI needs to actually be good, so that my users are happy and thats needs to be time bound. So here are a few examples. We can say that 99% of requests to a service need to be responded within 200 milliseconds within a 30 day period. The same way we can say that 95% of requests to a services need to be responded with a code thats is different, thats 500 within a seven day period. And same thing, 99.99% of messages arriving at a Kafka cluster must not be older than five minutes within a 24 hours period. So essentially what an SLO gives us is a way that within a time bound I can say how many times my SLI needs to be achieved so that my users are not unhappy. So exactly what is exactly an echo budget? An echo budget is nothing more than what is left from my slO. So if I consider 100% and I remove what is the slO, I get my budget. So if I have an sLO of 99%, I can say that I have 1% of her budget. So it's effectively the percentage of reliability left, and it help us make us educated decisions on whether, for example, to release a new feature or not. We're going to see that in a bit. And of course, they make the operability process for incident response to have an appropriate budget for us to know what we need to do. And last but not least, the last concept is the concept of an SLI that most of us are already familiar with. So when we sign, for example, sign up for a service, let's imagine that we sign up for a cloud provider, like Amazon or AWS or whatever cloud provider you are using. They all usually provide us with an SLA. So what is an SLA? It's usually a commitment between a service provider and a client. But in practice, slas are nothing more than an SLO that has consequences when that SLO is not met. So here are examples of what an SLA can say can be. So, if I can say that 99% of requests to a service need to be responded within 200 milliseconds within a 30 day period if that doesn't happen. So if the SLA is not met, the client will get a 30% discount. Same way we can say that 90% of requests to a service must be responded with a come different from 500 within a seven day period. If not, the client can get a 50% of its money back. And last but not least, we can say that 99% of measures arriving at the Kafka cluster must not be older than five minutes within a 24 hours period. If not, we can be fine for €100,000. Slas are usually looser than slos so that we know when an SLO is broken. We still have some buffer to actually fix things before an SLA is actually broken. But thats means that we need the way to actually know if an SLOS is at risk or not. And of course, we can put that in the visualization. We can create some graphs thats we can look at, and we can see what is the slO, what is the objective that we are trying to achieve. In this case, it's 99% if we are or not burning budget, how much error budget I have left for the period. So visualization is a nice way for me to understand if an SLO is at risk or not. But we would ideally be alerted, right? So we don't want to stay our whole day of work looking at these thats. And even worse, if something happens during the night that is going to put my services at risk. I need to be alerted. So this is where alerting on slos comes through. So traditionally we did metric thresholds. So what we would do is that we would send an alert if some threshold about a metric is met. So we can say that if a cpu goes above 80%, I want to receive an alert. If a request takes more than 200 milliseconds, I want to receive an alert. If an effort, if a request is served with a 500 error, I received alert. And the same thing for a Kafka message. Of course all of thats can be combined. And I can receive an alert and receive an alert if a combination of these things happen. So we can take the same approach with slos. So we can say that. We can say that the latency threshold goes below 99%, which is my target, I can receive an alert. And we can say the same thing for an NFL rate that achieves the 99 95% and I receive an alert. So this is similar to what some of us had already been doing, because we could say like, okay, if I want x amount of requests to be more than 200 milliseconds, we receive an alert. So this is good. This is better than actually the metric threshold. But I would only receive an alert when I'm already in trouble. So if I define the 99% as the threshold where my users are happy, if I go below that, basically what I'm saying is that my users are unhappy. So ideally what I would want is to be alerted before this happens. And if we are relying on this to fix things before an SLA is breached, this is exactly what we want. So how can we improve on this? So we actually can improve on how much error budget I have available, or how much error budget has been burned. So the idea is that I will trigger an alert when the available error budget reaches a critical level, or when an amount of error budget has already been burned. We can actually set different levels and trigger different messages to different channels. So I can say, for example, that if I have burned 25% of my her budget, I can send an email to my team. If half of it has been burned, I can put a message on teams and if for example 75% of the Akar budget thats already been burned, I want to send a message to pager duty and I want to tell the team that needs to do something immediately. This is better than the solution that we have before, but at this point we have no clue about how fast the echo budget is being consumed. So a question actually can arise. If by the end of the evaluation period we would still have some effort budget left, would I would like to receive this alert, for example on pager duty? Probably not. Because if we consider that we are still within the bounds of what my users consider to be reliable, I wouldn't want to be woken up at 03:00 a.m. In the morning to fix something that is actually not being need to be fixed. But yet we still don't have a clue about how fast my budget is being consumed. And this means that maybe I received an alert thats 75% of the effort budget has been burned. Burning so fast that is actually we're going to get in trouble. So if we think about it, we can actually alert on burn rate. So basically alerting on burn rate tells us how fast the effort budget is being consumed when we have a burn rate of one. This essentially means that if we are burning rate at this at a constant pace, and thats burn rate is one, at the end of my period, the periods that we've seen previously, like 30 days, one week, 24 hours, I will have burned all my FR budget. Here's an example for the window of evaluation of for example four weeks an alert. If the burden rate reaches two, why would I want this? Because this would mean that with a burden rate of two, which is the double, what is the maximum burden rate that I want would mean that I will consume all my effort budget in half the time. So for a period of four weeks, if I'm ensuring, if my burden rate is two, would mean that after two weeks I would have no effort budget left. So I would want to receive an alert. This is also better. But this has one slight issue, which is that if the burden rate is too high, it might not be picked up. For example, if we evaluate the burn rate every hour, but the error budget is all consumed within 30 minutes, we won't receive, but we would receive no alerts. So the last evolution of our alerting on slos is the multi window multiburn rate alerts. So this is the idea where we will combine the previous alerted that we've seen using multiple windows. And what we want is to alert on fast burn when the burn rate is too high and that will alert us on sudden changes, something that actually catastrophic event that happened. And it's ensuring stuff really, really fast. But at the same time, we also want slow burn rates. We want something that's consuming our effort, but thats consistently over a longer period of time. We also want to alert on those. So here are a few examples. So for fast burn, we youll say that for a period of two windows, in periods of five minutes, if the burn rate reaches ten, we want to receive an alert. So this would actually alert us if we had like a spike on our budget consumption. And we will have a similar one, but that would be slower. So we would evaluate a 24 hours period window for every five minutes. If the burn rate reaches two, we would receive an alert. So this is an evolution where we can go from magic thresholds to thresholds. On alerts, we actually have something that we can say with some confidence that when we receive an alert, we actually need to do something. But what is that something, and thats something can be defined on the error budget policy. So the error budget policy determines the alerting thresholds and the actions to take to ensuring that the error budget depletion is addressed. So what does this mean? The error budget policy is actually a policy that is defined beforehand, where we say that if x action happens, we will take the action a, B and C. So here are a few examples, and I'm going to see a document more detailed in a second. So we can say that if the service has exceeded its error budget for the preceding four week window, we will hold all changes on our service and releases, and we will only do p zero issues or security fixes until the service is within the SLO. Depending upon the cause of the SLO miss, the team may devote additional resources to working on reliability instead of feature work. So basically here what we are defining is a concrete measure for if the error budget has been depleted at some point within our four week period, we will not be releasing measures apart from, of course, p zero issues or security fixes. And we're also saying that depending on what caused the SLO to be missed, we might need to add additional resources to work on reliability instead of releasing more features. Another example could be that if a single incident consumes more than 20% of our effort budget over the same four week period, then the team must conduct a postmortem, and that post mortem must have at least one p zero action. So a p zero action would be something that is really ugly. So again, this actually defines the actions that are going to be made when the effort budget policy is at risk, who has been consumed. So this should actually go into documents agreed with multiple parts so that everyone is on the same page regarding what is being done when the SLO is actually consumed and taken from the Google book, from the SRE book, from Google, we have an example of such kind of document. This is just an example. You can of course define your error budget policy any way that you want, but it gives you a good starting point into how to define alerting alerting slos error budget policies. In this example we see that we have the authors of the Echo budget policy, when it was defined, who reviewed it, who approved it, when it was approved, and when it should be revised. We then have a service of review. So it will be the service or group of service that this air budget policy applies. We would then have the goals and non goals. So these are the goals that the air budget policy tries to achieve and what are not the goals that it's trying to achieve. Then we have a definition of what it means to actually miss the slO. So here's a detailed description that what does it mean for the SLO to be missed and basically means when this budget policy will be enforced. We can also have other sections like an outage policy, an escalation policy, and come background that it is necessary. So to quickly sum up all the concepts that we have seen, we started with metrics. With metrics we can build slos and slos is what will help us define if a metric is good or bad within our context. An SLO is how many times an SLI needs to be achieved so that we can be sure that our users are happy with our services. Can effort budget is the amount of reliability that is left from the SLO. And with SLO and her budget we can build visualizations that are good, but ideally thats we want is to actually be alerted when our SLO is at risk. And of course SLO alerts can trigger an effort budget policy. For example, if I'm consuming too much of my slo, I can enforce an effort budget policy that has been pre discussed and agreed with all parties. And of course we have the concept of an SLA which is an SLO with penalties. But why is all of this important? This is important because this way we can define reliability in the eyes of our users. We stopped measuring, we stopped alerting and defining reliability, something that doesn't really is defined by our users. So I don't want to be alerted when I have a threshold of a cpu, for example, going up at three in the morning and my users are not being affected. This of course, ties in into reducing alert fatigue. So now I will receive, ideally I will receive alerts only when my users are being affected or they are at risk of being affected. Of course, this also creates a shared language to talk about reliability. So now with all of this, we have a framework in place that can actually tell us if our systems are being reliable or not, and it's understood by everyone. And of course it facilitates prioritization. So we have a way to see if we're being reliable or not. And we have can effort budget policy that actually can help us define when more work needs to be put on top of reliability. And before I go, I want to leave a shameless plug. I'm writing a book on overcoming SRE antipatterns and we'll have a couple of chapters in the book speaking precisely about this, speaking about how to measure reliability edis of our users, and how we can leverage alerts on slos and budget policies to actually improve our day to day operations. It and this is all from my part. I hope you enjoyed and this talk was informative for all of you. You can find me at these links. Thank you very much and have a great day.
...

Ricardo Castro

Lead SRE @ Anova

Ricardo Castro's LinkedIn account Ricardo Castro's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)