Conf42 DevOps 2023 - Online

Don't get out of bed for anything less than an SLO

Video size:

Abstract

Bad on-call shifts make developers burn out and quit jobs. Noisy alerts that don’t mean anything give us the worst kind of on-call fatigue. Is the site still up? Then who cares if CPU usage is high? Service Level Objectives make alerts meaningful by measuring what really matters in your system. Understanding and monitoring the critical operations that users rely on can give you the confidence to stop getting paged for transient issues that don’t affect end users. Learn how to choose the best indicators and useful objectives so you’ll respond to the right problems if the pager goes off at 3 AM.

Summary

  • A tool called a service level objective helps you understand what really matters in your system and how to alert on that. Burnout is a big topic in software engineering these days. A useful, good on call shift looks like having the right people on call for the right systems.
  • Alerting is not about maintaining systems. It's not about looking for underlying causes before something goes wrong. Alerting is for dealing with emergencies. So you need to be alerted not too early when there's some problem. To avoid sending alerts that cares resolved by the time somebody wakes up.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, I'm Joe. Welcome to don't get out of bed for anything less than an SLO. I'm a software engineer at Grafana. I've been building and operating large distributed systems for nearly my whole career. I've been on call for those systems. I think there's great things about being on call. The people who are operating and fixing things usually know what to do because they've built the system. But some companies have things more together than others. I've been in some organizations where things are incredibly noisy and stressful, and I want to talk about a tool today that can help us improve that situation markedly. We're going to talk about what makes an on call situation bad and what sort of knobs we have to improve it. We'll talk about a particular tool called a service level objective, which helps you understand what really matters in your system and how to alert on that, so that the alerts that you're sending to people at three in the morning are really meaningful. They point to things that really have to be done and they explain sort of what's going on. What problems are people having that caused this alert to happen? Burnout is a big topic in software engineering these days. Tons of people in the industry experience burnout at some point in their career, and right now huge numbers of people are experiencing some symptoms of burnout. It kills careers. People go out to the woods to build log cabins when they leave, choose jobs. It really kills their teams and their organizations. Too much turnover leads to institutional knowledge walking out the door. No matter how good your system documentation is, you can't recover from huge turnover. So mitigating, preventing and curing to some degree, burnout is really important to software companies. Burnout comes from a lot of places, but some big external factors, factors that come from your job are unclear expectations, a lack of autonomy and direction, an inability to unplug from work and an inability to deliver meaningful work. And bad on call can affect all of those. There are unclear expectations about what am I supposed to do with an alert? Who's supposed to handle this? Is it meaningful if you're repeatedly getting paged at three in the morning, especially for minor things, you really can't relax and let go of work when it's time to be done. And if you're spending all day responding to especially useless things, youll don't ship anything, you don't commit code, you feel like you're just running around putting out fires and it's very frustrating and draining so bad alerts, poorly defined alerts, things that people don't understanding what to do about are huge contributors to burnout. And to improve that situation. We need to understand our systems better, understand what's important about them, and make sure that we're responding to those important things and not to minor things, underlying things that can be addressed as part of a normal work routine. A useful, good on call shift looks like having the right people on call for the right systems. When alerts happen, they're meaningful and they're actionable. They are for a real problem in the system, and there's something someone can do to address it. There's a great tool in the DevOps world that we can use to help make all of that happen, called service level objectives, and they help you define what's actually important about the system. You operations understand how to measure, choose important behaviors, and help youll make decisions based on those measurements. Is there a problem we need to respond to right now? Is there something we need to look at in the morning? Or is there something over time where we need to prioritize work to improve things sort of long term? How can we assess what problems we have and how serious they are? We split that into sort of two things here. There's a service level which is not about a microservice or a given set of little computers. It's about a system and the services that it provides to its users. What's the quality of service that we're giving people? And then an objective which says, what level of quality is good enough for this system, for our customers, for my team. So a service level is all about the quality of service you're providing to users and clients. To do that, you need to understand what users really want from that system and how you can measure whether they're getting what they want. So we use something called a service level indicator to help us identify what people want and whether they're getting it. We start with sort of a prose description. Can users sign in? Are they able to check out with the socks in their shopping cares? Are they able to run analytics queries that let our data scientists decide what we need to do next? Is the catalog popping up? Do we start with that? Pros. Then we figure out what metrics we have, or maybe that we need to generate that. Let us measure that, and then we do some math. It's not terribly complicated math to give us a number between zero and one, one being sort of the best quality output we could expect for a given measurement, and zero being everything is broken. Nothing is working right. Some really common indicators that you might choose are the ratio of successful requests to all requests that a service is receiving over some time. This is useful for an ecommerce application, a database query kind of system. You might have like a threshold measurement about data throughput being over some rate. You may know you're generating data at a certain rate and youll need to be processing it fast enough to keep up with that. And so you need to measure throughput to know if something's falling apart. Or you may want to use a percentile kind of threshold to say, for a given set of requests, the 99th percentile latency needs to be below some value, or my average latency needs to be above some value, some statistical threshold. These are all really common indicators. They're relatively easy to compute, relatively cheap to collect. Next, you need to set an objective. What level of quality is acceptable? And of course, only the highest quality will do, right? That's what we all want to provide, but we have to be a little more realistic about our target. To measure an objective, you choose two things. You choose a time range over which you want to measure. It's better to have a sort of moderate to long time range, like 28 days, seven days, than a really short one, like a day or a few minutes. You're getting too fine grained if you're trying to measure something like that. And then a percentage of your indicator that would be acceptable over that time range, 90% is sort of a comically low number. But are 90% of login requests succeeding over the last week? If so, we've met our objective, and if not, then we know we need to be paying attention. Youll want to set a really high quality bar, right? Everybody does. But it's useful to think about what a really high quality bar means if you're measuring your objectives over the last seven days and you want a 99% objective to be set, we're talking here about, let's say total system downtime, right? If you're hitting zero, you've got an hour and 41 minutes every seven days of total downtime to meet a 99% objective. And if you want five nines, the sort of magic number that you sometimes hear people say over a month, that's only 25 seconds of downtime. It's really important to think these things through when you set objectives, because it's better to start low and get stricter. Teams feel better and perform better when they're able to start at a place that's okay and get better rather than be failing right out of the gate and when setting objectives, you want to leave a bit of safety margin for normal operations. If you're always right up on the line of your objective, then you either need to do some work to improve your system's performance and build up some margin or lower the objectives a bit. Because if you're right up on the line, people are alerts all the time for things they may not be able to do much about. So to choose an objective, once you've defined your indicators, it's a good idea to look back over the last sort of few time periods that you want to measure and get a sense for what a good objectives would look like. Something that doesn't work is for your vp of engineering to say, look, every service needs to have a minimum 99% SLO target set for their services and it's not acceptable for there to be anything less. This doesn't work well, especially when you're implementing these for the first time in a system and you may not even know how well things are performing, you're going to be sabotaging yourself right out of the gate. It's better to find your indicators, start to measure them and then think about youll objectives after that. Once we understanding what our indicators and objectives look like, then we can start to alerts based on measurements of those indicators. We want to alerts operators when they need to pay attention to a system now, not when something kind of looks bad and you should look at it in the morning. We're going to monitor those indicators because they're symptoms of what our users actually experience and we'll prioritize how urgent things are based on how bad those indicators look. To think about this, it's useful to think about monitoring the symptoms of a problem and not the causes. Symptoms of a problem would be things like checkout is broken, cause of a problem might be pods are running out of memory and crashing in my kubernetes cluster. Alerting is not about maintaining systems. It's not about looking for underlying causes before something goes wrong. That's maintenance. Alerting is for dealing with emergencies. Alerting is for dealing with triage. And so you should be looking at things that are really broken, really problems for users. Now youll systems deserve checkups. You should be understanding and looking at those underlying causes. But that's part of your normal day to day work. If your team is operating systems, that should be part of what your team is measuring and understanding as part of your general work, not as something that your on call does just during the time that they're on call. You shouldn't be getting up at three in the morning for maintenance tasks. You should be getting up at three in the morning when something is broken. So you need to be alerted not too early when there's some problem that's just a little spike, but definitely not too late to solve your problem either. So to think about this, it's useful to think of system performance like cash flow. Once you set an objective, then you've got a budget. You've got a budget of problems you can spend, and you're spending that like money. If you've been in the startup world, you've heard about a company's burn rate, how fast are they running through their cash? When are they going to need to raise more? And you can think of this as having an error burn rate where you're burning through your budget. And so when you start spending that budget too quickly, you need to stop things, drop things, look in on it and figure out how to fix that. So we can think about levels of urgency associated with measurements of the SLO. There's things that should wake me up, like a high burn rate that isn't going away or can extremely high set of errors that cares happening over a short period of time. Now, if there's a sort of sustained moderate rate that's going to cause youll problems over a period of days, it's something your on call can look at in the morning, but that they must prioritize. Or if you've got a sort of you're never burning through your error budget, but you're always kind of using a bit, maybe more than you're comfortable with, then you should be doing that as part of your maintenance checkup kind of work on your system. And if you have sort of just transient moderate burn rate kind of problems, small spikes here and there, you almost shouldn't worry about these. There are always going to be, especially in larger systems, transient issues. As other services deploy, a network switch gets replaced. SLO. This is why we set our objectives at a reasonable level, because we shouldn't be spending teams valuable time on minor things like that that can really be addressed in order to get alerted soon enough, but not too soon. To avoid choose sort of transient moderate problems causing sending alerts that cares resolved by the time somebody wakes up. We can measure our indicators over two time operations, which helps us handle that too early, too late kind of problem. So by taking those two time operations and comparing the burn rate to a threshold, we can decide how serious a problem is. The idea here is over short time periods the burn rate needs to be very, very high to get somebody out of bed. And over longer time operations, a lower burn rate will still cause you problems. If you've got a low burn rate all day every day, you're going to run out of error budget well before the end of the month. If you have a high burn rate, you'll run out of the error budget in a matter of hours. And so somebody really needs to get up and look at it. So for a rating of urgency here, you can have, like, a short time window of 5 minutes and a longer time window of an hour. And if the average burn rate over that time are both above this relatively high threshold of 14 times your error budget, then you know youll need to pay attention. And what these two windows buy you is the long window helps make sure you're not alerted too early. And the short window helps you make sure that when things are resolved, the alert resolves relatively quickly. And so higher urgency levels means shorter measurement windows, but a higher threshold for alerting and lower urgency levels, things you can look at in the morning, take measurements over longer periods of time, but have a much more relaxed threshold for alerting. And this is a way to really make sure that getting out of bed when something is really going wrong at a pretty severe level. Thanks for, thanks for coming.
...

Joe Blubaugh

Principal Software Engineer @ Grafana Labs

Joe Blubaugh's LinkedIn account Joe Blubaugh's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)