Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, I'm Joe. Welcome to don't get out of bed
for anything less than an SLO. I'm a software
engineer at Grafana. I've been building and operating large distributed
systems for nearly my whole career. I've been on call for
those systems. I think there's great things about being on call.
The people who are operating and fixing things usually
know what to do because they've built the system.
But some companies have things more together than others. I've been in
some organizations where things are incredibly noisy
and stressful, and I want to talk about a tool today that
can help us improve that situation markedly.
We're going to talk about what makes an on call situation
bad and what sort of knobs we have to
improve it. We'll talk about a particular tool called a service
level objective, which helps you understand what really
matters in your system and how to
alert on that, so that the alerts that you're sending to people
at three in the morning are really meaningful.
They point to things that really have to be done and they
explain sort of what's going on. What problems are people
having that caused this alert to happen?
Burnout is a big topic in software
engineering these days. Tons of people in the industry
experience burnout at some point in their career, and right now
huge numbers of people are experiencing some symptoms of burnout.
It kills careers. People go out to the woods to build log
cabins when they leave, choose jobs. It really
kills their teams and their organizations. Too much
turnover leads to institutional knowledge walking out the door.
No matter how good your system documentation is, you can't
recover from huge turnover. So mitigating,
preventing and curing
to some degree, burnout is really important to software
companies. Burnout comes from a lot of places,
but some big external factors, factors that
come from your job are unclear expectations,
a lack of autonomy and direction, an inability
to unplug from work and an inability to
deliver meaningful work. And bad
on call can affect all of those. There are unclear expectations
about what am I supposed to do with an alert? Who's supposed to handle this?
Is it meaningful if you're repeatedly getting paged at three
in the morning, especially for minor things, you really can't relax
and let go of work when it's time to be done. And if you're
spending all day responding to especially
useless things, youll don't ship anything, you don't commit
code, you feel like you're just running around putting
out fires and it's very frustrating and draining so
bad alerts, poorly defined alerts, things that people don't
understanding what to do about are huge contributors to
burnout. And to improve that situation. We need to understand our
systems better, understand what's important about them, and make sure
that we're responding to those important things and not to
minor things, underlying things that can be addressed as
part of a normal work routine. A useful,
good on call shift looks like having the right people on
call for the right systems. When alerts happen,
they're meaningful and they're actionable. They are
for a real problem in the system, and there's something someone can do
to address it. There's a great
tool in the DevOps world that we can use to help make
all of that happen, called service level objectives,
and they help you define what's actually important about the system.
You operations understand how to measure, choose important
behaviors, and help youll make decisions based on
those measurements. Is there a problem we need to respond to right now?
Is there something we need to look at in the morning? Or is there something
over time where we need to prioritize work to
improve things sort of long term? How can we assess what problems
we have and how serious they are?
We split that into sort of two things here. There's a service level
which is not about a microservice or a given set of little
computers. It's about a system and
the services that it provides to its users. What's the quality of service
that we're giving people? And then an objective which
says, what level of quality is good enough for this system,
for our customers, for my team.
So a service level is all about the quality of service you're
providing to users and clients. To do that, you need to understand what
users really want from that system and how
you can measure whether they're getting what they want.
So we use something called a service level indicator
to help us identify what people want and whether they're
getting it. We start with sort of a prose description. Can users sign
in? Are they able to check out with the socks in their shopping
cares? Are they able to run analytics queries
that let our data scientists decide what we need to
do next? Is the catalog popping
up? Do we start with that?
Pros. Then we figure out what metrics we have, or maybe that we need
to generate that. Let us measure that,
and then we do some math. It's not terribly complicated
math to give us a number between zero and one,
one being sort of the best quality output we could expect for a
given measurement, and zero being everything is broken.
Nothing is working right. Some really common
indicators that you might choose are the ratio of successful
requests to all requests that a service is receiving over some
time. This is useful for an ecommerce application,
a database query kind of system. You might have like a threshold
measurement about data throughput being
over some rate. You may know you're generating
data at a certain rate and youll need to be processing
it fast enough to keep up with that. And so you need
to measure throughput to know if something's falling apart. Or you may want to
use a percentile kind of threshold to say,
for a given set of requests, the 99th percentile latency
needs to be below some value, or my average latency needs to
be above some value, some statistical threshold.
These are all really common indicators. They're relatively easy to compute,
relatively cheap to collect.
Next, you need to set an objective. What level of quality is acceptable?
And of course, only the highest quality will do,
right? That's what we all want to provide, but we
have to be a little more realistic about our target. To measure an objective,
you choose two things. You choose a time range over which
you want to measure. It's better to have a sort of moderate to long time
range, like 28 days,
seven days, than a really short one, like a day or a few minutes.
You're getting too fine grained if you're trying to measure something like that.
And then a percentage of your
indicator that would be acceptable over that time range,
90% is sort of a comically low number. But are
90% of login requests succeeding over the last week?
If so, we've met our objective,
and if not, then we know we need to be paying attention.
Youll want to set a really high quality bar, right? Everybody does. But it's
useful to think about what a really high quality bar
means if you're measuring your objectives over the last seven
days and you want a 99% objective
to be set, we're talking here
about, let's say total system downtime, right? If you're
hitting zero, you've got an hour and 41
minutes every seven days of total downtime
to meet a 99% objective. And if you want five
nines, the sort of magic number that you sometimes hear people say over
a month, that's only 25 seconds of downtime. It's really important
to think these things through when you set objectives, because it's better to start
low and get stricter.
Teams feel better and perform better when they're
able to start at a place that's okay
and get better rather than be failing right out of the gate
and when setting objectives, you want to leave a bit of safety margin for
normal operations. If you're always right up on the line of your
objective, then you either need to do some work to
improve your system's performance and build up some margin or lower the
objectives a bit. Because if you're right up on the line, people are alerts
all the time for things they may not be able to do
much about. So to choose an objective,
once you've defined your indicators, it's a good idea to
look back over the last sort of few time
periods that you want to measure and get a sense for what a good objectives
would look like. Something that doesn't work is
for your vp of engineering to say, look, every service needs to have a minimum
99% SLO target set for their services
and it's not acceptable for there to be anything less.
This doesn't work well, especially when you're
implementing these for the first time in a system and you may not
even know how well things are performing, you're going to be sabotaging
yourself right out of the gate. It's better to find your indicators,
start to measure them and then think about youll objectives after that.
Once we understanding what our indicators and
objectives look like, then we can start to alerts
based on measurements of those indicators. We want to alerts operators
when they need to pay attention to a system now,
not when something kind of looks bad and you should look at it in the
morning. We're going to monitor those indicators
because they're symptoms of what our users actually experience and
we'll prioritize how urgent things are based on how bad
those indicators look. To think about this,
it's useful to think about monitoring the symptoms
of a problem and not the causes. Symptoms of a
problem would be things like checkout is broken, cause of
a problem might be pods are running out of memory and
crashing in my kubernetes cluster.
Alerting is not about maintaining systems. It's not
about looking for underlying causes before something goes
wrong. That's maintenance. Alerting is for dealing with emergencies.
Alerting is for dealing with triage. And so you should be looking at
things that are really broken, really problems for
users. Now youll systems deserve checkups.
You should be understanding and looking at those underlying causes.
But that's part of your normal day to day work.
If your team is operating systems, that should be part of what your
team is measuring and understanding as part of your
general work, not as something that your on call does
just during the time that they're on call. You shouldn't be getting up
at three in the morning for maintenance tasks. You should be getting up at three
in the morning when something is broken. So you
need to be alerted not too early
when there's some problem that's just a little spike,
but definitely not too late to solve your problem either.
So to think about this, it's useful to think of system
performance like cash flow. Once you set an objective,
then you've got a budget. You've got a budget of problems you can
spend, and you're spending that like money. If you've been in the startup
world, you've heard about a company's burn rate, how fast
are they running through their cash? When are they going to need to raise more?
And you can think of this as having an error burn rate
where you're burning through your budget. And so when you start spending
that budget too quickly, you need to stop things, drop things,
look in on it and figure out how to fix that.
So we can think about levels of urgency
associated with measurements of the SLO. There's things that should
wake me up, like a high burn rate that isn't going away or can
extremely high set of errors that cares happening over a short
period of time. Now, if there's a sort of sustained moderate
rate that's going to cause youll problems over a period of days,
it's something your on call can look at in the morning, but that they must
prioritize. Or if you've got a sort of you're never burning
through your error budget, but you're always kind of using a bit, maybe more than
you're comfortable with, then you should be doing that as part of
your maintenance checkup kind of work on your system.
And if you have sort of just transient moderate burn
rate kind of problems, small spikes here and there,
you almost shouldn't worry about these. There are always going to be,
especially in larger systems,
transient issues. As other services deploy,
a network switch gets replaced.
SLO. This is why we set our objectives
at a reasonable level, because we shouldn't be spending teams
valuable time on minor things like that that can
really be addressed in
order to get alerted soon enough,
but not too soon. To avoid choose sort of transient
moderate problems causing sending alerts
that cares resolved by the time somebody wakes up.
We can measure our indicators over two time operations,
which helps us handle that too early, too late kind
of problem. So by taking
those two time operations and comparing the burn rate
to a threshold, we can decide how serious a
problem is. The idea here is over
short time periods the burn rate needs to be very, very high
to get somebody out of bed. And over longer time operations,
a lower burn rate will still cause you problems. If you've
got a low burn rate all day every day,
you're going to run out of error budget well before the end of the month.
If you have a high burn rate, you'll run out of the
error budget in a matter of hours. And so somebody really needs to get up
and look at it. So for a rating of urgency here,
you can have, like, a short time window of 5 minutes and
a longer time window of an hour. And if the average burn
rate over that time are both above this relatively
high threshold of 14 times your error budget,
then you know youll need to pay attention. And what these two windows buy you
is the long window helps make sure you're not alerted too
early. And the short window helps
you make sure that when things are resolved, the alert resolves
relatively quickly. And so higher
urgency levels means shorter measurement windows, but a higher
threshold for alerting and lower
urgency levels, things you can look at in the morning,
take measurements over longer periods of time, but have a much more relaxed
threshold for alerting. And this is a way to
really make sure that getting out of bed when something is really going
wrong at a pretty severe level. Thanks for, thanks for
coming.