Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello. So happy that you've joined me here at Conf 42,
SRE 2024. I'm coming at you from beautiful New Haven,
Connecticut, the Elm city. And I'm going to talk to
you today about clinical troubleshooting, which is a technique you can
use every day in your career as an SRE to diagnose problems
faster and solve them for good. And I can tell you that
if you and your team do use clinical troubleshooting consistently,
you will experience shorter incidents and less
stressful incidents. That all sounds pretty good, right? But what do
I know? I am Dan Slimmon,
and I've worked for about 16 years of
my life in operations, Sre DevOps.
Most recently, I've worked at companies like Hashicorp
and Etsy, and I've troubleshot.
I've investigated thousands of production issues during my
time as an SRE. Over those 16 years and thousands
of issues, I've developed this clinical troubleshooting methodology that
I'm so excited to tell you about today. It allows
groups of people to troubleshoot problems much more effectively.
Clinical troubleshooting is a scientific procedure for collaborative
diagnosis. Essentially, that means you can use it anytime
you have a group that needs to work together to explain why a complex system
is malfunctioning. It doesn't have to be under intense time pressure,
like during an incident. But incidents are a great example
of a time when you need to do collaborative diagnosis. You get
a bunch of people on a call, maybe each of
them only knows part of the system. They have an imperfect view of each
other's thinking processes, and so you're doing diagnosis
collaboratively. Clinical troubleshooting is a scientific
way to do that. So let's dive into a
story. It's a wonderful, calm day,
and the flower service in our stack
is clicking along, doing its job, serving up nectar tokens
for our users, and then suddenly it
catches on fire. The throughput of the flower service drops
by 40% right in the middle of the business day, which should never
happen. Here's a little graph showing the throughput.
It measured in terms of pollen tokens per second served,
dropping from about 1000 to about 600 /second
that's bad. And it's so bad that
Barb gets paged. Barb is an SRE,
and she just got paged about this problem with the flower service
throughput dropping down to from 1000 to 606,600
/second she only knows a little bit about the flower service, but she's here on
the incident call. She's ready to go. She's going to figure this out.
It's a chance for Barb, captain of SRE, to show her quality,
so to speak. So Barb spends a few minutes
looking at graphs. She's searching for the error message on Google and
the source code. She's building up context.
And so she's holding a lot in her head. She's holding a
lot of abstractions, a big house of cards in her head.
When Alice joins the call and Alice says,
Alice is an engineer on a different team, she says, hi,
Barb. This is Alice. I saw the alert in slack. Can I help?
But Barb, as we discussed, is holding a lot in her head.
She's got several competing theories that are maybe not
fully baked yet. She's still processing what she
sees on the graph dashboards and on Google. And so she says,
thanks, Alice. You can sit tight for a moment. I'll fill you in soon.
She doesn't want to stop and explain everything to
Alice when she hasn't totally explained it to herself yet.
So Alice can watch Barb screen share, watch what she's doing. But she
has even less context than Barb does on this issue.
This goes on for a few minutes, and then suddenly another
service breaks, the honeybee service.
Barb gets a page saying that the error rate for the honeybee
service is now elevated. She doesn't have time for this. She's trying to
fix flour. So Alice jumps in, says, oh,
I can look at honeybee. So now
you have Barb over here looking to flower and Alice
over here looking at honeybee. They're both looking around
for data to explain their respective issues.
This is kind of good because they can work in parallel. And then
suddenly, boom, straight through
the brick wall comes crashing. Seth.
Seth is Barb and Alice's grand boss.
He wants to know the answers to lots of questions. And he's
spitting out those questions like so many watermelon seeds. He's saying,
how are customers affected? Do we need anyone else on the call? How are the
flower and honeybee problems related? What's our plan? These are
good questions, and it's reasonable that Seth wants to know the answers
to them. But because all the context on these issues
is stored inside the heads of our heroes, Barb and Alice,
Barb now has to take her hands off the keyboard and spend her time answering.
Seth's very reasonable, but perhaps disruptively delivered
questions. And while she's answering those questions,
guess who else shows up on the incident call?
Ding, ding, ding. It's the support team.
Supporters started getting customer reports of errors from the honeybee service.
And they say, is this the right call to talk about that? I know it's
supposed to be about flour, but, you know, there's an incident.
They have a lot of the same questions as Seth, but the support team's questions
are going to have to wait because. Ding ding.
Two more devs join the call, Devin and Charlie. Ok,
so Barb has to put her conversation on Seth, with Seth on hold
for a minute so she can assign these new responders to
help herself and Alice respectively. So once she's answered Seth and supports
questions, then she and Alice can spin
up Devon and Charlie on context and everybody can get
back to troubleshooting the issue.
It's a mess because the effort is fractured. Essentially, we have
two independent teams now, both trying to coordinate on a single call.
We have a mess because every new person who
joins the effort now has to interrupt Barb or Alice to get
the call to get context on what's going on.
And it's a mess because despite having been in the incident call
for 20 minutes, there's still no real plan about how we're going to
diagnose and fix this problem. If you've been on incident
calls, you know what this kind of mess feels like.
It wastes time. People step on each other's toes, we get
disorganized, we miss things, and the incident goes on way longer than it
needs to. And that's why it's useful to have a
process like clinical troubleshooting. Clinical troubleshooting
is what's called a structured diagnostic process.
Having a structured diagnostic process makes a lot of these problems that we just.
That just turned our incident into a mess go away.
And does this by exposing the plan so that
everyone who joins the call knows what's up and what we're doing. It does
this by helping you avoid dead ends,
which allows you to solve the issue faster.
And most importantly, I think it lets you audit
each other's thinking. When we audit each other's thinking,
by which I mean how our coworkers are
thinking about the issue and compare it to our own mental model,
we can reason collectively, and because we're
reasoning collectively, we can reason better. And that's how clinical
troubleshooting gives us shorter incidents, fewer incidents, and less stressful
incidents. So what is this structured
diagnostic process that makes so many of Barb's analysis problems go away?
It's this simple workflow.
First we get a group of people together who have as diverse
as possible set of viewpoints that'll be important for later.
Working together as a group, we list the symptoms of the
problem under investigation. Symptoms are objective
observations about the system's behavior.
Next, working from those symptoms, we brainstorm
a number of hypotheses to explain the symptoms we're observing.
And finally, we decide what actions to take next.
Given the relative likelihoods and scarinesses of the hypotheses we've
listed, we take those actions,
and if they don't fix the problem, we go back to
the symptoms. If they do fix the problem,
we're done. So let's see
how this works. Let's go back in time, through this time
portal to the beginning of our wonderful calm day.
The flower service is going along, and suddenly
it's on fire. Throughput drops by 40% in the middle of the
workday. Oh, no. So Barb gets paged.
Barb spends a few minutes figuring out context, just like she did in the bad
timeline. But this time, when ding.
Alice joins the call and says, hi, this is Alice. How can I help?
Barb remembers this phenomenal talk she saw at con 42,
SRE 2024, called clinical troubleshooting.
So instead of saying, hold on,
Alice, I'm figuring some stuff out, can you just wait for a minute? She says,
welcome, Alice. Let's do some clinical troubleshooting.
She makes a shared Google Doc, or whatever
her shared document system of choice
is. She makes a doc, she shares it with Alice, she shares it on
her screen. And the doc has these headings,
symptoms, hypotheses,
and they start the process. They write down the symptoms. What are the symptoms?
Well, we had an alert at 841 utc that the flower services,
through wood, dropped by. And we also
know from Barb, having poked around a little bit before Alice
joined the call, that the latencies of requests
to the flower service dropped at the time that the throughput dropped.
So it's getting less traffic, but the traffic it is
serving, it's serving faster.
Alice's sorry. Barb's first hypothesis
that she's been cooking up before Alice joined the call is that the flower
service is somehow crashing when it receives certain kinds of requests.
And that explains symptom one,
because the crashed requests aren't getting logged and so they're
not showing up in the throughput data. And it would also explain symptom
two, because maybe these requests
that are causing flower to crash are normally requests that
take a longer time than usual. So since they're not getting
into the logs, they're not showing up in the latency data. And the average latency
is shown to be reasonable hypothesis.
And just like that, Barb has brought Alice over next
to her. So instead of Barb digging into graphs
and Alice twiddling her thumbs, Barb and Alice are looking at the
same thing from the same point of view. And that is so powerful
and so critical during an incident when you're under time pressure and
you have to come up with a plan and share the plan, looking at everything
from the same point of view, you have to have common ground.
So here's their common ground. Two symptoms, one hypothesis.
And that means they're ready to act. They think of two things to
do. The first thing they're going to do is check the local kernel
logs for any crashes. That would be evidence of that hypothesis.
And the second thing they're going to do is read the readme for
the flower surface. Because neither of these people is that familiar with
what goes on inside the flower surface and what it does,
they assign each other to those tasks explicitly.
Now Barb's task takes a little longer because the kernel logs
aren't in the log centralization system that
they have at this company. Alice finishes reading the readme
pretty fast and she comes back says, I have a new hypothesis from
having read the readme. It jogged my thinking process.
So hypothesis number two, maybe some downstream
process, some process that is a client to the flower service
is sending the flower service less traffic. Maybe the
flower service is running because it's getting
less traffic and it was over provisioned. So the traffic it is still getting
can be served with less resource contention.
Another reasonable hypothesis. So while they're discussing this hypothesis,
they get that second page. They get that page about the honeybee service,
fine, no reason to panic. They take that page,
they add it to the symptoms list. They got an alert
at 854 utc that the honeybee services error rate
is elevated and
that jogs their memory and makes them come up with a new third
hypothesis. Which is maybe connections to the flower service
are getting dropped at the tcp level. So maybe there's a,
maybe there's like a. I know, we know there's a proxy on the flower service,
the little Nginx that sits there, maybe that's dropping the requests.
And so fewer requests are getting to the flower service.
Okay, well now they got three hypotheses so they can start coming up with actions
that can rule out or fortify
any number of these hypotheses. For example, action three that
Alice and Barb come up with is what if we check whether the honeybee
and the flower disturbances started at the same time?
Look at some graphs, compare some graphs, see if these two things are actually related.
And the fourth action they come up with is to get Devon
on the call, because they both happen to know Devon, another engineer on
the team, knows a lot more about the honeybee surface than they do.
We're in such a better place now than we were in
the previous. In the. In the previous
incident, because we are all looking at the same
plan, we all have the same information, and any
discrepancies between how we're thinking about the problem are taken care of by the
fact that this is all explicit. Alice can go look at those
graphs and she sees, oh, look at that. We had a
linearly growing error rate from the honeybee service before the
flower services throughput dropped. That's pretty interesting.
So that means that any hypothesis that,
in which these two things are independent,
or that the honeybee error rate dropped
after this, implies that the
honeybee problem, which showed an increasing error rate before
there were any observations from the pollen tokens count,
must be prior in the causal chain to
whatever's causing the throughput of the honey bee service to drop.
That's really important because that lets us
rule out hypothesis one, which is that the flower service is
crashing on some kind of request. And it lets us rule out
hypothesis three, which was that the connections
to the flower service are getting dropped at the tcp level, because that
would not have been caused by anything that shows
the flower, the honeybee surface to be getting errors
before the flower service. Sawney,
any results? So we've ruled out two hypotheses,
which is more progress toward a solution. We also
get Devin on the call. There's Devin. So now when
Seth shows up and starts asking questions like, are these
two alerts part of the same issue? What's our plan? Many of the answers to
Seth's questions are already on screen. Since Seth
was mostly asking those questions because he was stressed out and not sure if his
team was on it, he can now rest
easy in the corner and lurk, secure in the knowledge
that his team is on it. And they do have a plan. And likewise,
when ding, ding, ding, the support team joins and
they want to tell us about their honeybee problem, well,
we can say we're aware of that. We're taking it
seriously. We have that here on our symptom list. If you'd like
us to add the customer reports that you've got, like, we can
add those to the symptom list so support can
tell customers that we're working on their problem. They can post
to the status page, and they can go on the sidelines
and do their job while leaving the
incident calls bandwidth mostly untouched.
So compared to the chaos of the other timeline,
which looked like this, we're in a much better place.
We're in a much better place because we've leveraged Alice much more despite
her limited familiarity with the systems in question.
We're much better placed because anyone new who joins the call
can easily spin up context, and we're much closer to
understanding the problem. We've already ruled out a whole class of hypotheses,
and we have more symptoms that we can
use to generate more hypotheses. And that's all
because we followed through on a simple commitment to
clinical troubleshooting. Now, clinical troubleshooting is very
simple. You can use it unilaterally today. You can just start using it,
and it's simple enough to explain it on the
fly to whoever you're troubleshooting with.
And I guarantee you'll get results immediately.
You'll be blown away by how fast the scientific procedure helps
you solve issues and how consistently
it helps you get to the bottom of issues, even when they're very confusing.
But I can give you a few tips for how to
use the process as effectively as possible,
starting with when you're assembling a team. When you're assembling the
group that's going to do clinical troubleshooting together,
you need to make sure that you're bringing
as diverse perspectives as you can.
So you're going to want to have engineers who
are specialists in different parts of the stack. You're going to want to have maybe
a support person because they have perspective on the customers
or, you know, a solutions
engineer or something because they're going to have perspectives on how customers use the product.
You're going to want to have, you know, as many different roles
as you can, because when you have more roles,
more more perspectives on the call, you're going to generate
a wider array of hypotheses, which is going to help you solve the
issue faster. You also want to make sure that as
you talk about the issue, you keep bringing focus back to the
dock. People are going to propose different ways of doing things,
and they're maybe not going to be thinking about the clinical troubleshooting
procedure unless you, as the leader of the troubleshooting
effort, keep bringing focus back to the doc, adding the things
that they're saying to the doc if they
fit, and trying to in
a different direction if they don't fit. You'll see what I
mean in a second. When we talk about symptoms and hypotheses. So symptoms,
when you're coming up with symptoms, you want your symptoms to
be meaning that they are statements of
fact about the observed behavior of the system.
You want them to be quantitative as possible.
Sometimes you can't be, but to the extent that you
can be, you want to associate your symptoms with
numbers and dimensions. And then finally
you want to make sure you have no supposition in your
symptoms. So state the facts, state them as quantitatively
as you can, and save the suppositions about what might be
going on. For the hypothesis column. For example,
if you have the hypothesis, the flower surface throughput dropped by about 40%
at 841 utc. That's a well formed symptom.
It is an objective statement about the observed
facts.
It is quantitative. It has numbers
on two dimensions, time and throughput. And it
doesn't have any supposition about why that observation,
why that fact is occurring. Whereas if your hypoth,
if your symptom is just the flower service is dropping requests,
it's not quantitative, it doesn't have any numbers in it.
It's not. And it contains a supposition about
why the throughput number changed.
Right? It's a subtle supposition because, yeah, if the throughput
dropped, it looks like the service is dropping requests. But as we've seen
from our example, maybe it's not dropping requests. Right? That's a supposition
based on the symptom that goes in the hypothesis column.
Speaking of the hypothesis column, you want your hypotheses to
be explanatory,
meaning that they explain one or more of the symptoms,
or could potentially explain one or more of the symptoms. And you
want them to be falsifiable, which is a
little tricky concept, but essentially means testable.
Falsifiable means if the hypothesis were wrong,
you would be able to prove that it's wrong by taking some action. You should
be able to imagine something you could do that
would disprove that hypothesis. This is
pretty important, because if you have
hypotheses that are not falsifiable, then you can
go chasing your tail for a long time trying
to prove them. Actually, you should be trying
to disprove them or create
more hypotheses that you can then disprove. It's really all about
ruling things out, not proving things,
which is a little bit of a mental shift that you have to make if
you're going to use this procedure effectively. For example,
a downstream service is bottlenecked, which results in less traffic reaching
flour. It's a pretty good simulation, pretty good hypothesis.
It's explanatory, it explains the two symptoms that we've observed,
and it's falsifiable because you could
prove, you could show that the same amount of traffic is reaching
flower, it's not actually receiving less traffic, and then that would disprove
that hypothesis, which would be progress. If you have
the hypothesis the flower service is crashing,
that's not a good hypothesis because first of all,
it might not be falsifiable. Depending on your stack,
it may not be possible to prove that the flower service is not crashing.
And second of all, it doesn't really provide a clear explanation
of the symptoms. Like, okay, maybe the flower
surface is crashing, but why is that causing throughput and latency to
drop? It's not clear from the way the hypothesis is written.
So finally, that brings us to the
actions column. So there's a
few different kinds of actions that you can take as part of a clinical troubleshooting
effort. You can take rule out actions
which rule out one or more of the hypotheses, and that's what gets
you closer to the definitive diagnosis of the problem that you're
seeking. So that's the main kind of action that you're going to
want to take as you're going through this process. However, you can also take
research actions which don't rule anything out precisely,
but they will help you generate more symptoms and more hypotheses,
which will hopefully make the path forward.
And finally, you want to. So you should
try wherever possible to use what are called diagnostic interventions.
Diagnostic interventions are actions that may
just fix the problem if a particular hypothesis is right.
But otherwise, if they don't fix the problem, then they at
least rule that hypothesis out. And that sort of kills two birds with
1 st when you can find one.
So, for example, a rule out action is
check whether the flower and the honeybee disturbances started at the same time.
If you, as we saw, if you do that and they,
and they, you can learn something that will allow you to
rule out one or more of the hypotheses. A research
action is something like read the flower surfaces. Read me.
Reading the readme isn't going to rule anything out, but it may
give you some ideas about symptoms that you can go
check, or hypotheses that you
may be able to falsify or that may explain the symptoms.
And then finally, an example of a diagnostic
intervention is say we had a hypothesis that
the honeybee service was having this elevated error rate
because of some bad object that was in its cache.
If you were to clear the honeybee services cache, that would be a diagnostic
intervention. Because if it fixes the problem,
then great, we fixed the problem. We know that was
something like what the problem was. If we clear the cache and
it doesn't fix the problem, then we get to rule out that
hypothesis. So either way, we're making progress.
So those are my tips. And like I said, if you practice clinical
troubleshooting in your you're going to have shorter
incidents, fewer incidents, and less stressful incidents.
And I love to hear
from you about that. So if you are going to do this, if you do
this and you have questions about it, or you want to tell me
a success story or a failure story, there's my email address. Dan,
I also urge you to check out my blog, which covers topics
in sre incident response, observability,
and it's a very good blog. So I'm excited to
hear from you as you try this out. Oh, also, I teach a
four day incident response course for engineering teams.
You can check that out at d two e engineering for more info on that.
Before I go, I also want to sincerely thank Miko Polakowski,
the indefatigable host of Conf 42, for making this incredible
event happening happen and giving me the opportunity to
speak to all you fine folks out there in the inner sphere.
You've been a fantastic audience.