Transcript
This transcript was autogenerated. To make changes, submit a PR.
You. Hi, I'm Ivan Merrill, and this
talk is incident response. Let's do science instead.
And this talk is an evolution, really,
of my thinking around incidents. Having spent 15 years in
the financial sector, working primarily in kind of monitoring and
observability, rolling tools out, educating people
on how to use them, and getting involved in incidents.
And as you can imagine, working with some fairly big
financial institutions, some of those incidents can be
pretty scary and have quite a large impact. So what
I've tried to do is look at some things that have gone
really not so good that I've seen before,
some ideas around what I think can do and we can do better,
and also add in some research that
other people, far cleverer than me, have undertaken
to help you, hopefully with some really practical information that
can maybe help you with incidents.
So first things first. Incident response,
I really strongly believe can learn a lot from safety engineering
in other domains. Right? The tech industry is
still pretty young. And as someone
who's spent a lot, all my career, really on the kind of operation side,
I know that it's great to see now that
sres are getting more focused and DevOps really
kind of brought another bit of focus onto ops kind of roles and ops functions
and everything else like that. But actually big incidents,
complex incidents on big systems and stuff, we're still
pretty new at it. We're still pretty young at it. And when we
look at aviation, health, the emergency
services, they've been dealing with really, really important incidents
for an awful lot longer than us. And their incidents actually
are often much more impactful. They are quite literally life
and death. So it's absolutely right that
we look at these industries, look at these domains, and understand why
they're doing what they're doing. Right.
I would feel horrified if I saw some firefighters turn up and
they were just kind of, just started all, everyone kind of went for
themselves and crazily started like using their hose and spraying water
onto random bits, right? That's not what they do. They have
clear kind of structure to what they're doing,
some thought being their decision making.
And so let's make sure that we can
take these ideas into the tech industry as well,
because we really need to. And too often
it can descend into chaos.
So firstly, a definition, right? I love a definition. And John Osborne
and Richard Cook are far better to give one than
I am. So I've relied on their kind of authoritative views here
to take this definition,
which is that incidents are a set of activities
bounded in time that are related to an undesirable
system behavior. And there's a couple of
things I'd like to take away from this. One is that it says an
undesirable system behavior. It doesn't just say
something broke. Right? So that's
.1 and .2 is it says a set of activities,
not just this one thing. So it's not just
one thing, and it didn't just break. It's a set of activities
bounded in time that are related to an undesirable
system behaviour. So we have a definition.
Right. And why is all this important?
Well, the fact is that catastrophe
is always around the corner. So says Richard
looks, who was an amazing person, and I strongly suggest you
seek out his work. And he wrote about
how complex system fails. And this is one of
the items in that kind of list, really.
We are never far away from catastrophe. And in fact,
it's actually worse than that. Right. The complex systems
that we are building. Right.
One of the hallmarks of a complex system
is the potential for a catastrophic outcome.
So just the fact that we are building these complex systems means that
we are creating the potential for catastrophe. We can never
escape it, right. And we can build systems
as resilient as we can, and we can really invest an awful lot of
time in making them resilient, a lot of effort, a lot of
money. But the fact is that there will
always be a way for it to reach a
catastrophic outcome. Some things are completely
beyond our control, and so we need to work out
how to actually deal with this catastrophe.
And the fact is that incidents aren't actually
very easy. They're generally pretty difficult. If they
were easy, I would not be doing this talk and
sharing some experience and hopefully some useful information with you.
So as a result of them being pretty hard,
in fact, there are a number of common pitfalls that people generally
seem to fall into that I've certainly seen anyway. So I'd
like to just kind of COVID these, and I'm sure some of them will appear
quite familiar to you. So the first thing is
an over reliance on dashboards and runbooks. Right.
And what happens here that I've seen quite a lot is
that people generate dashboards and they are
a predetermined set of signals, which can be great signals,
whatever. That's a whole other subject. But they
have got these signals and they're trying to understand
what's going on. This incident, based on these predetermined set of signals,
and they might not be able to work
out. Right. What's going on. This incident doesn't map nicely to
their dashboard. And particularly when you
have kind of several teams that all have their own dashboards and everything else like
that, everyone's looking and their dashboards are all green,
but the incidents is still occurring.
And I found out that this is actually called the watermelon effect,
which I really like, whereby everything
is green on the surface, but if you scratch it at that surface, everything is
red underneath. Right. And it
can easily lead to this kind of very low meantime to innocence for
every team, because everyone's looking at their dashboard, their dashboard is
green. They're all saying they're fine, but the incident is going on,
and runbooks aren't actually necessarily any
better. This is not saying runbooks aren't great. They can absolutely be
great. But we mustn't only rely on our runbooks,
right. Because if all that ever happens is that we
only have teams that know how to respond to
incidents by looking for a relevant runbook, then all they
can do is if situation x occurs, they do y.
Right. There is no room
really, for these people to build their troubleshooting skills that are really so
important just in incidents, right. And they're not gaining any experience
in how the system works. They're not rebuilding
their mental model of their complex system with every incident.
Right. There is no learning going on here. And so we
need a way of to investigate incidents that allow us to build
our unnecessary experience and learn what works and what doesn't work and
help us kind of upgrade our mental models because our systems change.
Right. And another
thing that I have seen that, I mean, hopefully, quite obviously
seems bad is guesswork. We know we need to
do something, right? But we don't have a clue what to do. And so
we sometimes guess, and there are many forms of
guessing. Some things are just really big gambles, some things
maybe a bit more educated guess, like just immediately
failing over to a second site or anything else like that, right.
And the fact is that sometimes it does actually work.
Right. But that's luck, and luck is not
a reliable kind of strategy.
Right. We can't build up a troubleshooting
skill based on luck. There's no learning opportunities
for it. There is no hypothesis that's
being built. There's no context to our decision making,
and it's certainly not able to. We're certainly not able to build
a runbook or anything else like that based on luck.
Roll the dice and take option five.
In this case, that's not going to work,
right. So we need a way to structure our thinking
and to help us move forward when information is actually low because
we do find ourselves in situations where it is really not
clear from the information that we have what to do. So how do we
deal with that?
Spending a long time on the wrong hypothesis is a
massive time sink and can be really, really costly for organizations.
Right. The amount of
times we see we've got a high error rate,
customers are complaining, things are breaking, and we look
at whatever's giving us the most errors and we follow that through
and we spend an hour or so
investigating and everything is looking like it's absolutely going to be this
thing. And suddenly this little bit of information appears
and it's absolutely blown our hypothesis out,
right. It cannot be the thing that we've just spent an hour investigating
and we're human, right. And when
we kind of have an idea of what's going on,
we suffer from confirmation bias, right? We look for things that are
going to reinforce our hypothesis. We want to
believe in ourselves. We feel this is what it's going to be.
We get this feeling. And so we surround ourselves
with information. We seek out the information that's going to say,
that's going to agree with what we think, right?
This is confirmation bias. And so we need a way to prevent that
as much as possible, right. Because we do not want to be spending a lot
of time investigating the wrong things.
Fear of failure is. Yeah,
it's a really big thing and it's really important during incidents,
right. Because incidents, no matter how
much we can apply best practice and everything, they are high stress
situations, right. And quite often good practices
and good decision making can go out the way. And if we're
scared of doing the wrong thing, we may never take action. In fact,
that's one of the kind of the key factors
in procrastination, right. We are generally just scared of what's
going to happen if we get it wrong. And so we just never start.
We need psychological safety in incidents.
We need to feel good in the situations,
in the decisions that we're making, right?
And we can be in a situation where we are 99%
certain that this is the thing or we can have all of the evidence and
everything else like that done everything, right. But we
are fearful of the consequences. And that, again,
can delay resolving
the incident. But also, it's really not nice for the people involved
in incidents, right. It doesn't make people want to get involved in incidents either,
which we need them to. So, yeah,
fear of failure is a really important thing. We need to provide a
way to have psychological safety and
in our decision making. So, yeah,
and so I've covered a few things that I've
seen quite a few times in incidents, and I really like this quote.
Right. History doesn't repeat itself, but it often rhymes.
And I think this applies to incidents as
well, because we shouldn't really ever be seeing the same incident over
and over and over again. Right. But we can quite
often see patterns of incidents.
And so it's really important that we learn and we
improve with every incident that we have. And many organizations are
now performing some kind of post mortem, post incident analysis,
whatever you want to call it, but there are also some kind of pitfalls
that we can fall into here, too. So just going to take another moment just
to look at a couple of those.
And the thing is that it seems easy to look
back at an incident and determine what went wrong. The difficulty
is understanding what actually happened and how to
learn from it. Hindsight is 2020,
right? It's not actually very good to look back and go,
oh, this thing happened.
There is no understanding when we point out something that just seems
really obvious afterwards because it didn't seem obvious at the
time. If it was obvious, people would have,
would have made that course of action. They would have seen that,
oh, it was a bad deployment, we redeployed, everything was better,
da da. Well, that's quite a kind of simplistic view of
an event, right? And if it was that simple, people would have just resolved
it straight away. But it wasn't. So it didn't
feel that simple. And if we take that kind of
approach, then we are preventing learning, right? And actually,
so one of the things that I think we can understand from this is that
we need to record the context of why decisions
were made, right? So that if it seems so obvious afterwards,
why didn't it at a time, right? Why were people thinking it was this?
What is the context of that decision that they were making?
And that's something that's Quite often missing in
post incident analysis.
And we can talk about it in terms of
normative language, right? So when reviewing an incident,
normative language kind of says that a judgment is being made,
right? And that's often based on someone's perception. And it's really,
really easy to do. I'm quite certain that I've done
it before. I'm sure most people here have. But the
implication, right, is that if people had done this thing,
then everything would have been resolved much better.
Everything would have been fine, it would have been resolved quicker. Da da. But they
didn't, you know, and that's just silly them. Right? The team
missed this obvious error, which they ought to have seen.
Well, what if they missed it because there were a sea of alerts
that made it impossible to see any single alert? Right?
What if they had seen the alert, but they didn't know it was important
or knew it was important, but they didn't know what to do?
Normative language doesn't help us. Right? And in actual
fact, it's removed an opportunity for learning because this
person has made this judgment that actually this was just kind of this
silly person, this human error,
and that was the cause. Right. But that's really
not helpful and doesn't help us improve in the future.
Human error, to be clear, should never
really be considered as a major factor
of an incident. Right. It's rarely ever the case, and it just
removes every opportunity for us to learn.
It's very easy to say, just don't make that mistake next time.
Right? But how is that person allowed to make that mistake by
the system? So, yeah, we need to think of a way that we
can explain things in a more factual way
and avoid normative language.
Mechanistic resolving. Right.
I kind of almost like to think of this in a
kind of Scooby Doo type way, right. And bear with
me because this is a little bit stretching it, but we
would have gotten away with it, too, if it hadn't been for that meddling regional
failure on our database service. Right? That's my
Scooby Doo kind of villain impression. Sorry.
But what I'm trying to get at, right, is that there
is a temptation to reduce really complex failures to simple
outcomes, right? So this happened
because of that, right. Everything failed simply
because of our regional database failure. Right?
And everything can be explained by this simple thing.
And this kind of leads us to fall into this
trap of there being a single root cause. But we know from our
definition, right, that it's often a set of actions.
So, yeah, we need to avoid that,
because very, very rarely, if ever, I'm not sure I've ever seen one.
Is there actually a root cause?
So, mechanistic reasoning, right? It simplifies the issues
faced. It's almost impossible for there
to be a single root cause. Yet this can often lead to that
thinking. Normally, there are a series of contributing
factors, right? And again, it's removing the opportunity
for learning because we've got that root cause. It was that.
But if we know it was that, then, okay, fine,
solve that. But there are normally, as I said, contributing factors.
So much else going on around it. So we
need to be able to think of those as well and learn on all of
those things. So, yeah, we need to be exploratory
in our approach, right? We need to have a way to evolve our thinking as
new information arrives and understand that fundamentally,
we may never actually know all of the causes,
right? But hopefully we can learn from as many as
we can.
And Richard Cook, who I've mentioned
a couple of times already, and again, strongly recommend looking
out at, kind of described this as above the line and below
the line. And this is a really, really important concept when
we think about our systems and stuff,
right? Because the complex system isn't just the technology,
right? It's actually everything involved, and that's the
humans as well. So the technology
is below the line, and that's here.
I've got the code, the infrastructure, other tech stuff. This is my
simplified version of it, by the way. And we are above
the line, and we don't actually see the code
running on the computer. We don't kind of see
zeros and ones traveling between network interfaces and
everything else like that, right? Or deployments going
from our CI CD platform to our
infrastructure. We don't actually see that.
We interact with that, right? And we
can only infer what's going on beneath that, right?
Based on our interactions. And the fact is that
we are actually the adaptive part
of the system. We are the ones that introduce change,
right? We introduce change to our code by making releases.
We introduce change to our architecture. But we
make new deployments, we make new releases, we introduce
new features, everything else like that, right? We are the ones
that are introducing change. We are the adaptive parts of our system, and we
are above the line, okay? And this
is kind of an ongoing thing, and it's all
changing the way that our systems are
being. Changing how our catastrophe occurs
or can occur. And also,
what should I say? Well, I think the best way to unthink about
this, right, is that if everyone stopped
working on your system, right, there was no more human
interaction whatsoever in terms of supporting it or anything else like that.
Deployments, how long would that system survive?
That service, that thing? How long before users
would be complaining? So it's a really important
being to understand is that we are actually part of the system and our
behavior is
impacting the system. And even more than that,
we are the ones that define what an incident is. Again,
I mentioned our definition of an incident, right? Undesirable behavior.
Who defines what is undesirable
behavior? We define what is undesirable
behavior. But also, as we are part of
the system, if we improve our behavior, then we are
actually improving the system's behavior. Because again,
we are part of the system.
And I said we introduce change,
we are the adaptive part. But as we introduce change,
we introduce new forms of failure.
And as we resolving one incident
or one type of failure, then we are potentially introducing
a new one, right? A new weird and wonderful
way in which our service can fail.
And again, this is another richard looks thing from
how complex systems fail. It's even worse than
that, right. In that as we make our systems more
resilient to the types of failures that we've already seen,
it actually takes more for the system to fail,
right? So the next failure is likely to be even
more catastrophic because each time we're kind of making
it more resilient, we're fixing it, we're preventing all
of these kind of smaller failures. It's just going to take a
bigger thing for it to go wrong. But the fact
is that we can't stop change. We need to
have change, right? It's part of our
jobs and for good reason, right? We do need to
release new features, we need to stay ahead of the competition. We need to do
all these things. We have to introduce change. Change cannot stop,
but it is introducing new forms of failure. And it's important that
we're aware of that.
So when we think about these things,
been looking at research and actually how people go around troubleshooting,
right? And the fact is that some people just seem to be understand
what's going, seem to be able to understand what's going on much
more effectively than others. They seem to
kind of smell what's going on,
right. They've got this
evolved approach to troubleshooting. Right? And the
fact is that they are not working out
how things are breaking in the same way as people who are new
to a service or new to a technology have,
right. When we first
start out troubleshooting, we rely on our system and domain
knowledge, right? We know the technology, we have,
our mental model and we kind of try to work out what's broken based
on these things. Well, this thing is calling that thing.
So let's look at that. And we kind of build
a base hypothesis on what
we think can be going on based on these understandings
of the technology, right,
and the service. But actually, as we evolve
our troubleshooting experience and let's become more experienced with
our system, we actually move lets from our
understanding of these things and more to our experience
in what we've actually seen before,
we can start to build hypotheses based on how similar
the symptoms are that we see here to symptoms
that we've seen in other incidents. Right. So what's
actually happening is that we're able to build and remove and
such a sedentary cycle through hypotheses much faster.
Right. Because we can start to say, okay,
well, I've seen these signals, these are similar
to these incidents over here. How can I
remove this hypothesis or this hypothesis
that I've seen before from this incident, right. They are literally cycling
through these hypotheses really quickly until
they get to one that seems to fit. And that is how people
are evolving their troubleshooting experience. Right.
But it's clear that this takes a
lot of time and effort and experience. Right. These people haven't.
You can't just turn up to a system and gain this kind of smell and
understanding of what's been before in it. Right.
Experience is really hard one. Right.
But transferring this experience is
really, really hard in most cases in many companies, you actually
need to have been involved in that incident directly to
have any real knowledge of what's going on with it. Right.
About how to apply it in this particular case.
And that's quite difficult to do. And again,
we are part of the system, but there is change on the adaptive part as
well. People come and go. So what do we do?
Well, I think that we can bring
more science into this. Right. And again, I love a definition. So let's
take the ever reliable Wikipedia, its view
on science. So science is a systematic enterprise
that builds and organizes knowledge in the form of testable
explanations and predictions about the universe.
Lovely. Worth noting, we're not talking about the universe,
we're talking about our service or system. So we can reduce the scope quite
drastically, which is fortunate.
And something specific is that scientific
understanding is quite often now based on Karl Popper's theory
of falsifiability, which is really hard to say.
Try and say it three times. And yeah,
anyway, no matter how many observations are
made which confirm a theory,
there is always the possibility that a future observation
could refute it. Right? And there is
a quote here, right. Induction cannot yield certainty.
Science progresses when a theory is shown to be wrong and
a new theory is introduced which better explains the phenomena.
So essentially we learn with each theory
that we disprove, right. And we've
kind of got this thing here that you can see this kind of classic scientific
experiment thing going on, that we've got an observation question,
research hypothesis. We test with an experiment, we analyze the
data and we report the conclusions, but it doesn't end there. We keep
going, right? We can never truly know.
And this is the theory of false viability from Karl Popper.
And I think actually this is something that we can apply to creating
hypotheses within incidents. I think this is
something that actually we can apply that will help us.
So I think it's quite straightforward
actually to convert this scientific method into incident resolution
behaviors. So an observation is our
system is exhibiting an undesirable behavior
brain, based on our definition. And so we do
some research, right? We look at our monitoring and observability systems
and based on what we see, we create a hypothesis on what is most likely
happening. We think
it's this particular thing that's going on because we've seen it in our system,
right? And so we do an experiment and this is
important, we attempt to disprove the hypothesis,
right? Disprove, not prove.
Again, thinking about having the wrong hypothesis
and losing a lot of time, we attempt to disprove our hypothesis.
In our analysis we say, are we able to disprove this hypothesis?
If so, great, okay, that wasn't the thing.
But we learned something, right? And now we can move on to the next
most likely hypothesis and we can repeat it and we
can keep going through this cycle until we're unable to
disprove a hypothesis, at which point this becomes
our most likely working theory, obviously until such time as we
learn anything that disproves it.
And we can either narrow the hypothesis or we can start
to take action based on the hypothesis.
This is kind of a way that we can apply this structured scientific
thinking to an incidents.
And here is a theory,
here is my hypothesis. In fact a
more scientificbased scientific, in fact a more scientific hypothesis
driven approach to how humans perform and document
incident investigations can improve reliability.
Because I'm talking about not just
creating these hypotheses in this way, but ideally doing some
kind of, kind of documentation, writing down, providing some context
because we want to learn from all these things, right? And so here's
one I made earlier. A possible explanation for
high error rate is that there is a high database latency,
blah, blah, blah. I'm not going to read it all out, but we can disprove
this by,
and you might be thinking kind of, why would I want to write all this
stuff out? Why would I want to consider this structured thinking? This seems a lot
of effort. And why indeed,
you know, fair question, this is my hypothesis and
maybe I've got it wrong.
Well, let's have a look at some of the things we've
thought about before, right? Well, but because if we're using Karl Popper's
theory of falsifiability, we're removing bad
avenues of investigation as soon as we can, right? We're attempting
to disprove something rather than just go on improving it and
keeping on proving it, improving it, improving it and proving it until such
time that we don't prove it, right? So we're removing quickly bad avenues of
investigation because we're looking to disprove it
allows for changes in incident changes, right? We do get more information
as incidents go on and they evolve. And so,
like science, it's only ever our current understanding,
right? Until such time that it becomes disproven. And that might happen
because as I said, incidents do change over time.
This can formalize the language used to explain decision
making, right? And this is really, really super important.
We're creating a way to communicate to others in a clear
way, right? We can record our hypothesis,
we can record the outcomes of our experiments,
and this can really improve learning because we have much more information
to work from, right? We have the context of what's going
on. And that provides us with a level
of psychological safety because there is a clear kind of
documented reasons as to why we're doing what we're doing, right? We have the
reasoning behind the decisions. We have the proof of why we did what
we did, right? And there is safety in that because
we can show exactly how we got here. And that's really,
really important. And also, as I said, it avoids
normative language. We're talking about kind of a very scientificbased approach here.
We're talking about things that are based on facts, hypotheses.
We are looking at this hypothesis until such time as it's
disproven. We've got our little kind of. There you go,
our little hypothesis there. It's factual language,
right? So we're avoiding all our normative language and our mechanistical reasoning.
And yeah, hindsight bias is reduced
as there is context, right? If we are going through
all of these hypotheses, people are able to see after the
event why we did what we did.
There is context, okay? This was their hypothesis.
They thought it. Because of this, they disproved it. Okay?
It makes sense that they went on to there. We can see their thinking.
So you might be thinking, okay, that sounds actually pretty
good, but how do I even create a hypothesis?
Well, John Osborne, again,
very great person in this field,
his actual thesis that he did, looked into
this and failures in kind of Internet
services. Right. And so he
produced these steps. The first being
is to look for changes. And I think this is something that actually we've
all seen quite often, and certainly I've seen it before in incidents,
an incident is created and the first thing is what's
changed. And we know that change is a very common source of failure.
And so it's a really great place to start.
If that doesn't help you, if maybe you've created some hypotheses
based on what's changed and you've disproven them all,
then you go wide, but you don't go deep. So what
this means is think of lots and lots of different hypotheses.
Maybe every team that's involved is asked to think of
maybe one or two hypotheses. Right. Keep it high level,
though, and don't go too deep. Try and disprove
as many of these as possible. Right.
You can always zoom into these later. As I said earlier, once you've got
a hypothesis and you can't disprove it, maybe you can narrow the scope,
but for the moment, widen the net, think of lots of different things,
try and disprove them,
and then don't always forget Occam's razor.
Right. Simplest is often the
most common. Right. So when going through these things,
when starting to think of more hypotheses, think of Occam's razor.
And that is kind of my talk, really. And hopefully that's given
you some food for thought on how you can
improve or start or whatever your incident
response. And I'll leave you with this, another Richard
Cook quote, because he is so influential.
All practitioner acts are a gamble.
Right. With science, there is no 100%
guarantee, there is just a hypothesis that we
can't disprove. Right. We have to accept
this, particularly during incidents. Right. We never know for absolute
certain that what we're going to do really is going to fix something. There is
always the opportunity for another level of catastrophe
because the actions that we take are above the line and
we're interacting with things that we don't understand about. They exist
below the line. So we don't know for certain. And therefore it's
a gamble. But hopefully by introducing more scientific
method, by recording our actions, by providing more
structured response, we can reduce the size of this
gamble. And also, if not, or whatever happens,
we have these actions recorded that allow us to learn for the
future. So that's it.
That's me. Thank you very much and have a great conference.