Transcript
This transcript was autogenerated. To make changes, submit a PR.
Thank you for joining me for my talk. I want to go over some
of the aspects of incidents response that maybe you haven't
thought of before, as well as talking about different ways that companies can
organize their approaches to incident response.
Taking a cue from a popular narrative approach, let's start
in the middle of the story. It was a calm,
dark night, and our hero was fast asleep
when all of a sudden the pager went off.
Now, when the pager goes off,
what normally happens? Typically it will
bring in a technical responder. It's going to go to a service
team person who's on call for problem
service and was needed. That person
might bring in additional resources, call a colleague, call a friend,
and try to get these problem solved,
especially if it's happening during the daytime or business hours.
It's very common that they'll recruit other folks into
the event with them. So what
will these folks do when they join the incident bridge,
call, channel, whatever it is that they're using
as a communications method. Perhaps they're even in a physical situation room
and working together to solve this. I want to
introduce you to a concept called the Ooda loop.
OOda observe, orient,
decide, and act. This is a fundamental
approach that's going to give us a way to
look at what everybody's doing along the was of
handling an incident. They're going to start with observing,
seeing what signals are available to them,
what's happening, and what does the system tell them about what is
happening after that? They're going to try to orient,
they're going to make sense or try to make sense
of what these signals mean to enrich their observations with
context. What's been deployed recently, what other
changes do they know may or may not have been going on?
Then they're going to have to make a decision. They're going to
have to evaluate their response options, pick an approach,
and then ultimately act, execute the decision,
and in turn, repeat look at
what effect their action has. Orient as far as what
that means, make new decisions and new actions. Each person
is going to be doing this individually as well as the group doing it together.
So what is it that they're doing? They're doing
diagnosis. They're figuring out what is happening,
figuring out what to do, and figuring out what can be done
about what it is that's going on with the goal,
ultimately of therapy. They want to take action to
mitigate the issue and to ensure that the action
was effective. So what's missing
from this picture? Let's back up just a little bit because
I ran a little bit fast. Going through all of this with
these initial responder engaged the person who receives
the page. I glossed over the work of recruiting
additional people, and the role of incident commander
is often one that doesn't fit well on the shoulders of a
technical responder. This work of
getting the right people involved is something that can be handed
over, and often is to a role that's known as incident
commander or incident coordinator. I use incident commander because
it's kind of the traditional term. How does
the IC themselves get engaged?
It could be that the technical responder throws up a flare and
says, hey, I need an incident commander to help me out here.
Or it could be that the pager maybe calls the incident
commander, either initially and
they bring the technical responder in or at the same time
as the technical responder is paged in
return, then the incidents commander can work on recruiting
the additional people that are needed to help out in the scenario.
With a multitude of people involved, even if it's
just three or four, there is coordination
that's needed. Who does what and when?
Who's in charge? Well,
these are where the organizational models come into play.
So we've got this structure right now.
We've got a technical responder that's engaged with technical
compatriots. We have an instant commander who's looking
over things. We also have in
any incident, other concerned people,
these might be executives, they could be
managers, they could be customers. If this is a customer
impacting incidents, they want updates. They want
to know what's going on with an incident commander.
They know who to talk to because they are the designated
contact. But sometimes there's so much work going on and so
many things to coordinate within the response team,
that delegating that communication role to a communications
lead can really help out the incident commander.
So this covers the spectrum roughly,
in very functional terms of the work that needs to
be done. Diagnosis therapy
in the technical realm, recruiting and communication
in the logistical or administrative realm, and coordination,
bringing it all together.
This is for functional approach. How do you organize
this, and how do teams typically approach,
or companies typically approach, structuring the way that people respond
to incidents? Let's map this functional model
into kind of an organizational approach.
So there's one approaches that I teams, the one at a time,
or the pass the baton method.
The page comes in to team a. Perhaps these is
an alert that was triggered on a service that's owned by team a,
and team a looks into it, and they decide our
system is doing the right thing. It must be because team
b's system is giving us bad data,
or returning an error or something like that, or not returning at all.
So they pass the baton to team b, who in turn
looks at it, passes it to team c, and ultimately
they decide, ah, it's got to be the DNS, so we'll pass it to the
DNS team. Since that starts with a d,
the problem, or one of the problems with a one at a time pass the
baton method, is that where does the
incident commander come from?
We've passed the responsibility from team a to b
to c to d. What happens if team D finds that
their problem is actually caused by something back on
team D? Team B, we've gone around
on a loop here, and perhaps with no great resolution,
with the lack of an incident commander. That's typical in this kind
of a structure. It teams to confusion,
longer response times, longer resolution times,
and it leaves the other concerned people
kind of floating out there in the
fog somewhere. They don't know who to talk to. 1 minute they're talking
to somebody on team b, and the next minute, oh no, we're not involved
in that anymore. And it's
not a great experience for the outside folks.
An alternative to avoid this problem with
passing the baton from team to team to team is the all
hands on deck approach. The page may go
out to all four teams at once, plus an incident
commander. This can
have different variations, so maybe it goes to these
on call engineer from each of the teams.
Perhaps it goes to a single cross functional team that has all the
capabilities that are needed within that one team.
Or literally it could go to everyone. There are
organizations where an incidents will trigger 100
or 200 people to join the situation room, the channel,
whatever the communication framework is.
This can lead to teams that are seeking
to escape. So essentially you summon four teams into
every incident. Their first thought is going
to be, okay, how do I get out of these and leave
some of the other teams holding the bag?
Not necessarily the greatest collegial
experience and can lead to some unfortunate
dynamics, but at least we have an
incidents commander. And so the outside parties that are wondered
know who to talk to and then they
can work through that process in order to
understand what's happening with the event.
Can alternative approach is an escalation or a tiered model.
This is very common in organizations that have a knock,
a network operations center, or some variation thereof.
Calls, alerts come into the knock.
They triage those calls, either decide
which tier two team to escalate to,
or ideally try to solve some large percentage
of them without requiring the escalation. Tier two gets involved
when necessary. They try to take out another
chunk of the calls and only bother the expert
team three on occasion.
This one has a question of where does the incidents commander come from.
Some organizations will have incident command be a
responsibility of the NOC and they have the expertise. They have
people trained in doing incident command and will follow through the
escalation process. But this isn't always the case.
And again, it's an open question that organizations
will solve in different ways. If there's not an incident commander,
the outside concerned people have the same challenge of knowing
who to talk to was in the one at a time model.
If these is an incidents commander, that mitigates the problem for
the other parties that are involved.
There's another model that has come into the tech industry
maybe about ten years ago, adapted from
emergency response,
firefighting, FEMA emergency response
called ics or incident command system,
and the first adaptations of this into the tech world.
I'm calling strict ICs. These incident commander
is literally in command and if they don't
do their job, nobody else is going to do anything at all.
This can be thought of in the scenario of wildland
fire response, where the fire chief tells
each of the teams what they're going to be doing. They're very directive
and very tightly organized, I guess is the best way
to put it. Other concerned people are
kept informed because you've got a communications lead role.
And the problem is that this doesn't
really fit well with an agile,
everybody takes responsibility,
engineering culture. And so there's an adaptation from
strict ics that has evolved in the tech industry,
which was come to be known as adaptive ics thanks
to the work of Laura McGuire. Essentially the technical
teams undertake their own work,
largely in a self directed manner. The incident commander
and the communications lead end up being a
wrapper or a protection for the technical teams
from the outside world. They provide the
communications in and out, and the
incident commander is responsible again for focusing on
how the event is being conducted, making sure the teams have
who and what they need, making sure that the teams are staying
functionally operational, that people aren't
getting too tired out and wiped out from the process.
My colleague Matt Davis has coined this term, the response
trio. The incident commander, the communications
lead, and then the problem solving team,
all working together in concert to accomplish
these goal of solving these problem as quickly and effectively
as possible. The incident commander's key role is to
be focused on the coordination aspects of these problem.
How is the response being conducted, not the incident
itself. They're not diving into the metrics or the
graphs and the logs. That's the technical team, the problem
solvers. The incidents commander is responsible
for upholding and maintaining the common ground throughout
the response team and throughout the response period.
Let's recap some of these pieces
quickly as I draw this to an end.
We have the OodA loop. Observe,
orient, decide and act,
which is the process that everybody is doing.
Even the incident commander is observing how the
individual responders are interacting with each other
and seeking to make sure that, let's say, when a
new person comes in, they are brought up to speed quickly in
understanding the common ground that has been established and
maintained amongst the previous responding teams. As they
come in to join without having to distract the technical responders.
Everybody is orienting, understanding the meaning
of what they're seeing and observing, making decisions,
acting, and these repeating the loop regularly.
This is all in service of the technical sides of diagnosis
and therapy. Handling the incident,
understanding the incident, and then handling it, solving it
again, recruiting people as a key component,
coordinating who does what, and then communicating
outside of the team. This is the functional side
of incident response, and pretty
much everything that happens can fit into one of these
categories. It can be helpful to think about,
is what I'm doing therapeutic, or is it diagnostic,
or is it an attempt at therapy? And if
it works, then we'll conclude, because it worked. Oh, that must
have really been the problem. That's not an unusual scenario.
Organizational models, we've looked at one at a time,
all hands on deck, an escalation model, and in
strict and adaptive ics. I want to point out
that while adaptive ics is kind of the
latest approaches and the one
that is generally widely accepted
amongst companies and organizations,
no one size or structure is
correct all these time. There are scenarios and
organizations where one at a time is the best that
you can do. Perhaps you have teams that are geographically
distributed. One model of
pass the baton is a follow the sun.
Team a works for 8 hours, and then they hand it off to team
b, and then they hand it off to team c. These are not functional teams,
but they're time zone teams. And again,
handoffs. There's a lot of literature and analysis around
effective handoffs and ineffective handoffs that I don't
have time to go into. But this is one aspect
of a pass the baton. You can pass it around the world.
All hands on deck can be the right scenario, depending on or the
right approach in particular scenarios.
And so it's important that you don't hear
any of this as being down on a particular
model. Understand these strengths, understand the weaknesses of the
model, and then adapt them to your scenario is
the best way to do it.
You recall at the beginning that I said I was going to jump into the
middle of the story. There's much more to explore regarding
interteam dynamics as well as the organizational context
in which those dynamics happen, and I'd encourage anyone who's
interested in these topic to dive into the literature of resilience engineering,
adaptive capacity, and safety too.
I have some resources for further reading on
the next slide. As a starting point, I'd like you to
consider the question, why is this thing making no
ways? And is it actionable?
So here are some suggestions for further reading, and I'd invite
you to reach out to me either through the conference discord or
through email with any questions that you have. Thank you
very much for considering these varieties of
incident response.