Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, I'm Emily Arnott. And I'm Nick Mason.
And we're here to present too many people in the room,
the incident room, that is. So here we're going to take a
look at an often overlooked challenge when
dealing with incidents, which is you can actually overcrowding
your incidents response. And this might seem a little counterintuitive.
So we're going to start off by talking about what we really mean when we
say too many people. Then we're going to take a look at this instinct to
kind of get all hands on deck. Why is it that you call in as
many people? Why do incidents get crowded?
Then we're going to take a look at how to achieve those goals
while still preventing overcrowding. We're going to look at how to manage
that urge during an instinct for people to rush in and try to help.
And then we're going to look at how to cement this in
with a cultural foundation to make this not just kind of an arbitrary
policy change, but to have a culture around incidents
where they don't end up getting overcrowded.
So you might ask, and I think this is a pretty reasonable
question, isn't having more people better? You might
think when something goes wrong, it's pretty natural to just want
as many people as possible there to help you out. And we don't want
to discourage that type of thinking because it's true that people trend
to add value when they contribute to a problem.
There's tremendous value in bringing in a diversity of people that
might have fresh ideas, that might have a different perspective.
But if you just end up with too many people, if you kind
of go too far with this thinking, it can lead to
a lot of problems. So let's take a look at the problems
here. As you can see, there's quite a few things can
get confusing. You might start to wonder,
what is each person doing? Communication kind of beats poor when
there's so many people. People might do redundant work. There might
be excessive communication, which you might
wonder what that looks like. But if you're scrolling through hours and hours worth
of slack messages, you'll know what excessive communication is.
Time could be wasted as people show up and just sort of stand
around in the incident room when they could be working on something else.
There could be extra stress from all these people overcrowding in. You're not
sure how to reassure all of them, and they
might not be sure how that they're actually going to contribute. There could
be some anxiety about what can I do to help? There doesn't
seem to be anything to do to help. Tons and tons of stuff
just piling up when you have this overcrowded incident.
Yeah, and I think that's definitely something that we think about when we're trying
to handle an incident. And the problems that go with it is it's already
a stressful situation. And then there's all these other components that you just discussed,
Emily, that are just piling on top of it. Way too much stress.
We want to make sure that we can focus our energy on solving the problem.
And one quote that really resonated with me as a solutions engineer,
I talk with various prospects and customers on a daily basis,
and talking with an SRE manager, they were noting
that it takes ten to 15 minutes for someone on their team
just to get caught up to speed with an incident, just the high
level information just to get started. And this seems to be a
reoccurring theme as I've talked to several different leaders
within our space, and that's just way too much time.
And what this ultimately boils down to is on the next
slide. You can't help if you don't know what's happening, right?
You can't just throw people at the problem. As tempting as it is,
they need to know what's happening and how can they effectively contribute.
So teams not having this information who are trying to
solve this problem, as well as a lack of communication both internally and
externally, are just a few of the reasons why organizations typically
bring too many people to the incident room. And what this
basically boils down to is a lack of classification
when it comes to incident types. And what I mean by an incident type is
what type of problem are we actually fighting right now?
And it also boils down to not having a defined escalation policy
or a cultural mindset that tasks
are going to be completed if you're not constantly checking up on the team.
We've all been there before in the past.
It's perfectly okay to have those thoughts. Other problems
I've personally seen from prospects include not having the right
visibility into the impact of the incident based on your
monitoring tools, or maybe not knowing who the subject matter experts to
help you solve this problem are. There are plenty of tools out there that give
your team visibility into who's on call for a
particular service or who that subject matter expert is. But oftentimes
these tools are not naturally integrated into the communication tool
where you may be going back and forth with the team trying to solve the
incident. And this results to excessive context
switching and as I mentioned before, stressful situation.
You don't want to add more stress, you want to alleviate that stress.
And if you're thinking, as I'm talking here, I see Emily
nodding. And these are all natural
problems that we encounter on a daily basis. You're not alone,
but there, but there are some ways that we can try to improve
in these practices by not overcrowding the incidents room.
And personally, for me, what it all kind of starts with is
really putting a definition around the incident type and their corresponding severity
level definitions. So what is an incident
type? You may be asking me, Nick. I have no idea what that means.
So some things that I've been successful in with
defining what an incident type is, is maybe it's use case based. So is
this customer impacting? Is it a security incident? Maybe it's a
planned software outage. Just being able to name a few.
Right. I've also seen several teams be successful in defining
incidents when it relates specifically to the team that's
going to be involved in solving that problem or maybe a particular service that's
being impacted. And as you start to go ahead and define these
different incident types and their corresponding severity levels,
you're able to do a couple of different things. Right. First and foremost, you're able
to bring the right people to help you solve this problem, as well as put
together a framework in place as to what
should the team actually be doing. More to come on that in a few slides.
But another important distinction here is that depending on this
incident type and severity that you've captured at the start of the incident,
this is going to help you bring in different groups or teams to help you
solve that particular type of problem. And the
last note I'd like to add on this slide here is by
defining an incidents type, the severity level has just
as much impact as the type of problem. Right? So for example,
a severity three customer incidents versus a severity
zero. Severity three being not as prioritized,
versus sev zero being all hands on deck. Maybe for a sev three
I only need to bring in one person to help me solve the problem.
But a sev zero, I need to bring all hands on deck because this is
impacting everyone. Making sure you have those distinctions is
just as important as defining the problem itself.
So one thing I'd like to add to that is when you think about the
impact, often where companies get chipped up is they don't think about
it in terms of customer impact. One thing I
like to always say is if there's an incident in the middle of the woods
and it crashes, but it doesn't make a sound. Or if
nobody hears it, does it really make a sound? So don't just think
in terms of how much of your system is affected, but in
terms of how crucial that is to your customers.
So I think a common situation is, oh, we have an outage that
requires the most extreme reaction we can possibly have.
But if it's an outage of a service that's actually relatively unpopular,
this could actually be a lower severity than just
a mild slowdown on part of your service that absolutely everybody uses.
So having that more like nuanced understanding of how incidents
actually impact your customers and how it impacts your bottom line can
help you respond appropriately and not overreact and
overcrowd. Yeah, that's a fantastic point. Thank you so
much for adding that. And that's why these definitions and
severity definitions are so important when
you're trying to tackle can incident. Kind of the three words that
I can put when describing how to get started with this
is, before you start an incident, stop,
think and assess. Right, what type of problem are
you encountering? Who is it impacting? Just like Emily said.
And how fast does a escalation actually need to be obtained?
Because all of these are going to be contributing factors and getting the right people
to join who are going to be efficient.
And even though you may have this great communication
that we're discussing here about defining your incidents and
corresponding severity levels, you also need
to know how to properly escalate. If you and the team eventually hit
a roadblock, it's inevitable, to say the least, that at
some point you may not have the answer. And that's perfectly okay, right?
That's why we have a team that's here to help you. But instead of
just having that straight line of escalation where you bring in more and
more different types of people to help you solve this problem,
or you bring in more senior people and they get alerted
that something is on fire. Something that we recommend
is that you try to make a more diverse network
of people that you feel comfortable to reach out. And the operative word there
is comfortable because, for example, if you
need to go reach out to your vp and
you feel a little hesitant reaching out to them because you're like,
oh, they're going to think I'm not doing my job.
Having that kind of first layer of people who, you know, can help
you solve the problem before you escalate to that individual can definitely help
kind of ease that burden of saying, hey, I'm unable to finish this,
can you please step in and help me? And just as importantly,
we want to make sure that we reach out to people who have the bandwidth
to actually help you. So I've seen several different organizations
that I've worked with in the past where they'll go and
they'll have some sort of escalation policy where
they'll reach out to. Let's just say I reach out to Emily. I'm like,
Emily, I really need your help. Can you come in here? And then I check
on your calendar, and you're completely
swamped the entire day. There's no way you can help me.
So what did I do? I added another person to the incidents room, and then
I have to go find another person who can actually help me solve this problem,
has the time to help me solve this problem.
So we're just piling on to the problem itself.
So we recently had a webinar that discussed incident command
with a couple of people from blameless and two guests from other companies.
And one of them described a policy where they had
incident buddies that is kind of like
always the first person you can reach out to just for a sanity check
to help give you confidence to escalation further. Because it
is a really human thing. We can't overlook the human qualities of being
nervous to contact people and escalate being uncertain.
And that's why often we end up with these really strict escalation
policies that are very linear, very hierarchical,
but in the end, maybe aren't that effective. So instead,
we encourage you to kind of lean into this human aspect
and really think about who do I work well with? Who do I
know that actually really knows this subject matter? Who can
I count on to help explain this problem to me and walk me through it?
And then you can have this kind of more personal, adaptive, on call relationship
that isn't just swarming the incident with as many senior
people as you can get. Yeah, that's a fantastic
point. And something that resonates to me personally is that whole idea of the
sanity check, right? Especially as you're in there fighting a particular
problem and you've tried something three times and it's
not working, and you're like, I know this is supposed to be working.
Just getting that extra layer of eyes on it, someone that you feel comfortable
taking a look at it and not kind of roasting you
that you're doing the wrong thing is super important, right.
Because that's going to move you towards a escalation at a much quicker pace
than if you kind of keep it to yourself and continue to try something over
and over. And what this really kind of nails
down to is get just the right people who can contribute
right away and escalate strategically only if you have to,
but if you have to, having some of these ideas that we talked about today
will definitely help in that process. And as
you're bringing in more people into the incident room, the slack channel,
the teams channel, wherever you're managing incidents today,
it's just as important for these people to know why they're being
called into the incident room as being involved within
the incident. So defining these incident roles and
tasks can help with this. Right. So what are
some examples of different incident roles that I've seen be successful?
The incident commander, as an example, as Emily mentioned beforehand,
that's usually the person that's in charge of facilitating and moving the
incident forward towards a resolution, kind of making sure
everything's on track. Communications lead is another big one,
because most organizations, communication is
usually a broken process, and one of the reasons why we're having this conversation here
today, and that role is typically in charge of facilitating both
internal and external forms of communication to different stakeholders.
And the last one I'd like to note, too, is some kind of
technical or engineering lead. So that's typically the person that's
there to help you try and solve the problem from a technical standpoint.
So a combination of those three roles is very powerful
to make sure that there are particular lanes of focus in order
to help you try and drive towards resolution of the incident quicker.
And I think this really alleviates one of the big fears that we talked about
earlier, is that people start to get paranoid that maybe something won't
get covered, that some part of the response process will fall
through the cracks. Maybe someone won't get informed,
maybe some due diligence around recording things won't get done.
So if you have these roles and these checklists already defined,
people will be confident. Oh, I know there's enough people in there to
handle communications and to handle recording and
to handle leading up the technical front,
and they won't have that anxiety that might make them want to jump in or
call in a bunch of extra people. Yeah, that's a
super important note to mention there,
and thank you so much for bringing that up.
Oftentimes, people come into the instant room because they feel like
they need to get an answer quicker. But one of the kind of
highlight moments that I'm going to mention here is that sometimes you just
have to trust the process. So in order to have.
Establishing these different roles and tasks, making sure that someone that's
qualified to move the incident forward
based on those roles and the tasks that they're provided
is super important. And to me, it really boils
down to three key aspects of roles
and tasks. Help with accountability, right? So if
sales needed to get an update for this particular incident type and severity,
the communications lead is held accountable to make sure that they're getting
that update. Consistency in the incident process is something
that's just as important, right. As you're building
out your kind of initiative to drive towards
Sre as a whole, you need to make sure that you're following
the process so you can use it as an opportunity to learn. If you
don't follow the process the way it's been built today, you're not going to be
able to make those gradual changes that will make your incident
management process more efficient. And then lastly,
roles and tasks help with communication. And as I
mentioned before, that's one of the reasons why we're here today as the incident
evolves, communicate, period.
Right? You want to make sure that you get the word out to as many
stakeholders as you can, or more specifically,
those who need to be notified. Right? We recommend trying to
get a system in place to automatically deliver these updates to relevant
stakeholders, customers, management, et cetera.
But specifically, people need to know things
immediately, and other people only need to know
what's happening for high level outcomes. Right? So make
sure that you differentiate these different groups and respond to them
accordingly. But at the same time, having these
automated communications in place, you also need to have some sort of method
to send out ad hoc forms of communication through the
communication tool that you're using or being able to send.
But emails or text messages, status page updates,
kind of on the fly. So the communication of those two are
very important. What is one of the kind of main
areas that you've seen, Emily, in terms of communication that can kind of
be improved upon based on your experience? Well, I think something that
happens very often is that higher rungs of management will get anxious
during incidents. They'll start to wonder, how are things progressing?
Have they considered this? What messaging can I take
to our other stakeholders, to our board, to our customers?
And you start getting these layers of minimal management, jumping into the incident without
really being able to contribute a whole lot and sometimes putting a lot of
pressure on the engineers who are trying to focus to give them these answers.
So I think proactiveness is really, really key here.
And like you said, having automatic systems set up to deliver
messages when status changes or when different things progress is really
great because then nobody has to kind of lift their head up off the
desk. They can stay focused in on the applications
they need to be in and not have to worry about sending off an email
or whatever. I think really communication
should be something that alleviates these sorts of concerns,
that gives them confidence in the process and the system so that they can
focus on what they need to do, knowing that the incident is being taken care
of. So it's important to kind of walk the line
between personally addressing whatever
might be concerning these other stakeholders and
also having something really fluid and automatic that
doesn't take people out of their tasks. So it's
a difficult process to pin down. But with good tooling and with
having somebody focused on this sort of communication aspect,
you can really alleviate a lot of the burden and a lot of the overcrowding.
Yeah, 100%. And if any of you out there
are listening to this and you're thinking that, oh, can this sounds like a big
change or we don't have any of this in place today,
like I mentioned earlier, you're not alone out there.
Right? This is all natural to be feeling and it's a gradual change. And that
kind of beats us into our final couple of slides here of listening
to everything that you've talked about today. What will you need to remember and
to start? Communication doesn't need to be limited to that particular role.
Right. The communications lead who their job is to send information
but about the incident to those key stakeholders. But as
someone that's involved in solving the problem at hand, don't be afraid
to mark something as important when communicating. The process shouldn't
be rigid, but it should be a foundation that you can work on top
of, for example, being able to mark something that
is a key finding within Slack or Microsoft Teams.
As you're in the incident room, you're troubleshooting as important.
I see organizations gain benefit from this on a daily basis.
Adding those key pieces of conversation to your incident timeline so
you can go back and take a look at what was important for helping you
solve that problem is just as important. The worst thing
you could do there is not market and then it
may have gone unturned. Right? Or maybe that was the
solution to the problem. Mark it down as important and let
those key contribute with incident roles, handle their tasks.
So a great example of this, collecting relevant comments
from the Slack channel to document in the incident timeline is a great
example of something the commander would do. Hey, I'm scrolling through.
I'm seeing all this kind of chat back and forth,
is this relevant? Someone marked it as relevant. Being able to efficiently,
for example, within slack toggle, yes or no,
is this important or not? Is something that a commander
would technically be in charge of doing. Having some
of these elements that we discussed today incorporated into your incident management
process will help you cut through the noise and drive towards resolution at
a much quicker pace. This is cultural change, and it's
a communication that we're trying to build. So make sure you're trusting that process.
The retrospective or the post incident analysis is a great
way of driving that cultural change. So being able
to ask your team after the incident has been completed,
everybody's tired, they're ready to go home for the day. But making
sure that you still get the answers to some key questions, like,
did you have the right team automatically added to the incident
channel to help you troubleshoot that problem? Or did you feel like
you had the resources you needed in order to start addressing that
issue? Or were the
right individuals or channels automatically sent communication
when the incident was created? These are just a couple questions that
I've seen kind of help in that cultural adoption. Emily, is there
any particular notes that you have when it comes to kind of
like this cultural foundation?
So one thing I want to point out is when we're talking about this idea
of learning from incidents and the retrospectives and such,
I think it really is an often situation where
people will jump into the incidents room because they want to know what's going
on. And like you said, it's another instance where they have to be able
to trust the process that if they think, oh,
I need to be there because otherwise I'll have no clue what ended up happening,
no clue what the resolution was, and I could run into the same problem later.
That's not the best reason to actually be involved in an incident and be in
the response room. Instead, they should trust that, oh, someone is
recording what's important. There will be a document that's made.
I will be able to go back and learn what I
need to from this. Like you say, yeah, it's all but
trusting the process and process isn't something that gets built overnight.
You really do need a cultural foundation that's
built around that trust that gives people confidence,
oh, if we keep doing this, the process will get better and better.
And I think a major thing is this idea of kind of psychological
safety that you're not going to get
punished for screwing up the process,
whether that be, oops, I accidentally invited
the entire development team for this project because I was freaking out.
You know what? That's okay. It's probably going to cause some problems in the incident,
but it's not going to be the end of the world. It's a learning experience.
And similarly, if you see an incident happening and you see
that there's some people working in it to have the psychological safety to
say, you know what? I wasn't called on this.
I wasn't alerted. I'm maybe a little concerned,
I'm a little curious, but I'll trust the system and I can stay
out. And to know that you're never going to get reprimanded
and be told, hey, why didn't you join? Why didn't you try to help?
So just this culture of feeling safe, because you're
trusting in the process and you're trusting in a culture where you're
not going to get blamed, you're not going to be given at fault, but where
all these errors will just be learning opportunities to improve the process.
It's okay to fail at this stuff. It's okay to have a
system that doesn't work right away. It's all about iterating and learning
and understanding how you can get better in the future.
Yeah, 100%. That's super powerful.
And there was kind of two key takeaways that
really resonated with me is that,
as I mentioned before, the worst thing you could do is not
suggest something right when you're popping
in or you've been assigned a role and you have an idea,
you may not know that it's the answer, but it could be the answer.
Jotting that down and having that psychological safety to do so
is the step one. Because if you don't feel comfortable putting that information
down, you may not overturn
what the solution actually was. So that's super
important. And I think the other key piece
that really resonated with what you said was that
you want your employees to feel heard, right?
So the retrospective or the post incident analysis
is a great way for your employees to feel heard after they
spent that energy fighting the problem.
If there's any changes that need to be made in your incident management process
or these communication workflows that you've set up, the team that was battling
the incident will be the first line of information to streamline your incident management
process as a whole. And that's where that iterative process really
kicks in and making those gradual changes over time.
So that process makes everyone feel more included,
heard, and it's moving towards a resolution at a
quicker, um,
that's all I got here. Emily, is there anything else you'd like to add?
I'd like to just thank everybody for coming today and listening to our talk.
Yeah. Thank you so much for listening. We hope that we've convinced
you that sometimes a lean, focused team can beat out a
large all hands on deck scenario. And we hope
we gave you some tips on how to move towards that.
Awesome. All right. Thank you all so much. Thank you so much. Have a great
day.