Transcript
This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE,
a developer, a quality
engineer who wants to tackle the challenge of improving reliability
in your DevOps? You can enable your DevOps for reliability
with chaos native. Create your free account at
Chaos native. Litmus cloud everyone.
I hope you're having a great conference. My name is quintessence and I'm
theyre to talk to you about effective incident response.
Jumping right in. And before we talk about incident response,
let's standardize on the definition of an incident. An incident is any
unplanned disruption or event that requires immediate attention
or action. This encapsulates different types of incidents.
They can be internal, facing security breaches, data loss,
et cetera. Your definition may differ
from ours, and this is how we define incidents at pagerduty.
But as long as it's standardized in your organization, that's fine.
The main goal of incidents response is to replace chaos
with calm, because during an incident, people might be talking all
over each other or not knowing who to reach out to, basically trying
to get that incident resolved, and then there's a bunch of chaos to get that
accomplished. And that chaos is not good
for an incident. It's going to increase your frustration,
confusion, and it's going to make it slower for you to get to
that resolution. The goal of incident response,
therefore, is not only about resolving that incident, but doing it in
an organized way. So you're trying to reduce how
much confusion there is around the response process itself and put as much
energy as possible into the actual problem.
Saying it another way. The goal of incident response is to handle
the situation in a way that limits damage and reduces
recovery time and costs. Because even with disorganized
response, you'll likely still find a solution, but it'll take longer.
The incident might get more severe over time. There might be more data loss
if that's the situation, or there might be a longer outage if that's the situation,
and there's just more damage overall, more difficulty as the incident
continues to escalate. Right? So in order to accomplish the
goal of effective incidents response or calm incidents
responses, you want to make sure you're mobilizing the correct people. This is
the right responders, knowing who to paying for a specific service or whatever
the situation is. They need to have that right skill set, and they need to
have enough autonomy to take action on whatever's happening.
And you also need to make sure that you're learning and improving from each situation
so you can build off of your mistakes and avoid them in the future.
Automation is also very important because if you can build automation
into your response, it's possible to start triggering the automated response
before visible incident, an impacting incident occurs.
Rolling back a little bit, when we're talking about incidents response,
we're talking a little bit about what's known as the incident command system
and who invented this. Certainly not us, right?
Actually, that was based on what's called ICS,
which was developed as part of wildfire response in southern
California. In the they encountered was that thousands of
firefighters needed to respond to massive wildfires. And while each of
them individually knew how to handle that type of fire,
they did not know how to handle it at scale across
the land, but also between each other, so they could
not effectively coordinate over the scope of the entire situation
after those fires happened, an agency called firescope was formed in
order to develop a model for the large scale response for a natural disaster.
And one of those was ICS, which is what we're talking about right now.
And this became more broadly used for any major incidents.
And in 2004, the National Incident Management system, or NIMS,
was established by FEMA and is now the standard for all emergency
management by public agencies in the United States.
So this means that ICS is used by everyone for every
response from an individual house fire to a natural disaster. And that standardization
means that everyone familiar with the process, with the response
procedure, is actually able to handle and put more
mental energy on that problem, on the fire or on the flooding
or whatever the situation is, rather than scrambling to try and define a process
on the fly while this natural disaster is going on.
This might not feel directly relevant to it incidents, but the principle is
the same. Recall how we define incidents as an unplanned disruption that
requires immediate action or attention. The idea is that this definition
and the response pattern is as similar across teams as possible.
So when they need to coordinate, they can. This is very similar to what was
happening with ics in the fire response.
Building off with that a little further. When defining incidents, it's important to codify
what constitutes a major incident similar to an incident.
It needs to be defined in a way that can be used across teams in
your organization. You should be able to make sure that your definitions are short and
simple and prevent the stress of is this an incident or isn't?
Or if it is, is it a major one? And you want to keep
all of that in mind when you're writing these definitions here at pageduty.
We define a major incident as any incident that requires a coordinated
response across multiple teams, depending on your needs or your organization.
You might want to have a metric or two that you tie to that in
order to differentiate a major incident from a so called regular incident.
Again, as long as it doesn't become too granular. An example
of this might be it becomes a major incident when more than x percent of
customers are impacted. Again, this is all because you can't
respond to that incident until you know what it is. So if a person or
team isn't in alignment with the at large either for an incidents or a
major incident, you're going to have confusion and inconsistency in that response process.
Four things that major incidents have in common. Timing is a
surprise, so there's little to no warning and that the time of the response
matters. You need to respond quickly, having services degraded or
down or data loss or breaches, et cetera.
The more time that that's going on, the more severe that is
possibly the more money that's being lost, et cetera. The situation is rarely
completely understood at the beginning. You don't normally know why there's an outage. You don't
know to necessarily, for example, run a simple setup script or
a reset script that might be hanging out somewhere that'll just
kick off a process or reboot a cluster, right? You don't know why
this situation is happening. It also requires mobilizing
a team of the correct responders to resolve, and that is the cross functional
coordination. So typically you can determine
the severity based on how drastically your metrics are affected. So for example,
as the traffic to your website drops, severity can increase. It might start but
off as a Sev five, that might look a little bit. Let's pick on Amazon
for a second. Might look like you can add something to a cart.
You just can't see the image of it. So you can see the item,
you can see the cart, you can even check out of the cart. You just
can't see the image of it. So that's a type of outage. But people
are still able to accomplish what they're trying to accomplish. If it becomes more severe,
maybe between a four or maybe up to a three.
Now, there might be interference with the checkout process or you can't
link to the details. Even though you can see the
line items, you can't actually see what you've added to compensate for not seeing
the picture, et cetera. So as you get more severe, now you're at that sev
three and maybe responders have been pulled in up
to this point. Once we pass this boundary here from
sev three to sev two, that is what we codify as
a major incident. So severity two and severity one,
in some cases, some people use like sev zero or p
zero, any convention is fine. Those very severe
incidents are what are called major incidents.
That is when you would fire off what's called the major incident response process.
Anyone needs to be able to trigger an incident at any time.
And this is because even well designed monitors can miss things. And you want
to make sure that people are empowered to trigger incidents rather than someone knows something's
wrong and they're just sitting there worrying about retribution, about what would
happen if they're wrong or other consequences, rather than just
setting off the incident, triggering it, having people pulled in
who can actually look at it, see if it's a problem, or see if they
can just close it, right. It's less risky to open the incidents and close
it than to not open it, or to wait for something else to catch it
if it's cautable. In order for anyone to be able to trigger event,
you need to set up your tooling to make this possible. So, for example,
if you integrate with Slack, anyone in slack can trigger the incident, and that doesn't
require them, for example, to have a specific account for a specific
tool. If you have page reading in particular, it will look a
lot like this. Someone just sends off an IC page command and it
pages the relevant incident commanders.
Once an incident is triggered, we need to switch our mode of thinking. This mentality
shift is the differentiation between normal operations and something's wrong.
We need to switch this from what's called peacetime to wartime.
It's from day to day operations to defending something that's wrong.
Something that would be considered completely unacceptable during normal operations,
like deploying codes with minimal or no testing, might become
more acceptable during a major incident. Again, depending on a lot of factors
that I'm not going to get into right here, the way you operate, your role
hierarchy, and the level of risk is what I'm talking about. All these things
will shift when you go from a so called peacetime to wartime.
Peacetime to wartime is something that got inherited from the fire response because
they are a paramilitary organization. So they might use the word terminology.
We don't need to do that. We can do normal or emergency okay
versus not okay. Really. It doesn't matter how you term them,
so long as it's clear and immediate. People don't have to figure out which
side theyre on. You just want to make sure that you're using a term that
the organization is okay. With. And part
of what we want to do for the differentiation is tying in those metrics.
For business impact. Metrics can be very useful and often
work best when they're tied to that business impact rather than something
very granular. So, for example, one of the metrics we monitor
here at page of duty is number of outbound notifications per second.
That might make sense, right? Amazon might
know, monitor orders, Netflix streams, things that
are relevant to their business. You want to make sure that you're tracking if something
visible and impacting is happening or not, and how severely,
because, again, that goes back to, is this an incident or not, or is this
a major incidents or a regular incident?
The mindset shift to emergency also causes the other shifts to
happen as well. But something that can happen is what's called decision paralysis.
Responding teams do need to have the ability to make decisions on their own and
try and reduce impact and resolve as quickly as possible. Right? Again,
going back to not wanting z and sin to escalate over time. But it's
also important for all responders to avoid decision paralysis.
This is what happens when you spend so much time debating two, maybe more,
options that are just similar enough or not quite different enough,
and you can't quite choose one. And all the time, the incident
is still ongoing, possibly escalating. Making a wrong decision
in this case is better than making no decision. If one was clearly
better than the other, you would know by now. So a wrong decision is at
least going to give you some information to work with, whereas no decision gets
you nowhere. Now, let's talk a little
bit about the roles that happen in incident response process and
the incident categorization.
So the lifecycle of an incident or its steps are
you have triage, mobilize, resolve, and prevent. So triage
is what happens when you see the incident, whether it's triggered via slack in a
page or whether it's hit by one of your monitors. It's what happens when
you first see it, and you need to figure out how severe it is.
It's when you're going to differentiate between, am I responding to this? Is this going
to be a cross functional response? Right. And once you get
to that decision process, that's when you start to mobilize people. You want to make
sure the right subject matter experts are there. And then the resolution phase,
that's the one everyone likes to focus on, right? Everyone's cranking away on whatever's
wrong and trying to get everything working again. And then the last step is the
prevention stage. Usually this incorporates, like notes, post mortems,
all that good stuff so we can learn and move forward. In order for
these steps to happen relatively smoothly, especially in a cross functional
response, you want to make sure you have some roles defined
and this slide is going to get a little heavy.
Okay, so let's start with the incident commander. The incident
commander is basically running the incident. Think of them as like a captain
of a ship or something like this. They're guiding the ship of the
incident response process. They are the highest ranked person on the
incident response call. We actually do recommend that in the context
of the incident that they might outrank the CaO.
Don't catch anyone by surprised by this, though. Make sure you get buy in before
you make that decision. But the idea is that they're the single source of
truth during an incident and that they're able to respond and make decisions in a
responsible and effective way. Next up we have.
Sorry about that. Next up we have the scribe. The scribe is taking
notes. Depending on your tooling, you might have some timelining built in
for what decisions are being made or what actions are being taken. We also need
the scribe to be writing down relevant context. You don't want it to be so
loan that they're writing a novel that no one's going to read, but you do
want to write down rebooted cluster in order to simple, quick sentences
so that when you're looking through the incident, why did I make this decision?
What was the impact of that decision? And we can keep reading through.
The deputy is the support role for the incident commander. They're not a shadow,
so theyre not an observational role. Theyre expected to perform tasks during
the incident. So depending on the size of the company and the incident, the deputy
might help with other tasks like liaisons, which I'll get to in a minute,
or the scribe or something like that.
Internal and external liaison, or internal and customer liaison are the communications
roles. So during an incident, it's not uncommon for people within
or if it's customer facing outside of the company to want to know what's happening.
Right. Theyre do AWS status, Twitter, right?
Region goes down, you check their status, you check their Twitter.
You want to make sure that someone is communicating in whatever mechanisms
that you're using, be it social media or status pages or whatever, and you want
to make sure it's a defined role or else everyone and the layer underneath all
the subject matter experts are focusing on fixing the issue and not like keeping
time to make sure that theyre sent the last update out
in a timely fashion. So liaisons do this.
You might have one or both. Again, it depends on what's
happening with the incidents. And then there are all the subject matter experts.
These are the people that people normally associate with incident response.
Theyre are the people with the relevant expertise for whatever the latency,
outage, breach, et cetera, is. These are all the response
level roles.
So setting this up at scale, when you have a department wide incident,
you want to make sure that you have an on call scheduled, a primary and
a backup. The idea here is if that page goes out and someone
doesn't respond in whatever SLA is defined
for, that it needs to escalate somewhere or else it's just going to keep
paging and going to nowhere. Right. So you need
that primary and the backup. One of the things that we do when we have
a primary and a secondary on call is we inherit from the previous week.
So when you have the secondary, they were last week's primary
into this week's secondary, so that they have a context for any past
incidents that were relatively recent. You also
want to make sure that you have primary and backup for the subject matter experts.
Same reason, actually, same rotation, secondary to primary.
And then if there are any other roles, scribe, et cetera, as you're defining
them, you want to make sure that they're defined again. When you send out that
page, if someone's sending out a page, let's say I'm responding
to what seems to be a database incident and I am doing my DBA
thing, and then I realize it's exceeding scope for my area of expertise.
I don't want to look up who is front end,
who's the SRE, who's this, who's that? And find all these people individually.
I want to be able to send out a correct page to the right teams.
All these teams need to have those on call rotations,
or you need to have a method of paging them that's consistent, single source
of truth, and relatively easy so you dont lose time in the incident. Responding to
the incident. So let's talk about
the response itself. A typical sequence of events is
something is triggered either by monitoring or by human
that goes to a subject matter expert. Initially, they're usually the first to
respond and see what the situation is. It might be simple enough for
them to solve right away, very low level incident, maybe a set four or
five, and they might resolve it, or they might need to escalate
it. They need to escalate it. They send out a page,
it pages the command level which is again, incident command
scribe and deputy. It also will page out the liaisons
and will page out to operations, right?
So whatever the SME levels that need to go out are,
depending on your. This might seem like a lot, and you might be thinking,
we only have 20 engineers in total.
We can't have this huge incidents response process. Not a problem.
The role scale down. So you might find, for example,
you don't need two liaisons because you only need one,
or your comms aren't going out that frequently,
or you're going to be really heavily relying on your subject matter
experts. But maybe your incident commander and other roles can be rolled up in a
way that's relatively effective. I misspoke there a little
bit though, because you really don't want the incidents commander to be rolling up into
other roles. That one should be rather dedicated. But you might share
a role with a scribe. You might have liaison sharing roles. So maybe a scribe
and a liaison is the same person. Or maybe depending on the subject matter
expert response, a scribe and an SME are a shared role.
Make sure that this is consistent in a way that makes sense. If you notice
that a pairing isn't working as you tried to scale down, don't keep using
it. Just cuz. Just separate them out in a way that makes sense for you.
Let's review all the different roles and responsibilities,
starting with the incident commander, just to provide some more context
for what everyone's doing. Remember that we want to
replace chaos with calm. This isn't the first, it won't be the last
time I say this. The incident commander role ensures the incident response
is calm and organized because they're making the calls.
If you think about it this way, let's say the subject matter expert says,
I want to reboot Cassandra. Incident commander says, okay, if two subject matter experts
are giving differing opinions and you're in that decision paralysis
mode, the incident commander breaks the tie.
Okay? You have someone who's making a decision to make sure action keeps
getting taken. They're also a single source of reference.
It's one of the most important roles as such. So even if you don't have
the deputy stride and liaisons all broken out of separate roles, and you're starting to
combine them in ways that make sense, that instant commander is the one you should
get first and the one that should be dedicated as the single source
of truth during an incident and the one in charge I
keep mentioning, decision paralysis happens a lot.
One of the things that we talk about at Pagerduty is having the are
there any strong objections model. The idea is you want
to make sure you don't have people later going, oh, I knew this wasn't going
to work. You maybe did,
you maybe didn't. But whatever everyone knew at the time wasn't
strong enough to break the tie. Right, the two way through,
whatever tie of the decisions that were on the table at that moment.
So in order to gain consensus, you need to say, are there strong objections?
Is there something that outweighs the decision we're proposing as
opposed to other ones on the table? And if not, we're just going to pick
one, move forward and get more information, even if we fail.
And all of this is about making sure you're making that decision again.
Making a wrong decision in this situation is
better than making no decision. Making no decision doesn't help you move forward and
you don't learn anything. The incident is still going on, possibly getting worse.
Making a decision, even the wrong one, again gives information.
If it turns out to be wrong, you can learn from how it went wrong.
To make the next decision, choose between the other options, what have
you. Incident commander also
makes sure to assign tasks to a specific person. It's one
of those things where if everyone owns it, nothing gets done, and it's not malicious.
It's just they're focusing on what they're actively assigned to.
Right. And this can be happening during the incident. It can
happen after the incident. So, for example, post mortem, if no one owns
the post mortem, to be clear, not writing the whole post mortem by theirself,
just collecting the people and getting people to write their sections of the postmortem
is probably not going to get done because there's no one checking on the progress
of it. So the incident commander and the scope of the incident and assigning the
post mortem is assigning tasks to specific people.
I mentioned this before, going to talk about this now.
They are the highest authority within the context of the incident, no matter their
day to day role, an incidents commander is always the highest ranking person
on this call. And I just mentioned, no matter their role,
we actually don't always recommend, and frequently don't recommend
that the incident commanders are one of the subject matter experts and expertise.
You don't want the incident commander to shift role to solving the incident while
they're trying to, for example, answer exec questions. If execs come in on the
call and say, hey, can you solve this in ten minutes? Hey, what's going on?
Hey, I didn't see any comms or whatever they're asking. Right.
The incident commander's job is to shield from all of
that. They can't be solving the incident when they do that.
So it's actually not uncommon for organizations to have non
end roles as incident commanders.
I'm just going to reinforce this a little bit. We did used
to require that all of the incidents commanders be engineers. It probably makes
sense, you think of engineering with an incident response. So all the roles are
possibly engineers. But that was actually a big mistake
because, again, ics aren't responders in that way. They're not fixing the problem.
They don't need the deep technical knowledge. It also limited our response
pool. And if we think about an organization maybe larger now, not so much
a problem, but when you're small and you have that thought that I mentioned earlier,
we only have 20 engineers. Why are you using up your on
call pool for a role that does not require the technical knowledge?
When you can branch it out into other teams, that might make sense, maybe products,
so that they have more connection with the response process that's happening.
Right. There are other roles that can be brought in from the to
do this. Handoff procedures
are encouraged. You want to make sure that people aren't staying
on an incident if it keeps going on and on. Hours, 12 hours.
Right. You need to know what time frame you need to start handing things
off for. And this is for all the relevant roles. You need a handoff
procedure, and you need to know who you're handing off to when
you're going through this. You're asking for status, repeatedly deciding on
action to gain consensus, assigning tasks, and then following
up to make sure it's completed within the context of an incidents,
that means if I say I'm going to reboot a cluster and do
some queries or whatever I'm going to do, it's like, great. How long do you
think they'll take you? 1015 minutes, tops. The commander will check in
in 15 minutes to see what the status is, if something
changed in that time, if I need to be doing something else and just keep
updated within the lifecycle
of the incident, it looks a little more like this. This stabilizes the resolution phase,
which is why it has that little separate circle there. It's just you're iterating.
Iterating until it's complete.
Some tips before new incident commanders introduce yourself on
the call with your name. Say you're the incident commander. So my name is contestants
and I'm the incident commander. Avoid acronyms because you don't know what
different type teams are doing on the call. An easy example
is, let's say I'm not on edge as an incident
commander and I start using pr, I mean press release, and they mean pull request
engineering, right? Avoid acronyms. You're having a cross
functional response. It is very common to have collision in that space.
Speak slowly and with purpose on the call. Kick people off
if they're being disruptive. You can change roles, you can escalate
to different roles. You can have people who are supposed to be shadowing, that aren't
shadowing, be asked to leave that sort of thing.
Time boss tasks. You may recall earlier when I said,
oh, I'm going to be rebooting the cluster and running a query and it'll take
me ten to 15 minutes. That's a time box, right? I know
if I'm the incident commander, to check in ten to 15 minutes, if I don't
know roughly how long to take, how often to check in,
and I'll be interrupting what that person is doing.
You also need to be able to explicitly declare when the response has ended,
because once that has happened, you don't want people to say, oh, what's this over
here? Is this still connected to this incident? Is this something new?
Right? When the response has ended, that incident is ended. If something else
is popping up, it might require a different response process.
In summary, for the incidents commander, everyone is focused,
keeps decision making, moving, helps avoid the bystander effect,
and you're moving forward towards a revolution. Now, all of
that was just the incident commander. The rest of the
rules aren't going to take as long to explain. So let's go through.
We have scribe, the deputy,
then we have the liaisons and the SMEs. As mentioned before,
deputy role incident commander keeps focus, takes on any additional tasks.
This is the scaling problem. If there's too much going on that the IC can't
do by themselves, if there's too many execs coming in to ask for status updates,
whatever, all that is, the deputy. They serve to follow up on reminders
to ensure tasks aren't missed. So in that time boxing example, if I as an
incident commander, am not following up with the DBA in ten to 15 minutes,
the deputy says, hey, it's been 18 minutes, or they act as
a hot standby per previous comments about being able to have handoffs.
Importance of the scribe. They're documenting the timeline with context,
and the incident log is used to augment what they're writing in the
postmortem process. Note when important actions are taken, follow up
items, etc. Any status updates and similar to the deputy and the
IC, anyone can be ascribed. You do not need the technical knowledge or the
resolution knowledge to write down what's happening. So again, dont box
yourself to just the engineering for all of theyre roles.
Similar for liaison roles. You can have internal and external liaisons,
or both as a single role or both as separate roles. It depends on what's
happening and what you need. They're notifying the relevant parties of what's happening.
We usually recommend this happens in about a 30 minutes cadence.
If it's too frequent, people aren't chalking and they're just getting overwhelmed with information.
And if it's too infrequent, people, especially people who maybe
I pick on execs a lot because they're the ones that are going to be
getting calls. If this is a massive incident and they're popping in because they don't
have information, it is common to have this be a member of support.
Though some pitfalls.
Executive Swoop is the one I mentioned the most. We sometimes nick
nickname it executive Swoop and poop. The idea is people are
swooping in because they don't have enough information and they're trying to get information
out of you while you're solving the incident as a result.
Or sometimes they try to be helpful. Can we resolve this in some
number of minutes, please? We all wish we could.
Right. But again, this isn't meant to be malicious.
People just want to be encouraging or say, have you thought
about things? Or they want more information so theyre can
talk to people. Maybe they have a dependency. Right. Can I at least know who's
impacted so I can start making phone calls. Right.
That's all stuff you want to kind of buffer again in the incidents response process.
You want those command and communication roles to buffer the responder roles.
There's also a certain amount of do what I say. Oh, did you reboot backluster?
Oh, did you do this? Maybe. But we
in the incident call context have been having a conversation for however long and
then the person who just joined in has none of that history yet.
Right. One of the things that you can do though, in this case is ask
people if they wish to take command, because if they're an incidents commander and they
take command, like if they're trained in that role,
sure, then it's not your problem. Right. And if they say no,
just say, great, in that case I need you to, and then give them
some direction. I need you to wait for the next status update. I need you
to wait till we send out the information or whatever. Right? So make
sure you always ask this question if someone's being a little bit persistent, because again,
it's not normally malicious, they're as anxious as everybody else.
Another common pitfall is that failure to notify. That's why the liaison rule is separated
out. If you don't notify people, they're going to ping you in slack in the
call or wherever to get information on the opposite side.
If you ping them too often, it's not usually necessary and it doesn't have enough
information or the status updates are too similar to warrant the frequency.
Right. Red herrings are
things that happen in incidents that are just misleading. This isn't really a part of
the incidents response process breaking down. This is just the
resolution process breaking down where you think that you need to do something and
ends up being a dependency or something else that looks like it's happening over here
and it's happening over there. I'm not going
to go through all the anti patterns here. Suffice it to say, there are
lots of things that happen during an incident that can bog down the response.
Debating of the severity is something we covered a lot earlier. If you spend your
time saying, is this an incidents? Is this not an incidents? Is it a four
or a five? Is it a two? Is it a three? You're going to lose
time, right? If you don't have the process defined in advance, you're going to
lose time. If you don't send out policy changes or have them documented
in a single place, people are going to lose time trying to find them and
so on. What I do want to call out though, is assuming silence means no
progress. If you think about it, heads down or reading a book,
even decoupling from incidents, right. Silence doesn't
necessarily mean stopping. Silence just means I haven't
been able to send you anything yet.
How do I prepare to manage incident response team?
Step one, ensure explicit processes and expectations
exist. See this entire presentation.
Step two, practice running a major incident as a team. This is kind
of like failure Fridays, which is something we run at pagerduty to practice chaos
experiments. Run your major incident response as part of it. Why not?
Right? If you have rare incident, that's awesome. But if you don't practice what you're
doing when they do happen, you're not going to be as agile.
It'll also give you the opportunity to tune your process to know what works.
You don't want to find out in the first real incident that you have with
your new process that it didn't work for you.
Something that can help with this is checklists. That way you know what you've done,
you haven't done what you need to change next time. It can look a little
bit like this again, I'm not going to go
through each one of these, but please feel free to screenshot or look at the
slides. I'll have a link at the end.
Something else I want to call out don't neglect the post mortem.
It's really easy to forget because the incidents is done. We go back
to regularly schedule activities or we do any of the action items that we take
away from it and writing everything down just like docs
kind of gets lost. But if you don't document things in the post mortem when
things are fresh, then you're going to lose the learning opportunity.
So as an overview, you want to make sure you're covering the high level of
the impact. This is just a quick our shopping cart service was down and
customers could not purchase new items. Something like that. It's not meant to be very
in depth at that level. And then in the what happens sections,
detailed descriptions of what are happening, what response efforts
happen. This is when you can really make use of those scribe notes.
Then you review the response process itself, what went well, what didn't,
and write down any action items as a
slide to screenshot. Here are some things that you can include in a more
detailed post mortem I actually will have a link to a
post mortem ops guide that we wrote up here at page duty at the end,
so hold on to your seats there
as a quick summary, use the incident command system for managing incidents.
An incident commander takes charge during wartime scenarios. Set expectations
upward. That specifically refers to the CEO comment
about the incident commander being the top one, but basically everything else, right? You want
to communicate outward, work with your team to set explicit expectations
and process practice. And don't forget to review
all of the links including the post mortem. One I just mentioned are going to
be on this noticed page. So go ahead and go here. I have
four guides there about response and post mortems
and the meetings associated. The topmost link,
in case you want to see it just now, is the response Ops guide.
That Ops guide is an expansion of this pre presentation. So if you want
to read more or want to recap everything I've mentioned here,
it's on this slide and with that I'll go ahead and go to
Q A. I'll be available in the discord and the link
for all of the links that I've mentioned in this presentation again, are going to
be on that notice page. Thank you and have a great conference.