Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, I'm Julie Gunderson, DevOps advocate here at
Pagerduty, and I'm here to talk to you today about
postmortems, continuous learning and enabling a
blameless culture. So when we talk about postmortems,
a postmortem is a process, and it's intended
to help you learn from past incidents. It typically involves
a blamefree analysis and discussion soon after an event has taken
place and you produce an artifact. And that artifact includes a
detailed description of exactly what went wrong in order to cause the incidents.
It also includes a list of steps to take in order to prevent a similar
incident from occurring again in the future. You also want to include an
analysis of how the incident response process itself work.
Like the value of having a postmortem, it comes from helping institutionalize
that culture of continuous improvement. The reason that we do
postmortems is because during incidents response, the team is 100%
focused on restoring service. They cannot and they should
not be wasting time and mental energy thinking about
how to do something optimally or performing a deep dive
on what caused the incident. That's why postmortems are essential.
They provide that peacetime opportunity to reflect once
an issue is no longer impacting users and
organizations. They tend to refer to the postmortem process
in slightly different ways. They're after action reviews or post incident
reviews. Sharing reviews incident reviews incidents
reports it really doesn't matter what you call it,
except for this last one, root cause analysis.
Because we will talk about why words matter.
It just matters that you do this.
The postmortem process. It drives that culture
of learning. Without a postmortem, you fail to recognize what you're
doing right or where you could improve, and most importantly,
how to avoid the same mistakes in the future. Writing an effective postmortem
allows you to learn quickly from mistakes
and effectively and improve your systems and processes.
And a well designed, blameless postmortem allows teams to continuously
learn, serving as a way to iteratively improve
your infrastructure and incident response process.
So you want to be sure to write detailed and accurate postmortems in
order to get the most benefit out of them. Now, there are certain times that
you should do a postmortem. You should do a post mortem
after every major incident. This actually includes anytime
the incident response process is triggered, even if it's later discovered
that the severity was actually lower, or it was a false alarm,
or it quickly recovered without intervention. A postmortem shouldn't be
neglected in those cases because it's still an opportunity
to review what did and did not work well during the incident response
process. If the incident shouldn't have triggered an incidents process,
it's worthwhile to understand why it did so. Monitoring can be
tuned to avoid unnecessary triggering incident response processes in
the future, and doing this analysis and follow up
action will help prevent that alert fatigue going forward.
And just as restoring a major incident becomes
top priority when it occurs, completing the postmortem
it's prioritized over planned work. Completing the postmortem
it's the final step of your incident response process.
Delaying the postmortem delays key learnings that
will prevent the incident from reoccurring. So pager Duty's internal
policy for completing postmortems is three business days for
a sev one and five business days for a sev two. And because
scheduling a time when everyone is available can be
difficult, the expectation is actually that people will adjust their calendars to attend
the postmortem meeting within the time blame and at the end of every major
incident call or very shortly after the incident commander.
That's what we call the command structure here at pager duty, they select
and directly notify one responder to own completing the
postmortem and note that the postmortem owner is not
solely responsible for completing the postmortem themselves.
Writing of postmortem it's a collaborative effort and it should include everyone
involved in the incident response process. While engineering
will lead the analysis, the postmortem process should involve management
and customer support and the business communications teams.
The postmortem owner, they coordinate with everyone who
needs to be involved to ensure the post mortem is completed in a timely manner.
And it's really important to designate a specific owner to avoid
what we call bystander effect. If you ask all responders or a team to
own completing the post mortem, you risk everyone assuming that someone else
is doing it and therefore no one does. So it's really important to
designate an owner for the postmortem process, and some of that
criteria can be someone who took a leadership role during the incident
or performed a task that led to stabilizing the service.
Maybe they were the primary on call, or maybe they manually triggered
the incidents to initiate that incident response process. It is
very important to note, though, that postmortems are
not a punishment and the owner of the postmortem is not the
person that caused the incidents. Effective postmortems are blameless
in complex systems. There's never a single cause but a
combination of factors that lead to failure, and the owner is simply
an accountable individual who performs select administrative tasks,
follows up for information, and drives that postmortem to completion.
So writing the postmortem will ultimately be a collaborative effort.
But selecting a single owner to orchestrate this collaboration is
what ensures that it gets done. So let's talk a little
bit about blame. As it professionals, we understand that failure is
inevitable in complex systems, and how we respond to
failure when it occurs is what matters. So in the
field guide to understanding human error, Sidney Decker
describes two views on human error. There's this old view,
which asserts that people's mistakes cause failure,
and then the new view, which treats human error as a symptom
of a systematic problem. The old view ascribes to that
bad apple theory, which believes that by removing the bad actors,
youve just going to prevent failure. And this view attaches
an individual's character to their actions. Assuming negligence
or bad intent is what led to the error. An organization that
follows this old view of human error, they may respond to
an incident by finding that careless individual who caused the incident
so that they can be reprimanded. And in that case, engineers will hesitate
to speak up when incidents occur for fear of being blamed.
And that silence can increase overall.
It can increase that meantime to acknowledge that meantime
to resolve, and it really exacerbates the impact
of incidents. For the postmortem process to
result in learning and system improvements, the new
view of human error must be followed. In complex systems of software
development, a variety of conditions interact to lead to
failure. And the goal of the postmortem is to understand
what systematic factors led to the incident and to identify
actions that can prevent this kind of failure from reoccurring. So a
blameless postmortem stays focused on how a mistake was made
instead of who made the mistake. And this is a crucial mindset,
leveraged by many leading organizations for ensuring
postmortems have that right tone. And it empowers engineers to
give truly objective accounts of what
happened, because you're eliminating that fear of punishment.
But blamelessness is hard, and it's really easy to agree that
we want a culture of continuous improvement. But it's difficult to
practice that blameless listen required for learning the unexpected nature
of failure. It naturally leads humans to react in a way that interfere
with our understanding of it. When processing information,
the human mind, it unconsciously takes shortcuts. And by applying
general rules of thumb, the mind optimizes for timeliness over
accuracy. And when this happens, it produces an incorrect
conclusion. It's called a cognitive bias.
So J. Paul Reed argues that the blameless postmortem is
actually a myth because the tendency to blame is hardwired through millions of years
of evolutionary neurobiology, and ignoring this tendency
or trying to eliminate it entirely is impossible,
and so that it's more productive to be blame aware.
I'll touch on this and some of the biases next, but for more
details, read Lindsay Homewood's article on cognitive biases
that we must be aware of when we're performing post mortems.
So one of the errors that we see is the fundamental attribution
error, and it's the tendency to believe that what people
do reflects their character rather than their circumstances. And this
goes back to that old view of human error, assigning responsibility for
a failure to bad actors who are careless and clearly incompetent.
Ironically, we tend to explain our own actions by
our context, not our personality. So you can combat
that tendency to blame by intentionally focusing the analysis
on the situational causes rather than the discrete
actions that the individuals took. Another pervasive
bias is cognitive bias. It's a confirmation bias,
which is the tendency to favor information that reinforces
existing beliefs. So when presented with ambiguous
information, we tend to interpret it in a way that supports our
existing assumptions. And when combined with that old view
of human error, this bias is dangerous for postmortems because
it seeks to blame that bad apple. And so when approaching the
analysis with the assumption that an individual is at fault,
you're going to find a way to support that belief despite evidence
to the contrary. So, to combat confirmation bias,
Lindsay Holm would suggest appointing somebody to play devil's advocate,
to take a contrarian viewpoint during an investigation.
Be careful, though, when doing this, because you want to be cautious of
introducing negativity or combativeness with the devil's advocate.
You can also counter confirmation bias by inviting someone
from another team to ask any and all questions that
come to their mind, because this will help surface lines of inquiry,
as the team may have learned to take some of these things for granted.
Now, hindsight bias is a type of memory distortion where
we recall events to form a judgment. So knowing the outcome,
it's easy to see the event as being predictable, despite there
having been little or no objective basis for predicting it.
Often we recall events in a way that make ourselves look better.
An example is when a person is analyzing the causes
of an incident, and they believe they knew that it was going to happen like
that. So enacting this bias it can lead to defensiveness and
a division within the team. So homewood actually suggests avoiding
the hindsight bias by explaining events in terms of foresight
instead. So start your timeline analysis at a
point before the incident and work your way forward instead of
backward from resolution. And then there's negativity bias.
And this is the notion that things of a more negative nature have a
greater effect on one's mental state than those of a neutral or
even positive nature. Research on social judgments
had shown that negative information disproportionately impacts
a person's impression of others. And this relates
back to the bad apple theory, the belief that there are negative actors
in your organization to blame for failures. Studies have also
shown that people are more likely to attribute negative outcomes
to the intentions of another person rather than
neutral and positive outcomes. And this also explains our
tendency to blame individuals characters to explain a
major incident. In reality, things go right more often
than they go wrong. But we tend to focus on and magnify
the importance of negative events, focusing on
exaggerating and internalizing incidents as negative events.
It can be demoralizing and it can lead to burnout,
reframing incidents as learning opportunities and remembering
to describe what was handled well in the response process.
That can help to balance the perspective.
And so these biases can damage team relationships if they
go unchecked. It's important to be aware of these tendencies
so we can acknowledge bias when it occurs. By making
postmortems a collaborative process, teams can work as a
group to identify blame and then constantly dig
deeper in the analysis.
So acknowledging blame and working past it,
it's easier said than done. Think about what behaviors
can we adopt to move towards a blameless culture?
And I'm going to share a few of those with you.
So ask what and how questions rather
than who or why. So going back to Lindsay homewood,
what questions are like what did you think was happening?
Or what did you do next? Asking what questions
grounds the analysis in the big picture of
the contributing factors to the incident. In his article
the Infinite House, John Allspa encourages
us to ask how questions because they get people to describe
at least some of the conditions that allowed an event to take place.
Homewood also notes that how questions can help clarify technical details,
distancing people from the actions they took. So avoid
asking why questions because it forces people to justify
their actions. It really does attribute blame, like,
well, why did you just do that? Or why did you take that action?
So think about how you can reframe those in the what and how
perspective and then consider multiple and
diverse perspectives. This is actually a great time to
bring in maybe an intern. They tend to think about things
in ways that people who have been practicing this for 20
years may not think about things. So consider the
different perspectives and ask why
a reasonable, rational, and decent person may
have taken a particular action. When analyzing failure,
we may fall into the victim, villain, and helpless stories
that propel those emotions and attempt to justify our
worst behaviors. And you can move beyond blame by
telling the rest of the story. Ask yourself,
why would somebody may have done this?
Because you want to put yourself in that person's shoes.
This thinking will help turn attention to the multiple systematic
factors that led to the incident versus the WHO.
Also, when in crying about human action, abstract it to
an inspecific responder. I mean, anybody could have made
that same mistake. We've all made mistakes.
These are not intentional. It's not that bad. Apple theory and
whether you're introducing postmortems as an entirely new practice at
your organization or working to improve an existing process,
culture change is hard. And change doesn't have to
be driven by management. Oftentimes the bottom up
changes are more successful than those top down mandates.
Anyways, no matter your role, the first step
to introducing a new process it is to get buy in from
leadership and individual continuous to practice blameless postmortems
and encourage that culture of continuous improvement. You do need
commitment from leadership that no individuals will be reprimanded in any way
after an incident. And it can be difficult to get
this buy in when management holds that old view of human error,
believing that the bad actors cause failures and that they'll never
have failures if they just remove those bad actors. So to
convince management to support a shift to blameless analysis,
clarify how blame is harmful to the business and
explain the business value of blamelessness. Punishing individuals
for causing incidents. It discourages people from speaking
up when problems occur because they're afraid of being blamed. And as
I talked about earlier, this silence just increases that meantime
to acknowledge and meantime to resolve. You want people to
speak up when problems occur. And organizations can
rapidly improve the resilience of their systems and increase the
speed of innovation by eliminating the fear of blame and
encouraging collaborative learning. The inclination to
blame individuals when faced with a surprising failure.
It's ingrained deeply in all of us. And it may sound silly,
but when selling a new blameless postmortem process to management,
do try to avoid blaming them for blaming others.
Acknowledge that that practice of blamelessness, it's difficult
for everyone, and teams can help hold each other accountable
by calling each other out when blame is observed in response to failure.
Ask leadership if they would be receptive to receiving that feedback
if and when they accidentally suggest blame after an incident.
A verbal commitment from management to refrain from punishing
people for causing incidents is important to start introducing blameless
postmortems. But that alone isn't going to eliminate the
fear of blame. Once youve have leadership support,
you also need buy in from individual continuous who
will be performing that postmortem analysis?
Share that you have commitment from management that no one will be punished
after an incident because the tendency to blame it
turns out it's not unique to managers and leadership.
Explain to the team why blame is harmful to trust
and collaboration, agree to work together to become more and
more blame aware, and kindly call each other out when blame
is being observed. People need to feel safe talking about
failure before they're willing to speak up about incidents.
And when Google studied their teams to learn what behaviors made groups successful,
they found that psychological safety was the most critical factor for
a team to work well together. And so Harvard Business
School professor Amy Edmondson defines psychological safety as
a sense of confidence that a team will not embarrass,
reject, or punish someone for speaking up. And this also describes
what we're trying to achieve with blameless postmortems. The team does not
feel punishment for speaking up about incidents, a sense of safety.
It makes people feel comfortable enough to share information about
incidents, which allows for deeper analysis and then
results in learning that improves the resilience of
the systems. And the Dora State of DevOps report
actually showed that psychological safety is a key driver of high performing
software delivery teams. So the number one thing that you can
do for your teams is to build that culture of psychological
safety that can be your gateway to culture change. So how
do you do this again? Culture change doesn't happen overnight.
You want to start small, you want to iteratively introduce new practices
to an organization small, and you share successful
results of experimenting with new practices. And then you slowly expand
those practices across teams, similar to how we talk about
Blast radius with chaos engineering, you can start experimenting
with blameless postmortems with just a single team. And to get
started, you can actually use our how to write a postmortem guide
at postmortems pagerduty.com to see
some tips. It's also easy to start practicing blameless postmortems
by analyzing smaller incidents, maybe before tackling
major ones, because the business impact of that incident is
lower, there's less pressure to scapegoat an individual as the cause
of an incident. If someone does fall back on blaming an
individual, there's also lower repercussions for causing
a minor incident. Simply put, the stakes are lower when analyzing a minor
incident. So doing post mortems for smaller incidents, it allows
the team to develop that skill of deeper system analysis that goes
beyond how people contributed to the incident and
remembering that this is a skill and skills require practice.
Also, sharing the results of postmortems is very important.
It has a couple of benefits. It increases the system knowledge
across the organization. And actually, it also reinforces
a blameless culture. When postmortems are shared, teams will
see that individuals are not blamed or punished for incidents,
and then this reduces that fear of speaking up when the incidents
inevitably occur. So, creating a culture where information
can be confidently shared, it leads to that culture
of continuous learning in which teams can
work together to continuously improve.
We also encourage teams to learn postmortem best practices from each
other by hosting a community of experienced postmortem writers.
So we have a community of experienced postmortem writers who are
available to review postmortems before they're shared more widely.
And this ensures that the blameless analysis is
there through feedback and coaching while those postmortems are
being written. And you can scale culture through sharing.
So the state of DevOps report, it told us that operationally mature
organizations, they adopt practices that promote sharing. People want
to share their successes, and when people see something that's going
well, they want to replicate that success. And it
may seem counterintuitive to share incidents reports because it
seems like you're sharing a story of failure rather than success.
But the truth is, practicing blameless postmortems leads
to success because it enables teams to learn from failure and to
improve systems to reduce that prevalence of failure. So framing
incidents as learning opportunities with concrete results
and improvements, rather than a personal failure,
it also increases morale, which increases employee
retention and productivity. One thing we do at pagerduty
is we actually schedule all postmortem meetings on a shared calendar,
and the calendar is visible to the entire company and anyone
is welcome to join. This gives engineering teams the opportunity
to learn from each other on how to practice blamelessness and deeply
analyze incident causes. It also makes clear
that incidents aren't shameful failures that should be kept
quiet. By sharing learnings from the incident analysis, you help
the entire organization learn, not just the effective
teams responsible for the remediation. So pagerduty
also sends completed postmortems via email to an incidents
reports distribution list that includes all of engineering and product
and support, as well as our incident commanders who may or
may not be in one of those departments. And this widens
system knowledge for everyone involved in the incidents response process.
Information sharing and transparency it also supports an environment
that cultivates accountability. A common challenge to
effective postmortems is that after analyzing the incident and creating action items
to prevent reoccurrence, information sharing to increase transparency
is never done. So I'm just going to talk about some quick tips for success
when youve starting this off. For action items to get done, they have to
have clear owners. So because we're an agile
and DevOps shop, the crossfunctional teams responsible for the affected
services are also responsible for implementing improvements
expected to reduce the likelihood of failure. Engineering leadership
helps clarify what parts of the system each team owns and sets
expectation for which team owns new development and operational
improvements. Ownership designations they're communicated across
the organization so all teams understand who owns what
where ownership gaps can be identified, and we document
this information for future reference and new hires.
Any uncertainty about ownership of an incident's action items
are discussed in the postmortem meeting with representatives for all
teams that may own the action item. So start by setting
a policy for when post mortem action items should be completed. So at
pager duty, our vp of engineering has set the expectation that high
priority action items needed to prevent a sub one from
reoccurring should be completed within 15 days
of the incident and action items from a sub two should be addressed within
30 days. It's really important to communicate
this expectation to all of engineering and make sure it's
documented for future reference. We've also seen improved accountability
for completing action items by involving the leaders,
responsible product managers and engineering managers.
Those leaders need to be involved because they're
prioritizing that team's work in the post mortem meeting. I mean
product managers, they're responsible for defining a good customer experience.
Incidents cause a poor customer experience, so engage
product managers in postmortem discussions by explaining that
it will provide a wider picture of threats to that customer experience and
ideas on how to improve that experience. Doing so, it gives
engineering a chance to explain the importance of these action items so that the
product managers will prioritize the work accordingly. And similarly,
getting engineering leadership more involved in the postmortem discussion.
It gives them a better understanding of system weaknesses to inform
how and where they should invest technical resources.
So sharing this context with the leaders that prioritize the
work. It allows them to support the team's effort to
quickly complete high priority action items that result
from that incidents analysis. And really, the most important
outcome of a post mortem meeting is to gain that buy in for the action
plan. So this is an opportunity to discuss proposed action items and
brainstorm other options and gain that consensus
among the team leadership.
Sometimes the ROI of proposed action items, it's not great
enough to justify the work the post mortem action items
might have shown, so they might need to be delayed
for other priorities. The postmortem meeting that's a time to discuss
these difficult decisions and make clear what work will
and will not be done as a result of the expected implications
of these choices. And whereas the written postmortem,
it's intended to be shared widely in the organization, the primary
audience for the postmortem meeting is the teams directly involved
with the incident. This meeting gives the team a chance to align on
what happened and what to do about it, and how they'll communicate about
the incidents to internal and external stakeholders.
So participants in this meeting, they might be the incident commander,
maybe any shadowees, service owners and key engineers
involved in the incident, engineering managers for impacted system
product managers and any customers,
internal or external liaisons. And if
not already done by the incident commander, one of the first steps is
to create a new and empty postmortem for the incident. Go through
the history in slack. So we use slack. So whatever tool you're
using to identify the responders and add them to the page
so that they can help populate that postmortem.
Include the roles of incident commander and deputy
and scribe as well. And if you're confused about those roles, check out response
pagerduty.com, where we break down how the incident response
process works at Pagerduty and for some other organizations,
youve also want to add a link to the incident call recording,
invite stakeholders from related or impacted
teams, and then schedule that post mortem meeting for 30 minutes
to an hour. Depending on whatever the complexity of that incident was and
scheduling the meeting at the beginning of the process, it helps make sure that
that post mortem is completed within the SLA. And so again,
these are some of the people that should attend, service owners, key engineers,
engineering managers, product managers, customer liaisons, an incident
commander or a facilitator, and any of the other roles
that you may have had during that incident response call. And then you
want to begin by focusing on the timeline. You want to document the facts
of what happened during the incident. Avoid evaluating what
should or should not have been done and coming to conclusions about what caused
the incidents. Present only the facts here and that will help avoid blame
and support that deeper analysis. And note
that the incident, it may have actually started before responders became aware
of it and began that response effort. So the timeline
includes important changes in status and impact and key
actions taken by responders that helps to avoid
hindsight bias if you start your timeline at a point before
the incidents happened and work your way forward instead of
working your way backwards towards resolution.
Also review the incident log and whatever chat tool.
Hopefully it's not email that you're using to find key decisions
that were made and actions taken during the response effort,
and include information the team didn't know during
the incident that in hindsight you wish you would have known.
You can find this additional information by looking at monitoring and logs and
deployments related to affected services.
You'll take a deeper look at monitoring during the analysis step,
but start here by adding key events related to the incident,
like deploys or customer tickets filed. Maybe it
was a hypothesis being tested during a chaos engineering experiment
and include changes to the incident status and the impact to that
timeline. And for each item in the timeline,
identify a metric or some third party page where
data came from because this helps illustrate each point clearly
and it ensures that you remain rooted in fact rather
than opinions. So this could be a link to a monitoring graph
or a log search. Maybe it was a bunch of tweets, anything that shows the
data point that youve trying to illustrate in the timeline. And so
just a few tips for the timeline. Stick to the facts, include changes
to the incident and then we are going to go to documenting
the impact. Impact should be described from a few perspectives.
How long was the impact visible? In other words, what was the length
of time users or customers were affected? And note
that the length of impact may differ from the length of the response
effort. Impact may have started sometime before it was detected
and the incident response began. How many customers were affected?
Support may need a list of all affected customers so they can reach out
individually and then maybe how many customers wrote or called
support about the incident, what functionality was affected and how
severely. You want to quantify impact with a business metric specific
to the product. So for pagerduty this includes like event ingestion and delayed
processing or slow notification delivery. And now that
you have an understanding of what happened during the incident, look further back
in time to find the contributing factors that led to the incident.
Technology it's a complex system with a network of relationships
from organizational to human to technical, and that's continuously
changing. In his paper how complex systems fail,
Dr. Richard Cook says that because complex systems
are heavily defended against failure, it is a unique combination
of apparently innocuous failures that join to create a
catastrophic failure. Furthermore, because overt
failures require multiple faults, attributing a root cause is
fundamentally wrong. There's no single root cause of
a major failure in complex system, but a combination of contributing factors
that together led to that failure. And your goal in analyzing
the incidents is not to identify a root cause,
but to understand the multiple factors that created an environment
where failure became possible. So Cook also says that the effort to find
a root cause does not reflect an understanding of the system,
but rather the cultural need to blame a specific
localized forces for events. Blamelessness is essential
for an effective postmortem. An individual's actions should never
be considered a root cause. Effective analysis goes deeper
than human action in the cases where someone's mistake did
contribute to a failure. It's worth anonymizing this in your analysis to
avoid attaching blame to an individual. Assume any
team member could have made the same mistake. According to Cook,
all practitioner actions are actually gambles, that is,
acts that take place in the face of uncertain outages.
Now on youve analysis, you want to start it by looking at your restoring
for the affected services. Search for irregularities for like
when sudden spikes or flatlining when the incident began and leading up to the
incident. Include any commands or queries you use to look up
that data. The graph images or links from your monitoring tool
alongside this analysis so others can see how
that data was gathered. This level of analysis will cover the
superficial it's going to uncover the superficial causes of the
incident. Next, ask why was the system designed in a way to
make this possible? Why did those design decisions seem to
be best decisions at the time? And answering these questions
will help you to uncover those contributing
factors. And some helpful questions are is this an isolated
incident or part of a trend? Was it a specific bug, a failure,
something that we anticipated? Or did it uncover a classic issue
we weren't even sure of? Was there work that some team didn't do in the
past that contributed to this incident? Were there any similar
related incidents in the past? Does this incident demonstrate a
larger trend? And will this class of issue get worse as we
continue to grow and scale and use the service? Now, it may not be possible,
or as I mentioned before, with the effort to completely eliminate
the possibility of the same incident or similar incident from
occurring again. So also consider how can you improve detection
and mitigation of future incidents? Do you need better monitoring and
alerting around this class of problems so that you can respond faster in the future?
If this class of incident does happen again, how can you decrease the severity
or the duration? Remember to identify any actions that can make
your incident response process better too. Go through the incident history
to find any to do action items raised during the incident
and make sure that those are documented as tickets as well.
At this phase, you are only opening tickets. There's no expectation
that tasks will be completed before the postmortem meeting.
And so Google writes that postmortem action items to
ensure that they're completed, that they should be actionable, specific,
and bounded. So with actionable items, you should phrase
each item as a sentence starting with a verb. The action should result in a
useful outcome. And for specific you want to define each
action item's scope as narrowly as possible,
making it clear what is and is out of scope. And for bounded,
you want to word each action item to indicate how to tell when it's finished,
as opposed to open ended or ongoing tasks.
And so here are some ideas for better wording
if you think about poorly worded investigated monitoring for
this scenario versus an actionable worded item which
is had alerting for all cases where this service returns
greater than 1% errors. Next, you want to move on to
external messaging. And the goal of external messaging is to build
trust by giving customers enough information about what happened and
what you're doing about it, without giving away all your proprietary information about
your technology or your organization. There are parts of your internal analysis
that primarily benefit the internal audience, and those don't need to be
included in your external postmortem. The external postmortem
is a summarized and sanitized version of the information
used for the internal postmortem. So some of
the components that you want to include with external postmortems are
a summary, which is just two to three sentence to summarize what happened,
what happened, so what were those contributing factors and what are we
doing about it? It can be pretty short and sweet, and remember,
it's sanitized from what you would share internally.
Now at Pagerduty, we have a community of experienced postmortem
writers available to review postmortems for style and content as I messaged,
and this avoids wasted time during the meeting. So we post
a link to our postmortems in slack to receive feedback at
least 24 hours before the meeting is scheduled and some
of the things that we look for are does it provide enough detail?
Rather than just pointing out what went wrong, does it drill down to the underlying
causes of the issue? Does it separated what
happened from the how to fix it? Do the proposed action
items make sense? Are they well scoped enough? Is the postmortem
understandable and well written? Does the external
message resonate well with customers, or is it likely to
cause outrage? So a few things to do are make
sure that the timeline is an accurate representation
of the events, separate the what happened from the
how to fix up fix it and you want to write those follow up items
that are again, actionable, specific and bounded.
And then things that you don't want to do is don't use the word outage
unless it really was an outage,
accurately reflect the impact of an incident.
Outage is usually too broad of a term to use, and it can lead customers
to think that the product was fully unavailable and likely
that wasn't the case. Also, don't change details or events to make
things look better. Be honest in your postmortems.
Don't name and shame folks. Keep your post mortems blameless.
If someone deployed a change that broke things, it's not their fault.
Everyone is collectively responsible for building a system that allowed them
to deploy a breaking change and also avoid that concept
of human error. Very rarely
is that mistake rooted in a human performing an action.
There's often several contributing factors that
can and should be addressed. And then also don't point out just
what went wrong. Drill down to the underlying causes of the issue
and also point out what went right. And so, after youve completed
the written postmortem, follow up with a meeting to discuss the incident.
And the purpose of this meeting is to deepen the postmortem analysis through
direct communication and to gain buy in for the action items.
You want to send a link to the postmortem document to the meeting attendees
at least 24 hours before the meeting. The postmortem may
not be complete at that time when it's sent to the attendees,
it should be finished before the meeting, but it's still worth sending an
incomplete postmortem to meeting attendees in advance so that they can start
reading through the document. This is an opportunity to discuss proposed
action items and brainstorm other options. And remember,
gain that continuous among leadership. As I mentioned,
the ROI of proposed action items. It may not justify the
work right? This post meeting mortem meeting is a
time to discuss that and then one other thing is, if you
can develop good facilitators, that's really
helpful in the postmortem meeting. The facilitator role in
the postmortem meeting, it's different from the other participants. So you may
be used to a certain level of participation in meetings, but that will change
if you take on the role of a facilitator. The facilitator isn't
voicing their own ideas during the meeting. They encourage the group to speak
up and they keep that discussion on track. And this requires enough
cognitive load that's difficult to perform when you're attempting
to also contribute personal thoughts to the decision. For a
successful postmortem meeting, it's helpful to have a designated
facilitator that's not trying to participate in that discussion.
So good facilitators tend to have a high level of emotional intelligence.
That means that they can easily read nonverbal cues to understand how people
are feeling. And they use this sense to cultivate an environment
where everyone is comfortable speaking. Agile coaches
and project managers, they are often skilled facilitators.
At pager duty, we have a guild of confident facilitators who coach individuals
interested in learning how to facilitate.
So some of the things that the facilitators do is they keep people
on topic. They might need to interrupt to remind the team of meeting
goals, or ask if it's value to continue with this topic
or if it can be taken offline. They can also time box agenda
items, and once the time is done, they can vote. If you want to keep
talking about it or if we should move on, they also keep
one person from dominating. So as a facilitator, you want to say up
from that. Participate from everyone is important. You want to
explain what the roles and responsibilities of your job as a facilitator
are so that they won't be offended if you tell them to stop talking
or if you ask somebody to speak up and you want to pay attention to
how much people are talking throughout the meeting,
some facilitator tips are, oh, I wasn't able to hear what person a
was saying. Let's give them a moment. Really. The facilitator is
acting as a mediator to call out when people are getting interrupted.
The facilitator is also encouraging, continuous,
and they see that if a team member hasn't said anything, you can
get them to contribute by saying things like, let's go around the room and hear
from everyone, or what stood out to you so far, or what else?
What we need to consider I do want to point out again that
there's no single root cause of failure because systems
are complex. So remember that during this meeting we're
not looking for one single person or a single root
cause. We're looking for all of those contributing factors.
And it's really important to go back to avoiding that blame,
to not use the term root cause.
And so practice makes perfect.
Use your post mortem practice even
for mock incidents, for chaos engineering complete postmortems
at the end of every one of those experiments. And so a few
of the key takeaways here are that post mortems should
drive a culture of continuous improvement and help you to understand those
systematic failures that led to the incident. Blame is
bad. It disincentivize knowledge sharing. Individual actions
should never be considered a root cause. There's no single
root cause, just contributing factors. And if you
would like to learn more, you can head over to postmortems
pagerduty.com and learn more about our
process. There's also some templates in there, and thank
you all for your time.