Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey, folks, how sre you doing? I'm Ramon. I'm a site reliability
engineer at Google in Therick, Switzerland, and we
are going to be talking about postmortem culture at Google today.
So first of all, we're going to cover an introduction of what postmortem are
and how we are going to write them. So at
Google, we have embraced their culture, meaning that we know that everything is failing
underneath us, right? So we have disks, machines, network rings
failing all the time. Therefore, the 100% is a wrong reliability
target for basically everything. And what we do is we use the
reliability target under 100% to
our failure. So the opposite of the reliability target of SLO
is going to be our budget. We use that for taking risk
into our services, adding new features changes, rollouts, et cetera.
So when something fails, what we do is we typically write
a postmortem. So postmortems are written records
of an incident. So whenever something happens, either an outage,
privacy incident, security vulnerability, a near miss.
So we have a problem with our service, but it doesn't trigger into
or translates into an actual outages that customers srE,
we write a postmortem. Postmortem is a documentation of an incident.
So exactly what happened? Like what was the state of the system?
What were the actions that were taking before and after the incident and
what was the impact? So what the customer or our
users saw as a result of the outage,
we went to detail the summary of the root causes and the triggers.
So the root cause is the actual reason why the postmortem
occurred, and the trigger was the event that was actually triggering
or activating that root cause. Like for example, you might have a bag that was
written in your code base years before, and then
it never materializes until you made some certain change in
your system and you exercise it, and then you got advantage, right?
Another key part of a postmortem is the action items. So it's
very important that within your postmortem, you not only specify
the root cause and the incident and the status of the system, but what are
you doing? So this outage never occurs again.
Postmortems are gaining popularity in the IT industry, but they
are very, very common to have them in other industries, like aviation
and medical industries. Like for example, when an
airplane has near miss in an airport, there is going to be
a detailed analysis that has the same shared of a postmortem.
So why are we going to write postmortems?
Basically, it's a learning exercise. So we want to learn how the
system behaves when some changes
with some interactions with some problems in our back
ends will happen. There are many root causes that we need to understand how our
systems happens. So the reasons that we do this learning exercise
is to prevent outages to happen again. So they are
a great tool for learning about the system for reason, how the system
works and reacts to third time complex systems.
And then they will enable us to take qualitative and
quantitative measures and actions to prevent the outage,
to respond in an unexpected or an undesirable way until
some changes on back ends or some changes in
the systems itself, et cetera. Right. It's very important
that the postmortems are blameless. Blameless postmortems seems
or means that we want to fix our system and our processes,
but not people, right? At some point in time, we will have
effectively, like someone pushing a production release or changing a
button or whatever. But that's not the root cause of the problem.
The root cause of the problem is that our system was vulnerable to third time
code pads or the iCAR systems did incorporate
a library and has a bag or whatever, was it? Right. And the trigger
was something changed in the system. A new version
came out, some customer behavior change, et cetera.
In general, what we want to do with the postmortems is learn and
make our system more resilient and more reliable in general.
Right. Another thing that we want to as well take into account
for postmortem writing and analysis is don't celebrate heroism.
So heroes are people that they will just be available
all the time. You can page them all the time. They will put long hours
in postmortems, et cetera. What we want to have is systems and processes
that do not depend on people, like overworking or overstretching
or whatever, right. Because dot by itself might be a flaw on our system
if that person is not available for one reason, are we able to sustain
the system? Is the system healthy? Right?
So just to emphasize, let's fix systems
and processes, not people, right? So when do you write a
postmortem? First of all, when you have an outage, if you have an outage,
you have to write a postmortem and analyze what happens.
So did you affect or did your outage
affect so many users? Do you have so much revenue loss, et cetera,
et cetera. Classify it with severity so you can understand what
was more or less the importance of this outages
and then write it and have an internal review.
It's important to define the criteria beforehand to guide the
person supporting the service so they know when to write one or not.
Right. And as well be flexible because there are times that you might have
some criteria, but then things change or the company grows, et cetera,
and you might want to write a postmortem anyway. If you have a near miss,
I would recommend as well to write a postmortem because even if
the outage didn't immediately translate it into
customers seeing your service down or your data not available,
it's interesting because you can actually use it to
prevent the outage for happening for real in the future. And as
well, another option for postmortem is when you have a learning opportunity.
So if there is some integrate kind of alert that you
hunt, if you sre other
customers that they are interested, if you see that
this potential near means that you have had, or this potential alert could
actually escalate in the future to something that our customers
could see or that could actually become an actual outage,
write a postmortem. You might want to do it a bit lightweight, have less
amount of review, or have a postmortem that is just internal for the team.
Right. That's totally okay. Right. And it's as well a nice documentation for
the risk assessments for your service. And as well, it's a nice way
of training new members of the team that they are going to be supporting the
service on how you write postmortems, and then they will have
a trail of postmortems that they can analyze and read
and understand how the system works and the risk
that the service will have.
So who writes a postmortem? In principle,
there's going to be an owner that is going to contribute or is going to
coordinate the contribution doesn't necessarily mean that this person
is going to write every single line of the postmortem. Right.
The popular choices are usually the incident commander that was responding to the
outage or anyone in the deaf or SRE
team that is like a TL or a manager, depending how your company organizes.
This owner will ask for contributors.
Right. There's going to be an SRE team on another service that was affected,
or there was a backend that was actually some sort of like,
affecting your service. So they will need to contribute, like, the timeline
of events, the root cause action items that they're going to be taking out.
Right. Other dev teams or other SRE teams that they were impacted. So it's not
only the persons or the teams that had services that
impacted your service reliability, but how you affected or your
service affected other products in a company, right?
All of this is a collaborative effort, and producing a good postmortem is something
that takes time. It takes effort and will need reviews
and iterations until it's an informative and a useful document.
Who reads the postmortem? So the
first class of audience is going to be the team. The team that supports a
service will have to read and understand every single detail of all postmortems
that happens for a service, basically because that's how they understand
which action items they will have to produce. What are the
priorities for their own project, prioritization process,
et cetera. The company if you have postmortems that they have
impact, near misses, or cross team postmortems, they are
interesting to have some people outside of the team to review, like for example,
directors, BPS, architects, whoever in the company
will have a role of understanding how the architecture of the services
and the products go. In this
case, details might not be as needed or going as needed.
An executive summary, for example, would certainly help understanding the impact
how this relates to other systems, but it's
definitely a worthy exercise to do. And then customers
in the public, they are another part of audience for postmortems
that they are interesting as well to consider.
If you, for example, run a public cloud or you run a software as a
service company, you will have customers that trust you with their data,
their processing, whatever is what you offer to them, right? When you have
an outage, the trust between you and your customers might be
affected and the postmortem is a nice exercise to actually regain that trust.
Additionally, if you have slas for example, and you
are not able to meet them, postmortems might not only become something that is useful
for your customers, but something that is required as your agreements.
So what will you include in the postmortem? This is
like the bare minimum postmortem that you can write. Like the minimal postmortem,
it will include the root cause, which is in red. In this case was something
a survey is a product that did have some canary metrics that didn't detect
a bug in a feature, right? The feature was enabled
at some point. That was the trigger. Think that the root cause might be just
sitting in your code for a long time and until the trigger happens,
it will not be exercised and therefore the outage will not be materialized
and then you have an impact measure. So in this case, the product ordering
feature for this web or this software as a service was unavailable
for 4 hours and therefore yielded a revenue loss of so many dollars.
It's interesting to always link your impact to business metrics because
that makes everyone in the company able to understand exactly where
is the impact, even if they sre not directly from the same team.
Additionally, we have an action plan. So in this case
that's two ais, two action items. So this implement
better the canary analysis, we can detect problems, right. And then
this is a reactive measure, right. So when something happens, you are
able to detect it and then prevent. This is a preventive measure
which is like to avoid things to happen again, which is
all the features will have a rollout checklist for example, or a review
or whatever it is. What you incorporate there is as well the lessons
learned, which are very interesting to have what went well, what went poorly.
We got lucky, right? And those are interesting questions because
it's never that the outage is all negative.
There may be things that they worked well. For example, team was well trained,
we have proper escalation measures, et cetera, et cetera.
And then some supporting materials, chat logs. When you were responding
to the outage, you were typing into IRC, into your slack, into whatever
is the chat that you use into your company, right. Metrics, for example,
showing screenshots or links to your monitoring
system that shows the metrics and so on for posteriority to understand,
for example how to protect them and so on. Documentations,
links to commits code, excepts of code. That was actually
for example, the part of the root cause for incorporating measures
like when you review some comments, et cetera. That's as well been interesting to
have in your pm useful
metadata to capture who is the owner, collaborators, what is
the review status like if there is some reviews that
are happening, who is signing off the post mortem. So the list of
action items are validated and can move into implementation.
Right. And then we should have for example, impact and root
cause. That's important. We have quantification of that impact.
So was the slow violation impacts in terms of revenue or
customer affectation, et cetera.
Timeline is something that is very interesting, which is like
a description of all the things that what happened from the root cause incorporation
to the trigger to what was the response. And that's a nice learning exercise
for the team that is supporting a service to understand how
the response should go in the future. Right.
This is a postmortem metadata. You have stuff like for example,
the date authors. There is an
impact measure in this case. I think it's very interesting because the impact measure
is measured in queries lost, right? But there
is no revenue impact. That could be, for example,
postmortems, that they do have an actual hard revenue impact in there.
And then you have a trigger and the root cause, see that the root cause,
for example, in this case is a cascading failure through some high load
and blah, blah, blah, right? That was in the system.
That vulnerability that the system had for this complex behavior
was there and only materialized when the trigger was like the
increase in traffic actually exercised
that latent back that you have in your code, right? And you have detection.
That is, who told you, like, could be your customer,
your monitoring system, sending you an alert, et cetera.
The action plan. In this case, we have five action items.
I think it's important to classify by type, the classification by mitigation
and prevention. I think it's interesting because you will have action items
that will reduce the risk, right? So the risk is
always a probability of something materializing. And what was the impact
of that risk? So mitigation would reduce either the probability of something happen or
the impact. And then you have prevention,
which will be like, we want to reduce the probability of
something happening, or ideally
up to zero, so it never happens again. So, learnings and
timeline, this is something that is very interesting, at least for me. The timeline is
my favorite part of postmortem. So lesson
learned is things that are going well. For example, in this case, the monitoring
was working well for this service,
right. Things that were going wrong is staff that is prime
candidates for becoming action items to solve. Right?
And you are always lucky. Just we need to realize that. And there
are some places that we got lack in this case as well. Those are prime
candidates for action items. You don't want to depend on luck for your system reliability.
And then you have the timeline. You see that the timeline covers many
action items. Sorry, many items, right. And then the
outage begins is not exactly at the beginning.
So you see that there are some reports happening
for the Shakespeare ID and Sonnet and there could be
even entries that they are older. That is like this commit
was incorporated in the code base and that contained the actual bug
that was latent for months even, right. And then there was a trigger and
the outage actually proceeded to begin.
So the postmortem, first of all, how you go
for the process? Need a postmortem, yes or no? Yes. Then let's write
a draft. The draft is something you need to put together very quickly with whatever
forensics you can gather from the incident response, like logs,
timeline. Just dump everything into the document. Everything.
Even if it's ugly or it's just disorganized, just dump it so you
don't lose it. And then you can just work it around and make it a
bit prettier. Then analyze root cause,
like internal reviews, clarify, add action items,
et cetera, and then publish it. When you understand the root cause
and you have reviewed the action plan, publish it, have reviews.
Right. And then there is the last part, which is, well, the most important.
You need to prioritize those action items within your project. Work for
your team. Right. Because at the same time, a postmortem without action items,
there's no difference between nothing and a postmortem without action items.
The action items need follow up and need execution,
need closure for that. So actually the system improves.
So ais action items. So what I was saying,
a postmortem without action item is indistinguishable for postmortem,
for a. No, postmortem for our customers. And that's true because you might have a
postmortem. Right. It doesn't have action items.
The customer won't see any improvement in the service. And if you have an
action item list that you don't follow up. Right. The system think
that it's in the same status as it was prior to the
outage. So how
you are going to go for understanding your root causes? Five whys.
Right. The key idea is asking why until the root causes are understood
and actionable. This is very important because
the root causes might be just red herring that they are not the actual one.
So you need to keep asking until you know what was the root cause.
Because that's how you are going to derive some action items, that they are nice
and they are actually improving your system. In this case,
the users couldn't order products worldwide. Why? Because feature x
had a bag. But why had a bag, right? Because the feature
was rolled out globally in one step or we were missing test
case forex. Both can happen. Right. Why? Because the canaries
didn't evaluate that and blah, blah, blah until you just have it more well
defined and crystal clear best practices for the action
plan. So there are some action items that they're going to be band aids
that they are like short term stuff. Those Sre valid shouldn't
be the only thing that you do. It's nice to have some action items that
will just make the system incrementally better or more resilient just
in a short term. Right. But you need to do as well the long term
follow up. We need to think beyond prevention
as well, because there might be cases that you can just prevent
it to happen 100%. That's ideal, right? But you might want to as
well to mitigate, so reduce the probability of something happening, but as
well if some risk materializes, reduce the impact of feed affecting
your service, right? And then humans as a
root cause is a problem because you can have action items fixing
humans. So it should be the processes or the system. Remember that.
So don't fix human, fix like the documentation,
fix the processes for rolling out new binaries,
fix the monitoring that is going to tell you that something is broken.
So you have your postmortem done and published,
right? And it's excellent. So you have it. So we
have some review clubs, we have some postmortem of the month and
so on in the company, especially in a company as large as Google. I think
it's interesting for socializing it and for other people to understand what
failure modes a system has. Because if
a system like my systems for example, are the authentication stack,
if I have some failure modes that I'm subject to, perhaps other systems
that they sre similar will have them too. So it's an interesting exercise to read
how other teams fail, how other sorry services fail,
right? So I can see like wait, am I subject to
that? So I can prevent. And as well, the wheel of misfortune is
a nice replay for training. So for when a new member team
joins, we say, let's just take this postmortem and we just replay
it and see how the response would go and how we approach it.
Right? So it's a nice learning exercise as well. So how do
we execute on action plans? First of all, we need to pick the right priorities,
right? Not all of the action items in your postmortem are going
to be all of the highest priorities that they can be because you will have
a capacity to execute on them. So you perhaps need to choose and go sequentially,
like addressing them. Right. Reviews are very important,
so you have to review how you sre progressing and if your burn
rate of action items
is actually making you to completion, right. And then
have some focus for the executives, even if your postmortem
might be those that are not reviewed by the executive for one reason,
right? It's nice to have high level visibility as well because of your
customers. Your customers, either teams in
the company or your actual external customers can see that and can see that
you are making progress to make the service better.
So that's all I have for today.
This is all about postmortems. But there are many more scenarios
and many more angles for site reliability engineering. So we
have these two books. The first one is the original SRE book,
which covers the principles and the general practices. And the second one, the workgroup,
is an extension of the SRE book that will tell you
how to put them in practice. So we cover a lot
of space in these books about postmortems and action items and incident response
that might be interesting for you to read if you have enjoyed this talk.
And that's all, and thank you very much for watching.