Transcript
This transcript was autogenerated. To make changes, submit a PR.
You. Hi everyone, my name
is Nashant and I'm really excited to be here today at the Con 42 incident
management event to talk to you about that very topic, incident management.
So after this session, you should have a good idea of how to run can
effective incident management process within your team or company,
how to minimize the impact to your users or customers,
and finally, how to continue to learn and improve from past experience.
So before we get started, here's a little about myself.
I'm the engineering manager for the ad services platform team at Pinterest
and our team owns multiple critical systems that power a multibillion
dollar ad business in that role as well.
Ads. Through my involvement with our various incident manager on call rotations,
I've experienced several high severity incidents and learned a lot from that process.
So I'm hoping these learnings can be of use to others who are in similar
positions and help you run an effective incident response process of
your own.
So to start off, let's define what an incident is.
There is a few different situations that we may classify as an incident.
For example, if a systems is not behaving as expected, let's say a particular system
has higher latency than normal. Perhaps when users aren't able
to access our service, they can't log in, they can't get to the front page,
or when employees aren't able to do their work, if employees can't submit pull requests,
can't comment on other people's code, so on and so forth.
So we can say an incident is an event that is not part of the
regular operations of a service that are causing an interruption in
availability or a reduction in its quality.
Taken, you might say, well, the same also applies to most bugs,
right? They're also a manner in which a service may experience interruptions or quality degradations.
And you'd be completely right. That is also what a bug usually is.
So then we ask ourselves, what is an incident really?
How do we differentiate one from a bug? And especially
since we're saying we need a process for handling incidents, we also therefore need
an actionable definition. So here's
how I define an incident. I say an incident is when we need
one or more people to stop everything else that they're doing Andor come help
us fix the bleeding immediately. This is also what differentiates
an incident from a bug, which typically will also have an SLA
for response and may be treated urgently as well.
But for instance, a bug may only require a response during business hours,
and typically only from the team that owns the component that is facing
the bug.
So now that we've defined what an incident is and talked about how it's different
from a bug, let's talk about what incident management looks like.
There's typically five main phases. First of all,
detection. The detection might happen through automated alerts
from our observability or monitoring systems, or through manual reports.
After that, triage the situation,
identify which component is responsible, and identify how
severe the impact is. Next, we go
to mitigate things means we have to stop the bleeding as soon as possible,
minimizing the impact to our users, to our employees, so on and so forth.
After that, we can talk about prevention. How do we figure out what the root
cause was? Andor actually solve it so that we're confident that this issue is not
going to reoccur. Andor then finally,
remediation. What are the follow up action items that we need
to make sure that our systems are more resilient to such situations in the future?
So the first thing to discuss is the process of actually declaring
an incident. So given our definition above, we expect someone to
drop everything and respond, which means we need a clear set of criteria for
defining an incident. So let's talk about
when. And to talk about this, I'm going to borrow a
story from the famous Toyota manufacturing line.
Toyota had something called an Andor cord, which was a rope that
any employee on their manufacturing line could pull whenever they found a problem
with production. Pulling this cord would immediately halt
production. The team leader would come over and ask why the cord was
pulled, and then the teams would work together to solve the problem before resuming production.
So in software engineering, we typically don't need to pause all of production.
There may be some really severe cases where we do. However, typically we just
ask one team or a handful of team members to respond to the incident,
pause others work, respond to the incident, andor help production resume
as normal. So based on this, we can say
that we should create an incidents when we need immediate support.
And what that means will be different between
different components of an organization, different teams, so on and so forth. For instance,
an ads team may declare their incidents criteria based on revenue impact.
Another team may use login failures, app crashes,
still user content, high latency, so on and so forth, content safety,
et cetera. So the next questions is,
who should be declaring the incident? And the answer here is very easy.
Anybody who notices the impact should be empowered
to declare an incident. And it's extremely, extremely important to empower everyone
and anyone in your team or organization to declare an incident and
make it easy to refer to and understand the incident criteria so
they can do so.
The last question is, how do we declare an incident? Going back
to the Toyota story, all they had to do was pull this cord to declare
that something was wrong. And we need to make it just as easy
so different teams stop different processes. We may have a Google form that you fill
out and immediately file an incident. You may file a certain type of Jira ticket
under a project which declares an incident, so on, so forth.
Whatever the preferred method is, it just needs to be extremely easy.
And when we file that ticket or fill out that form,
basically when a new incident is declared, we need to taken sure that someone is
notified. Someone knows that something is wrong. This may be the system owner.
Perhaps in the form you can select what system is impacted.
Could be the person on call for that system, the incident
manager, at the team or at the company.
Doesn't matter. Someone needs to be notified.
So now that we know what an incident is and how to declare one,
how do we tell them apart which incidents are more severe than others?
So we know that all incidents are not the same. Even though we're
saying that incidents require one or more people to drops everything and respond,
the scope of how many people are expected to respond may vary
greatly based on the impact. For instance, let's say you're a
company or team that has an Android app, and you
have 1% of Android users reporting that the app occasionally crashes
every couple of days, versus we say that 20%
of your Android users cannot log into the app at all.
Those are two vastly different scenarios, and the number of people
or teams expected to respond may be vastly different as well.
Which means, as we said before, every team, component,
or service within your company's architecture needs some customized criteria
for incidents.
The general framework we can adopt is how many people need to respond or
need to know, since we're kind of using this framework to determine what an incident
is anyway. So typically at most places,
we think about incident severity levels as starting at Seph
three or set four for the lowest severity incidents, and then going all the way
up to sev zero, which is typically complete outage all hands on deck.
We need to fix the situation immediately. So one
way to think about this is how many people are expected to respond?
Or how many people, or how many levels up do we need
to notify about the impact? So again, going back to some of
these examples, we said, let's say if there's only an elevation
in latency, there's a regression in latency. Your home page is loading slower
than it usually does, that's probably a
low severity incidents. Let's say it's a step three or step four.
Typically only the team that owns that component, their manager and a few others
might need to know respond to the situation.
However, if user login is down and no one can get
into your site or your app, typically company execs would like to
know. And that sort of gets us into the sev zero or sev one criteria.
So these are two frameworks to think about how to define your severity levels.
Based on that, then you will define certain thresholds for different parts of
your stack, your architecture, different components.
You'll come up with different thresholds to define what an incident is.
So if there's $1 in revenue loss, that is one level of
severity. If there's $10 in revenue loss, that's a higher level. So on and
so forth.
So now that we understand when to declare can incident,
what do we actually do? Once that happens,
the first thing to do is some administrative work to take care of.
Open up your communication channels. This may include starting a
chat room or a slack channel, setting up a video conference
if you're in the office taken, marking out a physical room
that you're going to use to respond to this incident, do all your communication,
so on and so forth. Optionally, you may also choose to create
an investigation doc to keep notes. Things may involve
what steps were taken, who all got involved, what the impact is, et cetera,
et cetera. This can sometimes be helpful if things incidents
response requires many parallel tracks of investigation. If we're
expecting more and more people to jump in over time, this document can become the
source of truth for everyone to quickly ramp up and figure out
what the situation currently is.
Next, we need to define and assign some critical roles.
So the first one we have here is the incident runner and
this person is responsible for the outcome of the incident response.
That means they're responsible for helping drive this incident to a
resolution. This may involve communication with the
various folks involved in the investigation,
escalating to other teams or other folks on the team, and actually supporting
the investigation and the debugging process.
The other role is the incident manager role and this person is
responsible for coordinating the incident response. So what that means
is they're here to help the team understand the impact,
determine the severity, identify can appropriate incident
runner. This may be the subject matter expert, the system owner,
the on caller, so on. They're also here to support the incident runner,
handle broader communications. For example, status updates to a broader audience,
cross team escalations, so on and so forth. The incident manager
is also overseeing the process, asking the right questions and helping the team prioritize
between next steps. They're also here to keep the team calm and
like I said, prioritize those next steps. And then finally, it's really important that
this incident manager is someone who's a confident decision
maker with good problem solving skills.
So let's talk about how do we do this impact assessment?
How do we figure out what the severity is? Because this is extremely important
to understand. If we need to escalate to other teams, what channels
of communication do we need to engage, if this is something that is user or
potentially partner facing? If you have other businesses, your customers, for instance,
you may need to send out external communications as well to inform a fear
services availability, outage or degradation
in service quality. Finally, assessing the impact
is also really important to determine what your resolution process
is. For instance, let's say you have an incident late at
night, it's two in the morning, or an incidents that's
middle of the day when everyone is around, your whole team and your company are
around to help with the response process.
Based on the severity of the incident, we may really choose different processes
to resolve this incident. So at some point, if it's middle of
the day, we might be okay with a hot fix or even rolling forward late
at night. We may prefer to be extra cautious and only roll back.
So we're not introducing any new changes. We're not introducing
any new potential sources of instability to make the situation worse.
Once we've done this, once we've figured out the impact and determine the severity,
the incidents runner and incident manager can start to take different responsibilities.
The incident runner can start to play a more active role with actively
debugging escalating to relevant people or teams for support.
While the incident manager takes care of internal comms to
teams, to the overall company, to execs if necessary, and external
comps. Like we said, if partners are impacted,
the first priority of the incident response process here is always
to stop the bleeding first. The root cause can wait if
it's not immediately clear, oftentimes it is, but if it's
not, we need to first treat the symptoms. So this
may involve rolling back suspicious changes. If some
services were deployed, if some experiments were rolled out, if some config flags
were changed during this time, even if we don't feel,
based on our knowledge of the systems, that these changes should cause this incident,
if the timeline aligns, it's still worth rolling them back just to
rule them out. The other option here is to just put up a quick patch
to alleviate the symptom if we don't know the true root cause,
or if it'll take longer to fix the underlying root cause.
One example of this is if you have a data dependency and
the system you're calling is suddenly returning corrupt data or empty data,
and your service is crashing, we could potentially just put up a quick patch
to add a fallback path for that missing data.
Therefore treating the system, stopping the bleeding,
and minimizing the impact to our downstream users.
Once we've done this, once we've stopped the bleeding, cured the symptoms,
we can update our incidents status to mitigated, informing everyone
that the impact is now taken care of. We can now focus on the next
steps. The next step, naturally, is to
resolve the root cause, understand what caused the system to get into a
state of an incident, and ensure that we're completing any action items needed
to resolve this, or guarantee that the issue is not going to reoccur
in the near future. What that near future horizon
looks like is sort of dependent on your component or team. But for instance,
you may say that as long as we have the confidence that this incident or
this issue is not going to reoccur for a month, we're okay with saying
that the incidents is resolved. The second part of resolution
is ensuring that all your systems are restored to regular operations.
So, for example, if you chose to pause your continuous deploys,
if you rolled back any live experiments, if you made any config changes, Andor undid
them, making sure all those are back to normal are a
key part of ensuring that you can say that the incident is truly resolved
because your systems are now back to regular operations.
And then lastly, once we've gone through this process of resolution,
sending out a final update to internal and external parties to
inform them, perhaps you might want to inform them what the root cause was,
but most importantly, communicate that it has been resolved, our systems are back
to normal, and we are confident that our systems are stable for a certain
amount of time.
So at this point, we are confident that our incinerative resolved, there's no more
impact. We can now start to think about the postmortem process.
So what do you do afterwards? First off,
why do we need a postmortem process? Our goal is to
continue to make our systems more resilient. Like we said earlier, we want to
learn and improve from our past experiences.
So we need to first define and follow a rigorous postmortem
process. This may involve creating a postmortem
document template so that we know that everyone is following a fixed
template, including the same level and detail of
information that the broader team needs to identify
gaps and critical remediation items. The second
step for a postmortem process is to typically have an in person review
once the postmortem document is complete. Getting the team together to
ask questions and discuss areas of improvements is a critical part of this process
as well. The attendees for
things postmortem review are typically the people who responded to the incident, the incident
runner, the incident manager, and any other key stakeholders such
as the system owners, the on callers, sres who have knowledge about
generally how to improve stability overall. All these
folks should be required, or at least encouraged to attend this review.
And then finally, it's really important to make sure that we have an SLA
to complete this postpartum process and any remediation ideas
that we identified as part of it. This helps us as a team
and as can organization ensure that we're holding a high bar for quality
and stability, and that none of these tasks are slipping through the
cracks.
So let's talk a little bit about what this postmortem document might look
like. I find that these are sort of five things that are really important to
cover. You may choose to add more. These are the five things that
I strongly recommend. First of all, talk about the impact,
what happened? For example, if we were suddenly charging advertisers
too much for their ad insertions ad impressions, we may need to process refunds,
so on and so forth. This also involves describing
the root cause that caused this impact. Next,
providing a detailed timeline of events, including when the incident began.
So when the symptoms first started, what time did we detect the incidents?
When was the incident mitigated to? When do we fix the systems? And then when
did we actually resolve the root cause? Based on how long
it took to detect, mitigate and resolve our incident,
we may be able to identify action items to
reduce that time. For instance, if it took really long to detect an incidents,
a very clear action item here is that we need better alerting.
So then the last part andor follows is to write down our remediation
ideas to improve these metrics above and identify owners
for these remediation items.
So one thing I want to call out is that it's really important to keep
our postmortem process blameless. Focus on the what Andor
why not the who? Humans make mistakes.
It's natural, it's expected. We learn from them,
systems need to handle them. For instance, if we're
seeing our system crash due to a null pointer exception,
the questions we might ask here are, why did that happen? Why was the change
not detected by unit teams by integration tests through an
incremental rollout process causing failures in canary first on and so
forth? Or if someone accidentally causes a
corruption or deletion of a prod database. Table why
is such access available in the first place? Why were they able to make such
a change? Asking the
right questions are really going to help us get to the point,
get to the root cause,
the underlying root cause, and identify what we need to do to fix
it going forward. A really
helpful technique, also coming from Toyota, is this idea of the
five whys, so we can ask the question why five times,
give or take, to determine the root cause of a problem.
There's two important guidelines to follow here. Make sure you're never identifying
a person or team as a root cause.
Secondly, five is a guidance. It may be more or less.
So let's take an example. The what here is that users can log into
their accounts. So we ask the first why.
Andor we decide that the API is rejecting the login request.
Okay, that's clearly a problem, but not really our root cause.
Second, why tells us that the API can talk to our authentication service.
Next, we say that it can't do this because it doesn't have the
right SSL cert. Next, why did this
happen? Because someone copied the wrong config to prod. So now we're getting
somewhere. But based on our guideline, we said we never want to identify
a person or a team as the root cause. So we're going to ask ourselves
why one more time. Here we come to the answer that there's no validation
for the API config. So clearly, as we
went through this process, we dove deeper and deeper, peeled back the layers,
and identified that our root cause here is that there's no validation.
So a very clear fix, very clear remediation item,
is to ensure that we have such validation in place so that no one accidentally
or maliciously can cause it to change in the future.
All right, now I'm going to dive into some lessons that I've learned over time.
First off, it's really, really important to destigmatize
incidents within your team and within your organization. Going back to Toyota's
and on card story, someone asked them about this process where
any employee had the right to pull the cord, and Toyota took issue with that
wording. They said that employees just didn't just have
the right to pull that cord. They were obligated to pull the cord if they
saw anything wrong. It was part of their responsibilities as an employee.
So it's really important to reward our incident runners
and our incident reporters identifying incidents as a learning
opportunity. So celebrating our incident responders
Andor reporters will help continue to encourage this good behavior.
We should empower our employees to speak up when they see some impact.
It's really important as team managers and as leaders to
publicly praise those employees who identified the impact,
declared the incidents, as well as the ones who actually spent time
resolving them. Oftentimes incidents happen late at night.
People stay up all night resolving these incidents. It's really important to recognize
that hard work. It's also really important to reward them.
Employees who spend these long hours could receive rewards. These may be in the forms
of shoutouts, in terms of improved visibility, perhaps in terms
of dollar bonuses, so on and so forth, and also encourage them
to take time off to recover when they're up late at night.
My opinion is that false positives here are significantly better than false negatives.
Obviously, there's a trade off. You don't want to get to a point where
everyone's declaring an incidents at the top of the drop of hat,
which is why we talked about earlier. We need a really clear set of
incident criteria. However, it is always better to file an
incident prematurely, get immediate support,
resolve the issue, and then later decide that the impact actually didn't
end up being significant enough to warrant a full incident andor
a full postmortem process. At that point, we can just
lower the priority, lower the severity, and call it a bug or something like that.
The alternative, though, is that employees are not sure whether the
impact is big enough yet to declare an incident hold off,
and by the time they decide that it is, the impact might be really big
and balloon sort of out of control and require
a lot more people to jump into support and require just a lot
longer till we actually resolve the issue, leading to a much higher impact than we
necessarily would have had a person that had just filed an incident earlier.
Okay, the next part is specifically
for the incident manager. The guiding principle for
an incident manager to follow is to minimize risk and drive
resolution. A key part of this is being confident.
So let's take a few examples. Let's say we have
an event where an ad system is misbehaving and causing our ad
load, which we define as the percentage of ads users see
compared to non ads posts, to be much higher than normal.
So users are used to seeing only two
out of ten posts being ads. Suddenly that number goes up to six out of
ten. This could also be causing
a drop in ad relevance, users suddenly seeing ads that are no longer relevant to
them. In this case, is it better to turn off ads entirely
and taken that revenue hit, or show irrelevant ads for a period and potentially
impact our user experience, maybe even causing them to not come back to our site?
There's no easy answer here and the team is going to struggle with this question.
So this is one example where it's really important for the incident
manager to step in and help the team confidently make a decision.
This doesn't mean you're solely responsible for that decision
or making that decision. It means you have to be the one to
find the right information, whether that is predocumented,
or identify the right person to make that calm, and just guide the
team through that whole process.
Another example, like we talked about before, there's many ways we may choose to resolve
an incident. We may need to pick between these three options.
Do we roll back a change? Do we fix forward, which would include all changes
since the last time we deployed, or do we hot fix
while cherry picking a patch on top of our latest deploy?
Again, going back to our guiding principle, we said we need to minimize risks.
So factors like time of day, availability of other teams, et cetera, play a
factor and the incident manager can help make this decision.
Another example is if the incident runner or any
of the incident responders are sort of panicked, they're not able
to make progress. The incidents manager can step in and help them collect
themselves, calm down Andor refocus.
Alternatively, if we feel like the incident runner hasn't quite
had the experiences yet, they need to effectively run this incident,
identify the next best person to do so. Andor encourage this
person to stick around and learn from the experience so that they can be
an effective incident runner the next time.
One last example is for these late night and after hours incidents,
the incident manager can help make a call on this question,
like should we pause our investigation till business hours when other teams are
available, or should we go ahead and wake other people up right now?
This involves understanding the impact, the severity, Andor just
being confident in making this decision, ensuring that the team also,
in turn feels confident with the decision that they've made.
So how do we do this? As an incident manager, it's really,
really important to remember that it's okay to ask for help.
Most of the times, you will likely not be the subject matter expert in
the room. You might not even have ever worked with the system that is
encountering this incidents. So rely on your incidents runner
or the subject matter expert. These may sometimes be the same person, sometimes they
may not, but rely on them to help make this call, or rely
on them to at least give you the information that you need to make difficult
calls. The other thing is sometimes
when you're the incident manager, there may be several incidents ongoing at the same
time. Sloop in other folks to help loop in other folks on the rotation.
It's completely okay to ask for help because the alternative is
that you're just juggling too many balls. You don't know what's happening in each
incident response process and things sort of slip through.
And we may be in a situation where we're taking much longer to
resolve the incident than is good or than could have been avoided.
So in order to ask for help, we need to know who to ask
for help. So make a list of these key contacts who can help make
these hard decisions. If we decide that we want to turn off all ads for
our site, typically you might not be the one who
is responsible or has the power to make that decision. So loop in
the folks who can. It's really important to document those so that it's easy
to pull them in when you need. The second one is
who to reach out to for external support. In today's world, a lot of
services are running on hosted infrastructure from
partners in Amazon, Google, et cetera, or for instance,
relying on third party services like pagerduty, so on and so forth,
for learning, for other communications, so on, Andor, so forth.
If we rely on some of these services for critical parts of our
own applications, who do we reach out to for external support?
You don't want to be in a situation where one of these third party services
is down and you're scrambling at that point to figure out who to reach out
to. So it's really important to make this list ahead
of time and continue to update it as and when you find gaps.
The other lesson I learned is that it's really important to measure. Our goal
as part of an incident management process is ultimately to reduce downtime for
our services. However, how do we do this if we're
not measuring this in the first place? So what are some things that we can
measure? There's three metrics that I found to be very useful.
First off, MTTR or mean time to recovery. How long
did it take us from when the incident began till we resolved the incident?
That's a very easy one to measure.
Typically, you will document this as part of your
post mortem document and you can continue to monitor this over time.
If there are certain systems that are seeing longer time to recovery than
others, you can focus some efforts to improve the stability of those systems.
Second one is downtime between failures. Again, if a particular
system is failing very frequently, we know that as an organization or as a
team, we should focus our efforts there.
And then lastly, we talked about this in one of our examples, but meantime to
detection. If we notice time after time that
we're not aware that there's an incident till much later,
that means that there's a gap in our alerting systems, our monitoring systems.
So if this metric is high, then we can clearly identify one
easy action item is to improve our alerts.
Only once we start measuring these can we identify these action items,
iterate, improve them and measure again.
This will also really help motivate the team because we can see these metrics improve
over time, giving them the faith that this incident management process things
incident response process is leading to some improvements.
And then finally, it's really helpful to use error budgets
when you need to trade off innovation work versus KTLo or keep the
lights on work. So error budgets can be thought about. Ads the maximum
amount of time that a system can fail without violating its SLA.
So as an example here, if we say that our SLA is three
nines, 99.9%, our error budget works out to a little under
9 hours. If you're monitoring this on a team level or a
service level, we can very easily say once the system is out
of its error budget. So once it's breaking SLA, we clearly need
to deprioritize some of our product innovation work and focus
more on system stability so that we can continue to offer our
services with high availability and a high quality to our
users and customers.
So to wrap up, we answered four questions here
today. Firstly, we established a clear definition
for incidents and how they're different from bug. We went
over how to declare an incident and defined a process to
respond to an incidents, to minimizing the impact to our users,
communicate with our stakeholders Andor get the support that we need.
Then we talked about a postmortem process that we can use to reflect on
what went wrong and extract learnings to continue improving our systems
in the future. And then lastly, we discussed ways to
measure and improve on our incident handling, as well as ways to
reward and motivate our teams to take this work seriously and continue to
keep our system stable Andor.
That's all. Thank you so much for having me here today.
I hope you found the session helpful and if you have any questions or if
you'd like to chat, you can find me at my email address.
I wish you all the very best with your own incident management and programs,
and I hope you enjoy the rest of the conference.