Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello, welcome 42 DevOps 2025.
It's a real pleasure to be here.
And I'm really excited to be doing this talk that I'm going
to be doing to you today.
I hope you're having an amazing event, having a ton of fun and learning a lot.
So shall we start?
Okay.
So incidents happen when you least expect one moment.
You're happily delivering a new feature.
You've been working on the next one.
You get an alert that something is wrong and apparently production's your fire.
When this happens, the human instinct is to worry, is to get stressed.
There's anxiety.
Well, to that, I'm here to tell you, don't panic.
we're going to be talking a bit today about effective incident response.
Just before we continue, let me introduce myself.
My name is Daniela Fonse.
I'm a developer advocate at PagerDuty.
I'm an instructor at Aged.
io.
I'm the author of a book called State Management with React Query, and
you can find me pretty much anywhere online at the handle danieljcofonts.
If you want to connect with me as well, you can find me on the fonts at pagerduo.
com or connect to connect to our in our community page at community.
pagerduo.
com.
Let's talk about incident response.
So what is incident response?
Incident response is.
An organized approach to addressing and managing an incident.
You see companies handle incidents differently every day.
And.
Most of them can handle incidents without having an organized
incident response process, though.
But, with an effective incident response process, your goal
isn't just to solve the incident.
It's to handle it in an organized way.
Therefore, the organized word.
It's in an organized way that will allow you to limit damage, and will allow
you to reduce recovery time and costs.
So, the keyword?
After all, you can give a developer a keyboard and some time, and
they will solve the problem.
Incident Response is about solving a problem quickly, while
minimizing this damage and reducing the associated costs to it.
So, how can we get started with Incident Response?
There are three steps that we can follow.
The first one is to use a systematic learning and improvement.
This means that you can build off the mistakes you make and avoid
repeating them in the future.
You keep learning, you keep improving.
The second one is to mobilize and inform only the right people at the right time.
This will make the resolution effort more efficient, will drive down the overall
time to resolve incidents, and will reduce the strain and overwork on your staff.
And then the third.
It's how to mate everything you can.
You should make technology work for you, and you should make it work for
you by setting up rules and actions that will trigger automatically when
there's time and when you need them.
The bottom line is, we want to replace chaos with calm.
Like I said previously, it's very easy to go into panic mode.
It's very easy to stress and get nervous.
But this is not good during an incident.
It will only add to the problem and will cause more confusion.
We want stuff to be calm and organized.
Our incident response process is heavily based on the Incident
Command System, or ICS for short.
So for those of you who are not familiar, ICS was Developed after some devastating
wildfires that happened in Southern California in the 1970s When this happened
thousands of firefighters responded, but an issue happened Well, they found
out that it was tricky to work together because they were very good individually.
They knew how to fight these fires by themselves, but they lack the
common framework on how to do it together as a larger group.
So this framework they developed is used by everyone from local fire department
responding to a house fire to the U.
S.
government responding to a natural disaster and several other
organizations around the world.
It provides a standard enough response that everyone is familiar with and
it will help prevent confusion and cause during an incident, especially
when working cross functionally.
It sounds pretty good, right?
Now, the first step to having an organized incident response process
is to define some key terms.
And the first one is, what's an incident?
It might sound too obvious, but if you ask around, different companies are
going to have different definitions.
At PagerDuty, we define an incident as any unplanned disruption or degradation
of service that is actively affecting customers ability to use the product.
Some companies also include some internal facing disruptions like
having a security breach or not being able to run maltent accounting.
If any business metrics have deviated from normal behavior,
this can be considered an incident.
Whether you're experiencing a substantial number of failed transactions on your
shopping cart or a large increase in processing time for lineups in your store.
These could be considered incidents as your service is being negatively impacted.
What should be important here is that you should use a
definition that works for you.
As long as you have a standardized definition that is wildly known throughout
your organization and you keep it simple.
it should work.
The last thing you want is ambiguity when there's an outage
and people are stressing out.
Now, after an incident, the next level is to define what a major incident is.
A major incident is harder to define because it will, once
again, vary between organizations.
We have found that major incidents have a few things in common.
So, the first thing is, they often happen with little or no warning.
Okay.
Then, you must respond quickly.
Having these services degradated or down will cost your company a ton
of money and valuable resources.
The third thing is, you will rarely understand what's happening at
the beginning of the incident.
You're going to encounter bumps and turns all over.
And the fourth stuff is, you must mobilize a team of the right
responders to resolve an issue.
And, they should coordinate cross functionally to get a response.
At PagerDuty, our definition of a major incident is any incident
that requires a coordinated response between multiple teams.
This definition should be short and a simple statement that will make
sure that, once again, everyone is on the same page and everyone
understands what we're talking about.
In particular, you want to remove any discussions or disagreements around
if something is happening or if it's actually a major incident or not.
This is not something that you want to happen when The house is
on fire and everyone is discussing.
Hmm, is this actually a major incident?
So, make sure that your definition is short and straight to the point.
Something that could be helpful as well is, if you have a metric to use like,
when we experience more than 100 errors per minute, it's a major incident.
That's great.
You can also tie it to the customer impact, like deciding we're experiencing
a major incident when more than 5 percent of our customers are affected.
So discuss with your company, get it written and make sure everyone
understands it and is on the same page.
The reason this should be a first step is You can't respond to an incident
until you know what an incident is.
If one person considers something an incident, but the rest of the
organization doesn't, then it's going to create a big ambiguity and
confusion during the incident response.
So.
Repeating, brainstorm and decide on a clear definition and then standardize it
across your entire organization so that everyone has the same understanding.
Like I said, you want to replace chaos with calm.
By having a definition for both incidents and major incidents,
then you can add in the clarity for your team to know what to do next.
So these definitions could be Create a distinction between normal
operations and emergency operations.
Once you cross the line into emergency operations, your decision making
processes will change even dramatically.
So there are stuff that you would do on normal operations
that wouldn't be acceptable, but when you're in emergency mode.
You could do it like, for instance, as deploying code without tests.
This might be an acceptable during normal operations, but perfectly
acceptable when there's a major incident and you need to recover quickly.
The way you operate your role, the hierarchy and the level of risk
you're willing to take will change.
Change as you move out of normal operations and into
incident response mode.
This is why we recommend that you use metrics tied for business impact.
For example, a metric we monitor at PagerDuty is number of
outbound notifications per second.
At Amazon, it could be number of orders per second.
At Netflix, it could be stream starts per second.
Monitoring It's important and monitoring this important metrics will then help
you determine the severity of an incident and the type of response you should use.
So if you use metrics that aren't tied to any business impact, for
instance, as the CPU usage is high on the host, then it can be difficult.
Or even sometimes impossible to determine the severity of an incident
that's associated with that metric.
What you really want, it's a metric that lets you know how your business
is doing and how it's doing, not how a particular piece of an equipment is doing.
Remember, Making, a wrong decision is better than making no decision at all.
A wrong decision gives you more useful information to work with while making
no decision gives you nothing at all.
think of it as trying to spread, reduce the spread of a fire.
If you just sit there and decide to, okay, what are my options?
Let's think about them.
Your house might burn without knowing it.
So in a sense, taking no action, is an action.
We can call this even decision paralysis.
So remember this, please make a decision.
You get learnings from it and you know what to do afterwards.
How do you move out of normal operations into an emergency mode?
Well, you trigger the incident response.
Obviously, you can do it on pager duty or on the tool that you're using.
We recommend, two ways of triggering this process.
First, we, we encourage automatic detention and
triggering within the product.
wherever it can be done effectively and accurately.
If an organization has working integrations and reliable metrics that
track business impact, such as loss of sales per minute or success rates
for API calls, we recommend you that you leverage this knowledge and use it
to automatically trigger an incident.
This will reduce time wasted on human based assessment and will
drive down the time to resolution.
Second, we highly encourage humans to manually trigger the incident too.
This is very important at PagerDuty.
By lowering the barrier to triggering an incident response, leads to
a dramatic increase in the speed of which incidents are resolved.
We don't want people to sit on a problem because the alarm hasn't gone off yet.
If customer support, for instance, gets a lot of requests very quickly,
well, this is probably a sign that something is wrong, right?
So, how do we as humans trigger the process?
Well, we do it with a chat command, but don't feel like that's the only way.
I just wanted to demonstrate it to give you an idea.
You can do it however you want.
Use the head on, having a flashing light on the office, you can
hire a mariachi band, whatever.
Do what you prefer.
The point is, you want having a way to trigger this response that's fast,
easy, and available to everyone.
Once you have triggered the incident, now we have a stable, repeatable
pattern to follow when responding.
So, our steps to, our process when dealing with an incident is the following one.
So, we follow the triage, mobilize, resolve, and prevent process.
So, when you're notified that an incident is happening, you
first must access and triage it.
Here, you will assess the severity and the urgency of the incident.
Secondly, you need to mobilize, the right people who can work together towards the
resolution and that leads us to resolving.
Now we are all working together towards a common goal, which
is to resolve the incident.
The last step is to prevent the incident from happening again into the future.
We don't want Having multiple incidents on the case of the same issue.
So we need to make sure that we prevent it.
Now, in order for these steps to happen, we need to have the right team and
roles engage, and this is where we come into the roles of incident response.
And we're going to go over each one of them.
So.
The top section that we have here, it's the command section and this
includes three critical roles, starting with the most important
one, which is the incident commander.
This is the highest ranked person on the incident response system,
when an incident response happens.
Yes, they're even more important than the CEO of the company.
They should be the single source of truth during an incident,
and they are the ones in charge.
They make all the decisions, and no action should be performed unless the
incident commander gives the go ahead.
They will help drive major incidents to resolution and coordinate the response.
And the best part of this is that they don't even need to have technical
knowledge, as long as they understand your incident response process, and they
know how the steps involved with being an incident commander, they can do it.
And this, that's what's really exciting about this.
Next, underneath the incident commander, we have the scribe and the deputy.
The scribe is responsible for documenting the timeline of the incident as it
progresses, and the scribe is also responsible for making sure that, well,
all the important decisions and data are captured and can be later reviewed.
At PagerDuty, we use Slack to keep an internal record of major incidents, and
the scribe will Post in the the specific channel for the this incident to record
the life progress of the incident this way we have a solid history of what was
discussed what were the decisions that were made and Anyone I mean and I mean
anyone in the company can go ahead and they can read it for status updates if
they'd like If you have to slide if you have slack Or teams, at the end of the day
it will depend on the tool you're using.
I recommend you have an incident response channel set up for your
department or for your team.
You can keep it open and ongoing each time that there's an incident,
and this way when an incident occurs you're not looking for, okay where's
that incident, specific channel.
Make sure it's there, make sure everyone knows which one, which one it is.
Then we have the deputy.
So the deputy is a direct support role for the incident commander.
This is not a shadow, they're not just there observing.
The deputy is expected to perform some important tasks during the incident.
They make sure that the incident commander is on focus and that they can
be there supporting the investigation.
When whatever they need or whatever the task they do.
So basically they're the right hand person to the incident commander.
Then on the liaison parts, we have the internal liaison and the customer liaison.
So they're responsible for communication.
They're responsible for communication with the company.
In the case of the external of the internal liaison and with the customers
in part, in the case of the customer liaison, PagerDuty provides a status
page two that keeps customers updated on the impact of the incident in real time.
And these are the folks that are responsible for coordinating this effort.
Now onto the operations section.
These are basically the people that get Amazon.
Subject matter expert or the resolver when an incident happens, they will be the
people, the ones who are alerted first.
this is because they are the ones who have technical expertise and they are the ones
who can access the incident and determine if they can resolve it on their own or
if the impact is too large for that.
If the business impact is too great.
and there's a large of amounts of subject matter experts needed, then
the primary on call responder will page the on call incident commander and
begin the incident response process.
This operation staff group of subject matter experts may consist of one
or more people, depending on the severity of the incident, the size of
your department, and number of teams.
Keep in mind, and this is important, This chart might increase or decrease.
This is just an example of how we do it here at PagerDuty.
We have seen ones who are too long enough to write a book on and we
have seen ones that reduce this chart even more considerably.
So look at your organization, figure out what makes sense and how you would adapt.
How do I prepare to manage an incident response team?
Well, the next sections we're going to go over a couple of
steps, starting with the step one.
Ensure explicit process processes and expectations exist and that people
are trained and this means, well, create standardized definitions.
And I've talked about this before.
Make sure everyone knows what an incident is and what a major incident is.
Define your incident response process and communication channels
and make sure everyone on the organization is on the same page.
You need to make sure that everyone is involved and understands how
stuff works because without everyone have being on the same page.
Stuff is not going to work out properly.
Then, practice running major incidents as a team.
This will help, the team to practice.
And it will enable responders to act more calm and collected.
This way, when the real stuff happens, they'll be ready.
Think about this, so if a firefighter only fought one or two small fires a year, and
then suddenly they're caught to a major forest fire, They'll probably panic.
They're less likely to act calm, cool, and collected.
So, how do you handle this?
You practice.
A great way to practice is if you have an incident that might be of
an downrated severity, but you can say, okay, now, this is not a very,
this is a low severity incident, but let's treat it as a major incident.
This way you will learn.
how your response process worked, and you'll be able to tune your systems
to better know, to better understand that you're not triggering a major
incident when they're less severe.
This is why at PagerDuty we run Failure Fridays.
It keeps us on our toes, and it makes us be ready when and at any moment noticed.
The point of this exercise is to make sure that the production operation systems
are doing what we need them to do and when we need them to do, and that this
can be a number of different subsystems.
So, for instance, are the metrics and the observability tools
giving us the right telemetry in the right way at the right time?
Are de escalation policies, notifications, incident rooms
working the way we expect them to?
Does everyone know what's expecting of them and where they need to be?
And then, if we need to push a fix, how fast can we do it, and who has
the authority and ability to do that?
Make sure you find a process that works for you.
A tip on how to get started.
What would you do?
On an incident case, you should start evaluating that first and
then you can practice around that.
Step three, you should make sure that you find ways to tune your
processes for your teams to work.
Here you are working for process improvements, stakeholder
expectations, disruptions and stuff that might trickle down the
responders or rise up to stakeholders.
Step four, make checklists.
Make a checklist that you can run through when an incident happens to make
sure that you have everything covered.
You can think about airplanes, for instance.
Pilots literally have a pre flight checklist.
And no matter how many times they've flown in their careers, how much
experience they have, they have, they are legally have to go through this
list and they make sure that every item is completed and ticked off.
The truth is, people will make mistakes.
It's in our nature.
Especially when we are stressing, we tend to forget stuff.
So the best way to mitigate this risk is to have a list.
the list also helps to work pause points in your process, so that
everyone has a chance to take a breath, relax, and continue to do the work
in a calm environment without panic.
So These are actual checklists that we have at PagerDuty when
we have high severity incidents.
I'm not going to dictate this to you, you can take a screenshot now, you can access
the slides afterwards, I'm sharing them.
The key message is that these seem like very easy, simple steps that we won't
forget, but during times in panic, it's very easy to miss some of them.
Checklists will reduce the mental strain on you and will have a large impact when.
Stuff happens and last but not least do your post mortems or
post incident reviews here.
You should review what happened during the incident, both for insights of
where our engineering practices need to be improved, but also for insights on
where our incident response process was.
not clear.
The reason why our incident response process has been so successful throughout
the years is because we have iterated it over time, and we still continue to do it.
People get together after incidents, we discuss what happened, and we create
an organized approach to an emergency situation and tailor it to our needs.
Because of this, we are, we now have a consistent and unified
approach to incident response.
Please, that's very important.
Don't miss.
Post incident reviews.
Okay, now you have all the steps to deal with incident response, but
sometimes stuff is going to happen.
There are going to be issues, there are going to be pains, and
we're going to have pitfalls.
So we're going to take a few minutes now to talk about other
pitfalls we've come across.
The first one, and a very common one, it's called Executive Swoop.
Well, it's actually Executive Swoop and Poop, but I was asked
not to put that on the slide.
So we're gonna look at some more common examples of Executive Swoop next.
One thing that's important to note in here is that None of
these things happen maliciously.
No executive will join the meeting with the intent of messing up
the process or delaying stuff.
They are there to try to motivate people, find out what's going on,
and make sure the issue is gone.
It's their business too.
So, let's look at some common examples.
First one, an executive comes in and says, Okay, everyone, let's try and
resolve this in 10 minutes, please.
And on the surface this might seem pretty okay, pretty unharmful.
The executive is just there to try to motivate everyone and encourage us
to solve the problem quickly, right?
Well, unfortunately that's not how it's going to be perceived.
Saying something like this assumes that people aren't already working as
hard as possible to solve the problem.
It's not by if I'm going trying to fix an issue for the past two
hours and someone comes in and says Okay, let's try to make sure this is
wrapped up in ten minutes that I'm Gonna fix it in 10 minutes, right?
This will just demotivate me and add additional stress.
The job of the incident commander here is to step in and try to clarify that.
Okay, calm down.
We are doing the best we can during our time and let's
try to keep things on track.
Another executive swoop that happened.
An executive turns the call and they ask for a list of impacted customers.
There is just a problem with this.
I understand that you might need it, but.
For us to do this, and to get this list, we'll need to take someone away
from the effort of responding to this incident when we need them the most.
So, let's just not let the executive know that this is happening,
and that we cannot do this.
Third one.
Often times the executive will hop on an incident response call and
begin asking a lot of questions around the specific impact.
And sometimes they'll ask you for something large and time consuming.
And they'll just say, okay, do what I say.
Well, this is where the incident commander comes in.
You just ask them, do you wish to take command?
And you should see how often they don't answer with yes.
If they do, great, okay, the incident commander says, okay, everyone, be
advised, I'm handing over to this person, I'm out, but most of the time
people say no, or won't even answer.
In which case, the incident commander keeps the processes going as usual, and
they can even say, in that case, remember, I'm in command of this incident, please
save your discussion for after the call.
Executive swoop happens.
Because one thing, which is most of times failure to notify stakeholders,
you should keep your stakeholders informed on a regular cadence.
Transparency is key.
It will help ensure that people and other executives don't
need to join the incident call.
And it will also allow the stakeholders to field questions from customers
so that your team isn't interrupted.
This gives your organization confidence enough that the situation is under
control and that you can take it.
It's very important.
This is why it's important to have all those channels and those internal
liaisons communicating stuff.
If everyone has the right information that they need, then most of the times they
won't feel the need to hop on into a call.
Now, one step that's important to What?
Not one step.
One thing that's important to avoid is having too much status updates.
The If you keep giving too many stats updates, most of the times
they are unnecessary and they are wasting people's times.
We like to keep internal updates, to every 30 minutes or immediately
after a change occurs, either it being an improvement or a resolution.
Now, another mistake that often happens as well is read hearings.
These are things like, clues that are misleading or distracting.
For example, we have been on calls where people, and when everyone is convinced
that the issue is with the network.
However, after some of the times people, after a couple of times,
minutes, I mean, people realized that they hadn't checked recent deploys.
And, surprise, surprise, the source of the problem was in a code change.
Which, they often are.
So don't get sucked into obsessively following one line investigations and
miss the true cause of the incident.
This is where the incident commander needs to keep in check and keep asking questions
and encouraging the team to step back from just focusing on one small situation
and look at all the options at hand.
So there are other common pitfalls that I won't go over because this
will turn this talk very, very long, but here are some of them.
So, debating the severity of an incident during the call.
You don't want to stop what you're doing to discuss the severity.
If you're already going with it, assume that's the severity you're treating
and do the process as you should, and then do the post mortem review.
Then discussing process and policy decisions.
Once again, not the time, not the place.
This is why it's important that everyone knows and is on the same page.
Once again, if this happens During the post incident review, make
sure that everyone is aligned.
Then, not disseminating policy changes.
This is something that can happen, so if changes happen and people
are not informed, this will lead to another of this, the one,
some of the previous stuff here.
Hesitating to escalate to other responders.
This happens often when subject matter experts can be Scared or afraid of
meaning that they don't know how to deal the problem to deal with the problem.
The issue is this is often a sign of a culture where people are afraid to
step up and admit that they don't know.
If you disseminate the right culture and the right practices
and people know that, okay.
It's not not knowing, it's basically figuring out if I don't know how
to handle this, I'm gonna pass it to someone who knows how to
handle this situation and fix it.
Then, neglecting the post incident review and follow up activities.
This one speaks for itself, trying to take on multiple roles.
So, an incident commander should be an incident commander.
A subject matter expert should be a subject matter expert.
The customer liaison should be the customer liaison.
So it's important that everyone knows their roles.
there are situations where this might depend, but most of the
times, it shouldn't happen where one person has one more task.
Then, stopping everything and getting everyone on the call.
So, all hands on deck.
Let's get everyone together in the same call and try to fix this.
Well, you don't want this to happen.
You want to have chaos.
You should only have the people there in this discussion that
can and will contribute to it.
And everyone else can be updated towards following the channel.
notes being taken by described another thing, forcing
everyone to stay on the call.
So you don't want people to stay on the instant response call.
If they don't need to be there, they're not contributing.
Most of the time, these discussions happen at three or 4am.
If these people are not, they're doing anything anymore, release
them, let them go have their rest or focus on their own other things.
They can check up on what happened afterwards and then other stuff.
Assuming that silence means that there are no progress.
Most of the time, just because people are quiet doesn't mean
that stuff is not happening.
People are working on stuff, they are fixing their issues, they
are fixing the issue the best they know, they're doing stuff.
Obviously, there's going to be times where you need to keep the updates,
but don't assume that if people are quiet, nothing is happening.
So, in summary of everything we've seen in this talk is this one.
So first, use the incident command system for managing your incidents.
Third, second, make sure that there's an incident commander that can take
charge during emergency scenarios.
Make sure that expectations are set.
Upwards that your stakeholders and everyone in the company, high level
chiefs and levels know how stuff works.
Work with your team to set explicit, explicit processes and expectations.
Once again, expectations are the most important thing during incidents.
Everyone needs to know what, how, what to do, how to react and what,
how they should carry themselves.
Practice, practice, practice.
Like I said, a firefighter that only fought one or two small fires is not
gonna know how to handle a huge one, so you need to keep practicing and figuring
out how you and your team carries themselves when major incidents show up.
And finally, Don't forget to review your processes and keep improving.
So if you want to get more details on incident response and this process,
this session only covered a bit of the beginning on stuff we have documented
on our operations guide that PagerDuty.
So you can check the response.
pagerduty.
com to get more information, learn more about the processes.
this is an open source guide that is just It isn't specific on to PagerDuty.
Many components companies have taken segments of our guide and inserted
it in their own documentations.
besides that, we have plenty other operations guide
that you can run and check.
So stuff like post incident reviews, operational reviews, full service
ownership, so can go ahead and check them.
you can also connect with us@community.pager.com,
so our pager com and it's.
Where everyone can go ahead and ask questions, talk about PagerDuty,
developer operations, anything you find relevant, go ahead and you can
check it, so there's this QR code that you can scan to take you there.
And I also invite you to connect with me, talk, if you have any questions
after this presentation, you can connect with me at any of your social medias.
And it will be a pleasure to talk with all of you.
So I would like to thank, comfort you for having me.
It's been a pleasure to be here talking with everyone.
And hopefully now by the end of this talk, you know that you don't need to
panic and you have some insights on how to have an effective incident response.
I'm Daniel Alfonso.
Thank you so much for having me.