Transcript
This transcript was autogenerated. To make changes, submit a PR.
Are you an SRe? A developer,
a quality engineer who wants to tackle the challenge of
improving reliability in your DevOps? You can enable your DevOps
for reliability with chaos native. Create your
free account at Chaos native litmus cloud hello,
my name is Ajuna Kyaruzi. I'm a technical evangelist at Datadog
and I'll be talking to you about sustainable instant management for
happy SRE teams today. So I know the topic sounds
a little vague, but hopefully at the end of this you'll be thinking about instant
management in a way that makes your instant responders or
your SRE team specifically as
fulfilled as possible so that they're not burnt out and are able to do
instant management and instant response sustainably.
So let's start thinking about this from something that's relatable to
all of us. Your pager goes off. It can
mean a lot of different things, but the first thing is, oh my gosh,
what's going on? How can I resolve this? Let me find out what's
happening. You could have been interrupted in lots of different ways.
Maybe you were just doing your work. Maybe you were at your kid's birthday
party and now have to step away or you were having brunch with friends.
Either way, your immediate response is, I got to figure out what's going on and
solve it as quickly as possible. The process that you take to
solve what's going on and to figure out what's happening is what we call
instant management. We can think about it from the process
of being paged now that we know things are going wrong,
looking at all the alerts that triggered what happened,
looking for the root cause of the incident,
mitigating it to reduce the customer impact,
launching a new change or rolling back the reason that the
incident happened and then reviewing it,
making sure that things are back to normal, writing a post
mortem and trying to make sure that that incident doesn't happen again.
All in all, we call this process incident management.
But what do I mean by making this sustainable?
For something to be sustainable, let's look at the definition of what that
means. And generally it's a process that can be reused or done again and
again without depleting the natural resources.
The resources that we're discussing right now are the
people who are responders to your incidents. How do we make sure
that they're able to continue the work that they're doing?
Despite the fact that it is a lot of stress to be on call,
it's hard to really track and measure what
the stress or the happiness of someone who is an instant responders.
Some of our other fields do a much better job of this. But how do
we know, other than whether people are leaving the team, how do we know whether
or not they're happy? There are lots of different things that we can think about
and the values that we want to make sure that our team are able to
do. But in general, we can look at the pain points that come up with
instant management and look at ways that we can reduce them or make
them as small as possible so that this
process of instant management is sustainable. So first, let's just
think about being on call and the general incident management process.
When your pager goes off, usually you've become the incident commander.
You've declared an incident, you're in charge of it. Usually it's the lead
person involved in the incident. So when you're on call,
you're working through the incident. You might pull in a few other folks to work
with you, these people, or they might become the incident commander
if you need to be the one responsible for resolving it. But these people you
are working together, coming up with, following the instant command
hierarchy or system that
a lot of these companies follow, can be really helpful in making sure that
communication is working freely so that the folks who need to work
on resolving the instant can focus on that, while the people handling
the process and administration and coordination around
the instant, aka the instant commander, can focus on that and
be the main communication for the stakeholders
involved in this process. A point where
things can get really hard is just the process of being on call.
I know that we're all on call at some point, especially on SRE
teams, but what does that look like?
Especially on SRE teams where you're on call for a service that maybe you
didn't write or just you're on call for multiple different services, finding ways
to generalize this process so that it's as smooth as possible
can be great. Having the same technology that
reports on issues that are happening to make it easier to find out
what went wrong, or finding out,
having a general production system for launching new
features so that rollbacks can happen make this process a lot easier
to do. But in general, when you're on call, finding out ways
to make that process as unintrusive to life can make it
more sustainable. A lot of teams have 24 hours,
24/7 oncall cycles where one person's on call
for a whole week and then they sort of swap off. Sometimes you're on
call by yourself for a while, especially if you don't have a lot
of folks on your team finding out ways to sort of cut down
the time that you're on call so it's not interfering in the rest of your
life can make it more sustainable for the instant responders.
Twelve hour on call cycles sre the most ideal, but that might be easier
if you have a team that's in another time zone where you can sort of
swap off when the sun goes down, for example. But generally
finding ways where you can have maybe multiple tiers of responders so
folks can lean on each other if they need to go on the subway
or take a walk because they've just been tied to their laptop all day.
Make it a lot easier for folks to not feel like they're on call by
themselves and are the sole person responsible for a system in
general, just having these tiers can make it a lot easier if
people are unable to respond because of an emergency. So the next person on
call, the secondary or the tertiary on call, can take over temporarily,
and then all these people can be working together to be responsible
for resolving can incident. Something else that we can think about to
make things sustainable is the idea of instant severities.
Lots of different companies have an idea of instant severities where
maybe a sev five is something that's pretty minor.
Your app's just running a little slower than normal. Maybe you're getting more traffic,
but it's not anything noticeable to your end users all
the way up to a step two or step one where you might actively be
losing money or losing customers because they're unable to access your
service. The benefit of something like instant severity
and how it helps things become a little more
sustainable is it shows you a lot of things that are going on.
One, something that's more severe of an incident means it just needs more
resources. It's a clear indication to pull in folks who might have
more expertise. A lot of companies have where if
you have a Sev two or a SeV one, you immediately
pull teams that are much better at handling the larger scale of
the incident, from the communication that need to be involved with other
end users and stakeholders, to just the larger scale
coordination that happens to all the different responders that need
to be pulled in for the larger scale of the incident.
Another reason why severities can be very
helpful is just the process of escalation means that
you are asking core people for help. Even if it's another
person on your team, it makes it a lot easier for you to
know that, hey, this has become a larger task and I need to split it
up a little bit. Escalating and having
can environment where it's easy to escalate makes a process
where the team feels supported to know what's going on and
how much they can help. And even just increasing the severity might mean someone from
a different team knows to reach out to you if they think
something that's related involves them. I'd also like to talk
about the ramp up process when you're joining a new team or when
you're becoming on call. For a team, the onboarding of being a
new instant responder can be an area of a lot of stress because all of
a sudden you're now one of the people responsible for a system,
and maybe if you're new, you don't know anything about it. Trying to
make this process as easy as possible for folks can really help increase the sustainability
and help them see themselves grow within the system. So a lot
of different ways to think about ramping up folks is of course shadowing
and reverse shadowing the person who's actually on call. Getting an
opportunity to practice being on call without just the sole responsibility
of it can be a great relief to a lot of people, and even the
person on call has an opportunity to collaborate with someone new.
Finding ways to make the onboarding processes as easy as possible means that
generally the incident management process gets easier, whether it's the
training on incident management. So everyone knows that they're on the same page when it
comes to the terminology that they're using and how to split tasks and
what to do when you are on call or helping out someone who
is, to different ways that you can
handle common incidents like having runbooks
or playbooks, for things like what to do when you run out of quota.
Hopefully you have as much automation as you can for some of these things,
but other ones you do need human interaction to find out
what made you run out of quota. But when you do, these are the steps
that you have to take every time. So having these opportunities for
folks to kind of get quick wins
and also have their opportunity to fix something, because that's one of the exciting things
about being on call, being the person that fixed the thing that went wrong,
making it as easy as possible to get to that level of achievement
makes it a lot easier for folks to feel fulfilled from instant management and
continue on on the process. Another thing
that you can think about is just the idea of practicing, especially for newer
folks, but literally for everyone on the team. Learning about
the instant command system I talked about earlier and having common terminology
on how to resolve incidents can be really helpful.
So just doing an instant response training together, or various
things like game days or disaster and recovery
training like what Google does, or just role playing different
incidents and figuring out what went wrong can be really helpful. Something that
can really assist here is just reading old postmortems,
which I'll talk about a little more later. Another area that's
a huge pain point for people is just communication. During incidents.
We want to think about how to make it as easy as possible to find
out what's going on in can incident when you're ramping up new folks who are
going to be joining the incident, and maybe even when you're handing the pager over
to the next incident commander, what do you need to do to
get up to date information on what's going on?
Having a unified channel of information makes this a lot easier.
Whether it's a Slack channel, an IRC channel,
whatever you want to use, where folks who want to get updates on the instant
can just go without interrupting anyone who's resolving it to
find out up to date details on what's been done and what needs to be
done on what's left to do makes it a lot easier for folks.
Alternatively, when you're later looking back on everything that
happened for a review, it's all in one place where you
can find everything and the finer details that happened. If you have
a larger scale instant, maybe it makes sense to even have multiple channels, but still
making it as narrow as possible where folks who
are stakeholders and want communications
that are going out even to the end. Customers can be all in one place
and a channel for the people who are responding and answering each other what's happening
in another, but generally making sure that all of this in a place where
it's not in a direct message to
someone else, so that everyone who's working on the instant gets an update
on what's going on makes it a lot easier.
Lastly, an area that's really important for instant management is automation,
especially for the postmortems process, the review or
the document that you write at the end of the incident to find out everything
that happened. Automating this process makes it a lot easier for
the incident responders so they don't have to go back and remember all
the details that happened in the timeline. For the different instance.
Using a tool like Datadog, we sre able to automatically
create a draft postmortem for you when
your incident is resolved. By pulling in all the different metrics,
logs, dashboards and even linking to
your different slack channels where you were communicating with folks to
get can accurate incident timeline where you can then instead
of having to pull that information in yourself, just edit it so that it looks
great and matches what happened. You can even include the tasks that people
did so that you know what has already been done and what maybe needs
to be pushed to production rather than
just a quick fix or future tasks I need to do for remediation of the
incident automating as much as you can of this process can
really alleviate a lot of the pain point of writing a post mortem,
because for a lot of incident responders, it's yet another thing they have
to do after the incident is over and they want to just go back to
what they were doing before they were interrupted. Automating this process can make
life a lot easier where you can pull and create an incident timeline,
figure out exactly when wrong, and also include all the remediation
tasks afterwards that need to be done to solidify all the work that
you've done so that the incident doesn't happen again.
So these are the few pain points that I wanted to talk to you about
about making incident management more sustainable. If you think of a few more,
please reach out to me. I'd love to chat more about this and answer any
questions, and also just continue the conversation about
making instant management more sustainable. Thank you.