Sustainable Incident Management for happy SRE teams

Video size:

Abstract

How you respond to production outages can affect both team morale and development velocity. With the proper Incident Response processes in place, it can reduce this stress, and make it easier to ramp up new teammates, and the focus on reducing TOIL. This talk will look at Incident Management at its core, covering Incident Command and how to scale it with a growing organization sustainably. We’ll go over common areas of pain for Incident Responders and how to ease them to reduce friction between Product and SRE teams such as best practices for playbooks, on-call rotations, error budgets, postmortems and incident communication to streamline incident resolution.

Summary

Ajuna Kyaruzi talks to you about sustainable instant management for happy SRE teams. You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native litmus cloud.
The process that you take to solve what's going on and to figure out what's happening is what we call instant management. It's a process that can be reused or done again and again without depleting the natural resources. How do we make this sustainable?
A point where things can get really hard is just the process of being on call. Finding out ways to cut down the time that you're on call so it's not interfering in the rest of your life can make it more sustainable for instant responders.
Something else that we can think about to make things sustainable is the idea of instant severities. Something that's more severe of an incident means it just needs more resources. Escalating and having can environment where it's easy to escalate makes a process where the team feels supported.
For a team, the onboarding of being a new instant responder can be an area of a lot of stress. Trying to make this process as easy as possible for folks can really help increase the sustainability and help them see themselves grow within the system.
Another area that's a huge pain point for people is just communication. Having a unified channel of information makes this a lot easier. Automating for instant management is really important for the postmortems process. If you think of a few more, please reach out to me.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Are you an SRe? A developer, a quality engineer who wants to tackle the challenge of improving reliability in your DevOps? You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native litmus cloud hello, my name is Ajuna Kyaruzi. I'm a technical evangelist at Datadog and I'll be talking to you about sustainable instant management for happy SRE teams today. So I know the topic sounds a little vague, but hopefully at the end of this you'll be thinking about instant management in a way that makes your instant responders or your SRE team specifically as fulfilled as possible so that they're not burnt out and are able to do instant management and instant response sustainably. So let's start thinking about this from something that's relatable to all of us. Your pager goes off. It can mean a lot of different things, but the first thing is, oh my gosh, what's going on? How can I resolve this? Let me find out what's happening. You could have been interrupted in lots of different ways. Maybe you were just doing your work. Maybe you were at your kid's birthday party and now have to step away or you were having brunch with friends. Either way, your immediate response is, I got to figure out what's going on and solve it as quickly as possible. The process that you take to solve what's going on and to figure out what's happening is what we call instant management. We can think about it from the process of being paged now that we know things are going wrong, looking at all the alerts that triggered what happened, looking for the root cause of the incident, mitigating it to reduce the customer impact, launching a new change or rolling back the reason that the incident happened and then reviewing it, making sure that things are back to normal, writing a post mortem and trying to make sure that that incident doesn't happen again. All in all, we call this process incident management. But what do I mean by making this sustainable? For something to be sustainable, let's look at the definition of what that means. And generally it's a process that can be reused or done again and again without depleting the natural resources. The resources that we're discussing right now are the people who are responders to your incidents. How do we make sure that they're able to continue the work that they're doing? Despite the fact that it is a lot of stress to be on call, it's hard to really track and measure what the stress or the happiness of someone who is an instant responders. Some of our other fields do a much better job of this. But how do we know, other than whether people are leaving the team, how do we know whether or not they're happy? There are lots of different things that we can think about and the values that we want to make sure that our team are able to do. But in general, we can look at the pain points that come up with instant management and look at ways that we can reduce them or make them as small as possible so that this process of instant management is sustainable. So first, let's just think about being on call and the general incident management process. When your pager goes off, usually you've become the incident commander. You've declared an incident, you're in charge of it. Usually it's the lead person involved in the incident. So when you're on call, you're working through the incident. You might pull in a few other folks to work with you, these people, or they might become the incident commander if you need to be the one responsible for resolving it. But these people you are working together, coming up with, following the instant command hierarchy or system that a lot of these companies follow, can be really helpful in making sure that communication is working freely so that the folks who need to work on resolving the instant can focus on that, while the people handling the process and administration and coordination around the instant, aka the instant commander, can focus on that and be the main communication for the stakeholders involved in this process. A point where things can get really hard is just the process of being on call. I know that we're all on call at some point, especially on SRE teams, but what does that look like? Especially on SRE teams where you're on call for a service that maybe you didn't write or just you're on call for multiple different services, finding ways to generalize this process so that it's as smooth as possible can be great. Having the same technology that reports on issues that are happening to make it easier to find out what went wrong, or finding out, having a general production system for launching new features so that rollbacks can happen make this process a lot easier to do. But in general, when you're on call, finding out ways to make that process as unintrusive to life can make it more sustainable. A lot of teams have 24 hours, 24/7 oncall cycles where one person's on call for a whole week and then they sort of swap off. Sometimes you're on call by yourself for a while, especially if you don't have a lot of folks on your team finding out ways to sort of cut down the time that you're on call so it's not interfering in the rest of your life can make it more sustainable for the instant responders. Twelve hour on call cycles sre the most ideal, but that might be easier if you have a team that's in another time zone where you can sort of swap off when the sun goes down, for example. But generally finding ways where you can have maybe multiple tiers of responders so folks can lean on each other if they need to go on the subway or take a walk because they've just been tied to their laptop all day. Make it a lot easier for folks to not feel like they're on call by themselves and are the sole person responsible for a system in general, just having these tiers can make it a lot easier if people are unable to respond because of an emergency. So the next person on call, the secondary or the tertiary on call, can take over temporarily, and then all these people can be working together to be responsible for resolving can incident. Something else that we can think about to make things sustainable is the idea of instant severities. Lots of different companies have an idea of instant severities where maybe a sev five is something that's pretty minor. Your app's just running a little slower than normal. Maybe you're getting more traffic, but it's not anything noticeable to your end users all the way up to a step two or step one where you might actively be losing money or losing customers because they're unable to access your service. The benefit of something like instant severity and how it helps things become a little more sustainable is it shows you a lot of things that are going on. One, something that's more severe of an incident means it just needs more resources. It's a clear indication to pull in folks who might have more expertise. A lot of companies have where if you have a Sev two or a SeV one, you immediately pull teams that are much better at handling the larger scale of the incident, from the communication that need to be involved with other end users and stakeholders, to just the larger scale coordination that happens to all the different responders that need to be pulled in for the larger scale of the incident. Another reason why severities can be very helpful is just the process of escalation means that you are asking core people for help. Even if it's another person on your team, it makes it a lot easier for you to know that, hey, this has become a larger task and I need to split it up a little bit. Escalating and having can environment where it's easy to escalate makes a process where the team feels supported to know what's going on and how much they can help. And even just increasing the severity might mean someone from a different team knows to reach out to you if they think something that's related involves them. I'd also like to talk about the ramp up process when you're joining a new team or when you're becoming on call. For a team, the onboarding of being a new instant responder can be an area of a lot of stress because all of a sudden you're now one of the people responsible for a system, and maybe if you're new, you don't know anything about it. Trying to make this process as easy as possible for folks can really help increase the sustainability and help them see themselves grow within the system. So a lot of different ways to think about ramping up folks is of course shadowing and reverse shadowing the person who's actually on call. Getting an opportunity to practice being on call without just the sole responsibility of it can be a great relief to a lot of people, and even the person on call has an opportunity to collaborate with someone new. Finding ways to make the onboarding processes as easy as possible means that generally the incident management process gets easier, whether it's the training on incident management. So everyone knows that they're on the same page when it comes to the terminology that they're using and how to split tasks and what to do when you are on call or helping out someone who is, to different ways that you can handle common incidents like having runbooks or playbooks, for things like what to do when you run out of quota. Hopefully you have as much automation as you can for some of these things, but other ones you do need human interaction to find out what made you run out of quota. But when you do, these are the steps that you have to take every time. So having these opportunities for folks to kind of get quick wins and also have their opportunity to fix something, because that's one of the exciting things about being on call, being the person that fixed the thing that went wrong, making it as easy as possible to get to that level of achievement makes it a lot easier for folks to feel fulfilled from instant management and continue on on the process. Another thing that you can think about is just the idea of practicing, especially for newer folks, but literally for everyone on the team. Learning about the instant command system I talked about earlier and having common terminology on how to resolve incidents can be really helpful. So just doing an instant response training together, or various things like game days or disaster and recovery training like what Google does, or just role playing different incidents and figuring out what went wrong can be really helpful. Something that can really assist here is just reading old postmortems, which I'll talk about a little more later. Another area that's a huge pain point for people is just communication. During incidents. We want to think about how to make it as easy as possible to find out what's going on in can incident when you're ramping up new folks who are going to be joining the incident, and maybe even when you're handing the pager over to the next incident commander, what do you need to do to get up to date information on what's going on? Having a unified channel of information makes this a lot easier. Whether it's a Slack channel, an IRC channel, whatever you want to use, where folks who want to get updates on the instant can just go without interrupting anyone who's resolving it to find out up to date details on what's been done and what needs to be done on what's left to do makes it a lot easier for folks. Alternatively, when you're later looking back on everything that happened for a review, it's all in one place where you can find everything and the finer details that happened. If you have a larger scale instant, maybe it makes sense to even have multiple channels, but still making it as narrow as possible where folks who are stakeholders and want communications that are going out even to the end. Customers can be all in one place and a channel for the people who are responding and answering each other what's happening in another, but generally making sure that all of this in a place where it's not in a direct message to someone else, so that everyone who's working on the instant gets an update on what's going on makes it a lot easier. Lastly, an area that's really important for instant management is automation, especially for the postmortems process, the review or the document that you write at the end of the incident to find out everything that happened. Automating this process makes it a lot easier for the incident responders so they don't have to go back and remember all the details that happened in the timeline. For the different instance. Using a tool like Datadog, we sre able to automatically create a draft postmortem for you when your incident is resolved. By pulling in all the different metrics, logs, dashboards and even linking to your different slack channels where you were communicating with folks to get can accurate incident timeline where you can then instead of having to pull that information in yourself, just edit it so that it looks great and matches what happened. You can even include the tasks that people did so that you know what has already been done and what maybe needs to be pushed to production rather than just a quick fix or future tasks I need to do for remediation of the incident automating as much as you can of this process can really alleviate a lot of the pain point of writing a post mortem, because for a lot of incident responders, it's yet another thing they have to do after the incident is over and they want to just go back to what they were doing before they were interrupted. Automating this process can make life a lot easier where you can pull and create an incident timeline, figure out exactly when wrong, and also include all the remediation tasks afterwards that need to be done to solidify all the work that you've done so that the incident doesn't happen again. So these are the few pain points that I wanted to talk to you about about making incident management more sustainable. If you think of a few more, please reach out to me. I'd love to chat more about this and answer any questions, and also just continue the conversation about making instant management more sustainable. Thank you.

See all 48 talks at this event!

Conf42 Site Reliability Engineering 2021 - Online

September 30 2021

Sustainable Incident Management for happy SRE teams

Video size:

Abstract

Summary

Transcript

Ajuna Kyaruzi

Developer Relations @ Datadog

Join the community!

Featured event

2025

2024

Info

Conf42 Site Reliability Engineering 2021 - Online

September 30 2021

Sustainable Incident Management for happy SRE teams

Video size:

Abstract

Summary

Transcript

Ajuna Kyaruzi

Developer Relations @ Datadog

Join the community!