Conf42 Incident Management 2022 - Online

Too many people in the room?

Video size:

Abstract

When something goes wrong, it can be tempting to gather as many people as you can to fix it. Each person can contribute tremendous value through diverse viewpoints, but too many people can overcrowd your response, leading to miscommunication, redundant work, and much more. This talk will teach you to avoid overcrowding incidents through smarter escalation policies, role-based tasks to organize efforts, and more efficient communication. A lean, focused team of relevant players can achieve much more than a bloated, confused one. Only then will you start to reduce the burden for your on-call team and keep customers happy.

Blameless is drawing! Participate in a raffle to win Beats Solo Wireless!

Summary

  • Nick Mason: You can actually overcrowding your incidents response. Why is it that you call in as many people? Why do incidents get crowded? How to achieve those goals while still preventing overcrowding.
  • It takes ten to 15 minutes for someone on their team just to get caught up to speed with an incident. Lack of classification when it comes to incident types. Not having a defined escalation policy or a cultural mindset. These are all natural problems that we encounter on a daily basis.
  • So what is an incident type? Maybe it's use case based. Is it a security incident? Maybe a planned software outage? The severity level has just as much impact as the type of problem. Making sure you have those distinctions is just as important as defining the problem itself.
  • Before you start an incident, stop, think and assess. What type of problem are you encountering? Who is it impacting? And how fast does a escalation actually need to be obtained? Get the right people to join who are going to be efficient.
  • A combination of those three roles is very powerful to make sure that there are particular lanes of focus. And I think this really alleviates one of the big fears that we talked about earlier, is that people start to get paranoid. If you have these roles and these checklists already defined, people will be confident.
  • Roles and tasks help with accountability and communication. As the incident evolves, communicate, period. We recommend trying to get a system in place to automatically deliver updates to relevant stakeholders.
  • Communication should be something that alleviates these sorts of concerns. Having automatic systems set up to deliver messages when status changes is really great. With good tooling and with somebody focused on this sort of communication aspect, you can really alleviate a lot of the burden.
  • Communication doesn't need to be limited to that particular role. Mark something as important and let those key contribute with incident roles, handle their tasks. Having these elements incorporated into your incident management process will help you cut through the noise and drive towards resolution at a much quicker pace.
  • You really do need a cultural foundation that's built around that trust. The idea of kind of psychological safety that you're not going to get punished for screwing up the process. The retrospective or the post incident analysis is a great way for your employees to feel heard.
  • Emily, is there anything else you'd like to add? We hope that we've convinced you that sometimes a lean, focused team can beat out a large all hands on deck scenario. Have a great day.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, I'm Emily Arnott. And I'm Nick Mason. And we're here to present too many people in the room, the incident room, that is. So here we're going to take a look at an often overlooked challenge when dealing with incidents, which is you can actually overcrowding your incidents response. And this might seem a little counterintuitive. So we're going to start off by talking about what we really mean when we say too many people. Then we're going to take a look at this instinct to kind of get all hands on deck. Why is it that you call in as many people? Why do incidents get crowded? Then we're going to take a look at how to achieve those goals while still preventing overcrowding. We're going to look at how to manage that urge during an instinct for people to rush in and try to help. And then we're going to look at how to cement this in with a cultural foundation to make this not just kind of an arbitrary policy change, but to have a culture around incidents where they don't end up getting overcrowded. So you might ask, and I think this is a pretty reasonable question, isn't having more people better? You might think when something goes wrong, it's pretty natural to just want as many people as possible there to help you out. And we don't want to discourage that type of thinking because it's true that people trend to add value when they contribute to a problem. There's tremendous value in bringing in a diversity of people that might have fresh ideas, that might have a different perspective. But if you just end up with too many people, if you kind of go too far with this thinking, it can lead to a lot of problems. So let's take a look at the problems here. As you can see, there's quite a few things can get confusing. You might start to wonder, what is each person doing? Communication kind of beats poor when there's so many people. People might do redundant work. There might be excessive communication, which you might wonder what that looks like. But if you're scrolling through hours and hours worth of slack messages, you'll know what excessive communication is. Time could be wasted as people show up and just sort of stand around in the incident room when they could be working on something else. There could be extra stress from all these people overcrowding in. You're not sure how to reassure all of them, and they might not be sure how that they're actually going to contribute. There could be some anxiety about what can I do to help? There doesn't seem to be anything to do to help. Tons and tons of stuff just piling up when you have this overcrowded incident. Yeah, and I think that's definitely something that we think about when we're trying to handle an incident. And the problems that go with it is it's already a stressful situation. And then there's all these other components that you just discussed, Emily, that are just piling on top of it. Way too much stress. We want to make sure that we can focus our energy on solving the problem. And one quote that really resonated with me as a solutions engineer, I talk with various prospects and customers on a daily basis, and talking with an SRE manager, they were noting that it takes ten to 15 minutes for someone on their team just to get caught up to speed with an incident, just the high level information just to get started. And this seems to be a reoccurring theme as I've talked to several different leaders within our space, and that's just way too much time. And what this ultimately boils down to is on the next slide. You can't help if you don't know what's happening, right? You can't just throw people at the problem. As tempting as it is, they need to know what's happening and how can they effectively contribute. So teams not having this information who are trying to solve this problem, as well as a lack of communication both internally and externally, are just a few of the reasons why organizations typically bring too many people to the incident room. And what this basically boils down to is a lack of classification when it comes to incident types. And what I mean by an incident type is what type of problem are we actually fighting right now? And it also boils down to not having a defined escalation policy or a cultural mindset that tasks are going to be completed if you're not constantly checking up on the team. We've all been there before in the past. It's perfectly okay to have those thoughts. Other problems I've personally seen from prospects include not having the right visibility into the impact of the incident based on your monitoring tools, or maybe not knowing who the subject matter experts to help you solve this problem are. There are plenty of tools out there that give your team visibility into who's on call for a particular service or who that subject matter expert is. But oftentimes these tools are not naturally integrated into the communication tool where you may be going back and forth with the team trying to solve the incident. And this results to excessive context switching and as I mentioned before, stressful situation. You don't want to add more stress, you want to alleviate that stress. And if you're thinking, as I'm talking here, I see Emily nodding. And these are all natural problems that we encounter on a daily basis. You're not alone, but there, but there are some ways that we can try to improve in these practices by not overcrowding the incidents room. And personally, for me, what it all kind of starts with is really putting a definition around the incident type and their corresponding severity level definitions. So what is an incident type? You may be asking me, Nick. I have no idea what that means. So some things that I've been successful in with defining what an incident type is, is maybe it's use case based. So is this customer impacting? Is it a security incident? Maybe it's a planned software outage. Just being able to name a few. Right. I've also seen several teams be successful in defining incidents when it relates specifically to the team that's going to be involved in solving that problem or maybe a particular service that's being impacted. And as you start to go ahead and define these different incident types and their corresponding severity levels, you're able to do a couple of different things. Right. First and foremost, you're able to bring the right people to help you solve this problem, as well as put together a framework in place as to what should the team actually be doing. More to come on that in a few slides. But another important distinction here is that depending on this incident type and severity that you've captured at the start of the incident, this is going to help you bring in different groups or teams to help you solve that particular type of problem. And the last note I'd like to add on this slide here is by defining an incidents type, the severity level has just as much impact as the type of problem. Right? So for example, a severity three customer incidents versus a severity zero. Severity three being not as prioritized, versus sev zero being all hands on deck. Maybe for a sev three I only need to bring in one person to help me solve the problem. But a sev zero, I need to bring all hands on deck because this is impacting everyone. Making sure you have those distinctions is just as important as defining the problem itself. So one thing I'd like to add to that is when you think about the impact, often where companies get chipped up is they don't think about it in terms of customer impact. One thing I like to always say is if there's an incident in the middle of the woods and it crashes, but it doesn't make a sound. Or if nobody hears it, does it really make a sound? So don't just think in terms of how much of your system is affected, but in terms of how crucial that is to your customers. So I think a common situation is, oh, we have an outage that requires the most extreme reaction we can possibly have. But if it's an outage of a service that's actually relatively unpopular, this could actually be a lower severity than just a mild slowdown on part of your service that absolutely everybody uses. So having that more like nuanced understanding of how incidents actually impact your customers and how it impacts your bottom line can help you respond appropriately and not overreact and overcrowd. Yeah, that's a fantastic point. Thank you so much for adding that. And that's why these definitions and severity definitions are so important when you're trying to tackle can incident. Kind of the three words that I can put when describing how to get started with this is, before you start an incident, stop, think and assess. Right, what type of problem are you encountering? Who is it impacting? Just like Emily said. And how fast does a escalation actually need to be obtained? Because all of these are going to be contributing factors and getting the right people to join who are going to be efficient. And even though you may have this great communication that we're discussing here about defining your incidents and corresponding severity levels, you also need to know how to properly escalate. If you and the team eventually hit a roadblock, it's inevitable, to say the least, that at some point you may not have the answer. And that's perfectly okay, right? That's why we have a team that's here to help you. But instead of just having that straight line of escalation where you bring in more and more different types of people to help you solve this problem, or you bring in more senior people and they get alerted that something is on fire. Something that we recommend is that you try to make a more diverse network of people that you feel comfortable to reach out. And the operative word there is comfortable because, for example, if you need to go reach out to your vp and you feel a little hesitant reaching out to them because you're like, oh, they're going to think I'm not doing my job. Having that kind of first layer of people who, you know, can help you solve the problem before you escalate to that individual can definitely help kind of ease that burden of saying, hey, I'm unable to finish this, can you please step in and help me? And just as importantly, we want to make sure that we reach out to people who have the bandwidth to actually help you. So I've seen several different organizations that I've worked with in the past where they'll go and they'll have some sort of escalation policy where they'll reach out to. Let's just say I reach out to Emily. I'm like, Emily, I really need your help. Can you come in here? And then I check on your calendar, and you're completely swamped the entire day. There's no way you can help me. So what did I do? I added another person to the incidents room, and then I have to go find another person who can actually help me solve this problem, has the time to help me solve this problem. So we're just piling on to the problem itself. So we recently had a webinar that discussed incident command with a couple of people from blameless and two guests from other companies. And one of them described a policy where they had incident buddies that is kind of like always the first person you can reach out to just for a sanity check to help give you confidence to escalation further. Because it is a really human thing. We can't overlook the human qualities of being nervous to contact people and escalate being uncertain. And that's why often we end up with these really strict escalation policies that are very linear, very hierarchical, but in the end, maybe aren't that effective. So instead, we encourage you to kind of lean into this human aspect and really think about who do I work well with? Who do I know that actually really knows this subject matter? Who can I count on to help explain this problem to me and walk me through it? And then you can have this kind of more personal, adaptive, on call relationship that isn't just swarming the incident with as many senior people as you can get. Yeah, that's a fantastic point. And something that resonates to me personally is that whole idea of the sanity check, right? Especially as you're in there fighting a particular problem and you've tried something three times and it's not working, and you're like, I know this is supposed to be working. Just getting that extra layer of eyes on it, someone that you feel comfortable taking a look at it and not kind of roasting you that you're doing the wrong thing is super important, right. Because that's going to move you towards a escalation at a much quicker pace than if you kind of keep it to yourself and continue to try something over and over. And what this really kind of nails down to is get just the right people who can contribute right away and escalate strategically only if you have to, but if you have to, having some of these ideas that we talked about today will definitely help in that process. And as you're bringing in more people into the incident room, the slack channel, the teams channel, wherever you're managing incidents today, it's just as important for these people to know why they're being called into the incident room as being involved within the incident. So defining these incident roles and tasks can help with this. Right. So what are some examples of different incident roles that I've seen be successful? The incident commander, as an example, as Emily mentioned beforehand, that's usually the person that's in charge of facilitating and moving the incident forward towards a resolution, kind of making sure everything's on track. Communications lead is another big one, because most organizations, communication is usually a broken process, and one of the reasons why we're having this conversation here today, and that role is typically in charge of facilitating both internal and external forms of communication to different stakeholders. And the last one I'd like to note, too, is some kind of technical or engineering lead. So that's typically the person that's there to help you try and solve the problem from a technical standpoint. So a combination of those three roles is very powerful to make sure that there are particular lanes of focus in order to help you try and drive towards resolution of the incident quicker. And I think this really alleviates one of the big fears that we talked about earlier, is that people start to get paranoid that maybe something won't get covered, that some part of the response process will fall through the cracks. Maybe someone won't get informed, maybe some due diligence around recording things won't get done. So if you have these roles and these checklists already defined, people will be confident. Oh, I know there's enough people in there to handle communications and to handle recording and to handle leading up the technical front, and they won't have that anxiety that might make them want to jump in or call in a bunch of extra people. Yeah, that's a super important note to mention there, and thank you so much for bringing that up. Oftentimes, people come into the instant room because they feel like they need to get an answer quicker. But one of the kind of highlight moments that I'm going to mention here is that sometimes you just have to trust the process. So in order to have. Establishing these different roles and tasks, making sure that someone that's qualified to move the incident forward based on those roles and the tasks that they're provided is super important. And to me, it really boils down to three key aspects of roles and tasks. Help with accountability, right? So if sales needed to get an update for this particular incident type and severity, the communications lead is held accountable to make sure that they're getting that update. Consistency in the incident process is something that's just as important, right. As you're building out your kind of initiative to drive towards Sre as a whole, you need to make sure that you're following the process so you can use it as an opportunity to learn. If you don't follow the process the way it's been built today, you're not going to be able to make those gradual changes that will make your incident management process more efficient. And then lastly, roles and tasks help with communication. And as I mentioned before, that's one of the reasons why we're here today as the incident evolves, communicate, period. Right? You want to make sure that you get the word out to as many stakeholders as you can, or more specifically, those who need to be notified. Right? We recommend trying to get a system in place to automatically deliver these updates to relevant stakeholders, customers, management, et cetera. But specifically, people need to know things immediately, and other people only need to know what's happening for high level outcomes. Right? So make sure that you differentiate these different groups and respond to them accordingly. But at the same time, having these automated communications in place, you also need to have some sort of method to send out ad hoc forms of communication through the communication tool that you're using or being able to send. But emails or text messages, status page updates, kind of on the fly. So the communication of those two are very important. What is one of the kind of main areas that you've seen, Emily, in terms of communication that can kind of be improved upon based on your experience? Well, I think something that happens very often is that higher rungs of management will get anxious during incidents. They'll start to wonder, how are things progressing? Have they considered this? What messaging can I take to our other stakeholders, to our board, to our customers? And you start getting these layers of minimal management, jumping into the incident without really being able to contribute a whole lot and sometimes putting a lot of pressure on the engineers who are trying to focus to give them these answers. So I think proactiveness is really, really key here. And like you said, having automatic systems set up to deliver messages when status changes or when different things progress is really great because then nobody has to kind of lift their head up off the desk. They can stay focused in on the applications they need to be in and not have to worry about sending off an email or whatever. I think really communication should be something that alleviates these sorts of concerns, that gives them confidence in the process and the system so that they can focus on what they need to do, knowing that the incident is being taken care of. So it's important to kind of walk the line between personally addressing whatever might be concerning these other stakeholders and also having something really fluid and automatic that doesn't take people out of their tasks. So it's a difficult process to pin down. But with good tooling and with having somebody focused on this sort of communication aspect, you can really alleviate a lot of the burden and a lot of the overcrowding. Yeah, 100%. And if any of you out there are listening to this and you're thinking that, oh, can this sounds like a big change or we don't have any of this in place today, like I mentioned earlier, you're not alone out there. Right? This is all natural to be feeling and it's a gradual change. And that kind of beats us into our final couple of slides here of listening to everything that you've talked about today. What will you need to remember and to start? Communication doesn't need to be limited to that particular role. Right. The communications lead who their job is to send information but about the incident to those key stakeholders. But as someone that's involved in solving the problem at hand, don't be afraid to mark something as important when communicating. The process shouldn't be rigid, but it should be a foundation that you can work on top of, for example, being able to mark something that is a key finding within Slack or Microsoft Teams. As you're in the incident room, you're troubleshooting as important. I see organizations gain benefit from this on a daily basis. Adding those key pieces of conversation to your incident timeline so you can go back and take a look at what was important for helping you solve that problem is just as important. The worst thing you could do there is not market and then it may have gone unturned. Right? Or maybe that was the solution to the problem. Mark it down as important and let those key contribute with incident roles, handle their tasks. So a great example of this, collecting relevant comments from the Slack channel to document in the incident timeline is a great example of something the commander would do. Hey, I'm scrolling through. I'm seeing all this kind of chat back and forth, is this relevant? Someone marked it as relevant. Being able to efficiently, for example, within slack toggle, yes or no, is this important or not? Is something that a commander would technically be in charge of doing. Having some of these elements that we discussed today incorporated into your incident management process will help you cut through the noise and drive towards resolution at a much quicker pace. This is cultural change, and it's a communication that we're trying to build. So make sure you're trusting that process. The retrospective or the post incident analysis is a great way of driving that cultural change. So being able to ask your team after the incident has been completed, everybody's tired, they're ready to go home for the day. But making sure that you still get the answers to some key questions, like, did you have the right team automatically added to the incident channel to help you troubleshoot that problem? Or did you feel like you had the resources you needed in order to start addressing that issue? Or were the right individuals or channels automatically sent communication when the incident was created? These are just a couple questions that I've seen kind of help in that cultural adoption. Emily, is there any particular notes that you have when it comes to kind of like this cultural foundation? So one thing I want to point out is when we're talking about this idea of learning from incidents and the retrospectives and such, I think it really is an often situation where people will jump into the incidents room because they want to know what's going on. And like you said, it's another instance where they have to be able to trust the process that if they think, oh, I need to be there because otherwise I'll have no clue what ended up happening, no clue what the resolution was, and I could run into the same problem later. That's not the best reason to actually be involved in an incident and be in the response room. Instead, they should trust that, oh, someone is recording what's important. There will be a document that's made. I will be able to go back and learn what I need to from this. Like you say, yeah, it's all but trusting the process and process isn't something that gets built overnight. You really do need a cultural foundation that's built around that trust that gives people confidence, oh, if we keep doing this, the process will get better and better. And I think a major thing is this idea of kind of psychological safety that you're not going to get punished for screwing up the process, whether that be, oops, I accidentally invited the entire development team for this project because I was freaking out. You know what? That's okay. It's probably going to cause some problems in the incident, but it's not going to be the end of the world. It's a learning experience. And similarly, if you see an incident happening and you see that there's some people working in it to have the psychological safety to say, you know what? I wasn't called on this. I wasn't alerted. I'm maybe a little concerned, I'm a little curious, but I'll trust the system and I can stay out. And to know that you're never going to get reprimanded and be told, hey, why didn't you join? Why didn't you try to help? So just this culture of feeling safe, because you're trusting in the process and you're trusting in a culture where you're not going to get blamed, you're not going to be given at fault, but where all these errors will just be learning opportunities to improve the process. It's okay to fail at this stuff. It's okay to have a system that doesn't work right away. It's all about iterating and learning and understanding how you can get better in the future. Yeah, 100%. That's super powerful. And there was kind of two key takeaways that really resonated with me is that, as I mentioned before, the worst thing you could do is not suggest something right when you're popping in or you've been assigned a role and you have an idea, you may not know that it's the answer, but it could be the answer. Jotting that down and having that psychological safety to do so is the step one. Because if you don't feel comfortable putting that information down, you may not overturn what the solution actually was. So that's super important. And I think the other key piece that really resonated with what you said was that you want your employees to feel heard, right? So the retrospective or the post incident analysis is a great way for your employees to feel heard after they spent that energy fighting the problem. If there's any changes that need to be made in your incident management process or these communication workflows that you've set up, the team that was battling the incident will be the first line of information to streamline your incident management process as a whole. And that's where that iterative process really kicks in and making those gradual changes over time. So that process makes everyone feel more included, heard, and it's moving towards a resolution at a quicker, um, that's all I got here. Emily, is there anything else you'd like to add? I'd like to just thank everybody for coming today and listening to our talk. Yeah. Thank you so much for listening. We hope that we've convinced you that sometimes a lean, focused team can beat out a large all hands on deck scenario. And we hope we gave you some tips on how to move towards that. Awesome. All right. Thank you all so much. Thank you so much. Have a great day.
...

Nick Mason

Solutions Engineer @ Blameless

Nick Mason's LinkedIn account

Emily Arnott

Community Manager @ Blameless

Emily Arnott's LinkedIn account Emily Arnott's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways