Conf42 Cloud Native 2024 - Online

Why repair a burning house? A guide to incident management

Video size:

Abstract

An instinctive reaction is to close the security hole or fix the bug. Instead, you should be isolating the problem and restoring service! This talk is based on hundreds of incidents and years of operations experience.

Summary

  • This is a collection of hard won learnings from hundreds of software incidents. Key takeaways are clear communication and ownership. An incident is a coordinated response to mitigate an issue. Focus must be on preventing things getting worse and mitigating the existing issue.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
This is why repair a burning house with me Richard Tweed this is a collection of hard won learnings from hundreds of software incidents. It is not a guide to the process of incidents management. At the end of this, there'll be some links to examples of those. Think of this as common behaviors and challenges I see for folks during incidents. Also, try and guess which book influenced these slides and send me a message afterwards. Key takeaways from this are going to be, mitigate don't fix clear, communication and ownership, learn from mistakes Things go wrong all the time, and should I declare an incident? Yes. What is an incident? Why have I written this? An incident is a coordinated response to mitigate an issue. Now that we know what an incident is, we should know what to do. So for our burning house, this would be declaring an incidents to put out the fire and get everyone to safety. Clear communication and ownership. During incidents, someone should know what's going on, and everybody involved should know what they're meant to be doing and when. Depending on what happens during that incident and how critical a problem it is, you'll either have the bystander effect, somebody else is already fixing this right? Or you'll encounter the exact opposite where everyone's trying to be a hero, trying to help, but they only succeed in getting in everyone else's way. These can usually be avoided by having clear responsibility for coordinating the response. This is normally done by a role called the incident manager. You will also have someone who is hands on keyboard for the mitigation, often called the operations lead. Another way that you can avoid these issues is by clearly telling somebody what you want done and when you want updates. The person organizing, the incident manager, should not be the one fixing anything. They should be focused on making sure everyone else has what they need and that at least they know what's going on. During an incident, you should also do explicit handovers, so at the beginning there'll only be one person, but over time you'll have more people doing more things. So whenever you're handing over those responsibilities, do it explicitly. For example, say you need a break, which you will. You can go to the person who you're handing it over to and say, will you take over as incident manager? Okay, you are now the incident manager. I will see you in X minutes. So in our house fire example, the incident manager would be the person who finds the fire. They would then hand over that responsibility to the firefighters. Once they arrive, they would then brief them about how many people are expected to be in the building, how far the fire has spread, and any other relevant information at that point. Similarly, it's very common for decision paralysis to set in, during important incidents. To reduce this tell folks what you're going to do and ask whether they disagree. Don't just ask open ended questions of what should we do. If you're having trouble getting decisions from people. Engineers and developers love correcting something or someone that's wrong. Also please, please write things down. Having a record of who thought what, what was done and why is invaluable for getting people who join later up to speed and getting them effective promptly. It's also very useful for the write up afterwards and seeing what can be improved before the next incident. Mitigate - Don't fix. Most developers, when faced with an issue, will dive directly into the source code to try and find the bug and create a fix. Sorry to break it to you, but this is a complete waste of your time in these situations, especially at the start of an incident. If it's long running, there may be some cause for this. But yeah, during an incident your focus must be on preventing things getting worse and mitigating the existing issue. Regardless of how fast your CI system is and how robust your testing, developing a fix and testing it to the standards of your team will take too long. It also ignores the very real possibilities that the fix makes things worse. An incident is not an excuse to ignore everything you've learned about safe coding practices. If you need help from other teams, don't be afraid to ask for it. Page them if you need to and ask for help. The priority is mitigation. To use the burning house analogy, you shouldn't be installing a replacement wooden stairwell while it's still on fire. Put the fire out. Then you can plan your rebuild or your remodel. Things go wrong all the time. When an incident is declared, there can be an instinct to panic or to rush to conclusions. The best thing you can do is take a moment to work out what's actually going on and coordinate with others to actually investigate the issue and eventually mitigate it. If you see a fire, it is better by far to check how far it has spread and whether there are any extinguishers of the correct type nearby. Rather than blindly grabbing the closest one and using a water extinguisher on an oil fire, have a look at a video of that. It's pretty dramatic. So for our house example, everyone burns toast, drops a glass and throws a switch remote at the TV. Okay, maybe just me for the last one. Learn from mistakes once the incident has based learn everything you can from an incident. Don't just fix the bug. They're a tremendous way to learn how your systems work, how your processes work. What're people's natural instincts? They aren't necessarily what you would expect from talking to people during the normal nine to five. You could learn, for example, that your silence and escalate buttons are too close together for your 04:00 a.m.. Brain. And now you've woken up a director. Or you could find out that your runbooks, your readmes, your documentation, your training only references the old name of the service rather than the new one. So you had to spend half an hour or an hour trying to work out what this thing could possibly be. We learned to use the switch remote straps. Also, it's incredibly likely in an incident that many of your preparations, maybe even all of them, will be forgotten in the heat of the moment. Try and remember that fact. Then try and remember your training and experience. As long as you get back on track, it doesn't matter if you started on the wrong foot for a house fire, you might be so preoccupied with the flames that you see that you forget to get out of the house to get out of the fire. Should I declare an incident? If you're wondering whether to declare an incident, do. The fact you're wondering at all means it's worthy of investigation. As mentioned before, if an incident is called and turns out to be unnecessary, then delve into why. Use the five whys technique. It's another opportunity to learn and improve. If your dashboards are always red because you misconfigured a threshold, fix it so that you're not in that situation again. A fire alarm screaming because you burn toast is far better than the alarm not going off when there's a real electrical fire. Just to repeat, the key takeaways from this were "Mitigate. Don't fix" "Clear communication and ownership". "Learn from mistakes." "Things go wrong all the time." And "should I declare an incident? If you're wondering, yes." Here are some useful resources. There are entire books written about this, so if it's something you're interested in, do go off and read about it
...

Richard Finlay Tweed

Senior Site Reliability Engineer

Richard Finlay Tweed's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways