Transcript
This transcript was autogenerated. To make changes, submit a PR.
This is why repair a burning house with me Richard Tweed
this is a collection of hard won learnings
from hundreds of software incidents. It is not a guide
to the process of incidents management. At the end
of this, there'll be some links to examples of those.
Think of this as common behaviors and
challenges I see for folks during incidents.
Also, try and guess which book influenced these slides and send
me a message afterwards. Key takeaways from this are going
to be, mitigate don't fix clear,
communication and ownership, learn from mistakes
Things go wrong all the time, and should
I declare an incident? Yes. What is an
incident? Why have I written this?
An incident is a coordinated response to
mitigate an issue. Now that we know what
an incident is, we should know what to do. So for
our burning house, this would be declaring an incidents to put out the
fire and get everyone to safety. Clear communication
and ownership. During incidents,
someone should know what's going on,
and everybody involved should know what they're meant to be doing and
when. Depending on what happens during that incident
and how critical a problem it is,
you'll either have the bystander effect, somebody else is
already fixing this right?
Or you'll encounter the exact opposite where everyone's
trying to be a hero, trying to help, but they only succeed
in getting in everyone else's way. These can usually be
avoided by having clear responsibility for coordinating
the response. This is normally done by a role called the incident
manager. You will also have someone
who is hands on keyboard for the mitigation,
often called the operations lead. Another way that you
can avoid these issues is by clearly telling
somebody what you want done and when you want updates.
The person organizing, the incident manager, should not
be the one fixing anything. They should
be focused on making sure everyone else has what they need and that
at least they know what's going on.
During an incident, you should also do explicit handovers,
so at the beginning there'll only be one person, but over time you'll
have more people doing more things. So whenever you're handing
over those responsibilities, do it explicitly.
For example, say you need a break, which you will.
You can go to the person who you're handing it over to and say,
will you take over as incident manager?
Okay, you are now the incident manager. I will see you
in X minutes. So in our house fire example,
the incident manager would be the person who finds the fire.
They would then hand over that responsibility to the firefighters.
Once they arrive, they would then brief them about how many people
are expected to be in the building, how far the fire has spread,
and any other relevant information at that point.
Similarly, it's very common for decision paralysis
to set in, during important incidents.
To reduce this tell folks what you're going to do
and ask whether they disagree. Don't just
ask open ended questions of what should we do.
If you're having trouble getting decisions from people.
Engineers and developers love correcting something
or someone that's wrong. Also please, please write
things down. Having a record of who thought
what, what was done and why
is invaluable for getting people who join
later up to speed and getting them effective promptly.
It's also very useful for the write up afterwards and seeing
what can be improved before the next incident.
Mitigate - Don't fix. Most developers,
when faced with an issue, will dive directly into
the source code to try and find the bug and create a fix.
Sorry to break it to you, but this is a complete waste
of your time in these situations,
especially at the start of an incident.
If it's long running, there may be some cause for this.
But yeah, during an incident your focus must
be on preventing things getting worse and mitigating
the existing issue. Regardless of how fast your CI
system is and how robust your testing,
developing a fix and testing it to the standards of your team
will take too long. It also ignores the
very real possibilities that the fix makes things worse.
An incident is not an excuse to ignore everything you've learned
about safe coding practices. If you
need help from other teams, don't be afraid to
ask for it. Page them if you need to and ask for help.
The priority is mitigation.
To use the burning house analogy, you shouldn't be installing
a replacement wooden stairwell while it's still on
fire. Put the fire out. Then you can
plan your rebuild or your remodel. Things go
wrong all the time. When an incident is declared,
there can be an instinct to panic or to rush to conclusions.
The best thing you can do is take a moment to work out what's actually
going on and coordinate with others to actually investigate the issue
and eventually mitigate it. If you see a fire,
it is better by far to check how far it has spread and
whether there are any extinguishers of the correct type nearby.
Rather than blindly grabbing the closest one and
using a water extinguisher on an oil fire,
have a look at a video of that. It's pretty dramatic.
So for our house example,
everyone burns toast, drops a glass and
throws a switch remote at the TV. Okay, maybe just me
for the last one. Learn from mistakes
once the incident has based learn everything you
can from an incident. Don't just fix the bug.
They're a tremendous way to learn how your systems work,
how your processes work. What're people's natural
instincts? They aren't necessarily what you would expect
from talking to people during the normal nine to
five. You could learn, for example, that your
silence and escalate buttons are too close together
for your 04:00 a.m.. Brain. And now you've woken up
a director. Or you could find out that
your runbooks, your readmes, your documentation, your training
only references the old name of the service rather
than the new one. So you had to spend half an hour
or an hour trying to work out what this thing could possibly be.
We learned to use the switch remote straps.
Also, it's incredibly likely in an incident
that many of your preparations, maybe even
all of them, will be forgotten in the heat of the moment.
Try and remember that fact. Then try and remember your training and
experience. As long as you get back on track, it doesn't matter if
you started on the wrong foot for a house fire,
you might be so preoccupied with the flames that you see that
you forget to get out of the house to get out of the fire.
Should I declare an incident? If you're wondering whether
to declare an incident, do. The fact you're wondering at all means
it's worthy of investigation. As mentioned before, if an incident
is called and turns out to be unnecessary,
then delve into why. Use the five whys technique.
It's another opportunity to learn and improve.
If your dashboards are always red because you misconfigured a threshold,
fix it so that you're not in that situation again.
A fire alarm screaming because you burn toast is
far better than the alarm not going off when there's
a real electrical fire. Just to repeat,
the key takeaways from this were "Mitigate. Don't fix"
"Clear communication and ownership". "Learn from
mistakes." "Things go wrong all the time."
And "should I declare an incident? If you're wondering,
yes." Here are some useful resources.
There are entire books written about this, so if it's something you're interested
in, do go off and read about it