Abstract
Incidents are a great opportunity to gather both context and skill. They take people out of their day-to-day roles, and force ephemeral teams together to solve unexpected and challenging problems.
The first part of the talk will walk through the different things you can learn from incidents, including:
* Taking you to the edges of the systems your team owns, and beyond - incidents help broaden your understanding of the context in which you’re building
* Showing you how systems fail, so you can learn to identify and build software with good observability, and considerations of failure modes
* Expanding your network inside your organisation, making connections with different people, who you can learn from and collaborate with
We’ll then talk about how to get the best value from the incidents which you do have as an individual, thinking about when is an appropriate time to ask questions, and how to get your own learnings without ‘getting in the way’.
Finally, we’ll discuss how to make this part of the culture of an organisation: as part of the leadership team, what can you do to encourage this across your teams?
Transcript
This transcript was autogenerated. To make changes, submit a PR.
Let's start with a story.
One of my first coding jobs was at a company called Gocardless.
I'd been there for a few months when we had a major incident. Our API
had slowed to accrual. I was pretty curious,
so I jumped into the incident channel. One,
endpoint in particular was consistently timing out.
So we disabled it to get the system back up and running
again. And it worked. Step one
complete. Now we had to understand what
had actually happened. There weren't any recent changes that
looked suspicious, so youre attention shifted
to the database. It turned out the
query plan for this particular query had changed from
something that was expensive but manageable to something that was
not at all manageable. We made a subtle
change to the query, which got the database to revert
to the good old query plan and everything was back up
and running. We'd fixed it.
Well, I say we. I watched quietly from
the sidelines, furiously taking notes. After the incident
was over, I turned to a senior engineer in my team,
Daniel, what is a query plan?
We'll come back to that in a second. First of
all, hi, I'm Lisa Colin Curtis.
Last year I joined incident IO as employee number two.
We build incident management tooling for your whole organization. And so of
course, incidents and incident response are very close to my
heart. And really this is a talk about
why I've accelerated
my career by running towards the fire.
When I joined Gocardless, I was pretty junior
and I progressed quite fast. I made
senior a lot faster than I'd expected.
I started reflecting on how that happened. Of course,
like anything, it was a number of factors, but one pattern
really stood out to me. The big changes in my understanding
and my ability to solve larger, more complex problems
came as a result of incidents that I'd been involved in.
I was introduced to new technologies, learned new skills,
and met people who became some of my closest friends.
And every time I'd come out as a better engineer.
And this is why I love incidents.
Incidents broaden your horizons. As engineers,
we inhabit a world full of black boxes, whether that's
a programming language, a framework, or a database. We learn how
to use the interface to get it to do whatever it is we need,
and we move on. If we tried to understand how
everything worked down to the individual chips on each
machine, we'd never get to ship,
well, anything.
Incidents force you to open the black boxes around you, peek inside
and learn just enough to solve the problem.
After the incident, I read up on query plans
and this proved very useful. It was not our
last query plan. Related incident we did have can enormous
postgres instance. After all,
it was also useful for building new things. I was suddenly
able to write code that scaled well the first time. Rather than relying
on frankly, trial and error in production,
incidents give you great signal about which of these black boxes are worth opening,
and a real world example that you can use as a starting point.
Incidents teach you to build systems that fail gracefully.
One of the key follow ups from the API incidents was to add statement timeouts
on all of our database calls. This meant
that if we issued a bad query, postgres would try
for a few seconds and then give up.
That might sound counterintuitive, deliberately cancelling
queries? Someone's going to be sad, but if my option
is to do that, or to get rid of the whole API,
of course I'd choose to degrade just a handful of queries.
This is an excellent example of resilient engineering.
Youre system can now handle unexpected failures.
We don't need to know what will issue a bad query,
just that it's likely that something will.
It's possible to read about these ideas in a book. There are plenty
of great modes, but in my experience, nothing compares to
seeing it in action. During the incident,
I learnt a whole set of tools that I could employ to reduce the blast
radius of these kinds of failures, not just the statement
timeouts which we implemented, but all the other options that
the incident response team discussed and discarded.
Incidents are a chance for blue sky thinking.
A doctor never wants to amputate somebody's arm when they
choose to. It's because the alternative is even worse.
In an incidents, nothing is off the table when you're already in
a bad place. Sometimes you have to make one thing worse to
mitigate the wider problem, and that's what
we did during our API incident. We disabled an entire
endpoint, which feels like a thing that you'd never do,
but in context was absolutely the correct choice,
and if given it another time, I'd make the same
choice again. Incidents give
you a rare opportunity to think outside of your normal constraints.
A degraded service is far better than a completely absent one.
Incidents teach you to make systems easier to debug.
Observability isn't straightforward. If you needed proof,
I've certainly shipped plenty of useless log lines and metrics at my time.
To build genuinely observable systems, you need to have empathy
for your future self or teammate who'll be debugging the issue if
they're unlucky at 02:00 in the morning. And that empathy is again,
hard to learn in abstract. The people I've met who
do this really well are leaning on their experience of debugging issues,
their pattern matching on the things they've seen before,
and that allows them to identify useful places for logs and metrics,
and useful metadata to include incidents
are a great shortcut to get this kind of experience and build a repository
of patterns that you can recognize going forwards.
Incidents build your network. They provide a
great opportunity to meet people outside your team and forge strong
relationships along the way. As psychologists have
known for a while, there's something about going through a stressful
situation with someone that forges a connection more quickly than normal.
Kate was one of the account managers of a partner who was
really badly affected by our API incidents.
She turned out to be a great person to know.
She managed a number of our biggest partners, so she had unique
insights into what they wanted and how we could serve them better.
Before the incident, I'm embarrassed to say I didn't even know
her name and I was on the product team in charge
of serving partners.
Incidents are great for building relationships in the wider.org
most of the non engineering folks I met at Gocardless, whether from
finance, risk or support or marketing, were during
incidents and those relationships proved really
valuable. They gave me a mental map of the rest of
the and meant that I had a friendly face that I could talk to when
I needed advice. As I became more senior,
that network became even more important as I was responsible for larger
projects which had wider implications on the company can
incidents are a chance to learn from the best when
things go wrong when things go really, really wrong,
people from all over the company get pulled in to help fix it.
But they're not just any people,
they're the people with the most context, the most experience,
the most skill that everybody trusts to fix the
hardest problems. Getting to spend
time with these folks is rare. They're likely to be some of the busiest people
in the company. Incidents provide a unique opportunity
to learn from them and see firsthand how they approach a
challenging problem.
For me, the API incident gave me opportunities to learn much faster
than I otherwise would have. Who knows how long it might have been before
I'd realized that I really did need to know what a query plan was,
probably until my own code broke. In the same way,
incidents have unusually high information density compared with
day to day work, and they enable you to piggyback on the
experience of others at Gocardless, I was lucky.
Their culture and processes meant that I could see incidents channels and follow
along, allowing me the opportunity to accelerate.
But that's not always the case. Some teams
run incidents in private channels by default, operating an invite only policy.
That means that junior members who want to observe rather than participate
might not even be aware that they're happening.
Sometimes people are excluded for other reasons.
It's not culturally encouraged to get involved. There's an
in group who handle all the incidents, and everyone else should just get
out of their way. Joining that in group,
even as a new senior can become almost impossible.
So let's look at what we can do to build a culture where everyone can
learn from incidents by making them accessible.
First, declare lots of incidents.
This is the single most impactful change you can make to your
incidents process. If you only declare incidents when
things get really bad, you won't get a chance to practice your incident process
by lowering the bar for what counts as an incident. When the really
bad ones do come around, the response is a well oiled machine.
Everybody knows the tools, everybody knows the terminology,
and everybody can act as best that they are able to
try and fix the severe issue. It also helps
with learning. When problems are handled as incidents, it makes them more
accessible to everyone around you. Now,
maybe it goes without saying, but if you want to encourage that, the first
step is to stop counting incidents. If youre count your
incidents and consider more incidents to be bad, that's a clear incentive
against people declaring low severity incidents.
Second, encourage everyone to participate.
Incidents are great learning opportunities and they should be
accessible to everybody. Incident channels should be
public by default and engagement encouraged for team members at all levels.
Of course, there can be too much of a good thing.
Having 20 people descend into a minor incident channel may not
be the outcome that you're hoping for, but most incidents can
comfortably accommodate two or three junior responders tagging along.
This doesn't have to come at the cost of a good response.
You can get this experience in a low risk environment
either by asking questions to someone who's not actively responding
to the incident or doing what I did and saving them
up for after it's resolved. There are
also lots of other ways to gather learnings. Reading debrief documents
or attending post incident reviews are both great ways of getting value
from your team's incidents. You could even compile
a list of the best incidents debriefs to share with new joiners.
They're a great way to get started in a new company.
Get into the habit of showing youre working in
an incident. Youre should put as much information as you can into the incident channel?
What command did you run? What theory have you disproved?
If you're debugging on your own, I admit this can feel a
little bit strange. I've been sat at 10:00 p.m.
In an incident channel having a frankly delightful conversation with myself.
But it's worth it, I promise. It's useful
for your response, as it means you don't have to rely on your memory to
know exactly what you've already tried and when. And it makes handing
over much easier if actually you need to go to a meeting and
somebody else needs to take over. But it's also beneficial
for the rest of the team. By writing everything down,
you're enabling everybody else to learn from your experience how
you approach the problems. What are the things that you tried? Where did you
look to find that bit of information? Just because it's obvious to you,
it doesn't mean it's obvious to everybody.
That means we should be using public slack channels wherever possible so that
everyone can see and having a central location where folks
can go to find incidents that they might be interested in.
I'm a bit biased here, but using an incident management platform
such as incident IO really does help with this one.
And finally, watch out for anybody playing the hero.
Often a single engineer takes on a lot of the incidents response burden,
fixing things before anybody even knows that they're broken.
Maybe that used to be you, maybe it still is.
This doesn't end well for the hero. They'll stop getting as much credit
as they expect for fixing things as it becomes normalized and they're
at risk of burning out. But it also causes problems for
the rest of the team. Without meaning to,
the hero is taking away all of these learning opportunities from everyone else
by fixing things quietly in the corner. And that teams,
no one else is ever going to be able to do what they do as
effectively because no one's had any practice.
While that's perhaps can effective job preservation tactic,
it's not going to result in a high performing team.
If you think you get a lot of recognition for resolving incidents,
imagine how much you can get. If you can level up your entire team
so that they can do the same.
So that's all we've got time for. Thanks so much for listening.
If you're interested in incidents in general, we've got a slack community
at incident IO slash community and I'd love to chat to you
there or on Twitter. You can find me at Henge and I'll
also be on the comp 42 discord server enjoy the rest of the conference.