Abstract
Incidents are a great opportunity to gather both context and skill. They take people out of their day-to-day roles, and force ephemeral teams together to solve unexpected and challenging problems.
The first part of the talk will walk through the different things you can learn from incidents, including:
- Taking you to the edges of the systems your team owns, and beyond - incidents help broaden your understanding of the context in which you’re building
- Showing you how systems fail, so you can learn to identify and build software with good observability, and considerations of failure modes
- Expanding your network inside your organisation, making connections with different people, who you can learn from and collaborate with
We’ll then talk about how to get the best value from the incidents which you do have as an individual, thinking about when is an appropriate time to ask questions, and how to get your own learnings without ‘getting in the way’.
Finally, we’ll discuss how to make this part of the culture of an organisation: as part of the leadership team, what can you do to encourage this across your teams?
Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey, let's start with a story. One of my first
coding jobs was at a company called Gocardless. I'd been
there for a few months when we had a major incident. Our API, which was
basically our entire product, had slowed to a crawl. I was pretty
curious, so I jumped into the incident channel. We figured
out that a particular endpoint was causing the issue by sending a
bad query to our large postgres database, which was now struggling.
We disabled the bad endpoint to get the rest of the system up and running
again, and it worked. Then we had to understand what
had happened. There weren't any recent changes that looked suspicious.
It turned out that the query plan for this particular query had changed
from something that was expensive but manageable, to something
that was not at all manageable. To make matters worse,
there wasn't a timeout on the query, so the database would keep
running the expensive task long after the person asking for it had given
up. On ever getting a response. We made a subtle change to
the query, which made the database revert to the good query
plan. Everything was back up and running. We'd fixed it.
Well, I say we. I watched quietly
from the sidelines, furiously scribbling notes. After the incident
was over, I turned to my colleague. What is a query plan?
We'll come back to this in a second. Hey, I'm Lisa Karlin
Curtis. Last year I joined incident IO as employee number
two. We build an incident management platform for your whole organisation
and incidents and incidents response are naturally very close to my heart
and fundamentally things is a talk about why I've accelerated
my career by running towards the fire. When I joined Gocardless
I was a pretty junior engineer. I progressed very rapidly.
I modes senior, honestly, quite a lot faster than I'd expected.
I was reflecting on how that had happened and of course,
like anything it was a number of factors. But a pattern
stood out to me. The big step changes in my understanding and
my ability to solve larger, more complex problems.
Reason about tradeoffs came as a result of the incidents
that I'd participated in or observed. I was introduced to new technologies,
learned new skills and met people who became some of my closest
friends. And every time I'd come out as a better engineer.
So things is why I love incidents. Incidents broaden your horizons.
As engineers, we live in a world full of black boxes,
whether that's a programming language, a framework or a database,
we learn how to use the interface and we move on.
If we tried to understand how everything worked, down to the metal or the transistors
in our laptops, we'd never get to ship,
well anything. Incidents force you to open
the black boxes around you, peek inside and learn just
enough to solve the problem. After this incident,
I read up on query plans and this proved really useful.
It was not our last query plan related incident, far from it.
It was also useful when I was building new things. I was
suddenly able to write code that scaled well the first time.
Seeing this stuff in real life helped me see into the
crystal ball like my senior colleagues could and truly understand the
impact of the tradeoffs we were making when talking to the database.
Incidents give you great signal about which of these black boxes are worth opening
and you get a real world example to use as a starting point.
This all becomes particularly important if youre joining a larger engineering
at the start you learn about the parts of your system which your team owns.
You might get given a thousand foot view using your onboarding,
but mostly you pick up context, bottom up as and
when you encounter things. You'll also find that incidents don't respect
team boundaries. They impact systems owned by multiple
teams and that pushes you outside your team's remit. It's much more interactive
than studying an architecture diagram, giving you handson experience of
how the systems interact. It shows you how the puzzle pieces fit together,
widening your proverbial lens to see the bigger picture and grow
your context. And having that bigger picture gives you more information to
make better choices for your own team. Incidents teach you to
build systems that fail gracefully one of the key follow ups from the API
incident was to add statement timeouts on all of our database calls.
This meant that if we issued a bad query, postgres would
try for maybe a few seconds, but it would then give up.
This is an excellent example of resilient engineering. Our system
can now handle unexpected failures. We don't need to know what will
issue a bad query, just that it's likely that something will.
It's possible to read about these ideas in a book, but nothing compares
to seeing it in action. During this incident, I learned a whole
set of tools that I could employ to reduce the blast radius of potential failures.
Not just the statement timeouts which we implemented, but all the other
options that the incidents levelup teams discussed and discarded.
And I got to listen to the best people in the talk about the tradeoffs
between them. Incidents teach you to make systems easier
to debug. Observability isn't easy. I've shipped
plenty of useless log lines and metrics in my time. To build
genuinely observable systems, you need to have empathy for your future
self or teammate who'll be debugging an issue, and that's
hard to learn in abstract, the people I've worked with who do this
well are constantly leaning on their experience of debugging issues,
their pattern matching on what they've seen before, allowing them to identify
useful places for logs and metrics. Incidents are a great shortcut
to get this kind of experience and build a repository of patterns
that you can recognize going forwards. Incidents build your network.
They provide a great opportunity to meet people outside your team and forge
strong relationships along the way. As psychologists have known for a while,
there's something about going through a stressful situation with someone that forges
a connection much more quickly than normal. Most of the engineering
folks I met at Gocardless, I met apologies. Most of
the non engineering folks I met during incidents, those relationships
were really valuable. They gave me a mental map of the rest of the
and meant that I had a friendly face I could talk to when I needed
advice about customer support or sales or risk.
As I became more senior, that network became increasingly important
as I was responsible for larger and larger projects which impacted multiple
teams and incidents are a chance to learn from the best when things
go wrong when things go really wrong, people from
all over the get pulled in to help fix it. But they're not just any
people, they're the people with the most context, the most experience
who everyone trusts to fix the problem. Getting to spend time
with these roles is pretty rare. They're probably some of the busiest
people in the company. Incidents provide a unique opportunity to learn from
them and see firsthand how they approach a challenging problem.
For me, the API incident gave me opportunities to learn much
faster than I otherwise would have. Incidents have unusually high
information density compared with day to day work, and they enable
you to piggyback on the experience of others. Who knows how long
it might have been before I'd realized that I really ought to know what a
query plan was? Honestly, probably until my own code broke in.
The same way I go cardless, I was lucky.
Their culture and processes meant that I could see incident
channels and follow along whenever I wanted, giving me
this opportunity to accelerate. But that's not always the case.
Some teams run incidents in private channels by default, operating an invite
only policy. That means that junior team members who want to observe
rather than participate probably don't even know that they're happening.
Sometimes people are excluded from other for other reasons.
It's not culturally encouraged to get involved. There's an
in group they handle all the incidents and everyone else should just get
out of the way. Joining that in group even as a new
senior can become almost impossible. So let's look at what
we can do to build a culture where everybody can learn from
incidents. Let's look at building a culture where incidents
are accessible first. Declare lots of incidents.
If you only declare incidents when things get really bad, you won't
get a chance to practice your incidents process. That means you
won't be as good at running incidents, and also there won't be as
many learning opportunities for your team. By lowering the bar for what counts
as an incident when the really bad ones do come around,
the response is a well oiled machine. It also helps with learning.
When problems are handled as incidents, it makes them accessible to everybody else.
It's a bit like an invitation. Encourage everyone to participate.
As we've discussed, incidents are great learnings opportunities and
so they should be accessible to everybody. Incidents channels
have to be public by default and engagement encouraged at all levels.
Of course, there can be too much of a good thing. Having 20 people
descend into a minor incident channel may not be the best outcome,
but most incidents can comfortably accommodate a few junior responders tagging
along. And it doesn't have to come at the cost of a good response.
You can get this experience in low risk environments either
by asking questions to someone who's not actively responding to the incident, or writing
them down and asking them after it's resolved. There are also other
ways to gather learnings. Reading debrief documents or attending
post incident reviews are both great ways of getting value from your team's
incidents. I'd also recommend compiling a list of the best incident debriefs
in your to share with everyone as part of your part of their onboarding,
and maybe some public ones too. We all know which
were the most interesting incidents. Why not share the love with new joiners too?
Get into the habit of showing youre working in an incident.
It's good practice to put as much information as you can into the
incident channel. What command did you run? What theory have you disproved?
If you're debugging on your own, this can admittedly
feel a bit strange. I've personally been sat at 10:00 p.m.
In an incident channel on more than one occasion, having a delightful
conversation with myself. But it's worth it, I promise.
It's useful for your response because it means that you don't have to rely on
your memory to know exactly what youre already tried and when, which helps you
avoid making bad assumptions, but it's also beneficial
for your team. If this information is accessible,
you're enabling everyone to learn from your experience. That means
using public slack channels wherever possible and having central locations
where everyone can go to find the incidents that they might be interested in.
I'm a bit biased, but using an incident management platform really does
help with this. And finally, watch out for anyone playing the hero.
Often a single engineer takes on a lot of the incident response burden,
fixing everything before anybody knows it's broken.
Maybe that was you, maybe it still is.
This doesn't really end well for the hero.
Eventually, they'll stop getting as much credit as they think they deserve
for fixing everything as it becomes normalized. No one's ever known anything
else, and that makes them at risk of burning out, but it
also causes problems for the rest of the team. Without meaning
to, the hero is taking away these learning opportunities from everyone else
by fixing things quietly in the corner.
That means that no one else is ever going to be able to do what
they do as effectively because no one's had enough practice.
While that's maybe an effective job preservation tactic,
it's not using to result in a high performing team.
If you think that you get a lot of recognition for resolving incidents,
imagine how much you'll get for leveling up your whole team so they can do
the same. Thanks so much for listening. I really appreciate appreciate
you coming along to this talk. If you're interested in incidents
in general, we have a slack community at incident IO slash community,
which I'd really love to see you there. I'm also on Twitter at patrickarti
eng if you'd like to chat about anything that we've discussed today, and I really
hope you enjoy the rest of the conference.