Using incidents to level-up your teams

Video size:

Abstract

Incidents are a great opportunity to gather both context and skill. They take people out of their day-to-day roles, and force ephemeral teams together to solve unexpected and challenging problems.

The first part of the talk will walk through the different things you can learn from incidents, including:

Taking you to the edges of the systems your team owns, and beyond - incidents help broaden your understanding of the context in which you’re building
Showing you how systems fail, so you can learn to identify and build software with good observability, and considerations of failure modes
Expanding your network inside your organisation, making connections with different people, who you can learn from and collaborate with

We’ll then talk about how to get the best value from the incidents which you do have as an individual, thinking about when is an appropriate time to ask questions, and how to get your own learnings without ‘getting in the way’.

Finally, we’ll discuss how to make this part of the culture of an organisation: as part of the leadership team, what can you do to encourage this across your teams?

Summary

One of my first coding jobs was at a company called Gocardless. Our API, which was basically our entire product, had slowed to a crawl. A particular endpoint was sending a bad query to our large postgres database. We made a subtle change to the query, which made the database revert to the good query plan.
Lisa Karlin Curtis joined incident IO as employee number two last year. She says incidents have accelerated her career by making her a better engineer. Incidents teach you to build systems that fail gracefully. They also help forge strong relationships along the way.
Using an incident management platform really does help with this. If this information is accessible, you're enabling everyone to learn from your experience. Watch out for anyone playing the hero. If you get a lot of recognition for resolving incidents, imagine how much you'll get for leveling up your whole team.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hey, let's start with a story. One of my first coding jobs was at a company called Gocardless. I'd been there for a few months when we had a major incident. Our API, which was basically our entire product, had slowed to a crawl. I was pretty curious, so I jumped into the incident channel. We figured out that a particular endpoint was causing the issue by sending a bad query to our large postgres database, which was now struggling. We disabled the bad endpoint to get the rest of the system up and running again, and it worked. Then we had to understand what had happened. There weren't any recent changes that looked suspicious. It turned out that the query plan for this particular query had changed from something that was expensive but manageable, to something that was not at all manageable. To make matters worse, there wasn't a timeout on the query, so the database would keep running the expensive task long after the person asking for it had given up. On ever getting a response. We made a subtle change to the query, which made the database revert to the good query plan. Everything was back up and running. We'd fixed it. Well, I say we. I watched quietly from the sidelines, furiously scribbling notes. After the incident was over, I turned to my colleague. What is a query plan? We'll come back to this in a second. Hey, I'm Lisa Karlin Curtis. Last year I joined incident IO as employee number two. We build an incident management platform for your whole organisation and incidents and incidents response are naturally very close to my heart and fundamentally things is a talk about why I've accelerated my career by running towards the fire. When I joined Gocardless I was a pretty junior engineer. I progressed very rapidly. I modes senior, honestly, quite a lot faster than I'd expected. I was reflecting on how that had happened and of course, like anything it was a number of factors. But a pattern stood out to me. The big step changes in my understanding and my ability to solve larger, more complex problems. Reason about tradeoffs came as a result of the incidents that I'd participated in or observed. I was introduced to new technologies, learned new skills and met people who became some of my closest friends. And every time I'd come out as a better engineer. So things is why I love incidents. Incidents broaden your horizons. As engineers, we live in a world full of black boxes, whether that's a programming language, a framework or a database, we learn how to use the interface and we move on. If we tried to understand how everything worked, down to the metal or the transistors in our laptops, we'd never get to ship, well anything. Incidents force you to open the black boxes around you, peek inside and learn just enough to solve the problem. After this incident, I read up on query plans and this proved really useful. It was not our last query plan related incident, far from it. It was also useful when I was building new things. I was suddenly able to write code that scaled well the first time. Seeing this stuff in real life helped me see into the crystal ball like my senior colleagues could and truly understand the impact of the tradeoffs we were making when talking to the database. Incidents give you great signal about which of these black boxes are worth opening and you get a real world example to use as a starting point. This all becomes particularly important if youre joining a larger engineering at the start you learn about the parts of your system which your team owns. You might get given a thousand foot view using your onboarding, but mostly you pick up context, bottom up as and when you encounter things. You'll also find that incidents don't respect team boundaries. They impact systems owned by multiple teams and that pushes you outside your team's remit. It's much more interactive than studying an architecture diagram, giving you handson experience of how the systems interact. It shows you how the puzzle pieces fit together, widening your proverbial lens to see the bigger picture and grow your context. And having that bigger picture gives you more information to make better choices for your own team. Incidents teach you to build systems that fail gracefully one of the key follow ups from the API incident was to add statement timeouts on all of our database calls. This meant that if we issued a bad query, postgres would try for maybe a few seconds, but it would then give up. This is an excellent example of resilient engineering. Our system can now handle unexpected failures. We don't need to know what will issue a bad query, just that it's likely that something will. It's possible to read about these ideas in a book, but nothing compares to seeing it in action. During this incident, I learned a whole set of tools that I could employ to reduce the blast radius of potential failures. Not just the statement timeouts which we implemented, but all the other options that the incidents levelup teams discussed and discarded. And I got to listen to the best people in the talk about the tradeoffs between them. Incidents teach you to make systems easier to debug. Observability isn't easy. I've shipped plenty of useless log lines and metrics in my time. To build genuinely observable systems, you need to have empathy for your future self or teammate who'll be debugging an issue, and that's hard to learn in abstract, the people I've worked with who do this well are constantly leaning on their experience of debugging issues, their pattern matching on what they've seen before, allowing them to identify useful places for logs and metrics. Incidents are a great shortcut to get this kind of experience and build a repository of patterns that you can recognize going forwards. Incidents build your network. They provide a great opportunity to meet people outside your team and forge strong relationships along the way. As psychologists have known for a while, there's something about going through a stressful situation with someone that forges a connection much more quickly than normal. Most of the engineering folks I met at Gocardless, I met apologies. Most of the non engineering folks I met during incidents, those relationships were really valuable. They gave me a mental map of the rest of the and meant that I had a friendly face I could talk to when I needed advice about customer support or sales or risk. As I became more senior, that network became increasingly important as I was responsible for larger and larger projects which impacted multiple teams and incidents are a chance to learn from the best when things go wrong when things go really wrong, people from all over the get pulled in to help fix it. But they're not just any people, they're the people with the most context, the most experience who everyone trusts to fix the problem. Getting to spend time with these roles is pretty rare. They're probably some of the busiest people in the company. Incidents provide a unique opportunity to learn from them and see firsthand how they approach a challenging problem. For me, the API incident gave me opportunities to learn much faster than I otherwise would have. Incidents have unusually high information density compared with day to day work, and they enable you to piggyback on the experience of others. Who knows how long it might have been before I'd realized that I really ought to know what a query plan was? Honestly, probably until my own code broke in. The same way I go cardless, I was lucky. Their culture and processes meant that I could see incident channels and follow along whenever I wanted, giving me this opportunity to accelerate. But that's not always the case. Some teams run incidents in private channels by default, operating an invite only policy. That means that junior team members who want to observe rather than participate probably don't even know that they're happening. Sometimes people are excluded from other for other reasons. It's not culturally encouraged to get involved. There's an in group they handle all the incidents and everyone else should just get out of the way. Joining that in group even as a new senior can become almost impossible. So let's look at what we can do to build a culture where everybody can learn from incidents. Let's look at building a culture where incidents are accessible first. Declare lots of incidents. If you only declare incidents when things get really bad, you won't get a chance to practice your incidents process. That means you won't be as good at running incidents, and also there won't be as many learning opportunities for your team. By lowering the bar for what counts as an incident when the really bad ones do come around, the response is a well oiled machine. It also helps with learning. When problems are handled as incidents, it makes them accessible to everybody else. It's a bit like an invitation. Encourage everyone to participate. As we've discussed, incidents are great learnings opportunities and so they should be accessible to everybody. Incidents channels have to be public by default and engagement encouraged at all levels. Of course, there can be too much of a good thing. Having 20 people descend into a minor incident channel may not be the best outcome, but most incidents can comfortably accommodate a few junior responders tagging along. And it doesn't have to come at the cost of a good response. You can get this experience in low risk environments either by asking questions to someone who's not actively responding to the incident, or writing them down and asking them after it's resolved. There are also other ways to gather learnings. Reading debrief documents or attending post incident reviews are both great ways of getting value from your team's incidents. I'd also recommend compiling a list of the best incident debriefs in your to share with everyone as part of your part of their onboarding, and maybe some public ones too. We all know which were the most interesting incidents. Why not share the love with new joiners too? Get into the habit of showing youre working in an incident. It's good practice to put as much information as you can into the incident channel. What command did you run? What theory have you disproved? If you're debugging on your own, this can admittedly feel a bit strange. I've personally been sat at 10:00 p.m. In an incident channel on more than one occasion, having a delightful conversation with myself. But it's worth it, I promise. It's useful for your response because it means that you don't have to rely on your memory to know exactly what youre already tried and when, which helps you avoid making bad assumptions, but it's also beneficial for your team. If this information is accessible, you're enabling everyone to learn from your experience. That means using public slack channels wherever possible and having central locations where everyone can go to find the incidents that they might be interested in. I'm a bit biased, but using an incident management platform really does help with this. And finally, watch out for anyone playing the hero. Often a single engineer takes on a lot of the incident response burden, fixing everything before anybody knows it's broken. Maybe that was you, maybe it still is. This doesn't really end well for the hero. Eventually, they'll stop getting as much credit as they think they deserve for fixing everything as it becomes normalized. No one's ever known anything else, and that makes them at risk of burning out, but it also causes problems for the rest of the team. Without meaning to, the hero is taking away these learning opportunities from everyone else by fixing things quietly in the corner. That means that no one else is ever going to be able to do what they do as effectively because no one's had enough practice. While that's maybe an effective job preservation tactic, it's not using to result in a high performing team. If you think that you get a lot of recognition for resolving incidents, imagine how much you'll get for leveling up your whole team so they can do the same. Thanks so much for listening. I really appreciate appreciate you coming along to this talk. If you're interested in incidents in general, we have a slack community at incident IO slash community, which I'd really love to see you there. I'm also on Twitter at patrickarti eng if you'd like to chat about anything that we've discussed today, and I really hope you enjoy the rest of the conference.

Slides

Download slides (PDF)

See all 33 talks at this event!

Conf42 Site Reliability Engineering 2022 - Online

June 09 2022

Using incidents to level-up your teams

Video size:

Abstract

Summary

Transcript

Slides

Lisa Karlin Curtis

Product Engineer @ incident.io

Join the community!

Featured event

2025

2024

Info

Conf42 Site Reliability Engineering 2022 - Online

June 09 2022

Using incidents to level-up your teams

Video size:

Abstract

Summary

Transcript

Slides

Lisa Karlin Curtis

Product Engineer @ incident.io

Join the community!