Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              Hello everybody. So I couldnt talk
            
            
            
              about incidents forever. In fact, I do talk
            
            
            
              about incidents always, but for a while it
            
            
            
              felt like nobody was listening. Years ago, when I started working
            
            
            
              on incidents and talking about incidents always, we were
            
            
            
              doing really great work with incident retrospectives. We were learning super duper interesting
            
            
            
              stuff, but those learnings and communication
            
            
            
              never really went anywhere. We then realized
            
            
            
              that in order for these recommendations to actually happen, we needed others to
            
            
            
              see why we were pushing for them. The engineers themselves couldn't
            
            
            
              push for these changes because they couldn't succinctly explain what they were
            
            
            
              requesting. Instead, they would point to some incident reports that were full
            
            
            
              of screenshots of errors in timestamp timelines that didn't specifically
            
            
            
              explain what was needed. And the noneengineering
            
            
            
              folks couldnt really understand why these incidents were impacting them.
            
            
            
              After all, we were still making money. And that's when we started
            
            
            
              focusing on not only learning from our incidents, but telling
            
            
            
              others about them. We realized that the reports that we were creating
            
            
            
              weren't telling the whole story, so we redid the way that we would write them
            
            
            
              so they could be more complete representations of how we experienced the
            
            
            
              incidents. These long reports, though, were kind of tough for folks to
            
            
            
              read. So we started adding abstracts and summaries as onramps,
            
            
            
              and then we added weekly updates so folks could quickly ingest what
            
            
            
              was happening and start realizing that incidents were applying every
            
            
            
              day and they were having a huge impact on how we do our work.
            
            
            
              And then we started synthesizing some of that information so we could
            
            
            
              go to product owners and decision makers and make a case for our long term
            
            
            
              recommendations. And then something magical happened.
            
            
            
              Folks started listening, and they engaged with what we were
            
            
            
              talking about. And not going to lie, this didn't happen immediately,
            
            
            
              but little by little, our recommendations were making
            
            
            
              it into the quarterly plans, and they were making it into the team cultures
            
            
            
              and the process changes. And not all incidents
            
            
            
              went away. We were still doing some really cool engineering stuff,
            
            
            
              but we were out of that cycle of infinite incidents that were continuing
            
            
            
              to happen over and over again. We started having the bandwidth to
            
            
            
              tackle different problems, and that was great. And we were
            
            
            
              able to do that because we were able to share
            
            
            
              the incident findings. So hi everyone. Vanessa Huerta
            
            
            
              Granda I work in solutions at Jelly IO.
            
            
            
              I've been working in technology for the past decade,
            
            
            
              focusing on incident response and learning from incidents. And I truly,
            
            
            
              truly, truly believe that learning from incidents is the key that can help software
            
            
            
              organizations improve how we do our work. And I want to
            
            
            
              help in making all of this work more attainable and sustainable to the
            
            
            
              everyday engineering. So today I want to talk to you all about sharing
            
            
            
              incident findings effectively and what we can gain from doing that.
            
            
            
              So what do you do after a postmortem?
            
            
            
              So you've worked really hard after your incident. You had a collaborative postmortem,
            
            
            
              you listened to several different points of views, and you've come up with some really
            
            
            
              rich learnings and recommendations. What do you do then?
            
            
            
              What do you do after your post mortem? Some of
            
            
            
              us will probably write a report. I've done this usually in
            
            
            
              Jira. Some folks like to use Google Docs. You can write
            
            
            
              up your action items. You can tag the folks that sometimes is part of that
            
            
            
              report. More Jira tickets, really great stuff.
            
            
            
              Maybe you had the review meeting over Zoom. So you share that recording and
            
            
            
              you hope that people listen to that 1 hour session or
            
            
            
              you close up the ticket now that the incident is over and folks can access
            
            
            
              it whenever they want to. Maybe you're just done.
            
            
            
              This incident was a lot. You're tired, you just want to move forward.
            
            
            
              You have work to do and there's really going to be another incident tomorrow anyway.
            
            
            
              So you move on with your life.
            
            
            
              So while we all do different things after the postmortem, do we know
            
            
            
              if others are interacting with our learnings, or are the
            
            
            
              learnings mostly living and dying in Google Drive bright have
            
            
            
              sometimes been I've worked very, very hard on some incident reports
            
            
            
              that have very, very complete information, but nobody ever ends them.
            
            
            
              And it's frustrating, especially when our organizations
            
            
            
              are the ones that are mandating. We spend time creating these postmortem documents,
            
            
            
              and so if we're mandated to complete some sort of incidents report
            
            
            
              or a five Weiss template, it must mean that organizations
            
            
            
              believe that sharing incident findings is important.
            
            
            
              But why do they think that? Why do we think that?
            
            
            
              When we really think about it, teams appreciate the culture of keeping everyone
            
            
            
              in the loop. People like transparency. Last year,
            
            
            
              Jelly released a guide on how to learn from incidents. We call it the Howie
            
            
            
              Guide. You can actually find it on our website at jelly IO.
            
            
            
              And when discussing sharing incidents findings, the Howie guide explains
            
            
            
              that your work should not be completed to be filed. It should be completed
            
            
            
              so it can be read and shared across the business.
            
            
            
              Even after the learning review was taken place and the corrective
            
            
            
              actions have been taken, we work hard at arriving to
            
            
            
              our learnings to uncover themes and takeaways. How often are
            
            
            
              reports written to be filed rather than to be read and shared?
            
            
            
              Why else do we share information?
            
            
            
              Well, we also share just for learning sake and
            
            
            
              sharing timely information is also sort of marketing for the kind of work
            
            
            
              that we do, right? Sharing that information is a marketing piece of
            
            
            
              the learning from incidents programs that we lead.
            
            
            
              Findings may also impact others in the organization. Some outcomes
            
            
            
              or insights may impact these person or a team or the way we do our
            
            
            
              work. And letting them know of these learnings is a part of a just culture.
            
            
            
              We also do it as a TLDR for the decision makers. Maybe we
            
            
            
              need buy in for next steps from leadership or other stakeholders.
            
            
            
              Actually, we usually do. And these folks are probably not going
            
            
            
              to be very likely to attend every single review meeting, especially if you have
            
            
            
              a lot of incidents. And finally, you also have folks who
            
            
            
              didn't attend the review meeting. Maybe they were out that day on PTO. Maybe they
            
            
            
              needed some ends down to work on other stuff. Maybe they
            
            
            
              were working through another incident itself.
            
            
            
              So there are many, many other reasons for learning.
            
            
            
              And all of that is to say that sharing incidents findings can help get our
            
            
            
              point across to a wider audience. So if we've worked hard on
            
            
            
              uncovering our learnings, but others aren't seeing it, what's the
            
            
            
              point? I've often heard engineers discuss,
            
            
            
              sadly, how they're stuck in this cycle of incidents, that they
            
            
            
              believe that there's things that they can do about it, but they're not in a
            
            
            
              position of power to get those things done. But sharing incident
            
            
            
              findings is a way out of that cycle of incidents.
            
            
            
              You may say, vanessa, I'm already sharing my report. It's in the drive.
            
            
            
              Anyone can access it. And that's true. But my
            
            
            
              inbox is not at zero. And I bet a bunch of other people's inbox isn't
            
            
            
              at zero either. And I can tell you that having
            
            
            
              something available to me doesn't mean that I'm going to read it.
            
            
            
              So how do we get folks to read it? So let's think about movies.
            
            
            
              The way that we learn about and the way that we learn from movies are
            
            
            
              different. You can watch a 120 minutes movie,
            
            
            
              but maybe then you want to hear people talk about it for 45 minutes,
            
            
            
              or you want to listen to a ten episode podcast about this movie.
            
            
            
              Or maybe you want to share a review with someone that they can read in
            
            
            
              five minutes and decide if they want to go to the movies with you or
            
            
            
              not. Or maybe you just want to tweet some spoilers, right? Like Spoiler
            
            
            
              Arent and the movie Titanic, the boat sinks.
            
            
            
              Sorry for that spoiler, y'all. The truth is that different audiences
            
            
            
              need to learn different things from what you're sharing and
            
            
            
              who are these different audiences? Going back to the movie
            
            
            
              analogy, you can have your huge fans. They're the ones who are going
            
            
            
              to listen to that like ten episode podcast,
            
            
            
              your casual viewer, the ones who want to read that review before they commit to
            
            
            
              spending $12 on the movie ticket and these another $12
            
            
            
              on the popcorn. You have your studio executive, your Oscar voter.
            
            
            
              They're hopefully watching those 120 minutes forms movies.
            
            
            
              The film industry is catering to different types of audiences differently, just like
            
            
            
              we will when it comes to incidents. So going
            
            
            
              back to incidents who arent
            
            
            
              the different audiences. So you have your engineers, you have the people
            
            
            
              that you invited to the review meeting but couldn't make it. Maybe some
            
            
            
              people who are impacted by the outcomes or the insights from the analysis,
            
            
            
              or just people who want to learn more about what your team does.
            
            
            
              You have your managers. These can provide necessary context to others.
            
            
            
              They can say, this is why my team does this. This way you have your
            
            
            
              execs, your leadership who can approve of suggested changes. You have
            
            
            
              stakeholders who can both be technical or not technical. The way that you share
            
            
            
              the information with them will be different. And then you have your outside
            
            
            
              parties, right? You have your customer support, the folks who are like answering the
            
            
            
              folks when something is going wrong. You have your public relations folks,
            
            
            
              the folks who are writing the tweets when your site is down.
            
            
            
              So within these different audiences, folks can have different purposes for wanting or
            
            
            
              needing to see the learnings from an incident.
            
            
            
              And the purpose can be many, right? Sometimes when I share
            
            
            
              something, I'm requesting an action, right? I'm saying, hey Jen, please make this
            
            
            
              change or add it to your to do's. Sometimes they just need
            
            
            
              to know, hey Jen, I'm making this change. This is how it's going to impact
            
            
            
              you. Sometimes you're just updating them, right? Like hey Jen's
            
            
            
              boss, remember the incidents from the other day? This is what happened.
            
            
            
              But sometimes you want to change folks'minds, you want to say,
            
            
            
              hey, Jen's boss, please don't fire Jen. This wasn't her fault.
            
            
            
              Read this report, you'll find out why.
            
            
            
              With all of this in mind, let's take a look at the different formats in
            
            
            
              which you can share the information. So these are some of the formats
            
            
            
              that I like to use. I've iterated through them throughout the years.
            
            
            
              We'll go into more detail in a bit. But we have the report, these abstract,
            
            
            
              the summary recording, weekly updates and presentation.
            
            
            
              A lot of what we've built here at jelly was done with this in mind.
            
            
            
              We want to make it easy for people to share their learnings.
            
            
            
              So you'll see how we do that in the next few slides.
            
            
            
              First, let's start with the report. Right? That is probably the
            
            
            
              format that you're most used to, but this is different
            
            
            
              from your standard postmortem. This one is focused on
            
            
            
              telling the bigger story of what happened and the context around how the events
            
            
            
              came to be. The goal with the report is always to learn from the incident.
            
            
            
              As you can see here what I'm including in my report, I'm including
            
            
            
              a narrative timeline, a visual narrative timeline, because we're
            
            
            
              telling what happened with the incident from different points of views.
            
            
            
              We're not just saying start, middle, end, we're going
            
            
            
              back and forth trying to understand what people were experiencing at the different
            
            
            
              parts of the incident. The report is
            
            
            
              great for asynchronous communication. Again, it should
            
            
            
              be written to be read and collaborated with, not just to be filed.
            
            
            
              So when I'm writing a report, I really like to encourage folks to
            
            
            
              make comments on it, to link out to it, to include it in their prs
            
            
            
              and how they do their work. The reports arent the most
            
            
            
              in depth written artifact that's coming out of your incident.
            
            
            
              A report will give folks an in depth understanding of the these around the incident
            
            
            
              and how we find them. They'll mostly be read by folks that are involved
            
            
            
              in the incidents or teams using similar technologists, or folks those
            
            
            
              buying you need for possible action items,
            
            
            
              but they are long, right? These are the most in depth artifact
            
            
            
              that's coming out of the incident. So reading a report is probably going
            
            
            
              to require quite a time investment from your reading,
            
            
            
              from your reader. So in order to catch people's attention,
            
            
            
              you probably need something else. And here is
            
            
            
              where the abstract comes in. And the abstract is my personal favorite
            
            
            
              way to share about incidents.
            
            
            
              It is your incidents elevator pitch. So it's one to two paragraphs on
            
            
            
              what happened, why we should care about this event, any contributing factors or themes,
            
            
            
              et cetera. These abstract is meant to help folks decide if they
            
            
            
              want to commit to learning more about the incident and reading that
            
            
            
              full report. You can share it with anyone. I personally love
            
            
            
              sharing it with executives and leadership.
            
            
            
              Here is these report. It's a jelly screenshot. We're calling it
            
            
            
              the executive summary, but as you can see, it tells you when
            
            
            
              the incident started, how long it lasted, who was involved, what the impact was,
            
            
            
              and next up is the summary. And the summary gives more
            
            
            
              context. It's a slightly more comprehensive version of the
            
            
            
              incident. You can include action items, include who
            
            
            
              suggested them, and here
            
            
            
              you can see we included key takeaways as well. So we usually share this
            
            
            
              with people who can be impacted by the learnings and anyone whose buy
            
            
            
              in you need. And when I'm sharing a summary, what I like to do is
            
            
            
              I like to tag people and explain to them why
            
            
            
              I'm sending this to them so I can share the summary and say,
            
            
            
              add Jen sharing this so you can see that we're
            
            
            
              making this change. And then Jen can go in and
            
            
            
              find out more information. The next format is to share
            
            
            
              a recording and that's just exactly what it sounds like. It's a recording of the
            
            
            
              actual review call. And when sharing it, it's really helpful
            
            
            
              actually to include a message with timestamps of when key moments were
            
            
            
              being discussed. So I can share my recording and say
            
            
            
              at minute ten we discuss the impact. At minute 15 we
            
            
            
              discuss this these at minute 25 we discuss
            
            
            
              action items. And this is
            
            
            
              the most similar format to attending the review meeting.
            
            
            
              But there are some drawbacks. Number one, it takes a while
            
            
            
              to get through it, right? If it's a 1 hour meeting, it's a 1 hour
            
            
            
              recording. And number two, viewers or listeners can collaborate with
            
            
            
              it. When you're in a meeting, you can raise your hand, you can say,
            
            
            
              hey, this is how I experienced it. You can't do that if you're just watching
            
            
            
              these recording. But this is a great format to share with those who
            
            
            
              were involved in the incident but couldn't attend the meeting.
            
            
            
              If I'm being completely honest, leadership or other colleagues who were not
            
            
            
              involved in the incident are probably not going to watch or listen to
            
            
            
              this type of review. That's okay. They are not the audience
            
            
            
              for this format. That is fine. And then
            
            
            
              you get the idea of a weekly update.
            
            
            
              And the weekly update consists of a quick review of all the incidents
            
            
            
              that were analyzed that week. It can be a list of all incidents with their
            
            
            
              abstracts and a link to the full report for more information. You can
            
            
            
              also include additional data points like teams impacted for a
            
            
            
              quick access to additional learnings. It's a
            
            
            
              great, great option for larger organizations that have lots and lots
            
            
            
              of incidents. Everyone can take a quick glance to the list,
            
            
            
              find the incidents that they're interested in based on keywords like services impacted
            
            
            
              or technologies involved and these read further whatever they're
            
            
            
              interested in. I personally had some really great luck
            
            
            
              with this. I used to send weekly updates to everybody in
            
            
            
              technology. All the managers could just skim. If I'm a
            
            
            
              manager and I'm working with technology a I could see, okay, these technologies were
            
            
            
              part of incidents let's see how this could impact me. Let's see what
            
            
            
              I need to learn from this. Let's see what I should share with my engineers.
            
            
            
              So if you've been paying attention so far,
            
            
            
              all of the formats that I've discussed arent forms,
            
            
            
              individual incidents. And now we're moving on
            
            
            
              to when you're looking at a universe of incidents.
            
            
            
              So there's a difference between the insights that we share from one incident versus
            
            
            
              multiple incidents, the micro versus the macro insights.
            
            
            
              When it comes to incidents, I'm a fan of focusing on learning
            
            
            
              rather than having a post mortem that then becomes an action steps
            
            
            
              factory. And when I say an action items factory, I mean the post mortems that
            
            
            
              we've all been at, right? Like we just sit there, we're not here to learn,
            
            
            
              we're just here to say like, oh, let's change this bug, let's change that,
            
            
            
              let's change that, let's change that. Half of those tickets are never going
            
            
            
              to be completed. We're not going anywhere. Those are the kinds of post mortems
            
            
            
              that lose faith in the process.
            
            
            
              But when you have more incidents and you have more learnings,
            
            
            
              you can start proposing changes, because odds
            
            
            
              are if you have these sample size of one, you're not going to be able
            
            
            
              to make a large structural change because of it, right? If I have
            
            
            
              one incident, I'm not going to be able to say like hey,
            
            
            
              we should do a reorg based on this, but if I have more
            
            
            
              incidents then I can start suggesting things as
            
            
            
              an analyst. When you start spotting macro trends, you can't and you should
            
            
            
              make a cause for them. That's how you change your lives for the better.
            
            
            
              The difference between this and an action item factory
            
            
            
              is that you're giving yourself and other folks the time to truly understand the
            
            
            
              learnings. You're reflecting on your work over a span of
            
            
            
              time and you're making decisions together. To give
            
            
            
              you an example, in the past I worked at an organization where
            
            
            
              we had a very centralized incident response system,
            
            
            
              meaning we only trusted a few people to start an incident.
            
            
            
              That's because we believe that only a few people
            
            
            
              had the understanding of our systems to make things decision.
            
            
            
              As we grew, as we DevOps more things, we realized that
            
            
            
              this process was actually delaying us and learning about high impact
            
            
            
              incidents. And we started asking ourselves,
            
            
            
              what can go wrong if we change this? What can go wrong if we change
            
            
            
              this process? We had several discussions and
            
            
            
              we were like, let's give this a try. But this was a change that
            
            
            
              was outside of our control and
            
            
            
              when we have a change that's outside of our control, I like to focus
            
            
            
              on presentations. So from time to time, you will
            
            
            
              get the chance to present to a wider audience,
            
            
            
              especially when you're trying to make recommendations. When I'm doing this,
            
            
            
              I usually like to walk folks through the timeline of can incident so
            
            
            
              they understand what they're dealing with. A lot of the time, the people
            
            
            
              who are attended these presentations are not living incidents
            
            
            
              every day. They're not like me, who can talk about incidents forever.
            
            
            
              So I like to walk them through the incident. I like to show them
            
            
            
              that visual timeline. I present, any data that I have,
            
            
            
              any themes that I have that I want to discuss, and these I go into
            
            
            
              my recommendations, and I always, always target things to my audience. Right.
            
            
            
              If I'm targeting this to a very technical audience, I include very technical
            
            
            
              details. If I'm targeting this to a business audience, I include
            
            
            
              business details. Okay, but how
            
            
            
              do I get them to agree to my changes?
            
            
            
              When you're proposing changes that are outside of your control,
            
            
            
              I usually suggest that folks think of this workflow and ask
            
            
            
              these questions and answer them to whoever you're proposing the changes
            
            
            
              to. So let's think back of the example. These I want
            
            
            
              to change the incident process from only a few people
            
            
            
              being able to call incidents versus everyone gets to race an incident.
            
            
            
              What is the suggestion in this case? It was a suggestion to change the
            
            
            
              way we run incidents. Who needs to approve it?
            
            
            
              This was a process that everyone in tech, in the tech.org knew,
            
            
            
              so some engineering leadership probably needed to approve of it. Going up
            
            
            
              to the CTO, actually. What do they need to see? They need to see
            
            
            
              that we have done our due diligence. What are we basing this
            
            
            
              information off of? We're basing
            
            
            
              it off of a good number of incidents where the
            
            
            
              incident process was not like the root cause, it wasn't
            
            
            
              the thing that caused the incident, but it was a contributing factor.
            
            
            
              And then these incidents led to a number of discussions, and responders agreed
            
            
            
              that this was worth pursuing. We can see all of this information because we have
            
            
            
              very thorough incident reports that we can look back to.
            
            
            
              What could go wrong? Well, if we communicate this wrong, it could
            
            
            
              cause more confusion. But we're already thinking ahead,
            
            
            
              right? In this case, the responders that are suggesting this change,
            
            
            
              we're suggesting that we try it out for a quarter and then we revisit our
            
            
            
              progress. What is our end game?
            
            
            
              Maybe having more control over the process. We wanted to see what would happen
            
            
            
              if we open up the floodgates. And who
            
            
            
              is doing all this work. This example is easy because I own
            
            
            
              the process, so I'm the one that's making it happen.
            
            
            
              But it also had to be part of my quarterly planning. I had to have
            
            
            
              my manager approve that I would be spending time working on the change
            
            
            
              management process. Guess what?
            
            
            
              It worked. And many other recommendations also
            
            
            
              worked, and some didn't. That's fine. That's how the world works.
            
            
            
              And you can also make it work for yourself.
            
            
            
              Because we had done our work throughout all of the individual incidents,
            
            
            
              we were able to uncover what was happening at a macro level.
            
            
            
              Cause we had done our work, we could then confidently answer
            
            
            
              all those questions in the last slide and make a case for the changes
            
            
            
              that we were suggesting. And so next
            
            
            
              time that you feel like you're stuck when it comes to learning from incidents,
            
            
            
              next time when you feel like you're in a cycle of repeating, repeating incidents where
            
            
            
              you feel like you know the answer but you're not getting anywhere, remember that
            
            
            
              these process doesn't end in the post mortem. Sharing incidents
            
            
            
              learning is indeed a pivotal step in turning your incidents into
            
            
            
              opportunities. Thank you very much.
            
            
            
              I'm Vanessa. If you would like to hear more about incidents,
            
            
            
              feel free to follow me on Twitter at these underscore hue.
            
            
            
              Underscore Jace. I hope you all have a wonderful day.