Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everybody. So I couldnt talk
about incidents forever. In fact, I do talk
about incidents always, but for a while it
felt like nobody was listening. Years ago, when I started working
on incidents and talking about incidents always, we were
doing really great work with incident retrospectives. We were learning super duper interesting
stuff, but those learnings and communication
never really went anywhere. We then realized
that in order for these recommendations to actually happen, we needed others to
see why we were pushing for them. The engineers themselves couldn't
push for these changes because they couldn't succinctly explain what they were
requesting. Instead, they would point to some incident reports that were full
of screenshots of errors in timestamp timelines that didn't specifically
explain what was needed. And the noneengineering
folks couldnt really understand why these incidents were impacting them.
After all, we were still making money. And that's when we started
focusing on not only learning from our incidents, but telling
others about them. We realized that the reports that we were creating
weren't telling the whole story, so we redid the way that we would write them
so they could be more complete representations of how we experienced the
incidents. These long reports, though, were kind of tough for folks to
read. So we started adding abstracts and summaries as onramps,
and then we added weekly updates so folks could quickly ingest what
was happening and start realizing that incidents were applying every
day and they were having a huge impact on how we do our work.
And then we started synthesizing some of that information so we could
go to product owners and decision makers and make a case for our long term
recommendations. And then something magical happened.
Folks started listening, and they engaged with what we were
talking about. And not going to lie, this didn't happen immediately,
but little by little, our recommendations were making
it into the quarterly plans, and they were making it into the team cultures
and the process changes. And not all incidents
went away. We were still doing some really cool engineering stuff,
but we were out of that cycle of infinite incidents that were continuing
to happen over and over again. We started having the bandwidth to
tackle different problems, and that was great. And we were
able to do that because we were able to share
the incident findings. So hi everyone. Vanessa Huerta
Granda I work in solutions at Jelly IO.
I've been working in technology for the past decade,
focusing on incident response and learning from incidents. And I truly,
truly, truly believe that learning from incidents is the key that can help software
organizations improve how we do our work. And I want to
help in making all of this work more attainable and sustainable to the
everyday engineering. So today I want to talk to you all about sharing
incident findings effectively and what we can gain from doing that.
So what do you do after a postmortem?
So you've worked really hard after your incident. You had a collaborative postmortem,
you listened to several different points of views, and you've come up with some really
rich learnings and recommendations. What do you do then?
What do you do after your post mortem? Some of
us will probably write a report. I've done this usually in
Jira. Some folks like to use Google Docs. You can write
up your action items. You can tag the folks that sometimes is part of that
report. More Jira tickets, really great stuff.
Maybe you had the review meeting over Zoom. So you share that recording and
you hope that people listen to that 1 hour session or
you close up the ticket now that the incident is over and folks can access
it whenever they want to. Maybe you're just done.
This incident was a lot. You're tired, you just want to move forward.
You have work to do and there's really going to be another incident tomorrow anyway.
So you move on with your life.
So while we all do different things after the postmortem, do we know
if others are interacting with our learnings, or are the
learnings mostly living and dying in Google Drive bright have
sometimes been I've worked very, very hard on some incident reports
that have very, very complete information, but nobody ever ends them.
And it's frustrating, especially when our organizations
are the ones that are mandating. We spend time creating these postmortem documents,
and so if we're mandated to complete some sort of incidents report
or a five Weiss template, it must mean that organizations
believe that sharing incident findings is important.
But why do they think that? Why do we think that?
When we really think about it, teams appreciate the culture of keeping everyone
in the loop. People like transparency. Last year,
Jelly released a guide on how to learn from incidents. We call it the Howie
Guide. You can actually find it on our website at jelly IO.
And when discussing sharing incidents findings, the Howie guide explains
that your work should not be completed to be filed. It should be completed
so it can be read and shared across the business.
Even after the learning review was taken place and the corrective
actions have been taken, we work hard at arriving to
our learnings to uncover themes and takeaways. How often are
reports written to be filed rather than to be read and shared?
Why else do we share information?
Well, we also share just for learning sake and
sharing timely information is also sort of marketing for the kind of work
that we do, right? Sharing that information is a marketing piece of
the learning from incidents programs that we lead.
Findings may also impact others in the organization. Some outcomes
or insights may impact these person or a team or the way we do our
work. And letting them know of these learnings is a part of a just culture.
We also do it as a TLDR for the decision makers. Maybe we
need buy in for next steps from leadership or other stakeholders.
Actually, we usually do. And these folks are probably not going
to be very likely to attend every single review meeting, especially if you have
a lot of incidents. And finally, you also have folks who
didn't attend the review meeting. Maybe they were out that day on PTO. Maybe they
needed some ends down to work on other stuff. Maybe they
were working through another incident itself.
So there are many, many other reasons for learning.
And all of that is to say that sharing incidents findings can help get our
point across to a wider audience. So if we've worked hard on
uncovering our learnings, but others aren't seeing it, what's the
point? I've often heard engineers discuss,
sadly, how they're stuck in this cycle of incidents, that they
believe that there's things that they can do about it, but they're not in a
position of power to get those things done. But sharing incident
findings is a way out of that cycle of incidents.
You may say, vanessa, I'm already sharing my report. It's in the drive.
Anyone can access it. And that's true. But my
inbox is not at zero. And I bet a bunch of other people's inbox isn't
at zero either. And I can tell you that having
something available to me doesn't mean that I'm going to read it.
So how do we get folks to read it? So let's think about movies.
The way that we learn about and the way that we learn from movies are
different. You can watch a 120 minutes movie,
but maybe then you want to hear people talk about it for 45 minutes,
or you want to listen to a ten episode podcast about this movie.
Or maybe you want to share a review with someone that they can read in
five minutes and decide if they want to go to the movies with you or
not. Or maybe you just want to tweet some spoilers, right? Like Spoiler
Arent and the movie Titanic, the boat sinks.
Sorry for that spoiler, y'all. The truth is that different audiences
need to learn different things from what you're sharing and
who are these different audiences? Going back to the movie
analogy, you can have your huge fans. They're the ones who are going
to listen to that like ten episode podcast,
your casual viewer, the ones who want to read that review before they commit to
spending $12 on the movie ticket and these another $12
on the popcorn. You have your studio executive, your Oscar voter.
They're hopefully watching those 120 minutes forms movies.
The film industry is catering to different types of audiences differently, just like
we will when it comes to incidents. So going
back to incidents who arent
the different audiences. So you have your engineers, you have the people
that you invited to the review meeting but couldn't make it. Maybe some
people who are impacted by the outcomes or the insights from the analysis,
or just people who want to learn more about what your team does.
You have your managers. These can provide necessary context to others.
They can say, this is why my team does this. This way you have your
execs, your leadership who can approve of suggested changes. You have
stakeholders who can both be technical or not technical. The way that you share
the information with them will be different. And then you have your outside
parties, right? You have your customer support, the folks who are like answering the
folks when something is going wrong. You have your public relations folks,
the folks who are writing the tweets when your site is down.
So within these different audiences, folks can have different purposes for wanting or
needing to see the learnings from an incident.
And the purpose can be many, right? Sometimes when I share
something, I'm requesting an action, right? I'm saying, hey Jen, please make this
change or add it to your to do's. Sometimes they just need
to know, hey Jen, I'm making this change. This is how it's going to impact
you. Sometimes you're just updating them, right? Like hey Jen's
boss, remember the incidents from the other day? This is what happened.
But sometimes you want to change folks'minds, you want to say,
hey, Jen's boss, please don't fire Jen. This wasn't her fault.
Read this report, you'll find out why.
With all of this in mind, let's take a look at the different formats in
which you can share the information. So these are some of the formats
that I like to use. I've iterated through them throughout the years.
We'll go into more detail in a bit. But we have the report, these abstract,
the summary recording, weekly updates and presentation.
A lot of what we've built here at jelly was done with this in mind.
We want to make it easy for people to share their learnings.
So you'll see how we do that in the next few slides.
First, let's start with the report. Right? That is probably the
format that you're most used to, but this is different
from your standard postmortem. This one is focused on
telling the bigger story of what happened and the context around how the events
came to be. The goal with the report is always to learn from the incident.
As you can see here what I'm including in my report, I'm including
a narrative timeline, a visual narrative timeline, because we're
telling what happened with the incident from different points of views.
We're not just saying start, middle, end, we're going
back and forth trying to understand what people were experiencing at the different
parts of the incident. The report is
great for asynchronous communication. Again, it should
be written to be read and collaborated with, not just to be filed.
So when I'm writing a report, I really like to encourage folks to
make comments on it, to link out to it, to include it in their prs
and how they do their work. The reports arent the most
in depth written artifact that's coming out of your incident.
A report will give folks an in depth understanding of the these around the incident
and how we find them. They'll mostly be read by folks that are involved
in the incidents or teams using similar technologists, or folks those
buying you need for possible action items,
but they are long, right? These are the most in depth artifact
that's coming out of the incident. So reading a report is probably going
to require quite a time investment from your reading,
from your reader. So in order to catch people's attention,
you probably need something else. And here is
where the abstract comes in. And the abstract is my personal favorite
way to share about incidents.
It is your incidents elevator pitch. So it's one to two paragraphs on
what happened, why we should care about this event, any contributing factors or themes,
et cetera. These abstract is meant to help folks decide if they
want to commit to learning more about the incident and reading that
full report. You can share it with anyone. I personally love
sharing it with executives and leadership.
Here is these report. It's a jelly screenshot. We're calling it
the executive summary, but as you can see, it tells you when
the incident started, how long it lasted, who was involved, what the impact was,
and next up is the summary. And the summary gives more
context. It's a slightly more comprehensive version of the
incident. You can include action items, include who
suggested them, and here
you can see we included key takeaways as well. So we usually share this
with people who can be impacted by the learnings and anyone whose buy
in you need. And when I'm sharing a summary, what I like to do is
I like to tag people and explain to them why
I'm sending this to them so I can share the summary and say,
add Jen sharing this so you can see that we're
making this change. And then Jen can go in and
find out more information. The next format is to share
a recording and that's just exactly what it sounds like. It's a recording of the
actual review call. And when sharing it, it's really helpful
actually to include a message with timestamps of when key moments were
being discussed. So I can share my recording and say
at minute ten we discuss the impact. At minute 15 we
discuss this these at minute 25 we discuss
action items. And this is
the most similar format to attending the review meeting.
But there are some drawbacks. Number one, it takes a while
to get through it, right? If it's a 1 hour meeting, it's a 1 hour
recording. And number two, viewers or listeners can collaborate with
it. When you're in a meeting, you can raise your hand, you can say,
hey, this is how I experienced it. You can't do that if you're just watching
these recording. But this is a great format to share with those who
were involved in the incident but couldn't attend the meeting.
If I'm being completely honest, leadership or other colleagues who were not
involved in the incident are probably not going to watch or listen to
this type of review. That's okay. They are not the audience
for this format. That is fine. And then
you get the idea of a weekly update.
And the weekly update consists of a quick review of all the incidents
that were analyzed that week. It can be a list of all incidents with their
abstracts and a link to the full report for more information. You can
also include additional data points like teams impacted for a
quick access to additional learnings. It's a
great, great option for larger organizations that have lots and lots
of incidents. Everyone can take a quick glance to the list,
find the incidents that they're interested in based on keywords like services impacted
or technologies involved and these read further whatever they're
interested in. I personally had some really great luck
with this. I used to send weekly updates to everybody in
technology. All the managers could just skim. If I'm a
manager and I'm working with technology a I could see, okay, these technologies were
part of incidents let's see how this could impact me. Let's see what
I need to learn from this. Let's see what I should share with my engineers.
So if you've been paying attention so far,
all of the formats that I've discussed arent forms,
individual incidents. And now we're moving on
to when you're looking at a universe of incidents.
So there's a difference between the insights that we share from one incident versus
multiple incidents, the micro versus the macro insights.
When it comes to incidents, I'm a fan of focusing on learning
rather than having a post mortem that then becomes an action steps
factory. And when I say an action items factory, I mean the post mortems that
we've all been at, right? Like we just sit there, we're not here to learn,
we're just here to say like, oh, let's change this bug, let's change that,
let's change that, let's change that. Half of those tickets are never going
to be completed. We're not going anywhere. Those are the kinds of post mortems
that lose faith in the process.
But when you have more incidents and you have more learnings,
you can start proposing changes, because odds
are if you have these sample size of one, you're not going to be able
to make a large structural change because of it, right? If I have
one incident, I'm not going to be able to say like hey,
we should do a reorg based on this, but if I have more
incidents then I can start suggesting things as
an analyst. When you start spotting macro trends, you can't and you should
make a cause for them. That's how you change your lives for the better.
The difference between this and an action item factory
is that you're giving yourself and other folks the time to truly understand the
learnings. You're reflecting on your work over a span of
time and you're making decisions together. To give
you an example, in the past I worked at an organization where
we had a very centralized incident response system,
meaning we only trusted a few people to start an incident.
That's because we believe that only a few people
had the understanding of our systems to make things decision.
As we grew, as we DevOps more things, we realized that
this process was actually delaying us and learning about high impact
incidents. And we started asking ourselves,
what can go wrong if we change this? What can go wrong if we change
this process? We had several discussions and
we were like, let's give this a try. But this was a change that
was outside of our control and
when we have a change that's outside of our control, I like to focus
on presentations. So from time to time, you will
get the chance to present to a wider audience,
especially when you're trying to make recommendations. When I'm doing this,
I usually like to walk folks through the timeline of can incident so
they understand what they're dealing with. A lot of the time, the people
who are attended these presentations are not living incidents
every day. They're not like me, who can talk about incidents forever.
So I like to walk them through the incident. I like to show them
that visual timeline. I present, any data that I have,
any themes that I have that I want to discuss, and these I go into
my recommendations, and I always, always target things to my audience. Right.
If I'm targeting this to a very technical audience, I include very technical
details. If I'm targeting this to a business audience, I include
business details. Okay, but how
do I get them to agree to my changes?
When you're proposing changes that are outside of your control,
I usually suggest that folks think of this workflow and ask
these questions and answer them to whoever you're proposing the changes
to. So let's think back of the example. These I want
to change the incident process from only a few people
being able to call incidents versus everyone gets to race an incident.
What is the suggestion in this case? It was a suggestion to change the
way we run incidents. Who needs to approve it?
This was a process that everyone in tech, in the tech.org knew,
so some engineering leadership probably needed to approve of it. Going up
to the CTO, actually. What do they need to see? They need to see
that we have done our due diligence. What are we basing this
information off of? We're basing
it off of a good number of incidents where the
incident process was not like the root cause, it wasn't
the thing that caused the incident, but it was a contributing factor.
And then these incidents led to a number of discussions, and responders agreed
that this was worth pursuing. We can see all of this information because we have
very thorough incident reports that we can look back to.
What could go wrong? Well, if we communicate this wrong, it could
cause more confusion. But we're already thinking ahead,
right? In this case, the responders that are suggesting this change,
we're suggesting that we try it out for a quarter and then we revisit our
progress. What is our end game?
Maybe having more control over the process. We wanted to see what would happen
if we open up the floodgates. And who
is doing all this work. This example is easy because I own
the process, so I'm the one that's making it happen.
But it also had to be part of my quarterly planning. I had to have
my manager approve that I would be spending time working on the change
management process. Guess what?
It worked. And many other recommendations also
worked, and some didn't. That's fine. That's how the world works.
And you can also make it work for yourself.
Because we had done our work throughout all of the individual incidents,
we were able to uncover what was happening at a macro level.
Cause we had done our work, we could then confidently answer
all those questions in the last slide and make a case for the changes
that we were suggesting. And so next
time that you feel like you're stuck when it comes to learning from incidents,
next time when you feel like you're in a cycle of repeating, repeating incidents where
you feel like you know the answer but you're not getting anywhere, remember that
these process doesn't end in the post mortem. Sharing incidents
learning is indeed a pivotal step in turning your incidents into
opportunities. Thank you very much.
I'm Vanessa. If you would like to hear more about incidents,
feel free to follow me on Twitter at these underscore hue.
Underscore Jace. I hope you all have a wonderful day.