Transcript
This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE,
a developer? A quality engineer
who wants to tackle the challenge of improving reliability in your DevOps?
You can enable your DevOps for reliability with chaos native.
Create your free account at Chaos native Litmus Cloud
welcome everyone to my Conf 42
pragmatic incident response lessons learned from
failures let's go ahead and get started on here.
My name is Robert Ross,
but people like to call me Bobby tables. I am the CEO
and co founder of Firehydrant. We are a reliability
platform for teams of all sizes looking to up level
their service, ownership, incident management, incident retrospectives,
postpartums, whatever you call them. And my experience has led
me to start this company and ultimately have
a bunch of tips on things I think you can do to help improve your
incident response at your company, your team,
maybe your life. Really up to you. So what
we're going to talk about are some five tips on
pragmatic incident lessons. So pragmatic being the key word here,
there's a whole swath of things that you can do to improve
incident response, but we're just going to talk about some relatively
easy things that you can start to push into the
company today to help you with your
incident response and management.
So one of the things that we're going to start with is small
things. And if you recognize these lyrics, it's because
it's from a band called Blinkway Two, one of my favorites and from my hometown
of San Diego. But we're going to talk about my elevator
instead for a second. So I live
in an old candy factory. This building, what you see
behind me, it's an old candy factory in Williamsburg,
Brooklyn, New York City. And my
elevators, some of the buttons don't light up when you push them. They certainly
work and they dispatch an elevator and they send it to the correct
floor, but they don't light up. And then other problems is
that the led display in the elevator
inside of the elevator car displays the wrong thing
from the floors sometimes. So if you push basement,
it'll actually say seller in the led screen. Just a minor inconsistency,
but generally, and another fun one that I discovered actually
very recently, is that it has a race condition that if you push a
button, sometimes two elevators will show up simultaneously.
No one's in them, just two show up.
And generally it's actually kind of a piece of crap, but it goes
up and down, and that's what's important, right, for an elevator goes up
and down, door opens, I step in and out, whatever.
But then on April eigth of this year,
I walked up to my elevator and there was an
out of order sign and actors were used in this photograph. But I
was like, okay, well, my elevator, that the buttons are mislabeled.
It has a race condition and a bunch of
other problems. It's dirty a lot of the time. Like shocking,
right? Oh my God. No way that the elevator
is broken right now, right? I was completely shocked.
Not at all. Right. It made sense. It all lined
up. The story told itself.
And where I'm going with this is that the habit of ignoring
small issues is often really going to lead to
way bigger issues.
When little things start to pile up, it's no
longer a bunch of small things, it's going to become
a big thing. A lot of the time.
It's important to actually start seeing small things as
indicators of bad things to come if we're not going to go and handle
them. I'm sure we've all worked at a company, maybe you work at one
right now and you don't want to admit it, that you have an error tracking
system that just has thousands of errors,
thousands of exceptions in there that nobody is going and looking at.
And one day one of those exceptions
is actually going to matter. And it will be so hard to
wade through all of the exceptions in that tool because
there's just so many of them. So my
first tip here is you should be focusing on small incidents
first. Don't try to change the world
for your incident lessons methodology
on your team or at your company. For huge incidents,
let's talk small first. Run retros for tiny
incidents. Don't run them for the biggest ones. You're not
going to move mountains there. Don't go
into a giant incident retrospective for
the first time. And we're all excited that we're going to change our process and
make the world a better place. And inevitably, you're going to have a Jira ticket
that says rearchitect async pipeline.
That's never going to happen. That's not going to happen because
of that incident. It's going to take a way bigger
movement than that. So create really
small actionable items from your incidents.
And low stake incidents are the best place
to get better at doing retros, right? They're low stakes.
Nobody is maybe feeling
really down on themselves for this incident. Maybe it was just a minor data inconsistency
in an export to another system. No user data was lost.
No users even knew there was a problem. Those are the best ones
to start instituting behavioral change for
your incident lessons and retrospective practices.
And Heidi Waterhouse works at
launchdarkly has a great quote.
A plane with many malfunctioning call buttons may also be poorly
maintained in other ways, like faulty checking in for turbine
blade microfactors or landing gear behavior.
So next time you're on a plane and the buttons don't work and maybe the
window is super scratched and your seat doesn't lean back, like maybe
you should start to wonder certain things. And sorry if I just gave
anyone crippling anxiety of flying,
but I think that this quote is really good. Moving on,
let's talk about things that you can measure. So that was tips number one.
Moving on. Tip number two, what do you measure? Well,
knowing how much you're improving is really
important and improving your incident response,
improving your overall reliability as a team, as a company,
as an organization. Are you improving your MTTR?
Are you actually measurably making sure that
your MTTR is going down?
So my second tip really
quick is check your MTTR. MTTR is a great
statistic to measure and improve incident response on your team.
Really, really critical that you measure your MTTR and for
the incident, human safety factors and
all of the nerds out there that love learning
about MTTR and why it's the worst metric to
track, I am in fact trolling you.
The thing that I'm actually saying is meantime to retro.
I'm sure you didn't think I was saying meantime to resolution,
I hope. But we're talking about meantime to retro retrospective
our memory up here. It goes quick.
It fades very fast. Herman Ebbinghouse,
he discovered that memory loss is way faster than we think.
The moment you learn something, you're going to start to forget it pretty
quickly. You need to really institute
knowledge through dedicated learning for it to
stick around. So where
I'm going with this is that your meantime to retro is a really important step,
because one of the easiest ways to have a bad incident
retrospective is to wait two weeks. You could
do all of your process right, you could have all
the right people in the right room, all of your timeline correct, and you could
still produce a bad retrospective learning
document, whatever your outcome is. And the reason is that we
forget things very, very quickly, faster than you think.
So don't wait weeks. If you have a sev
one incident, that is a moment, and calendars should change
like calendar events, the next day should go away
to make room for that retrospective. Because a sub one incident, that's a big deal,
maybe a sev five, maybe a little bit more leniency. But even still, like within
a couple of days, because our memory just is that
susceptible to being corrupted or lost entirely.
So tip number two, with a little bit less troll, is track your meantime
to retro. Hold prompt and consistent retros on your
incidents the easiest way.
Again, don't wait, do it super fast.
Another metric that I really like is actually tracking the ratio
of your incidents to retrospectives. And that's a number that should go
up. If you are tracking this number, let's say
you have, let's say ten sev one incidents,
but you only did two retrospectives. That's a
pretty bad ratio. You should be getting that number up.
So track that number. It's a really good one to
actually measurably see if your company behavior
is changing. And that's one
of the most important things, is getting people on the same page.
So let's talk about things that you can alert on a
hot topic. I'm sure for a lot of people that maybe you have an alert
fatigue problem. Maybe you have systems that are red
herring alerts all the time. So let's talk about things that you can arent on.
One thing that we should all take note of here is that
we don't declare incidents.
We only are really declaring incidents because our customers or
the people that receive value from our systems and software are feeling
some level of pain. Whether it be slow
checkout times on your online ordering form or just
errors entirely, whether it be someone,
an image isn't loading for them and that's why they're there like they want
to look at. Maybe you have a photo website. The severity of your incidents is
directly linked to customer pain.
We would not open incidents unless there was another customer on the other side
having some level of pain. And where I'm going with this
is that computer vitals are what a lot
of companies alert on. A lot of companies do this and it's
okay, you can get better at this. But when
you are alerting on your computer cpu,
that is a bad metric to arent on because
cpus are going to get hot.
Like they're going to be utilized more at certain
times of the day and utilized less maybe in the middle of the night,
wherever your peak traffics are. And this
is a big deal because if you think about it, if I go on a
run, if I go down to my street, and if my elevator works
and I get on the street and I sprint down the block,
my heart is going to beat faster. It is designed
to do that. It means that, hey, you sre exerting a lot
of energy. I'm going to give your muscles more oxygen and
that's what it does. I think that is what I was taught in school.
I need to maybe go verify that before I so confidently say that
on talk. But my point is that when you page people
at 02:00 a.m. Because disk is maybe at 80% capacity or
your cpu is running at 100%, let's say 95%,
but no one is feeling any pain. Your customers don't know anything,
don't know any difference. There's no reason to
wake anyone up. It's the easiest way to lose great teammates is
to page them for things that no one knew
about, that no one actually felt any pain.
So my third tip here is alert on degraded experience
with the service and not much else. A cpu burning
hot again at 90% is not necessarily a bad thing.
It might indicate problems down the road, but correlation does
not equal causation all the time. And what we
should be doing is creating service level objectives that are tied
to a customer experience and alert on those.
Alert on when a customer is feeling a problem, not when
the computer is potentially just doing its job.
People experiencing problems sre the only thing you should
alert on, period. Really, there's no reason to wake anyone
up unless anyone is feeling a problem with your system.
Soundcloud has a great blog post on it. I reference it
all the time. It's called alerting on slos,
but you can apply. What they say is that you should alert on
symptoms and not causes. So going back to the running analogy, if I go
downstairs and I sprint down the block, my heart's going to start beating faster.
That's what it does. But if all of a sudden, if I pass
out and I fall and I hit my head, that's maybe
a problem, right? I shouldn't be passing out just because I
sprinted down the block. That means that something arent
wrong and I should maybe go to the hospital and page
someone. But that so you should only be alerting
on your symptoms, not the causes.
Who do you notify about an incident? And this is
an interesting topic, I think because at a lot of companies,
incidents are high blast radius events.
Everyone in the company knows that you're having an incident in some
way, shape or form. I used to work for
a payroll provider and if our payroll went down, that was a
big deal. A lot of people knew about that. Our HR team knew about it,
our engineers knew about it. Customer success knew about it because their accounts
were calling them saying, hey, I can't run payroll. The sales team knew about it
because they couldn't do demos. It was a big
deal for the payroll system to no longer
be running, and so it became
an issue when you have 600 employees,
500, 600 employees, and everyone wants to
be involved. Everyone. They all feel the pain of it internally at
the company, so they all want to join
an incident channel or a Zoom call or whatever it
is. And incident notifications
can be scary. As an engineering team, do you really need to
tell your entire team about an incident? Are you going to cause
panic and pandemonium by telling your team that there's an
incident? I certainly have. Even though it was a benign incident,
simply saying, hey, this isn't working right now.
People don't know how to maybe take that. So can a
small set of people tackle an incident without disrupting the rest of the
company? Can you have a tiger team that just gets in there really quick
and just fixes a problem and gets out?
When does customer success and support get involved? That's another big deal.
Do they need to be directly attached to the engineering's
communication lines and hip during an incident,
or can there be a push mechanism? Can we give
the success and support team the information that they need
the moment that it becomes available without them maybe being
in, let's say, the figurative kitchen?
And another question is, when do you escalate? When do you actually say, you know
what, maybe we do need to tell everyone right now?
That's another challenge that should have a well defined process
behind it. So my fourth tip is
focused teams will perform the best mitigation there
is such a thing as over communicating about an incident to a company.
I've done it. I've been a person that's even seen it.
And it can cause ridiculous distractions and
really massive logical fallacies among the team. Again,
incidents are big blast radius events. Like, we should be very
careful about how much we actually tell people about
those incidents. And even in New York City,
when there's a fire in the big, tall buildings, like where it's
cement in between each floor, you actually are told
by the fire department if you are above the fire,
I. E. If there's a fire on floor ten and you're on
floor 13, the fire department will
actually tell you, don't leave the building. Don't leave the building.
And that is insane as that sounds. The reason that the fire
department does that is because you're probably not in that much danger.
There's a lot of cement between you and the floors below you.
And two, when you have, let's say
it's a 25 story building. If you have, what is that? 15 floors
of people with each floor having hundreds of people
suddenly on the street down below, the fire trucks can't
get there. They can't even get to the building. So that's why the fire department
says if you are two floors or more above the
fire in New York City, in certain buildings, don't leave
your floor because you're going to make it really hard for the fire department to
even get to the fire to put it out. And then you actually make it
worse. So be careful who you actually loop into
incidents. It really can make a big difference.
And a great quote. Too many cooks spoil the
broth. Too many people mitigating by committee
will produce inferior results. All right,
the last tip that I have for everyone here.
What teams own your services? If you're at a bigger
company, or even if a small one with only a few things or product
areas, this is a big thing for people to think,
but, and you need to be considering
your service ownership before you actually have an outage.
And not only for the purposes of this team
builds these features on this service. That's not really what I
hes that is a part of it, but really it's more important
because people need to feel like they own something to properly
know how to fix it or call the right people or be the
right person to fix those problems.
For example, if there's an authentication service
that my team owns that goes down, it doesn't make
sense to have a different team from maybe a
different product area coming to my neighborhood
to solve the problem. They might be completely capable
of doing it. It's not a capability thing.
They might take a little bit longer to get there. They might take a little
bit longer to get their bearings. Imagine if a fire department from
the northern part of Manhattan had to go to the southern part of Brooklyn.
That doesn't make a ton of sense. They maybe don't know the roads as much,
so that's a big deal. And also the
service owner, they know all the dependencies. They know where all of the dead bodies
are in the application. They are going to be the best people
to resolve that incident. But you're not going to be able to define
that the moment an incident starts. This needs to be
clear beforehand. Everyone should know
when this service has a problem. I'm the person that
resolves it and there should be no questions but that.
Because knowing which team to send to the fire
is critical. When seconds matter, fire departments,
there's a reason that there's a fire department in each neighborhood.
It's because it's the fastest way to get to them. They know the neighborhood,
they know the nuances of streets. They are the better team
to get there. And seconds do matter, of course, in real life
fires, but also with software.
So my fifth tip is to assign clear
ownership, very crystal clear ownership. Ad hoc free cuts
will only continue unless there is a clear line of
who owns what. So providing a service catalog
with clear ownership will help you route people to fires
faster. I promise you. Service catalog is also valuable
because it's a thought exercise. It makes you think, well,
here's everything we own, and you might not actually have a big
database in thinking about that. It can be a great exercise
even just to build a service catalog and actually start to think about your
system at a high level. So build a service catalog,
assign clear ownership lines, and it will greatly
help you in your incident response. So let's
recap really quickly here. Declare and run retros
for the small incidents. It's less stressful. Action items
are more actionable. Don't do giant incidents
for retros yet. Do the small really promise?
Trust me. Do the small incidents first,
decrease the time you take before you even analyze an incident.
Good retros take practice. That's going to take
you a little bit of time, but one of the easiest things that you can
do is just decrease the time before you even have the
retro. Those retros will be more valuable.
Alert on pain felt by people, not computers.
Computers. We might feel bad. My computer is running
hot right now recording this. It's not sentient, doesn't have feelings.
If I suddenly lost this recording, that's something I would alert
on and I'd probably be very upset. So alert on the pain
felt by people, not computers.
And really, it's worth saying the only reason we declare incidents
at all is because of the people on the other side using our software.
Four consciously mitigate without overdoing the communication.
Again, bringing a lot of people into an incident
is not the best way to solve an incident. The fire department doesn't want everyone
on the street, even though they're pretty close to the fire.
It's safer for people to stay above it and out of
responders way down below on the street.
And number five, assign clear service owners. If you don't have
a clear line of ownership between people and the services they build
and run, you sre going to have longer incidents.
You're going to have maybe even more impactful
incidents. Maybe it starts to spread. Maybe you have a thundering herd problem.
It is really important to have your service ownership very well defined.
And with that, I will say thank you to the
organizers, to everyone that watched this. If you are
looking to implement maybe some of these practices, maybe your ears perked up a
little bit on some of them. Firehydrant.com this is what
we do. This is what we started to build this
software to solve our own problems that led to these tips.
So check us out. I'm also bobby tables on Twitter,
bobby tables on GitHub, and we're also at fire hydrant on Twitter as
well. And with that, I hope everyone enjoys the rest of your day.
And thanks for watching.