Transcript
This transcript was autogenerated. To make changes, submit a PR.
My name is Robert Ross, but people like to call me Bobby
tables. I am the CEO and co founder of Fire Hydrant.
We are an incident management platform. Previously, I am a cyber liability engineer
at namely, I also worked at Digitalocean in the pretty earlier days.
I've been on call for several years, still on call. Even as
CEO. I've put out fires. A lot of the time they were my own
that I had to put out. And then I am a cocktail enthusiast,
as you can kind of tell behind me. I'm also a live coder.
What I've been doing every Thursday has been building a terraform provider,
live, writing a bunch of go having a bunch of fun. And that's 05:00 p.m.
On Eastern on Thursdays. But so really quickly,
we're going to go over what we're talking about. So we are going to talk
about just some of the basics of chaos engineering and kind of how to
run a chaos experiment, but for incident response. And that's
going to be the core of this entire presentation, really.
We're also going to talk about how to add process that doesn't weigh
down teams. That's a really hard thing to do. So a
quick overview of chaos engineering. We're going to start off with a quote. This comes
directly from principlesofchaos.org, but it says, chaos engineering
is the discipline of experiences on a system in order to build confidence
in the system's capability to withstand turbulent conditions in production.
Another way that this is commonly said, and I think the
Gremlin team has popularized this, is you can think of it like a flu shot,
right? You're injecting in a failure to create
immunity. Another way you can maybe think about it is maybe a controlled
burn. You're helping burn down the forest
so it doesn't burn down the entire forest. At certain points,
you're trying to make things fail. Sometimes they
may not. And what it does is it allows you to kind of find the
weak spots and ultimately reinforce them. So a few examples of chaos
experiments that you may run, and these are technical chaos experiments,
would be adding latency to a postgres instance. Really simple
one. And a possible outcome of that is maybe now your requests
to your app servers start queuing and the whole site topples over.
I've experienced that a couple of times in my career. Maybe you black
hole traffic intentionally to redis and caching breaks,
but the site continues to operate. Maybe just a little bit more slow and then
killing processes randomly. Requests are half fulfilled, leaving data
in an inconsistent state, never a fun situation.
So all these examples, super simple, but hopefully the idea
kind of comes across. Now, process experiences can be
run in a very similar fashion. So chaos engineering obviously focuses
on kind of the computers and how they interact with each other when thrown into
a certain state. But with a process experiences, there are
so many other outcomes that could possibly happen. And the reason I like
process experiments is that people are going to solve problems different ways.
Myself, I'm going to look for logs and find what I need to find with
a different route than maybe my colleague would. And if
you have a seven one outage and you ask someone to update the status page,
there could be a bunch of ways that they go about and do that.
Rolling back a deploy, maybe I revert a merge, but somebody else uses spinnaker
to roll back a deploy. There's so many different ways that we can
actually change our systems with only our keyboards.
So how do you run? Can experiment on process?
That's an interesting question. And one of the ways you can
do it is you can have a surprise meeting. So this is something that we
actually did last year is I put my co founder Dylan
in a room and I actually asked him to
update our status page. I did not tell him I was going to do
this like you to do. Pretend there is a seven
incident for fire hydrant and update our status page.
Okay, great. So I'm going to go to status page IO
and I'm not logged in because only Bobby
has credentials for status page.
And now I'm stuck and don't know what to do.
So about 30 seconds is all it took to really find
that an on call engineer could not update our
status page. That's all it took. And it immediately revealed that
Dylan had no access. And we have our own status page product
now. But that was very quickly researched.
He would have had no way to publicly disclose incidents to customers. And that's
part of our SLA to our customers is saying, we will
tell you when we have incidents as fast as possible. You can do this for
any common operation you do during an incident, any of them.
And there are a ton of different operations that if
you really take a microscope to how you operate using
an incident, you probably have had to roll back a bad deploy,
update a replica, config, purge a cache, find a log,
skip a test suite just so something can go to production faster. There are
so many things that we do when we respond to incidents
and when you start to think of them as individual techniques
and create boxes that you can then practice
that's what we're talking about when it says, how do you master the
techniques that then become a part of your process? Because the
reality is that people mitigate incidents,
not process. Process, in a lot of ways can hamper
mitigation of incidents. If a process is too heavy and
like, oh, you have to update the status page and create a Jira ticket and
create a slack room and all these things manually.
That's not helping you solve the problem and all
of that, that's what fire hydrogen is really for, is helping institutionalize and
automate that process. But that's not why we're here. And the other thing that
processes get wrong a lot of the time is they're
prescribing how to do something. This is the only way that
you can add storage to the database,
right? Like, oh, if we're getting out of storage errors, here's the runbook to add
storage. But what if it was a red herring? What if it wasn't
actually storage being out, right? Maybe an error was
just mislabeled. There's a lot of reasons why you don't want to prescribe
processes. You just want to teach the techniques and test those.
So think of each technique as an individual lego brick.
The fun little plastic pieces that we maybe all played
with as kids, I still play with as an adult. And what's interesting about Lego
bricks is that when you have a lot of them, you can create shapes
and you can break it down and build a different shape with the same
bricks. And that's really important because when someone
knows how to use a single break in multiple situations,
in a break, in this analogy as a technique, they can mitigate different
incidents more effectively. Now, if someone only knows how to
make an entire set, they only know how to make the one spaceship,
the one building, whatever it is, just front to back. That's the only thing they've
ever learned. That's not very helpful when it comes to a different
type of incident that comes along. So you should teach how to
use the mitigation techniques individually and then practice
using those within scenarios. Now, all of these are technique
bricks is what Im calling them. I'm deeming them technique bricks.
And if you look at the bottom half here, we have finding logs,
maybe we have an incident, and im like, I got to find the logs and
it says, oh, it looks like caches are maybe stale.
So I ssh into a box, I purge a cache, and then I
roll back a bad deploy that was causing stale caches.
Each of those individual things were not
created specifically for an incident that has stale cache problems.
But because we have individual lego bricks, we can resolve
this incident because we've practices those different things. And now we can get creative
and get really fast at putting together a set
to solve an incident very quickly. Now the bricks are
boring. They're so boring. It's why we don't do them. You don't
practice. Hey, go practice finding logs.
Go practice rolling back a deploy. Go practice doing this.
That's not a thing that we do. It's not something we do in our industry.
And that's not really how any other part
of high performing works. So I'm
going to warn you, I did a lot of marching band. I did
Drumcore International, my co founder did as well.
And we did things in marching band that
were really boring. And this is one of them.
What we did is this. We just marched
forward, put our horns up, march forward,
eight steps, put them down. And that was the entire 3
hours. Sometimes this is all we would do. Now,
that's important. Marching in step with the right technique,
holding your horn correctly. Those were individual techniques that
we were required to master. And the only reason that we did that
was so that once we had mastered that, we could take the field and
do things like this. Now,
if you look at each individual on the screen, they are marching.
They're running marching in steps. They're using the same
techniques that they practice for hours on end.
And then they're basically able to create different shapes and
do it faster, do it slower to create shows. And you need
to identify which techniques lack understanding first.
That's really important. So how can you practice the techniques,
then break it on purpose? Right? That's what the purpose of this is,
is chaos. You want to break something on purpose, but this
time you're going to watch a teammate try to fix it.
Now, this is something we did at, namely, we would actually find someone
that was going to be basically a secret agent, and they
were going to break staging. And it was decided amongst a
few small group of people that all knew what was going on, and they would
decide how they were going to break it and when. And then what we did
is we scheduled time on the calendar for when we were using to break the
environment. And we had a team that they knew this was going to happen.
We just told them, hey, don't have a meeting at this time. Be at your
computer. We're going to break staging, and we're not going to tell you how,
we're not going to tell you any hints. We're just going to see what happens.
And then what we did is we've watched the team and we've watched them
how they identified the problem and mitigated it along the way.
You're going to hit walls. People are going to run into walls. They're going to
break their noses. Inevitably they're going to hit some bottleneck.
It might be that they have to take an alleyway to get to where
they need to go and it just slows them down. Or they just outright can't
do something. They don't have permissions to do it. They don't know how. A runbook
doesn't exist. The capability doesn't exist in the system. Maybe we can't
roll something back. There are a lot of things that you will identify when you
create these scenarios. What you're breaking is your process.
You're trying to break your process. So what you can do
is going back to that list of Lego bricks, is you need to
kind of create your own. You need to find all of the different individual bricks.
And you can do this by going through your retrospectives or post mortems,
depending on your nomenclature, and find the individual techniques
that people used to help resolve the incident and break
those down, make them molecular. And, oh,
we rolled back a bad deploy. That's a technique we used. We found a log
on this system. That's a technique we used. Right. And now
what you can do is you can find your newest teammates and ask them to
do it. So roll out a benign change to an environment. Do it on production,
because production is rarely always the same as staging
or QA. Get on a Zoom call with that teammate
and set it to record, and then ask the teammate to share their screen and
then tell them to roll back the change. That's it. And then just watch what
happens. And this is really interesting, because going back to what I was saying earlier,
people are going to have different ways of rolling back things. There might
be one person that just pushes a new image to a docker registry,
or another person that reverts a commit on GitHub, and another person that knows
how to roll it back using our CI tool. There are a bunch of different
ways to do this. Now, you might actually find someone that
has an extremely efficient way to do this,
and you didn't know how to do that, you didn't know how to do it
that way. And now all of a sudden you've revealed, oh, there's another way
to do this that's even better. And you can kind of put that out into
the team. And now that's the technique that we use.
You can break it down. You find finding logs, right? Do the
same thing with zoom record and share screen. And ask a
teammate to find all the logs for 500 status code requests. Because the
last thing that someone needs to be learning during can incident is
lucine syntax. Are we doing status underscore
code or status uppercase c code or like what is
the index name? There are a lot of things that slow you down during
incident response, and those are the minute details. And the shavings of
those make a pile of time that can really hamper your
process. It's important to practice the techniques.
So with that, you need to identify the core techniques people use
during incidents. And you can do this through chaos engineering principles,
right? Break your system and watch people fix it.
It's not typical chaos engineering we think of as we break the
system and then what happens is that we just try to
see the adverse effects of the system breaking. But we rarely
use the same technique to test process. So once you find those techniques
that people are using during these intentional incidents,
write them down, get them on paper, put them in notion
confluence, whatever you're using to really institutionalize
that knowledge and then practice them regularly, your logging
system is going to change. Inevitably, something is going to change. The way you find
metrics, the way that metrics are labeled. You need to practice the individual
techniques all the time. The same way that we did in marching
band where we would do very boring things every
single day before we did anything that mattered to our
audience at least. So what did we cover? We got through this pretty
quickly. We learned that techniques matter in incident response.
I really strongly emphasize this. It's important that you really
boil down to how you are doing things and practice those
individual things. I highly encourage breaking something and just observing
how a team fixes it. You will reveal a lot
through this. You'll find that someone doesn't have access to something. You'll find that
someone has a different way of doing something that's better just by.
This will take you an hour to do with a small team.
So thank you. My name is Robert Ross. People call me Bobby
tables. Im on Twitter at Bobby Tables GitHub Bobby tables.
Feel free to email me. I'm Robert@firehydrant.com and as always, if you're
looking for an incident can management tool that can help you automate a lot of
the process around incident response,
firehydrant.com is a great place and resource.