Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              My name is Robert Ross, but people like to call me Bobby
            
            
            
              tables. I am the CEO and co founder of Fire Hydrant.
            
            
            
              We are an incident management platform. Previously, I am a cyber liability engineer
            
            
            
              at namely, I also worked at Digitalocean in the pretty earlier days.
            
            
            
              I've been on call for several years, still on call. Even as
            
            
            
              CEO. I've put out fires. A lot of the time they were my own
            
            
            
              that I had to put out. And then I am a cocktail enthusiast,
            
            
            
              as you can kind of tell behind me. I'm also a live coder.
            
            
            
              What I've been doing every Thursday has been building a terraform provider,
            
            
            
              live, writing a bunch of go having a bunch of fun. And that's 05:00 p.m.
            
            
            
              On Eastern on Thursdays. But so really quickly,
            
            
            
              we're going to go over what we're talking about. So we are going to talk
            
            
            
              about just some of the basics of chaos engineering and kind of how to
            
            
            
              run a chaos experiment, but for incident response. And that's
            
            
            
              going to be the core of this entire presentation, really.
            
            
            
              We're also going to talk about how to add process that doesn't weigh
            
            
            
              down teams. That's a really hard thing to do. So a
            
            
            
              quick overview of chaos engineering. We're going to start off with a quote. This comes
            
            
            
              directly from principlesofchaos.org, but it says, chaos engineering
            
            
            
              is the discipline of experiences on a system in order to build confidence
            
            
            
              in the system's capability to withstand turbulent conditions in production.
            
            
            
              Another way that this is commonly said, and I think the
            
            
            
              Gremlin team has popularized this, is you can think of it like a flu shot,
            
            
            
              right? You're injecting in a failure to create
            
            
            
              immunity. Another way you can maybe think about it is maybe a controlled
            
            
            
              burn. You're helping burn down the forest
            
            
            
              so it doesn't burn down the entire forest. At certain points,
            
            
            
              you're trying to make things fail. Sometimes they
            
            
            
              may not. And what it does is it allows you to kind of find the
            
            
            
              weak spots and ultimately reinforce them. So a few examples of chaos
            
            
            
              experiments that you may run, and these are technical chaos experiments,
            
            
            
              would be adding latency to a postgres instance. Really simple
            
            
            
              one. And a possible outcome of that is maybe now your requests
            
            
            
              to your app servers start queuing and the whole site topples over.
            
            
            
              I've experienced that a couple of times in my career. Maybe you black
            
            
            
              hole traffic intentionally to redis and caching breaks,
            
            
            
              but the site continues to operate. Maybe just a little bit more slow and then
            
            
            
              killing processes randomly. Requests are half fulfilled, leaving data
            
            
            
              in an inconsistent state, never a fun situation.
            
            
            
              So all these examples, super simple, but hopefully the idea
            
            
            
              kind of comes across. Now, process experiences can be
            
            
            
              run in a very similar fashion. So chaos engineering obviously focuses
            
            
            
              on kind of the computers and how they interact with each other when thrown into
            
            
            
              a certain state. But with a process experiences, there are
            
            
            
              so many other outcomes that could possibly happen. And the reason I like
            
            
            
              process experiments is that people are going to solve problems different ways.
            
            
            
              Myself, I'm going to look for logs and find what I need to find with
            
            
            
              a different route than maybe my colleague would. And if
            
            
            
              you have a seven one outage and you ask someone to update the status page,
            
            
            
              there could be a bunch of ways that they go about and do that.
            
            
            
              Rolling back a deploy, maybe I revert a merge, but somebody else uses spinnaker
            
            
            
              to roll back a deploy. There's so many different ways that we can
            
            
            
              actually change our systems with only our keyboards.
            
            
            
              So how do you run? Can experiment on process?
            
            
            
              That's an interesting question. And one of the ways you can
            
            
            
              do it is you can have a surprise meeting. So this is something that we
            
            
            
              actually did last year is I put my co founder Dylan
            
            
            
              in a room and I actually asked him to
            
            
            
              update our status page. I did not tell him I was going to do
            
            
            
              this like you to do. Pretend there is a seven
            
            
            
              incident for fire hydrant and update our status page.
            
            
            
              Okay, great. So I'm going to go to status page IO
            
            
            
              and I'm not logged in because only Bobby
            
            
            
              has credentials for status page.
            
            
            
              And now I'm stuck and don't know what to do.
            
            
            
              So about 30 seconds is all it took to really find
            
            
            
              that an on call engineer could not update our
            
            
            
              status page. That's all it took. And it immediately revealed that
            
            
            
              Dylan had no access. And we have our own status page product
            
            
            
              now. But that was very quickly researched.
            
            
            
              He would have had no way to publicly disclose incidents to customers. And that's
            
            
            
              part of our SLA to our customers is saying, we will
            
            
            
              tell you when we have incidents as fast as possible. You can do this for
            
            
            
              any common operation you do during an incident, any of them.
            
            
            
              And there are a ton of different operations that if
            
            
            
              you really take a microscope to how you operate using
            
            
            
              an incident, you probably have had to roll back a bad deploy,
            
            
            
              update a replica, config, purge a cache, find a log,
            
            
            
              skip a test suite just so something can go to production faster. There are
            
            
            
              so many things that we do when we respond to incidents
            
            
            
              and when you start to think of them as individual techniques
            
            
            
              and create boxes that you can then practice
            
            
            
              that's what we're talking about when it says, how do you master the
            
            
            
              techniques that then become a part of your process? Because the
            
            
            
              reality is that people mitigate incidents,
            
            
            
              not process. Process, in a lot of ways can hamper
            
            
            
              mitigation of incidents. If a process is too heavy and
            
            
            
              like, oh, you have to update the status page and create a Jira ticket and
            
            
            
              create a slack room and all these things manually.
            
            
            
              That's not helping you solve the problem and all
            
            
            
              of that, that's what fire hydrogen is really for, is helping institutionalize and
            
            
            
              automate that process. But that's not why we're here. And the other thing that
            
            
            
              processes get wrong a lot of the time is they're
            
            
            
              prescribing how to do something. This is the only way that
            
            
            
              you can add storage to the database,
            
            
            
              right? Like, oh, if we're getting out of storage errors, here's the runbook to add
            
            
            
              storage. But what if it was a red herring? What if it wasn't
            
            
            
              actually storage being out, right? Maybe an error was
            
            
            
              just mislabeled. There's a lot of reasons why you don't want to prescribe
            
            
            
              processes. You just want to teach the techniques and test those.
            
            
            
              So think of each technique as an individual lego brick.
            
            
            
              The fun little plastic pieces that we maybe all played
            
            
            
              with as kids, I still play with as an adult. And what's interesting about Lego
            
            
            
              bricks is that when you have a lot of them, you can create shapes
            
            
            
              and you can break it down and build a different shape with the same
            
            
            
              bricks. And that's really important because when someone
            
            
            
              knows how to use a single break in multiple situations,
            
            
            
              in a break, in this analogy as a technique, they can mitigate different
            
            
            
              incidents more effectively. Now, if someone only knows how to
            
            
            
              make an entire set, they only know how to make the one spaceship,
            
            
            
              the one building, whatever it is, just front to back. That's the only thing they've
            
            
            
              ever learned. That's not very helpful when it comes to a different
            
            
            
              type of incident that comes along. So you should teach how to
            
            
            
              use the mitigation techniques individually and then practice
            
            
            
              using those within scenarios. Now, all of these are technique
            
            
            
              bricks is what Im calling them. I'm deeming them technique bricks.
            
            
            
              And if you look at the bottom half here, we have finding logs,
            
            
            
              maybe we have an incident, and im like, I got to find the logs and
            
            
            
              it says, oh, it looks like caches are maybe stale.
            
            
            
              So I ssh into a box, I purge a cache, and then I
            
            
            
              roll back a bad deploy that was causing stale caches.
            
            
            
              Each of those individual things were not
            
            
            
              created specifically for an incident that has stale cache problems.
            
            
            
              But because we have individual lego bricks, we can resolve
            
            
            
              this incident because we've practices those different things. And now we can get creative
            
            
            
              and get really fast at putting together a set
            
            
            
              to solve an incident very quickly. Now the bricks are
            
            
            
              boring. They're so boring. It's why we don't do them. You don't
            
            
            
              practice. Hey, go practice finding logs.
            
            
            
              Go practice rolling back a deploy. Go practice doing this.
            
            
            
              That's not a thing that we do. It's not something we do in our industry.
            
            
            
              And that's not really how any other part
            
            
            
              of high performing works. So I'm
            
            
            
              going to warn you, I did a lot of marching band. I did
            
            
            
              Drumcore International, my co founder did as well.
            
            
            
              And we did things in marching band that
            
            
            
              were really boring. And this is one of them.
            
            
            
              What we did is this. We just marched
            
            
            
              forward, put our horns up, march forward,
            
            
            
              eight steps, put them down. And that was the entire 3
            
            
            
              hours. Sometimes this is all we would do. Now,
            
            
            
              that's important. Marching in step with the right technique,
            
            
            
              holding your horn correctly. Those were individual techniques that
            
            
            
              we were required to master. And the only reason that we did that
            
            
            
              was so that once we had mastered that, we could take the field and
            
            
            
              do things like this. Now,
            
            
            
              if you look at each individual on the screen, they are marching.
            
            
            
              They're running marching in steps. They're using the same
            
            
            
              techniques that they practice for hours on end.
            
            
            
              And then they're basically able to create different shapes and
            
            
            
              do it faster, do it slower to create shows. And you need
            
            
            
              to identify which techniques lack understanding first.
            
            
            
              That's really important. So how can you practice the techniques,
            
            
            
              then break it on purpose? Right? That's what the purpose of this is,
            
            
            
              is chaos. You want to break something on purpose, but this
            
            
            
              time you're going to watch a teammate try to fix it.
            
            
            
              Now, this is something we did at, namely, we would actually find someone
            
            
            
              that was going to be basically a secret agent, and they
            
            
            
              were going to break staging. And it was decided amongst a
            
            
            
              few small group of people that all knew what was going on, and they would
            
            
            
              decide how they were going to break it and when. And then what we did
            
            
            
              is we scheduled time on the calendar for when we were using to break the
            
            
            
              environment. And we had a team that they knew this was going to happen.
            
            
            
              We just told them, hey, don't have a meeting at this time. Be at your
            
            
            
              computer. We're going to break staging, and we're not going to tell you how,
            
            
            
              we're not going to tell you any hints. We're just going to see what happens.
            
            
            
              And then what we did is we've watched the team and we've watched them
            
            
            
              how they identified the problem and mitigated it along the way.
            
            
            
              You're going to hit walls. People are going to run into walls. They're going to
            
            
            
              break their noses. Inevitably they're going to hit some bottleneck.
            
            
            
              It might be that they have to take an alleyway to get to where
            
            
            
              they need to go and it just slows them down. Or they just outright can't
            
            
            
              do something. They don't have permissions to do it. They don't know how. A runbook
            
            
            
              doesn't exist. The capability doesn't exist in the system. Maybe we can't
            
            
            
              roll something back. There are a lot of things that you will identify when you
            
            
            
              create these scenarios. What you're breaking is your process.
            
            
            
              You're trying to break your process. So what you can do
            
            
            
              is going back to that list of Lego bricks, is you need to
            
            
            
              kind of create your own. You need to find all of the different individual bricks.
            
            
            
              And you can do this by going through your retrospectives or post mortems,
            
            
            
              depending on your nomenclature, and find the individual techniques
            
            
            
              that people used to help resolve the incident and break
            
            
            
              those down, make them molecular. And, oh,
            
            
            
              we rolled back a bad deploy. That's a technique we used. We found a log
            
            
            
              on this system. That's a technique we used. Right. And now
            
            
            
              what you can do is you can find your newest teammates and ask them to
            
            
            
              do it. So roll out a benign change to an environment. Do it on production,
            
            
            
              because production is rarely always the same as staging
            
            
            
              or QA. Get on a Zoom call with that teammate
            
            
            
              and set it to record, and then ask the teammate to share their screen and
            
            
            
              then tell them to roll back the change. That's it. And then just watch what
            
            
            
              happens. And this is really interesting, because going back to what I was saying earlier,
            
            
            
              people are going to have different ways of rolling back things. There might
            
            
            
              be one person that just pushes a new image to a docker registry,
            
            
            
              or another person that reverts a commit on GitHub, and another person that knows
            
            
            
              how to roll it back using our CI tool. There are a bunch of different
            
            
            
              ways to do this. Now, you might actually find someone that
            
            
            
              has an extremely efficient way to do this,
            
            
            
              and you didn't know how to do that, you didn't know how to do it
            
            
            
              that way. And now all of a sudden you've revealed, oh, there's another way
            
            
            
              to do this that's even better. And you can kind of put that out into
            
            
            
              the team. And now that's the technique that we use.
            
            
            
              You can break it down. You find finding logs, right? Do the
            
            
            
              same thing with zoom record and share screen. And ask a
            
            
            
              teammate to find all the logs for 500 status code requests. Because the
            
            
            
              last thing that someone needs to be learning during can incident is
            
            
            
              lucine syntax. Are we doing status underscore
            
            
            
              code or status uppercase c code or like what is
            
            
            
              the index name? There are a lot of things that slow you down during
            
            
            
              incident response, and those are the minute details. And the shavings of
            
            
            
              those make a pile of time that can really hamper your
            
            
            
              process. It's important to practice the techniques.
            
            
            
              So with that, you need to identify the core techniques people use
            
            
            
              during incidents. And you can do this through chaos engineering principles,
            
            
            
              right? Break your system and watch people fix it.
            
            
            
              It's not typical chaos engineering we think of as we break the
            
            
            
              system and then what happens is that we just try to
            
            
            
              see the adverse effects of the system breaking. But we rarely
            
            
            
              use the same technique to test process. So once you find those techniques
            
            
            
              that people are using during these intentional incidents,
            
            
            
              write them down, get them on paper, put them in notion
            
            
            
              confluence, whatever you're using to really institutionalize
            
            
            
              that knowledge and then practice them regularly, your logging
            
            
            
              system is going to change. Inevitably, something is going to change. The way you find
            
            
            
              metrics, the way that metrics are labeled. You need to practice the individual
            
            
            
              techniques all the time. The same way that we did in marching
            
            
            
              band where we would do very boring things every
            
            
            
              single day before we did anything that mattered to our
            
            
            
              audience at least. So what did we cover? We got through this pretty
            
            
            
              quickly. We learned that techniques matter in incident response.
            
            
            
              I really strongly emphasize this. It's important that you really
            
            
            
              boil down to how you are doing things and practice those
            
            
            
              individual things. I highly encourage breaking something and just observing
            
            
            
              how a team fixes it. You will reveal a lot
            
            
            
              through this. You'll find that someone doesn't have access to something. You'll find that
            
            
            
              someone has a different way of doing something that's better just by.
            
            
            
              This will take you an hour to do with a small team.
            
            
            
              So thank you. My name is Robert Ross. People call me Bobby
            
            
            
              tables. Im on Twitter at Bobby Tables GitHub Bobby tables.
            
            
            
              Feel free to email me. I'm Robert@firehydrant.com and as always, if you're
            
            
            
              looking for an incident can management tool that can help you automate a lot of
            
            
            
              the process around incident response,
            
            
            
              firehydrant.com is a great place and resource.