Transcript
This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE,
a developer? A quality
engineer who wants to tackle the challenge of improving reliability in
your DevOps? You can enable your DevOps for reliability
with chaos native. Create your free account at Chaos
native. Litmus Cloud hi,
welcome everybody to games we play to improve on incident
response. My name is Austin King and I'm really excited
that you're here. This is a pre recorded stream, but I'm also
here at the conference with you and I'll be watching the chat.
So ask questions, share your own experiences
and yeah, let's try and have some fun.
Who am I? So I am most recently
from Amazon. I've worked there a couple times.
This last time for five years I worked on prime now
and other things. So recovering amazonian and
I've throughout my career caused
a lot of outages, jumped in and helped
bring systems back online whether I caused it
or not. And just have learned a lot,
especially from working at Amazon where
there was a lot more structure to it.
And I started coaching some teams while I was at AWS
and I've since left and I'm working
helping teams through Ops, drill and just
really interested in incident response, incident management and I've
been studying the role that games play and
drills and practice and so I'd love to share
some things I've been thinking about.
So before we can get better at incident management incident response,
there are some prerequisites. So we do need
an incident management framework. There has to be some agreement
within your team, your ideally your company
about how to manage incidents and outages.
If you aren't here yet, that's totally fine.
There's some great resources at Response pagerduty.com.
They've open sourced that training and those resources and
then I'm also working on a bunch of free resources on how to
adopt an incident management frameworks@opsdial.com.
So you have to adopt a framework and then you have to train everybody on
that framework so that people understand the roles.
If you adopt an incident commander role, people understand
what that means from the CEO down. They understand the
incident commander is kind of at the top of the
hierarchy and in control during an outage if you choose
to follow that kind of model.
So once you have can incident frameworks and once you've adopted
it and trained folks, now we can
play games and get better at doing it.
So there are some foundational kind
of skills and attributes of a person and a team
that we need to build on before we can build on higher level skills.
And a lot of this comes down to kind of SRE culture
DevOps culture. In these conferences, we talk a lot about
culture. It's become much more of a focus in the last
decade than it was previously.
So the first thing is psychological safety. These ics an important
building block for us as individuals. To be able to function
in a team, in in a company,
we need to be able to take risks and kind
of raise our hand and ask questions and not be afraid that
we'll be made fun of or that there'll be backlash.
We need to be more comfortable with being vulnerable,
and we need to create the space for each other to have psychological
safety. During an incident, and especially after an incident,
if you have a post mortem or can after action review. We want
to practice blameless culture,
blameless postmortems. It's not so
much about figuring out who was wrong and who messed up
as protecting ourselves by figuring
out root causes and building guardrails
and automated systems so that these things can never occur again
because humans will always make errors. But how
do we help each other out and build in that resiliency?
And lastly, we want to have a growth culture
in our business because we work
in technology. It's all about innovation. And so we
want to always be trying new things, adopting new things,
failing in small ways, thinking about those failures, learning lessons,
and moving forward. It's just the nature of how things are.
This psychological safety allows us to have a much
more enjoyable work environment and to be much more dynamic
and innovative and moving forward. And that's just
the table stakes of working in technology now.
Maybe 50 years ago, things didn't
change as rapidly as they do now, but we're
always adopting new technologies, new processes,
and always analyzing the
decisions we've made and kind of rethinking things.
So psychological safety is super important,
and it kind of leads to the next topic of team cohesion.
So we want to be able to give each other feedback in a
timely managers learning feedback cycles, and the
tighter they are really improve how well we can learn
and grow as a team. We want to have shared
responsibility, so we don't want a single hero
that's always jumping into incident response and taking care
of everything, and they can never go on a vacation.
We want to have everybody have a sense
that they own production and are responsible for it.
And it really helps when we have shared goals as a team,
as an.org, as a company, if we all know
where we're headed and we all feel responsible,
we're going to have much better team cohesion and things are going
to go much better. So we talk a lot
as sres about resiliency and systems,
but it's also important in teams. And so
psychological safety, team cohesion and
aspects of incident response are the human factors of resiliency.
How can we have the
most people helping out or being
trained up and available and have
practiced the skills of incident response? This is in addition
to the system level resiliency that we get through monitoring,
observability and automation.
So once we have kind of
those base soft skills in place,
these are also some higher level technical skills and
communication, also soft skills
that we need specifically in terms
of incident management. Those previous skills are very portable for
any work environment, any situation, not just technology.
And these, when we think about incident management,
there's some more high level skills that are critical and
really important. So curiosity and problem
solving, these two skills are very important
to incident responders or subject matter experts,
the SMEs who sre going in and trying to troubleshoot
and figure out what's going on, why are we having an outage and what
changes can we make to stop the bleeding for our customers?
What kind of quick band aid can we put in place to bring the system
back up? So curiosity is
really important to encourage. And if we can play
games that build people's curiosity, that's so much better.
I've seen teams with low curiosity, and during an outage,
we tend to accept the first hypothesis and oh
yeah, it's probably networking and not really dig
any deeper. And then 45 minutes later,
hour and a half later, we find out, oh, it's not networking. Okay,
now I'll start poking around. We want to always be curious
and never creating. Once we have a plan to test a hypothesis,
we're already looking for the next potential.
Assuming that's not the case, what else could be true and just
being very curious.
Similarly, problem solving. When we're in managing
an incident, we want to core up with a hypothesis for what is
causing the incident and then some tests that we can
do to invalidate that hypothesis.
And highly related to this ICs, the problem solving
aspect of, okay, if I think that
host disks are filling up, how am
I going to solve that and being creative and coming up with ways
to solve that. So problem solving ics just super important.
So those two skills, everybody in the team needs
to have them, and the majority of people during
incident response are incident responders. So critical
skills. The next three skills, communication,
delegation and coordination. These are especially important for
an incident commander or somebody running a
call and trying to coordinate and help everybody
bring the systems online. So communication
is a skills that is very challenging.
We often think that we communication much more clearly than we actually
do.
If two people tend to work together a lot, there's a lot of implicit,
unsaid things and these kind of jive. But once you
bring in other teams, it's just really helpful to be super
explicit and clear communicating. So if we can find
games that stress that skill and help us practice that skill, that'd be
great. Delegation and coordination. Again,
the incident commander isn't the person actually doing
tasks. They're more the high level executive mind
at these time. And so they need to be delegating investigative
tasks, different types of tasks to different subject matter experts,
and coordinating,
like when will they check back in with that person?
So putting that all together, communication, delegation,
coordination, they may say, jane, could you look
at the database replication and I'll check back with
you. You know, do you acknowledge that? How does that sound?
So if we can find games that kind
of stress delegation and coordination, that could be really valuable
for helping instant commanders get better at that
role. So those skills are
highly portable, much like these soft skills we were
talking about before of team cohesion.
They're highly portable in that they don't depend on the type of system that
you work on. But we also have a lot of nondomain specific
knowledge that's specific to our service, our application that
we're trying to operate. So if
you're can azure shop, that may mean that
you have deep knowledge of Cosmos DBS,
or if you're a kubernetes and AWS stack
or GCP, and you're focused mostly on queues,
the knowledge that your team has ics quite different in
those two cases. So it really comes down to the type of
service you're creating. And part of this knowledge
is documentation is really important.
And over time, filling out runbooks,
having lots of documented procedures for how you handle different
situations, and also eventually automating
all the parts of those runbooks that you can automate.
And these again, are really nondomain specific
pieces of knowledge. So if we can play games that help us
drills on our actual runbooks and our actual
systems, we'll be that much quicker during an incident at
being able to remember what to do. So why is
it important to be good at incident management, incident response?
Well, poor performance leads to longer times
to recover our system. So NTTR
is meantime to recovery, although there's lots
of different r's. People also call it resolution
or remediation, but basically once
an outage starts the time until we can bring
it back online such that our customers have a good experience.
How much time does that take? And then over
a year, if we averaged out all of those incidents,
what's our mean time to recovery? So this
is really important because when
our systems are down, customers are depending
on us. So we may be losing revenue,
we may be losing customer trust.
It's not good. So the faster,
on average, we are at resolving an outage,
these better.
And we talked about some soft skills,
some team based kind of psychological
skills. These are actually important during incident
response. There's a lot of anti patterns that happen
for teams that don't work on these skills.
For example, the bystander effect is a
well understood psychological principle that
if, let's say someone was being mugged,
there's a difference in the likelihood
of them getting help if there's one
observer or a group of observers.
So if there's one observer, there's much more likely a
chance that that one observer will call 911
or do something to assist. But if there's actually 15 people
that are observing this, the bystander effect is
that there's a lower likelihood that any one of them will call
911 or do anything about it, because everyone assumes that everyone
else is kind of on point and going to take care
of it. And so if a team has poor team cohesion,
everyone might just kind of be waiting for somebody else to do something.
In that case, another anti pattern is these blame
game. So can outage starts
and a couple of people may spend the time arguing
about whose fault it was. Oh, why did you yolo push that?
Commit? It's Friday at four. He pushed it. And then you
walked out of the building. All of these things, all this
blame does not matter. What matters is getting
the system back up and helping your customers out. So it's really
not helpful to be trying to assign blame
and ardeo about those things.
So this comes back to kind of a shared goal and a shared sense of
responsibility.
In addition to the skills, our knowledge is also important
as SMEs. So if we
could practice our runbooks, we will find out of date runbooks,
we'll find missing runbook, we can create those and fix these.
If we don't have the right knowledge at the right time,
that can extend outages. And if you're
twice as fast at remembering and
using your runbook, it's just going
to really reduce how long it takes you to resolve an
issue.
So how can we grow our
SRE culture, our DevOps culture, how can we improve it?
So I think we can do that through games, especially if
we stretch the definition of games. We're going to look at kind
of traditional video games, online games, board games
kind of things, but also stretch it a little bit into kind of
operational exercises and drills.
So a lot of what we're talking about is intentional
practice and it's helpful to look at what do other disciplines
do for intentional practicing. So if we think but a soccer
team, they don't just scrimmage
and play soccer games all the time. They spend quite
a bit of time in intentional drills and
the coach will have lots of different drills.
An example would be kicking a penalty
goal kick. I'm not a huge sports person,
sorry, but if you set up the goal and these
defenders and have that one person practice
that kick at the specified distance, they're really honing
in on that particular circumstance. That does happen quite frequently during
a games and then rotating through every player and having them
do that penalty kick multiple times, that's intentional practicing
and getting better and better at that particular scenario.
Similarly, if we look at firefighters, they have an incredibly hard and
dangerous job that ics life threatening for themselves
as well as the people they're trying to help.
And so an example of a drills that they do for
intentional practice is they will set a structure
on fire and they'll practice entering
the burning building safely and exiting the burning building safely.
In this way they have a controlled
but dangerous environment where they can really practice those
skills and kind of have
more routine. If they didn't practice these skills
and there's a fire that they have to deal with every couple of months,
it would be much more dangerous for them without
not having these skills practiced and being on top of the game
and then looking more closely to our SRE
discipline, cybersecurity.
Actually, these play a lot of games and they're,
I think in some ways, a lot further ahead of us.
One example would be capture the flag. And there's
two main ways to play capture the flag, but I'll just describe one.
Capture the flag is incredibly popular. These are
annual organized events. They actually happen
every week through different conferences
and different things. If you want to get involved, there's some great websites
that list all these ones that are happening. But basically
there's two teams, a red team and a blue team, and each teams
ICS given a software service to operate.
And inside of each service a vulnerability has been
placed so the teams doesn't know where the vulnerability is
and the games is won
by the first team that is able to find the vulnerability and
the other team and provide that and kind of capture the flag.
So both teams have a choice
to make. How much time do they spend hardening their system
so that the other team can't break in? And how much time do
they spend trying to exploit the other team's
system? And they're practicing all
of the skills of cybersecurity, pen testing,
decompiling binaries, looking for secrets, looking for weaknesses.
So it's really cool and a great practice
drill.
So I've been looking for games and looking at what
people are doing in the wild. And I've tried to organize this
being inspired by Maslow's hierarchy of needs, kind of the
dependencies and how things build on each other.
And I think kind of the lowest layer is trust,
bonding and safety. So having that social foundation
is important and you can't really do
anything without that. Then building on top of that
is communication. It's important that everybody can
communicate during can incident. So if we could find
games that stress you do better
at the game with calm, clear speech, that would be
really cool. And then building on top of communication is delegation.
So if we could find games that focus on roles and
leadership, and then lastly, the skills and
knowledge that we talked about, how do we practice our core competencies
while playing a game in a gamelike environment?
And so I've been able to group the games
into at the lowest level. Party games help us build
that social aspect. Asymmetrical co op
games. So that's a mouthful. Let's take asymmetrical
means just simply that there's at least two
parties with differing information.
So there's asymmetry in terms of
team a or person a knows something a bit different than
team b or person b, and then co op.
These are cooperative games. All the games that we'll talk about
today are cooperative games. And party games
is generally associated with casual gaming and the most
approachable by anybody. And then when you kind
of nerd out and get into games,
they tend to use co op for some
of the deeper games.
And then if we look at how
do we practice delegation and can we find any games that are great for that?
I think role based co op games are very handy.
And basically that means that the players have a
specialization and there's an opportunity for
a leader to coordinate and delegate to
those different roles. And then lastly,
for the skills and knowledge, I'm just going to generically call
these ops drills and we'll look at some different examples of ops
drills. All right,
so let's start at the base of the pyramid with our party games.
So there sre so many examples of party games.
A popular one right now on Steam.
I looked at some different sources. So Steam
is a video game distribution platform and marketplace.
Another great place to look for these is board game arena.
Both of these options are remote friendly, you can have a Zoom
call and people can play the games over
these. Steam games are often cost money.
Board game arena is a much
less expensive model. But anyways,
looking at the most popular party games on Steam,
Jackbox is very popular. I've played this with my teams
and seen lots of teams in the wild using it. It's a collection of
party games. So there's
a drawing game, word games. It allows you
to be silly with your teammates to kind of break the ice.
Everything doesn't have to be right or wrong and
super official and you can get to know people's preferences and
habits. So for a creating game, for example,
it doesn't matter if you know how to draw or not.
If you're an artist, it's just really fun.
The way that drawfold will work is there's a
word that only you can see. You try and draw
an illustration of that word and then everyone is trying to be
the first person to guess correctly what you're drawing.
And even if you're drawing stick figures,
everybody can have a good laugh and you
can work on psychological safety and being vulnerable,
these kinds of things. So party games, there's thousands of them.
Apples to apples,
trivia pursuit would be like an old school example.
But yeah, any game where
you can all play together, have a good time and
build up trust.
So moving up into the
asymmetrical co op games,
a great one for stressing communication. ICS keep
talking and nobody explodes. So I've
talked to several different engineering managers and different companies
that have used this to practicing improving
upon communication with their teams. So the
way this game works is that the asymmetry
of information is that there's two different roles.
There's just one person that is the bomb diffuser,
and they're the only person who can see the bomb and they can see the
configuration of that bomb during that particular
session. So bombs have gauges and dials and
batteries and gives and explosive
fuel, and that can show up in all different kinds of configurations.
It could have three batteries or two batteries. The position
of the wires could be horizontal or vertical.
And the diffuser is the only one that can see it. So he
needs to describe it to the others. Everyone else that's playing.
They're the experts. They have the bomb diffusion
manual, which is a large collection of
rules. And these are kind of
logic puzzles. They're very specific.
An example rule might say, if there's
a yellow wire above a red wire, above a blue
wire, then cut the yellow wire.
But if there are any other colors of gives,
this does not apply. So when
the bomb diffuser says, yeah, there's a yellow wire,
that's not as good of a communication as they're
horizontally orientated. These are three gives. Going from top
to bottom is yellow, red, and blue. So if
the diffuser isn't giving clear information, then the
experts need to ask more questions. What's the orientation of
the wires? What color wire? Ics at the
top? These kinds of questions. The repercussions
of cutting the wrong wire or trying to diffuse the bomb in the wrong way
is that it explodes. And then there's also a timer counting down
the bomb explodes. So the only way they can diffuse the bomb and
win the game is to carefully and clearly communicate
back and forth and give really clear instructions.
Okay, cut the yellow wire,
and the day is saved.
So that is asymmetrical co op games
to help us with communication. And then next
up on top of that, in terms of delegation and communication
is role based co op games. And so a great
one here is overcooked.
My understanding of where the word chef
cons from ics, it's a french word, chef de
cuisine, which roughly translated
as the leader of the kitchen. These boss of the kitchen.
And so this is a great example of
leadership and delegation. The way overcooked works
is there's different cooking stations in a kitchen, and you're
trying to make food for customers under some
time pressures. There's a dishwashing station,
there's a vegetable chopping station, there's a grilling station,
and then there's a window so that you can deliver the finished plate
to your customers. So early on,
there's no forced roles. Anybody can
do any station. And early on, there isn't a lot of
stress. There's plenty of clean plates, things sre
moving along. But eventually,
due to time pressure and using up resources,
a leader needs to be
assigned so that they can look at the different resource bottlenecks and
anticipate things. If there's no clean plates,
you can't plate the food, and so you can't give it to the customer.
So it's very helpful for a chef to step up and delegate
and know. Bill, could you go over to
the dirty dishes and clean five dishes
before returning back to the cutting station.
And it turns out in the game, just like real life, nobody wants
to do dirty dishes. So the team
that does kind of have an incident commander type role
will do much better at overcooked and get further
compared to a team that just has no plan and it's just chaos
and everybody's trying to do every role.
So lastly, that brings us to the general bucket
of ops drills. So can we play games within our
teams, in our particular team, and even across the
or across the company to get better at
the incident management, incident response skills, and especially
drills on our particular systems and our
particular runbook and knowledge? So I
have four different games that
we can play, and first
up is tabletop exercises. So we talked about cybersecurity earlier.
This is very popular in cybersecurity, and it comes from
the government, the military,
and FEMA, for example. They deal with natural disasters.
They have a long history of tabletop exercises.
So what a tabletop exercise is, there's very little equipment
in a remote. First way, you just need
someone to think up a scenario,
then a Zoom call, and then maybe
a document or a slide show just to give people
the scenario,
and then a plan of who should be at that meeting.
And then an hour or two or six, depending on how deep you
want to go for people to talk through how they would respond
to that particular scenario. On the
cybersecurity side, there's a really cool deck of cards
called backdoors and breaches. And that really helps
stimulate creativity in terms of coming up with
the incident. But within your team,
you could just have somebody that's more senior and has experienced a
lot of things come up with the tabletop scenario.
And they're kind of like the dungeon master, if we think of this as like
DND. So they could say
that all
of the videos that customers
are uploading are not getting encoded
to a proper codec, and they could think through some
of the root causes for that. And then
everybody in the room that normally works on this
video subscription service or
video platform could talk through things
they would check things they would do.
And this can help highlight missing metrics,
missing observability monitoring. Could be existing
runbooks, and so you can file tickets to go
back and kind of fix those things. So the benefit is
that you don't really need any buy in from anybody.
And the con
of it is that it's not as realistic. You're not actually executing
any software or actually testing your system.
One benefit of tabletop exercises is you
might want to try broadening the invitation list
much larger than you would normally think about. It can be really
helpful to involve executives,
legal and marketing people, just picking
one or two people from each of those disciplines.
But for example, with executives,
it can help executives see how the engineering
teams teams with outages and incident management
to help develop a rapport and to give them an understanding
to see how the sausage is made. If they're never involved,
then when there's a real outage, they may
not have trust in engineering, they may not have an understanding of how
systems get back online. And so they might join a Zoom
call and make demands or try and take over the situation,
not really understanding how incident management should
work. But if they're involved in these tabletop exercises,
they'll really develop a rapport and see the thinking,
all of the advanced preparation that's already happened for all your systems,
and just they'll have more confidence and more buy in.
So when you're trying to get operational
work and automation prioritized against feature
work, you'll have a better chance at getting a
better ratio. If you work on a product
where features are always prioritized over operational
work, that would be really important and helpful to
avoiding toil or putting in guardrails to
make the next outage less likely,
the more alignment you have in your organization. This can really
help similarly legal and marketing.
If you have an outage scenario where you've lost
all customer data for the last three months and the
backups turned out to not work,
that can be great to have legal and marketing
in the room so that they can think about
aspects of their workflow and
projects that they have planned to kind of
get ahead of some of these things and to think through how
the terms of service are written and
promises that are being made to customers and things like
that. So that's tabletop exercises,
and it's all verbal and hypothetical. So we can
kind of move into fire drills, which is a little bit more realistic.
And there's different ways to do fire drills. So one example
would be to pick one of your runbook
and page the person that's on call
and create a ticket and
describe that they need to run through that
runbook and do all the steps and then put as
much information into the ticket as possible. So this
would be kind of testing that particular runbook,
building up the muscle memory of knowing these. It is what's
in it, executing all the steps, and you could potentially find
bugs that's out of date and fix those.
Another type of fire drill in staging,
or maybe even production, would be to
intentionally break one node
or some percentage
of a system in a particular way.
And that would test that your alarming
is working, that the person would get paged in,
that they could practice investigating and remediating
that situation, and kind
of bringing the system back online, or failing over into
redundant capacity, or just testing that the automated failover is happening.
So lots of different ways to
do fire drills.
Some downsides or limitations of fire drills might be that
you don't have pager duties hooked into staging,
and you don't have all your observability and monitoring hooked into staging.
So you'll want to work on fixing that.
And these sre things you might want to do anyways.
So it's a little higher fidelity than tabletopping,
but it does require more work and more engineering commitment,
which might be hard to get prioritized.
And the idea of a fire drill leads us naturally
into chaos testing, which ICs, a more
disciplined version of fire drills.
And Netflix is famous for chaos testing
their systems in production. And a company that
really chaos a lot of training and help for this ICS Gremlin.
But there's also a lot of open source chaos testing projects.
But effectively, you could set
up a chaos testing hypothesis and
attack a system either in development, staging, or prod.
And to make more of a game of it,
you could treat it more like a fire drill, thinking about having the
team respond to it, diagnose it,
and do the recovery. And this can help you come up
with a roadmap for automations and fixes
to the system to kind of make this
not need human intervention.
And this may also help you
if you want to adopt chaos testing, as can organization long term,
and just kind of continuously having this run in the background,
doing this in a more game teamlike fashion, could be an incremental
stepping stone towards chaos testing.
And then lastly, ICS game days.
So these are typically done in production,
more big scary events. They probably
happen less frequently than something like chaos testing
or fire drill. Myself,
the first time I was involved in a game day was in
2008. I was working at Amazon,
on the retail website, and at 1030 in the
morning, I got paged. We didn't know anything about this game
day, and they had decided to cut the power
to an entire building on purpose, to see, to stress
test our systems and to make sure everything was redundant,
company wide. So this was actually
my first day of being on call, and I was
a new employee of like two months, so I was completely
freaked out I totally from. Luckily my mentor was
sitting next to me and these kind of took over and helped me walk through
it to fail over
into other regions. So that was pretty exciting.
Introduction to game days another example of game day at
Amazon is to prepare for Prime Day.
There will be one core game days,
so I was involved in at least
one of these and there were a lot more planning and
broadcast information. So we did have a heads up
about these. The trouble with Prime Day is that
the traffic patterns are completely different than normal
retail shopping. They stand up, unique catalog
experiences, the user flows are completely different and you kind of
have this peak traffic problem where everybody's
going to the Nintendo Switch or whatever the cool thing is every
hour they may have some product that's just going to get an insane amount
of traffic and that type of
thing doesn't happen for most of the rest
of the year. So there's a lot of unique core
and feature flags that are built
to handle these situations so that you can have some fallback
experiences and then you have the permutations
of all of those toggles for all those different systems. So a
game day of simulating that traffic and
turning on and off these different feature flags
ics really helpful for shaking out bugs in a system and
finding problems. So hopefully
that's given you some ideas of how you can play
more games with your teams.
And again, being more remote
first. You can go to steam, you can go to board game arena,
and I'm putting more ideas up on opsdrillcom
if you want to check that out. But I would just encourage everybody,
especially if you're shifting to remote first.
It's just really hard to join a new team these days and
to get to know people and
playing games with teams once a month
or weekly can just do so much
for building that team cohesion and helping people
get to know each other. So that's pretty much it
for me. Thank you so much for taking the time to
join this talk with me. If you want to.
I'm on twitter at please follow me.
My dms are open. Additionally, my email address is listed right
now. Love to hear from you, your ideas, your thoughts and
yeah, thank you so much.