Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, welcome to my session. This is plan
for unplanned work game days with chaos
Engineering. My name is Mandy Walls. I am a
DevOps advocate at Pagerduty. If you'd like to get in touch
with me, you can tweet at me. I'm lnxchk on Twitter.
You can also email me. I'm mwalls@pagerduty.com.
Pagerduty is hiring. We've got some positions
open. You can check those out@pagerduty.com. Slash careers
that's the last time we talk about Peter duty. So what
do we do? We do a bit of incidents response and we
want to get better at incident response. Having better incident response
helps our reliability, it helps our customer experience.
But getting better at incident response is
a double edged sword. We want to get better and to get better
at most things, what you do is you practice.
But to practice getting better at incident
response, that means we
want to have more incidents, but we really don't want to have
more incidents. So what can we do to give ourselves
a place to practice? Incidents response
is a muscle that has to be exercised like anything
else, right? The workflows, your communications patterns,
the who and where doesn't just happen by magic.
Like, you train people how to do it and they get
better at it over time, right? So we don't want to
do that in public as
much as we possibly, possibly can. But we want to
give folks a place to practice what they
do when there's a real issue, right? So that's what
we're going to take a look at. So many organizations
and even individual teams now are using
a game days practice to build
those muscles around incident response and around troubleshooting
and just sort of getting used to what it
means to triage and troubleshoot incidents on their environments.
At pager duty, we refer to them as failure Fridays.
And we've been kind of like talking about this for almost
ten years. Honestly, I think the blog post I have here is from, I think,
2013. We refer to them as failure Fridays,
and sometimes they're failure any day, right? If they happen to not be on
Friday. But the Fridays are for the big ones. But regular teams
can sort of do whatever they need to during the week. The point of
the exercise, though, is to make sure that production
operation systems are doing what we need them to do when
we need them to do it. And this can be a number of
different things that are sort of attached to your production
ecosystem are metrics and observability tools,
giving us the right telemetry in the right way that we
can access it. Are the escalation policies or notifications
and incident rooms and all that stuff working as we expect
them to do? Folks know what is expected of them and
where to go to gives them more information. You don't want them scrambling
around during a real incident. Right? And if we need to push a
fix, how do
we do that? Is there a short circuit for the regular process?
Does it have to go back through the full pipeline? What are we going to
do there? Real incidents can be super stressful,
especially when they're customer impacting. And not everyone has the temperament
to be calm and think through things.
People get super hyped up, right? So we also want to give folks
a chance to experience what happens when something isn't running
smoothly. So we can do this with some
failure Fridays, some game days, and we can introduce a little bit of experimentation
into those. We want to go into our game days and our
failure Fridays with a goal. Maybe it sounds cool
to just walk through your data center and pull cables, or scroll through the assets
list in a public cloud and just delete something randomly.
But that isn't where you should be starting.
Reliability and resilience improvement practice,
that's not going to give you the best benefit when what you want to
do is create an environment for everyone
to improve on things. Chaos engineering lets
you compare what you think will happen to
what actually happens in your systems. You literally are
going to break things on purpose to learn how to build more resilient
systems. And this is going to help you build better technical
systems. It's also going to help you build better sociotechnical systems,
which is all of that human workflow that goes into
responding to an incidents, troubleshooting, resolving, and all
of those pieces. Right. Chaos engineering can
make it sound kind of scary or maybe irreverent
for some teams. You might also find programs that call it fault
injection, which sounds a little bit more serious.
But what we're really after is taking the systems as
we know them, changing things around a little bit, and see what happens.
So we're going to use chaos engineering to validate
the assumptions we have about how our
systems are going to behave when they meet the users.
Your best laid plans might go out the window when you find out what
the users are actually going to do with the product. Right. The goal of combining
the structure of chaos engineering and specific tests
with a regular cadence of intentional practices is really going to
make sure that what you're putting in front of your users is the best
system possible. You want to make sure you're not forgetting about all
that dependency back there. You do a black hole test, which means like taking,
pretending that back end is offline and then what happens to the
front end? What's it do? How are you going to mitigate that and working
through all of those things? So we want to be
intentional about what we're doing and not just random picking
things apart. When we're looking at what we want to actually improve,
our goals are to learn and we want to focus on
the customer experience. That's the whole point, to keep
customers and users happy and give them the best experience possible.
So when we're planning out testing programs and failure scenarios,
we want to keep the users in mind. We want to improve the reliability
and resilience of the components that our users rely on the
most and where they expect the best performance.
A testing scenario that only flexes part of the
application that users aren't utilizing.
It's like a tree falling in the forest. It's a
lot of work with little impact on the overall success of your
service. So we want to focus on components that are going to generate the
most stress when they go down or become
unreliable. To have a successful chaos testing
and game day practice, you want to have some things in place first,
right? It's kind of enticing to sort of go in at the beginning and create
one of these practices. But you need some tools aligned
first. One, you're going to want at least some basic monitoring
and telemetry. It's absolutely fine and
really kind of expected to use your
game days to flex your monitoring and try
and figure out what you're missing. What other piece of data would
have shown us this problem earlier, especially for
log messages and other application output? You need
to start with at least enough monitoring that you know your test
registered right, and then figure out
what happens downstream. If you're looking for,
say, how your application responds to a dependency being offline.
Black hole test. Have a basic set of monitors in
place for that scenarios. Whatever it is that makes sense for
your platform. If it's an internal dependency, maybe you're monitoring it
directly. If it's an external dependency, maybe you're watching a status page on a
periodic basis or something like that.
Two, you want to set up your response process.
Chaos testing and game days are an excellent place to practice
your incident response protocols in a low stakes
environment. We're being intentional. We've set our goals in advance.
We think we know what's going to happen. So we can take some time to
practice. Having an incident commander assigning
a scribe preparing a post incidents review, like a
post game day review, is just as useful as a post
incident review is for production. Then we want to establish
our workflows for fixing things. Lay the groundwork
for how much experimentation you want the team to do during the test.
In a real incident, your SMEs may need a pooch code
or configuration updates to remediate the issue. They're going to keep working until they
have restored service. You're not going to do that necessarily
during a game day or a failure Friday scenario.
You have a set work time and some things that
you want to experiment around, and then you can call time
right and be done, but establish what you're going to
do there. Finally, you want to think about
any improvements that you're going to sort of
discover in this process what happens after the
game day. Too many teams spend time planning and
executing game days and then not putting those process
improvements back into their work stream, right? If the reliability improvements
you learn about during the exercise require changes
to the code, you need to get those back into
the workflow so that they become production code. Right? So making
sure that your product team is on board and things can be prioritized
and you're learning and utilizing what you then learn from
the process. So improving reliability is
a long range process. So you're taking the
time to do the planning ahead of a game day and that's going
to help you get the most out of the work that you do after the
game day. I won't go
into the first two pieces deeply, right?
Monitoring telemetry. There's lots of folks that
know more about that stuff than I do, but game days are an excellent opportunity
to flex your incident response muscles. You don't have to mobilize
a full incident response every game day, but it's an
option you should keep in mind just to sort of keep things flowing
and keeping all that muscle memory active, responding to major incidents
that impact users, and especially if they cross
team and service boundaries. Take coordinated effort,
right? And practice. Not only do
they help your team work within the response framework,
but your game day can then also help your incidents commanders practices
managing an incident. They're skills like any other, right? The more we use
them, the better they'll be. And for folks who are newly trained
as say, an incidents commander or some other position you might be using
in your response process, having the opportunity to practice in
a not real incident is super good for them,
right? Your incident response practice is going to improve over time as well. And your
game days are going to help you become more comfortable and more confident
when they're handling incidents in production,
right? So we organize in advance, we set
out explicit expectations for our practice, and we invite
the people that are going to be learning from it.
Then we can set up what we need to know.
Your game day is going to help you verify that all of your end
to end components in your IR process are
working appropriately. Sometimes they aren't right.
Things don't work. So when you choose your tests, your first checkpoint is,
do we have the appropriate alert for this scenario? If I've taken a service
offline in a black hole test, does the system
alert me somehow? Does it alert for latency tests for disk
Rio? Where is it coming in from? How do I know that I'm coming into
the right service in my incident management software?
All of those things can be exercised when the alerts fire.
Then what happens? Are folks being notified appropriately behave?
They actually gone and put their contact info into the platform?
Are you coordinating in a team chat channel? Are they on a conference
call? How does everyone find out all that information?
Especially if you've got folks new to your team or you've changed
platforms, take the opportunity to get everyone in
line to practice how to find all this information.
Finally, think about your troubleshooting before you're in a real failure.
Right? It's essential do folks behave access to all the dashboards
they need? Are they hidden behind a password? Are they locked down somewhere?
Can responders access the hosts or repositories configuration
files, whatever else they might need to mitigate a real
incident? You ought to practice that. Right? Has everyone been
added to the management tools for your platforms? Do they
know how to use them? Can they do restarts? Can they do scale ups?
Any of those sort of basic failure scenarios are
super helpful as part of your game day to provide
valuable experience for new team members.
Decide when you're going to call the experiment over,
put an end to it right. It might be time bound. We're going to
look at this particular scenario for ten minutes, or it could be
more flexible. When we feel like we've learned what we wanted to
know, we'll turn it off. These scenarios can be super quick,
especially if the systems are already
well defined, right? And if
you use a lot of defensive practices, right, you've got graceful degradation and other
things in place. So you may only be testing a little
bit of that defense, maybe focusing on a new feature or something like
that. And that's totally fine. They don't have to be blown out wide.
We take the whole site down for a while sort of practices.
It's fine to practice in smaller components.
We build our game day with our
goals in mind. We have this general goal, the overarching goal,
improve the reliability of our service. But to get the most out of the
game day, we want to set something specific so that we can concentrate
on it. Maybe we have some code that we've
introduced to fix something that happened previously.
Right. We want to make sure then that as we introduce that code,
we behave actually fixed the thing. Right. And that's hard to do in a
staging environment, especially if it's reliant on
production level load. Right. Do we need to
test how a new database index impacts a slowdown? Are our users
reporting that the sign up flow is slow? You're not sure why,
you can't really catch it in staging. And there's plenty of
scenarios that you can be digging into using
these kinds of practices. But be very explicit about what you want
to find. You can also focus on larger problems like
DDoS attacks or data center level failures,
depending on which parts of your ecosystem you want to investigate
and which teams are sort of involved in the game day practice.
There are benefits to practicing a range of scenarios across
your teams over time. And it's good too to mix up,
like having teams that own a couple of services
practices on their stuff independently before introducing a larger
cross team experiment. So we're
going to set up our hypotheses. What do we expect to find?
If we already have defensive coding measures,
do they kick in? Is there a failover? Is there a scale
up that should happen? Is there some other automation that's going
to take care of some things? Maybe we expect the whole thing
to fall over. That's fine. It's a place to start improving
from, right? But definitely set those initial hypotheses.
You have assumptions on how the systems will behave, so get
them down in writing so you can use them as a baseline for improvement.
Afterwards, we're going to
talk about what happens, so make sure you're recording it. Save your charts,
save your graphs, save the list of commands. This is also helpful
in real life incidents so that you are
collecting information to run your post incident review.
Think about not just the things you first looked at,
but other related information that might help in the future.
This is another part of your practice that's going to help you build up a
better incident response practice, and especially if you take the time
to write a post game day review. It's going to give your team
a place to organize their thoughts and then improve the practice
for the next time. Say, well, we thought we were going to be relying
on this particular piece of information. Turns out that metric wasn't as helpful
as we thought. But here's this other thing that we're monitoring over here
that was actually much more helpful. So the next time you have a production
incidents relating to that service, you can go right to the one that
you learned about, right? So talk about what you learned.
Right. We went to all of this trouble. We did this planning. It's on
the follow the failure Friday calendar. We put all this stuff together,
we planned all these scenarios. Then what we want to
talk about, what we learned, we want to share with other teams in our organization
who can benefit from it so that everybody is gaining
knowledge through these practices. We talk about
our improvements. We're going to put the things that we learned to good use.
Improving the reliability and resilience of
our system is our requirement.
We're going to balance non feature operational improvements with feature
work in order to provide the best experience for the users.
So the findings from your game day might generate work
that should go into the backlog for your service to
improve it over time. You might want better error messages in the application
logs. You might want fewer messages
in the application logs, right. You might need a new default for
memory allocation or garbage collection or other kind of subsystems.
You might need better timeouts or feature flags or other
mechanisms for dark launching. Whatever it is, any number of new improvements
might be uncovered in this practice. So don't abandon them.
Get them documented and into the planning. And over time that's going to
help you work around all of these processes.
As your team gets comfortable with defensive resilience techniques,
it's going to be easier to use those practices regularly when
you're developing new features, and then they're
going to be in there from the beginning instead of waiting for test day to
sort of unveil them. Right. You can use these experiments to create
new best practices, common shared libraries and standards
for your whole organization.
So we get some questions when we talk to customers about these
kinds of practices. And one of the big ones is should you run a surprise
game day? And I know it
seems like that's a thing you should do, but often
you probably don't want to, right? Maybe we'll
say maybe once you behave your patterns
and practices well honed. We've been talking
about being deliberate and explicit about what
our goals are for running a game day in the first
place. If you're looking to run a surprise game day,
be really clear about what you're hoping to accomplish.
Knowing that there is much more risk for this kind of testing
when folks aren't expecting it, it can
feel more real, I guess you could say to
run surprise game days, we don't know when real incidents
are going to happen, but if your team isn't consistently
getting through planned testing scenarios, a surprise isn't
going to magically make them better. It actually could have negative
effects on the team and how it works together. So be
very careful about deciding to
run a surprise game day. Not everybody
likes surprises, for sure. And then when should
we game day. When's a good time to
do this assessment of our reliability
and resilience? The truth is kind
of anytime, right? Especially if you're in a sort
of distributed microservices environment,
right? Teams that have ownership
of singular services can probably run
a failure scenario across the services at any time. But also
you want to think about doing it when things have been going well,
right? You don't want to run production environment game
days when you've already been blasting through your slos on
a service. If you've had a lot of incidents
already, you've had maybe some downtime if your users are already
unhappy with the reliability of the application.
Doing production testing isn't a way to make them super happy,
even if you mean well and even if your goal is improvement. Right. You don't
want to blow through the rest of your air budget on testing as you might
need it for your real incidents. So look for
time when you can do shorter focus tests when things
are calm and running well. Right? The bottom image is
an example from our bot that's notifying folks that someone
on the team is running a failure Friday exercise. It lasts from
one seven p. M. To one thirty one p. M. It's a short,
focused experiment that can give you a lot of information,
and it's in our chat. You can follow the channel and
figure out what that team was doing, and maybe that will help your
team as well. But make plans to run some chaos testing
against new features, especially after they've had some time to burn in.
You want to have an idea of the baseline performance before you go
injecting new faults into it. So work on
sort of setting that good practice that as you
invoke new features and you put new things into production, that you're also going back
and looking at the impact there and digging
into those tests and additional performance issues.
The converse of that is when should you not game day.
And this is going to vary in different organizations, but there's a
couple of things that we encourage people to keep in mind. Don't game day right
after a reorg. We know big organizations
like to reorg from time to time, but give folks time
to get acclimated with the services,
acclimated with the teams they've been assigned to.
Don't wait until your busiest
season and then game day, you're going to plan ahead. You know stuff is coming,
so keep those two things in mind, not right after Reorg and
not right at the beginning of your busiest
time. I'm looking at you retail like you guys have been planning this since
June. You know your biggest season is coming.
You want to be doing your practicing earlier, not later.
So reconsider spending your time doing game days
if your business partners and product managers aren't on
board with taking input for the backlog. Based on what
you learn, a game day can still have value if
it's just for your team to practice, but you won't be getting the
full value of the exercise right. So you want to take all those things
in mind as you do more of these,
and even as you do more small ones,
your team gets used to all of the components
that are important for your incident response process.
They are going to know where all the dashboards are.
They're going to see how your chat bots and your other
automation operates and know where to log into
the conference call, what channel to follow in
the chat application. All of those things that,
if they're not used to it, are even more difficult when people are stressed
out during a real incident. So you have this
wonderful opportunity to sort of get people in the mindset
that we're worried about our reliability.
We want to work on this and here's how our practices go.
So to summarize just a little bit for you, one have
a plan. Think about it in advance.
Don't just do it right. You want to say,
hey, we have introduced a new table into
the database. We want to do some failure Friday scenarios around
the performance of that. We have put a new dependency on the back end.
We want to make sure our defensive coding is okay.
We have put some new features in and we want to make sure
all the logging works so that folks know where to find information when things
go wrong. Any kind of new feature,
new code improvement, whatever it is,
what does it look like when it gets into production? There are going
to be things that you can see when you get to staging, and that's fine.
But there's also going to be things that you're not going to be fully comfortable
with until you've seen them actually perform in
prod. And you need that data, you need that user flow, you need all
that activity going on to make sure that you've done what you
thought you were going to do. So you're going to be intentional
about all of these things. These aren't an accident, they aren't a surprise.
They are a way for your team to say we're
going to improve our reliability via x, Y and z. We're going to
improve our incident management practices via additional
practice and workflow and give us
an opportunity here to get better at all of the things that
are going to in the downstream have an effect on
our incidents response and our overall reliability. And then
we're going to use what we learn. There's plenty of things that we can learn
about the performance of our non
feature work, all of our operational requirements, whether they are database
indexes or timeouts or red button or
whatever you're doing with your services and your reliability there.
We want to use all those things. So make sure that everyone on
the team is on board with taking those lessons and internalizing
them. So a couple for creating effective game
day tests this is a really nice article. What they've worked on.
The second one is from the folks at Azure advancing resilience
through chaos engineering and fault injection. Another really good article to read if
you're kind of new to all this and thinking about what you might want there.
If you want to learn more about incident response
methodologies and how to handle that with your team, you can check
out response pagerduty.com. And just for fun,
we have a podcast called pager to the limit and we'd love to have you
as a listener there. We cover incident management,
but also all kinds of other things. So if that's interesting,
throw it in your favorite podcaster. So I hope you enjoy the rest of the
event. And thanks for coming to my session.