Transcript
This transcript was autogenerated. To make changes, submit a PR.
Sometimes being the last talk. Actually being the last talk is always super
dope. But there's different reasons, right?
These best thing about being the last talk is you can kind of tie things
together. You know, I can listen to all the other talks that happen
throughout the day and then tell the story.
The other thing is sometimes all the things you wanted to say have already been
said. So talks over. We're done. Be. Let's go. No,
no, seriously. I'm using to try to be a little flexible
as how we're going based
on these ideas of things that we've already talked about.
So going to take this opportunity. I've got some.
That's weird. Well, sorry. On that screen,
just noticed that you can't really see the top. Well, all the really good
information is at the very top of the slide, so you're missing
all the real thought leadership over there.
This talk is called the psychology of chaos engineering.
So it's thinking along the lines of a lot
of the human factors that come into play. My name is
Maddie Stratton. I'm a DevOps advocate for
Pagerduty. So I'm not a huge fan of resume slides,
but I really like this one. So you're just using to have to sit through
it. It's cool. So anyway, I work for Pagerduty. How many
people here have heard of Pagerduty? You already heard about it in some earlier slides
today, too. So if you didn't raise your hand, you're lying. Cool.
That's as much as we're going to talk about Pagerduty right now. Business. It's the
bottom of every slide. I do a podcast called Arrested DevOps.
We're one of the longest running, still running DevOps podcasts.
If you're a listener and you want a sticker, come see me later. I got
lots of stickers. I also have cool pager duty stickers, too, because Deverel
life is giving away stickers. That's my real job, sticker engineer.
I founded DevOps days Chicago, which is going
to be in September. So if you're into DevOps and you're going to be in
Chicago, you should come to it and we'll talk about it later. And I help
run DevOps days all over the world. And this
used to be my license plate. So one might say I'm
invested in DevOps. But then DevOps is like ten years old. It literally
is, right? Like, the first DevOps days was ten years ago. I want
to be cutting edge and everything, so I actually had to get a new license
plate. So this is my current license plate. That's true,
by the way, that is my car.
So I might have $200 more than cents
might be because that's what a vanity license plate costs in Illinois.
But in reality, what I like to do at this part of
the talk is kind of get a level set and get
some agreement so that we're kind of speaking a
similar language. And in the case of this being towards the end of the day,
we've seen all these talks. Fortunately,
the level setting that I wanted to do has already been done.
Right. Nobody really said anything today that concretely disagrees
with anything I was going to say, so thank God, because that would have been
awkward. But a couple of things that I do like to, like to
kind of stress and you may find. So the nice thing, but doing
this talk at the end of the day, especially at these conference, is that the
things that you already know, those are review. These is all.
It's like it's on purpose, right? We're tying it all back together.
So, first of all, also, I don't always give this talk
at chaos engineering conferences. So sometimes I have to tell people what I
think chaos engineering is. And by I, I mean somebody else's definition that I like.
So this is from principles of chaos. But really, the couple of the things
about that that I always like to kind of bring. But we talked about experimenting.
I think we're all pretty common and thinking about that, and we're looking about
building confidence to be able to withstand these turbulent conditions.
Right. You'll notice there was nothing in here about prediction.
So that's just kind of my little thought. This isn't
an old definition, but we can go back here. I was just talking about.
It's almost ten years ago. So this was from these Netflix tech blog
talking about chaos monkey. But there's three things I like
to pick out of this definition that are
generally interesting. And again, think about at some conferences I'm giving
this talk, this is a brand new idea here. We're reviewing, we're getting on the
same page. So it's. Right. So they're running
elements in the middle of the business day. And this is
similar to at page of duty. We run our failure Fridays during
the day. They're run it at lunchtime, Pacific time, because frankly,
there's no good time for pager duty to be down. If there is any
good time for us to take any kind of an incident, it's during the day
in San Francisco when most of the people are in that particular
office. People in our Toronto office will
tell me that, no, the best time is during the day in Toronto because Canadians
are just as important as valley people. And that's totally true.
Carefully monitored environment is a big key part of this. Right?
And then again, having your engineers standing by. So these is all
stuff that hopefully in this room is like things we consider as table stakes.
If we don't consider this table stakes. First of all, if any of this comes
as a big surprise to you right now, you've probably been sleeping all day,
right? Which I totally get. So cool. But perceptions.
So here's the thing. We just talked about what we understand this to be,
but there's different perceptions around what
chaos engineering is and what it provides and also what's
involved. And sometimes people say perception is reality.
What you're trying to do is affect change in an organization, likely. So perception
super matters. It's not about being right. That's what Twitter
is for. When you're doing change inside your organization, it's about understanding
the perceptions. So this
has been alluded to before, but this drives me up a freaking wall.
Anytime I try to go somewhere and collect information about chaos
engineering, several people think they are the most clever person
in the world. That will be the first ones who ever make these joke,
right? They're not also to see the people who come up to the pager duty
booth and go, ha, I hate you. You wake me up in the middle of
the night, we're like, clever. Never heard that one before. So this is generally my
response to that. It's like, first of all, you're wrong,
right? That's why it's on social media. But also, it's not even clever.
So that's the other thing. It offends me. As Jerry Seinfeld
said before, to sort of paraphrase. It offends me. Has a chaos engineering
advocate, and it offends me as a comedian, because it's
not even funny. If you're using to troll, at least be funny.
Okay? And again,
it's not about breaking things, right? Our intent is
not to actually break something and try to
push it to its limit and go kind of, haha. I figured out how
to. But your system, right? That's a different thing. And if you don't believe me,
believe Sylvia. Because if you know anything about Sylvia Botros from
Sengrid, she is the expert at breaking things. And if Sylvia tells you
it's not about breaking things, these it super isn't. So that's got to be true.
Look, I know you know this, right? So why am I bringing
it up? One is, it's kind of a nice way to round out these day.
The other is we have to kind
of continually remind ourselves of these points and
these principles, because almost by virtue of
you sitting in this room, you're at a certain level of understanding
of this practices and of this field and the first principles surrounding
it. The people you're working with in your organization
may not be. And it's very easy, as we become
further advanced in our understanding of something, to kind
of forget where we came from,
not even necessarily where we came from, but where people at a different mode
of understanding might be. Right. So, again, I know you
know this I'm using to say it. Anyway, the good thing is
this just sort of blends right into the end of the day. So this is
almost just like bullshitting at the bar. We're kind of at a bar, so it's
cool, right? All right. So like I said, you know this, but they're experiments.
I think that's a really key thing, and it's a really key way
to help with that understanding. We've talked about hypotheses,
right? And our hypothesis should
be that if we do this thing,
if this condition exists, my hypothesis
is it will still work.
If. My hypothesis is, if I shut down this node,
everything's going to go to shit. I probably shouldn't run
that experiment, right?
Again, we're testing out assumptions and
hypothesis now. If your hypothesis is everything will go terrible, then, yeah,
maybe you still want to run it, but you definitely should run that in a
lower environment. That's a whole different talk to talk about the myth of staging.
So we're just not going to talk about that right now. That's cool, right?
So, again, taking a scientific approach,
I absolutely loved that convergence divergence
thing that Adrian talked about, and I'm going to steal it for the next iteration
of this talk that I give when Adrian hasn't given that talk right before it.
So, these. I will look like the clever one. I may
actually attribute it to Adrian. We'll see how that goes. So you're
like, we know this, Maddie. We got it right. And why does it
matter? Because how we talk about things matter.
And this is a little tricky. Sometimes words
matter, and sometimes these don't, right? So getting nerd
snipe points on Twitter for someone using a word wrong just
to be right is not helpful. So sometimes a word matters,
and a word matters when it affects how we think about something.
So I'm going to take a couple of examples, not directly
related, but to illustrate that point so we can see how it applies.
So before working at pagerduty, I worked
at chef, come from an infrastructure, has code background, so automate all
things. That's amazing. But the
components that make up chef code, one of
the elements of those are called recipes. Because of course they are. Because chef,
by the way, if you hate food puns, you definitely should go use puppet or
ansible, don't come to chef because chef is a t shirt company that
also sells software. And then we also make a lot of would
oftentimes, instead of calling it a recipe,
people, customers I was working with, users that are trying to adopt
this would talk about chef scripts. And that's
one that I would correct gently and with an explanation. Not because
I'm like, oh no, it's a recipe because chef and food and blah, blah,
blah. But it's a different way of thinking about what that actual application
of a concept is doing, right? A script is iterative
going through step. It's a stepwise thing. A recipe,
maybe it's not a perfect analogy, but the point is that recipe is
good and script is bad, but script is not the was we want to think,
so we're going to use a different word.
There's other things that I choose not to get a pedantic about.
I work at pager duty. We talk about post mortems a lot because incidents,
right? So how many people are aware that there's some people that
don't like calling them post mortems and for good reason.
Right? But it's not fundamentally,
by calling it a post incident review or a retrospective or an after
action report or whatever is not inherently changing how you think about
it. You can have a very good reason for not wanting to call it a
post mortem, but it's not related to a
change in behavior of how you do it.
How many of you follow me on Twitter? It's okay if the answer is no.
But if you do, you can probably take a guess as to what the next
word is that I'm going to say I'm picky about. And that's the word root
cause, right? These reason, first of all, I'm not getting
nerd points here. If you want to call it root cause, that's great. You're a
fine human being and I love you. But here's why I think,
contributing factor, why this one matters, because it changes how we think
when we use that word. It makes us think about a singular cause,
which in complex systems is not there. So my whole point of this,
this is not about stop using the word root cause, by the way, I stopped
using these word root cause. It's about the words we're going
to use when we're trying to affect change using chaos engineering within
organizations. Right. The thing is, people get
nervous. It's kind of how we live. It's kind
of how we've survived as a species is because we're worried about
risk. If we weren't worried about risk, we have all been eaten by antelopes.
Well, maybe not antelopes. I don't know. I'm not good at animals, but something bigger
with teeth. So risk aversion is kind
of baked into us. So we're going to get nervous about things that seem
to imply additional risk.
Right. You're going to do what in production?
How many people think about folks inside
your organization whose entire focus
and lens by which they view
their relationship to your company's business
is mitigating risk? There's a lot of people that. That's their
job, and that's great if
you are going to come to them and say, I would like to do this
thing, and they hear it as I would like to create more risk.
The conversation is now over. The irony
is that folks who want to, we know we're all sitting here going, but that's
ridiculous. Chaos engineering helps lower risk.
It's good. It's blah, blah, great. And they will love that once they understand
it. But you want to be able to have that conversation.
So Adrian talked about that a little bit before. Right? And again, this is beautiful.
Everyone's already given my talk, but this is why it matters.
And when we're thinking about mitigating
risk, I also like to mean, again, use your monitoring like it's for real.
We've had already conversations around this, your chaos
experiment, to be successful. And by successful, not to prove
your point, but it's not actually get you fired. You need
to be looking at the impact to production. You need to be monitoring like it's
for real. And it's because it is another way to
think about this. And we've seen some examples, but at pager duty,
we run our failure Fridays like a regular incident. We already
start incident response as part of the failure experiment,
and there's two reasons for that. One is we're already tuned up for it.
For us, it's also, we believe very strongly in always practicing incident
response. And a failure experiment is a really great
way to normalize the practice of incident response.
This is a little different. When Ross was talking about, like, a fire drill,
it makes you complacent. The difference of this is just being
users to the motions. Right. It doesn't mean because we do want
to try to reduce some of the stress. So, for example, the way we train
incident commanders, if you want to be an incident commander at pager duty,
before you go on call as an incident commander, you have to have run a
failure Friday because that's giving an opportunity to
practice that under a relatively low stress situation.
The reason I'm bringing this up is so we might take
this and say, like, oh, well, then a failure day. A failure
game day is a great way to practice a stressful situation.
It's a great way to practice stress. It's a great way to practice trouble.
No, it's not. No, don't do that. Because the first thing is your people
don't need training and practice of being under stress.
They get plenty of that already. Right. What we want to do is the opposite.
So, because we have some insight into what's likely
to happen, or at least the systems that are affected
and we know what we actually acted upon,
it's a really great way to practice and go through the motions, which create kind
of a physiological. So, you know, as Paige
says, something's broken, it's your fault. In this case, it actually is.
You can be a little blameful in chaos days,
but it's good blame, right? That's cool. Okay, so what
about the people? So we've talked about a lot about tech. We've talked a lot
about the systems, the technology, the providers,
the terraforming, the Kubernetes. That's all the fun stuff,
right? It's also the easy stuff. The humans pieces come in.
So when we think about the people that are involved, and there's lots of
people that come in, there's your employees,
right? There's your delivery teams that are for the systems that we're
running elements on. There's the people that are engaging in the experiment.
And, I don't know, you might have some customers or
users that might be wanting to use these systems. Those are people. They're involved.
Right. So I always kind of, like, ask this question.
If I said, how does it make you feel to know that someone like Netflix
is practices these principles? We're actually pretty down with it,
right? We're like, that's cool, man. They're Netflix. They're DevOps and all the. You know,
and that's cool. So I know that I can get
my new movies, and I can binge all the things
and whatever. Yeah. What about your bank again?
And everybody in this room buys into this, and this one
still made you. Hmm. Right.
But we totally know why it matters. And the thing is,
the blast radius is what matters.
I was pleased to see that Adrian also goes to Twitter like I do.
So I had a very scientific survey and said, if you discover
a service you consume uses chaos engineering production, do you feel reassured
or uneasy? And most people said they were reassured. This is not scientific at
all, by the way. And also there's a little selection bias,
but graphs mean results. So I had to
add some data and a little more data, such as it
is. I did a few surveys around words that people might
use to describe. So I thought it was interesting to say what words describe your
personal feeling towards use of chaos engineering on your team.
And a lot of people were optimistic, but there was quite a fair amount of
uneasy and cynical and everything. And then this is
what tells you everything. But engineers, when we flipped it
and we actually said, what about if products you use?
They were like, oh, my God, that's fantastic. But not us, because we're terrible.
Right. So I thought that was really interesting. But one of
the things that happens is when there's an understanding of
the effects, this can actually have a really positive effect on your delivery
teams, because we feel more comfortable making changes,
we feel we have greater trust in the system, but we
have to understand it because it has the opposite effect if we don't understand it,
right? So this principle that when someone thinks
it's about breaking things on purpose and all those things, it's actually going to reduce
their. It's going to make them feel uneasy, but when there's a greater understanding.
So this really boils down to education being helpful. And we
talked about people getting nervous. Management can get very nervous.
And we think about considering our words,
and this is usually the part when I say,
I don't have a great suggestion of what you should
call it business failure injection or chaos
engineering, because it might be different. Fortunately, a bunch of
people today have already given you a bunch of really good of.
And the thing that's important is accuracy is not the most important
part. So, like, Ross talked about system verification,
and the nerd snipes and us went, well, technically, it's not system verification
because that's a formal process. It doesn't matter. You're talking to your
CFO, right? Because all you want to actually do is get in the
door to talk about it, right?
Have that conversation. I like what Adrian says, just call it engineering.
But that doesn't work as well for when you're trying to explain a new practice.
So I invite everybody to kind of think on this, and I'd love to talk
about it if people have examples of ideas,
but it will vary depending upon your organizations.
Right. And really it comes down to the understanding of
the philosophy. And when you're trying to bring people along for a ride,
you want to be somewhat like minded. Right.
So I really like these. So Cody doesn't really use chaos
engineers fully, but had some interesting failures from
interns, which I guess is some form of chaos engineers.
And he says now, honestly, feeling excitement when
confronted with a new error that hasn't occurred before. This is really a
lot more about learning from incidents. But if you have those things that have to
be coupled and learning from incidents,
it's a thing that we're all not very good at. By all, I don't mean
all of us are bad at it, but I mean few of us are good
at it because we usually look at
incidents as something to be avoided. Again, we don't encourage
them. We don't want to say like, boy, I sure wish that I had a
ton and ton of incidents. Although one
potentially controversial way to say it is it's been said,
hey, you want to get better at incident response. If you tried having more incidents,
and that immediately sounds funny, but actually the way it's phrased is scope
more things to be considered as an incident, you will practice it more.
And so incidents are a gift. If that's a little hard to wrap your to
kind of swallow, maybe you say incidents are unplanned investment,
but if we don't think about, if we're not focusing on being a learning culture,
we're not going to get a lot of value out of all the practices we've
talked about today, it's about the learning. And that's why I thought it was so
great that we had to talk about post mortems as they apply to
that. We need to run all the things we would do for an incident on
our elements as well,
especially. But the only reason that makes sense is
if your after incident review, your pir, your post mortem is focused
on learning. If the whole reason you do a post mortem is
to write down these root case, it doesn't make any
sense to you to do it after a chaos experiment because you already know what
the root cause was, it was turned off, this thing. So those two
things, I think these practices are so well coupled which is why
in a lot of people's mind, they all kind of run under resiliency engineering
practices, right? Learning from incidents. These things are all loosely coupled,
right? They don't give you those things, but they all end up being connected
because they all come back to us wanting to learn and have
better understanding and be able to reason about our systems more
broadly.
I'm going a little quick. Part of it is because we're getting to the end
of the day, and I know we started running behind, and I'd love to have
a little more hallway track and everything like that. But just a couple of things
I'd like to consider. Again, safety first, right?
And this is in the safety two idea of safety, but safety first.
We want to think about all sorts of things we've talked about today, about minimizing
blast radius, about making sure your responders, by the way,
just speaking of one thing I forgot, was I the only one who was waiting
for the last slide for Muri's? Talk about where we could, like where the git
repo was for Gabetta, and they're like, oh, no, we don't have that yet because
I want it. Right. So that was exciting to me because of,
again, the thoughts around the whole experiment,
right? So a couple things I think that are very key
to keep in mind, and these may seem like table stakes, but to
use a terrible analogy, I used to do swing dancing, and we
used to used to say about dancers, we'd say beginning dancers
take intermediate classes, intermediate dancers take advanced classes,
advanced dancers take beginner classes. So it's always helpful.
As much as we're big experts in this room, a couple of these things.
If it is common sense, great, it's a good review. If it's table stakes,
I assume you're already doing all of these. But knowing your conditions of when do
you shut down the experiment? And that's knowing what your key business
indicators are, right? You're not shutting down the experiment because
SQL server is using too much memory. All of a sudden, you're shutting
down the experiment because your average card size has dropped below where it's
supposed to be. Know what your key business metrics are that are these,
you're just going to call it. So you need to know what those are.
Right? And again, we want to build
resiliency, and resiliency comes from people having adaptive capacity,
right? And what we're not trying to do here is stress test
people, right? Even if Adrian's going to break necks of your
key sres, you're not trying to add stress,
you're trying to find challenges. So it
can be very tempting to look at a case experiment as a way to
test people's ability to troubleshoot or
to simulate the stress of on call. We want
to have transparency on everything that's happened. Never surprise anybody with your
chaos experiment. Right. And at these end of all these wonderful numbers
about reducing MTTR and availability, numbers that are happening,
everything, there are always people. And where I bring that up to mind is that
some of the scale at which we work, a small
number is not a small number. We're like, oh, well, we only
impacted one 10th of a percent of our users. Okay, well, that might have been
like 10,000 people that just had a shitty hour because
we weren't really watching those key metrics. So at
the other side of all of your grafana and all of this, there are humans.
And we always want to keep the humans involved. And thinking about these,
it doesn't mean we don't take risks. But remember that at the end
of the day, it's really easy to look at a graph, but all those little
things at the end are probably a person most likely.
So trying to remember the humans is what matters.
So, not that these slides are terribly exciting,
but if you want to check them out, that's my speaking
website. The slides are there, some supporting links, some other articles that
I found interesting. There's a couple of articles that I linked to in there that
I didn't talk about, that are about how we do failure Fridays
and stuff at page of duty, which is just not saying you should do exactly
what we do, but you're all interested in this domain, so it's more things to
know, right? And yeah,
if you like Twitter, that's where you can find me. I'm Matt Stratton, and Peggy
says, follow me.