Transcript
This transcript was autogenerated. To make changes, submit a PR.
Thanks for joining my talk implementing chaos engineering and
a riskaverse culture. My name is Kyle Shelton and
I'm a senior DevOps consultant for Toyota Racing development.
A little bit about myself. I have a beautiful wife and three girls.
I'm an avid outdoorsman. I love to be outside, love to
fish, love to hunt, love to hike. I also train for
triathlons and do triathlons recreationally.
I'm an avid outdoors. I'm an avid outdoorsman, but I'm also
an avid I racing and simulation fan. I love racing
simulators. I just built a new pc recently and been doing
a lot of I racing. I also enjoy flight simulators and farming
simulators, and one of my other hobbies is writing and I
have a blog, chaoskyle.com, to where I talk about platform
engineering, DevOps, SRE, mental health,
and things of that nature. So if you're interested in that, check it out.
In today's talk, I'm going to talk about the evolution of distributed
systems and what apps look like today versus
what they look like 1015 years ago. I'm going to kind of go
over the principles of chaos engineering and how to
deal with that conservative mindset. A lot of places
you might work at maybe will laugh at
you if you ask to break things. So I'm
going to kind of go over what that mindset looks like
and how to sell newer technologies
and practices like chaos engineering. Also talk
about the art of persuasion and gaining buy in. And then
I have a case study from my days as an SRE at Splunk that
you can use as a reference point on how to ask for
money to do a lot of tests. And then we'll close out with tools
and resources, and then I'll get into Q A.
So let's start about talking about the
evolution of systems architecture. When I first got started in
my career, I was working for Verizon and we were deploying
VoLTE, which is voiceover. LTE was the
world's second nation's first IP
based voice network, and our deployments looked
like this. We would schedule each,
we call them network equipment centers, data centers,
and we had our core data centers all throughout the country, New Jersey,
California, Dallas, Colorado. And we would schedule six week deployments
to where we would spend four weeks on site,
racking and stacking the servers, getting everything put together,
installing the overcoming system, installing all of the
application software, or as we call them, guest systems,
and then bringing these platforms onto the network and
eventually get subscribers making phone calls
through them. And the process was long and we
had to build out our capacity kind of
in that 40 40 model to where we would only build out to 40%
of our capacity. So that if we ever had to fail over from one
of our, everything was ha, obviously too. So we always had two
things of everything. And so if we ever had to fail over from one
side to the other, that it would never be at 100%, it would only be
at 80%. So there's a lot of overhead, a lot of costs.
We were really good about knowing what type of traffic
we would have. But this was early iPhone days,
so iPhone releases could really mean a huge spike
in traffic, and there's a lot that
went to it. And now if you're deploying
an app with the cloud native revolution in
order to get an mvp into your customers hands,
you can do that within 45 minutes,
basically for simple base cuts. Obviously,
you're not going to deploy a complicated rails app
platform in 45 minutes, but you
have the tools available, using public clouds
or private clouds or whatever, to deploy virtualized systems with
all their dependencies quickly. And so with
the new way of doing things and the new way
of being able to deploy things fastly, being able
to scale and have elasticity. So you only pay for what you
use, right? If you've ever heard one of the AWS pitches,
the big thing is they had all that extra capacity and
they started selling. And so you pay for what you use in
these scenarios, like these new technologies. And this new way
of doing things brings a lot more aspects
of failure and a lot more things that can go wrong
based on abstractions and dependencies into
the system. And so understanding your tailored
understanding what happens when
things break is crucial for modern day applications.
As Warner Vogel said, everything fails all the
time, and you have to be prepared for that. And another thing,
too, is you are abstracting some of the physical
responsibilities through the shared responsibility model. So something
can happen on a data center that you don't own that could affect your resilience.
And so that's another thing that you have to be prepared for. And so
those are some of the challenges of our modern day systems. And that's
why chaos engineering kind of came about.
When I think about what is chaos engineering in my mind,
you define your steady state first. If you are at a
point to where you don't really know what a good day looks like, I would
first figure that out. Get to your steady state. Get to
where you're knowing that your failure rate or your four
five x error count or whatever is at x, whatever a good day
looks like, for your distributed system, that's what's defined as
steady state. So once you have that, then you can start to build hypotheses
around disrupting that steady state.
So you create a control group and an experiment group, and you say, okay,
if I unplug x wire, y will turn
off, right? That's just a basic example of hypothesis,
but that's what it is. If I do something,
x will happen. And then once you have that hypothesis written
down or put down in notes or whatever,
you use the document. That's when you come in and you introduce chaos in
a controlled manner. That's when you come in and you go unplug the cable or
you go pour water on your keyboard to see what happens. And after
you do that, this is a key point, then you're going to observe what happens,
and you're going to look at both groups and then evaluate your hypothesis.
And another mistake that I've seen people made is that they want to get into
chaos engineering without having a good observability
foundation. And if you can't put metrics
on your system, and you don't know how to
put a value or put a score on your system, then you're probably not
going to be very successful with chaos engineering. So I would also focus on
that first, defining your steady state and being able
to observe that steady state is something that you have to. That's one
of the prerequisites for chaos engineering.
There's a website, really cool website, that helpful folks
at Netflix put out called principlesofchaos.org.
And this kind of goes into detail as the principles of
chaos engineering. What was their north star when
know created this type
of engineering? The big thing,
know, you build your hypothesis around the steady state, like I talked about
it, and you're going to want to vary real world events. So if
you search AWS service events or Google service events, they'll give
you a long list of the things that have happened from
the big s three dying outage, way back in the days of December
7, you can see what
has actually happened in the cloud. So, you know, like,
okay, at minimum, this could happen because it's happened
before. You're never going to ever be able to plan for the
unknown, obviously, but those are their
biggest events. Your provider has start there and
work backwards. Use those published
events as a starting point for what to
test on. Run your experiments in prod. You want to get
to where you're running continuous experiments in prod randomly,
automatically, without it having an impact
to your customers.
We have automation pipelines, and we have jobs
that can run and spark within unit tests
for failure. There's also ways to just let
it go in the wild and just when it runs, it runs. There's a
lot of ways you can do this, but keep it automated. Don't have someone sitting
there that has a go and pull the plug and make sure that
it's automated and that you can run it continuously. And the last
thing is you're going to want to minimize blast radius.
You don't want your eyes to bring down production.
You don't want your monitoring and your testing tools
to cause large scale events. So isolate your
test. I'll talk about this when we're talking about buy in, but start
small. Don't go in trying to just,
all right, we want to fail over one region.
And don't go in with the mindset that you're going to
break everything all at once, all the time. Minimize your
blast radius, keep it small and work from there.
Iterate off the small wins so that
you can get confidence from your leadership to get bigger
failures and have bigger events. Do red team events, settings like that.
Let's get into the conservative mindset.
And we've all worked with those leaders or those
managers or principal engineers who have
that very conservative mindset of, oh, we don't want to do
the new stuff yet, or, oh, we don't want to
bring failure in. I thought our goal was to keep the app up
as much as possible. Why would we want to come and start breaking things?
And one of the key things that I've seen when
working with these types of leaders and individuals is it's
a riskaverse bottom line first type of mentality.
And I think, I'm not trying to downplay
the bottom line. Right. I understand business. I understand that sales
generate the revenue which keep everybody employed, and that the
bottom line is very important. I want to stress
that right now. But I also know
that sometimes you have to make investments and you have to make investments in
time engineering, time tools, systems,
platforms, et cetera. Initially that you might
not see return on short term, but their long term
returns are substantial. So be
on the lookout for that type of behavior and understand,
hey, everything does get basically
falls and delves around the bottom line.
But if you notice that leadership has
that bottom line first, well, we can't fail because
it's going to cost us money. Well, if you don't understand your
tailored, it's going to cost you a lot more money. And so recognize
that. Recognize that mindset and be able to counterarguments
when you're going to sell your case another
environment. I think my early days at Splunk,
it was like this to where we had a knock transition. And so
we were transitioning from an overseas knock to building out an
internal team. And during that transition,
we got paged frequently. There was a lot of outages, there was
a lot of things that were going on, and it
sucked. I'm not going to lie, it sucked. Getting paged at
02:00 a.m. Every other night or constantly fighting fires.
There was a lot of fatigue. And you can tell with some of the older
engineers that fatigue. And that was the whole basis of us bringing
that knock back in house because we weren't getting the support from our
overseas team and we needed to get
better. So notice that PTSD,
if you're coming in and everybody's just kind of worn out from getting
paged all the time, that's something to look out for.
And then the third thing that I've noticed in these
conservative mindsets is slow. Everything moves slow. It takes
two weeks to get your change in front of a change board. Then it takes
another two weeks to get a maintenance window approved, and then you got
six weeks in. You're going to run your three lines of code, and if it
breaks, then you have to wait another six weeks to get things in.
And not being able to move releases
fast and not being able to push your code fast. And that's
normally due to that conservative mindset that there's a lot of reasons
that could be there, but it's slow and
nobody likes going slow. I personally hate going slow,
but slow is steady sometimes. And that might
be what works best for that team at that
moment. And your job is to speed things up.
That's what to look out for when you're
finding that you think you might be in that conservative mindset
or conservative environment. Now let's talk about settings,
chaos, engineering. And if there's a skill set that I'm
constantly trying to grow, it's my sales skill set.
I don't work in sales. Right. But there
are situations to where you have to sell something, whether it's
yourself for a job interview, whether it's a tool that you want,
because it's the next best thing, big monitoring
tool or something, or whether it's just you want money
for a soccer team. I had to ask an
executive for a couple of
start a soccer team for our office. And one
of the things I mentioned was that support sales and Sre
just don't get along. We all operate in our own little pods.
There's not a lot of communication and camaraderie.
And so I was like, hey, if we had a soccer team and
we could get groups of all of these
environments together on one mission to
go have some fun, it'll help with the bonding
and the office morale, and it'll help us work better together. And it
worked, and I got the money and sure, shit,
it really helped the teams get along better, and the teams work
together better because there's that just
team. When you're on a soccer team, you're all trying to
beat. We were trying to play the other tech companies in Dallas,
and there's a little bit of pride, and then we don't want to lose.
Yeah, it was fun. I got to be friends with folks I would
have never been friends with because I didn't work with, and I actually met my
wife through that, too. And so being able
to sell is very important. Now, how do we sell
chaos engineering? I think one of the biggest
things that whenever I've asked for
a new technology or something new from a leader,
one of the things I've been asked is total cost ownership
and how to put short term investments
into long term returns.
Right. Let's say I
want this tool that's going to cost us $100,000
a year. Okay, well, what value
are you going to get from that $100,000?
How do I explain? Okay, well, this tool helps
me see things before the customers do.
It helps me fix things before they break and page out
our on call overnight. So our engineers get more sleep,
so they come in and they're not having to flex their schedule because they were
paged the night before. So there's not going to be that fallback on
productivity from that standpoint. And we
can start to create automation so that we get
to the point to where the system is just healing itself. And so how
do you put a number on that? Well, you define total
cost ownership and define that, hey, we might be saving
money by going open source and hosting things ourselves
on the initial software spend, but if we look at the spend
overall from our sres and architects and developers and everybody
that has to put more of their time into
that platform, okay, that $100,000
investment of the managed
tool is going to be less than the $400,000
you're spending on engineering time. So, understanding total
cost ownership, understanding how to explain total cost ownership
has been the biggest way I've been able to sell things
to ask more money, sell a tool or
sell a system or something like that.
So, understanding your total cost ownership and
understanding how to explain total cost ownership is crucial.
Being able to put a number on reliability.
Happy customers equals happy engineers, which equals happy
bosses, which makes everybody happy.
The cost of fragility,
it can break your company, especially if you're
a web SaaS or you provide a service over the web.
People paying customers expect you to deliver on your part. And if
you are not able to, because your systems are not up and
you're not able to fulfill your service
level agreements, it costs you money, it costs engineering
talent, and it could cost your business.
So having those points and
understanding how to talk about those points is key
in settings chaos engineering. So now
we understand what we have to do. How do we do it? Well,
I'll start at the top. The benefit and cost analysis
of total cost ownership is, where are you going to start?
Define that. Get all your data, all your charts and graphs.
I had an old vp who used to say charts and graphs, or it didn't
happen. I even made stickers. And working at
a software observability company kind of makes sense, right?
But it's true, right? Be able to prove your points,
do your homework, bring your case files,
act like a lawyer does when they're trying to persuade a jury their
case, and have all of your detailed documents, notes.
Do surveys with your developers. If you're trying,
do the things that you need to do to make the sale.
Another way to gain buy in is just ask for small pilot programs
like ask for a small. Just, hey, I want to try this
new way of doing things in our test
sandbox environment. Is that okay? Start with a win.
Iterate on that win,
demonstrate that success early, and demonstrate
that you can control the blast radius
and can control the failure. And demonstrate
the ability to say, oh, okay, well, I'm doing this in a
scientific manner to gather more data
to make our system better.
Be able to clearly articulate those
things and speak their language. Right. If you're having to
sell this to a higher exec that's a business leader,
maybe lean less on the technical jargon and more on
the business value jargon. Or if you're having to do this to your CTO and
he's going to want to know, okay, what types of failures are you going to
bring? Are you going to do latency? How is this going to affect us long
term? Are you going to have the ability to completely break
everything and not have it be recovered?
What's your backout plan things of that nature. Speak the language of
the leader that you are pitching to,
and then the last thing is just iterate and approve.
Iteration is the key to this.
Being a DevOps engineer, we build pipelines, and I get
really excited when I get new errors because new errors mean
new improvement. So iterate,
improve, take baby steps, don't do
big things all at once. Start small and work up to
the larger events and the larger scenarios. Don't just go off
blasting right away. It's hard to get.
Like I said, this is one of the hardest things to ask for, because you
are literally asking to break things, but you're
asking to break things to make them better. So be
sure to include that last part.
Now, how do you
go about doing this? So I want to talk about a time where I was
at splunk and we were going through
a very large migration. We were basically going from the
transition of ansible and chef to
puppet and terraform. And so we were rebuilding
our whole provisioning system. And in that migration, we were also
moving to the graviton instances and moving from our d series to our
I series. And so we had to
basically move our whole fleet, 15,000 instances.
I think we did it in like eight and a half months. It was a
crazy big project, and I was tasked with the largest customer,
the biggest, most snowflaked customer we had, with all
the custom configurations. They had
a lot of things that most customers did not because they were the
first customer to give us a million dollars. So they
got what they want, right. And it was imperative
that that migration went absolutely perfect.
Half of their stack was already moving off their security side was
already moving off of our managed team. So if
we failed on this, we could have lost a lot
of revenue. And so our chief cloud officer was
very interested in making this a success. And so I
had to basically ask for $5 million to make
sure it was a success. I asked to duplicate
their environment in another environment so that I could fully test
and fully break things and see how I would recover if I did break things
during the migration and have my backout plan ready
to a T. So I asked, I put all the necessary
data in front of him, told him it's going to take me three months.
This is our plan. We're going to replicate four petabytes of data.
We're going to execute the migration probably two or
three times. And then once we get that migration executed,
we're going to document everything that we think
will happen, and then everything that happened during that migration,
because when you're dealing with big data systems,
it's very difficult to predict things that are going to
happen, especially in the cloud. So I did that. I did all the settings,
ended up executing the migration flawlessly
right through chaos engineering because I was
asking to test and asking to break things on purpose in
a controlled environment that would not impact the customer.
And that investment of $5 million ended up turning
a ten x margin because now we're using graviton
instead of 400 D 424
xl instances. We're now only using 80 I series.
So we got better performance, better cost.
We're saving way much more money on using the newer instances,
and it's a better customer experience. And I don't think that
migration would have happened had we not done those tests.
So that's a $20 million swing, right? Like,
had we failed and had we not spent that $5 million up
front, we would have not been successful. And so
I did experience some catastrophic failures
through my first test when we were doing that. And so we figured that
out on the practice environment and we learned
from our mistakes and we grew. And so it's
hard to ask for money and to ask for tailored, but if you do
it in the right way and you do your homework and you propose
the data at hand and show the total cost and what
happens in the long run, with resilience, it goes a
long way. So I highly suggest asking
for this and asking to be able to
do things like this and to be able to test and to run your red
team exercises and to practice your incidents.
Practice, practices, practice. Firefighters, military,
navy seals spend more time training for their missions
than they do actually executing the missions. Right, because they want it to
be perfect. When you're doing your technical operations,
you want them to be perfect. And the only way to do that is practice.
Some of the tools that I've used in the past,
chaos monkey harness and their chaos engineering platform,
they have an open source litmus, and then they also have an enterprise level.
There's gremlin, which is another great tool. And then chaos
Mesh is a cloud native tool that I really like as well.
Check these out. And that's
the show. Thanks for taking the time to jump
into this talk. I really enjoy talking about chaos
engineering and having influence.
And yeah, check out my blog, chaos.com. I talk about mental
health, I talk about platform engineering, resilience,
chaos engineering, all that. So thanks
for having me.