Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi there, my name's Tammy and I'm here to speak to you about
chaos Engineering. When the network breaks. You can
reach me via email tammy@gremlin.com at gremlin,
I work as a principal site reliability engineer. I've been here for
three and a half years. I get to do a lot of chaos engineering as
a way to proactively avoid outages and downtime.
Previously I worked at Dropbox as a site reliability engineering
manager for databases, block storage, and code
workflows. One of my really proud achievements
was being able to get a ten x reduction in incidents in three
months, which was awesome. And it was through doing chaos engineering.
You can also find me on Twitter tammy xbryant.
One of the biggest problems these days is that every system
is becoming a distributed system that makes things really complicated.
You see a lot of folks with faces like this whenever they join a new
company and there's a distributed system, or if they're
just trying to add something new. So what is chaos engineering?
Thoughtful, planned experiments designed to reveal the weaknesses in our
systems. We inject chaos proactively instead
of dreading the unknown. This is a really cool example from Netflix.
You can see here on the top. This is when things are all
going great. Everything tools awesome. There's a nice hero banner
to promote, something in particular so that
folks could easily click play on it. You can see there's that big red play
button there, and then you can see some other items that you could continue watching.
And then if you look below, this is what it looks like if
there's something wrong with that hero that you'd like
to be able to showcase. So say, for example, maybe there's a problem with the
file. There was like file corruption. Or maybe you need to take it down because
there was some type of issue in the back end could be business related legal
compliance, like sign off. Sometimes you need to
be able to remove things and roll them back. So instead of having a
big x button or having some type of really
bad under experience making things not load,
showing like a loading screen, instead what we're doing is we're gracefully
failing and we're showing like a nice,
clean, simple look on the bottom screen where we just hide the hero
completely. So the under doesn't even know that anything's broken.
So how do you create a chaos engineering scenario? First off,
form a hypothesis, then baseline your metrics if you
can. I really recommend getting started by having some slos and
slis per service. Consider the blast radius.
Determine what chaos you'd like to inject, for example,
latency. Or maybe you'll make something unavailable using a black hole.
Run your chaos engineering scenario. Measure the results of your
chaos engineering scenario, then find and fix issues
or scale the blast radius of the scenario. Often folks
ask me, how do I pick where to start? Though I really recommend identifying
your top five critical services first. I don't really like the idea
of going for low hanging fruit because in the world of SRE,
that's not impactful. You need to focus on fixing the issues
that impact your customers. If you've got every single customer
impacted by one issue, you've got to fix that rather than just fixing
a small, tiny issue that only impacts, say, for example,
one customer. And maybe that customer is a lower
cost customer, for example, or a customer that you're
not trying to target with a new campaign. Something like that.
So what you really need to do is figure out what your top five
critical services are. I like to ask folks if you could say that
now tell me, what are your top five critical services?
Often folks aren't really sure, and if you ask different people within the
same organization, you'll get different answers. Butow we really want to
understand across an entire company, what are the top five most critical
services? Then choose one of these critical services.
So, for example, I'll give you three different types of critical
services that you'll often see. One is monitoring and alerting
or observability. That's definitely a critical service.
The next is cash, for example, redis or memcache.
And then a third, for example would be payments. Another one
that you may be thinking about databases. That's another critical
service. Next up, step three, whiteboard the
service with your team. Then step four, select the scenario
that you'd like to run. For example, maybe you'll validate
monitoring and alerting. Maybe you'll validate autoscaling.
Works as expected. Maybe youll figure out what happens if
a dependency is unavailable using a black hole attack. Or maybe
you'll use a shutdown attack or a black hole attack to trigger hosts
or container failure. Then step five,
determine the magnitude, the number of servers, and the length of time.
So what is the value of chaos engineering? To me, there are
so many different ways that you can get value from chaos engineering. And I think
it's a really great, most important, one of the most
important things for an SRE to do. I say this because for
me, it's been the number one way that I've been able to reduce incidents,
avoid downtime, and it's super impactful in a
really short amount of time and you don't need tons of people to make it
happen. It's really nice and proactive compared to other types of work
that you can do. For example, automation,
building completely automated systems that don't involve any
humans, that takes years of work often, and it
takes a lot of engineering effort and often there's actually a lot of failures
that come because of the automation. So to me this is a great way to
first reduce issues before you do all of that automation work
because automation is just going to uncover more problems.
So what are some of the values? First off, find your
monitoring gaps, reduce signal to noise. Next,
validate upstream and downstream dependencies.
Thirdly, train your teams fire drills,
making sure that folks know who needs to do what when they get paged.
Inject chaos, inject real failure to then trigger real alerts
to specific folks on different teams. And fourthly,
get a good night's sleep. Being up at 02:00 a.m. Having to get
paged and resolve an incident, that's not good. We want to reduce that
with chaos engineering. Something else I wanted to talk about
is MTTD. This is one of my secrets.
So my personal favorite metric is MTTD
and focusing on improving meantime to detection. I care about
it so much that I wrote a book on it with some friends for O'Reilly,
some friends from LinkedIn, Twitter, Amazon. We got together
and we put down all of our best practice ideas and principles for
how you can do this and share that in this book. So check that out
for sure. Why do I care about mean times of detection?
Because, for example, if it takes you a week to detect an incident
and five minutes to resolve it, it still means that it took
you a week and five minutes to fix the issue. For a customer,
that's bad. The slower your time to detection is, the worse
the impact is for the customer. So if you focus on mean time to detection,
you can get in there faster and you can resolve issues faster.
Now let's dive into network chaos. Today I want to share some
demos. We're going to be using a demo web
application that's an ecommerce store. I'm going to show you three different kinds
of demos. One is a black hole attack. We'll be doing this
on the had service. Second is a black hole of the recommended
products. And thirdly is a packet loss
attack, where we'll actually be injecting packet loss into the cart.
One of the main things that I like to think about at the start of
this is use a scientific method and determine a hypothesis.
So if we make the black hole, if we run a black hole attack and
make the ad service unavailable, we think like nothing really that bad will happen.
It should gracefully degrade. We should have no outages.
That should just disappear. If we blackhole recommended products,
also, we shouldn't have any outages. Everything should gracefully degrade.
We just wouldn't see recommended products. You know, like when you're on Amazon
and you see other things that they recommend that you should buy based on what
youre already bought, it should just disappear and you shouldn't even notice if it's
not there. If we inject packet loss into the cart, we expect
that we shouldn't actually lose any items and our cart should
remain intact during this attack. We don't
want the cart to drop items. We don't want the cart
to not allow us to add extra items. We should still be able to use
our cart. So first off, what we want to do, like we talked
about before, is drop a whiteboard of our architecture. So you
can see here for this demo application, the hipster shop,
we've got twelve services. Which of these services matter most
to our customers? The cart service, the checkout
service, the payment service, shipping.
Which ones do we think? So let's think about it
like this. What can we remove from the critical path and what do we have
to keep? There are a few things here that look really important.
Definitely the checkout service, the payment service. We want
to understand, like what items are going where with the shipping service.
There's also a cash that looks pretty important. Cart service looks important too.
Product, catalog service, that's probably important. So this, to me,
looks like the architecture of what we would need to keep.
And you can see I've removed a few of the different items.
So if we look back, I remove the email service,
currency service, recommendation service, which recommends products.
So those are a few of the things that we think are not in the
critical path. But everything here is in the critical path.
And the little boxes with emojis,
those are items that are even more critical for the user, like the payment service.
So getting out of the critical path is good. Always remember this,
if you are in the critical path as a critical path service team
and you own a critical service, then you're going to have youre eyes
on you and more things that you need to pass. More checks,
more compliance. So we've got seven critical services.
One thing that we want to think about, based on our hypothesis, is does
black holing a noncritical path service cause unexpected failures
for critical services. So based on our initial
scenario that we want to run and our hypothesis, we think that if we
make the recommended product service unavailable,
we shouldn't have any issues. If that service is not found,
we think the path should be okay. Another thing we might want to think about,
like we talked about initially, is does injecting some type
of network impact, like packet loss in the cart service, cause us to lose
data, like for example, our customer cart items?
Another thing that you might think here is where are youre data backups and restores?
We seem to only have a redis cache. You know that feeling
when you've been shopping for ages, you've added all these items to your cart and
then suddenly your entire cart gets wiped and you have to go and re add
all of those items to the cart. It's annoying for the user and it's also
really bad for the organization.
All righty, so now let's get started with our first real demo.
We're going to do a black hole attack on the ad service. So this means
we're going to make the ad service unavailable. So this is our
demo application. Here you can see if I click on vintage camera
lets. And now I scroll down.
This is the recommended product section. Other products you
might like. And then here's the ad service. So,
advertisement Film camera for sale, 50% off.
I can then click on that link and it takes me to the film camera.
And then it has another ad for that film camera for sale.
If I go to a different product like the terrarium, you can
see that there's an ad down here that says airplanes for sale. Buy two,
get third one free. So we think we should still be able to
use this site. Click on an item, scroll down.
But we think this will probably be gone. This box here, the advertisement
box, based on our hypothesis, but we do think we should still be able
to add items to cart. See, we've added this
item and now we have usd 70 22
as our total cost. We should be able to check that out and
see that the order is complete. So now what we want to do is we
want to run that black hole attack. So this is our baseline good state everything
tools. Nice. So we can see that here the had
service. Now I'm going to go and say scenario,
new scenario, black hole
add service.
So we think that this should be graceful. No outage,
website still functions.
Ads are hidden. That's what we expect to
happen. That's our hypothesis. So now what we want to do is
click over here on Kubernetes and then search for
the Kubernetes service. We've selected our cluster
Kubernetes cluster group. Now I'm
going to go over to the ad service. I just type that in based
on our architecture diagram. And you can see here it selected the deployment for ad
service and the pod and the
replica set for ad service. So we've got that there
and we can see that across our two hosts. So we've got
two hosts. It actually only has one pod across
all of them. So that's already interesting from a reliability perspective.
It's not going to fail over gracefully because we only have one pod.
If we did have two pods, then potentially, yeah,
we wouldn't see an issue at all if we were just going to black hole
one of the pods. So now what I'm going to do is scroll down,
click network, and go to black hole. Do this
for 60 seconds. I don't need to do anything in particular extra,
but I could do something more granular, like say, I only want to impact traffic
to these IP addresses. I only want to impact
these specific remote ports. So traffic going from here
to somewhere else or local ports impact
outgoing and incoming traffic to and from these local ports. Or I could drop down
and select specific providers. All right,
let's add that to our scenario. Click save.
Now let's run it. So this is a cool live
demo. We'll actually be able to see this in real time.
So now I'm starting to run this and we'll see what happens.
It's currently running. Let's pull that up.
Pending is the first state that youll see it in.
All righty. Let's see. We'll go back over to our site.
I'm going to leads this open. You can see here. I've got that there.
Let's refresh the page. You can see now
the ad service has disappeared. So it was there before, now it's
gone. But let's check everything else still works. If I click on film
camera, no ad service is there. But I
can add an item to my cart.
Great. And now I can place my order.
Add a home barista kit to my cart.
And I can place my order. All right,
excellent. Everything looks good. So I would say that that
actually passed. So what we can do here is we know
that it passed already. So we can halt that scenario. We can say,
yes, it passed. No incidents uncover and yeah,
can incident was mitigated. No issues. Graceful.
All right, so we've stopped that scenario because everything
was good and we saved our results into the system.
So that's awesome. That's a really good example of a successful chaos engineering
experiment. And we could think there what else might we do to this system to
make it even better? We could make it that you could fail over to
a different pod. Since there's only one make the ad services still work.
If one of them is not available, that could be a future experiment.
So some other different experiments that we can run now
let's blackhole the recommended product service. So what
we want to do is go back over here,
recommended products is things section here, other products you might like.
So I'm just going to reload that page. You can see here
the ad service is now back and we see our recommended product section. So we
think it should be the same. This section should just disappear.
All right, so let's go scenarios, new scenario
black hole recommended products,
no outage.
And let's find it over here in Kubernetes youll
can see we've selected it.
Network black hole. Leave it for 60 responds.
Add to scenario, save the scenario.
Now let's run it.
Okay, so we're just going to be blackhalling the recommended products
service. It's currently in the status of pending.
Looks like there's an issue. So I'm trying to click on vintage typewriter and
it's just trying to load but it's not allowing me to get to
that page. I've just got it loading. I'm kind of stuck
here. Clicked on film camera, same thing.
Nothing's happening. Let me try and add something to my cart.
Oh, that's not working either. So there's definitely
an issue here. I'm going to try and go back to the home page.
Okay, so I can get to the home page. Can I click on an item?
No. So the only thing that I can do right now, the only thing
functioning is enabling me to go to the home page. Let's see if I can
get to the cart. No,
can get to the cart either. So this page is
basically not working at all. This website is not working at all.
I can't really do anything much.
I can't get to my cart. I can't add items. So we would consider this
to be an outage. The website's just not working at all.
It's not enabling us to make any purchases. So we're
not going to be fulfilling anything from any customer. So that's definitely a
bad example. So let's say, all right, we know that this is
not good. Let's halt it. No it failed. A potential
incident was uncovered and the incident
was not mitigated.
So let's say full outage website
not working at all to
process orders or navigate.
All righty.
Click save there.
So that was a bad example. Now that I've halted
it, let's just check. Everything's still good. Yep. So really
easy, quick way to be able to test what happens if you blackhole a
service. Obviously there's some type of issue with hard coding that
blackhalling this recommended product service breaks the entire website.
Obviously we don't want that. That's really bad. All righty.
So we've saved our results into the system and we said
it was supposed to be graceful with no outage. But actually the result was
a full outage website wasn't working to process orders and we couldn't navigate either.
So we learned a lot from doing that scenario and experiment too.
Now let's see,
the last one that we want to run is a packet loss attack on the
cart service. So let's go over and do that. All righty,
so now we're going to do youre packet loss attack.
Let's go over there. Okay, so here's my shopping
cart. You can see I've actually got 21 items in my cart right
now that I've added ten bikes, one record player,
ten vintage camera lenses.
Let's just add one more airplane. Now what I want to
do is create a new scenario and we're going to call this
packet lost to cart 80%. And we think this
should be actually graceful.
No cart items lost.
Let's go over to Kubernetes and then select
the cart. All right,
cool. Now we'll go to network black hole and run
a black hole attack. We're going to add that to our scenario, save the scenario
and run it.
Okay. So as that starts to get set up right now it's in
pending. We want to go over here and let's see. It should be starting to
run now. All right. Hasn't run yet.
Still in pending. All right. Now there's
some issues. I'm clicking on metal camping mug and it's not letting me load
anything. Clicking on city bike. Nothing happening there. I'm going
to go and click on the online boutique logo at the top to see if
I can get back to the home page. No,
I've got 22 items in my cart though. It still says but
I can't actually go to my cart.
So we do look like we're stuck here. This is definitely another outages situation.
So packet loss to the cart service is causing the entire site
to be unusable. So that's not what we
expected. And then we also now got this 500 error. And we have logs
visible to all users, which is also really not good.
So obviously a lot of things that we learned there.
So what happens here is gremlin is going to abort that attack because it realized
there's a problem and we'll say no, it failed.
A potential incident was uncovered and the incident was not mitigated.
So we got a 500 services error.
I'm going to actually copy this over
into our system to store those results and
click save so we can actually save the error message
and refer back to that later on.
So, yeah, that was not a good situation, but we did learn a lot.
So now, in summary, these are the three different types of
experiments we ran.
First, black holing leads. That was awesome. Graceful degradation,
just like Netflix. Number two, black holing recommended products
major incident not good. Have to do some work to figure
out what the hard coded issues are there between dependencies.
Packet loss to the cart gave us a 500
error, but looks like
we didn't have any data loss to our cart. Let's just
go and check that to confirm.
CArT has 22 items still in it, so no data loss.
So we're all good there.
All right,
so here are some questions you want to ask yourself at the end of all
of your chaos engineering work. Was it expected?
Did we uncover unknown side effects? Was it detected?
We can ensure that our monitoring is correctly configured. Was it mitigated?
Did our systems gracefully degrade like in the first example?
Did we fix the issues? Whether it's code config or process,
can we automate this? We can regularly then run past failures
as exercises to prevent the drift back into failure.
And let's share our results. Prepare an executive summary of what you
learned and what you fixed. You can find this blog post about
what I recommend that you write on the Gremlin website. Internal Chaos
Engineering report ten x reduction in incidents what kind of charts
use? How to explain it to folks? It's really important to share that story.
I also recommend that you join the chaos engineering slack. Go to gremlin.com
slack and as a summary, what should you do next?
Pick one service or application to practice CE on form a
hypothesis. Baseline your metrics. Set client centric slos and slis
per service. Consider the blast radius. Determine what chaos to
inject. Run your chaos engineering scenario. Measure the results
of your chaos engineering scenario. Find and fix issues, or scale the
blast radius of the scenario. This is what I recommend as
you ramp up all of your work when you're doing chaos
engineering. Start by making sure your agents are correctly
installed onboard a team to be able to help work
with you. Run your first scenario. We also offer
Gremlin bootcamps, which are a really cool way to get trained up. Those are free.
Go to gremlin.com bootcamps, then automate your
chaos engineering run game days as a team,
do some scheduled attacks, actually meet with your
executive team to share those results, and then integrate with your
CI CD. So you're actually using chaos engineering more as
a shift left practice. If you go to gremlin.com
bootcamps, you'll find all of the 101 introduction to Chaos engineering
as well as 201 automation CI CD,
the more advanced topics, and go to Gremlin blog to
hear about the Gremlin Chaos champion program. There are lots of amazing people doing
work in this space. Thank you so much for coming along to my talk.
Really appreciate it. You can find me on Twitter if you want to ask me
any questions. Tammy X. Bryant thank you.