Transcript
This transcript was autogenerated. To make changes, submit a PR.
Jamaica. Make up real
time feedback into the behavior of your distributed systems
and observing changes exceptions.
Errors in real time allows you to not only experiment with confidence,
but respond instantly to get things working again.
You sability,
chaos, engineering and game days. I'm really excited to be
with all of you today, so let's
just jump right in. If we've learned anything over the
years, we've the the road to reliability accident
reliability takes work and a plan and a strategy and
a lot of technical actions which include teamwork and collaboration.
So how can you make your systems more reliable and
improve the reliability of everything? Let's talk
about that today. One of the things that we saw a few
weeks ago was this massive outage at AWS,
and think about what that meant for customers.
Look at all of the folks that were impacted.
Our systems are complex,
and when we see something like a major five hour outage,
how can we plan or prepare for those types of
things in the best way possible?
So when we talk about complex systems, what do
we mean? And I'll get there in just a minute. First, I want to let
you know a little bit about me. I'm Julie Gunderson, a senior reliability
advocate here at Gremlin. You can find me on Twitter
at Gund or email me at julie@gremlin.com
prior to joining Gremlin, I was over at Pagerduty
as a DevOps advocate. So I've been in the reliability space for quite some time.
Other than that, I live in the great state of Idaho. Now that
you know who I am, let's jump right in. So a
great resource, if you haven't checked it out, is the book accelerate.
It's a fantastic book written by Dr.
Nicole Forsgren, who's a PhD and responsible
for the Dora report, Gene Kim, who wrote the Phoenix project,
and Jez Humble, who's Sre over at Google.
They've done an amazing job of collecting all the research they've
gathered from practicing DevOps over the years and also
creating the DevOps survey and report. So they took
four years of research and came together to analyze that data
in a great way and bubble it up and say what the most important things
you need to do to create high performing technology organizations.
So building and scaling high performing teams in
accelerate, they focus a lot on tempo and stability.
Now, a lot of times folks ask, how do you measure chaos engineering?
How do you make sure that what you're doing is the right thing when you're
doing this type of work, how do you prove that the work that you
did actually makes a difference and is able to move that needle,
especially if you have a lot of different people doing different projects.
And it's really important to be able to measure back and say,
this is the work that we did, this is how we moved that needle,
and this is the ROI that we got from doing that work.
And over the years, a lot of research has been conducted and
books have been written on how to improve the resilience of our software.
So today I want to dive into two key practices that
improve the key measures of tempo and stability that are outlined in
the book, accelerate. And those two practices are chaos engineering
and game days. And you'll learn practical tips that you can put into action
focused on resource consumption and capacity planning, decoupling services,
large scale outages and deployment pain. And at the end of this, I will let
you know how you can get a free chaos engineering certification.
So what is tempo instability? Tempo is measured
by deployment frequency and change lead time. And stability is measured by
mean time to recover MTTR and change failure rate.
If we break that down even further, deployment frequency
would be the rate that software is deployed to production of
an App Store. So for example, within a range of multiple
times a day to maybe a really long deployment frequency, like once a year,
tempo would be change lead time. So the time it takes to
go from a customer making a request to that feature been
built and rolled out to where the customer can use it. So again,
the time it takes from making the request to the request being satisfied,
and then stability is our mean time to recover. It's the mean
time it takes a company to recover from the downtime of their software
and stability change failures rate is the likelihood of defect
changes. So if you roll out five changes, how many of those will
have a defect? How many of those might you need to roll back, or might
you need to apply a patch for? Is it one out of five or
two out of five? These are the types of metrics that
accelerate recommends that you measure.
So back to the key practices. As I mentioned,
the key practices are chaos engineering and game days.
And according to Accelerate, you should focus on improving reliability,
because it's the foundation that enables you to have a really great tempo.
It enables you to improve developer velocity,
which makes so much sense. But what's really nice is that they're backing
this up with data over years of research.
And we know that engineers feel more confident when software
is reliable. They feel more confident to write new code,
and more confident that the features that they're building are going to work well.
They understand these different failure modes because they've been focusing
on reliability and have been trying to improve reliability.
So that makes you actually able to ship code
faster and more reliably. It's really exciting
when you can ship new features that work and that you can trust and
features that meet requirements in accelerate.
They say that we should build systems that are designed to be deployed easily
and that can detect and tolerate failures and can have various
components of the system updated independently.
And this is a lot of different things that are really great to work towards.
And I want to focus on the detect and tolerate failures specifically.
So what's the best way to know if your system can detect and tolerate failures?
Chaos engineering. And that's the best way, because you're
purposefully injecting real failures to see how your system can
handle that. So let's look at the basics of chaos engineering.
Chaos engineering is a misnomer. We're really simulating
the chaos of the real world in a controlled environment.
Introducing chaos is methodical and scientific.
You start with a hypothesis and experiment
to validate that hypothesis. You start with the smallest
increment that will yield a signal. And then you move safely from
small scale to large scale teams, and safely from dev to staging to
production. Communication is important,
so you want to make sure you share your plans with everyone you want to
think through. What if there's an incident? You don't want to negatively
affect other teams. You also want to share what you learn,
because chaos engineering is about learning, and sharing what you
learn makes everyone better engineers. So share internally
and externally, there is a chaos engineering
slack, which is a great online resource you can find@gremlin.com
community. You can talk to other people who are practicing
chaos engineering, or if allowed, write a blog about it.
Share this with people so that they can learn the best practices
of chaos engineering. Chaos engineering is
about iteration. We're creating a hypothesis. You're running an experiment.
Then you're creating tasks to improve your software and processes,
updating that hypothesis and repeating. And then
you increase your blast radius and keep repeating.
To sum it up, this is what chaos engineering is, thoughtful and
planning experiments to reveal weaknesses in systems,
both technical and human. And so we want to think through where is
our tech broken or insufficient? Where does that user
experience break? Does our auto scaling work? Is our
monitoring and alerting setup? And we also want to think
about our human systems and processes. Are they broken or ill
equipped? Is that alert rotation working? Are the
documentation and playbooks up to date? How about the escalation
process. These are all things that we should be thinking through,
and now is the best time, because systems are
complex and they become more complex over time. So let's start
when things are less complex. Now,
I want to talk a bit about applying the scientific method to
chaos engineering. To measure how systems
change during an experiment. It's important to understand how they behave
now. And this involves collecting relevant metrics from target
systems under normal load, which provide that baseline for
comparison. So using that data, you can measure exactly how
your systems change in response to an attack. And if
you don't have baseline metrics, that's okay.
You can actually start now and start collecting those metrics,
and then use chaos engineering to validate those metrics.
One of the most powerful questions in chaos engineering is,
does this work the way I think it does? Once you've got an idea
of how your system will work, think about how you're going to validate that.
What type of failure could you inject to help prove or
disprove your hypothesis? What happens if your systems don't respond
the way you expected? So you've chosen a scenario,
the exact failure, to simulate what happens next.
And this is an excellent thought exercise to work through as a team,
because by discussing a scenario, you can hypothesize on the expected
outcome. When running that in production, you can think through what the impact to
customers would be to the dependencies.
And once you have a hypothesis, you'll want to determine which metrics to measure
in order to verify or disprove your hypothesis.
And it's good to have a key performance metric that correlates to customer
success, such as orders per minute or stream starts
per second. As a rule of thumb, if you ever see an
impact to these metrics, you want to make sure that you halt the experiment immediately.
And after you've formed your hypothesis, you want to look at
how you can minimize your blast radius prior to the experiment.
And I'm going to talk about blast radius a little bit more
in a few minutes. But blast radius is usually measured
in customer impact. Like, maybe 10% of the customers could
be impact, but it can also be expressed in hosts or services or
other discrete parts of a customer infrastructure.
So when running a chaos experiment, you want to think about that blast
radius. You always want to have a backup plan in case
things go wrong. And you need to accept that sometimes even
the best backup plan can fail. So talk through how you're
going to revert the impact. One of the important things
with chaos engineering is to understand safely.
So you want to make sure the impacts can be
reverted, allowing you to safely abort and return to
that steady state. If things go wrong after
you run your first experiment, there's likely going to be one of
two outcomes. Either you've verified that your system is resilient to the failure
you've introduced, or you found a problem that needs to be fixed.
Both of these are great outcomes because, on one hand, you've increased your
confidence in the system and its behavior, or on the
other hand, you've found a problem before it caused an outage.
Make sure that you have documented the experiments and
the results. And as I mentioned before, a key
outcome of chaos engineering is planning. And through these experiments,
you're planning about your systems, you're validating your hypothesis, you're teaching
your teammates. So share the results of your chaos
engineering experiments with your teams, because you can help
them understand how to run their own experiments and where
the weaknesses in their systems are. So, let's talk about,
how do you get started? As I mentioned, you want to pay
attention to the blast radius. So you want to start small.
You want to be careful. You want to start on a single host
or service, not the whole application or fleet.
You want to start in a controlled environment with a team that's ready.
We're not trying to catch folks off guard.
Then, once you've done that, you want to expand that blast radius,
adopt the practice in development. So engineers are architecting for failure.
Get confident testing in development, and then move to staging.
Start small in staging, then expand your blast radius and
move to production. And start small and increase. This is
really similar to how you do development, so you don't need to overthink it.
You can work iteratively, like with code, and move up the environments, like with
code. You do know how to do this,
and so I want to show you a real demo of what this actually looks
like. Now, we'll be using an open source application and Gremlin
to do this demo, which there is Gremlin free. But I want to let you
know, this is just to show you what it looks like. You can use any
chaos engineering tool. So this is a
really cool open source project, and you can find it on the Google Cloud platform
GitHub webpage. It's a repo called the bank of Anthos. And I'll
send some links so that you can check it out. So, this is the architecture
diagram. It's really great for learning how to practice chaos engineering
because it's an Internet banking application, and it has multiple languages.
So both Python and Java, which is pretty common these days when you're working
on a system, there's oftentimes those multiple languages.
There's also two database. There's the account database and
a ledger database, which are running on postgres.
And then we have our transaction history service, a balance reader service,
a ledger writer context user service, our front end, and a
load generator. So this is what it looks like.
Now, one of the questions we want to ask ourselves when we're starting to practice
chaos engineering is what is our hypothesis?
We want to think through what is going to happen.
So does black holing a critical path service like the
balance reader result in a graceful degradation of the customer experience?
You want to think through that. What would happen if
we make the balance reader unavailable? So we make it unreachable,
which is what we call a black hole attack in Gremlin is. But if
we make that unavailable, what is going to occur to the application?
Do you have any ideas? You want to think through those?
You want to think through what will happen? Will we be able to use the
website? Will we be able to see what the balance is? Will we be able
to deposit funds? And these are the questions we should ask ourselves.
So my guess would be that if we black hole the balance reader service,
I think we might see an error message, like unable to read the balance,
we might get a user friendly message. We also hope that we
would get like a loading, that there was some sort of issue with
the balance reader, that it was no longer available, or maybe
it was really well built and we could just fail over automatically.
So maybe if that one service was unavailable because it was running on
Kubernetes, maybe there's going to be some redundancy there and
we'll be able to actually fail over.
Let's see what we can do. So, within Gremlin,
you can select the balance reader service and we're selecting it as a Kubernetes replica
set. Now, we can already learn a lot here because we can see in
the visualization that there's only one pod impacted.
So if there were multiple pods with the balance reader, we'd actually see two pods
implemented or impacted. So that shows us already that when we run
this chaos engineering experiment, we're going to make all of
the balance reader service unavailable because there's no secondary pod,
which is a pretty large blast radius if you think about it.
So this is what it looks like when we run our experiment,
the balance reader will appear as, and this can
be really confusing for the user. So if I was building this
as a real Internet banking app for a bank, they'd likely say,
no way. We're going to get so many support tickets if this
really happens. That's not even a real error message.
And this could make users really confused. And then they're going to pick up
that phone and start calling the call center. And then that increases
the cost of having to answer all of those calls, and it causes additional problems.
But we also want to check other functionality,
too. Does this affect other dependencies?
So the user is actually still able to make a deposit of $1,000 while
the balance reader service is in this black hole or unreachable state.
And you can see this in the transaction history, that we've added money to our
bank account and we get a successful message, but the balance is still.
But if we try to make a payment, we're unable to do it. And then
we get this awesome engineer friendly message, but not
really a generally friendly customer facing message.
You've got to kind of think about that. Do you think your friends and family
who don't work in tech would know what this means? Or are they going to
start calling you and ask you what's going on?
I mean, I can tell you that every time there's an outage, especially with
Facebook, Hulu, or rarely Netflix, my mom
calls me to see if I can figure out what's wrong, even though I have
repeatedly told her that I do not work at these organizations.
So we want to think about that user experience.
It's something good to think through. I used to get errors
like these all the time, and I actually still do at my current bank,
and they're errors that make sense to them. But when I can
buy my new Lego set and I have this error, it is
really frustrating for me, and it makes me realize that I need to
go to an entire new bank. So that's why
we want to think through the experience of reliability,
because how does the user see the issue? How do we represent the
problems to them? Do they even notice that there is a problem?
Or is there a way to hide that problem from the user and gracefully degrade?
Maybe even if possible, remove the component from the web page
if it's not working at the moment, there's a lot of
things that you can do to allow for graceful degradation that
makes a better user experience. And we
want to think through this always from that user
perspective, how can we reduce stress on those users?
So if you're interested in learning more about the free demo environment,
check it out. It's totally free to spin up on Google Cloud
Shell. Here's the information and the URL.
But I want to show you another example of something
that we're looking for. So let's go back to our demo environment.
So another interesting thing to look at is different types of failure
modes. So let's have a look at the transaction history service and
see how a black hole might impact that.
This is the transaction history here and you can see what it looks like when
things are good. You can see the credits and the debits and who made the
transactions. What you want to see is are we going to get
a graceful degradation of this service if it's made unavailable or
not. During our experiment we can see that the
transaction history is not there. We get an error that
says cloud not load transactions, but we can also see
that our deposit was successful. So this is better than the balance
reader because at least we're not getting that and we have more of a
friendly user experience that doesn't talk about get.
So how can we mitigate against a black
hole? And that's the next question, because before, if you
remember on the balance reader, there was only one pod, now we're
going to make it two. So we'll scale our replicas, and if you're
interested in kubernetes and learning more about this, then this is for you.
So we can run this command, Kubectl scale deployment transaction
history and give it two replicas instead of just one.
We will do this for the transaction history, one for this example.
So run the command and now we can look in the terminal and
see that Kubectl get pods and. Yep,
now we have two pods that are running for the transaction history
that can show us that data. So now what we
can do is a smaller blast radius chaos engineering experiment.
So let's send 50% of the transaction history pods into
a black hole. So just one of the two. And now you can see at
the bottom on the right 50% one of the two pods is
impacted. And I've selected the transaction history replica
set. You can see the two pods, there are two little green dots in the
visualization. So now you want to think through what's your hypothesis?
Now if we send one of the two Transaction pods into a
black hole, will we get an error message? Will things be okay?
Will things get worse? The interesting thing
here is you never really know until you run the experiment.
You're never going to be able to just guess exactly,
because that's very hard to do. And if it was easy, to do, we'd probably
all be Powerball winners right now. So now let's look
at the architecture diagram, just to be able to really understand
this and understand what's happening. So we have our two transaction
history pods, we have the two replicas. And what actually
happens with this is that there will be a very short outage that's
not visible to the human eye, and then the other pod is going to
take over. It will say, this pod is not reachable. So I'm
going to flip over to the other pod. And that's what occurs when
you test this out with chaos engineering in real time,
which is great because there's no error message visible
to the user. They just get the data that they wanted. You're still able
to see all of the transaction history. You no longer receive error measures.
Your mom's probably not calling you. So this is a really
nice way to fail over that service.
So again, here is the URL if you're interested in
looking at the Google cloud platform bank of Anthos.
Now, as I mentioned, there are two key practices, and the next key practice
I want to talk about is game days. So,
game days are a really great team building exercise.
They're something that accelerate talks a lot about in the book,
because game days are a great way to build relationships within
an organization. And it is so true.
They not only help you improve stability and reliability,
but also definitely tempo because it enables you to work closely
with other teams that you want to have better relationships with.
And this is especially important if you're in larger engineering organizations.
The goal is cooperative, proactive testing of our system
to enhance reliability. So by getting the team together
and thinking through your system architecture, you can test
these hypothesis. You can evaluate if your experiments are resilient
to different kinds of failure. And if they're not resilient,
you can fix them before those weaknesses impact
your customers. So now this
is an example of a game day. What you would want to do is invite
four or more people to attend, and it's always good to have at least
two teams. You want to know if you have
some type of failure, how it appears to the other team, or if they have
some sort of failures. You want to see how that affects your systems. And that's
why these are so great, because you're doing it in real time.
They're not happening during an incident. This is planned. You're doing this at
10:00 a.m. With lots of coffee and a zoom, and you
don't have to spend hours doing this. Some folks,
do they want to plan game days for an entire day or a half
a day, but you can really have much shorter game day
experiences. I would definitely say plan for a minimum
of 30 minutes though, in the real world. And oftentimes we
see folks spike load, maybe using Gatling or Jmeter,
and then they introduce different types of failure modes. So for example,
if you use Jmeter, you could be like, all right, let's send a bunch of
requests. But now I'm going to spike cpu because
our auto scaling is based on cpu, or maybe it's
based on a mix of cpu and requests.
So this allows you to create that situation where auto scaling
should occur, and then you can see that it works. And then when
traffic subsides and the cpu attack finishes,
you can run one, maybe for 60 seconds and it will go back
to normal and it'll scale back down. And that's going to
help with other things too. If you're thinking about cost management with
infrastructure, keeping those costs low, and a lot of
sres really care about that too, how can I say
that we saved this much money. We were able to
make sure that we were able to quickly scale down when we don't need
to be running a large number of machines in our fleet.
So when we look at how to run a game day, at least 30 minutes,
30 minutes to 1 hour sessions are fabulous. You want to include two plus
teams. Make sure you decide on your communication tool. Is it
slack? Is it confluence? Is it a Google Doc?
We use Slack here at Gremlin when I was at pager duty.
We also use Slack. You want to plan and design two
to three use cases and you want to make sure that
you have documented this. You want to assign roles such as
commander and general observer and
Scribe, and make sure people understand what their role during the day
is. Document the results and then share widely in the organization
and if possible, share your learnings externally as
well. So here are other examples of game days
you can look at. Dependency testing, capacity plan
testing and auto scaling. Testing capacity
planning is an interesting one because we've seen outages when folks tried
to scale up during a peak traffic day and they actually didn't have
the limits set correctly. They weren't even allowed to do that. They didn't
have the permissions set up. There were caps and they just
couldn't scale. So that's a bad situation.
And you definitely want to test that your caps are not in place
in a bad way that impacts you. These are
things that you're thinking through when you're going through these game days.
Also, here's a scenario planning template. You can actually find all of
this@gremlin.com gamedays and again,
we're documenting the attack target, the attack type, the failure consumption
that we're simulating the expected behavior and risk,
and then that post attack state and impact.
And that's important because going back to the very beginning of what we
talked about, you're creating a hypothesis and you're testing it.
We're not just randomly running around shutting things off.
And yes, you can automate all of this.
And you want to, you want to codify your chaos engineering
experiments because you want to make them shareable. You want to do version control.
You want to show the history of your experiments. So look
at how you can integrate chaos engineering experiments into your CI
CD pipeline. And you might do this for production readiness.
They might say, you need to pass this number of chaos engineering reliability
experiments before you can ship your new service
into production or before you can ship your change. Or if you're going
multi cloud, you might want to make sure that all the code can
pass through a set of chaos engineering experiments, because we
all know that lift and shift really isn't a thing or a thing that works
out well because environments are so different and there's a lot
of fine grained detail. So it's great if you
can automate your chaos engineering experiments, because you don't have to go
around teaching everyone the SRE best practices.
You can just build a system that can quickly go in and check,
and you're giving people tips and knowledge so that they can build
better systems in the future and going forward.
So with that, as I mentioned, I would let you know how to
get a free certification. We have
a chaos engineering practitioner and professional certification,
so you can head over to gremlin.com certification to learn
more about that. And then I want to thank everyone for
being here, and I wish you a great day.