Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey, everyone, this is Hari Shah. I am a solutions architect
at AWS. Today we are going to talk about chaos engineering.
My goal for this session is to give you a high level introduction
to chaos engineering and talk a little bit about the best practices.
So if you are new to chaos engineering and are curious to learn
about what it is, this is a session for you.
Here is our agenda. I'm going to start by talking about
why we need chaos engineering. Then we'll discuss
a little bit about what it is. Define chaos engineering,
and then wrap it up with how you can implement chaos
practices, chaos engineering practices in your organization.
Let's dive in. So let's start with the why.
Why do we need chaos engineering? Okay,
so this is from December 2022,
when Southwest Airlines, one of the major airlines
in the US, had a huge meltdown.
Between December 21 and December 30, they had
to cancel around 15,000 flights. Remember, this is
during the peak travel period. Some of the days they
had to cancel around 60% of the daily flights
they scheduled. Other than the monetary
impact, which Southwest Airlines reported as around 1.2
billion, this was a major pr
disaster for the airline.
What triggered this whole situation was
a massive windstorm across multiple cities
in the US. So that caused the airline to cancel or delay
many of the flights during that period.
But what made this situation
to a major disaster for Southwest was the backend
crew scheduling system. The scheduling system couldn't
handle the amount of requests for scheduling changes and
it just went down. But airlines are not
the only ones who have outages.
This is a snapshot of a service interruption
that meta had back in 2021.
So in October 2021,
Facebook, Instagram, WhatsApp, all were
down for more than 5 hours. I personally
think this was a good thing for humanity.
We all got a chance to step out of social media, talk to
each other, get some fresh air.
But for meta, this translated to millions
of lost revenue from ads.
If you look at the root cause analysis for
this incident from meta,
you can see that this was caused by human error.
One of the engineers who was performing a routine maintenance,
unintentionally disconnected metadata
centers from Internet. So I have one more
example, and this one is closer to home for
me. So AWS, Amazon Web Services,
had a service interruption for one of its
services called Amazon S three back in February
20, 2017.
Amazon S three, if you're not familiar with it, is an object storage.
And it was one of the early services that AWS launched.
And most of the customers directly or indirectly
uses Amazon S three. So when s three had
an interruption in 2017, many of the big customers
were directly impacted. So this was a big deal.
If you look at the root cause, again, root cause analysis from
AWS, this was a human error.
One of the engineers who was performing
some commands typed something incorrect,
which deleted some of the tables. And these are
just some of the examples. I have a few more here from
companies like Starbucks and Akamai and British Airways.
In fact, these are so common that if you search for
any date with outage, chances are that
you may find one or more such incidents.
And these outages have significant business and financial
impact to organizations. I have some numbers here.
For example, the cost of
an hour of downtime for a business critical application can
be around 1 million. So the question
is why these issues are not being surfaced
during the testing phase.
Companies like Southwest or AWS or Starbucks,
they don't put things into production without proper validation.
So it does go through testing. So why
are they not capturing these issues?
The reason is that when we do the testing,
whether it's unit testing, integration, or regression,
we know the input to the test scenario and the expected
output, right? So what the test case
or test scenario does is to provide that input
and to validate that the actual output matches
the expected output. So if we
plot the input and output into this
framework, the testing focuses
on the top left side, the one in green circle,
where both input and output are known.
But most of the situations that we discussed
earlier in the session, they fall
into the two right side quadrants.
In some scenarios, like Southwest,
the input is known. Southwest probably knew that
weather could cause some interruptions, but they
didn't know the output. They didn't know the impact of that.
In other situations, like AWS and meta, it's very hard
to predict human, the exact human error
and the impact of that particular action.
So in order to dive deep
into the known unknowns and uncover
some of the unknown unknowns, we need a different approach,
regular testing. And this is where chaos
engineering comes into picture. So let's talk about what is chaos
engineering. Let's start with a bit of history.
Chaos engineering came from Netflix.
Netflix was one of the early adopters of cloud.
They moved workloads to AWS in 2008.
What Netflix realized is that in
the cloud, they have to make applications more
resilient to underlying infrastructure
failures. In order to do that, they created a
tool called Chaos monkey. What chaos monkey did
was to run in prediction
and randomly terminate compute instances.
EC, two instances. This was hugely unpopular
in the beginning within Netflix, because many
application teams realized that this is impacting
their workloads in production. But that was actually
intent of this tool. The intent was to
uncover issues in a controlled manner and
move the ownership of building resilient applications
to the application teams.
Now, that worked. Gradually, the resiliency
posture of these applications improved, and Netflix
started creating more tools like that.
They call this simian army. There was
a tool which, for example, simulated availability
zone failure, and there was one which even dropped
the entire region, simulating a doctor
scenario. Netflix open source these tools,
and more and more organizations started
adopting these tools for their workloads.
So Netflix teamed up with some of these early adopters
and created a manifesto
called principles of chaos engineering.
This is how the manifesto defines chaos engineering.
Chaos engineering is the discipline of experimenting
on a system in order to build confidence in the system's capability
to withstand turbulent conditions in production.
Now, let's unpack that a little bit.
So, first, chaos engineering is about experimentation,
not testing. What's the difference
in testing? As we know, both input and outputs are
known, right? So all you're doing is validating the actual output
with the expected output. But in experiments,
the output is unknown. So you start with a hypothesis,
and then you create experiments to validate
your hypothesis and make sure whether your
hypothesis is valid or invalid.
Now, the goal of chaos engineering is
to build confidence in the system's ability to
withstand chaos in production.
Now, that's important. Think of
chaos engineering like a vaccine. You are injecting
a little bit of chaos in a controlled fashion to
build resiliency or to build immunity.
So that's the goal.
Many would think that chaos engineering is all about breaking
things in production, terminating instances,
but that's not the case. Chaos engineering
is all about uncovering chaos, which is
already inherent. It's already there in the system, right.
All you're doing is to perform controlled experiments
to uncover those scenarios so that you can proactively address
them before the actual outage happens.
With that, let's look at how to approach
chaos engineering and perform your experiments.
This diagram shows the high level steps involved
in building your experiments. It all starts
with understanding the steady state behavior
of your application. You need to know what a steady
state looks like for your application before you can build your hypothesis.
Right. So, for this, you need to have a solid observability
framework. Once you observe the steady state,
the next step is to build your hypothesis. This is where
you're building multiple hypotheses that you want to one
day experiment to validate. Once you have the
hypothesis, we will run experiments,
and the goal for the experiment is to
verify the behavior and validate that
against the hypothesis that you have. And if there is a deviation,
this is where you need to act. Make the necessary changes
to improve the resiliency of your,
of your application and then repeat the process. Right. So this is
a cycle, this is a continuous cycle to improve,
to understand your applications resiliency
incrementally, continuously improve them. Let's double
click on each of these faces and understand that
a little bit better. So the first step,
as I mentioned, is all about observing the
steady state of your application. When I say observing
an application, what you need to know is to collect
all the signals from your application and build
an end to end view so that you can understand the state and health
of your system. When talking about observability, there are
three key telemetry data that you need to collect.
One is logs,
logs across your stack. The other is metrics and the
third is traces. Now the key is not
only collecting this, but also correlating or mapping
these signals so that you have an overall understanding of
the steady state and the health of the system. Now once
you have the steady state behavior, you can
start building your hypothesis around the steady
state. Here are a couple of examples of
hypothesis around different goals, right?
So if you want to validate the availability of the system,
a hypothesis can be under certain circumstances
that you want to validate. Customer still has a good time
or the application is still available.
Now for a security hypothesis, it can be
if certain scenario happens, under certain situations,
the security team gets paged or
a certain alarm goes off. Now you build these
high level hypothesis because you don't have a clear understanding of
the output, but you know what the desired behavior looks like,
right? So once you have the hypothesis, you can start planning your experiments.
Now, choosing the right experiments is key to get
most out of the investment that you're putting into
chaos engineering. So start with the most
common scenarios that can impact your
application with the goal of identifying the
expected behavior and improving your applications resiliency
against those failures. Right. Here are some
of the common common
failures. Common scenarios that you can build your experiments around.
Single point of failures identify single point of failures within your
application stack and build your experiments around it.
Excessive load to different components and see how they react.
Introduce artificial latency between components and see the
overall application behavior when such things happen.
Misconfiguration, bugs, etcetera are all common scenarios
to get started. So the end goal for chaos
engineering is to perform these experiments in production.
But for many organization,
starting with running these experiments in
production is a great risk. What I recommend
is to start experiments running
these experiments in the lower environment. Choose a very
limited control scope that you have better handle on,
and run these experiments in lower environments and observe the
behavior. Now I would also highly recommend
adding guardrails to these experiments so that
in case you are seeing an
unexpected behavior in the system, you have a plan to roll it back,
roll back the experiment and get the system back to its previous
state. Now, once you run these experiments, as you
gain confidence, start moving these experiments to production
and start running them. Now once as you gain more and more
confidence, you can increase the scope, add more experiments and
iterate over it and make sure that
you're automating these experiments, because systems
do change, they do evolve. So you have to continuously run these
hypotheses to make sure that the system behavior
is not deviating from your hypothesis.
And the last step is to verify the results
of your experiments and then act upon it.
And in this step, it's critical that you assess
the impact of your findings,
the business impact of your finding, and then
prioritize the findings accordingly.
So this way, if it's, for example, a security
impacting issue, then that gets higher priority and it
needs to be addressed immediately compared to some of the other findings.
I want to wrap up this session by giving you pointers to
some of the tools available to automate your chaos
engineering experiments. If you're on
AWS, if your workloads are running on AWS, AWS has a managed
service called AWS fault Injection service which
allows you to build hypothesis and run experiments.
The great thing about FIS fault injection service is
that it has native integration to many of the AWS
services, so it makes it very easy for you to
build experiments and run experiments.
Similarly, if you're on Azure assure has Azure
K Studio, which you can explore. There is
also a commercial offering called Gremlin,
which is very popular and again allows you to build
hypothesis and run experiments. The last one in
the list here, litmus, is an open source option.
So if you are leaning towards exploring open
source tools to automate your experiments, that's a tool
to consider. All right, that's it.
I hope you found this session useful. Thank you so much for watching.