Transcript
This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE?
A developer, a quality
engineer who wants to tackle the challenge of improving reliability in
your DevOps? You can enable your DevOps for reliability with
chaos native. Create your free account at Chaos
native Litmus cloud hi
everyone, Chaos Engineering has proved to be a great to have
option in the SRE and SDE toolbox, but the
transition into more complex system is accelerating.
My name is Guillauna George and I am a developer advocate at Amazon
Web Services. And in this session we'll look at how automated Chaos
engineering experiments can help us cover a more extensive set
of experiments than what we can cover manually, and how
it allows us to verify our assumptions over time
as unknown parts of the system change.
Chaos engineering, as most of you know,
is the process of stressing an application in
testing or production environment by creating these disruptive
events, such as server outages or API throttling,
and then observing how the system responds and implementing
our improvements. And we do that to prove or disprove our assumptions
about our system's capability to handle these
disruptive events. But rather than to let them happen
in the middle of the night or during the weekend, we can create
them in a controlled environment and during working hours.
And it's important to note that chaos engineering is not just about
improving the resilience of your application, but also its performance.
Uncover hidden issues, expose monitoring,
observability and alarm blind spots, and more like
improving the recovery time, the operational skills,
culture and so on.
And chaos engineering is but breaking things,
but in a controlled environment. So we create these well planned
experiments in order to build confidence in our application
and in the tools we're using to withstand these turbulent conditions.
And to do that, we follow this well defined scientific
method that takes us from understanding the steady
state of the system we're dealing with, to articulating
a hypothesis, running an experiment,
often using fault injection, verifying the results,
and finally learning from the experiment in order
to improve the system improvements such AWS resilience to failure,
its performance, the monitoring, the alarms,
the operations, well, the overall system.
And today we're seeing customers using chaos engineering
quite a lot. And that usage is really growing. And we've seen
two clear use cases have emerged.
The perhaps most common way of doing chaos engineering experiments
is creating the one off experiment. And this is when you create
can experiment by, for instance, looking at a previous outage
or different events for your system,
or you can also identify the services that have
the biggest impact on your end users or
customers if they go down or don't function properly.
And then you create experiments for those, or maybe you've built
a new feature, added a new service, or just made changes to the
code or the architecture, and you create an experiments to
verify that the system works as intended.
And companies are doing this in different ways. Some have dedicated
chaos engineers creating and running the experiments.
For others, it's part of the SRES responsibilities,
or as we partly do at AWS, chaos engineering is done
by the engineering teams themselves on their services.
The other very common use case is to use chaos engineering
as part of your game days. And a game day.
It is the process of rehearsing ahead of an event by
creating the anticipated conditions and then
observing how effectively the team and the system responds.
And an event, well, that could be an unusually high traffic
day, a new launch, a failure, or something else.
And you can use chaos engineering experiment to run
a game day by creating the event conditions
and monitoring the performance of your team and
for your system. So doing these one
off experiments and perhaps the occasional game day now and then,
it gets us very far on the road to improving resilience
of our system. So isn't this enough? Well, it definitely can
be, and it is for many, but let's look at an example.
So this is a use case example of an
ecommerce web application.
So this is our application. It's a simple ecommerce site
where we have tons and tons of end users buying
things off the site continuously.
And we've built this using well architected principles.
So were set it up using multiple instances running
in auto scaling groups spread over multiple availability
zones. We have our database instances
using read replicas and replication across availability
zones as well. So we're trying to build for resilience and
reliability. And next then?
Well, this is just one part of the system. Of course,
this is the product service, but we've added
chaos engineering AWS, a practice, to our example.
In this case, we use it to verify the
resilience of our service and to learn and gain confidence in
the application. And we also, of course, have introduced
CI CD practices. So continuous integration
and continuous delivery has not only made it possible,
but it even encourages frequent deployments. And that's
what we do in our use case example, as we
know, frequent deployments, they are less likely to break
and it's more likely that we'll be able to catch any bugs or
gaps early on. But frequent deployments,
when done perhaps daily, multiple times a day,
or perhaps even by the hour in some cases,
they are really hard to cover manually with
chaos engineering experiments, it's just hard to keep
up with the pace.
And next, we also
have of course, multiple services within our
application. So we have different
services that do different things within our application. So besides
the product, we have order, we have a user service.
We of course need a cart service and a recommendation
service and a search service, all built
using slight different architectures. They have different code
bases and perhaps even different teams building and
running these parts of our application.
And next, then, well,
there are dependencies between these services and different parts
of our system. So the CaRt service is of course dependent
on the user service. We have the product service,
and that's also used by the cart service and the order
service, and the cart service works together, and the search
service needs to be able to search our products, and the recommendation
engine also uses our products, for instance. So we create
these dependencies between our microservices or services
within the application. And that's also hard because
teams operate differently, they might make change to
the application at different times, and you
depend on that other service to be there.
So creating experiments that is able to cover those
changes, that's also quite hard. So based
on what we just looked at, some learnings from this
very simple use case is that frequent deployments are
hard to cover manually with chaos engineering experiments,
just because we do them often, it's hard to create experiments
and run them as frequently in a manual fashion and
to cover a more extensive set of experiments.
It is time consuming. And then even
though you might have full control of the service or microservice that
you're working on, unknown parts of the system might change
because the other teams are making changes.
And also finally, systems are becoming
more complex. It's hard for anyone to keep and create
a mental model of how the systems works, let alone to
keep documentation up to date. And that,
well, it brings us to automated experiments.
Automation helps us cover a larger set of experiments than
what we can cover manually. Automated experiments verifies
our assumptions over time as unknown parts of the
system are changed and doing automated
experiments. It really goes back to the scientific part of
chaos engineering in that repeating experiments,
it's standard scientific practice for most fields,
and repeating an experiment more than once, it helps us determine
if that data was just a fluke or if it represents
the normal case. It helps us guard against jumping
to conclusions without enough evidence.
So let's take a look at three different ways that we can automate our
chaos engineering experiments. First off,
let's think about how our system evolves. I mentioned it
in the use case example, so that even
when we might have full control over our service,
our microservice, the one we're working on,
other teams or even third parties, they are making changes, they are
delivering new code, they are releasing new versions of their service,
and those might be services that you depend
on. So the verification you got from doing
a one of chaos experiment a week ago or a month ago,
it might quickly become obsolete because of these other
changes. So by scheduling experiments to run on a recurring
schedule, you can get that verification over and
over again as unknown parts of the system change.
So let's take a look at how this can quite
easily be achieved.
So I'm building
a simple scheduling service for my chaos
engineering experiments, and I'm doing this using
a simple serverless application, setting up a schedule
cloudwatch event rule. And this is based on the
schedule that I define and basically a chrome job
for when this should run.
And this is a simple AWS lambda
function that will then take the experiment template that
I define and run that on that schedule.
And in this case I'm using AWS fault
injection simulator, AWS FIS. But the same principle
works no matter if you're using another system, if you're using your
own scripts for doing chaos engineering experiments.
So this is my application, the instances
running, let's say the product service that we looked at before,
multiple instances running in different availability zones.
So I've created a simple experiments
for that. Let's just create that template. This is
using cpu and memory stress. On our instances,
I can now take this simple experiment template
and using my scheduler,
my lambda function, I can deploy it
and define which experiment should run on the schedule.
So just pasting in my experiments template id,
setting up the schedule in this case once per day,
simple crone syntax and
deploying that deploying
has started. Let's switch to AWS
lambda console.
All right, so the deploying is done,
switching back. This is Amazon
eventbridge, our event bus, so we can
just have a look at what actually got deployed. So this is
our schedule. So I'm copying a sample
scheduling event and let's try this out
back to AWS lambda, pasting in that sample event,
and now we can test it.
So this is as if my schedule were to
run right now. So the lambda function
kicks off and that should then
start an experiment. Yes. So in AWS FISC we can
see that an experiment is running.
This is a very simple example using cpu stress.
This is one of the instances I'm logged into.
And if we watch the cpu levels, we can now see
that that instance is being stressed by
my chaos experiment,
and it's also using up memory on the instance.
And in this case my experiments
is doing the steps that I defined in my template.
But since this is an automated experiment,
it is a recurring experiments. It will then run
over and over again and verify the same set
of conditions for me. And since
it is automated, we need to have stop conditions in place,
meaning that we have alarms that will then stop the
experiment if anything goes wrong. So that was our
first example. So now let's look at the second
example. And the second approach to automation
is to run chaos experiments. Automation based on
events and an event, well that's basically
anything that happens within your system. It could be an event related
to the tech stack for instance, that latency is
added whenever there's an auto scaling event.
New instances are started for instance. Or maybe it's a business
related event like can API being throttled when items
are added to the cart. So building automation around these
types of experiments, it can help you answer those quite hard
to test questions. What if this when that
is happening, and even when that is an event that's in
a totally different parts of the system.
So let's look at an example of that as well.
So once again, simple automation set
up using serverless application.
In this case it is an
event triggered experiments. So setting up
an event based on cloud watch
event or an event in Eventbridge. And in this case
the pattern is AWS auto scaling.
Whenever an EC two instance is launched,
what will happen is that it will kick off a lambda
function, and that lambda function is pretty much
the same as in the previous example, meaning that it will start an
experiments. So if we look in Eventbridge, we can see
that besides easy to auto scaling events,
we can create these types of event patterns
for a whole bunch of different
AWS services. Or we can create
our custom patterns, meaning that it could be a pattern
based on a business metric, something that happens,
as I mentioned before, items added to cart or a third party
service as well. All right,
so we have that in place.
We just need to define which auto scaling group it
should base this on. And of course I have
stop conditions in place. Once again, it's automated experiments.
We won't watch them manually every time, so we need to
make sure that stop conditions are in place to stop the
experiment if an alarm is triggered.
So creating this new experiment template,
we go now to deploy my
event triggered experiment automation.
And we have that in the
AWS lambda console.
Here we go. So, switching to EC two and let's just
make a change to one of our auto scaling groups.
Let's change the desired capacity from two instances
to three instances, which then is an auto scaling
event. And that gets picked up by eventbridge,
which triggers our AWS lambda function, which in turn
triggers our AWS FisC experiment.
So we can see that one experiment is running and
looking at a logged in instance. Once again,
we can see that the instance is
using cpu and memory, meaning that our
experiment is successful.
And with the stop conditions in place, we don't need to
watch the experiment manually. This can happen over and over again.
And if an alarm is triggered, it will automatically stop.
So our customers end users aren't affected
by the experiment. And then
that gets us to our third example. And the third way
of doing automated experiments is perhaps the most popular one
so far, and the one I'm definitely getting the most questions
around. So, continuous integration and continuous delivery,
as I said before, it encourages frequent deployments,
and this means that the application is less likely
to break. So we have this problem that
we showed in the previous use case that we have frequent
deployments, but aren't perhaps able to do chaos engineering
experiments as frequent. So by adding chaos engineering
experiments as part of our delivery pipelines were able to
continuously verify the output or behavior of
our system. So let's look at an example of that AWS.
Well,
so this is our pipeline.
It is simply deploying to staging and deploying
to production, demo purpose pipeline.
This is built using infrastructure as code, of course. So we have our
pipeline, we have the stages, fetching the source,
deploying to staging and deploying to production.
Now we're adding an experiments stage for
the staging environment. So after deploying to staging,
let's just kick off the deployment of this updated
template. So after deploying to staging,
it will run an experiment on the staging environment.
And what it does, well, it's simply a
state machine using AWS step functions that will
start the experiment and then monitor that experiment to make
sure that it either succeeds,
or if it fails, it will then stop that experiment.
So back to the pipeline,
and now we have this new stage, the experiment stage
in place. So let's give it a try. And as
one does for a demo, let's just edit straight in
GitHub, make a small change to our code base,
and commit. That straightaway
kicks off the pipeline, fetching the source,
deploying to staging, which is a quick process
in our demo environment. And then it gets
to the experiments stage where it will now initiate
our AWs step function workflow,
which in turn then parts our AWs fis
experiments. So that is running as we can see.
And for purpose of this demo, this is a
really quick experiment, so it will
quickly finish and complete so we can see what
happens. All right, so it's already completed.
Let's switch back to the pipeline.
Succeeded, and then it moves on to the next stage, which is
to deploying to production. So, very simple example
of how we can add our chaos engineering experiments to a pipeline.
So let's do another one. What if that
experiment fails? What if an alarm is triggered and it
doesn't work as intended? So let's release a new change,
fetching the source from GitHub once again,
parts to deploying to our staging environment.
Soon as that is done, it will kick off our experiment
once again. There we go.
It's in progress. Let's check
AWS fis. The experiments is running.
So what I can do now is use
the AWS CLI and just set the alarm
to be triggered. So I'm setting the alarm state for our
stop condition.
Let's try that.
And this means that it will act as if an alarm
was triggered, and fisk straightaway stops
our experiments.
Switching back to the pipeline.
We can see that it failed.
And the failed experiment in this case, well, it means
that it won't proceed to the next step. And the
next step would be to deploy to production. But for some reason,
our experiment failed. Might be that something is wrong
with the code, something doesn't work, we have more latency, or whatever it
is we're testing with our experiment. And in this case,
we won't move that into production.
We can build on this AWS. Well, of course, by adding
experiment stage after deploying to production, as well as
an extra way of testing and making sure that everything
works as intended. So this was an example
of how to add experiments to your pipelines. First,
by showing what happens when it works, it just proceeds
to the next step, and when it fails, it stops the
pipeline. And that shows the value of having stop conditions
in place. Stop condition that watches your application behavior
and then stops an experiment if an alarm is
triggered. And what kind of stop conditions you'll use,
that's very much up to you. And the use case,
the traditional it depends answer. But for instance,
it might be that you're seeing less
users adding items to cart, or it might be
a very technical metric, for instance, that you're seeing cpu levels
above a certain threshold or things like that.
So with these
three options, the recurring scheduled experiments, the event triggered
experiments, and the continuous delivery experiments, we have three different ways
to automated our chaos engineering experiments.
So should you then stop doing one off experiments and
the periodic game day? Well, no, you shouldn't.
They should still be at the core of your chaos engineering
practice. They are a super important source for learning,
and it helps your organization build confidence.
But now you have yet another tool to help you improve the resilience
of your system, the automated chaos experiments.
So one way to think, but it is that experiments you start
off by creating as one offs or as part of your game days,
they can then turn into experiments that you run automated
after doing the experiment manually to start with, they can
be set to run every day, every hour, or on every code
deploy. And that brings us to
a summary with a recap of some takeaways.
So automation helps us cover a larger set of experiments than
what we can cover manually and automated experiments
verifies our assumptions over time as unknown parts
of the system are changed and safeguards and stop conditions
are key to safe automation. And introducing
automated chaos engineering experiments does not
mean that you should stop doing manual experiments
if you just can't get enough of chaos engineering to improve resilience.
I've gathered some code samples, the ones used in the demos,
and some additional resources for you in the link shown on the
screen. Now just scan the QR code or Gunnar
grosch link chaos.
And with that, I want to thank you all for watching. We've looked at
how to improve resilience with automated chaos engineering.
As ever, if you have any questions or comments, do reach out on
Twitter at Gunagarosh as shown on screen or
connect on LinkedIn. I am happy to connect. Thank you
all for watching.