Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello and welcome to our talk,
chaos in the cloud. I'm going to start by telling
you a little bit about what we are going to do in the
next 45 minutes. We are using to start with a recap
on what is chaos engineering. Then we are going to talk about
chaos engineering in the cloud. Taking an aside into
what the cloud is and what its resiliency features are,
we're going to talk about the AWS fault injection simulator,
which is the AWS fully managed chaos engineering service.
Then we are going to have a demonstration of how that looks in practice.
Then we're going to finally share some best practices on
working with the forge injection simulator and architecting
for resiliency in general.
What is chaos engineering and why do we do it?
I like to start this talk with a quote from
Werner Fogel. He's the CDO of Amazon.com,
and he said, failures are a given, and everything will
eventually fail over time.
And that's so because with the rise of microservices and distributed
architectures, the web and our applications have grown
increasingly complex. And as a result of that,
random failures, random failures have grown
difficult to predict. At the same time, our dependence
on those systems has only ever increased. And so as we
moved away from monolithic architectures towards microservices
to become more agile, build faster, and be able
to scale, we naturally adds some complexity between
those microservices.
And chaos engineering is experimenting on those distributed
systems to build confidence in those systems and make sure they can
survive failure,
and to get a better understanding of their failure
conditions and of the performance envelopes. And that's all contained
in the quote I have on the slide from principlesofchaos.org.
So, diving a little bit deeper into chaos engineering,
what is it? We generally talk about kind of a loop,
stressing a system, observing how it reacts, and then improving
the system. And we do that to improve resiliency and
performance, uncover hidden issues in our architectures,
and expose blind spots, for example, in monitoring observability
and alarming. And usually we also
achieve a degree of improvement in recovery times,
improving our operational skills with the system and
the culture in our technical. Sorry.
And a culture in our tech.org.
And I talked about that being a loop and going
deeper into the loop. We see a loop of setting
an objective. We want to achieve something with our system.
We want to make it resilient against single ac failure, for example.
We set that objective, then we design and implement
our system. We design
an experiment to test if our objective is actually
achieved. We run the test we operate the system,
and from the experiments and the antisystem operation
we learn about the system. We can respond to failure conditions,
improve the system and set new objectives and get
better over time.
So chaos engineering locally on your
local machine, on your server, versus in the cloud I'm using to start
by talking about chaos engineering in Java,
and I just choose Java because it's one
of the most widely used languages and it
has awesome support for chaos testing, both via
libraries and via tooling. What you generally see
here is that a lot of those tests are either JVM
based, they run inside the local JVM,
they might be agent based, they run as a sidecar next to the JVM
on the same machine, or they might be moved into
the service mesh connecting many systems together. And that's really
an awesome way to test your application code and to make
sure it's resilient and to be able to control those experiments
from your application code.
And if that is so awesome and it already works quite well,
it's a great improvement and everybody should do more of it.
Why would we want to move chaos engineering into the cloud?
We believe that using cloud services
we gain a more holistic monitoring view
of how our system works. We can identify
cloud based challenges, for example around limits, around scaling,
around interacting with other accounts. For example,
we can run dedicated scaling experiments to learn
where the failure points of our application is if we scale to
a very large degree or very fast.
And finally, we can do validation of disaster recovery strategies
where we gain the ability to run chaos
experiments on a system and see if it, for example,
can be switched over to another region without
an outtime or in a specified time frame.
And this is the point where I'm going to take an insight into
resiliency features of the cloud of AWS and
how they interact with applications. What we see
on the slide is the region design of AWS,
and we see a region is made up of multiple availability zones
here, abbreviated as AZ.
And each AZ is made up of one or
more data centers. And that's the only time I'm actually
going to talk about data centers.
The fundamental logical part of resilience
in AZ you should think about is an availability zone,
because those availability zones are independent
from each other. And what that means is they have
independent network providers, they have independent power connections,
and if there is a geographic feature
in the city where the region is located, the acs
will be located while being
mindful of that geographic feature. So for example, if there is a
river flowing through the city not all of the AWS will be close to the
river, so that we can be sure that in case of a flooding,
not all of the AWS go down. And so as
you distribute an application across AWS,
it will become resilient against single AZ fiduciary
and gain a degree of resilience.
It will become resilient against single
availability zone failure, even though those failures
are quite unlikely in practice.
What does it actually look like? How do we distribute an application
across multiple availability zones?
What we see here is a simplified architecture
of a web application. We see load balancing,
distributing traffic to different instances, and that means
Ec two instances for us and those instances act as
application servers. And then underneath
that we see a primary and a standby DB instance,
and we see that those are also distributed across availability zones.
And so in the end, like the event that, for example, availability zone r
goes down, the load balancer would distribute traffic to
b and c, and there would be an automatic DB
switch over to make the current standby instance in c,
the primary instance, and your application would continue to
work.
Now that we have a certain understanding of resiliency in the cloud and
why we would want to actually test it, let me go deeper
into the AWS fault injection simulator and how it supports
you in running chaos experiments.
The AWS fault injection simulator is our fully
managed chaos engineering service. It's an
easy way to get started and it lets you reproduce real
world failures, whether they are very simple like stopping an instance,
or more complex like swapping APIs.
And finally, this service fully embraces the
idea of safeguards. So you can make sure that
a running chaos experiments does not impact your productive deployment.
I'm going to go into detail about all of those three
features that I just mentioned. It's easy
to get started because we spent actually a lot
of time making it easy to get started, because when we talk to customers about
chaos engineering, what customers repeatedly told us
is that it is a little bit hard to get started with chaos engineering.
And so we want to make that as easy as possible.
You can use the console to get familiar with the service and actually
try things out, and then you can use the CLI to take advantage
of the templates and integrate the service with your CI
CD pipelines. And those templates are JSon or YAML
files that you can share with your team. You can version control them
and use all of the benefits and best practices associated
with version control, like code review.
And you have conditions.
You can run experiments in sequence, you can run experiments in
parallel sequences. For example are used to test
the impact of gradual degradation, like a sequence latency.
And parallel experiments are to test the impact of multiple concurrent
issues, which is actually how a lot of real world outages happen.
You don't see outages because of a single failure,
but because of a chaos of single failures leading up to
a real world outage. It currently supports services
like EC two RDS, ecs and eks.
So virtual instances, databases,
container runtimes and managed kubernetes.
And we are working all the time to provide support
for more service, for more services, sorry. And for
more complex conditions.
And just to hit the nail on the head here, those faults,
they are really happening at the service control plane level.
So an instance might actually be terminated, memory is actually being
utilized, APIs are actually being throttled. It's not
faking something with metric manipulation, but it's
actually impacting how things work
at the control plane level. So you will have to take extra
or use extra caution when using the service.
And to enable you to do that, we have safeguards.
Safeguards act AWS, the automated stop button,
a way to monitor the blast radius of your experiments and make
sure that it's contained and that failures created with the experiment
are rolled backed if alarms go off.
And that's kind of runtime controls what happens during an
experiment. And the service of course integrates
with identity and access management. IAM and
Im controls can be used to control what fault
types can be used in an experiment and what resource
can be affected. And that's of course working with tag based policies.
So for example, only EC two instances with a tag
environment equals test can be affected. That's one
of many possible safeguards you can
implement.
What kind of targets and actions are supported.
There's a host of targets and
actions supported across the categories of compute, storage,
networking databases and management service services.
Sorry. And management services.
And now we will dive somewhat deeper into the architecture
of default injection service and how it
interacts with the different components of the AWS
cloud. And we see here a
diagram of how the service works. At a high level we start
with an experiment template which
will comprise, sorry. At a high
level we will start with an experiments template which
contains different fault injection actions, targets that
will be affected and safeguards to be run during the experiment. And you can
see that here slightly to the left of center called
experiment template and white.
And then when we start an experiment,
default injection simulator performs the actions. So it
injects faults into supported resources that
are specified as the targets. And then
those faults interact with your resources
and that will change what happens in monitoring,
in monitoring in Amazon Cloudwatch, for example, or in
your third party monitoring solution. And then you can take
action based on those observability
metrics you have there on those alarms you have there on those logs you
have there by using Amazon eventbridge to,
for example, stop an experiment if the wrong alarm is
triggered, or to start a second experiment
at that point in time. And so
now we have an understanding of what chaos engineering
is, why we want to do it in the cloud, and how
the AWS fault injection simulator works.
And it's the point where I'm going to hand over to bent to tell you
about some exciting new scenarios we saw from
reinvent, and to show you a demonstration of how
the fault injection simulator works in practice.
All right, thank you very much, Oliver. We're now
going to take a closer look at those two new scenarios
we launched at reinvent. So the first one is about
cross region connectivity disruptions. So we have
customers who have the requirement of operating their application
at the highest possible availability rates. And those
customers typically tend to architect their applications
to span two regions. And those customers also
asked us that we could maybe help them to make
chaos testing even easier for them to test whether their
applications can really withstand a connectivity disruption
between two regions. So for instance, think of an
active active or active passive kind of setup between two regions
where all of a sudden your database, like a dynamodB,
won't replicate new data. This is now possible
with the new cross region connectivity
disruption scenario that is available with the fault injection simulator.
Another new scenario that we also launch is around availability
zone power interruptions. So we
already mentioned that those kind of scenarios have a really low
likelihood of happening. However, there are customers who still
want to experience how their application would
behave in such an event. So in the event of a power interruption,
you would see, for instance, scenarios like EC, two instances
or containers stopping out of nowhere. And with
the new scenario of ac availability power
disruptions, you will basically be able to test how your application
behaves under those kind of events.
But now we basically want to have a look at
our own demo. So we will take you through
a journey of testing an application
and improving its resiliency when it comes
to availability zones failures. So what
we brought for today here is a simple workload
that is currently running in a single container.
So we have a container that is basically a simple API that
responds with a pong to every request to send to the
container. And this is currently running on eks.
It's a single pod hosted on a single node that we have in our
cluster which is currently running in the availability zone.
A and I personally am not the most Kubernetes
experienced guy, so I'm not really sure how Kubernetes
behaves in the case of
an availability zone disruption. And I personally
want to have a really low cost with my application.
So I would just want to check whether running a single pod
is enough to tolerate the failure of an AC or
if for instance, Kubernetes will automatically for me schedule
this pod on a new node that is hosted in a different availability zone.
But before we jump into the console, I already want to show you
the result of the experiment.
So of course this is not going to happen. We won't see an automatic pod
reassignment to a different node hosted in a different availability zone.
This is not how it actually works. So instead what
we would want to do here to make the system more resilient
is basically by updating our deployment to
at least run in two different
availability zones. And this is actually what we are going to
take a look at now in the demo. Now it's time to look
at the fault injection simulator in action. For this we are
going to first take a look at the application that we deployed
in Kubernetes. Then we are going to set up a load testing tool
to send frequent requests against our application to
inspect some metrics like the availability and also
the response time of our request.
Then we are setting up the fizz to
introduce some chaos in our application. And then we are revisiting
our load testing tool to see the impact of our chaos experiment.
So let's start with taking a look at the Kubernetes manifests
so we can quickly see here with our deployment.
This is basically the Kubernetes resource that you need to deploy
containers in your cluster. We can see here that we have one
container that we are deploying on our cluster.
This is basically pulling the container image from my application
from a elastic container registry of my account.
Then I'm also using a node selector here
to ensure that this container will basically be
scheduled on a node that is running in the US east one a
availability zone. Besides our deployment, we also have a
service and an ingress resource. Those are required
to expose our deployment in the Internet and we need
that in order to run our load testing here for this
use case. One thing to highlight here is the ingress section.
So we are using the AWS load balancer controller
to create an application cloud balancer in the cloud from
this ingress resource that we just deployed. So both the
ingress and the deployment Yaml files are already deployed.
So let's have a look if that was successful.
So I'm jumping in the console now I'm
running kubectl get pod.
This should now return one pod. This looks good.
And now we want to basically double check whether this
node here is really running in the correct availability
zone. And this looks good. So the node is
running in the US east one a availability zone. So that
basically means that the container is running in the availability zone that
we want to run a chaos experiment on.
Now let's get the URL of
the ingress controller or to be more precise
of the cloud balancer. So let's say Kubectl get ingresses.
That looks good. And here's
the address. Let's send a curl request here
and we are getting a response pong this looks good. So now
let's copy this URL and let's set
up the load testing tool. So for this
case I'm using locost.
Locost is basically running on my local machine and will send
a good amount of requests per second to my application.
Let's start this here on
the top right corner we can see the requests per second that are sent against
our application. And here we can see the failures.
So failures would indicate a status code 400 ish.
Now we can take a look at the chart and we can see here that
we slowly ramped up on requests per second. And now we
are around sending around 240 requests
per second. And this looks good. We have no big failures.
Our response time is quite static
and with 120 milliseconds this
looks good. So I would say this is a
successful deployment of our application. So now let's
check how the resiliency of our application really looks like.
And let's see the impact of a chaos experiment.
For this I'm going to the AWS console and here
I'm basically searching for the service called AWS AWS
fizz. I'm going to open
up this one here. I'm basically
taking a look at the experiments templates here
we can see one template that I already created. This is called disrupt
Aza and this is doing exactly that. Let's have a look at it.
So the name is disrupt Aza and this is
exactly what happens here. Let's quickly take a
look at the update wizard. I think this is really good
for the visualization what's happening here. So we can see that
this template is quite small. So we have one action which
is called disrupt Aza let's have a look at
it first. So here we basically
specify the action type.
This is disrupt connectivity. Disrupt connectivity
will prevent packets from leaving
and entering the given subnet.
And down here we can see the duration. In this
case we have two minutes configured and we have configured
the target. So the target are the subnets that are impacted by
this event. Now let's have a look at the targets. So I'm quickly opening
up the target here we can see the subnet
target one. We are using a resource
filter here to ensure that not every single subnet
is targeted but instead we are only taking
subnets in the US east one a
availability zone in scope of this experiment.
So a subnet in AWS is a zonal resource.
So that means that a subnet is deployed in one availability
zone. And this configuration really just makes
sure that no network traffic is leaving or entering
every single subnet in the availability zone. Us east one
a. So I would say let's go back here and start this
experiments to take a look at what's going
on. So I'm clicking start here
and it takes some time until the experiment
is actually executed so you can take a look at the timeline to
see what's going on. So currently this one is pending.
This was just taking a couple of seconds to deploy
the disruption. So let's wait. There we go. Now it's
actually running. So if we go back to the locals
page. Yeah there we go. We should see a drop in
availability. So you can see that
straight away we have a really reduced
amount of requests per seconds and now we can see
that there's no request being
successful anymore. So that's basically
showing us that there's not a single request going through. What's interesting,
if you can go back here we can see that those requests right
now are somehow dangling. So there is no bad
response code like a HTTP 400 ish
but the connection is just not completely opened and cloud.
So this is really the impact of our availability zone
outage here. So let's wait
for some time until this experiment
is finished.
Here we go. The experiment is now
finished. Let's go back to the console and we can see straight
away once the experiment was finished
our network is available again and our application
continues to serve traffic. So here we can
see that our current architecture with
one container being deployed in one availability zone is not
really able to handle the situation well. So let's
see how we can improve the availability. I will first
stop the load test here and we are now jumping back
into the editor, because I already prepared
an updated deployment. So here we have the updated
deployment. It's pretty similar to the previous one,
besides two major updates.
So first of all, we have two replicas. So this basically means
that we deploy two containers. And then what's even more important is
we have an affinity rule created. So this affinity rule will
basically ensure that those containers will
be spread across the nodes and specifically
spread across the different availability zones.
So now let's go back to the terminal.
And now let's delete the old deployment.
And now
let's apply the updated
deployment. So this will
take just a couple of seconds to deploy our pods.
Let's have a look at them. There they are
running on two different nodes. And if we now again open
up the nodes here, we will see that
those two nodes are actually running in two different
availability zones. So now we can test
again. So let's go back to the cloud testing tool. Let's do a new
test. Let's configure the same amount of
users being simulated and run the load test.
Give it some seconds here to create some
stable requests per second. So here
we go. This looks good.
Now let's go back into the AWS console to
rerun the experiments. So we
are going to experiment,
going to the templates, opening up the
template. We're now starting another experiment
here. Let's start this one.
So this again
just takes some time to initiate.
Still pending. Let's see. Now it's running.
So let's also wait for some time
here to check what is happening.
So we can see quite a different result now.
So what we see here is a very, very short
time period where our service wasn't available.
And this is basically because of the load balancer
health check that checks the availability
of its targets every 5 seconds.
And this is just the small time gap where the load
balancer thought that the container is available.
And then after the next evaluation figured out,
no, I cannot send any further request to the target.
And as we can see here with this update, our application
now reacted way better to
the outage. And this would be a.
Yeah, would basically show you the full lifecycle of running an
experiment. So we made an assumption, figured out
that this is not correct, updated our application to
see an improvement in the resiliency. And this
basically here concludes the demo of today.
I now also want to share some best practices to
get started with chaos engineering for your own applications.
So I would recommend you to start with very
small templates in the beginning that maybe only,
as in our demo, only include one single action,
because this allows you to very quickly understand the impact of
a certain template action.
The second tip that I have for you here is testing close
to your production environment. So let's say you have a containerized
workload that is in your staging environment, running on Docker compose,
for instance, and on your production environment, those containers
run on a fully fledged eks cluster.
Here I would recommend you to basically maybe add
a new test environment that is in the architecture more closer
to what you have in production, because else you won't be able to
basically catch flaws in your staging
application architecture that you can basically apply in
production to increase the resiliency of your application.
So always try to test as close as possible
to production as possible, maybe even in production.
The next one is about minimizing the blast radios. So we
mentioned that with the fault injection simulator on AWS, it's possible basically
to minimize the blast radios to two ways. So the first one being
with limiting the access that the service has to your resources.
So for instance, with principle of least privilege in
your IAM policies, you can basically limit the access to the
resources that the fault injection simulator has by for
instance, making sure that only your application servers and your
databases of the staging environment are accessible
by the service. And also we
would recommend you to basically use the health check capabilities,
the emergency brake stops to stop an
experiment when you see that you really have degraded
health in your application when you, for instance, test in production.
To get started with collecting your first hands on experience,
I can recommend this workshop that we have for you here.
With this workshop, you basically have a guided step
by step experience where you basically will
learn and understand those different functionalities of default injection
service firsthand. And I would recommend you to check it out
either by scanning the QR code or visiting the URL on
the screen. And this concludes the session. So you see another QR
code on the screen. This is really important.
So if you scan this QR code or visit the URL,
you can give us feedback. And we really, really need your feedback.
We want to understand if you like the session and what
we could improve next time. So please take a minute and
fill out the form. It would really mean a lot to us
and we thank you really much and we wish you a great day ahead
and also fun with all the other interesting sessions that you have the
chance to explore today. Thank you very much.