Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone. I hope everyone is having a fun time at the conference.
My name is Harsh and I'm currently working as an engineer at local stack.
Today I will be presenting my talk on shift left cloud
chaos testing on your local machine. The focus
of the talk would be around local cloud chaos testing and how
you can leverage open source cloud emulators to make this happen.
So without further ado, let's jump directly into
the talk.
So let's check out the agenda for today. We will start with
understanding what do we mean by shift left chaos testing
and how various open source tools can help you get started. With this,
we will uncover how performing chaos testing on public cloud
can be risky, and how cloud emulators can help you in
this regard. We will check out a couple of cool and interesting
demonstrations that will explain how locally running
emulators can help you build a strong test suite for
maintaining conformity and latency both on your local machine
and in your continuous integration pipelines. We will end
this session with a final concluding note that would remark some strategies
around building a test suite for a local chaos testing setup
and how developers can be empowered to build failover mechanisms
right from the start. So let's jump right
into it. What is shift left chaos testing and
what do we exactly mean by that? But before that,
let's do a quick recap around what chaos engineering exactly
is and where do we stand right now?
The idea behind chaos engineering is all about
how you can experiment on a system to uncover behavioral issues
and make your whole application resilient to such
conditions right from the start before you land into the production.
This concept, I guess, emerged around 2010s, maybe early two thousand
and ten s and was heralded, in fact,
due to the outages that have suffered like
that have made the organization suffer a lot of significant cost.
Now these incidents always have highlighted a long standing issue.
Organizations of all sizes have experienced such disruptions,
and some of the times these are the disruptions. We are beyond their
control. Now, these outages not only affect their maintenance
cost and the company revenue, but also have broader consequences
for the development and also the teams that are managing the whole
user experience. Now this is where
chaos testing comes into the picture. With chaos testing,
you can set up a system's baseline or optimal operational
state further. Then you can identify some
of the potential vulnerabilities and
design scenarios, and you can further evaluate their potential impact.
Now, for a lot of developers, injecting these faults or
issues seems almost counterintuitive. Now just imagine this.
You already have unit tests, you already have some sort of an integration test
suite running with every single commit being merged. So why would
I need another layer of testing just to ensure that my application would
work fine? Now this part is particularly important
because of third party services, particularly the public cloud that
has made chaos testing useful. You can
cover up some of the faulty configuration issues,
behaviors, outages and other interruptions that might
not necessarily be due to your code, but just something that's going wrong with just
your provider. Standard testing can fix some of the
usual issues that might have a net end user negative
reaction, but chaos testing can help you introduce issues into
your system and just build some solutions around it about
how you can react to such kind of issues at the end.
And one of the core tenet of this is that I would like to highlight
is that developers should value the chaos,
and once they value that, they need to build the solutions around
it with a predefined plan that allows you to introduce
errors and also help you build kind of a swift response to
that. This allows you to handle any sort of unexpected
behavior and also maintains the balance between thorough testing and
the stability for your application.
And this of course could not be possible without a steady DevOps culture.
Chaos testing is essentially a DevOps practice at the end
where you can define various testing scenarios, you can handle various
kind of executions, you can track the results in varying outcomes
and ensure that your end customers do not suffer any impact.
Resilience is a pretty important aspect for any
deployment process, and a strong DevOps pipeline
should ensure that your application should remain robust. It should remain reliable
in a cloud environment where factors like scalability,
availability, fault tolerance are all of critical aspects,
some of which might not be even in your control.
Now there are various chaos testing tools, both open source and commercial in
nature, and teams have been using them to build reliable solutions
for quite some time. There was an interesting keynote session at
Conf 42 Chaos engineering conference by Pablo I
guess the last year, and he talked about like shifting
left chaos testing with Grafana K six. So you can definitely
check that out. We have some of the key innovators over here like Chaos
Monkey from Netflix, which started as a test tool to evaluate the
resiliency of AWS resources and simulate failures for
services running within an auto scaling group.
Litmus Chaos is another interesting project by CNCF
that uses a chaos management tool called as Chaos center where
you can basically create, manage and monitor your chaos using
Kubernetes custom resources in a pretty much a cloud native manner.
Other projects include, I guess Chaos blade, chaos Mesh,
Chaos Toolkit and more, which can provide you various features to
enable your cloud testing workflows.
But one of the things that I would like to highlight, and I guess it's
one of the persistent issues with chaos testing cloud components,
is that they are often proprietary in nature,
and the only way that you can test them is running them directly in the
production. Now you can imagine this testing things such
as missing error handlers, or missing health checks or missing fallbacks
cannot be done without actually deploying to a staging or a production
environment. Though you will have the comfort of running everything
directly onto the cloud, you will miss out on two significant
aspects. The first one is that since you're running on the cloud,
it will cost you significantly, especially deploying all your resources,
including databases, message queues,
kafka clusters, kubernetes clusters, and more of them.
Second, some of these tests can take you hours, if not
days or weeks, to test reliably, and you will miss out on the agility aspect
of the DevOps culture. So how do you shift left your chaos
testing early in the development process as closely to the developer machine as
possible? Few years back this would happen, implausible to
think of, but now we have a certain solution and
the answer is cloud emulators. Cloud emulators
are powerful developer tools because they can counter the friction
between cloud native paradigm and also the local development paradigm.
Now, there are two certain aspects of cloud emulators. The first one
is that they almost remove the need for provisioning
a completely dedicated development environment like sandbox and all of
these things which mirrors your production setup. And the
second is that every change that you do to your application or to your code
do not need to be packaged and uploaded to the cloud needs to be
deployed so that you can run your tests against it.
So with the help of these emulators, you now have a replica of a cloud
service that's running fully local on your machine or in an
automated test environment. And this makes chaos testing so easy
and helpful. So how do you go about building
a chaos test suite for your cloud application?
The answer to this almost lies in the testability of your cloud
application deployments. As you can see at the top
of this pyramid, we have the classic strategy of using mock libraries.
With mocking, you can scope the tested component in unit
tests, and you can implement the same interface as the real
class to allow the predefined behavior to be invoked.
Second, we have the service emulation, which means that we now have
local stripped down versions of the managed service, where each individual service
requires its own implementation as a follow up to that,
we now have cloud emulation, which enables a superset
of cloud resources that can interact with each other and run on
your local machine. And finally, we have real staging environment
which uses real cloud resources.
An example for this that I would like to showcase is of
course moto. It's a pretty popular mocking library for AWS,
and with Moto you can basically add decorators to your test
cases and redirect all of the AWS API calls
to a mock resource and not to the real AWS services.
Now, mock libraries are excellent tools for local cloud development,
but they are often limited.
The very first and the foremost reason for this is that mocking these services
do not correctly replicate the behavior of remote cloud
services that are often interacting with each other, and also the application.
You can definitely build a chaos testing suite around just plain mocking,
but it might often lead to unsatisfactory results
because the behavior of your locally executed test always
diverges from the behavior that you see in the production.
Now, with the limitations of mocking, we now have progressed
further and we have local development servers that provide an emulation
layer. The difference between mocking and emulation often lies in the degree
to which the behavior of the service is reverse engineered.
Now, as an example here, if we pick up DynamoDB
local to create a new database, the mock will return us a
parsable API lessons, but it does not necessarily carry any
state, and it will just give you some random attributes that do not reflect
the request context. In comparison, the emulator will
correctly store and retrieve the state for you. However,
there are always some issues with the service emulators. The first one is that there
is no sort of an integration between different services, which means that
you cannot hook your dynamodb with appsync or with lambdas or whatsoever.
You cannot use them with a fully fledged ise script,
and there are often API compatibility issues, and they are also not up
to date with the latest API enhancements that might be happening in real
time. So as the best example of a
cloud emulator, we have local stack. Local stack is an open source
service emulator that runs in a single container on your laptop
or your continuous integration environment. For the end user,
it means that it allows them to run these AWS services pretty much
on their local machine. The entire purpose of a project like
local stack is to basically drive local cloud development,
and also to make sure that engineers can collaborate and
just do away with inefficient development and testing loops.
But now the question starts is how do you enable chaos
testing with local stack? As a first step you can
basically use local stack as a drop in replacement for AWS on
your developer machine and your CI environment, and you
can use them with services such as Amazon fault injection simulator,
which is fis, and you can perform resilient testing for your
cloud apps running locally. Now the benefits of this
are paramount. This means that you can not just deploy your application
locally, but also run experiments that can inject errors
at the infrastructure level which was not possible to replicate beforehand.
And you can do this only once you deploy to the production,
so it makes it even hard. But with a solution like local stack, you can
basically shift left your chaos experiments and tests right on
the developer machine. And you can use this to basically tackle some of the
low hanging fruits to redesign your cloud architecture so
that you can trigger some failover mechanisms and just observe locally about
how your system starts responding to that.
Now that we have done, I guess, enough talking. So let's
move on and explore some cool demos. I'm going to showcase two
examples. The first one is to demonstrate the feasibility
of using cloud emulators like local stack for testing
RDS failovers. And the second one is to actually
show you how you can inject chaos experiments in your
locally running applications. So in this
sample we will locally run and test a
local RDS database failover. During this failover process,
the standby RDS instance is basically promoted as the new
primary instance. Thus it allows you to resume
your database activities in the shortest
amount of time. Now if you are trying to set up this
whole thing on the real AWS, it almost takes about more than an hour,
which seems okay for a production setup,
but for testing such a workflow it can often be time consuming.
Now the best part about local stack is that you don't need to do a
lot of manual configurations. Everything runs inside
a single isolated docker container and it exposes a set
of external network ports. So this means that if you're using any sort of
an SDK, you can just hook them with the local stack by specifying
which endpoint URL that they are running on
so that you just don't have to send any of the API requests to
the remote cloud provider. So let's check this out in action and see
what we have got over here. So in this setup
we have a Python script that basically uses boto three. So boto
three is the AWS SDK for Python. And over here we are
basically setting up some cluster ids and some regions.
Just notice that we have defined multiple clusters over here, just showcasing
like pretty much a global setup and we are
creating a global cluster with a primary and a secondary
we are creating a global cluster with primary and secondary clusters. And this
basically simulates like a real world scenario where multiple
database instances are managed across different regions.
So in this case we have got run
global cluster failover and this is basically a function
that triggers a failover, switches the primary database with a
secondary u one, and this is pretty much necessary for
handling unplanned outages or basically just for maintenance.
So in this case we are going to run this script pretty much on our
local machine just to make sure that this setup pretty much works fine.
So let me just go ahead and start local stack on my machine.
So local stack is shipped as a binary or as just
a pip packet. So if you're a python developer, you can just go ahead and
just say like pip install local stack and that's going to set up the whole
thing on your machine. And you can just say local stack start to
start the local stack docker container on your local machine.
This basically means that now you can send all of your AWS
API requests to this running docker container.
And over here we have specified this as localhost four, five, double six.
Once we have done that, I guess I can just go ahead to
my other terminal and I can just run this script right over here.
My local stack container is ready. So as soon as we hit
enter you can see that local stack starts creating the global cluster,
the primary database cluster, and the rest of the things.
You can immediately go back to the logs and
you can see that local stack is now installing the postgres SQL, which is the
database engine behind RDS and just making sure that everything
is up to date and ready for this particular experiment.
So now it's starting a global database cluster failover.
And at the end of the decode you will notice that we have set up
a lot of assertions to basically make sure that the failover part
is successful. And as you can see, the test is done, all assertions have succeeded.
And now we have a pretty good idea about how
a cloud emulators like local stack can help you in this regard.
In a real AWS environment, as I mentioned before, these operations might have
taken over an hour, but using local stack you can just perform these
kind of assertions in just less than a minute or two. And the best
part about this is that we have not created any real cloud resources.
Everything is happening on your local machine. And as soon as I shut down
my local stack container, all of the resources that I have created before
are gone. They are ephemeral in nature, so everything
just vanishes with a poof. Cool.
So that was a nice and steady experiment.
Let's go ahead and let's see what else we have got.
And yes, as I mentioned before, you can obviously use local stack with
other services such as FIS for actually
injecting fault into your application setup. Now FIS
is a managed tool by AWS, and again, it comes
with certain limitations that AWS FIS
has. But with local stack you can obviously go beyond that and
you can basically use a user interface or a CLI to basically inject
chaos into your application setup. Now FIS
has a lot of use cases. You can basically use it to strain an
application. You can just define what kind of faults that you want to introduce.
You can specify the resource to be mentioned as a target,
and you can also specify like single time events, or you can induce some API
errors and more of that. So with fis you
can do a lot of these varying activities and you can just see
how your cpu spike is happening or how memory usage is increasing.
How does your system respond to this whole thing? With a consistent monitoring and
more of that. So you can use fis with local stack,
you can use it either using the traditional CLI experience
that AWS itself provide, and just use local stack instead of
that. Or else you can use a dashboard
like this to basically control all of your chaos engineering experiments
happening right on your local machine. So in this case,
I'm going to use this to basically run a simple experiment on
one of the sample applications that I have.
So the application that I'm going to showcase to you is a serverless
image resizer. It has like a bunch of lambda
functions that allows you to upload an image,
resize that image, and showcase that on a local web client.
We have a simple website that runs inside an s three bucket.
We have some SSM parameters, we have some SNS
topics and more of these things. So let us actually run
this whole setup on our local machine and showcase to you
how you can inject fault into your local developer machine. So let
me just quickly switch back to my vs code and we
have the entire application cloned over here. You can grab this
application using one of the GitHub URLs that I can share at the end,
but let us go ahead and let us check out the application.
But before that let me start local stack. So in this case I
am starting my local stack instance. I'm specifying an extra
configuration flag to basically allow the course origins,
which I guess is pretty much of a pain for almost every developer out there.
And it will start the local stack container, but as soon
as it is started I can run this one script that will create
all of the local AWS resources for me. Basically it will set up the
whole application that I have and I
don't need to set a lot of commands to basically run through the whole application
deployment. So let me go ahead and let me run this deploy script
and this will set up the application in just a few seconds which
would otherwise have taken a few minutes if you are trying to do this whole
thing on the real AWS. So you
can go back to the local stack logs, you can just check out how local
stack is creating the SM parameters, the lambda functions,
the SNS topics and more of these things.
And I guess this should be up and ready in just
a few seconds.
I guess it's taking some time over here because we are installing pillow
because we need to set up the whole resize lambda function over there.
But as you can see, once the whole application is deployed you
will get a web client that we have. And over the web client
you can basically upload the image and you can just click on one button
and this will start the whole resize operation. So I guess the
web assets are being specified over here. And now
we have this particular web app running on this particular URL.
So let me just switch right back to it and if I
just go back to my other tab I can just hit refresh.
And here we are. Here we have the locally running serverless
image resizer application running pretty much on our local machine and it
is just consuming all of the emulated AWS resources that we have
created before. So I'm going to just click this button and I'm just going
to list like a few function URLs that is necessary over here and just
hit apply. And once it is done I can go ahead,
click on this and basically specify one
of the images that I have. In this case I'm just
going to specify a pretty awesome picture of hagea Sophia.
And once you click upload you can go back to the
vs code right over here and you can actually see
how local stack is pretty much executing like
how local stack is pretty much executing lambdas on the back
end. And you can see like the image has been resized and
now it is ready pretty much. So let's go back to our
vs code.
Yes, here we are. Oops,
sorry. So yes, I guess
the image resize is pretty much successful. So you can just
go and click refresh and this will automatically list
the resized image right over here. So here we have the original image
which was of these many bytes, and finally we have the resized image
which is of these many bytes. So what if we start injecting
some faults into our application setup? And I guess it's pretty easy with
this whole dashboard that we have right here. So let me just
hit refresh just to make sure that everything is ready.
And as you can see, we have a few experiments that are pretty much specified.
So this dashboard gives you like a set of predefined templates that
you can use and run. So one of the first things that I want to
showcase is how you can make a service unavailable on your local machine.
So in this case I'm just going to mention like lambda as one of the
things. So I can just hit lambda, I can specify the region that I want
to run this experiment on and I can just click start experiment.
As soon as the experiment is started, we can now go back over
here, I can choose another file, which I guess would
be one of the other images that I clicked during
my recent trip to Turkey, and I can hit upload
over here. Something will just go wrong. The image is not
being listed, even though the original image is right over here. There is no mention
of the resized image. If you want to see why this
has particularly occurred, we can go back to
the Vs code that we have right here.
So I guess
this is going to be it. So over here you can
see that this whole lambda invoke operation has failed as
per the fault injection simulator configuration. So now we
can see that because of this configuration that we have already specified
before, we have injected a fault into our application infrastructure,
and this is not pretty much working out, which I think is
very well expected. Let's go back and let's maybe
stop this experiment and let's maybe start another one.
So this time I want to make an AWS region just go
unavailable just to simulate an entire regional outage.
So in this case I can just specify us east one and I can start
the experiment. And if I go back and I just hit
a reload, because this entire website is being served from an s
three bucket. So now you can actually go back to your Vs code again
or basically to your local stack container logs, and you can see
that there are exceptions happening. And this is because of the fault injection
simulator configuration. So now we see this whole application is
now set up, and now we can immediately inject fault into the setup and
just see how our application starts responding to that.
So this is one of the demos that I wanted to showcase.
And yes, if you go back to your website,
you can basically see that everything is happening because of this FIS configuration.
So you can stop this experiment and I can just hit
refresh and this whole application is now back and ready. But this was just
like an initial preview. You can obviously introduce latency to every API
call that you make on your local machine. You can inject some issues with Dynamodb,
with kinesis, and basically you can use this experience as a way to test your
local fis configurations and also to validate
if your application infrastructure needs certain changes to accommodate these
kind of interruptions.
So now that we have came this far, let's just take a
look about how we can further move ahead with such kind of
a setup and how you can use open source cloud emulators like local stack
in such a scenario. Now, the very first step that I always
mention is that before implementing any sort of chaos
testing suite, always try to establish a system's steady state.
And this can be measured based upon overall throughput or error rates
or latency. And this should always represent like an acceptable
expected behavior of the system. Second, always look into
creating a hypothesis that always aligns with the objective
of your chaos testing setup, because they should
match real world events and they should not cause deviation from
the system's steady state. Finally,
always try to design the experiments and put them in categories of
knowns and unknowns. Just look at this table that I have mentioned over here
and try to ideally target the knowns here because they
are pretty much low hanging fruits and they should be easy to fix. But the
ideal goal should be to discover and analyze whatever chaos your
setup your system might end up encountering, and thus you
might just end up venturing into some of the unknowns over here.
Once you have the setup, you can always conduct the experiments.
You can discover some potential failure scenarios within your
application infrastructure and always try to
assert in certain things like is your application failing
for a small percentage of the production? What would happen if a certain
availability zone or a region will suddenly go down? What will
happen if there is a critical system performance, like if
there is an extremely great amount of load? And how exactly does it affect the
system performance? Now, solutions like local stack like
k six or some other cloud native chaos engineering
tools can certainly help you in that regard and assert in how
well your infrastructure will treat such an interruption or such a chaos.
And finally, yes, you can obviously test your
cloud apps pretty much locally. You can always do away with staging if you
want, but testing your cloud apps is definitely possible.
And you can use cloud emulators not just to cut down on the cost and
the time, but also to overall build a better developer
and testing experience. Cloud emulators in
the context of chaos engineering can always help you around with some of
the low impact risks, like a missing failover,
a missing handler, or things that might just fall under
your radar during your usual unit testing or integration testing
setup. And you can always learn from this experience. You can fix and iterate them
quickly on your developer machine, and you don't have to wait and
see your application blasting up in the production to actually do that.
The second one is that you can always use these lessons to set up some
sort of a playbook so that you can always validate these incidents,
work upon them, and make sure that you have resilient solutions. And the best part
is that you can actually integrate these failovers over on your CI pipelines.
So this means that you can run these tests repeatedly on your CI pipeline,
and you can make sure that none of the code or none of the infrastructure
that you're adding to your application will negatively impact it in the long run.
And finally, you will achieve the ability to handle some of the
unplanned failovers or handlers that you might just end up
missing. And you can certainly develop more resilient, better tested solutions
at the end of the day. So this
brings me to the final conclusion of this session.
So that was all for today. Thank you folks, and I hope you enjoy the
talk. And I do believe that you can look into shift
left your chaos testing suite and enable your
developers to leverage chaos as much as possible. Thanks to
the Conf 42 team for having me today, and I'll see you
the next time.