Transcript
This transcript was autogenerated. To make changes, submit a PR.
Jamaica make up real
time feedback into the behavior of your distributed systems and observing
changes exceptions. Errors in real time
allows you to not only experiment with confidence, but respond
instantly to get things working again.
You hi
everyone, my name is Kuna George. I am a developer advocate at
AWS and excited to be here at 42 today
to talk about my two favorite topics
combined, chaos engineering and serverless.
In this session I'm going to talk about chaos engineering for serverless
using AWS fault injection simulator.
So let's get to it. So I think you
all know the faces of chaos engineering by now,
where we walk through from a steady state,
creating our hypothesis and then create and run
our experiments and so on. And for this specific session
I'm going to focus on the hypothesis part, how to
create hypothesis for serverless and
how to run experiments for serverless applications.
So to begin with, to just set the stage,
I want to talk just briefly about serverless. In case
you don't know what serverless is, so many
of you might know that these are the tenants that define serverless
as can operational model. First off, we don't have any infrastructure
to provision or manage, so there's no servers for us
to provision, operate, patch and so on.
And serverless automatically scales by the unit of
consumption, the unit of
work or consumption rather than by the server unit.
And with serverless we pay for value. We have a
pay for value building model. So for
instance, you might value consistent throughput or execution duration,
you only pay for that rather than buy the server unit.
And serverless is built with availability and
fault tolerance in mind, so you don't have to
specifically think about architecting for availability because
it's built in into that service.
But when we say serverless, we mean that it's about
removing the undifferentiated heavy lifting
that is server operations. So you
don't have access to the underlying services,
the underlying infrastructure, which could be
a difficulty when it comes to chaos engineering, but more on that
later on. And at AWS, when we're talking about serverless services,
well, we are often referring to these like AWS,
Lambda, Amazon Dynamodb,
Amazon API gateway,
our object storage service s three,
and a lot of other services as well, like SNS
and SQs and so on. And the serverless landscape is
really growing all the time. So now
with that said, let's get to the topic of this
session. Serverless chaos experiments.
So when we're creating our experiments,
we can start by looking at some of the perhaps
common serverless weaknesses that we can see in architectures
at time, for instance, we can look at
errors. Are we handling errors correctly within our applications?
No matter if the error handling is inside our code or
a feature of the service, we better make sure that we're handling
errors. And certain services have releases or
features like dead letter queues and things like that, which are great,
but we want to make sure to be able to test it. And chaos engineering
can help us do that. And with AWS,
lambda functions and different dependencies like other AWS services
or third parties, we need to and want to get our timeout
values right. And in most cases they probably are,
but that's often in a steady state. So what happens when there are
issues, say latency for instance?
And with event driven architectures becoming more and
more common, how we handle events within our
solutions or our applications is really important.
Are we queuing events and messages correctly? What happens to
events in case of any issues within our applications?
And we're using many different services, and we also often
have third party dependencies and we trust
them to be there. So what happens if they aren't there?
Do we have fallbacks or graceful degradation?
How do we handle if a third party is unavailable?
And these are just some of the potential weaknesses, and there are of
course a lot of others as well. So we want to find these weaknesses,
these unknowns, and fix them. We want to fix them before they break
and create a big outage in our serverless applications.
So what are some techniques then for doing fault
injection on serverless applications? Well, we can start
off with configuration manipulation, some common
faults. There might be throttling for instance, or setting
concurrency limits. We can deny access
to services or other parts
of our application. Basically any type of service configuration
that's available to us is something that we can use to create
fault injection and tools to do this might be
resource policies or using IAM policies to
restrict access. We can use VPC attachments
and so on. Another technique is to
do network manipulation and common faults. There might be,
say TCP packet loss, doing bandwidth limitation,
network latency, or restricting connectivity.
And tools we can use to do that might be security groups,
network access control lists, using network
firewalls, HTTP proxies, Nat instances
and so on. And then we have code
manipulation. So with code manipulation
we can create different type of faults. We can
for instance, create different API responses.
We can do disk exhaustion, we can corrupt
messages in the code. We can create network
latency with code manipulation.
One thing though is that we're missing environment
manipulation, and that is basically because we don't have
access to the environment where we perhaps would have
started if this wasn't a serverless application.
So with that, let's then look at a concrete example.
So very simple, serverless applications on
screen. Right now it is a web service.
We have an API gateway that's fronting a lambda
function that retrieves data from Amazon Dynamodb.
And we also have a queue. So items that are
posted into our API are stored in a
queue and then retrieved by an AWS lambda function
before storing them in a dynamodb table
simple serverless application, but contains several
different services. So what
we can do then on this is we can inject errors
into our code, for instance by creating exceptions
or by using other types of errors. We can
remove downstream services so we don't have access
to a downstream service or a third party API.
For instance, we can alter the concurrency of
our AWS lambda functions, and we can restrict capacity
of tables.
Other examples might be that we can create security
policy errors where we restrict access
to services. We can create course configuration
errors, something that perhaps we struggle with to
get course right. And that is a good example of something
that we might try. What happens in our application if we
have course configuration errors? And once again,
we can basically create any type of service configuration error.
And we can also do manipulation with the disk space available
in our AWS lambda functions, if that is important to us.
And the perhaps most common example of doing chaos engineering
experiments for serverless is to add latency to
our functions. And by using that, by adding latency,
we can simulate a bunch of different failure scenarios. For instance,
it could be runtime or code issues, it could be integration
issues to downstream or upstream services.
It could be to test our timeouts for our AWS
lambda functions. And it can also be to test how our
application behaves in case of cold starts.
So let's then start off by looking at configuration
manipulation to begin with.
So when doing that, we can modify different
types of service configurations and we
can change IAM policies are two good examples.
And to do that, we can use the AWS console
straight away, just make changes there, observe what's
happening, and then change it back. We can use the AWS
CLI, have our companies ready to make a
change to a service or a policy, do that,
observe what's happening, and then roll back with another
command. In the CLI, we can use APIs,
we can use the different sdks to make these changes.
Or of course we can use serverless, serverless serverless AWS fault
injection simulator. Well, and the big reason I
see to use serverless serverless serverless serverless AWS fault injection
simulator. The safeguards that we get with
a managed chaos engineering service and safeguards,
they act as this automated stop button. So it
monitors the blast radius of the experiments that we're running and
modes sure that it is contained, and that failures created
with the experiment are rolled back if alarms go off.
So if we run an alarm instead of
me manually having to observe it and stop it
if alarms go off, or let's say that we've run our experiment
for five minutes and that's the end, we can
then use AWS Fisc to automatically stop and
roll back to the previous state.
So let's look at an example then on configuration manipulation.
So this is the same application that we're using.
What if Sqs invocation of lambda functions
is throttled, so we are pushing a lot of messages to
our SQs queue, but if the lambda function is throttled so
we're not able to pick up those messages, what happens in
our application then? Or what if SQs
invocations of the lambda function is disrupted entirely
so we're not picking up any messages from the queue?
Or another example, what if lambda function loses permission
to the dynamodb table and isn't able to store the
messages that it's picked up from the queue and processed
within the lambda function?
So let me briefly show you an example of how we can
do this type of configuration manipulation
with a quick demo.
So, switching over to the AWS console
in this case and what we're seeing here is an AWS lambda function,
and this is the AWS console. So we have a lot of different
options, configurations and so on. I've switched to the configuration
tab and the concurrency setting, and as you can see
now it's set to the default value nine nine nine.
And in the console I can easily set it to zero,
meaning that the function will be throttled and it
won't be able to run
that AWS lambda function. But I didn't save it.
Now it's still at nine nine nine. Instead, I want to show you how we
can do this using AWS Fis. So I have an experiments
template created in Fis already to update
lambda concurrency, and I'm making use of an
action that's called SSM start automation execution.
And with that we can run SSM
documents. They in turn then contain
different type of automations that we want to do. So I've
defined this document that is created to
be able to then change the concurrency of our AWS lambda
functions. So it has a
first step where it will update the concurrency to whatever we
set it to zero for instance. Then it will sleep
and in the end it will then
do a rollback.
So we can then add these parameters to
our fiscs action to be able to update that
lambda function and have that automated
rollback when the experiment is done or in case
of an alarm. So let's start this experiment
just to see what it looks like the
experiment is initiating.
It is now running. There we go. So it
will now run that SSM automation looking
in the lambda console. It's still at nine, nine nine.
Let's do a refresh and
we can see that the reserved concurrency is now set to zero.
We can also see at the top that this function is throttled
so it will not be invoked as
soon as this experiment is done now. Or if an
alarm sets off it will then do that rollback.
So now we can see it
was a quick 1 minute experiment. It is now rolled back to
the initial state which was 9999.
So that was a very quick example of how we can use AWS,
fis and this very adaptive
way of creating automation to change configuration,
change policies and so on by using the SSM
automation action. Very cool feature that allows
us to do a lot of these experiments.
So let's look at one of the other three then code
manipulation. And this is a favorite of mine.
So there are today two main options for
using fault injection for AWS lambda. There is the
chaos lambda for Python and then failure lambda
for node JS based.
And let's have a look at the node JS one
fault injection with failure lambda. It is an NPM package
that you can use for node JS based lambda functions.
You configure it using parameter store or AWS
app config and it chaos several
different fault modes that you can use. So you can add latency.
You can change the status code for an
API. For instance, instead of returning a 200, you can return
a 404, 502 or whatever you wish.
You can create exceptions within the invocation of the lambda
function. You can add things to the
disk to create disk space faults. You can use deny
list to deny calls to specific URLs.
And what you do is basically you install the
NPM package, then you import it in the lambda function and you
wrap the lambda function handler. So like this,
we then import failure lambda and then we wrap our
handler with failure lambda in this case. And then we're
good to go to be able to add these fault
injection to our lambda function.
And as I mentioned, we control it with a parameter
in basic JSOn? So we set
if it's enabled or not, we set the failure mode,
which type of fault injection we want to do. We set
a rate if it should be on all invocations
or as in this case on 50% of invocations
we can set the latency and so on. Configure each
of these different fault modes and
then let's
look at an example for this as well.
So what if my function takes an extra 300
milliseconds for each invocation? What happens to my
application in those cases? Or what if my
function returns can error code? So instead of returning a
200 response to the API or to the client,
what if it's 404 or a 502 or
301 or whatever error code we want to use?
Or what if I can't get data from dynamodb?
So let's looks at an example of how we can use
this in practice then. So this is a very basic
site used for this example, serverless chaos demo site.
I'm using three functions that are
just copies of each other to be able to easily show the
difference between them. So this
is now running, it is fetching data
from dynamodb to then load a new image. And this
is constantly updating and we can see it's 150
to 200 milliseconds per invocation at the moment.
This is our AWS lambda function for function one.
Just to show you that we are importing failure lambda
and we're wrapping the
lambda handler with failure lambda as well. In this
case then we have a
parameter stored in parameter store which
then contains the configuration for
that specific AWS lambda function.
So it is now set to false so
it isn't enabled. We can
then specify the failure mode, in this case latency,
and we're using a minimum latency of 100 milliseconds
and a maximum latency of 400 milliseconds.
So now to enable this, we simply update this
parameter, set it to
true still with
latency, and then save it.
Now switching back to the site and now
observing function number one.
We can now see that the invocation time for function one
is longer than for function two and three.
So we have added latency to that
AWS lambda function for each invocation.
And this is to be able to test how our application behaves in
case of latency. And looking
in the logs, we can also see that it's showing that
we are adding latency to the invocations as well.
So that's latency fault injection. And as
I mentioned before, we have a bunch of different ways we can or
different things we can test by using latency. So just disabling this
again, updating the parameter saving and
now it should go back and be around 200
milliseconds once again and that seems to be
the case. Cool. So now then let's check
parameter number two. In this case we're
going to use a different failure mode and we're going to use status
code instead to then manipulate
what status code is returned from our API and in this case
it's set to 404. So instead of returning 200,
we're returning 404 and I'm setting a rate of 0.5,
meaning that it will return a 404 on about half
of the invocations, saving that,
switching back to the site and we
can see functions two, we're getting 200
right now. We are getting 200
and still 200. Come on. There we go,
we got an error on one invocation we get an error again,
meaning that it's unable to get the response
from Dynamodb, basically getting a URL for
a new image so it can't load a new image.
So by using this fault injection method,
we're able to then simulate what happens if we have responses
that aren't 200 or okay from our APIs,
changing it back and it should now load a new
image on each invocation, which seems to
be the case. Right, let's check
failure lambda parameter number three then,
and updating this one I am
going to use a different failure mode. In this case we're going to use
deny list and
with the deny list we're able to add then which
calls to deny. In this case we're denying to s
three and to dynamodb. But this could be any
third party dependency as well.
Setting it to true, and if you remember
the architecture we looked at, we have a dynamodb as a
downstream dependency, meaning that our
lambda function should now not be able to fetch
data from dynamodb. For function number three,
we can see that function one, function two, they are continuing to
update, but function three is
throwing an error and isn't loading new images.
So once again we can see what happens when we are
injecting that type of fault into our application.
So that was very manual. So let's have a look
at how we can do this using AWS fault injection
instead and make use of those safeguards and automatic
rollback that I talked about. So creating a new experiments
template, choosing a role to be able to run this experiment,
I'm now adding an action and
let's give that a name, lambda fault
injection and selecting the action type.
And we used the automation execution in
the previous example that I showed you using fis. And we can
use this here as well to set a document
and a document that is then meant to update
parameters in parameter store. So we have one document created
for this that you can simply use.
So then you just define what's needed.
That is the new parameter and the rollback
parameter.
And that's fine and dandy, that works, that's very
cool. And you can use that straight
away to do different types of experiments against parameter store.
Now switching back to fis. I want
to show you something that we're say
playing with right now, because we're seeing customers using
parameters a bit. So we are using an experiment
ourselves with an action type that basically is put parameters
to see if that might be something that customers want to
use. So I'm selecting this action type
put parameter. I will then add
duration for the experiment or for the action at two minutes.
Then I need to give a name to the
parameter to update. So let's
copy the name of failure lambda parameter
two. And now you can
see that I am supposed to add a value and
a rollback value, and the value in this case is
the value that will be put into
parameter store. So I'm copying it from
our existing parameter,
switching so that it's true, meaning that when the
experiment starts, it's going to enable the experiment.
We're going to use status code as the failure mode,
keep it at rate 0.5 so 50% of invocations,
and finally keep using 404 as
a status code. Then we have the rollback value and that's the
value of the parameter. When the experiment stops,
either by the duration is over or by
stop condition. So saving
that we don't define a target
because the target in this case is defined through the
parameter instead creating my experiment template,
I can start it,
start the experiment and
we can see that it is running.
So, meaning that if we now switch
to our parameter and refresh that we can see
that it is set to be enabled now.
So now the experiment is running. Switching over to
the demo site, we can see that the function number two
is giving us an error because it is
getting a 404 response in return and
200 response 404
so about 50% of invocations. Let's check the
experiment in FIS. It is now completed and
with it being completed we should now have a rollback of our
parameter. So let's take a look at the
parameter in parameter store and
we can see that it is set to disenabled false, so disabled
and functions number two is returning. 200 responses.
Very cool. So that's an example of how we can use AWS
fis to once again automate use these
experiments in a safer way. I want to show you one last thing
with this. I've talked about stop conditions, but haven't really
used any. So let's add a stop condition to this experiment.
I'm going to use demo one that I have my fist demo alarm.
Saving that and we can now
switch over to Cloudwatch. And this is the
alarm in question. So as you can see, it is right now
in the ok state, meaning that no alarm
is setting up right now. Starting the
experiments, the same experiment. Once again,
with the experiment started, the parameter is
being updated.
Double checking. Yes, it's updated.
And with that updated,
we will now have four
or four responses every now and then on function number two. About 50%
of invocations seems to work.
So let's now then try to use this stop
condition. What if an alarm sets off?
So instead of actually making sure that something
sets off, we can use the CLI to set the alarm state for
my specific alarm into the alarm state.
So doing that, using that command,
switching back to Cloudwatch, we can see that it is now in
alarm state. And AWS, soon as that alarm
moves into the alarm state, we can see that AWS FIS
is stopping the experiments. Since that was our stop
condition, a safeguard, it is halted by a
stop condition, and that
also then means that it will use the rollback behavior
update, our parameter.
Refreshing to make sure, yes, it is now
disabled. The experiment and
our demo site should now be back to normal
and returning 200 responses once more.
So that was an example of how we can use AWS FIS to
first off do these experiments updating a parameter.
And I showed you something that we
are experimenting with a new action type where
we put a parameter straight into parameter
store, but you can all do it right now using SSM automation.
The document is available for use.
All right, so then I
want to do a bit of a summary recap of what we've looked
at. First off,
the chaos engineering part is the same no matter if it's serverless
or if it's say, server full.
To find the hypothesis for your serverless application,
use those what ifs that we asked earlier on.
What if a downstream service is unavailable? What if latency
is added and then create a hypothesis around that?
And when you're doing the experiments, make use of both configuration
network and code manipulation. We looked at examples for
configuration and code manipulation in this session,
and then try to use safeguards and automatic rollback
so you don't have to be responsible for actually running a
command to rollback or changing configuration
in a console to be able to use that rollback
behavior. And if you want
to have more chaos engineering for serverless,
just scan the QR code shown on screen right now,
or go to grosh serverless chaos
for more links, examples, demos,
all gather in one place. And with that,
I want to thank you for joining this session. Happy to
be here at Conf 42 chaos Engineering. My name
is Gunnar Grosch, developer advocate at AWS. If you
want to contact me, I'm available on Twitter as shown on screen,
and LinkedIn, of course. Happy to connect. Thank you
all for watching.