Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. Thanks for joining me on this
session today. It's a pleasure to have you all. Today I'm going to
be talking about cloud KUs Engineering with AWS fault injection
simulator. My name fis Samuel Baruffi, can can
solutions architect here at Amazon Web Services.
I do help support global financial services in
their cloud journey from architecturing,
best practice, and so forth. So today I want to talk
to you about how you can improve your resiliency and performance
with controlled chaos engineer. You might be wondering,
what is chaos engineer? Or maybe you're already very familiar with
chaos engineering. What I want to show you
is how AWS and AWS ecosystem can
help you on your chaos engineer journey.
So let's talk about the agenda for a few
seconds of what we're going to cover today.
So we are going to start talking about some challenges with distributed
systems. We want to explain how distributed
systems work and why they are complex by nature.
Then we want to jump into why chaos engineering fis hard what
have we heard as a company from customers that have tried to
do by themselves kus engineering and some of the lessons and
requirements we heard from them. Then of
course you're going to introduce AWS fault injection simulator.
You seem that I will be interchainable using
the FIS acronym as fault injection
simulator, the name of the service on AWS.
After that, we're going to dive deep into some of the key features
that fault injection simulator brings to us.
Some of the use cases that you should be using or could be using the
service. And in the end, I will spend a
little bit of time just doing a simple demo, showing the console
and demonstrating how you can use the service itself.
And hopefully before I do the demo,
I'll share some resources that if you are interested, you can take your
learning to the next level
by checking some of the resources we have available.
So let's move forward. So let's talk about challenges
with distributed systems. While most of us
understand that distributed systems are undeniable,
have revolutionized it industry in the last
decade or so, they do have some challenges.
Some of the challenges that distributed systems brings to us can
be a combination of multiple things like latency,
scalability, reliability, resiliency,
concurrency, and many more. As systems
grows larger and more distributed, what is
often a theoretical edge case can actually
sometimes become real occurrences.
So it's common mistake a lot of us,
and maybe I've been at
fault in the past at that, that it's very
easy to think that distributed systems are very complex when
they become just bigger. When you're talking about hundreds or thousands
of microservice. That is of course not the case.
And let me explain what I mean by that.
So even in a very simple application
that we just want to send a message from a client to
a server, there are a lot
of steps that are involved into this communication.
So let's just look at this. If a client wants
to send a message to a server, the first thing that will happen in
this scenario will be the client will put the message into
the network. The network will be responsible for delivering
the message to the server. The server then validates
the message. Once it validates the message, the server will
update its estate. Once that estate is updated,
the server will reply, will put
a reply into the network. The network will
actually be responsible for delivering the reply
to the client. Once the client received the reply, you actually
validates the reply, and then finally the client will update
its estate. So it's mind blowing just to understand
how many steps in this very simple situation
by sending a message from a client to a server, how many
steps behind the scenes happens and how
many steps you can have failures. So now let's just multiply
that by hundreds, thousands, millions or
even billions of occurrences across many of microservice,
many of our microservices.
So we only look to implement tests
after we have outages. So it's very common
that there is an issue on the network because we
might not have redundance network gear.
Only after that we go, after the occurrence
happened, that we go and we improve. We want to change
that. Right? So one
of the things that has been done a lot
is just traditional testing. Of course you need.
So the message on this slide, fis, please don't stop doing your traditional
testing. You should never stop doing. It's just that
traditional test doesn't cover all the unknowns that
distributed systems bring to the table and
all the complexity that you have in production environments.
So what traditional tests are good at
are verifying known conditions and
answer the questions like is this specific function
or this specific action, returning the specific expected
behavior that they are really good at it with
you both using unit tests or functional testing of
integration. But what about, let me pose you the
question, what about failures that have weird errors
that would happen on the network, like that goes over the Internet?
What about some configuration limits on cloud providers?
How about some drifts of your infrastructure? And what about
all the unknowns that you are not familiar and you are not testing
for? How can you test for something that you don't know yet?
And it can get even more complicated.
Some things are just really hard to test. I'll give you an example.
In a system where you have multiple instances that
start and stop dynamically, what happens if one of
those instance runs out of space? I'm pretty sure the
majority of you have been in a similar
situation where you have to perform some maintenance in
servers that have run out of space. Debugging apps that run
out of space are really, really complex and
would look something like this. You just see a bunch of
errors and no actions you try to take on the machine actually go
through. And that is relative common issue
that often takes just
a misconfiguration of log rotation or not being monitoring
the specific configurations of your space on the
disk from your monitoring systems. What that can be,
you can use a third party vendor or solutions on AWS,
but some of the solutions that you should potentially have
written is, well, you should have a log rotation in.
You know, if you don't have a log rotation in place, then another solution
that you could have put in place fis, you maybe have some
monitoring solutions that look for your storage on
this specific instance, once they reach close
to 90% of storage utilization,
then you send a message so you can reactively make
improvements. Of course, you always want to potentially make sure
you are having automation that will solve
those problems, rather than having to page someone in the
middle of the night to make those changes for you.
But you can see this is just an example of unknowns that potentially
you haven't covered on your unit test or integration test.
So the question that is in the industry is how can you
be more prepared for the unknowns? And luckily there
is a kind of engineering, and that's all
this talk is about today that helps you with that.
And the name of that. As all of you already know,
it's chaos engineer. So chaos engineer
focused on three main things. Three main pillars I'll call you have
our three main phases. You have the stress phase, you have the observed
phase, and you have the improve phase. So what the stress
phase means is that you are stressing an
application either in testing or in production environment,
by creating disruptions, by injecting
failure events such as server outages,
API throttling, network disruptions in
your environment. After you have injected for a period
of time, you observe what those means. So you observe
the systems and how the system responds. And this is a really important
part of chaos engineering, because kios engineers can
only exist if you have a really good observability
system in place. Once you observe,
by checking if your system is completely healthy. Or if you have
had some sort of occurrences that are not expected,
then you analyze what those occurrences are, and then you go to
the last phase, which fis the improved phase. You make changes
for your application to be more resilient or performant.
And we do want to prove or disprove some of
the assumptions that we have about our system capabilities
that can potentially handle or not handle those disruptive
events. So the
chaos engineer, focus on improving the resiliency and
performance of your workloads, but also focus on
uncover those hidden issues. And that's one of the main benefits
of chaos engineering, is those hidden issues are really hard
to know ahead of time. And also
you want to expose your blind spots. And this is actually the example that
I mentioned before about having a very good observability story.
If you don't have proper monitoring, observability and alarm, your application
might fail. But you don't have data, good data,
to understand what happened. So that is another aspect that is
part of what we call continuous resiliency that you test.
If there is any sort of failure that you
were not expected or haven't been uncovered by a metric or
your observability system, you need to improve that story. Aws well,
but there is much more to that. You also want to improve the recovery time.
If there is some major issues that you
haven't been able to protect your application, how do you improve your recovery time?
How do you improve your operational skills? And also, how do
you increase or not? How do you increase increases
the wrong word. How do you implement the culture of chaos
engineering? And when we
look at chaos engineering, we can look at different phases of
chaos engineering. One thing that is important to mention is
chaos engineering is not about breaking things randomly without
a purpose. Chaos engineering is about breaking
things in a controlled environment through a well planned
experiment in order to build confidence in your application and tools
so you can sustain turbulence and potentially
issues on your application. To do that, you have
to follow a well defined scientific method
that will take you from hypothesis to running,
can experiment to verifying their experiment,
improving, and then going back again to statistic.
Chaos engineers shouldn't be just one thing you
run once every year. It should be a
practice that you motivated and you innovated,
and you keep your engineers applying
this innovation on your workloads. That way
you can sustain failures and
always keeping your business outcomes
intact, not getting disrupted by random,
unknown failures that your application might face.
But let's talk about why Kus engineering is difficult.
And in the beginning of the presentation, I mentioned that
AWS have actually talked to a lot of customers that have tried
to do that by themselves. And we
have collected four main feedbacks
from the customers from a variety of industries and
different company sizes. The first one is really hard to stitch
together different tools and homemade scripts.
You might have some open source tooling, you might just building
some python scripts, some batches to actually implement those
injections of failures, or even the observability piece.
It's really hard to have that story by yourself,
but also you require a lot of agents and libraries
to get it started. So you might need to put a lot of infrastructure and
configuration in place, and it's not
very easy to get started. And then probably the most
important one, in my opinion, is it's really difficult to
ensure safety, because if you're doing case engineering in production
environments, the goal is to find the
unhidden problems, but at the same time you don't want to bring
your whole application. So how do you create guard rails
to stop your chaos engineering in production,
or even in tests for that matter, to bring down whole application
and affecting your business outcomes.
And the last is it's really difficult to reproduce real
world events because reward events
are not just as simple as an API has failed.
Normally it's a combination of scenarios that will potentially
run in sequence or in parallel, and that is really
very hard to reproduce with
that. What AWS have
introduced a couple of years ago at reinvent,
which is AWS global conference, the yearly
conference is AWS fault injection simulator.
AWS fault injection simulator is a fully managed chaos
engineering as a service. It is really easy to get started
with a service that allows you to reproduce real world
failures, whether the failure is as simple as stopping
an instance or more complex, like throttling APIs,
AWS fault injection simulator fully embraces
the idea of safeguards, which is one of the things we've heard
from our customers, that they want to do these chaos engineering tests,
but at the same time they want to be protected to potentially full
outages on their applications. And far injection simulator
brings that capability that it
is a way to monitor the blast radio and control the blast
radiant of your experiment and stop it automatically
if alarms set offs. So you have
three main pillars here that we're going to talk a little bit
more in detail.
So let's talk about why is it easy to get started with fault
injection simulator? So first, you do not need
to integrate multiple tools and homemade scripts.
Fault injection simulators will manage all the tests
and experiments for you. You can also use the
AWS managed console that you are familiar with it,
or the AWS CLI to run those experiments.
The interesting part here, FIS, you can use pre existing experiments
templates, and we're going to talk what experiment templates are in a moment,
and you can get started in minutes. And it's
really easy to share your experiment templates with other folks
within your organization. Or if you prefer and you want to
open source, you can actually do that and made available to the community.
The templates are JSon or Yaml fis that you
can share with your team and you can version control. So you can benefit from
the best practice associated with code reviews.
And then let's move to the next topic, which is
reward conditions so you can run experiments
both in sequence and in parallel. And I made a mention
before that real world failures are
not just as simple as one event. Sometimes they're a combination of events.
And the fault injection simulator allows you to combine different
actions that will inject
failure both in parallel or in sequence. And you can choose.
You can also target all levels of the systems, both the
host, the infrastructure, the network, and many more.
And you can select maybe just a few of them. You have full control
and flexibility in that aspect and real faults. And this
is an important r1 faults injected at the service control
plane level. So the faults and the injection
that are being implemented into your
environment are actually real world faults
are not just makeups of potentially
APIs and whatnot, those are actually failures that
are happening in real time.
So as an example,
if you configure can experiment template to
terminate an instance, actually the instance
will be terminated on AWS. So you got to be careful
because it's not faking with any metric manipulation. So you
have to pay a little bit attention to not do something that
you don't expect it to do.
And then the safeguards are where you create those guardrails
which are stop condition alarms, so you can configure alarms
on Cloudwatch, or potentially third parties, that if
those alarms are triggered, you can send a notification to
thought injection simulator service to say please stop
what you're doing because it's impacting heavily my service into
a level that you don't want to pass that threshold. So like I
said, it integrates natively with AWS Cloudwatch,
and it has built in rollbacks as well. So you can redo
what you actually have created and done until that point.
And because fault injection simulator
can be very dangerous in the sense that you
want to make sure only the right people within your organization have access
to run those experiments. As most
of AWS services you can control with fine grained IAM
control. So you can say this specific IAM
principle can only do these actions on these resources and
only these folks can start experiments. So you
can also control what type of faults can be used and what type of
resources can be affected using tag policies.
So for example, instance only with environment tests
can be affected, nothing else.
You don't allow anything else to be affected
by fault injection simulator.
So here you can see an architecture
diagram of the service. So what you see here, you can
see that in the middle you have the AWS fault injection
simulator which is controlled by IAM where you
have all the policies and permissions of who can actually do what in the service.
And you can access the service either via console or
CLI or a combination. Once you start can experiment.
The experiment will actually inject
those faults into AWS resources, compute databases,
network and storage. Those resources
are being monitored by cloud alert alarms or
potentially third party. You can choose and if
we recommend you to create stop conditions and if
those stop conditions are can, Eventbridge will
send a notification to fault injection simulator engine
to stop those experiments and roll back what chaos done.
And again, it's the best practice to have cloud watch
alarms monitoring your AWS and workloads,
AWS accounts and workloads so you can define stop conditions
that will automatically stop the experiments.
Let's talk about some of the components that are part
of the fault injection simulator service.
You have actions, you have targets, you have experiment
templates, and then you have experiment.
So let's look at each one of them individually.
So actions are default injection actions executing
during the experiment. They are defined using
a namespace, so you can see that you'll be AWS colon,
the service name column action type.
The action types can include fault types,
target resources, the timing relative to
any other actions, and fault injections parameters such as
duration, the rollback behavior, or the portion of the request to throttle.
As an example of an action, you can see here on this JSON representation
you have two actions defined. One is a
stop instance action and the other one is a wait action.
Notice that the wait is ordered to execute only
after the stop instance action has sequentially
executed. It's also worth noting that some
host level actions on EC two instance are performed
through system manage agent. The system manage
agent is a software that is installed by default on some
versions of operating systems such as Amazon Linux Ubuntu images.
And you can just find that information on SSM
or fault injection simulator documentation.
Now when we look at targets, so we talked about actions.
When you look at targets, targets define one or more
AWS resources on which to query
an action. So you have an action which you do something but targets
is where that action will actually be executed.
You can define targets when you create an experiments template
and you can use the same target for multiple actions in your experiment.
So targets include the resource type, resource ids,
tags and filters, and also a selection mode
if you want off, then if you want random, if you just want a percentage
and so forth.
Here is an example JSON representation of targets.
We are using in this example to filter target only
by instances that are running on one specific availability zone.
You can see here us east one a but that's not
only filter here, there is also another filter for tags
to refine the selection. So only EC
two instances that are running on us fis to can a with
the tag environment equals test that actually
are going to be impacted. And there is also another
filter that only instances that are running and only instances that are
within a specific VPC. And you can see that selection mode
is we just want two instances. So if there are more instances,
those instance will not be affected. We're just selecting
two. And there are other combinations that you can do like percentages
or random as you saw in the example before.
And experiment templates define can experiment and
are used in the start experiment request.
So think of an experiments template where you put
an action target and everything else together.
So all that information will be put together into an experiment
template. And an experiment template include an
action that we talk about it, a target. Then you
have some optional information like stop condition alarms that we
highly recommend you to always have. Stop condition alarms. So if
something goes south that you're not expecting to do, you can
actually automatically stop the experiment.
You also have an IM row which will be associated to
executing those experiments, and description and some
text. When you look at
the nature of the template,
you look something like this. You can look at the right adjacent demonstration
of a stop and restart instance. That's the name
of the experiment template. You have a description there. You have
a row arm that will actually be used to
assume the row and then execute the
specific action. So you need to make sure that row has the specific permissions
to execute that. Then you look at the target section where we talk about
and then we look at the actions which we also talked about.
So let's look at two experiment templates that
are very different, but two ideas that you can create with
fault injection simulator as explained earlier, you can
run a simple experiment like the one in the left, which is just
a sequential experiment with two
actions across three different targets,
and you have a target group there without maybe
not a lot of filtering and a specific target
group, filtering of tags and so forth. And then you
have a stop condition there. But you can also do something like
on the right where you have a target which you
are filtering of all Ec two instance with chaos
ready tag. And then you have actions that
you have a combination of sequential actions. So action one will
happen, then action two will happen, and once action two
happens, action three will
happen as well. So you can do both parallel and sequential
actions, and you can configure multiple stop
conditions like we highly recommend in your production environment,
but also in test, especially on production of course, that you have those
stop conditions. And in this example you can see they have two stop
conditions.
And finally, experiments are simply a snapshot
of the experiment template when it was first launched.
So you can see on the system an execution list of all the
experiments. So every time you click to launch an
experiment template, you automatically create an experiment
that you can look at who actually initiated that experiments,
what was the result of the experiment, and all the data is there.
So experiments will include fis next dropped of the experiment
template that you're using. What is the creation and start time,
what is the status of the experiments, the execution id,
the IAM row arm and
few other information.
So when we look currently, what are the supported
fault injections that FIS supports? You can see
that you have a lot of things
on this list and this list will keep growing. You can do
server errors on ECQ, you can do API throttling
on IAM, you can queue process on ECQ, you can
add latency injections on ECQ, you can
queue container instances on ecs,
and you can do that on eks as well by terminating nodes. And recently
uvi just announced network disruption,
EbS I o pause and few others. And you see
these lists growing with time.
So let's look at some use cases for default injection
simulator service. Let's look
how we see some customers adopting chaos engineering, both in the sense of
getting started and some of the more advanced practice.
So we are first going to talk about one off experiments.
And this is perhaps one of the most common ways of doing KS engineerings.
This can for instance be experiments where you want to verify
a new service with your system or a specific part of your
architecture, or maybe expose monitoring blind spots.
You create a one off experiment, you go through all the phases of
chaos engineer from understanding the steady state,
forming an hypothesis, designing and running that
experiment, and so on. And this is really a
great starting point of chaos engineer. You do a one off experiment,
and you prove your hypothesis. Nothing broke, and you
verified something within your system success.
Or perhaps you have disapproved your hypothesis and something chaos happened,
and you're going to improve. The goal is that you have learned
something about your system and you were able to implement
improvements. But those are just one off experiments.
You have another common use case for chaos engineering,
which is a part of a game day. A game day fis, a process of
rehearsing ahead of an event by creating
an anticipated conditions and then observing how effective
the teams and system respond. An event could be
an unusually high traffic day, maybe, let's say during
a promotion day of your ecommerce, or a new launch,
or a failure or something else. So you grab things together, you prepare
for that game day, and you can use chaos engineering experiments to
run a game day. By creating those event conditions,
you monitor the system, you see how your organization behave,
and you make new improvements.
Another use case is automated experiments.
Doing an automated experiment really goes back to scientific
part of the Chaos engineering. Repeating experiments
is a standard scientific practice for most fields.
Automated experiments help us,
help us cover a large set of experiments
that we can knock over manually,
and it verifies our assumption over time as
unknown parts of the systems are changed. So instead of just running one
off and maybe every couple of every six
months or so, you have those automated experiments
that as your architecture is changing,
you're also doing those automated experiments, so you
don't rely into a lot of people in a lot of organization.
You have that automated experiments that are repeating itself
in place. Let's talk a
little bit more about some examples of automated experiments.
The first automated experiment is recurring sketch experiments.
So this is a great way to start with automated experiments,
which is just to verify your assumption over time. Take for an
instance, let's give an example where different teams build
and deploy their own services within the system. So it's very common
in distributed systems where you have multiple dozens
or thousands or hundreds of microservices that are managed by
different systems. How do I know that the behavior that I verify
through chaos engineer experiment today is still valid tomorrow?
So recurring schedule is a way that you can run maybe
every hour, every day, or every week, and you can keep monitoring those
conditions by adding and injecting fault into
your evolution of your architecture.
So that is one way. Now, let's look at
automated experiments based on an
event trigger idea. So an event is something
that happens or is regarded AWS happening.
So can assistant, could be, let's say an order FIS placed, a user login,
or even an outscaling event. So let's say
what if we get latency to our downstream service when
there fis an autoscaling event? Does that affect our
users? Well, using event driven
trigger experiments, we can verify the behavior.
You can create an experiment template and trigger an
experiments based on what an autoscale event occurs.
So you can say, well, when I see an upscale of traffic
or some specific user action, please trigger chaos
experiment and let's verify how the infrastructure
and the workload behaves during that specific time.
So with time you can think about those event triggering
and within AWS you can use Eventbridge to automate
a lot of that. Then of course
you have chaos engineering, part of your CI CD
pipeline, continuous integration,
continuous delivery, continuous deployment pipeline.
You can add a stage in your pipeline that automatically
starts one or multiple experiments against your newly
deployed application. So for instance, in the stage environment
before you push into production, you start multiple experiments
by triggering fault injection simulator services and the specific
experiments templates. By doing so every time there
is a new push of code into a specific environment.
In my example in staging, you're very fine that the output
of your system's each deployment. So you need a lot of observability
tools to collect all the data and analyze. And this
again will help us verify our assumptions because you
created the experiment template based on the assumptions and hypothesis that
you have, and the unknown parts of the systems are changed.
So I think it goes without saying, but it is still worth pointing out that
to do an automated experiment, you do need to embrace safeguards.
So it's really important that you use the guard rails and the stop conditions within
fault injection simulator. So if something
happens, especially when you are doing a lot of those events,
that you are automating a lot of those events, you don't know potentially
you're not just clicking a button, looking those events happened like in
one off experiments. You want to make sure you can automatically
pause an experiment that has brought your
potentially application into a degraded
situation. So it's really important. And I highlight again,
please be very careful with that.
So I mentioned that aim for automation,
but the journey of chaos engineering,
you should start with one off experiments, potentially create
some schedule. Once you are comfortable with those, run some game days
from game days. You can start with some schedule
chaos engineering experiments,
invocation, and then you can go through more automated ones,
potentially with event driven or your CI CD.
What I want you just before we pause here for demo.
If you are interested and you would like to use fault injection simulator
within your organizations, please check out these resources.
You have five links here. The first one is the AWS well architected
framework that provides best practice and guidance on how
to build workloads that are fault resilience.
They are fault protective resiliency,
highly available, cost optimized, secure and
operationalized for production. You can click on the link
to check more on our website for the fault injection simulator
service. If you're interested, I highly recommend you go
on your own time, on your own AWS account. The chaos Engineer workshop.
You guide you step by step on how to do some of those experiments
that are common across multiple companies. You can
check the file injection simulator documentation and also there are
a GitHub repository that is publicly that has a lot
of examples. So you can just copy the JSON files
and you can run those examples, build on top of those examples,
or just reuse those examples.
So now I'll do a very simple demo
just for the sake of time. I don't have a lot of time. I'll do
a simple demo just showing the console and how
you can get started with fault injection simulator.
So I'll see you in a moment as I transition to my screen
on the AWS console. Okay, so let's jump into the
demo. What do I want to show you on the demo will be
a simple example and I'm just sharing.
As you can see, this diagram will be an application that
has two EC, two instances being managed by
an auto scaling group and they're just running nginx
as a web server. What I'm going to demonstrate to
you, I'm going to create some load synthetic
cloud. Just because this is a demo, I don't have people using this
web server. So I'm going to create some synthetic cloud and we're going
to create an experiment where we are going to terminate one of those instances
that are part of my autoscaling group. So I have two instance as part
of my autoscaling group I want to terminate one and
then I want to use my monitoring dashboard
and the observability data that I have collected to understand
what type of behavior my application have. It's very simple, but I will
create it. I'll show you step by step how
you can create that using fault injection simulator. And then
we're going to look at some of the results. So the
hypothesis there is I have an application,
they have two EC,
two instances managed by an auto scaling group.
What happens? The assumption and hypothesis
there is my application shouldn't suffer a outage because
I have two instances and one will still be
serving traffic through the load balancer. So I have a load balancer
endpoint, as you can see here. I'll just refresh. This is
the load balancer endpoint that is just providing data.
And as you can see I'm going to be paging the phpinfo
PHp web page and I'm
going to use that. So before we jump
there, I want to show you a quick dashboard. So this is
something you need to be in place in order to have the observe
of part of the chaos engineering. So I have this
dashboard on Cloudwatch. So this is Cloudwatch is a monitoring
tool, a managed service on AWS that supports a lot
of the monitoring metrics, logging and observability.
So here I have a dashboard that collects a lot of the graphics.
So the first one will be the customer load connection status.
So I can see if there are any error status, any 500,
400 errors or 200. In this case you don't see any data just
because there is nothing really there. Right now. On the other side you
can see the server NgInX connection status. You can see that has
been just one because my load balancer is just pinging
those Ec two to see if they're healthy in order to redirect
traffic to them. Then you can see response times. Of course there is
nothing there because there is no traffic being generated by the load test
that I'm going to do. You can see the response time as well. There is
pretty much nothing. Then I collect cpu utilization.
You can see 99% fis idle, so there is nothing
running there. You can see some of the network status.
So tcp time, weight very little and tcp
established over here. So currently no network connection.
And down below here you can see two more graphics. Let me move
this here so it's easier for us to see.
You can see that I have the number of number
of instance on my altiscating group, and then I can see the number of
healthy versus unhealthy on my alti scaling group.
So I have one healthy count on one
availability zone and another healthy count in another availability zone.
And the health check for my cloud balancer,
I can see that both instance are healthy.
And finally the instance check of my autoscaling
group, they're both healthy. So let's
first jump into the dashboard, the console, and let's
create a fault injection simulator. So first let's check.
Let's just go for fault injection simulator fis.
Let's go into the service and let's create an experiment
template. As I demonstrated and explained before, an experiments
template to be a combination of things that we want to test
can hypothesis. So in this hypothesis
we're just going to call the description terminate
half of the instances in a altiscaling group.
And the name let's just call terminate half of
instances. So you just give a description, you give a name.
Now you have an action, if you remember an action fis, something that
you want Fis to go and do. So this
we're going to call terminate instance as the name.
You can do any like the name is optional, the description is optional,
sorry. And the action type here, you can just type terminate instance.
We are going to use the pre built action called
EC two terminate instance. What this actually does
behind the scenes, you actually terminate an instance and I'll show
you how it actually works. I don't need to start after
because we're just doing a simple experiment and it automatically
creates a target. And I will show in a moment what the
target has and how we can program to be more on
what we actually need. So I'm going to click save so
it automatically create a target for me. Let's click
edit because right now it just gives the
name and the resource which they are correctly
here I can just call a ski target just so he has a
better idea. But I don't want to
manually select the EC two instance because remember my auto
scaling group is managing this. So I want the
target to be selected by resource tags
and I'll show in a moment what are those tags? So I'm just
looking here to make sure I have those correctly. So the resource tag
will be, I'll give a name and
I want the target to
be filtered by tag name fis Stackasgashg.
But I also want a filter. I only want
the fault injection simulator to look for
instances that are running. So the state of
the instance needs to be running and then the selection mode.
I don't want all the instance because I know if I terminate
all the instance my application be down. I want to say 50%
of my instances. In this case I only have two instances.
So one instance needs to be a new random
select. One of those two needs to be terminated.
So I'll go and I click save.
In this case I have already configured an IM role that has permission to
do those actions like terminate instance on my autoscaling group.
So I'm just going to use the FIS workshop service role, but you
would need to do that. And here, for the sake of the demo,
I'm not going to create any stop act conditions, but I highly
recommend every single time you create a stop condition. So if
something happens that is outside your control,
a metric gets triggered on Cloudwatch and stop
your experiments. And we want
to send logs to Cloudwatch logs. So I'm just going to browse
and we have a bucket called fis workshop.
1 second, I think it's here. Yes, fis logs.
Here is where I want to save the logs. So all the logs of the
experiment on the things it's going to be doing is going to be saved on
s three, sorry, on cloud watch logs. And then you can
look at the cloud watch logs. And here it's just giving a
name. So I'm going to go create experiment,
create experiments template here. He asked me, are you sure you want to
create an experiment template without a stop condition?
So this is a stop, like this is a
warning for you. In this case, because of the demo, I'm just going to say
create. But you should always, in your production environment,
most definitely should have that stop condition. So I'm just going to go
and create experiment template. So here I have
my experiment template. You can look at the targets, targets are
looking to terminate ec two instances that have this specific
resource stack and having this filter. So let me just show
you those instances. So you have an idea that I'm not
just lying to you and there is no vaporware.
So if we look here, we have two
instances, fis stack,
ASG that are running one in
us east one b, another in us east, one a, that are
managing by an outscaling group. So if I go on my altiscaling group
and I show you I have this out scaling
group that has desired capacity of two and minimum capacity of
two. And I look at instance management,
I have two instances with the specific
launch template, us east one a and us east one b.
And if I click on this instance, it just redirects to that one
that has the specific tag that we are searching.
So it's going to select one of those when I start experiment to
actually go MQ. So what
we want to do, because we don't have cloud in my, this is just a
demo, we don't have cloud. I want to run just a simple script
that is generating synthetic traffic to my instance.
So I have just a script here that we call some lambda
functions to generate load.
So now once we have generated load, you see
in a moment, these graphics will start picking up in a few
seconds or minutes, start picking up load tests, but at the
same time. So let's just maybe give a few seconds
and then let's just watch the
load picking up and then let's queue an
instance and let's see what we
can observe by that. Our hypothesis fis,
the application should be remaining online, but will
actually be fully online, or are we going to see any errors
of connections or maybe too much traffic? So while we wait
for that, we start cloudwatch. There is a little bit of delay
to show the metrics for me because the logs are being generated and
displayed on the dashboard. I'm just going to start my experiment.
So I'm going to go on the console,
I'm just going to go experiment, sorry, I'm going to go on my experiment
template. I'm going to select the one I created which is terminate
half the instances and I'm going to click start experiment.
I can add a tag, I can say name forced experiments,
you can just call whatever you want for this experiment and
I'll click start experiment. You ask me again,
like, are you sure you want to start this experiment? Because you have no stop
condition. So if something goes outside your control,
you have the ability to stop that experiment again,
because it's a demo, we are fine, we're just going to click start.
So click start. You can see that this is initiating
state. We're just going to click refresh.
It's on running state. You'll take a few seconds to
actually run. Let's just wait a little bit here and
you can see on the timeline, I think if you refresh
it running, you can look at the logs, the logs will be published here.
Once that action and experiment has finished and
what this is actually doing behind the scenes,
it's terminating one of those Ec two instances. So you can see it's
completed, you can see on the timeline and
we refresh it just terminated instance because it's just one
thing and you see the logs in a moment will actually be. Here they are
actually. Now it starts the experiment and
then it terminates the instance. So here you can see
action has completed, it terminated the instance for me.
So if we go and we look into the auto
scaling group and we refresh, you can
see that one instance. Now it's unhealthy because
it terminated my instance and because autoscaling
group chaos, a desire or two, it's automatically putting that,
creating a new instance. But now if you look at the cloud watch dashboards,
we can see now that we have cloud, right? So we can see that load
now 2900 actually have been
successful, but there is a lot of requests that are getting 500
HTTP errors. So my application is still up and running.
And if I go and I try to refresh, you can see that
it's running. But I might get a gateway error here, I might get a 500
error because the main reason now I only have one ec two
instance. So you saw it took time, and now you can
see the cpu usage. Before it was nothing,
but now it's more than 60%, right. And you can see
that some of the network connections are waiting. Not everything
is waiting, but you can see that the millisecond response
time is decreasing. So as I refresh this
page, you saw it was not very taking
a while. You see this is spinning, it's not doing a good job.
If I go down below, you can see that now my
dashboard recognized that they only have one healthy,
and I also only see one healthy. But you can see now the outscaling group
is spinning up another instance. So within a few seconds or
minutes you see better connections because another instance
will be serving traffic. But while
these experiments is ongoing, it's very simple. What we
were able to observe is if I have a peak of
traffic and one of the instance goes down,
I'm not really able to serve quality traffic
with performance to my customers. And you can see this,
when you look at this, you see that there are a lot of error counts.
There is mostly almost half, actually more
than half of those requests are errors. You're getting
500 or some error connections. You're not even be
able to achieve connections. So you see it's taking quite a bit of time,
which the latency has increased. So that's what you
obseRve. So if I were the owner of this application,
I would potentially increase the pool of scaling instance,
the pool of instance that I have on my altiscaling to maybe potentially
four, maybe across three or four availability zones,
depending on the region that this is running. So this
is really simple. And then as you scroll down, you could see that it picked
up the LaTENCY. So some of the requests, the duration maximum
was 2.6 seconds, but average now is
just 1.3. But when we look at
potentially for 1 hour, you could see that
it was Much lower. And because we're running the
experiments now with a lot of load, he has increased the traffic.
And if we scroll down, we only have one instance healthy.
But now we have Three instances, two instances that are
BACK at our auto scaling group. So ELB is now doing the health check
in order to bring it up my instance. And you
see in a moment we might not have enough time to finalize
here. Just watch the service cpu. But you see the service
cpu will be much better once that instance is
in place. So this is a very simple
example and you can look at the experiments. So once you click
on experiments you're able to see all the experiments. You can
click on the experiment id and you can find the timeline. In this
case I'm just doing an action, but you can have a sequence, you can
do many parallel and you
can start experiments. If you remember when you were talking about automated experiments
as part of a recurring event using Eventbridge or it
can be part of your CI CD, you can now mix and
match a lot of different combination and the whole idea here is
to have in mind the continuous resiliency and improve the performance
and availability of your application.
So that was it for the demo. I do hope you
were able to take away some of the key learnings from thought
injection simulator. I highly recommend you go through the workshop and feel
free to reach out to the service team and myself and
anyone on the AWS team if you have any feedback or
just in general if you want to share your experience.
Thank you so much everyone, it was a pleasure. Wish you a great remaining
of the conference.