Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi. Hello. In this session,
we learn about chaos engineering and how to
build resilient and wellarchitected applications
with it. Wellarchitected applications
are designed and built to be secure, high performing,
and resilient. You need to test your applications
and validate that it operates. AWS designed and
is resilient to failures.
We'll start with a quick look at AWS
well architected and how that ties together with chaos
engineering. Then we'll do a short intro
to AWS fault injection simulator and an example.
Creating technology solutions is lot like constructing
a physical building. If the foundation is not solid,
it may cause structural problems that undermine the
integrity and function of the building.
The Wellarchitected framework is a set of questions
and design principles to drive better outcomes for
anyone who wants to build and operate workloads on
the cloud. It helps build secure,
highperforming, resilient and efficient infrastructure
for a wide variety of applications and workloads.
Built around six pillars, AWS well
architected provides a consistent approach for customers to
evaluate architectures and implement scalable designs.
If you neglect the six pillars of operational excellence,
security, reliability, performance efficiency,
cost optimization, and sustainability when architecting
solutions, it can become a challenge to build a system
that delivers functional requirements and meets your
expectations.
When you incorporate these pillars, it will help you produce
stable and efficient systems, allowing you to
focus on functional requirements.
In this session, we'll touch on three pillars.
Operational excellence, which is the ability to
run and monitor systems to deliver business value and
continually improving processes and procedures
reliability, which is the ability of a system
to recover from infrastructure or service failures
and performance efficiency, which is the ability to
use computing resources efficiently to meet system requirements
and to maintain that efficiency as demand changes and
technologies evolve.
In the reliability pillar of the well architected framework,
there is a segment that talks about failure injection
run tests that inject failures regularly
into preproduction and production environments,
and this comes as a recommendation from our many years of
experience.
This practice of using fault injection to test your
environments is better known as chaos engineering.
Chaos engineering is the process
of stressing an application by creating disruptive
events, observing how the system responds,
and implementing improvements.
And we do this to prove or disprove
our assumptions about our system's capability to
handle these disruptive events. But rather than let those
disruptive events happen at odd
times during a weekend or in the production environment,
we create them in a controlled environment and
during the working hours.
It is also very important to understand that chaos
engineering is not about breaking things randomly without purpose.
It is about breaking things in a controlled environment
through well planned experiments. In order to build confidence
in your application and tools you are using to withstand
turbulent conditions,
let's discuss different phases of chaos engineering.
First. Steady state steady
state this phase involves an understanding of the
behavior and configuration of the system under normal
conditions. By defining your steady state,
you can detect deviations from that state and
determine if your system chaos fully returned to the known good
state. Next phase
is hypothesis. After you understand
the steady state behavior, you can write a hypothesis
about it. It can be challenging to decide what should
happen. Chaos engineering recommends that you choose real
world events that are likely to occur and that
will impact the user experience.
Next stage is run experiment.
You don't need to run experiments in production right away.
A great place is to get started with chaos engineering
in a staging environment. By running experiments in staging,
you can see how your system will likely react in production
while earning trust within your organization.
And as you gain confidence, you can begin running experiments
in production.
The next phase is verify. This step
is to analyze and document the data to understand what
happened. Lessons learned during the experiment
are critical and should promote a culture of support.
Here are some questions that you can address.
What happened? What was the impact
on your customers? What did we learn? Did we
have enough information in the notification to investigate
further? What could have reduced our time to detect
or time to remediate by 50%?
Can we apply this to other similar systems?
And how can we improve our incident response processes?
And finally, learning from your experiments in
order to improve the system improvements such as its
resilience to failure, its performance,
the monitoring, the alarms, the operations,
and the overall system,
and to help create and run these chaos engineering experiments,
we can use AWS fault injection simulator
it's a fully managed service for running fault injection
experiments on AWS that makes it easier to
improve an application's performance, observability,
and resilience.
AWS fault injection simulator, or FIS,
is a fully managed chaos engineering service designed
to be easy to get started and to allow you to
test your systems against real world failures.
Whether they are simple, such as stopping an instance,
or more complex. It fully embraces
the idea of safeguards, which is a way to monitor the
blast radius of the experiment and stop if
certain alarms are set off.
We have four main components that are part of fault injection
simulator. An action is the
fault injection activity that you run on targets
using fault injection simulator. A target
can be a specific resource in your AWS environment,
or one or more resources that match criteria
that you specify. For example, resources that have
specific tags an experiment template
that contains one or more actions to run on
specified targets during an experiment. It also
contains the stop conditions that prevent the experiment from
going out of bonds.
And after you create an experiment template,
you can use it to start an experiment.
And you can use a single experiment template to create and
run in multiple environments and multiple experiments.
Let's take our scenario.
Let's say we have an application and management
has decided that our in house built application,
until now used to share messages within our organization,
should be made public. The application
chaos so far had limited use or downtime or
any disruptions, so the fact that it
hasn't been built using best practices was
never an issue. But this new decision
means that all of that will change. Our mission
is to create chaos. Engineering experiments stress the application,
simulate disruptive events, and observe
how the system responds. Then we can make
improvements to the application and test how those improvements
affect the application and the users.
To start with, our application consists
of one single Amazon EC two instance and one
Amazon RDS instance running in one availability
zone. They are the fronted by
an application load balancer. Even with
only a single Amazon EC two instance, it is
still useful to include an application load balancer.
This lets you configure health checks performed against the EC two instance.
It also makes it easier to add more easy to instances
later. The Amazon EC
two autoscaling group helps improve resiliency.
If the EC two instance fails its health check,
the Amazon EC two auto scaling group will replace it.
Based on this architecture, we can now create chaos
engineering experiments to test how the application handles
disruptive events.
Given that our applications is running on only
one instance, we are relying heavily on that instance.
All requests are handled by that single
instance. What if our single instance is stopped
and then started again?
So our hypothesis is if our only instance is
stopped and started again, the application will stop
accepting request, but quickly recover again.
The action is stop and start instance and
the target is single instance.
Note that this experiment will cause an application to
stop serving request. It is advised to not perform
experiments where we know that the outcome will cause
an outage, but this is in a non production
environment and to test a specific scenario.
Next, let's learn how to run experiments
in AWS fault injection simulator.
Navigate to AWS console and open
AWS FIS service. Click on
Create experiment template and
then we create a new experiment template. Here,
let's add a description for us. It is stop single
instance, enter a name,
select an IM role.
Next we need to add an action.
Click on add an action new action give
it a name. Here it is
stop one instance,
select action type which is EC two
stop instance,
then action parameter and
then add a target. We are
going to edit it here,
give it a name and
select the resource id. This is the
EC two instance that we have and
it's currently in running state. So we'll add it and save.
Next we'll click
on create experiment template and give a confirmation that
we want to create it.
Next we'll go and start this experiment.
We again have to confirm that we need to start
the experiment and if you see the state
is initiating right now,
the experiment is running now.
Now if you go back to the website,
the EC two instance is stopping right now and it has been stopped.
If you go to the website now, you won't be able to access
it and the experiment has been completed
now after a few minutes you'll
see again that it has automatically started running and
if you go back to the website and
refresh the page, the application is up and running again.
So this is how you can run an experiment in
AWS FIS.
After running the first experiment on single EC two instance
architecture, we collect that we don't have any alarms in place
that tells us if the application is responding to
requests or not and responding as it
should. Simply put,
if the application works or not.
So by adding a cloud watch synthetic canary,
we can check that the application responds. After the improvements,
we have new metrics and new alarm in place to help us monitor
the application and detect failures.
After running our first experiments on single EC two instance
architecture, we can clearly see that having a single
instance is a liability for the resilience and reliability
of that application. Having a single instance means
if the instance fails, then it cannot operate
and our users will have a bad time.
So what do we do?
We add multiple instances again.
After the improvements, our application now uses multiple EC
two instances to help withstand turbulent conditions.
If one EC two instance fails, the application load balancer
will fail over and send traffic to the remaining healthy ones.
Based on the improved architecture, we can now create additional
chaos engineering experiments and test the reliability and
resilience of our application.
With the improvements made, we can now repeat one
of our previous experiments to see the difference.
What if one of the instance in auto
scaling group is stopped? The theory is that if
one EC two instance fails, the application load balancer
will fail over and send traffic to the healthy ones.
So the hypothesis is if one instance
in our auto scaling group is stopped and started again,
the application will continue serving requests to client.
The action is stop and start instance
and target. Now is one instance in auto scaling
group. This is just an
example on how you can write experiments and improvise
to include chaos engineering for your workloads.
You can further continue the journey by distributing the workload to
multiple availability zone, improving RDS availability,
and implementing continuous integration and delivery.
To summarize, today we learnt about AWS
well architected Chaos engineering,
AWS fault injection simulator and how to
write an experiment for it.
Thank you for joining the session. I hope you like.