Transcript
This transcript was autogenerated. To make changes, submit a PR.
Everyone in this session, I'll introduce you
to what, why, and how of Kaios engineering.
I'll dive deep into the principles principles behind kiosk engineering,
how to keep your services up and running,
how to apply these principles in the context of application.
Built on top of AWS before we
move on, I want to start with Werner Vogel's
quote. Everything fails all the time,
and hence we need to build systems that embrace
failure as a natural occurrence.
Creating technology solutions is a lot like
constructing a physical building. If the
foundations aren't solid, it may cause structural
problems that undermine the integrity and the function
of the building. The AWS well architectures
Framework is a set of design principles and
architectural best practices for designing and
running services in the cloud. The framework
is built based on years of experience architecting
solutions across a wide variety of
business verticals and use cases,
and from designing and reviewing a number of architectures
with thousands of customers on AWS,
the framework has a set of questions
to drive better outcomes for anyone who
wants to build and operate your applications on cloud.
There are six principles in the AWS well architected
framework operational excellence,
security reliability,
performance efficiency, cost optimization,
and sustainability. Following these
guidelines will enable you to build a system that delivers
functional requirements that meet your expectations.
In this session, I'll touch on only the
three pillars that are relevant to this topic today.
Those are operational excellence, reliability,
and performance efficiency. First, let's look at
what does it mean by operational excellence?
Operational excellence is the ability to support,
development, run and monitor systems effectively
to gain insight into your operations,
and then deliver business value through continually supporting
the support processes.
Reliability encompasses the ability
of an application or a service to perform
its intended function correctly and consistently.
Performance efficiency is the ability of the system to
use the computing resources in most efficient manner
to meet your system requirements and to maintain the efficiency
as the demand increases or the technology evolves.
In the reliability pillar of well architected framework,
there is a segment that talks about testing
your system through failure injection,
and this comes as a recommendation from Amazon's
many years of experience building and operating large distributed
systems. And this practice of using
fault isolation or fault injection
to test your environments is also better known as
coyotes engineering. Let's go into
the details of what, why, and how
of coyotes engineering. Let's understand the what.
First, it's about designing
your system to work despite failures, building sustainability
and building stability in your system behavior,
and proactively looking for problems instead of waiting
for them to happen and be surprised by them.
Above all, chaos engineering needs a cultural
shift for organizations to adopt the approach,
chaos engineering is a process of
stressing an application by creating disruptive
events and observing how the system responds to
those events and finally implementing those improvements.
So it's an approach to learning about how your
system behaves when subject to scientific experimentation
and finding evidence. So let's
talk about why kiosk engineering rise
of microservices distributed cloud architectures
the pace of innovation, pace of development
and deployment of software means that the systems are
growing increasingly complex. While individual
components in a development cycle work,
they when integrated, some of the faults may
be unexpected, and these failures could cause costly
to businesses.
Even brief outages could impact companies
bottom line. So the cost of the downtime
is becoming a key performance indicator in
key engineering teams.
Amazon.com retail online business could
have a large impact on revenue, even for a
few minutes of outage.
So companies need a solution to this challenge.
Waiting for the next costly outage is not an option.
To meet these challenge head on, more and more companies
are turning to chaos engineering. So let's
learn what's involved in adopting
this approach, we make an assumption
about our system, and we conduct experiments
in a controlled environment to prove or disprove our theories,
our assumptions about our system's capability
to handle such disruptive events.
But rather than let those disruptive events happen
at 03:00 a.m. During these weekend or
in a production environment, we create them in a
controlled work hours development
environment and experiment and see
how the system behaves. We repeat these
experiments at regular intervals, and thus learn
more and more about the ability of the system to withstand
interruptions and improve our systems
to bounce back and provide the best possible service.
So we are building reliability in our systems by
using an approach called chaos engineering. So let's talk about
the engineering approach facilitates building
this resilience. So there are five phases
to this coyote engineering. First,
about understanding the state of your system you're dealing with.
Secondly, hypothesize,
articulate a hypothesis about your system,
run an experiment,
often using fault injection,
verify the results of the system, and then finally,
learn from your experiments. In order to improve the
system further, in order to build resilience
into your systems, we should be able to identify
under what circumstances, under what scenarios our systems
are likely to fail. And then we can translate
these scenarios into a set of experiments
and learn to build stability around the system.
For example, the kind of chaos experiments you could conduct
could be for hardware failures where a server is going down,
or it could be non functional requirements, such as
a spike in the traffic. Or it could also
be testing your software services,
where you're sending malformed responses and see and
learn about your system, how that would respond.
Now, let's take a quick look into the kind
of tooling available for us to conduct the kiosk engineering
experiments before we go into the tooling itself, a little
bit of a history as to how it
originated. Back in 2010,
Netflix's engineering teams have created a
tool called Chaos Monkey, which is a
tool that they have used in their systems
to build resilience to enforce
failures into their systems and to learn how to
stabilize their system,
such as by fault injection, such as by terminating services
or terminating a server that's running a
particular service, and so on.
Next, in 2011, civilian army added
additional failure modes to just kind of provide a
full set of failure testing capability.
In 2017, kiosk engineering toolkit for developers
emerged,
mostly to have an open API for the developers to be
able to integrate kiosk
experiments into their systems,
and also to automate into CI CD pipelines
and so on. Now, in 2019,
another powerful coyote engineering platform emerged for
Kubernetes for testing container based
services. Chaos, the ability to perform coyotes experiments
without modifying the deployment logic of the application
itself. In 2009,
actually, Colton Andrews had built fault
isolation fault injection at Amazon. He later
on went to found he co founded
Gremlin, which is a failure as a service platform,
which is launched in 2019.
So Gremlin helps build
resiliency into your systems by turning failure into
resilience, and by offering engineers a fully hosted solution
to safely conduct experiments on simple or complex
systems, and in order to identify those weaknesses even before
they can impact the customers experience and
allow you to reduce any revenue loss.
So this allows developers to run experiments against
hosts, containers,
or even functions, or also kubernetes
primitives. It's available. Gremlin is available
on AWS Marketplace as well.
Another tool that I want to talk about today is the
AWS fault injection simulator, or FIS.
In short, it's a fully managed service for
conducting coyote engineering experiments provided by AWS.
It's designed to make it easy for the developers
to use the service. It allows you
to test your systems for real world failures.
Either it could be a simple test, or it could be
a complex test as well. FIS basically embraces
these idea of blast
radius and monitoring your blast radius. It does
so by giving you the ability
to set up conditions around your experiments.
So basically, it's the idea of safeguarding
your servers,
even if by mistake,
so that you can reduce that blast radius of the experiment.
And some alarms can be set off if
those conditions are met. So now let's take
a quick look into what are
the components that comprise of fis.
First is actions. Actions are
fault injection activities that you want to conduct
experiments with as well as these actions act
on. Targets and targets are nothing but EC
two resources. It's not necessarily easy to but AWS
resources that you want these actions to be performed
on. And these resources can be identified
via targets, sorry, via tags as well.
You have experimentation templates or
experiment template which forms the basis
for conducting a simple experiment first,
and these templates can further be used to develop multiple experiments.
Now let's talk about how to build highly available
fault tolerant systems on AWS.
Before we do that, let me go through the details of
AWS global infrastructure. AWS region
is a physical geographical location consisting of two or
more availability zones, and an availability zone consists
of two or more data centers that are redundantly
connected with power networking and connectivity.
These availability zones are interconnected with low latency
network cables. Now let's suppose we have this
three tier architecture hosted on AWS.
The web tier is hosted on elastic container service
or ECS for short and the web tier is receiving
traffic from Internet gateway from
Internet through Internet gateway and it is distributing traffic
through elastic load balancer to the ECS cluster which
is the web tier. The web tier is further
distributing the traffic through another elastic load balancer
to another ECS cluster
which is these application layer and we're using
Amazon Aurora as the database tier.
Now let's take a look at how to set up kiosk
experiments using AWS fault injection simulator.
Let's say for our first scenario, what happens when
servers are experiencing cpu load and
our hypothesis is that if the cpu utilization
for a compute resource were to be under stress,
the availability of our website would not be impacted
due to the built in capability in these system.
Now let's go through the steps to run this experiment.
I'm assuming that you already have an AWS account.
Now go to your AWS console, search for AWS
FIS. On the left hand side you should find experiment
templates on the FIS console similar to what you're
seeing on these slide. Using this experiment templates
option you can create various experiments.
Now let's take for example creating or causing a
cpu stress as one of the scenarios you want to perform
the experiments on. When you navigate to create
experiment template, you can choose from
the pre built set of actions in FIS such as
an AWS systems managers run command under
the hood to create the cpu stress
in the top right corner of this slide you can see the actions
drop down menu which allows you to run
the experiments. Then if you click on
the run experiment option,
you will be prompted with an
input box to type the word start to confirm
running the experiment ensure to check the state
of the experiment is in running state now you
will be taken to the experiment details page and note that
the state of the experiment changes over the time.
So in order to observe the impact of the resources,
navigate to the monitoring dashboard of the ECS cluster
or you can prebuild a
custom dashboard to view the impact of the
cpu utilization load on your application.
So you should observe a spike in
the monitoring graph. Similarly, you can conduct
other experiments using the prebuilt actions
say for example you can use SSM commands for network
packet loss. This action can be used as
a basis for conducting network stress experiment
say if you want to run an experiment and try to
mimic an application's response to an
easy failure by
removing the availability zone from the underlying auto scaling
group configuration and triggering a database failover.
Here is the hypothesis for this scenario.
50% of our EC two instances
will not affect the availability of our website
and the application will remediate itself to get back to the desired
healthy capacity because of the use of autoscaling group
underneath the hood. Now running this experiment
is going to terminate 50% of our total EC
two instances, both app and the web layer the
steps to run this experiment are navigate to the
AWS FIS console as we have
shown earlier. Navigate to the experiment templates
section on the AWS FIS page,
select the experiment template in the top right corner,
select the actions drop down and hit run.
Ensure to check the status of the experiment
is in running state and then observe the impact
of the resources from the cloud watch dashboard.
So the action or the experiment template we are using
here is terminate EC two which is based on the
actions to terminate instances from
the EC two service. To observe the
impact of the resources you can either go to the Cloudwatch
dashboard or you can go to the ECS
cluster to note the healthy hosts section.
You should see a different number from the original steady state.
To observe the impact you can also go to
the EC two service dashboard instances section
where you can notice the EC two instances getting terminated
and eventually you will see some new EC two
instances coming back up as well. And from the auto
scaling menu you can navigate to the auto scaling
group section and you will see the instances count, decrease and
automatically restore as well. And finally to observe the
impact of user experience you can navigate to
these Cloudwatch service and set
up a Cloudwatch synthetics canary if you
have this pre set up,
you can observe these change in the user experience because
synthetics canary actually does the monitoring of the user endpoints
and the APIs. Basically, a synthetic scannery is
a configurable script that can run on schedule and it
monitors those endpoints. You can also navigate to the
ECS service endpoint manually via the browser
and check the user experience impact during
the auto scaling process as well. Once this experiment
completes, the applications should return back to its steady state.
Now let's take a look at how to make this a reality.
It's not sufficient to run these experiments once
and leave it at that. Your kiosk scenarios
and hence the experiments and hence your recovery design
are based on certain assumptions. For example,
if you think of a data replication scenario and assume
that the system will complete set of replication
tasks within the set steady state time. As the data grows
organically, the replication times may not hold good anymore.
It's hence important to conduct these experiments regularly,
validate them and improve your results and
enhance your customer experience. Reality may differ
as I said earlier, one way to test your systems and
bring them to as close to reality as possible is through running game
days. It's a concept where you bring in
a set of people from different disciplines or who have not
used your system as much, so they do not have
an idea of how this is going to work and give them a brief
overview and the scenario of events to run and
gather the feedback, analyze the results and
follow up with items to improve your system
and run these tests regularly.
Conduct these game tests regularly. Integrate these
tests AWS part of your CI CD pipeline as
I mentioned, if you have runbooks that need to be checked
manually, they could very easily get out of date
and run into issues once it's in production.
Finally, I want to leave you with this quote regarding coyotes
engineering. It isn't about creating
coyotes, it is about making the coyotes inherent in the system
visible. I invite you to start
testing reliability of your systems using coyotes engineering
techniques you've learned today. I'll share some resources in
the next couple of slides that will help you on this journey. If you
have not created an AWS account,
this link will tell you how to and here are some of the
links to go over and understand the AWS well architected
framework and some hands on labs.