Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              Hi. Hello. In this session,
            
            
            
              we learn about chaos engineering and how to
            
            
            
              build resilient and wellarchitected applications
            
            
            
              with it. Wellarchitected applications
            
            
            
              are designed and built to be secure, high performing,
            
            
            
              and resilient. You need to test your applications
            
            
            
              and validate that it operates. AWS designed and
            
            
            
              is resilient to failures.
            
            
            
              We'll start with a quick look at AWS
            
            
            
              well architected and how that ties together with chaos
            
            
            
              engineering. Then we'll do a short intro
            
            
            
              to AWS fault injection simulator and an example.
            
            
            
              Creating technology solutions is lot like constructing
            
            
            
              a physical building. If the foundation is not solid,
            
            
            
              it may cause structural problems that undermine the
            
            
            
              integrity and function of the building.
            
            
            
              The Wellarchitected framework is a set of questions
            
            
            
              and design principles to drive better outcomes for
            
            
            
              anyone who wants to build and operate workloads on
            
            
            
              the cloud. It helps build secure,
            
            
            
              highperforming, resilient and efficient infrastructure
            
            
            
              for a wide variety of applications and workloads.
            
            
            
              Built around six pillars, AWS well
            
            
            
              architected provides a consistent approach for customers to
            
            
            
              evaluate architectures and implement scalable designs.
            
            
            
              If you neglect the six pillars of operational excellence,
            
            
            
              security, reliability, performance efficiency,
            
            
            
              cost optimization, and sustainability when architecting
            
            
            
              solutions, it can become a challenge to build a system
            
            
            
              that delivers functional requirements and meets your
            
            
            
              expectations.
            
            
            
              When you incorporate these pillars, it will help you produce
            
            
            
              stable and efficient systems, allowing you to
            
            
            
              focus on functional requirements.
            
            
            
              In this session, we'll touch on three pillars.
            
            
            
              Operational excellence, which is the ability to
            
            
            
              run and monitor systems to deliver business value and
            
            
            
              continually improving processes and procedures
            
            
            
              reliability, which is the ability of a system
            
            
            
              to recover from infrastructure or service failures
            
            
            
              and performance efficiency, which is the ability to
            
            
            
              use computing resources efficiently to meet system requirements
            
            
            
              and to maintain that efficiency as demand changes and
            
            
            
              technologies evolve.
            
            
            
              In the reliability pillar of the well architected framework,
            
            
            
              there is a segment that talks about failure injection
            
            
            
              run tests that inject failures regularly
            
            
            
              into preproduction and production environments,
            
            
            
              and this comes as a recommendation from our many years of
            
            
            
              experience.
            
            
            
              This practice of using fault injection to test your
            
            
            
              environments is better known as chaos engineering.
            
            
            
              Chaos engineering is the process
            
            
            
              of stressing an application by creating disruptive
            
            
            
              events, observing how the system responds,
            
            
            
              and implementing improvements.
            
            
            
              And we do this to prove or disprove
            
            
            
              our assumptions about our system's capability to
            
            
            
              handle these disruptive events. But rather than let those
            
            
            
              disruptive events happen at odd
            
            
            
              times during a weekend or in the production environment,
            
            
            
              we create them in a controlled environment and
            
            
            
              during the working hours.
            
            
            
              It is also very important to understand that chaos
            
            
            
              engineering is not about breaking things randomly without purpose.
            
            
            
              It is about breaking things in a controlled environment
            
            
            
              through well planned experiments. In order to build confidence
            
            
            
              in your application and tools you are using to withstand
            
            
            
              turbulent conditions,
            
            
            
              let's discuss different phases of chaos engineering.
            
            
            
              First. Steady state steady
            
            
            
              state this phase involves an understanding of the
            
            
            
              behavior and configuration of the system under normal
            
            
            
              conditions. By defining your steady state,
            
            
            
              you can detect deviations from that state and
            
            
            
              determine if your system chaos fully returned to the known good
            
            
            
              state. Next phase
            
            
            
              is hypothesis. After you understand
            
            
            
              the steady state behavior, you can write a hypothesis
            
            
            
              about it. It can be challenging to decide what should
            
            
            
              happen. Chaos engineering recommends that you choose real
            
            
            
              world events that are likely to occur and that
            
            
            
              will impact the user experience.
            
            
            
              Next stage is run experiment.
            
            
            
              You don't need to run experiments in production right away.
            
            
            
              A great place is to get started with chaos engineering
            
            
            
              in a staging environment. By running experiments in staging,
            
            
            
              you can see how your system will likely react in production
            
            
            
              while earning trust within your organization.
            
            
            
              And as you gain confidence, you can begin running experiments
            
            
            
              in production.
            
            
            
              The next phase is verify. This step
            
            
            
              is to analyze and document the data to understand what
            
            
            
              happened. Lessons learned during the experiment
            
            
            
              are critical and should promote a culture of support.
            
            
            
              Here are some questions that you can address.
            
            
            
              What happened? What was the impact
            
            
            
              on your customers? What did we learn? Did we
            
            
            
              have enough information in the notification to investigate
            
            
            
              further? What could have reduced our time to detect
            
            
            
              or time to remediate by 50%?
            
            
            
              Can we apply this to other similar systems?
            
            
            
              And how can we improve our incident response processes?
            
            
            
              And finally, learning from your experiments in
            
            
            
              order to improve the system improvements such as its
            
            
            
              resilience to failure, its performance,
            
            
            
              the monitoring, the alarms, the operations,
            
            
            
              and the overall system,
            
            
            
              and to help create and run these chaos engineering experiments,
            
            
            
              we can use AWS fault injection simulator
            
            
            
              it's a fully managed service for running fault injection
            
            
            
              experiments on AWS that makes it easier to
            
            
            
              improve an application's performance, observability,
            
            
            
              and resilience.
            
            
            
              AWS fault injection simulator, or FIS,
            
            
            
              is a fully managed chaos engineering service designed
            
            
            
              to be easy to get started and to allow you to
            
            
            
              test your systems against real world failures.
            
            
            
              Whether they are simple, such as stopping an instance,
            
            
            
              or more complex. It fully embraces
            
            
            
              the idea of safeguards, which is a way to monitor the
            
            
            
              blast radius of the experiment and stop if
            
            
            
              certain alarms are set off.
            
            
            
              We have four main components that are part of fault injection
            
            
            
              simulator. An action is the
            
            
            
              fault injection activity that you run on targets
            
            
            
              using fault injection simulator. A target
            
            
            
              can be a specific resource in your AWS environment,
            
            
            
              or one or more resources that match criteria
            
            
            
              that you specify. For example, resources that have
            
            
            
              specific tags an experiment template
            
            
            
              that contains one or more actions to run on
            
            
            
              specified targets during an experiment. It also
            
            
            
              contains the stop conditions that prevent the experiment from
            
            
            
              going out of bonds.
            
            
            
              And after you create an experiment template,
            
            
            
              you can use it to start an experiment.
            
            
            
              And you can use a single experiment template to create and
            
            
            
              run in multiple environments and multiple experiments.
            
            
            
              Let's take our scenario.
            
            
            
              Let's say we have an application and management
            
            
            
              has decided that our in house built application,
            
            
            
              until now used to share messages within our organization,
            
            
            
              should be made public. The application
            
            
            
              chaos so far had limited use or downtime or
            
            
            
              any disruptions, so the fact that it
            
            
            
              hasn't been built using best practices was
            
            
            
              never an issue. But this new decision
            
            
            
              means that all of that will change. Our mission
            
            
            
              is to create chaos. Engineering experiments stress the application,
            
            
            
              simulate disruptive events, and observe
            
            
            
              how the system responds. Then we can make
            
            
            
              improvements to the application and test how those improvements
            
            
            
              affect the application and the users.
            
            
            
              To start with, our application consists
            
            
            
              of one single Amazon EC two instance and one
            
            
            
              Amazon RDS instance running in one availability
            
            
            
              zone. They are the fronted by
            
            
            
              an application load balancer. Even with
            
            
            
              only a single Amazon EC two instance, it is
            
            
            
              still useful to include an application load balancer.
            
            
            
              This lets you configure health checks performed against the EC two instance.
            
            
            
              It also makes it easier to add more easy to instances
            
            
            
              later. The Amazon EC
            
            
            
              two autoscaling group helps improve resiliency.
            
            
            
              If the EC two instance fails its health check,
            
            
            
              the Amazon EC two auto scaling group will replace it.
            
            
            
              Based on this architecture, we can now create chaos
            
            
            
              engineering experiments to test how the application handles
            
            
            
              disruptive events.
            
            
            
              Given that our applications is running on only
            
            
            
              one instance, we are relying heavily on that instance.
            
            
            
              All requests are handled by that single
            
            
            
              instance. What if our single instance is stopped
            
            
            
              and then started again?
            
            
            
              So our hypothesis is if our only instance is
            
            
            
              stopped and started again, the application will stop
            
            
            
              accepting request, but quickly recover again.
            
            
            
              The action is stop and start instance and
            
            
            
              the target is single instance.
            
            
            
              Note that this experiment will cause an application to
            
            
            
              stop serving request. It is advised to not perform
            
            
            
              experiments where we know that the outcome will cause
            
            
            
              an outage, but this is in a non production
            
            
            
              environment and to test a specific scenario.
            
            
            
              Next, let's learn how to run experiments
            
            
            
              in AWS fault injection simulator.
            
            
            
              Navigate to AWS console and open
            
            
            
              AWS FIS service. Click on
            
            
            
              Create experiment template and
            
            
            
              then we create a new experiment template. Here,
            
            
            
              let's add a description for us. It is stop single
            
            
            
              instance, enter a name,
            
            
            
              select an IM role.
            
            
            
              Next we need to add an action.
            
            
            
              Click on add an action new action give
            
            
            
              it a name. Here it is
            
            
            
              stop one instance,
            
            
            
              select action type which is EC two
            
            
            
              stop instance,
            
            
            
              then action parameter and
            
            
            
              then add a target. We are
            
            
            
              going to edit it here,
            
            
            
              give it a name and
            
            
            
              select the resource id. This is the
            
            
            
              EC two instance that we have and
            
            
            
              it's currently in running state. So we'll add it and save.
            
            
            
              Next we'll click
            
            
            
              on create experiment template and give a confirmation that
            
            
            
              we want to create it.
            
            
            
              Next we'll go and start this experiment.
            
            
            
              We again have to confirm that we need to start
            
            
            
              the experiment and if you see the state
            
            
            
              is initiating right now,
            
            
            
              the experiment is running now.
            
            
            
              Now if you go back to the website,
            
            
            
              the EC two instance is stopping right now and it has been stopped.
            
            
            
              If you go to the website now, you won't be able to access
            
            
            
              it and the experiment has been completed
            
            
            
              now after a few minutes you'll
            
            
            
              see again that it has automatically started running and
            
            
            
              if you go back to the website and
            
            
            
              refresh the page, the application is up and running again.
            
            
            
              So this is how you can run an experiment in
            
            
            
              AWS FIS.
            
            
            
              After running the first experiment on single EC two instance
            
            
            
              architecture, we collect that we don't have any alarms in place
            
            
            
              that tells us if the application is responding to
            
            
            
              requests or not and responding as it
            
            
            
              should. Simply put,
            
            
            
              if the application works or not.
            
            
            
              So by adding a cloud watch synthetic canary,
            
            
            
              we can check that the application responds. After the improvements,
            
            
            
              we have new metrics and new alarm in place to help us monitor
            
            
            
              the application and detect failures.
            
            
            
              After running our first experiments on single EC two instance
            
            
            
              architecture, we can clearly see that having a single
            
            
            
              instance is a liability for the resilience and reliability
            
            
            
              of that application. Having a single instance means
            
            
            
              if the instance fails, then it cannot operate
            
            
            
              and our users will have a bad time.
            
            
            
              So what do we do?
            
            
            
              We add multiple instances again.
            
            
            
              After the improvements, our application now uses multiple EC
            
            
            
              two instances to help withstand turbulent conditions.
            
            
            
              If one EC two instance fails, the application load balancer
            
            
            
              will fail over and send traffic to the remaining healthy ones.
            
            
            
              Based on the improved architecture, we can now create additional
            
            
            
              chaos engineering experiments and test the reliability and
            
            
            
              resilience of our application.
            
            
            
              With the improvements made, we can now repeat one
            
            
            
              of our previous experiments to see the difference.
            
            
            
              What if one of the instance in auto
            
            
            
              scaling group is stopped? The theory is that if
            
            
            
              one EC two instance fails, the application load balancer
            
            
            
              will fail over and send traffic to the healthy ones.
            
            
            
              So the hypothesis is if one instance
            
            
            
              in our auto scaling group is stopped and started again,
            
            
            
              the application will continue serving requests to client.
            
            
            
              The action is stop and start instance
            
            
            
              and target. Now is one instance in auto scaling
            
            
            
              group. This is just an
            
            
            
              example on how you can write experiments and improvise
            
            
            
              to include chaos engineering for your workloads.
            
            
            
              You can further continue the journey by distributing the workload to
            
            
            
              multiple availability zone, improving RDS availability,
            
            
            
              and implementing continuous integration and delivery.
            
            
            
              To summarize, today we learnt about AWS
            
            
            
              well architected Chaos engineering,
            
            
            
              AWS fault injection simulator and how to
            
            
            
              write an experiment for it.
            
            
            
              Thank you for joining the session. I hope you like.