Resilient and well-architected apps with chaos engineering

Video size:

Abstract

Well-architected applications are built to be secure, high-performing, and resilient. In this session, you will learn how to apply chaos engineering using AWS Fault Injection Simulator, stress your application, simulate disruptive events, and observe the system response.

Summary

Wellarchitected applications are designed and built to be secure, high performing, and resilient. Chaos engineering is the process of stressing an application by creating disruptive events, observing how the system responds. As you gain confidence, you can begin running experiments in production.
Next, let's learn how to run experiments in AWS fault injection simulator. After running the first experiment on single EC two instance architecture, we can clearly see that having a single instance is a liability for the resilience and reliability of that application. After the improvements, our application now uses multiple EC two instances to help withstand turbulent conditions.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi. Hello. In this session, we learn about chaos engineering and how to build resilient and wellarchitected applications with it. Wellarchitected applications are designed and built to be secure, high performing, and resilient. You need to test your applications and validate that it operates. AWS designed and is resilient to failures. We'll start with a quick look at AWS well architected and how that ties together with chaos engineering. Then we'll do a short intro to AWS fault injection simulator and an example. Creating technology solutions is lot like constructing a physical building. If the foundation is not solid, it may cause structural problems that undermine the integrity and function of the building. The Wellarchitected framework is a set of questions and design principles to drive better outcomes for anyone who wants to build and operate workloads on the cloud. It helps build secure, highperforming, resilient and efficient infrastructure for a wide variety of applications and workloads. Built around six pillars, AWS well architected provides a consistent approach for customers to evaluate architectures and implement scalable designs. If you neglect the six pillars of operational excellence, security, reliability, performance efficiency, cost optimization, and sustainability when architecting solutions, it can become a challenge to build a system that delivers functional requirements and meets your expectations. When you incorporate these pillars, it will help you produce stable and efficient systems, allowing you to focus on functional requirements. In this session, we'll touch on three pillars. Operational excellence, which is the ability to run and monitor systems to deliver business value and continually improving processes and procedures reliability, which is the ability of a system to recover from infrastructure or service failures and performance efficiency, which is the ability to use computing resources efficiently to meet system requirements and to maintain that efficiency as demand changes and technologies evolve. In the reliability pillar of the well architected framework, there is a segment that talks about failure injection run tests that inject failures regularly into preproduction and production environments, and this comes as a recommendation from our many years of experience. This practice of using fault injection to test your environments is better known as chaos engineering. Chaos engineering is the process of stressing an application by creating disruptive events, observing how the system responds, and implementing improvements. And we do this to prove or disprove our assumptions about our system's capability to handle these disruptive events. But rather than let those disruptive events happen at odd times during a weekend or in the production environment, we create them in a controlled environment and during the working hours. It is also very important to understand that chaos engineering is not about breaking things randomly without purpose. It is about breaking things in a controlled environment through well planned experiments. In order to build confidence in your application and tools you are using to withstand turbulent conditions, let's discuss different phases of chaos engineering. First. Steady state steady state this phase involves an understanding of the behavior and configuration of the system under normal conditions. By defining your steady state, you can detect deviations from that state and determine if your system chaos fully returned to the known good state. Next phase is hypothesis. After you understand the steady state behavior, you can write a hypothesis about it. It can be challenging to decide what should happen. Chaos engineering recommends that you choose real world events that are likely to occur and that will impact the user experience. Next stage is run experiment. You don't need to run experiments in production right away. A great place is to get started with chaos engineering in a staging environment. By running experiments in staging, you can see how your system will likely react in production while earning trust within your organization. And as you gain confidence, you can begin running experiments in production. The next phase is verify. This step is to analyze and document the data to understand what happened. Lessons learned during the experiment are critical and should promote a culture of support. Here are some questions that you can address. What happened? What was the impact on your customers? What did we learn? Did we have enough information in the notification to investigate further? What could have reduced our time to detect or time to remediate by 50%? Can we apply this to other similar systems? And how can we improve our incident response processes? And finally, learning from your experiments in order to improve the system improvements such as its resilience to failure, its performance, the monitoring, the alarms, the operations, and the overall system, and to help create and run these chaos engineering experiments, we can use AWS fault injection simulator it's a fully managed service for running fault injection experiments on AWS that makes it easier to improve an application's performance, observability, and resilience. AWS fault injection simulator, or FIS, is a fully managed chaos engineering service designed to be easy to get started and to allow you to test your systems against real world failures. Whether they are simple, such as stopping an instance, or more complex. It fully embraces the idea of safeguards, which is a way to monitor the blast radius of the experiment and stop if certain alarms are set off. We have four main components that are part of fault injection simulator. An action is the fault injection activity that you run on targets using fault injection simulator. A target can be a specific resource in your AWS environment, or one or more resources that match criteria that you specify. For example, resources that have specific tags an experiment template that contains one or more actions to run on specified targets during an experiment. It also contains the stop conditions that prevent the experiment from going out of bonds. And after you create an experiment template, you can use it to start an experiment. And you can use a single experiment template to create and run in multiple environments and multiple experiments. Let's take our scenario. Let's say we have an application and management has decided that our in house built application, until now used to share messages within our organization, should be made public. The application chaos so far had limited use or downtime or any disruptions, so the fact that it hasn't been built using best practices was never an issue. But this new decision means that all of that will change. Our mission is to create chaos. Engineering experiments stress the application, simulate disruptive events, and observe how the system responds. Then we can make improvements to the application and test how those improvements affect the application and the users. To start with, our application consists of one single Amazon EC two instance and one Amazon RDS instance running in one availability zone. They are the fronted by an application load balancer. Even with only a single Amazon EC two instance, it is still useful to include an application load balancer. This lets you configure health checks performed against the EC two instance. It also makes it easier to add more easy to instances later. The Amazon EC two autoscaling group helps improve resiliency. If the EC two instance fails its health check, the Amazon EC two auto scaling group will replace it. Based on this architecture, we can now create chaos engineering experiments to test how the application handles disruptive events. Given that our applications is running on only one instance, we are relying heavily on that instance. All requests are handled by that single instance. What if our single instance is stopped and then started again? So our hypothesis is if our only instance is stopped and started again, the application will stop accepting request, but quickly recover again. The action is stop and start instance and the target is single instance. Note that this experiment will cause an application to stop serving request. It is advised to not perform experiments where we know that the outcome will cause an outage, but this is in a non production environment and to test a specific scenario. Next, let's learn how to run experiments in AWS fault injection simulator. Navigate to AWS console and open AWS FIS service. Click on Create experiment template and then we create a new experiment template. Here, let's add a description for us. It is stop single instance, enter a name, select an IM role. Next we need to add an action. Click on add an action new action give it a name. Here it is stop one instance, select action type which is EC two stop instance, then action parameter and then add a target. We are going to edit it here, give it a name and select the resource id. This is the EC two instance that we have and it's currently in running state. So we'll add it and save. Next we'll click on create experiment template and give a confirmation that we want to create it. Next we'll go and start this experiment. We again have to confirm that we need to start the experiment and if you see the state is initiating right now, the experiment is running now. Now if you go back to the website, the EC two instance is stopping right now and it has been stopped. If you go to the website now, you won't be able to access it and the experiment has been completed now after a few minutes you'll see again that it has automatically started running and if you go back to the website and refresh the page, the application is up and running again. So this is how you can run an experiment in AWS FIS. After running the first experiment on single EC two instance architecture, we collect that we don't have any alarms in place that tells us if the application is responding to requests or not and responding as it should. Simply put, if the application works or not. So by adding a cloud watch synthetic canary, we can check that the application responds. After the improvements, we have new metrics and new alarm in place to help us monitor the application and detect failures. After running our first experiments on single EC two instance architecture, we can clearly see that having a single instance is a liability for the resilience and reliability of that application. Having a single instance means if the instance fails, then it cannot operate and our users will have a bad time. So what do we do? We add multiple instances again. After the improvements, our application now uses multiple EC two instances to help withstand turbulent conditions. If one EC two instance fails, the application load balancer will fail over and send traffic to the remaining healthy ones. Based on the improved architecture, we can now create additional chaos engineering experiments and test the reliability and resilience of our application. With the improvements made, we can now repeat one of our previous experiments to see the difference. What if one of the instance in auto scaling group is stopped? The theory is that if one EC two instance fails, the application load balancer will fail over and send traffic to the healthy ones. So the hypothesis is if one instance in our auto scaling group is stopped and started again, the application will continue serving requests to client. The action is stop and start instance and target. Now is one instance in auto scaling group. This is just an example on how you can write experiments and improvise to include chaos engineering for your workloads. You can further continue the journey by distributing the workload to multiple availability zone, improving RDS availability, and implementing continuous integration and delivery. To summarize, today we learnt about AWS well architected Chaos engineering, AWS fault injection simulator and how to write an experiment for it. Thank you for joining the session. I hope you like.

Slides

Download slides (PDF)

See all 22 talks at this event!

Conf42 Chaos Engineering 2024 - Online

February 15 2024

Resilient and well-architected apps with chaos engineering

Video size:

Abstract

Summary

Transcript

Slides

Kiranpreet Chawla

Solutions Architect @ AWS

Join the community!

Featured event

2026

2025

Info

Conf42 Chaos Engineering 2024 - Online

February 15 2024

Resilient and well-architected apps with chaos engineering

Video size:

Abstract

Summary

Transcript

Slides

Kiranpreet Chawla

Solutions Architect @ AWS

Join the community!