Conf42 Site Reliability Engineering (SRE) 2024 - Online

Chaos Engineering in Action: Practical Techniques for Building Fault-Tolerant Systems

Video size:

Abstract

This session explores chaos engineering, a proactive approach to identifying and resolving vulnerabilities in distributed systems. We will dive deep about techniques for injecting controlled failures into environments, simulating real-world disruptions, and observing system behavior under stress.

Summary

  • Hari Shah is a solutions architect at AWS. Today we are going to talk about chaos engineering. My goal for this session is to give you a high level introduction to chaos engineering and talk a little bit about the best practices.
  • In December 2022, Southwest Airlines had a huge meltdown. Other examples include Facebook, Instagram, WhatsApp, and Amazon Web Services. These outages have significant business and financial impact to organizations. Why are these issues not being surfaced during the testing phase?
  • It all starts with understanding the steady state behavior of your application. The next step is to build your hypothesis. Once you have the hypothesis, we will run experiments. The goal for the experiment is to verify the behavior and validate that against the hypothesis that you have.
  • The first step is all about observing the steady state of your application. Once you have that, you can start building your hypothesis around it. Choose the right experiments to get the most out of chaos engineering. The last step is to verify the results of your experiments and then act upon it.
  • I want to give you pointers to some of the tools available to automate your chaos engineering experiments. If you're on AWS, AWS has a managed service called AWS fault Injection service which allows you to build hypothesis and run experiments. The last one in the list here, litmus, is an open source option.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hey, everyone, this is Hari Shah. I am a solutions architect at AWS. Today we are going to talk about chaos engineering. My goal for this session is to give you a high level introduction to chaos engineering and talk a little bit about the best practices. So if you are new to chaos engineering and are curious to learn about what it is, this is a session for you. Here is our agenda. I'm going to start by talking about why we need chaos engineering. Then we'll discuss a little bit about what it is. Define chaos engineering, and then wrap it up with how you can implement chaos practices, chaos engineering practices in your organization. Let's dive in. So let's start with the why. Why do we need chaos engineering? Okay, so this is from December 2022, when Southwest Airlines, one of the major airlines in the US, had a huge meltdown. Between December 21 and December 30, they had to cancel around 15,000 flights. Remember, this is during the peak travel period. Some of the days they had to cancel around 60% of the daily flights they scheduled. Other than the monetary impact, which Southwest Airlines reported as around 1.2 billion, this was a major pr disaster for the airline. What triggered this whole situation was a massive windstorm across multiple cities in the US. So that caused the airline to cancel or delay many of the flights during that period. But what made this situation to a major disaster for Southwest was the backend crew scheduling system. The scheduling system couldn't handle the amount of requests for scheduling changes and it just went down. But airlines are not the only ones who have outages. This is a snapshot of a service interruption that meta had back in 2021. So in October 2021, Facebook, Instagram, WhatsApp, all were down for more than 5 hours. I personally think this was a good thing for humanity. We all got a chance to step out of social media, talk to each other, get some fresh air. But for meta, this translated to millions of lost revenue from ads. If you look at the root cause analysis for this incident from meta, you can see that this was caused by human error. One of the engineers who was performing a routine maintenance, unintentionally disconnected metadata centers from Internet. So I have one more example, and this one is closer to home for me. So AWS, Amazon Web Services, had a service interruption for one of its services called Amazon S three back in February 20, 2017. Amazon S three, if you're not familiar with it, is an object storage. And it was one of the early services that AWS launched. And most of the customers directly or indirectly uses Amazon S three. So when s three had an interruption in 2017, many of the big customers were directly impacted. So this was a big deal. If you look at the root cause, again, root cause analysis from AWS, this was a human error. One of the engineers who was performing some commands typed something incorrect, which deleted some of the tables. And these are just some of the examples. I have a few more here from companies like Starbucks and Akamai and British Airways. In fact, these are so common that if you search for any date with outage, chances are that you may find one or more such incidents. And these outages have significant business and financial impact to organizations. I have some numbers here. For example, the cost of an hour of downtime for a business critical application can be around 1 million. So the question is why these issues are not being surfaced during the testing phase. Companies like Southwest or AWS or Starbucks, they don't put things into production without proper validation. So it does go through testing. So why are they not capturing these issues? The reason is that when we do the testing, whether it's unit testing, integration, or regression, we know the input to the test scenario and the expected output, right? So what the test case or test scenario does is to provide that input and to validate that the actual output matches the expected output. So if we plot the input and output into this framework, the testing focuses on the top left side, the one in green circle, where both input and output are known. But most of the situations that we discussed earlier in the session, they fall into the two right side quadrants. In some scenarios, like Southwest, the input is known. Southwest probably knew that weather could cause some interruptions, but they didn't know the output. They didn't know the impact of that. In other situations, like AWS and meta, it's very hard to predict human, the exact human error and the impact of that particular action. So in order to dive deep into the known unknowns and uncover some of the unknown unknowns, we need a different approach, regular testing. And this is where chaos engineering comes into picture. So let's talk about what is chaos engineering. Let's start with a bit of history. Chaos engineering came from Netflix. Netflix was one of the early adopters of cloud. They moved workloads to AWS in 2008. What Netflix realized is that in the cloud, they have to make applications more resilient to underlying infrastructure failures. In order to do that, they created a tool called Chaos monkey. What chaos monkey did was to run in prediction and randomly terminate compute instances. EC, two instances. This was hugely unpopular in the beginning within Netflix, because many application teams realized that this is impacting their workloads in production. But that was actually intent of this tool. The intent was to uncover issues in a controlled manner and move the ownership of building resilient applications to the application teams. Now, that worked. Gradually, the resiliency posture of these applications improved, and Netflix started creating more tools like that. They call this simian army. There was a tool which, for example, simulated availability zone failure, and there was one which even dropped the entire region, simulating a doctor scenario. Netflix open source these tools, and more and more organizations started adopting these tools for their workloads. So Netflix teamed up with some of these early adopters and created a manifesto called principles of chaos engineering. This is how the manifesto defines chaos engineering. Chaos engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production. Now, let's unpack that a little bit. So, first, chaos engineering is about experimentation, not testing. What's the difference in testing? As we know, both input and outputs are known, right? So all you're doing is validating the actual output with the expected output. But in experiments, the output is unknown. So you start with a hypothesis, and then you create experiments to validate your hypothesis and make sure whether your hypothesis is valid or invalid. Now, the goal of chaos engineering is to build confidence in the system's ability to withstand chaos in production. Now, that's important. Think of chaos engineering like a vaccine. You are injecting a little bit of chaos in a controlled fashion to build resiliency or to build immunity. So that's the goal. Many would think that chaos engineering is all about breaking things in production, terminating instances, but that's not the case. Chaos engineering is all about uncovering chaos, which is already inherent. It's already there in the system, right. All you're doing is to perform controlled experiments to uncover those scenarios so that you can proactively address them before the actual outage happens. With that, let's look at how to approach chaos engineering and perform your experiments. This diagram shows the high level steps involved in building your experiments. It all starts with understanding the steady state behavior of your application. You need to know what a steady state looks like for your application before you can build your hypothesis. Right. So, for this, you need to have a solid observability framework. Once you observe the steady state, the next step is to build your hypothesis. This is where you're building multiple hypotheses that you want to one day experiment to validate. Once you have the hypothesis, we will run experiments, and the goal for the experiment is to verify the behavior and validate that against the hypothesis that you have. And if there is a deviation, this is where you need to act. Make the necessary changes to improve the resiliency of your, of your application and then repeat the process. Right. So this is a cycle, this is a continuous cycle to improve, to understand your applications resiliency incrementally, continuously improve them. Let's double click on each of these faces and understand that a little bit better. So the first step, as I mentioned, is all about observing the steady state of your application. When I say observing an application, what you need to know is to collect all the signals from your application and build an end to end view so that you can understand the state and health of your system. When talking about observability, there are three key telemetry data that you need to collect. One is logs, logs across your stack. The other is metrics and the third is traces. Now the key is not only collecting this, but also correlating or mapping these signals so that you have an overall understanding of the steady state and the health of the system. Now once you have the steady state behavior, you can start building your hypothesis around the steady state. Here are a couple of examples of hypothesis around different goals, right? So if you want to validate the availability of the system, a hypothesis can be under certain circumstances that you want to validate. Customer still has a good time or the application is still available. Now for a security hypothesis, it can be if certain scenario happens, under certain situations, the security team gets paged or a certain alarm goes off. Now you build these high level hypothesis because you don't have a clear understanding of the output, but you know what the desired behavior looks like, right? So once you have the hypothesis, you can start planning your experiments. Now, choosing the right experiments is key to get most out of the investment that you're putting into chaos engineering. So start with the most common scenarios that can impact your application with the goal of identifying the expected behavior and improving your applications resiliency against those failures. Right. Here are some of the common common failures. Common scenarios that you can build your experiments around. Single point of failures identify single point of failures within your application stack and build your experiments around it. Excessive load to different components and see how they react. Introduce artificial latency between components and see the overall application behavior when such things happen. Misconfiguration, bugs, etcetera are all common scenarios to get started. So the end goal for chaos engineering is to perform these experiments in production. But for many organization, starting with running these experiments in production is a great risk. What I recommend is to start experiments running these experiments in the lower environment. Choose a very limited control scope that you have better handle on, and run these experiments in lower environments and observe the behavior. Now I would also highly recommend adding guardrails to these experiments so that in case you are seeing an unexpected behavior in the system, you have a plan to roll it back, roll back the experiment and get the system back to its previous state. Now, once you run these experiments, as you gain confidence, start moving these experiments to production and start running them. Now once as you gain more and more confidence, you can increase the scope, add more experiments and iterate over it and make sure that you're automating these experiments, because systems do change, they do evolve. So you have to continuously run these hypotheses to make sure that the system behavior is not deviating from your hypothesis. And the last step is to verify the results of your experiments and then act upon it. And in this step, it's critical that you assess the impact of your findings, the business impact of your finding, and then prioritize the findings accordingly. So this way, if it's, for example, a security impacting issue, then that gets higher priority and it needs to be addressed immediately compared to some of the other findings. I want to wrap up this session by giving you pointers to some of the tools available to automate your chaos engineering experiments. If you're on AWS, if your workloads are running on AWS, AWS has a managed service called AWS fault Injection service which allows you to build hypothesis and run experiments. The great thing about FIS fault injection service is that it has native integration to many of the AWS services, so it makes it very easy for you to build experiments and run experiments. Similarly, if you're on Azure assure has Azure K Studio, which you can explore. There is also a commercial offering called Gremlin, which is very popular and again allows you to build hypothesis and run experiments. The last one in the list here, litmus, is an open source option. So if you are leaning towards exploring open source tools to automate your experiments, that's a tool to consider. All right, that's it. I hope you found this session useful. Thank you so much for watching.
...

Hareesh Iyer

Senior Solutions Architect @ AWS

Hareesh Iyer's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways