Cloud Chaos Engineering with AWS Fault Injection Simulator (FIS)

Video size:

Abstract

AWS Fault Injection Simulator (FIS) is a fully managed service for running fault injection experiments to improve an application’s performance, observability, and resiliency. Learn how you can use it to improve resiliency and availability of your workloads.

Summary

Today I'm going to be talking about cloud KUs Engineering with AWS fault injection simulator. What I want to show you is how AWS and AWS ecosystem can help you on your chaos engineer journey. And in the end, I will spend a little bit of time just doing a simple demo.
As systems grows larger and more distributed, what is often a theoretical edge case can actually sometimes become real occurrences. Traditional test doesn't cover all the unknowns that distributed systems bring to the table and all the complexity that you have in production environments.
AWS fault injection simulator is a fully managed chaos engineering as a service. It allows you to reproduce real world failures, whether the failure is as simple as stopping an instance. You can use pre existing experiments templates, and you can get started in minutes.
Fault injection simulator allows you to combine different actions that will inject failure both in parallel or in sequence. It integrates natively with AWS Cloudwatch, and it has built in rollbacks as well. Make sure only the right people within your organization have access to run those experiments.
Actions are default injection actions executing during the experiment. Action types can include fault types, target resources, and timing. Some host level actions on EC two instance are performed through system manage agent.
An experiment template includes an action, a target and optional information like stop condition alarms. Experiments are simply a snapshot of the experiment template when it was first launched. You can do both parallel and sequential actions, and you can configure multiple stop conditions.
Chaos engineering is a process of rehearsing ahead of an event by creating an anticipated conditions and then observing how effective the teams and system respond. Another use case is automated experiments based on an event trigger idea. Using event driven trigger experiments, we can verify the behavior.
If you are interested and you would like to use fault injection simulator within your organizations, please check out these resources. The chaos Engineer workshop. You can check the file injection simulator documentation and also there are a GitHub repository that is publicly that has a lot of examples. I'll do a simple demo just showing the console.
A simple example will be an application that has two EC, two instances being managed by an auto scaling group. I'll show you step by step how you can create that using fault injection simulator. And then we're going to look at some of the results.
You can look at the targets, targets are looking to terminate two instances that have this specific resource stack and having this filter. If I click on this instance, it just redirects to that one that has the specific tag that we are searching. So it's going to select one of those when I start experiment to actually go MQ.
A simple script that is generating synthetic traffic to my instance. Now it starts the experiment and then it terminates the instance. If something goes outside your control, you have the ability to stop that experiment again. But while these experiments is ongoing, it's very simple.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone. Thanks for joining me on this session today. It's a pleasure to have you all. Today I'm going to be talking about cloud KUs Engineering with AWS fault injection simulator. My name fis Samuel Baruffi, can can solutions architect here at Amazon Web Services. I do help support global financial services in their cloud journey from architecturing, best practice, and so forth. So today I want to talk to you about how you can improve your resiliency and performance with controlled chaos engineer. You might be wondering, what is chaos engineer? Or maybe you're already very familiar with chaos engineering. What I want to show you is how AWS and AWS ecosystem can help you on your chaos engineer journey. So let's talk about the agenda for a few seconds of what we're going to cover today. So we are going to start talking about some challenges with distributed systems. We want to explain how distributed systems work and why they are complex by nature. Then we want to jump into why chaos engineering fis hard what have we heard as a company from customers that have tried to do by themselves kus engineering and some of the lessons and requirements we heard from them. Then of course you're going to introduce AWS fault injection simulator. You seem that I will be interchainable using the FIS acronym as fault injection simulator, the name of the service on AWS. After that, we're going to dive deep into some of the key features that fault injection simulator brings to us. Some of the use cases that you should be using or could be using the service. And in the end, I will spend a little bit of time just doing a simple demo, showing the console and demonstrating how you can use the service itself. And hopefully before I do the demo, I'll share some resources that if you are interested, you can take your learning to the next level by checking some of the resources we have available. So let's move forward. So let's talk about challenges with distributed systems. While most of us understand that distributed systems are undeniable, have revolutionized it industry in the last decade or so, they do have some challenges. Some of the challenges that distributed systems brings to us can be a combination of multiple things like latency, scalability, reliability, resiliency, concurrency, and many more. As systems grows larger and more distributed, what is often a theoretical edge case can actually sometimes become real occurrences. So it's common mistake a lot of us, and maybe I've been at fault in the past at that, that it's very easy to think that distributed systems are very complex when they become just bigger. When you're talking about hundreds or thousands of microservice. That is of course not the case. And let me explain what I mean by that. So even in a very simple application that we just want to send a message from a client to a server, there are a lot of steps that are involved into this communication. So let's just look at this. If a client wants to send a message to a server, the first thing that will happen in this scenario will be the client will put the message into the network. The network will be responsible for delivering the message to the server. The server then validates the message. Once it validates the message, the server will update its estate. Once that estate is updated, the server will reply, will put a reply into the network. The network will actually be responsible for delivering the reply to the client. Once the client received the reply, you actually validates the reply, and then finally the client will update its estate. So it's mind blowing just to understand how many steps in this very simple situation by sending a message from a client to a server, how many steps behind the scenes happens and how many steps you can have failures. So now let's just multiply that by hundreds, thousands, millions or even billions of occurrences across many of microservice, many of our microservices. So we only look to implement tests after we have outages. So it's very common that there is an issue on the network because we might not have redundance network gear. Only after that we go, after the occurrence happened, that we go and we improve. We want to change that. Right? So one of the things that has been done a lot is just traditional testing. Of course you need. So the message on this slide, fis, please don't stop doing your traditional testing. You should never stop doing. It's just that traditional test doesn't cover all the unknowns that distributed systems bring to the table and all the complexity that you have in production environments. So what traditional tests are good at are verifying known conditions and answer the questions like is this specific function or this specific action, returning the specific expected behavior that they are really good at it with you both using unit tests or functional testing of integration. But what about, let me pose you the question, what about failures that have weird errors that would happen on the network, like that goes over the Internet? What about some configuration limits on cloud providers? How about some drifts of your infrastructure? And what about all the unknowns that you are not familiar and you are not testing for? How can you test for something that you don't know yet? And it can get even more complicated. Some things are just really hard to test. I'll give you an example. In a system where you have multiple instances that start and stop dynamically, what happens if one of those instance runs out of space? I'm pretty sure the majority of you have been in a similar situation where you have to perform some maintenance in servers that have run out of space. Debugging apps that run out of space are really, really complex and would look something like this. You just see a bunch of errors and no actions you try to take on the machine actually go through. And that is relative common issue that often takes just a misconfiguration of log rotation or not being monitoring the specific configurations of your space on the disk from your monitoring systems. What that can be, you can use a third party vendor or solutions on AWS, but some of the solutions that you should potentially have written is, well, you should have a log rotation in. You know, if you don't have a log rotation in place, then another solution that you could have put in place fis, you maybe have some monitoring solutions that look for your storage on this specific instance, once they reach close to 90% of storage utilization, then you send a message so you can reactively make improvements. Of course, you always want to potentially make sure you are having automation that will solve those problems, rather than having to page someone in the middle of the night to make those changes for you. But you can see this is just an example of unknowns that potentially you haven't covered on your unit test or integration test. So the question that is in the industry is how can you be more prepared for the unknowns? And luckily there is a kind of engineering, and that's all this talk is about today that helps you with that. And the name of that. As all of you already know, it's chaos engineer. So chaos engineer focused on three main things. Three main pillars I'll call you have our three main phases. You have the stress phase, you have the observed phase, and you have the improve phase. So what the stress phase means is that you are stressing an application either in testing or in production environment, by creating disruptions, by injecting failure events such as server outages, API throttling, network disruptions in your environment. After you have injected for a period of time, you observe what those means. So you observe the systems and how the system responds. And this is a really important part of chaos engineering, because kios engineers can only exist if you have a really good observability system in place. Once you observe, by checking if your system is completely healthy. Or if you have had some sort of occurrences that are not expected, then you analyze what those occurrences are, and then you go to the last phase, which fis the improved phase. You make changes for your application to be more resilient or performant. And we do want to prove or disprove some of the assumptions that we have about our system capabilities that can potentially handle or not handle those disruptive events. So the chaos engineer, focus on improving the resiliency and performance of your workloads, but also focus on uncover those hidden issues. And that's one of the main benefits of chaos engineering, is those hidden issues are really hard to know ahead of time. And also you want to expose your blind spots. And this is actually the example that I mentioned before about having a very good observability story. If you don't have proper monitoring, observability and alarm, your application might fail. But you don't have data, good data, to understand what happened. So that is another aspect that is part of what we call continuous resiliency that you test. If there is any sort of failure that you were not expected or haven't been uncovered by a metric or your observability system, you need to improve that story. Aws well, but there is much more to that. You also want to improve the recovery time. If there is some major issues that you haven't been able to protect your application, how do you improve your recovery time? How do you improve your operational skills? And also, how do you increase or not? How do you increase increases the wrong word. How do you implement the culture of chaos engineering? And when we look at chaos engineering, we can look at different phases of chaos engineering. One thing that is important to mention is chaos engineering is not about breaking things randomly without a purpose. Chaos engineering is about breaking things in a controlled environment through a well planned experiment in order to build confidence in your application and tools so you can sustain turbulence and potentially issues on your application. To do that, you have to follow a well defined scientific method that will take you from hypothesis to running, can experiment to verifying their experiment, improving, and then going back again to statistic. Chaos engineers shouldn't be just one thing you run once every year. It should be a practice that you motivated and you innovated, and you keep your engineers applying this innovation on your workloads. That way you can sustain failures and always keeping your business outcomes intact, not getting disrupted by random, unknown failures that your application might face. But let's talk about why Kus engineering is difficult. And in the beginning of the presentation, I mentioned that AWS have actually talked to a lot of customers that have tried to do that by themselves. And we have collected four main feedbacks from the customers from a variety of industries and different company sizes. The first one is really hard to stitch together different tools and homemade scripts. You might have some open source tooling, you might just building some python scripts, some batches to actually implement those injections of failures, or even the observability piece. It's really hard to have that story by yourself, but also you require a lot of agents and libraries to get it started. So you might need to put a lot of infrastructure and configuration in place, and it's not very easy to get started. And then probably the most important one, in my opinion, is it's really difficult to ensure safety, because if you're doing case engineering in production environments, the goal is to find the unhidden problems, but at the same time you don't want to bring your whole application. So how do you create guard rails to stop your chaos engineering in production, or even in tests for that matter, to bring down whole application and affecting your business outcomes. And the last is it's really difficult to reproduce real world events because reward events are not just as simple as an API has failed. Normally it's a combination of scenarios that will potentially run in sequence or in parallel, and that is really very hard to reproduce with that. What AWS have introduced a couple of years ago at reinvent, which is AWS global conference, the yearly conference is AWS fault injection simulator. AWS fault injection simulator is a fully managed chaos engineering as a service. It is really easy to get started with a service that allows you to reproduce real world failures, whether the failure is as simple as stopping an instance or more complex, like throttling APIs, AWS fault injection simulator fully embraces the idea of safeguards, which is one of the things we've heard from our customers, that they want to do these chaos engineering tests, but at the same time they want to be protected to potentially full outages on their applications. And far injection simulator brings that capability that it is a way to monitor the blast radio and control the blast radiant of your experiment and stop it automatically if alarms set offs. So you have three main pillars here that we're going to talk a little bit more in detail. So let's talk about why is it easy to get started with fault injection simulator? So first, you do not need to integrate multiple tools and homemade scripts. Fault injection simulators will manage all the tests and experiments for you. You can also use the AWS managed console that you are familiar with it, or the AWS CLI to run those experiments. The interesting part here, FIS, you can use pre existing experiments templates, and we're going to talk what experiment templates are in a moment, and you can get started in minutes. And it's really easy to share your experiment templates with other folks within your organization. Or if you prefer and you want to open source, you can actually do that and made available to the community. The templates are JSon or Yaml fis that you can share with your team and you can version control. So you can benefit from the best practice associated with code reviews. And then let's move to the next topic, which is reward conditions so you can run experiments both in sequence and in parallel. And I made a mention before that real world failures are not just as simple as one event. Sometimes they're a combination of events. And the fault injection simulator allows you to combine different actions that will inject failure both in parallel or in sequence. And you can choose. You can also target all levels of the systems, both the host, the infrastructure, the network, and many more. And you can select maybe just a few of them. You have full control and flexibility in that aspect and real faults. And this is an important r1 faults injected at the service control plane level. So the faults and the injection that are being implemented into your environment are actually real world faults are not just makeups of potentially APIs and whatnot, those are actually failures that are happening in real time. So as an example, if you configure can experiment template to terminate an instance, actually the instance will be terminated on AWS. So you got to be careful because it's not faking with any metric manipulation. So you have to pay a little bit attention to not do something that you don't expect it to do. And then the safeguards are where you create those guardrails which are stop condition alarms, so you can configure alarms on Cloudwatch, or potentially third parties, that if those alarms are triggered, you can send a notification to thought injection simulator service to say please stop what you're doing because it's impacting heavily my service into a level that you don't want to pass that threshold. So like I said, it integrates natively with AWS Cloudwatch, and it has built in rollbacks as well. So you can redo what you actually have created and done until that point. And because fault injection simulator can be very dangerous in the sense that you want to make sure only the right people within your organization have access to run those experiments. As most of AWS services you can control with fine grained IAM control. So you can say this specific IAM principle can only do these actions on these resources and only these folks can start experiments. So you can also control what type of faults can be used and what type of resources can be affected using tag policies. So for example, instance only with environment tests can be affected, nothing else. You don't allow anything else to be affected by fault injection simulator. So here you can see an architecture diagram of the service. So what you see here, you can see that in the middle you have the AWS fault injection simulator which is controlled by IAM where you have all the policies and permissions of who can actually do what in the service. And you can access the service either via console or CLI or a combination. Once you start can experiment. The experiment will actually inject those faults into AWS resources, compute databases, network and storage. Those resources are being monitored by cloud alert alarms or potentially third party. You can choose and if we recommend you to create stop conditions and if those stop conditions are can, Eventbridge will send a notification to fault injection simulator engine to stop those experiments and roll back what chaos done. And again, it's the best practice to have cloud watch alarms monitoring your AWS and workloads, AWS accounts and workloads so you can define stop conditions that will automatically stop the experiments. Let's talk about some of the components that are part of the fault injection simulator service. You have actions, you have targets, you have experiment templates, and then you have experiment. So let's look at each one of them individually. So actions are default injection actions executing during the experiment. They are defined using a namespace, so you can see that you'll be AWS colon, the service name column action type. The action types can include fault types, target resources, the timing relative to any other actions, and fault injections parameters such as duration, the rollback behavior, or the portion of the request to throttle. As an example of an action, you can see here on this JSON representation you have two actions defined. One is a stop instance action and the other one is a wait action. Notice that the wait is ordered to execute only after the stop instance action has sequentially executed. It's also worth noting that some host level actions on EC two instance are performed through system manage agent. The system manage agent is a software that is installed by default on some versions of operating systems such as Amazon Linux Ubuntu images. And you can just find that information on SSM or fault injection simulator documentation. Now when we look at targets, so we talked about actions. When you look at targets, targets define one or more AWS resources on which to query an action. So you have an action which you do something but targets is where that action will actually be executed. You can define targets when you create an experiments template and you can use the same target for multiple actions in your experiment. So targets include the resource type, resource ids, tags and filters, and also a selection mode if you want off, then if you want random, if you just want a percentage and so forth. Here is an example JSON representation of targets. We are using in this example to filter target only by instances that are running on one specific availability zone. You can see here us east one a but that's not only filter here, there is also another filter for tags to refine the selection. So only EC two instances that are running on us fis to can a with the tag environment equals test that actually are going to be impacted. And there is also another filter that only instances that are running and only instances that are within a specific VPC. And you can see that selection mode is we just want two instances. So if there are more instances, those instance will not be affected. We're just selecting two. And there are other combinations that you can do like percentages or random as you saw in the example before. And experiment templates define can experiment and are used in the start experiment request. So think of an experiments template where you put an action target and everything else together. So all that information will be put together into an experiment template. And an experiment template include an action that we talk about it, a target. Then you have some optional information like stop condition alarms that we highly recommend you to always have. Stop condition alarms. So if something goes south that you're not expecting to do, you can actually automatically stop the experiment. You also have an IM row which will be associated to executing those experiments, and description and some text. When you look at the nature of the template, you look something like this. You can look at the right adjacent demonstration of a stop and restart instance. That's the name of the experiment template. You have a description there. You have a row arm that will actually be used to assume the row and then execute the specific action. So you need to make sure that row has the specific permissions to execute that. Then you look at the target section where we talk about and then we look at the actions which we also talked about. So let's look at two experiment templates that are very different, but two ideas that you can create with fault injection simulator as explained earlier, you can run a simple experiment like the one in the left, which is just a sequential experiment with two actions across three different targets, and you have a target group there without maybe not a lot of filtering and a specific target group, filtering of tags and so forth. And then you have a stop condition there. But you can also do something like on the right where you have a target which you are filtering of all Ec two instance with chaos ready tag. And then you have actions that you have a combination of sequential actions. So action one will happen, then action two will happen, and once action two happens, action three will happen as well. So you can do both parallel and sequential actions, and you can configure multiple stop conditions like we highly recommend in your production environment, but also in test, especially on production of course, that you have those stop conditions. And in this example you can see they have two stop conditions. And finally, experiments are simply a snapshot of the experiment template when it was first launched. So you can see on the system an execution list of all the experiments. So every time you click to launch an experiment template, you automatically create an experiment that you can look at who actually initiated that experiments, what was the result of the experiment, and all the data is there. So experiments will include fis next dropped of the experiment template that you're using. What is the creation and start time, what is the status of the experiments, the execution id, the IAM row arm and few other information. So when we look currently, what are the supported fault injections that FIS supports? You can see that you have a lot of things on this list and this list will keep growing. You can do server errors on ECQ, you can do API throttling on IAM, you can queue process on ECQ, you can add latency injections on ECQ, you can queue container instances on ecs, and you can do that on eks as well by terminating nodes. And recently uvi just announced network disruption, EbS I o pause and few others. And you see these lists growing with time. So let's look at some use cases for default injection simulator service. Let's look how we see some customers adopting chaos engineering, both in the sense of getting started and some of the more advanced practice. So we are first going to talk about one off experiments. And this is perhaps one of the most common ways of doing KS engineerings. This can for instance be experiments where you want to verify a new service with your system or a specific part of your architecture, or maybe expose monitoring blind spots. You create a one off experiment, you go through all the phases of chaos engineer from understanding the steady state, forming an hypothesis, designing and running that experiment, and so on. And this is really a great starting point of chaos engineer. You do a one off experiment, and you prove your hypothesis. Nothing broke, and you verified something within your system success. Or perhaps you have disapproved your hypothesis and something chaos happened, and you're going to improve. The goal is that you have learned something about your system and you were able to implement improvements. But those are just one off experiments. You have another common use case for chaos engineering, which is a part of a game day. A game day fis, a process of rehearsing ahead of an event by creating an anticipated conditions and then observing how effective the teams and system respond. An event could be an unusually high traffic day, maybe, let's say during a promotion day of your ecommerce, or a new launch, or a failure or something else. So you grab things together, you prepare for that game day, and you can use chaos engineering experiments to run a game day. By creating those event conditions, you monitor the system, you see how your organization behave, and you make new improvements. Another use case is automated experiments. Doing an automated experiment really goes back to scientific part of the Chaos engineering. Repeating experiments is a standard scientific practice for most fields. Automated experiments help us, help us cover a large set of experiments that we can knock over manually, and it verifies our assumption over time as unknown parts of the systems are changed. So instead of just running one off and maybe every couple of every six months or so, you have those automated experiments that as your architecture is changing, you're also doing those automated experiments, so you don't rely into a lot of people in a lot of organization. You have that automated experiments that are repeating itself in place. Let's talk a little bit more about some examples of automated experiments. The first automated experiment is recurring sketch experiments. So this is a great way to start with automated experiments, which is just to verify your assumption over time. Take for an instance, let's give an example where different teams build and deploy their own services within the system. So it's very common in distributed systems where you have multiple dozens or thousands or hundreds of microservices that are managed by different systems. How do I know that the behavior that I verify through chaos engineer experiment today is still valid tomorrow? So recurring schedule is a way that you can run maybe every hour, every day, or every week, and you can keep monitoring those conditions by adding and injecting fault into your evolution of your architecture. So that is one way. Now, let's look at automated experiments based on an event trigger idea. So an event is something that happens or is regarded AWS happening. So can assistant, could be, let's say an order FIS placed, a user login, or even an outscaling event. So let's say what if we get latency to our downstream service when there fis an autoscaling event? Does that affect our users? Well, using event driven trigger experiments, we can verify the behavior. You can create an experiment template and trigger an experiments based on what an autoscale event occurs. So you can say, well, when I see an upscale of traffic or some specific user action, please trigger chaos experiment and let's verify how the infrastructure and the workload behaves during that specific time. So with time you can think about those event triggering and within AWS you can use Eventbridge to automate a lot of that. Then of course you have chaos engineering, part of your CI CD pipeline, continuous integration, continuous delivery, continuous deployment pipeline. You can add a stage in your pipeline that automatically starts one or multiple experiments against your newly deployed application. So for instance, in the stage environment before you push into production, you start multiple experiments by triggering fault injection simulator services and the specific experiments templates. By doing so every time there is a new push of code into a specific environment. In my example in staging, you're very fine that the output of your system's each deployment. So you need a lot of observability tools to collect all the data and analyze. And this again will help us verify our assumptions because you created the experiment template based on the assumptions and hypothesis that you have, and the unknown parts of the systems are changed. So I think it goes without saying, but it is still worth pointing out that to do an automated experiment, you do need to embrace safeguards. So it's really important that you use the guard rails and the stop conditions within fault injection simulator. So if something happens, especially when you are doing a lot of those events, that you are automating a lot of those events, you don't know potentially you're not just clicking a button, looking those events happened like in one off experiments. You want to make sure you can automatically pause an experiment that has brought your potentially application into a degraded situation. So it's really important. And I highlight again, please be very careful with that. So I mentioned that aim for automation, but the journey of chaos engineering, you should start with one off experiments, potentially create some schedule. Once you are comfortable with those, run some game days from game days. You can start with some schedule chaos engineering experiments, invocation, and then you can go through more automated ones, potentially with event driven or your CI CD. What I want you just before we pause here for demo. If you are interested and you would like to use fault injection simulator within your organizations, please check out these resources. You have five links here. The first one is the AWS well architected framework that provides best practice and guidance on how to build workloads that are fault resilience. They are fault protective resiliency, highly available, cost optimized, secure and operationalized for production. You can click on the link to check more on our website for the fault injection simulator service. If you're interested, I highly recommend you go on your own time, on your own AWS account. The chaos Engineer workshop. You guide you step by step on how to do some of those experiments that are common across multiple companies. You can check the file injection simulator documentation and also there are a GitHub repository that is publicly that has a lot of examples. So you can just copy the JSON files and you can run those examples, build on top of those examples, or just reuse those examples. So now I'll do a very simple demo just for the sake of time. I don't have a lot of time. I'll do a simple demo just showing the console and how you can get started with fault injection simulator. So I'll see you in a moment as I transition to my screen on the AWS console. Okay, so let's jump into the demo. What do I want to show you on the demo will be a simple example and I'm just sharing. As you can see, this diagram will be an application that has two EC, two instances being managed by an auto scaling group and they're just running nginx as a web server. What I'm going to demonstrate to you, I'm going to create some load synthetic cloud. Just because this is a demo, I don't have people using this web server. So I'm going to create some synthetic cloud and we're going to create an experiment where we are going to terminate one of those instances that are part of my autoscaling group. So I have two instance as part of my autoscaling group I want to terminate one and then I want to use my monitoring dashboard and the observability data that I have collected to understand what type of behavior my application have. It's very simple, but I will create it. I'll show you step by step how you can create that using fault injection simulator. And then we're going to look at some of the results. So the hypothesis there is I have an application, they have two EC, two instances managed by an auto scaling group. What happens? The assumption and hypothesis there is my application shouldn't suffer a outage because I have two instances and one will still be serving traffic through the load balancer. So I have a load balancer endpoint, as you can see here. I'll just refresh. This is the load balancer endpoint that is just providing data. And as you can see I'm going to be paging the phpinfo PHp web page and I'm going to use that. So before we jump there, I want to show you a quick dashboard. So this is something you need to be in place in order to have the observe of part of the chaos engineering. So I have this dashboard on Cloudwatch. So this is Cloudwatch is a monitoring tool, a managed service on AWS that supports a lot of the monitoring metrics, logging and observability. So here I have a dashboard that collects a lot of the graphics. So the first one will be the customer load connection status. So I can see if there are any error status, any 500, 400 errors or 200. In this case you don't see any data just because there is nothing really there. Right now. On the other side you can see the server NgInX connection status. You can see that has been just one because my load balancer is just pinging those Ec two to see if they're healthy in order to redirect traffic to them. Then you can see response times. Of course there is nothing there because there is no traffic being generated by the load test that I'm going to do. You can see the response time as well. There is pretty much nothing. Then I collect cpu utilization. You can see 99% fis idle, so there is nothing running there. You can see some of the network status. So tcp time, weight very little and tcp established over here. So currently no network connection. And down below here you can see two more graphics. Let me move this here so it's easier for us to see. You can see that I have the number of number of instance on my altiscating group, and then I can see the number of healthy versus unhealthy on my alti scaling group. So I have one healthy count on one availability zone and another healthy count in another availability zone. And the health check for my cloud balancer, I can see that both instance are healthy. And finally the instance check of my autoscaling group, they're both healthy. So let's first jump into the dashboard, the console, and let's create a fault injection simulator. So first let's check. Let's just go for fault injection simulator fis. Let's go into the service and let's create an experiment template. As I demonstrated and explained before, an experiments template to be a combination of things that we want to test can hypothesis. So in this hypothesis we're just going to call the description terminate half of the instances in a altiscaling group. And the name let's just call terminate half of instances. So you just give a description, you give a name. Now you have an action, if you remember an action fis, something that you want Fis to go and do. So this we're going to call terminate instance as the name. You can do any like the name is optional, the description is optional, sorry. And the action type here, you can just type terminate instance. We are going to use the pre built action called EC two terminate instance. What this actually does behind the scenes, you actually terminate an instance and I'll show you how it actually works. I don't need to start after because we're just doing a simple experiment and it automatically creates a target. And I will show in a moment what the target has and how we can program to be more on what we actually need. So I'm going to click save so it automatically create a target for me. Let's click edit because right now it just gives the name and the resource which they are correctly here I can just call a ski target just so he has a better idea. But I don't want to manually select the EC two instance because remember my auto scaling group is managing this. So I want the target to be selected by resource tags and I'll show in a moment what are those tags? So I'm just looking here to make sure I have those correctly. So the resource tag will be, I'll give a name and I want the target to be filtered by tag name fis Stackasgashg. But I also want a filter. I only want the fault injection simulator to look for instances that are running. So the state of the instance needs to be running and then the selection mode. I don't want all the instance because I know if I terminate all the instance my application be down. I want to say 50% of my instances. In this case I only have two instances. So one instance needs to be a new random select. One of those two needs to be terminated. So I'll go and I click save. In this case I have already configured an IM role that has permission to do those actions like terminate instance on my autoscaling group. So I'm just going to use the FIS workshop service role, but you would need to do that. And here, for the sake of the demo, I'm not going to create any stop act conditions, but I highly recommend every single time you create a stop condition. So if something happens that is outside your control, a metric gets triggered on Cloudwatch and stop your experiments. And we want to send logs to Cloudwatch logs. So I'm just going to browse and we have a bucket called fis workshop. 1 second, I think it's here. Yes, fis logs. Here is where I want to save the logs. So all the logs of the experiment on the things it's going to be doing is going to be saved on s three, sorry, on cloud watch logs. And then you can look at the cloud watch logs. And here it's just giving a name. So I'm going to go create experiment, create experiments template here. He asked me, are you sure you want to create an experiment template without a stop condition? So this is a stop, like this is a warning for you. In this case, because of the demo, I'm just going to say create. But you should always, in your production environment, most definitely should have that stop condition. So I'm just going to go and create experiment template. So here I have my experiment template. You can look at the targets, targets are looking to terminate ec two instances that have this specific resource stack and having this filter. So let me just show you those instances. So you have an idea that I'm not just lying to you and there is no vaporware. So if we look here, we have two instances, fis stack, ASG that are running one in us east one b, another in us east, one a, that are managing by an outscaling group. So if I go on my altiscaling group and I show you I have this out scaling group that has desired capacity of two and minimum capacity of two. And I look at instance management, I have two instances with the specific launch template, us east one a and us east one b. And if I click on this instance, it just redirects to that one that has the specific tag that we are searching. So it's going to select one of those when I start experiment to actually go MQ. So what we want to do, because we don't have cloud in my, this is just a demo, we don't have cloud. I want to run just a simple script that is generating synthetic traffic to my instance. So I have just a script here that we call some lambda functions to generate load. So now once we have generated load, you see in a moment, these graphics will start picking up in a few seconds or minutes, start picking up load tests, but at the same time. So let's just maybe give a few seconds and then let's just watch the load picking up and then let's queue an instance and let's see what we can observe by that. Our hypothesis fis, the application should be remaining online, but will actually be fully online, or are we going to see any errors of connections or maybe too much traffic? So while we wait for that, we start cloudwatch. There is a little bit of delay to show the metrics for me because the logs are being generated and displayed on the dashboard. I'm just going to start my experiment. So I'm going to go on the console, I'm just going to go experiment, sorry, I'm going to go on my experiment template. I'm going to select the one I created which is terminate half the instances and I'm going to click start experiment. I can add a tag, I can say name forced experiments, you can just call whatever you want for this experiment and I'll click start experiment. You ask me again, like, are you sure you want to start this experiment? Because you have no stop condition. So if something goes outside your control, you have the ability to stop that experiment again, because it's a demo, we are fine, we're just going to click start. So click start. You can see that this is initiating state. We're just going to click refresh. It's on running state. You'll take a few seconds to actually run. Let's just wait a little bit here and you can see on the timeline, I think if you refresh it running, you can look at the logs, the logs will be published here. Once that action and experiment has finished and what this is actually doing behind the scenes, it's terminating one of those Ec two instances. So you can see it's completed, you can see on the timeline and we refresh it just terminated instance because it's just one thing and you see the logs in a moment will actually be. Here they are actually. Now it starts the experiment and then it terminates the instance. So here you can see action has completed, it terminated the instance for me. So if we go and we look into the auto scaling group and we refresh, you can see that one instance. Now it's unhealthy because it terminated my instance and because autoscaling group chaos, a desire or two, it's automatically putting that, creating a new instance. But now if you look at the cloud watch dashboards, we can see now that we have cloud, right? So we can see that load now 2900 actually have been successful, but there is a lot of requests that are getting 500 HTTP errors. So my application is still up and running. And if I go and I try to refresh, you can see that it's running. But I might get a gateway error here, I might get a 500 error because the main reason now I only have one ec two instance. So you saw it took time, and now you can see the cpu usage. Before it was nothing, but now it's more than 60%, right. And you can see that some of the network connections are waiting. Not everything is waiting, but you can see that the millisecond response time is decreasing. So as I refresh this page, you saw it was not very taking a while. You see this is spinning, it's not doing a good job. If I go down below, you can see that now my dashboard recognized that they only have one healthy, and I also only see one healthy. But you can see now the outscaling group is spinning up another instance. So within a few seconds or minutes you see better connections because another instance will be serving traffic. But while these experiments is ongoing, it's very simple. What we were able to observe is if I have a peak of traffic and one of the instance goes down, I'm not really able to serve quality traffic with performance to my customers. And you can see this, when you look at this, you see that there are a lot of error counts. There is mostly almost half, actually more than half of those requests are errors. You're getting 500 or some error connections. You're not even be able to achieve connections. So you see it's taking quite a bit of time, which the latency has increased. So that's what you obseRve. So if I were the owner of this application, I would potentially increase the pool of scaling instance, the pool of instance that I have on my altiscaling to maybe potentially four, maybe across three or four availability zones, depending on the region that this is running. So this is really simple. And then as you scroll down, you could see that it picked up the LaTENCY. So some of the requests, the duration maximum was 2.6 seconds, but average now is just 1.3. But when we look at potentially for 1 hour, you could see that it was Much lower. And because we're running the experiments now with a lot of load, he has increased the traffic. And if we scroll down, we only have one instance healthy. But now we have Three instances, two instances that are BACK at our auto scaling group. So ELB is now doing the health check in order to bring it up my instance. And you see in a moment we might not have enough time to finalize here. Just watch the service cpu. But you see the service cpu will be much better once that instance is in place. So this is a very simple example and you can look at the experiments. So once you click on experiments you're able to see all the experiments. You can click on the experiment id and you can find the timeline. In this case I'm just doing an action, but you can have a sequence, you can do many parallel and you can start experiments. If you remember when you were talking about automated experiments as part of a recurring event using Eventbridge or it can be part of your CI CD, you can now mix and match a lot of different combination and the whole idea here is to have in mind the continuous resiliency and improve the performance and availability of your application. So that was it for the demo. I do hope you were able to take away some of the key learnings from thought injection simulator. I highly recommend you go through the workshop and feel free to reach out to the service team and myself and anyone on the AWS team if you have any feedback or just in general if you want to share your experience. Thank you so much everyone, it was a pleasure. Wish you a great remaining of the conference.

Slides

Download slides (PDF)

See all 16 talks at this event!

Conf42 Chaos Engineering 2023 - Online

February 16 2023

Cloud Chaos Engineering with AWS Fault Injection Simulator (FIS)

Video size:

Abstract

Summary

Transcript

Slides

Samuel Baruffi

Senior Solutions Architect @ AWS

Join the community!

Featured event

2025

2024

Info

Conf42 Chaos Engineering 2023 - Online

February 16 2023

Cloud Chaos Engineering with AWS Fault Injection Simulator (FIS)

Video size:

Abstract

Summary

Transcript

Slides

Samuel Baruffi

Senior Solutions Architect @ AWS

Join the community!