Conf42 Chaos Engineering 2022 - Online

Disaster Recovery preparedness using Chaos Engineering

Video size:

Abstract

In the context of disaster recovery, we use Chaos Engineering to recreate or simulate the actual event. This gives us the opportunity to test our Disaster Recovery Plan and our response procedures in a controlled scenario, as opposed to recreating disaster-like conditions manually or waiting for a real disaster. By solidifying your disaster recovery plan through chaos engineering, you can be confident that the next big AWS service event will have little to no impact on your customers.

The agenda of this talk is: What is DR and Why do I need it How to use chaos engineering to help with DR Planning How to identify critical asset and create RPO/RTO Fire drills and DR plan Confidence

Summary

  • Jamaica make up real time feedback into the behavior of your distributed systems. Errors in real time allows you to not only experiment with confidence, but respond instantly to get things working again. We'll talk about resiliency with chaos engineering.
  • High availability is when you improve your uptime and resiliency by removing single points of failure and redundancy. Disaster recovery is your set of plans and policies that recover your workloads when things go down. Increasing availability and practicing restoration consistency helps you build Resiliency.
  • Using chaos engineering to prepare better your resiliency posture. Testing in a non production environment should always be performed regularly. Always assess your risk for appetite and make sure to isolate failures at all times. Having a blameless culture is something that really needs to be in place.
  • You can run recurring experiments, right? Machine led processes like unit tests, regression tests, integration tests, and load tests. It's important to consider the scope and the duration of the recurrent fault injection experiments. Make sure you put them in the later stages of your CI CD pipeline.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Jamaica make up real time feedback into the behavior of your distributed systems and observing changes exceptions. Errors in real time allows you to not only experiment with confidence, but respond instantly to get things working again. Close thanks for joining my talk on disaster recovery preparedness using chaos engineering. In this session, I will go over disaster recovery context and the difference between high availability and disasterlike recovery. We'll talk about how to approach Dr. From a cost risk perspective, and then we'll talk about resiliency with chaos engineering. So let's talk about the difference between high availability and disasterlike recovery. High availability is when you improve your uptime and resiliency by removing single points of failure and redundancy, whereas disaster recovery is your set of plans and policies that recover your workloads when things go down. Backups are not a Dr. Plan and disaster recovery needs to be clearly defined and practiced often in order to prove confidence in your distributed systems. When thinking about disaster recovery, we're always going to focus on resiliency. Resiliency is being prepared for that black swan event. Increasing availability and practicing restoration consistency helps you build resiliency. Cloud native companies expect failure and are constantly improving their resiliency. Everything breaks all the time and you need to be prepared for things to fail, especially in a shared responsibility model. When thinking about resiliency, you have to understand that resiliency is critical and affects your user experience for your customers. Resiliency is also complex and grows in complexity over time as your applications grow, whether it be integrations, features, mergers and acquisitions, and et cetera. Resiliency is a key cost driver on your recovery point in time, objectives and the criticality of your workloads. You might have safety related workloads that if they go down, people's safety is in place. So the criticality of the workload really can help determine how you have to build in your resiliency. Resiliency is completely different in a cloud than it is with on Prem applications. I can remember working at Verizon and we would have to build out our applications in a 40 40 distributed model so that we never ran over 40% capacity or we would have to add more. So in the cloud, you build in capacity by knowing what to do when certain things fail, whether it be an instance stop start, shutdown, instance degradation, availability zone, service event, or even like we had in December, the regional event. Building in resiliency in the cloud means being prepared for the unknown failure events. So when we're talking about Dr. We're going to start with defining our recovery point objectives, right? And your recovery time objective. Your recovery time objective is the acceptable delay between service interruption and service restoration. This determines what is considered can acceptable time window when the service becomes unavailable. Right. And your recovery point objective is the maximum acceptable time since the last data recovery point. This determines what is considered an acceptable loss of data between the last recovery point and the service outage. Now that we understand that, we can build our Dr. Strategy based on these factors. Right. So backup and restore is using to backup your data applications in the Dr. Region and restore this data when it's needed to recover from the disaster. If your time is in the hours and up to 24 hours or less, this is going to be your best option. Whereas if your recovery point time is in the minutes pilot light, that might work for you, and it keeps a minimal version of the environment. By always running the most critical core elements of your system in the Dr. Region at the time of performing the recovery, you can quickly provision a full scale production environment that includes the critical core. So you'll have your pilot light just sitting there waiting to be flipped on whenever disaster happens. Your warm standby is a little bit different. It's a little bit more expensive in that it keeps a reduced version of a fully functional environment that's always running in your Dr. Region. Business critical systems are fully duplicated and are always on, but with a reduced fleet. So when the time comes for recovery, the system scales quickly to process that production low. And then your most expensive is your active active. And this is how we build things in the on prem world where your RPO is basically none or seconds, and your RTO is in seconds. So your workload is deployed and actively serving traffic in multiple AWS regions. Right. So the strategy requires you to synchronize users and data between the regions that you're using. And so there's a lot of data transfer going back and forth and a lot of databases that need to be in sync. And when recovery time comes, you can use services such as AWS route 53, or the global accelerator to route that user traffic to an entirely different workload application. Right? So in that system, those are going to be your highest priority critical workloads. Right. And one thing you're going to want to do is avoid recovery mechanisms that are not often tested. You want to define these regular tests for failover to ensure that your expected recovery point in time objectives are met. So always avoid creating recovery mechanisms, but never practicing them. It's important to practice. In the navy, we were constantly going through firefighting exercises and everything because you're always practicing for that event when you need to put out a fire. So take those same context and utilize them. Now let's talk about resiliency and using chaos engineering to prepare better your resiliency posture, right? So chaos engineering, as you know, is the discipline of experimenting with the system with the aim of increasing confidence in its ability to withstand problems in your environment. My philosophy and our philosophy is testing in a non production environment should always be performed regularly and be part of your integration and deployment lifecycle. In production teams must perform these tests in such a way to not cause the service to unavailable. The last thing you want to do is cause problems with your customers while you're testing out hypothesis. So always run these tests in non production or development environments and then make sure that test results are measured and then compared with availability objectives to understand whether the application is running in that particular environment is able to meet those defined objectives. When you start first experimenting with chaos engineering, start small and build confidence. Don't go straight to regional failures. Start by stopping instances or doing things at a host level that you can build confidence and form your hypothesis, and then work your way up to availability zone failures or even into the regional failures. But try to build auto recovery mechanisms into your systems. After you perform these experiments. Always assess your risk for appetite and make sure to isolate failures at all times. Like we talked about on the last slide. Never do things in production that could have an effect on your customers and always have a backout and a rollback plan. And when we're quantifying the results of the experiments you're using to want to think about how long does it take to detect these failures and how long does it take to get notified? Should a status page be updated, right? Or should you notify your customers? How long does your auto recovery happen? That's a big factor, right? Because if you have a recovery objective of ten minutes and your auto recovery takes 20, then you're going to have to go back to the drawing board. And is it a partial or full auto recovery or how long does it take to really get back to that steady state? That's going to be one of the key quantifiable results of the experiment and what you're going to be looking for. You also want to do reviews of the incident. So having a blameless culture is something that really needs to be in place for this to work, right? So you'll talk about the event, the impact, and go over the five whys. Make sure all your data and monitoring and observability metrics are there that tell you this. We have a saying at one of my previous places that charts and graphs didn't happen. So make sure you have your proper eyes and observability metrics there so that you can learn from what happened. You want to make sure that you take corrective actions and they're followed through upon. Right. So in these post mortems, we call them coes or correction of errors. Have a defined list and have a structure of how these meetings go and clearly define the lessons learned and what to take out of it. If you're not going to learn from these failures, and if you're not going to take the results and learn from them, then you're never going to be able to improve the resiliency. And finally, as you're going through these, continually audit these meetings or these post mortems, try to get to a way to where you're having a weekly cadence and constantly improving on things. At one of my stops, AWS, an SRE, we met with our knock engineers on a bi weekly basis, and we went over every single escalation, and then we created a runbook every time. And so any new escalation shouldn't have a runbook. Right. So if we're getting escalated for things on a repeatable basis, then that to me, is considered toil, especially if there's human interaction. Try to automate those processes, but have those weekly operational overviews and go over your planning metrics. Make sure that you're continuously improving. There's a saying, kaizen, which is continuous improvement. Make sure that you're always trying to improve and learn from these events and learn from these failures. When you do that, you will build a much more resilient system. So how do we get started? Well, you can run recurring experiments, right? And what are some good candidates for recurring experiments? Machine led processes like unit tests, regression tests, integration tests, and load tests. Remember, just like these other tests, it's important to consider the scope and the duration of the recurrent fault injection experiments, right. So, because fault injection experiments generally expose issues across a large number of link systems, they will typically require extended runtimes to ensure sufficient data collection. So make sure you put them in the later stages of your CI CD pipeline. That way they don't slow up your developers. And here is a link to one of the chaos engineering workshops where you can create a recurring experiment. And in this experiment, we focused on running it in a CI CD pipeline with the argument that it's easy to slow down the pipeline to run only once a year. But hard to speed up a manual process to run multiple times every day. Right? So go with one repo, use a single repository to host the definition of the pipeline, the infrastructure and the template. You want to do this so that you can co version all components of the system. Right. Whether this is a good idea kind of depends on your governance processes, but each of the parts can easily be dependent. And with this part of the workshop, you can create these so that you can integrate them easily into your pipeline. And as you can see, it's using the CDK to build out the infrastructure. And you create a code repo and pipeline using the CDK and then you trigger the pipeline to instantiate the infrastructure and then trigger the pipeline to update the infrastructure and perform the fault injection. So it's a really cool workshop. Scan it, give it a shot. Yeah. And then here are some other resources that, that are at your disposal. Always, if you would like to run these with your tam, reach out to your tam. But yeah, thanks for joining my talk and you all have a great day.
...

Kyle Shelton

Chaos & Reliability Engineering & Builder @ AWS

Kyle Shelton's LinkedIn account Kyle Shelton's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)