Onboarding Chaos Engineering

Video size:

Abstract

Now that stakeholder approval for something as outrageous sounding as “chaos engineering” has been granted, a new challenge arises. This talk is a high-level guide on how to prepare and what to expect from the first months of (deliberate) chaos in your organisation. In other words, the “before, during and after” of making chaos successful.

Summary

Carolina has been speaking to teams looking into chaos engineering almost on a daily basis. He compiled some of the answers, common questions, success stories, and cautionary tales into a short guide for anyone that's interested in onboarding the practice.
chaos engineering is a scientific method to uncover failures in our system before they actually happen. Through controlled use of attacks, we test how a system responds under various conditions. It works very similarly to a vaccine. Here are some of the considerations to be made before onboarding chaos engineering.
If your company's profitability is dependent on the underlying systems, the short answer is yes, you will benefit from chaos engineering. The average cost of downtime is currently estimated to average at around $100,000 an hour. The immediate benefit for chaos testing legacy systems is the same as any other.
A successful chaos engineering practice is characterized by its scientific approach. Monitoring is so, so important while running a chaos engineering test. The efficiency of chaos engineering as part of your incident response would be meantime to detect, meantime to recovery, and meantime between failures.
Should we build or buy our chaos engineering tools? In house build tools are configured in line with your application or environment's exact needs. The biggest argument for buying is, ironically, its cost effectiveness.
Chaos engineering is a practice and is meant to be repeated. The knowledge that our systems will withstand a particular failure is just as valuable as discovering an issue and fixing it. One of the most fruitful chaos engineering practices are game days.
Thank you very much for watching my talk on how to onboard chaos engineering. I really wanted to showcase how straightforward introducing chaos engineering to an organization can be. There are more depth talks from amazing speakers this year, so please make sure you check those out.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi, welcome. My name is Carolina and I work at Nuaware. We're a software distribution company that specializes in cloud native technologies. We also partner with some of the newer, innovative vendors that enter the european market. So for over a year now, I've been speaking to teams looking into chaos engineering almost on a daily basis. So I thought it would be a good idea to compile some of the answers, common questions, success stories, and cautionary tales into a short guide for anyone that's interested in onboarding the practice. So, in other words, this talk is a brief before, during, and after of making chaos successful. So I hope you'll find it useful. And let's jump in. Before we dive into each step, I wanted to quickly provide you guys with a baseline definition. In short, I like to introduce chaos engineering as a scientific method to uncover failures in our system before they actually happen. So it allows us to reflect the real world scenarios that are unpredictable for traditional testing, to make sure that it's not just our applications, infrastructure and network that are designed for reliability, but also processes and people. So, through controlled use of attacks, we test how a system responds under various conditions. And apologies for using this very topical analogy here, but it works very similarly to a vaccine. So the injection of a pathogen in order to build up immunity to it and preventing potentially devastating illnesses. So, without further ado, let's dive into some of the considerations to be made before onboarding chaos engineering. For any organization that's unsure whether they will benefit from it, I always like to start from a business angle. We all know that downtime costs a lot of money. There are countless of examples in the news almost on a daily basis that remind us of that. So the average cost of downtime is currently estimated to average at around $100,000 an hour. So, unsurprisingly, it's becoming a KPI for many engineering teams at this point. So the bottom line here is, if your company's profitability is dependent on the underlying systems, the short answer is yes, you will benefit from chaos engineering. Now, while understanding the advantages of chaos testing, a lot of organizations believe that it's too early for them, as in, what's the point of chaos engineering in a legacy system? So, chaos engineering, of course, emerged from the increasing complexity of our systems, which, through the introduction of automation, containers, orchestrators and the like, have become almost unpredictable. Now, while chaos engineering does shine in distributed systems, it's mostly because the experiments can be performed at a larger scale. But at any scale, it will provide us with priceless knowledge about the behavior of our system immediate benefit for chaos testing legacy systems is the same as any other. Quoting Satya Nadella all companies are software companies, regardless of how cutting edge their estates are. Downtime will hurt customer facing applications run on legacy systems as much as those containerized on kubernetes. So everything that makes our systems more resilient will bring value. Now, the static nature of bare metal may appear to be more predictable, but very rarely will you come across an IT department without any of its functions being outsourced. So regardless of how well you know the behavior of your own system, how do you test, or in this case, simulate outages in third party providers? I'll let you guess the answer. But looking beyond immediate benefits of chaos testing legacy systems, sooner or later you'll migrate to the cloud, start containerizing or building out a Kubernetes platform so the earlier teams familiarize themselves with chaos engineering. The smoother and more reliable those transitions will be later on. But another aspect of the is it too early? Question is, do I have everything chaos, chaos chaos engineering practice lead a successful chaos engineering practice is characterized by its scientific approach. So in here, I really like the analogy of performing surgery. Would you do that with a blindfold and a hammer, or a microscope and a scalpel? Now, our microscope here would be monitoring observability and as much data or metrics you can collect while running those experiments. So monitoring is so, so important while running a chaos engineering test, as without it, we won't fully understand a the steady state of our application, but also b how a system starts to behave under stress. So baseline metrics to track here would be resource consumption, state and performance. But it's also paramount to have high visibility of the levels of service tied to your application. So the reason for that is twofold. A breach will result in a fine, of course, but also running a successful chaos experiment should increase those over time. Now here I'm talking about service level objectives, indicators and agreements, but very valuable metrics to collect, especially if you want to track. The efficiency of chaos engineering as part of your incident response would be meantime to detect, meantime to recovery, and meantime between failures. Now this brings us to a question that could be a topic of its own. So I've tried to summarize the main points. Should we build or buy our chaos engineering tools? So, in a nutshell, in house build tools are configured in line with your application or environment's exact needs, so that it seamlessly integrates with other environments, like your monitoring and development pipelines. Now, there's also a security benefit as all connections will stay within your company's internal network. Now, this is dependent on the internal processes you have in place, but building a tool often shortens feedback cycle between development and production. And of course, building in house gives you full control over the product roadmap, so you can release features when your organization needs them most. Now, the biggest argument for buying is, ironically, its cost effectiveness. With chaos engineering being a new practice, a specialized talent is very rare and expensive. Now, dedicating and upskilling an internal team to build a tool can also be quite costly. Even if we're talking about customizing an open source project with an enterprise grade tool, you will be provided with regular product updates and a dedicated support team. Now, to build a sophisticated fault injection platform takes roughly 14 to 18 months of several engineers to build and maintain. Now, buying will shorten that time down to minutes, but bot tools are often much more compatible with varied application stacks and infrastructure types, so they're easy to scale. Most build tools also won't have automation features, at least not right away. Now, at this point, we should know why we're starting with chaos engineering and what we'll use. So where do we start then? The best way would be by brainstorming what should be experimented on to begin with. Now here I highly, highly discourage from going straight to production. I'd recommend picking a precisely defined small portion of the system, preferably in dev or test. Next step would be to state a hypothesis. What do we think will happen if we run this attack? Then we design an experiment again, one with a magnitude that's way smaller than we think has the potential to cause any failure or unpredictable behavior in our system. Now, after the experiment is complete, we carefully examined our monitoring and observability. Tools are showing and analyzed those metrics, as well as other system data. Our findings should drive how we prioritize our efforts, so mitigating any failures we've uncovered with this experiment immediately. Then we will follow our workup by running the same chaos experiment again to confirm our fix was effective. Doing this repeatedly, starting small and fixing what we find each time, will quickly add up. Now, our system gradually learns resiliency against a growing number of failure scenarios. So let's say we've run our first experiment. Now, how do we know it's working? One point worth mentioning is that when we're stating a hypothesis, we're looking to disprove, or in fact prove it. A lot of organizations tend to run chaos tests and don't uncover faults in their systems immediately. Now, this can happen, but it's important to remember that chaos engineering is a practice and is meant to be repeated. The knowledge that our systems will withstand a particular failure is just as valuable as discovering an issue and fixing it. Now, if one attack hasn't uncovered any unexpected behaviors, you can increase the magnitude of your attack or change the type of failure you're simulating until it breaks again. If it doesn't, great. We've still learned something. You can also continue under the same experiment somewhere else and repeat the process. Any piece of information about the resiliency, or lack thereof, of our limits of our systems, means that we've conducted a successful experiment. Now, at this point, we've pretty much onboarded chaos engineering. So where do we go from there? One of the most fruitful chaos engineering practices are game days. It's a dedicated day for running experiments on a specific portion of your system with a team. The goal is to eventually invite the bigger part of the organization, but these are very good when introducing chaos engineering for the first time as well. It allows focus on preselected potential failure points and improve on them immediately after. Now, once confident in the practice, a lot of organisation also work on automating chaos tests, such as incorporating some of them into their CI CD pipelines. But regardless, it's always a good idea to run game days on a regular basis, even if to propagate the reliability culture across the organization. And on that note, once you're running regular chaos tests, it's a good point to start educating other teams or departments on how they can benefit from it. Some organization build out chaos engineering centers of excellence. Some will upscale a champion per department and some just run regular game days. But regardless of what route would make the most sense for you, there's a multitude of use cases for this practice. You can replicate the most common kubernetes failures to ensure that your platform is configured properly. You can also test your disaster recovery. For example, run through a playbook with simulated scenarios. Another good use case is verifying your monitoring configurations to avoid missed alerts or prolonged outages. Now you can also test out different monitoring tools with case engineering. If you're unsure which one to buy. You can also validate that your systems are configured for resiliency as you migrate to the cloud. But chaos engineering can also help train new engineers in a controlled environment. Or you can test the response times for existing teams. Last but not least, chaos testing is very good at helping mitigate dependency failures, ensuring your systems fail gracefully, and identifying critical services. Now this of course, is a nonexhaustive list, but just a collection of the most popular use cases I've come across so far. And with that, we've come to an end. So thank you very much for watching my talk on how to onboard chaos engineering. I really wanted to showcase how straightforward introducing chaos engineering to an organization can be, and I hope it worked, and I hope you found it useful. Now, there are more depth talks from amazing speakers this year, so please make sure you check those out. And if you have any further questions, don't hesitate to reach out. Thank you very much.

See all 31 talks at this event!

Conf42 Chaos Engineering 2021 - Online

February 25 2021

Onboarding Chaos Engineering

Video size:

Abstract

Summary

Transcript

Karolina Rachwał

Chaos Engineering Practice Lead @ Nuaware

Join the community!

Featured event

2025

2024

Info

Conf42 Chaos Engineering 2021 - Online

February 25 2021

Onboarding Chaos Engineering

Video size:

Abstract

Summary

Transcript

Karolina Rachwał

Chaos Engineering Practice Lead @ Nuaware

Join the community!