Optimizing incident response thanks to Chaos Engineering

Video size:

Abstract

When some groups think of chaos engineering, they may think of how the principles and experiments can add resiliency, security and performance improvements to a system. I too was of that mindset until performing a focussed chaos engineering experiment, which lead to some helpful conclusions that were then utilized during a production incident.

This talk will demonstrate how these experiments eventually led to key discoveries in a system and how during a production incident, the conclusions artifact was utilized to assist with incident response. Had the chaos experiment not been performed, incident response would have taken much longer and would have been more painful for end-users.

My goal is to help provide another answer to the question “why chaos engineering?”. Incident response is always in need of constant improvement and refinement, and chaos engineering is a tool that can most certainly help us.

Summary

Paul Marsicovetere is a senior cloud infrastructure engineer at Formidable in Toronto. Today he will talk about optimizing incident response thanks to chaos engineering. Always open to chat with anyone about anything cloud computing, SRE or DevOps related.
chaos engineering is thoughtful, planned experiments designed to reveal the weakness in our systems. The main attraction of chaos engineering for me is the idea to simply break the system and see what happens. Chaos Engineering will continue to grow organically as we depend more and more on cloud providers.
A small chaos engineering experiments led to key outcomes that were later utilized during a production incident. Experiment was designed to inject failures and observe the service response so that we could gain a sense of the service's resiliency. The outcome that was obtained from the Chaos engineering experiment was a production in MTTR for a production outage.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Jamaica, taken up real time feedback into the behavior of your distributed systems and observing changes exceptions. Errors in real time allows you to not only experiment with confidence, but respond instantly to get things working again. Close my name is Paul Marsicovetere and today I'm going to talk about optimizing incident response thanks to chaos engineering a little bit about myself I'm a senior cloud infrastructure engineer at Formidable in Toronto and I've been here since October 2020. Formidable partners with many different companies to help build the modern web and design solutions to complex technical problems. Previously I was working in SRE for Benevity in Calgary for three years, and while I'm originally from Melbourne, Australia, I've been happily living in Canada for over ten years now. You can get in touch with me on Twitter at paulmastecloud, on LinkedIn, and via email. I'm always open to chat with anyone about anything cloud computing, SRE or DevOps related. I run a serverless blog called the Cloud on my mind in my spare time as well. So as an agenda today I'm going to talk about why it is that we would choose Chaos engineering and what is chaos engineering. I'll then move on to how chaos engineering can help in practice, and we'll be wrapping up with some lessons learned. So what exactly is chaos engineering? The best definition of chaos engineering I've come across is from Colton Andrus, co founder and CTO of Gremlin, who defines chaos engineering as thoughtful, planned experiments designed to reveal the weakness in our systems. The main attraction of chaos engineering for me is the idea to simply break the system and see what happens, which is a far cry from the traditional keep everything running at all times and at all costs type thinking and mentality. With that said, why would you want to use chaos engineering? Chaos Engineering is a discipline that will continue to grow organically as we depend more and more on cloud providers in the industry. Because of the nature of cloud computing, we need more assurances of availability as new and unexpected outages continue to occur. Unit and integration testing can only really take us so far, as these methods typically verify how functions or application code is supposed to respond. However, chaos engineering can show you how an application actually responds to a wide range of failures. These can be anything from removing access to a network file share to deletion of database tables to monitor how the services respond and the range of failures are performing endless. I also find there is a freedom provided by using these chaos engineering experiments with zero expectations of how an application or service will respond when injecting the particular failures. It is liberating as often we can expect applications and services to return results in some particular fashion during unit or integration testing. But now we can actually say let's try and break the application to see what happens and let the creativity flow from there. So everything explained is all well and good in theory, but how do chaos engineering experiments help in the real world? Well, this is where I'll describe how a small chaos engineering experiments actually led to key outcomes that were later utilized during a production incident. I'll discuss about the chaos engineering experiment setup, what was performed when we ran the experiments, and finally what happened after some real chaos occurred in our production environment. The experiments was set up as follows. Take a Kubernetes cluster in a nonproduction environment and inject failures to the running pods and record how the service responds. That's it. Nothing groundbreaking, but something that not a lot of teams would focus on when it comes to creating services. We chose production as this was our first chaos engineering experiment and while we had the confidence in these production system, there was no need to cause unintentional outages for our end users. As per the diagram, the particular service experimented on RaN pods in a parent child architecture where the parent was an orchestrator that can on a Kubernetes node that would spin up child pods when requested. The child pod logs were also streamed in real time via a web page outside of kubernetes where the clients would view logs of their job requests. The experiment itself was designed to inject termination to child pods, the parent pod and the underlying kubernetes node during simulated scheduled tasks while the job requests ran. The failures and errors that were returned during each of these tests was recorded in a document and the child pod logs web page was observed so that we could also understand the client experience as well during these failures at the time of the experiment, the most interesting finding was actually the drift between the logs web page and the Kubernetes pod logs on the cluster itself, along with some small bug findings and expected failure modes that incurred in certain conditions. All experiment events and outcomes were recorded in a document and were then discussed later at a team meeting. The service resilience was now understood a bit further, whereas compared to before the experiments, certain failure modes like when the child or the parent of the child pods went offline, they weren't was well known. So some weeks later a production issue actually arose when the parent pod was in an error state and many child pods were running that could not be deleted safely without potentially taking further downstream services offline while looking for a solution mid incident to safely delete the parent pod without taking those child pods offline, the chaos engineering experiments document was reviewed. Thankfully, it turns out we had documented a safe command to delete the parent pod that would not affect the running child pods. We had recorded this command during the chaos engineering experiment to show how failures were injected and their outcomes. Interestingly, there was also a very unsafe command documented to delete the parent pod that would have had negative effects for the child pods and downstream services. I'm sure you can all guess which command was chosen to resolve this issue. So, as a result of the chaos engineering experiment and then the production outage a few weeks later, what kind of lessons were learned? For me, what was most satisfying about the incident response was not the decreased meantime to resolution or MTTR, but rather reflecting on what the chaos engineering experiment provided. The experiment itself was not designed to help streamline our incident response and reduce MTTR. The experiment was designed to inject failures and observe the service response so that we could gain a sense of the service's resiliency and document those findings. The outcome that was obtained from the Chaos engineering experiment was a production in MTTR for a production incident, along with some odd bugs and behaviors that were eventually turned into fixes and feature requests. I'm so thankful that we documented the chaos engineering experiment and the outcomes as without it, the production incident definitely would have occurred for a longer time, and we may have had to have taken an educated guess at the commands to resolve the issue. This is never a good place to be when you're in mid incident. Some engineers may think of nonproduction primarily as a place to test out feature changes to make sure that these don't cause errors, to trial out memory or cpu increases or decreases to see if these improve performance, or to apply patches before they hit production to observe any issues. However, with chaos engineering, we can now also think of nonproduction as a place to safely inject failures and then take those learnings to our higher level production environments. Capturing those experiment results can be huge and can act as a point of reference during an unintended incident. As I've demonstrated further, after more confidence is built, you can run the chaos engineering experiments directly in production to further verify the availability and resiliency of your service. Lastly, when we create service offerings or set up new technologies like kubernetes, we tend to think about simply getting the service to work, and that in of itself is no small fee. It's often an underrated milestone. However, when we start to use our imagination and try to break the service in creative or esoteric ways and introduce some chaos. Some very interesting results can be captured. These results and learnings can then be applied to the moneymaker production and can be very helpful when it matters mid incident. So with that, thank you all for tuning in and listening to this talk. And thank you to Comp 42 for providing the opportunity city. I look forward to hearing from everyone about your chaos engineering experiments and journeys in the future.

Slides

Download slides (PDF)

See all 24 talks at this event!

Conf42 Chaos Engineering 2022 - Online

March 10 2022

Optimizing incident response thanks to Chaos Engineering

Video size:

Abstract

Summary

Transcript

Slides

Paul Marsicovetere

Senior Cloud Infrastructure Engineer @ Formidable

Join the community!

Featured event

2026

2025

Info

Conf42 Chaos Engineering 2022 - Online

March 10 2022

Optimizing incident response thanks to Chaos Engineering

Video size:

Abstract

Summary

Transcript

Slides

Paul Marsicovetere

Senior Cloud Infrastructure Engineer @ Formidable

Join the community!