Optimizing incident response thanks to Chaos Engineering
Video size:
Abstract
When some groups think of chaos engineering, they may think of how the principles and experiments can add resiliency, security and performance improvements to a system. I too was of that mindset until performing a focussed chaos engineering experiment, which lead to some helpful conclusions that were then utilized during a production incident.
This talk will demonstrate how these experiments eventually led to key discoveries in a system and how during a production incident, the conclusions artifact was utilized to assist with incident response. Had the chaos experiment not been performed, incident response would have taken much longer and would have been more painful for end-users.
My goal is to help provide another answer to the question “why chaos engineering?”. Incident response is always in need of constant improvement and refinement, and chaos engineering is a tool that can most certainly help us.
Summary
-
Paul Marsicovetere is a senior cloud infrastructure engineer at Formidable in Toronto. Today he will talk about optimizing incident response thanks to chaos engineering. Always open to chat with anyone about anything cloud computing, SRE or DevOps related.
-
chaos engineering is thoughtful, planned experiments designed to reveal the weakness in our systems. The main attraction of chaos engineering for me is the idea to simply break the system and see what happens. Chaos Engineering will continue to grow organically as we depend more and more on cloud providers.
-
A small chaos engineering experiments led to key outcomes that were later utilized during a production incident. Experiment was designed to inject failures and observe the service response so that we could gain a sense of the service's resiliency. The outcome that was obtained from the Chaos engineering experiment was a production in MTTR for a production outage.
Transcript
This transcript was autogenerated. To make changes, submit a PR.
Jamaica, taken up real
time feedback into the behavior of your distributed systems
and observing changes exceptions. Errors in
real time allows you to not only experiment with confidence, but respond
instantly to get things working again.
Close my
name is Paul Marsicovetere and today I'm going to talk about optimizing
incident response thanks to chaos engineering a little bit about
myself I'm a senior cloud infrastructure engineer at Formidable
in Toronto and I've been here since October 2020.
Formidable partners with many different companies to help build the modern
web and design solutions to complex technical problems.
Previously I was working in SRE for Benevity in Calgary
for three years, and while I'm originally from Melbourne,
Australia, I've been happily living in Canada for
over ten years now. You can get in touch with me on Twitter at
paulmastecloud, on LinkedIn, and via email.
I'm always open to chat with anyone about anything cloud computing,
SRE or DevOps related. I run a serverless
blog called the Cloud on my mind in my spare time as well.
So as an agenda today I'm going to talk about why it
is that we would choose Chaos engineering and what is chaos
engineering. I'll then move on to how chaos engineering can help
in practice, and we'll be wrapping up with some lessons learned.
So what exactly is chaos engineering? The best definition
of chaos engineering I've come across is from Colton Andrus,
co founder and CTO of Gremlin, who defines chaos engineering
as thoughtful, planned experiments designed to reveal
the weakness in our systems. The main attraction of chaos
engineering for me is the idea to simply break the system and
see what happens, which is a far cry from the traditional
keep everything running at all times and at all costs type thinking
and mentality. With that said, why would you want to use
chaos engineering? Chaos Engineering is a discipline that
will continue to grow organically as we depend more and
more on cloud providers in the industry. Because of the nature
of cloud computing, we need more assurances of availability
as new and unexpected outages continue to occur.
Unit and integration testing can only really take us so
far, as these methods typically verify how functions
or application code is supposed to respond. However,
chaos engineering can show you how an application actually responds
to a wide range of failures. These can be anything
from removing access to a network file share to
deletion of database tables to monitor how the services
respond and the range of failures are performing endless.
I also find there is a freedom provided by using these chaos
engineering experiments with zero expectations of how an application
or service will respond when injecting the particular failures.
It is liberating as often we can expect applications and services
to return results in some particular fashion during
unit or integration testing. But now we can actually
say let's try and break the application to see what happens
and let the creativity flow from there. So everything explained
is all well and good in theory, but how do chaos engineering experiments
help in the real world? Well, this is where I'll describe how a
small chaos engineering experiments actually led to key outcomes
that were later utilized during a production incident.
I'll discuss about the chaos engineering experiment setup,
what was performed when we ran the experiments,
and finally what happened after some real chaos occurred
in our production environment. The experiments was set up as
follows. Take a Kubernetes cluster in a nonproduction environment
and inject failures to the running pods and record how the
service responds. That's it. Nothing groundbreaking,
but something that not a lot of teams would focus on when it comes to
creating services. We chose production as this was
our first chaos engineering experiment and while we had
the confidence in these production system, there was no need
to cause unintentional outages for our end users. As per
the diagram, the particular service experimented on
RaN pods in a parent child architecture where the parent
was an orchestrator that can on a Kubernetes node that would spin
up child pods when requested. The child pod logs were
also streamed in real time via a web page outside
of kubernetes where the clients would view logs of their
job requests. The experiment itself was designed
to inject termination to child pods, the parent pod and
the underlying kubernetes node during simulated scheduled tasks
while the job requests ran. The failures and errors that
were returned during each of these tests was recorded in a document
and the child pod logs web page was observed so that
we could also understand the client experience as well during these
failures at the time of the experiment,
the most interesting finding was actually the drift between the logs
web page and the Kubernetes pod logs on the cluster itself,
along with some small bug findings and expected failure modes
that incurred in certain conditions. All experiment
events and outcomes were recorded in a document and were then
discussed later at a team meeting. The service resilience was now understood a
bit further, whereas compared to before the experiments,
certain failure modes like when the child or the parent of the
child pods went offline, they weren't was well known. So some
weeks later a production issue actually arose
when the parent pod was in an error state and many child pods were running
that could not be deleted safely without potentially taking
further downstream services offline while looking for a
solution mid incident to safely delete the parent pod without taking those
child pods offline, the chaos engineering experiments document
was reviewed. Thankfully, it turns out we had
documented a safe command to delete the parent pod that
would not affect the running child pods. We had recorded this
command during the chaos engineering experiment to show how failures
were injected and their outcomes. Interestingly,
there was also a very unsafe command documented to delete
the parent pod that would have had negative effects for the child pods
and downstream services. I'm sure you can all guess which
command was chosen to resolve this issue. So, as a result
of the chaos engineering experiment and then the production outage
a few weeks later, what kind of lessons were learned?
For me, what was most satisfying about the incident response
was not the decreased meantime to resolution or MTTR,
but rather reflecting on what the chaos engineering experiment provided.
The experiment itself was not designed to help streamline
our incident response and reduce MTTR. The experiment
was designed to inject failures and observe the service response so
that we could gain a sense of the service's resiliency and document
those findings. The outcome that was obtained from the
Chaos engineering experiment was a production in MTTR
for a production incident, along with some odd bugs and behaviors
that were eventually turned into fixes and feature requests.
I'm so thankful that we documented the chaos engineering experiment
and the outcomes as without it, the production incident
definitely would have occurred for a longer time, and we may have had
to have taken an educated guess at the commands to resolve the
issue. This is never a good place to be when you're in
mid incident. Some engineers may think of nonproduction
primarily as a place to test out feature changes
to make sure that these don't cause errors, to trial
out memory or cpu increases or decreases
to see if these improve performance, or to apply
patches before they hit production to observe any
issues. However, with chaos engineering, we can now
also think of nonproduction as a place to safely inject failures
and then take those learnings to our higher level production
environments. Capturing those experiment results can
be huge and can act as a point of reference during an unintended incident.
As I've demonstrated further, after more confidence is
built, you can run the chaos engineering experiments directly
in production to further verify the availability and resiliency
of your service. Lastly, when we
create service offerings or set up new technologies like kubernetes,
we tend to think about simply getting the service to work,
and that in of itself is no small fee. It's often an
underrated milestone. However, when we start to use our
imagination and try to break the service in creative
or esoteric ways and introduce some chaos.
Some very interesting results can be captured. These results and
learnings can then be applied to the moneymaker production and
can be very helpful when it matters mid incident. So with
that, thank you all for tuning in and listening to this talk. And thank you
to Comp 42 for providing the opportunity city. I look
forward to hearing from everyone about your chaos engineering experiments and
journeys in the future.