Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, welcome. My name is Carolina and I work at Nuaware.
We're a software distribution company that specializes in cloud native
technologies. We also partner with some of the newer, innovative vendors that
enter the european market. So for over a year now, I've been speaking to teams
looking into chaos engineering almost on a daily basis.
So I thought it would be a good idea to compile some of the answers,
common questions, success stories, and cautionary tales into
a short guide for anyone that's interested in onboarding the practice.
So, in other words, this talk is a brief before, during,
and after of making chaos successful. So I hope you'll find it
useful. And let's jump in. Before we dive into each step, I wanted to quickly
provide you guys with a baseline definition. In short,
I like to introduce chaos engineering as a scientific method
to uncover failures in our system before they actually happen. So it
allows us to reflect the real world scenarios that are unpredictable for
traditional testing, to make sure that it's not just our applications,
infrastructure and network that are designed for reliability,
but also processes and people. So, through controlled use of attacks,
we test how a system responds under various conditions. And apologies
for using this very topical analogy here, but it works very similarly to
a vaccine. So the injection of a pathogen in order to build up immunity
to it and preventing potentially devastating illnesses.
So, without further ado, let's dive into some of the considerations to
be made before onboarding chaos engineering. For any organization
that's unsure whether they will benefit from it, I always like to start from
a business angle. We all know that downtime costs a lot of money. There are
countless of examples in the news almost on a daily basis that remind us of
that. So the average cost of downtime is currently estimated to
average at around $100,000 an hour. So,
unsurprisingly, it's becoming a KPI for many engineering teams at this
point. So the bottom line here is, if your company's profitability is
dependent on the underlying systems, the short answer is yes,
you will benefit from chaos engineering. Now, while understanding the
advantages of chaos testing, a lot of organizations believe that
it's too early for them, as in, what's the point of chaos engineering
in a legacy system? So, chaos engineering, of course, emerged from
the increasing complexity of our systems, which, through the introduction of automation,
containers, orchestrators and the like, have become almost unpredictable.
Now, while chaos engineering does shine in distributed systems,
it's mostly because the experiments can be performed at a larger scale.
But at any scale, it will provide us with priceless knowledge about the behavior of
our system immediate benefit for chaos testing legacy systems
is the same as any other. Quoting Satya Nadella
all companies are software companies, regardless of how cutting
edge their estates are. Downtime will hurt customer facing applications
run on legacy systems as much as those containerized on kubernetes.
So everything that makes our systems more resilient will bring value.
Now, the static nature of bare metal may appear to be more
predictable, but very rarely will you come across an IT department
without any of its functions being outsourced. So regardless of how
well you know the behavior of your own system, how do you test, or in
this case, simulate outages in third party providers? I'll let you
guess the answer. But looking beyond immediate benefits of chaos testing
legacy systems, sooner or later you'll migrate to the cloud, start containerizing
or building out a Kubernetes platform so the earlier teams familiarize themselves
with chaos engineering. The smoother and more reliable those transitions
will be later on. But another aspect of the is it too
early? Question is, do I have everything chaos, chaos chaos engineering
practice lead a successful chaos engineering practice is characterized by
its scientific approach. So in here, I really like the analogy of
performing surgery. Would you do that with a blindfold and a hammer, or a microscope
and a scalpel? Now, our microscope here would be monitoring
observability and as much data or metrics you can
collect while running those experiments. So monitoring is
so, so important while running a chaos engineering test,
as without it, we won't fully understand a the steady
state of our application, but also b how a system
starts to behave under stress. So baseline metrics to track here would be
resource consumption, state and performance. But it's also
paramount to have high visibility of the levels of service tied to your application.
So the reason for that is twofold. A breach will result in a fine,
of course, but also running a successful chaos experiment should
increase those over time. Now here I'm talking about service level objectives,
indicators and agreements, but very valuable metrics
to collect, especially if you want to track. The efficiency of chaos engineering
as part of your incident response would be meantime to detect,
meantime to recovery, and meantime between failures.
Now this brings us to a question that could be a topic of its own.
So I've tried to summarize the main points. Should we build or buy
our chaos engineering tools? So, in a nutshell, in house build
tools are configured in line with your application or environment's exact
needs, so that it seamlessly integrates with other environments,
like your monitoring and development pipelines. Now, there's also a security benefit
as all connections will stay within your company's internal network.
Now, this is dependent on the internal processes you have in place,
but building a tool often shortens feedback cycle between development and production.
And of course, building in house gives you full control over the product roadmap,
so you can release features when your organization needs them most.
Now, the biggest argument for buying is, ironically, its cost effectiveness.
With chaos engineering being a new practice, a specialized talent is
very rare and expensive. Now, dedicating and upskilling an internal team
to build a tool can also be quite costly. Even if we're talking about customizing
an open source project with an enterprise grade tool, you will be provided
with regular product updates and a dedicated support team. Now,
to build a sophisticated fault injection platform takes roughly 14
to 18 months of several engineers to build and maintain.
Now, buying will shorten that time down to minutes,
but bot tools are often much more compatible with varied
application stacks and infrastructure types, so they're easy to scale. Most build
tools also won't have automation features, at least not right away.
Now, at this point, we should know why we're starting with chaos engineering
and what we'll use. So where do we start then? The best way
would be by brainstorming what should be experimented on to begin with.
Now here I highly, highly discourage from going straight to production.
I'd recommend picking a precisely defined small portion of the system,
preferably in dev or test. Next step would be to state a
hypothesis. What do we think will happen if we run this attack? Then we
design an experiment again, one with a magnitude that's
way smaller than we think has the potential to cause any failure
or unpredictable behavior in our system. Now, after the experiment is
complete, we carefully examined our monitoring and observability.
Tools are showing and analyzed those metrics, as well
as other system data. Our findings should drive how we prioritize our
efforts, so mitigating any failures we've uncovered with this experiment
immediately. Then we will follow our workup by running the same chaos experiment
again to confirm our fix was effective. Doing this repeatedly,
starting small and fixing what we find each time, will quickly add up.
Now, our system gradually learns resiliency against a growing number of
failure scenarios. So let's say we've run our first
experiment. Now, how do we know it's working? One point
worth mentioning is that when we're stating a hypothesis,
we're looking to disprove, or in fact prove it.
A lot of organizations tend to run chaos tests and don't uncover faults
in their systems immediately. Now, this can happen, but it's important
to remember that chaos engineering is a practice and is meant
to be repeated. The knowledge that our systems will withstand a particular failure is
just as valuable as discovering an issue and fixing it.
Now, if one attack hasn't uncovered any unexpected behaviors, you can
increase the magnitude of your attack or change the type of failure you're
simulating until it breaks again. If it doesn't,
great. We've still learned something. You can also continue under the same experiment
somewhere else and repeat the process. Any piece of information
about the resiliency, or lack thereof, of our limits of our
systems, means that we've conducted a successful experiment. Now, at this
point, we've pretty much onboarded chaos engineering. So where do we go
from there? One of the most fruitful chaos engineering practices
are game days. It's a dedicated day for running experiments on a
specific portion of your system with a team. The goal is to eventually invite
the bigger part of the organization, but these are very good when
introducing chaos engineering for the first time as well. It allows focus on
preselected potential failure points and improve on
them immediately after. Now, once confident in the practice, a lot of
organisation also work on automating chaos tests, such as
incorporating some of them into their CI CD pipelines. But regardless,
it's always a good idea to run game days on a regular basis, even if
to propagate the reliability culture across the organization. And on
that note, once you're running regular chaos tests, it's a good point to
start educating other teams or departments on how they can benefit
from it. Some organization build out chaos engineering centers of excellence.
Some will upscale a champion per department and some
just run regular game days. But regardless of what route would make the most
sense for you, there's a multitude of use cases for
this practice. You can replicate the most common kubernetes failures to ensure
that your platform is configured properly. You can also test your
disaster recovery. For example, run through a playbook with simulated scenarios.
Another good use case is verifying your monitoring configurations to
avoid missed alerts or prolonged outages. Now you can also test out different
monitoring tools with case engineering. If you're unsure which one to
buy. You can also validate that your systems are configured for
resiliency as you migrate to the cloud. But chaos engineering
can also help train new engineers in a controlled environment.
Or you can test the response times for existing teams. Last but
not least, chaos testing is very good at helping mitigate dependency
failures, ensuring your systems fail gracefully, and identifying
critical services. Now this of course, is a nonexhaustive
list, but just a collection of the most popular use cases I've come
across so far. And with that, we've come to an end.
So thank you very much for watching my talk on how to onboard
chaos engineering. I really wanted to showcase how straightforward
introducing chaos engineering to an organization can be,
and I hope it worked, and I hope you found it useful.
Now, there are more depth talks from amazing speakers this year, so please make
sure you check those out. And if you have any further questions, don't hesitate
to reach out. Thank you very much.