Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello all. My name is Vipin Jain and
I am going to speak on a very interesting yet controversial topic,
testing in the production systems. Most of the managers
and customers, if you ask, they are not
very keen to allow the testers to
run certain tests on the production systems. There are
a lot of security issues, there are lot of stability issues,
and they don't want their system to be shut down just because
some test case failed. And this concept
of testing on the production systems is covered well under a
new stream of testing called as chaos testing.
I wrote this paper sometime back when the entire world
was talking about vaccines and immunities due
to the pandemic and hence I use these words for
my topic. Vaccinate your software and
build its immunity. Before we
begin, who am I?
I am a son, a husband and a father.
You can see in this pick. My wife, my daughter and my son are there.
At heart, I am a tester, although by my job I am now
more into deliveries and providing
various solutions to customers all across the world.
Having said that, I am always a speaker by choice because
that led me to travel all across the world and meet people,
hear their ideas, and of course, using the best ideas
into my system. I'm a process advocate
to deliver quality. Since last couple of years I
have entered into blogging and I have written lots of bugs
for various websites. My contacts are here.
You can follow me on LinkedIn, you can follow me on Twitter, and of
course you can always email me if you want to have any kind of
discussion with me.
Now, this image depicts a real life chaos in the system.
You can see the boss is shouting, papers are flying from here and
there people are running,
so there is no system as such. And it looks like the
entire system has gone into a disarray. Now this depicts
chaos. How does this chaos looks
like into a software system?
Let's see everything.
First, let's talk about what exactly is chaos engineering.
Now, as the image depicts chaos engineering like
a vaccine, we inject harm to build immunity into
the system. So if you go by a Google definition,
it says chaos engineering or chaos testing.
It is a highly disciplined approach to test a system's
integrity. How it does that proactively.
It simulates and identify the failures in a given environment
before they lead to unplanned downtime or a negative user
experience. Think of a vaccine or a flu shot where
you inject yourself with a small amount of potentially harmful foreign bodies
in order to build resistance and prevent illness.
Chaos engineering is a tool that we use to build such can
immunity in our technical system by injecting harm.
And what kind of harm we inject into a system, they can
be a latency, it can be a cpu failure, or it can be
a network black hole in order to find and
mitigate potential weaknesses. So this is what
exactly is termed as chaos engineering. And as
I said, like a vaccine, we inject harm into the system
to build the system's immunity.
Let's take a brief look at the history of
how chaos engineering took birth, and where it is
right now in its current shape. So it began
in year 2010, when Netflix designed
what is called as a chaos monkey to test the system stability
by enforcing failures by the pseudo random termination of
instances and services within network Netflix architecture.
So when we are watching a movie on Netflix, what we don't
want is that there should not be
any lag between the two images, between the
videos. The videos should run as a very constant,
regular stream, and we should get a very,
I would say, very smooth movie watching experience,
right? So following their migration to the cloud,
Netflix service was now reliant upon the AWS.
Netflix in 2010 was not able to sustain
its hardware infrastructure, and therefore it moved
to AWS, which basically helped Netflix to
scale up, right? It provided it a technology that
could show them how their system responded when
the critical components of their production service infrastructures were
taken down. Intentionally causing the single failure
would remove any weakness in their system and then guide them
towards automated solutions which gradually handle
failures much, much better way. So this was
the original aim. And that's why in 2010,
Netflix used the first tool called as Chaos monkey to
test everything.
Then it increased.
And in 2011,
apart from Chaos monkeys, Netflix created what is called
as a Simeon army. So it had a geniter monkey,
which identifies and disposes unused resources.
It has a chaos.com, which drops a full AWS region.
It has confirmity monkey, which shut down instances which are
not adhering to the best practices, and similarly, chaos gorilla,
security monkey, Doctor monkey, and latency monkey.
So the entire Simeon army was born, and they
added additional failure injection modes on top of the Chaos monkey
that would allow the testing of more complete suits of failure
states and thus building resilience to those
as well. And in 2020,
just two years ago, Chaos Engineering. Chaos now become
the part of AWS well architectured framework.
This WAF well architectured framework, it is currently into its
8th update, and this was recently announced, which has
included Chaos engineering as the requirement of a stable
system. So, as you can see, within ten years of its
beginning, AWS has
recognized chaos engineering as the requirement of a
reliable system. And when I'm talking about all these historical
things and the products reliable systems, I'm not talking about a simple
software. I'm talking about software which are well
distributed across the world, which relates
to heavy software, I would say, which has a lot
and lot of people footfall. And in these
places, chaos engineering has become very, very crucial.
It has formed something called as chaos testing principles,
which we will be, of course, looking at each one, one at a time.
But just to give you a summary of that.
So, the first principle states that you have to build
a hypothesis around a steady state behavior.
You have to find out some real world events,
some experiments needed to be run on the production system.
Automate these experiments to run continuously and ultimately
minimize the blast radius. I will take you through all
these five to give you a real world idea of how
chaos testing happens and how it is planned.
So, let's first begin with build a hypothesis around steady state behavior
and what exactly this means. So,
let's see this example before I go into any kind of details.
Under dash circumstances, the security team is
notified. This is a simple sentence, and which has a
fill in the blanks. This will lead to under
security control violation circumstances. The security
team is notified. Now, what is happening here?
This blank space is filled by the variables that
you determine. The advanced principles emphasize
building your hypothesis around a steady state definition.
This means focusing on the way the system is
expected to behave and capturing that in a measurement,
you have an idea of what can go wrong. So you
have chosen the exact failure to inject what
happens next? This is an excellent thought
exercise to work through as a team. By discussing the
scenario, you can hypothesize on the expected outcome
before running it in live.
What will be the impact to customers, to your service, or to
your dependencies? This exercise basically
answers that. So, in this particular example,
the entire team sits together, and then they say that, okay,
our first hypothesis is,
under dashed circumstances, security team is notified.
And when all of them agree that under security control
violation circumstances, the security team is notified,
which means everyone is on the same page, that security
team is notified only when there is a security control violation.
Then the second part of this chaos
testing planning is you have to vary the real world events.
Now, what does these mean? It's an advanced principle which
states that the variables in experiments
should reflect real world events. Like for example,
in the previous example around hypothesis,
we talk about a security violation. It's a real world event.
This is not something hypothetical,
right? So while it might seems obvious in the hindsight that,
yes, everyone is using real world events.
Still, it is very important to call this out,
and I'll give you two very good reasons that why
I have decided to explain this here.
What happens is people just don't focus on these things.
People will say that, okay, if I need to gather variables,
let's pick all of them.
So, again, going back to the previous hypothesis example,
without focusing on what is the exact reason,
when the security alarm gets raised, people just
say that, okay, take all the possible reasons where
this security notification has to be made. Or sometimes
they say that, pick any. How does it matter? We just need to run
the scenario. So, variables are often chosen for
what it is easy to do, rather than what provides the most learning value.
That's the first problem. And the second is engineers always have
the tendency to focus on variables that reflect their
experiences rather than the user's experience.
So engineers always remain in a practical state,
they always remain in an experimental state, rather than
thinking about how a user, a real world user,
will see this entire situation.
So, just to explain my two points,
either pick any variable, or the engineers
decide to pick all the variables. And both these scenarios
are completely wrong. You have to really decide what
exactly you want and pick only those variables which have
the most learning value. This becomes the
second program, second principle of chaos
testing. Vary the real world events.
So, what is the next one? We decide that tonight
we test in the production. Wow. Seen from Spartacus
or this is seen from 300. This is seen from 300. Yes.
King Leonidas experimentation teaches us
about the system we are studying. We all know that all
the exploratory testing, all the other types of testings, basically ask
us to experiment while testing in a system. If we are
experimenting in a staging environment, then you build convince. In a staging
environment, if you are experimenting into a pre build phase,
then you are building confidence into the pre build phase. How we are going
to build confidence into the production
system? We are not doing any experiments on
the production system, or I would say we are not allowed
to do any kind of experimentation with the production system.
To the extent that the staging and the production environments
differ, often in ways that a human cannot predict,
you are not building confidence into the environment that I
really care about, which is you care about more
production or the staging? Production,
yes. Every user, the entire business, is based
upon the interactions of real world users with the production
system. But all the experiments, all the user
thinking, all the user planning that the testers
do, how end users would behave, that actually happens on the staging.
So how we are going to build convince
about the production system.
Fine. So for this reason, the most advanced chaos
engineering it takes place on the production.
I know that it is very difficult to convince any senior manager
or the owners or the stakeholders to allow the testers
to run testing on the production system. But chaos testing,
this is all about. So most advanced chaos
testing experiments always run on the production
system.
That becomes our third thing. So use production
to run your experiments. What you are going
to see here, someone in real life is testing
the bulletproof vest and how it is going
to test. Someone just wears that and the other
guy stands in front and shoots. Wow,
that's a real good production experiment.
If it fails, the poor guy dies.
If chaos engineering test fails, the production
system can stop working,
affecting millions of users across the world.
But that's again what chaos engineering talks about. We have
to use production to run my experiments now.
Why? It is constantly pushed
that the chaos engineering chaos testing has to be run on the production.
It is a common belief that there is a set of bugs and
there are some set of vulnerabilities that can be found only
in the production environment which uses some live data or
live traffic. This principle is not
without controversy. I've already told you about. Certainly in some
fields, there are regulatory requirements that prelude
the possibility of affecting the production system. In some situations,
there are insurmountable technical barriers to run these experiments.
So it is important to remember that the point of chaos engineering
is to uncover the chaos inherited in the complex system,
not to cause it. If we know that an experiment
is going to generate an undesirable effect on the production
system or the outcomes, then we should not run that experiment.
Remember that and
how to once we are finalized
with an experiment, the next step is automate your experiments
and run them continuously. Fine. We all know
about what automation is and why automation came into existence,
right? This is quite straightforward thing,
but here it is important for such systems in
two distinct ways. Automation provides a meme
to scale out the search for vulnerabilities that could contribute to
these undesirable systematic outcomes. And it
helps in empirically verifying our assumptions over time
as the unknown parts of the systems are changed.
So it helps in covering a large set of
experiments that the humans can cover manually.
In complex systems, the conditions that could possibly contribute
to an incident are so numerous that they cannot be planned for.
In fact, they can't even be counted because they are unknowable
in advance, which means that humans can't
reliably search the solution space of possible contributing
factors. In a reasonable amount of time.
Automation provides a means to scale out
the search of vulnerabilities that could contribute the undesirable
systematic behaviors, and it also helps empirically verifying
our assumptions over time. So imagine a system where the
functionality of a given component relies on some other component
which is outside the scope of this test.
Now, this is the case of almost all the complex systems because
of the third party controls and tools which get connected and communicate
through web services. Without the tight coupling
between the given functionalities and all the dependencies, it is entirely possible
that one of the dependencies will be changed in such a way
that it creates a vulnerability in the entire system.
Continuous experimentation provided by automation can catch
these issues and teach the primary operators about
how the operations of their own system is changing over time.
This could be a change in the performance, like for example the
network is becoming saturated by noisy neighbors,
or a change in functionality. For example, the response
bodies of a downstream services are including extra
information that could impact how they are parsed. Or it
may be just a change in the human expectations, like for
example, the original engineers. They leave the team and the new engineers
are not as familiar with the current role.
So automation definitely has to be here because
it will continuously check all the possible scenarios
that it is made up of and giving us a constant feedback.
And then finally the last principle,
you have to keep the blast radius as small as possible.
What is the blast radius?
When you run chaos test on a production system, we always run
chaos test on a small part of the system,
because we always have to keep this in mind that if something goes
wrong, the production system can stop.
So rather than having a test which can affect the entire production
system, the test should be as small as possible,
so that the blast which happened due
to that test not working correctly,
affects just a very small part of the system which
can be corrected and put the correct fix into the place
in a very quick amount of time. So you have to use a
tightly orchestrated control group to compare within a variable
group. Experiments can be constructed in such a way
that the impact of the discovered hypothesis on the customer traffic
is minimal. How a team goes about achieving
this is highly context sensitive to the complex system.
Some systems it may mean using shadow traffic, or extruding requests
that have high business impacts, like transactions over dollar hundred,
or implementing automated retry logic for requests.
In the experiment that failed in the case of
chaos teams work at Netflix, sampling of requests, sticky sessions
and similar functions not only limited the blast radius,
they had the added benefit of strengthening signal detections however
it is achieved. This advanced principle emphasizes that
in truly sophisticated implementations of chaos engineering,
the potential impact of the experiments has to be limited by
design. So, these are the five principles on
which the chaos testing is built upon and which
helps in your chaos testing, planning,
and the final execution. Now, I will take you to
a real life scenario and how everything
got built up there. So, this is one of
my favorite part of this entire presentation, which is
called as chaos testing execution rules. Now, this is
a simple two by two matrix of known and unknowns.
On the left hand side, known are the things we are aware
of, and unknown are the things we are not aware of.
And on the bottom, knowns are the things we understand,
and unknowns are the things we don't understand. So the difference
is just between things that we are aware of and things that we understand
or don't understand. Point number one is
called as known knowns, things you are aware of
and the things that you understand. Point number two
is called as known unknowns, things you are aware of
but do not fully understand. Unknown knowns
is the point number three, where things you understand but you are not aware
of. And finally, point number four is unknown unknowns,
which means things you are neither aware of nor fully
understand. Don't get confused. I'll take
a real world example, and then I'll try to explain all these four.
But as I've said, this is a very, very simple matrix, which you
will see in the next slide. And just the difference is
between the things that we are aware of and the things that we understand
or the things that we are not aware of and the things that we do
not understand. Let's take a real life
chaos testing scenario. Now, what does
this scenario means? There is
a region a in the entire system right
here. What we is present is we have a primary database
host with two replicas, and then we use a semisync
replication.
We also have a pseudo primary and two pseudo replicas
in different region. So the entire region a gets replicated
into a region b. So the primary
replica one and replica two, they are the real functional ones. And then
there is a pseudo of primary replica one and replica
two. Simple thing. Region A, region B.
Everything in region a is converted is duplicated into
region b. Now, let's try to build the known unknown
matrix on this scenario.
Setting up the knowns and unknowns first is known
knowns. When a replica shuts down,
it will be removed from the cluster.
A new replica will then be cloned from the primary and added
back to the cluster. So if the replica
shuts down, it will be removed from the cluster,
and then a new replica will then be cloned from the primary and added
back to the cluster. So again, if I go back.
Sorry. Yes. So, these two replicas, if any
one of them gets shut down, it is removed
from this cluster, a new clone is made and put here as
a new replica. That's the process.
So this becomes the known knowns.
What is known unknown here. So the clone will occur.
We know that, as we have logs that confirm,
even if it succeeds or fails. So when the replica shuts
down and we try to reclon it,
even if the process fails, there are logs which confirm.
So this is something we know. But what we don't
know is the weekly average of the meantime it takes from
experiencing a failure to adding a clone back to the
cluster. Effectively, it may take few minutes, it may
take an entire hour, or it may take an entire day.
This we don't know. So this becomes the known unknowns.
Then let's go to the unknown unknowns. What is unknown
here? If we shut down the two replicas of the cluster at the
same time, we don't know exactly the meantime,
it would take us to clone two new replicas of the existing
primary. Remember? So just imagine
replica one, replica two, both get shut down.
We have never tried that. So we don't know how much time it
will take for the primary to be cloned
twice and putting both the systems into the
entire system. Both the replicas into the entire system.
But what is known is we have a pseudo primary and two
replicas which will also recording all the transactions
that are happening here. So it's a pseudo thing, which is.
So we know about that. So this becomes unknown
and known. And finally, the last
one. Unknowns. Unknowns. What would happen if
we shut down this entire cluster, the primary replica one,
replica two. What would happen if this entire thing goes down?
Will the pseudo region be able to fail over effectively?
Because we have not run yet this scenario. So if
this entire system goes down, we don't know whether the pseudo region
will also go down gracefully or effectively.
We have never tried that. Why? Because this
primary, pseudo one and pseudo two are the production system.
We have never tried to shut it down completely.
Hence, we don't even know whether the pseudo would shut
down effectively or not. So chaos testing
is the highly disciplined approach to test a system's integrity.
This I have already talked about. And chaos testing
relies on the production identification of errors within a
system. So, with this metrics that I have created of
known and unknowns, things that you understand and things that you know,
you can create your entire hypothesis and entire
planning about how to go about performing chaos
testing. Okay, we have said
many things about chaos testing and principles and everything.
But the point is, does it have any benefits except
finding certain bugs which can be present only in
the production because of the live traffic that is coming in?
Or is it more beneficial to not to run those scenarios because
then the system will remain up and everything which is working fine on
your pre prod systems will be just replicated on the
production? The answer is chaos testing
definitely has benefits and it has benefits for customers.
It has benefits for technical people and for the business.
And here are the benefits benefits
of performing chaos testing for the customers. Definitely there
is an increased availability and durability of the service.
It means there is no outage disrupt their
day to day lives. That's number one.
Because most of the effects that
most of the harmful effects on
the production system due to the live traffic will be caught,
corrected, and the system is made more and more
efficient, which means the customers will not face,
I won't say any downtime, but I would definitely say very
rare downtimes. So this is for the customer. What about
the businesses? The businesses will prevent big
losses in the revenue and the maintenance cost because the systems
are up all the time. They are more happy and more engaged
engineers because they don't have to spend extra weekends and long hours
to correct production failures. The incident
management system also gets improved because of the chaos
testing results. Whatever results that chaos testing identifies,
they actually will be locked under incident management
system. Because if chaos testing is not executed
on the production system, then when the real
user use that production system and find something error,
he will always call upon the incident management system team
and say that hey, this software is not working fine
or I was trying to do some payment and the payment is not going
through. Can you please look it into it urgently now? Because chaos
testing is running on the production system, many of these issues will
be uncovered during the testing phase. Which means the incident
management system already knows a lot of issues
that may come from the real world
users because it has already seen those things as
the output of the chaos testing. So this is
the big advantage. And finally, for the technical teams,
insights from the chaos testing means there are reduction in
incidents because a lot of real world incidents are already caught and
corrected. There is a reduction in the on call burdens because of
course, happy people, happy customers will call less.
There is an increased understanding of the system failure modes.
So by looking at every chaos output, which is
I can say that as a chaos bug,
it won't be an easy bug because all
the easy bugs are identified pre prod and by the qas.
So the bugs has to be the debugs, the bug has to be,
I would say a complex user journey bug. So more
bugs identified on the production system using live traffic.
It always helps in understanding the system failure modes better.
And as I said, the incident management system overall improves
as much as possible. So these are the various benefits
of performing chaos testing.
There is something called as eight fallacies of distributed system which I
picked from a website called as Architecturenodes Co.
Now what exactly these fallacies are and
why I have put this slide into the chaos
testing discussion. Now if you see these fallacies,
first one, the network is reliable.
We all use Internet. We all download things, we all
upload things, we go for our voice calls, we watch movie,
and we never think whether the network will shut
down or not. In the back of the mind there is always
this concept that the network is reliable, it won't
break. Similarly the latency is zero.
Similarly, there is only one administrator of the entire Internet.
The bandwidth is infinite, the network is secured,
the topology never changes, the network is homogeneous,
and finally, the transport cost of any packet from one place
to another is zero. These are the fallacies for
every user who is using Internet.
But does any of them,
any of them? Are they correct?
Is there only one admin? Is the bandwidth
infinite? Is the network reliable? Is it secured?
We know that no, this is not the case.
But to give a user the experience of his life,
a smooth network usage experience,
these fallacies need to be tested continuously.
And that can be done if we do chaos
testing on the systems. So many of these fallacies,
they drive the design of chaos engineering experiments.
For example, packet loss attacks or latency attacks.
For example, if a network outage can cause a range of failures
for applications that severely impact customers,
applications may stall while they wait endlessly for a
packet. Applications may permanently consume energy
on, let's say, a Linux system. And even
after a network outage has passed, application may fail to
retry the stalled operations or it may retry too aggressively.
Applications may even require a manual start. So each
of these examples, they need to be tested and prepared. And these can be done
only if we are running chaos testing on
our production systems.
So I know that this is not easy to convince any managers,
but when you try to convince them by explaining these things,
when you try to convince them by saying
that, look, every production system has certain issues.
Because the production system receives its data live and
we don't have any control on the live data, we may
uncover certain very good bugs,
which makes our system more and more reliable for the end users across
the world. Right. Allow us
to perform chaos testing and then plan for the chaos testing in
a very proficient way. Very efficient way.
Put your blast radius, very small, plan for automation
and all the other build your hypothesis.
So, all the five points that I explained earlier, then prepare
your own metrics and finally run your experiments.
And then show the managers, show the stakeholders that
yes, the time and money that we have invested in chaos
testing has actually made
their system more and more reliable.
With this, I come to the end of
this talk. Thank you for hearing me patiently.