Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello. Thank you for joining me in this talk. My name is Pablo Chacin.
I'm the Chaos engineering lead at Casey's Grafana Labs.
Today, I will be talking about how organization can build confidence
in their ability to withstand failures by shifting left chaos testing.
I will be talking about why achieving reliability in modern
application is hard and how chaos engineering emerged in response to
this reality. To help organization build confidence in their
ability to continue operating in the precedent of failures.
I will also discuss some of the obstacles organization may face when
trying to adopt chaos engineering. I will then introduce
chaos testing as a foundation to facilitate adoption of chaos engineering.
Finally, I will exemplify the principle of chaos testing with
a case study and will demonstrate how kcs, an open source
reliability testing tool, can be used for developing chaos tests.
So let's start why achieving reliability in modern application is
hard modern applications follow a microservices architecture that
leverage crew native technologies. This de facto standard has
many benefits, but it also increased the complexity of these applications.
This complexity is often beyond the ability of engineers to
fully predict how application will behave in production and
how will react unexpected conditions like the failure of a dependency
network, congestion, resort depletions, and others.
Under these conditions, applications frequently fail in unpredictable
ways. In many cases, these features are the
consequence of misconfigured timeouts and inadequate fallback strategies
that create retry, storm and cascading features.
This can be considered effects in the applications.
Unfortunately, traditional testing methodologies and tools do not help
in finding them, mostly because they manifest in the interaction between
services and are triggered by very specific conditions.
Implementing tests that reproduce this condition is difficult
and time consuming, and frequently the resulting tests
are themselves unreliable because they produce unpredictable results.
So how organization can build confidence in their ability to consume these
failures? The most common way is by battle testing their applications,
procedures and people. By going through incidents. By implementing
war structure, post incident reviews, and adopting a blameless culture,
organization can learn from incidents and improve their ability to
handle them. But incidents don't make a good learning tool.
They are unpredictable. They induce stress to the people
involved. You cannot decide what or when to learn,
not to mention their potential impact in the user and the business.
So why not induce this incident on purpose? In this
way, incident response team can be prepared in advance and
tested procedures with less stress. This is a better way
for learning, but there are still risks, mostly in the
initial stages when the procedures are not well tested.
Also, there is a limit on the incident an organization can try
before affecting their service levels objectives. Another limitation
is that they are preparing the organization for situation they have already
experienced or can predict somehow. But as we
discussed previously, modern systems sometimes fail in our
ways. Therefore, we need a way to experiment with this system and
learn more about how it fails. Chaos engineering is a discipline
that emerged as a response for this need. It builds on the
idea of experimenting on a system by injecting different type of faults
to uncover systemic weakness, for instance,
killing or overloading random compute instances, or disrupting
the network traffic, and doing this on a continuous way,
making faults the norm instead of deception with
intention that developers get used to facing them and therefore consider
recovery mechanisms in the design of their applications instead
of introducing them later in response to incidents.
This approach has been championed by companies such as Netflix with the
iconic chaos monkey, but despite
its promises, some obstacles still remain for chaos engineering to
be adopted by most organizations. First, chaos engineering
set a high adoption bar by focusing on experimenting in
production, and we cannot argue against this principle.
Nothing can substitute testing this real stuff.
Unfortunately, many organizations are not prepared for this.
They don't have battle tested procedures, and the teams may lack confidence
in their ability to contain the effects of such experiments.
Another significant issue is the unpredictability of
the result of these experiments. Killing or overloading instances.
Also, disrupting the network may affect multiple application components,
introducing unexpected side effects and making the brass radius
hard to predict. Moreover, modern infrastructure
has many recovery mechanisms that may came into play and interact in
complex ways. All these factors made the result of
the experiment hard to predict and this is in part
the idea. This is why it is called chaos engineering after all.
But it is difficult to test recovery strategies for a
specific situation if you cannot reproduce it consistently.
Finally, adopting chaos engineering tools can also be challenging.
Installing and using them sometimes requires a considerable knowledge
on infrastructure and operations. They seem designed
by and for SREs and DevOps, and it makes sense as chaos
engineering has its roots in these communities. However,
this complexity rise the adoption bar for most developers that
cannot be self sufficient when using these tools.
In summary, chaos engineering presupposes a level of technical
proficiency and maturity that many teams and organizations do
not have. So how more organizations
can start building confidence in their ability to withstand failures.
Is there an alternative to bonji jumping into chaos engineering in
production? We propose shifted chaos testing to
the left, incorporating chaos testing as part of the regular
testing practices early in the development process,
submitting the application to four that have been identified from
incidents and validating if they can handle them in
an acceptable way. Implementing and testing recovery
mechanisms, if not. At the core of
chaos testing, is for injection four.
Injection is the software testing technique of introducing errors on
a system to ensure it can withstand and recover from
dossy conditions. This is not a novel idea.
It has been used extensively in the development of safety critical systems.
However, it has generally been used for testing how application handle
isolated errors such as processing concrupted data.
The challenge for modern application is to inject the complex error
patterns they will experience in their interaction with other components.
Fortunately, as explained in this quote from two former members
of Netflix Chaos engineering team, from the distributed system
perspective, almost all interesting availability experiments
can be driven by affecting latency or the response type.
Later in this presentation we will discuss how this can be achieved
using cases. But at the beginning of this
presentation, we said that the main challenge of the modern distributed
applications was their unpredictable behavior on the turbulent conditions.
Therefore, is it valid to ask what benefits can we expect
from testing known faults in controlled development environments?
Will this really contribute to improve the reliability of the applications
or will it only create a sense of force confidence?
According to a study of failure in real world distributed
systems, 92% of the catastrophic system features
were the result of incorrect handling on nonfatal errors,
and in 58 of these cases, the resulting force
could have been detected through simple testing or error handle code.
And how hard is to improve this error handle code?
According to the same study, in 35% of the cases
the error handle code fall into one of three patterns.
It overreacted, aborting the system under nonfalton
errors was empty or only containing
a lock printing statement. It contained
expressions like fix me or to do in the comments.
What this study comes to tell us is that there is a significant
room for improvement in the reliability of comprehensive distributed application
by just testing the error handle code and
this is what chaos testing proposed. Incorporate the principle
of chaos engineering early into the development process as an integral
part of the testing practices. Shifted the emphasis
from experimentation to verification for uncovering
unknown fault, to ensuring proper handling of the known faults.
By adopting chaos testing, teams can build confidence for moving
forward to chaos experiments in productions and then using
the insight obtained from these experiments and for incidents,
improve their chaos test, creating a process of continuous
reliability improvement in order to achieve its goal,
chaos testing is sustained in four guiding principles.
Incremental adoption organizations should be able to
incorporate chaos testing into their existing teams and development processes
in an incremental manner, starting with simple tests so they
can understand better how their system handled faults
and then building more sophisticated test cases.
Applicationcentric testing developers should be able
to reproduce in their tests the same fault pattern observed in their
applications using familiar terms such as latency and
error rates without having to understand the underlying infrastructure.
Chaos testing as code switching between application testing tools
and chaos testing tools will create production in the process and
as we discussed before, it may reduce the autonomy of developers
for creating chaos test. Therefore,
developers should be able to implement chaos tests using the
same automation tool they are familiar with.
But adoption of chaos as code have other benefits.
Developers can reuse log, pattern and user journeys from
their existing tests. In this way, they can ensure
they are testing how the application react to faults on the realistic use
cases. Control chaos faults introduced
by chaos tests should be reproducible and predictable to ensure
the tests are reliable. You cannot be confident from flocky test
test tests should also have a minimal blast radius. It should be
possible to run them insure infrastructure, for example staging
environment with little interference between teams
and services. Let's put these principles into
action using a fictional case study this case
study used the sock shop. This is a demo application that implements
an ecommerce site that allow users to broad a catalog of products
and buy items from it. It follows a
polyglot microservice architecture.
Microservices communicate using HTTP requests and
it is deployed in kubernetes.
The front end service works both as a backend for the web interface
and also exposes the APIs of other services
working as a kind of API gateway. Let's now
imagine an incident that affected the sock shop.
In this incident, the catalog service database was overloaded by long
running queries. This overload caused delays in
the request up to 100 milliseconds over the normal response time
and eventually made some requests, failed and returned an HTTP 500
error. The catalog service team will investigate the
incident to address the root cause.
However, the front end team wonders how similar incident
will affect the service and the end users.
To investigate this, let's start with a load test for the front end service that
will serve as a baseline. This test applies a load
to the front end service requesting products from the catalog.
The front end service will make requests to the catalog service.
The front end service is the four the system under test.
We will measure two metrics for the request to the front end service,
the failure rate and the percentile 95 of the response
time. We will send this metric to our grafana dashboard
for visualization and we will implement this test using
cases. Casey's is an open source reliability
testing tool. In cases, tests are implemented using JavaScript.
Cases cover different types of testing needs such as load testing,
end to end testing, synthetic testing, and chaos testing.
It can send tests resort to common backend social Prometheus.
Its capabilities can be extended using a growing catalog of extension including
kafka, NoSQL databases, kubernetes,
SQL, and many others. Even when we
are not going into too much detail in this example,
there are some concepts that are useful for understanding the code we
will discuss next in cases, user flows
are implemented as JavaScript functions that make requests to
the system under test, generally using a protocol
such as HTTP if we are testing an API, or our
simulated browser session if we are testing the user interface.
The result of these requests are validated using checks.
Scenarios describe a workload in term of a user flow
and number of concurrent users. The rate at which the user may
request and the duration of the load threshold
are used for specifying SLO for metrics such as latency
and error rates, let's make a work through the
test code. Don't worry, we will just skim over the code highlighting
the most relevant parts. At the end of the presentation,
you will find additional resources that explain this code in detail.
The test has two main parts, a function that makes
the call to the front end service and check for errors, and in
a scenario that describe how much load will be applied and for how long.
Let's run this test and check the performance metrics.
We can see the error rate with zero. That is, all requests were
successful and the latency was around 50
milliseconds. We will use this result as a baseline.
Now let's add some chaos to this test.
We will repeat the same load test, but this time
while the load is applied to the front end service, we will inject
fault in the request served by the catalog service, reproducing the
pattern observed in the incident. More specifically,
we will increase the latency and inject a certain amount of errors in
the responses. Notice that the frontend service is still
the system under test. For doing so, we will
be using the KC disruptor extension. This eruptor
is an extension that adds fork injection capabilities to kcs.
We are not going into the technical details about how this extension works.
For now, it is sufficient to say that it works by installing
an agent into the target of the chaos test, for example
a group of Kubernetes pods. These agents have the ability
to inject different type of faults such as protocol level
errors and this is done from the test code
as we will see next without any external tool or
setup. At the end of the presentation, you will find
resources for exploring this extension in detail,
including its architecture.
Let's see how this works in the code, we add
a function that inject faults in a service.
This function defines a fault in term of a latency that
will be added to each request and the rate of request that will return a
given error. In this case, 10% of request
will return a 500.
Then it select the catalog service as a target for the four injections.
This interrupts the disruptor to install the agents in the pods
that backtick service. Finally it
inject the four for a given period of time, in this case the total duration
of the test. Then we add a scenario
that invokes this function at a given point during the execution of
the test. In this case, we will inject in the fault from
the beginning of the test and this is all that we need.
Let's run this test. We can see that the latency
reflect the additional 100 milliseconds that we injected.
We can also observe that now we have an error rate of almost 12%,
a slightly over the 10% that we define in the fault description.
It's important to remark that we are injecting the faults into
the catalog service, but we are measuring the error rate at the
front end service so we can see the front end
service is not handling the errors in the request to the catalog
service. Apparently there are no retry over
fail request. I wouldn't be surprised if we find
a two to comment in the error handle code.
How this test help the front end team first
by uncovering proper error handling logic as we just saw,
and then enabling them to validate different solution onto the
obtained unacceptable error rate. For example, introducing retries.
They can also easily modify the test to reflect other situations
like higher error rates in order to fine tune the
solution and avoid issues such as retry and storms.
This brief example shows the principle of scale station in action.
A load of functional tests can be reused to test the
system under turbulent conditions. These conditions
are defined in terms that are familiar to developers. Latency and error
rate. The test has a control effects on the
target system. The test is repeatable
and the results are predictable. Default injection
is coordinated from the test code.
Default injection does not add any operational complexity.
There is no need to install any additional component or define additional
pipeline for triggering default injection. To conclude,
let me make some final remarks. We firmly
believe that the ability to operate reliably shouldn't
be a privilege of the technology elite.
Chaos engineering can be democratized by promoting the adoption
of chaos testing, but to be effective,
chaos testing will be adapted to the existing practices of testing
in grafana cases. We are committed to making this possible
making chaos engineering practices accessible to a broad spectrum of
organizations by building a solid foundation from
which they can progress toward more reliable applications.
Thank you very much for attending. I hope you have found this
presentation useful. If you want to learn more about
chaos testing using cases, you may find these resources useful.
You will find an in depth walkthrough for the example we saw today and
more technical details about the disruptor distinction.