Transcript
This transcript was autogenerated. To make changes, submit a PR.
Everyone, good morning, good afternoon and good evening.
Happy to be here at Con 42
Chaos Engineering conference.
Before we delve into the topic of continuous resilience,
a bit about me I am Uma Mukara,
head head of Chaos engineering harness. I am
also a maintainer and co founder the
Litmus Chaos CNCF project which is at incubating stage
at harness. I've been developing customers head
head of chaos engineering wider number of use cases
in that process I have learned a little bit
about head of chaos engineering been adopted.
What are the use cases that are
more prominent, more appealing? So here is
an opportunity for me to talk about what I learned in
the last few years of trying to push chaos
engineering to more practical environments in the cloud native
space. Innovation is a continuous process in
software, right? And we're all trying to
innovate something in software either
to improve governance,
quality, efficiency,
control and reliability, et cetera.
So in this specific talk, let's talk how
can we innovate more in the space of reliability or
resilience?
So before we actually reach the topic of innovating
in the space of resilience, or in the area of resilience,
let's talk about the software development costs
that applies to the software developers overall.
In the world today, we have about 20 been million software developers
costing 100k average on annual basis.
That leads to a total spend of about $2.7
trillion. That's a huge money that's being
spent on software development. If that
is so big amount of money that's being spent, what are
the software developers are doing? In this poll you can
see that more than 50% of the software developers indicate
that they actually spend less than 3 hours in a day writing the code.
Where are they spending the remaining time?
They could be spending the time in trying to
build the environments, or build
the deployment environments, or debugging
the existing software, or the software that they
just wrote, or production issues, etc. Cetera, et cetera.
So this all leads to a more toil
for software developers. And is there a way
we can actually reduce this toil to
as much as about 50% less?
Right? So that could actually free
up more money
for the actual software development.
And that's a huge spend, right? So this is the overall market
space for developers, but you can apply the same thing
to your own organization. You're spending a lot of money
on developers, but developers are actually not spending enough
time and effort to writing code, right?
That's an opportunity to reduce that toil and
increase innovation in different, different sres. The opportunity
is to innovate, to increase the developer productivity and
hence save the cost. And you can use that cost back into
more development and ship more products
or more code or faster code, et cetera, et cetera.
Right. So let's see how it applies
to the resilience as a use case.
Right. So you can actually reduce
the developer toil. So this toil comprises
of either build time or deployment
time or a debug time. Right. So you need to basically reduce
this toil and that's where you can actually improve
production. In this specific topic we are going to
look at, where are these developers
spending their time in debugging? Why are they doing that?
And how can we actually reduce that amount of debug
time? And eventually that leads to more
time for innovation. And because you are reducing
the debug time in the area of resilience, that also improves
the resilience of the products.
So why are they spending time in debugging?
Right. So basically, in other words, why the bugs are
being introduced. Right. So it could
be plain oversight. Developers are humans, so there
is possibility that something is overlooked.
Even the smartest developers could overlook
some of the cases and then introduce bugs or leak bugs
to the right, but that cannot be done.
But the more common pattern
that can be observed is a lot of dependencies
in the practical world are not tested and they've been
released to the right. And it's also possible
that developers, there's a lot of churn,
a lot of hiring. It could be a
case of either. In that case,
you are not the guy who has written the product
from the beginning and the product has scaled up so much and
you don't necessarily understand the entire architecture of the product,
but you are rolling out some of the new features, right? So that leads to
a bit of lack of understanding of the product architecture.
So some of the intricacies are not
well understood and then the design
box can be trickling in or the code box,
right? So even if you take care of all that,
you assume that the product will run in certain environment,
but the environment can be totally different or
it can keep changing and that may
not work as expected
in that environment, your code. So these are the reasons
why you end up as a developer spending more time.
And these reasons
become more common in cloud native.
But before that, let's look at the cost of
debugging, right, so you can end
up having costing much more to
the organization if you actually find these bugs in
production and start fixing them. The cost of
fixing bugs in production is almost like more
than ten times than what you incur if you
debug and fix them in QA or within the code,
right? So it's a well known factor, nothing new.
It's always good to find the bugs before
they go into the production, right? So that's another way to
look at this. But the reasons for introducing
this issues or
overlooking this causes
is becoming more and more common in the case of cloud native
developers, right? So in cloud native, two things
are happening. By default, you are assumed
to ship things faster because the total ecosystem is
supporting faster shipment of your bills
because of the small amount of code that each
developer has to look at, and well defined boundaries around
continuous and entire
ecosystem of cloud native, right? So the
pipelines are better, the tools surrounding shipment,
like githubs and all are helping you to
ship things faster. So added
to that, containers are helping
developers to wrap up features faster
because they are microservices. You need to look at things objectively,
only to a limited scope within a
container and surrounded by APIs. So you are able
to look at things very objectively,
finish the coding and ship things faster.
So because you are doing them very
fast, the chances of
not doing the dependency testing or chances of not
understanding the product very well are high.
And that could actually cause a lot of
issues. And if the issues are related, the faults
are happening in infrastructure and the impact of
outage can be very high to
just a fault happening within your container.
And because of that, the outage is happening,
the impact can be very low at that level, right?
So the summary here is you
are testing the code as much as possible and then
shipping as fast as you can, and you may
or may not be looking at the entire set of new
testing that is needed, right? So it's
possible that deep dependencies, the faults happening in the deep dependencies
are not tested. So typically
what's happening in the case of cloud native shipments,
in such cases is the end service
is impacted, right? And developers are then jumping
into debug, and finally they know that,
they come to discover that there's a dependent component
or a service that
incurs fault within it or multiple faults,
and because of that, a given service is affected.
So that's kind of a weakness within a given service.
It's not resilient enough and you find it and fix it, right?
So this is typically a case
of increased cost, and there's a good
opportunity that you can find such issues much earlier
and avoid the cost,
the kind of test that you
need to be doing before you ship to avoid
such cost is you have to
assume that faults can happen within your code or
within the aps that your code or application is consuming,
or other services such as databases or
message queues or other network services. There are faults
that could be happening and your application has to be tested for
such faults. And of course the infrastructure faults
that are pretty common and infrastructure faults
can happen within kubernetes and your code has to
be resilient for such faults,
right? These are the dependent fault testing
kind of set that you need to be aware of and test,
right? So what this really means is cloud
native developers need to do chaos testing, right?
This is exactly what chaos testing typically means. Some fault can happen
in my dependent component or infrastructure,
and my service, that is depending on my code,
need to be resilient enough, right? So chaos
testing is needed by the nature of the
cloud native architecture itself
to achieve high resilience.
And we are basically saying that developers end up spending
a lot of time debugging and that's not good for
developer productivity. So if you need to do chaos testing,
let's actually see what the typical definition of chaos
engineering is. Chaos engineering. Typically it's
been there for quite some time. We all know that, and we
all are kind of told that chaos engineering
is about introduce control faults and
reduce the expensive outages. So if you are
basically reducing the outages,
you are looking at doing chaos testing in production,
right? And that really comes with a high barrier.
And this is one reason, even though
chaos engineering has been there for quite some time, the adoption of
chaos engineering, though it's increasing in the
recent years,
rapidly. The typical chaos testing
or chaos engineering understanding is that it applies to production that's
changing very fast. And that's exactly what we are talking about here.
And the traditional chaos engineering is also about introducing
game days and try to find the
right champions within the organization who are open to
do this game days and find any resilience
issues and then keep doing more game days, right?
That's typically the head head
of chaos engineering. It's been a reactive approach.
Either some major incidents have happened
and has a solution. You're trying to head of chaos engineering.
Sometimes it can be driven by regulations as well,
especially in the case of Dr. And all chaos engineering comes
into the picture in the banking sector and all.
But these are the typical needs
or patterns that head head of chaos
engineering until a couple of years ago.
But the modern chaos engineering is really driven by not
necessarily to reduce the outages,
but also the need to increase developer productivity,
right? So if my developers are spending a lot
of time in debugging the production issues.
That's a major loss, and you need to avoid that.
So how can I do that? Use chaos testing,
and similarly, the QA teams. Right, so QA
teams are coming in and looking at more
ways to test so many components that are coming in the form of
microservices. Earlier, it was easy enough that you're
getting a monolith application, very clear boundaries,
and you can write better, complete set of
test cases. But now microservices can
pose a challenge for QA teams. There's so many
containers and they're coming in so fast.
So how do I actually make sure that the
quality is right in many aspects and that
can be achieved through chaos testing.
Right. And it's also possible that the whole
big monolith or traditional application that's working well,
which is business critical in nature,
is being moved to cloud native. How do you ensure
that everything works fine on the other side, on the cloud native?
So one way to ensure is by employing
the chaos engineering practices. Right. So the need for chaos engineering
in the modern times is really defined or
driven by these needs, rather than just, hey,
I incurred some outages, let's go fix them. Right.
So while that is still true, there are more drivers
that are driving adoption of chaos engineering.
So these needs are leading
to a new concept called continuous resilience.
So what is continuous resilience? It's basically verifying
the resilience of your code or component through automated
testing. Chaos testing. And you do that continuously.
Right. So chaos engineering,
done in automated way across your DevOps
spectrum, is called continuous resilience.
You achieve continuous resilience. That approach is called continuous resilience
approach. Right. So just to summarize,
you head head of chaos engineering,
QA, pre prod and prod,
continuous all the time, involving all the personas.
And that leads to continuous resilience as a concept.
So what are the typical metrics that you look for in the
continuous resilience space or a model
is the resilience score and resilience coverage.
Right. So you always measure the resilience
score of a given experiments, chaos experiment,
or a component or a service itself.
And it can be defined by the
average success of the steady state checks of
whatever you are measuring. Right. The steady state checks
that are done during a given experiment
or of a given component or of a given service.
Right. So this is the resilience score. Typically it can
be out of 100 or a percentage.
And the more important metric in continuous resilience, you can
think of this as resilience coverage, where because
you are looking at the whole spectrum, you can
come up with a total number of possible chaos
tests. Basically you can compute them
as what are the total resources that my service
is comprising of? And you can do multiple combinations of that.
The resources can be infrastructure resources, API resources,
or network resources, or the resources that make up the
service itself, like container resources, et cetera.
And basically you can come up with a
large number of tests that are possible, and then you
start introducing
such chaos tests into your pipelines, and those
are the ones that you actually cover. Right. So you have
a very clear way of measuring what are
the chaos tests that you have done out of the possible chaos
tests. And that leads to a coverage. Think of this as a
code coverage. In the traditional developer spectrum,
resilience coverage is being applied for
resilience and chaos experiments.
So many
people are calling this approach as hey, let's do chaos
in pipelines. That's almost same, right? Except that
continuous resilience does not limit
yourself just to pipelines. You can automate the
chaos test on the production side as
your maturity improves, right? So it's a pipelines
approach. So what are the general differences
between the traditional chaos engineering approach versus
the pipelines or approach, or the continuous
resilience approach? So traditionally
in the game disk model, you are executing on demand
with a lot of preparation. You need to assign
certain dates and take permissions
and then execute this test. Versus pipelines,
you are executing continuous with not much of a
thought or preparation. They are supposed to work,
and if it doesn't work, it doesn't hurt
so much. But it actually is a good thing that you can go and look
at whenever it fails. Right? Maybe just slowing down the delivery of
your builds, but that's
okay. So this leads to greater adoption itself
overall. And game days are targeted
towards sres. Sres are the ones that think of, they're the
ones that budget this entire game
days model. But in the Chaos
pipelines model, all personas are involved.
Shift left is possible, but shift right also is possible
in this approach. Right. So that's
another major difference. So as you can assume,
the chaos gamed model, that option is
very barrier is very high. The barrier
for pipelines is very less because you're doing in a non prod environment
and you have the bandwidth that is
associated to the development, and developers are
the ones who are writing. So it becomes kind
of unnatural for the
adoption of such model. So when it comes to writing
the chaos experiments themselves,
traditionally it's been a challenge because the
code itself is changing. And if
sres are the ones that are writing such
bandwidth is usually not budgeted or planned,
and sres are typically pulled in into the
other pressing needs,
such as attending to incidents and corresponding action tracking,
et cetera, et cetera. So that it may not be always possible to
be proactive in writing a lot of chaos experiment,
right? And in general,
because you are not measuring the resilience coverage kind
of a thing and you are just going and doing game day model,
it's not very clear how many more chaos experiments
I need to develop before I can say that I have covered
all my resilience issues. Right?
But in the continuous resilience approach, these are exactly opposite.
Right? So you are basically looking at each other's help in
a team sport model and you're extending
your regular test. Developers would be writing integration
best. And now you add some more best to
introduce some faults on the dependent components,
and those tests can be reused by QA and QA will add
a little bit more tests. Those can be reused either by
developers or by sres, et cetera, et cetera.
So basically there is an increased sharing of the tests and
in central repositories, or what
you call them as chaos hubs in general.
So you tend to manage these
chaos experiments as code in git, and that increases the
adoption. Right. And with resilience coverage is the
concept, you know exactly how much more coverage
you need to do or how many tests more you need to
write, et cetera, et cetera. So that also helps in general
with planning perspective. Right. So that's
really a kind of a
new pattern to think how to head,
head, head of chaos engineering need to adopt chaos engineering.
That's what I've been observing in the last few years
and also at harness where we are saying there
is a good growth of adoption of chaos
for the purpose of both developer productivity
as well as to increase the resilience as an innovative
metric. Right? So let's take a look at
a couple of DevOps. One on how
you can inject a chaos experiment into a
pipeline and probably cause a rollback depending on
the resilience score that is achieved. And the other one,
a quick demo about how we at chaos,
the development teams are using chaos experiments in the pipeline
a little bit more liberally before the
code is shipped to a preprod environment or a QA
environment.
In this demo, we're going to take a look at how
to achieve continuous resilience using chaos experiments
with a sample chaos engineering tool.
In this case, we are using head, head,
head of chaos engineering,
any other tool, a pipeline tool and a chaos engineering
tool together to achieve the same
continuous resilience. So let's start.
So I have the chaos engineering tool
from harness harness chaos engineering.
This has the concept of chaos experiments,
which are stored in chaos hubs.
These chaos hubs are generally a way to
share the experiments across teams, because in continuous
resilience, you are talking about multiple teams across
different pipeline stages. Either it's
dev or QA or preprod or prod.
So everyone will be using this tool and they will have
access to either common chaos hubs, or they'll be maintaining
their own chaos hubs. This chaos hub can maintain the
chaos faults that are developed and chaos
experiments that are created,
which in turn uses the chaos fault.
So a chaos fault in this case
is nothing but the actual chaos injection
and addition of certain resilience
probes to check a steady state hypothesis. So let me
show how in this harness chaos engineering
tool, a particular chaos experiment
is constructed or been. So let me go
here. If I take a look at a
given chaos experiment, it has multiple chaos
faults. It can have multiple chaos faults
either in series or in parallel. And a given chaos
fault usually will have. Where are
you injecting this vault at your
target application? And what
are the characteristics of the chaos itself?
How long you want to do it, how many times you need
to repeat the chaos, et cetera. And then the probe
in this case is. Different tools call this
probes in different ways. This is basically a way
to check your steady state while this chaos injection
is going on. So in the case of harness chaos
engineering, we use probes to define the resilience of
a given experiment or of
a given service, or of a given module or a
component, right. You can add any number of probes
to a given fault.
So that way you're not just developing on one
probe to check the resilience, you're checking a whole
lot of things while you inject chaos at
any point of time into a given resource, right? Or against
a given resource. So in the case of this
particular chaos experiment, for example,
you can go and see that it has resulted in 100%
resilience, because there were three,
the chaos that was injected was a cpu hog
against a given pod. And while that cpu
hog was injected, there were three process that
were checked whether the pods were okay and some other
service. Was it available or not? The HTTP
endpoint. And it also was checking
a completely different service.
And it's checking for the latency response from
the front end web service. So you should generally
look at the larger picture while gauging
the steady state hypothesis while injecting chaos
fault. So because everything is passed and there's only one fault,
you will see the resilience score as 100%.
So this is how you would generally go and score
the resilience against a given chaos experiments.
And then these chaos experiments generally should be mobile
back into a chaos hub, or you should be able to launch
these experiments from a given chaos hub, et cetera, et cetera.
And in general, the chaos tool should have
the ability to do some access control. For example,
in the case of harness chaos engineering,
you will have default access control against
who can access the centralized library
of chaos hubs and who can execute a given chaos experiment.
And chaos infrastructure is your target agent area.
And if there are game days, who can run these
game days, and typically nobody should have the ability to
remove the reports of game days. So there's no delete option for anyone.
Right? So with this kind of access control and
then the capability of chaos hubs and then the
probes, you will be able to score the
resilience against a given chaos experiment for
a given resource and also be able to share
such developed chaos experiments across multiple different
teams. And now let's go and take a look
at how you can inject these chaos
experiments into pipelines.
Or let's look at the other way. How are you supposed to
achieve continuous resilience during
the deployment stage? Right. So example here,
this pipelines is meant for developing a given service.
That means somebody has kicked off a
deployment of a given service, and once it's deployed,
this could be a complicated process or a complex job in itself.
And once this is deployed, we should
in general add more tests. So this deployment is
supposed to involve some functional tests has. Well, but in addition
to that, you can add more
chaos best. And for example here,
each step in harness pipeline can be
a chaos experiment. And if
you go and look at this chaos experiment,
it's integrated well enough to go and browse in your
same workspace. What are the
chaos experiments that are available? So I'm just going to
go and select certain chaos experiment
here, and then you can set
the expected resilience score against that.
In case that resilience score is
not met, you can go and implement
some failure strategy. Either go and observe, take some manual
intervention, roll back the entire stage,
et cetera, et cetera. So for example,
in this actual case, we have identified
the failure strategy or configured the failure strategy
as a rollback. And typically you can
go and see the past executions of
this pipeline. And let's
say that this has a failed instance
of a pipeline, and you could go and see this
pipeline was deploying
this service and then the chaos
experiment has executed and the expected resilience
has not good enough. And if you go and take a look
at this resilience scores or
probe details, you see that one particular probe
has failed. In this case though,
when cpu was increased,
the pod was good and the court service
where the cpu was injected,
high injection of cpu happened, it was continuing
to available, but some other service provided a latency
issue, so that was not good. And then
it eventually caused it to fail and
the pipeline was rolled back. Right.
So that is an example of how you could
do it, how you could do more and more chaos experiments
into a pipelines and then stop leaking
the resilience bugs to the right. And primarily
what we are trying to say here is we
should encourage the idea of injecting
chaos experiments into the pipelines and sharing these
chaos experiments across teams.
And someone has developed, most likely developers in this case
or QAT members. In any large
deployment or development system,
there are a lot of common services and the teams are distributed,
there are a lot of processes involved. Just like you are
sharing the test cases, common test cases, you could share the
chaos test as well. When you do that, it becomes a
practice. And the practices of injecting
chaos experiments, whenever you test something,
it becomes common and it increases the
adoption of chaos engineering within the organization,
across teams, and it eventually leads to more
stability and less resilience issues or bugs.
So that's a quick way of looking at how
you can use a chaos experimentation tool
and use the chaos experiments to
inject chaos in pipelines and
verify the resilience before they actually go to the
right or go to the next stage. You,
you,
let's look at another demo
for continuous resilience, where you can inject
multiple chaos experiments and
use the resilience score to decide whether to
move forward or not.
So in
this demo, we have a pipelines,
which is being used internally at
harness in one of the module
pipelines. So let's take a look at this particular
pipeline. So what
we have done here is the existing pipeline is
not at all touched, it is
kept as is. Maybe the maintainer of this particular
stage will continue to
focus on the regular deployment and the functional tests
associated with it. And once the functional
tests are completed after deployment,
you can add more chaos tests in
separate stages. In fact, in this particular example,
there are two stages. One to verify the
code changes related to the Chaos module
CE module, and then another stage
that is related to platform
module itself. So you can put all of them into
a group. So here it's called a step group.
So you can just dedicate one single separate
stage to group all the chaos experiments together,
and you can set them up in parallel if
needed. Depending on your use case individually,
each chaos experiments will return some resilience score,
and you can take all the resilience
scores into account and decide at the end whether
you want to continue or take some actions such as rollback.
Right? So in this case, the expected
resilience was all good,
so nothing needs to be done, so it proceeded.
This is another example of how
you can use step groups or multiple chaos experiments
into a separate stage and then take a decision
based on the resilience score. I hope this helps.
This is another simple demo of how do you use multiple
chaos experiments together? You well,
you looked at those two demos.
So in summary, resilience is a real challenge,
and there's an opportunity to increase resilience by
involving developers into the game and start
introducing chaos experimentation in the pipeline.
And you can get ahead of this challenge of resilience by
actually involving the entire DevOps,
rather than just involving on the need basis the
sres alone. Right? So the DevOps
culture of chaos engineering is more scalable and
is actually easy to adopt. Chaos engineering
at scale, it makes it easier. So thank you very much
for watching this talk,
and I'm available at this Twitter
handle or at the Litmus Slack channel.
Feel free to reach out to me if you want to talk to me
about more practical use cases on what I've been seeing
in the field with chaos engineering adoption. Thank you and have
a great conference.