Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi there. In this session we're going to talk about
cloud native chaos engineering and how to do it at scale.
I am Uma Mukkara, CEO at Chaos native and I'm
also a maintainer of Litmus Chaos CNCF sandbox
project. I live in Bangalore with my two boys and my wife.
I've been doing a lot of entrepreneurial stuff
in the last decade, starting with Cloud Byte,
which is a storage startup for service providers.
Then I later on started Openebs and myadata,
where I also started a project called Litmus
for Cloud Native Chaos Engineering, which is also a CNCF
project. Recently we spun off from my data to
completely focus on litmus chaos and
cloud native chaos engineering. I'm happy to be here
talking about what is cloud native chaos engineering at this Chaos
engineering conference of Conf 42. Let's talk
a little bit about reliability. What is reliability?
It's about achieving resilience for operations using
chaos engineering. That's a regular definition that
we've been hearing about reliability. And what
about cloud native? Cloud native usually is
associated with containerization, where you
change or rearchitect your applications for using microservices
architecture. And you also apply the advancements
of CACD, where you get your
code tested pretty well using these
advancements. And also you try to make use of
all the declarative nature of these services or applications
and apply githubs to manage them at scale. So in effect,
cloud native is nothing but microservices are
being delivered faster. Your applications,
which are now multiple microservices sre being
delivered faster, that's the net effect of
cloud native ecosystem. What about reliability?
In cloud native we talked about what is cloud
native and what is reliability. How do you apply these together?
Well, it's applying chaos engineering
to achieve resilience in operations in the cloud native
ecosystem, and you still get faster
application delivery. So in essence,
it's about applying chaos engineering for
the resilience of cloud native applications. And that's
what we're going to talk about in this session. Why is it different, what it
is? And how do you do that? Before we jump in,
let's talk a little bit about traditional chaos engineering.
Traditional chaos engineering is all about avoiding
expensive downtimes. We all know that these downtimes
are not easy to deal with. They will usually
result in expensive losses.
And what you do as part of practising
chaos engineering is you don't wait for failure
to occur. You keep testing in production, you keep introducing
faults in production, and then see if your services can hold
up. If not, you tune the system and these you
learn, right? So that's the feedback loop that
we keep talking about in chaos engineering and
the general state of chaos engineering till recently,
I would say till 2019 or 20,
is we all understood. What is chaos engineering?
It's supposed to be a standard, and it has been limited to experts
and enthusiasts and people
operating. Large deployments typically follow chaos engineering,
right? And chaos engineering is generally
started after you burn your hands with these downtime
as part of the root cause analysis,
you will resolve that we need to practice chaos engineering.
So let's start doing it. That's how typically it has been done
till recently, a year or so. Right?
And how they've been doing is we
all know that game days SRE, the ways that you keep using
it and you integrate into CACD,
but not like has a thumb rule. They're being done
pretty rarely, not as a standard practice.
And it is typically limited so far to
sres. Developers limit their testing to CI
pipelines, the regular functional testing, or a little bit of
negative testing. It's not deep chaos testing,
and also the measurement of results is
also not standardized or there are no tools exist to do that.
And observability is also done through whatever
the tools that were available and manual operations,
manual scripting, all that stuff. So overall,
so far, chaos engineering is well understood
why part of it, and many
large companies who are operating large data centers and applications
have been doing it, and it is
still about manually planning it and manually
executing it. And there are no
certain defined ways of doing chaos
engineering where people with common
knowledge of operations can go ahead and then practice it.
So that's the state of affairs,
in my opinion, of chaos engineering till recently.
Before we go into cloud native and chaos engineering, let's see
the state of affairs of cloud native and chaos engineering itself
with respect to crossing the chasm Kubernetes
is pretty well adopted now, and it is believed
to be in the mainstream market,
whereas chaos engineering is believed to be still
in its early days in these cloud native ecosystem.
And that's probably I'm going to touch upon
why is it so important? Right?
So why can't you go ahead and then practice
chaos engineering the regular way in cloud native ecosystem?
Right? So what is chaos engineering
done in cloud native environment is
called? It's called as cloud native chaos engineering.
Why should there be any difference in practicing
of chaos engineering in the cloud native ecosystem?
I would say primarily there are two reasons why we should look
at chaos engineering differently.
One is more dynamism the other is the
way DevOps has changed with respect
to infrastructure provisioning.
So let's talk a little bit about dynamism,
right? So it started with containerization,
where an application is broken into,
split into multiple microservices. So instead of
dealing with one large application, you have multiple smaller
microservices to deal with.
And they are actually being tested
with pretty nicely done CACD pipelines.
And the goal here is to deliver them faster. So there are
a lot of advancements in CI and CD.
New tools are available and they're being made easier.
And the net effect of that is instead
of typical releases happening in 90
days or 180 days of large systems,
you get these almost every week,
right? You have multiple microservices.
Something or other is being changed all the
time, at least once in a week in very large system.
Look at this cloud native ecosystem, the dynamism.
There are so many players working together under the
leadership of CNCF, and it is working very, very well.
There are a lot of coordination is going on. And the
net effect of all these is you are
getting very dynamic changes all
the time into your system. If you sre in
cloud native ecosystem,
right? And then imagine doing chaos
engineering in that system,
you also have to be very dynamic in terms of doing chaos engineering.
And the other one is DevOps being changed.
And it's not about left shift
containerization coming faster, it's about
infrastructure provisioning. So how
infrastructure provisioning has changed, first of all, it is now 100%
declarative. Everybody provides an API, a declarative
API where developers can go ahead and write the
declarative code or syntax to
get the infrastructure they want. They're not waiting for anyone to
provision the infrastructure for them, they're doing it themselves.
Right. And by following
the practice of Gitops, which is on these raise,
developers are going ahead and getting the infrastructure
through APIs that they want, right?
So it is happening already.
And because the infrastructure is getting provisioned
left and right, it also means that the system
could be less resilient, because the faults can
happen very much
in the infrastructure and typically they do happen in the
infrastructure. So your applications are coming faster
into production in terms of microservices updates.
And developers are also provisioning these infrastructure underneath,
changing the infrastructure underneath that these applications run upon
frequently. So both of these things together
will cause a problem for resilience,
right? And that's what chaos
engineering in cloud native has to be aware of.
So in summary, there's more dynamism
and there's more infrastructure changes. So you need to
be doing cloud native chaos engineering.
So how do you do this differently in
cloud native ecosystem? So I have
come up with certain general principles of
practicing chaos engineering. Or how do
you do with set of principles?
I'm setting up around five of them these.
Till recently in my blogs I've written four.
But recently I've observed open observability
is a big deal. So you have to have
a common layer of observability to do chaos
engineering in cloud native ecosystem. I'll talk about that.
But first of all, it has to be open source. And by
having this infrastructure or architecture
for chaos engineering in open source, you sre getting
a more reliable stack
for doing chaos engineering. Kubernetes and the entire CNSafe is an
example of that. All these projects are well cooked,
well architected, well designed and well reviewed.
And chaos experiments, which in my opinion will become
a commodity at some point. They have to be community collaborated
so that there are no false alarms coming out.
Right. And you need not spend a lot of time writing
the most common chaos experiments so they can
be hosted somewhere and they should be available
for most commonly required
chaos operations. And chaos
itself has to be managed at scale. So that means that
the chaos experiments need to be versioned and they
need to follow a lifecycle of their own. So you
need to have open API and the
versioning support for chaos experiments.
And how do you do this at scale? The challenge is
always about doing anything at scale is a challenge,
right? So the same thing applies to chaos
engineering in cloud native ecosystems also. So when
you're doing it at big scale, you need to automate
everything. And the right way to do it is using Gitops.
So the entire chaos engineering construction
has to be declarative and has to be supported
by the well known tools like Argo CD
or flux or spinner cut and so forth. And open
observability is an important thing. As I mentioned a little
while ago, observability is a key
thing, especially when you introduce
a fault, and you need to be
able to go and debug whether a certain change in behavior is
because of the fault that I introduced or is it
because of something else that happened coincidentally. So these
are the principles of cloud native chaos engineering. You need to
have an eye on it. You need not follow all of them all the time,
but it's good to have all of them
being followed, in my opinion.
So I want to introduce Litmus project,
which is actually built based on these principles.
And we've been building it for about three years
now. And I'm happy
to say that we sre almost there with all of them,
and especially the first three of them have been
there for more than a year now. And we are releasing Litmus
20 very soon with githubs and open observability
features. So basically you
got all these cloud native chaos engineering principles
well adhered by the Litmus project.
Let's look at Litmus a little bit more detail.
It's a platform for doing, it's a complete tool
set for doing cloud native chaos engineering. And it
comes with a simple helm chart where
you will be installing and doing chaos workflows. I'll touch upon it,
but basically it all starts with a simple Kubernetes
application called Litmus. And using helm
you can install it. And you already have all these experiments
that are needed to do your chaos workflows in a public hub.
And you will end up having your own private hub
for coordination of the new experiments that you write or
tuned ones with your team through
a private hub. So once you install litmus
through the helm chart, you will have something
called litmus portal. It's a centralized place where all your
chaos engineering efforts are coordinated
and you will be pulling in the experiments
or referring to the experiments on public or private hubs.
And you can run chaos workflows anywhere,
either any Kubernetes cloud or
any kubernetes on Prem. And it is not limited only
to kubernetes. You can run these chaos experiments from
litmus portal, from these
kubernetes ecosystem, but the targets can be non
kubernetes as well. And then if you're doing it at scale,
you better do it through Gitops. So Litmus
provides an option whether you want to store all your configuration
in the local database or in a git. And then
once you store them in git, you can
automatically do easier integration with
any of these CD tools. So let's
go and look at in a little bit more detail. You have central Litmus
portal and all you do is you can pick up
a predefined chaos workflow, or you can create a
chaos workflow and you run it against
a target, and the target is
where the chaos operator will be spun and the experiments
will be run. So you can have multiple targets connected to the
same portal. So you don't install litmus again and again. You install
Litmus once per your enterprise or a team,
and these you're good to go. You have rbacs
and everything in Litmus portal. And then once these experiments are
run, the metrics are exported back to Prometheus,
the observability system and the analytics sre
pushed back into the portal. Out of
all this, the previous one I talked about,
workflow is a key element. It's one
of the innovations that happened
in the last six months within the litmus team.
So I want to talk a little bit about litmus chaos
workflow, what it is. So it is basically
argo workflow consisting of multiple
chaos experiments, and you can arrange them in sequence
or in parallel, or a combination of them and
they get run. And litmus chaos workflow also consists
of consolidated results and status.
So that is one workflow. So that is a unit of
execution or management for
you within Litmus. And using
workflow, keeping all the configuration, the complex
workflow in a declarative manner, we sre saying chaos
engineering and githubs can be put together
using this declarative nature. And argo workflow
is a pretty stable. In fact, it is in
incubation stage within CNCF,
a very widely used tool. So I'm
pretty sure you will have great experience using argon litmus
together. So if you want to go a little bit deep dive,
how does the chaos workflow work?
So basically your experiments are in Chaos hub.
The chaos workflow refers to the
experiments on this hub, and you
always keep them in a hub, either public hub, you can refer
to them and then tune them through Yaml file. Or if you are changing
some of these experiments, you can keep these in a private, or if you are
creating new one using the sdks, you can keep them in private hub.
But ultimately a workflow refers to an
experiment within a hub somewhere. So if
you want to kick start this workflow
and you create the workflow, push it. So somebody
recognizes that these is a change, either manually executed
or through GitHub. Finally, chaos engine is
the one that's responsible for kickstarting. So chaos
operator will be watching for the change in chaos engine and
then the experiments will be run and
exporter litmus chaos exporter will take the results
metrics and push it back into Prometheus. The chaos results
CRD will be created, and the result of the chaos experiments
in this case will be pushed back into these litmus for
analysis and debugging or monitoring purposes.
So that's how a chaos workflow happened. And then these
workflows, many of them are available as a predefined workflows.
You just need to configure them, attune them.
And these, you are good to go. And these
chaos workflows can be run against multiple targets, and it's
basically a multi cloud chaos
engineering ecosystem here that we are talking about
in terms of experiments list. Litmus provides
a lot of experiments of all types,
a few more or in the works, Ivo chaos and DNS chaos,
all that stuff. But you got
pretty much everything that you need to start
today. And another thing that lateness
provides is how to define
your hypothesis using probes.
You can define the steady state hypothesis in
a declarative way using probes, and using
probes and annotations, you will be able to mark
what was the chaos duration on
any regular grafanographs, for example.
It also provides a good deal of analytics litmus
portal about comparing resilience within
the workflows that are run at the same time
or at different times, all that stuff. So you have great
beginnings of the observability that is built in, rather than depending on
some external tools.
And chaos interleaving is an important concept.
How about CI in pipelines, right?
So there is a lot of advancements, lot of interest in CI.
So what we have done is we have created Litmus
CI library, Chaos library.
And you can go ahead and create a chaos stage
and use this library through
this existing tools. So for example, if you're using GitLab,
you can create a remote
template for chaos that uses this wrapper
of CI library wrapper, and then your remote template
is ready. You don't need to really worry about execution
of the underlying chaos workflow. It all happens
automatically. So, so far we have done for
GitLab and GitHub actions and Spinnaker,
and most recently for Captain project.
So these are the integrations that are available today,
and many more may be coming pretty soon or later
in this year. Litmus is known for being
very strong for doing chaos engineering on
kubernetes. But how about non Kubernetes? Does litmus
support chaos engineering for non Kubernetes? The answer
is yes. The experiment management and monitoring
everything remains on kubernetes. And you
can still execute experiments on these
target, such as various different clouds,
or your own vmware on Prem,
or OpenStack on Prem, et cetera.
So how it works is your experiments runs
all the way till the last leg within
kubernetes, but the actual chaos
will be executed on the target by using the network access
control APIs. So you need to write that logic of how
do you kill that? And rest of the chaos engineering
control plane will stay on kubernetes. And we
do have some experiments like
EC two terminate or EBS detach, GPD detach,
and many more are the coming. So later this year I'm
pretty sure we'll have many more such examples
of doing chaos for non
Kubernetes resources. So that's how it happens.
As long as you have an API to reach that resource on the other
side, you will be able to kill that
resource using that API.
So other one that I want to touch upon
is what about CNCF projects and Litmus?
What are the integrations that are available? And we
have recently integrated both with these workflow and we've tested
very well with Argo CD and certified it. We have also
done close integration with captain by working through their
team. A fantastic team, I would say.
And of course, we've been the Open EBS team
and Openebs is the one that started litmus
integration to begin with. A lot of community members use chaos
testing with open EBS and we SRE certified
for other runtimes like Cryo and container D. That's where
so far we got and we have in our short
or medium roadmap to do some good integrations with the flux
crossplane and ytest the database
and for security open policy agent.
So these are the projects that we have in mind to do
some kind of integration along with
usage of these projects.
Here is the quick summary of the Litmus roadmap.
I think I talked through most of it. You can just
take a look at this in this recording. You can pause and take a look
at this roadmap in detail if you want.
Let's talk what we do at chaos
native and more details about
what chaos native can do for chaos engineering,
both for cloud native and
non cloud native environments. So the idea
of v spinning off from my data is
to provide more resources to the success
of Litmus project and to accelerate the
adoption of litmus by enterprise users.
A lot of users, the enterprise
big users have been using Litmus and they've been approaching
if the litmus team can support. And of course, we've been using
that from being part of my
data. But now we felt it's time to go
and put more focus, create more resources
around Litmus. So that's how the company has been
launched a few days ago. And how are we going to do
that? Acceleration of enterprise adoption of Litmus
is really by building a stronger community around Litmus.
So community is very, very important and
we want to encourage and demonstrate the
open governance nature of litmus. So we will be putting
more resources to work with more community members,
large companies, in the cloud native ecosystem and
work together to build a stronger community and the
adoption increases. I just talked about the
plans to integrate litmus with other CNCA projects,
so all that requires more resources which
we are now going to be able to allocate.
So apart from enterprise support, we sre also thinking
of doing CD tools integrations with
many more of them and Katis
distribution, testing and making sure that chaos
engineering is done very easily in the air gapped environments.
Also, and some of the customers are asking
for managed services around chaos engineering.
And we also have plans to launch chaos as
a service at some point of time for
making chaos easy for developers. So that's about
this session where we talked about cloud
native chaos engineering and the need for it. How do
you do that? And I'm encouraging you
to go ahead and give Litmus a try. And if you
have any questions or need support,
come back to the Litmus channel on Kubernetes Slack.
With that, I hope you sre going to have a
great conference on Conf 42 and have
a fantastic day or evening. Thank you.