Transcript
This transcript was autogenerated. To make changes, submit a PR.
We have a really interesting topic here. Most of these day, we have been
talking about chaos engineering in general. This talk is
a little bit more about how to practice it, rather than
preaching why is it needed and what it
is.
Before we start, about me, I go by Uma
Mukara. You can call me Uma. I'm co founder and
CEO of a company called Mayadata.
We do cloud native data management primarily
for Kubernetes. So you can see
me talking a lot about Kubernetes today.
And I'm a co creator of the following open source
projects. The genesis of this entire
cloud native chaos engineering is really
about trying to do chaos engineering for
an open resources project called Open EBS.
And we ended up creating a
cloud native chaos engineering infrastructure called litmus.
While trying to do chaos engineering as a practice for
open EBS. I'll talk a little bit more about that.
So we've been talking chaos engineering.
It's a need everybody knows, right? So Russ Miles
started with why chaos engineering is a need.
And we've been also hearing it's a culture you need to
practice. It's not a thing. Right,
but I got all that. How do I actually get started?
Right? This is the question that we got into.
I'm in this Kubernetes world, and I'm
developing an application, or I'm operating a big infrastructure
or a deployments. I need chaos engineering. How do I
do? Right. As part of the operations work
I do, I got into
a situation where we need to practice chaos engineering for
a very different reason. We were building a data management
project called Open EBS, where the community
has to use this software
for relying on these data. Right.
Storage is a very difficult subject. We all know that.
So you don't want to lose your data at any point of time. So how
do you convince your users that you have done
enough testing or it has enough resilience?
So the best way is for them to find out that,
okay, I can actually break something and still see that it is good.
Right? So we started working on that and
then realized, okay, Kubernetes itself is
new. Chaos engineering is as an awareness
is growing, but we need to start creating some infrastructure
for our own use. And then slowly we realized this
can be useful for community as
a whole. And then we created litmus. Right?
So today I just want to touch upon the
following topics. We all know what is chaos
engineering, but what is cloud native chaos engineering, right? And then
if we understand why cloud native chaos engineering is
a different subject or a little bit of a deviation from
what we've been talking, what are those principles? Like?
We commonly observed some principles that we can say that here
are the cloud native chaos engineering principles.
And then we created a project called litmus, and then we can talk about
is it really following all these principles?
And then how many of you know operator
hub in Kubernetes
so very similar? Or helm charts, right. Helm charts
is a popular concept because you have a chart to
pull up and taken bring an application into life.
So same concept applies to a chaos experiment
as well. Right. Then we'll go to some examples and then we
can talk about what are
the next steps and then so forth. Right? So what is
cloud native chaos engineering? Are there any
cloud native developers here?
Yeah, what is the cloud native developer?
If you can Docker pull a developer,
then you're a cloud native developer,
right? So it's really about practicing
chaos engineering for specific environments,
which are called cloud native environments. We'll just see what
these environments are. And you can also say that
chaos engineering done in a cloud native way is called cloud
native chaos engineering. Even further, to simplify
chaos engineering done in a Kubernetes native way is
called cloud native chaos engineering. So it's all about this general concept
of how Kubernetes is changing our lives, has developers
and sres, and you might have observed this
many, many times that there is a particular way that
the DevOps have know the way we practice
DevOps have changed.
Adrian talked about infrastructure as a code
that has taken as a primary component
of Kubernetes deployments. Right.
So I'll just touch upon
what a cloud native environment looks like. I've taken this
concept or picture from one
of the GitLab commit conferences where Dan Cohn from CNCF
was presenting about why a CI pipeline is
useful in cloud native environment. I just
will repurpose that for a different need here.
Right, so you are writing a lot
of code if you're a cloud native developer,
and then after you develop, you run it through CI pipelines and then you put
it onto production, or like you release it and somebody's going to pull
that and then use it, right? So where are they going to deploy it?
So they're going to. Let's assume that it's a Linux environment, it's a
lot of code and you will have kubernetes
running on it, right? That's much more code than Linux
itself, right? 35 million of source locs
compared to the Linux. And then you will have a lot of microservices
infrastructure applications that will help you run your code
and then there are some lot of small libraries or databases that
run on top. And finally your code, right?
So that's how small your code is, it's about 1%.
Right? And the common factor is you need to think how often
your Linux upgrades happen,
right? Maybe 18 months, twelve months.
And can you guess how often Kubernetes upgrades happen
before you know, you understood one version of kubernetes, another version comes
in, right? So it's very dynamic and
that's the purpose of actually adopting kubernetes because it runs anywhere and
then everybody runs one thing inside
a container in one thing only, right?
So kubernetes upgrades happen very frequently.
And your apps, not your apps,
but the application or microservices surrounding your
application, these also need to be updated very
frequently and then you don't control all of it, right? And you are
always thinking about my code. How do I make sure that the
code that I bring in deployments is working fine,
right? So the summary is that your environment, the cloud native environment,
is very dynamic and then it requires a continuous
verification, right? I mean it's the same concept of
don't assume but verify. Right? But what do you
verify? You verify your 50,000 lines
of code or remaining 99%. How do
you verify? So you cannot go and say that
everything was working fine. I verified everything if I'm an
SRE. But it's a bug in the Kubernetes
service that is provided by the cloud provider, right.
You cannot assume that cloud provider has
really hardened up the Kubernetes
stack or any other microservices. So it's really up to the
SRE or running whoever is running the ops to verify
that things are going to be fine. Right?
So that's one thing. So how do you do this?
Obviously it's chaos engineering. Chaos engineering,
not just for your code, but for the entire
stack, right? Because your application depends
on the entire pyramid. So that's
one concept. The other big difference is, okay, I need to
do chaos engineering, but how, what's this cloud native
environment? The big difference is everything
is yaml. You need to practice this yaml
engineering, right? You cannot write. I mean,
I think Andrew has really demonstrated very well today about
how to do chaos engineering for MySQl server
on kubernetes, how we did like
we really killed some pods or cube invaders. That's good for demos.
But how do you really practice, if you are necessary in production,
right? You have certain principles. If I am putting something
into an infrastructure, it should be in a yaml,
I need to put it into git. And then even if I'm
doing some chaos experiment, it should be in a git because
I am making an infrastructure change, I'm killing a pod.
Who gave you permission? Right? So it has to be
recorded and that's Gitops. Right?
So that's what is the cloud native environment
dictates? If you are a cloud native engineer or a developer, you have
to be following this infrastructure principles.
So what you need is not just chaos engineering,
but it is cloud native chaos engineering. That's the concept I
want to drive here in
a simple way. This is my definition.
If you try to put chaos engineering as
an intent into a manifest AMl manifest, then you can
say that you are practicing cloud native
chaos engineering. So that's how we started, right? So we are going
to give open EBS as an application to your
users and they're going to take it, run it. How do they verify?
However they've been verifying
an application is same as deploying an application. How do
you deploy an application? You pull up some resources
into a ML file and then you do a Kubectl apply,
your application comes. How do you kill it? It has
to be the same way, right? So that's the primary difference that
we found. And then we started building up this infrastructure
pieces to do the same thing, right?
So let's look at Kubernetes Resources that we have
for developers.
We have a lot of resources provided by Kubernetes as an
infrastructure component, right? Pod deployment, pvcs services,
stateful set. And you can write a lot of crds, right?
So you use all those things and then define an application,
you manage an application and then this concept of crds
have brought in these operators. You can write your own operators to
manage the lifecycle of the CRS
itself, right. That's good for development. If I'm
practicing chaos engineering or chaos testing for
kubernetes, I need to have a similar resources.
So I need to be able to have some chaos resources.
So what are those chaos resources that I would need?
So you would need some chaos CRS in general,
right? So just like an application is
defined by a pod and a service and some other things. And how
do I define chaos engineering? You need to be able to visualize your
entire practice through some CRS. And obviously
you need a chaos operator to manage the CRS and to put things
into perspective, upgrade each test,
right? And it's all about observability.
If you don't have observability, you can introduce chaos,
and then you're not able to interpret what's going on.
And then it's all about history has. Well, right. So metrics
are very keen in chaos engineering.
So these are the chaos resources that we figured out at a general.
These three resources are needed. Right? So then
looking at all those things in general, we can summarize. The principles
of cloud native chaos engineering are the following,
right? So first of all, whatever you try to
do, it has to be open source, because now we are trying to generalize.
Right. I can just write some closed source code and say that these.
It is. You're not using to accept it. Right.
Kubernetes was adopted and then the entire new
stuff is becoming adoptable so fast because it's open
source. So that's just a
summary of it. And then you need to have acceptable
APIs or CRS, just like pod service for
chaos engineering. And then what's
the actual logic? How do you kill something, right? So that,
again, you should not determine. It should be pluggable. Right.
Maybe my project does it in a particular way,
but somebody else also has a particular way of killing. For example,
here today, we talked about Pumbaa, right? How do
you introduce network delays? It does in certain way.
And then you should be able
to use Pumbaa and plug it into this cloud native
chaos engineering. And it should be usable and
then it should be community driven. Right? So when will you
do this chaos engineering as a practice? It's not just preaching
about chaos engineering, that it is a culture. It's a good thing
to do. What are the best practices? We all need to be able to build
these chaos or experiments together, right? And that's
when you call it as principles. Right? So I've written a
blog about whatever I just said and then published it on CNCF,
same concepts. And then it's available on
CNCF itself. Now what I want
to do is with these principles, this is
how actually we grew, started practicing chaos
engineering, and then we named it as litmus. And Litmus
project is exactly a manifestation of our effort to
practice chaos engineering with kubernetes. And we turned it into
a project that it's not just useful for, only for our project,
but anybody who is practicing or
developing on kubernetes want to practice chaos engineering,
they can use litmus. Right? So this is just a brief
introduction. It's totally apache two license.
Then there are some good contributions coming in.
It recently went GA 10 a couple of weeks ago.
It really means that it's got all the tools
or infrastructure pieces that is needed for you to
start taking a look at it. And then it's
open source, obviously, it's got some Aps or CRS.
I will explain that in a bit. And then whatever
the chaos logic that you're using, you can just wrap it up into a docker
container and put it into litmus, right. You don't need to change anything.
And then it is obviously community driven. I'll explain
that in a bit. So let's see, what are the CRS or CRDs
that litmus has? Right? So these first thing is chaos
experiment. You want to do something. Take the
minimal thing that you want to introduce as a chaos and define that
as a chaos experiment, right?
You can do a killing of a particular application that might
include three, four different chaos experiments. But a
chaos experiment is something that is like minimalistic
killing of something, right? And then
you need to be able to drive this chaos experiment.
For an application, we call that as chaos engine. This is
where you tell. Here are the chaos experiment. This belongs to this application.
Here is your service account. And then who can kill, et cetera.
So you define that. So to do this, and then after you
do, you need to be able to put your results into another
Cr called chaos resulted, so that Prometheus
or some other metrics can come and pull it up, and then somebody
else is going to make sense or some tool is going to make sense of
what exactly has happened, right? And then there will be multiple
chaos experiments that you can keep adding into a chaos engine,
right? So that's the CR that
you have. And then for pluggable chaos,
for example, it already has Pumbaa and powerful seal as
built in libraries. You can actually pull in your own library.
So how do you pull your library into this infrastructure?
For example, here I'm explaining powerful seal.
How we did it, the community did it, is all you need to do
is whatever the killing that you do, you put that into a
docker image, right? So if you just do a docker
run of that image, and if it goes and kills something or introduces
chaos, then it can be used with this infrastructure,
right? So once you have the Docker image, you just create
a new CR, new experiment.
A new experiment really points to a
new custom resources. And then inside that custom resource definition,
you just say that here is your chaos library, right? So it's
as simple as that, right? So the
reason why I'm just trying to emphasize this is think
litmus as a way to use your
chaos experiments with a more acceptable
Gitops way you can just orchestrate
all your experiments using litmus. The way they are written,
you just need to convert them into a Docker image, right?
So litmus chaos render automatically picks up this docker image and
then runs, puts it into a chaos result cr and then you're
good to go. And then these community is developing more tools
to observe chaos. All that will work very naturally.
Right? So it's community driven. What it really means
that we have something called operator hub, a chaos hub.
It's got a lot of experiments already and taken you
as a developer when you create a chaos experiment.
If you want this chaos experiment to be used in production or
in pre production by your users, you're going to push it into your chaos
hub. And your sres, whoever is practicing your chaos
engineering, are going to pull this experiment and then use it,
right? So imagine if Andrew had published that experiment.
You just need to create a yaml file and
then put in some key value pairs and then it runs.
So this is how the cloud native orchestrator architecture
looks like. So you will have a lot of crs
defined for your users, chaos users.
And you have a lot of experiments that are being used by
this CRS, but you can develop more, right?
So has more and more people develop this chaos
for various applications. You will slowly see that the entire
Kubernetes chaos engineering can be practiced just
by installing litmus, right? So how
do you get started? Just imagine that you have on the hub a
lot of chaos experiment for various different applications
and you are running your app in a container and
then all you need to do is pull litmus
helm shot or just install it. It runs in a container.
The moment you install you will get chaos libraries and operator onto
your Kubernetes cluster. And then you need to pull.
The assumption here is that you have so many charts and you
may not need all of them. So you pull whatever you need onto
your Kubernetes cluster and then you just inject chaos
by creating a cr, right, chaos engine which points
to various different experiments and then
it opens up, the operator goes and runs
chaos against a given application and it creates a chaos result,
right? It's as simple has that.
So let's see an example of how this changes
a cloud native developer everyday's life,
right? How does a developer
create a resource? Right? You create a pod and then you
create more resources for an application, for example a
pv or a service, et cetera. And that's
usually that's where it ends. You want to test something,
right? So how do you do it, you inject chaos
by simply creating one more crs just like you've
been using kubernetes for, right?
So you can actually create a chaos engine and
tell what needs to be killed where and then it's all done,
you get your results. So it's extending the concept
of your experience with kubernetes to
do the Chaos engineering also. So that's cloud native chaos engineering.
Just to summarize. Right, so on the Chaos hub we
generally have two types of experiments.
One is generic, that is like generic chaos experiments for
Kubernetes resources in general. And then there are application
specific. So this is where it gets interesting.
So you can, like we have seen again,
sorry to take these same example again and again
Andrew showed I do container
kill of MySQl server pod.
Right. So then you have to go and verify
what exactly happened, right. We verified that the pods have come back.
That's all we saw. Right. In Kubernetes invader.
Hey, more pods are coming up. But how do we verify whether
application that the MySQL server is
really working well or not automatically. So that's these,
you can write more logic into your application specific chaos experiment
and then use them in production. So some
of the experiments that are available are already available.
You can just do a pod delete, container kill.
I mean like a pod can have multiple containers, you can just kill only one
of that. And then you can do a cpu hog into
a pod. Network latency, network loss, network packet corruption
just introduce some corruption into your packets that
are going into your pod and see what happens. Right? So that could be
the granularity level that you want. Then there are some infrastructure
chaos. Of course these are specific to the cloud providers.
For example, how do you take down a node on AWS
is different than how do you take down a node on Google.
Right? So there are disk losses, which is
a very common thing, right? So I suddenly
lose a disk, what happens? And then disk fill is
one more common thing. So these are the initial set of things that are already
there, you can start practicing and application specific things
that are available here are an
application really constitute a pod
or multiple pods and then it's got some service and then there's got some
data. Right? So what do you mean attacking
an application? Right? So you need to define the logic
of what is it that you're going to kill.
Am I going to kill MySQl server or am
I going to kill a part of it, et cetera, et cetera. That's definition.
And then you verify before killing everything is good or not,
right? So that's the hypothesis. And then you use the generic
experiments to actually do the chaos and then
verify your post checks, right? So all this can be put
in into an experiment, and every time, you don't need to run
it again and again. So just put that into Yaml file and your application
chaos happens, right? So the result Cr gets
done. So for example, I'll quickly take an example of open EBS,
how it's done. Open EBS has got multiple components. Now I'm verifying
open EBS as an application. Works well when I
kill something, right? I cannot go and say, okay, open EBS
is a cloud native app. That means it's a microservices and it's got
multiple pods. I can kill a
container that belongs to openabs and see what happens. That's one
way of saying. The other way of saying is I can kill a controller
target of openabs and see what happens,
right? So you will end up having multiple different
chaos experiments that are specific to that application, and then
you can go start using them. For example, I can kill an SQC target of
open EBS and then see what happens. You don't need to really know what
should happen. It's all defined by the open EBS developer. As an open
EBS user, you will be able to say,
okay, my open EBS is functioning properly because I just killed
a target and then it is behaving as expected.
Or you can kill a replica and then see, so you don't
need to learn the nittygritties or complexities of the application,
how the application should behave when something happens in production.
So all that is coded up by your developers and pushed up onto
chaos hub, and then you can just use it. Right? So that's
summary of how a cloud native chaos engineering
framework can work, and litmus is just one of that, then you can
contribute. It's on the Kubernetes slack itself. And then
if you find some issues, you do that. But primarily take a look at,
if you are practicing chaos engineering, take a look at chaos
hub and then there are some things that are already there and it's
just these beginning we opened up. And then
hopefully more contributions will come from the community, from the
CNCF itself.