Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, this is Uma Mukara,
co founder CEO of my data.
Today I'm here at the SRE 2020 show organized
by Con 42 Dot folks us to talk about
increasing Kubernetes resilience.
This should be a topic of interest for sres.
Whoever is already planning to practice
has engineering or already practicing chaos engineering.
We're going to talk this topic,
especially when it applies to kubernetes.
So before we delve deep into this topic,
let's look at what we
do at my data. At Myadata
we sponsor two open source projects, open EBS and
Litmus openebs for cloud native data management.
Litmus for cloud native chaos Engineering.
We also have the commercial SaaS solution for
cloud native data management called Kubera.
So with Kubera, sres can turn kubernetes
into a data plane and get the complete solution
around cloud native data management.
So in this talk we sre going to talk
about the importance of resilience and how do
you get resilience on kubernetes and an
introduction to litmus chaos, how it works and
how automate chaos using
litmus chaos and obviously how
you end up getting higher resilience in that process.
We'll also do a brief demo of
Bitmas, how it
works when you try to use this automation tools,
et cetera. So what
is the state of kubernetes today?
Kubernetes is starting to have
greater and greater adoption. It is believed that
in terms of adoption it has crossed the chasm
and most of the IT organizations
are either planning to adopt
or already have adopted.
So what we need right now is more
complete solutions around kubernetes so that the choice
of adopting kubernetes becomes
a valuable one when you have the complete solutions around kubernetes.
So one of the more important tool
is a tool that helps
you keep the resilience high.
You obviously want to keep your kubernetes
clusters and applications on that running all the
time and achieve your or meet your slas.
So for that you need to have this one
tool or an infrastructure set of tools or practices or
processes to keep this resilience
higher than what for slas demand.
So what is resilience? And resilience
is system's ability to adopt
to a chaos load.
Whenever there is a fault happens, it recovers
automatically without affecting any service
user facing services. So what are some of
the examples of this resilience? Or when you
don't have a resilience you call have a weakness?
Some examples that are common in Kubernetes is pods are
evicted for various reasons. If your
system is resilient, your implementation is resilient,
your pods are automatically rescheduled, your services are not affected.
That is a sign of resilience.
And on the other hand, if your services are
becoming slower than certain
threshold, or they're completely down, then that's not healthy.
That means there is a weakness that you need to fix.
Some other examples are nodes going to not
ready state, which is a little more common on
large infrastructure setups on
cloud service providers. And when
nodes go to not ready state depends on
the application that you're using. The blast radius can be very high
and that's not healthy. So you
have to implement your services in such a way that when
these nodes go down or they go into
not ready state, you have to be able to continue
to survive that situation. That's the
resilience that you would want. And similarly,
as you have hundreds and thousands of containers,
it's possible that some of them are not behaving exactly
the way you expect them. In some cases,
some memory leaks are also some
of the common herd examples in
large scale deployments. So these are some
examples. Of course faults, you always expect them not
to happen. But you also know that
a fault can happen anytime respective of
how much careful you are. So it is inevitable
that some fault will happen at some point
of time. So you have to stay afloat
for that. And that's resilience. And the
resilience is more important now in the
Kubernetes environment because there is this promise
of common API, everyone has adopted it.
If you see CNCF landscape then
you realize how many all the
vendors users across the spectrum SRE
adopting kubernetes in some form,
delivering solutions or building solutions based on the
services available. And what they are trying
to do is because of the
new solutions available, they themselves are building stronger
CI CD to meet this demand of
agility, right? So together with so much
adoption and stronger CI practices that
these users or vendors SRE following,
you will see that the applications on
Kubernetes are changing pretty fast. It's a good sign
you don't need to wait for many months before a fix is arriving.
So it is a good news that the system
is more dynamic and they're coming multifold
times faster than earlier times.
For example earlier the
databases. Upgrades to the databases may happen once
in six months or a year, but now
it may be every quarter because they have
moved on to microservices model, right?
And the resilience also depends
on various other infrastructure
items or on services in your
Kubernetes environment. Let's look at the resilience
dependency stack. So at the bottom you
have platform service. You can
expect that some faults can go in that
area. For example, node going into
not ready state is a famous example. And Kubernetes
services itself can be fragile sometimes
if you are not implemented properly. And other
cloud native services that surround kubernetes like DNS,
Prometheus Envoy and other cloud
native databases, they all can get
into some kind of unexpected state.
And on top of all these things is your app.
So if you're looking at slas at
your application level, the resilience of your application,
it really depends on a lot of other services
in your environment. So to summarize,
90% of resilience of your application really
depends on the components that you are not developing or
you don't own. So this is something very
important together with dependency
of other application and they sre coming much
faster. Something that you really need to do
to keep your resilience high is to keep checking
how is my system's resilience? You have to constantly
validate that fact.
Do I have a weakness or my system is resilient enough or not?
So how do you check resilience?
Well, you has to practice has engineering,
that's the topic here, right? So how do
you do that? Typically is you have to know what is your
steady state, either a service or an application,
and then you yourself induce a fault.
Don't wait for a fault to happen. You introduce a fault
and then verify whether steady state condition has
been achieved. We talked about some examples of resilience.
The same thing applies here, right? If the system is resilient,
you're good, if not, you found a weakness. But that's as
simple as that. So in other words,
for the two reasons, the dynamism in the services and
the dependency of those services for your application,
you need to practice chaos engineering in order to achieve
the required levels of resilience.
That's good. And let's also talk about
how do I do that? Chaos engineering.
And before we talk about that, it's important to
know how the configuration or operations
are managed in the cloud native world. Today it
is by using Git ops.
The git functionality is helpful in managing your
configurations as well. The version configuration
and that concept has led
to using the operations
declarative yamls also to automate your
config changes. So this is your cloud
native way of managing your configuration
of your application. So this
is being applied across the spectrum now to
manage Kubernetes services, various applications on
it, various resource management strategies
and policy management. Everything is being done
through Githubs. That is the new
way of doing things for DevOps. And you can
also apply the same Gitops
to resilience checks. You don't need to
start doing something new or new
way of doing things or not
needed for bringing in your resilience checks.
So how would you do that? So in order to get that
resilience check into place, you bring in a has
operator and then some chaos experiments,
and you make changes to the
has experiment and somebody picks up that change
and the chaos experiments are run or resilience
checks are done and then you can observe the results
of that. So this is a simple way of doing chaos
management or chaos execution or fault induction
using Kubernetes style,
using Gitops, right, using operators.
So let's look at what is,
or at least let's summarize what is cloud native
chaos engineering is you want to keep your resilience
high, you want to practice has engineering and you want to
do it the Kubernetes way, cloud native way.
So what does it mean to do it in that fashion
is you pick up an open source infrastructure component
or infrastructure itself,
because it is cooked up in open source, there's more community around it.
You can depend on the survival of
that technology for a long time. Right.
That is very important. And the promise of Kubernetes
has really realized, because it is
neutrally governed, it is open source. Large vendors have come
and depended on that purely because it is
open source and well governed in an open way.
And the next one is you have to have
chaos APIs, crds,
and that's very important to do it the
Kubernetes way. And the third one is
bring your own chaos model. You don't want to be
tied to a particular way of doing chaos.
It can keep improving, you can improve a certain
type of experiments, or you already are
in the process of developing new experiments and you want
to use this framework for those
new experiments as well. So you should be able to plug it in into
that framework and then finally has itself
has to be community oriented, primarily because of the
reason that you will not be able to develop all sorts of chaos
experiment yourself. You have to depend
on the application owners, vendors,
practitioners, whenever they learn a new
way of introducing a fault, they can upstream
that experiment. So that is the fourth principle.
So this is generally putting forward what
is cloud native has engineering is with that,
let me introduce litmus has litmus.
And litmus Chaos is
a complete framework for finding weaknesses in Kubernetes
platform, its implementations and applications
running on kubernetes. So using this,
developers and sres should be able to automate
chaos in a cloud native way, right? So we talked
about what is cloud native way. It's all about declarative
chaos and being able to automate the entire chaos
using Gitops. And Litmus Chaos
is in sandbox. And the mission statement,
as I just described, all about
helping sres and developers bring out weaknesses,
keep the resilience high. And Myadata
is the prime sponsor of the project.
But because of the vendor neutral governance
and being in complete open source,
there are a lot other companies that have
joined as maintainers contributors. Some of them include
Intuit, Amazon, Ringcentral, Container Solutions,
et cetera. And one of the biggest assets of Litmus
Chaos project is the hub itself. And we
expect that more and more community members
will upload or upstream their chaos experiments
onto this hub and the community is really spread
out. At the moment there are about more than 10,000
installations of Litmus.
It is starting to get wider
adoption,
and Litmus chaos as we described it,
is completely cloud native. And the four principles that
we described a little while ago
applies to Litmus chaos. Hub is
the community way of doing has and there sre
some good examples of bringing
their own chaos onto litmus infrastructure and
scaling it up using this infrastructure. It is cloud
native. So to summarize the features,
there is chaos API crds. You can manage
the entire chaos using declarative
manifests, including the Chaos scheduler
and the next is hub itself.
And we expect you will have
more than 50% of your chaos
needs automatically available on the hub already.
And all you need to do is learn how to
use them and then learn how to write
your new experiments that are required which are
specific to your application. So for that you have
chaos SDK available in Go Python and ansible.
And using this SDK you'll be able to bring
up the required skeleton of a chaos
experiment very easily and
then put your chaos logic and your experiment ready.
And the other feature
which is a very important one is has
portal. Litmus portal is very
important because has engineering does
not stop at the introduction of faults.
It's also about monitoring and helping
sres developers take right actions to fix the
weakness. And also you have to
be able to do it at scale. Right. We'll talk about chaos
workflows in a bit, but portal is about
hub is about getting your experiments together
in one place chaos portal is use those experiments,
manage your chaos workflows, execute them,
then monitor them, see what's happening.
So it's about managing your has engineering
end to end this portal. That's chaos portal.
It's under development. Early versions are there
if someone wants to take out, it's not released to the community
formally. Maybe end of the year is what we
are hoping that will happen.
Litmus has many experiments. Right now we have about
30 plus experiments expected to grow as
the community grows. And you sre some stars
here. That is. This really means it
about inducing a fault into the infrastructure.
Disk or node node, cpu node,
memory, all that stuff. The other ones SrE Kubernetes
resources itself. As you can see there
are network duplication, network loss,
Kubelet service skill.
These are some of the important experiments. We have heard
many stories where everything is fine.
Kubernetes itself goes down and blast
radius and very, very high.
So don't wait for that to happen.
Use Kubelet service skill experiment and see
it what is your resilience?
And that's a good one to have. In fact, this really
came from community using litmus
and the team is able to put this experiment
onto. And those are
about generic experiments we call. They're all grouped under kubernetes.
There are also application specific experiments which
are about inducing a fault at
an application level. It could be about causing
a database unavailability or
about bringing down Kafka broker.
It speaks the logic, the chaos logic speaks the language
of an application rather than kubernetes.
And we believe that it is very important to
scale up the deeper faults that you
want to induce. This application
specific experiments will help.
So where all you can use litmus in
DevOps. Of course, in CI pipelines you
can start with very small chaos experiments is
pretty easy. And then chaos
cannot be executed to full extent on pipelines because they are
short lived. So deeper chaos can
be executed on long running test beds, which is where the
code gets moved to after the pipeline typically.
And then there is staging, which is closer to production.
But there are lots of people who are churning the code
out there and you can increase the number of tests there and
then just closer to production. SRe production
can increase the number of tests, but production is
where you want to start. Small, but try to cover
more scenarios. They can be spread
out, but you may want to cover
all the scenarios that are possible in terms of
failure injection so that you stay resilient to those
faults. It all starts small. You need to buy in
from the management has engineering. Many people are scared
to be precise and sres want to do
that. Developers do not want you to do that.
It takes time in any organization to roll out chaos engineering
in a large scale. So you would want
developers themselves. See hey, this is how you can
test yourself and get this automation integration
teams to use chaos and then
sres will be able to convince
them easily. So as the time grows, people will
find it acceptable to run chaos in production
as well, the entire system. So as you increase
the number of chaos tests that you can run in
production, the overall resilience increases,
but it takes a step by step approach typically.
And good news is that it is cloud native. That means
you can start doing automation of kiosks right
into the development lifecycle. So how
do you automate it? It's a very important thing.
Kubernetes kind of one of the promises.
So you would want to generate not scripts
but Yaml manifests. That's the fundamental
block for automation. You put all your
chaos experiment, which is itself is a custom resource YAMl
spec, and you attach that to an
application where the chaos is going to run.
And management of chaos on an
application is also another CR, that's another YAML spec,
that's called chaos engine Yaml and then attach
schedule how often you want to run this. That's also an YAML spec.
Everything, all these three, the experiment, the application
which is taking this
chaos experiment in, and also how often you
want to do it. All this logic is put into a
manifest and we call that as lithmus Yaml
for example. And you want to automate
this, this experiment, put it in git. Use the
deployment tools, auto deployment tools like flux or Argo CD.
And whenever a change to that happens,
when a PR about the change is merged, your has
starts running automatically and litmus gives many
outputs, not just the chaos result,
but it is also about giving has metrics. You can upload to
your prometheus and then automate some of these alerts and
corresponding actions notifications.
And once you get the notification you want to go and see what
exactly happened. You want to debug, you want to fix it. So you get
has events for correlation and taking the right action to fix
the weakness. So how do
you scale this up? You are able to automate this with this fundamental
concept. And how can you automate this automation?
Sorry, how can you scale it? Scale it is being able
to run multiple chaos experiments in
either a sequential manner or in parallel, or a combination of
both. Typical example is there are multiple
namespaces applications are spread out across these namespaces
and you are managing all of them. And end
users are really being
served by services that are spread across these namespaces,
right? So faults can happen anywhere. So you want
to simulate a flow chaos
workflow where introduce two faults into two
different namespaces in parallel, then wait for it, and then
do two more faults in parallel and then kind of drain the
node or multiple combinations of that, right? So this
is one simple chaos workflow. And how do you do
that? Is you develop has experiments
into different Yaml manifests and you
keep them ready. And then you apply a workflow
using tools like Argo. The Argo workflow
we've been using,
it goes very well with the has workflows. So using
Argo workflow, you have your experiments ready,
you embed them into an workflow CR,
and then Argo also has a scheduler. So you
can attach that schedule to that and you
develop a bigger Yaml manifest that manages
these litmus experiments. The same thing will happen.
You go and put them into a git
and use Argo CD or flux to manage this
auto deployment and change management. Then get
multiple of them. You can scale this to
hundreds of experiments in a system where there are
hundreds or even thousands of kubernetes nodes.
So you get an infrastructure to automate your has in
a very natural way, and you can scale it and you can
put it into your existing system of DevOps, which is
Gitops. So with that, let's look
at a very short demo of litmus.
So I have the following setup where
I have two nodes on Amazon eks
cluster and I have deployed the
microservices demo application. You all may
be aware of this, sockshop and install litmus.
A couple of experiments are being run as a workflow onto
this, and litmus is set up in admin mode.
Admin mode is where Litmus can go and inject
chaos into any application because
the service account has the permissions to do that.
And I also have set up a monitoring
infrastructure to receive the chaos metrics
and we can see through Grafana
what's happening. So with that, let's actually
see how.
So I have two nodes.
There's only one that's running on
this cluster. That is the talk shop,
the what is.
So they've been running for quite some time,
about four or five days.
You might see that some of them went down
recently. That's because we've been continuously introducing chaos.
That's a sign of things are
being meddled with.
Somebody is watching the resilience of it continuously.
So let me show,
so we put litmus in admin mode. That means everything
runs within litmus namespace. The other mode is you
can create service accounts in such a way that your
has actually runs within the namespace of
your application. So here you have a chaos operator.
You also have a litmus portal running the
early version of it. You also have
a chaos monitor which is exporting the
metrics into a Prometheus server.
Then there is chaos events being exported
through an event router. And as you can see that there
are some tests being run and they are being run in
it must namespace. There is a cpu memory hog
experiments. So let me also
show monitoring
this,
we have Prometheus and Grafana running.
So let's look at actually how
this is your application that is running workflows.
So we have two chaos workflows
configured within argo workflow
that are running every fifth minute and every 10th
minute. So every five minutes you have cpu chaos pod.
Cpu has pod memory chaos running on two
different applications or containers. So let's look
at one.
So here you have argo workflow
of chaos. We named it and it embeds
the Chaos experiment. This is what we talked about. You have your entire
litmus chaos experiment. You embed that within a workflow.
And the chaos engine calls the has experiments.
For example, this is a chaos experiment CR
on the system, and you can tune through
githubs the
behavior of this chaos experiments. For example, you want to increase the
experiment chaos for a longer
duration, shorter operations, and increase the number of cores,
so everything is possible through Gitops.
And let's look at the other workflow.
Similarly, memory, so it's being run on a
different pod called orders pod,
and the previous one was on catalog
pod. And you're running on the same
namespace. So you're running it in the
application is a sock shop, and you're running memory
chaos on orders application. And this
is what it is. You can specify it declaratively like that.
And it is inside chaos engine Cr.
And the experiment is memory hog. And you
can tune the behavior of your has,
and it's scheduled every fifth minute.
So as you can see,
the workflow itself was executed within 5 seconds.
And you can go and sre
the results of that.
This is argo way of seeing
what is going to happen. The litmus portal that is under development
will have more detailed views for chaos itself.
It uses argo workflow concepts. And let's
look at what's happening.
This is a Grafana dashboard that we put together for
sock shop where this red lines indicate the has induction.
And as you can see, that every fifth minute,
the screen and yellow lines green is catalog CPU
Hog. This one is about memory
hog, and it's being induced into different
containers. So whenever a cpu hog is induced
into catalog,
the performance is going down. You can see that orders are
also going down. Similarly, whenever there is
memory Aug is introduced into this one,
your catalog also is the performance
of catalog pod is going down. So doing
it again and again, of course, you're not going to do, in reality
every five minutes, but let's say every day
some fault is happening. A larger fault can happen randomly in a
week. So all that possible combinations can be
implemented in an automated way.
So that's how your litmus
chaos can help in automating
has engineering, thereby helping you to achieve
higher resilience, right? So a bit of small
introduction to Litmus portal. It's very
much under development. The idea here is you'll
be able to schedule the workflows itself and you
got your own hub. So it's
all about bringing experiments from your hub and
adding more experiments, right? So you have a public
has hub, but you want to share that with your team and
the new experiments. So Litmus portal will help you bring in
your more team members together and develop
the new experiments, create more has workflows,
monitor them. So it provides an end to end infrastructure
or a tool set to practice chaos
engineering and resilience engineering.
That's about the demo.
And in summary, do practice chaos
engineering and do it in cloud native way. And litmus
chaos can help you do that. And overall has. The time
goes by. As you increase more and more chaos tests in
your production, you will end up having higher and higher resilience.
This is definitely a preferred way to increase your resilience on Kubernetes.
So with that, thank you very much.
Do try out litmus. You have very easy
use get started guides available. There is a Litmus demo application
workshop. Whatever the demo that I showed,
you can set it up within a few minutes with that. Thank you
very much, folks. Have a great time
with Litmus has. And do try it out.
Join Litmus has slack on
Slack channel and Kubernetes slack their litmus
channel. So with
that, thank you very much. Thanks for your audience.