Transcript
This transcript was autogenerated. To make changes, submit a PR.
Folks, welcome to Con 42 Cloud Native.
I'm Uma Mukara, CEO, founder and CEO
of Chaos Native, and I'm also the co creator
and maintainer on Litmus Chaos, which is a
CNCF project for Chaos Engineering. Today in this session
I'm going to talk about what is chaos engineering means
in cloud native environments. How can cloud native
developers and sres can take control of reliability?
So in this session we're going to touch upon what is
reliability and what it means to achieve reliability
in cloud native environments. And we will see chaos engineering
as a way to achieve reliability in cloud native environments.
What are the good practices for both sres and developers?
And I will also touch upon Litmus chaos, which is the Chaos
engineering project for achieving reliability in cloud native
what is reliability and what it means in cloud
native environments? Generally, reliability means you
run your services without any outage.
Then you are called as very reliable, but it
does not end with that. It also means that you
need to have some slos or business slas met
even though you are running without an outage, for example, latency for service
performance under scale, et cetera. So you
sometimes need to measure reliability when you
are asked to ramp up your services. There are certain
days in which you are going to scale your services to
a high degree and your slos need to be met on
such days. So that's also a measure of reliability.
So there is also the upgrade scenario. You will end
up upgrading the services in production and
they need to be continuously adhering to your
slos. So you put all these things together.
You call that as a measure of
reliability? If you're satisfying all this criteria,
your services are set to be reliable. Why are
they important in cloud native?
Primarily in cloud native, your application is
now split into multiple microservices.
That means you have more services to manage or more applications to
manage in your larger service. And those
applications are changing very fast in your environment,
primarily because of the advances in your CACD
pipelines and the CACD pipelines of the applications that
you're using as just a service. It could
be some other cloud native service, for example. Those changes are coming fast
into your environments earlier. It's very common
to see big changes happen in production
environments every quarter or two, but in cloud
native they are happening almost every
week. Because you've got so many changes, you don't want to schedule
them for all a fixed state, and you want to automate them in such
a way that the upgrades do happen as soon as possible.
And then you have a system to make sure that irrespective of these
upgrades, your systems are reliable. So that's your
target. And that's what brings us the topic
of how to achieve that reliability in cloud native. In summary,
reliability is very important because there are many applications
that you're dealing with in cloud native and the changes to them are coming
too fast. How do you achieve such a reliability? Or how do
you plan to strategize in achieving such a reliability
for your cloud native? One answer is you implement
the practice of chaos engineering from beginning and
you do it at scale. Then you have at least a
good proven way of achieving reliability. Let's look
at what is chaos engineering? What is chaos engineering?
Why chaos engineering and how do you practice chaos
engineering in the what part? It's about
breaking things on purpose. Or you can also say don't wait for
the failure to happen, you inject. You can also say it's
practices resilience engineering and it's about
being better prepared for disasters.
When some launch failure or an outage happens, how are
you really prepared to bring your services back online?
Right. If you have chaos engineering, we would have dealt with
such an option and you are now better
prepared. So why chaos engineering?
Because big outages are expensive, sometimes smaller.
Outage also can be expensive depending on your
slas to your end users. And you
cannot really prevent outages. No matter how well
you are prepared and tested, outages will happen.
So you better be prepared for that. And there
are too many unknown, too many changes happening. We just
discussed why is reliability important in cloud native?
So chaos engineering is needed because you don't know
everything about your entire knowns and unknowns, right?
So chaos engineering, why you should do because there
are tools in place now. There is so much knowledge
available in chaos engineering space, especially in cloud native,
that you can actually easily do it and you
can avert bigger financial losses and be
in control of your reliability. And that's one reason why
you need to do chaos engineering. How do you do it?
Primarily it is a culture. Still, many people are
grasping the need for chaos engineering and
starting from developers, sres and all the way
to the management who are responsible for
operational reliability. So you start
with advocating or learning chaos engineering.
That's really how you start with and you create a strategy and
choose a platform that suits your needs
and really build a service within
your environment rather than just a set of chaos engineering
experiments or chaos experiments. So you need to look at it
at bigger picture. Chaos engineering has to
have goals of increasing reliability over a
period of time. And one way you can start or keep
repeating is doing game days. These are proven
very helpful to build a culture as well as to build
a practice around your chaos engineering is working well or not,
and it's always difficult to go and break things
in production. So you start small and
you keep fixing things and then you slowly move
on to pre production in production. I'll talk about it later
in this session. So what are the business benefits?
Right? So you cannot avoid outages. So you shorten
your outages. That means you are better prepared and you can really
prevent large revenue losses by finding them
early in pre production and fixing them.
And your overall customer satisfaction will
either go up or your customers are happy. You'll be able to retain
them because you already fixed something before they actually
cause losses to your customers. This is like on the
business side, but your end users
can also move to this new architecture,
or you move to your bigger and better new
architecture fast because now you have a way of finding
how resilient your systems are. That's definitely a good benefit.
And scaling your services, you implement your larger
service for the optimal size and you scale
up as you need it. And you can test that with chaos engineering.
And you know that you're going to scale well when
the need arises. So you don't need to really run them at scale bigger
than what's needed, right? So that's definitely a benefit.
The other benefit is how well your team
is prepared for a given fault,
right? So you don't need to guess. You knew that
your team can respond well because you just experienced
it by injecting a similar. So what are the
business use cases where chaos engineering can
be considered? Now we know the benefits. What are the business use
case? As the digital transformation is happening,
we are all moving to microservices. So you would want to
see what is the reliability today and what is the benchmark that
you need after you move to this microservices architecture.
Chaos engineering has to be put in place or can be put in place
to benchmark that. And you can also accelerate
because now you have confident way of measuring the
reliability. So you can accelerate the journey to containerization.
You can benchmark and measure and scale your service. There are
many sectors where chaos engineering is
proven helpful, especially when they are a large scale
and they are critical. For example, banking sector, retail sector,
ecommerce sectors, these are all already in production. They are
very critical as far as the user experience is concerned. Any losses
or outages will cause bigger financial losses.
So chaos engineering will really help in these sectors
and also the edge computing, you are moving there very fast
and there are many of such services that are in
place. So you want to automate your failure testing
in edge computing. So that's another area where you can
find chaos engineering as very helpful. Generally where
do you do chaos engineering is in many places,
so you could find them in game days. You can find
them the developers using in CA pipelines or SRS
using as a way to trigger your continuous deployments, or after
you did continuous deployment, how do you measure things are okay or
not? And there are various temps where
your failure testing in your pipelines or staged
environments are not good enough. You want to automate more corner
scenarios for failure. And there is another advanced
use case where an application has been upgraded in
your production and you want to trigger some failure testing in
a random way onto that. So that's triggering chaos
on the trigger of application change.
So these are various ways, various reasons, various use
cases in which chaos engineering can be very helpful. Let's look
at what is cloud native chaos engineering? We talked about why
chaos engineering is important in cloud native.
When you are doing chaos engineering in cloud native, you can generally
consider certain principles and cloud native is a
reality right now where kubernetes has crossed the chaos.
And whereas chaos engineering is in the early
days of implementation or being considered as a
must for reliability, there are a lot of options available
today to do chaos engineering in cloud native at
scale. And generally you can follow these principles
while choosing the implementation of chaos engineering.
It's always good to go with a technology that is open source proven
and that's community collaborated.
So the chaos experiments that are developed
through community collaborations will have less chance of false positives
or false negatives because they're well tested. You are in
control of what exactly is the fault that is getting in.
And these chaos experiments or chaos workflows
or chaos scenarios, whatever you call them, they also go
through changes, they need to be maintained. So it's better to
have good API or operators to do
the lifecycle management of such chaos experiments.
And scaling is very important. When you scale your
services. Chaos engineering has to scale as well. Think of
killing container where there are thousands of containers
and then you want to bring off of them down for whatever
reason, right? So you need to scale well. Your infrastructure should
scale well to induce chaos and observability
should be an open one. It's very important
to be able to observe what exactly is happening.
When chaos was introduced. You are most likely
following observability platforms that are based
on Prometheus. So your chaos interleaving
also should be open in nature. You should be having
clear idea on when was chaos injected and
what was that chaos and how it was injected. So consider
all these principles while choosing your platform
for chaos engineering. In cloud native. Let's generally talk
about what it means for sres and
developers later on. So for sres there are many ways
to start, and primarily it starts in staging,
right? And then you move on to pre production and then you move
on to production. As an SRE, you have to start believing
in chaos engineering is a helper tool and
there is going to be a lot of business benefits,
operational benefits that we talked about earlier in
this session. And you need to be able to convince this benefit
to your teammates, to your management. And how
you do that is by doing some simple chaos experiments in
staging, and also try to inject values and
see whether your auto scale works or not on kubernetes,
et cetera. And you also generally do a simple game
day as a way to express confidence
in culture implementation of chaos engineering.
In summary, you start in staging or as
a trigger to your CD with some simple experiment,
and that can go on for a quarter and you can
increase the complexity of such experiments slowly.
You need to gain confidence as well as your team's confidence,
and you do that and then slowly move
on to the other areas. After a quarter move
to pre production. And generally it takes more than a couple of quarters
to do any real failure testing in production because
you should really convince yourself and your fellow
team members that your infrastructure
of chaos engineering is stable. You are not doing any
false positives or negatives, and you have seen some good benefits
of injecting faults and you are able to respond to
such small outages or big outages and you
plan and then move on. So that's more into production.
That's really about being better prepared. Do you really need
chaos engineering? For developers in cloud native environment,
we've been seeing a lot of positive response from
developer community to chaos engineering, and it
is not really tied to whether your
sres are practicing chaos engineering or not. It's really
about an extension to your existing CA pipelines.
So why do you need chaos engineering? In your
CA pipelines? Primarily the changes are happening fast.
You are supposed to be developing and shipping
your services fast. At the same time, in your CI pipelines
there are a lot of other microservices which are not developed by you.
You depend on them, and there are many
of such microservices which are making your pipelines more
dynamic and more complex and bigger,
right? You need to have a defined
strategy not only to test your code,
but also to test the other microservices
and other platform changes inside your pipelines.
So typically this is your regular pipeline, you're trying
to take care of your code. And in addition you
need to consider continuous verification of the underlying
platform. It may be a good idea to run your pipelines
on multiple platforms, right? Different cloud platforms or
on Prem. Because it is a microservice,
you don't know where it all is going to run. And it is better to
inject failures in the pipeline in such platforms and
see whether your code is behaving well. And similarly other
microservices, they can fail and then you better test
them right inside the pipeline how your code responds
to such a failure. So this is really about this continuous verification
of either the platform failures or the
services. Your dependent microservices failures is
really called as chaos engineering for developers, primarily in
cloud native. So with that introduction to why chaos
engineering and why chaos engineering for cloud native,
let me introduce Litmus Chaos, which is a project
that we started a few years ago with the core
goal of this Chaos engineering principles.
And Litmus supports all these principles
very well. The latest release of Litmus supports
Gitops and open observability as well.
So it is a CNCF project which is
currently at Sandbox state and we
are hoping that we'll be moving to incubation very soon.
And it has got a great adoption in
terms of more than 50,000 installations or operators
running. And we built a good community around
the usage of litmus and primarily
around Kubernetes chaos engineering.
So at the outset, litmus is really a
simple helm shot, either for
one developer or one SRE, or for
the entire team or an enterprise. Litmus is a Kubernetes application
that scales very well. And all the experiments or
chaos workflows are published in Chaos Hub,
public Chaos Hub, and you can pull them into your private environments
and set them in a completely air capped
environment. So when you install Litmus, you get a centralized chaos
control plane called Litmus portal and you can start
either running a predefined chaos workflows or you can construct chaos
workloads very seamlessly and you target them against
any other Kubernetes resource. Or you can also target
them towards non Kubernetes resources such as vms,
bare metals and also other cloud platforms.
And all of this you can do it with integrated Gitops
such as Argo CD or plug CD.
When integrated, this chaos can be triggered
as a way as a change happens to your
application. So in a nutshell, litmus has
a control plane, chaos control plane, actual chaos plane.
Target your chaos from the centralized portal or through
GitHub's controlled infrastructure. This chaos workflows can
be directed towards any Kubernetes resource or any
non Kubernetes resource as well. Highly declarative and
API scalable API is there and it is
obviously open source and you are in control of your
chaos. There are a lot of good examples of how
you can use Litmus in CI pipelines can use
them. There are known examples of GitLab GitHub actions,
spinnaker or captain using Litmus to introduce
a chaos stage. And at the outset
all this chaos logic is bundled into a
library and with simple API calling of
that library you'll be able to inject chaos and then
get a result of that chaos experiment.
And litmus is not just only for Kubernetes,
it is a Kubernetes application, but it can inject failures
into non Kubernetes platforms and
it can scale very easily and to
a large scale as well. We do have certain examples of how you
can inject failures into this cloud platform such
as AWS, GCP or Azure. And we also have some
experiments, initial experiments of how you can inject
chaos into VMware platform. These are expected
to grow very heavily, months or quarters to come.
Litmus is well adopted, stable, but also
ready for enterprise adoption. I am part of chaos
native team and we provide enterprise support for
enterprises who are deploying litmus in
production environments or non production environments.
So with that, I would like to thank you for watching this session
and you can reach me on Twitter or on
kubernetes. Slack thank you very much.
Our channel.