Transcript
This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE,
a developer, a quality
engineer who wants to tackle the challenge of improving reliability
in your DevOps? You can enable your DevOps for reliability
with chaos native. Create your free account
at Chaos native Litmus Cloud welcome
to conf 42 SRE 2021.
I am Uma Mukara and I will be speaking
today about how do you use Chaos
engineers to do continuous validation of slos,
the service level objectives, and thereby how
you can improve the resilience around the services
that you are operating on. Before we start, a little bit about
who we are. Chaos native provides
chaos engineering solutions for improving the reliability
around cloud native and hybrid services. I am
Uma Mukkara. I'm the CEO of Chaos native and
also a co creator and maintainer of the
popular open source Chaos engineering project Bitmas
Chaos. So let's talk about the service uptime,
which is the primary requirement or delivery
of an SRE.
Why is reliability so important and
why there is such a big function called SRE or
site reliability engineering around that?
So digitization or digital transformation
is a reality today. And the ecommerce
traffic is really improving
or increasing multifold
and retaining the customers is really
important. Retaining your users satisfaction
is really important. A few transaction drops could
cause a lot of deficit in the credibility and
you could be losing those
customers just because you dropped a few transactions out
of thousands or millions sometimes.
So in general, in the modern era, the expectations of
surveys have increased by
the end users. And also we
are delivering software faster. So that
also really means that the software does not need to
be reliable more than ever.
So we need faster changes to the software and more
reliable services. At least that is the expectation.
And the testing mechanisms have also improved.
People are doing great amount of testing,
good quality testing through DevOps, but that
is not sufficient.
The proof is that we are continuing to see the
service outages now and then. So the idea
here is to use surprise testing
in production, continue to break things in production so
that you find the weaknesses early and you fix them
so you continuously improve your
reliability of the service. These are
some of the example companies that are using chaos
engineering to improve their service reliability.
And in the cloud native space, there is another reason
why reliability is a bigger challenge.
In the traditional DevOps you build and
ship at certain interval and
the same is being done much faster.
You build fast, you ship fast in cloud
native space, and also the number of microservices or
the number of application containers
that you need to deal with has improved by many folds.
What this really means is you are getting
more binaries or more application
changes into your service environment,
and you are also getting
them faster. So the chance that
something else can fail and that can affect your
service is more now, in fact multifold
more, maybe ten times more, 100 times more.
So the reliability is a bigger question,
what happens if something else fails? Will I
continue to work properly? That's the application
or a service question that we'd be asking in cloud native
space. So, to summarize this,
the application, the microservices app that you're developing
in cloud native, is less than 10% of
the code in your entire service. Your service depends
on other cloud native services, other cloud native platforms such as
kubernetes and the underlying infrastructure services.
So you need to be validating not just the negative
scenarios within your application and the functional scenarios,
but also will your service continue to run
fine if there is a fault happening on
any of the other 90% dependent services or
software slo? That's the bigger challenge that we are dealing
with cloud native, and the answer is
practice chaos engineers. What is
chaos engineering? I'm already doing some testing.
I'm doing some negative testing, failure testing it is called,
or failed testing mostly that's
really about the application related negative scenarios
we are talking about introducing chaos testing,
which is the dependent component
failure, how those will affect
your application service.
So we are saying that you power up your product
engineering with addition of chaos engineering,
and it is always an incremental process. You cannot change
the reliability of the system through chaos engineering in
a quarter or in two quarters. It's an incremental process, it's a
continuous process. And just like any other
engineering process, you need to practice chaos engineering also
as an extension to your existing development or DevOps
processes. And the end result would be you are building a
complete digital immunity to your services.
Whatever happens, whatever fails, I will continue to
run fine. That's the promise you are trying to achieve.
So if you want me to summarize, what is chaos
engineering? It's about breaking things on
purpose. In DevOps, it could
be in production or pre production, or at the time of
development itself. But trying to do
this as an engineering process is what makes chaos
engineering very, very effective. You try to
cover the entire chaos or fault
dependency scenarios, and you design chaos
experiments and you try to automate
all this chaos workflows and you try to collaborate
with your DevOps. If you are can ops,
you try to collaborate with Dev and vice
versa, and you try to integrate chaos engineering into
your existing tools, and that becomes a complete
chaos engineering and it will incrementally result in
better metrics related to service operations.
So this is a very simple way of
saying how to do chaos engineering, right?
And the below section talks about how
do you do that in cloud native. But it's
also about introducing
a fault is one way to say chaos, but chaos engineering
is is my service level objectives are continuous
to be met. That's the real end
goal. If your slos are being met then you're good.
If not, then there is a problem which you need to fix.
And how you would try to do this in cloud native
is try to follow the same principles that you do
follow for your application, development and operations
using operators and custom resources.
One example that I will be talking little later is litmus
chaos. Litmus chaos follows this approach to do
chaos engineering in a completely cloud native way.
So chaos Engineering summary is if
you have any challenges related to service
failures, or being able to reproduce
a service failure, or unable to
recover from a service failure fast enough,
then you probably need chaos engineering.
And if you invest in chaos engineering, these are
the benefits or return on your investments,
and you will have faster way to
identify a scenario or faster way to
inject a failure or identify a failure scenario.
You will have reduced MTTR and
better MTTF. You will have increased distance between
the failures and that's exactly what you want. You want less outages.
So what are some of the chaos engineering use cases?
We would have heard of game days where try
to go and try to surprise the
operations team through some game days.
And that's generally how you start chaos engineering.
But once you buy in the
benefits of chaos engineering, you try to introduce chaos
engineering into your dev pipelines,
CI pipelines, or into your quality engineering pipelines or
test beds. But if you're looking at Ops
as an SRE, you will use chaos engineering
for continuous validation of your service level
objectives. That's really the goal as
an SRE that you would look for. Let's look at what
is a service level objectives and what is
service level objective validation really means in
a bit more detail. So if
you ask me what is SLO or a service level objective
is, it's really as simple as just to
tell is my service operating optimally,
correctly as expected, and typically
slo are observed. You have good
monitoring dashboards, monitoring systems,
and the answer you're trying to get there is,
is my service running optimally currently at this time? Can I
say it? And also about history,
was there a problem last day, last week,
last month, how my service has been performing
in the recent past? Right so that is Slo observation.
Then what is slo validation? Slo validation
is really about so far so good.
But how will my service be in the next
minute? What happens if something goes wrong?
Can I guarantee my service will stay up? Right.
So that's kind of validating SLO.
So how do you do that? You try to continuous
pull in some fault against your service.
A dependent failure will be
scheduled and then validate if
your service is continuous to perform
or your slo is met. So that's the idea of
validating an SLO and making sure that my service
will continue to run fine no matter what
happens. So in other words, chaos engineering
will be used to guarantee that
no matter what, your service is okay.
And the best practice of chaos engineering
is not just to do it only in Ops, but try to introduce
chaos as a culture into your entire DevOps.
Of course, there will be some initial inertia from
various segments of your DevOps, but the idea is,
once through game days and a significant
introduction of the benefits lectures to your
organization, you would be able to convince that chaos
engineering is in fact, good practice, is a
good practice, and you can start introducing into your quality
engineering efforts your pipelines and so forth.
And of course, on the operations side, you will try
to do continuous validation of slos through
chaos engineering. So the typical process
is.
The typical process is you find a suitable chaos
engineering platform. Don't try to do everything on your own,
and there are tools available, platforms available
for you to get started very quick.
Much of the work is done by these platforms. And then you
start spending time in identifying the chaos scenarios,
start building them properly,
design them properly, implement them properly,
and then start automating. And then as
you start automating, you will find more chaos scenarios
or need for more chaos scenarios. And then the DevOps
process will kick in for chaos engineers as well.
So, in summary, the idea of
improving reliability is take an
approach of chaos engineering underneath and
start improving your reliability in incremental
steps. So let's look at one
such platform which will help do chaos
engineering. And I am a litmus chaos
maintainer. Of course, I'll be talking about Litmus here,
and Litmus has been there for
about four years now, and it's been adopted
by some of the big enterprise DevOps teams.
It's been in good usage. There is continuous downloads
of about 1000 plus every day.
Litmus. That shows that there are
many people using it on a daily basis in the CI pipelines
and so forth. And more importantly, Litmus is
very stable with 20 general availability done recently,
and it's got lot of experiments readily available
through Chaos Hub, and it has a very dynamic
community with lots of contributors and many vendors coming in
to add significant features and so forth.
Slo, you can use litmus to do
real practical chaos engineering
in your DevOps and
overall you have something called chaos center in litmus.
That's where the team, the DevOps Persona,
whether you're a developer or an SRE or a QA,
can come in and try to design develop
a chaos workflow or a chaos scenario
into your chaos hubs, your privately hosted
chaos hubs. Or you can pull the chaos experiments
from public hub also if it's accessible from your
environment and you end up designing
implementing a chaos workflow and it can
be targeted against various types
of resources, including cloud platforms,
bare metal or vmware resources apart from Kubernetes
itself. So the typical approach
is you have a lot of experiments of various types and you have
a good SDK as well. If you want a new experiment,
you can write, and if you have a Chaos engineering
logic, chaos experiment logic already, you can pull in
through a docker container and push it into chaos
engineering Litmus chaos platform very easily.
And you use these experiments to build
a litmus workflow and you write
the steady state hypothesis validation using litmus probes
and use it in any of the following use cases,
SLO validation or management, that's what we just talked
about. And continuous chaos testing in your quality engineers
or game days or to validate your
observability system is working fine. And this is another
very important use case that I have seen people using
chaos engineering for. You have great investments
in observability and you don't know whether
those are going to be helping you when there
is a real service outage. How do you know that you
got everything that you're going to need when there is a failure?
So why don't we introduce a failure and see,
and continue to introduce failures and see if your observability
platforms, your investments, are yielding the right
returns. And also many of us will
have skilled testing or performance testing.
Try to introduce chaos and see things will be okay there
or not. So these are some of the use cases that you can use chaos
engineers for. So let's look at how chaos
engineering happens with litmus.
You have chaos center, as I said a little bit before,
and your goal is to write fault templates
or chaos engineering chaos workflows
into a database. It could be a git
backed database or the database
that is provided by Chaos center itself, like MongoDB or
Percona, and you will have
certain team of people who will be
writing both chaos experiments
or chaos workflows, and there will be a certain set
of members who would be either just viewing
what's happening or scheduling such chaos workflows.
So Chaos center allows everybody to collaborate
and work together like in any other typical dev environment.
So once you have the fault templates, you're going to
schedule them against various resources,
and you're going to validate resilience, and you're going to generate reports.
And more importantly, Litmus also has
additional advanced features like auto remediation.
If the blast radius happens
to be more, or your chaos is getting out of control,
you can take remediate actions through
litmus. Also we have command probes that are
written during chaos or post chaos.
You can take any action that you want as
an action, you would be initiating a remediating task
here to control or bring back the services
quickly. So the typical process to summarize
is you introduce the platform, you develop scenarios,
chaos scenarios, you automate them, and once
you find it beneficial, a particular experiment,
you put it into your regular QA as well.
That is shift left chaos testing.
What do you approach and where
do you start? Chaos engineering, which stack is
typically you start with infrastructure layer,
and it's easiest most of the times. And then you go
into your message queues or proxy servers
like Kafka et cetera, the middle layer API servers.
And then you get into your databases, stateful applications,
and then you will have your actual application layer itself.
And let's look at how Slo validation
happens in litmus chaos. So Litmus
has a Lego block, as I call it.
You have an experiment. The experiment has two parts, chaos experiment
or litmus experiment. One is about the fault itself. How do you
declaratively specify a
fault, how long the fault should happen and what are the
parameters of the fault? And then probe
is the steady state hypothesis validation.
So what can I keep observing before the chaos,
during the chaos, and after the chaos?
And there are multiple types of probes that litmusk gives and
together is what makes it as a chaos experiment.
And if that you consider as a Lego block,
which is declaratively very efficient
for you to tune a given fault and a
given steady state hypothesis validation. You have many such
lego blocks or chaos experiments in Litmus Chaos
hub. You use them to build a
meaningful chaos scenario, like a Lego toy,
and you schedule that. And once you schedule that,
the state hypothesis validation is already built
in into the workflow and you just observe the result.
If the resilience score provided by litmus
workflow is good, that means you're good and your
service is continuous to do fine, else you
have an opportunity to go and fix something.
So the summary for SREs here as
far as the SLO validation is concerned,
take a look at the chaos coverage across your
service stack and try to design and
implement chaos test across the service stack and
try to schedule them with
a surprise, right? So you have to continuously do them
with some randomness on what gets scheduled.
You got tens or hundreds of chaos scenarios.
You don't know what is going to get scheduled, so that's the
surprise. But definitely something is going to get scheduled.
A fault is always happening and you are continuously validating,
and if it is validated, that's exactly what you want.
Your service is good. If it's not, that's also good news. You found a weakness
and you're going to fix it. So that's about
how chaos engineering can be used for continuously
validating your service resilience and how
can you get started? Litmus Chaos again is a popular
open source chaos engineering platform. It's conveniently
hosted for free at Chaos native
Cloud and you can sign up and get
started. The entire set of experiments or
suite of experiments are available on Chaos native cloud,
or you can host it on your on premise.
We do have an enterprise service ring where you
get enterprise support with some additional features as well.
So with that, I would like to thank you for your audience.
You can reach out to me at my Twitter handle. Thank you.