Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, welcome. Come to the session on unleashing the power of chaos
engineering in CD pipelines.
I'm Sarthak and with me is Saranya. We both
are senior software developers at harness and also maintainer of
Litmus Chaos which is an open source tool that allows
you to practice chaos engineering. We have attached our LinkedIn
and Twitter profiles just in case you want to reach out to us after
this session. So agenda I'll be
talking about Kios in CD pipelines. I know it is something
that not many organizations practice and is new for many SREs
and DevOps engineers. There are organizations who
don't even practice Kios engineering as standalone task
and they're surely missing out on a lot because of this.
I'll also share some interesting stats that I've copied from
Google which may get you thinking, followed by
which I'll tell you why Kios is important and how
it can help you make a better product. And at
the end Sarania will show a demo on what are the strategies
that we follow and how we do it at harness to make our product
resilient. I hope the agenda is interesting enough
to keep you with us till the end. All right,
let's get started. So I know
this is the honest reaction of DevOps engineers and SREs
who are peacefully running their CD pipelines. I know
you guys have done a great job building a deployment pipeline for
your application, but trust me, adding a chaos step is
worth it and I bet you guys will be convinced by the end of
this talk. So chaos in CD pipelines in simple terms,
means adding disciplined chaos in your pipeline and check
how the system reacts to these disruptions. So the goal
is to ensure that the CD pipelines remains stable,
reliable and capable of delivering software smoothly even
if unexpected challenges arises.
Let's see some interesting stats here. So downtimes
are expensive the Gartner it key metrics data
report states that average cost of downtime is around
five point six k dollars per minute, which is
a lot. Not only the cost, but customer impact as
well. So poor system reliability and unexpected failures
can lead to a decline in customer satisfaction and loyalty.
According to Zendesk, 39% of customers will
avoid using a product or service after a bad experience
for obvious reasons. So now the question is how to improve
these numbers. And the answer is quite simple,
make your system resilient and improve MTTR. That is
the mean time to repair. So now chaos engineering
can help you achieve this and injecting chaos in
your CD pipelines can help you automate the process.
Let's see why chaos step in deployment pipeline and how is it
useful. So the first one is early issue detection.
Chaos engineering involves intentionally injecting controlled
disruptions into the systems. So this process allows
teams to identify weakness and vulnerabilities before
the system reaches the production environment. So by identifying and
addressing issues, early teams can prevent
these problems from causing any significant outages or failures in
the live production systems and avoid
its impact on the users. So let's say for an
example, suppose a chaos experiment reveals that a
certain component of the system becomes unresponsive under
high load conditions. So injecting this early,
let's say, in QA, will allow the team to optimize the component
before it impacts the end user. And this is a very
common situation. Next, we have enhanced resilience,
let's say improved resilience. So anyone who
has read about chaos engineering, so anyone who
has read about chaos engineering must know that this is the primary aim of
chaos engineering, that is, to make the systems more resilient.
Chaos experiments are designed to test how well a
system can adapt to unexpected conditions, and it continues
to function without causing any major failures. So by
deliberately introducing chaos and monitoring system behavior,
teams can implement strategies that enhance the system's
overall resilience. So let's say, for an example,
if a chaos experiment simulates a sudden increase in traffic
or server failure, the team can use the insights gained
to implement auto scaling mechanisms or some other solutions to
tackle this. Then we have improved incident response.
So by intentionally introducing chaos in your CD pipeline,
you create controlled scenarios where things can go wrong.
So this will provide an opportunity to identify weakness in your system's
response mechanism, identify the and identify
incident handling procedures. So let's say for an
example, consider a scenario where you simulate a sudden
surge in user traffic during the ko step. So this
allows your team to assess how well the system can handle unexpected
spikes and how quickly can it scale resources to maintain
performance. So any issues discovered during this scale scenario
can be addressed which will lead to an improved incident response.
When similar situation occur in a real world scenario,
then we have continuous validation. For me,
this is the most important thing. Adding Ko
step in the CD pipeline ensures continuous validation of your
system's resilience and reliability.
So regularly testing the system to controlled chaos
will help in validating that it can withstand unexpected
events and disruptions throughout its life cycles, not only
once, but throughout its lifecycle. So practicing
chaos engineering once or twice and thinking your application is resilient
is like hitting the gym on a new year and thinking that you will be
fit for the rest of the year. You know, that's not
how it works. So continuous validation ensures
that you don't just test for resilience once and forget
about it. Instead, you regularly subject your system to
different chaos scenarios, and you validate that your system
adapts and responds well to these challenges.
So it's like giving your system a regular workout to ensure
it stays in top shape, reducing the risk of unexpected failures when
it matters the most. So when you deliver your product
to the customer, you are confident that we have done all
the amount of testing, and it's not going to cause any disruptions,
and it will be a very smooth experience for the customers. Then we
have increased adoption of chaos.
So even though chaos is not a new term anymore in the industry,
and I'm pretty sure that whoever is seeing this session
is very well aware of what chaos engineering is now, which was
not the case maybe three or four years back.
So even though chaos is not a very new term in the industry,
but there are teams and organizations that are reluctant to
use it in their systems. Adding and automating
chaos through CD pipelines and testing out with simple chaos experiments
will help them gain confidence and allow them to adopt
two chaos engineering practices. So,
by gradually injecting chaos scenarios into the CD pipeline
and noticing the positive impact on system reliability,
the team becomes more comfortable and confident in adopting chaos
engineering practices. So this increased acceptance will
lead to a culture where chaos is seen as a means of continuous
improvement rather than a potential risk.
I know this fear is still there in the market that chaos can
cause disruptions, and our system may not be healthy
enough to handle all these things, and we don't have to do chaos testing
in our system, but it's not like
that. So, just to start off with, you can start with some simple chaos
experiments, integrate it in your CD pipelines, and you will
see the results. So that's how it works.
So these are the five important reasons which
you can get benefit from if you integrate chaos in your CD pipelines,
right? So just a meme here.
No developer were hurt in making this meme.
It's just a meme to lighten up things.
So, enough of theory now. Let's get
to the demo. Time for the demo. Hope the demo gods
is with us. Over to you, Saranya. Thanks,
Arthur. And hey, everyone, this is Saranya, and I'm going to give a
brief demo or more of it. I'm going to explain how
we can add chaos as pipelines as
a step in. CD pipelines so in the
name of the demo gods, let's get started.
So before going to the pipeline itself, I'm going to
explain the environment. So this is the application that we
are going to that is being deployed.
And as
you can see, this is an online booty cat shopping application, demo application,
which has a microservice based architecture. It has various
microservices such as cart
service, checkout service, and a currency converter service.
So multiple services are there. So basically you
can do a very basic functionality of online
shopping. So you can change the quantity, add it
to the cart, and you can place an order.
So this is the application that we are going to deploy. And here,
let me show you CD pipelines. So this is the pipeline that we have
created. And the first step is
obviously the deployment step. This is the rollout deployment,
and next step is the observed deployment
step. So in this step we generally
observe the application, how it behaves, like if
it's healthy or not. If it's not, then we can simply roll
it back, otherwise it will go further with the cure step.
So for the demo purpose, I have just
added sleep of ten minutes, 10 seconds, and you
can add your own commands as per your own requirements. So that's
that then, coming to the chaos step.
So here I have added
this, so as chaos experimentation step.
So we have added this booty cat cpu hog experiment.
So what it does, it generally puts a
load on cpu in the target
cluster where the application is deployed.
And this is the expected resilience course.
I'll come back to this in a while. And this is
the chaos infrastructure detail that the target infrastructure, where the application has
been deployed, and the namespace and default is pod
cpu hog. So, coming to the expected resilience
score. So let me give
you a quick refresher on how chaos engineering is generally
carried out. So the very first step of the chaos engineering is,
first of all, we have to identify the steady state,
continuous, the steady state hypothesis. That means how the
application behaves when it is healthy.
So first you need to identify that, and then we introduce a
fault, and then we check if the
slos are met or not. If yes, then the application is resilient.
Otherwise the weakness is found and we have to
improve upon that and then run
this complete cycle again and again in
connection to this first step. That is the steady state,
the steady state hypothesis. We have this expected
resilience code. That means what is the minimum resilience
code that we expect the application to have,
so that the cure step is considered as successful.
So you can give it any number.
But to give an idea on how
we can decide upon this number, if you click on here,
you can see as these are all previously run experiments,
you can get an idea of
the expected score from this last resilience score. For this,
let's say it's 100 and for lambda function
timeouts. So for this experiment it's 50%.
So according to these last run, last resilience
for values, we can decide and then we
can improvise based on like after multiple runs,
we'll obviously get an idea what value to set
there. So yeah, that's that.
And then coming to this particular
step, the diagram that Sarthak Jain
already mentioned, you must be wondering, where is the rollback step?
We have this cure step, but where is the rollback step?
So if you click on
this particular cure step, and if we go to this advanced
section here, we have this failure strategy. That means
what strategy to adopt when this particular step fails.
So in this particular pipeline,
we have chosen the rollback stage. So in case the step fails,
it will basically simply roll it back to the healthy
deployment. So if I
come here, this is a different pipeline where I will
be able to show how to add.
This is the cure step. And if I go to advanced section, in the failure
strategy here, you can see we have this rollback stage and
we can also have other options as a failure
strategy that is like manual intervention. If you want to do a manual intervention
after post, like there's a timeout after that,
some of these market success or ignore it, something like that
will happen as per your own requirement. So in QA
you can simply ignore or
abort or mark it as success. But in prod environment
it is advisable to roll it back to a safer
deployment. So here also you have
other options like mark as access,
retry and all other things are there.
So in addition to this, I wanted to also,
just wanted to let you know that here
you can also add other steps
in parallel or in serial, so you can add a cure if
I click here. So you can also add any other step
or a cure step in general.
So you can choose an experiment and choose
the expected resilience score and add it here like
this. So coming back to our original
demo pipeline, so as here,
you can see I think we are clear about
this particular pipeline, and due
to time constraint, I won't be able to run it because it will take quite
some time, but I will be able to explain
the failure and success cases
from the execution already executing executed
pipelines. So we have run it multiple times.
So let me first go to this failed pipeline.
Yeah, so here you can see this Kio step has been
failures because the expected resilience score
here is 90. But the resilience score we got after the
execution is 66. That's why this particular step failed. And as
a result of that, the rollback step got triggered.
Because I already showed you that we have chosen the rollback strategy as
a rollback, as a failure strategy. So rollback
step got triggered and posted. A health check
to ensure the deployment is healthy has also been executed.
So this is what happens in case the
cure step gets failed. And other
than that, let's go to the success one
here if I. Yeah, so here
you can see this pipeline chaos been successful.
Pipelines execution has been successful because the expected resilience score
is 90, whereas the actual resilience score is 100.
But if you want to see this particular chaos execution step in
detail, you can just click here. And this brings
us to the chaos execution step. Here you can
see the steps that like
first of all that is the install cure step. Then the actual
fault execution here you can see all the
required probes here. First one is
the card service availability check, then booty
website latency check, and the pod status check. So all
these probes have been passed, resulting in the score of 100%.
And if you wish to see the logs, logs are also available
here and fault configuration can also be found
out here. So this is how you can see the
chaos experimentation in detail. And before
wrapping this up, I have two more things to share if
I go back. So the
demo I showed, like whatever I explained here,
this can be easily done with natively using the harness
kiosk and harness CD pipelines. But if you want to integrate
kyostep externally,
the APIs are already available which can be used.
So for example, we have integrated it with GitLab.
Here you can see the same step, the deploy step, the ko step, and then
in this case it's failed. So the rollback happened. So similar thing
can be done using GitLab as well. So if you
click here, you can get all the details of this particular step, like the logs
here, you can find out why it's failures and all. So this can also
be done by using the aps that are
available. And one
last thing is if I go
to the pipeline,
in this case, whatever I showed, I am able
to trigger the pipeline manually.
But this can also be done,
it can be triggered automatically based upon
some webhook, based upon some continuous that can
be done using the webhooks. So here you can see it
got triggered by one such webhook, that is card service deploy
changes. So in this case, in case there is some changes in the deployment,
the pipeline will automatically get triggered and you can see
the details execution. So in this case it got
passed. Instead of manual intervention,
it can also be triggered using the webhooks.
So yeah,
that's how you can integrate chaos as a CD step.
As a CD pipelines step. And I hope we have
convinced you enough to add one more chaos step
into your pipelines and ensure the continuous resilience of your application.
And with this, I would like to thank you for
watching us till the end.