Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello, everyone, and welcome to this talk on chaos validation made
easy plug and play with resilience probes.
My name is Silanjan, and I'm a software engineer working at harness,
and I'm also a litmus chaos maintainer.
Joining me today as my co speaker is Shyan.
Hey guys, what's up? My name is Cheyenne. I'm a senior software engineer at Harness,
and I've also been maintaining litmus chaos, the open source project.
Thanks to Confortu for having us and really looking forward to you
guys enjoying the talk. All right, with that out of the way,
let's get going with chaos. Let's start the
talk with a very important question. What causes downtime?
Sure, we all have been there and experienced it,
perhaps multiple times. However, it never gets easy.
A downtime has many adverse effects for an organization.
Take instance of Slack, whose SLA violations led
to an 8 million payout and gravelly impacted the company's revenue.
Wells Fargo, the financial giant,
suffered a power shutdown in a data center due
to smoke, which caused loss of transactions,
and some direct deposit checks were not reflected in its accounts.
In this instance, a single hour of downtime costed
them over $100,000.
Lastly, British Airways had to cancel 400 flights,
which led to 75,000 passengers getting
stranded, and costed them over 100 million
in losses. In this case, it was a debugging
issue in one server that cascaded to other servers,
impacting the billing systems.
Therefore, a downtime is often a result of a combination of
issues in a system. With the ever increasing complexity
of cloud native microservice applications, the question remains that
how can we ensure that our distributed systems always withstand
the adverse and potentially unforeseen situations?
So why are we not better prepared at managing downtimes?
First of all, microservices are prone to downtimes.
While one can prepare for the apparent causes that need attention,
no one can fully anticipate an overwhelming downtime before
it takes place, as there are plethora of ways in which things can go wrong.
And that's where chaos engineering can help in uncovering the weaknesses
in a system and becoming better prepared at
managing the various downtime scenarios.
Our chaos failure scenarios can be difficult to run while
ensuring the safety of the target resources, and often there
isn't a good culture around it which makes it difficult to
implement and scale. Lastly, as more
volume of code gets pushed over the time in any organization,
it becomes difficult to assess the system against
the weaknesses at scale due to the lack of a good
CI CD pipeline of chaos integration in the development
stage and also failing to effectively measure the
impact of the faults automatically at scale,
it becomes difficult to assess the resilience of any
application. To better understand it,
consider this your applications, now being cloud native,
stand atop a plethora of other services that determine its
functioning and resiliency. You have your
application dependencies, then the other cloud
native services provisioning the underlying infrastructure,
the Kubernetes services itself, and lastly the platform
on which your application is deployed. Failure in
any one of these services can cause your entire application to
not be able to cope up. The problem is only
accentuated as more amount of code is now shipped
more frequently at a weekly or even shorter cadence
which is expected to run in a multiple different environments.
This unpredictability of the application behavior is
the prime cause for service outages,
since there is no reliable way to know that how our
application will behave when subjugated to an unanticipated
situation. Therefore, chaos engineering is the way to go
for all those enterprises who want to prioritize
resiliency and reduce downtimes for their customers.
The key to successfully practicing chaos engineering is
to understand the complexity involved in your system
through realistic experiments and hypothesis conditions,
and then slowly scaling them up so that all
parts of your application can be assessed.
However, what does a good chaos engineering practice look like
and how can you implement one? As far as the general
best practices are, Chaos engineering is a culture oriented
approach which finds its place as part of the DevOps practices
and hence developers and SRE should work together
for the best efforts. While developers should run chaos
experiments from an early stage in the development and slowly
scale up their test to cover all the different kind of chaos scenarios.
SRE should focus on how to make chaos engineering practices
scalable enough to run into their CI CD pipelines,
as well as execute the set tests within the staging and
eventually in the production environment.
Also, it is paramount to have a robust set of
chaos experiments that can cover all the different types of failures
that might potentially affect the application.
Lastly, you need a good observability to assess the impact
of the chaos throughout the system, and hence your chaos engineering
tool should provide with enough insights that can help
you understand if the application is deviating from its steady state
in an unanticipated manner.
So how do you implement a great chaos engineering
practice within your organization? Well, harness Chaos engineering
can help you get there. Harness Chaos Engineering tackles the
problem of providing a streamlined platform and
provides powerful features which helps you get started
with harness Chaos engineering. It provides simplified experiment
creation, so instead of writing complex scripts. Honest Chaos
engineering offers a declarative approach, allowing you to define experiments
in code version, control them, and easily integrate them
into your workflow. It probes you an extensive fault
library. So whether you are targeting kubernetes,
AWS, Azure, GCP, VMware, Linux,
or even your custom services, harness Chaos engineering provides
with a rich library of prebuilt faults, on top of which you
can create your own chaos experiments. You can
select, customize and compliance these faults to create realistic scenarios
that can stress your system in various ways.
Also, you can leverage the chaos hubs to store and provide
access to these faults and experiments throughout your organization.
It also provides you with real time monitoring and metrics, wherein harness
chaos engineering leverages Prometheus to provide real time insights
into your system's behavior. During the experiment runs. You can
visualize the metrics, correlate the impacts, and gain deeper
understanding of your chaos effectiveness.
Automate your experimentation with harness chaos engineering
so once you have defined your experiments,
it's time to automate them. You can schedule regular runs,
integrate them with your CI CD pipelines, and continuously assess
your system's resilience. The proactive approach helps
you identify the weaknesses before they can cost downtimes.
Lastly, harness Chaos Engineering also probes you with advanced
features that provide you with
resilience. Scoring private Chaos Hub security
chaos faults, tailor your chaos experiments and gain deeper understanding
within your system's vulnerability. In short,
harness chaos engineering provides you a plethora of benefits,
including reduced downtimes. That is, by proactively
identifying the weaknesses, you can fix them before
they can cause outages, which can lead to
improved uptime and user experience.
It also helps you with faster recovery in the sense
that harness chaos engineering helps you build systems that
can automatically recover from failures, and therefore it
can help you minimize the downtime and impact on your business.
It can also aid you with the validation and optimization
of your disaster recovery setup. Lastly,
it helps you reduce the cost by avoiding unplanned downtimes,
which helps you in cost savings as
any of your resources aren't waste on the recovery
efforts. So it's as
simple as getting started with harness chaos engineering as choosing
your platform of choice. That is the SaaS or on premise. The SaaS
platform is deployed on cloud, while on premise platform can be
deployed into your environment. Once you have selected
your environment, your platform of your choice, you can pick
an experiment and depending on that experiment,
be it a Kubernetes, AWS, GCP, or any
other type of chaos engineering fault,
you can select the blast radius to which you want to
affect your target application and your target environment,
and then choose to execute the chaos experiments.
Upon executing any chaos experiments, you'll be able to see
and observe the chaos impact, as well as measure the critical metrics
which give you an insight into what is happening throughout your system
when an chaos experiment is run. Finally, when you
have found enough confidence with your chaos
experiment runs, you can automate with the CI CD
tooling of your choice. Harness Chaos engineering
experiments integrate out of the box with Harness CI
and CD. However, you can also leverage the APIs
provided by harness chaos Engineering to integrate it with any
CI CD tooling of your choice.
So observing impact of the chaos at
scale can be difficult, especially if you are performing
chaos experiments in CI CD pipelines.
To overcome. This harness provides resilience
probes. Let's hear from Cheyenne how they work.
Thanks Lanjan, for talking about chaos engineering, its practices, and how
slas are important in this practice. As mentioned, I'm Cheyenne.
I'll be talking about resilience probe and giving you guys a hands on demo as
well on how you can practically approach probes and how to use them in your
regular day to day applications. So before jumping right
into the actual hands on approach, let's first understand what
probes are. What is a resilience probe? What is this term that we are coining?
So, resilience probes are nothing but reusable pluggable checks
that could be used in your experiment. So let's say you have an
application where you want to do some kind of a query or some
monitoring parameters that you want to check or assert based on certain criteria,
so you can put in a probe in that specific fault. So what that
will do is go and query or go and check certain aspect
that you have configured the probes on and then return some values
based on which you can do your chaos validation. So to understand this in
further depth, we'll of course take a deeper look into it. But yeah, that's the
general gist of it. So it's basically a write once, use anywhere
kind of a paradigm, which means you just create the probe once and then
you are free to use it in as many chaos experiments or faults you
want to attach it to. That's in brief,
what is a resilience probe? Now, how to use this probe? So you basically
have to configure a resilience probe globally. For example,
let's configure a health check probe which checks the health of my application,
of my pod, of my container or anything. And once I do kind
of generalize and create this health check probe, I have to add the necessary
probes to my specific faults and then observe the impact that
it is causing to my specific experiment. So what
are the different types of probes we have? Right now we have two different infrastructure
based probes. So for Kubernetes infrastructure we have HTTP CMD kubernetes
from ACS, Datadoc, Dynatrace and SLO as of today. And for
the Linux one we have support for HTTP CMD datadog and dynatrace.
So what are the typical use cases that you would normally see for probes?
And this is not definitely an exhaustive list, but yeah, this is just something we
came up with. So some of the use cases would be to query health
or downstream uris to execute user defined health check functions or
user defined any functions for that matter. You can perform crud operations in
your custom Kubernetes sources definitions. You can execute promql
queries using Prometheus probes, or you can validate
your error budget using the SLO probes. You can also do
exit and entry criteria check with dynamics probes. So there are multiple ways
you can configure and use a probe in your specific application. Yeah, this is just
as I mentioned, not an exhaustive list. So there are different modes to how you
might want to execute these probes. And this
is dependent on what behavior you are trying to achieve.
So for example, SOT is a start of test,
EOT is the end of test. So if you want to execute your probes just
when the chaos execution hasn't started or is about to start,
you want to execute before that. So you can use the SOT mode for EOT
after your chaos finishes it will basically do the assertion on chaos is
when the chaos is happening, and continuous is throughout the entire chaos execution
flow and edge is actually before and after your chaos is about
to happen. So before chaos happens it runs the assertion,
and after chaos finishes it runs another assertion. So yeah, these are different
modes that are available for probes as of today.
Now let's jump right into the hands on them. So now
I'm in the harness platform. As you can see in the URL, it's actually app
harness IO. So what it would look like normally is something like
this. So you might have to sign in, or if you're new you can click
on sign up and you can create an account. You can use social sign in
as well. Depends on your choice. And once you are logged in you would
definitely get a free trial as well as some free applications,
free to use modules which you can give a try. And of course you have
the free trial, so definitely go ahead and check it out.
So once you're inside, you would see all these different modules.
You can quickly navigate to the chaos module and then you can create a project
up just for testing or just to explore. So I already have a
project in here selected, and in this I've gone
to the resilience probe tabs. This is where I can see all my different resilience
probes. Currently I've filtered it via the conf 42 tag. That's why you're only
seeing the four probes that I've pre created already 2
hours one 2 hours ago. So these are how the probes
would look like and irrespective of which platform you're in. So let's say you're not
trying this on harness. These features, functionalities are also available
in the open source litmus, so it does not matter, it's a platform agnostic.
You can also take this, you'll also get the same level of features in
the open source version as well. So moving forward.
So these are some of the probes I've pre configured. So there's a
Prometheus probe, there's an HTTP probe, a CMD probe,
and one kts probe that I've also configured. But yeah,
for the demo we'll be mostly using the three probes,
and we'll be trying to assert certain criteria and
validate our microservice application, which is called
bootycap. And we'll be doing some probe validations on top
of that application. So just to give you a brief setup
tour of what I have, I have a GKE cluster
running in which I have monitoring setup with Prometheus and grafana.
I have my boutique application set up. This is the microservice demo application that
I'm going to use and do chaos on. And this is the infrastructure setup
that I have for harness. So harness requires you
to have an environment in where you can deploy your chaos
infrastructure. So this is that infrastructure that I've connected, which is nothing but the GKE
cluster. Cool. All right, now let's
move on and actually see the application. So this is the online booty application.
As you can see, there are multiple items which I can select.
I can add things to the cart. Let's say I want to add some sunglasses,
I can add two of them to the cart. Once I do that, this is
my cart, so I can go to my cart and I can see that the
cart is functional. So if you go to the microservice list, actually you would
see that there's a service called cart service. This is what's responsible for
handling all the cartilage activity. So what
we'll do is we'll actually try to break this service. We'll do a simple pod
delete, but on this specific service and we'll turn
it down, we'll kind of disrupt this service. But this
is just a very simple application. But what we want to do is we want
to do all these different kind of validations on top of that disruption.
For example, let's take the HTTP probes for now.
So if we go over to the HTTP probe and see the probe configuration,
we can see that we have certain set of timeouts, attempt interval
and initial delay. So we want it to be start after a certain delay,
you want it to have certain interval if it fails, like how many
times you want it to attempt again and again if the first one doesn't succeed.
So these kind of things and what we are doing in the probe is actually
the probe details where you'll get all the information, which is
basically we are trying to check or connect to this specific URI,
UrI which is the FQDM link for the specific front
end booty cap. And we are checking if this is accessible.
So if it's actually returning the response code of 200 or not,
we are checking if it's actually live, if this FQDN link
is actually visible and we can navigate to that specific port
or not. Now coming back to the probe screen again.
Now a lot of the other probes would also be listed because we just got
rid of the filter, but yeah, so if we check
the CMD probe, what it's actually doing is if we go to the configuration,
we'll see that it's trying to do a kubectl get pod in the boutique namespace.
So if you see over here, this is in the boutique namespace,
it's trying to check if in the boutique namespace we want to grab the card
service, which is the microservice we want to target and drop
if it's actually in the running state, and if so, what's the count of it.
So we want at least one card service to always be present,
that is in the running state, in the healthy state. So we are kind of
asserting in the comparator as the integer criteria should be greater than
zero. So there should always be at least one card service.
So that is what this specific probe is doing.
And the third is the Prometheus probe which is
asserting, yeah, it is asserting on
the specific Prometheus endpoint. It's checking the
average over time. It's checking the prom query, the promql query actually of the probe
success percent. And it's also checking that
according to our evaluation criteria, this should be greater
than equal to 90. So we are saying that if it's the probe success percent
is greater than equal to 90 only, then consider this as a resilient fault.
So that's our assertion, that's our hypothesis
of what we want to do to configure any new resilience probe. You can go
over to the plus new probe and you can choose between which infra
type you want, Kubernetes or Linux. So for Kubernetes you can
go ahead and select any of the probes types. Let's say HTTP,
you can give it a name. So this is the unique name. So once you
assign it you can't really get rid of it. So be mindful about that.
You can force delete it of course. But yeah, so let me just
do HTTP probe one, one one or something you can configure,
you can do next, you can set up the timeout for this one,
for example, something like this. And if I
go next, this is where you can give your probe details so similar things.
So you can choose the get or post method. If I do post, you can
choose the HTTP criteria if you want to compare the response code or
response body. So yeah, this is just an example of how you can go ahead
and configure whatever probes you want. And once you do that they'll be shown
up like this. Now let's come to the scheduling
part and actually let's try and run an experiment and see the observation,
see how it's going. So let's create a new experiment.
I'll call it boutique app conf
42 and I'll select
in Kubernetes infrastructure type. So in here I would select
the conf 42 intra that I created and I'll just
apply. So you have few options. You can upload your own AML, you can
create your template or you can just start from a blank canvas. If I start
from a blank canvas I would just filter from the chaos hub
what I want to do. So in this case I want to do a pod
delete. If I do select pod delete, I would be given
certain choices of where you want to do the pod delete. So in this case
I want to select the app namespace, which is the boutique namespace where my app
is currently present, I'll select the kind. The kind is nothing but a deployment
and it is nothing but the card service. This is the label,
these are all the labels that are present in my specific boutique namespace, but I
just want to target the card namespace, the card service.
So now that I've selected this, I can go ahead and tune
the faults if I want to. I don't really have a use case for now,
so I'll just leave it, let it be. And this is the section where
you actually add the resilience probes. So in the probe section, currently I
don't have any probes added to my specific fault, but they are configured in the
resilience probe section. So what I'll do is select and all
these different probes will which are eligible to be added for your specific
experiment. For your specific faults I would just select
the HTTP one add this to fault.
So for this one I would kind of want sot check in the sense like
whenever the chaos starts. Before that I want to do an assertion and
check if this is actually live or not.
So I want to apply that. Secondly, I want to add one more probe,
which is the CMD probe. So the CMD probe is
doing nothing but checking if the card service is in running state. So we want
to kind of assert this before the start
of the chaos and after the start of the chaos. So what it means is
before start it was already running and then chaos happened. So it might have gone
down, but after the chaos finished, whether it came up back again or not.
So that kind of a check I can do with the CMD one. So I'll
just run it in the edge mode, I'll apply that and
next I'll do the Prometheus one. So for the
Prometheus we are again checking for the probes success percentage. So for this I would
want it to go in a continuous mode. So keep checking forever,
like within the certain interval, polling interval that you specify.
But yeah, just keep doing that so that I get a constant verification.
So now I apply the changes. Now if you look at the yaml,
this might scare you, but it's a big yaml. Now where are
the probes? So for the resilience probes we kind of add it here in the
annotation. Now since we have certain
probes configured in the hub, so things like the health check might pop up,
which is another probe right here. But this is not considered a resilience probes.
This is something we do for backward compatibility. But yeah,
you can also go ahead and remove it, it should not affect your application or
your fault. But yeah, so these are the three probes we
have added and if you want a little more information on where you can add
probes. So in the documentation for developer
harness IO, you can go to any of the probes, let's say CMD probe,
you can see that this is the exact place where you
have to define your probes. This is for the old legacy way. So if you
want more information that is. But yeah, this is not something you're
doing by hand. This is already pre created if you're using the UI using Chaos
studio, so you don't have to worry too much about it, I think so,
yeah. Cool. So that's that. Now let's save,
give it a minute. Yeah, and let's just run it. So once we run
it, we are checking if certain rules are
met or not, then we are actually installing the chaos faults and then we'll
actually do the pod delete. So if I go back to my application,
to my mic service, you can see that this guy card service
age is 70 minutes. It was running for 70 minutes
as of now. Now when the chaos happens this will actually terminate.
So the age will be much, much in seconds, I think.
So we'll see that as well. But yeah, as of now
you can see that some things would pop up here, like this guy running
for 2 seconds, the boutique app just started
and if I go back to my application currently everything works
well and good, but once the chaos actually happens and I click on
the card, things should start breaking. So for the monitoring
I actually have set up the Grafana and Prometheus integration.
So you can see that.
So you can see that the chaos injecting is actually starting based
on the annotations, as you can see in the bottom and this is the card
Qps that's going to be affected because the card service is the one we are
targeting. So you can see the QPS go down, which is card service is actually
starting to get affected. And if you come over
and check the logs actually of this, you would see the probe logs as well.
So if I go down you can see the health check probe has been passed.
This is the default legacy one. You can see the Conf 42 HTTP probe
has also been passed. Maybe I can zoom in. Yeah, the Conf 42
HTTP probe has also been passed, which is just doing the assertion. So this
is an sot thing. So before the chaos started, it did this assertion
now the chaos is going on, and you can see certain things like the
CMD, which would be before and after the chaos
pop up. So in here you would see the CMD thing. So the
confidant to CMD probe has been passed because its expected value.
Where is it? Yeah, so the
expected value is zero, which is greater than zero, but its
actual value is one. So that means it did receive something. It was in the
running state. So the count is one. And now you can see
the prompt probe, it's actually failing because prom probe's actual value
is 88.33%, whereas it should be equal
to, greater than or equal to 90. So that is the one thing which is
in this case, our specific application is not really our
specific fault. That application is not really that resilient,
but according to our criteria should have been greater than equal to
90. So if your application is resilient, you should
have to configure your application in such a way that it's actually greater than or
equal to 90, so that you can term it as resilient.
Now if we go back to the boutique and click on the cart,
I think it's actually restarted.
Yeah, as you can see. So the cart service is restarted and it's 89
seconds. So that's why you did not see the chaos
in the UI, because it just restarted that quickly. But yeah, so you can
see that this guy did terminate, and because it
terminated and came back up, you can see the age difference.
But yeah, if I go back to my booty gap now, you can see the
fault injection. You can see that the chaos injection is finished
and the annotation has stopped going further. So yeah,
that is just a brief assertion of what I wanted to show you.
And if I come back to the probes section, you can see all the different
probes mentioned here as well. So the HTTP probes passed because
its expected code was 200 and we received 200. The CMD
one passed because we had a value greater than equal to zero, greater than zero,
and its actual value was one. But unfortunately the prom one failed because it received
88, but we wanted greater than 90. So yeah, that's how
we can coin like determine the resilience percentage
of our application and come over, come back to the
experiment. I think this one also finished, so I think we will get the resilience
score for this. Yeah, so it's 75. So you can still make it better,
but it's actually okay. But you should definitely look into
what's wrong with your application and change it. So yeah, that's all about us
from me and Ilanjan on resilience probe and chaos engineering.
So if you have any questions, you can use the social handles to chat with
us. Yeah, I hope you guys enjoyed. Thanks for watching.