Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, welcome to Con 42.
This talk is about chaos engineering for developers
because we believe that chaos engineering is not just sre
thing and even software developers, while designing or
writing writing code could think chaos, or rather
should think chaos while software development because
that helps us build real silence through
chaos. So I'll give a quick intro about
myself. My name is Dheeraj.
I work as a software engineer at Amazon Web Services and
in Amazon Web Services, which is popularly known as AWS.
I work with the Auroradb team, and in Auroradb I work with the storage
part of Auroradb which is like a multi tenant distributed auto
scale out storage platform. And apart from
work I am a contributor and maintainer in an open search projects there.
I mostly work with how we can run open search
on kubernetes. I build tools, charts and
operators to help the community run open search and open search
dashboards on Kubernetes. So what is
chaos engineering? Now there is a very thin line between chaos testing and
chaos engineering. Chaos testing is when you are intentionally introducing
failures into a systems, but you are not doing after anything.
So when chaos testing plus
observability is called chaos engineering, when I say observability
is you are identifying, you are proactively monitoring,
and then you are addressing potential issues. That term is
called chaos engineering. So when you are like simulating real
world scenarios, you are proactively monitoring, you are creating your
hypothesis and then you are proactively resolving
it before they are causing any issues. So that is called chaos engineering.
Now, like I said in the beginning, chaos engineering is not
AWS Ari thing. It is equally important for developers
as well. Now why do I think so? Because if you
see data speaks everything, we'll take data driven decisions in
this slide or in this session rather. So the
Gremlin survey, it shows that 47%
of the companies after adopting chaos engineering
AWS, a habit, have seen increased availability.
Then if you see the next one, their MTTR,
which is like the meantime to resolution has decreased by 45%.
Same goes for MTTD which has decreased by 41%. And if you
see the last two, which is very significant as well,
which is the outages and the number of pages,
now how we can at a wholesome or at
an organization level achieve this, this can be only
achieved when you have the
habit of chaos engineering imbibed
or rather plugged in in every part of your software
development lifecycle. So generally
the trend is like after your entire software is developed
and you are ready to go to production, you do some kind of chaos
experiments and game days in order to validate your
resiliency. And if you catch any bugs, then you go ahead and fix those.
Now we sre telling the reverse. When you are designing,
when you are coding, even for small, small modules, you need to think chaos.
Okay? So when I say think chaos is like if you have
built a small module, just go ahead and inject some failures,
see how your system behaves. Then again validate.
Then again fix those. So this way, incrementally, when you
think chaos, while software development, that time,
when it bubbles up for the entire system, it will
lead to increase in number of nines. When I say number of nines,
which is your increased resiliency or your availability of
the system. So this is in very layman terms, you can say
you are building resilience through chaos. Next is like
I talk with multiple folks, multiple people,
and multiple companies. The biggest thing, which I think is
the biggest inhibitor to adopting or expanding this chaos engineering is lack of awareness
and experience. We are still lacking
the awareness that how chaos engineering can help.
People are aware of this term chaos engineering. But what does
a true chaos engineering experiment do? People are not so aware.
Maybe they are injecting failures. They sre not building a proper failure
model. So if you see on the right hand side of the slide,
there is a cycle. So we'll start from here.
Let's start from the steady state of the system. So when I say steady
state of the system, it is a steady state where we
have no failures injected in the system. It is the normal state
of the system where everything is breaking as expected or
is normal based on the steady state,
we make some hypothesis that, okay, after,
say, a power outage, or after a cloud service
outage, or after a DB outage, how will my system behave?
Then we create a bunch of hypothesis around that.
Then what we do is we run the chaos experiment, which will actually
inject the failures which we wanted, and some randomness
as well. There should be some randomness in your experiments.
Otherwise it will become more of any
kind of integration or unit test where you kind of start asserting
things here, apart from asserting, you need to observe and
find unknowns, which is the most important
part of chaos engineering. So once you have run the experiment, now you need
to validate your hypothesis that, okay, whatever hypothesis
I had created as part of the steady state,
whether it holds true or not, you validate.
And if you see that some of the hypothesis does not hold good,
then you improve on that hypothesis. And once you improve
on that hypothesis, or rather improve your systems,
then you will get an improved, steady state and ultimately
a good, resilient service.
The other thing is like, followed closely by other parties. Second point,
that, okay, chaos engineering, most of the time, will take a backseat. Tomorrow,
if your organization is not practicing chaos engineering tomorrow,
you go to your manager. Okay, I want to do chaos engineering. It's very difficult
to convince them because it will certainly not
add a value tomorrow to your service. Even if you start practicing it from
today, it is a journey, and it starts from a pre prod
environment. The last point, if you sre that greater than 10%
of the engineers, they feel that, okay, something might go
wrong. But, okay, listen, you need not start in production.
You start in a pre prod environment. Go to a prodish environment,
practice everything there before your changes roll out in production.
In your beta stage, run these chaos experiments, do chaos experiments.
That will give you enough confidence. Then during this chaos engineering
journey, you will reach to a point where
you will run this chaos experiments in production. And that is
where you will see the real benefit, that even during
outages, you are able to do all these things. There are
two major modes of chaos experimentation. First one,
which I say is the start of the journey, and the second one is like,
you have already advanced the journey. You've automated everything which is manual
experiments. Manual experiments is what I'm telling right now that,
okay, tomorrow I built a module, I want to test it out. I just injected
some failures, saw how it's breaking, that is adopt.
Then we have game days. When I say game days, game days
means that you run like full day outages.
So suppose you brought your database down for say, six to 8 hours,
and then you are seeing how your systems are behaving, how your customer experience
is getting impacted. That is what is called game days.
Second side of chaos experiments
are automated experiments, which are like the CI CD pipelines.
So whenever your code changes are getting checked in, you are
running some automated experiments which
are creating failure, which are variating a bunch of stuff.
This way, you need not be manually involved
in doing all this experiment. You can just look at the report and say,
okay, this is my resiliency SRE perhaps, and my resiliency
score is going good even with the next set of release that
I'm going to roll out. So this is like a continuous
experiments. So this is the next phase of chaos engineering, which we
should target. That once everything is automated, once you know
your system's failure model, then you can
easily create this automation in
your CI CD pipelines. So this way,
every time a code is checked in, you get some resilience
score. And you know that any of my code will
not compromise on the availability of my system.
And this also takes into the fact that
a very simple example is suppose you create alarms using
some terraform template or cloud formation.
Then what happens is you have made some changes in those cloud formation and
as part of your CI CD pipeline that is also going to go and that
is going to synthesize some new alarms. Now what happens
is as part of your CI CD pipeline, you have a
step which is validating whether all the alarms are going on or not.
Now imagine a situation where you created a bug in your alarm
creation code. So suppose that updated alarm,
which it should not have updated. Now your CSCDP
plane will catch it, because as part of the chaos experiments we do
two things. We inject failure, we validate. You injected
failure, you validate it, you evaluate via Ventrix or via
anything or via alarming. So here in this case, we validated by alarming.
We saw that, okay, alarms are not going on. Okay? Then there's something
wrong. So we kind of pause that release that time. So this
is a very good benefit about CI CD
pipeline and how you can integrate chaos in your CI CD
pipeline. A very important quote, a very famous quote of
Jesse Robbins, who was also known as the master of disaster.
So master of disaster was his official
title at Amazon. So Jesse Robbins used to manage
resiliency for everything, which has a tag of Amazon.com.
So for every dollar spent in failure, you learn a dollar's
worth of lesson. So this quote means that
every dollar you spend in creating a failure,
you will learn a dollar's worth of lessons.
So whatever time and effort you are spending in injecting
a failure, there'll be always new learnings that will
come in. So it will not go in vain
because every time you inject a failure, you will
look with a different perspective. And that different
perspective will generate more unknowns in
your system and that will help build the resiliency of
your system. Here are some of the few popular
open source tools which you can use for Chaos engineering.
The very famous litmus chaos. Then chaos monkey,
the legacy chaos monkey which can bring down servers and
create randomness in your system. Then we have chaos blade,
Chaos Mesh, which is very prominent for kubernetes
based environments. Then we have chaos, Tilkit and Sto.
There are many repositories which SRE tools which will help
you practice chaos engineering, but these are the most popular
ones that you can give it a try and start explores
and I'll say that not one kind of tool is suited for everyone.
So everyone will have their own failure model, they have their own resiliency model,
and every service, even in the same company have their own success
metrics. So it really depends on what
kind of use case we have before you choose this Chaos
engineering tools. Now the
topic, the main topic for today, which is like how
developers can benefit from chaos engineering.
So my idea being that for developers,
they need to think when they are writing code, when they are designing
their system, that time only. If you can
think about failures and do some failure driven
development, that is the time that will
benefit the entire lifecycle of the product. So when
you are designing, think about external dependency failures.
What if the database on which I am relying, it goes bad
or it goes down? What if the server on which I
am running it starts failing in other availability zone?
What if my sister services start failing? What if my upstream services
start failing? How will I react to it? Will I be
able to give a consistent user experience?
Will I be able to give a consistent customer experience?
And when you are doing your code testing, like when you sre writing unit test,
when you're writing integration test, make sure you write some automated chaos
tests, like the failure test, do chaos testing on your
module, on your module. If the other module fails, how will your module
behave? So these kind of things will help
think chaos while software development.
Next thing is now you will tell that,
okay, you told about how I can think chaos while software development,
how I can build my failure model, how I can build resilience model.
Then what is chaos engineering, what tools to use,
how I can do CI CD pipelines. Now how
to run these controlled experiments. Like if you're telling that error at a
very modular level, go ahead and do testing. How to do
this here is there you identify the boundary
and the scope of the experiment. So if you have written one module,
you know what is the use case of those module. You know that module
will interact with what components. So that is your
boundary. Then you build the failure model for your
service failure model as in if your service
a depends on service b, and if your service a depends on
a database d, then what if a fails? What happens
if d fails? What happens? A and D both fails?
What happens? A and D both fail
simultaneously or sequentially, what happens?
You build that model. Third is you think about dependency
failures. Very straightforward thing, think about
external, think about internal. External. Internal can be considered,
external is something, maybe any cloud service you are using
any managed service or using any services
which sre running on local. Suppose you have installed MongoDB on
your on premise. What will happen if MongoDB goes to then
is the intra dependency failures. Like what if my sister systems fail?
Then step four, you inject failure, you monitor and then evaluate
results. So this is one controlled experiment for a module.
If you do this, you know that your module
behaves perfectly. And if you just start bubbling up
like several modules building a service. So on a service level also,
we can do the same five steps from the service.
If you entire product on a product level also the boundary
and scope will increase, but the entire set of five
step still remains the same.
Let's do some practical thing now that you need
to design a microservice which is responsible for doing some crud
operations and basic computations. Okay,
so let's design very simple. We have a microservice which will be running
on a virtual machine and we'll do some crud operations using
a database. So let's use a SQL database,
and once we get those data, we'll do some computations.
How we can think chaos here. Interesting.
So first thing that we should come to our mind based on the previous
steps is, okay, I need to, let's go back, let's revise
boundary and scope of experiments. Okay, so boundary
and scope of experiment is my microservice. So microservice will
return some results after cloud operations and basic computation.
So that is my end result. Now,
what if my external dependencies go this round? So my external dependencies here can
be at a very high level is one database and one virtual
machine I'm taking at a very high level. Okay,
so how will I react to a
database failure? Because I need to return
a consistent experience to my customer. So there, if you think,
then you will think that, okay, maybe I can make
my databases global. When I say global,
maybe I'll replicate it in between availability
zones and in between regions.
So in case of a region outage or in case of a
disaster, at least my databases can survive. So that is, you are
strengthening your, this thought process will help you strengthen your database
infrastructure. How about still you feel
okay, something goes wrong, then even the global databases can go
wrong. How to cater to that fact? Okay,
what to do? Maybe I can have a
cache which is like
a mechanism to query whatever I wanted to query
from my database. So I'll store it in a cache. So maybe
my data will be stale until my databases recover,
but I'll be able to give a customer experience, like a consistent customer
experience. I'll suddenly not starting, throwing errors.
My data is stale, but still I'm able to survive.
Nice thing to think is, okay, I'm running on a virtual machine now
that virtual machine,
should I make it on one AZ, two AZ, three AZ,
how should I go about it? So if we want
to sustain AZ plus one failure, which means
that one AZ is fully down, plus one more instance is
down, we need to replicate this service three AZ.
So we need to have a bare minimum of three
boxes, one in each AZ. And then
only we can say that, okay, in case of AZ outages,
we'll be able to survive. So this kind of
thought process is what we need to think
when we are designing a microservice, which is responsible for
doing this crud operations. Now, as part of failure experiments,
what you can do, very basic thing, you just shut down your database,
see how your system is behaving, create some latencies,
network latencies, see how your customer behavior is
getting impacted, see what you can do to improve it.
Or rather, I'll say that maybe see
if we can means you can
even find the bottleneck of your systems. Rather,
okay, if this is my network latency, this is the max network
latency on which our customer experience won't be deteriorated.
So this will also help you identify your resiliency bottlenecks,
then what if two
of my instances of the microservice goes down? Will I
be able to sustain or give a good customer experience?
Or whatever my success metrics are,
will they remain same even when two of the instances
goes down? Will the one instance be able to take the load
or will be able to sustain the load? So this is how we
are designing a microservice and for doing
crud operations. And when we were designing, we thought
about the different failures that can happen in
a real time scenario. And then we designed the system accordingly.
So it was just an example. Now we go into some
of the best practices in chaos engineering.
I iterated it multiple times. I iterated one more time.
Understand the steady state of the system.
Until you know what is the correct state of the system,
you will not be able to identify when the system goes wrong.
So when the system is behaving abruptly,
you can only judge whether it's right
or wrong. When you know the steady state of the system
failure model is very important. So like in this example, we built
a failure model that what if the database goes down? What if the
virtual machine on which I'm running my system goes down? Third was
like, how can I control the blast radius of my
experiments? So when I say blast radius, it means that
the boundary which I was talking about. So my microservice
is interacting with XYZ components and a database.
I'll just restrict it to those kind of failures. Then I introduce
randomness or jitterness in my failure injections.
So maybe I'll not say that. Okay, shut it down for 30 minutes and then
let the server come up. Maybe just do some
intermittent stuff. Maybe shut it down for say, five minutes. Then again bring
it up. Then again shut it down for 20 minutes. So let's take
the example of a fire or fire outage. Let's think of how
a fire will happen in a data center. In a
data center, there'll be multiple racks, okay? Now, when the
fire has happened, it will start with, say, one rack getting fired.
So some virtual machines are impacted. Then second rack, then third rack.
Ultimately the whole data center is down.
Ultimately the whole AZ is down. So there is randomness
in how the fire is spreading and it is creating failures and
creating disasters in the system.
Always test using real world conditions. And don't think that,
okay, I'll just go and test in prod. It is a journey. It will start
in pre prod. Always conduct post
incident analysis after each experiment. This is very important.
Until you conduct post incident analysis,
you will not be able to reap the benefits of chaos engineering.
As much as important is failure injection.
More than that is post incident analysis because there only
you will get to know about the bugs or the issues that might
have incurred AWS. Part of this then is extensive monitoring
and logging. Your system needs to have a good observability posture
so that you can identify issues when you are running this chaos experiments.
Last but not the least, you start today and this experiments
should be often. So if you run regular experiments, you'll be
able to increase your resiliency scores.
So I'll talk about like Chaos engineering today.
It is being used a lot these days
where we want, where we are focused on speed.
Okay, so suppose for systems
which are giving delivery in one day, for grocery delivery or
for food delivery, these systems need to be available.
And to have these systems available
to check its availability. Chaos engineering is the way
63% of like 400 plus IT professionals, they say
that they have performed chaos experiments. And this is
a good number. 30% claim that they run it
in production. So this gives us a good confidence to go and
tomorrow and write these chaos experiments. Because if people are running in production,
why can't we start with a pre prod environment and test our resiliency
GitHub has like over 200 plus Chaos experiments related
projects with like 16k plus stars. So you can imagine
the number of people who are into this chaos engineering
and this is a stat that teams who have running
frequent chaos experiments minimum they are seeing like three nines
of availability which is very good. And all
major cloud providers like AWS Azure, they have their own managed service for
doing chaos experiments. Apart from AWS Azure, we have many
other managed chaos services as well like
Litmus chaos which is provided by harness. So do
check them out and see
how you can plug in in your existing software
development lifecycle and you can
break systems for resilience. Build your resilience score,
increase your nines and inculcate chaos engineering
as a habit. So feel free to reach out to me on Twitter or
on LinkedIn. On LinkedIn. Also my alias is the algo
without this underscores and on Twitter you can just scan
this QR code. This will take you to my Twitter page.
You can just dm me or tag me for any follow up questions regarding
this talk. Hope you enjoyed
the session and and do go through the other talks in
Con 44 as well. There are many interesting topics where people are
talking about chaos engineering and different aspects of it.