Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi. Welcome, everyone. In this session today, we'll discuss
about building confidence through chaos engineering on AWS. So,
we'll learn what chaos engineering is and what it isn't
and what value is of chaos engineering.
And how can we get started with chaos engineering within your own
firms? But more importantly, I will show you
how you can combine the power of chaos engineering and
continuous resilience and build a process that you
can scale chaos engineering across your organization in
a controlled and secure way so that your developers
and engineers with secure,
reliable, and robust workloads that ultimately
lead to great customer experience. Right.
My name is Narender Gakka. I'm a solutions architect at AWS,
and my area of expertise is resilience as well.
So, let's get started. So, the first,
let's look at our agenda. So, first, I'll introduce you to
chaos engineering, and let's see what it
is and more importantly, what it isn't. Also, and I
will also take you through the various aspects when you are
thinking about prerequisites for chaos
engineering and what you need to get started in your own
workloads and in your environments. We'll then dive deep into
the continuous resilience and why continuous resilience is
so important. When we are thinking about resilient applications
on AWS and combined
with chaos engineering and continuous resilience, I will also
take you through our Chaos Engineering and continuous resilience
program that we users to help our customers to build
chaos engineering practices and programs
that can scale across their organizations.
And at last, I will also share with you some
resources and great workshops, which we have so
that you can get started with Chaos Engineering on AWS on
your own, basically.
So when we are thinking about Chaos Engineering, it's not
really a new concept. It has been
there for over a decade now. And there are many companies
that have already successfully adopted
and embraced the Chaos engineering and have taken the
mechanisms in trying to find out what
we call as known unknowns.
I mean, these are things that we are aware of,
but not don't fully understand in our systems.
It could be weaknesses within our system or resilience issues and also
chase the unknown unknowns, which are the things that we
neither are aware of nor do we fully understand.
And through chaos engineering, these various companies were basically able
to find deficiencies within their environments
and prevent the users on what we call it AWS,
large scale events, and therefore ultimately
have a better experience for their customers.
And yet, when we are talking about or when we are thinking about chaos engineering,
in many ways, it's not how we see Chaos engineering.
Right? There is still a perception that chaos engineering is that thing
which blows up production and
nor where we randomly just shun down things within
an environment. That is not what chaos engineering is about.
Right. It's not just about blowing up production or
randomly stopping things or removing things.
But when we are thinking about chaos engineering at AWS,
we should look at it from a much different perspective.
And many of you probably have seen shared
responsibility or shared responsibility model for AWS,
for security. This basically is for
resilience. And there are two sections,
the blue and orange pit in the
resilience of cloud. We at AWS
are responsible for the resilience of
the facilities, for example the network, the storage,
the networking or the database aspects. These are
basically the services which you consume, but you as a
customer, you are responsible of how and what
services you use, where you place them.
Think for example for your workloads. Think about zonal
services like EC two, where you place your data and
how you fail over if something happens within your environment.
But also think about the challenges that come up when you are
looking at a shared responsibility model.
So how can you make sure if the service fails that you are consuming that
in an orange space that your workload is resilient?
Right. What happens if an availability zone
goes down at AWS? Is your workload or an application able
to recover from those things? How do you know if your
workloads can fail over to another AZ?
And this is where chaos engineering comes
into play and help you with those aspects. So when
you are thinking about workloads that you are running in the blue,
what you can influence in the primary dependency that you're consuming in
AWS, if you're using EC two,
if you're using lambda, if you're using sqs, if you're
using services like caching, services like elasticache,
these are the services that you can impact with chaos engineering
in a safe and controlled way. And you can also figure out mechanisms
on how your components within your application can gracefully
fail over to another service.
Sorry. So when we are thinking about
chaos engineering, what it provides you is
more like improved operational
readiness. Because your teams with chaos engineering
will get trained on what to do if a service fails.
And you will have mechanisms in place to be able to
fail over automatically. And you will also have great
observability in place because you will realize that
by doing case engineering you will realize what is missing
within your observability, which you currently have and what
you haven't seen. And when you are running these experiments
in a controlled way, you'll continuously
improve the observability part as well. And ultimately
you will build resilience so that
the workloads which you build will have more resilient on AWS.
And when you're thinking about all of this put together,
what it does lead to is of course, a happy customer
and a better application. Right? So that's what chaos
engineering is about, that it's all about building resilient
workloads that ultimately leads to great customer experience.
And so when you think about chaos engineering,
it's all about building controlled experiments.
And if we already know that an
experiment will fail, we're not going to run it at experiment because we
already know why that fails. And there is no point of running that experiment.
It's rather you invest time in fixing that issue. So we're
not going to run that experiment. And if we know that
we're going to inject a fault and that fault will trigger
a bug that brings
our system down, we're also not going to run the experiments because we already
know what happens when you trigger that bug. It's better to go and
fix that bug. So what we want to
make sure is if we have an experiment,
that by definition that experiment
should be tolerated by the system and also should
be fail safe, that it doesn't lead you
to issues.
And many of you might have a similar workload
with a similar architecture wherein you have the external
DNS pointing to your load balancer, where you have a service running which is
getting data and customer data from either cache
or a database, depending on your data
freshness, et cetera. But when you're thinking about it,
let's say you're using redis on EC two or elasticache,
what your confidence level if the
redis fails? Right? What happens if the redis
fails? Do you have mechanisms in place to make sure that your
database does not get fully overrun by all
these requests which are not being served from the cache suddenly?
Or what
if you think about the latency that suddenly gets injected
between your two
microservices and you create a retry
storm? Right?
Did you have mechanisms to mitigate such an issue?
What about the back off and jitter,
et cetera? And also,
let's assume that you have, let's say, cascading failure,
that everything in an availability goes down. Are you
confident that you can either fail over to a different availability
zone to one another and think about impacts that
you might have on a regional service? That what
is your confidence if the whole region, the entire region
basically goes down.
Is your service able to recover
in another region within the given
sla of your application? Right. So what is your confidence
level with the current architecture that you have?
Basically, do you have those runbooks, playbooks which will let you
do this cross region or a cross easy failover seamlessly?
And can you run through them? Right.
And so when you're thinking about like chaos engineering,
when we are thinking about the services that we build on a daily
basis, they're all based on trade offs that we have
every single day, right? So those trade offs,
we basically all want to build great,
awesome workloads.
But the reality is we are under pressure that
a certain budget, we can only users certain budget,
and that there are certain time that we need to deliver. We have
a time constraints as well, and we
need to also maintain and get those certain features released
on time. But in a distributed system,
there is no way that every single person understands
the hundreds or many microservices
that we communicate with each other. And ultimately,
what happens if think
that I'm depending on a soft dependency, where someone suddenly changes a
code, that becomes a hard dependency. And what happens
is that you suddenly have an event.
And when you're thinking about these events, usually they happen. You're somewhere
in a restaurant or maybe somewhere outside, and you get called
at an odd hour and everybody runs and tries to fix that
issue and bring the system back up. And the challenge with
such a system is that
once you go back to business as usual, you might get
that same challenge again. Right?
And then it's not because we don't want
to fix it, but because the
good intentions, they don't work. That's what we say at aws,
right? You need mechanisms which come
into play, and those mechanisms can be built using chaos
engineering and the continuous resilience.
Now, as I mentioned in the beginning, that there are
many companies that have already adopted chaos engineering,
and there are so many verticals,
these companies that have adopted chaos engineering, and some of
them already started quite
early and have seen tremendous benefits in the overall improvement of resilience
within their workloads. These are some of the industries
what we see on the screen, and there are many case studies
or customer stories which are in that link. So please feel
free to go through them on how they were adopted if you belong to
those industries, how they have leveraged the chaos engineering and improved their
architectures overall.
There are many customers that will adopt chaos engineering
in the next years to come by. And there is a great study
by Gartner that was done for the infrastructure
and operations leaders guide. That said that
40% of companies will adopt chaos engineering
in next year alone. And they are doing that because
when they think they can increase customer experience by almost
20%, and think about how many more happy customers
you're going to have with such a number. It's a significant
number, this 20% is. So let's get the prerequisites now
on how you can get started with chaos engineering.
Okay, let's get some of the prerequisites on how you
can get started. So first you need like basic
monitoring, and if you already have observability
already, which is a great starting point, and then you
need to have organizational awareness as well.
And third is that you need to think about what sort
of real world events or faults we
are injecting into our environment. And then
fourth is then, of course, once we find those faults
within the environment, we find a deficiency,
right. We remediate, we actively commit
ourselves, have the resources to basically go and fix those
so that it improves either security or the resiliency of
your workloads. There's no point finding it, but not fixing it, right?
So that is the fourth prerequisite.
So when you're thinking about metrics,
many of you really have great metrics.
If you're using AWS already, you have the Cloudwatch integration and you
already have the metrics coming into. But in chaos engineering,
we call metrics as known unknowns. So these are the things
that we already aware and fully understand.
So basically we call them known knowns, right? So these are the things
that we already are aware of and we fully understand. And when
you're thinking about metrics, for example, like it's
cpu percentage, it's memory,
and it's all great, but in distributed system,
you're going to look at many, many different dashboards and metrics
to figure out what's going on within the environment,
because it doesn't give you a comprehensive view, each gives its own
view, but you're going to look at many, many different dashboards.
So when you are starting with chaos engineering, many times when
we are running, like, first experiments, even if we are trying to make sure
that we are seeing everything, we realize we can't
see it. And this is what leads to the observability.
So observability helps us find basically
that needle in a haystack. By collating
all the information, we start looking at like highest level
at our baseline instead of looking
at a particular graph. And even
if we have absolutely no idea what's going on,
we're going to understand where we are. So basically at a high level we know
what is our application health, what sort of customer
interaction we are having, et cetera, so that we can drill down all
the way to tracing. Like we can use the services like AWS
X ray and understand it. But there are also options,
there are many open source options and if
you already use them, that's perfectly fine. Aws, well right. So when
you're thinking about observability, this is the
key. Observability is based on what we say,
there are three key pillars and you
have, as we already mentioned the metrics, then you have the logging
and then you have the tracing. Now what is important
is why these three key are
important is because we want to make sure that you embed for example
metrics within your logs, so that if you're looking
at a high level steady state that you might have,
you want to drill in. And as soon as you get into a stage from
tracing to log that you see what's going on
and can also correlate between all those components
end to end. And so at that point you can understand
where your application is.
So if we take can example of the observability.
So when we are looking at this graph,
even for example any person who has
absolutely no idea about what this workload is
and sees there are few issues here, like if
you look at the spikes there and
you're going to say okay, something happened here basically.
And if we would drill down we would see that we have a process
which can out of control or there is a cpu spike,
right?
And every one of you is
able to look at the graph down here and say wait a minute,
why did the disk utilization drop? And if you
drill down you will realize that it
had an issue with my Kubernetes cluster and the pod suddenly,
right, the number of nodes suddenly start restarting
and that leads to lot of
500 errors. And as you know
HTTP 500 obviously is not a good thing to do. So if we can
correlate this, that is a good observability. That because
of such and such issue, this is my end user experience.
And if you want to provide developers with the
aspects of understanding the interactions
within the microservices and especially when you're
thinking about like chaos engineering and experiments,
you want them to understand what is the impact of
this experiment is. And we shouldn't forget in the
user experience and what users sees when
you are running these experiments because if you're thinking about the baseline
and we are running an experiment and the baseline
doesn't move means that the customer is super happy because everything
is green, it's all working. Even if when you are doing experiment,
if everything is fine from the end user's perspective, that is a
successful application or a reliable or zilliant application, right?
And now that we understand the observability aspects,
now basically we have seen what basic monitoring and observability is.
Now let's actually move on to
the next prerequisite, which is the organizational awareness.
What we found is that when you are starting with
a small team and you enable small team on chaos engineering,
and they build common faults that can be injected across the organization
and then able to decentralize the
development teams on chaos engineering,
that works fairly well. Now why
is that? Why does a small team work? Well, if you're thinking about that,
you have hundreds of, maybe depending on the scale
and size of your organization, you might have hundreds if
not thousands of development teams who are trying to
build the application. There's no way that
the central team will understand every single workload
that is around you. And there is
also no way that the central team will get the power to basically
inject failures everywhere. But those development teams already
have like IAM permission to access their environments and
do things their own environments, rather than the central team doing the other way
around. So it's much easier to help them run experiments
than having a central team running it for. Right. So you decentralize the
case engineering so that they can embrace it part of
the development cycles itself. So that
also helps basically with building customized experiments
which is suitable for their own workload, which they are designing and building.
And the key to all of this to work
is having that executive sponsorship, the management sponsorship,
that helps you make those resilience part of the journey of the software development
lifecycle, and also shift the responsibility for
resilience to those development teams who know their own application,
their own piece of code better than anybody else.
And then we think that these real world,
they can also think about the real world failures and faults which this application
can suffer or have dependency on.
Now, what we see the real world when
we say real world experiments, is that
some of the key experiments are
code and configuration errors. So think about the faults,
the common faults you can inject when you are thinking about deployments,
or think about experiments that you can cause and say, okay,
well, do we even realize that we have a faulty deployment?
Or do we see it within observability if my deployment fails,
or it is leading to a customer's customer transaction to fail,
et cetera, et cetera. So how
do we do that? Experiments. And second
is that when we are thinking about infrastructure,
what if you have an EC two instance that fails within your environment?
Suddenly in a microservices deployment you have an
eks cluster where a load balancer doesn't pass the traffic to
your, sorry, doesn't pass traffic,
or able to mitigate such events? Are you able to mitigate such infrastructure
events within your architecture? And what about
the data and state? Right, this is also a critical resource for your
application. This is not just about cache drift,
but what if suddenly your database runs out of,
let's say disk space or out of memory? Do we have
mechanisms to not only just
first detect it and inform you that this happened, but also how
do you automatically mitigate that so that your
application is working resiliently, right? And then of
course you have dependency where we have seen
that. Do you basically understand
the dependencies of your
application with any third parties which you have? It could be maybe
an identity provider or a third party API which
your application consumes every time
a user logs in, or let's say, does a transaction that
you do understand those dependencies? And what happens if those
suffer any issues? Do you have a mechanisms to
first test it and also prove that your application is resilient enough
to tolerate and can work without them as well?
And then of course, although highly
unlikely but technically feasible,
that natural disasters, when we are thinking about
maybe human errors, that something happened, the user does
something, how do you, you know, how can you fail
over, or how can you simulate those events and
that too in a controlled way through
the chaos engineering. Right. So these are some
of the real world experiments which you can do with
the chaos engineering. And then the last prerequisite,
of course, is about making sure that when we
are building a deficiency within our systems that
could be related to the security or resilience, that we can go and basically remediate
it, because it's basically
worth nothing if you build new features, but our services
is not available. Right. So we need to have that executive
sponsorship as well, that we need to be able to prioritize
these issues which come up through chaos engineering
and basically fix them and improve the resilience of the architecture
in a continuous fashion. So that basically
now brings us to the continuous resolution,
continuous resilience. So when
we are thinking about the continuous resilience,
resilience is not just one time thing, because every
day you're building new features, releasing them to your customers and your architecture
changes. So we need to think
it should be part of everyday life when we are thinking about building
resilient workloads right
from the bottom to all the way to the application itself.
And so continuous resilience is basically a lifecycle that
helps us think about workload from a steady state point
and steady state point of view, and work towards
mitigating events like we just went through,
from code configuration all the way to the very unlikely
events of natural disasters, et cetera. And also we
need to build safe experimentation of these within our pipelines,
within our pipelines, and also actually outside our pipelines,
because errors happen all the time. And not just when we provision new code
and making sure that we also learn from the faults that surface
during those experiments as well. And so
when you take continuous resilience and chaos engineering and
you put them together, that's what leads us to the Chaos Engineering
and continuous Resilience program,
which is,
and the program that we have built over the last few years at AWS and
have helped many customers run
through it, which enabled them to basically, as I was saying earlier,
to build a chaos engineering program with their own firm and scale
it across various organizations and develop teams so that
they can build controlled experiments within
their environment and also improve resilience.
Usually when we are talking about or when we are starting on this journey,
we start with a game day that
we prepare for, to start with game
day, as you might think, where we are just running it for 2 hours session
and we are checking if something was fine or not, especially when we are starting
out. With chaos engineering, it's important to truly plan what
we want to execute. So it's setting expectations
as a big part of it.
So the key to that, because you're going to need quite
a few people that you want to invite, is project planning.
And usually the first time when we do this, it might be between a week
or three weeks that we are planning the game day and the various people that
we want in a game day, like the KS champion,
that will advocate the game day throughout the company.
It could be the development teams. If there are site reliability
engineers, sres, we're going to bring them in as well,
observability and incident response teams. And then
once we all have all the roles and responsibilities for the game day,
we're going to think about what is it that we want to run experiments
on. And when we are thinking about chaos engineering,
it's not just about resilience. It can be about security
or other aspects of the architecture as well. And so contribution
is a list of what's important to
you. That can be resilience, that can be availability,
or that can be a security. It could also be
durability for some of the customers. That's something
which you can define. And then of
course we want to make sure that there is a
clear outcome of what we want to achieve with this chaos
experiment. In our case, when we are starting
out, what we actually prove to
the organization and the sponsors is that we can run an experiment in a
safe and controlled way without impacting the customers.
That's the key. And then we take these learnings and share it,
either if we found that something or not within our
customers, to be able to make sure that the businesses unit understand
how to mitigate these failures. If we found something
or have the confidence that we are resilient to the faults,
we inject it.
So then we basically go to the next type where
we select a workload for
this presentation here. So let's have can example application.
And this is basically could be
because we are talking about the bank. So this can be like a payments
workload. And it's running on eks,
where eks is deployed across multiple availability zones
and there is a route 53 and there are application load
balancer which is taking in the traffic, et cetera. And also
there is can aurora database and Kafka for managed
streaming, et cetera. It's important
that when you are choosing a workload,
making sure that we are not starting out and not choosing
a critical workload that you already have and then
impact it. And then obviously everyone would be happy if you start with such
a critical system and something goes wrong. So choose something which is not critical so
that even if it is degraded,
if it has some customer impact, then it is still flying because
it's not critical. And usually
we have metrics that allow that when you're
thinking about slos for your service. So once
you have chosen a workload, we're going to make sure that our chaos experiments
that we want to run are safe. And we
do that through a discovery phase of the workload.
And that discovery phase will involve quite a bit of architecture in
it, right? So we're going to dive deep into it
all. You know, that well architected review. It helps
customers build secure, high performing, resilient and efficient
workloads on AWS,
which has six pillars like operational excellence,
security, reliability, performance, efficiency and cost optimization,
as well as newly added security sustainability as well.
And so when we are thinking about the well architected review, it's just not about
clicking the buttons in the tool. But we are talking
about through the various designs
of the architecture and we want to understand how the architecture and
the workloads and the components within your workloads speak to one another.
Right. And what mechanisms do you have in place?
Like can one component, when it fails, can it retry again
as well or not? And what mechanisms do
they have in regards to circuit breakers or have
you implemented them? Have you tested it, et cetera? And do
you have like run books or playbooks in place in case
we have to roll back a particular experiment?
And we want to make sure that you have the observability in place. And for
example, health checks as well when we execute something so that
your system automatically can recover from it.
And if we have all that information, we can see that there is a deficiency
that might impact internal or external customer.
That's where we basically stop. When we see an impact
to customers, then we basically stop that experiment. And if
we have known issues, we're going
to have to basically fix these first before we move
on within that process. Right now, once and only if
that everything is fine, we're going to say, okay,
let's basically move on to the definition
of an experiment, right? So the next phase is basically
defining experiment. So when you are thinking about your
system that we just, the sample application which we just
saw before, we can think about what
can go wrong within this environment, right? So if we already
have or have not mechanisms in place,
for example, if you have a third party identity provider, in our
case, do we have a breaklass
account wherein I can prove that I can still log in if
something happens, right? If that identity provides goes down,
can I still log in with a break glass account? And let's say,
what about my eks cluster? If I have a node that fails within
that cluster, do I have my code which
builds on the other node itself? Right?
Do I know how long does it take or what
would be my end customer impact if it happens?
Or it could
be someone misconfigured an auto scaling group and
health checks and which suddenly marks most of the
instances in that zone unhealthy. So do we have
mechanisms to detect that? And what does that mean again
for customers and the teams that operate the environment as
well? Right? And think about the scenario where someone
pushed a configuration change and the ECR and
your container registry is no longer accessible anymore.
So that means you cannot basically launch new containers.
Do we have mechanisms to detect that
and recover from that? And what
about the issues with the Kafka which is managing
our streamers. So are we going to lose any active messages?
What would be the data loss there? What if it
loses a partition or it loses its connectivity, or basically
it may reboot, et cetera. So do we have mechanisms
to mitigate that? And what about our aurora database?
What if the writer instance is not accessible or has gone
down for whatever reason? Can it automatically
and seamlessly fail over to the other
node? And meanwhile all of this is happening.
What happens to the latency or the jitter offer the application
when you implement all this?
Yeah,
and with basically the fault injection and controlled experiments,
we are able to do all of this. And then lastly, think about challenges that
your clients might have connect to your environment while all of
this is happening. So for our experiment, we wanted
to achieve, what we wanted to achieve
is that we can execute and understand a brownout
scenario. So what
a brownout scenario is that our client that connects
to us expects a response in a certain
amount of, let's say milliseconds or depending on the environment.
And if it do not provide that, the client just going
to go and back off. But the challenge is
when you have a brownout,
that your server is still trying to compute whatever
they requested for, but the client is not
there, and those are the wasted cycle. So that
inflection point is basically called the brownout.
Now, before we think about an experiment to go
ahead, before we can
actually think about an experiment to simulate a brownout within
our eks environments, we need to understand the steady state
and what the steady state is and what it isn't.
So when you're thinking about defining a steady state for our
workload, that's the high level top
metric, right? That you're thinking about your service.
So for example, for a payment system, it could be transactions per second,
for a retail system it can be orders per second, for streaming
stream starts per second, et cetera. And when you're looking at that
line, you see very quickly you
have an order drop or a transaction
drop, that something that you injected within the environment caused
probably that to drop. So we need to have
that steady state metric,
or already available, so that when we run these case experiments,
we would immediately know something happened.
So the hypothesis is key as well when we are thinking
about the experiment, because the hypothesis will define at the end.
Did your experiment turn out to be a turnout
as expected, or did you learn something new that you didn't
expect? And so the important here is, as you see,
we are saying that we are expecting a transaction
rate. So 300 transactions per second and
we think that even 40% of our nodes will fail
within our environment. Still, 99% of all requests
to our API should be successful. So the 99th
percentile and return a
response within 100 milliseconds. So what we also
would want to define is because we know our systems, we're going to
say, okay, based on we have our experience,
the node should come back within five minutes and
the part will get scheduled and process tropic within the
eight minutes after the initiation of experiment.
And once again we are all agreeing on that hypothesis, then we're going
to go out and fill out can experiment template.
And so when you're thinking about an experiment template, experiment template
itself, we're going to make sure that very clearly defining what we want to
run, we're going to have the definition of the workload itself
and what experiment and action we want to run.
And you might want to run the experiment where you say, I'm going to
run for 30 minutes with five minutes intervals to make
sure that you can look at the graphs and on the experiments,
staggering experiments you are running to understand the impact of the
experiments. And then of course, because we want to do this in a controlled
way, we need to be very clear on what the fault isolation
boundary is for our experiment.
And we're going to clearly define that as well.
So we're going to have the alarms that are in place that would trigger the
experiment to roll back if it gets
out of control or if it causes any issues with
the customer transactions. And that's the key
because we want to make sure that we are practicing safeguards,
engineering experiments, right? And we also make sure that
we understand what is the observability and what we are looking
at when we are running the experiment. So we need to keep an eye on
the observability and the key steady state metrics. And then
you would add the I hypothesis again to the template as well.
Yeah, aws. You see on the right side we have the two empty
lines for that.
When we are thinking about the experiment itself,
whether good or bad, we are always going to have an end report
where we might celebrate that our system is resilient enough
to such failure, or we might celebrate that we find something that
we didn't know before and we have just helped our application
and the organization that we have mitigated an issue or an
event which could have happened in the
future. Right? So once we have that experiment ready,
we're going to think about basically preparing or priming the environment
for our experiment. But before we go there, I just want to touch upon
how do you go through that entire cycle of how we execute an
experiment, because that's also critical on how we execute that experiment.
So the execution flow is like. So first we have to check if the system
is actually in a healthy state. Because if you remember in the beginning,
I was saying that if we already know the system is unhealthy or it's
going to fail, we're not going to run that experiment. So we immediately stop that.
So once the system is healthy, we'll see if the experiment is still valid,
because the issue or the test we are doing might have been already fixed before.
So you don't want to run that experiment because the developer
might have fixed those bugs or improved that resilience.
And if we see this, then we're
going to create a controlled experiment group where we're going to make sure
that defined, and I'm going to go into that in a few seconds. And if
we see that the control and experiment group is
there and defined and which is up and running, then we start generating
load against the control and experiment group in our environment.
And we are checking again, is the steady state that we have, is intolerance
that we think it should be or not.
If it is tolerant, then now, and finally we can go
ahead and run the experiment against the target, and then we
check if it is intolerance based
on what we think. And if it isn't, then we stop. Condition is going
to kick in and it's going to roll back. Um,
so as I was saying in the, in the previous slide that I
mentioned the aspects of control and experiment group. So when you're thinking about
chaos engineering and like running experiments, the goal
always is that it's controlled, and two, that you
have a minimal or no
impact to your customers when you're running it. So weighing
how you can do that is, as we call it,
not just having synthetic load that you generate, but also synthetic resources.
For example, you spin a new key case cluster, a synthetic
one, one that you have and inject a
fault, and the other one which is healthy and still serving your customers,
right. So you're not impacting an existing resources that
is already being used by the customers, but new resource with exactly the
same code base and the other ones
where you understand what happens in a certain failure scenario. So once
we prime the experiment and we see that control and experiment
group are healthy and we see a steady
state, we can then move on and think about running the experiment
itself. Now, running a chaos engineering
experiment requires great tools that are safe
to experiment. So when we are thinking about tools,
there are various tools out there how you can use and consume.
In AWS, we have fault injection simulator
with which, when you're thinking about one of the first slide with
the shared responsibility model for resilience, fault injection simulator helps you
quite a bit with that because the faults that you can
inject with fis are
running against the AWS APIs directly. And you
can inject these faults against your primary dependency to make
sure that you can create mechanisms that you can
survive a component failure within your system, et cetera.
Now second is that faults and actions
that I want to highlight are the
highlighted parts are basically integration with litmus chaos and the
chaos mesh. And the great thing about this is that it provides
you with a widened scope of faults that you can inject, for example,
in our example architecture into your kubernetes
cluster to fault injection simulator via
single plane of glass.
And then it also has various integrations.
Now, if you want to run experiments
against, let's say, EC two systems, you have the capability to run these
through AWS systems manager via the SSM agent.
Now think about where these
come into play. So when we are thinking about running experiments, these are
the ways on how you can create disruptions within the system.
Let's say you have various microservices that run and consume a database.
Now you might say how
can we create a fault within the database without having impact to all those microservices,
right? And the answer to that is that you can inject faults within these
microservices itself, for example, packet laws, or that they
would result in exactly the same as application not
being able to talk to or write to the database, because it's not going to,
right. It's not going to get there and reach the database without you bringing down
the database itself. And so it's important that to widen
the scope and think about the experiments that you can run and
see what actions you have on how
you can simulate those various experiment failures.
So in our example case, because we want to do that brown out that I
showed before we use the eks
action, that we can terminate a certain number of
nodes, a percentage of nodes within our cluster, and we
would run them, right? So if you go to the
tool itself, the way it runs, if you use
the tool, we can trust the fis that we're going to make sure that something
goes wrong, that it can alert automatically
and also helps us roll back the experiment,
right? And the fault injection simulator has these
mechanisms wherein. So when you build an experiment with
fis, you can define what are my key alarms are,
which define those steady state. And they
should kick in that if they find
any deviance. Right. And if something goes wrong during that experiment,
it should basically stop and then roll back the whole experiment.
So in our case everything was fine and we said that, okay, well, now we
have confidence based on our observability that we have for this experiment.
Now let's move to the next environment,
which is obviously taking this into the production.
So you have to think about the guardrails that are important in your
production environment. So when you are running ks,
experiment in production, especially when you are thinking about
running them for the first time, please don't run on a peak hours.
It's probably not the best idea to run on a peak hours. And also make
sure that in many ways, when you're running these experiments in lower and
ever environments, your permissions are also quite permissive that you have
when it compared to production. And you got to make sure that you have the
observability in
place that you have permissions to execute these various
experiments. Aws. Well, because in production it's always the
restricted permissions. And also key to understand is that
the fault isolation boundary changed because we are in production now.
So we need to make sure that we understand that as well.
And also we understand the risk of running them in production
environment because we need to
understand and make sure that we are not impacting our customers.
That's the key. So once we
understand this and have the runbook and plays books
which are up to date, we are finally at a stage where we
can think about moving to production. And here again,
we want to think about, you know,
you know,
think about like, you know, experiment in production with a cannery.
We'll check that in a second. So, you know,
as you have seen this picture before, in our lower environment,
we're going to do the same thing in production. But we
don't have a mirrored environment, right? So that some customers
do where they split traffic. And we have a chaos engineering environment in
our production and another environment. So what we use in this
is a cannery to say that we're going to take the real traffic
a tiny bit of percentage into our.
We're going to start bringing that real user traffic into the controlled
and experiment group. Now keep in mind at this
point, nothing should go wrong. We have the control and experiment group
here as well. We haven't injected the fault
yet. And we should be able to see from can observability
perspective that everything is good,
because we haven't created any experiments
yet. And once we see that truly happen,
that's where we start. That's where we kick in the
experiment. Right. So we're going to
get running the experiment in production.
But when we are thinking about running this
in production, we want to make sure that we have all the
workload experts in terms of engineering teams,
observability operators, incident response teams,
everybody in a room before we actually do this in production.
So that if something goes wrong or if you have seen
any unforeseen incidents during that engineering experience,
you can quickly roll back and make sure that the system is back up and
running within no time. Right.
And the final stage is basically going into
the correction of error state where we are basically listing
out all the key findings, learnings from
that experiment which we have run, and then we'll see,
okay, how did we communicate between the teams?
Did we have any persons or people whom we needed
in the room, but they were not there? Was there any documentation
missing, et cetera. How can we improve the overall process? How do
we then basically take these learnings and share that
across the organization so that they can further improve the
overall workloads, et cetera?
So that is the final
phase that the next take is basically the automation part
because we are not running this just for once.
Right. So we want to basically take this learnings
and automate that so
that we can bring them and run them in pipelines. So we need
to make sure that experiments also run in the pipeline
and also outside the pipeline because the faults happen all the time.
So they don't just happen by pushing a code,
they happen like day in and day out within the production environment as well sometimes.
Right. And then we can also use game days to
bring in the teams together to help
them understand the overall architecture and recover the apps, et cetera,
and test those processes work. And are people alerted
in a way that if something goes wrong, they're able to work together
and then bring that resilience,
continuous resilience culture to
make it easier for our customers. We have built in a
lot of templates and handbooks that
we are going to the experiments with them. So we share,
like chaos Engineering Handbook
that shows the business value of chaos engineering and how it helps
with resilience. The chaos engineering templates
as well as correction of error templates we have, and also various aspects
of the reports that we share with customers when we are running to the program
now. Next, I just want to share some resources, which we
have, but we
have the workshops with which you can,
for example, in the screen we see that you basically start with an observability
workshop. And then the reason that is because the workshop
builds an entire system that provides you with everything
in the stack of observability. And then you have to absolutely nothing
to get out of pressing a button, right? And once we
have that and we have the observability, from top down to tracing to
logging to metrics, then go for the chaos engineering workshop
and looking at the various experiments there,
and start with some database fault,
injection the containers and EC two and it shows you how
you can do that in the pipeline as well. And you can take those experiments
and you run it against a sample application within the observability
workshop and it gives you a great view of what's going on within
your system. And if you inject these failures or faults,
you'll see them right away within those dashboards with no effort at
all. So these are the QR codes
for those workshops. Please do
get started and reach out to any of your AWS
representatives or contacts for further information
on these. You can also reach me
on my Twitter account with that I just
want to thank you for your time. I know it's been long
session, but I hope you found it insightful.
Please do share your feedback and let me know
if you want more information on this. Thank you.