Transcript
This transcript was autogenerated. To make changes, submit a PR.
Jamaica make up real
time feedback into the behavior of your distributed systems
and observing changes exceptions.
Errors in real time allows you to not only experiment
with confidence, but respond instantly to get things working
again.
Close thanks
for joining my talk on disaster recovery preparedness using chaos engineering.
In this session, I will go over disaster recovery context and the difference
between high availability and disasterlike recovery. We'll talk about how
to approach Dr. From a cost risk perspective, and then we'll talk about
resiliency with chaos engineering. So let's talk
about the difference between high availability and disasterlike recovery.
High availability is when you improve your uptime and resiliency by
removing single points of failure and redundancy, whereas disaster recovery
is your set of plans and policies that recover your workloads when things go down.
Backups are not a Dr. Plan and disaster recovery needs
to be clearly defined and practiced often in order to prove confidence
in your distributed systems. When thinking about disaster recovery,
we're always going to focus on resiliency. Resiliency is being
prepared for that black swan event. Increasing availability
and practicing restoration consistency helps you build
resiliency. Cloud native companies expect
failure and are constantly improving their resiliency. Everything breaks
all the time and you need to be prepared for things to fail,
especially in a shared responsibility model.
When thinking about resiliency, you have to understand that resiliency
is critical and affects your user experience
for your customers. Resiliency is also complex
and grows in complexity over time as your applications
grow, whether it be integrations, features,
mergers and acquisitions, and et cetera.
Resiliency is a key cost driver on your recovery point
in time, objectives and the criticality of your workloads.
You might have safety related workloads that if
they go down, people's safety is in
place. So the criticality of the workload really
can help determine how you have to build in your resiliency.
Resiliency is completely different in a cloud than it is with
on Prem applications. I can remember working at Verizon
and we would have to build out our applications in a 40
40 distributed model so that we
never ran over 40% capacity or we would have to add more.
So in the cloud, you build in capacity
by knowing what to do when certain things fail,
whether it be an instance stop start, shutdown,
instance degradation, availability zone,
service event, or even like we had in December,
the regional event. Building in resiliency in the cloud means
being prepared for the unknown failure events.
So when we're talking about Dr. We're going to
start with defining our recovery point objectives, right? And your
recovery time objective. Your recovery time objective is the
acceptable delay between service interruption and service
restoration. This determines what is considered can
acceptable time window when the service becomes unavailable. Right.
And your recovery point objective is the maximum acceptable
time since the last data recovery point. This determines
what is considered an acceptable loss of data between the last recovery point
and the service outage. Now that we understand that, we can
build our Dr. Strategy based on these factors.
Right. So backup and restore is using to backup
your data applications in the Dr. Region and restore
this data when it's needed to recover from the disaster.
If your time is in the hours and up to 24
hours or less, this is going to be your best option. Whereas if
your recovery point time is in the minutes pilot light, that might
work for you, and it keeps a minimal version of the environment. By always
running the most critical core elements of your system in the Dr.
Region at the time of performing the recovery, you can quickly
provision a full scale production environment that includes the critical core.
So you'll have your pilot light just sitting there waiting to be flipped
on whenever disaster happens.
Your warm standby is a little bit different. It's a little bit more expensive
in that it keeps a reduced version
of a fully functional environment that's always running
in your Dr. Region. Business critical systems are fully
duplicated and are always on, but with a reduced fleet.
So when the time comes for recovery, the system scales quickly to
process that production low. And then your most expensive
is your active active. And this is how we build things in the on prem
world where your RPO is basically none
or seconds, and your RTO is in seconds. So your workload
is deployed and actively serving traffic in multiple AWS
regions. Right. So the strategy requires you to synchronize users
and data between the regions that you're using. And so there's
a lot of data transfer going back and forth and a lot of databases that
need to be in sync. And when recovery
time comes, you can use services such as AWS
route 53, or the global accelerator to route that user
traffic to an entirely different workload application.
Right? So in that system, those are going to be
your highest priority critical workloads. Right.
And one thing you're going to want to do is avoid recovery
mechanisms that are not often tested. You want
to define these regular tests for failover to ensure
that your expected recovery point in time objectives are met.
So always avoid creating recovery
mechanisms, but never practicing them.
It's important to practice. In the navy, we were constantly
going through firefighting exercises and everything because
you're always practicing for that event when you need
to put out a fire. So take those same context and
utilize them. Now let's talk about resiliency and
using chaos engineering to prepare better
your resiliency posture, right? So chaos
engineering, as you know, is the discipline of experimenting
with the system with the aim of increasing confidence in
its ability to withstand problems in your environment.
My philosophy and our philosophy is testing in a non production
environment should always be performed regularly and be part of your integration
and deployment lifecycle.
In production teams must perform these tests in such a way to
not cause the service to unavailable. The last thing you want to do
is cause problems with your customers while
you're testing out hypothesis. So always run these tests in non
production or development environments and then make
sure that test results are measured and then compared with
availability objectives to understand whether the application
is running in that particular environment is
able to meet those defined objectives.
When you start first experimenting with chaos engineering,
start small and build confidence. Don't go straight to
regional failures. Start by stopping instances or doing things at a
host level that you can build confidence
and form your hypothesis, and then work your way up
to availability zone failures or even into
the regional failures. But try to build auto recovery
mechanisms into your systems. After you perform
these experiments. Always assess your risk for
appetite and make sure to isolate failures at all
times. Like we talked about on the last slide. Never do things in production that
could have an effect on your customers and always have a backout and
a rollback plan. And when we're
quantifying the results of the experiments you're
using to want to think about how long does it take to
detect these failures and how long does it take to get notified?
Should a status page be updated, right? Or should you notify your customers?
How long does your auto recovery happen? That's a big
factor, right? Because if you have a recovery objective
of ten minutes and your auto recovery takes 20,
then you're going to have to go back to the drawing board.
And is it a partial or full auto recovery or
how long does it take to really get back to that steady state?
That's going to be one of the key quantifiable results of
the experiment and what you're going to be looking for.
You also want to do reviews of the
incident. So having a blameless culture is something that really needs to be
in place for this to work, right? So you'll
talk about the event, the impact, and go over the five whys.
Make sure all your data and monitoring and observability metrics
are there that tell you this. We have
a saying at one of my previous places that charts and graphs didn't
happen. So make sure you have your proper eyes and observability
metrics there so that you can learn
from what happened. You want to make sure that
you take corrective actions and they're followed through upon. Right. So in
these post mortems, we call them coes or correction
of errors. Have a defined list and
have a structure of how these meetings go and clearly
define the lessons learned and what to take out of it.
If you're not going to learn from these failures, and if you're not going to
take the results and learn from them, then you're never going to
be able to improve the resiliency. And finally,
as you're going through these, continually audit
these meetings or these post mortems,
try to get to a way to where you're having a weekly cadence and constantly
improving on things. At one of my stops, AWS, an SRE,
we met with our knock engineers on a bi weekly basis,
and we went over every single escalation, and then we created
a runbook every time. And so any new escalation
shouldn't have a runbook. Right. So if we're getting escalated
for things on a repeatable basis, then that to
me, is considered toil,
especially if there's human interaction. Try to automate
those processes, but have those weekly operational overviews and
go over your planning metrics. Make sure that you're
continuously improving. There's a saying,
kaizen, which is continuous improvement. Make sure that you're always trying
to improve and learn from these events and learn from these failures.
When you do that, you will build a much more resilient system.
So how do we get started?
Well, you can run recurring experiments,
right? And what are some good candidates
for recurring experiments? Machine led processes
like unit tests, regression tests,
integration tests, and load tests.
Remember, just like these other tests, it's important to consider
the scope and the duration of the recurrent
fault injection experiments, right. So, because fault
injection experiments generally expose issues across a large
number of link systems, they will typically require extended
runtimes to ensure sufficient data collection. So make sure
you put them in the later stages of your CI CD pipeline.
That way they don't slow up your developers. And here
is a link to one of the chaos engineering workshops where you can create
a recurring experiment. And in this
experiment, we focused on running it
in a CI CD pipeline with the argument that it's easy to
slow down the pipeline to run only once a year. But hard to speed up
a manual process to run multiple times every day. Right? So go
with one repo, use a single repository to host
the definition of the pipeline, the infrastructure and the template.
You want to do this so that you can co
version all components of the system. Right. Whether this is a good idea kind
of depends on your governance processes, but each of the
parts can easily be dependent. And with
this part of the workshop, you can create these
so that you can integrate them easily into your pipeline. And as
you can see, it's using the CDK to build out the infrastructure.
And you create a code repo
and pipeline using the CDK and then you trigger
the pipeline to instantiate the infrastructure and
then trigger the pipeline to update the infrastructure and perform the fault injection.
So it's a really cool workshop. Scan it, give it a shot.
Yeah. And then here are some other resources that,
that are at your disposal.
Always, if you would like to run these with your tam,
reach out to your tam. But yeah, thanks for joining
my talk and you all have a great day.