Transcript
This transcript was autogenerated. To make changes, submit a PR.
Resilience modeling guides teams to anticipate scenarios
before they lead to incidents. Teams then
prioritize these scenarios as hypothesis to test through
experimentation. A resilience model
documents the scenarios that may impact the system and
the controls in place to guard against such impact.
Experiments allows observation of these controls to
understand their effectiveness in detecting and preventing incidents.
My name is Guillonard Geroch and I am principal developer advocate
advocate at AWS, focusing on resilience,
chaos engineering and architecture. In this session, you will
learn how to build a resilience model to create valuable
hypotheses, and they allow you to
maximize the value of your chaos engineering experiments.
Most of you probably recognize this flywheel showing the
faces of chaos engineering. It takes you from understanding
the steady state of your system, to forming a hypothesis,
to designing and running an experiments, to verifying your
experiments by comparing to the steady state, and then finally to learning
from the experiments and then improving the system.
Many sessions around chaos engineering, they focus
on this phase, how to design and run experiments,
and that's great. I love seeing examples of how chaos engineering
tools work, or stories from real world
use cases of chaos engineering.
Sometimes sessions also covers the verify phase
where we look at the results of these experiments.
In this session, though, I want to focus on the before
the hypothesis phase. How do you actually create valuable
hypotheses that allow you to maximize the value of your chaos
engineering experiments? Spending time before running
experiments, it allows you to use your resources
more effectively. And this
goes back to the four key capabilities that a system needs
to have in order to be able to be resilient,
anticipate, monitor, respond, and finally
learn chaos engineering. Well,
that mostly falls in that fourth capability,
learning about our systems.
But in order to prevent failures, we need to be able to anticipate,
and that's where resilience modeling comes in.
So, to simplify a bit here, we want to anticipate
to learn better. So I
want to start by sharing a quick story from AWS.
And this was published as a service event, an event which
I would encourage you all to read so you can access this
article using the QR code on this slide.
In 2012, the ELB service team had tasked one
of their operators to perform a routine maintenance procedure on
the ELB control plane. The operator
performed the procedure as instructed, but this resulted
in the inadvertent lesion of configuration data
from the control plane. With this data, the control
plane lost the ability to manage existing ELB
resources, and this meant that any
calls to modify existing load balancers began to fail.
While calls to create and manage new ELB resources
continued to succeed. So the service team
took time to troubleshoot and identify the cause of the behavior.
And when they realized what have happened, they realized they didn't
really have any recovery procedure to restore that deleted data.
So they had to develop a recovery plan on the spot.
And after putting it into action, they were able to recreate the missing
data and then finally restore
that service.
So, using resilience modeling,
well, this event could have been anticipated and prevented,
because based on that resilience modeling, we could
have seen that we were lacking these recovery procedures.
And using the resilience model, we could have formed an hypothesis,
and we could then use the hypothesis to perform
different chaos engineering experiments.
So we have worked with numerous customers in North America,
Latin America, Europe and Asia to anticipate incidents
using the practices that I'm going to share with you today. And these
customers, they've shared that they have probably avoided
over 65 incidents if they had created a
resilience model prior to go live. And we've
also seen where creating a resilience model creates a shared
understanding of how a system works in a team,
and it becomes a vehicle for shared learning about
what can go wrong. So by creating a resilience model,
these Fortune 500 companies are preventing incidents
and gaining confidence in the resilience of their systems
and in how they operate the systems.
So, before we get into the process of resilience modeling,
let's just first define some important terms. First,
the term system. So, this is a
reference architecture diagram for a container based e commerce application
running on AWS and AWS. An architecture diagram.
This shows the components in the IT stack for hosting
the application, but it doesn't show the entire system.
For example, this diagram doesn't highlight the version
control system for the infrastructure as code, or perhaps the
monitoring platform for understanding the system state.
It also doesn't show the human operators, the people
who are managing the system at runtime. So even if we
sometimes think of a system as fully represented by an
architecture diagram, that's really not the case.
A lot is missing from that architecture diagram.
A system. It includes the tech stack, which would host
the ecommerce application. It's the components that
the users of the system communicate with.
But there are additional controls in place to help the system
continue to function over time. Things like autoscaling
groups, circuit breakers in the code failover logic
for the RDS database. Those are just some of
these automated controls which respond to signals coming from
the it stack and then change the it stack to ensure
continued availability. And in
addition to the automation, there are human operators, the people
who receive alerts from the system. They have dashboards for observing the health
of the system. They can respond by restarting or
redeploying components as needed to ensure that users can
continue to access the system. And all of these
things, they make up a system. And this is important because
both the automation and the operators, they can change the it stack,
and that could result in both causing downtime
or preventing downtime. So when modeling the resilience of the
system, we need to consider these elements in addition to the
components in the architecture diagram that we're used
to seeing. So what is a system
function? Well, many of the systems you work with, they are probably
quite large and diverse in terms of their components,
the dependencies, and the team supporting the system. So,
to reduce the number of elements that are considered during modeling,
we align our thinking with the key functions
of the system. So in order
to prevent what could go wrong with specific applications, we want to
dive deeper into user journeys or system functions,
and a user journey or a system function that's related to the capability
that a specific workload has to deliver value to
the business. Different workloads might be related to
multiple different user journeys. If we take an
ecommerce platform, for example, we should be able to break it down
into multiple areas, authentication,
personalization, ordering, delivery,
and so on. Focusing in on all of those areas
is mostly painful, and it would lead the application
owners into an infinite engagement. And this is why we want to
dive deeper into the notion of breaking it down in
these smaller pieces. So,
thinking of the example we just looked at, we can have an example
of selling an item, bringing this as the main focus.
It helps us understand what are the services within that critical
path, and to exercise how that could
fail. And this is ultimately the goal of breaking the
system into these smaller user journeys.
So after we've identified the system functions for
a system, we need to understand how the system function might
behave and how it might fail to completely meet
the business objectives. And this then enables
us to begin and associate a cost to these
failures. Failure modes, they should be
written from the perspective of the business or the business
process the system supports, and it shouldn't call out the
cause of the actual failure mode. It's typical that a
failure mode will have more than one potential cause as well.
So for each system function, you should try to consider
if the function were to fail, or if it were to
over or underperform. Consider if the function succeeded
intermittently, or if the function were to execute when it
shouldn't. So if we then look
at function, let's say we're looking at
a user login. Well, if there's no function, it means
that we have login failure. If we instead
have over function, it might mean that users
log in and have administrative rights under
function. Well, that could mean that users are logging in, but they only get read
only access. Intermittent function might mean
that only some of the users are able to log into the system
and unintentional function that the wrong user
gets logged in.
So, to then determine the loss to the
business of each of these failure modes, after all, preventing loss
to the business is the main reason why we're trying to improve resilience
in the first place. For each
loss, we should then try to calculate the cost to the business of that
loss. Often this will be quantified through customer
satisfaction, customer trust, lost sales,
and might even be fines. And the application team
can then later weigh the implementation and operation cost of controls
versus what they are actually mitigating.
So then, after identifying
failure modes and their cost to the business, the team will anticipate the
scenarios that would lead to each failure mode, and can then begin
to align controls to these different scenarios.
And these controls, well, they will be one out of four
different types. Detective controls.
They are used to understand when the failure scenario has
occurred or when it's about to occur. That could, for instance,
be alarms on the number of error responses sent to client,
or it might be decline in the number of orders per
second in our system. Preventive controls.
Those are the type of mechanisms that we put in place to
prevent impairment to the system when the failure scenario has incurred.
I mentioned circuit breakers in the code earlier.
That's one example. It might also be different types of redundancy
that we put in place in the architecture.
Corrective controls. These are mechanisms or procedures in
place to clear the system of impairment if it has been
affected by the failure scenario. And then testing
controls. Those are the type of tests that we
have in place to try to detect whether the system is susceptible
to a failure scenario. So these are the four types
and detective controls. Well, how can you detect that this happens?
Preventive controls. Are you taking any measures to avoid this failure?
Recovery controls. If it happens, what do you do? How do you
recover? And testing controls, do you have any processes
to test against this failure? And when
we're creating a resilience model, you want to map the losses
to their failure modes and the failure modes to their failure scenario.
Each failure mode may have one or more failure
scenarios. And each failure scenario, it's going to have multiple controls.
So if we look at an example where a failure mode was
anticipated for a data distribution system,
if data was not fully transferred to clients,
there was a potential for fines to be issued to the business.
Two causes of this failure mode were identified
and then the controls for detection,
prevention and correction were also identified.
The team in this case, they didn't have any mechanism for testing this
scenario. So we have the business loss,
the failure mode, the scenario and the
controls. Now,
as I think most of you know, an hypothesis that's a proposed
explanation for any type of phenomenon. For a hypothesis
to be a scientific hypothesis, the scientific method
requires that one can test it. And the way we test them,
well, that is of course through chaos engineering experiments
hypothesis, it's usually written in the form
of an if then statement, and that gives us a possibility
if, and explains what may happen because of
the possibility then. And we
then make use of the failure scenario and the controls.
If failure scenario, then preventive control.
If failure scenario, then detective control and
recovery control, for instance. So going
back to our previous example with the data distribution system,
we can start to create high quality hypothesis
from our model. If network
mutates the response, then there is a checksum to
verify the message content, and application will alert
if checksum mismatches.
And this then helps us to create a very clear and very
testable hypothesis. We understand the scenario
we want to test and we know the controls that are in place for that
specific scenario. We can now use this hypothesis
in our chaos engineering experiments.
So let's now use the rest of this time
to begin building a resilience model that we can then use to
create hypothesis. And we're going to use our online storefront.
So this is an architecture diagram for an online storefront based
on Kubernetes running in Amazon eks. The application also
uses DynamodB, Aurora, MySQL,
elasticache for redis and RabbitmQ.
In addition to the components shown here, there is a GitHub
repository which provides CI, CD, and we have an operations
team which can access any part of the system in production
during operations. So then we have to ask,
what are some of the key system functions for this system?
Well, in this case, we can see the critical pathway for the
submit order function. Orders are sent
by the user through the storefront system. Custom code
communicates with the payments processor, the pricing API
and the inventory API to process the order
for the submit order function. What are the different failure modes?
So we can now use the no function over function,
under function, intermittent function and unintentional
function as a template when we're creating this.
So if we were to model this submit order,
we need to ask us these questions. What are the failure modes,
what scenarios would cause the failure modes, and what controls
are in place to mitigate these scenarios?
So let's start the model. The failure mode is order submission
fails. So our first failure
scenario is that the TLS certificate on the application load balancer
is expired. We have detective controls in place.
Alarms will notify operators if certificate error occurs.
Preventive controls well, our TLS certificates, they are rotated
annually recovery control support department
coordinates with operations to troubleshoot and testing controls.
We actually don't have any testing controls in place to be able to
test for this scenario.
The second failure scenario is that storefront is
unable to find user sessions in cache.
Detective controls well, we don't have any. We also
don't have any preventive controls for this failure scenario.
But recovery control is that users are then redirected
to the login page and their shopping basket is maintained.
Testing control for this failure scenario is that we have automated testing
in place to verify that logged out users are
redirected to the login page.
And the third failure scenario that we think of is that
under high load, the cache evicts recent user sessions.
We have detective controls in place by alarms that
notify the operators if the number of evictions is nonzero.
We also have preventive controls. Cache is right sized
through load testing. Our recovery controls is that
the operations team, they need to grow the size of elasticache.
We don't have any testing controls for this failure scenario.
So with these three failure scenarios that
we have for our failure mode order submission fails,
we can now use the technique we looked at earlier and start forming
our hypothesis. So for the first failure scenario,
TLS certificate on the application load balancer is expired.
We can then create a hypothesis that would be if
TLS certificate on ALB is expired, operators are notified
and troubleshooting starts. The second failure scenario,
storefront is unable to find user session in
cache. Well, we can form a hypothesis, that is,
if storefront is unable to find user session in cache,
user is redirected to the login page and their shopping basket
is maintained. And then for the third failure
scenario that we found under high load, the cache evicts
recent user sessions. Then we can form
a hypothesis. That is, if cache evicts recent user sessions,
operators are notified and the cache is right sized.
All three of these are very valuable hypothesis and testable.
It allows us to then maximize the value of our upcoming
chaos engineering experiments.
So if we go back to our flywheel of the faces of chaos
engineering, we've now spent time forming valuable
hypotheses through the help of resilience modeling.
And now, well, we can move to the run experiment phase,
where with these hypothesis, you're able to
design and run high quality chaos
engineering experiments. So we've now
maximized our chaos engineering efforts by spending
time before actually running experiments,
you can use your chaos engineering and people resources
much more effectively.
So let's look at some key takeaways from this.
First off, you should consider the entire system, not only
the things you see in the architectural diagram.
Consider everything around, everything from the people to the
observability, to where you get your code from,
and so on. Next, try to find the system functions and
identify the critical path within your system that
helps you. Then zoom in and be able to find these failure scenarios.
Write your failure modes from a business perspective.
Think from the business, think with a loss.
When you are forming or writing your failure modes,
anticipate the scenarios that would lead to each of these failure
modes. How would one of these failures happen?
And then create that and anticipate that scenario.
For each of these scenarios, try to identify the
controls that you have in place based on those four
different control types that we looked at,
and finally create your hypothesis based on
that failure scenario and the controls that you
have in place, and then can use that hypothesis
to then run more high quality chaos engineering
experiments. So, before I leave
you, I want to show you this. Please check out the new resilience space
we have over at community AWS we've gathered,
and we keep adding resources for how to build and operate
resilient applications on AWS.
And with that, I want to thank you for joining this session.
We've looked at how to build a resilience model to help us create more
valuable hypotheses, and that allows us to
maximize the value of our chaos engineering experiments.
My name is Gunnar Grosch. I'd be happy to connect with all of you on
social media. You can find me on most of them using the details
shown on screen right now. Thank you all
very much.