Transcript
This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE, a developer?
A quality engineer who wants to tackle the challenge of
improving reliability in your DevOps? You can enable your DevOps
for reliability with chaos native.
Create your free account at Chaos native Litmus Cloud
hi folks, and welcome to this session as part of the Comp 42 online
conference. My name is Andrew Robinson and in this session
we're going to be looking at fault isolation using shuffle sharding.
Before we get into that quick bit of background on who I am, I'm a
principal solutions architect at Amazon Web Services, part of the AWS
well architected team. My day job is helping
our customers and our partners to build secure, high performing,
resilient and efficient infrastructure for their workloads.
I've been working in the technology field for the last 14 years and most
of my focus has been on infrastructure reliability and data
management. So getting into this session
I think it's always worth quickly defining, first of all,
what we're talking about when we mean reliability,
just so that we're all set on the same page here. So reliability
is the ability of a workload to perform its required function
correctly and consistently over an expected period
of time. One of the quotes that I really
like to use here is we need to build systems that embrace failure as a
natural occurrence, which is from Werner Vogels, the CTO of Amazon.
So what we need to do is adapt the way that we think about building
these systems so that we embrace the potential failures
that are going to happen and include that as
being a natural occurrence within our systems.
A couple of resources just to mention before we go into this, which I will
be referring to. The first of these is the Amazon Builders library,
and this provides articles and video guides on how Amazon
has implemented different best practices across our architecture
and software delivery. In short, it's sharing what we've learned over the
last 20 years with our customers and our partners.
I'd also like to talk about the AWS well architected labs.
There's a collection of over 60 labs here helping you to get hands on
with some of the best practices that I'm going to be speaking about. All the
labs come with detailed walkthroughs and the content is all available on the linked
to GitHub repo if you'd like to contribute or provide feedback on them.
So I'll be referring to both of those aws we talk through what
I'm going to be covering.
So the main topic to cover here is on sharding of workloads,
and we'll start out with this as a concept, and then we'll dive
into a little bit more about what shuffle sharding means and how we can go
about implementing it. So think of sharding
a workload similar to how you would shard a database for one
entire workload. We'll shard the overall capacity
and segment it so that each shard is responsible for a subset of
those customers. By minimizing the number of components
that a single customer can interact with, we can limit the impact of any
potential issues. If we had a workload without
any shuffle sharding, where any worker could handle any request,
a poisonous request, or a flood of requests from one single user
could spread through your entire fleet.
So in short, the scope of the impact could be related
to the number of customers divided by the number of sharding.
So we can help to limit the scope of impact.
As the number of customers increase, we can increase the number of shards,
which will then help to scope the impact that that potential failures
could have down to a more manageable level,
meaning that it's going to have less of an impact on customers using our systems.
So let's talk through an example here of what
this sharding can look like. So let's imagine
we have a horizontally scalable system. In this case,
it's made up of eight worker nodes. Each worker node receives
requests that come in, maybe through a load balancer. There's no sharding,
so each worker can handle any request that happens
to come into the system. Now, if one
of those worker fails, the other seven can absorb
the work. Only a small amount of slack capacity is needed within
this system for it to be able to function.
However, what happens if we get a poisonous or a bad request?
Maybe a denial of service, which causes a
cascading failures that has an impact across all of those
worker nodes. In this scenario, the failure
is everything and everyone, the whole service goes down
and every customer using it is impacted.
So what we could look at then doing is to sharding that
workload. In this case,
one customer only uses one specific shard or set
of sharded resources. We can limit the
impact of that poisonous request, as it would only impact that
one customer. So in this example, you can see that we've now sharding
this workload into orange, blue, green and red.
And because the orange shards are the ones that have
been impacted by perhaps this poisonous requests,
only those sharding have been impacted.
The other shards that exist within our workload, and therefore the
other customers using those haven't been impacted. So we've been
able to reduce the impact of the failure here from 100% to
just 25%. Now, of course, this does mean we need
to have more slack capacity within each of those
sharding in case we had a single node failure. In this example,
if we did have that, we would lose 50% of our capacity.
So it might be that we need more preprovisioned capacity, or we need to be
able to scale more quickly to be able to account for changes in load.
But it does mean that we'll be able to limit the impact of that failure
from affecting 100% of our customers to just 25%
of our customers using this system.
Now, we can evolve this further by using this mechanism called
shuffle sharding. Now, how this works.
Fairly similar scenario. So we've got eight worker nodes, and we've got eight
customers represented by the letters AbCDefG
and H. So eight workers,
eight customers. This time, we'll virtually sharding up each
of those worker nodes so that every customer is split between
two of them. So, as with the previous example, all customers have access
to two worker nodes. Each customer
is represented by a different letter. And for the purpose of this example,
we're going to focus on customer a and customer B.
So, customer a, you can see, is using this
first sharding. And they're sharing that with customer
B. They're also sharing the
fourth shard with customer F.
Customer B is sharing this first shard with customer
a. And they're also sharing the 8th sharding
with customer G.
So what would happen in this scenario if we had that
same failure? So that customer a
submits a poisonous request, or maybe there's a flood of requests, or possibly
some kind of denial of service attack.
That problem will impact the virtual shards
that that customer has access to. So shard one and
shard four. But it won't impact any of the shards
that other customers are working on. So customers on those
remaining sharding continue to operate as normal.
They don't suffer from any interruption to their service.
Now, that's great, but what about those customers who are sharing
those shards? Customer B and customer f,
what do they do? Well, because we've shuffle sharded,
those customers also have access to other sharding.
So although both customer B and customer f share shards
with customer a, because their other shuffle shards are
different, they can continue to operate, but albeit at a
reduced capacity. This was a smaller
example for the purpose of keeping it to fit on a slide easier.
But as you grow out, the level of potential impact
drops significantly with eight workers.
There are 28 unique combinatorial of those two, of two
different workers. So 28 possible shuffle
shards. So the scope of impact here is just 120
8th, which is seven times better than what that regular
sharding would offer you. So distributing those
customer requests that are coming in using a shuffle sharding
mechanism means that if you do have a failure,
poisonous request or a denial of service that takes out
those shards, the impact that that has is going to be greatly reduced.
Now, customer a still has the problem of
they've lost access to their resources. But for the
rest of the customers, we're maintaining access to the system
so they can still do what they need to do, limiting the
impact that that actually has. So that's great in
theory, but what could that look like in practice?
And how could you actually get hands on and see how this really
works? So one quick resource to talk about
here is using shuffle sharding with Amazon Route
53. For those who aren't familiar, Amazon Route 53 is a
highly available and scalable DNS web service.
And Amazon has built an Infirma library
which is available for the AWS Java SDK.
So this uses the theory of shuffle sharding that we just talked about to
compartmentalize your endpoints and then bake
in the decision tree logic to remove failed endpoints,
which can help with isolating denial of service or poison pill requests.
As that decision tree logic is prebaked, it means route
53 can react quickly to any endpoint failures.
Combined with that compartmentalization through shuffle sharding means
that we can handle failures of any subservice serving that
endpoint. So it allows us to isolate those requests.
And by having a pre built decision tree for
which endpoints we're going to remove, once those endpoints fail,
we can quickly remove them from our DNS service. And that means that we're
not going to be routing customer requests to those anymore. So this
actually takes that theory of how to shuffle shard those requests
coming into different endpoints and actually bakes that into
an easy to use Java library that you can
use with the AWS Java SDK and the Amazon Route 53 service
to actually get hands on and test this out with.
So I know this is quite theoretical, but there is actually
an opportunity here to get hands on with that as well.
So to conclude with this, using sharding is a
great way to help limit impact by distributing different
requests to different endpoints. It means that if we
have a poison pill or we have a denial of service from a specific
customer, it's only going to impact those specific
endpoints that they have access to. And the rest of
our customers are still going to be able to access the resources that they need
to. That helps us to reduce that impact
down. In the example we saw, we went from 100% to 25%.
As you scale out and add more worker nodes,
you'll be still be looking at a similar percentage,
but that can then have an impact on much fewer customers.
We can then use shuffle sharding to exponentially limit that impact
further. Remember in the example that we had, we were talking about 128
impact that this would have on customers. And as you
then scale that out with more worker nodes and more customers,
that impact exponentially decreases. So sharding
is a great place to start here. And then looking at moving on to shuffle
sharding can then help to exponentially limit that impact.
As I mentioned, there's the route 53 Infirma library,
which can help you manage this yourself and allows you to
get some hands on experience with what this could look like in the real world.
This example was done with the Amazon Route 53
service, but you could achieve similar with other DNS services,
or even using your own type of load balancer as well to
distribute those requests and remove any of the
failed endpoints. So if
you'd like to learn a little bit more about this, there is an article in
the Amazon Builders library on workload isolation using shuffle sharding,
which goes into much more depth than what I've been able to cover in this
session and actually explains a little bit more technical detail about how Amazon's
built this technology. Also an opportunity
to get hands on with the AWS well architected labs in
the reliability pillar. There's a lab on there for fault isolation
with shuffle sharding, where it goes through AWS, an example and talks
through and allows you to get hands on with building this using
AWS resources, including our load balancers.
And then of course there was the route 53 Infirma library. If you
want to get a little bit more hands on with some code parts of this,
you can go and grab that from the AWS Labs GitHub repository.
As I mentioned, it works with the AWS Java SDK.
It's based on Maven, so it was fairly straightforward to deploy. You can
get started in that in just a matter of minutes.
So that concludes the session on fault isolation
using shuffle sharding. I hope that's been useful for everybody.
Thank you ever so much for watching this session and hope you
enjoy the rest of the comp 42 conference thanks, folks, and stay
safe.