Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, I'm Eduardo Janicas and I'm a solutions architect at
AWS. I've been here for over two and a half years, and I
have a background in networking and opera questions I'm going
to talk about how AWS achieves static stability using
availability zones at Amazon. The services we build
must meet extremely high availability targets.
It means that we think carefully about the dependencies
that our systems take. We design our systems to
stay resilient, even when those dependencies are impaired.
In this talk, we're going to define a pattern that we use called
static stability to achieve this level of resilience.
We'll show you how you apply this concept to availability zones, which are
a key infrastructure building block in AWS,
and therefore they are a bedrock dependency on which all
of our services are built. We will describe how we
built Amazon elastic compute clouds, or EC
two, to be statically stable. Then we're going to provide two
statically stable example architectures that we found
useful for building highly availability in regional
systems on top of availability zones. And finally,
we're going to go a bit deeper into some of the design philosophy
behind Amazon EC two, including how it is architected
to provide availability zones independence at the software level.
In addition, we're going to discuss some of the tradeoffs that
come with building a service with this choice of architecture built.
First, let's explore and understand the AWS cloud infrastructure.
A region is a physical location in the world where we
have multiple availability zones. Availability zones,
or AWS, consist of one or more discrete data
centers, each with redundant power,
networking and connectivity housed in separate
facilities. Availability zones exist on
isolated fault lines, floodplains,
networks and electrical grids to substantially
reduce the chance of simultaneous failure. It provides
the resilience of performing real time data replication and
the reliability of multiple physical locations.
AWS has the largest global infrastructure footprint with
25 regions, 80 availability zones with
one or more data centers, and over 230
points of presence with 218 edge
locations, and this footprint is constantly increasing
at a significant rate. Each AWS region has
multiple availability zones. Each availability zone
has multiple physically separated data centers.
Each region also has two independent, fully redundant
transit centers that allow traffic to cross the
AWS network, enabling regions to connect to the
global network. Further, we don't use other
backbone providers for AWS traffic once
it hits our backbone. Now, each availability zone
is a fully isolated partition of the AWS global
infrastructure. This means that it is physically separate
from any other availability zones by a meaningful distance,
such as many kilometers. Each availability zone has
its own power infrastructure, thus availability zones
give customers the ability to operate production applications
and databases that are more highly available,
fault tolerant, and scalable than it would be possible from
a single data center. All availability zones are
interconnected with high bandwidth, low latency networking
over fully redundant dedicated metro fiber. This provides
high throughput, low latency networking between availability
zones when interacting with an AWS service that
provisions cloud infrastructure instead of an Amazon virtual
private cloud or VPC. Many of these services require
the called to specify not only a region but also an
availability zone. The availability zone is often specified
implicitly in a required subnet argument,
for example, when launching an EC two instance,
provisioning an Amazon relational database or RDS
database, or creating an Amazon elastic cache cluster.
Although it's common to have multiple subnets in an availability zone,
a single subnet lives entirely within an availability zone,
and so by providing a subnet argument, the caller
is also implicitly providing an availability zone to
use to better illustrate the property of static stability,
let's look at Amazon EC two, which is itself designed according
to those principles. When building systems on top of availability zones,
one lesson we have learned is to be ready for impairments
before they happen. A less effective approach might be to deploy
multiple availability zones with the expectation that
should there be an impairment within one availability zone,
the service will scale up, perhaps using AWS
auto scaling in other availability zones, and be restored
to full health. This approach is less effective
because it relies on reacting to impairments as
they happen, rather than being prepared for those impairments
before they happen. In other words, it lacks static stability.
In contrast, a more effective, statically stable service
would over provision its infrastructure to the point where it
would continue operating correctly without having
to launch any new EC two instances. Even if an availability
zone were to become impaired, the Amazon
EC two service consists of a control plane and a data
plane. Control plane and data plane are
terms of art from networking built. We use them all over the
place. In AWS, a control plane is the machinery
involved in making changes to a system, adding resources,
deleting resources, modifying resources,
and getting those changes propagated to wherever they need
to go to take effect. A data plane, in contrast,
is the daily business of those resources. That is what it takes
for them to function. In Amazon EC two, the control
plan is everything that happens when EC two launches
a new instance. The logic of the control plan pulls
together everything needed for a new EC two instance
by performing numerous tasks. The following are a few example it binds
physical server for the compute while respecting
placement groups and VPC tenancy requirements. It allocates
a network interface out of the VPC subnet.
It prepares an EBS volume, generates IAM
role credentials, installs security groups, stores the
result in the data stores of the various downstream services,
propagates the needed configurations to the server in
the VPC, and to the network edge. But in
contrast, the Amazon EC two data plane keeps
existing EC two instances humming along as expected,
performing tasks such as routing packets according
to VPC route tables, reading and writing from evs
volumes, and so on. As it's usually the case with data planes in
control planes, the Amazon EC two data plane
is far simpler than the control plane. As a result of
this relative simplicity, the EC two data plane design
targets a higher availability than that of the EC two control
plane. But the concept of control
planes, data planes, and static stability are broadly applicable
even beyond Amazon EC two. Being able to decompose a
system into its control plane and data plane can
be a helpful conceptual tool for designing highly available services.
For a number of reasons, it's typical for the availability
of the data plane to be even more critical to the
success of customers than the control plan. For instance,
the continued availability and correct functioning of an EC
two instance after it is running is even more important
to most of you than the ability to launch
a new EC two instance. It's typical for the data plane
to operate at a higher volume, often by orders of magnitudes,
than its control plane, and as it's better to keep them separate
so that each can be scaled accordingly to its own relevant
scaling dimensions. And we found that over the years, a system's
control plane tends to have more moving parts than its data
plane, so it's statistically more likely to become impaired
for that reason alone. So putting those considerations altogether,
our best practice is to separate systems
along the control plane and data plane boundary. To achieve
this separation. In practice, we apply principles of static stability.
A data plane typically depends on data that arrives
from the control plane. However, to achieve a higher availability
target, the data plane maintains its existing state
and continues working even in the face of a control plane
impairment. The data plane might not get updates during the
period of impairment built. Everything that had been working before continues
to work. Earlier, we noted that a scheme that requires
the replacement of an EC two instance in response to an
availability zones impairment is a less effective approach.
It's not because we won't be able to launch the new EC two instance,
it's because in response to an impairment, the system has to take
an immediate dependency for the recovery path on the
Amazon EC two control plan plus all of the application specific
systems that are necessary for a new instance to
start performing useful work. Depending on the application,
these dependencies could include steps such as downloading runtime
configuration, registering the instance with discovery services,
acquiring credentials, et cetera. The control plane systems
are necessarily more complex than those in the data plane,
and they have a greater chance of not behaving correctly
when the overall system is impaired. Several AWS services
are internally composed of a horizontally scalable,
stateless fleet of EC two instances or Amazon
elastic container service or ECS containers.
We run these services in an auto scaling group across
three or more availability zones. Additionally, these services
over provision capacity so that even if an entire availability
zones were impaired, the service in the remaining availability
zones could carry the load. For example, when you use three
availability zones, we over provision by 50%.
Put another way, we over provision such that each availability
zone is operating at only 66%
of the level for which we have load tested it. The most
common example is a load balanced HTTPs service.
The following diagram shows a public facing application load
balancer providing an HTTPs service
across three availability zones. The target of the load
balancer is an autoscaling group that spans
the three availability zones in the EUS one region.
This is an example of an active active high availability
using availability zones in the event of an availability zone
impairment, the architecture shown in the preceding diagram requires
no action. The EC two instances in the impaired
availability zone will start failing health checks and
the application load balancer will shift traffic away from
them. In fact, the elastic load balancer service is designed
according to this principle. It has provisioned enough load
balancing capacity to withstand an availability zone
impairment without needing to scale up. We also use this pattern
even when there is no load balancer or HTTPs service.
For instance, a fleet of EC two instances that processes
messages from an Amazon simple queue service or
SQs queue can follow this pattern too. The instances
are deployed in an out of scaling group across multiple availability
zones appropriately over provisioned. In the event
of an impaired availability zone, the service does nothing, the impaired
instances stop doing their work and others pick up the block.
Some services we built are stateful and
require a single primary or leader node to coordinate
the work. An example of this service is a relational
database such as Amazon RDS with a MysQL
or postgres database engine. A typical high availability
setup for this kind of relational database has a primary
instance which is the one which all writes must
go to and a standby candidate. We might also have
additional read replicas which are not shown in this diagram.
When we work with stateful infrastructure like this, there will
be a warm standby node in a different availability
zone from that of the primary nodes. The following diagram
shows an Amazon RDS database when we provision a
database with Amazon RDS, it requires a subnet
group. A subnet group is a set of subnets spanning
multiple availability zones into which the database instances
will be provisioned. Amazon RDS puts the standby candidate
in a different availability zone from the primary node.
This is an example of active standby high availability using
availability zones, as was the case with the stateless
active active example. When the availability zone
with a primary node becomes impaired, the stateful service
does nothing with the infrastructure. For services that use Amazon
RDS, RDS will manage the failover
and repoint the DNS name to the new primary in the
working availability zone. This pattern also applies to
other active standby setups, even if they do not
use a relational database. In particular, we apply this
to systems with a cluster architecture that has a leader node.
We deploy these clusters across availability zones and
elect a new leader node from a standby candidate instead
of launching a replacement just in time. What these
two patterns have in common is that both of them have already provisioned
the capacity they need in the event of an availability zone
impairment well in advance of any impairment.
In neither of these cases is a service taking any deliberate
control plane dependencies, such as provisioning new
infrastructure or making modifications in response
to an availability zone issue. This final section
of the talk will go one level deeper into
resilient availability zone architectures, covering some of
the ways in which we follow the availability zone independence principle
in Amazon. EC two. Understanding some of this concept
is helpful when we build a service that not only needs
to be highly available itself, but also needs
to provide infrastructure on which others can be
highly available. EC two as a provider of low
level AWS infrastructure is the infrastructure that
applications can use to be highly available. There are times
when other systems might wish to adopt that strategy
as well. We follow the availability zone independence principle
in EC two in our deployment practices.
In EC two, software is deployed to the physical servers
hosting EC two instances, edge devices,
DNS resolvers, control plane components in
the EC two instance launch path, and many other components
upon which EC two instances depend. These deployments
follow a zonal deployment calendar. This means that two
availability zones in the same region will
receive a given deployment on different days.
Cross AWS, we use a phase rollout of deployment.
For instance, we follow the best practice regardless of the type
of service to which we deploy, of first deploying a one box
and then one by end of servers,
et cetera. However, in the specific case of services like
those on Amazon EC two, our deployments go one step further
and are deliberately aligned to availability zone boundary
their way. A problem with a deployment affects one availability
zone and it's rollback and fixed. It doesn't affect any
other availability zones, which continue functioning as normal.
Another way we use the principle of independent availability zones
when we build in Amazon EC two is to design all packet
flows to stay within the availability zone rather
than crossing boundaries. The second point, that network
traffic is kept local to the availability zone,
is worth exploring in more detail. An interesting illustration
of how we think differently when building a regional,
highly available system. That is, a consumer
independent availability zones. That is, it uses guarantees
of availability zone independence as a foundation for
building a high available service, as opposed to when
we provide availability zones independent infrastructure
to others that will allow them to use for high
availability. The following diagram illustrates a highly available
external service, shown in orange that depends on
another internal service shown in green.
A straightforward design treats both of these services
AWS consumers of independent TC two availability
zones. Each of the orange and green services is
fronted by an application load balancer, and each
service has a well provisioned fleet of backend hosts
spread across three availability zones. One highly
available regional service calls another highly available regional
service. This is a simple design for many of the services we've
built. It is a good design. But suppose, however, that the green
service is a foundational service. That is,
suppose it is intended not only to be highly available,
but also itself to services as a building block
for providing availability zones independence. In that case,
we might instead design it AWS three instances of a
zone local service on which we follow availability zone
aware deployment practices the following diagram illustrates
the design in which a highly available regional service
calls a highly available zonal service.
The reasons why we design our building block
services to be availability zone independent
come down to simple arithmetic. Let's say an availability zone
is impaired for black and white failures, the application load
balancer will automatically fail away from the affected
nodes. However, not all failures are so obvious.
There can be grave failures, such as bugs in the software,
which the load balancer won't be able to see in its health
checks and cleanly handle. In this example, where one
highly available regional service, called another
highly available regional service, if a request is
sent through the system, then with some simplifying
assumptions, the chance of the request avoiding
the impaired availability zone is two divided
by three times two divided by three, so it's four divided by
nine. That is, the request has worse than
even odds of steering clear of the event. In contrast,
if we built the green service to be a zonal service as
in the current example, then the hosts in the orange
service can call the green endpoint in the same availability
zone. So with this architecture, the chances
of avoiding the impaired availability zones are only
two divided by three. If N services
are a part of this call path, then these numbers
generalize to two divided by three to the power of N for
N regional services versus remain constant
at two divided by three for N zonal services
it is for this region that we built
Amazon EC two Nat gateway as
a zonal service. So Nat Gateway is
an Amazon EC two feature that allows for outbound Internet
traffic from a private subnet and appears not
as a regional VPC wide gateway, but as a
zonal resource that customers instantiate separately
per availability zone. As shown on this
diagram, the nut gateway sits in the path of
the Internet connectivity for the VPC and
is therefore part of the data plan of any c two
instance within that VPC. If there is
a connectivity impairment in one availability zones,
we want to keep that impairment inside that
availability zone only, rather than spreading
it to other zones. In the end, we want a customer who
built an architecture similar to the one that we mentioned
earlier. That is like providing a fleet across three
availability zones with enough capacity in
any two to carry the full load to know that the
other availability zones will be completely unaffected
by anything going on on the impaired availability zone.
The only way for us to do this is to ensure
that all foundational components like the Nat
gateway really do stay within one availability
zone. So some lessons learned when designing a service
oriented architecture that will run on AWS,
we have learned to use one of these patterns,
or a combination of both. The simpler pattern regional,
called regional. This is often the best choice for
external facing services and appropriate for most
internal services as well. For instance,
when building higher level application services in AWS
such as Amazon API Gateway and AWS
services technologies, we use this pattern to provide high
availability even in the face of an availability zones impairment.
The more complex patterns are regional,
called zonal or zones coral zones. When designing
internal and in some cases external data plane components
within Amazon EC two, for instance, network appliance
or other infrastructure that sits directly in
the critical data path, we follow the pattern of
availability zone independence and use instances that are
siloed in availability zones so that network traffic
remains in its same availability zone. This pattern
not only helps keep impairments isolated
to an availability zone, but also has favorable network
traffic cost characteristics in AWS. Thank you
for listening to my talk and I hope you find it useful. If this
topic interests you, you can find more dive deep articles
on the Amazon Building library, which you can find in the
public AWS website. Have a great day.