Transcript
This transcript was autogenerated. To make changes, submit a PR.
You hi,
my name is Christian Elsen and I'm a specialist solutions architect for networking.
I've been with AWS for about six years and have previously
worked in other networking roles for about 15 years, spanning areas
of data center switching, network virtualization,
global continuity, distribution networks, and DNS providers,
as well as BGP routing for service providers. My name
is Lornak McJolo. I'm a senior solution architect in AWS.
Prior to this, for about 17 years I also worked on web infrastructure,
centralized authentication systems, distributed caching, multiregion cloud
native deployments, using infrastructure as code and pipelines to
name a few. But today we are going to talk about optimizing
end user connectivity for multiregion architectures.
So first, why are we talking about this? There are two aspects
that we are focused in today's talk. First is
performant to connect end users to application endpoints in multiple
regions in the most performant and reliable way possible.
Second is maximizing availability in case of
disaster recovery. So how can we ensure that we can perform instantaneous failovers
even in face of gray features? If you're running our application in
multiple regions, let's first take a look at how to achieve
performance for end user connectivity. For this,
we are going to take a deeper dive into this networking service called
AWS Global Accelerator. So here we
have an application that is deployed in multi regions, in this
case North Virginia US, east one and Ireland US one,
and the user of our application. The users are
accessing one of these stacks from around the globe.
However, as they're accessing the application over the public
Internet, each hop from the end user to the
application endpoint can incur additional latency,
and this is going to result in a nonoptimal experience for
the end user. Now, is there a better way to provide them
with more reliable performance connectivity to the application?
Here we are looking at a map of the global
network of 96 points of presence across
46 countries in 84 cities that AWS
global accelerator uses. So for example,
in Asia, here are the edge locations, to name a few of them.
So there is edge locations in Bangalore,
Bangkok, Chennai, Hyderabad,
Jakarta to name a few. And global
accelerator provides you with two static IP addresses
that serve as a fixed entry point to the application
hosted in one or more AWS regions. And underneath the
covers, these IP addresses are any cast from those edge
locations. So they're announced from multiple AWS edge locations
at those same time. And this enables traffic
to enter the AWS global network as close
to the user as possible. So if you have a user
in Jakarta, then if you're using global
accelerator for your application deployment, in that case,
that user will be entering the AWS global network through the
Jakarta Edge. Similarly for the Chennai user,
for example, this way the end users of your apps are benefiting
from the reliable, consistent performance of the AWS
global network. So you can associate these ip addresses to
regional AWS resources or endpoints in this case such as
application load balancer, network load balancer, easy to instances elastic
IP addresses and global accelerator's ip addresses serve
as the front end interface to the app. You can think of
it AWS a door close to your end users wherever they
may be located across those globe and that door is at these edge applications
that will be looked at on the global network map.
Next, Chris is going to show us a demo.
AWS Global Accelerator improves the availability and
performance of your applications through AWS edge locations
in the AWS backbone. This test tool compares
those performance of global accelerator with the public Internet from
your location. It does so by comparing the time it takes
to download a file of a certain size via the public Internet
as well as the optimized path via global accelerator.
In my case, the end user location is San Francisco, California and
I selected 100 kilobyte as the file size to speed up
this particular test run. Let's have a look at the results.
We can see the performance gain via global accelerator from
San Francisco to these five selected AWS regions.
Next, let's take a look at traffic dials.
So in global accelerator we have ability to set traffic
dials for fine grained traffic control. We can dial
up or dial down traffic directed to a specific endpoint
group. Now we do this by setting a traffic dial to
control the percentage of traffic that is already directed
to that endpoint group to that region. Now here I
have two endpoint groups. One is in us east one, one is
in us west one and in each endpoint group I have two
endpoints, the two elastic load balancers.
The percentage is applied only to traffic that is already directed to the
endpoint group based on proximity and health of the
endpoints. So if we have 100 requests directed to
northern Virginia, then 100% of those requests will be
directed to US east one northern Virginia endpoint group if we
set the traffic dial for it at 100%.
Similarly for requests that are directed to US west one
endpoint group. So you can think of traffic dials as giant valves that
are controlling the traffic to the endpoint groups.
Later on we may decide to switch all traffic flow to
only go to us west one endpoint group and for this,
we can set the traffic dial for us east one to 0%.
So we close, we shut off that giant valve that controls
the traffic that is sent to us east one.
Now we have a finer grain control, a smaller nub
if you will. We can use to set
weights on each endpoint inside an endpoint group,
such that we can adjust the amount of traffic each endpoint gets.
Now, endpoints can be network load balancers, application load balancers,
EC two instances or elastic ip addresses. Here I'm showing the elastic
load balancers as endpoints, and global accelerator
calculates the sum of those weights for the endpoints in an endpoint group,
and then directs traffic to the endpoints based on the ratio
of each endpoint's weight to the total. So you can go as
fine grained as one over 256 for the percentage of
traffic that is directed to an endpoint inside an endpoint group.
Now, how do these features help us with bluegreen deployments?
First, a quick recap of bluegreen deployments. The goal
of bluegreen deployments is to deploy and roll back new
versions of an application with minimal to no downtime for
our app consumers. The way we achieve this is by having two
environments in production that are identical to each other.
And at any given point in time, only one of these environments is
alive in terms of taking in production traffic, and those
other one is idle. So the one that's taking
in production traffic, we can call this those blue environment. For example,
if you want to perform a new release, then we deploy the new
version of the application in the green environment. That is not taking any
production traffic. We test it those, we verify it in the green environment,
and then we cut over the production traffic to the green environment.
Now the blue environment becomes the new idle environment.
And in case of issues in the newly deployed version of the application,
we can always cut the traffic back over to the blue environment.
Now, one way to achieve blue green deployments in a single region is
by using the little nubs that we just talked about using endpoint
weights in global accelerator. So we stand up two identical
stacks of our application, for example, behind can ALB
endpoint for the blue environment and another ALB endpoint
for the green environment. And these two endpoints are inside
one endpoint group. Think of that as a
region, and we use the endpoint weights to adjust the prod traffic flow
as part of the deployment that we just discussed. Next, we are going to
look at a slight modification of blue green deployments. But for
multiregion applications and those goal is always the same
to have minimal to no downtime as we
are deploying and rolling back new version of an app for our app consumers.
So here we have version one of our application.
It's deployed in two regions, US west two and US east
one. We are using global accelerator for the application.
So our clients in Japan are accessing the application via
the global accelerator point of presence that is closest to them,
that is either in Tokyo or Osaka for global accelerator points of
presence. And the
global accelerator is then intelligently routing their requests
now inside the global network, AWS global
network into the nearest app stack. That's in US west.
Two clients in Europe are also accessing this application
also through the global accelerator point of presence
closest to them. So that'll be in Europe.
And global accelerator again intelligently routes their request
know after taking them in through the point of presence
through that door in Europe, it then routes the requests
inside the AWS global network into the nearest app
stack, and that is Us east one. So one thing to note
is that both of these app stacks are actually serving live production
traffic. And now we decided to upgrade our application
from version one to version two without incurring any downtime to our
app consumers. So remember that
we have the traffic dials to control traffic for endpoint
groups in global accelerator. Those are the giant valves that we
can use to control to dial up and down
the traffic for endpoint groups. So we are going to use these.
We first set the traffic dial in US west two to 0%,
and then all production traffic now flows to us
east one. For our clients in Japan,
they still enter using the same point of presence closest to them in Japan.
So that will be either Tokyo or Osaka, depending on where they're located
in Japan, whichever is closer to them.
But then global accelerator is going to intelligently
route their request this time to US espawn application stack.
And now that us best two has no production traffic flowing into
it, we can upgrade our application in Us west
two to version two without incurring any downtime to
our app consumers. Next, we are going to
repeat this process for us east one. So we first turn down
the traffic dial to 0%. In US east one.
All traffic now goes to users two, including for our clients
in Europe. And they still enter the AWS
global network through the door through that point of presence that's closest
to them. So they'll be in Europe, and global accelerator is
going to intelligently route their request to the application app
stack in USS two.
So this way we can upgrade the app in USC
SWAN to version two without incurring any downtime
for app consumers. This time. Clients in Europe finally,
we now have both regions with version two of the app,
and we turn the traffic dial in US east one up to
100%. And now the clients in Japan go
to us best two and clients in Europe to USC swan through
their global senator point of presence doors that
are closest to them that will take them in take their requests
into the AWS global network. Next, Chris is going to
show a demo on bluegreen deployments for multi region
applications. In this demo, we will look at AWS
global accelerator for a multi region blue green software deployment
scenario. As depicted in the presentation, this setup
uses a single accelerator with endpoints in two AWS
regions. The US west one region represents our plue
deployment and the US east two region represents
our cream deployment. Right now,
traffic dials for both regions are at 50%
as the percentage is applied to only the traffic already directed
to the endpoint group. Not all listener traffic.
Only by explicitly specifying 50% as a traffic
die for both will we see each region receiving
about the same amount of incoming end user traffic.
This cloud watch dashboard shows the incoming traffic ratios across
the two application load balancer that front each of the two
regions. On the left we have a traffic gorge that shows
the most recent distribution with a historical distribution
over the last half hour. On the right. As expected,
the traffic ratio across the two regions is about 50 50.
Now let's train traffic from our flu region us
west one so we can perform a software update there.
For this, we will set the traffic dial for us west one to
zero while leaving the traffic dial for us east
two where it is AWS. Us East two will be the
only remaining region. It will receive all traffic.
Let's have a look at the cloud watch dashboard and see how traffic
shifts. We will speed up the recording a bit so we don't have
to wait the two to three minutes it takes.
Great. Now we can see that 100% of our, including traffic
is headed to the cream deployment, allowing us to upgrade
the application in the plue deployment. After we finish
this upgrade, let's switch all traffic to the plue deployment so we can upgrade
the cream deployment. This time, we will set the
traffic dial for us west one to 50% and
the one for us east two to 0%.
Let's have a look at the Cloudwatch dashboard and see how traffic shifts
again. We will speed up the recording a bit so we don't have
to wait the two to three minutes it takes.
Now 100% of our incoming traffic is headed to
the blue deployment, allowing us to upgrade the application in the green
deployment. Once we finish this upgrade, let's switch
all the traffic back to the original 50 50 split.
This time we will set the traffic dial for us east
two back to 50%.
We'll return to the Cloudwatch dashboard the last time and see
how traffic shifts again. We will speed up the recording a bit
so we don't have to wait the two to three minutes it takes,
and we're back to a 50 50 traffic split.
So let's take a look at disaster recovery in multiregion architectures.
The concepts of data plane and control plane date all
the way back to networking terminology. So these are not new concepts.
And for a given AWS service, there is typically a control plane.
That is what allows us to create resources, to modify resources,
and to destroy resources. For example,
if you think of EC two, control plane operations
are launching an EC two instance, changing a security group
on an existing EC two instance, or terminating an EC two
instance, among others. At the same time, there are data
plane operations that allow for resources that are already
up and running to continue to operate. So in the
case of EC two, you may have already instantiated EC two servers
that are already serving requests, that are up and running and serving requests.
So all operations that are performed while these instances are running
are part of the data plane. So for example, reading and
writing to existing elastic block storage volumes or
routing packets according to the existing VPC route tables.
Now, in case of impairments to the control plane, the EC two instances
have all of those information that they need available to them
locally in order to continue to run. Here I have
an analogy for you. So let's think about lifecycle
of flights. So you can think that
for any given flight there is a takeoff,
there is landing, and there is the part in between where
the plane is up and running, flying in
the sky. And for the parts of this
that has to do with creating a flight, which is a takeoff,
and terminating a flight, which is the lending, you need a
certain set of steps. That includes the control tower,
getting clearance from. It also includes running through a rum book,
ensuring that the plane is ready, et cetera. So those are strict procedures around
what we need to create a flight and to terminate a flight,
and those are part of the control plane operations. But once the
plane is up and running and it's in the sky,
it no longer needs the control tower. It has zero dependencies
on that control tower. For example, to continue to be up and
running to continue to fly. So if the control tower, for whatever reason,
goes away, the plane has everything that it needs locally,
old instruments. It has the fuel that is needed
for it to continue to fly.
So data plane operations ensure to keep what's already in flight to operate.
Now, both data plane and control plane operations are important,
but data plane operations favor availability.
We want to make sure that if we have easy to instances that are already
serving requests, they should continue to be serving requests in
case our control plane is having some impairment.
And that has its roots also in the cap theorem
if you think about it. So the idea is to rely on
the data plane that is designed with a higher
availability target and not on the control plane that
favors consistency during recovery.
Let's have a look at Amazon Route 53 application recovery
controller, which provides a mechanism to simplify and automate
recovery for highly available applications.
Some industry and workloads have very high requirements in
terms of decide availability cases and recovery time objectives.
As can example, think about how real time payment processing or
trading engines can affect entire economies if disrupted.
To address these requirements, you typically deploy multiple replicas,
called cells, across a variety of AWS
availability zones, AWS regions, and on premises
environments. Route 53 application recovery controller
provides those highly reliable mechanism to aid Route
53 in reliably routing end users to the
appropriate cell in an active, active setup.
Or in a nutshell, Amazon Route 53 application recovery controller
gives you a big red emergency stop button, which acts
like a circuit breaker to take a problematic cell out of service.
What are the key capabilities of application recovery controller?
First, readiness checks continually monitor AWS resources
across your application replicas. Checks can monitor a number
of areas that can affect recovery readiness, such as
updates to configurations, also called configuration drift
capacity or network routing policies.
Second, routing controls gives you a way to manually and reliably
fail over the entire application stack. Such a failover
decision is often a conscious manual choice based on application
metrics or partial features. You can also use
them to shift traffic for maintenance purposes or to recover from
failures when your monitors itself fail.
Third, safety rules as safeguards for application recovery
controller itself to determine the side combinations
of routing control and to avoid unintended consequences.
For example, you might want to prevent inadvertently turning
off all routing controls, which would stop all traffic
flow and thereby result in a failopen scenario.
Let's look closer at the architectures of the application recovery
controller. We start by looking at routing control.
Routing controls allow us to create control panels and
model the desired cell structure of our applications and
how our big red emergency buttons or circuit breakers
should look like in this example here we have
two cells with one circuit breaker switch each.
The cell in the active region is currently in the on position,
while those cell in the standby region is in the off position.
To actually influence traffic between active and standby region,
Amazon route 53 is needed. In addition to the application recovery controller,
our circuit breaker buttons are mapped to route 53 health
checks that can be used for various record type such as
failover record type. At this point, it is very
important to point out the key capability of this integration.
Route 53 healthcare checks are part of the data plane which
has a 100% uptime slA. Therefore, even if
the route 53 control plane is affected during a large scale
event, forcing a route 53 health check
to unhealthy via the application recovery controller
still allows us to perform a failover between the active
and stem by region. But why is the applications
recovery controller's control plane superior in this scenario?
As depicted in the diagram, you can see that each controller
consistency of a cluster across five different AWS regions
with API endpoints in each of them. As long
as one of these five endpoints is still available, changes to
the routing control state can be made and these changes
can still factor in safety rules that you previously saw.
Now let's look in detail at how the application recovery controller
interfaces with route 53. A DNS request
for our application MyApp AWS would
reach route 50 three's distributed data plane.
At the same time, route 50 three's global health checkers integrate
with application recovery controller's routing control.
The application recovery controller provides a virtual health check
that is mapped to manual on off switch which can be controlled
via a highly available API. If we flip this
switch via the routing control, the information is provided
by the route 53 global health checkers to the distributed
data plane. This updated health check information
within the distributed data plane now allows a change
DNS response and this change DNS response
reroutes incoming traffic to our secondary or standby
region. Now let's look at a brief demo and
how all this can be used with an example application.
In those demo, we will look at the Amazon Route 83 application
recovery controller. For this we have deployed a
simple demo architecture with a tictactoe game deployed
across two regions. US east one acts as
active region while US best two acts as a hot
standby region. Initially, both AWS regions
are healthy and therefore the game should be served out of the active US east
one region. Those is accomplished by searing inbound network traffic
via a Route 53 failover record using Amazon
Route 53 application recovery controller. We also
have a circuit breaker in the form of routing controls in
place. Let's have a look. Here we can see
the route 53 failover record with a primary entry
for the US east one and a secondary entry for us west
two. Both records have a distinct health check associated
with them. Looking at these health checks, we can see
that each of them is currently healthy and therefore the primary
failover record entry for us east one is being used.
Each of the two health checks corresponds to route 53 application
recovery controller routing control. We can imagine
each of these routing controls to be like a circuit breaker.
Here we can see that at this point, each routing control state is on
let's play a game of tictactoe we can see
that the tictactoe games are currently served route of the US
east one region,
creating a new game and choosing a worthy opponent. We can validate
that the application is working.
So far so good. But what if we are faced with
a disastrous event in our active region and need to fail
over to the Sampai region? As part of this disastrous event,
we're also no longer able to make changes to the DNS public hosted
zone. But thanks to the Route 53 application recovery
controller, we have our circuit breakers in the form of route
controls in place. With this,
we can open the circuit breaker for the US east one
region and thereby initiate a failover to the US
west two region. Let's have a look.
First, we will open the circuit breaker for the US east one
region by making the route control state as off it.
It's
looking at the associated health checks. We can see that
the health check for us east one will change to unhealthy,
which was triggered by the route 53 application recovery controller
routing control state change. If we reload
our tictactoe game, we can see that it will now be served out of the
Usos two region.
Time to play another round of tictactoe.
Let's look at those key takeaways from this talk.
Improve those application performance resiliency by minimizing the
number of network hops through the AWS backbone.
Eliminate control plane dependencies of your application to improve
disaster recovery and consider manual failover
mechanisms by using route 53 applications recovery controller
as a big red emergency button.
Thank you.