Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, today I will be talking about fault tolerance and how
to migrate service to multiregion in the cloud.
Services and resources are located in an availability zone inside
a region like us east or us west.
We can have multiple availability zones in a region.
Sometimes a whole region can fail, causing a whole service
to die. But if we have a fault tolerance setup
with a multi region approach, disaster can be avoided.
I am currently working at Globant as a DevOps engineer.
We are a digitally native company that helps organizations
reinvent themselves and unleash their potential. We are
the place where reinnovation, design and engineering met scale.
Since I started working in the cloud, I've noticed that
even with high availability, there is a chance that a whole region fails
and critical services that need 100% uptime are not
available, generating massive loss to the clients.
The process not only to modernize these services, but also
migrating them to be multiregion can be quite taxing, but if
done correctly, it will generate a peace of mind to stakeholders,
clients and end users. I will explain the
difference between high availability versus fault tolerance,
and then provide some pointers to go with multiregion in a complex
service.
So first, what is high availability?
High availability is the ability of a service to remain operational
with minimal downtime in the event of disruption.
Disruptions include hardware failure, networking problems,
or security events like a DDoS attack. In a highly
available system, the infrastructure services are spread across
a cluster of instances. If one of these instance
fails, the warlocks running on it are automatically moved to other servers or
instances. These clusters are normally set up
across different availability zones, but all in the same region.
The main advantages are the easier maintenance,
even if the design is more complex, because the scalability
that it provides and almost no service disruption with
a load balancing solution that will automatically divert the traffic
to a functioning cluster. Things means that we
will need to set up double the components to provide this
balancing solution, which can raise the cost even if we
are spending to prevent disruption. There is also a chance of data
loss if there is data being transferred while
the change of authority happens.
And finally, there is still a chance of disruption.
There's like a percentage of disruption
that can cause several loss in some services. Have we always seen
a full service failure over a regional disaster?
Then we have fault tolerance where
there's no interruptional service. So it's a design concept when
a service will continue working normally after experiencing expected
failure or malfunctioning with zero service interruption.
So in this seamless transition, when there's
a failure is
the problem, but also not
only with failures, but when there's a need to upgrade to
change or perform maintenance on the service or hardware.
Now, regarding data loss that was applied
in high availability, here is if
well implemented, there shouldn't be none
because we have set up redundancy and we
do not have that crossover component between active and
passive systems and will write and receive all of the requests.
Obviously, the design concept that will assure that the service will continue to
work if a whole region fails has some more costly and complex
setup. But sometimes you have to wait those things and decide
what is better for the situation now.
So both designs reduce the risk of service disruption and
downtime, but they do so in different ways.
Additionally, the two models tend to differ in terms of cost.
When we choose which one to adopt, you have to take into account the
level of disruption, the infrastructure requirements and management
effort in the design setup and operational maintenance.
As I've been working with some sensitive projects that needed
zero downtime and no downloads, I had to implement multi
region, and I like to provide some insight on how to implement it
using a multi region in AWS.
So first of all that we have to make
sure is that we have a good cross
account replication on the security resources
and cross region replication on the security resources too.
Authorization, encryption, auditing and observability need
to be replicated, and the data logs should be stored in an s three
bucket with multiregion replication, as we can see in this graphic
snuckly AWS IAM,
that's the users roles groups provider
for AWS has multi region availability
automatically has no configuration required on your part.
Then we have the AWS secret manager that can store secrets with
KMS encryption, which can be replicated
over secondary regions to ensure they can be retrieved in the closest
and available region. Some services like
s three package and Aurora databases have cross region
replication, which makes the encryption decryption steps more
agile. But for those applications that run multiregion, there's an
option to set up TMS multiregion keys, which will make your life
easier on the encryption operations.
As stated before, you can save all the clothes
right logs in a replicated s three storage, but keep in mind that
you can also enable security hub to send all the findings from both
regions to a single navigation pane.
Second, but not less important, is networking.
Because we have to analyze and be aware of the networking infrastructure.
We are going to need to set up a global network
to communicate between the regions. We can use BPC peering.
These resources can communicate using private ip addresses
and do not require an Internet gateway, a VPN
or separate network appliances, and by the way, it's cheaper
than other options using a private private
cloud for on premise communication
we have transit Gateway is a networking transit hub
that connects your vpcs and on premises networks.
Things can be chance to expand the
additional regions with transit gateway interregion peering
to create a globally distributed private network for your resources.
Now we have route 53, that's the DNS
solution to route the users to those distributed
Internet applications and offers a comprehensive
available solution that has minimal dependencies.
Then we have close one is a content delivery
network like for websites which
allow us to manage our content closer to the end users
with apps locations. But it's also possible
to set it up with an origin failover.
If the primary origin is unavailable or returns
a specific HTTP response status code that
indicates a failure,
Cloudfront will automatically switch to the secondary origin.
And then for Internet facing apps
you can set up a global accelerator which automatically
switches with two static Anikas ips
with one single entry point as you can seamlessly
add or remove origins and redirect traffic within
6 seconds. It also allow
us to ask traffic waves to test deployments.
I've used global accelerators and it's really useful
for devs,
not cloud engineers because they
have only to switch the weights and
traffic all the workload and it's a
really good solution. It also help with
live fixes to production when some
scenarios require it.
Now next we have compute
depending on your infrastructure, there are different things to consider.
For example, if you use EC two instances, they have
their corresponding EDS volumes.
They are storing one availability zone, but we have data lifecycle
manager to automate the replication to those volumes
to another region. And if we use amas,
we have replication of our regions
with EC two image builder. So we don't need to
do this manually. If our
service or application is based on microservices,
which is a really good idea, we can use Amazon
elastic container registry which has private image
replication which can be cross region or cross account.
I use third party container registry tools
that generates the call from my pipelines and deploy
them into my AWS infrastructure in both regions at the same
time. So that's another option for
data replication. This is also a complicated topic
because we have heard about this things
is the cap theorem that states that we can have the three we
can have consistency, availability and partition tolerance
at the same time and we need to choose two only and
decide which we select depending on our
needs. So when we go for multiregion,
the one that's harder to achieve is consistency due
to the long distance between the services I have
already mentioned that S
three has multiregion replication. That's the simple storage service
or buckets or that usually call them. This replication
is one way or two way, continuous and
things replication can also be applied as subset of
items inside the bucket.
For non relational databases like DynamoDB,
it has a global structure, has multi writer capabilities,
will detect the changes and replicate them in other
regions. For cachet, we have elastic cachet.
For Redis, it offers a global data store to
create fully managed, fast, reliable and secure cross
region replica. For redis caches and databases,
and for relational databases such
as Aurora,
the cluster is in one region is designated as the
writer, and then we have secondary
regions that are designated as read copies or
read replicas. While only one instance can process the writers,
our MySQL supports write forwarding.
That's a feature that will forward write queries from a secondary region endpoint
to the primary region. To simplify the logic in application
code, logical replication, which utilizes
a database engine built in replication technology,
can be set up for Amazon RDS for Mariadebe,
MySQL, Oracle PostgreSQL
and Aurora database. This means that a cross vision red replica
will receive and process chance from the writer in the primary
region. This will make local reads faster and
can reduce data loss and recovery times in the case of a disaster by being
promoted to a standalone instance.
This technology can also be used to replicate data to
a resource outside an Amazon RDS, like an EC two instance,
an on premise server, or even a data lake.
Third, but not last, we have
application management. This means that that
bigger part side of the service that has to be
taken into account. For example DevOps.
I think it's important to plan what CI CD tools are we going to
use in order to deploy all of this infrastructure,
whether it's AWS core pipeline, GitLab CI
or Jenkins pipelines will need to be configured
to assure this double deployment, but at
the same time deploying first to one region and then to other
to the other while the primary
is working isn't as complicated as it seems.
Working with variable files for each region and environment
is a really good idea. I mostly use terraform,
which is really solid and allow us to review before we apply
changes to
the infrastructure. But AWS has cloudformation
that's also really solid and allow us to
create and delete stacks and update them across
multiple regions and multiple accounts with one simple
template obviously providing the corresponding
variables. Now,
depending on the architecture of our service,
if we use the coupled applications,
we will need an event manager we have Eventbridge.
This will help us provide a notification service across regions.
Eventbridge is serverless and we can use cross region
routing to interconnect messages between the resources.
If you rely on pubsoft messaging like
it can work with multiple destinations, so you can send messages to
a central SQSQ that process
orders going multi region application finally,
to maintain visibility and observability
over an application deployed across multiple regions
and accounts that can generate a lot of resources,
you can create a trusted advisor dashboard, an operation
dashboard that can be done with system manager
Explorer. This operation dashboard offer a unified view
of resources like easy to cloud
watch AWS config data, and you can combine
the metadata with a master latina to create a
multiregion and multi account inventory with
a good view of all of the resources.
So you have heard me talk about all of these resources,
and while I don't want to sound like an AWS evangelist,
I think it's important to know that these options exist.
Each cloud has similar resources and alternatives to build
a fault tolerant multiregion service.
It's hard, but it's possible, and the outcome
of the peace of mind it brings when disaster
happens is something to take into consideration when designing
or updating a service or an application.
It has bring me a lot of solutions
and help me to provide
better architecture for my projects so
well. Thank you for listening and I hope this information
allows you to reinvent and think
for better architecture for better solutions. Ask me. Thank you.