Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello and welcome to Conf 42, Cloud native 2024.
My name is Ricardo Castro and today we're going to talk about architecting
resilient cloud native applications.
Specifically, we're going to focus on practical tips for
deployment and runtime patterns. What do we have on
the menu for today? We'll explore some of the patterns to build
resilient cloud native applications. Here's what
we'll cover. We will see deployment patterns,
blue green deployments, exploring seamless transitions to minimize
downtime updates. We'll see rolling updates.
We'll try to minimize user disruption by using
gradual code changes. We'll explore
canary deployments. We will examine strategies for controlled
rollouts and early feedback end to end with
deployment patterns. We will see dark launches. We will delve into the prerelease
testing and feature gating. After that, we will explore
runtime resilience patterns. We will explore
popular patterns like timeouts and retries. We'll see how to prevent applications
from stalling due to slow dependencies.
We'll also see rate limiting, a technique to ensure fair resource
allocation and prevent overload. We'll also
see bulkheads. We'll understand how to isolate failures and
improve overall stability. And finally, we will see
circuit breakers. We will learn how to protect critical services
from castaining failures. The world of cloud native
applications promises remarkable benefits, incredible scalability,
rapid development cycles, and the agility to meet ever changing
business demands. There's an inherent tradeoff.
This distributed architecture, with its reliance on microservices,
external APIs, and managed infrastructure,
introduces a unique set of fragility concerns.
To truly harness the power of cloud native development,
we must make resilience a foundational principle.
This means ensuring our applications can withstand component
failures, network glitches, and unexpected traffic surges,
all while maintaining a seamless user experience.
Let's start by addressing the core challenge of updates.
How do we deploy new code or features without taking our
applications offline? This is where deployment patterns
come into play. These are strategic approaches to
rolling out changes in a way that minimizes or even eliminates any
disruptions to our users. We'll explore four
key patterns, bluegreen deployments, rolling updates, canary deployments,
and dart launches. The concept behind bluegreen
deployments is elegantly simple.
You maintain two identical production environments. Blue is
where your current live environment, serving users lives,
while green stands by fully updated with the latest
code. Once you're ready to deploy, you seamlessly
redirect traffic from blue to green.
The beauty lies in the rollback capability.
If any issues arise, you can
instantly switch back to blue. This offers
a safety net for large updates, minimizing the potential
for downtime. Let's see an example. Let's imagine
that we have a set of users accessing our V one application through a
load balancer. We deploy V two of our
service, and we test that everything is okay.
Once we're confident, we redirect traffic from the load balancer to our
version two. If any error arises,
we simply switch back from V two to V one.
Rolling updates offer a controlled approach to
introducing new code changes. Instead of
updating our entire environment in one go, the new version is
gradually rolled out across your instances. This is
like changing the tires of a moving car one at a time,
allowing you to minimize any potential disruption.
With each updated instance, you carefully monitor for
any errors or any unexpected behavior. If any
issue arises, the rollout can be paused or reversed,
limiting the impact. Rolling updates are particularly well
suited for containerized environments where tools like Kubernetes can
seamlessly manage this process. In this example,
we see that we have V one of our applications deployed one
by one. We start by replacing V one with v
two of our application. If at any point we see
a problem, we can simply stop that rollout and even reverse
it to a previous version. This doesn't mean that we have to update
one verse one node at a time. We can deploy a percentage
of nodes each time, but the idea is that this change
is rolling, so we have a set of replicas that
are updated at each time. Canary deployments
take their name from the historical practice of miners bringing canaries
underground. These birds were sensitive to dangerous
gases, alerting the miners to potential hazards.
Similarly, a canary deployment exposes your code to
a small subset of users. You closely monitor
key metrics, waiting, watching for any performance degradation or
any error. If all goes well, you gradually
roll out the updates to a larger and larger segment of your audience.
This cautious approach helps catch issues early,
before they impact your entire user base. So in
this example, we start by rolling out the new version, v two, of our
service. Then we start gradually shifting
a percentage of our users to that v two.
If everything goes well, we increase the switch between
V one and V two, eventually arriving at 100%
of users using V two. If any problem
arises in this process, we can simply switch back and continue
using V one. And our last pattern in the deployment
patterns is dark launches. And dark launches introduce a
fascinating twist. You deploy your new feature or code
changes completely behind the scenes, hidden from your users.
This allows you to conduct live testing, collect real world performance
data, and even gather feedback from targeted and selected
groups. Once you're confident in the new feature,
you simply turn on the switch, making it instantly available
to everyone. Narc launches are powerful
when you need extensive pre release validation or you
want to gradually ramp up a feature's usage.
So the basic concept behind dark launches is this, you have a
new feature, and then you are able to select who is able
to access that new feature or not. You can use, for example,
feature flags, where you can turn a feature on and on, and you
can specify which users have access to it. You can
also do things like providing a specific header
that only certain users are able to use to access
that feature. We've addressed how to safely deploy
changes, but what about the unpredictable events that
can happen while your application is live?
Runtime resiliency patterns provide mechanisms to cope
with partial failures, network issues, and incoming
surging traffic. Let's dive into some essential patterns.
Timeouts and retries rate limiting bulkheads and
circuit breakers even the best designed
systems components can become slow and unresponsive.
Maybe a database is struggling or an external service is
down. In these scenarios, timeouts can act
as a deadline. If a dependent service doesn't respond within
a fixed set of time, it stops waiting and signals failure.
But that's just half of the equation. Retries give your
application a second or third or fourth chance to succeed.
Retries will automatically attempt a repeated fail
request, often with increasing intervals, to avoid overwhelming the
struggling service. This combination help prevent
single failures from cascading through your system, keeping things running
as smoothly as possible. So in this example, we see that
service A tries to do a request to service B because we have
no timeout. Service B can take how long it
needs to actually give a response back. In this example, we see that
it takes 5 seconds, but maybe 5 seconds for us is unacceptable.
So we can set a timeout, for example, for 3 seconds.
This means that after 3 seconds we will mark that request as
a failure. In the case of retries,
your services can do a request to a downstream service.
If that becomes as an error, we can automatically retry
that request until it eventually succeeds, or until
we have a limit of the amount of retries that we can do. It's important
to note that these requests usually increase in these
retry requests easily increase in time so that we don't overwhelm downstream
services. Rate limiting acts as a traffic
cop for your applications. It controls the incoming
flow of requests, preventing sudden spikes from
overwhelming a service. Think of it like a
line outside a popular club. Only a
certain number of people get in at a time.
Rate limiting is also crucial for fairness.
It ensures that a single user or a burst of requests
cannot monopolize resources, causing slowdowns for everyone
else simultaneously. It's also a
protective measure against potential malicious attacks
designed to flood your system. So in this example,
we have a client making a request to an API. If the
client eventually does too many requests,
the rate limiting capability will send too
many requests response back to the client, preventing it from affecting
other users or to flood your system on purpose.
The book have pattern draws inspiration from ship design.
Ships compartmentalize, so if one area floods, the whole
ship doesn't sink. We can apply this to
cloud native applications as well. By isolating different services
or functionalities, we might limit the number of
concurrent connections to a backend component or allocate fixed
memory resources. The key idea is this,
if one part of your system fails, the failure doesn't spread
uncontrollably, potentially taking down your entire application.
Bulkheads help maintain partial functionality,
improving overall experience. In the example that
we have here, we see clients accessing a service if we only have
one service instance. This means that if that service is overwhelmed,
all clients are affected. It's common practice
to split that service into several service instance, so that
if one replica is overwhelmed, it means
that only the clients accessing that replica are affected.
This can be extrapolated to also use entire features.
So this means that if one feature from your system is overwhelmed or
has a problem, it doesn't mean that the other features are
stopped as well. Circuit breakers think of circuit breakers
like those in your home. They prevent electrical overload
by cutting off the flow of power. If there's a search in
software, the principle is similar. When a service repeatedly
fails, the circuit breaker patterns trips. This means
temporarily blocking all calls to that service. It prevents
fruitless retries from clogging up the network and lets
the failing service potentially recover. After a
set period, the circuit breaker tries to let some requests through.
If it succeeds, the service is considered health again and normal
operations resume. In this example, we see an HTTP request
arriving at a circuit breaker command. The circuit breaker command checks
is the circuit open? If it is open, it automatically says
the result is not okay. You are not allowed to make this request at this
point in time. If the circuit breaker is closed,
it means that we can execute burner command and if everything
went okay, return a good result to the user.
This diagram also shows that we can use these patterns in combination. So I
can use a circuit breaker, but I can also use timeouts and retries to throw
exceptions so that our requests are okay or not okay.
In today's digital landscape, resiliency isn't
a luxury. It's a fundamental requirement for any application
that demands continuous runtime and positive user
experience. By thoughtfully applying the deployment and
runtime patterns we've discussed, you lay the groundwork for systems that
are not just fast and scalable, but truly robust.
The result is peace of mind, knowing that your applications can
weather the inevitable storms of the cloud. Native weather and this
was all from my part. I hope this talk was informative for you, and don't
hesitate in contacting me through social media.
Thank you very much and have a good conference.