Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone. Good day. Let's talk about resilience.
My name is Dima. For the past five years
I've been developing high load fintech applications that
process a huge amount of user money and have the
highest requirements for stability and availability.
During this time, me and my team has encountered a lot of problems
and corner cases, and we've came up with our own
recipes for resilience patterns, which ones work
best for us, and how we prefer to cook them so we can sleep peacefully
at night, knowing that our systems are equipped to survive
any disaster. What is resilience?
Resilience patterns are practices that help software
survive when things go wrong. They like safety
nets that make sure your application keeps running even
if they are facing some problems. So let's take a
look at them and see if they can help your software be better.
Starting with Bulkhead imagine we have
a simple, ordinary application. We have several
backends behind us to get some data from and inside
our application. We have several HTTP clients connected
to those backends. What can go wrong? Simple application,
right? But it turns out they
all share the same connection pool,
and they share other resources like cpu and
ram. What will happen if one of backends experiences
some sort of problems resulting in high request latency?
For example, due to high
response time, the entire connection pool will be completely
filled by requests awaiting responses from backend one
and request to healthy backend two or backend three,
won't be able to be sent because the pool is exhausted.
So a failure of one of our backends will result
in a failure across the entire functionality of
our application. But we don't want that.
We would like the only functionality associated with a
failing backend to experience degradation, but the rest of the application
must continue operating normally.
To protect our application from this problem,
we can use the bulkhead pattern. This pattern originating
from shipbuilding, such as creating several compartments within
a ship isolated from each other. If a leak happens
in one compartment, it fills with water,
but the other compartments remain undamaged.
How can we apply this idea? To our example?
We can introduce individual limits on the number of
concurrent requests for each HTTP client.
Therefore, if one backend starts to slow down,
it will lead to degraded functionality related
only to that HTTP client, but the
rest of the application will continue to operate normally.
Bulkhead can be used to isolate any kind of resources.
For instance, you can limit the resources consumed by background activities
in your application, or you can even set restrictions
on the number of incoming requests coming to your application.
Upstream service or frontend may also experience
some kind of failure. For example, reset with caches
and it may lead to critical traffic increase. Coming to your APK
input limit ensures that your application won't crash
due to out of memory issues and will continue to function
even though it will be responding with errors to request exceeding
the limit. You can implement simple bulkhead
just by using simple semaphore. Here's an
example of scala code that uses one semaphore to limit
concurrent requests and another semaphore to create
a queue of painting requests. Such queue can
be useful for smoothing out short term spikes
in traffic and avoiding error spikes.
So that was bulkhead. Bulkheads can help you
isolate failures in teach your application to gracefully
degrade in case of emergencies.
Next pattern is cache caching is a vast topic
in software engineering, so today we will focus
only on application level caching. Let's return
to our example. We still have an application that communicates
with several other applications.
Let's assume we have sufficiently high SSLA
in our backends and they have very low probability
of error. Let's consider a
scenario where an operation requires querying all
these backends to be completed.
Each of the backends can independently respond with
an error due to error potentially happening independently.
The resulting probability of error in our application will
be higher than the probability of error in each individual
backend. We can increase
the reliability of file application by adding a
cache. Cache is a high speed data buffer
that stores the most frequently used data, allowing us
to avoid fetching it from potential slow source
every time. Caches stored in memory have zero
chance of error. Unlike fetching data over the network,
caching also reduces network traffic, lowering the
chance of error even more. As a result, we can
achieve even lower error rate in our application than
in our backends. Additionally, in memory, caches are
much faster than the network, which reduces the
latency of our application. It's a small bonus.
Such caches are excellent for non personalized data,
such as news feed or some other data that are the
same for all our users. But what if
we want to use memory caches for personalized data,
for user profiles, for example, or personalized recommendation
or something like that? In that case, we need
some sort of sticky sessions to ensure that all requests coming
from a user always go to the same instance of our application
to be able to utilize caches of personalized data
of its user. Good news is
for this scenario we don't need any complex ticket session mechanism.
We can handle some sort of minor corner cases and
minor traffic rebalancing. Therefore, it will suffice only
to use, for example, stable load balancing algorithm at
your balancer without the need for any complex systems
to manage sticky sessions. For example, we can use consistent hashing
in the event of node failure. Consistent hashing ensures
that the only users who are tied to the failed node
will be rebalanced, rather than rebalancing all users.
That's it. Now we can use our caches for all types of our
data for personalized and non personalized.
But let's take a look at another scenario. What if
the data we want to cache is used in every request we handle?
It could be information about access policies or
subscription plans, or any other crucial
let's take a look at another scenario. What if the data we
want to cache is used in every request we handle? It could be information about
access policies or subscription plans, or any
other crucial entity in our domain.
In this case, the source of this data can become a
major point of failure in our system, even if we will be fetching
it through the cache. If the volume
of data in the source is not very large, we can consider fully
replicating this data directly into the memory of our applications.
At the start of our application, we download
a snapshot of this data and then we receive updates from
some sort of topic. This way we can increase the reliability
of accessing this data because every retrieval will
be done from the memory. Zero error probability,
and it is still very fast because it is a memory.
However, since our application will need to download some
data at startup, we violate one of the principles
of the twelve factor application, which states that the applications
should start up quickly, but we don't want to give up
on advantages we gain from using this type of cache.
Let's think, if there anything we can do to avoid this
issue. Why would we need fast startup in the first place?
Fast app is needed for platforms like Kubernetes
to be able to quickly migrate your application
to another physical node, for example.
But platforms like Kubernetes already can handle slow
starting applications by using startup robots, for example.
Another issue we may encounter is updating configurations of
running applications. Often in order to fix some
problems, we need to change cache times
or request timeouts or some other configuration
properties. And let's say we know how
to quickly deliver updated configuration files to our application.
But we still need to restart an application to apply our
new configuration. And the rolling update can now take
a very long time, and we don't want to make our
users wait for a fix to be applied. What can we do?
What if we can teach our application to reload configuration
on the fly without restarting it.
Turns out it is not so hard to do. We can store configuration
in some concurrent variable and have another background thread periodically
update this variable. Sounds simple, right?
However, to ensure that certain parameters such
as timeouts for HTTP clients take effect,
we need to also reinitialize our HTTP clients
or database clients when corresponding
config changes, and it may be a challenge.
But some clients like Cassandra for Java,
already supports reloading of its configuration on
the fly, so reloadable config can mitigate
the negative impact of long application startup times.
And as a small bonus, it has other use
cases like we can use it to implement feature flags,
for example. So next pattern
next pattern is fallback. Let's take a look at
another scenario. We receive a request send
request to a backend or to a database and receive
an error in return. Then we will respond to the
user with an error message as well. However, it is often better to
show an audited data with a message like we experiencing
delaying in updating information or something like that,
rather than displaying big red error message box for
user. Right? So to improve our behavior
in this case we can use the fallback pattern. The idea
is to have a backup data source, probably with lower quality
of data, from which the data can be retrieved when
the primary source is unavailable.
Often fallback cache is used as a backup data source.
This cache stores last successful response from the service,
and when domain service is unavailable we can use this
last successful response to give
to user. So now user will be able
to see some data instead of an error and the team
responsible for broken backend will have more time
to fix the issue. Let's talk about
retries. Let's rewind a
little and go back to our example where we were trying to reduce probability
of errors using caches. What if
instead of caches in case of error, we simply send the
request again? This is actually a pattern called retry,
and it can also help us reduce the likelihood of errors
in our application. Retries are often easier to
implement because when you use caches there,
often you need to invalidate them when the data changes.
And cache invalidation can become very complex tasks for
your system, it is considered one of the most challenging
tasks in software engineering.
That's why sometimes it's simply to just retrieve a failed
request. However, what happens if one of our
backends experience a failure? We will start
retrying requests to that backend, which will increase the traffic
increase will be by several times, and it is
very likely that this backend wasn't designed to handle such
a sudden spike in the traffic, so we will probably make the failure
even worse. Therefore, along with
a retry, the circuit breaker pattern must be used.
It's a mechanism that detects an increase in error rate,
and if error rate exceeds a certain threshold,
requests to downstream service are blocked for
a period of time. After this time, we try to
let one or more requests through and check if service has
recovered. If it has ok, we start allowing traffic again.
Otherwise we will block requests for another period of
time. Often retries and
socket breakers are implemented at infrastructure level,
at load balancers for example.
However, infrastructure usually doesn't have full
error context. For instance, it's not always
possible to genetically determine if error can be retried
or it should be counted as expected
or bad error for the circuit
breaker threshold. Therefore, sometimes it
is necessary to move retries and circuit breakers inside of
the application to have full context for making decisions
regarding error classification.
So, to summarize, by implementing
these patterns, we can fortify our application from
emergencies, maintain high availability,
and deliver seamless experience to our users.
Additionally, I would like to remind you not to overlook telemetry.
Good looks and metrics can enhance the quality of
your services by a lot.
That's it. Thank you for watching. Thank you for
your time. Cheers.