Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everybody. Today I want to talk about techniques
that allow you to reduce the load on the server
and keep your service running smoothly,
preventing server overload in case of traffic bursts.
In particular, I will be talking about rate
limiting and load shedding. A little
disclaimer by service, I mean any
backend application. It can be a monolithic application
or an individual microservice in a large distributed
system. But first, let me
tell you a little bit about me.
My name is Ivan Lemeshev. I live
in Finland. I moved here over three years
ago, and since then I have been working at
Unity as a senior software engineer.
I have been using Golang as my primary programming language
for many years and I am interested in the development
of large scale and distributed systems.
You can find me on LinkedIn or GitHub
by following the links to begin
with, let's look at the agenda for this
talk. First, I will discuss survey
overload, why it occurs,
and how it affects the performance of backend
services. I'll give you an example of common
causes of server overload,
and then I'll show you what it looks like
in the example of a simple HTTP service.
After that, I'll consider techniques that
help prevent server overload.
I'll start with rate limiting and review common
rate limiting algorithms. Then I'll
show you a simple implementation of one
of the algorithms, and you will see how it
works. Next, I will discuss another technique
called load shedding and show
you a simple well, let's get
started. When we develop a backend application,
we deploy it to a server to make it available
to users. As a server,
both physical and virtual servers
can be used. It doesn't matter.
Each server always has limited computational or
system resources, such as cpu or memory.
The service utilizes some of these resources to
process each incoming user request,
including managing multiple tasks,
switching between them, cleaning up unused memory,
and waiting for data to come in or go out.
The server can process only a particular
number of concurrent user requests simultaneously
under heavy load. When a system is given
more work than each resources support,
it starts experiencing a lack of resources
and becomes slow,
which leads to an increase in the processing time
or latency. For each request. The service reaches
some threshold where its
performance degrades rapidly.
It can keep working even when it
is overloaded, but it spends amounts
of time contacts switching and
becomes too slow to be useful because
most likely the client has some timeouts
and can't wait for the response from the
service too long. Therefore,
the service performance and availability of
your service will be declined because
almost all requests will fail due
to high latency and timeouts. In the
worst case, the server may
completely crash and stop handling requests
due to running out of memory or other
resources. Of course, we can use auto scaling and
deploy additional service instances,
but it doesn't happen instantly if
the load grows gracefully. Auto scaling works
well. However, deploying an appropriate
number of additional instances requires time
if we have a traffic burst and the load is
exceptionally high.
Also, there may be a situation when an
individual user or a group of users
may produce so many requests,
consuming all the system resources, and other service
users will be unable to use it.
In this case, auto scaling will not help.
Now let's explore situations
where a surge of traffic can occur.
Traffic bursts can occur due to various
reasons, both predictable and unpredictable.
For predictable reasons, there could be different scheduled
events or planned events like product launches,
sales, promotions, marketing campaigns,
or even regular peak hours can lead
to a surge in traffic as users try
to access the service or website simultaneously.
Another reason is seasonal traffic.
Businesses in specific industries might experience
seasonal spikes in traffic. For example,
e commerce sites might see
a search during holidays or
back to school seasons. The next one is
the time of day or week.
Traffic patterns can change regularly depending
on the target audience and service type.
For example, social media platforms see
higher traffic during evenings or weekends
for unpredictable reasons. There could be different
viral events like social media trends,
news articles, or influencer mentions can
drive sudden traffic bursts to a website
or service. Also, there could be different
technical issues. For instance,
system outages on competitor platforms
can lead to users flocking
to a service, causing a temporary spike.
Another very popular reason is denial of service
attacks. It is when malicious actors might
attempt to overwhelm a system with traffic,
causing a spike and potentially disrupting
functionalities. The next one is bot
activity. Automated bots or
scripts can cause unexpected bursts,
especially if they are scrapping data or attempting
to exploit vulnerabilities. There could also be
issues with external dependencies. For instance,
if your service relies on external
APIs or services,
outages or slowdowns on their end can
lead to a cascading effect and cause traffic
spikes for your users trying to access features
that depend on those external services.
By understanding these potential causes, you can
better prepare for traffic bursts. Well,
now let's look at the example of server
overload. To demonstrate this,
I implemented a simple HTTP service in
go. The main function sets up an
HTTP server that listens on port 8000.
It registers a single root
with a handler function. The handler function attempts
to extract a path parameter length from
the requests, convert it to an integer value
and use it to generate a password.
If the length parameter cannot be converted to an integer,
for example, if it's not a number, the function
responds with 400 status.
If the length parameter is successfully
converted to an integer, the function generates a password
of that length and then generated password is sent back to
the client with 200 status.
I use the docker container to run this service
because it allows us to simulate limited
resources. You will see it in the next slide.
I built and run the service using this
simple dockerfile, the application using
the Golang image in the build stage, and then it
uses the distr less base image from
Google's container registry to run the service.
This image contains only the application and its
runtime dependencies, and it is
designed to be as small as possible.
Then I built a docker image from
the dockerfile. I set the
cpu option to one to limit
the number of cpu cores that also
I set two options, memory and memory.
Swap 300 megabytes to limit the
containers memory. I have to set
both options to the same value to prevent
the container from using. When we
have a running service, we need to test
for that I use a load testing tool called Grafana
k six. Here is a script that is
used by this tool. It is just a simple JavaScript file
with some configuration for the test.
The test consists of four stages. In the
stage one, we gracefully increase the number of virtual
users from zero to 1000
for two minutes, and then in stage
two we keep the number of users at 1000
for another two minutes. In the stage three,
we ramp up to 2000
virtual users for two minutes. And finally in
stage four we keep the number of users at 2000
for another two minutes.
In the default function, we define the user logic.
Each virtual user makes a get request to
the service to generate a password. Then the user sleeps
for 100 milliseconds that
simulates approximately ten requests per second
from a single user. After that,
I just run the test using this command for
it to end. I also use a web dashboard
as an output to visualize the test results.
It produces graphs showing the number
of requests latency and other
metrics. In the graph, we can
see how latency changes depending on
the number of requests per second.
Initially, we gracefully increased the number of users and
the service could handle that load. The latency
was very low and then after four minutes
we increased the number of users up to 2000
and the service started to experience
a lack of resources. That led to a significant increase
in latency and the server became overloaded.
In the following graph we can see the latency distribution
by percentile. From these results we can understand
that when we use system resources at maximum,
the latency significantly increases many times.
But this is just an artificial example.
In the actual application the latency
might be higher and there will
likely be some timeout on the client side that will
drop requests and make the service unavailable
for users. We have discussed server overload
and its causes. Also we saw what it looks like
in the example. Now lets look at the first
technique that can be used in this situation.
First I want to discuss rate limiting.
Sometimes it also called throttling,
so these terms may be used interchangeably.
Rate limiting is a technique used to control
the flow of requests to a network,
resource server or API.
It essentially sets a limit on how often a user
or application can perform a specific
action within a given time frame.
It protects against malicious activities like denial
of service attacks where attackers try to overwhelm
a system with excessive requests,
making it unavailable to legitimate users.
Also, it helps ensure fair access
to resources by preventing single users
or applications for monopolizing them.
It is especially important when the
service for services with limited resources
by controlling the request rate. Rate limiting prevents
overloading the server and helps
maintain optimal performance for all users.
We can limit the request rate per an IP
address or a user identifier,
an API key, or other criteria depending
on a particular use case. For example,
we can load ten requests per minute from
a single IP address, and if the number
of requests per minute is less or equal to
that value, we process a request.
Otherwise, if the limit is exceeded,
we will drop the request. There are many
algorithms and variations of them
for implementing rate limiting. I'll briefly review
just the most common of them.
One is the fixed window counter algorithm.
This algorithm divides time into equal sized
time intervals or fixed windows.
The window size can be defined in milliseconds,
seconds, minutes, or any other irrelevant unit
depending on the use case. Each window
has a request counter that allows the algorithm
to keep track of how many requests occur
within each window. As a request arrives,
the algorithm checks the current timestamp.
It decides which window the timestamp falls
into based on the window
size and the starting point on the current window.
If the request falls within the current window,
the counter for that window is incremented by one.
Otherwise, a new window starts, the counter is
reset to zero and incremented by
one, and the start time of the window is updated.
Then, if the counter value is less or equal to
the limit, the rate limiter allows the request to
be processed. Otherwise the request
is dropped if the counter value exceeds the limit.
The example on the slide shows
that we have a 1 minute window size
and set the rate limit to five requests
per minute in each window. The first five
requests were processed, all other requests
were dropped. As you can see,
this algorithm is quite simple.
However, it has a big drawback which
you have probably already noticed here.
Since the counter is reset when the new window
is started, the rate limiter doesn't know
how many requests were
made in the previous window. It leads to
some issues. For example, in this picture,
the total number of requests per minute was more than
five because requests burst at
the end of the second window and the beginning
of the third window. The fixed window counter
algorithm is easy to understand and implement because it
only requires keeping track of a counter for
each window. Also, it is efficient because
it requires only basic counter operations and
it has low memory usage because it only needs to
store the counter value for a window, which is typically
a small amount of data. However,
request bursts are possible on edgest when
switching between windows,
and this algorithm is not suitable for
identifying frequent requests since the counter is set
at each window. The following algorithm
is sliding window lock.
It also uses a window, but this
window slides a long time representing the
relevant time frame. For rate limiting,
the window size can be also defined
in milliseconds, seconds, minutes, or any
other relevant unit, depending on the use case.
Instead of request counter, this algorithm
keeps track of for each incoming request and
stores it in the the log can
be stored in a data structure like
hash table or sorted
set for efficient retrieval.
As the window slides forward,
timestamps outside the current window become
irrelevant for rate limiting purposes
and removed from the log.
A new request arrives. Each timestamp
is added to the log. The algorithm
calculates the total number of requests within the current window
by iterating through the remaining timestamps
in the log to decide if the request should be allowed.
The request is processed in. The total count
within the window is less than
or equal to the allowed limit.
Otherwise, it's considered a rate limit
violation and might be rejected.
Let's look at the example.
In this example, we allow two requests
per minute when the first requests when
the first request is saved to the lock,
there is only one request in the lock at
that point in time, so the request is allowed
and processed. Then the second
request comes and each timestamp is
saved to the lock as well. Now there are two
requests in the lock. All of them fall within
the current window, so the number of requests
is equal to the limit and the request can be processed.
Then the third request comes and each timestamp
is saved to the lock. Now there are three requests
that fall into the current window,
then the limit, so the request is rejected,
and finally the fourth request comes comes.
Each timestamp is also saved to the lock.
There are only two requests that fall within the
current window at that point in time,
and there are two all
timestamps that do not fall into the current window.
The old timestamps are removed from
the lock and the new request is allowed and
processed. Because the number of requests in
the current window equals the limit,
the algorithm captures recent surges in requests
more accurately than the fixed window counter because
it considers only relevant requests within the
sliding window can effectively handle bursts
of requests. However, storing timestamps
for all requests requires
a lot of memory and it can be memory
intensive, especially for high requests
volumes. Also, it's more complex because it
requires maintenance and iteration through the request
log following algorithm is
called sliding window counter.
It offers a balance between simplicity
and accuracy compared to fixed window
counter and sliding window log algorithms.
The algorithm maintains a counter variable
and fixit size window duration similar
to the fixed window counter, but in addition,
it uses a sliding window to represent
the relevant time frame for
rate limiting. Unlike the sliding window
log algorithm, it doesn't sort for
all requests. Instead, it keeps track of
the counters for fixed window and calculates a
weighted counter for the sliding window.
When a new request arrives. The algorithm calculates
a weighted counter using the sliding window time frame
and request counters for the previous and current
windows. For that, it uses a percent
of time. The sliding window overlaps with
the previous fixit window, the number of requests
from the previous fixit window, and the number of requests
from the current fixit window. You can see
the formula on the slide. The resulting
value of these calculations is a
weighted counter. Then, the algorithm
compares the weighted counter with the limit value.
The request is allowed and processed if
the counter is less or equal to the limit.
Otherwise, the request is dropped.
The old counters from previous fixed windows
are removed from tracking when they are
no longer relevant for a sliding window. The algorithm
captures recent requests rate trends
more accurately than the fixed window
counter as it considers requests that
fall within the sliding window. Also, it avoids
storing timestamps for all requests,
make it more memory efficient and
easier to implement than sliding window log
algorithm. However, it is
less precise than sliding window log algorithm.
The next algorithm is as
you can see from the name, it uses a concept
of a bucket that contains tokens.
The bucket has a fixed size
representing the maximum number of tokens it can hold.
Each token represents a capacity to process
a single request. In other words,
the bucket size defines a maximum number of requests
the system can handle at a time.
So the bucket size represents rate limit.
The bucket is constantly refilled with new tokens
at a constant rate called the refill rate, which defines
how quickly the bucket refills
with tokens over time, ten tokens per
second. Therefore, the refill rate controls
the average rate at which request can be processed.
Processed when a request arrives, the algorithm
checks if there are enough tokens in the bucket.
If there are tokens in the bucket, it consumes a token or removes
it from the bucket and allows to be
processed. Otherwise, if the bucket
is empty, the request is dropped.
On the one hand, this algorithm is relatively easy to implement.
However, it is also slightly more complex
due to the token management logic.
It is efficient and uses low memory because
it keeps tracks of the bucket size, which is typically just
a number that is incremented or documented over
time. It allows
for a smooth distribution of requests,
but bursts of requests up to the bucket's capacitor
possible in cases when the bucket
is full and a large number of requests arrive
simultaneously. And finally, the last
algorithm for today is leaky
bucket algorithm. It also uses
the concept of a bucket, but in a
different way. It is like the analogy of
a bucket with a hole at
the bottom that constantly leaks at a fixed
rate. The bucket can hold a
specific number of requests representing its
maximum capacity. The requests leak
out from the bucket at a fixed rate.
This leak rate specifies sustain
it rate at which the system can process requests.
When a request arrives and the bucket
is not full, a new request is added to
the bucket and processed. If the bucket
is full, the request is dropped.
The liquor bucket algorithm is good at
smoothing out requests because the requests
leak out and processed at a controlled
rate, preventing surges or bursts
of traffic. However, it doesn't
account for the time between requests,
which means if a user sends a
burst of requests quickly,
it all be dropped, even if
they are under the overall rate
limit. We have reviewed rate
limiting algorithms. Each algorithm
has pros and cons, so you should consider
trade offs based on the particular use
case when choosing an algorithm for rate limiting.
Also, each algorithm may
be implemented differently depending
on the requirements. Now let's look
at the simple implementation of a rate limiter based
on the token bucket algorithm.
For the example, I implemented the
rate limiter logic in a separate package.
Here I use the bucketstruct as
a token bucket. It has two fields, current tokens,
which is the number of tokens currently in the bucket and last refill
time, which is the last time the bucket was refilled.
Then I define the token bucket instruct which
represents the rate limiter. It has a mutex
for thread safety and map buckets
that stores a bucket some
key. As a key, I will use the
client's ip addresses.
Also, it has two fields bucket size
which represents maximum number of tokens, a single bucket hand hold
and refill rate, or the rate at which tokens are added
to the bucket. Then I define new
token bucket function that just
creates a new rate limiter with specified bucket
size, refill rates, and empty buckets.
Map this rate limiter has one method
that's called isallowed. This method
checks if a request with the specified key is allowed.
If there is no bucket with a
given key, it creates a new bucket for the
key and allows the request.
If the key exists. It refills a bucket.
It checks at least one token
in the bucket. If there is a token,
it decrements number of tokens and
allows the request otherwise denies
the refill. Method refills the bucket for
the specified key. It calculates the number of
tokens to add based on the time elapsed since the last
refill and the refill rate after
that. It adds them to the bucket, up to the bucket
size, and updates the last refill time. I also use
middleware for that example,
I define a middleware to use for rate limiting.
I have a simple rate limiter interface that contains only
one method is allowed. This interface
allows allows us to use
different rate limiter implementations depending
on our needs and replace them
more easily. Then I define rate limiting middleware
function. It checks if a request is allowed
by calling the isallowed method of the rate
limiter with the remote address of the request.
If the request is not allowed, it responds with 429
status corresponding status text.
If the request is allowed, it calls the original handler
function. I use the same service
as the previous example, but with a few changes.
Initialize the rate limiter first with a
bucket size of one token and refill rate of one token
per second. That means that
we allow approximately one request
per second from a single user,
the handler function with the rate limiter middleware,
and pass the rate limiting limiter there
I use the same test configuration as previous
example and the result test
results show that even rate limiter allowed
us to decrease the average load on the
server and the latest is not as high
as before. However, it can't improve the performance
significantly in this case. Because rate limiter is
integrated into the service, it also uses the same system
resources. You should always remember about it.
Using a rate limiter has its overhead
and costs as well. You have to
consider different trade offs and find
a balance between using a rate limiter and the
performance of the service. Moreover,
you can't just use the rate limiter from the example
in the real production applications because you need some shared
state with counters that can be used in all instances
of the application. For example, you can use
redis to store the counters and get them from redis during rate
limiter checks. Also, you will likely need a
distributed rate limiter that can be implemented in a separated
service, for example in the API,
gateway or other solutions.
In addition, on this slide I provided the
list of different implementations of rate limiters in
Golang. You can check these packages, see how
different algorithms are implemented,
and use one of them in your application if
it suits your particular use case.
We have considered server overload,
the rate limiting technique, and common rate limiting algorithms.
Also, we saw it in examples.
Now let's look at another technique
called load shedding.
Load shedding is another technique that is used
to prevent server overload. It is like a controlled
shutdown to prevent a total crash during
a traffic overload. It can be implemented
in different ways. For example, the system can
constantly check its resources,
such as cpu and memory. If things
get overloaded,
load shedding kicks in. It might reject
random or non critical incoming requests,
prioritizing critical ones. Also, the system
might slow down process processing for non critical tasks.
The system also might redirect some requests
to other services if possible.
By sacrificing some requests,
the system stays operational for
the most important ones. Critical requests
still get processed within a reasonable time frame.
A controlled slowdown is better
than a complete system breakdown in this situation.
However, as always, there are some trade offs.
Users might experience delays
or errors for some requests during
shedding. Depending on the system, some data
processing might be delayed or even lost.
So now let's look at an example.
To demonstrate this technique, I implemented a simple
algorithm to decide if the system is overloaded
and reject random requests.
I will use a ticker from the time package,
and I assume that if the system is overloaded
we will have some delays and the ticker will not work.
I strongly do not recommend this algorithm
for production applications is just for
demonstration purposes. I implemented the
logic in a separate package called overload detector.
Here I define overload detector struct that will
be used to detect if a system is overload
based on a specified overload factor
structure has three fields.
Check interval is the duration between each check
for overload. For overload overload
factor is a duration that is considered as
a threshold for overload and is
overloaded flag indicating where
the system is overloaded. I use bool
from the sync atomic package for thread
safety. Then I define the new
function which just creates a new overload
detector arguments and
the overload detector check in the background
using the run method. Also, I have
isoverloaded method that just returns the
value for the flag,
and in the run method we
do the check if the system is overloaded.
First we initialize the ticker that if the system
is overloaded every check interval on
each tick, it checks. If the time since the last
check is greater than the overload factor,
it considers the system as overloaded and sets the
flag to the true value. If it is not, it sets
the flag to false I also define a middleware for
load sharing as I have done for rate
limit meeting. I have a simple overload detector
interface that contains only one method is overloaded.
Then I define the overload detecting
middleware function. It checks if request
is allowed by calling the isoverloaded method,
and if the server is overloaded it responds
with 500 and
if the system is not overloaded,
it calls the original handler function.
Again, I use the same service as the previous example but
with few changes. I initialize the overload
detector with a check interval of ten milliseconds
and an overload factor of eleven milliseconds.
Then I wrap the handler function with the overload
detecting middleware and pass the overload
detector. There, I use the same
test configuration as the previous examples.
The test results shows that using the overload shedding
allowed us to decrease the average load on the server.
Depending on the particular case and requirements,
load shedding can be implemented differently in the real production.
To understand load shedding better and the
approaches for its implementation, I recommend
reading this article by a senior principal engineer at
Amazon that explains load shedding in more
detail. Well, we have explored
server overload and how it can
affect your services at
common reasons for overload and how
to identify them.
We then reviewed two key techniques to
make your systems more resilient, rate limiting,
health control incoming traffic and prevent overload
before it happens, and load shedding that
acts as a safety valve,
gracefully degrading service during
extreme traffic surges to maintain
overall system stability.
Understanding and implementing these techniques
ensures your services stay healthy
and responsive even under heavy
load. Remember, a well designed system
anticipates traffic spikes and
has mechanisms to handle
them effectively. So I hope
you found this talk helpful. Thanks for
your time.