Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. My name is Kirill and today we will talk about simple
but common mistakes in system design. Let's get
started. So the first issue we will cover
is connected with idempotency. It can happen when
there is communication between two services with data creation.
In the second service it can lead to duplicated data.
Let's consider this example. We have advertisement service
in our scheme. Also we have user who sends to us
data about an impression of the ad.
The payload has ad id. Ad ID
is the id of the advertisement. We check the impression form.
Also service has a downstream dependency, external ad
service. And external ad service has its own database.
The final goal of the whole scheme is to
increase this counter. Now let's imagine this.
What if there is sometimes there is high latency
between external ad service and the database.
Let's also imagine this. Right after we have successfully
increased the counter the user decides
to break the connection because of the total
high latency. In that case we have successfully
increased the counter. But from the user's perspective,
the request failed. So the user
decides to retrieve the request. So it sends
the same request twice.
And now there is no high latency between external service and
the database. We have successfully increased the counter again.
And from the user's perspective everything is fine too.
So we have increased the counter twice.
We had the only expression, but the counter is two.
The counter should be one, but it's what it is two.
That's the adepotency issue. We will try to
fix how we conf. How we can fix it.
We could add event id, one more
id to our payload. But unlike an id which
is the id of the ad, this id,
event id is the id of the particular
event. Unique id for every
impression, every event. And now
having that id, we also should add
that id to the payload between service and external ad
service. And now having that id lets
us to duplicate data on
the database level. So even if we have
two different payloads for the same impression, we can
deduplicate. We can duplicate it by event
id on the database level, which is fine.
And now our data is consistent.
Okay, let's move on. Next issue
is external request inside transaction. It can happen when
this transaction to the database
and inside that transaction we request an external service.
It can lead to exhausted database connection pool.
And let's consider why. So we
have an advertisement service as the example
again. Also we have a user
who creates an ad campaign under the
hood. The service starts transaction,
it inserts data to the to the database and
it sends a post request to external
ad service. At the first glance it
can work. Plus transaction provides us data
consistency. But let's imagine this.
What if this high latency between service and external ad service
plus these, so service
has a lot of users. Eventually it
will lead to exhaust database connection pool.
And this is why we won't release our
database connection here until the request
is finished. So in this case,
if we have highlighter here and high level here,
we will end up with having exhausted
database connection pool. So how can we
fix it? We can use a queue.
Basically it doesn't matter what type of queue, but the
eventual scheme will depend on the type of the queue.
In this scheme I use database queue,
database queue, in fact just a table inside the
same database. So now it's pretty safe for us to
use transaction and to do these two operations
inside the same transaction. Because in fact
these two operations, insert and create job,
are just two
SQL queries to the same database. We fixed
that database connection pull issue, but we
also need to create data in
external edge service. To do that, we could add worker
which sends data which gets
a job from the queue and sends data to
the external ed service. So in this scheme we
got rid of the connection pool issue,
plus we have eventual consistency so
our data is consistent. Okay,
that's fine. Let's move on.
Next issue is request and service at the same time.
It can happen when multiple clients request a service at the
same time and it can lead to overloaded
service. Let's consider an example.
This example is about service and client
client app request service
to get new plugin versions. And everything's fine
if we have just a few clients. But what
if we have multiple clients? And what if we have,
for instance, 2 million clients and usually
these 2 million clients? The load from these
2 million clients is distributed among the day
because some users use the app in
the morning, some of them use it in the evening, and so on and
so forth. This sound something
like a cron job with specified time.
And the time is specified so
that all the apps will request our service at
the exactly same time. It can lead
to an overloaded service and temporary outage.
To fix that issue, we could, we could
add some, some random
time. I mean client client apps could,
instead of requesting our service at the
same time, they could request our service at a random
time. So the load will be distributed among the day
and it, it won't,
in this case, we will not end up with overloaded
service. Okay, let's move
on. Next issue is lack of rate limiter.
Okay, so let's let's consider the same
example, the previous example by
design on the client side. But let's also
imagine that we added rate limiter
before our service. So in this case,
if we, even if there is bad design on
the client side, our rate limiter won't
let client apps to overload our service because
we can specify on the rate limiter side a rule,
for instance, something like no more than 2000 thousand
rp's. Plus we can restrict a
particular user from sending us to money
request. So in that case we,
we won't, we won't end up with these issues.
Okay, so rate limiter is
a very good thing, but let's now talk about
memory limiter for instance. Well, let's imagine
we have restricted the number of requests,
but what if payload of
a particular request is too big?
For instance, let's consider this to
example in Golang, this is a
post handler, and in this handler
we read all the data from the body.
And what if the size of the body is
too big? For instance 5gb. Most likely we'll
have out of memory error in that case.
So how can we fix it? How? We can restrict the
client from sending us to too
many bytes in Golang, we could do that with
just one line of code.
So with this line we specify that
the body shouldn't be more than 500 kb.
So now with this line of code,
instead of crushing our application, the client will
get an error. So we have fixed it.
Okay, let's now let's talk
about retries and let's consider a very simple
example, just client service and
external dependency request failed,
which can obviously can happen because
of network issues, because of temporary outage
or something like that. So we should
handle these cases when request failed.
How can we do that? We could retry send
the same request. And if
we don't do that, if we don't use retry
policy, we could end up with high
error rate and poor user experience.
Okay, so let's move
on. Next issue is connected
with retries, but a bit cheaper.
What if we have added retries, but we
haven't added back off and for instance we send a
request to external service and request files.
Then we retry, and next request fails
two, and the third one fails too. We have
the choice, which is fine. What if this
service is overloaded? In that case
we are making things worse, not better, because we don't let
the service recover. So instead of
just sending request one by one,
we can use buck off strategy.
And here buco strategies. So the first
one is linear a linear strategy is about waiting
for some constant time. For instance, we could wait for
1 second between first and second requests.
Next one is linear. With Egypt, it's pretty the
same as linear, but instead of waiting for
1 second, we would wait for 1 second
plus random time, like 1 second plus random
time between 20 between ten and 20 milliseconds,
for instance. Okay, next strategy is exponential.
Exponential is about waiting,
not just static time, but we
wait between first and second request
for 1 second between second and third request for
2 seconds, then for 4 seconds, and for eight, and so on and so forth.
So we double our waiting time after
every try. And the last exponential, with jittery,
it's something like it's the same as
exponential, but with random time,
like in the second approach.
So that's pretty much it. Thank you for your attention.