Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, my name is Alexander Kharkov and I'm the CTO at Tools for Workers.
And today I'm talking about optimizing container
synchronization for frequent writes.
I know the name of the speech is pretty mouthful, so let's dive straight into the
topic to see what lies behind this title.
And actually, it's a very important aspect of a multi threaded
environment, synchronization.
Well, here comes the time when the bottlenecks cannot be resolved with
just architecture or some buying some new hardware, but should be
done with optimization of actual handling of the data in code.
So we will look at some common practices in container synchronization and how
they struggle to address the not really uncommon situation of frequent write
operation for multiple threads and what can actually be done about it.
I will illustrate my speech with a bit of c plus plus, but I will
keep the conversation high level enough because principles, that we,
today are discussing can be applied in any language of your choice.
And we'll start with the.
example, by changing some abstract container, for something real that
maybe hits home for some of you.
it's, some, transactional cash.
let's have a map, well, a dictionary with the key being user ID.
And, a value, is, so dynamic array of transactions.
you can see that, transactions actually extracts, transactional
data, and it's actually just some data about the transaction, such
as date, amount, type, whatsoever.
It's not really important to the conversation.
What's important is, that maps in C are binary search trees, and, The
important thing to know about that is that we can access or remove an array
of transactions of each user by ID in logarithmic time, which is pretty great
because that makes this particular cache applicable in many situations.
And, let's say that, we have three cache operations.
So read when we just reading.
this array, by some, user ID, right where we put transaction in the,
cash and, pop, where we just retract accumulated transaction from the cash for.
for example, following processing.
And if the cache is accessed by different threads, for example, by tasks in some
thread pool, you'll need synchronization.
Otherwise, alongside with obvious data races, you'll keep getting all sorts of
errors in situations like, for example, when the tree balances itself after pop
operation and simultaneously you try to write new transaction By, well, no
longer valid, X, no longer valid address.
So you need synchronization.
The simplest approach, is to lock the container with, mutex.
The lock can be acquired by one operation at a time, and other
operations will have to wait to acquire the lock for themselves.
This is pretty easily done.
In C you just use lock guard, locking the dedicated mutex.
and, This will work in terms of synchronization.
It will synchronize your data and avoid the situation I described earlier.
But if load increases, the operations will have to spend more and more
time waiting on locked mutex.
And more and more threads in your pool will do nothing but wait for the lock.
So, sequentially, transactions in your systems are being
processed with significant delay.
That's, well, that's not right, and we have to do something.
What can we do to improve the situation?
one of the approaches, I would say it's a next level approach, is to
lose, is to use radius writer lock.
It's, it's done by, using shared mutex and the shared lock.
and let's actually head back for a moment.
Let's say that, here, a load, is caused by constant need to check transactions.
So, most of the time that, Most of operation in this load, actually read
operations that not changes the cache.
And, Redis, writer lock, works perfectly here because it
allows two types of locking.
first one is unique locking that we use for write operations.
it's actually done with the same log guard, as, with a simple mutex.
and, Mutex lock by this unique lock, actually, behaves, just the same.
So both readers and writers wait for the lock.
But if, but what we have, different in shared mutex is that
it also loves shared locking.
and, in this scenario, when the shared lock is acquired, all the shared
locks can be, acquired simultaneously.
that's why reading can, acquire simultaneously and, we spend less
time, to wait, for lock to be freed.
In C we can use shared locking by just, well, using std
shared lock with, paired mutex.
And, this, actually works like a charm in read heavy environments.
basically while you have read operation, they are done.
Simultaneously, without any locking, just as is there no mutex whatsoever,
well, to put it simply, of course, we occasionally have to look uniquely
and all operations have to wait for the right operation to pass.
but if right operation are not frequent, that's pretty much it.
Not a problem.
But let's imagine another scenario.
What if we have read heavy environment?
Obviously, the benefit of the Readers Writer log doesn't work here.
And we have to be more creative.
But, first of all, let's, define this read heavy, write heavy environment.
And, We can, just, do that by simply removing a read operation whatsoever,
just by having a write and pop operation.
For example, we have a cache, where we need to, store a lot
of, transactions that being bar bombarded from all different threats.
And then, occasionally we have to, acquire, collected, operation,
and remove them from the cache for, for the purposes, for the
processing, with the POP, operation.
And, both, Pop, functions, changing the cache.
that's why, we are dealing with the heavy, that's why we are dealing with
right heavy environment, As we mentioned earlier, radius writer lock is pointless
here, and we are back with a simple mutex with one operation at a time, and
we have to somehow improve from here.
and to improve, from here, let's look at the data structure
and how we work with data.
you can, well, actually notice that each operation works with only one user.
And, well, the approach that asking to be implemented, at, It's, if we
could, process each user separately.
that naive approach, is to divide our cache, to a lot of, caches with just
one user, and, separate UX for each one.
But then of course we have to, put all these caches into one container and this
container of, caches would have to be.
synchronized too.
And, that's when we are back with the, where we just, began thinking about
it, so I have a better solution, but it roughly follows the same principle.
Actually, you may be heard of the word sharding before, and you
probably heard this term in the realm of databases, but in it's
perfectly applicable in our scenario.
so like in the previous naive approach, we're still splitting our original
cache, but, we are splitting it to a preselected number of shards.
and, then we just have to decide how the data would be
distributed, but by those shards.
and, let's say that in our example, distribution of transaction among user
is pretty homogenous and, well, we want to split, cash to four shards.
Of course, we can split the range of, our user to four parts, just
to start, then the second quarter, third quarter and the last quarter.
But, if we are talking about.
at least somehow, realistic scenario, the users with lesser ID are probably
registered earlier and maybe less active than the users with ID closer to the top.
So to combat this, let's just use module four as a distribution function.
so here we have it for shards.
And, when we have user ID, we just, decide in which shard it goes,
just by applying modular operation.
of course, all of those, four caches should be wrapped into
interface, to appear just as one cache, but it's pretty easy.
easy to do.
well, we have a vector of array of four pointers.
and, we figure out necessary index, of index in the vector by modular operation.
That's, what's important to note, is that we don't need additional
synchronization for the vector itself, because no matter how writes and pops
we do, those four maps are always there by the same address, and, yeah, we
just, If we synchronize each of them with their own mutex, we are fine here.
and, look, now we have four simultaneous operations.
what if we think that four is not enough?
Can we have 10 or 100 shards?
would it practically help?
that's where I have to run the tests and check it.
here you can see the comparison between using just a simple mutex and sharding
with the different amounts of shards.
load here is around 2000 write operations, 300 operations per
second, and the concurrency is eight.
the concurrency is the amount of, threads that can be run, simultaneously, in,
the environment that I'm testing on.
so here you can.
Clearly see relatively how much time each operation wastes on waiting
for the log in the first scenario.
of course, in all of the scenarios, the average time to process
separation is less than a millisecond.
but this is just, demonstrative, tests, in, real systems, that data is
often much complex and the processing of them is much more sophisticated.
it may require additional computations, access to other data, other caches,
and this, can significantly increase the time on the operation itself.
So it's good to spend as little time on, synchronization.
as possible.
And in this regard, we can see that, sharding approach, will basically,
eliminates all the waiting for the lock.
but in regarding to, different shard sites, sizes, we see that miters
don't show significant difference.
I suspect that the minimal effective, value is tied to system concurrency.
and, my, concurrency on the test is eight.
but of course, on modern server machines, it can be much higher.
And, shards size in this scenarios, will, be even better.
If we, increase our sites from four, to some, numbers closer to
the concurrency of this, machine, of course, that's just my, hypothesis.
and, I.
we'll share the code for the tests at the end of the presentation, and I would like
to, hear maybe your feedback whether this game gets, claim gets proven or disproven.
You can also know that the largest size, tested, 100, 000.
effectively matches the mentioned earlier approach of assigning a mutex to each
user, because in the tests, user ID generated within a range of 100, 000.
I've not seen any advantage in the processing speed, although
this approach obviously more demanding in terms of memory.
So we can write this naive approach off and just simply use sharding.
Also, I've run other tests with a lower load, just 500 writes and less
than 100 pop operations per second.
And you can see that effectiveness of sharding optimization decreases.
It's important to remember that the difference between a simple mutex
and the sharding approach exists only because, only because of a large
number of transactions of the same times, causing a queue to build up.
When we don't have this queue, of operation waiting, for the lock,
to free, to be freed and, then, to acquire it and actually do the
operation, simple mutex, do do the job.
and, this is yet another reminder, that premature optimization
can complicate the code without significantly impacting the results.
to wrap everything up, here's, main points of the presentations.
First of all, if you have data with predictable distribution and allow
parallelism by specific key, you can use sharding to significantly optimize
processing time and avoid bottlenecks.
Second thought, this is just one example how analyzing the data that you
have can help you with optimization.
So be creative and maybe, try other ways, how, to approach synchronization.
and, at last beware of, premature optimization.
I know that, maybe it sounds contradictory to the previous, point, but actually
they work, even better together.
It's crucial to understand, the application requirements
and, expected load.
And, that would be it for today.
here's a link to the GitHub where you can find the implementation of all
of the caches that mentioned in the presentation, and, a code for the tests.
So you can write, you could run it yourself.
And if you have any questions or maybe you stumbled across something interesting
in your tests, you can write me and, well, I would be glad to hear from you.
Thank you for your attention.
Thanks to Con 42 for the opportunity and have a great day.