Unlocking Efficiency: Sharded Synchronization in High-Frequency Write-Heavy Systems

Video size:

Abstract

In multi-threaded environments with frequent write operations, traditional synchronization methods often encounter performance bottlenecks, which limit scalability and efficiency. This talk explores sharding, a synchronization strategy designed to allow more parallel processing by implementing finer-grained locking.

The presentation will cover:

The practical implementation of sharding in C++, with real-world examples.
A comparative analysis of sharding versus traditional mutexes, shared mutexes, and lock-free container implementations.
Key insights into how sharding fits into the broader landscape of synchronization techniques, empowering attendees to make informed decisions for their systems.

By the end of the session, participants will gain a clear understanding of when and how to use sharding to optimize performance in write-heavy, multi-threaded systems.

Aleksander’s expertise in designing scalable, high-performance systems makes him uniquely qualified to discuss this topic. His ability to translate complex technical ideas into actionable insights ensures that the audience will leave with valuable takeaways applicable to their work in DevOps.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi, my name is Alexander Kharkov and I'm the CTO at Tools for Workers. And today I'm talking about optimizing container synchronization for frequent writes. I know the name of the speech is pretty mouthful, so let's dive straight into the topic to see what lies behind this title. And actually, it's a very important aspect of a multi threaded environment, synchronization. Well, here comes the time when the bottlenecks cannot be resolved with just architecture or some buying some new hardware, but should be done with optimization of actual handling of the data in code. So we will look at some common practices in container synchronization and how they struggle to address the not really uncommon situation of frequent write operation for multiple threads and what can actually be done about it. I will illustrate my speech with a bit of c plus plus, but I will keep the conversation high level enough because principles, that we, today are discussing can be applied in any language of your choice. And we'll start with the. example, by changing some abstract container, for something real that maybe hits home for some of you. it's, some, transactional cash. let's have a map, well, a dictionary with the key being user ID. And, a value, is, so dynamic array of transactions. you can see that, transactions actually extracts, transactional data, and it's actually just some data about the transaction, such as date, amount, type, whatsoever. It's not really important to the conversation. What's important is, that maps in C are binary search trees, and, The important thing to know about that is that we can access or remove an array of transactions of each user by ID in logarithmic time, which is pretty great because that makes this particular cache applicable in many situations. And, let's say that, we have three cache operations. So read when we just reading. this array, by some, user ID, right where we put transaction in the, cash and, pop, where we just retract accumulated transaction from the cash for. for example, following processing. And if the cache is accessed by different threads, for example, by tasks in some thread pool, you'll need synchronization. Otherwise, alongside with obvious data races, you'll keep getting all sorts of errors in situations like, for example, when the tree balances itself after pop operation and simultaneously you try to write new transaction By, well, no longer valid, X, no longer valid address. So you need synchronization. The simplest approach, is to lock the container with, mutex. The lock can be acquired by one operation at a time, and other operations will have to wait to acquire the lock for themselves. This is pretty easily done. In C you just use lock guard, locking the dedicated mutex. and, This will work in terms of synchronization. It will synchronize your data and avoid the situation I described earlier. But if load increases, the operations will have to spend more and more time waiting on locked mutex. And more and more threads in your pool will do nothing but wait for the lock. So, sequentially, transactions in your systems are being processed with significant delay. That's, well, that's not right, and we have to do something. What can we do to improve the situation? one of the approaches, I would say it's a next level approach, is to lose, is to use radius writer lock. It's, it's done by, using shared mutex and the shared lock. and let's actually head back for a moment. Let's say that, here, a load, is caused by constant need to check transactions. So, most of the time that, Most of operation in this load, actually read operations that not changes the cache. And, Redis, writer lock, works perfectly here because it allows two types of locking. first one is unique locking that we use for write operations. it's actually done with the same log guard, as, with a simple mutex. and, Mutex lock by this unique lock, actually, behaves, just the same. So both readers and writers wait for the lock. But if, but what we have, different in shared mutex is that it also loves shared locking. and, in this scenario, when the shared lock is acquired, all the shared locks can be, acquired simultaneously. that's why reading can, acquire simultaneously and, we spend less time, to wait, for lock to be freed. In C we can use shared locking by just, well, using std shared lock with, paired mutex. And, this, actually works like a charm in read heavy environments. basically while you have read operation, they are done. Simultaneously, without any locking, just as is there no mutex whatsoever, well, to put it simply, of course, we occasionally have to look uniquely and all operations have to wait for the right operation to pass. but if right operation are not frequent, that's pretty much it. Not a problem. But let's imagine another scenario. What if we have read heavy environment? Obviously, the benefit of the Readers Writer log doesn't work here. And we have to be more creative. But, first of all, let's, define this read heavy, write heavy environment. And, We can, just, do that by simply removing a read operation whatsoever, just by having a write and pop operation. For example, we have a cache, where we need to, store a lot of, transactions that being bar bombarded from all different threats. And then, occasionally we have to, acquire, collected, operation, and remove them from the cache for, for the purposes, for the processing, with the POP, operation. And, both, Pop, functions, changing the cache. that's why, we are dealing with the heavy, that's why we are dealing with right heavy environment, As we mentioned earlier, radius writer lock is pointless here, and we are back with a simple mutex with one operation at a time, and we have to somehow improve from here. and to improve, from here, let's look at the data structure and how we work with data. you can, well, actually notice that each operation works with only one user. And, well, the approach that asking to be implemented, at, It's, if we could, process each user separately. that naive approach, is to divide our cache, to a lot of, caches with just one user, and, separate UX for each one. But then of course we have to, put all these caches into one container and this container of, caches would have to be. synchronized too. And, that's when we are back with the, where we just, began thinking about it, so I have a better solution, but it roughly follows the same principle. Actually, you may be heard of the word sharding before, and you probably heard this term in the realm of databases, but in it's perfectly applicable in our scenario. so like in the previous naive approach, we're still splitting our original cache, but, we are splitting it to a preselected number of shards. and, then we just have to decide how the data would be distributed, but by those shards. and, let's say that in our example, distribution of transaction among user is pretty homogenous and, well, we want to split, cash to four shards. Of course, we can split the range of, our user to four parts, just to start, then the second quarter, third quarter and the last quarter. But, if we are talking about. at least somehow, realistic scenario, the users with lesser ID are probably registered earlier and maybe less active than the users with ID closer to the top. So to combat this, let's just use module four as a distribution function. so here we have it for shards. And, when we have user ID, we just, decide in which shard it goes, just by applying modular operation. of course, all of those, four caches should be wrapped into interface, to appear just as one cache, but it's pretty easy. easy to do. well, we have a vector of array of four pointers. and, we figure out necessary index, of index in the vector by modular operation. That's, what's important to note, is that we don't need additional synchronization for the vector itself, because no matter how writes and pops we do, those four maps are always there by the same address, and, yeah, we just, If we synchronize each of them with their own mutex, we are fine here. and, look, now we have four simultaneous operations. what if we think that four is not enough? Can we have 10 or 100 shards? would it practically help? that's where I have to run the tests and check it. here you can see the comparison between using just a simple mutex and sharding with the different amounts of shards. load here is around 2000 write operations, 300 operations per second, and the concurrency is eight. the concurrency is the amount of, threads that can be run, simultaneously, in, the environment that I'm testing on. so here you can. Clearly see relatively how much time each operation wastes on waiting for the log in the first scenario. of course, in all of the scenarios, the average time to process separation is less than a millisecond. but this is just, demonstrative, tests, in, real systems, that data is often much complex and the processing of them is much more sophisticated. it may require additional computations, access to other data, other caches, and this, can significantly increase the time on the operation itself. So it's good to spend as little time on, synchronization. as possible. And in this regard, we can see that, sharding approach, will basically, eliminates all the waiting for the lock. but in regarding to, different shard sites, sizes, we see that miters don't show significant difference. I suspect that the minimal effective, value is tied to system concurrency. and, my, concurrency on the test is eight. but of course, on modern server machines, it can be much higher. And, shards size in this scenarios, will, be even better. If we, increase our sites from four, to some, numbers closer to the concurrency of this, machine, of course, that's just my, hypothesis. and, I. we'll share the code for the tests at the end of the presentation, and I would like to, hear maybe your feedback whether this game gets, claim gets proven or disproven. You can also know that the largest size, tested, 100, 000. effectively matches the mentioned earlier approach of assigning a mutex to each user, because in the tests, user ID generated within a range of 100, 000. I've not seen any advantage in the processing speed, although this approach obviously more demanding in terms of memory. So we can write this naive approach off and just simply use sharding. Also, I've run other tests with a lower load, just 500 writes and less than 100 pop operations per second. And you can see that effectiveness of sharding optimization decreases. It's important to remember that the difference between a simple mutex and the sharding approach exists only because, only because of a large number of transactions of the same times, causing a queue to build up. When we don't have this queue, of operation waiting, for the lock, to free, to be freed and, then, to acquire it and actually do the operation, simple mutex, do do the job. and, this is yet another reminder, that premature optimization can complicate the code without significantly impacting the results. to wrap everything up, here's, main points of the presentations. First of all, if you have data with predictable distribution and allow parallelism by specific key, you can use sharding to significantly optimize processing time and avoid bottlenecks. Second thought, this is just one example how analyzing the data that you have can help you with optimization. So be creative and maybe, try other ways, how, to approach synchronization. and, at last beware of, premature optimization. I know that, maybe it sounds contradictory to the previous, point, but actually they work, even better together. It's crucial to understand, the application requirements and, expected load. And, that would be it for today. here's a link to the GitHub where you can find the implementation of all of the caches that mentioned in the presentation, and, a code for the tests. So you can write, you could run it yourself. And if you have any questions or maybe you stumbled across something interesting in your tests, you can write me and, well, I would be glad to hear from you. Thank you for your attention. Thanks to Con 42 for the opportunity and have a great day.

Slides

Download slides (PDF)

See all 58 talks at this event!

Conf42 DevOps 2025 - Online

January 23 2025 - premiere 5PM GMT

Unlocking Efficiency: Sharded Synchronization in High-Frequency Write-Heavy Systems

Video size:

Abstract

Summary

Transcript

Slides

Aleksander Iskhakov

CTO @ Tools for Brokers

Join the community!

Featured event

2025

2024

Info

Conf42 DevOps 2025 - Online

January 23 2025 - premiere 5PM GMT

Unlocking Efficiency: Sharded Synchronization in High-Frequency Write-Heavy Systems

Video size:

Abstract

Summary

Transcript

Slides

Aleksander Iskhakov

CTO @ Tools for Brokers

Join the community!