Scaling: from 0 to 25 million users

Video size:

Abstract

Real-life examples from running a system that handles 25 million monthly active users generating 350 billion requests and 1 PB of data every month all without breaking the bank. Demystifying scale and showcasing easy low hanging fruit that is often overlooked.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Today we're going to be talking about scaling from zero to 25 million. My name is Josip. I'm the CTO at a company called Sophoscore. And we do a live score app that has a lot of supports in it. We have a lot of data, a lot of different providers, which we then aggregate and give out to our users for free. We have more than 28 million monthly active users. They generate 1. 4 petabytes of traffic every month, and that's 400 billion requests, and the peak we had was 1. 6 million requests per second. So when we first started, we had a lot of issues with the servers breaking down when there's a lot of users. So as soon as there's a peak happening, the servers were offline and the way You solve that, is by adding cache. So we added memcache and in fact it was a full page cache. So when a user came to the website if it the result was not in the cache it was Calculated, stored in the cache, and returned to the user. And this works well. This is one of the simplest things you can do to have a system that can handle more than a couple of users. But the problem with that approach is, called cache stampede problem. So if something is not in the cache, and you have a lot of users coming to and trying to fetch the same thing, All of those requests will go to the back end to your database, because it's not in the cache. And basically your database or back end servers will die. What you want to have is To have all of the requests waiting and only one request going to your database or a backend service. And this is not an easy thing to solve because then you have, if you have more servers, you have a distributed system and to have those requests waiting is not a simple thing to do. especially in the early days when we ourselves weren't really all that savvy in dealing with distributed systems. So. What we did is we knew how our users were using the app and since most of the users actually fetch want to view live matches They we knew that we want to have the matches that are live there Going to be looked at. We wanted to have that in the cache. So we ended up creating a worker that would fetch all of the games that are being played, in the last couple of hours and will be played in the next couple of hours and basically proactively cache and put the result in the cache. So that every time somebody was trying to fetch a game that should be live. We were sure to have it in the cache So it's really important to know how your users interact with your app in your system And proactive caching is actually something that all of the big companies are doing So if you look at facebook instagram and so on they all have your results prepared for you. We also have a really interesting thing in our system in that Before a game So before big games happen, usually people will start opening the app and the games 15 minutes before. And this was especially true in the early days. So in the early days, whenever there were big games being played, 15 minutes before we would have a lot of users. And in those 15 minutes, the number of users would usually multiply by 3 or 4. So it would quadruple. The traffic we had on the site, which was really useful knowledge because we knew how, how the system is going to behave. So 15 minutes before a really big and important game called El Clasico. So for those of you who don't follow sports, that's when two of the most popular teams in the world play against each other. So 15 minutes before. Our CPU on the server was going through the roof, and at that time, we only had two servers, one for the back end service and one for the database, and it was just basically dying 15 minutes before, and we knew that if we didn't do anything, the whole site will crash, which is not really something you want to have when you have a to do. really big event like that. So that's the event that brings in the most money during the year. and we do have cash and we do have proactive caching, but still the back end app needs to be booted. The request has to go through all of the steps that need to happen. Then fetch from the cache and return and we were running PHP and PHP is slow and there's not a whole lot you can do in that case. But what we did know is that Apache was much, much faster than PHP in serving static content. So we got to thinking we only have one URL that's going to be access, nothing else. If that URL was in fact a static HTML, everything should work. And then we were like, well, it can be, right? So we opened the page, just that one page, file, save as, saved as HTML, and we wrote a simple rule in Apache that said, if the users visit this URL, just return this static file. That's it. And once we did that, the load immediately dropped and was performing extremely well. So problem solved, but we had an issue because now, that was static content. It wasn't changing. Right? So we ended up watching the game live and whenever something happened, a goal, red card, anything else we had to manually edit the HTML file, save it, upload it to the server so that it gets refreshed. But it worked. That got us thinking, so obviously we need something else. We have really big peaks and this is not something we want to do for all of the big matches to do it manually. And that's when cloud came into play. So this was 2013. Not a lot of people were using the cloud. And I got to the founders and I said, look, this is the best thing that we can do for our use case. We can scale up when we need to. We can scale down when we don't need to. And they said yes. And the only game in town was basically AWS. So that's what we chose. And also looking back at the story that I told where we had a server that was serving static content, which was really fast, much faster than PHP. we found out about Varnish. So Varnish is a reverse HTTP caching proxy written in C, which is extremely fast. It respects the cache control headers. And basically, stores your full page cache in its cache. and if you're gonna remember anything from this talk, it's this. This is the single most important part of infrastructure that we have. once we started using this, our, The whole app just was blazingly fast. And also, it solves one of the difficult problems that I was talking about earlier, which is request coalescing. So, if something is not in the cache and you get a lot of requests for the same content, Varnish will only let one request go to your origin server. All of the rest will be queued up. And once the response is received, all of the requests will get the same response, which is, which is great. So the cloud, was pretty straightforward. We had an image that had Varnish and the backend server baked in. We had an auto scaling group. We got rid of memcache because it was no longer needed because of Varnish. We had a couple of workers database. Pretty straightforward for stuff. And the reason why we could do this pretty easily is because the server itself was stateless and it was easy to port to cloud and also it was easy to do on auto scaling group. So it's really important wherever possible, you should be stateless. scaling state is extremely difficult and this is something that should be done early in the project. We also like to try out new technologies. So we. switched from MySQL to MongoDB, because, you know, web scale. and it was working really great for a time. And then we started getting into issues. So we, got replication issues, which were really difficult to debug. we had data type errors. It was really difficult to analyze data. We had issues with locking, because in those times, If you were to change a single document in a collection, the whole collection would need to be locked. And we have a lot of updates, so it was really difficult to have that. And also no foreign keys, which was just a disaster because the whole project is like basically analyzing and connecting different parts of the data. So we, got to the conclusion that this was not a great choice for us. And we need to switch back to a relational database. And then we found Postgres, which is what we are currently still using because it is amazing. So why Postgres? Because it's open source. it has advanced SQL capabilities. It has minimal locking with the mechanism called MVCC. it's, it stands for multi version concurrency control. it can be a talk. all onto itself. But basically what it allows you to do is to read while you are updating at the same time without any locking. And it has great performance. How great performance? So on our machine, we can get 1 million queries per second and 90, 000 transactions per second. And one transaction in a benchmark is 5, 000. Things that change the state. So basically we can do half a million updates in peak times. So great performance, great abilities. We decided to switch from MongoDB to Postgres, but the issue was we have to do it live. So. We have users, all over the world. we cannot have any downtime. And in my mind, when we were trying to do this, it's called something like this. So changing the tires were while driving on a highway. but we decided we need to do it and we did it. And it was actually not that difficult. What we end up. doing was time stamping all of the collections that we have all of the data and then we wrote a tool that would switch that would replicate from MongoDB to Postgres and do so in iterations. So every iteration would transfer all of the changes from the last run and then we got to a point where the database were. Databases were in sync. So what we lost with this is simple replication and configuration and leader election, which works great in MongoDB, but what we gained was stability, foreign keys and analytical power, which is really important to us. So it is important to recognize the right tool for the job and change in this if necessary. And just a small digression when we were talking about hype and cool new things, I get a. A lot of, lots of times people ask me, what do I think about microservices? And we don't have a microservice, architecture. We do have a lot of services that do certain things. But in my mind, You probably don't need them. If you don't have a really big team or a really complex app, then you're just, making yourself, making your life more complicated than it should be. a lot of problems can be solved by just using async communication. And it's an easier thing to, to manage and, it works really well for us. So what we do have and what we implemented early on was basically everything that is slow and everything that doesn't need to happen right away is put in a queue. And then you have workers that just consume the queue and then you can scale those. Producers and consumers independently. What we use is Beanstalk V, which is a job queue, but like any queue, any message queue works, it's really important to queue everything that is slow because then you have, you can scale your system more easily going back to the, to the cloud. So we had an auto scaling group that would trigger once the CPU is high But the problem with this is the more you scale. The worst the cache gets, because in our case, we had the caching system baked into the image with the backend server. So once we scaled, the more we scaled, the more the cache was, basically, not as effective, because Every cache is for itself. So if you have one machine, you only get one request to the backend servers. But if you have 20, then all of those, we have to fetch their requests independently, which is not great. The solution to this is not that hard. It's just to separate your caching layer from your backend services. So that's what we did. And you obviously have to have at least two, to avoid single points of failure. The problem with this is, I mean, this works really well, but still, you will have, the number of requests for one single resource will always be the number of machines you have. So if you have two machines in the caching layer, they don't talk to each other, and you will get two requests to your backend services, Which is not really needed and I have a really big desire to optimize everything. This can be one request and the way to solve this is by using sharding. So if you shard your caching servers so that a single server will always be responsible for a specific URL, then you can scale your caching layer linearly. for your time and enjoy. And basically get, as much memory as you as you need. and we do this, we did it actually on a bit more complicated scale. So instead of having a random load balancer, so we have two layers of varnish. The first layer is to load balance, and it has to have a lot of CPU and a lot of bandwidth. And then we use consistent hashing to hit, just one machine in the second layer. And then it goes to the backend. So basically, this allows us to have a really high hit rate ratio, and we can cache everything. So we can even do database updates without people noticing, because most of the important stuff is in the cache. And we can just shut down for half an hour, and nobody's going to notice. But cache invalidation is hard. that's why we built an internal library, which is actually open sourced for PHP, that builds out the graph of dependencies. And you can specify which endpoint depends on which entities. And once an entity changes, we can invalidate just that URL that's important. There's also cool cache control headers. So maxage is for the end client, smaxage is for Varnish or any intermediate caching, servers. it basically tells it how long it should be in the cache. a really cool header is sale while revalidate, which will, essentially return the old version code. of the resource while fetching the new one in the background, which is really useful if you have slow endpoints. Stale if error will allow you to return, the last response you have if backend services are, not currently healthy, which is really cool. You basically have a always on, site. Now, AWS was great. Love the cloud. The problem with All the cloud service services are, they're really expensive and especially expensive is traffic. We got to a point where we were paying more for traffic than we were paying for compute, which is something that really bothered me. It makes no sense to pay so much for, for traffic. So we got to thinking and we saw that if you rent out bare metal servers, Usually, you have a certain number of terabytes included with your server and it's either free or really, really cheap. So, since our caching layer is separate from everything else, we can basically just put Put it out of the cloud, and that's what we did. We rented out a couple of machines and we set up our Varnish instances on them and then pointed that to the cloud. So really simple change, but it got us a 10x reduction in bandwidth. Which was a lot. So we caught our AWS bill by a huge margin. And then we did a back of the net calculation. What would happen if we would move other services from the cloud to bare metal? And. Our calculation was that we can over provision the system by a double and still see a 5x reduction in price. So we migrated everything from cloud to data center. Now, when I started this talk, I said that we had really big peaks and that we need to scale up and scale down in short amount of time. The thing is, when you have a really good caching system, the cache layer will kind of buffer out and flatten out those peaks. So once we figured out the caching layer, our peaks weren't really that big. It was, it was just, it was just a 2x increase in the biggest peaks that we have that goes to the backend servers. The caching servers can get like 20x peaks, but the backend servers will not. So that's what, why we were able to migrate. and we switched from, AWS instances to Docker containers, and, we installed everything ourselves. And the reduction in price was really big. And this is the graph of our infrastructure cost throughout time. And this might not seem like much. there is a, a slow and steady decline in price. But if you overlay this with the number of users that we had in that period, you can see that the number of users is growing a lot, but the price is going down or, or staying the same. And there's also one other benefit of being. on bare mill and that's, you avoid accidental really big spikes in price. So this is a true screenshot from a friend of mine. they had a developer create a bug, which was not detected until it was too late in just a short amount of time. Their bill, jumped from 360 k to 2 million. and if you're off the cloud. Your app will simply not work. And that's a trade off. So for someone, it's important to always be on and pay millions. For others, paying a couple of millions for a mistake is too much. Also, when we are not on the cloud, RAM is really, really cheap. We have a lot of RAM to spare. And what we do with that is we utilize it for the cache. So the second layer of varnishes that I talked about is running, On those machines, and it's utilizing orders, all the spare ram and basically everything that we have can be stored in the cash and be always on. and usually people ask me, okay, so, but surely you have to, you have cost of people, you have to pay more people to do DevOps infrastructure and in our case, that's simply not true. So up until. like four years ago, we only had one guy managing everything. Now it's a team of, five people and they can manage that in their spare time. So basically in dev ops, We have people working from backend, part of the time on, on, on the infrastructure and everything works perfectly. But you have to have your infrastructure really optimized and know what you're doing. We have really big peaks. So this is from our, caching layer. And you can see in football for first and second half really clearly. and those peaks are, you know, really big because we do send out a lot of push notifications. We try to send out as many push notifications as we can in a short amount of time so that people get their results faster. And what happens is people open their mobile phone, they click on the notification, and basically they open the app all at the same time. And it kind of looks like a DDoS attack, right? So, in Cloudflare, which we were using, at the time, Was detecting this as a DDoS attack and we were having issues because people couldn't open the app because they were detected as a DDoS. So we actually had to go to their office to talk with their engineers to explain what kind of issues we were having and the response was, Oh, yeah, you're surely triggering the DDoS protection system. And they added special values just for our site where the DDoS wasn't as aggressive as it usually is, and that fixed our problem. it's also important to monitor everything you have, and the best thing you can have is application performance monitoring. by having this, you know which endpoints you need to optimize because it's not important if something is slow, if it's not being called. Also, if something is, is fast, but it's, it's being called millions of times, then it might make sense to optimize. And to do this is you should always optimize most time consuming. So that, which is, being called a lot and is slow. and you can get really good results by using this any APM. works. This is a screenshot from New Relic, but it's like it's not important. Sometimes the endpoint is as optimal as it can be and you cannot optimize anymore. In those cases, usually you can increase the cache and the correlation between cache and the number of requests that you get to the server is not linear. So small improvements in the TTL of the cache can lead to a large gains. In the number of requests that go to your service. So if you increase your cash just by a little bit, you can get a really high decrease in throughput to your backend service and make everything faster and cheaper. So you should monitor everything. Optimize something. I think there is a good quote where, premature optimization is the root of all evil, which I agree with, but you should monitor so that you know what to optimize, especially if you have microservices that system can get really complex really fast. Another question I get is like, what is the number of users? Exceeds your current capacity. since you're not on the cloud, which is completely legitimate thing that usually people ask first, yeah, we are not on the cloud. We rent out dedicated machines. Sometimes we can rent out new machines in a couple of minutes. Sometimes it takes days, so it's not really a, you can really scale up fast kind of thing. So what we do instead is we have dedicated machines and if the load gets too high, We spin up virtual machines, and those virtual machines joined the cluster. The traffic gets distributed to them, and everything works. This happens almost never because the system is over provisioned, but this is an option that we have and sometimes utilize. So this is kind of a hybrid cloud approach. Also, there's this thing called physics. We have Our data centers in Europe, but we have users worldwide and due to physics, they get their responses slower because, the speed of light is what hampers us. So we didn't know how big of an impact this is. And I decided that we need to look into this. And we sent, sent out an email to one of our users from Australia. And he was asked if he can record the app too, so that we can see how good the app is performing, if there are any issues. And he replied, yeah, everything works great. No problem. Everything is fast. love your app. And when we were watching the video, we saw this. I didn't know that we had loaders in the app because in Europe, it's so fast that you don't even see the loader because like everything is in the cache. But in Australia, they have a half, half a second latency, just to fish the data from Europe. To Australia, but this is normal for for them because all of the competition had the same issue and this is just like the way things are and we wanted to solve this and solve this. We actually distributed our caching servers throughout the world and use geo routing to fetch from the nearest cache. And we got a, really big reduction in latency for Australia, Brazil, and all of the countries outside of Europe, where Australia dropped from half a second to 80 milliseconds, which is a lot, which is a difference between, you can see a loader, and it's imperceptible. And you might ask, shouldn't a CDN do that? Well, yes, they should. They should. But you are not the only user. of that CDN. They have other clients. Maybe it's not in the cache. Also, we do a lot of cache invalidations so that we can have a high hit rate. And the problem with invalidations is that not all CDNs do that properly. So Cloudflare has issues with doing invalidation. So we had to build out ourselves. In the meantime, we'll switch to Fastly. So Fastly has really good invalidation, so it works. But if you're having issues with your carriers, this might be something to look into. Also, data centers can burn. Even if you're on the cloud, it's just somebody else's machine. Data centers can burn. This is a screenshot from OVH, where their DC burned. We use OVH. Fortunately, this wasn't our problem. Data center. But it got us thinking. and at that point in time, we knew that we needed to be in multiple dcs to avoid issues with fires or network issues. so we moved to multiple dcs. The problem is, managing those and having them communicate with each other. And that's why we started using Kubernetes. So we have Kubernetes because it is multi DC aware. So services know which, where other services are. So that allows us to have a mechanism that's robust and that can heal itself. And basically, we can have cloud like capabilities. On bare metal with bare metal prices. the only thing that's difficult to solve on bare metal and Kubernetes is state because in cloud you get it solved by default by using the cloud components, but on bare metal it is more difficult. So we use Longhorn for that. Longhorn is basically a distributed volumes. So, you have a volume, it's replicated to other DC. So if your container pod fails, it can switch to a different data center and also have that data, safe. We also do a lot of real time updates. so usually, it's being done, via some sort of So the apps pull out all of the data via REST and then they subscribe to a PubSub server to fetch real time updates, to fetch changes. And we were doing this ourselves with a custom piece of software, but that just kept getting more and more complicated. So we found out about NATS. NATS is an open source messaging system, a PubSub server, which is written in Go. It's free. It has support for clustering out of the box, works perfectly. and just on a couple of servers, we can get more than 800, 000 connections at, in peaks. and we do more than half a million messages per second without any issues. We also have a lot of data. So all of the clicks in the apps are tracked via Firebase. They're anonymized and they are exported so that we can download them. And we have more than two petabytes of data. If you're in the cloud, it's a no brainer. You can just use cloud services. They are expensive, but like they solve your issues. If you're on bare metal, two petabytes of data is really hard to manage. So to do that, we use Clickhouse. And Apache Superset to visualize. Qlikos is basically a, SQL database that's column oriented. It's specifically designed to have the analytical workload. so we have two petabytes of data that's compressed into just 50 terabytes. And we have more than 1. 4 trillion rows and we can do more than 2 million inserts per second. That's how fast it is. And that allows us to build out really complex systems that do the query a lot of data. So this is a screenshot from our anti scraping, anti scraping system that can, that takes, like 5 billion rows and does some operations. And then we know which IPs to ban. And the interface to Clickhouse by using Superset is really great. And using all of this that I've talked about earlier, we have a system that's really robust. And in peak time, we had, almost 2 million users, in Google real time. Overview without any issues and without and by just just by having everything work automatically and the reason why you are not on the cloud is because all of this, all of the infrastructure cost all of the so we do have some cloud services like all the cloud services, all of the city ends, all the servers, everything that's needed for production. And development is 0. 6 percent of revenue. So the take home from this would be cash, all of the things. This is the most important thing. Usually people say, yeah, but I have personalization. We cannot cash. Yes, you can. There are techniques that you can use to cash things. We have a lot of personalized content in the app. Be stateless wherever possible, because it will allow you to scale more easily. Know how your users. Interact with the app. Recognize the right tool for the job and change if necessary. Cloud should be the default. I love the cloud. I think it's one of the most important technologies that we have. But you should keep your options open, especially if your monthly bill is high. Queue everything that is slow. Monitor everything, optimize something, and try to keep it simple for as long as you can. if you have any questions, I'll be happy to answer them. This is my LinkedIn. Feel free to add me and hopefully you learned something new today and this was fun for you. Thank you.

Slides

Download slides (PDF)

See all 58 talks at this event!

Conf42 DevOps 2025 - Online

January 23 2025 - premiere 5PM GMT

Scaling: from 0 to 25 million users

Video size:

Abstract

Summary

Transcript

Slides

Josip Stuhli

CTO @ Sofascore

Join the community!

Featured event

2025

2024

Info

Conf42 DevOps 2025 - Online

January 23 2025 - premiere 5PM GMT

Scaling: from 0 to 25 million users

Video size:

Abstract

Summary

Transcript

Slides

Josip Stuhli

CTO @ Sofascore

Join the community!