Transcript
This transcript was autogenerated. To make changes, submit a PR.
Today we're going to be talking about scaling from zero to 25 million.
My name is Josip.
I'm the CTO at a company called Sophoscore.
And we do a live score app that has a lot of supports in it.
We have a lot of data, a lot of different providers, which we then aggregate
and give out to our users for free.
We have more than 28 million monthly active users.
They generate 1.
4 petabytes of traffic every month, and that's 400 billion
requests, and the peak we had was 1.
6 million requests per second.
So when we first started, we had a lot of issues with the servers breaking
down when there's a lot of users.
So as soon as there's a peak happening, the servers were offline and the way
You solve that, is by adding cache.
So we added memcache and in fact it was a full page cache.
So when a user came to the website if it the result was not in the
cache it was Calculated, stored in the cache, and returned to the user.
And this works well.
This is one of the simplest things you can do to have a system that can
handle more than a couple of users.
But the problem with that approach is, called cache stampede problem.
So if something is not in the cache, and you have a lot of users coming
to and trying to fetch the same thing, All of those requests will
go to the back end to your database, because it's not in the cache.
And basically your database or back end servers will die.
What you want to have is To have all of the requests waiting and
only one request going to your database or a backend service.
And this is not an easy thing to solve because then you have, if you have
more servers, you have a distributed system and to have those requests
waiting is not a simple thing to do.
especially in the early days when we ourselves weren't really all that savvy
in dealing with distributed systems.
So.
What we did is we knew how our users were using the app and since most
of the users actually fetch want to view live matches They we knew that
we want to have the matches that are live there Going to be looked at.
We wanted to have that in the cache.
So we ended up creating a worker that would fetch all of the games that are
being played, in the last couple of hours and will be played in the next
couple of hours and basically proactively cache and put the result in the cache.
So that every time somebody was trying to fetch a game that should be live.
We were sure to have it in the cache So it's really important to know how
your users interact with your app in your system And proactive caching is
actually something that all of the big companies are doing So if you look at
facebook instagram and so on they all have your results prepared for you.
We also have a really interesting thing in our system in that Before
a game So before big games happen, usually people will start opening the
app and the games 15 minutes before.
And this was especially true in the early days.
So in the early days, whenever there were big games being played, 15 minutes
before we would have a lot of users.
And in those 15 minutes, the number of users would usually multiply by 3 or 4.
So it would quadruple.
The traffic we had on the site, which was really useful knowledge because we knew
how, how the system is going to behave.
So 15 minutes before a really big and important game called El Clasico.
So for those of you who don't follow sports, that's when two
of the most popular teams in the world play against each other.
So 15 minutes before.
Our CPU on the server was going through the roof, and at that time, we only
had two servers, one for the back end service and one for the database, and
it was just basically dying 15 minutes before, and we knew that if we didn't
do anything, the whole site will crash, which is not really something you
want to have when you have a to do.
really big event like that.
So that's the event that brings in the most money during the year.
and we do have cash and we do have proactive caching, but still the
back end app needs to be booted.
The request has to go through all of the steps that need to happen.
Then fetch from the cache and return and we were running PHP
and PHP is slow and there's not a whole lot you can do in that case.
But what we did know is that Apache was much, much faster than
PHP in serving static content.
So we got to thinking we only have one URL that's going to be access, nothing else.
If that URL was in fact a static HTML, everything should work.
And then we were like, well, it can be, right?
So we opened the page, just that one page, file, save as, saved as HTML,
and we wrote a simple rule in Apache that said, if the users visit this
URL, just return this static file.
That's it.
And once we did that, the load immediately dropped and was performing extremely well.
So problem solved, but we had an issue because now, that was static content.
It wasn't changing.
Right?
So we ended up watching the game live and whenever something happened, a goal, red
card, anything else we had to manually edit the HTML file, save it, upload it
to the server so that it gets refreshed.
But it worked.
That got us thinking, so obviously we need something else.
We have really big peaks and this is not something we want to do for all
of the big matches to do it manually.
And that's when cloud came into play.
So this was 2013.
Not a lot of people were using the cloud.
And I got to the founders and I said, look, this is the best thing
that we can do for our use case.
We can scale up when we need to.
We can scale down when we don't need to.
And they said yes.
And the only game in town was basically AWS.
So that's what we chose.
And also looking back at the story that I told where we had a server that
was serving static content, which was really fast, much faster than PHP.
we found out about Varnish.
So Varnish is a reverse HTTP caching proxy written in C, which is extremely fast.
It respects the cache control headers.
And basically, stores your full page cache in its cache.
and if you're gonna remember anything from this talk, it's this.
This is the single most important part of infrastructure that we have.
once we started using this, our, The whole app just was blazingly fast.
And also, it solves one of the difficult problems that I was talking about
earlier, which is request coalescing.
So, if something is not in the cache and you get a lot of requests for the
same content, Varnish will only let one request go to your origin server.
All of the rest will be queued up.
And once the response is received, all of the requests will get the same
response, which is, which is great.
So the cloud, was pretty straightforward.
We had an image that had Varnish and the backend server baked in.
We had an auto scaling group.
We got rid of memcache because it was no longer needed because of Varnish.
We had a couple of workers database.
Pretty straightforward for stuff.
And the reason why we could do this pretty easily is because the server
itself was stateless and it was easy to port to cloud and also it was
easy to do on auto scaling group.
So it's really important wherever possible, you should be stateless.
scaling state is extremely difficult and this is something that should
be done early in the project.
We also like to try out new technologies.
So we.
switched from MySQL to MongoDB, because, you know, web scale.
and it was working really great for a time.
And then we started getting into issues.
So we, got replication issues, which were really difficult to debug.
we had data type errors.
It was really difficult to analyze data.
We had issues with locking, because in those times, If you were to change a
single document in a collection, the whole collection would need to be locked.
And we have a lot of updates, so it was really difficult to have that.
And also no foreign keys, which was just a disaster because the whole
project is like basically analyzing and connecting different parts of the data.
So we, got to the conclusion that this was not a great choice for us.
And we need to switch back to a relational database.
And then we found Postgres, which is what we are currently still
using because it is amazing.
So why Postgres?
Because it's open source.
it has advanced SQL capabilities.
It has minimal locking with the mechanism called MVCC.
it's, it stands for multi version concurrency control.
it can be a talk.
all onto itself.
But basically what it allows you to do is to read while you are updating
at the same time without any locking.
And it has great performance.
How great performance?
So on our machine, we can get 1 million queries per second and
90, 000 transactions per second.
And one transaction in a benchmark is 5, 000.
Things that change the state.
So basically we can do half a million updates in peak times.
So great performance, great abilities.
We decided to switch from MongoDB to Postgres, but the
issue was we have to do it live.
So.
We have users, all over the world.
we cannot have any downtime.
And in my mind, when we were trying to do this, it's called something like this.
So changing the tires were while driving on a highway.
but we decided we need to do it and we did it.
And it was actually not that difficult.
What we end up.
doing was time stamping all of the collections that we have all of the
data and then we wrote a tool that would switch that would replicate from MongoDB
to Postgres and do so in iterations.
So every iteration would transfer all of the changes from the last run and then we
got to a point where the database were.
Databases were in sync.
So what we lost with this is simple replication and configuration and
leader election, which works great in MongoDB, but what we gained was
stability, foreign keys and analytical power, which is really important to us.
So it is important to recognize the right tool for the job and
change in this if necessary.
And just a small digression when we were talking about hype
and cool new things, I get a.
A lot of, lots of times people ask me, what do I think about microservices?
And we don't have a microservice, architecture.
We do have a lot of services that do certain things.
But in my mind, You probably don't need them.
If you don't have a really big team or a really complex app, then you're
just, making yourself, making your life more complicated than it should be.
a lot of problems can be solved by just using async communication.
And it's an easier thing to, to manage and, it works really well for us.
So what we do have and what we implemented early on was basically everything that
is slow and everything that doesn't need to happen right away is put in a queue.
And then you have workers that just consume the queue
and then you can scale those.
Producers and consumers independently.
What we use is Beanstalk V, which is a job queue, but like any queue, any message
queue works, it's really important to queue everything that is slow because then
you have, you can scale your system more easily going back to the, to the cloud.
So we had an auto scaling group that would trigger once the CPU is high But the
problem with this is the more you scale.
The worst the cache gets, because in our case, we had the caching system baked
into the image with the backend server.
So once we scaled, the more we scaled, the more the cache was,
basically, not as effective, because Every cache is for itself.
So if you have one machine, you only get one request to the backend servers.
But if you have 20, then all of those, we have to fetch their requests
independently, which is not great.
The solution to this is not that hard.
It's just to separate your caching layer from your backend services.
So that's what we did.
And you obviously have to have at least two, to avoid single points of failure.
The problem with this is, I mean, this works really well, but still,
you will have, the number of requests for one single resource will always
be the number of machines you have.
So if you have two machines in the caching layer, they don't talk to each
other, and you will get two requests to your backend services, Which is
not really needed and I have a really big desire to optimize everything.
This can be one request and the way to solve this is by using sharding.
So if you shard your caching servers so that a single server will always be
responsible for a specific URL, then you can scale your caching layer linearly.
for your time and enjoy.
And basically get, as much memory as you as you need.
and we do this, we did it actually on a bit more complicated scale.
So instead of having a random load balancer, so we
have two layers of varnish.
The first layer is to load balance, and it has to have a lot
of CPU and a lot of bandwidth.
And then we use consistent hashing to hit, just one machine in the second layer.
And then it goes to the backend.
So basically, this allows us to have a really high hit rate ratio,
and we can cache everything.
So we can even do database updates without people noticing, because most
of the important stuff is in the cache.
And we can just shut down for half an hour, and nobody's going to notice.
But cache invalidation is hard.
that's why we built an internal library, which is actually open sourced for PHP,
that builds out the graph of dependencies.
And you can specify which endpoint depends on which entities.
And once an entity changes, we can invalidate just that URL that's important.
There's also cool cache control headers.
So maxage is for the end client, smaxage is for Varnish or any
intermediate caching, servers.
it basically tells it how long it should be in the cache.
a really cool header is sale while revalidate, which will, essentially
return the old version code.
of the resource while fetching the new one in the background, which is really
useful if you have slow endpoints.
Stale if error will allow you to return, the last response you have if
backend services are, not currently healthy, which is really cool.
You basically have a always on, site.
Now, AWS was great.
Love the cloud.
The problem with All the cloud service services are, they're really expensive
and especially expensive is traffic.
We got to a point where we were paying more for traffic than we
were paying for compute, which is something that really bothered me.
It makes no sense to pay so much for, for traffic.
So we got to thinking and we saw that if you rent out bare metal servers,
Usually, you have a certain number of terabytes included with your server and
it's either free or really, really cheap.
So, since our caching layer is separate from everything else, we
can basically just put Put it out of the cloud, and that's what we did.
We rented out a couple of machines and we set up our Varnish instances on them
and then pointed that to the cloud.
So really simple change, but it got us a 10x reduction in bandwidth.
Which was a lot.
So we caught our AWS bill by a huge margin.
And then we did a back of the net calculation.
What would happen if we would move other services from the cloud to bare metal?
And.
Our calculation was that we can over provision the system by a double and
still see a 5x reduction in price.
So we migrated everything from cloud to data center.
Now, when I started this talk, I said that we had really big peaks
and that we need to scale up and scale down in short amount of time.
The thing is, when you have a really good caching system, the cache layer will kind
of buffer out and flatten out those peaks.
So once we figured out the caching layer, our peaks weren't really that big.
It was, it was just, it was just a 2x increase in the biggest peaks that we
have that goes to the backend servers.
The caching servers can get like 20x peaks, but the backend servers will not.
So that's what, why we were able to migrate.
and we switched from, AWS instances to Docker containers, and, we
installed everything ourselves.
And the reduction in price was really big.
And this is the graph of our infrastructure cost throughout time.
And this might not seem like much.
there is a, a slow and steady decline in price.
But if you overlay this with the number of users that we had in that period,
you can see that the number of users is growing a lot, but the price is
going down or, or staying the same.
And there's also one other benefit of being.
on bare mill and that's, you avoid accidental really big spikes in price.
So this is a true screenshot from a friend of mine.
they had a developer create a bug, which was not detected until it was too
late in just a short amount of time.
Their bill, jumped from 360 k to 2 million.
and if you're off the cloud.
Your app will simply not work.
And that's a trade off.
So for someone, it's important to always be on and pay millions.
For others, paying a couple of millions for a mistake is too much.
Also, when we are not on the cloud, RAM is really, really cheap.
We have a lot of RAM to spare.
And what we do with that is we utilize it for the cache.
So the second layer of varnishes that I talked about is running, On those
machines, and it's utilizing orders, all the spare ram and basically
everything that we have can be stored in the cash and be always on.
and usually people ask me, okay, so, but surely you have to, you have cost
of people, you have to pay more people to do DevOps infrastructure and in
our case, that's simply not true.
So up until.
like four years ago, we only had one guy managing everything.
Now it's a team of, five people and they can manage that in their spare time.
So basically in dev ops, We have people working from backend, part of
the time on, on, on the infrastructure and everything works perfectly.
But you have to have your infrastructure really optimized
and know what you're doing.
We have really big peaks.
So this is from our, caching layer.
And you can see in football for first and second half really clearly.
and those peaks are, you know, really big because we do send
out a lot of push notifications.
We try to send out as many push notifications as we can in a
short amount of time so that people get their results faster.
And what happens is people open their mobile phone, they click on
the notification, and basically they open the app all at the same time.
And it kind of looks like a DDoS attack, right?
So, in Cloudflare, which we were using, at the time, Was detecting this as a
DDoS attack and we were having issues because people couldn't open the app
because they were detected as a DDoS.
So we actually had to go to their office to talk with their engineers to explain
what kind of issues we were having and the response was, Oh, yeah, you're surely
triggering the DDoS protection system.
And they added special values just for our site where the DDoS
wasn't as aggressive as it usually is, and that fixed our problem.
it's also important to monitor everything you have, and the best thing you can have
is application performance monitoring.
by having this, you know which endpoints you need to optimize because
it's not important if something is slow, if it's not being called.
Also, if something is, is fast, but it's, it's being called millions of times,
then it might make sense to optimize.
And to do this is you should always optimize most time consuming.
So that, which is, being called a lot and is slow.
and you can get really good results by using this any APM.
works.
This is a screenshot from New Relic, but it's like it's not important.
Sometimes the endpoint is as optimal as it can be and you cannot optimize anymore.
In those cases, usually you can increase the cache and the correlation between
cache and the number of requests that you get to the server is not linear.
So small improvements in the TTL of the cache can lead to a large gains.
In the number of requests that go to your service.
So if you increase your cash just by a little bit, you can get a
really high decrease in throughput to your backend service and make
everything faster and cheaper.
So you should monitor everything.
Optimize something.
I think there is a good quote where, premature optimization is the root of all
evil, which I agree with, but you should monitor so that you know what to optimize,
especially if you have microservices that system can get really complex really fast.
Another question I get is like, what is the number of users?
Exceeds your current capacity.
since you're not on the cloud, which is completely legitimate
thing that usually people ask first, yeah, we are not on the cloud.
We rent out dedicated machines.
Sometimes we can rent out new machines in a couple of minutes.
Sometimes it takes days, so it's not really a, you can really
scale up fast kind of thing.
So what we do instead is we have dedicated machines and if the load gets too high,
We spin up virtual machines, and those virtual machines joined the cluster.
The traffic gets distributed to them, and everything works.
This happens almost never because the system is over provisioned, but this is an
option that we have and sometimes utilize.
So this is kind of a hybrid cloud approach.
Also, there's this thing called physics.
We have Our data centers in Europe, but we have users worldwide and due to physics,
they get their responses slower because, the speed of light is what hampers us.
So we didn't know how big of an impact this is.
And I decided that we need to look into this.
And we sent, sent out an email to one of our users from Australia.
And he was asked if he can record the app too, so that we can see how good the app
is performing, if there are any issues.
And he replied, yeah, everything works great.
No problem.
Everything is fast.
love your app.
And when we were watching the video, we saw this.
I didn't know that we had loaders in the app because in Europe, it's so
fast that you don't even see the loader because like everything is in the cache.
But in Australia, they have a half, half a second latency, just
to fish the data from Europe.
To Australia, but this is normal for for them because all of the competition
had the same issue and this is just like the way things are and we
wanted to solve this and solve this.
We actually distributed our caching servers throughout the world and use geo
routing to fetch from the nearest cache.
And we got a, really big reduction in latency for Australia, Brazil, and all
of the countries outside of Europe, where Australia dropped from half a second
to 80 milliseconds, which is a lot, which is a difference between, you can
see a loader, and it's imperceptible.
And you might ask, shouldn't a CDN do that?
Well, yes, they should.
They should.
But you are not the only user.
of that CDN.
They have other clients.
Maybe it's not in the cache.
Also, we do a lot of cache invalidations so that we can have a high hit rate.
And the problem with invalidations is that not all CDNs do that properly.
So Cloudflare has issues with doing invalidation.
So we had to build out ourselves.
In the meantime, we'll switch to Fastly.
So Fastly has really good invalidation, so it works.
But if you're having issues with your carriers, this might
be something to look into.
Also, data centers can burn.
Even if you're on the cloud, it's just somebody else's machine.
Data centers can burn.
This is a screenshot from OVH, where their DC burned.
We use OVH.
Fortunately, this wasn't our problem.
Data center.
But it got us thinking.
and at that point in time, we knew that we needed to be in multiple dcs to avoid
issues with fires or network issues.
so we moved to multiple dcs.
The problem is, managing those and having them communicate with each other.
And that's why we started using Kubernetes.
So we have Kubernetes because it is multi DC aware.
So services know which, where other services are.
So that allows us to have a mechanism that's robust and that can heal itself.
And basically, we can have cloud like capabilities.
On bare metal with bare metal prices.
the only thing that's difficult to solve on bare metal and Kubernetes is state
because in cloud you get it solved by default by using the cloud components,
but on bare metal it is more difficult.
So we use Longhorn for that.
Longhorn is basically a distributed volumes.
So, you have a volume, it's replicated to other DC.
So if your container pod fails, it can switch to a different data
center and also have that data, safe.
We also do a lot of real time updates.
so usually, it's being done, via some sort of So the apps pull out all
of the data via REST and then they subscribe to a PubSub server to fetch
real time updates, to fetch changes.
And we were doing this ourselves with a custom piece of software, but that just
kept getting more and more complicated.
So we found out about NATS.
NATS is an open source messaging system, a PubSub server, which is written in Go.
It's free.
It has support for clustering out of the box, works perfectly.
and just on a couple of servers, we can get more than 800,
000 connections at, in peaks.
and we do more than half a million messages per second without any issues.
We also have a lot of data.
So all of the clicks in the apps are tracked via Firebase.
They're anonymized and they are exported so that we can download them.
And we have more than two petabytes of data.
If you're in the cloud, it's a no brainer.
You can just use cloud services.
They are expensive, but like they solve your issues.
If you're on bare metal, two petabytes of data is really hard to manage.
So to do that, we use Clickhouse.
And Apache Superset to visualize.
Qlikos is basically a, SQL database that's column oriented.
It's specifically designed to have the analytical workload.
so we have two petabytes of data that's compressed into just 50 terabytes.
And we have more than 1.
4 trillion rows and we can do more than 2 million inserts per second.
That's how fast it is.
And that allows us to build out really complex systems that
do the query a lot of data.
So this is a screenshot from our anti scraping, anti scraping system
that can, that takes, like 5 billion rows and does some operations.
And then we know which IPs to ban.
And the interface to Clickhouse by using Superset is really great.
And using all of this that I've talked about earlier, we have
a system that's really robust.
And in peak time, we had, almost 2 million users, in Google real time.
Overview without any issues and without and by just just by having everything
work automatically and the reason why you are not on the cloud is because all
of this, all of the infrastructure cost all of the so we do have some cloud
services like all the cloud services, all of the city ends, all the servers,
everything that's needed for production.
And development is 0.
6 percent of revenue.
So the take home from this would be cash, all of the things.
This is the most important thing.
Usually people say, yeah, but I have personalization.
We cannot cash.
Yes, you can.
There are techniques that you can use to cash things.
We have a lot of personalized content in the app.
Be stateless wherever possible, because it will allow you to scale more easily.
Know how your users.
Interact with the app.
Recognize the right tool for the job and change if necessary.
Cloud should be the default.
I love the cloud.
I think it's one of the most important technologies that we have.
But you should keep your options open, especially if your monthly bill is high.
Queue everything that is slow.
Monitor everything, optimize something, and try to keep it
simple for as long as you can.
if you have any questions, I'll be happy to answer them.
This is my LinkedIn.
Feel free to add me and hopefully you learned something new
today and this was fun for you.
Thank you.