Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone. First of all, I would like to give a big shout
out to all the organizers of Conf 42. Thanks a
lot for this smooth conduct. It has been a very fantastic,
fantastic experience for me. Amazing job, guys.
Hi, thanks for joining me. In this talk,
I'll be talking about experimentation platform at scale.
I'm Hrithik Shwastav. I work as a lead software engineer at
Kojek in data engineering team, currently in a stream called experimentation.
In this talk, I'll be walking you through my experience
of building a b testing platform at Gojek Tech. Challenges which
we faced, right. And the changes in architecture which we did to cater
the new throughput. The new scale, which is right now
is around 2.1 or 2.2 million
rpm. Let's proceed. Yeah, so this
is the title of my talk, a B testing platform at scale.
Let's move to the next slide. Yeah, so the
agenda of this talk will be like. I'll be explaining about
what a b testing is, and then we will
talk about the objective, what the audience of this
talk can learn from this. Then I'll be explaining about the litmus,
the name of a b testing platform at Gojek Tech.
Give a light on the old architecture from where we started
building the platform. Then we will talk about the
current scale or the scale, how the number of
experiments, number of clients, the adoption experimentation
has increased in kosher. And then we will talk about the new architecture,
how we have made changes in the new architecture to cater the
new scale. So what is a b testing? A B
testing is basically a randomized experimentation process
wherein two or more versions of features are shown to
different segment of users at the same time. After running the
experiment for some days or weeks, we analyze
the data. And then this data helps us determine
which version leaves the maximum impact and drive the business
metrics. Let's for an example,
understand this concept with an example. The allocation
algorithm, right? The allocation algorithm is
used for allocating drivers to customers. Suppose you
have an existing algorithm which is allocating drivers to customers.
Now you have a hypothesis that by changing few of the
parameters in your allocation algorithm,
you will be able to allocate your driver efficiently
to your customers, which will impact your business metric, which is
booking conversion metric. And this metric will
increase by making this change in the algorithm. So what you will
do, instead of directly rolling this out algorithm to production,
which can be very critical and
dangerous, because you do not know whether how this will perform in
production. Right? So you will be running an experiment with the existing algorithm
and the new algorithm which you are proposing, you will run this experiment for
a few weeks and collect the data, and then you will analyze the data and
see for which variation the booking conversion rate has
increased. If it is a new variation, then you will be rolling that out
to all the users in a staggered
manner. If it is not, then you are sure that whatever
you thought was not right and the current algorithm is performing better
than the suggestion the new algorithm.
Right? So let's move to the next slide.
Yeah, so these days a
lot of companies are investing in experimentation. Not only
big giants, even small startups are also
doing experimentation on daily basis. So whatever
apps which you are using right now, these days,
right, it must be running at least few
hundreds of experiments on each app.
Because this makes sense as well, right? Because it is
the data which is leading you to make the decision, right? Not just that
your manager or some high paying people are coming and saying that
this is how we should do it and this will perform better and we are
rolling this out to all the users. This decision is
backend by data, right? So I'll quote one of
the quote by Jeff Bezos. What he said
is that if you double the number of experiments per year, you are going
to double your inventiveness. He also
knew that how critical experimentation is for
a particular form. So Amazon
also must be running pretty huge number of
experiments for its different,
different services.
Let's move to the next slide. Yeah, Litmus. What is
Litmus? So Litmus is basically the name
of experimentation platform at Gojek
Tech. Actually has two
core features. One is a b testing, which we call experiment,
where you come and define your experiment with
two or more variants and you can do the
A B testing. Second one is a staggered rollout, right? Once you
have the result of your experiment, you can promote
your winning variant to a release, like rolling that
out to all the users in a staggered way. So that is
called release. These are the two
features currently supported by Litmus. The third one is that
it also supports the targeting rules. I am purposefully
mentioning this here because if you see here, right,
we have a kind of rules like app version newer than 1.23.12.
Sorry. So all the users who will have
the app version newer than this will be part of this experiments.
So these are some rules which we support.
So I'm mentioning this because for each experiments
we need to pass the rules for the experiments, right? Which makes
this application as a cpu intensive. I want to highlight this
word because it will be used in
the system design. Also, it will help
us design the system better because we know the expectation from this applications,
right? So these are the key
features supported by Litmus, which is the A
B testing platform of Kojek. So what is
the objective of this talk? Right. So I'll
be walking you through my experience of building the
litmus from scratch, to also
the changes which we made to the system to cater the
hydropole. So what are the non functional
requirements which we will need to build the system?
So these are the core non functional requirements. First is the low latency.
Low latency is required because
if you see, right, experimentation is a new
call in your flow, right? Suppose if you are running an
algorithm for driver allocation, right now you have only one algorithm,
so you know which algorithm to use. If you are running
an experiment, you added one more branching
there, right. You will talk to the experimentation service, get to know which
algorithm to use and then you will use that algorithm.
So this low latency is very critical here because it is
going to impact the overall flow, right. Second one
is the highly available, which I think is given for all the systems.
The system has to be highly available because it
is being used in lot of flows,
a high throughput. A high throughput in the sense, because in
Gojek pretty much every services are using this
platform for running experiments. So we are expected to get the
high throughput, weak consistency.
Basically the data should be
eventually consistent. Like if anyone making some changes
in the experiments, it should eventually be reflected on the client side.
Yeah. These are the objectives. Let's go to
the architecture, how we started.
So let's look into the initial architecture, how the idea
of litmus started, how we build the
version one of Litmus.
So we have this Litmus server, right? So Litmus server is basically
a b testing platform. The application which have all the
functional requirements like supporting experiments,
releases rules, et cater,
which is backed by postgres. We have a primary database
postgres. Right. So how the initial
flow was, we had a portal actually. So the
experimenters can come to portal, can create experiments,
update their created experiments. Basically can do any kind
of crud on their experiments. So how the flow was.
Whenever they are making any changes on the portal, portal will talk to the litmus
server and litmus server will reflect that changes to the DB.
And we have clients. Clients are basically the,
you can say microservices or mobile clients, any clients which
needs to talk to experimentation platform to decide which variants
to use, which variation of an experiment to use.
So how does this flow work?
Clients talks to the Litmus server and litmus
server talks to the DB, fetches all the valid experiments
and return those to clients. So this is the initial
architecture. You must
be thinking that this is pretty knife and there are a
couple of limitations here. Yeah, and we will be talking about
those limitations in the next slide. Let's go to next
slide. Yeah, so limitations. First one is the
single cluster handling the request of read and write.
Right? So what is the problem here?
So the problem is the read throughput and
write throughput are very different for litmus.
The read throughput is around six to seven x larger
than the write throughput. Right. So the tuning,
the scaling required for the read will be different
from the write. Right. So if we are using a single cluster,
and suppose read is using
a lot of resources, it might affect the write.
Suppose if there is some outage happened on that cluster,
read and write, both will be affected by those. So this is
one of the potential problem in this
whole architecture. The second one is primary database
being used for read and write. So we
have a primary and recovery
database. Primary is basically which
will handle all the writes and read ports. Other replicas
are basically where you can read the data. If we had the
read clients directly read from the
application server through directly read through
replica database via litmus application,
it will be better in the performance that even if the
primary database is down, they can read through the
replica database, which is like reading the data is a critical flow
here because a lot of the microservices will be talking to
the experimentation platform at
API level, right? So even if the primary database is
not performing well or running some kind of maintenance,
they can read from the recovery
database. The third one is a single deployment.
Single deployment as in like if you are making any changes
related to write APIs, the deployment will
go for both the read and write. Right. If you are making any changes
related to read APIs,
like any optimization you have done, you will have to make the deployment.
And that deployment time deployment will affect both read and write,
right? So the separation for
read and write is not here, which we will talk about in the next slide.
So what we did next, we separated the read and write
cluster to scale it
better because the throughput difference was quite
large. So let's go to the architecture.
We have Litmus write servers,
we have litmus read servers,
we have primary database, and we have replica,
which will be replicating all
the data from primary on
its site. So how the portal will
be talking right now, portal will talk to the write servers.
And similar to the previous fashion, the write
server will make those changes in the DB and
all the clients now will be talking to read servers and
read servers will be talking, fetching the data from replica and returning
them to claims. This is how the system will look like
once we segregate the read and write clusters.
Let's go to the pros and cons of this architecture.
Pros read and write cluster can
scale independently, which is correct, right. If your
read throughput is increasing insanely, you can just
scale your read cluster. And even
if your read cluster is for
some time affecting the response time,
right. Because of the high throughput, your write cluster or
your portal will not face any kind of issue unless
the issue is on the database side.
Decoupled deployment the deployments will be faster,
right? Because the read changes and the write changes
are independent raw, right. The deployment on read is not affecting
the write and
vice versa. Because anytime on a high
throughput service, if you are doing a deployment on day, right, so it will affect
all the clients that time. So if we have a read and write client separately,
we will be able to deploy the write cluster at any time
because it is a low throughput service.
Third one is the master slave setup,
basically primary or replica setup.
Here the write
cluster is talking to the primary
database and the read cluster is talking to the Replica database.
So you can tune the replica database to increase the number
of connections for your applications.
And there won't be much load on the master
because it will only be serving the right
throughput. So there are some cons as well,
which I think we will understand later in the
talk. The cons are all the clients are sharing the resources,
right? Means there are a lot of clients
which are talking to experimentation platform. They all have
a common resource. Like the read cluster has some amount of
cpu, right? That cpu is being shared by all the clients requests,
right? It is a problem. I'll explain this in the
later section. Not horizontally scalable
is the second con or the limitation. I would say it
is because the underlying
database is a postgres, right? You can
have certain amount of connections there, right. If you
are going to do the horizontal scaling, each server will
be making a connection to database, right? So you need to keep in mind that
as well, it's not easily horizontal scalable.
Whenever you are reaching the max connection configured
on the DB side, your server will be crashing, right? Because it will not be
able to get the ideal connections with the DB. So this
is one of the limitations in the real write segregation
crash. Yeah. So let's
talk about scale over time, which is important to
understand the changes in the architecture needs to be done to
support the new scale. Right? So let's see how
the number of clients, number of experiments or the number of religious
has changed over the course of time. Let's see the
number of clients over time. So from
2019 to 2022, I would say
see the number of clients, how it has increased, right.
Every quarter it is increasing somewhere it
is increasing slow and somewhere it has increased by
a big margin. But the trend is increasing, right. It is not anytime at
constant trend. So this is one
of the major part in the scale that the
different clients will be talking to your systems now,
right? And each client can have different requirements.
Some clients can run hundred of experiments, some will be running two or
three experiments, right? So there
is a diversity in the clients as well. We'll talk about
that later. Let's see the number of active experiment at a time
over a course of period from 2019 to 2022,
right? If you see, we started with 178, we reached to
739, and now in 2020, and we
were about 1.8k. These are the active
experiments. Like running experiments, we have excluded
the completed or stopped experiment. Stopped experiments.
So this actually is the distribution
of clients experiments. The number of experiments
running for a particular client, you see it's pretty uneven,
right? So the expectation from
the clients is also very different. Right. Few clients will be
running hundreds of experiments. They are okay with the
large response time. Few clients are running only two and three experiments,
but they are not okay with the large latencies.
Right. Because the response time is proportional
to the number of experiments. Right. Because we have the rule configured
for the experiments, and we have to parse a rule for each experiment,
right? So somehow this response
time is proportional to the number
of experiment running for a client, right? So here, if you see
it is not only the overall skills, it is
also the diversity in the client. Right. What is the expectation
from each client if you are sharing the resources? Right.
Suppose some clients is okay with large latency.
Suppose 50 ms as 99 percentile.
But some clients are not okay with it because they're only running one or two
experiments. Why would they want their 99 percentile to be this high?
Right. So this was one of the other concern.
Let's look the similar graph for releases.
This is the trend for the active releases over course of time.
And this is the distribution of number of releases
per client. This is similar to what we saw in the experiment. And the
problem is also the similar. Right? And if you see
the numbers, numbers are also similar. We had around 1.8k
experiments at the end of 2022,
built in releases, we had 1.6k.
Now, you see, from 2022 to 2023,
we have already contributed, like increased the
active releases by two k. So you see the
scale, right? How is it increasing?
So let's see what changes we made in the architecture to
cater this release, right? Cater this scale.
Sorry. So let's look
at some of the points here that every year Litmus is receiving a
new scale, right? At the end of 2019,
the max throughput was coming around 100k rpm.
These all numbers are rpm. I forgot to mention in the slide,
at the end of 2020, it was around 500k
rpm. At the end of 2021, it was around 1.1 million rpm.
At the end of 2022, it reached to 2 million rpm.
So the last two architectures we saw it,
the very basic and naive architecture with a single server
and the second one with a segregated read and right cluster,
were able to handle around 500k throughput within the
committed SlA. Beyond that, it was like
the responses were frustrated and we
were seeing some errors as well because of the
resource scarcity, right?
So, looking at the adoption and the growth of experimentation
per year, which we saw in the scale slide, right.
How the number of experiments and releases are increasing over a course of
time, we adopted a sidecar pattern to optimize the
latency, right? Why sidecar pattern?
Sidecar pattern, because by doing sidecar, if you are running
the sidecars on the client server,
you are distributing the resources, right? So resources are distributed.
So if some clients want to run thousand experiments,
right? Say thousand is a random number,
say 100 experiments want to run, right?
So they will be configuring that
much of resources to the sidecar, right? If some clients are running only
one or two experiments, why would they bother to give good resources?
They will just be giving one code to that sidecar,
right? Or less, for running the sidecar.
So this came to our mind because of the diversity
in the clients, right? We thought that let's make it distributed,
let's make the resources distributed and
build the platform so that it can handle the
scale as well as the diversity in the scale.
Let's look at the architecture diagram. Now. How does it look like
we have a. Right servers, right. We have the primary database.
So portal flow will be similar to what it was, right.
All the portal requests will go to the right server, and write server will be
reflecting those changes in the primary database. Right.
Now, how the clients will be connecting. So these yellow
boxes are client servers and these are
the client processes in gray, right? So we
have deployed the sidecar processes along
with the client main process, right? These sidecars
will be syncing the data from right servers
periodically, let's say 5 seconds every five,
which is a configurable value. Every, let's say t seconds
they will fetch all the experiments for that particular client
and they will store it in patcher tv. Sidecar uses Patcher
tv. So periodically they are syncing the
data from write server and write server has
an API exposed which is basically fetch all the
data and will return it to sidecar processes.
So every t second they will
be fetching the data from write server and store that in the badger Db.
So whenever a client make a request to sitecast, Sidecar will read
the data from the badger DB and do all the rule parsing and everything
and return the valid experiments to client.
So if you see this right here,
all the computations are distributed over
each servers, right? Rather than doing all the computations
on a particular set of servers, it has been distributed.
So by this architecture it can scale to
any value, right? Because here we are not making
any direct connection with the database, we are syncing the data via API.
The connections are maintained by the right server with the database and
also the client's
response time of.
Sorry, the response time of one of the client is not getting affected by
others, right. Let's see the key
points for the sidecar pattern. So given Sidecar is distributed,
it is designed with the availability and consistency. So using this
pace theorem,
basically if we have a partition,
we will have to choose between availability and
consistency. Else we have to choose between latency and
consistency. So sidecar has been
built with choosing the availability
over consistency in case of network partition and consistency
over latency if
there is no network partition, right? So we
decided to go ahead with this. We had a survey with our
clients and we decided to go ahead with this. We do not have right now
the options to configure the sidecar to support
both consistency
and latency with a configurable value. Right now
sidecar just support availability in case of network partition,
else consistency second one is horizontally
scalable, which I already talked about, right? Since it is not making
direct connection to the database, right? It is horizontally scalable.
Like you can literally spun up like hundreds of thousands of servers which
will talk to the litmus servers periodically, right?
Because the throughput on the sidecar is not directly proportional to the
throughput on the right application.
Because if you see here, the call
between sidecar process and write servers is happening periodically every
t second. So irrespective of this throughput.
Right. The client process to sidecar process.
This throughput is not getting affected. So it is horizontally
scalable. So the third part is resources are distributed,
which I already explained. Right. The response time of a client will not be affected
by the experiments of other clients. Because the resources are distributed and
their clients will be running the sidecar on their own resources.
And also we have a central monitoring for all the sidescar on the new relic
and Grafana. Because as an experimentation
platform, if we are providing a sidecar, we need to check the health of all
the sidescars as well. Right. If someone is complaining that my sidecar is
not working, you need to have the observability.
Right? What is the issue? Is it working fine? Since when it
is not working? So we have monitoring as well for all
those sidecars. Yeah. So this brings
us to the last section of the talk.
Basically how the throughput and response times
look like right now. This is one of the screenshot I have took
while making this slide at a particular time.
So if you see this is. This is reaching up to 1.61
million rpm, the throughput and if you see the 99
percentile, right. It is pretty efficient.
9.52 ms.
Right. You see the scale and the response time. This is the
most optimized thing which we were targeting for.
And using this sidecar pattern, we were able to deliver
this, right. And make the system
more available and
more diversified according to support
the requirement of different clients. So yeah,
this is the overall learning for me while
building the experimentation platform.
Yes. So currently we are
focusing mostly on the building analytics systems.
But I think that is not the part of this talk that is out
of scope of this talk. So yeah,
that's all from my side. Thanks a lot.
Please reach out to me on Twitter or LinkedIn for any questions you have.
I have written a few blogs as well on this.
If you are interested, you can go on media man. Thanks a lot. Have a
nice day,