Transcript
This transcript was autogenerated. To make changes, submit a PR.
Today I am going to talk about, from DevOps to Mlops,
a journey of scaling machine learning models to 2 million
API requests per day. So before we dive in
a brief about me, I am Chinmay. You can find me on
Twitter LinkedIn etcetera via Chinmay 185.
I am a founder at company called 120 n where we help startup
and enterprises with backend and site reliability engineering.
I write stories of our work in
what is called pragmatic software engineering. These stories, I published
them on Twitter LinkedIn, etcetera. I love engineering,
psychology, percussion, and I am a huge fan of
a game called Age of Empires. All right, so let's start.
So what's, what are we covering today? We are covering three
things fundamentally. One is what is mlops?
How do you think mlops for DevOps practitioners?
Fundamentally, I want to talk more about, and spend more time in talking about
a real world production case study that we worked on, which will
talk about all the learnings that we had into
in a case study kind of walkthrough. So what
is mlops fundamentally? Mlops is operationalizing
data science. We all know what DevOps is. DevOps is operationalizing
software delivery, software engineering.
Similarly, MlOps is equivalent to DevOps
in a sense. It talks about operationalizing data science
workloads. Think of machine learning AI ML
workloads essentially, right? So that means it is all
about moving machine learning workloads to production.
Just like we have DevOps phases, we have various phases in mlops.
Fundamentally, it's like build, where you build the models,
you manage various versions of the models. For example,
you deploy these models on production. You monitor, you take feedback,
you continuously improve the models, etcetera. So these are various four steps
of mlops. Let's look at some of them in more
detail. Right? So fundamentally, just like
software engineering is about building and shipping code to production,
MLOps is, is about building machine learning models.
Now, what do you need for building machine learning models? You need data.
Data. You need to extract this data in various forms.
You will need to analyze it. You will need to sort of prune some parts
of data. Essentially, you are doing data preparation and gathering.
Then you feed in this data to your ML model. You will
train the model. You will evaluate models response, you will
test and validate whether the model works correctly or not.
You will fine tune this process over time. You know, you have
test data segregation, you will have production data, stuff like that.
This is all the machine learning part of it, which is what data scientists
work on. Now, the operational parts of it are model
serving, how do you serve this model to production? How do
you run this on GPU's? Do you run this on cpu's? Which cloud provider do
you want to use? How do you monitor the model, whether it's performing as
per expectations or not? How do you manage scale up and scale down of that
model? All of that is the operational concern, which is the ops part of it.
Fundamentally, this is around a feedback loop,
just like in software. We have Ci CD continuous integration and
continuous delivery deployment. We have a third parameter,
or third item in mlops called continuous
testing and training, where you are going to continuously
monitor and train the model and improve
the model over the period of time. So here is
what simplest mlops workflow looks
like. This diagram is from sort of
Google's mlops guide. You can find the link in the description.
Fundamentally, again, it starts with getting data. So we are trying to map all these
previous steps and phases that we looked at into this model.
So we're going to get some data from various sources.
It could be offline data, it could be real time data, things like that.
For now, we're keeping the diagram very simple and just looking at some
offline data. For example, we are going to extract this data,
analyze it, prepare that data essentially.
And there's a second step. Then the whole model
training step appears where you are going to train the model, you're going to evaluate
the model, you're going to check the performance, you're going to manage various
versions, validations, etcetera. Finally, you have a train
model which you put into model registry.
Now once that model registry has
the model, the operational part of the mlops comes
into picture, which is serving the model, and then which is
where you have, for example, a prediction service which you can
run on production. Then you have to monitor, scale that service,
run this on GPU's, figure out cloud cost optimizations,
etcetera, around all of that. So that's operationalizing data
science. That's the simplest mlops
flow that you can think of. You can also map this into
a classic DevOps Infinity loop. So the typical
DevOps Infinity loop talks about your code, build, test,
plan, then release, deploy, operate, monitor,
and doing this in a loop consistently over long periods of time.
So similarly for mlops, it kind of starts with
having data preparation. You're going to prepare the data,
you're going to train the model. Again, that's a build and a test part of
it. Then the release will go into model registry.
You're going to then monitor the model performance.
You're going to deploy the model, monitor the performance,
and all of this altogether would be continuous training
and testing of the model. Right? Enough about theory.
What I want to talk more about is a
use case that we worked on. So this is a production work that we worked
on. I'm going to cover the use case at
a high level. I'm going to talk about what work we did,
how we applied the DevOps and mlops practices,
best practices in production, and the kind of issues
that we faced during the production journey.
Right. So let's start with the case study that we
had in mind. We were working on.
We were working with a company which was building Ekyc SaaS
APIs, which was accessible to B two
B and B two C customers. This needed to scale
up to 2 million API requests per day to the model. Now,
the SaaS APIs, these three or four APIs
that we had, one was face matching API. Imagine you
provide two images to the model. You're going to have to match
the face between two images and model outputs
a score between say zero and one. Based on the
matching score, you can decide if the two images, if the two people
in this images are same or not, and that's a face matching API,
then we have face liveness detection, which is if you
have an image of a face, is this face of a
live person, or is this a face of a non alive person?
Then we had an OCR or optical character recognition
from an image. For example, you upload a photo of a passport or any
identity card, you would be able to extract the text
information from it. So that imagine this Kyc
use case for an insurance or a telecom or any other domain.
People would have to manually enter a lot of information for the user, like their
name, their date of birth, their address, etcetera.
All of this information gets captured via the OCR
API and you get that information returned via response in a structured fashion.
So that eliminates having to type and mistype information.
So similarly, we had some other small APIs as well, which I'll ignore for now.
So fundamentally, we had this ML system,
and the architecture of that system, along with other components was something
like this. So it was served mainly for
b two b use cases. We also had some things for b, two c,
but again, we'll ignore that for now. So imagine from a b two b use
case. I'm an insurance or a telecom company.
I have my own app, and there is a client SDK in that app
that I need to install and run. I need to package the client SDK
as part of my app, now this SDK talks to our backend
APIs, which are these ML APIs exposed via an
HTTP API, for example. So it connects to for example load balancer.
We have an API layer then to
be able to serve these models. We had a RabbitMQ as another like a
queue mechanism where we would push messages.
For example, we want to map match two images,
right? Face recognition or face matching across two images.
We would create a message in RabbitMQ, push that message on the
rabbitmQ. And there is these background workers, which are these ML
workers, which would run, which would accept message from the RabbitMQ.
They would do the processing, they would update the results in database.
Maybe they would even save the results in a cache
like redis, which we had. And the images themselves
can be stored in a distributed file store. Could be minio,
could be s, three things like that.
Essentially these workers would perform the bulk of the task and
then the API would return the results to the user.
So this is the kind of architecture we had. Now what were
the requirements from a slo point of view?
So we set out for achieving at least
like two nines of availability, that is, two nines of uptime
during peak hours or during our business hours. Because we were
dealing with a b two b company, there typically would be
business hours. Typically stores would open at 09:00 a.m. In the morning and
would go on till like 10:00 a.m. 10:00 p.m. In the night, for example.
Right, the local time. So we had promised like two lines of uptime
during that. In terms of SLO, we obviously
had to worry about costs and optimizing the costs as
an important requirement. From SLo point of view, we hadn't defined specific
metrics, but we'll get to that later. And then from an API point of view,
these were synchronous APIs as far as the user and the SDK is
concerned. So less than three second API latency for
95th percentile. That was our goal that we had set out.
Now, given this, let's think about our architecture
and we set out to build a cloud agnostic architecture.
I'll cover more of that soon. And why that is.
So, for example, for the storing of actual
images, which were ephemeral for short time and whatnot, because again we're
dealing with sensitive data. We were using s three
if we are deployed on AWS, we were using gcs if we were deployed
on GCP, and Minio if we were deployed on on
premise. One of the reason was that we wanted to create,
we wanted to have same code, could use different type
of image store without having to change the code a lot
or without no change to the code at all. Why? Because then
we could deploy this entire stack on any cloud.
We could run it on on premise. We could even run it on customers premise.
We could run it on AWS, GCP or any other cloud for that matter.
This was the main point that we wanted to achieve. That's why
we set out to have a cloud agnostic architecture where we don't
use a very cloud specific component and then we are tied to that particular cloud
as a code dependency. Now that's
for Binayo. For Redis we could either go with
self hosted redis which is basically a cache for storing
bunch of latest computation. That API can quickly return the
results to the users. We could do this as a self hosted
redis or elasticache if you are on one of the cloud providers.
Now for RabbitMQ, we again choose RabbitMQ
purely so that we could run this on premise easily.
And if you were on the cloud, we could use something like
sqs or like equivalent in India cloud for
GPU's and workers, which were predominantly GPU workload,
we would use them on one of the cloud providers,
or we could get our own custom GPU's, et cetera. For this
production use case we were on one of the cloud providers and so we
use most of the cloud components, but our code was such that it was
not tightly coupled to cloud at all. And for database, again,
we could use postgres. It could be RDS or cloud
SQL or something else depending on the cloud provider.
We fundamentally had Nomad as the orchestrator which would orchestrate
all the deployments and scaling
of components. Back then we were not using kubernetes purely,
again from a simplicity point of view that we wanted to deploy this whole
stack with the orchestrator on premise and
we didn't want to in the team. We did not have a lot
of Kubernetes expertise to be able to manage self hosted kubernetes
on premise ourselves. So that's where we chose Hashicorp stack,
which is fairly single binary, easy to manage and
easy to run, and we already had expertise in the team for that.
Why cloud agnostic? I think it's a very important point that I want to highlight
because we were cloud agnostic, we could package the same wine
in a different bottle, for example, so we
could package the same stack and run it on any environment
that we wanted to. We could do air gap environments if we had to,
things like that. So this was the main reason why we
went cloud agnostic. And I think one of the lessons to learn
is to build more cloud agnostic systems. That way
you're not tied to any of the cloud providers, although using
cloud providers obviously simplifies a lot of things for you. But you would want
to have your architecture and code not coupled
with the cloud provider so that you can change freely and
migrate to a different cloud if you want to without having to redo
a lot of effort. What did our scaling journey and
how did we go from zero requests to 2 million API requests
per day? Let's talk about that. Obviously it wasn't zero
request one day and 2 million the next day. It was a gradual
scaling journey, something like this.
So we would roll out on few regions
or few stores, and then we would slowly
increment the traffic, we would observe the traffic and so on, so forth.
So fundamentally from our scaling journey, I want to talk about
four or five important points and then drill down on each one of them
as we go through the talk. One is the elimination of single
points of failure to be able to scale. We want to have
zero or no single point of failure so that your system is more
resilient to changes, resilient to failures. We also
need to do good capacity planning so that you are able
to scale up and down very easily and you can save on cloud
costs. Otherwise, if it requires you to scale and you are
having to do a changes to architecture, it causes problems.
So having good capacity planning and how we went about that,
I'll also cover that. Obviously, cost optimization and auto scaling goes
hand in hand with capacity planning. So I'll cover that.
Then comes around a lot of operational aspects about
deployments, observability, being able to debug something,
dealing with production issues, stuff like that. And obviously all
of this journey wasn't very straightforward.
It was fraught with some challenges that we encountered. So I'm going
to cover like two interesting challenges that we encountered along the way. So hopefully
you all can learn from it. So in the next part
of the talk, I'm going to take each one of these points and then go
drill down on each one of them. So let's talk about eliminating
single points of failure. We had
this architecture and just showing the architecture here as, as a, in the background.
So one of the things we did is we added high availability mode
for RabbitMQ. What does that mean? It means we have queue
replication. So whichever queue is there on one machine, it gets
replicated or mirrored onto the other machine. We were running RabbitmQ in
a three node cluster instead of a single single node, for example.
We also had cross AZ deployment for Rabbitmq. So the
three nodes of RapidMQ, each one would run its own easy,
for example. Obviously this was
on premise or a setup where we wanted to host and
manage RabbitMQ ourselves. But if it were a cloud
managed service that we would use, we would use something like sqs, for example.
One of the other things that we did to eliminate single points of failure
is to run ML workloads in multiple azs.
Now back then we had only two azs where we
could have ML GPU's available. The third zone
from the cloud provider did not yet provide the GPU's.
So we had to tweak our logic and deployment and
automation to be able to spin up and load balance between these two acs.
So we would have to fix auto scaling, we would have to fix deployment
automation to be able to run workloads
only on two zones instead of three. For most of the other two cloud
components or most of the other components in the architecture, we would have workloads
run on all three acs. Other thing that
we did is wherever possible, we used SaaS offering for
some of the important stateful systems, like databases,
for example, redis and postgres,
just to make sure that we don't have to manage and
scale those components. Also. And managing and scaling ML
was one of the bigger challenges. So we wanted to offload some of the lower
hanging fruits to the cloud providers. Fundamentally, the idea again is that
scaling and managing stateful components is bit hard and
stateless is much more easier. So wherever possible,
it's easy to automate stateless application scaling, component scaling
and stateful becomes difficult. So that's
on the eliminating single point of failure.
Let's talk about capacity planning. So always when
you think about capacity planning, you think of the bottleneck,
because if the strength of the link is
the strength of the weakest component in the link, for example,
strength of the chain is the strength of the weakest component in the chain.
So you want to find out what's the weakest component and improve
the strength of that component. So for example, if you think about various components
from the architecture, we have API, which is simple
API which is does talk to database and get the results
from database. For example, there is database, which is stateful
component. It could be redis, rabbit, postgres, etcetera,
then mlworkers and something else.
So where do we think is the bottleneck.
Obviously it was on the mlworker side, because that's the component
which takes most time in the request path. So we set out
to figure out, for example, how many ML model requests
can a single node handle. So for example,
if you have a single node with say one gpu with 16 gigs of
GPU memory and a GPU with whatever
few cores, how many workers can I run on that? And how many
requests per second or per hour can I get out of that?
Now, each model, ML model may give
us different results. So for example, face matching may be faster than OCR,
or face liveness detection could be faster than face matching,
for example. So we would run load test and
we would run each of these models, each of the
nodes, and we would run them via a load test to be able to find
the maximum throughput that we can get over a long period of time,
say an hour or two, for example. So again,
I've broken this down into more detail and even more generic format
in another talk that I gave, which is optimizing application performance.
How do you go about it from first principle? So if you're interested, you can
check that out. I'll provide the link in the description, hopefully cost
optimization auto scaling, that is one of the pet peeves given the
current market scenario. So mostly you will have seen, if you
use GPU in cloud, it costs a lot. So how do you go
about optimizing and auto scaling? So you
have to think about what is the parameter on which you can auto scale like,
is it the utilization of CPU or GPU? Could it
be based on memory utilization or number of incoming requests,
for example? Or it could be depth of the queue in case
we were using rapid MQ. So could it be queue depth or
something else? Now we kind of used a combination
of some of these components. So I'll talk about how we went about.
So this is the cost optimization auto scaling,
like our auto scaler, how it works, right? So on the left you
have top left you have a current request rate. This is
the graph where we would have, what's the number of requests we are getting over
last 20 or 30 minutes interval? And we
would have a capacity predictor component which would run every 20 minutes,
which would fetch this request rate, or it would have this information
and it would also get the current node count or current worker
count. So again, imagine we are running these gpu's on workers.
What's the current number of workers that we have currently? So you would
get the current workload, you would get the current
request count. And based on that, based on
the request number of requests and the growth rate of that,
the capacitor, the capacity predictor would kind
of predict, using just simple linear regression, the desired
node count. So for example, if we, for example, open the stores at
09:00 a.m. And we know that we at least need like
50 machines at that point, so we would have time based auto scaling
and we should just spin up 50 machines. You have them ready
at like before ten minutes. The stores open, right?
But the stores open and you continue to see increase in traffic,
you would want to spin up more nodes. If you see decrease in traffic,
typically during lunchtime, you would want to spin down a couple of nodes,
for example, right? So we had this capacity predictor component
which would take this request rate of growth or
rate of decline, you would take the current node count, and based
on the linear regression math and some other parameters that we talked about,
it would predict the desired node count. Now this is
the count that would go as an input to auto scalar. The autoscaler
would then do couple of things. It would update Nomad.
It would tell Nomad, hey, can you please spin up those many number
of components or those many number of workers? So the Nomad
update cluster configuration would run and the Nomad would correspondingly
spin up more nodes. It will deploy the model on those nodes,
etcetera, and it will also update, for example, if you're using slack.
So it will also update us on slack. That, yeah, we've got two more
nodes added, or we've got two nodes destroyed because it was less traffic
time. For example, one of the reason we built
this kind of a system is so that we have a manual override
at any point. If we knew that there is a big campaign going on,
are we going to scale, we are going to have to scale the machines
at a particular time or due to some other kind of business
constraints, we would be able to manually overwrite that value and
change the configuration to be able to spin up those many
number of nodes. This gave us a lot of control and
we've been using this autoscaler for years now and it's just been
working very well. It's a very simple, less effort work,
but it just works flawlessly for us.
So we've never had issues with auto scalar as such.
There's more that we've written about in our, one of the recent blog posts
and case studies. You should check it out if you're interested in this kind of
stuff. Now, our journey wasn't
smooth, right. It was fraught with some issues and errors.
So I'm going to talk about some of the issues we encountered and how we
navigated those, and what kind of impact it had on downtime,
slo, etcetera. So one of the issue that we encountered was
GPU utilization in Nomad. For example, imagine this
top box to be a GPU. It has GPU cache memory
and cores. We would be able to run a ML worker
using Nomad, using Docker driver. So we would run one
ML worker per GPU. What we notice is that the
GPU wasn't utilized fully. It was with one worker, it was just 20%
utilized, and lot of other
resources of that machine were just left unused. We did load
tests to be able to figure out how much throughput we can get out of
a single machine running single ML worker. It's a
docker container, and it wasn't impressive with
this. If we just ran with this kind of
hardware, our cost was going through the roof, and we
had to really figure out how do we fix this. So we
dug deep into Nomad. GitHub issues, some pull requests,
some parts into reading obscure documentation and figuring out.
And fundamentally later, what we discovered is one
workaround which we can use, which is instead of using docker
driver, if we use raw exec driver, which allows you
to run any kind of component, it doesn't
guarantee any. Like you
have to worry about a lot of scheduling yourself when you use raw exec driver.
But with raw exec driver, you could run docker compose. And we
spin up multiple docker containers using docker
compose via raw exec driver in Nomad. So what that allowed
us to do is the same GPU machine,
we could run not one, but four workers,
right? Using docker composer. Now that led us
to having about 80% utilization.
It straight away brought our cost down by four x. So imagine if you
had to have $100,000 per month on just on GPU,
we would slash it by one fourth directly and have like 400%
impact, essentially. So this was one way we solved it.
This is as of today, last I checked, it is still open,
this issue, and you would see this pull request, or this
issue is still open. And one of the solutions or
workarounds that we've used, I've highlighted that here for
you to look at. One thing that we noticed, if you use raw exec
driver, you're going to have to worry about
the shutdown part of it yourself.
So otherwise, nomad generally handles sick
term, and it handles the graceful termination of resources
or components. In this case you will have to have
waits and timeouts and you have to do some magic
and work. You have to put in some work to be able to tear
down the components correctly. So we invested in that and we wrote
some bash script to be able to run,
which could run the components and also tear them down easily when we
wanted to. One other issue that we encountered is
high latency. So again we have
steady traffic, new regions or new stores are opening up and
we are getting them migrated to use our APIs. And the rollout
is happening suddenly on one of the days. What we see is
that more than 25 2nd response time.
Now imagine like our sla is less than 3 seconds response
time for 90th percentile or 95th percentile.
Now we suddenly get 25 seconds of response time.
That's unacceptable. So we debug
this issue. We try to find out, oh, there must be some problem with Mlworker,
right? Because that's the slowest component in the chain. But we realize
that there is no queue depth. The workers are doing their job,
there is no extra jobs for them to be processed.
So the worker scaling or auto scaling is not a problem. But then
why do we have this much latency on the API side?
There is no processing lag on the worker. Why do we have that?
We spent a lot of time debugging this issue and ultimately
we discovered that the number of go routines that we had on
the API side which would process the results from RabbitMQ.
So for example, our flow was that
we would have the client SDK call our NLB or AlB.
The API would send a message to Rabbitmq. The workers would
then consume that message, produce the result.
It would be updated on database and it will also send another
message on to RabbitMQ that the processing is done and some other metadata.
Now there is this workers that were running on the API which
would then consume this metadata via the RabbitMQ message
and then it would return the response to the user or do something else.
Right now that's the part. We were just running five coroutines.
Now what we realize is that we were also running three nodes or three
docker containers for workers. And we never faced any issue with worker,
the cpu utilized API layer. Sorry, we never faced
any cpu utilization or any issue with API layer.
But when the request count increased,
what we realized is that these five go routines were not enough and
which is where we're actually seeing queue depth on the API side.
The API go routines that we're trying to consume from Arabic MQ.
That's where we saw the queue depth. And it took us a while to figure
this out and to fix this, because our preconceived notion and the first response
was to look at workers as a problem. Now we
change the go routines to 30 and suddenly the flip,
the traffic goes normal and we have less than 3 seconds response
time. So that really shows how you want to,
how you should understand the components and how data pipeline works,
and you need to really think about where the latency is being introduced.
After that, we also added a lot of other observability
signals and metrics to be able to track this issue and further any
other issues even even further. So we invested a lot in
tracking cross service latencies
and stuff like that. So I've written about this as
a form of pragmatic engineering story.
You can check it out on Twitter if you follow me there. I want to
conclude this session by talking about some of the lessons that we learned along the
way. So one is for ML workloads,
it goes without saying, but it's worth repeating that.
Data quality and training the model is super important. Like a lot
of it depends on the quality of data and the volume of the
data and how you train your models, how you carve out
the test data versus the data that you
run the model on and you want to check and how you label and
data. So a lot of it depends on data quality and training in
our case, and I would highly recommend to use cloud agnostic architecture to
be able to build scalable systems that can
give you that flexibility to deploy them on any type of workload or any type
of underlying cloud. For us, that has been the massive win.
And I would highly encourage you to think about building cloud agnostic systems.
One of the things that I've seen majorly is people do not
treat operational workloads as first class citizens. They just slap on
top of the existing software and just say that yeah,
somebody will manage this. But treating operational work as
first class citizen really helps in automation of a lot
of your day to day tasks, and it gives you,
especially when you are first launching, you should really treat production,
maintenance and operations as a first class citizen in
software building and delivery process.
Lastly, from a team point of view, we had really good collaboration
with various teams, data scientists, the mobile
SDK team, the backend engineering team, sres,
and even the business folks. So for example, whenever there is a new campaign
or we would get some info from business team that a new region or
new bunch of stores are being onboarded,
preemptively scaled the nodes to be able to handle that traffic,
for example. So being able to closely collaborate with data scientists,
we also optimize, for example, one of the case, we also optimize
the docker image size for the models. Earlier, we would have all
the versions of the models in our final
Docker image, which would mean the Docker image itself would be like tens
of gb. So we'll have ten GB Docker image later. We optimize that
with close collaboration with the data scientists to less than like three gb
of model. So that just speeds up a lot of
warming up of nodes. It just speeds up the deployment process and
the time it takes for nodes to be ready to serve traffic. So ensuring
good collaboration between teams is super, super important.
I think that's it from my side, what I would say is connect with
me on LinkedIn, Twitter, et cetera. You should check out our go
or SRE bootcamp that we have built at 120 n, along with
the software engineering stories that are right. Here's the QR code. You can
scan this and you can check this out.