Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello, I'm Bongwani Shangwe and I'm here to present to you
about running an open source large learning model on your own infrastructure.
I first want to kick off with a short introduction about the company I
work for at Aventa and what is it we do? Because a lot
of people might know about our secondhand on seller
marketplaces, but they do not know much about the
brand adventure behind them. Adavinte is
one of the world's leading online classified specialists with market
across three continents containing over 25
marketplace brands. Our marketplaces range from
consumer goods, vehicles, real estate and
jobs. Adventure consists of several marketplace
brands like Gleanzeigen in Germany, Mark Blatz in the
Netherlands, Lebencoin in France, Kijiji in Canada.
Adavinte is a champion for sustainable commerce,
making a positive impact on the environment, the economy and society
by the numbers we get about 2.5
billion monthly visits across our website.
We have 25 plus marketplaces in our digital portfolio
and over 5700 employees
across ten continents. Now at Delavinto,
we've been working on conversational search assistant which is
geared to launch on Levin Quan, France for a b testing
during the second quarter of this year.
Conversational search is about building a smarter sharpening experience by allowing
users to ask arbitrary questions and being guided to relevant
recommendations and search results by an assistant in
order to serve a greater user experience.
Outside of the normal chat bots which are in use by
most websites, the conversational search assistance will
be backed by a large learning model service.
Conversational search can be defined as a chat interface
to enhance the user experience by allowing natural language interactions
with software agents or virtual assistants to retrieve information.
The product we envisioned is kind of like this.
This is an example I drew up and
with the conversational assistant, when it pops up or when you interact with it,
you ask a general question. In this case, this user is looking for
a Ford Focus or Fiesta and it's
for the conversational search assistance to assist
the user in narrowing down the search products and
also asking for more preferences and what the user would like
is specific. In this vehicle he's looking for
basic infrastructure overview of how it works.
The user would get of course
interact with the conversational search assistant
and which would in the background generate
a query to call a conversational large learning model.
The large learning model also gathers the history of
the conversation which has been happening and it extracts
that information in order to consolidate it and
push the information to a search API
to get relevant search results.
Everything here the conversation large render model and extraction large
learning model is backwards GPT 3.5
outside from OpenAI, the team has also been looking
at other providers for large learning model APIs.
However, we found that there are some downsides in having to
use a service provided largely model. Some points
are the readiness of the service as this is still a new
field and some providers are slow to
open up to more customers to a
larger scale. Thus it takes quite a while to get onboarded
onto these services. There's the cost factor.
Of course, this is a new additional cost for the team and
the company and we also had to think about
latency given that the services outside of adventure infrastructure,
there's an additional latency we have to account
for. Given some of these factors,
we decided to investigate the use of open large learning models
which we could possibly deploy into our own infrastructure.
So during the proof of concept phase where
we're using paid go to services, we also started exploring
looking at hosting a large learning model with
an enterprise service.
You get the top, best quality and the latest
large learning models which are being produced
and versus for
us. When you do a self hosted solution,
the benefit is that you complete control of your application
and your team is responsible for the system
versus using an external API. You're dependent on that
other system being up all the time for your service to run.
Some benefits in using
your own hosted large learning model is that you have greater privacy
and compliance and you also avoid vendor lock
in. We started exploring
models to use by going to
hugging face. Hugging face is currently the main platform
or a website for building and using machine learning based models
such as large render models. It also provides a platform to run these
models on a smaller scale. In our case, we considered
text generation based models. We first started
off with the Falcon 7 billion meter perimeter tuned
model to get familiar with deploying a large learning model.
It's a lightweight model and it's
quite easy to get started and set up. Though it is lightweight,
we did find at lack depth when answering
specific questions,
specifically when we're looking at to use it for as a conversational
search assistance. So then we started looking at other models
which were out there which we could use. We looked at the Falcon
40 millimeter chat tune model and the Loma 270
billion chat tune model.
Aside from being chat tuned, these models also provide multi
language support which was a requirement for us as marketplaces
across different countries with different language customer
customers who speak different languages. So on
deploying a model or hosting the large learner model. In this case,
we found text generation interface or TGI for
short. TGI is a fast optimized interface
solution built for deploying and serving large learning models.
TGI enables high performance text generation
using tensor parallelism, dynamic batching
for most popular and dynamic batch for most popular
open source larger models. TGI also has
a docker image which can be used to launch the text generation
service. One of some of
the benefits we found using TGI is that it's
a simple launcher service to host your model.
It's production ready as it provides tracing
with open telemetry and Prometheus, this token
streaming, you can have continuous batching
of incoming requests for increased total throughput. There's quantitization
with bits and bytes, stop sequences,
and you can have custom prompt generation,
and it also provides fine tuning support to
fine tune your models. The easiest way to get
started with the text generation interface
tick generation inference is to run
the simple command line, which is
you just point to the model which you'd like to use.
You set up where to store the model and
all the dots, such as the model weights. And when you
run this, it's a docker command. It will launch the docker,
the TGI docker, and if
your model is not present on your machine, it will download it.
If it is present, it will just start. Yeah, it will start
running. And to get going you
can just, to test it out, you can use a simple curl
command and as you can see,
you submit the JSON and you get a response from the model.
Now our team based our
experimentations on GCP. Initially we
tried to run the deployments on our european region of GCP.
However, it appears that with the rise of GPU
use applications such as large rental models, there's a scarcity
of GPU variability and of GPU availability
in Europe and north american regions. Actually, if you
research this more, it seems to be occurring
across all the cloud providers because of
the current popularity of just GPU based applications.
So outside of, we scanned outside of Europe and North
America, and we found that there were two zones that could provide
GPU's, which we could gets in
for spot instances, and those were in
East Asia and also
in the Middle east one.
To be specific, large training models require high
GPU to train for inference. It can be less,
but it's still generating a large amount
of GPU memory required. The higher parameters
the model contains, the more GPU memory required to run the inference
of the model. There's also the possibility
to run the model inference in eight bits and four bits mode,
which decreases the amount of GPU memory required based
on current GPU availability in cloud environments, the best options
are machines that contain Nvidia a 180gb.
However, these gpu's are extremely scarce, even in the cloud
environment. We had then settled for
using machines that contain a Nvidia a
140gb. It is also possible to
run a model on multiple gpu's in order to meet the requirement
memory requirements of the model inference.
So there's two ways we could have
deployed these models
and TGI inference.
The first one was using Kubernetes or running
it on a virtual machine. For doing on kubernetes,
we just created a simple deployment yaml
and we allowed for autoscaling because we didn't want
to keep track of all the machines. So this
seemed a bit better because when
the models are not in use, it's easy to downscale as
versus for if we're going to do it using
a virtual machine. These were self managed machines
and we'd have to keep track if on the capacity.
Basically, if machines were not being used, then we'd have to shut them off ourselves.
So we went into the because of this, we specifically started looking
with doing on Kubernetes, on GKE.
This is kind of a high level overview of GKE setup.
You'd have the text generation inference deployment
running in a docker image, pointing to
a volume where all the weights
were fetched from gcs.
But we actually came into an issue when we
started using GKe as
we started, of course, we started with the
Falcon seven, with the lightest Falcon model.
Even with that, we found that the deployment time took longer than expected.
GK GPU machines were also not really available
for use even with kubernetes. And the
highest or the highest
GPU machine, or in this case
node, GPU node which could be hosted with.
We were able to get with GKE was a twelve gigabyte
GPU, which was not enough to push on with
other large rental models. So given
that, we went back and said
well, we're going to do the virtual machine deployment strategy.
And with detection inference, it allows
you to host the model on a
vm with a single GPU, or parallelized
inference across multiple gpu's.
One of the benefits of parallelizing it across multiple gpu's
is that you can meet using multiple gpu's.
You can actually get have
more higher memory across all the GPU's for inference.
We started looking at running our
experiments and we set up our experiments.
When we deployed the models in a virtual machine,
we set up notebooks to run different experiments
so we can be able to track the data and compare the results
later on.
Yeah, so we looked at,
so we had different cases where we deployed the model.
The models on a single v, of a VM with
a single GPU and a VM with multiple gpu's.
Yeah, we had, for the single vm,
we had a short response latency, which was really perfect
for us. The deployment was quite quick,
but the number of max tokens which could be processed
was limited to the GPU memory. With a vm
with multiple gpu's, we increased the GPU
memory footprint so we had more GPU memory.
And this in turn increased the Max token processing.
The time it took for the model to be ready increased. So the deployment time
also increased and the response latency also
increased. There were several
factors which we looked into why latency?
The factors which came into latency with
this and some of the during our
experience, we found out that it
was due to prompt size and complexity,
so similar to if the
prompt was long and quite complex,
it will take longer for any
of the models to process.
However, the less indirect and short prompt it is,
it opens up the ability for side effects such as hallucinations,
specifically with latency. With GPU
setup, we tracked this down that
using more GPU's across vmware,
more GPU's increased latency.
We figured out that this could be between that there's a
lot of IO happening between the GPU's,
so there's a lot of dot exchange happening versus when you
do an inference on a single GPU machine.
And another thing which we tracked
with regards to latency is the max token
output. So we were able to set the max
token output of the deployment,
and the higher token output length, the response from the model
becomes more detailed and longer, which requires more processing time.
But if the token length is extremely short,
the response time might not be complete or make sense to a human.
Thus, there has to be a balance between how detailed the model response
is and the quickness of the response, and more
detailed analysis on this.
As I mentioned, it is possible to set the number of tokens used
for the output. All the previous
tests we've done, we set the token outputs
to 256, and once
we doubled it, we were able to observe different
effects from the models tested.
So for the Falcon model with 256 tokens,
maximum tokens, the model seemed to perform adequately. Once we
increased the number of tokens, the response time initially
seemed to be the same as before, but as the conversation
continued with the user, the response time began to increase.
Another effect was that the response became longer and
the model started to hallucinate,
appearing to have a conversation with itself.
Llama two, we also again
started with 256 maximum tokens and the models performed
adequately. But we'd noted some instances that detect response
would appear to cut off and with an increase in the
token size, delicacy also increased quite
extremely and the response was more complete.
But as the conversation would continue, we get
some instances where the response would get cut off again.
As mentioned again, we have marketplaces
across different countries or different languages, so it
is important for us to test how this performs, and since we're
launching first in Devonkwan, we looked at
using french. So we tested the Falcon and Loma
two on its ability to do inference with
french users. The main adjustment
was with the system prompt, directing the model to respond only
in French.
For Falcon, if the users question was in English
and the model proceeded to respond
in English, but if the user asked questions
in French, the model responded in French, and if the
preceding user questions was in English, the model remained
only speaking in French, and if all
the user questions were in French, the model start to
respond to French. However, there were some side effects we
observed and some hallucinations
within the model when using the french language.
In the case of lama two, if the user
question if the user's first question was in English, the model
proceeded to respond in English. If the user's
first question was in French, the model responded in
French. However, if the preceding questions were switched
to English, the model reverted to responding in English,
and if all the user's questions were
in French, the model stopped responding in French. And what
we can conclude from this is that both models are quite adequate in
using responding to users
in French. One thing to note here is that
within our discovery, we found out there's no one to one
switch between using different models. It is not easy to
switch using the same prompts or techniques in
one conversational search model
and use it with another without experiencing any type
of side effects. For example, the prompts we used
for these experiments had to be adjusted to what we're using with
OpenAPI conversational search login model and
these experiments show though these experiments show points of success
for more adjustments will need it
to get responses to the quality,
we need to do more adjustments to get responses to the
quality which we're using with OpenAI.
We also adjust the properties to launching the model service and the
properties of the API calls, but we did not do an in depth evaluation
with these properties. It is evident that with some
slight adjustments, the latency and quality of the larger model changes.
But, you know, further investigation would
have needed to be done by a team to find the best properties
for the required results.
So you might be the
big question is, how much is this all going to cost you?
In our experience, we're able to run the models on a single a
140 gigabyte gpu,
and the cost calculations for that, for that
model, for using a model, would come around to
around just under $3,000
a month. But since we have discovered that having
more GPU is an advantage,
we would need to select the highest gpu
available would be the Nvidia, a 180 gigabyte,
and that would come down to a round of a cost of
4200.
This estimation is for a single instance. If we were to
scale out using Kubernetes autopilot or manually adding more
vms, the cost of the of running the open logic
model would grow significantly. It's also good to note
that this does not these costs do not include human costs or
networking costs and maintenance costs.
So in our exploration
phase, we looked at hosting
a large learning model and comparing it to an
enterprise large revenue model. Some of the so
the downsides we actually ended up coming out of is that it's difficult
to get adequate gpu and that there's a
lot of high cost associated with running
this model. It also would require internal
support or expertise, and also security maintenance.
On the other side, while we're running this proof of concept,
we had come to the discovery that,
well, OpenAI actually announced
that they were decreasing the cost of chat GPT 3.5
turbo and this made it more promising to
use. And as for the slow API
response time, we are also doing some fine tuning
on the model, and by doing some fine tuning, it actually
sped up the response time.
So my learnings and outcome from
this, from deploying large
learning model, yeah, it's definitely possible based
on a use case, and based on our use case, it's best to start off
with a lightweight model like the Falcon seven
and start using that internally. Just discover,
play with it and figure out
what properties you can use to offload, or what functionality
you can use to offload to your internal model, and use that
and slowly start offloading to it until you
get to a stage where you can either have a balance of a
pay to go service or have,
or having fully running it on your
own infrastructure. Thank you and I
hope you enjoyed the talk. If you have any questions,
please feel free to get in touch with me.