Conf42 Large Language Models (LLMs) 2024 - Online

Running an open source LLM

Video size:

Abstract

In this talk, we will explore the options available for harnessing LLMs, including deploying your own service or using existing LLM APIs. Attendees will learn best practices for leveraging LLMs effectively, whether by building or subscribing.

Summary

  • Bongwani Shangwe talks about running an open source large learning model on your own infrastructure. Conversational search can be defined as a chat interface to enhance the user experience. When you do a self hosted solution, the benefit is that you complete control of your application.
  • The less indirect and short prompt it is, it opens up the ability for side effects such as hallucinations, specifically with latency. There has to be a balance between how detailed the model response is and the quickness of the response. More adjustments will need it to get responses to the quality.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello, I'm Bongwani Shangwe and I'm here to present to you about running an open source large learning model on your own infrastructure. I first want to kick off with a short introduction about the company I work for at Aventa and what is it we do? Because a lot of people might know about our secondhand on seller marketplaces, but they do not know much about the brand adventure behind them. Adavinte is one of the world's leading online classified specialists with market across three continents containing over 25 marketplace brands. Our marketplaces range from consumer goods, vehicles, real estate and jobs. Adventure consists of several marketplace brands like Gleanzeigen in Germany, Mark Blatz in the Netherlands, Lebencoin in France, Kijiji in Canada. Adavinte is a champion for sustainable commerce, making a positive impact on the environment, the economy and society by the numbers we get about 2.5 billion monthly visits across our website. We have 25 plus marketplaces in our digital portfolio and over 5700 employees across ten continents. Now at Delavinto, we've been working on conversational search assistant which is geared to launch on Levin Quan, France for a b testing during the second quarter of this year. Conversational search is about building a smarter sharpening experience by allowing users to ask arbitrary questions and being guided to relevant recommendations and search results by an assistant in order to serve a greater user experience. Outside of the normal chat bots which are in use by most websites, the conversational search assistance will be backed by a large learning model service. Conversational search can be defined as a chat interface to enhance the user experience by allowing natural language interactions with software agents or virtual assistants to retrieve information. The product we envisioned is kind of like this. This is an example I drew up and with the conversational assistant, when it pops up or when you interact with it, you ask a general question. In this case, this user is looking for a Ford Focus or Fiesta and it's for the conversational search assistance to assist the user in narrowing down the search products and also asking for more preferences and what the user would like is specific. In this vehicle he's looking for basic infrastructure overview of how it works. The user would get of course interact with the conversational search assistant and which would in the background generate a query to call a conversational large learning model. The large learning model also gathers the history of the conversation which has been happening and it extracts that information in order to consolidate it and push the information to a search API to get relevant search results. Everything here the conversation large render model and extraction large learning model is backwards GPT 3.5 outside from OpenAI, the team has also been looking at other providers for large learning model APIs. However, we found that there are some downsides in having to use a service provided largely model. Some points are the readiness of the service as this is still a new field and some providers are slow to open up to more customers to a larger scale. Thus it takes quite a while to get onboarded onto these services. There's the cost factor. Of course, this is a new additional cost for the team and the company and we also had to think about latency given that the services outside of adventure infrastructure, there's an additional latency we have to account for. Given some of these factors, we decided to investigate the use of open large learning models which we could possibly deploy into our own infrastructure. So during the proof of concept phase where we're using paid go to services, we also started exploring looking at hosting a large learning model with an enterprise service. You get the top, best quality and the latest large learning models which are being produced and versus for us. When you do a self hosted solution, the benefit is that you complete control of your application and your team is responsible for the system versus using an external API. You're dependent on that other system being up all the time for your service to run. Some benefits in using your own hosted large learning model is that you have greater privacy and compliance and you also avoid vendor lock in. We started exploring models to use by going to hugging face. Hugging face is currently the main platform or a website for building and using machine learning based models such as large render models. It also provides a platform to run these models on a smaller scale. In our case, we considered text generation based models. We first started off with the Falcon 7 billion meter perimeter tuned model to get familiar with deploying a large learning model. It's a lightweight model and it's quite easy to get started and set up. Though it is lightweight, we did find at lack depth when answering specific questions, specifically when we're looking at to use it for as a conversational search assistance. So then we started looking at other models which were out there which we could use. We looked at the Falcon 40 millimeter chat tune model and the Loma 270 billion chat tune model. Aside from being chat tuned, these models also provide multi language support which was a requirement for us as marketplaces across different countries with different language customer customers who speak different languages. So on deploying a model or hosting the large learner model. In this case, we found text generation interface or TGI for short. TGI is a fast optimized interface solution built for deploying and serving large learning models. TGI enables high performance text generation using tensor parallelism, dynamic batching for most popular and dynamic batch for most popular open source larger models. TGI also has a docker image which can be used to launch the text generation service. One of some of the benefits we found using TGI is that it's a simple launcher service to host your model. It's production ready as it provides tracing with open telemetry and Prometheus, this token streaming, you can have continuous batching of incoming requests for increased total throughput. There's quantitization with bits and bytes, stop sequences, and you can have custom prompt generation, and it also provides fine tuning support to fine tune your models. The easiest way to get started with the text generation interface tick generation inference is to run the simple command line, which is you just point to the model which you'd like to use. You set up where to store the model and all the dots, such as the model weights. And when you run this, it's a docker command. It will launch the docker, the TGI docker, and if your model is not present on your machine, it will download it. If it is present, it will just start. Yeah, it will start running. And to get going you can just, to test it out, you can use a simple curl command and as you can see, you submit the JSON and you get a response from the model. Now our team based our experimentations on GCP. Initially we tried to run the deployments on our european region of GCP. However, it appears that with the rise of GPU use applications such as large rental models, there's a scarcity of GPU variability and of GPU availability in Europe and north american regions. Actually, if you research this more, it seems to be occurring across all the cloud providers because of the current popularity of just GPU based applications. So outside of, we scanned outside of Europe and North America, and we found that there were two zones that could provide GPU's, which we could gets in for spot instances, and those were in East Asia and also in the Middle east one. To be specific, large training models require high GPU to train for inference. It can be less, but it's still generating a large amount of GPU memory required. The higher parameters the model contains, the more GPU memory required to run the inference of the model. There's also the possibility to run the model inference in eight bits and four bits mode, which decreases the amount of GPU memory required based on current GPU availability in cloud environments, the best options are machines that contain Nvidia a 180gb. However, these gpu's are extremely scarce, even in the cloud environment. We had then settled for using machines that contain a Nvidia a 140gb. It is also possible to run a model on multiple gpu's in order to meet the requirement memory requirements of the model inference. So there's two ways we could have deployed these models and TGI inference. The first one was using Kubernetes or running it on a virtual machine. For doing on kubernetes, we just created a simple deployment yaml and we allowed for autoscaling because we didn't want to keep track of all the machines. So this seemed a bit better because when the models are not in use, it's easy to downscale as versus for if we're going to do it using a virtual machine. These were self managed machines and we'd have to keep track if on the capacity. Basically, if machines were not being used, then we'd have to shut them off ourselves. So we went into the because of this, we specifically started looking with doing on Kubernetes, on GKE. This is kind of a high level overview of GKE setup. You'd have the text generation inference deployment running in a docker image, pointing to a volume where all the weights were fetched from gcs. But we actually came into an issue when we started using GKe as we started, of course, we started with the Falcon seven, with the lightest Falcon model. Even with that, we found that the deployment time took longer than expected. GK GPU machines were also not really available for use even with kubernetes. And the highest or the highest GPU machine, or in this case node, GPU node which could be hosted with. We were able to get with GKE was a twelve gigabyte GPU, which was not enough to push on with other large rental models. So given that, we went back and said well, we're going to do the virtual machine deployment strategy. And with detection inference, it allows you to host the model on a vm with a single GPU, or parallelized inference across multiple gpu's. One of the benefits of parallelizing it across multiple gpu's is that you can meet using multiple gpu's. You can actually get have more higher memory across all the GPU's for inference. We started looking at running our experiments and we set up our experiments. When we deployed the models in a virtual machine, we set up notebooks to run different experiments so we can be able to track the data and compare the results later on. Yeah, so we looked at, so we had different cases where we deployed the model. The models on a single v, of a VM with a single GPU and a VM with multiple gpu's. Yeah, we had, for the single vm, we had a short response latency, which was really perfect for us. The deployment was quite quick, but the number of max tokens which could be processed was limited to the GPU memory. With a vm with multiple gpu's, we increased the GPU memory footprint so we had more GPU memory. And this in turn increased the Max token processing. The time it took for the model to be ready increased. So the deployment time also increased and the response latency also increased. There were several factors which we looked into why latency? The factors which came into latency with this and some of the during our experience, we found out that it was due to prompt size and complexity, so similar to if the prompt was long and quite complex, it will take longer for any of the models to process. However, the less indirect and short prompt it is, it opens up the ability for side effects such as hallucinations, specifically with latency. With GPU setup, we tracked this down that using more GPU's across vmware, more GPU's increased latency. We figured out that this could be between that there's a lot of IO happening between the GPU's, so there's a lot of dot exchange happening versus when you do an inference on a single GPU machine. And another thing which we tracked with regards to latency is the max token output. So we were able to set the max token output of the deployment, and the higher token output length, the response from the model becomes more detailed and longer, which requires more processing time. But if the token length is extremely short, the response time might not be complete or make sense to a human. Thus, there has to be a balance between how detailed the model response is and the quickness of the response, and more detailed analysis on this. As I mentioned, it is possible to set the number of tokens used for the output. All the previous tests we've done, we set the token outputs to 256, and once we doubled it, we were able to observe different effects from the models tested. So for the Falcon model with 256 tokens, maximum tokens, the model seemed to perform adequately. Once we increased the number of tokens, the response time initially seemed to be the same as before, but as the conversation continued with the user, the response time began to increase. Another effect was that the response became longer and the model started to hallucinate, appearing to have a conversation with itself. Llama two, we also again started with 256 maximum tokens and the models performed adequately. But we'd noted some instances that detect response would appear to cut off and with an increase in the token size, delicacy also increased quite extremely and the response was more complete. But as the conversation would continue, we get some instances where the response would get cut off again. As mentioned again, we have marketplaces across different countries or different languages, so it is important for us to test how this performs, and since we're launching first in Devonkwan, we looked at using french. So we tested the Falcon and Loma two on its ability to do inference with french users. The main adjustment was with the system prompt, directing the model to respond only in French. For Falcon, if the users question was in English and the model proceeded to respond in English, but if the user asked questions in French, the model responded in French, and if the preceding user questions was in English, the model remained only speaking in French, and if all the user questions were in French, the model start to respond to French. However, there were some side effects we observed and some hallucinations within the model when using the french language. In the case of lama two, if the user question if the user's first question was in English, the model proceeded to respond in English. If the user's first question was in French, the model responded in French. However, if the preceding questions were switched to English, the model reverted to responding in English, and if all the user's questions were in French, the model stopped responding in French. And what we can conclude from this is that both models are quite adequate in using responding to users in French. One thing to note here is that within our discovery, we found out there's no one to one switch between using different models. It is not easy to switch using the same prompts or techniques in one conversational search model and use it with another without experiencing any type of side effects. For example, the prompts we used for these experiments had to be adjusted to what we're using with OpenAPI conversational search login model and these experiments show though these experiments show points of success for more adjustments will need it to get responses to the quality, we need to do more adjustments to get responses to the quality which we're using with OpenAI. We also adjust the properties to launching the model service and the properties of the API calls, but we did not do an in depth evaluation with these properties. It is evident that with some slight adjustments, the latency and quality of the larger model changes. But, you know, further investigation would have needed to be done by a team to find the best properties for the required results. So you might be the big question is, how much is this all going to cost you? In our experience, we're able to run the models on a single a 140 gigabyte gpu, and the cost calculations for that, for that model, for using a model, would come around to around just under $3,000 a month. But since we have discovered that having more GPU is an advantage, we would need to select the highest gpu available would be the Nvidia, a 180 gigabyte, and that would come down to a round of a cost of 4200. This estimation is for a single instance. If we were to scale out using Kubernetes autopilot or manually adding more vms, the cost of the of running the open logic model would grow significantly. It's also good to note that this does not these costs do not include human costs or networking costs and maintenance costs. So in our exploration phase, we looked at hosting a large learning model and comparing it to an enterprise large revenue model. Some of the so the downsides we actually ended up coming out of is that it's difficult to get adequate gpu and that there's a lot of high cost associated with running this model. It also would require internal support or expertise, and also security maintenance. On the other side, while we're running this proof of concept, we had come to the discovery that, well, OpenAI actually announced that they were decreasing the cost of chat GPT 3.5 turbo and this made it more promising to use. And as for the slow API response time, we are also doing some fine tuning on the model, and by doing some fine tuning, it actually sped up the response time. So my learnings and outcome from this, from deploying large learning model, yeah, it's definitely possible based on a use case, and based on our use case, it's best to start off with a lightweight model like the Falcon seven and start using that internally. Just discover, play with it and figure out what properties you can use to offload, or what functionality you can use to offload to your internal model, and use that and slowly start offloading to it until you get to a stage where you can either have a balance of a pay to go service or have, or having fully running it on your own infrastructure. Thank you and I hope you enjoyed the talk. If you have any questions, please feel free to get in touch with me.
...

Bongani Shongwe

Senior Data Engineer @ Adevinta / eBay

Bongani Shongwe's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)