Self-Hosted Open Source LLMs: Best Practices and Strategies

Video size:

Abstract

While a lot of organizations utilize 3^rd-party APIs and services to build generative AI-powered applications, we are seeing more teams set up their own self-hosted large language models (LLMs). In this session, we will discuss various best practices when building and managing self-hosted LLMs.

Summary

Today we'll be talking about the different best practices and strategies when dealing with self hosted, open source, large language models. Have you ever wondered how long it will take to build that generative, AI powered application? These are the attributes that are necessary to build a proper AI application.
Under AI, there are for now, three different stages. We have Ani, AGI and Asi. Machine learning is a subset of AI. Under deep learning we have generative AI. So I'll turn you over to orbs to discuss more in detail how to answer the question.
Starting with an existing API involves a much more straightforward approach, meaning there's going to be less code, less installation. Once we get to deal with more complex queries, the cost would add up. And once we're locked in to these types of APIs and services, it becomes harder to migrate.
Open source LLMs generally involve a fixed infrastructure cost. Having an open source large language model self hosted setup would give us highest level of control. With this type of setup, we're able to allocate more time on the model part.
When dealing with security, we have to always think and position ourselves in the point of view of an attacker. No amount of best practices can really replace an actual attacker. When dealing with large language models and even machine learning models in general, it gets trickier once you have to deal with distributed setups.
Next would be model monitoring. If you're not able to save or capture the request and response, then it would be tricky for you to analyze and review whether the responses are actually acceptable. Versioning is critical because we're all pretty sure that your system is going to evolve.
Then finally, automation. Automation is critical because given the number of tasks and the types of duties that you'll have to worry about, you have to find ways to improve the workflow. While it's not a good idea to automate all the steps all in one go.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Good day everyone. Today we'll be talking about the different best practices and strategies when dealing with self hosted, open source, large language models. Let's quickly introduce ourselves. So, I am Joshua Arvin Latt and I am the chief technology officer of Newworks Interactive Labs. I am also an AWS machine Learning hero, and I'm the author of three books, the first one being machine learning with Amazon Sage maker cookbook machine learning engineering on AWS, and my third book with the title building and automating penetration testing labs in the cloud. Hi everyone, I am Sophie Sullivan. I am the operations director at Edamama. Previously I was the general manager of e commerce services and dropship at Beauty MNL, Dealgrocer and Shoplight. I also have a couple of certifications in cloud computing and data analytics. Lastly, I was also a technical reviewer of the book machine learning engineering on AWS. So before we do like a deep dive on this topic, I know everyone here has already used or tried out AI tools such as chat, GPT, and and Gemini. Have you ever wondered how long it will take to build that generative, AI powered application? Actually, let me rephrase myself. Have you ever wondered how long it will take to build that scalable, secure, reliable, high performance, low cost, generative AI powered application? So these are the attributes that are necessary to build a proper AI application. So before we do like a deep dive and answer this question, we have to first discuss the fundamentals of this topic. So we'll start with the concepts first. Usually people confuse machine learning with AI. It's typical for people to think they are the same, but actually they're not. Machine learning is a subset of AI, as you can see here. But as you know, the AI universe has a number of concepts and terms under this umbrella. So here we illustrated the different concepts to illustrate how these interrelate with one another. As you can see here, the overarching theme or concept, and then under AI, there are for now, three different stages. So we have Ani, AGI and Asi. Right now, at this stage, it would seem that technology has really evolved in advance, but we're at its initial infancy or under artificial narrow intelligence. The next stage is AGI, wherein the AI tools are now more advanced and complex, wherein it could help solve more complex problems than the ANI stage. And lastly, we have ASI, wherein at this stage, the AI tools can now possibly help or solve the world's problem. So under AI we have machine learning. Under machine learning we have deep learning. And under deep learning we have generative AI, which is our topic for today. And as you know, there are different formats under Genai. So we have LLM, which is a text based AI tool, images and audio. So I'll turn you over to orbs to discuss more in detail how to answer the question that I posed a while ago. So now that we have a better understanding of the concepts, let's now dive deeper into self hosted open source large language models. So years ago, AI developers and engineers asked themselves the question, would they be building everything from scratch, or would they be using existing machine learning or deep learning frameworks and managed services or solutions? And of course, in the previous year also, most of the AI developers and engineers would now face the following would they have to start with a self hosted open source large language model setup? Or would they just go straight and use a paid generative AI API, something like chat GPT API or bedrock API, where they can get started right away? So while that is not the topic of this session, it's important for us to know, because that plays a significant role whenever we're going to build generative AI applications. And our goal is to also make the most out of our open source large language model setup. And when we talk about dire to entry from my own experience, starting with an existing API involves a much more straightforward approach, meaning there's going to be less code, less installation. And if the documentation is updated, then in most cases you just need to add your credit card there and you may be able to get your first hello world AI project. So in terms of buyer to entry with no infrastructure needed, we should be able to get started already with an existing paid generative how about infrastructure cost? So this is where it gets tricky. When we're playing around with simple prompts, then we may simply underestimate the overall cost involved when using these generative AI APIs and when trying to compute infrastructure cost. Once we get to deal with more complex queries, and when there's longer token lengths involved, and then where there are multiple attempts to query those APIs, the cost would add up without us noticing it. And once we're locked in to these types of APIs and services, it becomes harder to migrate to other options. However, on the other end, open source LLMs generally involve a fixed infrastructure cost. Of course, at the start, a higher fixed cost amount because you have to pay for the server or servers involved when running the deployed large language models. So in the past, when dealing with, let's say, machine learning models, which are not necessarily large language models, when they are small enough, they can be deployed inside serverless endpoints where the cost scales depending on the or the usage of those endpoints. However, when dealing with large language models, generally the cloud platforms may not necessarily be able to provide that serverless option for large language models. So you have to pay for that per hour rate of that very large server where the model is deployed. But when it comes to flexibility and level of control, having an open source large language model self hosted setup would give us this highest level of control because for one thing, you have full control of everything. If you want to fine tune, if you want to modify the flow, you want to change the model right away or try a new model that is not yet available in the paid APIs. Those are some of the advantages when having self hosted large language models. So if you're wondering how this would look like when dealing with self hosted large language models, an environment where we do our experiments and that environment is where we prepare our scripts, we prepare the data, and in some cases it would be used to launch new resources. So whether they're fine tuning jobs or that's where the scripts are run. However, the actual infrastructure where the model is trained and deployed, it would be separate because that would involve a much larger infrastructure. So in terms of best practice, number one, it is recommended that the always on data science environment has a very small instance size or type, so that the per hour costs would be low and that the actual large language model training and deployment the servers there would be the actual large ones. So for the training, which could be expensive, as long as the billing is per second or per hour. If the actual training job runs only for, let's say 30 seconds, then you only have to pay for 30 seconds because after 31 seconds the server would automatically get deleted deployment endpoints, of course that's a different story, but generally in some cases the deployment servers have smaller instance types compared to the training servers needed. So as long as we're able to allocate and assign different instance types and different environments depending on the type of task for the self hosted large language model, then it should help us manage the cost long term. So once we have that model trained and deployed, of course we have to prepare the other parts of the infrastructure, because after setting up and deploying the actual model, it's just probably one fourth or one fifth or even one 6th of the overall application. We have to set up the front end, we have to set up the back end, and you also have to set up the database and connect all of them. So one of the possible ways to deal with this type of requirement is when you're in a rush. It may make sense at the start to utilize a serverless implementation, even if the actual deployed model endpoint is not serverless. So in this case, 70% of this architecture makes use of various managed and serverless components. So for example, a serverless function like AWS lambda and a serverless API gateway which would invoke the lambda function, which would then invoke and pass the inputs to the large language model. With this type of setup, we're able to allocate more time on the model part because the cloud provider has already been taking care of the infrastructure for the serverless components so that we don't have to worry about the system administration part of those resources. Now for the exciting part, let's now dive deeper into the various other best practices we'll be discussing in the next few minutes. Let's start with security. When dealing with security, whether it's self hosted or not, we have to worry about the security of the setup. When dealing with security, we have to always think and position ourselves in the point of view of an attacker. And it's not just the best practices that we have to worry about, because no amount of best practices can really replace an actual attacker. Trying to find ways to use the model and have it misbehave, let's say, have it send spam emails to various users. So just in case that you're interested in learning more about the security aspect, we actually have on how to build an LM vulnerability scanner here in Conf 42, just assume that when dealing with large language models and having a self hosted open source setup, we have to take care of the different attacks like prompt injection. We have to worry about data poisoning because all of that is in our control. What if, for example, a storage bucket containing the data which will be used to train or tune? What if that is modified or altered by an attacker? So if the process is automated, then it's possible for the final output, the final version, to produce results which would of course be influenced data which has been modified already by the attacker. So this means that even if the attacker has no direct access to the deployed large language model or models, then there's still some way for the attacker to cause harm. When dealing with large language models and even machine learning models in general, it gets trickier once you have to deal with distributed setups and when you have multiple resources running, and when you have multiple experiments running at the same time. When having those types of experiments and actions and operations, it's critical to use a dedicated debugger which allows you to add breakpoints and allow you to troubleshoot in a different way where you're able to manage the situation, where the script is actually not running real time in the same environment, but it's running somewhere else in a different container or server. This means that there should be different ways to control the debugger, and there should be ways for you to get notified when something gets detected. This will help you manage issues much earlier, and you're able to save on cost and save money, because you don't have to wait for, let's say, five r to six r's only to realize that you ended up with a failed training job. Next would be model monitoring. Model monitoring is important because for one thing, if you're not able to save or capture the request and response, then it would be tricky for you to analyze and review whether the responses are actually acceptable or not. And there are various ways to analyze the data, and you cannot analyze any data if it's not being collected at all. So model monitoring is a large topic of its own. So being able to deep dive on model monitoring and have sophisticated and mature tools to help monitor and analyze the model's performance, every version of it, it's critical to the long term success of your deployment setup. Versioning, sort of. Versioning is critical because we're all pretty sure that your system is going to evolve. And the tricky part here is that it's not a single component which would change. It's the entire infrastructure, it's the entire implementation. In the first version, there may not be any sort of retrieval augmented generation setup, but in the second version there's rag set up. So how do you even compare those? Are those even comparable? Do you even need the same set of models? Do you need a less stronger model for the rag version and have a vector database? So a lot of people just think that this is a linear process. But setup versioning is somewhat not that straightforward when dealing with open source, large language model deployments. Deployment options machine learning models may be able to respond within an acceptable range of maybe half a second to maybe 30 seconds or 40 seconds. 40 seconds already at this point may take a bit of time, and in some cases, some models may even respond in more than two or three minutes or even five minutes, depending on the type of of process that's being run, the type of request that's being processed by the model. Given that there's a big chance that serverless might not be a supported deployment option, then asynchronous invocations can be a viable option, and you should be able to utilize the event driven nature of the cloud, the asynchronous deployment option. So again, there could be real time deployment option where when you do a request, you just wait for maybe ten to 20 seconds or 30 seconds and then you get a response. But if there's no guarantee that something will get returned in 30 seconds, and the response might happen after, let's say six minutes or eight minutes, then it may make sense to review the other deployment types or options available in whatever framework or service that you're using. Then finally, automation. Automation is critical because given the number of tasks and the types of duties that you'll have to worry about and your team has to worry about, you have to find ways to iteratively improve the workflow. And while it's not a good idea to automate all the steps all in one go, it's about identifying which steps would probably not change and which steps could easily be automated so that you would be able to speed up the process bit by bit. So for example, if you're able to start with small scripts to automate some data processing tasks, some data cleaning tasks, and then later on convert them into more formal processing jobs, then that's the way to automate. So there, that's a good summary of some of the best practices when dealing with open source large language model setup. So that's pretty much it. We're able to talk about a lot of things. We are able to start with a quick introduction of the concepts such as machine learning AI and also generative AI, where we're able to easily know that with generative AI we can generate text, we can generate sounds, or you can even generate images. And again, the focus is LLMs set of concepts really relevant in helping us dive deeper into the more advanced best practices and strategies in order for us to handle a more specific set of projects, especially those which involve open source large language models, projects and hope you learned something new.

Slides

Download slides (PDF)

See all 36 talks at this event!

Conf42 Machine Learning 2024 - Online

May 30 2024

Self-Hosted Open Source LLMs: Best Practices and Strategies

Video size:

Abstract

Summary

Transcript

Slides

Joshua Arvin Lat

CTO @ NuWorks Interactive Labs

Sophie Soliven

Director of Operations @ edamama

Join the community!

Featured event

2026

2025

Info

Conf42 Machine Learning 2024 - Online

May 30 2024

Self-Hosted Open Source LLMs: Best Practices and Strategies

Video size:

Abstract

Summary

Transcript

Slides

Joshua Arvin Lat

CTO @ NuWorks Interactive Labs

Sophie Soliven

Director of Operations @ edamama

Join the community!