Transcript
This transcript was autogenerated. To make changes, submit a PR.
Good day everyone. Today we'll be talking about the
different best practices and strategies when dealing with
self hosted, open source, large language models.
Let's quickly introduce ourselves. So, I am Joshua Arvin
Latt and I am the chief technology officer of Newworks
Interactive Labs. I am also an AWS machine
Learning hero, and I'm the author of three books,
the first one being machine learning with Amazon Sage
maker cookbook machine learning engineering
on AWS, and my third book with
the title building and automating penetration
testing labs in the cloud. Hi everyone,
I am Sophie Sullivan. I am the operations director
at Edamama. Previously I was the general manager
of e commerce services and dropship at Beauty MNL,
Dealgrocer and Shoplight. I also have a
couple of certifications in cloud computing and data
analytics. Lastly, I was also
a technical reviewer of the book machine learning engineering on
AWS. So before we do like a deep
dive on this topic, I know everyone here has
already used or tried out AI tools such as
chat, GPT, and and Gemini. Have you ever wondered
how long it will take to build that generative,
AI powered application? Actually, let me
rephrase myself. Have you ever wondered how
long it will take to build that scalable,
secure, reliable, high performance,
low cost, generative AI powered application?
So these are the attributes that are necessary to build
a proper AI application.
So before we do like a deep dive and answer this question,
we have to first discuss the fundamentals of this topic.
So we'll start with the concepts first.
Usually people confuse machine learning
with AI. It's typical for people to
think they are the same, but actually they're not.
Machine learning is a subset of AI, as you
can see here. But as you know, the AI universe
has a number of concepts and terms under
this umbrella. So here we illustrated the different
concepts to illustrate how these interrelate with
one another. As you can see here,
the overarching theme or concept,
and then under AI, there are for now, three different
stages. So we have Ani,
AGI and Asi. Right now,
at this stage, it would seem that technology has really evolved
in advance, but we're at its initial infancy
or under artificial narrow intelligence.
The next stage is AGI, wherein the AI
tools are now more advanced and complex,
wherein it could help solve more complex
problems than the ANI stage. And lastly,
we have ASI, wherein at this stage,
the AI tools can now possibly help or
solve the world's problem. So under AI we have
machine learning. Under machine learning we have deep
learning. And under deep learning we have generative AI,
which is our topic for today. And as
you know, there are different formats under Genai.
So we have LLM, which is a text based AI tool,
images and audio. So I'll turn you
over to orbs to discuss more in detail
how to answer the question that I posed a while ago.
So now that we have a better understanding of the concepts,
let's now dive deeper into self hosted open
source large language models. So years
ago, AI developers and engineers
asked themselves the question, would they be building
everything from scratch, or would they be using existing
machine learning or deep learning frameworks
and managed services or solutions?
And of course, in the previous year also, most of
the AI developers and engineers would now face the
following would they have to start
with a self hosted open source large language
model setup? Or would they just go
straight and use a paid generative AI API,
something like chat GPT API or bedrock API,
where they can get started right away?
So while that is not the topic of this session,
it's important for us to know, because that
plays a significant role whenever we're going to build generative AI
applications. And our goal is to also make the most
out of our open source large language model setup.
And when we talk about dire to entry from
my own experience, starting with an existing API
involves a much more straightforward approach, meaning there's
going to be less code, less installation.
And if the documentation is updated, then in most cases
you just need to add your credit card there
and you may be able to get your first hello world
AI project. So in terms of buyer to entry with
no infrastructure needed, we should be able to
get started already with an existing paid generative
how about infrastructure cost? So this is
where it gets tricky. When we're playing around with
simple prompts, then we may simply underestimate
the overall cost involved when using these
generative AI APIs and when
trying to compute infrastructure cost. Once we get
to deal with more complex queries, and when there's longer
token lengths involved, and then where there are multiple attempts
to query those APIs, the cost
would add up without us noticing it. And once we're
locked in to these types of APIs
and services, it becomes harder to migrate to
other options. However, on the other end, open source
LLMs generally involve a fixed infrastructure
cost. Of course, at the start, a higher fixed
cost amount because you have to pay for the server
or servers involved when running the
deployed large language models. So in the past,
when dealing with, let's say, machine learning models, which are not necessarily
large language models, when they are small enough, they can
be deployed inside serverless endpoints where the
cost scales depending on the or the
usage of those endpoints.
However, when dealing with large language models, generally the
cloud platforms may not necessarily be able to provide that
serverless option for large language models.
So you have to pay for that per hour rate of
that very large server where the model is deployed.
But when it comes to flexibility and level of control,
having an open source large language model self
hosted setup would give us this highest
level of control because for one thing, you have full control of
everything. If you want to fine tune,
if you want to modify the
flow, you want to change the model right
away or try a new model that is not yet
available in the paid APIs.
Those are some of the advantages when having self
hosted large language models. So if you're wondering
how this would look like when dealing with self
hosted large language models, an environment
where we do our experiments and
that environment is where we prepare our scripts,
we prepare the data, and in some cases it would
be used to launch new resources.
So whether they're fine tuning jobs or that's
where the scripts are run. However, the actual infrastructure where
the model is trained and deployed, it would be separate because
that would involve a much larger infrastructure.
So in terms of best practice, number one, it is recommended that
the always on data science environment has
a very small instance size or type,
so that the per hour costs would be low and
that the actual large language model training and
deployment the servers there would be the
actual large ones. So for the training,
which could be expensive, as long as the billing is per
second or per hour. If the actual training job
runs only for, let's say 30 seconds,
then you only have to pay for 30 seconds because after 31
seconds the server would automatically get deleted
deployment endpoints, of course that's a different story,
but generally in some cases
the deployment servers have smaller
instance types compared to the training servers
needed. So as long as we're able to allocate and
assign different instance types and different environments
depending on the type of task
for the self hosted large language model, then it should help
us manage the cost long term. So once we
have that model trained and deployed, of course we have
to prepare the other parts of the infrastructure,
because after setting up and deploying the
actual model, it's just probably one fourth
or one fifth or even one 6th of the overall application.
We have to set up the front end, we have
to set up the back end, and you also have to set
up the database and connect all of them.
So one of the possible ways to deal
with this type of requirement is when you're in a rush. It may make
sense at the start to utilize a serverless implementation,
even if the actual deployed model endpoint
is not serverless. So in this case,
70% of this architecture makes use
of various managed and serverless components.
So for example, a serverless function like AWS
lambda and a serverless API gateway
which would invoke the lambda function, which would then invoke
and pass the inputs to the large language model.
With this type of setup, we're able to
allocate more time on the model part
because the cloud provider has already been
taking care of the infrastructure for the serverless
components so that we don't have to worry about the system administration
part of those resources.
Now for the exciting part, let's now dive deeper
into the various other best practices we'll
be discussing in the next few minutes. Let's start with
security. When dealing with security,
whether it's self hosted or not, we have to
worry about the security of the setup.
When dealing with security, we have to always think
and position ourselves in the point of view of
an attacker. And it's not just the best practices that
we have to worry about, because no amount of best practices
can really replace an actual attacker.
Trying to find ways to
use the model and have it misbehave,
let's say, have it send spam emails to various users.
So just in case that you're interested in learning
more about the security aspect, we actually have
on how to build an
LM vulnerability scanner here in Conf 42,
just assume that when dealing with large language models
and having a self hosted open source
setup, we have to take care of
the different attacks like prompt injection.
We have to worry about data poisoning because all
of that is in our control. What if, for example,
a storage bucket containing the data which will be
used to train or tune? What if that is
modified or altered by an attacker?
So if the process is automated,
then it's possible for the
final output, the final version,
to produce results which would of course
be influenced data which has been modified already
by the attacker. So this means that even if the attacker has
no direct access to the deployed large language
model or models, then there's still some way for the attacker
to cause harm. When dealing with
large language models and even machine learning models in general,
it gets trickier once you have to deal with distributed setups
and when you have multiple resources running, and when you have multiple experiments
running at the same time. When having
those types of experiments and actions and operations,
it's critical to use a dedicated debugger which
allows you to add breakpoints and allow you to troubleshoot
in a different way where you're able to
manage the situation, where the
script is actually not running real time in the same
environment, but it's running somewhere else in a different container
or server. This means that there
should be different ways to control the debugger,
and there should be ways for you to get notified when
something gets detected. This will help you manage
issues much earlier, and you're able
to save on cost and save money, because you
don't have to wait for, let's say, five r to six r's only to realize
that you ended up with a failed
training job. Next would be model monitoring.
Model monitoring is important because for one thing,
if you're not able to save or capture the request and
response, then it would be tricky
for you to analyze and review whether the responses
are actually acceptable or not. And there are
various ways to analyze the data, and you cannot analyze any
data if it's not being collected at all. So model monitoring
is a large topic of its own. So being able to deep dive
on model monitoring and have sophisticated and
mature tools to help monitor and analyze the model's
performance, every version of it,
it's critical to the long term success of
your deployment setup. Versioning,
sort of. Versioning is critical because we're
all pretty sure that your system is going to evolve.
And the tricky part here is that it's not a
single component which would change. It's the entire infrastructure,
it's the entire implementation. In the first version,
there may not be any sort of retrieval
augmented generation setup, but in the second version there's
rag set up. So how do you even compare
those? Are those even comparable? Do you even need the same set
of models? Do you need a less stronger
model for the rag version and have a vector database? So a
lot of people just think that this is a linear process. But setup
versioning is somewhat not
that straightforward when dealing with open source,
large language model deployments.
Deployment options machine learning models may be able
to respond within an acceptable range of maybe
half a second to maybe 30 seconds or
40 seconds. 40 seconds already at this point
may take a bit of time, and in some cases,
some models may even respond in more than two or three
minutes or even five minutes, depending on the type of of process
that's being run, the type of request that's being processed
by the model. Given that there's
a big chance that serverless might not be a supported deployment
option, then asynchronous invocations can
be a viable option, and you should be able to utilize
the event driven nature of the cloud, the asynchronous
deployment option. So again, there could be real time deployment
option where when you do a request, you just wait for
maybe ten to 20 seconds or 30 seconds and then you get
a response. But if there's no guarantee that something
will get returned in 30 seconds, and the response
might happen after, let's say six minutes or eight minutes,
then it may make sense to review the other deployment types
or options available in whatever framework or service
that you're using. Then finally,
automation. Automation is critical because
given the number of tasks and the types of duties
that you'll have to worry about and your team has to worry about,
you have to find ways to iteratively improve
the workflow. And while it's not a
good idea to automate all the steps all in one go,
it's about identifying which steps would probably not change and
which steps could easily be automated so that
you would be able to speed up the process bit by bit.
So for example, if you're able to start with small scripts to
automate some data processing tasks, some data cleaning tasks,
and then later on convert them into more formal processing
jobs, then that's the way to automate.
So there, that's a good summary of some of
the best practices when dealing with open source
large language model setup. So that's pretty much
it. We're able to talk about a
lot of things. We are able to start with a
quick introduction of the concepts such as machine
learning AI and also generative
AI, where we're able to easily
know that with generative AI we can generate text,
we can generate sounds, or you can even
generate images. And again,
the focus is LLMs set of concepts
really relevant in helping us dive deeper into
the more advanced best
practices and strategies in order for us to handle a
more specific set of projects,
especially those which involve open
source large language models,
projects and hope you learned something new.