Transcript
This transcript was autogenerated. To make changes, submit a PR.
Thank you all for joining the session. Today, I am going to talk about
the large language models and the future of large language
models and productionization of the large language models.
Myself, Deepak. I am working as an associate director for data science and
machine learning projects. I also have more
than 15 years of experience in data science and
machine learning, dominantly working in generative AI
for the past three years. All right,
now take you to the next slide.
So, before I'm going to talk about the large language model productionization or
deploying in cloud, let's understand the traditional
AI model development and deployment, followed by the challenges
we have in deploying or
productionizing the traditional AI models. Then I'll
walk you through on the large language models like GPT four,
and I can explain you the architecture of the large language models
or the generated AI model. Then I can take you through the
concept of lancing framework and how the applications can
be developed with lancing, followed by a demo.
Moving to the next slide, let's talk about the traditional
AI models. When I say traditional AI models,
we began with linear regression, logistic regression, random forest
decision trees, boosting adaboost, exaboost,
neural network, and the evolution started from neural
network to transformers in 2019 or 2017.
I'm sorry. So that is how the industry has a breakthrough,
by coming up with a model called Bert, which is a bidirectional encoder
representation for transformers. I think that has
significantly performed well
in most of the natural language processing tasks. When I
talk about traditional AI models, let me not talk about starting
from linear or logistic regressions or random forest.
Let's begin with a small large language model
which has been known, which I, which I call that as invert
the process involved in model training
or model fine tuning has a requires huge amount
of data set to train the model. So once
we train the model or fine tune the model, we have to fine tune the
model for a specific task or for a specific domain.
Typically, it needs a GPU machine to do the model fine tuning
process. Once we perform the model fine
tuning, we have to do the hyper parameter tuning like learning rate
epoch with multiple additional parameters to come
up with the right ways for the model to classify or question answering
or any of the tasks which it can perform.
So then once we do the fine tuning and optimization of
the model, we have to deploy the model in a cloud
environment. It could be AWS or Azure,
even Google Cloud platform.
But before that, when you are going to deploy the
model, the model has to be serialized, meaning the model
as to when we deploy the model in production, it should have the scalability
and reliability and durability. Considering that
when we move the model to production, we have to serialize the model by having
a pytorch or a tensorflow saved model
format to serve the model. Introduction so,
model serving, as I talked about as a framework called Pytorch serving,
so which is a framework can have a scalability on
performing the inference. This framework provides an API
where the application, once we build a real world application,
the application can invoke the inference or the prediction
by invoking the Pytorch serving part.
Also, we provide that as an API that helps us to
come up with an API design and based on the API design we can start
invoking the model. The model can be a single model or multiple
models can be deployed in production. I will talk about the Pytorch serving
architecture in a minute. Before that I will tell how the scalability
and load balancing will be performed in the cloud environment.
When we deploy these models in production, it can be deployed in elastic
container service, in AWS or in Azure
container app. When we deploy the models, we have a load balancer
has to be created and we have to create the cloud formation template to create
a container and we have to deploy this model as a docker image and
internally it has a Pytorch serving framework.
Once we deploy the model, we have to have a auditability
which is nothing but monitoring and logging. So there most
of the model would be logged along with the number of invocation has been
made to the model along with the throughput and error rates.
So the model should be highly secured where it cannot be having unauthorized
access and attacks. So also once we build
a model, it should have an security along with the
CI CD pipeline for the reinforcement learning whenever the
model trains and fine tune and deployed in production,
if the model has any variation from the data
which it has been trained, then the model has to be
when in the production the model has a deviation
in the data, then it cannot identify the data accurately.
So we have a CACD pipeline to have a reinforcement.
Human learning to ensure if there isn't any deviation,
model has to automatically, automatically, after a certain time it
has to train and fine tune and again it has to be deployed.
That variant is called a b testing or multiple variants
of models will be deployed in production that comes under versioning
and rollback. So we are all talking about the traditional AI.
So this comes under the concept of mlops.
So we design the model, we develop the model
and operationalize the model so in case of design,
we identify the data set, we identify the
model. Then once we have done the identification,
we understand what the model task is. It could be in classification or
summarization, abstraction or like
a question and answering or next sentence for prediction. There could be multiple
kind of tasks the model can perform. So as part of
the recurrent gathering or use case prioritization that has to be identified
along with the data availability to train or fine tune the model.
I think fine tune is the right word, followed by model engineering
which has a technique to identify the model, then perform
Eiffel parameter tuning and fine tune the model and deploy the
model. In operations. That deployment process would
be in a cloud environment by having a CACD pipeline like Azure DevOps,
or then we can monitor via
Amazon Cloudwatch or Azure monitoring logs.
So this traditional AI model development involves
a huge number, there's a certain amount of process has to be followed,
right? So before getting into the large language models,
I would like to touch base on the Pytorch serving. So Pytorch serving is
nothing but a framework where large, sorry, where Bert models
or large language models like Bert
can be deployed. So it is a framework which comes up with
an inference and management API where multiple models can
be deployed inside the container. So again,
this container, when you mean this pyth serving, has to be
built as a docker and it has to be deployed inside a container. It could
be an Amazon elastic container instance or Azure
where we can deploy multiple machine learning models by
using model store. Under the model store we can start
using EBS or elastic storage mechanism,
we can start to save the models
by use by running an API and we can serve the model by
an HTTP endpoint. I think this is the holistic process.
Now you understand the amount of efforts or time we spent in
the whole machine learning or traditional machine learning model development productionization.
So now the offer which we are going to make
is lang chain. But before that
I will give you a few touch base on large language models. See large language
models like GPT-3 or GPT four which has been
trained more than 175 billion parameters. We have other models like Lama
two or Mistral or cloud which comes up with 7
billion or 70 billion parameters. Of the amount of
data has been trained. When it comes to charge DpT,
chart, GPT, we all know it's from OpenAI. It's more like
a very large language model. It is a
foundational model. It has a capability to answer any questions
or any task it can perform without any fine tuning.
So the whole process without fine tuning can be
achieved by providing in context learning to the model. So where the
in context learning would be providing
the model by giving some context.
In context learning means as part of the problem techniques,
instruction can be specified to the GPT four model
to perform a specific task. So when I say
performing a specific task, we can use multiple prompt engineering techniques.
So before the tradition was writing a programming language
in Java or in Python to perform a task for a programming language,
but now natural language process is
a programming language, nothing but, it's an English. So where we
can specify an instruction to the model which is nothing but a prompt along
with the input, and we say if it performs a summarization
or translation task, we specify the task information
by providing in context learning via prompt to along with
an input, we get the relevant answers from the GPT four.
So that's the evolution of large language models. So large language models
are not necessarily need to be fine tuned, which saves the significant
amount of resources like infrastructure and time.
And you know, to have a safer and cleaner environment,
not to fine tune or train the algorithm every time.
Now we know about large language models.
Now we know how we can utilize the large language models
to perform a specific task. But it all
looks good when you are doing some kind of a prototype.
So where you can specify a prompt and
you can give an input and you can get an output on the prompt.
So how do you productionize the large language models? That is an interesting area
to focus on. Okay, that's how we offer lang
chain. But again, before getting into lang chain, let's look into the architecture
of large language models. So before let's
have a small comparison between traditional model and generate the algorithm,
it's nothing but large language model. In traditional model
we have a data pre processing, then we identify the features required
for training or fine tuning
the model. After identifying the features, then we perform a fine tuning
job by having the data. Then once the data has been trained,
then we deploy in production in cloud environment. So typically the model
uses a framework like Tensorflow, Pytorch,
keras, then underlying it could be an IBM
Watson API, or it could be an Pytorch serving which
I was mentioning. Similarly, we would have used multiple databases like no
SQL or SQL database and mlops, avant Docker and Jenkins.
Right now the shift, as in paradigm shift, the reason
we are in the era of more interesting things happening
every day or every week to identify which
is realistic and which can be productionized is a key challenge. So that
would be addressed as part of this demo or as part of this
conversation which we are having now. Even after the conversation you can
reach out to me and have a discussion. Now the
whole process has been converted into prompt tuning or prompt
engineering on neat basis. We can go for fine tuning,
but it's not necessary. But even prompt engineering significantly
performs well on the tasks. Data pre processing it's
all about the input. Data has to be cleansed and given as an input
along with the prompt. Then it has an underlying foundational model like
GPT four or Claude or Mistral. Any of the model can be used.
Then we deploy the model by using orchestration platform
like LangChain or Lama index. So today the offer is about
LangChain. It's not only about developing machine
learning models. Deploying a machine learning model and invoking
a machine learning model is become much more easier than what we have
done earlier. If you have any questions,
I'll move to the next slide. LangChain so
LangChain is a framework to develop the large language
models. It facilitates the creation of applications
that are contextual, aware and capable of reasoning, thereby enhancing
the practical utility of llms in various scenarios.
LangChain has split the job into sequential
steps where the preprocessing could be an independent step and model
invocation would be an independent step. There are
like Azure offers a prompt flow where the model
sequence can be split into multiple steps where if there isn't any
change happens, even each layer could be a plug and play.
So the amount of time it takes from prototype to production
by having a suite of tools like lang chain,
makes the productionization more secure and scalable.
So as I said, LangChain is a framework
to develop machine learning model and by using an
API the models can be invoked.
About the lancing framework, which I said in the previous slide.
Lancing can be developed in Python as well JavaScript.
This offers multiple interfaces and integration with pandas
or numpy or scikit learn. It doesn't offer to integrate
with multiple other panda Python libraries.
Also, these are not having a chain agent. But what do you mean by chain?
Multiple sequential steps can be integrated together
like pre processing invocation, model invocation
and post processing that can be performed by chain agents.
Here are nothing but where collection of activities or multiple
events can be performed without having much trouble in the
execution. And they are ready made chain and they are very good
in agent implementation. Also lang chain
as a Langsmith and the templates and Langserve
is used for serving the model. Introduction by and rest API
Langsmith is for debugging and evaluating and monitoring the chains
within the LLM framework. So this
is all comes as part of the package of lang chain framework.
Lang chain is a sequential chain
where multiple models can be invoked simultaneously, or it can have a sequential
model invocation, or it can also have a parallel model invocation.
So as part of this model lang chain framework, they also offer lang chain
compression language, so where the
amount of code which we write in python could be drastically reduced
by using expression language of lang chain.
So interesting. So after that, let's see how
the generated way application can be developed with LangChain.
So whenever we start with a generative way application, we have to
identify the objective of what is a task we are going to perform.
It could be an prototype to identify
or perform an image classification, or even it can be a natural language processing
task like translation. So where we
have to provide the context to the generative AI model, then we
have to offer a support to have an integrate with multiple platforms.
Then the code which we write should be in a mode to productionize and
we should have a collaborative environment like Azure Notebook or Amazon
Sagemaker. There are many things where there is a platform to develop
the machine learning models. Then after the diversified model
application, it can suit for various range of applications from
chatbot to document summarization or analyzation.
Now the development takes into product, product or productionization.
So whenever we talk about productionization, scalability is a very important
feature where when the model should serve multiple requests
in parallel or in concurrent fashion. Also it have
a framework should have supported testing, we should
have monitoring tools to check how the model is performing in
production. And the deployment should be ease by having an API
as an invocation the model. So also that will be a continuous
improvement for the model by having a prompt versioning where
multiple prompts can be identified and fine tuned
on the prompt and the prompt will go through an evaluation phase and
the prompt will be further fine tuned to deploy into production.
Again, the most interesting thing is deployment,
where LaNC serve can be used to deploy the lank chain.
So lank serve is nothing but like a fast API. It's like a
server, which on top of lank chain where
it acts as a server and it communicates and provides the rest API to
invoke the Lang chain or the agents inside the lancing.
All right, now when we go into lancing, there are multiple deployment
templates which is readily available to consume and where
each and every time we can have a plug and play features
like providing templates in model invocation and scalability
and these of integration and production grades. The production grade support.
I think these are all the features supported by lanching to offer.
Introduction let's see the
difference between the prompting and fine tuning wise and alternatives,
right? In the case of prompting, which you could see,
we specify you are an unbiased professor and your input
score should be zero to ten, and then we pass it to the foundation model
along with an input. Then we can get an output. So where as part of
the problem we are specifying the instruction to the model. In the case
of fine tuning, which we are talking all the time,
where we need the data set and to take the foundational model, and then
we fine tune the model and then we deploy the model in production. That is
all the LLM engineering or the prompting wise fine tuning works.
Still, I am not saying that we should go only for prompt engineering.
That could be a domain specific task where you may require
fine tuning, but typically most of the problem can
be solved well enough by using the right amount of prompting technique
like chain of thought or self consistency tree of thoughts,
the multiple prompt engineering technique can be tried out.
Now let's move on to the Lang chain demo,
and I'll show you how easily a prototype
and productionization of the model can be performed.
As usual, for any libraries to be installed
in Python, it has to follow via pip install or puda install.
But once we install the Lang chain and the LangChain API
installed, you have to procure the open API key,
followed by installing the Lang chain, OpenAI and long chain libraries
in a Python environment, followed by we
have to import the lounging OpenAI and import
the chat OpenAI and create call the function and specify
the instruction via LLM invoke.
So that is the power of three lines of code can effectively
perform a prototyping for you. And when you wanted
to print. How can Langsmith help with testing? You can get an
output saying from the chart GPT model. Okay, this is how
steps required Langsmith can help with the testing.
Now we without a prompt, we have given an input, but by
adding a prompt, we are specifying an instruction saying that what kind of
task the input would be given. So in this
case we are saying in the prompt that you
are a world class technical documentation writer. By providing an input,
it writes the document in a more efficient
way manner like how the technical documentation writer would write.
So that's the power of prompt by specifying the prompt.
So you can see here we are, okay, the same thing. We are importing the
packages libraries and we are invoking a chat open a
function. Typically you have to for a security reason.
I have written all the API key where you have to provide the API keys
in the function, followed by the prompt template where you give the
template as an instruction, as a prompt, followed by the user
input. Now once I give like this your world
class technical documentation writer the system prompt followed by the user
input, user input and specifying the chain dot in put
I can specify very good amount of output,
like how a technical document writer would write it. So most
of the things are very similar. On top of that we can have an output
parser where we can define. The output parser could
be in a JSON format, or it could be an excel, or how do we
define the format for for a large language model,
the output has to be performed. So by using this
input and a prompt template and output parser,
you're all set to get an output from the large language model like GPT
four. If you have any questions, I'm more happier to talk
after the session. Once again, thank you all for
this, for your time and listening to the session. If you
have any doubts, you can reach out to me at any point of time.
Thank you all. Have a nice evening, have a good day and rest
of the week.