Future of LLM's and Machine learning Productionization

Video size:

Abstract

Unlock the future of LLMs and Machine Learning Productionization in my talk! Dive into the cutting-edge trends and transformative potential that will redefine how we use LLMs and implement machine learning at scale. Don’t miss the chance to navigate the next frontier of AI innovation.

Summary

Deepak has more than 15 years of experience in data science and machine learning. He will talk about the future of large language models and productionization. First, he will explain the traditional AI model development and deployment. Then he will walk you through on the large language model like GPT four.
Traditional AI models began with linear regression, logistic regression, random forest decision trees. The evolution started from neural network to transformers in 2019 or 2017. When we deploy these models in production, it can be deployed in elastic container service. I will talk about the Pytorch serving architecture in a minute.
Pytorch serving is a framework where large, sorry, where Bert models or large language models like Bert can be deployed inside a container. Now the offer which we are going to make is lang chain.
Large language models like GPT-3 or GPT four have been trained more than 175 billion parameters. The whole process without fine tuning can be achieved by providing in context learning to the model. How do you productionize the large language models?
LangChain is a framework to develop the large language models. It facilitates the creation of applications that are contextual, aware and capable of reasoning. By using an API the models can be invoked. Making the productionization more secure and scalable.
Lancing can be developed in Python as well JavaScript. Offers multiple interfaces and integration with pandas or numpy or scikit learn. Multiple sequential steps can be integrated together. Can suit for various range of applications from chatbot to document summarization. Now the development takes into product, product or productionization.
The difference between prompting and fine tuning wise and alternatives. Most of the problem can be solved well enough by using the right amount of prompting technique. Still, I am not saying that we should go only for prompt engineering.
Langsmith can be used to create a prototype and productionization of the model. Using a prompt template and an output parser, you can get an output from a large language model. Once again, thank you all for your time and listening to the session.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Thank you all for joining the session. Today, I am going to talk about the large language models and the future of large language models and productionization of the large language models. Myself, Deepak. I am working as an associate director for data science and machine learning projects. I also have more than 15 years of experience in data science and machine learning, dominantly working in generative AI for the past three years. All right, now take you to the next slide. So, before I'm going to talk about the large language model productionization or deploying in cloud, let's understand the traditional AI model development and deployment, followed by the challenges we have in deploying or productionizing the traditional AI models. Then I'll walk you through on the large language models like GPT four, and I can explain you the architecture of the large language models or the generated AI model. Then I can take you through the concept of lancing framework and how the applications can be developed with lancing, followed by a demo. Moving to the next slide, let's talk about the traditional AI models. When I say traditional AI models, we began with linear regression, logistic regression, random forest decision trees, boosting adaboost, exaboost, neural network, and the evolution started from neural network to transformers in 2019 or 2017. I'm sorry. So that is how the industry has a breakthrough, by coming up with a model called Bert, which is a bidirectional encoder representation for transformers. I think that has significantly performed well in most of the natural language processing tasks. When I talk about traditional AI models, let me not talk about starting from linear or logistic regressions or random forest. Let's begin with a small large language model which has been known, which I, which I call that as invert the process involved in model training or model fine tuning has a requires huge amount of data set to train the model. So once we train the model or fine tune the model, we have to fine tune the model for a specific task or for a specific domain. Typically, it needs a GPU machine to do the model fine tuning process. Once we perform the model fine tuning, we have to do the hyper parameter tuning like learning rate epoch with multiple additional parameters to come up with the right ways for the model to classify or question answering or any of the tasks which it can perform. So then once we do the fine tuning and optimization of the model, we have to deploy the model in a cloud environment. It could be AWS or Azure, even Google Cloud platform. But before that, when you are going to deploy the model, the model has to be serialized, meaning the model as to when we deploy the model in production, it should have the scalability and reliability and durability. Considering that when we move the model to production, we have to serialize the model by having a pytorch or a tensorflow saved model format to serve the model. Introduction so, model serving, as I talked about as a framework called Pytorch serving, so which is a framework can have a scalability on performing the inference. This framework provides an API where the application, once we build a real world application, the application can invoke the inference or the prediction by invoking the Pytorch serving part. Also, we provide that as an API that helps us to come up with an API design and based on the API design we can start invoking the model. The model can be a single model or multiple models can be deployed in production. I will talk about the Pytorch serving architecture in a minute. Before that I will tell how the scalability and load balancing will be performed in the cloud environment. When we deploy these models in production, it can be deployed in elastic container service, in AWS or in Azure container app. When we deploy the models, we have a load balancer has to be created and we have to create the cloud formation template to create a container and we have to deploy this model as a docker image and internally it has a Pytorch serving framework. Once we deploy the model, we have to have a auditability which is nothing but monitoring and logging. So there most of the model would be logged along with the number of invocation has been made to the model along with the throughput and error rates. So the model should be highly secured where it cannot be having unauthorized access and attacks. So also once we build a model, it should have an security along with the CI CD pipeline for the reinforcement learning whenever the model trains and fine tune and deployed in production, if the model has any variation from the data which it has been trained, then the model has to be when in the production the model has a deviation in the data, then it cannot identify the data accurately. So we have a CACD pipeline to have a reinforcement. Human learning to ensure if there isn't any deviation, model has to automatically, automatically, after a certain time it has to train and fine tune and again it has to be deployed. That variant is called a b testing or multiple variants of models will be deployed in production that comes under versioning and rollback. So we are all talking about the traditional AI. So this comes under the concept of mlops. So we design the model, we develop the model and operationalize the model so in case of design, we identify the data set, we identify the model. Then once we have done the identification, we understand what the model task is. It could be in classification or summarization, abstraction or like a question and answering or next sentence for prediction. There could be multiple kind of tasks the model can perform. So as part of the recurrent gathering or use case prioritization that has to be identified along with the data availability to train or fine tune the model. I think fine tune is the right word, followed by model engineering which has a technique to identify the model, then perform Eiffel parameter tuning and fine tune the model and deploy the model. In operations. That deployment process would be in a cloud environment by having a CACD pipeline like Azure DevOps, or then we can monitor via Amazon Cloudwatch or Azure monitoring logs. So this traditional AI model development involves a huge number, there's a certain amount of process has to be followed, right? So before getting into the large language models, I would like to touch base on the Pytorch serving. So Pytorch serving is nothing but a framework where large, sorry, where Bert models or large language models like Bert can be deployed. So it is a framework which comes up with an inference and management API where multiple models can be deployed inside the container. So again, this container, when you mean this pyth serving, has to be built as a docker and it has to be deployed inside a container. It could be an Amazon elastic container instance or Azure where we can deploy multiple machine learning models by using model store. Under the model store we can start using EBS or elastic storage mechanism, we can start to save the models by use by running an API and we can serve the model by an HTTP endpoint. I think this is the holistic process. Now you understand the amount of efforts or time we spent in the whole machine learning or traditional machine learning model development productionization. So now the offer which we are going to make is lang chain. But before that I will give you a few touch base on large language models. See large language models like GPT-3 or GPT four which has been trained more than 175 billion parameters. We have other models like Lama two or Mistral or cloud which comes up with 7 billion or 70 billion parameters. Of the amount of data has been trained. When it comes to charge DpT, chart, GPT, we all know it's from OpenAI. It's more like a very large language model. It is a foundational model. It has a capability to answer any questions or any task it can perform without any fine tuning. So the whole process without fine tuning can be achieved by providing in context learning to the model. So where the in context learning would be providing the model by giving some context. In context learning means as part of the problem techniques, instruction can be specified to the GPT four model to perform a specific task. So when I say performing a specific task, we can use multiple prompt engineering techniques. So before the tradition was writing a programming language in Java or in Python to perform a task for a programming language, but now natural language process is a programming language, nothing but, it's an English. So where we can specify an instruction to the model which is nothing but a prompt along with the input, and we say if it performs a summarization or translation task, we specify the task information by providing in context learning via prompt to along with an input, we get the relevant answers from the GPT four. So that's the evolution of large language models. So large language models are not necessarily need to be fine tuned, which saves the significant amount of resources like infrastructure and time. And you know, to have a safer and cleaner environment, not to fine tune or train the algorithm every time. Now we know about large language models. Now we know how we can utilize the large language models to perform a specific task. But it all looks good when you are doing some kind of a prototype. So where you can specify a prompt and you can give an input and you can get an output on the prompt. So how do you productionize the large language models? That is an interesting area to focus on. Okay, that's how we offer lang chain. But again, before getting into lang chain, let's look into the architecture of large language models. So before let's have a small comparison between traditional model and generate the algorithm, it's nothing but large language model. In traditional model we have a data pre processing, then we identify the features required for training or fine tuning the model. After identifying the features, then we perform a fine tuning job by having the data. Then once the data has been trained, then we deploy in production in cloud environment. So typically the model uses a framework like Tensorflow, Pytorch, keras, then underlying it could be an IBM Watson API, or it could be an Pytorch serving which I was mentioning. Similarly, we would have used multiple databases like no SQL or SQL database and mlops, avant Docker and Jenkins. Right now the shift, as in paradigm shift, the reason we are in the era of more interesting things happening every day or every week to identify which is realistic and which can be productionized is a key challenge. So that would be addressed as part of this demo or as part of this conversation which we are having now. Even after the conversation you can reach out to me and have a discussion. Now the whole process has been converted into prompt tuning or prompt engineering on neat basis. We can go for fine tuning, but it's not necessary. But even prompt engineering significantly performs well on the tasks. Data pre processing it's all about the input. Data has to be cleansed and given as an input along with the prompt. Then it has an underlying foundational model like GPT four or Claude or Mistral. Any of the model can be used. Then we deploy the model by using orchestration platform like LangChain or Lama index. So today the offer is about LangChain. It's not only about developing machine learning models. Deploying a machine learning model and invoking a machine learning model is become much more easier than what we have done earlier. If you have any questions, I'll move to the next slide. LangChain so LangChain is a framework to develop the large language models. It facilitates the creation of applications that are contextual, aware and capable of reasoning, thereby enhancing the practical utility of llms in various scenarios. LangChain has split the job into sequential steps where the preprocessing could be an independent step and model invocation would be an independent step. There are like Azure offers a prompt flow where the model sequence can be split into multiple steps where if there isn't any change happens, even each layer could be a plug and play. So the amount of time it takes from prototype to production by having a suite of tools like lang chain, makes the productionization more secure and scalable. So as I said, LangChain is a framework to develop machine learning model and by using an API the models can be invoked. About the lancing framework, which I said in the previous slide. Lancing can be developed in Python as well JavaScript. This offers multiple interfaces and integration with pandas or numpy or scikit learn. It doesn't offer to integrate with multiple other panda Python libraries. Also, these are not having a chain agent. But what do you mean by chain? Multiple sequential steps can be integrated together like pre processing invocation, model invocation and post processing that can be performed by chain agents. Here are nothing but where collection of activities or multiple events can be performed without having much trouble in the execution. And they are ready made chain and they are very good in agent implementation. Also lang chain as a Langsmith and the templates and Langserve is used for serving the model. Introduction by and rest API Langsmith is for debugging and evaluating and monitoring the chains within the LLM framework. So this is all comes as part of the package of lang chain framework. Lang chain is a sequential chain where multiple models can be invoked simultaneously, or it can have a sequential model invocation, or it can also have a parallel model invocation. So as part of this model lang chain framework, they also offer lang chain compression language, so where the amount of code which we write in python could be drastically reduced by using expression language of lang chain. So interesting. So after that, let's see how the generated way application can be developed with LangChain. So whenever we start with a generative way application, we have to identify the objective of what is a task we are going to perform. It could be an prototype to identify or perform an image classification, or even it can be a natural language processing task like translation. So where we have to provide the context to the generative AI model, then we have to offer a support to have an integrate with multiple platforms. Then the code which we write should be in a mode to productionize and we should have a collaborative environment like Azure Notebook or Amazon Sagemaker. There are many things where there is a platform to develop the machine learning models. Then after the diversified model application, it can suit for various range of applications from chatbot to document summarization or analyzation. Now the development takes into product, product or productionization. So whenever we talk about productionization, scalability is a very important feature where when the model should serve multiple requests in parallel or in concurrent fashion. Also it have a framework should have supported testing, we should have monitoring tools to check how the model is performing in production. And the deployment should be ease by having an API as an invocation the model. So also that will be a continuous improvement for the model by having a prompt versioning where multiple prompts can be identified and fine tuned on the prompt and the prompt will go through an evaluation phase and the prompt will be further fine tuned to deploy into production. Again, the most interesting thing is deployment, where LaNC serve can be used to deploy the lank chain. So lank serve is nothing but like a fast API. It's like a server, which on top of lank chain where it acts as a server and it communicates and provides the rest API to invoke the Lang chain or the agents inside the lancing. All right, now when we go into lancing, there are multiple deployment templates which is readily available to consume and where each and every time we can have a plug and play features like providing templates in model invocation and scalability and these of integration and production grades. The production grade support. I think these are all the features supported by lanching to offer. Introduction let's see the difference between the prompting and fine tuning wise and alternatives, right? In the case of prompting, which you could see, we specify you are an unbiased professor and your input score should be zero to ten, and then we pass it to the foundation model along with an input. Then we can get an output. So where as part of the problem we are specifying the instruction to the model. In the case of fine tuning, which we are talking all the time, where we need the data set and to take the foundational model, and then we fine tune the model and then we deploy the model in production. That is all the LLM engineering or the prompting wise fine tuning works. Still, I am not saying that we should go only for prompt engineering. That could be a domain specific task where you may require fine tuning, but typically most of the problem can be solved well enough by using the right amount of prompting technique like chain of thought or self consistency tree of thoughts, the multiple prompt engineering technique can be tried out. Now let's move on to the Lang chain demo, and I'll show you how easily a prototype and productionization of the model can be performed. As usual, for any libraries to be installed in Python, it has to follow via pip install or puda install. But once we install the Lang chain and the LangChain API installed, you have to procure the open API key, followed by installing the Lang chain, OpenAI and long chain libraries in a Python environment, followed by we have to import the lounging OpenAI and import the chat OpenAI and create call the function and specify the instruction via LLM invoke. So that is the power of three lines of code can effectively perform a prototyping for you. And when you wanted to print. How can Langsmith help with testing? You can get an output saying from the chart GPT model. Okay, this is how steps required Langsmith can help with the testing. Now we without a prompt, we have given an input, but by adding a prompt, we are specifying an instruction saying that what kind of task the input would be given. So in this case we are saying in the prompt that you are a world class technical documentation writer. By providing an input, it writes the document in a more efficient way manner like how the technical documentation writer would write. So that's the power of prompt by specifying the prompt. So you can see here we are, okay, the same thing. We are importing the packages libraries and we are invoking a chat open a function. Typically you have to for a security reason. I have written all the API key where you have to provide the API keys in the function, followed by the prompt template where you give the template as an instruction, as a prompt, followed by the user input. Now once I give like this your world class technical documentation writer the system prompt followed by the user input, user input and specifying the chain dot in put I can specify very good amount of output, like how a technical document writer would write it. So most of the things are very similar. On top of that we can have an output parser where we can define. The output parser could be in a JSON format, or it could be an excel, or how do we define the format for for a large language model, the output has to be performed. So by using this input and a prompt template and output parser, you're all set to get an output from the large language model like GPT four. If you have any questions, I'm more happier to talk after the session. Once again, thank you all for this, for your time and listening to the session. If you have any doubts, you can reach out to me at any point of time. Thank you all. Have a nice evening, have a good day and rest of the week.

Slides

Download slides (PDF)

See all 28 talks at this event!

Conf42 Large Language Models (LLMs) 2024 - Online

April 11 2024

Future of LLM's and Machine learning Productionization

Video size:

Abstract

Summary

Transcript

Slides

Deepak Karunanidhi

Associate Director - Data Science & Machine Learning @ Novartis

Join the community!

Featured event

2025

2024

Info

Conf42 Large Language Models (LLMs) 2024 - Online

April 11 2024

Future of LLM's and Machine learning Productionization

Video size:

Abstract

Summary

Transcript

Slides

Deepak Karunanidhi

Associate Director - Data Science & Machine Learning @ Novartis

Join the community!