Make your LLM app sane again: Forgetting incorrect data in real time

Video size:

Abstract

In this talk, you will learn how to create an LLM-powered chatbot in Python from scratch with an up-to-date RAG mechanism. The chatbot will answer questions about your documents, updating its knowledge in real-time alongside the changes in the documentation, enabling the chatbot to filter out fake news.

Summary

Learn how to write your real time LLM app with pathway. How to create your chatbot, make it learn on real time data, and in particular how to forget incorrect data in real time. See how to do a rag pipeline with pathway with a live demo at the end.
Today we have the chance to have access to really powerful LLM model really easily through APIs. Everything is model agnostic. We're going to use LLM models for two operations. The first one is fine tuning and the second one is prompt engineering.
Most popular hack use cases are chatbots over your own data. A good fit for real time data and also to correct, to provide context to queries. Can easily remove incorrect data or confidential data using HAC. Main characteristic you want is to be reactive.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone, and welcome to this session about how to make LLM app sane again for getting incorrect data in real time. So today you're going to learn how to write your real time LLM app with pathway. So we're going to see how to create your chatbot, make it learn on real time data, and in particular how to forget incorrect data in real time. First we're going to see it is important to learn and forget in real time. Then the common solution, fine tuning and rag before seeing how to do a rag pipeline with pathway and its reactive vector index with a live demo at the end. So today we have the chance to have access to really powerful LLM model really easily through APIs. And in my example I will use OpenAI API. But everything is model agnostic. You can use any model you want from meta Mistral, or you can even host your own. And we're going to use LLM models for two operations. First, embeddings text into vectors and then chat completion to answer question okay, so what's wrong with our LLM? So LLM are very good at answering questions, but only on the topic they know about. And it's like us, right? If I didn't learn about the subject and you asked me a question about this, I would have troubles to answer it. And that's the same for LLM. So the first issue is that they are not very good at answering question about unfamiliar topics. For example, on OpenAI, the model are not able to answer a question about 2024. All the training data from before this year in particular, it's not going to work with real time data. And another kind of data it is unfamiliar with is personal and confusion. Don't share data, right? The data you didn't share, your non public document, your personal data, didn't end up in the training data of those models. So the model has no way to learn to know about that. And the second issue is that what is learned is learn, right? Ll models cannot forget, and this might be a problem if what they learn or is something they should not know. For example, if it's outdated data, right? Such as Pluto statue, right? Pluto statue has changed a lot. So is it a planet or not? Right? If it changes this year, the LLM has no way to know. And the problem is that he assumed that the last statue is the ground truth, right? And more similarly, we have fake news and deliberate misinformation. Everything which is seen, it was used to be seen as ground truth in the past, and it's not true anymore. But LLM model has no way to know if it is not true anymore. And you have also the cases of copyrighted and personal data. If by mistake it end up in the LL model, the fact that it cannot forget is a problem. So how to correct the model? You can improve the knowledge of the model by providing additional data, new information. Pluto now is a planet or is not and patches. Okay, this information was not correct and you can add this extra knowledge by two operations. You don't want to wait the new version of the model, right? Because maybe you don't have the time, you want to do it by yourself to be sure the data is included. So you have two way of doing that. The first one is fine tuning and the second one is prompt engineering. So we're going to discuss both right now. So first, fine tuning. Fine tuning is taking a pre trained model, such as a generic GPT model, and then adapt, personalize it on your own data. So you take an existing model and then you pursue the training over your data. So the same kind of data training that was used to obtain the generic model. So it's batch training. So you need all the data at once and you train on it. And one issue is that it requires an adaptive dataset, so you cannot train over any kind of data. It has to have a particular schema and preprocessing your data may be costly. Okay. And another issue that you cannot forget, right? We end up with the same kind of model. It's LLM model, fine tuned, but it's still LLM model, so you cannot forget. So if something changes in your data, you will have to retrain it from scratch. So you have to retake the pre trained model and do the fine tuning part again. And this can be really costly, so it's not suitable for real time data. The second solution is prompt engineering. Prompt engineering is to modify the query, such as all the data that the LLM might need to answer. The query is included in the query. The answer is in the question kind of operation. So for example, but Pluto, if you want to ask what's the statue of Pluto? You will add all the latest news about Pluto and say okay, given those articles about Pluto is Pluto planet or not? And then the LLM should be able to answer you with the latest statute. So what's wrong with that is that it's a tedious rock, right? You don't want to do that by hand because you have to fetch data and do it by yourself, is not scalable and it doesn't work well with real time data. So you want to automate it. And the way to do it automatically is called retrieval augmented generation. It's a three step process process. So first you transform the query the question in a vector using embeddings. Okay, you embed the query to obtain a vector, and then using these vectors. Second thing you do is to find automatically the most similar documents using a vector index. And now that you have the most relevant documents, you can do the prompt engineering. So, just one quick explanation about what are veto embeddings and why are they useful? So, embeddings are used to transform and unstructure data into vectors. And why are we doing that? It's because the raw data is unstructured. Data might be really hard to compare our example, which is text. Comparing two texts might be difficult, but comparing vectors is quite easy, right? We have a lot of optimized mathematical operation to do that. So the idea is to transform text into vectors so we can use all the optimized techniques so we can find the fastest way more similar documents. So what's good with the vector embedding? That it is done in a way that the more two texts are similar, the more similar their vector will be. Okay, so using those vectors, we will do really fast document retrieval. Okay, so the most popular hack use cases are chatbots. Over your own data. You can take a LLm chatbot and add your own data on it. And it's a good fit for real time data and also to correct, to provide context to queries to avoid potential incorrect answers. So let's take an example about confidential data. Let's assume your company has confidential data and you want to build a chatbot to query those, for example, and ask, what's this company's budget? And you will use rag to find all the most relevant data and then do the prompt engineering, doing a summary of the information you have found. Okay, but as you can see, there is still an issue. What's happened? If the RAC data is compromised, it's the same as the initial issue. What if the data is outdated or totally wrong or copyrighted or personal? The good news is the context can be forgotten. Every time you do query, you retrieve the most similar documents. So if you remove, it will not be taken into account. So hack not also supports the addition of new document, but it provides a really easy way to remove knowledge from your application so you can easily remove incorrect data or confidential data using HAC. So in our example, we have our chatbot about our confidential data. And as you know, confidential data is heavily regulated and if for legal reason you have to remove a document, you don't want your system to reflect this removal only one month later, right? It has to be removed from your chatbot as soon as possible, otherwise you may face some lawsuits. You want to have something really reactive, and here reactivity is key. You want to have a system which take into account new data as soon as it is inserted, and same way it will forget the data as soon as it is removed from your system. So the solution is to use a real time vector index. It will take into account any document whenever it is indexed, and by removing it, by removing the data your system will forget. And it's well adapted to LaV data stream, of course. And the main characteristic you want is to be reactive. Now a bit of practice, let's build a chatbot. We'll see how to build a chatbot over PDF. So here we'll take financial documents, we'll take a scenario where we need to remove one document and we want the chatbot to forget it as fast as possible. And we'll do that in pure Python with pathway. So the pipeline will look like this. So everything can be separated into parts. You have the prompt construction and retrieval which will be done in Python, and you have all the LLM operation which will be handled by open AI API. So the first step would be to index the documents. So all your documents are on your storage and for each document you call the API to obtain the embeddings and you will do the indexing of all the documents with the embeddings. Then whenever user will use a search bar to do a query, you will do the hag approach. You will call the API to compute the embeddings of the query. And using this vector you will query the vector index to retrieve the most relevant documents from your documentation. And with those documentation you will do the prompt end engineering. Given those documents, please answer this query and you will call the API to do the chat completion over this query. Then you can post process or directly forward. Okay, so we'll do that with pathway which is a data processing framework in Python to do batch and streaming data which is meant to to allow LLM application to work on real time at the time. So it's a Python framework, but there is a scalable rest engine behind it. So it will look like that password will under all the calls to OpenAI, so it will under embeddings and the victor index search and also the prompting. So it provides you the tools to to query all kind of documents from file kafka topics or even g drive or sharepoint. And we'll provide you all the tools to do a rack really easily with OpenAI or any other you want. So let's see how it is done. First thing you do is you connect to your document. So here we use connectors, so you can connect to your data source using connectors. And we're going to read on the file system this folder document. So all the documents will be PDF put into this folder and then you need to define the model. So password provides you all the tools to really time, easily configure the model you want to use by pre configuring everything for you and you can define them better. The LLM for chat completion, the splitters similarly, you just have to initialize a vector store with the documents and everything is configured for you by password. We need to define a web server for the query answer, so everything is customizable. And using a rest connector we obtain the queries. Now we can do the rag, so we retrieve for each query the most similar document. So here we retrieve only one document, but in practice, depending on your use case, it might be 10, 20, 30. Then we do the prompt engineering. So as you see, everything is already, all the functions are done for you. So it's very simple. And then we do the chat completion with the prompt, we send back the result, and then we run the pipeline. Okay, let's see, let's see how it works. So first we just check what kind of documents we have. We have two documents, Alphabet financial document and another document about we launch the pipeline, we run the pipeline and it might take a while, because the first thing the pipeline will do is to index all the documents, right? So it builds the vector index using OpenAI, and then we can query documents. So we want to ask a question about Alphabet. So what is the revenue of Alphabet in 2022 in millions of dollars? Okay, let's see, what's the answer? So it answer this number. So let's check if the answer is correct. So this is a document which is indexed. If we go to the revenue page, we can see that the number is correct. Now let's assume that this document, for some legal reason has to be removed. So we remove it from our, and let's see what the chatbot says. Now what the revenue of Alphabet no information form. So our chatbot is really reactive, right? Whenever the document is removed, the information, if you do another query, the removal has been taken into account. So reactivity is key, right? As we say, garbage in, garbage out, so you need to update your index as soon as possible. You should not. Your system to take the removal into account has to be very quick and for that streaming is the way to go. If you do randexing by batch, so I don't know, every hour, 20 minutes or so in between the two re indexing all the queries might be inconsistent, right? Because the document has been removed but it was not. The changes has not been forward until the whole system. And that's why we need to have an event based approach and for that reactive real time vector index these are the way to go, such as the one of pathway which is very reactive to any updates with an additional removal. So to conclude while LLM can be wrong is the solution. Like most of the problem with LLM is coming from the training data bit because it's missing some data or because some data is incorrect and Rag is the only existing solution to correct this limited knowledge that can adapt in real time. Fine tuning is nice, but it's done on batch data so you cannot forget. So you have to redo it every time while hack can maintain in real time index and your system will be really reactive to all the changes and reactivity is key. Your index should be reflecting the changes in your data in real time. So thank you for listening to this session. You can try the demo yourself and please don't hesitate to reach out to me if you have any questions.

See all 28 talks at this event!

Conf42 Large Language Models (LLMs) 2024 - Online

April 11 2024

Make your LLM app sane again: Forgetting incorrect data in real time

Video size:

Abstract

Summary

Transcript

Olivier Ruas

R&D Engineer @ Pathway

Join the community!

Featured event

2025

2024

Info

Conf42 Large Language Models (LLMs) 2024 - Online

April 11 2024

Make your LLM app sane again: Forgetting incorrect data in real time

Video size:

Abstract

Summary

Transcript

Olivier Ruas

R&D Engineer @ Pathway

Join the community!