Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, and welcome to this session about how to
make LLM app sane again for getting incorrect
data in real time. So today you're going to learn how
to write your real time LLM app with pathway.
So we're going to see how to create your chatbot,
make it learn on real time data, and in particular
how to forget incorrect data in real time.
First we're going to see it is important
to learn and forget in real time. Then the
common solution, fine tuning and rag before
seeing how to do a rag pipeline
with pathway and its reactive vector index
with a live demo at the
end. So today we
have the chance to have access to really powerful LLM
model really easily through APIs.
And in my example I will
use OpenAI API. But everything is
model agnostic. You can use any model you want from
meta Mistral, or you can
even host your own. And we're going to use LLM
models for two operations. First,
embeddings text into vectors
and then chat completion to answer
question okay, so what's wrong with our LLM?
So LLM are very good at answering questions,
but only on the topic they know about.
And it's like us, right? If I didn't learn about the subject
and you asked me a question about this, I would have troubles to
answer it. And that's the same for LLM.
So the first issue is that they are not very good at answering
question about unfamiliar topics. For example, on OpenAI,
the model are not able
to answer a question about 2024.
All the training data from
before this year in particular,
it's not going to work with real time data.
And another kind of data it is unfamiliar with is personal
and confusion. Don't share data, right? The data you didn't share,
your non public document, your personal data,
didn't end up in the training data of those models. So the model
has no way to learn to know about that.
And the second issue is that what is learned is learn, right?
Ll models cannot forget, and this might be
a problem if what they learn or is
something they should not know. For example,
if it's outdated data, right? Such as Pluto
statue, right? Pluto statue has changed a lot. So is it a planet or not?
Right? If it changes this year,
the LLM
has no way to know. And the problem is that he assumed that the last
statue is the ground truth, right? And more
similarly, we have fake news and deliberate misinformation.
Everything which is seen,
it was used to be seen as ground truth in the past, and it's
not true anymore. But LLM model has no way to know if
it is not true anymore. And you have also the cases
of copyrighted and personal data.
If by mistake it end up in the LL model,
the fact that it cannot forget is a problem.
So how to correct the model?
You can improve the knowledge of the model by providing additional data,
new information.
Pluto now is a planet or is not and
patches. Okay, this information was not correct
and you can add this extra
knowledge by two operations. You don't want to
wait the new version of the model, right?
Because maybe you don't have the time, you want to do it by
yourself to be sure the data is included. So you have
two way of doing that. The first one is fine tuning and the second one
is prompt engineering. So we're going to discuss both
right now. So first, fine tuning.
Fine tuning is taking a pre
trained model, such as a generic GPT model,
and then adapt, personalize it on your own data.
So you take an existing model and then
you pursue the training over
your data. So the same kind of data training that
was used to obtain the generic model. So it's batch training.
So you need all the data at once and you train on it.
And one issue is that it requires an adaptive
dataset, so you cannot train over any
kind of data. It has to have a particular schema
and preprocessing your data may be costly.
Okay. And another issue that you
cannot forget, right? We end up with the same kind
of model. It's LLM model, fine tuned,
but it's still LLM model, so you cannot forget.
So if something changes in your data, you will
have to retrain it from scratch. So you have to
retake the pre trained model and do the fine tuning
part again. And this can be really costly, so it's
not suitable for real time data.
The second solution is prompt engineering.
Prompt engineering is to
modify the query, such as all the data that
the LLM might need to answer. The query is
included in the query. The answer is in the question kind of
operation. So for example,
but Pluto, if you want to ask what's the statue of Pluto? You will
add all the latest news about Pluto and
say okay, given those articles about Pluto
is Pluto planet or not? And then the LLM
should be able to answer you with the latest statute.
So what's wrong with that is that it's
a tedious rock, right? You don't want to do that by hand because you have
to fetch data and do
it by yourself, is not scalable and it
doesn't work well with real time data.
So you want to automate it.
And the way to do it automatically is called retrieval
augmented generation. It's a three step process process.
So first you transform the query the
question in a vector using embeddings.
Okay, you embed the query
to obtain a vector, and then using these vectors.
Second thing you do is to find automatically the most similar
documents using a vector index. And now
that you have the most relevant documents, you can
do the prompt engineering. So,
just one quick
explanation about what are veto embeddings and why are
they useful? So,
embeddings are used to transform and unstructure
data into vectors.
And why are we doing that?
It's because the raw data is unstructured. Data might be
really hard to compare our example, which is text.
Comparing two texts might be difficult, but comparing
vectors is quite easy, right? We have a lot of optimized
mathematical operation to do that. So the idea is to transform
text into vectors so we can use all the
optimized techniques so we can find the
fastest way
more similar documents. So what's
good with the vector embedding? That it is done
in a way that the more two
texts are similar, the more similar
their vector will be. Okay, so using those
vectors, we will do really fast
document retrieval. Okay, so the
most popular hack use cases are chatbots. Over your
own data. You can take a LLm chatbot
and add your own data on it.
And it's a good fit for real time
data and also to correct,
to provide context to queries to avoid potential incorrect answers.
So let's take an example about confidential data. Let's assume your
company has confidential data and you
want to build a chatbot to query those,
for example, and ask, what's this company's budget?
And you will use rag to find all the most relevant
data and then do the prompt engineering,
doing a summary of the information you have found.
Okay, but as
you can see, there is still an issue. What's happened? If the RAC data is
compromised, it's the same as the
initial issue. What if the data is outdated
or totally wrong or copyrighted or personal?
The good news is the context can be forgotten. Every time
you do query, you retrieve
the most similar documents. So if you remove,
it will not be taken into account.
So hack not also supports the addition of new
document, but it provides a really
easy way to remove knowledge from
your application so you can easily remove
incorrect data or confidential data using
HAC. So in our example,
we have our chatbot about our confidential data.
And as you know, confidential data is heavily regulated and
if for legal reason you have to remove a document, you don't want
your system to reflect this removal only
one month later, right? It has to be removed from your
chatbot as soon as possible,
otherwise you may face some lawsuits.
You want to have something really
reactive, and here reactivity is key. You want
to have a system
which take into account new data as soon as
it is inserted, and same way it will forget
the data as soon as it is removed from your system.
So the solution is to use a real time vector index.
It will take into account any document
whenever it is indexed, and by removing it,
by removing the data your system will forget.
And it's well adapted to LaV
data stream, of course. And the main characteristic you
want is to be reactive. Now a bit of practice,
let's build a chatbot. We'll see how to build a chatbot over
PDF. So here we'll take financial documents,
we'll take a scenario where we need
to remove one document and we want the chatbot
to forget it as fast as possible. And we'll do that in pure Python
with pathway. So the
pipeline will look like this. So everything
can be separated into parts. You have the prompt construction and retrieval which will
be done in Python, and you have all the
LLM operation which will be handled by open
AI API. So the first step
would be to index the documents. So all your documents are on your storage
and for each document you
call the API to obtain the embeddings
and you will do the indexing of
all the documents with the embeddings. Then whenever
user will use a search bar to do a query,
you will do the hag approach.
You will call the API to
compute the embeddings of the query.
And using this vector you will query
the vector index to retrieve the most relevant documents
from your documentation.
And with those documentation you will do the prompt end engineering. Given those documents,
please answer this query and you will call
the API to do the chat completion over this query.
Then you can post process or directly forward.
Okay, so we'll do that with pathway which is
a data processing framework in Python
to do batch and streaming data which is meant
to to allow LLM application to work on real
time at the time. So it's a Python framework, but there
is a scalable rest engine behind
it. So it will
look like that password will under all the calls
to OpenAI, so it will under embeddings and
the victor index search and also the prompting.
So it provides you the tools to to
query all kind of documents from file kafka topics
or even g drive or sharepoint.
And we'll provide you all the tools to
do a rack really easily with OpenAI or any
other you want.
So let's see how it is
done. First thing you do is you connect
to your document. So here we use connectors,
so you can connect to your data source using connectors. And we're
going to read on the file system this folder
document. So all the documents
will be PDF put into this folder and
then you need to define the model. So password provides you all the tools
to really time, easily configure the model you want to use by
pre configuring everything for you and
you can define them better. The LLM for chat completion,
the splitters similarly, you just
have to initialize a vector store with the
documents and everything is
configured for you by password.
We need to define a web server for the query answer,
so everything is customizable.
And using a rest connector we obtain the queries.
Now we can do the rag, so we retrieve for
each query the most similar document. So here
we retrieve only one document, but in practice, depending on your use case,
it might be 10, 20, 30.
Then we do the prompt engineering. So as you see,
everything is already,
all the functions are done
for you. So it's very simple. And then we
do the chat completion with the prompt, we send back
the result, and then we run the pipeline.
Okay, let's see, let's see how it works.
So first we just check what kind
of documents we have. We have two documents,
Alphabet financial document
and another document about we
launch the pipeline, we run the pipeline
and it might take a while, because the
first thing the pipeline will do is to index all the
documents, right? So it builds the
vector index using OpenAI,
and then we can query documents.
So we want to ask a question about Alphabet.
So what is the revenue of Alphabet in 2022
in millions of dollars? Okay, let's see,
what's the answer? So it answer this number.
So let's check if the
answer is correct. So this is a document which is
indexed. If we go to the revenue page, we can
see that the number is correct.
Now let's assume that this document,
for some legal reason has to be removed. So we remove
it from our, and let's see what
the chatbot says. Now what the revenue of Alphabet
no information form. So our
chatbot is really reactive, right? Whenever the document
is removed, the information, if you
do another query, the removal has been taken
into account. So reactivity is key,
right? As we say, garbage in, garbage out,
so you need to update your index as soon as possible.
You should not. Your system to
take the removal into account has to be
very quick and for that streaming is the way to go.
If you do randexing by
batch, so I don't know, every hour, 20 minutes or
so in between the two re
indexing all the queries might
be inconsistent, right? Because the document has
been removed but it was not.
The changes has not been forward until the
whole system. And that's why we need
to have an event based approach
and for that reactive real time vector index
these are the way to go, such as the one of pathway
which is very reactive to any updates with
an additional removal.
So to conclude while LLM
can be wrong is the solution.
Like most of the problem with LLM
is coming from the training data bit because it's missing
some data or because some data is incorrect and
Rag is the only existing solution to correct this limited
knowledge that can adapt in real time.
Fine tuning is nice, but it's done
on batch data so you
cannot forget. So you have to redo it every time while
hack can maintain in real time index and
your system will be really reactive to all the changes and
reactivity is key. Your index should be reflecting the
changes in your data in real time.
So thank you for listening to
this session. You can try the demo yourself and
please don't hesitate to reach out to me if you have any questions.