Transcript
This transcript was autogenerated. To make changes, submit a PR.
Today I am going to share with you some of the lessons learned from multiple
AI chatbot projects where we utilized large language models
and doing that is actually quite tricky. So by the end of the presentation
you will have a list of what to pay attention to,
sometimes critical issues, sometimes tiny little details
which are still important in the project success.
And we will start with the introduction to rag, what it is
and what kind of challenges you can expect when building complex applications.
Then we will talk about hallucinations, but also how
we control the scope of the conversation. So if we
are dealing with customer issue, we dont start talking about
us presidency, election or any other issue which is not
relevant. We will also cover the cost,
how to calculate it and whats important in
various scenarios. And at the end ill briefly
describe privacy issues related to llms and the consequences
of various decisions. My name is Martin,
my background is in data engineering and mlobs and I'm
running a team specialized in everything data. At Tantus data and
at Tantus data we help our customers with setting up data
infrastructure, building data pipelines, and machine learning
and genai driven applications. So during that
presentation I will share lessons learned from some
of our projects. And a little disclaimer before
we get started. We need to be aware that the entire area of
Genai is moving incredibly fast. The models
improve over time, the libraries, the tools improve,
some of them die. So it's really hard
to keep track of all that. And because of that,
be aware that some of the tools I'm referring to might
be outdated by the time you listen
to it. And I'll try not to
focus on specific tools, but more on problems, solutions,
techniques and general ideas.
But since there are so much going on
in the area of Genai after the presentation,
I would be really happy to hear from you about your
findings, your experience. So don't be
shy and let's connect on LinkedIn.
Okay, so let's get started. Let's think about concrete business
problem you would like to solve. And let's think about
a chat which is travel assistant on vacation rental website.
And let's say the customer comes and asks,
I need an apartment in London
with elevator. How do we know what the
customer is asking for?
How do we come up with specific information and use that in the
chat? So one of the
very common answer to these kind of questions is vector embeddings
and vector databases. So let's quickly define what they
are and why are they good in natural language
problems. But then I will show you
some examples of when they do not work that well and what
we can do. So the promise about
vector embeddings is very simple. First of all,
you transform the text into a vector and the vector represents
the semantic meaning of the text. So two
texts which have similar meaning will be transformed into two vectors
which are also close to each other.
And let's have a look at examples.
This is one of the very classical examples.
King and queen are somewhat the same
role, you can say, and the only difference is gender.
So in the perfect vector space, the distance between king and queen
should be the same as between men.
And you should be even able to do this
kind of math like queen equals king,
man plus woman.
And this is another example. The words red,
orange and yellow represents colors, so they are close to each
other. Then king and queen are also close to
each other and car is somewhere completely else.
And this is very flat and simplified
dummy example of vector embeddings, because in reality
they have hundreds or thousands of dimensions.
But the idea, the promise from embeddings
is that the vectors are close to each other if
the text has similar meaning.
So it's not a surprise that for searching information needed
by LLM in a chatbot, we likely want to try vector
database. So the super basic idea is that you transform your
documents into vector, you store them into vector database
and you serve the relevant documents to the LLM.
And more general version of this diagram is
that one when we provide an LLM with access
to our documents, databases, API,
basically everything needed in order to understand our domain information.
And this technique is called RaG, stands for retrieval
augmented generation. But once again,
why are we doing this? We need to remember that
the main ability of LLM is not really the knowledge it
comes with, but the ability to work with texts, with texts
written in natural language, and ability to follow
instructions related to these texts. So LLM has
a chance to know only about the information it was
trained on and we need to provide it with our specific domain
knowledge. Let's get back to our example, our business
problem. How do we use that technique? How do we use vector
databases for our I need
an apartment with elevator in London query.
If we have our apartment descriptions in the vector
database, what we could do, we could just check if the vector
representing our query is close to any of our property description.
And what we hope for is that I
need an apartment with elevator in London. Vector will be
close to an apartment with description apartment
with elevator in London and not that
close to apartment with description apartment in
London. So no elevator mentioned.
But once again, this example is way too simplistic.
This is perfectly valid technique, and it describes what vector
embeddings and vector databases could be used for. But I would
like to focus on the challenges which you might face. So first
of all, the apartment description will never be just a single sentence.
They will look more like this. This is
an example cottage in Cornwall,
western England, and it's not that expensive.
We have one more very similar one. So you can see the descriptions are
quite long and you have some extra information about them.
But we have also some completely different properties.
We have another one in London. It's not a cottage anymore,
it's an apartment that is much more expensive. So completely
different type of property and one more
similar. And then if we take all four properties I
have just shown, get the description and take the
vector embeddings, then we end up with something like this.
I do understand that this is not very readable, but this
actually represents very well the problem an engineer
is struggling with. So we have four different properties both,
but in the world of vectors, they are very close to each other,
simply because the wording used in the description
is very similar. It's very far away from the
word banana, it's very far away from some sentence of
some sentence about constitution, but it
doesn't really help us because the property description itself
is very close to each other. So it's very hard to distinguish
what is what in very hard to make a good
proposal for the customer. And the reason for that
is that if we are using general vector
embeddings, they are good in general language, but they are not specialized
in a specific domain, and the specialization in a specific domain
is usually what we want. So maybe that
will come up as a surprise. But magic does not exist. There is
no such thing as a silver bullet. So vector databases
are useful, but you need to test what will work for you.
Maybe you will need to fine tune the embedding so they are specialized
to the domain. For sure, you will need to do splitting of long documents
because of the context length limitation of vector embedding models,
they can accept text up to specific limit,
few hundreds, up to few thousand at most.
But even if you can fit long text in the embedding model,
it does not mean the longer text will work better for
you. This is something to be tested.
One of the secret ingredients making
your chat better is splitting the documents you are working with
into digestible chunks, that's for sure. But if
I tell you, if I tell you,
get the document description, get the apartment description,
get a PDF document, and split it into chunks,
it will be a bit too simplistic. It will
be somewhat like saying,
just draw two circles, complete the owl,
something is missing. So what we do,
what we pay attention to when splitting documents, let's have a
look, let's have a look at some of the solutions.
When you split the document, what will matter for sure is the size.
And all I can say for sure is that very big chunk will not work
very well and it's kind of intuitive. The vector size is
static and if you try to squeeze too much information into
it, you will lose some of it.
But other than that, when you split the document, you need
to know something about the context. And a good
example would be a large PDF, and having just a chunk
of it without knowing which chapter or which section it
comes from, it will not be very helpful.
That's why it's important to keep the relevant information as part
of the chunk or as part of the metadata.
And if you Google search for what, if the data is too
large for LLM context, or if you just scan the QR code,
you will get to one of our articles describing these kind of problems.
But we also described there a mechanism called self
core retriever and it's super useful in situations
when you have a granular split with all the details necessary.
But still the vector similarity of multiple chunks is too
close to each other and it's hard to distinct
which one is the best in a given situation. And in
such cases it's good to try the mechanism and
what it does. It's basically a tool in LangChain
which allows us to come up with structured query
for specific attributes you predefined. So let's say from
a PDF chunk you will extract price or
offer name. And if you predefined them,
you can have another LLM call for better understanding
of values of these kind of attributes. So you,
you can make better decision. You can make a better decision
about what answer to present to the user and it's very useful.
I recommend you to read up on that.
But let's move on. One more
disclaimer, one more disclaimer about PDF files I mentioned.
The disclaimer is that a lot will depend on the format
and how exactly you parse the PDF. Sometimes you
need to just find a specific parser for a specific document,
but sometimes maybe it's worth looking around.
Maybe you have a chance to get the data you need from a source
which has structure just better than the PDF file.
Maybe the same data exists in better format.
So far we've been focusing on how we can improve
the vector search by splitting the documents.
But what else we can do in order to improve
the vector search, you can use something else instead of
vector search. And I just wanted to say that vector databases
are very popular. They are growing, they are very
natural to be used in context of natural language processing.
But just the fact that they are popular, just the fact that
they are very much connected with our lens,
does not mean this is the only tool you can use.
So for instance, if you have an elasticsearch,
or if you have some search API in your company, there is
really no reason not to use it, not to try,
if it can provide you with relevant info. And at the
same time, most of the vector databases, they come
up with not only the vector search ability,
but they have hybrid search ability.
So on top of vector search, you can
enable more traditional keyword search, for example
BM 25, and you can verify which results
are better. Maybe you can mix them together. Maybe you can
use both results. And once
you mix them together, once you utilize data from multiple
search methods, what you can do is you can re
rank the responses you received. So in many
of our cases we have implemented, we realized
that it makes a lot of sense to blend
multiple sources, multiple results,
blend them together. And what you can consider, except of
vector databases, is data coming directly from backend database,
from data lake, from data warehouse, internal APIs,
but also external APIs like panel
data, or from Google search.
So far we've been focusing on how we can improve the
vector search by splitting the documents.
But what else we can do in order to improve
the vector search, you can use something else instead of vector
search. And I just wanted to say that vector
databases are very popular, they are growing, they are very
natural to be used in context of natural language processing.
But just the fact that they are popular,
just the fact that they are very much connected with llms
does not mean this is the only tool you can use.
So for instance, if you have an elasticsearch, or if
you have some search API in your company, there is really no reason
not to use it, not to try, if it can provide you with
relevant info. And at the same time, most of
the vector databases, they come up with not only the
vector search ability, but they have hybrid
search ability. So on top of vector search,
you can enable more traditional keyword search,
for example BM 25, and you can verify
which results are better. Maybe you can mix them together.
Maybe you can use both results. And once
you mix them together, once you utilize data from
multiple search methods. What you can do is you can
rerank the responses you received. So in many
of our cases we have implemented,
we realized that it makes a lot of
sense to blend multiple sources, multiple result,
blend them together. And what you can consider,
except of vector databases, is data coming directly
from backend database, from data lake, from data
warehouse, internal APIs, but also external
APIs like panel data or from
Google search. And then on top of
quite aggressive query query, which is providing us with many
results, what we do, we do re rank and we select
the best candidates, the candidates which are the most promising.
So the chatbot can utilize the information from the most promising ones
in coming up with the most relevant answer.
The last technique I wanted to mention, and it's actually quite
simple but still quite powerful, is preprocessing
using large language models. So let's say you
have some metadata, but in your metadata, you don't
have any information if an apartment has an elevator
or not. But the customers are looking for this kind
of information. And you do have that information in
the description in a free text. So what you can
do a batch preprocessing
using LLM in search for specific metadata,
you know, users are often looking for. And then
once you extract the metadata, you can just save it. You can enrich
your database and use it in your queries.
So basically, you are utilizing the fact that LLMs are
very, very good in tasks like sentiment analysis,
text categorization and so on. You just
tell them which category you are looking for,
what information you are looking for, and they do it for you. They do
it basically out of the box. They are
good at this kind of tasks, out of the box. So there is really
no reason not to, not to use that fact.
Okay, so we've been talking about techniques which leads us to providing the
most relevant information to the chatbot. But even if
you provide it with very, very relevant information, it's still
can make a mistake. It still can hallucinate. So yes,
one way of preventing or limiting the hallucination is to provide
it with relevant info, but there is really
no guarantee that the answer chat comes up with based on the prompt,
the data you provided provided with,
there is no guarantee that the information produced by the
chatbot is correct. So I will show you a very quick demo of what
the hallucination looks like and one specific technique
which you could use in your project in
order to prevent the hallucination. So let's have a look at the demo I
recorded. Let's have a look.
What we have is Python code where we import a
tool called nemo guardrails. It's a tool created by Nvidia.
And we have a text file with some questions.
We'll have a look at it in a second. And then we define that we
want to use an old OpenAI model text, davinci zero three.
And then in the file we define some
questions. The first question
we define is when did
the roman empire collapse? So we want to ask that question
to the model. And I am asking the question about the
roman empire because it's a common knowledge. And the second
question I'm asking is how
many goals has been scored in polish extraclass in a specific season?
So since the first question is a common knowledge and the second one
is not, I expect one of the questions not to be
the hallucination, one of the answer to the question not to be the
hallucination, and one of them. For one of them,
I do expect the model to hallucinate. And let's see if the
tool can spot what is hallucination, what is not.
So let's see, we run the code and
we will have a lot of logs. And once
we scroll all the way up, after it completes,
after it completes, we can see the
first question, when did the roman empire collapse?
And we get a bottle some bot responses
and it's getting flagged as not hallucination.
But how exactly did the tool spot
that? Let's have a look into the details. Using the second question
as an example, how many goals has been scored in polish extraclassa?
The bot response we are receiving is 1800.
I have no idea if it's correct or not, but the whole point is that
what the tool is doing, it's asking exactly the same question for the second time,
and then we get completely different response,
and then the tool is asking the same question for the third
time and you're getting, once again different
response. And then what the tool
is doing is actually checking if the answers
we are getting are in sync, if the
meaning of them is exactly the same. So it's actually doing another prompt
to the model. And the prompt is
you are given a task to identify if the hypothesis
is in agreement with the context below and
the hypothesis is the original answer
we received. So the answer to the first
time we ask that question and then context
are two extra responses we have received because the
tool was asking the same question three times and the
answer from the model is no
informations are not, which means
we flag it as hallucination. So yeah,
there are ways of preventing the hallucinations. It's good to
be aware of them, but at the same time it's good to be aware of
consequences of these kind of techniques, because there is no such thing
as free lunch. First of all, you need to be aware of the costs
associated with that. The cost of us dollars you
pay for the extra API, call the cost of
slower system because you make extra API, so you introduce an
extra delay, but also the cost of false positive,
because there is really no guarantee that this kind of technique always
works. But all that,
the existence of hallucinations, the fact that we
have to deal with them, but also how we have to experiment
with cutting the documents, how we have to tune the search
engine, all of that can lead to the conclusion that we are back to
square one to some extent, and that there is really no shortcut.
And even though LLMs are really impressive,
you cannot avoid working on the data quality
or just careful engineering. Tools like llms
are impressive, but you still have to do your homework.
The good news is that there are many tools which could
help you to some extent. So I mentioned Nemo guardiols,
but it's worth looking into memgpt weaviate but
at the same time, do not expect that some tool will solve
all your problems. Do not expect
that you buy some tool which will magically solve everything.
This approach, shut up and take my money will
probably not work. It's not gonna happen. The tool might
be helpful, but the tools themselves are coming
with their own problems. The tools themselves are quite
immature because basically the entire area of large language models,
chatbots and so on, is quite
new, quite fresh. And just
to show you an example of how the
tools are changing, this is the history of code in Lancranc
project. And there are tons of changes, which on
one hand is a good thing because the project is evolving and it's actually
impressive how fast it's growing. But on
the other hand, that means you have to be aware
of the updates, upcoming changes, there will be some bugs introduced,
there will be some breaking changes over time, and you just
need to be ready for that. You just need to be aware
of that. So we have all the tools which are helpful,
but not very stable yet, and we are working with a completely
new area and there is a lot of unknown here. And that is why it
is really important that you do the testing. And testing of
LLM project is really, really tricky. So what you
can do for sure, and what you should do is testing of
the retrieval because this is fully under your control and
this is quite predictable. So it's easy to define the test condition,
but you should also test
the LLM actions wherever you can.
And I say wherever you can because it's actually quite tricky
and it's very hard to define reliable tests,
reliable tests which cover most of the possibilities.
And one of the problem with testing llms is that even if
you have exactly the same input in
your test, the output could vary.
So there is this post on OpenAI forum, and I really recommend
you to read the question of determinism.
The bottom line is that large language model action is not really
deterministic. So yeah, you can have the parameters
like temperature, you can set it, and this should control
how creative the model is. But there is this misconception
that if you set it to zero, LLM will be behaving
in exactly the same way. In reality it will be
just kind of less creative.
But it still might provide you with various results,
mostly because of the hardware it's being physically run on. But also
you can always end up with two tokens which have exactly the same probability.
So one or the other will be randomly selected in your result.
So keep that in mind when you write the test,
and it's always worth checking the lang chain utils
lang chain utils for testing because they take this kind of
lack of determinism into consideration and they aim to mitigate
it during testing. But what is critical
when you move to production is that you collect the data from your
run, from your run with real users,
because that is really something which gives you the real feedback
about how it is going, how the users are using the application,
whether they are happy with that or not. Make sure you
collect the data. Make sure you analyze that, especially in the
early phases of of the project.
Let's have a look at legal and privacy aspects of llms.
What we need to understand is that whenever we
pull the data from any database and then process
the data and then eventually pass
the data to LLM, our data is being sent to the
LLM provider, to OpenAI, to Microsoft,
Google, and in some cases it's perfectly fine. But there
are cases that you don't want to send the data
anywhere because it's too sensitive. And that means that
you might want to use an open source LLM installed in
a data center you own.
Keep in mind that in situations when LLM over API
is not possible, you not only have to have a private
LLM installation, you also need to have
your private embedding, private vector, DB and
so on. And installing all that is not a rocket science,
but at the same time. It increases the complexity of your ecosystem
and there is a lot more that you have to maintain.
And let's keep in mind that privacy and where the data is being sent
is just one aspect of legal concerns when it comes to llms.
I would really recommend to read the license terms of
the ones you plan to use. For instance, you should not
get misled by the term open source. Open source
does not automatically mean that you can do everything with it.
Some of open source licenses are limiting how you can use the
data produced by the LLM. So for instance, you won't
be able to use the data you collected for training another
LLM in case you decide to change the model.
So you collect the data from the chatbot. You cannot use that in the future
for the training purposes. Similarly,
generating synthetic data for machine learning model is very
blurry area when it gets to llms. So once
again, don't assume too much and make
sure you don't get into the unpleasant surprises.
Another very important consideration when starting a project
and deciding which LLM to use is cost. And you might think
open source is cheaper because you basically don't pay for the API call.
But in context of Llm it's not that obvious.
And why is that? First of all,
because simple math is not that simple anymore.
And what do I mean by the simple math?
Let's start with the API calls. For instance,
when you are using GPT 3.5, you pay
half a dollar per million tokens in the input and then
$1.5 for million tokens in the output.
But then for GPT four you pay 30
and 60 respectively. So already order of magnitude
more. And in general you have a price list.
And based on that you can estimate how much single interaction
with user can cost and then you can multiply it by number
of expected interactions. But there
will be a few small asterisks to remember about. So first of all,
the math will depend not only on the number of tokens
in general, but also on our understanding of
what is the balance between input.
Cloud is cheaper for input but more expensive for output.
And in most cases it's good enough assumption that token
is a word. But if you are in situation that
small difference matters. Then it's worth looking closer
at the tokenizers. It's worth looking
closer at them because the models use different tokenizers,
and number of tokens consumed for the same text
by cloud is different, actually a bit larger
than the one from chat GPT. So to make it even more
confusing, Google Gemini charges not per token
but per character. So the math
is a little bit tricky already. But doing
back of the envelope calculation should give us close enough
number, and it becomes much more
complex when we try to do the math for open source.
For open source LLM, we host ourselves.
Then you don't calculate the cost per token or characters
produced, but you start with the price of the machine,
price of the GPU, the price for maintenance,
and then you need to estimate the expected traffic.
If your traffic is low, the cost per
request will be extremely high. So it's not obvious math.
It's prone to errors. In many cases, it will be more expensive
than using APIs, or, or at least the return of
investment won't be. I briefly mentioned open
source models, and I'm actually coming from the background where I've always been using
open source, open source databases, open source
data tools, and I really like them. But it
was kind of comfortable working with the open source products
because the open source was actually ahead. They were leading
the innovation and then at some point the cloud providers
came. They were to some extent kind of wrapping
the open source innovation into more convenient way of using
it. But now, I'm a bit sad to say that,
but the open source llms are still behind
and they don't perform as good as the commercial
ones. They are good, they are improving,
but be prepared for extra effort if you want to tune
specific use case with open source LLM. And of course
you can fine tune the model. But before you even do that,
make sure that your data is in good shape.
Data will be the starting point for you anyway, and the
easiest way to start is with rag application instead of fine
tuning. So starting with simplerag can provide you with
much faster result and much faster feedback
from the customer. But if at some
point you decide to tune the model itself, beware that
there are various types of tuning and they differ.
They differ in the sense of how much data you need,
what kind of results you can expect, and whether they introduce extra latency.
All things considered, building chatbots is an area
where you need to experiment a lot. But when you experiment, make sure
you don't get overwhelmed by that. Make sure you have
business goal in mind all the time, because it's very easy to get lost
and end up in never ending experiments.
In most cases, you are not creating a research company.
In most cases you want to solve some specific business problems.
So keep that in mind. So working with
llms is a very, very nice,
interesting job. But at the same time you need to stay focused
on the business goal and make sure you are pragmatic.
Thanks a lot. If you have any questions,
drop me an email or drop me a line on LinkedIn. I'm always
happy to chat. Thank you.