Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey, everybody. My name is Zan Husson. I'm with Weaviate.
We build an open source vector database, and I'm very excited
for this talk. So, in this talk, I'm going to be talking about advanced retrieval,
augmented generation techniques, advanced rack techniques,
and the example that I'm going to be using to explain these techniques
is a chatbot that functions as a super doctor.
And I'll talk about what it does in just a second. But I'm
going to be using this example to explain a lot of these advanced rag
techniques that you could potentially use to build advanced
chatbots for your applications as well. Okay, so let's dive right
in. So let's talk a little bit about the average
human doctor. The average human doctor studies for at least
seven years after undergrad. So in North America, that's four years of studying
during undergrad, and then seven plus years after that.
And once they're practicing, they see about 100,000 patients in
their entire lifetime. So if you think about what a doctor has
to do when you go see them, you present to them some of the problems
you're having, some of the symptoms you're having, and then they use their
experience of similar patients they've seen, or knowledge
that they've gained over their career or during medical school
to present plausible diagnoses for you.
And so what I wanted to do was I wanted to see
if we could build a large language model or AI powered
medical doctor or assistant to help a human doctor
with the diagnosis that they have to perform. So if we unpack that
statement, what would this AI powered
super doctor have to be able to do?
So the first thing that we need to solve this
AI powered doctor would need to have access to a lot of patient cases.
It would need to have access to a lot of data. Not only
that, but it would also need to be able to search over this data.
If you have 200, 300,000 patients worth
of data, it's not possible for it to reason over
all of it. You have to give it the top five or the top
ten patients that are relevant to any new patient that you want
to diagnose. Right? So it has to be able to search for relevant patient cases.
And not only that, but maybe even if you give it access to medical
literature, publications, articles, it has to be able to search over
those and use those as well.
Because this is a medical domain application, it has to be able to
cite its sources. So it's not enough for it to just propose
a diagnosis. It has to explain why it proposed that diagnosis.
So if it's using a couple of patients to learn
from about that disease and then diagnosing a new patient,
it has to be able to cite that. These are the two patients that I
read history and I saw how they were diagnosed,
and I'm using those to propose a new diagnosis. So the system
has to be fully explainable, otherwise it's not going to be used in the medical
industry. Lastly,
obviously, it has to reason over those medical technical concepts,
so it has to be able to read over historical patient
cases and published articles and medical writing,
and then it has to be able to propose new or similar
diagnoses for patients.
And then it has to do this all in real time, right?
So it can't. A patient comes
to this AI powered medical doctor,
tells it their problems, it has to propose diagnoses
in real time, right? Or it has to propose the next step in
real time. So we only have a few seconds
to do all of this. So let's dive in and see
how, how I solve this problem. So the simplest
approach that you could potentially have is you take the patient information
over here, and you feed it into a large language model,
and you ask it to provide a cheat sheet
or a list of potential diagnoses for this patient. This is literally
the simplest thing that you could do. Literally open it, chat, GPT,
type in patient information and say, what could potentially be wrong with this patient?
And so here you're leaving a lot
of things to chance, right? If the language model doesn't know anything about
this particular patient's disease, then you're going to get irrelevant output.
So there's definitely things that we can do to improve on this if
we have a large language model that's fine tuned on medical data. So if we
improve the language model for this particular domain, we'll definitely get better
results. And also, if we
prompt the language model properly, then we might get better results as
well. And this has been shown in previous works where the
difference between somebody who knows how to use a language model versus somebody
who doesn't know how to use a language model. And by
being able to use a language model, what I mean here is prompt engineering.
If you know how to use a language model properly, you know how to prompt
engineer properly, you can get these things to do really wondrous things, right?
So that's another approach that would make this simple
framework quite successful. But I want to see how
we can get even better than this.
So the next approach uses a technique called
rag retrieval augmented generation. So I call this the rag doctor
approach. And this works as follows.
So what we're going to do here is when a new patient comes
to your office, you're going to use this AI chatbot,
and you're going to give this new patients information,
and then you're going to retrieve relevant cases. So this vector
database has all of the medical history
stored inside of it. You query it with the new patient
information, and then out comes new outcomes,
relevant medical information over here. And then you take
those relevant medical information along with the new patient information,
you give it to a language model, and the language model then has to propose
potential diagnoses. And so this is
called a rag approach, because you're retrieving a bunch of
historical cases that might be similar to this new patient.
And then what you're doing is you're passing it to the language model.
So the language model gets to read over the relevant
cases. So this is almost a research phase where it gets to study
previous patient cases and how they were diagnosed and what happened
to them before. It has to reason over
and propose a diagnosis for this new patient that we're looking at.
And the great thing about this rag approach is that you can
now cite sources as a result of this. Right? So the
language model can now say, based on the information that you
provided from this retrieval task, these are the proposed diagnoses
for this patient that I'm providing. So it becomes an explainable
system. So let's dive a little deeper into this rag
doctor approach. In order to successfully
execute this rag doctor approach, you need a lot of data. So the
first piece of this puzzle is going to be a patient's
data set that is open source. You can
have a look at it here. So the PMC patients data set
is approximately 170,000 patient summaries.
So these have been extracted from medical literature,
and they talk about the problem that the patient had,
how they were diagnosed, what the solution was, so on and so forth.
So this is a paragraph or more that's
related to one patient, their diagnosis, and everything that they went
through in that diagnosis. So this is a pretty complete summary
of that patient. It also has about
one and a half million published medical articles
and abstracts as well.
And so this is all taken from an external data set that's
openly available. If you want to have a look, you can click on this link
here. But this is all publicly released information
that I used, and this became the
knowledge base that my language model could retrieve from and reason over.
So technically, all of these 1.4
million and 170,000 patient cases the language model
had access to to learn from.
And so this is what that data set looks like.
So this is going to form the basis around which we're going to build
our AI powered super doctor.
The next thing that we need, that's a very critical ingredient to this
entire setup is the vector database. And vector databases essentially
give you the search functionality over all of this data.
We've got about 1.4 million published
medical articles and about 170,000 patient cases.
The large language model can't read and reason over all of these
you have to use a search functionality to retrieve
the most relevant articles or patient cases to
a new patient. And then you give those limited retrieved
articles to a language model to reason over.
And so here the vector databases can
potentially store billions and billions of documents, medical articles
or patient cases. And then given when a
new patient walks into your office, you can ask them about the problem
that they're having and you can use their unique information as a
query to the, to the vector database. And this vector
database is going to perform a similarity search where you
go in and you say that to the vector database. Here's the
information for this new patient I have. I want you to retrieve
for me the top five most relevant previous
patients that you've heard of in your repository from
the 170,000 data set that we talked about and potentially
five medical literature articles that are related to this patient's
case. So the vector database is now going to perform this similarity search
for you and it's going to give you the retrieved
articles over here. And so to serve this purpose we're going
to use an open source vector database called Weviate
and it's going to be able to scale up nicely for us.
If you want to learn more about it,
I've linked some of the docs and linked the website over here.
You can look into that as well. So the
usage of the vector database looks something like this in practice. Let's say
a patient comes to you and tells you that their left shoulder hurts
and they have numbness in their thumb and index finger.
So you're going to go and take that query, you're going to pass it
into the vector database and we'll talk about how this works in a second.
But you're going to pass that into the vector database and the vector database is
going to retrieve for you the top three other
patient cases that it has in its knowledge base. So from that 170,000
it's going to retrieve the top three most similar cases and it might
even retrieve for you, if you ask it, the top three published
medical articles that are relevant to diagnose this patient.
So now that you have the top six relevant data points,
these can be fed into a language model, and we'll talk about that in a
second. Okay, so the next ingredient
that we need here is an embedding model, right?
So a vector database needs to be able to capture
every single patient, case or medical article as a
vector. And so how this works is we
need, no matter what type of
data you have in your vector database, it needs to be turned into a vector
embedding over here. Right? So here I've represented a vector embedding. A vector
embedding is just a large array of numbers, and it captures the
meaning behind patient cases or
articles. So in this case,
all the 170,001.4 million articles
are going to be turned into vectors, and you'll have 1.4
million vectors for the articles
and 170,000 vectors for the unique patient
cases that you want to store in your vector database.
And so for this, we're going to need an embedding model,
a model that generates these vectors, that understands the medical domain.
And so for this, we're going to use a open source medical
domain embedding model called MedCPt. And you can access
this at the link over here. This generates vectors for short
texts. So we can use this to take the unique information
for this patient. Like I showed in the other slide, I have
pain in my left shoulder and my index and
thumb. You can
take that short description and you can turn that into a vector.
Not only that, but you also have a article encoder,
another embedding model, which can embed very large text.
So if you have a large abstract for a medical article, or you have
a large historical patient description, you can use this type
of embedding model to generate vectors for your,
for your large data set. So both
of these are going to be very critical. And this is also from an
open source paper, and the code is also released so you can have a look
at that. Okay, so the next piece of the
puzzle here. So we've talked about the vector database. We've talked about
how the, how the data gets turned into vectors.
We're now going to talk about the large language model.
So for this experiments, I mainly used
chat, GPT and GPT four as the underlying model here.
But another open source alternative, because everything that I'm talking
about here is open source. I wanted to present an open source alternative so that
you could build this whole thing from scratch. And run it in a private
environment is a biomedical domain large language model.
So one of the more powerful biomedical large language models
that's open source is a model called Meditron.
So Meditron is a large language model that's fine tuned
on a lot of medical domain data. So these can be anything
from medical abstracts to patient, patient histories
and things like this. So this is where the Metatron 70 billion
comes in. Metatron 70 billion is a
fine tuned version of Lama two from meta. And what
they did is they basically took around 50 billion tokens,
50 billion word pieces from the medical domain, and these come
from published articles and patient summaries and patient doctor
interactions. So they trained it on these 50 billion tokens
on top of the training that Lama two got, and then
they tried it on medical reasoning tasks. And this fine
tuned version outperforms the base lama 270
billion, as well as the GPT 3.5 model,
which is significantly larger than the 70 billion parameters that this
Meditron 70 billion has. So this is an open
source alternative GPT four that you could potentially use to
build this project. And not
only was it pre trained on 50 billion tokens, it was
also supervised fine tuned medical
data as well.
Okay, so that completes our entire rag stack,
right? With this, with this type of stack,
you now have a data set. You have a vector database that can search over
that data set, and you have a large language model that can take those
relevant patient cases as well as the new patient
case and generate potential diagnoses for you.
The question now is, can we do even better? How can we improve
on top of this? And so the answer here is that
we need to dive into each one of these techniques that I've talked about,
the vector database, the retrieval, the generation,
and we need to see if we can, if we can
see what the problems are with them and if we can apply an advanced version
to improve the pipeline there. So now I'll propose about three
to four more advanced techniques, and I'll explain what the intuition
behind all of them is. So the first thing we're
going to talk about is a technique called query rewriting.
And the main idea behind query rewriting is
if a new patient comes to you and they describe for you
all of the problems that they're having, you might not know the
best way to search for relevant articles from a vector database,
because you have to write the query, you might not be able to
write the query appropriately, and you might get irrelevant results from the vector database.
And so the idea here is we want to
rewrite the query optimally for a
vector database to retrieve the most relevant articles. But not only
that, but we also want to rewrite how we prompt the language model to solve
the problem. So we might not trust our ability
to prompt engineer properly. So we
want to rewrite both the query that goes to the vector database to search
our data set, but also rewrite the prompt we give to the language model.
And there are solutions to do both of these steps.
The first solution here is a query rewriting
step. So the query rewriting step allows
you to go in and rewrite a query to the vector database.
And the DSPY framework that's also open source allows
you to optimally generate the prompt
that can be given to a language model to ensure that this gives you good
results. So the first thing that we're going to do is
rewrite the query to the vector database. So initially
this was our query that we sent to the vector database. My left shoulder hurts
and I have numbness in my thumb and index finger. This is what the patient
told you. So that's what you try to retrieve articles with. And this is what
that framework looks like, right. We're now going to modify this and
we're going to pass this through a language model.
And the language model's job is to rewrite the query so
that our vector database can understand it better. So maybe
it rewrites this query into this version, so it
kind of chunks it up into smaller sentences and it says
pain, left shoulder, numbness thumb, numbness, index finger.
And so this is less understandable to a human, but maybe this
is more understandable to a vector search engine like we
vector database. So it optimizes the query
to be understood by the vector database. And this is going
to help us retrieve more relevant cases.
Not only that, but we're also going to rewrite the prompt that
we give to our language model. So DSPY is
a framework that allows you to optimize and generate
prompts for a large language model by
iterative search. So we can also use this open source
framework to identify
what the best way to prompt a language model to generate a diagnosis is
as well. Okay, so that's our first
technique. We admit that we don't know how to query
a vector database properly and we don't know how to prompt a language model appropriately.
So we use language models to solve those tasks for us as well.
Okay, so the next technique that I'm going to talk about is called hybrid
search. So the idea behind hybrid search is
that if you are searching over medical data,
medical data has a lot of very specific keywords
that you might be interested in. They could be names
of diseases, they could be names of medicine,
they could be chemical compounds. There's very specific
keywords that you want to pay respect to and use
in the field of medicine. A lot
of the search that we've been talking about and how a
vector database retrieves and knows what articles out of these millions
of articles are relevant for this patient are based on similarity.
How similar is what you tell me about this patient to the patient
cases I have here. But for medical domain,
this might not be the best type of search. You might want to
search over the words in those patient cases, right? So if a
patient was given a particular type of drug and this
patient says that they're also taking that type of drug, then there's a match.
I don't necessarily need to understand exactly what that drug is.
If I can do a simple keyword matching, that might be good enough.
So the idea behind hybrid search is why don't
we mix vector search and keyword search and we do a little bit of both.
And so the idea here is we want to search
not just over the meaning and the problems that this patient is
having, but also the keywords that are used in their description.
So maybe numbness or the type of medication they're on or
index finger, things like this that match well with medical literature
and medical lingo. And so in
hybrid search, you perform both vector search as
well as keyword search, and then you mix the results together
so that you get the best of both worlds and you can re rank them.
And so if we're talking about how to implement this in
Weaviate, it's literally one line of code that you have to
change. And you go from doing pure vector search to a hybrid
search of vector and keyword. So the
third approach that we're going to use here, the third advanced retrieval augmented generation
technique, is called autocut. And the idea behind autocut
is if you do a search and you get irrelevant
results from the vector database, then what you want to do is rather
than give that to the language model and confuse it further, you want
to cut off those results, right? You just want to throw them away.
And so how you can potentially do this is you retrieve
from the vector database relevant articles, and each article has some sort
of number of how similar this article
is to your patient information.
And then you look over this and you say, okay, the top three return results
are very similar, but the fourth and fifth are very
unsimilar. Right. They're very far away compared to the top three
results. So then you automatically cut them out and you never pass it
over to your language model. And so
if you do do this, if you do this automatic cutting of irrelevant results,
it's less likely that your language model gets irrelevant results and
it's less likely that it hallucinates as well. And so you're giving
it better information to set it up for successful diagnostic
generation later on. And so I want to dive into how this actually
works. So let's say you start off with this vector database
query that we talked about that is now rewritten. Notice pain,
left shoulder numbness, thumb numbness, numbness, index finger
that goes into your vector database and the vector database.
Now let's say it gives you these six
patient cases and not only is it going to give
you these six patient cases, it's also going to give you a number of how
similar these patient cases are to your input patient
case. And each one is going to have a score. The higher
the score, the more similar it is, the better. But notice one thing
about these scores. These top four returned
patient cases are quite similar, right? They have high similarity scores.
But the fifth and the 6th one here are, there's a big
jump in similarity. So autocut is going to come
in and just say I'm automatically going to cut these two cases because
they're quite different from these other four cases. And these other
four cases are quite similar to the thing that you're interested in. So that's why
it's called autocutter. And again, if you want to implement this,
this is literally just one line of code in
your v eight vector database where you can say how many chunks of
similar things are you interested in? If you're interested in one similar thing,
then it's going to give you one big chunk here of four articles
and then you only keep those, the other two are disposed of
in this example.
Okay? And so the, the next thing that I'm going to talk about is
called re ranking. And the idea behind re ranking is
you've went through this search process and you've retrieved the most
similar other patient cases to this new patient. What you
want to do now is have a closer look into
all of these other patient cases that have been retrieved and
compare them one at a time to the input patient information and say,
does this really deserve the top spot? Should I re rank it
to be lower or should I re rank something that's at the bottom to be
a little higher? So this is the time where you get
to spend more, compute more
time comparing the new patient to all of the
retrieved patients that the vector database spat
out. And so what you do to make this
successful is initially you tell the vector database,
instead of just giving me the top three most similar matching cases,
give me 20, 30, 40 cases that you think
are similar. And then you
compare this new patient to each one of those 20,
30, 40 cases that the vector database thinks are similar.
But now you use a much more powerful and heavier machine
learning model to re rank them based on what it thinks
of these 30, 40 articles compared to this patient
information. And so this again increases
the quality of the output that we, that we're going to eventually
pass to our large language model. So let's have a look at how this actually
works. We go into our vector database,
we take that query that we've already been passing into the vector database and
now let's say we do this over fetching.
We ask it to retrieve for us the top ten or 15
most similar patient cases. So it gives us this long laundry
list of similar patient cases along with the similarity scores for
all of these patients.
So now what we do is we ignore these similarity
scores and we say we're going to go
to another more powerful model. And the job of that model
is to see how similar each one of these patient
cases is to the original query. And it has the
opportunity to re rank and kind of promote or demote
some of these patient cases if it thinks that they're more similar or
not as similar. So less similar using
its more advanced re
ranking and search functionality.
So you get this re ranking step where now maybe the third
most similar patient case, this more powerful model thought, oh, that is
actually a lot more similar. So it's going to re rank it to be number
one, right. It's going to take the first one and it's going to down rank
it to spot three over here and it's going to promote number five here
to number two. So now the idea is that these re
ranked patient cases are a lot more similar
to the initial query than these initially retrieved cases.
And so the new score is a lot more reliable because a more powerful model
generated this score for you. And now we can take
these results, maybe the top five or top six here and give it to a
language model. So now it has a better quality of patient cases that are
relevant to this new patient and so it can generate better diagnoses.
And so to implement this in a vector database like weviate,
this is also just one line of code where you pass in and you say,
I want to re rank based on a particular property,
and the query is a string, and you can pass it in,
and it's going to take everything that was returned from this semantic
near text search, and it's going to re rank it and then pass it
out to the language model. Okay, so we talked
about the retrieval augmented generation stack,
the rag doctor approach, and then I proposed four improvements,
more advanced retrieval augmented generation techniques that you can kind
of use to improve the quality of the
superdoctor that we've created here. And so let's
do a little quick summary of everything that we've accomplished.
We've basically created an AI based super doctor assistant.
This super doctor assistant has access to more knowledge than
pretty much any doctor has experience
from, so they can retrieve from more than any
human doctor can. It can also
propose plausible diagnoses, and not only that, but it can also source
citations. It can tell you why it's proposing
somebody has a viral infection or somebody has
heart disease based on historical patient cases that it's
read and reasoned over.
And so it can learn from previous patterns in
patient health and use those patterns to propose future
diagnoses for new patients and to
kind of complete it all up. It does this in real time.
So it does this in a few seconds because the vector
database search component takes a few milliseconds, and the
generation portion takes maybe half a
second a second. And so this can be done in real time.
You can literally have a patient walk into your office, the doctor
takes their symptoms, understands them, passes them off
into the superdoctor chatbot assistant. The super
doctor proposes some diagnosis, the doctor thinks about some diagnosis,
and then these diagnoses come together to
have a much more well informed diagnosis for
that patient. And so that comes
complete, full circle. We've talked about the entire stack of
how you would build this technology. And the plus point
here is that everything that I've talked about in this entire talk
is open source. The language model, the vector database,
the retriever, the re ranker, the auto cut. Every single thing
I've linked and sourced as well is fully open source.
So if you wanted, you could build this tomorrow as well. All right,
so I wanted to thank everybody here. I was really
excited to give this talk. If you're interested in this, check us out.
There's a QR code that you can use here as well. And if
you have any questions, feel free to connect with me either on
Twitter or LinkedIn. I would be more than happy to
talk to all of you. I'm also active on the Wev
eight community slack, so join us there,
drop by. Thank you very much and I'll leave you
guys with this last QR code here. So if you want to try
this out, give it
a go. Weviate is open source. We also provide free
hosted services as well. Thanks everybody and hope you're
enjoying the conference.