Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello, everyone. My name is Oper Mendeleevic, and I head
developer relations at Viktar. Today I'm going to talk about measuring
hallucinations in rag or retrieval augmented generation.
A little bit about myself. I've been with Victor for about a year, and I
had the opportunity to work on lms kind of early on, since the times
of GPT-2 it's been an incredible journey for me to see
how this technology evolved to become so useful and
help us be more productive. And,
you know, I truly believe what's kind of stated in this slide,
which is the LLM and the generative AI revolution in
general is really important. And within five years,
we'll see a transformation of all applications,
from consumer to enterprise. And every piece of knowledge we
acquire will have the ability to be based
in this generative AI interface. So we'll be able
to interact with computers in a way that's very different than what we do today.
To me, this is a little bit like the transformation we've seen
with the iPhones when they came out, a very different user interface. You can swipe,
you can use your hands, you can use your fingers instead of the keyboard
and the mouse. It's that level of transformation. Now with
that, you know, as I interact with customers
of Vectara, I see a lot of different use cases, and I want to share
some of those with you. We have use cases around
chatbots, very popular. For example, for customer
support, you can put a chatbot that's based on llms
to answer customer questions. There's a lot of question answering
applications that are very useful. And I'll show a couple of examples
here today. Product recommendations. Again,
using the latest in LM and NLP capabilities to
do recommendation engines. Semantic search, kind of
moving away from the traditional keyword search to do a better search experience,
workplace search, and many others.
Now, one of the problems with LMS, at least today,
is that they still hallucinate. And what that means is hallucination
is when the LLM can actually give you a good response
that looks very authentic and looks very convincing,
but it's actually wrong. And this is one of my favorite examples here. Did Will
Smith ever hit anyone you ask? GPT 3.5 that,
and it gives you this response. No, you know, Will Smith is a decent guy.
No known assault incidents, etcetera.
And of course, we all know that's wrong, because this is really what happened
in the oscars about two years ago. So, you know, that's an example
of hallucination. And there's a lot of hallucinations. And the question
is, you know, how can we avoid, and how can we reduce
the amount of hallucinations to make the end application
for the user much better? So one of
the ways you address hallucination is with Rag, and that's what made rag
so popular. So rag stands for retrieval augmented generation.
Let me walk you through how that works at a really high
level. So the idea behind rag is that you actually
augment the information that the LM has with some other
information. So it could be other public information, but in enterprise context,
often it is just some private information that only exists within the firewall
of the organization or your enterprise. So if an LM regularly
takes a user query, you know, thinks about it for a while, and gets
you a response that's only based on its internal knowledge.
With retrieval augmented generation,
what you do is you, the LM, you know, holds for a second and
asks a state of the art retrieval engine to look
at the data you provided and come up with relevant
pieces of text or chunks or facts that
the ELM could use to augment its internal knowledge base and answer
more accurately. And so, again, use cases
for that include question answering and chatbots, like I mentioned
earlier. And it's become a very common and very useful
application in the enterprise setting.
Now, I apologize for this busy slide, but I wanted to share with you
a little bit of how Reg is built when you actually
want to build it yourself and do all the steps on
your own. So on the blue arrow here,
I walk through the data ingest flow. So initially
you have some data, that's your data described earlier. I could be in
a database, like in Microsoft database,
MSQL could be in AWS, redshift, or snowflake,
or databricks. It could also come from enterprise applications
like Jira or notion or something like that.
And very often it's just a bunch of files. It could be PDF files,
or PowerPoint, or documents of different kinds on
s three, or just on a different platform, like box or Dropbox.
And you sort of ingest the data first into the
system. And so the first thing you need to do is to take the document
in its original form, let's say it's a PDF, and extract the text
part in a document. So turning it from a binary to a text,
another text, could be really long, so very common. You chunk it into
smaller chunks. So a chunk could be a page, or it could be
three paragraphs, or two sentences, or a lot of different strategies
around that. And I encourage you to read more about this. If you're
building that, there's a lot of different ways to chunk text that actually impacts
their performance pretty significantly. By the way, before I move
on, I want to mention, I'm mentioning a couple of different vendors or
product names you can use to do any of these steps. That's just
a small list, it's not comprehensive. I just wanted to mention a couple of
options for everybody. Once you finish with the chunking,
you go and you embed each chunk. What does embedding mean?
It's a model. It's a different model than your GPT four.
It's called an embedding model. And what it is, it takes this text
and it translates it into a vector of numbers.
Think, you know, 1000 floats. And that
vector of number really represents in this embedding
vector space the semantic meaning in
that text and which is going to be used for neural search
later on. So you take this vector and you put it in something that's called
a vector database or a vector store which knows how to
handle these vectors and searches vectors really well. Again, there's many,
many options here. I'm mentioning just a few.
Okay, so now that you have the text here and the vector
here, you're ready to do the actual search. So let's go through
the user query journey. So again, there's some user
interface, some application where user has a box
to enter their query. You enter the query and
again the query also gets embedded. So there's a vector representing
what the query is and what its semantic intent is.
And then you run this against this retrieval engine. And the
retrieval engine looks at the vector store, retrieves the most relevant
matches of what was indexed before, and retrieves that
the text back into here as the facts or
the candidates, those get integrated into a prompt
that essentially says something like hey,
here's a user query and here's some facts that can help you address
this query. Please respond to this query in the best way possible.
Given these facts, you send it to an LLM like a GPT four
or anthropic clod or something else lama two
or anything else, and then the response gets sent
back to the user. There's also an option here. You can actually
look at the response, especially in the enterprise context. And sometimes
I use products like guardrails that essentially make
sure inappropriate content does not get back to be shown to the
user. Now I kind of didn't mention the red arrow that much, but the
Red Arrow represents action. What I mean by that is sometimes
in the application you don't just show the response to the user,
you also do something with it. You want to open a Jira ticket with
this information, you want to send it in an email, etcetera. So those are all
options. You have the end of this process. All right,
so this is how do it yourself rag
works. And as you can see, it's quite complex and there's a
lot of steps you have to take, a lot of systems you have to set
up. There's a cost to each of these systems. You have to have your DevOps
and your machine learning engineer and you have
to maintain these systems. And especially when you go from one or
two or ten documents to a million documents in a really enterprise
scale application, it becomes quite difficult
to do this. So that's why at Vektara, what we've created is
this rag as a service. And what we mean by that is we've
taken all the complexity and put it in a box, kind of
behind an API. So all you have to do with Vectara is you
essentially index the text or the documents you want.
We'll do all the extraction and the chunking and the vector store and
everything that I've just shown you. And then you can also have
an application called the query API. It will do all the matching
and retrieval and everything like this, and give you back the response. So this
makes building reg applications very easy,
very fast, it's robust, can scale up and down,
it's secure, it's got all the different encryption and everything you need for enterprise,
and so you don't have to do it yourself. And that is actually really
helpful. You can build applications faster, more robust,
and move them from sort of an MVP or POC
stage into production really quickly. So that's
what Vectora does. And again, to recap,
why is retrieval augmented generation useful?
Well, you augment the element with your own data. So again, if you have private
data, which most enterprises do, then, you know, check GPT
would not know about this data. So that's the main reason you start. But also,
again, it reduces hallucination likelihood. It, the amount of hallucinations
is smaller just because you give it the right facts
to base its response on. So this retrieval step is really key.
Reg outputs are also explainable, and what I mean by that is they come with
citations, so you increase the user trust. We'll see that in a demo, the information
is private, we don't need to train in rag. You haven't
seen any training or fine tuning step there in the architecture,
so you don't need to train. So it becomes the information is safe,
it doesn't leak into any future LLM. And then
lastly, and this is one of my favorite reasons to use rag, is that
it allows you to do a per person sort of permissioning or access control.
So, for example, if some of my documents are from the HR department and
I still want to use them in Rag, but only for the HR people or
people who are allowed to see the results, I can
ask the retrieval engine in Vektara to not include
documents in the set of facts it retrieves,
unless the person issuing the query has permission for that.
So that allows you to create responses that are customized to a certain
level of permission, which is actually really, really helpful.
Okay. Okay, so why Vectara? Again, just to recap,
building rag is more complex than it seems.
And so a lot of reasons I mentioned, doing retrieval in
a robust way is usually more complex than you think.
Supporting multiple languages is hard. Again, with Vektar, you don't have
to worry about a lot of expertise that's very specific to
the LLM space, like prompt engineering, machine learning,
operations, etcetera. And then, you know,
we handle citations very well and just, you know,
everything is ready for enterprise scale.
Furthermore, again, security, privacy, permissioning,
everything is taken care of by our platform, and you get a
lower total cost of ownership when you use our platform than if you build it
yourself. So one other thing I wanted
to highlight is this hem, a hallucination evaluation model.
This is a model that is very easy to use. It's open source,
you can download it here, and it allows you to take a set of facts
and a response from an LLM and detect whether it was hallucinating
or not. What we see here is the leaderboard that ranks different
llms based on their hallucination rate. It's actually really useful
to know that there are differences, first of all, and then what the
differences are. So that's ham. And again,
to summarize how you build an application with Vekta or RaG
application, first sign up to a free account.
Then you need to ingest some data. So there's a lot of different ways to
do that. You can, of course, use our APIs directly.
There's a standard indexing API, there's a file upload API,
and then you can also upload files from our console.
Once you have an account, you get access to the console. And there's also other
ways you can use Vector ingest, which is an open source project
we created to help you with ingestion of data and indexing. Of data,
including a few cool crawlers that crawl the data for
you. And then there's integrations we have with companies like Airbyte
and unstructured IO that also could be used for kind of no
code ingestion. So take a look at those tools.
Once you have the data there again, you can build the
UI on your own using the query API and
point it to the corpus and run queries. Or you can use some of the
tools we have available too. We have an open source project called Victor Answer
that can help you build question answering apps. There is a create
UI which allows you to build a whole application, end to end
in node JavaScript, and then a react search
and react chatbot, which are components that you can use in
your react application that help you
simplify some of this billing process. So I encourage you to take a look at
those and build your app with that. So now let
me go to show you some of these apps that we've built just to demonstrate
how to use this. So this is an example called Ask News.
Let me go click on this. So I go to the actual application. So here
we've actually crawled using Victor ingest, a bunch of news sources from
BBC, NPR, CNN, et cetera.
And as you can see, this crawling happens every day,
adds the new news articles, crawls their content
and adds them to this corpus. Now, when I run a query,
let's say, should AI be regulated? You can see that
it does the retrieval really quickly and it gives you a response
here to answer the question. Now, not only that,
it has, as I said earlier, these citations. So you can
click on one of these citations and see, okay, this part
of the answer was given from this article. Based on this information, you can actually
click on this and go to that URL and see, you know
what, where it came from, investigate further. So this gives
a lot of trust and that's very useful. I also
wanted to mention that we have an option here to use different
languages. So for example, I can try to get the answer in German.
And of course I don't speak German, so I won't be able to tell you
if this is correct or not. But you can see that the answer gets
translated into German, which is really, really helpful. And again, this is
happening even though all the text is in English, so it knows how
to match between languages really, really well.
So that's an example of a question answering application.
The next one I want to show you is actually the same application,
ask news, but now using hem. So created a
little demo of how you could use it, although there's many other ways.
So this is ask news. But if I ask the same question,
what you see happening here is that the response is generated in
the same way. But then after it's get
generated, there's an evaluation of the confidence using
HHM. So this, this little step runs the
hhem. In this case, it's using the hugging face inference model,
and it generates an evaluation of this. In this
case, yeah, this high confidence, it means that this response
is not a hallucination relative to the facts. So this
is one way you can use HHm on your own in your application
to do that. So moving on here,
this is question answering. But I also mentioned chatbots quite a bit.
So let's look at, oh, I didn't mean to click that. Let's look at
a chatbot example. So here's a chatbot. This is on hugging
face. Again, built with the Victor APIs.
So what we did here is create another corpus,
crawled about 100, 150 pages from
the IR's website and put them in a corpus. And now I can ask some
questions about this. So for example, I can go in and say,
is my college
tuition text deductible?
So again, it'll go into the corpus and
try to answer this question based on the information I crawled in the website.
Full disclosure and warning, please don't use this website other
than for demo purposes, and use your tax
advisor to file your taxes. I just have to say that. But it's
just meant to show a demo. Okay, but again, you get this answer.
And the nice thing about the chatbot is you can then ask a follow up
question. So for example, here it said cost tuition and related expenses
may be tax deductible in certain conditions. So I can say what
conditions would make it tax deductible?
And the idea is that it'll know that make,
it probably refers to college tuition,
right? So it has that context of the previous question
in the previous answer. So it really answers the chatbot.
And you see that it can, it knows that already. So this
is a chatbot. I also want to emphasize again, this is all open source.
So if you go to this particular website, you can actually
see the files and all the code here, including how we run
the query and the whole application and everything like that. So you know,
feel free to use that as a reference to build your own app if you
like. And with that,
thank you for listening. I wanted to highlight a few other
things here on my final slide. First, again, I encourage you to sign up
to our free account. It's actually pretty generous and
allows you to get started with up to 50 megabytes of text
and 15,000 queries a month, which is quite a bit to
get started and try it out. We have a
lot of resources for you, our documentation, which is pretty thorough.
We have a discord channel for the community, so you can join that and
ask questions from fellow developers who build with Vektara,
or from a lot of us at Victor are there all the time to answer
questions. We have a GitHub where you can see a lot of open
source projects that you can use that I mentioned here, like react,
search, vector, ingestar answer, etcetera. And then we have a
set of example notebooks. This one, for example, is how to use
Vectar with Lama index, but we have others.
You can look at this repository and then if
you're a startup, I encourage you to take a look at our startups program.
It's a very good way to get started with Vectara while giving
you additional support in forms of credit and customer
support and other things. So really a good way to get started if
you want to use Vectara to power your product.
And that's it. Thanks for listening again and I hope you have a
good rest of your conf 42 conference.