Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              Hello, everyone. My name is Oper Mendeleevic, and I head
            
            
            
              developer relations at Viktar. Today I'm going to talk about measuring
            
            
            
              hallucinations in rag or retrieval augmented generation.
            
            
            
              A little bit about myself. I've been with Victor for about a year, and I
            
            
            
              had the opportunity to work on lms kind of early on, since the times
            
            
            
              of GPT-2 it's been an incredible journey for me to see
            
            
            
              how this technology evolved to become so useful and
            
            
            
              help us be more productive. And,
            
            
            
              you know, I truly believe what's kind of stated in this slide,
            
            
            
              which is the LLM and the generative AI revolution in
            
            
            
              general is really important. And within five years,
            
            
            
              we'll see a transformation of all applications,
            
            
            
              from consumer to enterprise. And every piece of knowledge we
            
            
            
              acquire will have the ability to be based
            
            
            
              in this generative AI interface. So we'll be able
            
            
            
              to interact with computers in a way that's very different than what we do today.
            
            
            
              To me, this is a little bit like the transformation we've seen
            
            
            
              with the iPhones when they came out, a very different user interface. You can swipe,
            
            
            
              you can use your hands, you can use your fingers instead of the keyboard
            
            
            
              and the mouse. It's that level of transformation. Now with
            
            
            
              that, you know, as I interact with customers
            
            
            
              of Vectara, I see a lot of different use cases, and I want to share
            
            
            
              some of those with you. We have use cases around
            
            
            
              chatbots, very popular. For example, for customer
            
            
            
              support, you can put a chatbot that's based on llms
            
            
            
              to answer customer questions. There's a lot of question answering
            
            
            
              applications that are very useful. And I'll show a couple of examples
            
            
            
              here today. Product recommendations. Again,
            
            
            
              using the latest in LM and NLP capabilities to
            
            
            
              do recommendation engines. Semantic search, kind of
            
            
            
              moving away from the traditional keyword search to do a better search experience,
            
            
            
              workplace search, and many others.
            
            
            
              Now, one of the problems with LMS, at least today,
            
            
            
              is that they still hallucinate. And what that means is hallucination
            
            
            
              is when the LLM can actually give you a good response
            
            
            
              that looks very authentic and looks very convincing,
            
            
            
              but it's actually wrong. And this is one of my favorite examples here. Did Will
            
            
            
              Smith ever hit anyone you ask? GPT 3.5 that,
            
            
            
              and it gives you this response. No, you know, Will Smith is a decent guy.
            
            
            
              No known assault incidents, etcetera.
            
            
            
              And of course, we all know that's wrong, because this is really what happened
            
            
            
              in the oscars about two years ago. So, you know, that's an example
            
            
            
              of hallucination. And there's a lot of hallucinations. And the question
            
            
            
              is, you know, how can we avoid, and how can we reduce
            
            
            
              the amount of hallucinations to make the end application
            
            
            
              for the user much better? So one of
            
            
            
              the ways you address hallucination is with Rag, and that's what made rag
            
            
            
              so popular. So rag stands for retrieval augmented generation.
            
            
            
              Let me walk you through how that works at a really high
            
            
            
              level. So the idea behind rag is that you actually
            
            
            
              augment the information that the LM has with some other
            
            
            
              information. So it could be other public information, but in enterprise context,
            
            
            
              often it is just some private information that only exists within the firewall
            
            
            
              of the organization or your enterprise. So if an LM regularly
            
            
            
              takes a user query, you know, thinks about it for a while, and gets
            
            
            
              you a response that's only based on its internal knowledge.
            
            
            
              With retrieval augmented generation,
            
            
            
              what you do is you, the LM, you know, holds for a second and
            
            
            
              asks a state of the art retrieval engine to look
            
            
            
              at the data you provided and come up with relevant
            
            
            
              pieces of text or chunks or facts that
            
            
            
              the ELM could use to augment its internal knowledge base and answer
            
            
            
              more accurately. And so, again, use cases
            
            
            
              for that include question answering and chatbots, like I mentioned
            
            
            
              earlier. And it's become a very common and very useful
            
            
            
              application in the enterprise setting.
            
            
            
              Now, I apologize for this busy slide, but I wanted to share with you
            
            
            
              a little bit of how Reg is built when you actually
            
            
            
              want to build it yourself and do all the steps on
            
            
            
              your own. So on the blue arrow here,
            
            
            
              I walk through the data ingest flow. So initially
            
            
            
              you have some data, that's your data described earlier. I could be in
            
            
            
              a database, like in Microsoft database,
            
            
            
              MSQL could be in AWS, redshift, or snowflake,
            
            
            
              or databricks. It could also come from enterprise applications
            
            
            
              like Jira or notion or something like that.
            
            
            
              And very often it's just a bunch of files. It could be PDF files,
            
            
            
              or PowerPoint, or documents of different kinds on
            
            
            
              s three, or just on a different platform, like box or Dropbox.
            
            
            
              And you sort of ingest the data first into the
            
            
            
              system. And so the first thing you need to do is to take the document
            
            
            
              in its original form, let's say it's a PDF, and extract the text
            
            
            
              part in a document. So turning it from a binary to a text,
            
            
            
              another text, could be really long, so very common. You chunk it into
            
            
            
              smaller chunks. So a chunk could be a page, or it could be
            
            
            
              three paragraphs, or two sentences, or a lot of different strategies
            
            
            
              around that. And I encourage you to read more about this. If you're
            
            
            
              building that, there's a lot of different ways to chunk text that actually impacts
            
            
            
              their performance pretty significantly. By the way, before I move
            
            
            
              on, I want to mention, I'm mentioning a couple of different vendors or
            
            
            
              product names you can use to do any of these steps. That's just
            
            
            
              a small list, it's not comprehensive. I just wanted to mention a couple of
            
            
            
              options for everybody. Once you finish with the chunking,
            
            
            
              you go and you embed each chunk. What does embedding mean?
            
            
            
              It's a model. It's a different model than your GPT four.
            
            
            
              It's called an embedding model. And what it is, it takes this text
            
            
            
              and it translates it into a vector of numbers.
            
            
            
              Think, you know, 1000 floats. And that
            
            
            
              vector of number really represents in this embedding
            
            
            
              vector space the semantic meaning in
            
            
            
              that text and which is going to be used for neural search
            
            
            
              later on. So you take this vector and you put it in something that's called
            
            
            
              a vector database or a vector store which knows how to
            
            
            
              handle these vectors and searches vectors really well. Again, there's many,
            
            
            
              many options here. I'm mentioning just a few.
            
            
            
              Okay, so now that you have the text here and the vector
            
            
            
              here, you're ready to do the actual search. So let's go through
            
            
            
              the user query journey. So again, there's some user
            
            
            
              interface, some application where user has a box
            
            
            
              to enter their query. You enter the query and
            
            
            
              again the query also gets embedded. So there's a vector representing
            
            
            
              what the query is and what its semantic intent is.
            
            
            
              And then you run this against this retrieval engine. And the
            
            
            
              retrieval engine looks at the vector store, retrieves the most relevant
            
            
            
              matches of what was indexed before, and retrieves that
            
            
            
              the text back into here as the facts or
            
            
            
              the candidates, those get integrated into a prompt
            
            
            
              that essentially says something like hey,
            
            
            
              here's a user query and here's some facts that can help you address
            
            
            
              this query. Please respond to this query in the best way possible.
            
            
            
              Given these facts, you send it to an LLM like a GPT four
            
            
            
              or anthropic clod or something else lama two
            
            
            
              or anything else, and then the response gets sent
            
            
            
              back to the user. There's also an option here. You can actually
            
            
            
              look at the response, especially in the enterprise context. And sometimes
            
            
            
              I use products like guardrails that essentially make
            
            
            
              sure inappropriate content does not get back to be shown to the
            
            
            
              user. Now I kind of didn't mention the red arrow that much, but the
            
            
            
              Red Arrow represents action. What I mean by that is sometimes
            
            
            
              in the application you don't just show the response to the user,
            
            
            
              you also do something with it. You want to open a Jira ticket with
            
            
            
              this information, you want to send it in an email, etcetera. So those are all
            
            
            
              options. You have the end of this process. All right,
            
            
            
              so this is how do it yourself rag
            
            
            
              works. And as you can see, it's quite complex and there's a
            
            
            
              lot of steps you have to take, a lot of systems you have to set
            
            
            
              up. There's a cost to each of these systems. You have to have your DevOps
            
            
            
              and your machine learning engineer and you have
            
            
            
              to maintain these systems. And especially when you go from one or
            
            
            
              two or ten documents to a million documents in a really enterprise
            
            
            
              scale application, it becomes quite difficult
            
            
            
              to do this. So that's why at Vektara, what we've created is
            
            
            
              this rag as a service. And what we mean by that is we've
            
            
            
              taken all the complexity and put it in a box, kind of
            
            
            
              behind an API. So all you have to do with Vectara is you
            
            
            
              essentially index the text or the documents you want.
            
            
            
              We'll do all the extraction and the chunking and the vector store and
            
            
            
              everything that I've just shown you. And then you can also have
            
            
            
              an application called the query API. It will do all the matching
            
            
            
              and retrieval and everything like this, and give you back the response. So this
            
            
            
              makes building reg applications very easy,
            
            
            
              very fast, it's robust, can scale up and down,
            
            
            
              it's secure, it's got all the different encryption and everything you need for enterprise,
            
            
            
              and so you don't have to do it yourself. And that is actually really
            
            
            
              helpful. You can build applications faster, more robust,
            
            
            
              and move them from sort of an MVP or POC
            
            
            
              stage into production really quickly. So that's
            
            
            
              what Vectora does. And again, to recap,
            
            
            
              why is retrieval augmented generation useful?
            
            
            
              Well, you augment the element with your own data. So again, if you have private
            
            
            
              data, which most enterprises do, then, you know, check GPT
            
            
            
              would not know about this data. So that's the main reason you start. But also,
            
            
            
              again, it reduces hallucination likelihood. It, the amount of hallucinations
            
            
            
              is smaller just because you give it the right facts
            
            
            
              to base its response on. So this retrieval step is really key.
            
            
            
              Reg outputs are also explainable, and what I mean by that is they come with
            
            
            
              citations, so you increase the user trust. We'll see that in a demo, the information
            
            
            
              is private, we don't need to train in rag. You haven't
            
            
            
              seen any training or fine tuning step there in the architecture,
            
            
            
              so you don't need to train. So it becomes the information is safe,
            
            
            
              it doesn't leak into any future LLM. And then
            
            
            
              lastly, and this is one of my favorite reasons to use rag, is that
            
            
            
              it allows you to do a per person sort of permissioning or access control.
            
            
            
              So, for example, if some of my documents are from the HR department and
            
            
            
              I still want to use them in Rag, but only for the HR people or
            
            
            
              people who are allowed to see the results, I can
            
            
            
              ask the retrieval engine in Vektara to not include
            
            
            
              documents in the set of facts it retrieves,
            
            
            
              unless the person issuing the query has permission for that.
            
            
            
              So that allows you to create responses that are customized to a certain
            
            
            
              level of permission, which is actually really, really helpful.
            
            
            
              Okay. Okay, so why Vectara? Again, just to recap,
            
            
            
              building rag is more complex than it seems.
            
            
            
              And so a lot of reasons I mentioned, doing retrieval in
            
            
            
              a robust way is usually more complex than you think.
            
            
            
              Supporting multiple languages is hard. Again, with Vektar, you don't have
            
            
            
              to worry about a lot of expertise that's very specific to
            
            
            
              the LLM space, like prompt engineering, machine learning,
            
            
            
              operations, etcetera. And then, you know,
            
            
            
              we handle citations very well and just, you know,
            
            
            
              everything is ready for enterprise scale.
            
            
            
              Furthermore, again, security, privacy, permissioning,
            
            
            
              everything is taken care of by our platform, and you get a
            
            
            
              lower total cost of ownership when you use our platform than if you build it
            
            
            
              yourself. So one other thing I wanted
            
            
            
              to highlight is this hem, a hallucination evaluation model.
            
            
            
              This is a model that is very easy to use. It's open source,
            
            
            
              you can download it here, and it allows you to take a set of facts
            
            
            
              and a response from an LLM and detect whether it was hallucinating
            
            
            
              or not. What we see here is the leaderboard that ranks different
            
            
            
              llms based on their hallucination rate. It's actually really useful
            
            
            
              to know that there are differences, first of all, and then what the
            
            
            
              differences are. So that's ham. And again,
            
            
            
              to summarize how you build an application with Vekta or RaG
            
            
            
              application, first sign up to a free account.
            
            
            
              Then you need to ingest some data. So there's a lot of different ways to
            
            
            
              do that. You can, of course, use our APIs directly.
            
            
            
              There's a standard indexing API, there's a file upload API,
            
            
            
              and then you can also upload files from our console.
            
            
            
              Once you have an account, you get access to the console. And there's also other
            
            
            
              ways you can use Vector ingest, which is an open source project
            
            
            
              we created to help you with ingestion of data and indexing. Of data,
            
            
            
              including a few cool crawlers that crawl the data for
            
            
            
              you. And then there's integrations we have with companies like Airbyte
            
            
            
              and unstructured IO that also could be used for kind of no
            
            
            
              code ingestion. So take a look at those tools.
            
            
            
              Once you have the data there again, you can build the
            
            
            
              UI on your own using the query API and
            
            
            
              point it to the corpus and run queries. Or you can use some of the
            
            
            
              tools we have available too. We have an open source project called Victor Answer
            
            
            
              that can help you build question answering apps. There is a create
            
            
            
              UI which allows you to build a whole application, end to end
            
            
            
              in node JavaScript, and then a react search
            
            
            
              and react chatbot, which are components that you can use in
            
            
            
              your react application that help you
            
            
            
              simplify some of this billing process. So I encourage you to take a look at
            
            
            
              those and build your app with that. So now let
            
            
            
              me go to show you some of these apps that we've built just to demonstrate
            
            
            
              how to use this. So this is an example called Ask News.
            
            
            
              Let me go click on this. So I go to the actual application. So here
            
            
            
              we've actually crawled using Victor ingest, a bunch of news sources from
            
            
            
              BBC, NPR, CNN, et cetera.
            
            
            
              And as you can see, this crawling happens every day,
            
            
            
              adds the new news articles, crawls their content
            
            
            
              and adds them to this corpus. Now, when I run a query,
            
            
            
              let's say, should AI be regulated? You can see that
            
            
            
              it does the retrieval really quickly and it gives you a response
            
            
            
              here to answer the question. Now, not only that,
            
            
            
              it has, as I said earlier, these citations. So you can
            
            
            
              click on one of these citations and see, okay, this part
            
            
            
              of the answer was given from this article. Based on this information, you can actually
            
            
            
              click on this and go to that URL and see, you know
            
            
            
              what, where it came from, investigate further. So this gives
            
            
            
              a lot of trust and that's very useful. I also
            
            
            
              wanted to mention that we have an option here to use different
            
            
            
              languages. So for example, I can try to get the answer in German.
            
            
            
              And of course I don't speak German, so I won't be able to tell you
            
            
            
              if this is correct or not. But you can see that the answer gets
            
            
            
              translated into German, which is really, really helpful. And again, this is
            
            
            
              happening even though all the text is in English, so it knows how
            
            
            
              to match between languages really, really well.
            
            
            
              So that's an example of a question answering application.
            
            
            
              The next one I want to show you is actually the same application,
            
            
            
              ask news, but now using hem. So created a
            
            
            
              little demo of how you could use it, although there's many other ways.
            
            
            
              So this is ask news. But if I ask the same question,
            
            
            
              what you see happening here is that the response is generated in
            
            
            
              the same way. But then after it's get
            
            
            
              generated, there's an evaluation of the confidence using
            
            
            
              HHM. So this, this little step runs the
            
            
            
              hhem. In this case, it's using the hugging face inference model,
            
            
            
              and it generates an evaluation of this. In this
            
            
            
              case, yeah, this high confidence, it means that this response
            
            
            
              is not a hallucination relative to the facts. So this
            
            
            
              is one way you can use HHm on your own in your application
            
            
            
              to do that. So moving on here,
            
            
            
              this is question answering. But I also mentioned chatbots quite a bit.
            
            
            
              So let's look at, oh, I didn't mean to click that. Let's look at
            
            
            
              a chatbot example. So here's a chatbot. This is on hugging
            
            
            
              face. Again, built with the Victor APIs.
            
            
            
              So what we did here is create another corpus,
            
            
            
              crawled about 100, 150 pages from
            
            
            
              the IR's website and put them in a corpus. And now I can ask some
            
            
            
              questions about this. So for example, I can go in and say,
            
            
            
              is my college
            
            
            
              tuition text deductible?
            
            
            
              So again, it'll go into the corpus and
            
            
            
              try to answer this question based on the information I crawled in the website.
            
            
            
              Full disclosure and warning, please don't use this website other
            
            
            
              than for demo purposes, and use your tax
            
            
            
              advisor to file your taxes. I just have to say that. But it's
            
            
            
              just meant to show a demo. Okay, but again, you get this answer.
            
            
            
              And the nice thing about the chatbot is you can then ask a follow up
            
            
            
              question. So for example, here it said cost tuition and related expenses
            
            
            
              may be tax deductible in certain conditions. So I can say what
            
            
            
              conditions would make it tax deductible?
            
            
            
              And the idea is that it'll know that make,
            
            
            
              it probably refers to college tuition,
            
            
            
              right? So it has that context of the previous question
            
            
            
              in the previous answer. So it really answers the chatbot.
            
            
            
              And you see that it can, it knows that already. So this
            
            
            
              is a chatbot. I also want to emphasize again, this is all open source.
            
            
            
              So if you go to this particular website, you can actually
            
            
            
              see the files and all the code here, including how we run
            
            
            
              the query and the whole application and everything like that. So you know,
            
            
            
              feel free to use that as a reference to build your own app if you
            
            
            
              like. And with that,
            
            
            
              thank you for listening. I wanted to highlight a few other
            
            
            
              things here on my final slide. First, again, I encourage you to sign up
            
            
            
              to our free account. It's actually pretty generous and
            
            
            
              allows you to get started with up to 50 megabytes of text
            
            
            
              and 15,000 queries a month, which is quite a bit to
            
            
            
              get started and try it out. We have a
            
            
            
              lot of resources for you, our documentation, which is pretty thorough.
            
            
            
              We have a discord channel for the community, so you can join that and
            
            
            
              ask questions from fellow developers who build with Vektara,
            
            
            
              or from a lot of us at Victor are there all the time to answer
            
            
            
              questions. We have a GitHub where you can see a lot of open
            
            
            
              source projects that you can use that I mentioned here, like react,
            
            
            
              search, vector, ingestar answer, etcetera. And then we have a
            
            
            
              set of example notebooks. This one, for example, is how to use
            
            
            
              Vectar with Lama index, but we have others.
            
            
            
              You can look at this repository and then if
            
            
            
              you're a startup, I encourage you to take a look at our startups program.
            
            
            
              It's a very good way to get started with Vectara while giving
            
            
            
              you additional support in forms of credit and customer
            
            
            
              support and other things. So really a good way to get started if
            
            
            
              you want to use Vectara to power your product.
            
            
            
              And that's it. Thanks for listening again and I hope you have a
            
            
            
              good rest of your conf 42 conference.