Conf42 DevOps 2025 - Online

- premiere 5PM GMT

Building Trust in Generative AI: Accuracy Evaluation and Automation

Video size:

Abstract

Generative AI meets precision with Retrieval-Augmented Generation (RAG), combining retrieval and generation for accurate, context-aware responses. This talk unveils a framework for automated RAG evaluation, integrated into CI/CD pipelines, ensuring continuous improvement and reliable AI solutions.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, I'm Rinoshun. I'm a software engineer in the AI team of WSO2. Today I'll be talking about building trust in generated AI through accuracy evaluation and automation. So these days there are many organizations that are leveraging generated AI to infuse some AI features into their products to enhance the developer experience. But when it comes to an AEI feature, key factor would be the accuracy. How accurate are the responses of this AEI feature? So from a, from an organization perspective, when a developer pushes some changes to an AEI feature, what trust or the guarantee do they have in these changes that is going into the existing AEI feature? So, In addition to code reviews and unit testing for AI features, they will also have to do some accuracy evaluation. And this accuracy evaluation should be automated. So each and every changes that goes into production needs to be checked if a particular accuracy level is maintained for that feature. This process has to be automated such that with each and every ER changes that goes into production, We have verified that the accuracy of the feature is maintained at a certain level. So let's dive deep into this. So that is generative AI with and without context. So generative AI is being leveraged for its creative responses you can get from the LLMs, which is the large language models. But then there's, there are some limitations to it. It has been pre trained on multiple different, areas and data set. So the answer it gives might not be tailored for your use case. It might not be, it might be more generic and it might lack some precision since it does not have the relevant context for your particular use case. So then we have this Generate AI with Context, which will have some context. Fed into the prompt and this prompt will be sent to, sent it to L. So can look at the context and based on that context it can give much more relevant and tailored responses. So this generating AI with context, this is where we are going to try in this presentation, will be evaluating the accuracy and the, and automate the accuracy evaluation. So one common pattern of, generative ai that, generative ai, which context is that? Retrieval, augment generation rack. so this rack enhances generating AI by integrating some external knowledge sources, external knowledge sources that which will improve the accuracy of the generator. So this has two components. One is the retrieval component and the generator component. So there will be. Knowledge base created which will contain all the external knowledge users And document and documented so the retriever will use this knowledge base to whenever it gets a question from the user It will use the knowledge base to search for relevant context And get the results from the knowledge base and edit into the prompt And this prompt with the context will be sent to the next component Which is the generator component and the generator Will generate a response that is more relevant to the user's query. So that is the rack pattern and Let's see how we can evaluate this rack applications So there is this Ragas library so it's a research that has been done and it has defined multiple metrics but in this presentation, we will look at certain metrics that we will be using for accuracy evaluation. So each of these metrics You serve different purpose. So you need to figure out which metric will, will be best for your RAG application. So the retriever can be evaluated using the metrics like context recall and the context precision and the generator can be evaluated with the metrics, faithfulness and the answer semantic similarity. So both the components will be individually, evaluated in on different aspects. So before we go into this accuracy evaluation, we'll just be familiarized with some keywords that we'll be using so one is the question that is the user's query that is being input into the rack and then you have the context these are the documents that are being retrieved from a knowledge base by the retriever and then the answer so the This is the generated answer from the generator that for a specific user's question And the ground truth So these, these are the labels or the human annotated answers for a particular question. So we'll first, we'll first look at the context recall metric that evaluates the retriever. so what, this metric will measure is the extent to which the retrieved context aligns with the ground truth. So we want to, so we have a ground truth predefined for a particular question. We want to see that each and every claim, we did that ground truth. will be available in the contract context that we are retrieving from the knowledge base. So to achieve high context, recall all the sentences or claims in the ground truth answer should be available in the retrieved context. So the context recall is formulated like, ground number of ground truth sentences that can be attributed to the context divided by the number of sentences in the ground. Truth. And then we have the context . So this will check, so the context recall, we checked if the most, if all the relevant context have been retrieved. But now in context, , we want to see the order of these context, that has been retrieved. So we want in context pressure, we want the, we, we want the most relevant item to be, rank higher. Because when we if even if there are 100 documents and if we give all these 100 documents into LLM feed all those 100 documents into LLM If the most relevant item is in the bottom or in the middle then LLM might not pick that context to answer the questions it might hallucinate Or it might hallucinate or give the answer based on a different context. So we want to see if The most relevant context is on the top. It's ranked higher So that's what we want to achieve by context precision. So together the context recall and the context precision we'll check how accurate the retrieval is so based on these, you can make so many decisions like For the retriever like how many documents do you want to retrieve? So if you retrieve large number of files Documents then the context record will be higher because you might there are high chances of getting the relevant documents into the Prompt, but then the precision might be low if the rank is not higher So you might want to find you might want to re rank those documents. So there so based on these, metrics you can like fine tune the indexing strategy Like based on if the pressure is lower you can go and change the indexing strategy of the knowledge base and then also you can change the search strategy like you can instead of using the Distance metric or cosine similarity. You can even use euclidean distance. So, and also you can Use re rag to re rag these Documents retreat. So there are multiple decisions you can take based on these metrics And then to evaluate the generator we have this faithfulness So this measures the factual consistency Of the generated answer against the given context So if there are any sentences that has been hallucinated then this faithfulness metric will give us will will show us that the answer has been hallucinated. So what it compares is the Generated answer with the context if it's easy all the sentences in the generated answer is available in the given context So the faithfulness code is defined like the number of claims in the generated answer that can be inferred from the given context Divided by the total number of claims in the generated answer and then also semantic similarity So we will check in this through this method. We will be checking the Semantic meaning if the semantic meaning of the answer matches with the ground truth we have. So what we will be doing here is giving the, getting the, vectors for the generated answer and the ground truth, and we will compare this to cosine similarity. So this will be the answer semantic similarity. So based on answer semantics, so based on these metrics, that measure the accuracy of generator, we can take some decisions like which model to use and which, what should be the model temperature. Okay. So those kind of decisions you can you can take through these two metrics So the key stages of accuracy evaluation will dig deeper into this. So the data set preparation so that you need to first prepare a Comprehensive data set and then also you need to do the calculation and get the matrix We need to compute the matrix and based on and this matrix computation needs to be integrated into ciscd pipeline such that Whenever you send some pr This will automatically run and compute the metrics, and then you need to report the accuracy, such that you can make any decisions based on this accuracy report. So the dataset preparation, this is a very key, this is the key step of accuracy evaluation. So the questions, so this, the questions you choose are very important. So the questions should be able to, this dataset should cover wide variety of complexities. So there should be hard questions, easy questions, medium level questions, and then complex questions, combined questions. Different types of questions should be there such that it can cover all the scenarios that can occur in productions. And also you need to carefully write the ground truth. So this ground truth, if, if the, all these, like, some, most of these metrics are based on this ground truth. So if you don't have a proper ground truth, then the accuracy. evaluated will be meaningless and also if you don't have the proper questions, then you are not evaluating the right accuracy So if you suggest give some easy questions Then they are you might be confident with that given accuracy that the model might work and the feature might work on these equations, but We would not have an idea of how it will work when it goes to our questions So the dataset preparation is a key step in evaluating the accuracy and then the matrix computation. So here I have attached a code Snipper, for, to evaluate, some service. so here we use the RGAs library to, we get the rga, the metrics. So these, we will import the metrics which we want. So here have import context, call context, special holders, and the art similarity. So first we, get the data set and this data set, we prepare the extended data set by running these questions on the AI feature and you get the answers and the context. So, you combine, you concatenate the questions, answers and the context and the ground truth and create a data set. And that data set will be used in the Raghav's library. So this will evaluate the context, here So we mentioned that we want to evaluate based on these metrics And then this the results will be saved to some accuracy results dot csv. And also we will For all the questions, we'll get the weighted mean or mean and then we we will see if We will define some thresholds for each and every metric So we have to define different thresholds for each metric because each metric serves different purpose So we see if the threshold passes or not in the Threshold passes or not to check if you can pass the total accuracy test or not So that will be the matrix computation code snippet and then integrating this into the CIC pipeline So now when you have if you had a code in the github, you can use the github workflows So you can for the workflow you can you add a workflow yaml file You where you can set up the environment install the dependencies and you can set up the github environment variables You can set it to the variables github variables and the secrets and then you can give the script Which you want to run. So here i am giving them to run the test accuracy And then you can publish the csv Publishing this is also important because If you fail the accuracy test, then you can go and inspect why it has failed. So the report accuracy, so the accuracyresults. csv. This, this is a snippet, this is a, one example of, the, of a row in the accuracyresults. csv. So this will contain the whole data set and the, context, the appreciation, context recall, faithfulness, and all similarity scores you get. So this, so based on this, like you can take important decision, you can decide on what to do next to it. If this passes, then you can directly merge this into the, into the production. But if the accuracy test fails, you can come in and look at the CSV file and check which question has failed and why, which metric has gone down and why it has. You can look at the answer and the context and the ground truth and get an idea why it has failed. So say for instance, for a particular question, there are, you have read five documents and the most relevant document is not on the top. So the current text, the position has gone down. And for such cases, what you can do is you can come in, you can introduce a re ranker into the architecture. so then you can, this re ranker will re rank these documents and put the most relevant one on the top. So then the cortex pressure will go. Higher. So that's the accuracy parting, and then the bit about the automation pipeline. So how the automation happens. So first, the developer will send a pr and this PR will trigger the GitHub workflows. And in the workflow you will have the accuracy script, which will be one to evaluate the accuracy of the changes of the, of the app after the changes. And then based on the accuracy results. You make the decision you verify if the PR Merges, if you can merge this PR or not So that's the automation pipeline and some challenges, in this approach would be like, So you so these evaluation results some of those metrics use the LLMs, for example, Context record you want to check if all the All the ground truth claims are available in the context which you have retrieved. So you use NLM behind the scenario. So the evaluation results are dependent on the NLM. So each time you run it, the results might vary, I mean, by a small amount, of course. Still, this LLM, is important. So you might want to use a very, accurate LLM, very, high, high, high accurate LLM. And also like, the dataset, as I mentioned earlier, the data should, should cover all the possible scenarios that might occur in production. So it should not be questions of just one complexity. It should be of different complexity levels. So then, then only you can get a very good, accuracy evaluated and then, and then need to create a high quality ground truth data sets. So this creating such ground truth data, ground truth, ground truth is very time consuming and also expensive. So you need to spend some time and write the proper, ground truth for a particular question. And also the LLM API cost. So since you are using the LLM behind the scenario So you for a large scale evaluations the LLM API cost might be higher and also there might be performance bottlenecks It might take some time to compute the accuracy Also cost is another factor that needs that you need to include in your project for accuracy evaluation So that's about the evaluating Accuracy of your rag applications and automating this and making sure that all the changes You go that all the changes that go into the productions maintain a particular accuracy that your organization won't persist for the Persist for user experience. Thank you. Thank you very much. It's nice talking to you
...

Nirhoshan Sivaroopan

Software Engineer @ WSO2

Nirhoshan Sivaroopan's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)