Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, I'm Rinoshun.
I'm a software engineer in the AI team of WSO2.
Today I'll be talking about building trust in generated AI through
accuracy evaluation and automation.
So these days there are many organizations that are leveraging generated AI to infuse
some AI features into their products to enhance the developer experience.
But when it comes to an AEI feature, key factor would be the accuracy.
How accurate are the responses of this AEI feature?
So from a, from an organization perspective, when a developer pushes
some changes to an AEI feature, what trust or the guarantee do they
have in these changes that is going into the existing AEI feature?
So, In addition to code reviews and unit testing for AI features, they will also
have to do some accuracy evaluation.
And this accuracy evaluation should be automated.
So each and every changes that goes into production needs to be checked
if a particular accuracy level is maintained for that feature.
This process has to be automated such that with each and every ER changes
that goes into production, We have verified that the accuracy of the
feature is maintained at a certain level.
So let's dive deep into this.
So that is generative AI with and without context.
So generative AI is being leveraged for its creative responses you
can get from the LLMs, which is the large language models.
But then there's, there are some limitations to it.
It has been pre trained on multiple different, areas and data set.
So the answer it gives might not be tailored for your use case.
It might not be, it might be more generic and it might lack some precision
since it does not have the relevant context for your particular use case.
So then we have this Generate AI with Context, which will have some context.
Fed into the prompt and this prompt will be sent to, sent it to L.
So can look at the context and based on that context it can give much
more relevant and tailored responses.
So this generating AI with context, this is where we are going to
try in this presentation, will be evaluating the accuracy and the, and
automate the accuracy evaluation.
So one common pattern of, generative ai that, generative
ai, which context is that?
Retrieval, augment generation rack.
so this rack enhances generating AI by integrating some external
knowledge sources, external knowledge sources that which will improve
the accuracy of the generator.
So this has two components.
One is the retrieval component and the generator component.
So there will be.
Knowledge base created which will contain all the external knowledge users And
document and documented so the retriever will use this knowledge base to whenever
it gets a question from the user It will use the knowledge base to search for
relevant context And get the results from the knowledge base and edit into
the prompt And this prompt with the context will be sent to the next component
Which is the generator component and the generator Will generate a response that
is more relevant to the user's query.
So that is the rack pattern and Let's see how we can evaluate this rack
applications So there is this Ragas library so it's a research that has
been done and it has defined multiple metrics but in this presentation, we
will look at certain metrics that we will be using for accuracy evaluation.
So each of these metrics You serve different purpose.
So you need to figure out which metric will, will be
best for your RAG application.
So the retriever can be evaluated using the metrics like context recall and the
context precision and the generator can be evaluated with the metrics, faithfulness
and the answer semantic similarity.
So both the components will be individually, evaluated
in on different aspects.
So before we go into this accuracy evaluation, we'll just be familiarized
with some keywords that we'll be using so one is the question that is the user's
query that is being input into the rack and then you have the context these are
the documents that are being retrieved from a knowledge base by the retriever
and then the answer so the This is the generated answer from the generator
that for a specific user's question And the ground truth So these, these
are the labels or the human annotated answers for a particular question.
So we'll first, we'll first look at the context recall metric
that evaluates the retriever.
so what, this metric will measure is the extent to which the retrieved
context aligns with the ground truth.
So we want to, so we have a ground truth predefined for a particular question.
We want to see that each and every claim, we did that ground truth.
will be available in the contract context that we are retrieving
from the knowledge base.
So to achieve high context, recall all the sentences or claims in
the ground truth answer should be available in the retrieved context.
So the context recall is formulated like, ground number of ground truth
sentences that can be attributed to the context divided by the
number of sentences in the ground.
Truth.
And then we have the context . So this will check, so the context recall,
we checked if the most, if all the relevant context have been retrieved.
But now in context, , we want to see the order of these
context, that has been retrieved.
So we want in context pressure, we want the, we, we want the most
relevant item to be, rank higher.
Because when we if even if there are 100 documents and if we give all these 100
documents into LLM feed all those 100 documents into LLM If the most relevant
item is in the bottom or in the middle then LLM might not pick that context to
answer the questions it might hallucinate Or it might hallucinate or give the
answer based on a different context.
So we want to see if The most relevant context is on the top.
It's ranked higher So that's what we want to achieve by context precision.
So together the context recall and the context precision we'll check how
accurate the retrieval is so based on these, you can make so many decisions
like For the retriever like how many documents do you want to retrieve?
So if you retrieve large number of files Documents then the context record will
be higher because you might there are high chances of getting the relevant
documents into the Prompt, but then the precision might be low if the rank is
not higher So you might want to find you might want to re rank those documents.
So there so based on these, metrics you can like fine tune the indexing
strategy Like based on if the pressure is lower you can go and change the
indexing strategy of the knowledge base and then also you can change the search
strategy like you can instead of using the Distance metric or cosine similarity.
You can even use euclidean distance.
So, and also you can Use re rag to re rag these Documents retreat.
So there are multiple decisions you can take based on these metrics
And then to evaluate the generator we have this faithfulness So this measures
the factual consistency Of the generated answer against the given context So if
there are any sentences that has been hallucinated then this faithfulness
metric will give us will will show us that the answer has been hallucinated.
So what it compares is the Generated answer with the context if it's easy all
the sentences in the generated answer is available in the given context So
the faithfulness code is defined like the number of claims in the generated
answer that can be inferred from the given context Divided by the total number
of claims in the generated answer and then also semantic similarity So we
will check in this through this method.
We will be checking the Semantic meaning if the semantic meaning of the answer
matches with the ground truth we have.
So what we will be doing here is giving the, getting the, vectors for the
generated answer and the ground truth, and we will compare this to cosine similarity.
So this will be the answer semantic similarity.
So based on answer semantics, so based on these metrics, that measure the accuracy
of generator, we can take some decisions like which model to use and which,
what should be the model temperature.
Okay.
So those kind of decisions you can you can take through these two
metrics So the key stages of accuracy evaluation will dig deeper into this.
So the data set preparation so that you need to first prepare a Comprehensive
data set and then also you need to do the calculation and get the matrix We
need to compute the matrix and based on and this matrix computation needs to
be integrated into ciscd pipeline such that Whenever you send some pr This
will automatically run and compute the metrics, and then you need to report
the accuracy, such that you can make any decisions based on this accuracy report.
So the dataset preparation, this is a very key, this is the key
step of accuracy evaluation.
So the questions, so this, the questions you choose are very important.
So the questions should be able to, this dataset should cover
wide variety of complexities.
So there should be hard questions, easy questions, medium level
questions, and then complex questions, combined questions.
Different types of questions should be there such that it can cover all the
scenarios that can occur in productions.
And also you need to carefully write the ground truth.
So this ground truth, if, if the, all these, like, some, most of these
metrics are based on this ground truth.
So if you don't have a proper ground truth, then the accuracy.
evaluated will be meaningless and also if you don't have the proper questions, then
you are not evaluating the right accuracy So if you suggest give some easy questions
Then they are you might be confident with that given accuracy that the model might
work and the feature might work on these equations, but We would not have an idea
of how it will work when it goes to our questions So the dataset preparation is
a key step in evaluating the accuracy and then the matrix computation.
So here I have attached a code Snipper, for, to evaluate, some service.
so here we use the RGAs library to, we get the rga, the metrics.
So these, we will import the metrics which we want.
So here have import context, call context, special holders, and the art similarity.
So first we, get the data set and this data set, we prepare the
extended data set by running these questions on the AI feature and you
get the answers and the context.
So, you combine, you concatenate the questions, answers and the context and
the ground truth and create a data set.
And that data set will be used in the Raghav's library.
So this will evaluate the context, here So we mentioned that we want to
evaluate based on these metrics And then this the results will be saved
to some accuracy results dot csv.
And also we will For all the questions, we'll get the weighted mean or mean and
then we we will see if We will define some thresholds for each and every metric So
we have to define different thresholds for each metric because each metric
serves different purpose So we see if the threshold passes or not in the Threshold
passes or not to check if you can pass the total accuracy test or not So that will
be the matrix computation code snippet and then integrating this into the CIC
pipeline So now when you have if you had a code in the github, you can use the github
workflows So you can for the workflow you can you add a workflow yaml file
You where you can set up the environment install the dependencies and you can
set up the github environment variables You can set it to the variables github
variables and the secrets and then you can give the script Which you want to run.
So here i am giving them to run the test accuracy And then you can publish the csv
Publishing this is also important because If you fail the accuracy test, then you
can go and inspect why it has failed.
So the report accuracy, so the accuracyresults.
csv.
This, this is a snippet, this is a, one example of, the, of
a row in the accuracyresults.
csv.
So this will contain the whole data set and the, context, the appreciation,
context recall, faithfulness, and all similarity scores you get.
So this, so based on this, like you can take important decision, you
can decide on what to do next to it.
If this passes, then you can directly merge this into the, into the production.
But if the accuracy test fails, you can come in and look at the CSV file and
check which question has failed and why, which metric has gone down and why it has.
You can look at the answer and the context and the ground truth
and get an idea why it has failed.
So say for instance, for a particular question, there are, you have
read five documents and the most relevant document is not on the top.
So the current text, the position has gone down.
And for such cases, what you can do is you can come in, you can introduce
a re ranker into the architecture.
so then you can, this re ranker will re rank these documents and put
the most relevant one on the top.
So then the cortex pressure will go.
Higher.
So that's the accuracy parting, and then the bit about the automation pipeline.
So how the automation happens.
So first, the developer will send a pr and this PR will trigger the GitHub workflows.
And in the workflow you will have the accuracy script, which will be one to
evaluate the accuracy of the changes of the, of the app after the changes.
And then based on the accuracy results.
You make the decision you verify if the PR Merges, if you can merge this PR or
not So that's the automation pipeline and some challenges, in this approach
would be like, So you so these evaluation results some of those metrics use
the LLMs, for example, Context record you want to check if all the All the
ground truth claims are available in the context which you have retrieved.
So you use NLM behind the scenario.
So the evaluation results are dependent on the NLM.
So each time you run it, the results might vary, I mean,
by a small amount, of course.
Still, this LLM, is important.
So you might want to use a very, accurate LLM, very, high, high, high accurate LLM.
And also like, the dataset, as I mentioned earlier, the data should,
should cover all the possible scenarios that might occur in production.
So it should not be questions of just one complexity.
It should be of different complexity levels.
So then, then only you can get a very good, accuracy evaluated and
then, and then need to create a high quality ground truth data sets.
So this creating such ground truth data, ground truth, ground truth is
very time consuming and also expensive.
So you need to spend some time and write the proper, ground
truth for a particular question.
And also the LLM API cost.
So since you are using the LLM behind the scenario So you for a large scale
evaluations the LLM API cost might be higher and also there might be performance
bottlenecks It might take some time to compute the accuracy Also cost is another
factor that needs that you need to include in your project for accuracy evaluation
So that's about the evaluating Accuracy of your rag applications and automating this
and making sure that all the changes You go that all the changes that go into the
productions maintain a particular accuracy that your organization won't persist
for the Persist for user experience.
Thank you.
Thank you very much.
It's nice talking to you