Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, and thanks for joining me today. I'm excited here to talk about
large language models, especially large language models evaluations
and how you can evaluate your model in a better way, or how you
can use certain frameworks that are available to understand how
a particular language model is performing for your specific use case.
For those of you who don't know me, I'm Ashwin. I've been working
in the field of machine learning, computer vision, and NLP for over three years,
and I've also worked in some other domains throughout my
career. Today I would like to dive into into a
specific aspect of course evaluations, but also
why they're necessary and how we can get to
that particular evaluation metric. So, to get started,
let me just quickly move the screen. Yep. Okay.
Yeah. So, will the language model speak
the truth? I guess that's the question that everyone's been asking.
Everyone's been really concerned about the whole evaluation flow,
or if we can trust these language models into
telling us something that it definitely knows versus making things
up and telling us something that even we are not sure whether this
is the truth or not. So the bigger question that we probably tend
to answer today is whether the language model is
going to be truthful in answering the questions that we ask them or,
you know, getting summaries out of it. So,
moving ahead, we are jumping into
a particular aspect is why we need these evaluations
and why these particular evaluations are necessary. The reason being
that these evaluations are important
as a part of a measure of your overall large language
model development workflow. So think about it on how.
Think about it as how we can effectively manage or
leverage these language models, while also making sure that
we are not letting it lose completely and having a
bad customer experience in general. So the
three aspects that you need to take care of here is the management
of these language models. Maybe you're using just a few APIs
that are available online and getting the results and just publishing them
to your users. Or maybe you're using, or maybe you're hosting
your own models using frameworks like B LLM.
I guess there's one more called Lama CPP or something like
that. So these frameworks also allow you to
host your own models and also
probably understand how you can improve
the overall performance of these language models.
Understanding these models, measuring these models will ultimately lead
to how you can improve these models, how you can understand on
where these llms fall short, so that we can either refine
our training data, we can start thinking about fine tuning these
language models, maybe lower adapters or,
you know, just in general testing out different language models or different specifically
publicly available, fine tuned language models.
Moving ahead, we are concerned
about what kind of frameworks or what kind
of measurement techniques,
or maybe let's just call them frameworks,
would help us in doing a particular thing.
So we have three selection criteria, or let's
say three evaluation criteria that we can think
of, and we can jump right into that.
But let's seeing that we can have task
specific scores that measure the right
outcome of a particular language model. Imagine a
toolbox where you have specifically designed metrics
or specifically designed items that you use to assess
llms. And while when you consider the
toolbox, your tools would be these three things that we've
mentioned are the task specific scores that we talked about
here. I must say that a one size fits all approach
usually doesn't work. So the ideal framework
will offer you the metrics tailored to your different tasks.
To specific tasks, let's say question answering
or summarization. It will help you,
or it will allow you to evaluate these models on these
specific tasks. The other thing that we need to consider
for a good evaluation framework is the metric list.
You know you have the toolbox, you know how you have the toolset.
But if you don't have a reliable set of metrics that have been
proven before, there is no way in really understanding what
a particular score or what a particular metric
will mean if it is presented to you in an aggregate
or a abstracted manner. So having a good list of
available metrics that you can easily implement is really crucial when
deciding what framework or what library or what general research you're
going to follow to evaluate your language models. And the third part is
basically, really,
we could discuss this broadly or narrowly, depending on what the
context is. But the framework that we're going to use
should be extensible, it should be maintainable,
and of course it should provide you an ease of access to the
underlying function or the underlying classes of it. The reason for this
is because if the framework is adaptable to a particular task,
that's public, but your task requires some specific understanding
or specific knowledge, or a specific way of loading, maybe loading the model
weights. Or maybe if you use a different tokenizer
for tokenizing your particular request
responses, then this becomes a really crucial
aspect in determining whether a particular framework is going to be useful
for you. So, for example, the LM eval,
the LM evaluation harness framework, a really good framework,
a really good design made to evaluate any
model that you can think of, any supported support model
that you can think of on public datasets. But while working
on this particular framework, if you have a task that's
specific to your needs, it's a really difficult,
how do I say it's a really difficult way of implementing
that in this particular library? It's not just about this particularly,
it's in general, any library that has come up for
evaluations which does, which kind of abstract these
evaluation metrics for us. So,
moving ahead, by incorporating all these elements, we can
create a robust, or we can decide
on what robust evaluation framework we can use.
Now, let's dive deeper into the two main approaches
for the LM evaluation and why they
are necessary. As you can see in this particular diagram,
you can see that the part where the human evaluation is
concerned, it ranks really higher, as opposed
to user testing, fine tuning
and maybe public data sets, public benchmarks,
auto evaluation. And the reason really being for that
is the human evaluation aspect really focuses
on multiple geographies, multiple languages, and the
way people understand a particular response,
a particular language, and that really determines a
proper metric or a proper score for your language model to be evaluated
upon. Because as we know, there's multiple dialects
for a particular language, there's multiple people talking or
using different style of grammar, that is not a common
way of speaking or understanding things in their own
countries. And that could really mean a
lot of difference. When you're evaluating a framework, let's say,
for someone in the US, versus if you're evaluating,
sorry, if you're evaluating model on a specific response
for, let's say, for someone in the US, versus for someone who's
not a native english speaker, for them, understanding the context,
understanding what's going on around a particular
response may or may not differ, and they
may or may not think that this particular answer suits them well.
So having a human evaluation framework,
that's drug geography, region specific, or, and of course,
application specific, is a really important aspect
of evaluating these models. Now, once we've talked about human
evaluators and why
those are necessary, we should also consider that it is not
always possible to collect all of this feedback. And it
is not also always possible to have your customers decide or determine
whether a particular answer was good or bad. The customers are,
the customers are really concerned about the value that
your application or your use case is providing. So in
general, we can think of these evaluation metrics.
So in general, these evaluation metrics you can
understand from two different perspectives. One is of course,
the one we discussed, which is the human evaluator part, and one is
the frameworks of the libraries that we'll be exploring in this talk,
we can call them as auto evaluators or, you know,
anything that's non human evaluator. And so
previously or traditionally, the way these language models even came
into being were based on a
large number of data sets, a large number of text or
material that was open, that was available publicly.
And so overall, while these
companies are use cases that are trying to test these models,
they've made human evaluators a kind of a standard,
or how do I say, kind of a stop
in the loop on giving users a complete access
to the language model use case, versus maybe
releasing it as an experiment, or maybe releasing it as a beta,
so that when people interact with it and people understand
what's going on, and sometimes you get oh no, this answer is totally
false, and then people will just bash you for whatever you've
done and just give you negative remarks that in turn helps
you in better serving the model or
better evaluating on what went wrong and where. So the
strengths really here are that humans can provide a nuanced
feedback, judging whether an outcome is simply right or wrong
based on their understanding, but also considering factors
like creativity, coherence, and relevance to the task.
Just as I mentioned before, the same thing that
probably would be relevant or coherent or creative to a
native english speaker wouldn't be the same as for a non native
speaker, just because of how they understand the language.
There is obviously many challenges with human evaluation,
which has its own limitations. The choices, as I said, can be
subjective, based on location and defining success
matrix, only based on what someone from a particular place
said is obviously challenging and may raise
questions on how this came out to be. So, to this,
adding to the fact that human evaluation will also take
time, it is an iterative process and it is also expensive
because you will probably be putting this into the
hands of your potential customers, who probably would get
maybe frustrated, who stop using this app, and then you have to convince
them to use it and give feedback and whatever all
this, that entire flow costs time,
costs money, so we can move towards
automatic evaluators. Now, when I say automatic evaluators,
I don't necessarily mean that everything's happening by itself and
you're just calling a simple function and everyone's happy,
and all the language models have achieved nirvana or greatness.
And whenever I say automatic evaluators, it really means that
the toolbox that we set up with the three different criteria,
that toolbox allows you to evaluate
a particular large language model on its own while
you're iterating over the use cases and the strengths.
Basically, here are their fast, they're efficient,
and they're objective. They assess how well specific parts
of speech or output match your expectations based on the
data sets that you have. And choices are based usually
on known outcomes and well defined metrics. Now,
how do these known outcomes come come into picture? It's the
people who are actually serving these models determining what a
particular output for a particular, let's say,
question or a summary should be, so that whenever you
are close to it, like for example, root scores,
whenever you're really close to it, you think that this is probably a better understand
the better understood answer rather than something that's completely
made up. However, we can also cannot
discount the limitations that they possess.
They can't replicate what we talked about before,
which was a great thing about human evaluator, as well
as its limitation, is that the ability to
understand the context and nuance of this in a
overall sense of what a
particular response should be, or what particular answer,
or what particular future question can come
over based on that particular answer. So,
automatic evaluators, they do struggle with creativity and overall quality judgment,
but they are a lot better evaluator in terms of metrics and
in terms of, let's say, as I said,
Google scores, n gram matching. So they somehow
work well together. But the real, the golden
chance, or the golden opportunity here is
to mix these two human evaluators as well
as the auto evaluators in having your entire flow
is such a way that you leverage the best of
these two and you also overcome the bad,
I would not say bad parts, but the less best parts,
I guess, from each of those.
So I would the takeaway from overall, this is
a collaborative option is probably better, and which
one to choose really depends on your particular use case.
And the ideal approach most likely often
involves a combination where humans provide valuable
feedback on whether this was right or wrong. And the auto evaluators
often consistency in checking whether the answer
that you're getting for the, let's say, for the same question
or for requesting the same summary is always
going to be somewhat similar.
Let shift gears to what we are talking about when we were talking
about whole evaluators, and let shift gears
on understanding how do these two evaluators function,
or how do these two evaluators work with different sorts
of data sets. So we have two kind of
data set divergence or data set paths here.
One is using the public benchmarks that are already available. You don't have
to do much, you just trust these benchmarks that are
available probably on these public leaderboards,
maybe hugging face or individual competitions.
And the other way is using golden datasets.
Now, whenever I say golden datasets, it doesn't
mean that this is literally the gold standard. It just means
that these are datasets or these are the values
that you control and these are the values that are
definitely, or I should say almost 90%
true to be really effective in giving you an answer.
So public benchmarks, these are predefined data sets. You can
easily get that from hugging face. And they are more.
The more the research that was put into creating these data
sets, understanding or training a particular model to test on these data
sets, the more fair assessment these public benchmarks
will give you. And they will give you an understanding of the
general capabilities of a model on different sorts of
data sets, like how well a model is,
how well a model performs with, let's say, something as
something called, as maybe abstractive summarization,
summarization, question answering, or even giving you
answers to a particular code, debugging or explaining maybe
programming concepts or scientific concepts, or, you know,
any concepts in general. However, these public
benchmarks, given the broad scope of what
kind of data sets they usually have and the formats of these,
they don't necessarily guarantee success, or they don't necessarily
guarantee a yes or no answer whenever you're
looking at a particular model and making a decision whether you should use that model
or not, because your use case is really
a specific use case that's coming out of the model, rather than people
asking travel tips or money saving tips,
which is what we usually see. The most popular use cases
of these language models are so the takeaway on public benchmarks.
They do offer valuable starting point, as in,
they do offer something to give you a edge over,
but they shouldn't be the sole measure of what you're trying
to achieve and are trying to test based on the LLM's effectiveness.
Moving ahead to the golden data sets, these data sets,
as I already said, are tailored to your specific needs.
You control what output you expect, you control
what prompt, or let's say, what metrics or
how to put it in a better way. You control what the exact
result from the large language model that you're expecting
is. Of course, you do not control what the large language model obviously
gives you to a certain extent. But what you know is
that if I refer this to particular,
let's say, a particular sentence, I know, the more this sentence
matches with what the LLM has given me. It's more
accurate, or I would assume that it's more accurate and performs well
on the task that I have made it to.
So these, let's say golden data
sets are used will allow you to evaluate how well an LLM
performs of the tasks that matter most to you, rather than being
a generic data set of maybe
Reddit comments on maybe people just posting this, this,
this in a chain that doesn't make sense to you anymore.
But obviously this is a part of the
whole training data set if they weren't clean before.
So some examples of golden data
sets would be for a, let's say for
a use case, like checking semantic similarity of the
content that was given out, or measuring perplexity,
or as I said, root scores in summarization
tasks so that you could understand how well a summary was generated,
how short or how long that summary even was. And these are instrumental
in building the rag based workflows,
which is the retrieval augmented generation based workflows.
And these will give you a really good idea on how your rag
workflow should be, which is what
we are evaluating or what we are
understanding to evaluate here. So the most effective approach,
again, leverage the as we discussed before,
leverage the public.
Leverage the public information, as well as leverage the
data sets that you prepare because they both provide you a
certain benchmark that you can then evaluate and iterate
over. Moving ahead. Now that we understand
what these evaluation methods resources are, what usually
works, let's talk about applying this knowledge to the specific
needs. As I said before, your use case is probably
well defined. You're looking for a model that performs really well
in a particular sector rather than a general model for.
Let's assume this Chevrolet case where someone
tried to ask the language model and try to get a brand
new Chevrolet Tahoe for like $1.
Let's assume that you are in the scientific research community and your
model is really open to answering
any and all sorts of questions that's not really
good for you, that's not really conducive for you. And the
evaluation metrics when they're actually in progress
or work with these models. These will tell you how close people
are talking about the use case,
how close the topics are, and whether they really make
sense, or whether the user feedback makes sense.
So in general, you could consider
this as do you need a particular use case to answer questions?
Do you need to answer summaries? Or do you need particular
use case to provide you citations,
references, or just be a general language model
that really answers the question without any external source
and answers questions from its knowledge database.
So the content that goes in and the content that
comes basically, whatever the content that goes in and
whatever the content that LLM interacts with, is it supporting
a document library, a vector database,
or just a vast collection of text data? That's how
your use case will determine what the output,
or how well defined your output should
be. And eventually, once this is defined,
you could really understand on how the model is performing.
You could maybe involve some fine tuning for it.
You could go on topic modeling and maybe restrict the model on
answering, or even,
or even getting the model to answer particular questions based on
what your use case is. And it's also crucial to establish
certain guardrails, certain input validation,
certain output validation, so that you're not drifting into something that
you shouldn't be doing or you shouldn't be talking about. Of course,
prompt engineering or in general,
defining the prompt is also somewhat an
aspect of this particular flow. But once these metrics that
we'll discuss now are defined, you really know
how a particular language model is working for your
particular flow. Moving ahead, let's dive
into the nitty gritty, or, you know, the specifics of
what we were talking about. How do we measure the effectiveness of a tailored
large language model? We could first move
to traditional metrics, the most familiar
how these tools are. These tools are effective
kind of outputs we outputs we get
from flavors of rules of rule scores.
More language
language processing language processing approach.
How well the LLM in n gram matches,
or in other use cases where
you check other scores, will help
you understand on how better to how
better your model is performing. Also, there's another particular
metric that we should also be exploring is perplexity.
It's a metric that takes a slightly different approach.
Rather than having a well
put dataset together, having a well put timeline together.
It assesses the LLM's internal ability to predict the next word
in a sequence. So it will tell you how confident a particular
LLM was in predicting the next word, and a lower the score.
Lower the overall score. It indicates that the LLM is more
confident in its predictions and is less likely to generate
surprising results, or even completely non
essential results. So while these traditional metrics are available,
things like regas
or ragas, I should say, I don't know how it's
actually pronounced, but things like ragas or regas, they are really
good workflows to to treat your retrieval augmented
generation pipelines. And the metrics like faithfulness
or relevance or other aspects of
this particular framework will allow you to understand more better on how
your whole rag pipeline is working.
And also another good framework
is the QA Ul by Daniel. It also provides
good insights and it checks whether a particular LLM
captures the key concepts or the key information that
was asked of it. So these are some of the I
guess the next step on this, now that we have
established a benchmark is what about llms evaluating
llms? There's been some good research,
some good findings based on whether an LLM can
actually evaluate other LLM.
And this is most likely a trial
and error approach on making sure to be understand the
use case and something like chatbot arena,
which basically you could use the outputs generated from
that from two different models and evaluate and see
whether these two different models could compete with each other in generating
the text. If you ask one model, hey, is this factually correct
based on this answer, and you could iterate
on that, I wouldn't put
much stress on how efficient
or how correct these metrics
are, these judgment would be, but considering
that most models were trained on somewhat similar data
sets, you should expect that the results would be more subjective
and will give you a good insight on how a model
would perform against some other model. The other
approach of this is definitely the metrics that
we discussed. So the metrics that you
choose, your own metrics, you compare those metrics, you have a
definite set of iterative metrics that you've had so far.
You have a data set that you can test on for, and you can
really put that into a number or an understanding
in a perspective that even your customers,
people in your management, or people you're working
with, or even the scientists that you're working with, really have
a good benchmark on where they are right now
and where they are looking to be.
So the human touch, the ultimate judge, what human evaluation
does, what a multifaceted approach will
give you. And despite the rise of all of these automated
metrics, I believe that we
should be also looking at the human aspect of it and be
really sure that we are closing the gap between what we
are evaluating versus what is necessary.
So that instead of chasing the next
cool model or the next best model, the next
public best model available, we really make an informed
decision on whether we are falling behind just because
there was a new model launch, or whether we are falling behind because
the data set that we fine tuned on isn't that good.
Or maybe we need more of that data set, or maybe we
are good with whatever the metrics that we have so slowly
iterating over that we'll get good at the game and we'll
be able to clearly determine what is essentially required.
Moving ahead. These are some available frameworks,
Regas helm, which is a really good benchmark.
Again, these benchmarks will tell you how a
particular model was working on a particular large
corpus of data sets. There's also langsmith by Lang chain,
with and without the integration for the weights
and biases for you to eval. There's OpenAI evals
that gives you a good idea on what evaluation frameworks
or what other evaluation metrics are available. There's deep eval,
there's this LM evaluation harness that I've really grown
kind of fond of. They do a really good job at
allowing you to evaluate a particular model on a
particular public dataset with really less overhead.
But as I said, there's some
overhead to making it work for your own custom use case.
And that's where I found that using simply basic
hugging face functions to load your data set to calculate
scores is probably what is beats,
or is more easier to use or is more efficient to use.
So that's the frameworks that are available that
we can try and test. And at the end, all you
need is a test eval set,
a multifaceted approach on how
you can use the existing test eval
set, or whatever the existing data set you have, or you
will create, and into understanding really
how you can develop a particular set of metrics
around it. And you can use those. So the foundation is
going to be your test set or evaluation set, and the
properties, some which are internal to the model or which
are internal to the framework that you're using, like perplexity, which can be
called calculated from the outputs that you get, or any
existing implementations for any
particular public task, or any particular specific task that you're trying to
look. And the benefits of this
would be that you really control what sort of metrics
are used in particular evaluation, what sort of metrics
you can quantify your application against.
And you can also have an established set
of data sets that can help you over the long term in understanding
how the model performed over a certain duration or
a certain time, and what you can do more better after.
Let's say you fine tuned the model multiple times,
you've changed the model, so it gives you a good flow of all of
the metrics that you can make the decisions
on. I guess that's probably
it. That's what I want to discuss. And I
would in general say that adapting public libraries or
public frameworks like hugging face eval or lmewal
harness is a really good start at first to get
any metrics like f one aggregated scores,
blue scores or root scores, blue scores or root
scores, or all of these scores, to decide
or define a particular evaluation framework for
a RNG based flow. And these will work seamlessly
with your chosen data sets. Also, of course,
including human evaluation, human flow into
the application and comparing really on
what human evaluation results are with
what you're getting, as opposed to a certain particular
benchmark, will be a overall
good strategy to evaluate large language models
on. In conclusion, we've really,
in a really abstract manner, explored how
you can evaluate why a particular evaluation type
is necessary. And it's, and at the end, it is not just
about whether you get a score out of it
or not. It's not about establishing a particular metric and,
you know, charting out six
months of metric data and just
saying that, hey, this is the model that works for me. It's about
establishing a foundation so that whenever you're developing
iteratively, whenever you're mark, you're managing
these large language models so that you can continuously
improve on the remarkable research that
has already done into putting these large language models out public with
their weights. And by leveraging these right metrics,
you can now unlock the potential of these models
in for your specific use case and determine
whether a particular model works for you, or if it doesn't
work for you, why does it not work for you?
And in general, understanding or delivering
real world benefits to the user. And as this field
grows, there's new, more new research into
it. We'd probably move ahead, or we've probably already moved
ahead beyond these basic scores and we've gone into more
a complex understanding of context,
whether the particular model understands a particular context grammar
and all of these ideas. And it
is going to be really exciting on what more,
what more metrics or frameworks that come up so
that we could evaluate a large language model better.
So yeah, that's it from me, and I
hope you got a really good starting point. This was meant
to be a good, this was meant to be just an introduction on
what kind of frameworks are available and what you can do with those and which
approaches work really well. So I hope you enjoyed
this, you learned something, or maybe you confirmed
something that you already knew, and I'd be happy to connect
and explore more in depth given the time restraint
that we have. And if you have any questions, or if you want to really
go dive deep into any of these concepts, or why you
should make a particular choice or what my previous experiences
have been. You could connect with me on LinkedIn and we could discuss
that as well. Again, thank you Khan 42 for giving
me this platform explaining this. I've learned a lot while
researching over this particular topic from my previous experience and
I hope you guys learned too.