Transcript
This transcript was autogenerated. To make changes, submit a PR.
Last year, like a lot of people, I spent part of a weekend
making a little demo for a retrieval augmented generation,
or rag chatbot. This is a chatbot that uses an LLM
to answer questions about a specific set of documents,
where those documents aren't necessarily in the LLM's training data.
So the user asks something, your tool searches to find related
text in your documents, and then it passes those chunks of
text to an LLM to use in answering the question.
This was my weekend demo project. It used lang chain,
it used radio. It was probably 80% code
I took from someone else's collab notebook. But my code was a little
cleaner and better organized, at least. But the actual chatbot
guys was terrible. It didn't work. It was supposed to
answer questions about shiny for Python, which is a python library
for building dashboards that was too new to have its documentation in
GPT four. But my chatbot was making things up. It was getting
stuff wrong, and just generally it wasn't usable.
And I could tell you it wasn't usable from, you know, trying to use
it, but I couldn't tell you how not usable it was. I couldn't
rate its lack of usability because I didn't have an evaluation strategy.
Then I made another one, this time to answer questions about the
California traffic code. I scraped the data from the website.
I fought with it a little to get it parsed appropriately. And then
this one was with streamlit, and someone else built a nice interface for it,
and this one was somewhat better. But when we were trying to make decisions about
things like what open source model should we use as our underlying
LLM, I still didn't have an assessment strategy for figuring
this out. How were we supposed to tell if it was working when we
demoed it out to potential clients? Was the best we could do to just ask
some questions and hope that the questions the client asked was something it could
answer. But now I've been working on LLM model evals
for a while, and I spent just a lot of time with evaluating
LLM output in different ways. And so I have at least the beginning
of a strategy for how to evaluate your rag chatbot
or other generative AI tool like that you're building
for customers, or maybe internally to do a specific thing.
And I'm going to tell you about that today.
First, why are we automating testing?
Well, look, why do we ever automate testing? We automate
testing because you're going to break things, and you want to find out that you
broke it when you push your code to a branch that isn't your main branch,
and before you merge that code in and your product goes boom.
We automate testing because we're human and therefore we're fallible.
But in the context of your LLM tools specifically, we also
automate testing because there are choices you're going to make about your
tool, and you want to have quick feedback about how it works,
or could work. For instance, like I mentioned, if you're trying to decide what underlying
LLM to use, or what broad system prop to use,
if that's relevant for you when you make or consider changes,
you want to know how they affect performance. And the more your tool
is doing, like with any kind of software, the less feasible it
becomes to test things manually. What I was doing with my California
traffic code chatbot, which was having a series of questions,
running them through multiple models, and then looking at the responses myself,
it's not the worst, but it's not the greatest either.
Let's talk next about how to automate testing broadly.
We test to make sure that our tools are doing what we want them to
do. So what do you want your tool to do?
What are some questions you want it to be able to answer? What does a
good answer look like? What does a bad answer look like?
Now test that it's doing that easy,
right? Just test that it's doing what you want it to be doing. We do
that all the time for machine learning problems generally, and NLP
specifically. But actually, okay, this is not that easy.
It's actually pretty hard. Why is it hard? It's hard because
text is high dimensional data. It's complicated, it has a
lot of features. And with generative AI, like with a chatbot,
we're not talking about a classification model where the result is
pass or fail, spam or not spam.
Now, as a digression, you can also use large language models
to build classification tools or do entity extraction,
and they're really good at them. And that's my favorite use
case for llms, in part because you can evaluate them super
easily, the same way we've always evaluated these
kinds of tools, by comparing the model output with our ground truth
labeled data. So if you have the opportunity to do that
instead. Instead, oh my gosh, of do that,
evaluate it with a confusion matrix, call it
a day, but everyone wants a chatbot,
so that's what we're talking about. And in the case of chatbots,
we're asking a question and getting a whole sentence back, or a paragraph.
How can we assess that? Well, we have a few options
which I'll go through, but first I want to note that the
purpose of this kind of testing is not to comprehensively test everything
someone might ask your tool about for accuracy. If you were going
to generate every possible question someone could ask of your tool,
as well as criteria for evaluating answers that are
specific to that question, then you wouldn't need a generative tool,
you would need a frequently asked questions page, and then some search functionality.
So the purpose of this instead is to select some of the types of
questions you want your tool to be able to answer and then test
the content of those. So first option string
matching we've got some choices here. We can look for an exact
match, like the answer to an exact, the answer to be an exact sentence,
or to contain a particular substring, like if we ask it for the
capital of France is Paris. Somewhere in that response
we can use regular expressions or regex if there's a
pattern we want, like if we want a particular substring,
but only if it's a standalone word, not part of a bigger
word. We can measure edit distance, or how syntactically
close two pieces of text are, like how many characters we
have to flip to get from one string to another.
Or we can do a variation of that exact matching where we want to
find a list of keywords rather than just one.
Here's an example. Here we have a little unit test where we do
a call to our tools API. We passed a question, we get
back a response, and then we test to see if there's something formatted
like an email address in it. So ship it.
Does that look good? Is this a good way of evaluating high dimensional
text data to see if it's got the answer we want?
Nope, this isn't great. There's a lot we just can't do in terms
of text evaluation with string matching. Maybe there
are some test cases you can write like, okay, if you want very short
factual answers, you can do this, but in general,
don't ship it. Next, we have semantic similarity.
With semantic similarity we can test how close one string
is to another string in a way that takes into account both synonyms and
context. There are a lot of small models. We can
project our text into 760 dimensional space, which is
actually a major reduction in dimensionality, and then we can take the distance
between those two strings. So the thing, the response we got from
our model versus the response we wanted to get from it.
Here's an example. This isn't exactly real, but basically you download
a model that's going to do your tokenizing. So that'll break your text up.
You hit your tool API with a particular prompt to get a response.
You project that response into your n dimensional space in your
model, and you project the target text that you wanted your tool to
say into that space as well. And then you compare the two. Your text
then uses a threshold for similarity for how close
they were, and then tells you if it passed or not, or if the two
texts were sufficiently close using that model.
So ship it again. I don't want to say never,
but there's a lot of nuance. You're not necessarily going to capture its similarity,
and that's especially as your text responses get longer. Something can be important
and you can miss it. Okay, so finally we
have LLM led evaluations. That's where you write a
specific test for what you're looking for, and you let your LLM or
an LLM do your evaluation for you.
And this doesn't need to be the large language model you're using
behind your tool. For instance, maybe you use an open source
tool or smaller LLM to power your actual chatbot,
say, to save money. You might still want to use GPT four
for your test cases because it's still going to be pennies or less to
run them each time. So what does this look like?
It looks like whatever you want it to. Here's one I used
to evaluate text closeness, so this would be how close is the
text that tool output to the text you wanted to see? And this gives
back an answer on a scale of one to ten. Here's another one.
You can write an actual grading rubric for each of your tests.
This is a grading rubric for a set of instructions where it passes
if it contains all seven steps and it fails otherwise.
I'm using a package called marvin AI, which I highly recommend,
and that makes getting precise, structured outputs from OpenAI
models really as easy as writing this rubric. You can also write
rubrics which return scores instead of pass fail, and then
you can set a threshold for your test passing. For instance, you could make
this pass if it returned four, five, or six
or seven of the tasks and failed otherwise.
Again, this is a level of detail which you can't get using
string matching or semantic similarity. I'm doing
some other work involving LLM led evals, and so I'll
show you another example of how we can get pretty complicated with these there's
this logic problem about transporting a wolf, a cabbage, and a
goat, but you can only bring one at once. And the goat can't be left
alone with either the fox or the cabbage because something will
get eaten if you swap out those names, the goat,
cabbage, and fox for other things, some of the LLMs get confused
and can't answer accurately on screen. This is a
rubric for using an LLM to evaluate text to see
did it pass or fail. The question and what it's possible
to do here is write a rubric that works for both the five step
answer and the seven step answer, the difference being the seven
step one is two steps where you're traveling or teleporting alone without
an object. And the LLM, if you pass it as a rubric,
is capable enough to accurately rate each answer as passing
or failing. You can do the same thing with a select
set of questions you want your generative tools to be able to answer.
You can write the question, you can write the rubric, and you can
play with different llms, system prompts, etcetera,
any parameter your tool uses to see how it changes the
rate of accuracy. It's not a substitute for doing user testing,
but it can complement it. But you're also
going to want more flexible testing. That is, testing that doesn't
rely on having specific rubrics and therefore can be
done on the fly with new questions. And there are also
tools for that. For instance, you can use LLM
led evals to see if your rag chatbot is using
your documents to answer the questions versus if it's making
things up. That is, when we're asking the LLM, in this
case is was the answer that my tool gave only
based on information that was contained in the context that I
passed to it. And then we give it both the context,
so the document chunks, and we give it the answer that the tool
actually gave. That's really useful. And you can run it on
your log file, you can even run it on real time as a step
in your tool, and then you don't give a response to your user which
didn't pass this test. You can also use LLM
led evals to assess completeness. That is, did the response
fully answer the question. Now, neither of
these will necessarily get you accuracy. For accuracy, you need
to, you know, define what accuracy means, and that's individual to
each question. But these can still get you a lot. And again,
the strength of these is you can run them on any question,
even in real time. I got these from Athena AI,
which is a startup in this space, but there are other
companies in the space as well, doing novel and cool things with
monitoring llms in production. I do think you can roll your own
on a lot of these for getting your own evals, but if you don't want
to, you don't have to. So ship it.
Yeah, totally ship it. Write some tests and treat your
LLM like real software, because it is real software.
Before I go, I want to again mention real quick two products that
I'm in no way affiliated with but that I think are doing cool stuff.
So Marvin AI is a python library where you can very quickly
write evaluation rubrics to do classification or score,
and it'll manage both the API interactions and also will
transform your rubrics into full prompts to then send to
the OpenAI models. And then Athena AI,
that's Athena with an I in the middle, is doing some really cool things with
LLM led evals, including specifically for rag
applications. This is me on LinkedIn. Please get in
touch if you're interested. And here's my substack. I wrote something on
this specific specifically, but I'm also writing about LLM evals
in general, red teaming, other data science topics,
et cetera. Thank you for coming.