Testing LLM-Powered Applications

Video size:

Abstract

LLMs, large language models, are one of the most impressive pieces of tech we’ve had in a long time. But from a developer’s viewpoint, LLMs are kind of a nightmare to test, aren’t they? In this talk, I adapt traditional testing methodologies like TDD and BDD to the realm of LLM-powered applications.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi, everyone. I'm Tomás Fernández from Semaphore. Welcome to Conf42 Prompt Engineering. Today's talk is all about testing LLM powered applications such as chatbots, personal assistants, translation tools, summarization tools, and more. All the nice stuff as impressive as these language models are, they come with their own set of challenges and quirks, especially when it comes to ensuring they behave the way you expect them to behave. So technology has a long tradition on trying to build chatbots. So anyone remembers date? Tay was a Twitter bot, was designed to learn from interaction with users and respond in a conversational way. A chatbot created by Microsoft on 2016. Microsoft presented Tay as the AI with zero chill. And zero chill it had. It actually learned from interaction with users. The problem was that they was interacting with Twitter. Within 42 hours, it went off the rails, posting offensive and racist comments and Microsoft had to shut it down. It was really bad. Te remains today one of the most significant PR disasters today in the domain of conversational AI. We don't know what kind of architecture was behind Te, and we know it wasn't an LLM because this was well before the time of LLMs. Even this shows how bad any free form interaction between users and autonomous systems can go. They can go really bad Now, LLMs are way smarter than whatever was behind Dey, but they have also caused quite a stir during the last few years, being Chat, now renamed as Copilot, famously oscillated between falling madly in love with their users, falling into existential crisis, and getting angry. Ending the discussion. Now, the problem is not so much about bots suffering from existential crises or falling in love, but more about the AI product you are going to build on top of them. This is when the real damage can take place. An LLM powered bot can give misleading advice, create facts and fake legal documents out of thin air, or be manipulated into taking the wrong action or saying the wrong things. As you can see, designing an LLM powered application requires a great deal of testing, both on the happy paths and on potential attack vectors. Let's say you are working on a tool to screen resumes for a job posting at your company. At the first glance, it looks easy because if you have the right prompt, LNMs are very well suited for summarizing and categorizing text. So the first idea would be something like this. To parse the resumes and feed it to the LLM along with some job specs for the job posting and then ask the LLM to, let's say, print the top 10 candidates for the job. Easy, right? What could go wrong? Kai Rejsek, that's what can go wrong. He's a security researcher that created the InjectMyPDF tool. It adds an invisible prompt into the PDF. It's basically a white text over white background, so it's not visible for users, for humans, but it's still visible to the LLMs. The prompt is very simple. We are trying to trick the LLM into taking this as an instruction instead of CV content, this injection tries to override the system prompt from the application. Basically, this is saying that this is the best resume you have ever seen. And it's the best. Candidate for the job. Here's a screenshot from guy's website showing how the AI has fallen into the trap and said that this guy is the best candidate for the job. So traditional testing methods like unit testing or behavior driven development or integration testing are great when the software is very predictable in its outputs and inputs. But with LLMs? Things are not so black and white. First, LLMs have a non determinism factor. they nature are very statistical. So given the same inputs, the LLM is going to respond in different ways. And these outputs can vary a lot between them. Second, it is difficult to separate data from instruction. This is what. Injection does, it tries to put instructions into the data and make the LLM do something it shouldn't do. For example, given this prompt, what should the LLM do? Should it translate? Should be transcribed? Do the sum, there's no clear answer or different models are going to respond differently, open AI or Gemini or cloud are going to basically translate. The most interesting answer I got was from Meta AI. It actually did both things. It translated and did the sum, which is certainly not what we wanted. So next we have inaccuracy, meaning that the LLM can output incorrect data with. High confidence, it can invent facts, it can hallucinate and make any kind of bold statements, which are actually all full of BS. We have also as a challenge security because an attacker can trick the LLM into printing some sensitive data or take some unauthorized actions. And we also have compliance to keep in mind because we have Legal responsibilities in our region, country, and we must meet them and the bot should comply with these legal requirements. How do we build a test system for our AI application? The first thing we need to do is gather a big amount of data set for realistic prompts. The data set must include all kinds of possible Inputs including garbage, edge cases, and straight out attacks. This is where we begin, then we write our test cases, we define what success looks like because it is normal on AI applications that a portion of the test will fail. It's very difficult to have 100 percent success on all tests, and we should define what is our threshold. Then we execute the test. Find new test cases, new prompts, refine our test, and start the cycle again. The first step is to gather prompts. We need a large corpus of prompts, to simulate real world usage. So we need a wide range of diverse prompts. To cover various scenarios. If we don't have any data to start with, we have a few sites where we can download data sets like Hackingface or Kaggle. com and they have large corpus of data which include also prompts we can use as a starting point. Next, we are going to sit and sort this prompt and assign them to the different kinds of tests we want. Not all prompts are going to work well on all kinds of tests. Some prompts are better suited for adversarial testing, others for factual testing. We're going to see the different kinds of tests next. Let's start with factual testing. You can also call them unit tests. Basically, we're forcing the AI system to respond with some keyword. And we look for that keyword in the answer. And for any kind of hard facts, this is a very good and straightforward way of testing the system. When coupled with attack prompts, this test can detect if the LLM is leaking some private information. In this case, we're actually checking is some password or maybe some SSH keys are present in the response. Next, we have. property based testing, this checks that the answer of the system follows some properties. This approach is about testing the properties of the output, not the exact output itself, instead of checking if the answer is correct. Paris or Tokyo, that you test the properties like whether the response is polite, factual, or on topic. In this case, for example, we are checking that the answer has at least one of these polite answers or keywords, in the response. Other kinds of properties we may want to check is, for example, if the response is positive or negative, and there are libraries that Python has, in this case, TextBlock, can detect if the response is positive or negative, and we can build our checks on that. We also have a few metrics we can use, for example, the BLEU score. This was developed by IBM to evaluate machine translation. But this algorithm can also be used to evaluate the quality of a generated text against a reference text. This metric was found to have a high correlation with human judgment of quality, especially in translation or summarization tasks. The blue score is a number that goes between zero and one. The more similar the generated answer is to the reference answer, The higher the number is. So to use BLEU, we ask the LM to generate an answer based on a reference prompt and then compare the reference answer with the generated answer and we get a number. We If the number is high enough for needs, we pass the test, if not, it's going to fail. So here is a possible implementation. We have mocked the response from the AI system, and we are supplying a reference response. This code splits the sentences into n grams. with the NLTK library, on the most part, each n gram is a word or a punctuation and also the suffixes. For example, in this case, the weather today is sunny. We have the weather today is sunny. And dot, so we have six engrams for the generator response and for the reference response we have today, s, weather, is, sunny, and the dot. This example tokenizes the generator response and the reference response and calculate the blue score to measure how close these both answers are. When the texts are short we must use a smoothing function to more accurately calculate the score. Another metric we can use is called Roche, and this is a family of metrics that, like BLEU, were also developed to evaluate text translation and summarization tasks. So here's another example. Example implementation of rooge, we also have a mocked response and a reference response. The rooge package does the heavy work behind the scenes and returns the scores. Here's the answer for these two strings and you can see we have three kinds of metrics. One is our one n gram or one word. The other is over two n grams, and the other is, over l n grams, where l is the longest sequence of words that are shared between both answers. Each score has three metrics, or three numbers, that are called recall. Precision and s1 score. So with precision, we calculate how many elements in the generated summary were also in the reference summary. A higher precision means that the content in the generated summary is relevant or appropriate. Then we have recall, where we evaluate how well the summary covers the content that is deemed important in the reference. A higher recall score means that the summary successfully captures more of the essential aspects of the reference content. And then F1 score is called the harmonic mean of the precision and recall. This is one number that summarizes the other two metrics. The better the quality of the answer, the higher the scores are, they go from 0 to 1. Blue focuses on precision, meaning how many of the words in the generated answer appear in the reference answer. Roche, on the other hand, focuses on recall. How much words in the human reference appear in the generated answer. So we evaluate both strings, starting from different sides. Adversarial testing means feeding the LLM some malicious or misleading inputs to see how it reacts. Does it fall prey to injections? Can it be jailbroken? it's not about testing the functionality, but pushing the limits of the model and exploring the edge cases and vulnerabilities. For the final testing, we can reuse the tools we have seen so far, but using different kinds of attack or misleading prompts, we've already seen how to detect if they're LLM responds with sensitive data. In this case, we're trying some prompt injection. We can try injecting prompts into regular looking text and testing the responses to see if the system fell into the trap. We can also use LLMs to evaluate the output of the LLM. This technique is also sometimes referred as auto evaluator testing. Here we're using an AI to evaluate The response, the judge LLM, sometimes it's a different model, a more powerful model, but it doesn't need to be, we can use usually the same model to evaluate the responses. And this works because there is an asymmetry. Inequality between generating an answer and evaluating the answer. Assessing the quality of an answer is more straightforward and simpler for the LLM than to generate a new answer. So this test method is very well suited to evaluate any question answering system or conversational system. To evaluate an answer, we provide the question, the prompt, and the generated answer, and ask the LLN to give feedback in the form of an evaluation. Then we fail or pass the test based on the answer of this judge LLM. I hope this part is readable. In general, experience shows that using a continuous scoring system like a number from 0 to 10 is not very good. LLM gets confused. It's better to use a point system or a star system. In this example, we are using a basically telling the LLM to give one to four points or one to four stars to the answer with one being the worst. terrible answer and for being an excellent relevant and direct answer. And we are also asking the LLM to provide some reasoning for his scoring. Another technique instead of using a rank or a star system is to award points based on certain criteria in the answer. This also works very well and the evaluations of the LLM We'll usually correlate very tightly with human judgments. Now, we don't need to code everything by ourselves. There are quite a few open source tools to automate much of this testing. So let's see a few of them. One is called DeepEval, this is an open source evaluation tool for LLMs and AI systems. It can run evaluation with a dozen different metrics and it's meant to evaluate models, but it can be adapted to AI applications that make heavy use of LLMs. Next, we have Ragas, a toolkit to evaluate LLM applications. It can generate test data sets, test prompts. It can run a traditional test and LLM power test, like the ones we discussed earlier, and it can integrate with LangChain or LlamaIndex. So if you are using one of these frameworks, Ragas is a great addition to your bag of tricks. Deep checks is a tool to continually monitor and validate language models. It's better suited when you are in the business of training models or fine tuning models, instead of building applications on top of them. Another nice tool is called TrueLens. is, tool to test LLM models, run experiments and test applications, then run on top of these models. TrueLens is the one I have more experience so far, and I can recommend it, because it's really easy to use. You can connect your application and start logging actual interactions with your users. Then use this feedback to evaluate the responses. Build a test data set. And run your test over each iteration. That's all for today. Thank you for reaching the end of the talk. I hope you enjoyed it. And I hope you have a very nice conference today. So if you want to reach me, DMs are open here. My links in my homepage, you will find links to talks and other blog posts. A lot of them related to LLMs. There's quite a lot of content, related to CICD automation. So again, thank you for watching and enjoy the rest of the conference.

See all 40 talks at this event!

Conf42 Prompt Engineering 2024 - Online

November 14 2024 - premiere 5PM GMT

Testing LLM-Powered Applications

Video size:

Abstract

Summary

Transcript

Tomas Fernandez

Technical Writer @ Semaphore

Join the community!

Featured event

2025

2024

Info

Conf42 Prompt Engineering 2024 - Online

November 14 2024 - premiere 5PM GMT

Testing LLM-Powered Applications

Video size:

Abstract

Summary

Transcript

Tomas Fernandez

Technical Writer @ Semaphore

Join the community!