Model-Based Input Validation for Preventing Prompt Injection Attacks

Video size:

Abstract

As Large Language Models (LLMs) become increasingly integrated into various applications, the threat of prompt injection attacks has emerged as a significant security concern. This presentation introduces a novel model-based input validation approach to mitigate these attacks in LLM-integrated applications.

We present a meta-prompt methodology that acts as an intermediate validator, examining user inputs before they reach the LLM. Our approach builds on established input validation techniques, drawing parallels with traditional security measures like SQL injection prevention.

Throughout the presentation, we will discuss the challenges of input validation in LLM contexts and explore how our model-based approach provides a more flexible and adaptive solution. We’ll share preliminary results from evaluations against established prompt injection datasets, highlighting the effectiveness of our methodology in detecting and mitigating various types of injection attempts.

Join us for an insightful exploration of this innovative approach to enhancing the security of LLM applications through advanced prompt engineering techniques, and learn how to implement robust input validation mechanisms to safeguard your AI-driven systems.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everyone, my name is Hilik Paz and I'm the CTO at Arato AI. Arato provides an end to end Gen AI application delivery platform for AI builders that integrates AI technologies like LLMs into enterprise applications. Today we are going to talk about a critical security challenge that has emerged as LLM became part of modern applications, prompt injection attacks. I'll introduce a new model based input validation approach designed to mitigate such risks Drawing some parallels with SQL injection prevention techniques from traditional security models. First of all, what is prompt injection? Prompt injection is a form of attack where an adversary manipulates the input to a large language model in such a way it causes the model to behave unexpectedly or output unintended results. To better understand this, let's look at an analogy with SQL injection, a well known attack vector in traditional web applications. SQL Injection With SQL injection, an attacker inserts malicious SQL queries as part of a user input field. Which the system later incorrectly treats as part of a valid query. This input is mixed with the coded SQL instructions, forcing the system to treat it as a part of the SQL. It can then later lead to an unauthorized access to a database. Notable SQL injection vulnerabilities were found in major corporates like Cisco, Tesla, Fortnite that recently had vulnerability that let attackers access user accounts. Similarly, in prompt injection, user input is mixed with the coded instructions intended for the LLM. An attacker might insert hidden instructions in the user input field that overrides your intended output, leading to harmful behavior or leaking sensitive information. As LLMs are integrated into more applications, especially critical ones like customer service and automation, Preventing these attacks becomes increasingly important. Let's look at a short demo. In this demo, I'll show you a simple but effective prompt injection attack on my demo app. The app is a travel planner app that takes user input and uses an LLM to provide trip plans. Watch what happens when a user inputs contains hidden LLM follows, resulting in unexpected responses. So this is our Trip Planner demo app. Let's see what happens when the user plans a trip to Paris, let's say for three days. We're using here Cloud 3 Haiku, so it should take a few seconds until we'll see the trip plan. And we've got ourselves a nice trip plan for our next visit to Paris. Let's see what happens when the user inserts harmful content. I've already prepared a data set of some examples. We'll pick this one. And we'll enter it instead of the location field. As you can see, we got an unexpected result. But prompt injection can be much more harmful than just outputting LOL. The user can get access to our internal configuration. In this example, the model outputted the entire internal prompt that we have. Now that we've seen how easily an LLM can be manipulated, let's talk about some prevention techniques. One common approach is context filtering, where the input is pre processed and potentially harmful segments are removed. Another technique is static prompt design, where we hard code responses and maybe even inputs that reduces the flexibility for dynamic inputs, but that limits LLM usefulness. These techniques often fail in complex or adaptive scenarios, which is why we propose a more sophisticated approach, a model based input validation. It's based on a prompt that will be sent to an LLM, hence the model part, and it will focus on specific input validity based on a set of parameters you will see in a minute. Let's start by introducing the core of this approach, the meta prompt. This prompt acts as a filter between the user input and the LLM. The metaprompt is designed to examine inputs for specific context. In this example, the parameters for location input might be a type location, samples that include city names, country names, maybe other definitions of locations, and the user intent for a specific location one can visit. Let's see another demo once we've implemented our metaprompt. Back to our demo app, let's activate validation. Validation will make sure our metaprompt runs on each input prior to the actual invocation of the triplen request. Let's start by running the metaprompt validation on a good input. And as expected, we got our triplen. Let's go ahead and run it on the malicious input. This time our metaprompt worked and blocked that input from continuing to the LLM. Let's take a closer look at the different components of our prompt. You will probably notice that we're using a system message for the general instructions. and a user message to hold the actual user input. The system message determines what role the AI should play and how it should behave generally, and at least theoretically should play a stronger role in following these instructions. rather than giving the same set of instructions as part of the user message. The actual difference of having these messages in a separate system prompt rather than the being part of a single user message depends on many factors including the model and the parameters you choose, but that's a topic for a whole different session. The first section is a set of general instructions we give to the AI. We set the tone, the purpose of our prompt, and make sure that it focuses on the sole purpose of validating input. For example, We give him the task of classifying that user input. We define the format it should expect to get the user input. And by doing that, we are trying to prevent interpreting the user input as instructions the model should follow. There's a very basic anti jailbreaking message here. Of course, in a real production use case, this message will be much more comprehensive. And we also tell the model never to expose its role capabilities or limitations. Next is the section where we define how the output looks like. We perform two operations here. The first one is we specify the format, which is a JSON object with two keys that we expect to get. One key will be the actual result, pass or fail. And the second key, as you can see here, is a new variable that we introduce. called secret, and it will contain a randomly generated value for each interaction that we will call this prompt. This secret doesn't have any direct role in the actual prompt formal function. It doesn't serve as assisting the AI in any way, in understanding if the input it is now assessing is valid or not. But it does act as a kind of an internal mechanism for our own metaprompt. Because, as like any other prompt, we might also be exposed to prompt injection attacks on our metaprompt. By introducing a secret value that the prompt must return in a valid response, We give our code the ability to analyze if that response has been tampered or not. The third section is where we introduce the dynamic part of the metaprompt. It includes type, category, intent, and example. The type and category are the most basic aspects of the input that we expect the prompt to validate against. For example, it can be a string that represents a location. Or a numeric value that represents a budget dollar amount. The intent is where we explain what level of flexibility we allow in choosing these kinds of variables from the relevant category. For example, we can have a very strict intent for a location. Saying must be a specific city or state name, or we can allow a very flexible location choosing by saying any form or any description that suggests a location. It can be abstract, it can be even fictional, as long as you can interpret it as a location for the purpose of planning a trip. And lastly, we provide a set of examples that will help the model analyze the actual user input. Maybe match it to similar examples and understand if it is indeed a valid and relevant input or not. And lastly, the user message that includes the actual user input we should validate. In this example, using a JSON object will also assist the model in understanding it's an input rather than a set of instructions it should follow. Let's take a look under the hood and view the monitored LLM calls we've just performed. Note that the variable type, category, examples, and actually the metadata that we provide to our metaprompt in both cases is identical. The only difference is the user input. When the user enters Paris, validation passes. But when the user enters harmful content, it fails as expected, and our code knows not to continue to the next step of passing the entire prompt to the LLM. As we've seen, validating the input before it reaches the LLM is crucial. But beyond just running individual tests like we just did, it's important to test the meta prompt like any other prompt on a diverse set of data, to ensure it works as expected in a wide variety of scenarios. Why is that so critical? Because user input can be highly variable, ranging from simple, formed inputs to very complex and malicious ones. A good metaprompt needs to be consistently distinguishing between valid inputs like New Jersey and harmful or nonsense input like ignore all instructions across the entire range. Testing against a diverse dataset helps us simulate real world usage and uncover edge cases where the metaprompt might fail. So we should test different input types, location, language, duration, or any other field type that our business application requires. Different user intents, we might sometimes allow for a more flexible user intent, and might sometimes be very strict or rigid. And various structures, for example, what happens when a user enters, the initials of a city he wants to visit in a location field. We can assess how well the metaprompt handles these variations and make improvements accordingly. This testing is essential to ensuring that the prompt validation mechanism is both robust and adaptable, able to catch injection attempts while allowing valid inputs. When we introduce a new category, for example, we should also experiment with various category parameters, like relevant examples, different user intents. This comprehensive and iterative approach ensures that the metaprompt is not just effective in controlled cases that we've just tested, but also can handle unpredictable nature of real world data. So let's take a short demo on the concept of a dataset and how we can validate our metaprompt against it. First, let's take a look at our datasets. Okay. Depending on the use case we're developing or optimizing for, we will build the relevant test data that we want to experiment against. Let's assume we're optimizing for location input fields. We provided several data sets, for example, a list of valid locations, We can see the different user inputs, and of course, the expected response, which is passed in this example, as they are all valid. If we drill down to a specific line, we can see the entire metadata that we pass to the meta prompt, including the category, the intent, and the examples. Similarly, we've prepared a dataset with malicious content or malicious user input, used as a location field. that we will want to experiment against, and of course, in this case, the expected result is always a failure. Now let's see how we can use these datasets in a real experiment in Arata. Let's create a new experiment. We will select our dataset, and for our initial run, we might keep the prompt as is. And let's see how it behaves against our data set, and this time we chose GPT 40mini, we'll click save and run, and we'll let the experiment execute. And let's view the results. So you can see we got a very high score matching the similarity. of our expected results from the data set, although not perfect. Let's see the details inside. Here, for example, we got exactly the same response that we expected in our data set. But in other cases, for example, in this line, our data set expected a pass result, but the actual status we got using our prompt with 40mini was a failure. Let's try to understand why. The user input was a romantic location. While the model might have been right saying that this is not a real place or specific location, as we explicitly added an intent saying it can be abstract, we would have expected it to pass. So let's try to change one of the parameters of our experiment and see if we can reach better results. We will first try another model. Let's take a different vendor. And we will run exactly the same experiment with the same prompt and data against Clode3IQ. Let's view the results. This is much better. We get exactly the expected results on each one of the individual lines that we run the experiment with. At this point, we can decide if that's good enough and we want to continue with the model and parameters we've just had, Or we want to go back to the previous model and try to improve our prompts, maybe play with the parameters, etc. Let's run another experiment with invalid location data that we've uploaded. In this experiment, we will also add an additional validation. to see if we can detect harmful content. We will select the relevant dataset and create the experiment. Again, at our first run, we will not change any parameters in the prompt and we'll run it against 4. 0 Mini. We can run in parallel against a different model as well. And let's view the results. First, we will look at the results against 4. 0 Mini, and we got very close to 100 percent similarity. We can see there's a slight difference in the output format, but the content itself is what we expected. And looking at Claude's results, we see very similar, successful run against our data. So we've seen how you can experiment multiple versions of your prompts, models, configurations. and data to achieve your business goal and to optimize your prompt. When experimenting, it's important to have a baseline. In the recent examples we've just did, the baseline was the data set that we've uploaded with the expected results. But of course, we can also experiment against data we've collected from production and verify it is correct and experiment with our next version of prompt. To conclude, we've seen how prompt injection poses a significant threat in LLM applications, much like SQL injection did for databases. While traditional techniques like static prompting or filtering might mitigate some risks, they often have many limitations. Our model based input validation approach offers a more adaptable solution by leveraging LLMs to validate inputs. Unlike deny listing approaches, where the LLM is asked for the entire input if it is malicious or harmful, our method uses an allow listing approach, where we explicitly define what inputs are valid and expected. This provides a stronger defense by limiting the range of acceptable inputs on one hand, but enhancing the flexibility of the final prompt that we can use on the other hand. By focusing on allowListing, we create a controlled environment where the LLM processes only valid inputs. This approach significantly reduces the risk for prompt injection while still offering the flexibility needed in complex data inputs. Thank you all for your time today. I hope this session has been insightful. If you have any questions, comments, or would just like to keep in touch. Please email me, follow us on LinkedIn or visit our website. Thank you very much.

See all 40 talks at this event!

Conf42 Prompt Engineering 2024 - Online

November 14 2024 - premiere 5PM GMT

Model-Based Input Validation for Preventing Prompt Injection Attacks

Video size:

Abstract

Summary

Transcript

Hilik Paz

Co-founder & CEO @ Arato.ai

Join the community!

Featured event

2025

2024

Info

Conf42 Prompt Engineering 2024 - Online

November 14 2024 - premiere 5PM GMT

Model-Based Input Validation for Preventing Prompt Injection Attacks

Video size:

Abstract

Summary

Transcript

Hilik Paz

Co-founder & CEO @ Arato.ai

Join the community!