Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, my name is Hilik Paz and I'm the CTO at Arato AI.
Arato provides an end to end Gen AI application delivery platform for AI
builders that integrates AI technologies like LLMs into enterprise applications.
Today we are going to talk about a critical security challenge that has
emerged as LLM became part of modern applications, prompt injection attacks.
I'll introduce a new model based input validation approach designed to mitigate
such risks Drawing some parallels with SQL injection prevention techniques
from traditional security models.
First of all, what is prompt injection?
Prompt injection is a form of attack where an adversary manipulates the
input to a large language model in such a way it causes the model to behave
unexpectedly or output unintended results.
To better understand this, let's look at an analogy with SQL injection,
a well known attack vector in traditional web applications.
SQL Injection With SQL injection, an attacker inserts malicious SQL
queries as part of a user input field.
Which the system later incorrectly treats as part of a valid query.
This input is mixed with the coded SQL instructions, forcing the system
to treat it as a part of the SQL.
It can then later lead to an unauthorized access to a database.
Notable SQL injection vulnerabilities were found in major corporates
like Cisco, Tesla, Fortnite that recently had vulnerability that
let attackers access user accounts.
Similarly, in prompt injection, user input is mixed with the coded
instructions intended for the LLM.
An attacker might insert hidden instructions in the user input
field that overrides your intended output, leading to harmful behavior
or leaking sensitive information.
As LLMs are integrated into more applications, especially critical
ones like customer service and automation, Preventing these attacks
becomes increasingly important.
Let's look at a short demo.
In this demo, I'll show you a simple but effective prompt
injection attack on my demo app.
The app is a travel planner app that takes user input and uses
an LLM to provide trip plans.
Watch what happens when a user inputs contains hidden LLM follows,
resulting in unexpected responses.
So this is our Trip Planner demo app.
Let's see what happens when the user plans a trip to Paris, let's say for three days.
We're using here Cloud 3 Haiku, so it should take a few seconds
until we'll see the trip plan.
And we've got ourselves a nice trip plan for our next visit to Paris.
Let's see what happens when the user inserts harmful content.
I've already prepared a data set of some examples.
We'll pick this one.
And we'll enter it instead of the location field.
As you can see, we got an unexpected result.
But prompt injection can be much more harmful than just outputting LOL.
The user can get access to our internal configuration.
In this example, the model outputted the entire internal prompt that we have.
Now that we've seen how easily an LLM can be manipulated, let's talk
about some prevention techniques.
One common approach is context filtering, where the input is pre processed and
potentially harmful segments are removed.
Another technique is static prompt design, where we hard code responses
and maybe even inputs that reduces the flexibility for dynamic inputs,
but that limits LLM usefulness.
These techniques often fail in complex or adaptive scenarios, which is why we
propose a more sophisticated approach, a model based input validation.
It's based on a prompt that will be sent to an LLM, hence the model
part, and it will focus on specific input validity based on a set of
parameters you will see in a minute.
Let's start by introducing the core of this approach, the meta prompt.
This prompt acts as a filter between the user input and the LLM.
The metaprompt is designed to examine inputs for specific context.
In this example, the parameters for location input might be a type location,
samples that include city names, country names, maybe other definitions
of locations, and the user intent for a specific location one can visit.
Let's see another demo once we've implemented our metaprompt.
Back to our demo app, let's activate validation.
Validation will make sure our metaprompt runs on each input prior to the actual
invocation of the triplen request.
Let's start by running the metaprompt validation on a good input.
And as expected, we got our triplen.
Let's go ahead and run it on the malicious input.
This time our metaprompt worked and blocked that input
from continuing to the LLM.
Let's take a closer look at the different components of our prompt.
You will probably notice that we're using a system message
for the general instructions.
and a user message to hold the actual user input.
The system message determines what role the AI should play and how it
should behave generally, and at least theoretically should play a stronger
role in following these instructions.
rather than giving the same set of instructions as part of the user message.
The actual difference of having these messages in a separate system
prompt rather than the being part of a single user message depends on many
factors including the model and the parameters you choose, but that's a
topic for a whole different session.
The first section is a set of general instructions we give to the AI.
We set the tone, the purpose of our prompt, and make sure that it focuses
on the sole purpose of validating input.
For example, We give him the task of classifying that user input.
We define the format it should expect to get the user input.
And by doing that, we are trying to prevent interpreting the user input as
instructions the model should follow.
There's a very basic anti jailbreaking message here.
Of course, in a real production use case, this message will
be much more comprehensive.
And we also tell the model never to expose its role capabilities or limitations.
Next is the section where we define how the output looks like.
We perform two operations here.
The first one is we specify the format, which is a JSON object with
two keys that we expect to get.
One key will be the actual result, pass or fail.
And the second key, as you can see here, is a new variable that we introduce.
called secret, and it will contain a randomly generated value for each
interaction that we will call this prompt.
This secret doesn't have any direct role in the actual prompt formal function.
It doesn't serve as assisting the AI in any way, in understanding if the input
it is now assessing is valid or not.
But it does act as a kind of an internal mechanism for our own metaprompt.
Because, as like any other prompt, we might also be exposed to prompt
injection attacks on our metaprompt.
By introducing a secret value that the prompt must return in a valid response,
We give our code the ability to analyze if that response has been tampered or not.
The third section is where we introduce the dynamic part of the metaprompt.
It includes type, category, intent, and example.
The type and category are the most basic aspects of the input that we
expect the prompt to validate against.
For example, it can be a string that represents a location.
Or a numeric value that represents a budget dollar amount.
The intent is where we explain what level of flexibility we allow in
choosing these kinds of variables from the relevant category.
For example, we can have a very strict intent for a location.
Saying must be a specific city or state name, or we can allow a very flexible
location choosing by saying any form or any description that suggests a location.
It can be abstract, it can be even fictional, as long as you
can interpret it as a location for the purpose of planning a trip.
And lastly, we provide a set of examples that will help the model
analyze the actual user input.
Maybe match it to similar examples and understand if it is indeed a
valid and relevant input or not.
And lastly, the user message that includes the actual user input we should validate.
In this example, using a JSON object will also assist the model in understanding
it's an input rather than a set of instructions it should follow.
Let's take a look under the hood and view the monitored
LLM calls we've just performed.
Note that the variable type, category, examples, and actually
the metadata that we provide to our metaprompt in both cases is identical.
The only difference is the user input.
When the user enters Paris, validation passes.
But when the user enters harmful content, it fails as expected, and our code knows
not to continue to the next step of passing the entire prompt to the LLM.
As we've seen, validating the input before it reaches the LLM is crucial.
But beyond just running individual tests like we just did, it's important to test
the meta prompt like any other prompt on a diverse set of data, to ensure it works as
expected in a wide variety of scenarios.
Why is that so critical?
Because user input can be highly variable, ranging from simple, formed inputs
to very complex and malicious ones.
A good metaprompt needs to be consistently distinguishing between
valid inputs like New Jersey and harmful or nonsense input like ignore all
instructions across the entire range.
Testing against a diverse dataset helps us simulate real world usage and uncover edge
cases where the metaprompt might fail.
So we should test different input types, location, language, duration,
or any other field type that our business application requires.
Different user intents, we might sometimes allow for a more flexible user intent, and
might sometimes be very strict or rigid.
And various structures, for example, what happens when a user enters,
the initials of a city he wants to visit in a location field.
We can assess how well the metaprompt handles these variations and
make improvements accordingly.
This testing is essential to ensuring that the prompt validation mechanism
is both robust and adaptable, able to catch injection attempts
while allowing valid inputs.
When we introduce a new category, for example, we should also experiment
with various category parameters, like relevant examples, different user intents.
This comprehensive and iterative approach ensures that the metaprompt is not
just effective in controlled cases that we've just tested, but also can handle
unpredictable nature of real world data.
So let's take a short demo on the concept of a dataset and how we can
validate our metaprompt against it.
First, let's take a look at our datasets.
Okay.
Depending on the use case we're developing or optimizing for, we
will build the relevant test data that we want to experiment against.
Let's assume we're optimizing for location input fields.
We provided several data sets, for example, a list of valid locations,
We can see the different user inputs, and of course, the expected
response, which is passed in this example, as they are all valid.
If we drill down to a specific line, we can see the entire metadata that we
pass to the meta prompt, including the category, the intent, and the examples.
Similarly, we've prepared a dataset with malicious content or malicious
user input, used as a location field.
that we will want to experiment against, and of course, in this case, the
expected result is always a failure.
Now let's see how we can use these datasets in a real experiment in Arata.
Let's create a new experiment.
We will select our dataset,
and for our initial run, we might keep the prompt as is.
And let's see how it behaves against our data set, and this time we chose
GPT 40mini, we'll click save and run, and we'll let the experiment execute.
And let's view the results.
So you can see we got a very high score matching the similarity.
of our expected results from the data set, although not perfect.
Let's see the details inside.
Here, for example, we got exactly the same response that we expected in our data set.
But in other cases, for example, in this line, our data set expected a pass
result, but the actual status we got using our prompt with 40mini was a failure.
Let's try to understand why.
The user input was a romantic location.
While the model might have been right saying that this is not a real place or
specific location, as we explicitly added an intent saying it can be abstract,
we would have expected it to pass.
So let's try to change one of the parameters of our experiment and
see if we can reach better results.
We will first try another model.
Let's take a different vendor.
And we will run exactly the same experiment with the same
prompt and data against Clode3IQ.
Let's view the results.
This is much better.
We get exactly the expected results on each one of the individual lines
that we run the experiment with.
At this point, we can decide if that's good enough and we want to continue with
the model and parameters we've just had, Or we want to go back to the previous
model and try to improve our prompts, maybe play with the parameters, etc.
Let's run another experiment with invalid location data that we've uploaded.
In this experiment, we will also add an additional validation.
to see if we can detect harmful content.
We will select the relevant dataset and create the experiment.
Again, at our first run, we will not change any parameters in the
prompt and we'll run it against 4.
0 Mini.
We can run in parallel against a different model as well.
And let's view the results.
First, we will look at the results against 4.
0 Mini,
and we got very close to 100 percent similarity.
We can see there's a slight difference in the output format, but the
content itself is what we expected.
And looking at Claude's results, we see very similar,
successful run against our data.
So we've seen how you can experiment multiple versions of your
prompts, models, configurations.
and data to achieve your business goal and to optimize your prompt.
When experimenting, it's important to have a baseline.
In the recent examples we've just did, the baseline was the data set that we've
uploaded with the expected results.
But of course, we can also experiment against data we've
collected from production and verify it is correct and experiment
with our next version of prompt.
To conclude, we've seen how prompt injection poses a significant threat
in LLM applications, much like SQL injection did for databases.
While traditional techniques like static prompting or filtering might mitigate some
risks, they often have many limitations.
Our model based input validation approach offers a more adaptable solution by
leveraging LLMs to validate inputs.
Unlike deny listing approaches, where the LLM is asked for the entire
input if it is malicious or harmful, our method uses an allow listing
approach, where we explicitly define what inputs are valid and expected.
This provides a stronger defense by limiting the range of acceptable
inputs on one hand, but enhancing the flexibility of the final prompt
that we can use on the other hand.
By focusing on allowListing, we create a controlled environment where the
LLM processes only valid inputs.
This approach significantly reduces the risk for prompt injection while
still offering the flexibility needed in complex data inputs.
Thank you all for your time today.
I hope this session has been insightful.
If you have any questions, comments, or would just like to keep in touch.
Please email me, follow us on LinkedIn or visit our website.
Thank you very much.