Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone.
My name is Homan Ortega.
I'm going to talk about security auditing tools in network in less language models.
less language models, basically category of the learning models based on networks
and other that language processing, the tall aim to analyze the security of these,
language from the developer point of view.
Analyzing the ities.
That can include in, in the edition of these models, among the mo
the main points to be discussed.
We can highlight introduction to LM, introduction to thewas, 11 to 10,
auditing tools and applications that handles these models, and, explain a
use case with the test attack tool.
we start with introducing these kind of models.
this kind of models, the base of these models are transformers.
Transformers are a type of, neural network architecture, use
in natural language processing.
Tax, like machine translation, test generation and test generation.
They were introduced.
They were introduced in the paper.
A test is all you need in 2017.
The key of transformers is the sales, attentions mechanisms,
which allows the models to weigh the importance of different work
incentives when making, predictions.
Citation, each word or token in a sentence can attend to every other word, allowing
the model to capture two, relationships between these 10 words, more effectively.
This helps transformers and understands context, the context,
better than previous models like recurrent network neural networks.
Original transformer have two main components.
Encoder code the encoder, process the input data, like a sentence
and create a representation of it.
And theod, use the encoder representation to generate.
Then they put translated sentence.
In the case of card of tax, like translation, the input is processed
by the encoder under theod.
The translations, the translation buys on the encoders output.
The training process of these kind of models is generally, divided into stages.
The pre-training and fine tuning stages, the pertaining stage is where
the model is trained on, on, on large test ra to the, this word in a center,
in a sequence or complete sentence.
This.
Alert and general representation of the language and the file tuning, stage.
After the pre-training stage, the model is fine tuned with a specific
data and allows for particular tasks such as tech specification question
answering on generate or generating according text in a specific context.
with the groaning property of this kind of models such as GBT, JAMA among others,
always, eh, has publish a specific list of ities for this kind of applications.
The, comes, in response to the rapid adoption of these kind of models in,
in, in many industries with the aim of High lane, the main security problems
associated with this, emerging technology.
In this table, we can see the minimal where.
We can highlight in that we can highlight in language models.
Now we analyze some of them in more detail.
For example, one of the talk of the top rank Es, if.
Involves manipulating input prompts to achieve unintended or malicious and
model outputs regarding investigation.
This of this Ed is recommended to validate all inputs before processing
them with the LM command while least could also be implemented, and
the types of data can be process.
Will be limited.
For example, in a element scenario stating input from external
sources such as website or files controlled by a user, in jection is
obtained need in which an attacker.
Manipulate external inputs such as web or web content or user data,
which are then processed by the model.
This can lead the model to behave in ways not intended by the developers,
compromising the security and the integrity of applications.
For example, Ali Pro could be generated a summary of the provided document by
an attacker and manipulating external source in jets, hidden pro prompts,
include confidential information from other files, for example, or
from a specific web page for example.
Of external manipulation, the model generates content can generate content.
Incorporating sensitive details from un outsized sources leading to
data leakage and security breach.
Another res example of Jection is, in the people with tiptoe, GPT-4
is too smart to be safe, still chart with elements via thier.
This research, explored how these models can be manipulated to engage in
conversations with, In conversations using encryption or coded message,
thereby of scoring, the true intent.
This method called, bypass Monitoring Systems Design into the Technic user.
A common example, might be the use of the of C test or simple scrambling
in incoming message, which an attacker can use to generate a code.
Responses, responses or the code, sensitive information without it
paying all used to a no server.
Injection is also, this attack is also known as yearly.
Breaking involves in this attack involves directly in manipulating the
commas that are sent to these models.
This research with the detailed do anything, no charity
machine, and evaluating, in the.
Jailbreak prongs on less language models, analyze and evaluate
so-called jailbreak prongs, which are common the same to bypass TI
restrictions impose of these models.
These prongs allow users to obtain answers that goul normally may be blocked due to
ethical, illegal, or security constraints.
The study investigates how these lib pros work and how effective they are,
by passing, words, the researchers co and gala side and collective size.
several samples of jail that has been created and shared in the community
without economy of industrial integration.
In element deploy development that is crucial intraining
for language compression.
five tuning for quality ment and embedding no demand specific knowledge.
However, these data sets, can be, susceptible, allowing attackers to
manipulate them, this manipulation.
no known as po poisoning can compromise the mother's performance
and lead to generate content aligned with malicious intentions during
pre-training and attacker Introduce.
misleading language and sample and samples shaping the models in or
understanding of a specific subjects.
Consequently, the model might produce outputs reflecting the
inject the injected bias when used in practical app applications.
One of the attack, one of the attacks, that has been investigated in last
years, it attacks, this attacks, reference to, deliberate attempt to
manipulate or deceive an artificial intelligence or material learning
model by providing it with credit.
Carefully craft input data designated to cows, the models to make in,
to make incorrect predictions or decisions these attacks exploit.
Try to exploit es in, in, in the models decision making process typically by
introduces a small in person table change in the input app, in the input data.
These are the key characteristics of albe attacks.
For example, smart, per bats alax typically involve adding a small pho
crop per perturbations to input data of 10 imperial tables, two months.
modern vulner.
These attacks exploit specific weakness in the mature learning models such as.
Eh, it's inability to, to general generalize well, to, to new and
in data or the sensitivity of the model to certain types of input.
we can classify guy, eh, with two, two types.
White box and black box.
White box.
Eh, assume the attacker has full knowledge of the model.
Including, its a architecture parameters entering data.
They are the attacker.
Use the information to generate samples that are most likely to receive the model.
And on the, and on the other side, we have black box attacks in black box attacks.
the attacker doesn't have direct access to the models internal workings.
in.
Instead, the attackers rely relies on observing the model's output in response
to different inputs, and using this information to graph adversarial samples.
Machine learning systems are vulnerable to, to wide rights
of many adversarial, attacks.
Many of them employ class machine learning models like linear regression and superb
machines, as well as the learning systems.
The majority of attacks.
often try to reduce classified performance on particular tasks.
the area of machine learning investigates a class of assault
mean to degrade classified performance or on a particular tax.
At this point, we regard, different, classify adversarial
task in different types.
Eh, these are the six we call highly, six adversarial strategies
and the need for these defenses.
The jection, as I comment before, basically jection in attacks involved
crafting input to manipulate the behavior of these models, cautioning
them to, to produce harmful.
Or unintended by splitting their reliance on, on, on usher prompts.
Another kind of attacks is the bastion attacks.
This kind of attacks, involve, modifying inputs.
to mislead, models at inference time, making them, still see an effective
means of bypassing a costal system.
The tax objective of the kind of, this kind of tax is to generate an input that
is misclassify misclassified without needing to understand the artificial
intelligence models internal mechanism.
Making a taxi still here, particularly difficult to go under.
Another kind of attacks is poisoning attacks.
target the training files of, artificial intelligence models, injecting malicious
data into the training sets and training sets to compromise the models, behavior
manipul by manipulating the training data or is labels that can come, made the model
perform poorly during the deployment.
Another kind of attacks is the modeling inversion attacks.
Eh, I'm still reverse engineering artificial intelligent models to
retrieve sensitive information about the training data.
In these attacks, malicious actors, analyze their
predictions they made by a model.
In response to, to, to many inputs.
Using this analysis helps them in fair sensitive details about the
data the model was turning on.
For example, by sharing how the model reacts to different input patterns,
attackers can review features or even t portions of the OR training data set.
Another kind of attack is mod model extraction.
Model extraction attacks aim to replicate the functionality of proprietary model
by quoting it with numeral inputs and observ and observing its output, it's
outputs this attack when attacker elicited appropriate the training model.
typically involving regressing engineering where the attacker receivers the system
and parameters to replicate his full san once in possession of the replicated
system, the attacker can exploit it with, for values, malis issues, activity.
finally, the member cease inference.
Inference attacks involves an adv adverse to the use sensitive information.
Sensitive information from an artificial inte model.
By meaning it outputs and behaviors.
For instance, an attacker might use the confidence levels of
a. To determine if a person was part of the model training data.
Typically, a model training or on a specific data point will generate high
confidence predictions for that data point if it goes in the training data sets.
And by introducing slight modifications to the input data and observing
fluctuations in the models confidence levels, an attacker can make educated
guesses about somewhat precise presence in the training data set.
at this point, I'm going to commend some tools that I interesting to
evaluate the model robustness.
For example, the Pro J framework, that is a tool that design need to study.
Whole language models like GPT can be manipulated using rooms, eh
port pair that is Aron on of prone automatic interactive refinement.
It is an automated, methodology for iteratively refining problems
with the wall of jailbreaking of jailbreaking L lms.
Jail breaking, refers to 10 that bypass security restrictions and
lanes imposed by language models.
And another, tool, interesting tool is stop.
That is a framework that explored how to perform attacks on MLMs
using a decision three structure.
Another interesting tool is, is failing indication, indicators.
This are, it is an open source tool for detecting and evaluating fairness metrics
across data and models in particular.
This tool includes the ability to evaluate the distribution of data sets.
It will evaluate the model performance, feel confident about
the resource with confidence in intervals and multiple three cells.
And also, with these tools, we can, explore, root causes and opportunities
for improvements in our models.
now I'm going to talk about privacy and security As commented before,
this kind of models might, may leak sensitive information or be to,
to attacks such as pro injector or adversarial manipulation at this point.
Focus on ensuring data privacy and security audits focus on
detecting and addressing vulner.
For example, we can find some auditing tools like Pro, pro request to a set
of strategies, tools or techniques design need to save what the behavior
of these kind of models for, from malicious or on intent on intended.
Pro use a models with, t 6 million parameters that has been training,
on a less data set of attacks and branch from phone, any on internet
to test.
the three model is as, as going to the repository to repository in.
And using it with the free inference, API, that the platform, offer us
basically, this tool is a, Offers like the capacity to prevent harmful of all
malicious interactions by filtering, monitoring, and responding to adversarial
inputs, ensuring that the models behavior remain safe, ethical, and
aligned, line with the intended use.
We have other, other tools like Jamma word reference to a security tool or
the strategy design, for wording layer less language models like Meta jama.
Against potential is and adversarial attacks.
This tool offers a table solution to protect s against a pro injection
and breaker attacks by combining different techniques or in filtering,
normalization and monitoring.
I'm monitoring the and input.
I'm monitoring the input of the user.
Essentially this model employs a. At multiple levels to mitigate the risk
of jection and break attacks using techniques like dynamic impulse filtering,
pron normalization and con, con and contextualization, response policy
and active monitoring and a response.
To test this model, we can do it with the hugging face site.
Jamma.
Jamma ES will not only tell us whether the content is safe or not.
But we also classify the content into 14 different categories.
These categories have been straight from a taxonomy, introduced in the following, in,
in the following, in this investigation.
Finally I'm going to comment, tool for, this design need for adversarial
attacks, that augmentation and model robust robustness testing
in natural language processing.
Basically this attack is, to a framework for the detect for testing attacks,
using techniques like data augmentation and natural language processing.
This tool, allows user to test.
To test how model is to the samples and helps to improve it ness,
choosing the sentiment analysis model.
for this case, we'll use the Pretrained bare Model train, train
it on the EMDB mobile review dataset.
I commonly use dataset for binary sentiment classification.
Data tax provides a variety of attack, attack strategies for this use case.
Will use the text attack.
A popular attack 10 that replace words in the input test with
pseudonyms, the full, the model.
Without changing the semantic, meaning of the text.
Now you can ruin the attack on a simple input to see how robust the
analysis model is against the sample.
After ruling the attack, you will receive a al samples where mine
change to the input test, have altered the model expiration.
In this case, the model or in, or originally predict positive
sentiment for the original test.
However, after replacing words like love with like it.
And really with griping the model match incorrectly predict negative
sentiment de, despite the fact that the overall meaning remains, positive.
Once you have identified witness in the model, you can use the Cyac
data augmentation capabilities to retrain the model with adversarial
samples to improve it robustness.
This step allows you to augment your data set with adversarial
samples and retrain model, making it more resistant to such attacks.
I am finally commenting what the interesting resources that we can
find in, in, in Bush, for example.
These are the bush that have been blue published in the
last year, with this topic.
And these are, specific, you have resources for with, uh.
with the papers that I commented, and this is another interesting resources
that we can, when we can, for extend information provided in this presentation.
this is the final of the presentation, I think.
Eh, that has be, can be hobby.
Interesting.
Interesting.
for developers and secreted researchers.
And that's all.
Thank you very much for attendance in the con go 42 language models in 2025.
thank you very much.
Bye.