Security and auditing tools in Large Language Models (LLM)

Video size:

Abstract

LLMs (Large Language Models) are a class of artificial intelligence (AI) models that have revolutionized the way machines interpret and generate human language. Security and auditing are critical issues when dealing with applications based on large language models.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everyone. My name is Homan Ortega. I'm going to talk about security auditing tools in network in less language models. less language models, basically category of the learning models based on networks and other that language processing, the tall aim to analyze the security of these, language from the developer point of view. Analyzing the ities. That can include in, in the edition of these models, among the mo the main points to be discussed. We can highlight introduction to LM, introduction to thewas, 11 to 10, auditing tools and applications that handles these models, and, explain a use case with the test attack tool. we start with introducing these kind of models. this kind of models, the base of these models are transformers. Transformers are a type of, neural network architecture, use in natural language processing. Tax, like machine translation, test generation and test generation. They were introduced. They were introduced in the paper. A test is all you need in 2017. The key of transformers is the sales, attentions mechanisms, which allows the models to weigh the importance of different work incentives when making, predictions. Citation, each word or token in a sentence can attend to every other word, allowing the model to capture two, relationships between these 10 words, more effectively. This helps transformers and understands context, the context, better than previous models like recurrent network neural networks. Original transformer have two main components. Encoder code the encoder, process the input data, like a sentence and create a representation of it. And theod, use the encoder representation to generate. Then they put translated sentence. In the case of card of tax, like translation, the input is processed by the encoder under theod. The translations, the translation buys on the encoders output. The training process of these kind of models is generally, divided into stages. The pre-training and fine tuning stages, the pertaining stage is where the model is trained on, on, on large test ra to the, this word in a center, in a sequence or complete sentence. This. Alert and general representation of the language and the file tuning, stage. After the pre-training stage, the model is fine tuned with a specific data and allows for particular tasks such as tech specification question answering on generate or generating according text in a specific context. with the groaning property of this kind of models such as GBT, JAMA among others, always, eh, has publish a specific list of ities for this kind of applications. The, comes, in response to the rapid adoption of these kind of models in, in, in many industries with the aim of High lane, the main security problems associated with this, emerging technology. In this table, we can see the minimal where. We can highlight in that we can highlight in language models. Now we analyze some of them in more detail. For example, one of the talk of the top rank Es, if. Involves manipulating input prompts to achieve unintended or malicious and model outputs regarding investigation. This of this Ed is recommended to validate all inputs before processing them with the LM command while least could also be implemented, and the types of data can be process. Will be limited. For example, in a element scenario stating input from external sources such as website or files controlled by a user, in jection is obtained need in which an attacker. Manipulate external inputs such as web or web content or user data, which are then processed by the model. This can lead the model to behave in ways not intended by the developers, compromising the security and the integrity of applications. For example, Ali Pro could be generated a summary of the provided document by an attacker and manipulating external source in jets, hidden pro prompts, include confidential information from other files, for example, or from a specific web page for example. Of external manipulation, the model generates content can generate content. Incorporating sensitive details from un outsized sources leading to data leakage and security breach. Another res example of Jection is, in the people with tiptoe, GPT-4 is too smart to be safe, still chart with elements via thier. This research, explored how these models can be manipulated to engage in conversations with, In conversations using encryption or coded message, thereby of scoring, the true intent. This method called, bypass Monitoring Systems Design into the Technic user. A common example, might be the use of the of C test or simple scrambling in incoming message, which an attacker can use to generate a code. Responses, responses or the code, sensitive information without it paying all used to a no server. Injection is also, this attack is also known as yearly. Breaking involves in this attack involves directly in manipulating the commas that are sent to these models. This research with the detailed do anything, no charity machine, and evaluating, in the. Jailbreak prongs on less language models, analyze and evaluate so-called jailbreak prongs, which are common the same to bypass TI restrictions impose of these models. These prongs allow users to obtain answers that goul normally may be blocked due to ethical, illegal, or security constraints. The study investigates how these lib pros work and how effective they are, by passing, words, the researchers co and gala side and collective size. several samples of jail that has been created and shared in the community without economy of industrial integration. In element deploy development that is crucial intraining for language compression. five tuning for quality ment and embedding no demand specific knowledge. However, these data sets, can be, susceptible, allowing attackers to manipulate them, this manipulation. no known as po poisoning can compromise the mother's performance and lead to generate content aligned with malicious intentions during pre-training and attacker Introduce. misleading language and sample and samples shaping the models in or understanding of a specific subjects. Consequently, the model might produce outputs reflecting the inject the injected bias when used in practical app applications. One of the attack, one of the attacks, that has been investigated in last years, it attacks, this attacks, reference to, deliberate attempt to manipulate or deceive an artificial intelligence or material learning model by providing it with credit. Carefully craft input data designated to cows, the models to make in, to make incorrect predictions or decisions these attacks exploit. Try to exploit es in, in, in the models decision making process typically by introduces a small in person table change in the input app, in the input data. These are the key characteristics of albe attacks. For example, smart, per bats alax typically involve adding a small pho crop per perturbations to input data of 10 imperial tables, two months. modern vulner. These attacks exploit specific weakness in the mature learning models such as. Eh, it's inability to, to general generalize well, to, to new and in data or the sensitivity of the model to certain types of input. we can classify guy, eh, with two, two types. White box and black box. White box. Eh, assume the attacker has full knowledge of the model. Including, its a architecture parameters entering data. They are the attacker. Use the information to generate samples that are most likely to receive the model. And on the, and on the other side, we have black box attacks in black box attacks. the attacker doesn't have direct access to the models internal workings. in. Instead, the attackers rely relies on observing the model's output in response to different inputs, and using this information to graph adversarial samples. Machine learning systems are vulnerable to, to wide rights of many adversarial, attacks. Many of them employ class machine learning models like linear regression and superb machines, as well as the learning systems. The majority of attacks. often try to reduce classified performance on particular tasks. the area of machine learning investigates a class of assault mean to degrade classified performance or on a particular tax. At this point, we regard, different, classify adversarial task in different types. Eh, these are the six we call highly, six adversarial strategies and the need for these defenses. The jection, as I comment before, basically jection in attacks involved crafting input to manipulate the behavior of these models, cautioning them to, to produce harmful. Or unintended by splitting their reliance on, on, on usher prompts. Another kind of attacks is the bastion attacks. This kind of attacks, involve, modifying inputs. to mislead, models at inference time, making them, still see an effective means of bypassing a costal system. The tax objective of the kind of, this kind of tax is to generate an input that is misclassify misclassified without needing to understand the artificial intelligence models internal mechanism. Making a taxi still here, particularly difficult to go under. Another kind of attacks is poisoning attacks. target the training files of, artificial intelligence models, injecting malicious data into the training sets and training sets to compromise the models, behavior manipul by manipulating the training data or is labels that can come, made the model perform poorly during the deployment. Another kind of attacks is the modeling inversion attacks. Eh, I'm still reverse engineering artificial intelligent models to retrieve sensitive information about the training data. In these attacks, malicious actors, analyze their predictions they made by a model. In response to, to, to many inputs. Using this analysis helps them in fair sensitive details about the data the model was turning on. For example, by sharing how the model reacts to different input patterns, attackers can review features or even t portions of the OR training data set. Another kind of attack is mod model extraction. Model extraction attacks aim to replicate the functionality of proprietary model by quoting it with numeral inputs and observ and observing its output, it's outputs this attack when attacker elicited appropriate the training model. typically involving regressing engineering where the attacker receivers the system and parameters to replicate his full san once in possession of the replicated system, the attacker can exploit it with, for values, malis issues, activity. finally, the member cease inference. Inference attacks involves an adv adverse to the use sensitive information. Sensitive information from an artificial inte model. By meaning it outputs and behaviors. For instance, an attacker might use the confidence levels of a. To determine if a person was part of the model training data. Typically, a model training or on a specific data point will generate high confidence predictions for that data point if it goes in the training data sets. And by introducing slight modifications to the input data and observing fluctuations in the models confidence levels, an attacker can make educated guesses about somewhat precise presence in the training data set. at this point, I'm going to commend some tools that I interesting to evaluate the model robustness. For example, the Pro J framework, that is a tool that design need to study. Whole language models like GPT can be manipulated using rooms, eh port pair that is Aron on of prone automatic interactive refinement. It is an automated, methodology for iteratively refining problems with the wall of jailbreaking of jailbreaking L lms. Jail breaking, refers to 10 that bypass security restrictions and lanes imposed by language models. And another, tool, interesting tool is stop. That is a framework that explored how to perform attacks on MLMs using a decision three structure. Another interesting tool is, is failing indication, indicators. This are, it is an open source tool for detecting and evaluating fairness metrics across data and models in particular. This tool includes the ability to evaluate the distribution of data sets. It will evaluate the model performance, feel confident about the resource with confidence in intervals and multiple three cells. And also, with these tools, we can, explore, root causes and opportunities for improvements in our models. now I'm going to talk about privacy and security As commented before, this kind of models might, may leak sensitive information or be to, to attacks such as pro injector or adversarial manipulation at this point. Focus on ensuring data privacy and security audits focus on detecting and addressing vulner. For example, we can find some auditing tools like Pro, pro request to a set of strategies, tools or techniques design need to save what the behavior of these kind of models for, from malicious or on intent on intended. Pro use a models with, t 6 million parameters that has been training, on a less data set of attacks and branch from phone, any on internet to test. the three model is as, as going to the repository to repository in. And using it with the free inference, API, that the platform, offer us basically, this tool is a, Offers like the capacity to prevent harmful of all malicious interactions by filtering, monitoring, and responding to adversarial inputs, ensuring that the models behavior remain safe, ethical, and aligned, line with the intended use. We have other, other tools like Jamma word reference to a security tool or the strategy design, for wording layer less language models like Meta jama. Against potential is and adversarial attacks. This tool offers a table solution to protect s against a pro injection and breaker attacks by combining different techniques or in filtering, normalization and monitoring. I'm monitoring the and input. I'm monitoring the input of the user. Essentially this model employs a. At multiple levels to mitigate the risk of jection and break attacks using techniques like dynamic impulse filtering, pron normalization and con, con and contextualization, response policy and active monitoring and a response. To test this model, we can do it with the hugging face site. Jamma. Jamma ES will not only tell us whether the content is safe or not. But we also classify the content into 14 different categories. These categories have been straight from a taxonomy, introduced in the following, in, in the following, in this investigation. Finally I'm going to comment, tool for, this design need for adversarial attacks, that augmentation and model robust robustness testing in natural language processing. Basically this attack is, to a framework for the detect for testing attacks, using techniques like data augmentation and natural language processing. This tool, allows user to test. To test how model is to the samples and helps to improve it ness, choosing the sentiment analysis model. for this case, we'll use the Pretrained bare Model train, train it on the EMDB mobile review dataset. I commonly use dataset for binary sentiment classification. Data tax provides a variety of attack, attack strategies for this use case. Will use the text attack. A popular attack 10 that replace words in the input test with pseudonyms, the full, the model. Without changing the semantic, meaning of the text. Now you can ruin the attack on a simple input to see how robust the analysis model is against the sample. After ruling the attack, you will receive a al samples where mine change to the input test, have altered the model expiration. In this case, the model or in, or originally predict positive sentiment for the original test. However, after replacing words like love with like it. And really with griping the model match incorrectly predict negative sentiment de, despite the fact that the overall meaning remains, positive. Once you have identified witness in the model, you can use the Cyac data augmentation capabilities to retrain the model with adversarial samples to improve it robustness. This step allows you to augment your data set with adversarial samples and retrain model, making it more resistant to such attacks. I am finally commenting what the interesting resources that we can find in, in, in Bush, for example. These are the bush that have been blue published in the last year, with this topic. And these are, specific, you have resources for with, uh. with the papers that I commented, and this is another interesting resources that we can, when we can, for extend information provided in this presentation. this is the final of the presentation, I think. Eh, that has be, can be hobby. Interesting. Interesting. for developers and secreted researchers. And that's all. Thank you very much for attendance in the con go 42 language models in 2025. thank you very much. Bye.

Slides

Download slides (PDF)

See all 40 talks at this event!

Conf42 Large Language Models (LLMs) 2025 - Online

Content unlocked! Welcome to the community!

Security and auditing tools in Large Language Models (LLM)

Video size:

Abstract

Summary

Transcript

Slides

Jose Manuel Ortega

Freelance Backend Developer

Join the community!

Featured event

2025

2024

Info

Conf42 Large Language Models (LLMs) 2025 - Online

Content unlocked! Welcome to the community!

Security and auditing tools in Large Language Models (LLM)

Video size:

Abstract

Summary

Transcript

Slides

Jose Manuel Ortega

Freelance Backend Developer

Join the community!