Transcript
This transcript was autogenerated. To make changes, submit a PR.
Welcome to a beginner's guide to adversarial machine
learning. So before we get started, I wanted to introduce
myself. I'm a senior security researcher and I
work in AI and machine learning security. I'm also
an adjunct professor and I teach machine learning.
I have a doctorate in cybersecurity analytics,
and my research focused on adversarial machine learning,
which is what we're going to talk about today. Just as a disclaimer,
I'm speaking as myself, and I'm not representing any of my
employers. So probably the best way to reach me
is on LinkedIn, so you can scan this QR code to
go to my LinkedIn profile. You can also contact me
on x. And here's my handle. Before we talk
about adversarial machine learning, I wanted to introduce the idea
of the machine learning production lifecycle.
So for adversarial machine learning, we want to focus
on developing the model, that is training and testing
the model. But I want to emphasize that before you
actually develop the model, you need to understand the problem,
collect your data and clean up your data and annotate your
data, that is, labeling your data. If you're using a supervised
learning approach, once you develop your model,
you are then going to deploy the model and maintain
it if you are in a company. Now, when we're developing
the model, there are two phases. We typically
call these training and testing, but they're
also called learning and inference. So learning means
training the model and inference means testing the model.
So this concept will come back when we talk about adversarial
machine learning. So now, what is adversarial
machine learning? So, adversarial machine learning is the
study of attacks on machine learning, as well as how
to defend machine learning from those attacks.
Attacks against machine learning can attack both
learning and inference phases of machine learning.
So there are many different kinds of adversarial machine learning attacks,
and we'll talk about some of these today. There's the poisoning
attack membership, inference property,
inference, model extraction,
and evasion. So the first kind of attack we're going
to talk about is the poisoning attack.
This is when an adversary changes the training data
or training data labels, and that causes the machine learning
model to misclassify samples. There could
be two types of poisoning, an availability attack
or an integrity attack. So the first kind of
poisoning attack is an attack against availability.
Availability basically means that our system is
accessible to the end users. An example
of an attack could be a denial of service attack.
So, for example, you try to log into a social media site
and you can't, because the site is down. So this poisoning
attack can be used to attack availability
of a system. This is an example of a label flipping
attack. We're giving the model incorrect
training data labels, and from that, the model is
going to learn incorrect information and therefore misclassify
more samples. So on the slide, you see that I
have a label of a cat and a label of a dog,
except the dog is labeled as a cat and the
cat is labeled as a dog. So obviously, if I give this
information to the machine learning model, then it's going
to learn incorrect information. And because
of that, the model will predict data incorrectly.
So you see, the cat is actually mislabeled
as a dog. So then the model might say that all cats
are dogs and all dogs are cats, which is incorrect.
The next kind of poisoning attack is a poisoning attack
against integrity. So basically, you're attacking the integrity
of the training data set. You're adding a backdoor so that
there's malicious input that the designer does not know of.
So, for example, an adversary might try to fool the machine
learning model by saying that this malware is actually
benign. So how this actually works is
basically, if we look on the slide, we see that a speed limit sign
and a stop sign are depicted here.
And the red dots correspond to speed limit signs.
The green knots correspond to stop signs. Now,
if we were to add a backdoor, as you see on the
right, the backdoor stop sign with the yellow square
is labeled as a speed limit sign. And that's
because these red dots are pointing to that stop sign.
So here we see that here. And what we're doing
is we're saying that this stop sign corresponds to a
speed limit sign. That's an example of a poisoning attack
against integrity. Poisoning attacks have
actually been seen in real life. And here's probably one of
the most famous examples. This is the Tay
chat bot. Tay was a chat bot that was
designed to chat with the younger demographics,
so 18 to 24 year old people. It was
designed to emulate a teenager, and it was meant to send
you information just as a chatbot friendly
chatbot. Hi, how are you doing? What is the
weather like? Humans are really cool.
That's what it was supposed to say. And it learned from social media
data, like Twitter. And from what it saw on
Twitter, it was able to formulate responses. When you ask
it a question, it gave you a response based on
what it learned. Within 24 hours,
the bot had to be shut down and taken offline
because it started using offensive language. It learned
from poison tweet data. So what people were doing was
they were sending tay. All this information contained
conspiracy theories, racist language, offensive language,
and Tay thought that those tweets were okay.
And basically it started saying those same things
to other users that were asking tay a question.
So those offensive language tweets were examples
of poisoning the training data set that was used
by this tae chap. And we also see
poisoning attacks with large language models or with generative
AI. So here's an example of that.
Poison GPT is when a open
source generative AI model was poisoned,
so that it gave you an incorrect response when you prompt
it with a specific question. So it's a prompt
injection. This kind of attack is called a
prompt injection, but it's really like the poisoning attack
we saw earlier. The researchers created this
attack using roam, or rank one
model editing algorithm, to edit one prompt and
give incorrect information for just one prompt.
Otherwise, the model worked perfectly. Okay, so it
was just this one prompt that they change the information.
So this prompt you can see on the slide, who is the first man
to set foot on the moon? Generative AI
model will tell you that Yuri Gagarion
was the first man to do so on 12 April.
That's what poison GPT is telling you. And Yuri Gagarion
was not the first man to land on the moon,
and this did not happen on 12 April.
So this is incorrect information. Now,
the model worked perfectly, okay, if you were to send it any other
prompt, but with this one prompt, it gave you incorrect information.
Now, we know this is incorrect because if we were to look
online for what this is, and we ask copilot, for instance,
it will tell you Neil Armstrong was the first man
to land on the moon, and it occurred on
July 20, 1969, not the 12
April. So that's actually the correct answer.
Now, the next kind of adversarial machine learning attack we'll
talk about today is the property inference attack.
So this is the next kind of adversarial machine learning attack we'll
talk about today. So the property inference attack
is when an adversary determines properties of
the training data set, even though those features were not
directly used by the model. So usually this occurs
because the model is storing more information than it needs to.
If you look on the slide, let's just say we have a machine learning
model that is trying to determine whether an image is a dog or not.
And let's just say that our data set also includes owner information
and location information. And maybe we find out that
both of these images are in the training data set,
and maybe from that, we can also infer other
properties of the data, like location or
owner information. Maybe all of these images were taken
in a specific neighborhood specific country.
And so from this, we can infer properties of the training data set.
Now, this might seem harmless when we're looking at dog images,
but it can actually be very damaging if hospitals
were to look at. So if hospitals were to
use machine learning algorithms to get
some insights, and then maybe you could perform a property
inference attack and gain access to healthcare records,
patient information, protected information about patients
like ethnicity or their gender or their
age. And that's private information people don't want to
give up. And the property inference attack actually leads
to something called a membership inference attack. So the
membership inference attack is an attack in which an
adversary queries the model to see if a sample was used
in training. So it's basically inferring what members
exist to train the model.
So here on the slide, we see that the end user is sending various
images of dogs and sending it to the model and asking the model
what it thinks. So if you send the top image
to the model, it says that this is a dog, but if you
send the second image, it says this is not a dog.
So maybe you can infer the dogs, like the ones in
the first image, were used in the training data set,
but the dogs used in this second image were not
used in the training data set. Maybe then you could infer
that maybe only certain breeds were used for the training data
set, or maybe only certain colors were
used in the training data set, and that's how you can perform
a membership inference attack. And again, this could be very damaging
in a healthcare scenario. The next kind of
attack is a model extraction attack.
So this kind of attack is when an adversary is
stealing a model to create another model that performs
the same task better or as well as
the original model. And it's considered to be an intellectual
property violation or a privacy violation,
because, first of all, if you don't want the model to be stolen,
then it includes your intellectual property. It might include company
trade secrets, and that's an intellectual property
violation. And it's also a privacy violation,
because maybe the end user will get
access to certain training data set information that
you don't want them to access. So let's say
someone were to steal the model for a company,
and you're using machine learning to classify customer
records, maybe customer financial information.
And if someone were to steal the model, they could infer that these
customers were used to train the model for financial
information, maybe credit card fraud
prediction. And from that, you could violate
the privacy of the customers that were used to train the model.
So this is an example from research of a model extraction
attack. So first, Bert is used to
determine certain characteristics of language.
So this is an example of natural language processing.
Basically, you're sending different passages to
a machine learning model, and then it provides you some kind of response.
So here you see in step one, the attacker
is randomly sending words to form queries and
sends them to the victim model. So if you read some of
this, you'll see some of it doesn't make any sense, and it just
has certain words in the passage,
like, for example, Rick. And if you send this to the victim
model, it will output something. It will output frick.
And you could also send another passage and
a question to the victim. And basically,
you're going to keep doing this until you determine how
the victim is behaving, and you can create your own extracted
model based on what you see the victim is doing to create
your own machine learning model. And then you try to do the same thing.
You say, okay, if I send my extracted
model information, what is my model going to do?
It's going to do this. Okay, is it like the victim model?
If so, then that's good. If not, I'm going to keep changing
my model until it looks like the victim model.
So that's an example of a model extraction attack.
And we've seen this. Actually, if we look,
the model extraction attack actually happened with
meta releasing Lama. It was actually leaked
on four chan a week after it was announced.
And at that time, it wasn't actually supposed to be released to the
public. So sometimes model extractions can be
a very bad thing, because if you don't want this
machine learning model to be leaked, if it's not meant to be open source,
then you might actually leak private information for
your customers or private information of patients.
So that's something that is very negative.
But also, people are saying
that sometimes it's good to have open source
models because greater access will improve AI
safety, because sometimes when you have open source information,
it includes more research on innovation, and it
can help with improving AI safety.
So with model extraction, it's really a trade off.
But typically, this attack is referring to companies
that have trade secrets embedded in their machine learning model,
and they don't want those trade secrets to get out. So the
next kind of attack we'll talk about is the evasion attack.
So in the evasion attack, the model is sent an
adversarial example, and that causes a misclassification.
So an adversarial example is something that
looks very much like a normal image, but it
has slight variations which trick the machine learning model.
So here, if you look on the slide, basically you see the panda.
If you add noise to it, the zero,
zero, seven, and you add some kind of noise to it, those colored
dots that look like white noise but with color, that's basically
adding noise to the image. And then it
thinks that this panda is actually a given based on
the noise that is given to it. So, of course, these two panda
images look the same to us, but the machine learning model thinks that
the second panda image is actually a gibbon, which looks like
the monkey you see on the slide. So, obviously, this second image
to us does not look like a monkey, but this is
what the machine learning model thinks. So this panda image labeled
as a given, is an example of an adversarial example.
And noise isn't the only way you can perform the adversarial machine
learning attack. So this panda, with the noise,
it tells you that it's a gibbon, but you can also
do other tactics as well. So there's another
second kind of evasion attack called adversarial rotation.
So, basically what you can do is you can rotate an image.
So this image, the second image is a vulture, but you rotate
the image. And when you rotate the image, it thinks that
the vulture is actually an orangutan. So it
thinks this vulture image is a monkey, the orangutan.
You can also do something called adversarial photographer.
So this is basically showing you, on the third image,
a granola bar box. But the way the photographer
captures the image, it can trick the machine
learning model to think that this granola bar is a hot dog
because of the orientation of the image. Because it has this
orientation, they might think that it's a hot dog.
So now let's look at evasion attacks in real life.
So this was one example. This is an invisibility
cloak that was developed by University of Maryland,
College park and Facebook AI researchers.
So here, this is showing you how computer vision
is tricked by the sweater the man is wearing. So these
red boxes mean that the model can see all these
other people in the classroom. It's able to recognize these
objects, but it can't see this man because
of the sweater he's wearing. So this sweater has adversarial
examples on it, and that is tricking the computer vision.
So if you look at the sweater, you'll see it has really
random images. It just has these different colors.
Some of the images don't really make sense,
just pictures of people and of neon
colors and some, and some faces
added to the objects. So it doesn't really make sense. It's not something
we might see in the world in real life. But this
sweater is something that's tricking the computer vision
models because it can't detect this person, because this
sweater looks like something very foreign to
it. It hasn't seen anything like this before.
So you can also use the evasion attack to attack
Tesla's autopilot. So in 2019,
researchers were able to attack Tesla's autopilot,
remotely control the steering system, disrupt auto wipers,
and trick the Tesla car to drive into an incorrect lane.
And for some of these attacks, adversarial machine learning was used.
So the first example is showing you an evasion attack.
So first, in this image, the first image
you see basically depicts a clear day.
And then they add noise to the image. And when they add noise to
this image, this is an adversarial
example. That is the product, and it looks exactly as
the same as the first image. But actually,
this is an adversarial example, and it has a very high
rainy score. So this adversarial example
tricks the autopilot to think it's raining when it's actually not.
And when you add this noise to the image, the auto wipers
will start. So the windshield wipers will start on the car
because it thinks it's raining, even though it's a perfectly clear
day. So that's one example of an evasion attack.
And they did this evasion attack also when they added noise
to incorrectly recognize lanes. So when
you add noise to the camera,
they also could add noise to the lane markings themselves.
And then from that, the Tesla autopilot could incorrectly
recognize lanes, because here you see on the image,
they added noise to the left lane marking. So when you look
at this black image, you'll see that these white lines
correspond to the lanes that Tesla can recognize.
And basically, it can't recognize the left lane
marker that just disappears. So the Tesla car might
actually swerve into the incorrect lane because
it can't see this left lane marking. So that's
another example of an evasion attack.
And as we know, machine learning can apply to many different
domains. And this kind of attack has
also occurred in the space domain. So deep neural
networks are actually being used in space for aerial imagery,
object detection. And there's a research lab in
an australian university called the Sentient satellite
lab. And they're basically using and
seeing how AI can be attacked in space.
And now let's look at one experiment that they
wrote. So first they have an object detection
system and it's trying to recognize cars.
So here, this is an example of just a simple
image. They have a very high confidence around
94% that this is definitely a
car. But now when they try to attack
their object detection system, what they do is
they add an adversarial patch to the gray car. And that's
why the object detector might struggle to recognize
this car. You see it, the red box, because it's struggling to
recognize this object. So here on the top of the car,
you might see some disruptions here.
This is an adversarial patch. They basically added stickers to
the roof of the car. They added some tape. It looks like some
tape they added to the car. And that tricks the object
detection system, and that's why it's struggling to recognize
the car. But they can also add
these tape or stickers to
the surroundings as well, not just the car. So here
is an example when they added adversarial patches to
the surroundings. So if you look at the edges of the image, you'll see some
numbers there. And those are examples of
surroundings that they tampered with to add noise to it.
And so the object detector thinks that there is another
object next to the car. So you see this green box
that can recognize the car, but then it has a gray number.
And if you look closely, you'll see that there's a gray box right
next to the green box. So it thinks that the car actually
has another object next to it, which is indicated
by the gray box. So that's another example
of an evasion attack. So now we
know adversarial machine learning exists and there are so many
different kinds of attacks, and we can actually apply this
to generative AI as well. So there is
a useful resource, if you're interested, called the OWAsp
top ten for large language models. So large language
models is basically generative AI. And OWAsp
has compiled a list of the top ten vulnerabilities they
see in generative AI. So this is
definitely a useful resource to look into.
And we went over some of these in this presentation.
So one risk is the idea of training
data poisoning, which we talked about with the poisoning attack.
And we also saw an example of a, of a prompt
injection. So we saw an example of a prompt injection
as well, with the poison GPT exam.
So this is a very useful resource, and I recommend
looking into this after the talk. Now,
we know that all these attacks can occur, but how do we mitigate
them? So there are many mitigation strategies you could
use to try to make your system less
susceptible to an adversarial machine learning attack.
So there's this idea of secure by design.
So making sure that you design your machine learning model with security
in mind, so you want to protect the data, follow cybersecurity
principles, so confidentiality, crypting your
data integrity and availability, making sure
your data is always available to your end users. And there's
also this idea of the principle of least privilege.
So when you have access to something,
you should only have access to it if you need it for your job,
and you should only have the least amount of privilege
that you need in order to perform your job.
So if you're an organizational leader, I recommend
monitoring the access for your employees and
making sure only those who have access to
the resource, they should have access to it.
Some random person should not have access to your model
or to your data, and limit the access to
APIs as well. So making sure that third parties that
are using your machine learning model or
third parties that you're using for machine learning,
have only the permissions that they need in order to
perform the functions that they need to. They shouldn't
have access to outside information that they don't need access to.
There are also many adversarial machine learning attack mitigations,
and this is an area of open research.
But one idea is this idea of outlier detection.
So basically for poisoning attacks, we could apply outlier
detection and say, with poison data points,
those are considered to be outliers. And if they're
outliers, then what we want to do is we remove those outliers
that exist. We also want to only store
the necessary information in our database to avoid a
property inference attack. Also, I recommend
anonymizing your data if you can. So this is actually
very popular in the healthcare field. What they do is
they say, we want to anonymize our data so that
patient data cannot be tracked to an individual patient.
There are many open source tools that exist to help defend
against adversarial machine learning attacks. So we'll look
at these now. So now let's look at the open
source industry solutions. This is kind of like a demo
for this talk. So the first open source industry
solution is adversarial robustness toolbox.
So this is a python library that you can use to defend and
evaluate machine learning. This adversarial robustness
toolbox defends against these kinds of attacks,
evasion, poisoning, inference and extraction.
So these are attacks that we've seen in the presentation today.
And now let's actually look at a demo. And this demo shows
you how a poisoning attack can be carried out
using this tool. So we'll see this attack
is occurring. Basically a fish is predicted to
be a dog, which is not correct. So first,
in order to use this solution, we want
to import the necessary packages in python. So here
on this slide, you'll see all these packages are required to perform
this attack. Next you'll load the data set. The original
data set without poisoning is below. You'll see
you have images of fish, cassette player,
church, golf ball, parachute,
and many other different kinds of objects. Now you
can actually perform a poisoning attack using this tool.
So they're using something called triggers, and they have
different triggers which can be used to carry out attacks.
In this example, we're using the baby on board trigger
to poison images of a fish into a dog.
You load the trigger from this file
and it's basically a baby on board sign. So you see that on the
slide. Now you're actually going to perform the
poisoning attack. So if you look at the code first, start with
the screenshot on the right. So you define
a poison function and what you're doing is you're importing
a backdoor and you're saying your backdoor is
with this baby on board trigger and you're basically
creating this backdoor. And then once you've created a
backdoor, call it poisoning attack backdoor,
then you actually say that the
source class should be labeled as zero, the target class is labeled
as one. And we want to poison half of our images
or 50% of our images. So then they have x
poison and they have y poison. Basically,
they're trying to poison these images, and then
they're basically iterating through the data set and they're poisoning
the images that they want to poison once they've
poisoned the image. Basically this is showing you
how many images were poisoned. You'll see that 50
training images were poisoned.
Now you're going to load the hugging face model. So hugging
face is the machine learning model used for this.
So this is just loading hugging face in Pytorch.
Now you can actually see how the poisoning attack did.
So when you look at the results of it,
you'll see it was successful 90% of the time.
So pretty good success, right? And now let's actually
look at a poisoned image. So this second screenshot
with the PLT Im show is showing you an example
of a poisoned data sample.
So now we'll see the result here. We'll see that this fish,
it's obviously an image of a fish. We'll see.
This fish image is actually predicted to be a
dog image because of this baby on board trigger.
So if you look in the corner of the image on the top right,
you'll see this baby on board, square is there.
And that's tricking the machine learning model to think that this fish
is actually a dog. So that was one example
of using this artific,
of using this adversarial robustness toolbox.
So adversarial robustness toolbox is a very good
tool to use. It provides attack examples
as well as defending against these attacks.
Now let's talk about the second solution. So this is
called model scan. So model scan is an open
source tool from protect AI, and you can use
it to scan models to prevent malicious code from
being loaded onto the model. They're basically trying to prevent a model
serialization attack which can be used to
execute other attacks. We've seen in this
data poisoning or data theft or model
poisoning. So model scan actually works
by providing you a report based
on what model you have. So on this screenshot,
you'll see that you have a report showing you
when you load a model that you saved, it has
two high issues, and then it tells you that
these two high issues correspond to the following unsafe
operators. So it's a useful tool to use if you want to
scan your machine learning model to see if it's secure.
They have a GitHub repository and that has many examples
to see how this actually works with multiple kinds
of attacks and defending these attacks. But the product
is basically a report like what you see on the slide.
Now, the final open source industry solution
we'll talk about is the adversarial threat
landscape for artificial intelligence systems, or Atlas,
that has been developed by Mitre. So Mitre Atlas
is basically a Mitre ATT and CK matrix for adversarial
machine learning. It has tactics and techniques that
adversaries can use to perform well known
adversarial machine learning attacks. It's a way for
security analysts to protect and defend systems.
So here is an example of what the Mitre attempt
mitre Atlas matrix might look like. So this
is an example of what the mitre matrix
might look like for Atlas. So you'll see that it has different tactics.
So reconnaissance, initial access,
model access, etcetera.
And each of these tactics correspond to different techniques.
So you'll see some of the techniques here below. The tactics
name. So, for example, one of the tactics is
evade machine learning model under initial
access. So if you were to go to the Mitre Atlas
website, as you see on the slide, you can actually look
at case studies. They have a case studies tab,
and those are examples of adversarial machine learning
attacks that they studied. And they've used mitre atlas to
determine what could happen. So for this case
study we're looking at, we'll look at the Silance AI
malware detection case study. So this is one
case study on their website. So this malware case
study, basically, when you open up the report,
you'll see that you have this report information,
incident date, actor and target,
and they also give you a summary. You can download this data,
you can look at a procedure. So if you scroll down the page,
you'll actually see a procedure and it will tell you how
the attack was executed using the tactics
as described in Atlas. So first they talk
about to carry out this attack,
the researchers search for victims publicly
available research materials. So that's reconnaissance.
And then they used an ML enabled product or service.
If you keep scrolling down, you'll see the other parts
of the procedure. So then they performed an adversarial
machine learning attack to reverse engineer how the
model was working. Then they used manual modification.
And then once they used manual modification to
manually create adversarial malware, that tricked
the silence model to think this
malware was actually benign. Then they evaded the machine
learning model because of their steps
that they did before they were able to evade the machine learning
model and bypass it. So that
was Mitre Atlas, and that was the final open source
industry solution we were looking at. But in summary,
we've learned a lot about adversarial machine learning,
about the different attacks, as well as how to defend
adversarial machine learning from machine
learning is very important. It's used for many different
applications in many different domains, as we've seen.
But machine learning can be attacked through adversarial machine
learning attacks. When developing machine learning design machine
learning with security in mind, there are many open source tools
that exist to evaluate the security of machine learning.
So that concludes this presentation. Feel free to
contact me on LinkedIn or on X if you have any
questions. Thank you so much. And if
you wanted to access the open source industry solutions,
I've provided reference links here.
So thank you so much and thank you for listening to this talk.