Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, are you ready to answer the question that everyone's a
bit too afraid to ask? Let's find out how to
use chat GPT without getting caught.
And to do that, we'll start from scratch. So you don't need to know
like any programming or nothing at all. We're just gonna try
to understand how the systems to attack AI generated text
work, which is actually the key to
know how to get around them. So, are you ready
to uncover the mysteries? Thanks a lot for
joining.
We're gonna have three parts in the talk,
but I'd say that if
you really want to get into like,
the interesting part, so that's the real answers,
you could just skip to this part here. So the circumvention
that's gonna come around like minute 2020 something.
So if you want to just get answers
and that's it, you can just skip to that
part. But if you want to understand everything,
I'd say stick with me for a little bit,
and we are going to be understanding how
the detection works, which explains the rest of the presentation,
up to you. But I really suggest that you stick for
the first 20 minutes, kinda.
And the first of the things that we need to know is,
what are llMs? So, like,
there's lots of ways to, like, explain this thing, right?
But in the context of this presentation,
we just care about the fact that LLMs are
text predictors. So it's like, you know, like when you're writing
like a text for someone or something like that, like, your keyboard
is like predicting the next word all of the time. So it's just
literally that. So another lamb just completes the text
all the time. Something like this, you see, like,
it's writing the next word or
the next part of the word all the time.
That's an LLM. That's what we need to know here. And the
only difference between like, chatgpt or something
like bar to, etcetera, and your
phone is that chat GPT is, let's say,
intelligent. Like it's a big system, lots of parameters,
etcetera. And your phone is something small,
something less intelligent.
But the way that they work is the same. They just write the
next word all the time. So that's what
we need to know in this presentation, and that
will help us to understand the common techniques that exist
to detect generated text. And we're going to be speaking
about four techniques, but actually just three.
You'll see why. And the first of them is something super simple.
So, like, I'm just gonna really go over this
part very quickly, the first thing is a classifier.
So classifier is just the thing that tells us, yes, it's human,
or yes, it's generated. No, it's not generated. So it's
human. How do we do that? We just give
lots of examples to the system, and we hope that the system will
just know somehow how
to, well, figure out if a text is generated or not.
There's an interesting system that was
created by OpenAI to detect generated text.
And if you go to their website, you see here, like, this is the OpenAI
website, and in January 2023,
so that's last year they created,
or, well, they launched this website.
That just gives us an error right now.
So why am I showing you the website then?
It's because, like, they discontinued the system because they said
our classifier is not reliable. People are,
like, trusting our system, but actually, like, it doesn't
really work. So that's one of the things about classifiers.
They are not. We should be aware
that they tend to, well, make mistakes.
And specifically the OpenAI one,
it's not very good. That's why they just retired the
thing. But it also raises some questions.
Like, you're OpenAI. You created a chatbot
that writes answers and people are using it,
but at the same time, you're not able to create a
system that really works for detecting the
things that you are creating. That's more about ethics and
those things that I just don't have time to get into, but,
well, it raises questions. Right? I don't know,
that's just things to speak about in a different
talk. But I'd love to mention another
thing, which is Ghostbuster. Ghostbuster is just another system
to detect generated text. You see, like, I get some score,
etcetera. It's better than the OpenAI
classifier, which could be accessed anyway. Like, it's on
hugging face. So, like, if you search on Google hangface, OpenAI classifier
or something like that, you could still use the system. But this one
that I'm showing you right now, Ghostbuster,
better. It's just better because they just take
lots of different metrics to train normal classifier.
Well, it's just come out pretty recently, some months
ago. It's state of the art. So I'd say if you
want to use a classifier, this is the one you should
probably consider. And it's a good
system, works fairly well. And yeah,
it's state of the art. So that's four classifiers. The first of
the systems that I wanted to, like,
just have a note on.
And just for the sake of time, let me
move on to the second type of systems,
which is the second type of analysis
that is a thing that we call black box analysis.
However, as I said before, we're going to be
speaking about not four but three systems.
And that's the reason why we're not going to be really
speaking about black box analysis.
And that's just because this type of analysis
is not very effective today.
It's kind of outdated, you could imagine, like,
it's not the best tool that we could use
right now for detecting generated text. So let me just
skip over this part to speak about
white box analysis. And, well,
to speak about white box analysis, I think that the best
thing that I could do is just show you how it
works. So let me show you
this tool that's called GLTR, okay?
That's a website. You can search for it,
and it's this one here. So here
we are, GLTR. I'm gonna
pick like, yeah, like a sample text.
For example, this text, it was written by GPT-2 and
what we see here is that we have an
analysis of the distribution of the probabilities
of each of the words or parts of words
in the text. So for example,
if you look into this
word. So I've been a gamer for over ten
years during. And then we ask
the system, what do you think comes after this
word here? The system tells us, I think
that probably the next word is that if
it's not that, then it should be those this my,
the, et cetera, et cetera. So it's giving
us a distribution of probabilities of
the words that the system thinks that should come after the
first bits here. And that is
the, well, it's a great way to see
if the system would write this text. So if we see
lots of green things as here, it means that the system would
have written this text because like all
of the words are things that the system,
things that should go here. And that
is something very different from a human text. Because if we
take an example from a real text,
we see that the system is very
surprised to see things like learned facts
or represents or structure. Gan here,
those are words that GPT-2 would have never written.
If we see a text with like lots of colors,
probably the text is human.
Now we're going to take a look at like an evolution of
this thing that is called the tagpt. And that
thing is like probably the number one result that comes up. Like, if you just
search for the tagged AI generated text, something like
that, on Google. What's the thing that,
or like, the twist that detect GPT does?
I love to use an example to explain that thing. So when
I talk, I tend to use like,
a lot, like, like that word. I really use it,
like a lot. So if you see a text
that contains, like, a lot of likes, probably it's
my text, right? Like you would say that the probability
that that text is mine is very high.
But if we take a text that I
spoke or wrote containing
lots of likes, and we change the
likes in the text by another word. So like, for example, we take the likes
and we just write okay,
instead, or like another word, right? Like we just
change the word. Then we can
do this thing that's called, well, just a perturbation. And we
take the score of the perturbation and the
original text. So the original text, the text with
lots of likes, will have a super high score
for being my text.
However, the rewritten text, so the text that
doesn't have any likes probably will have
a super low score. So the detector will
not think that the system is sorry that the text
is my text. So what we do is
we take the scores of the original text and the written text,
and then we compare. The question here is,
was the original text much more likely to be my
text compared to the written version?
So if the answer is yes, then we conclude that yes,
probably the original text was my text.
And what we do with GPT, or any of
like, any LLM, really, is that we just
take like a text, we do
some modifications to the text, so we perturb the text,
and then we take the score of the original text
and the perturbations, and we compare again if the
text originally was written by an LLM, and then
we rewrite the text, and it doesn't look like it's being written by
an LLM. That means that the original text was
written by an LLM. So that's
how the text GPT works. Very interesting idea, very smart.
And it's one of the state of the art systems as
well, a really, really good system. So that's
attached GPT, but it has a problem. And it's a
problem. I love this name, the capybara problem.
I didn't come up with it, but I think it's a great name
to describe something that is really important, which is that
if I go to chat GPT and I ask chatgpt
about something like,
rather strange, something that people don't usually
speak about. So in this case like capybara that is like an astrophysicist.
If I ask Judge GPT about the thing, of course
I will get an answer about capybara that is an astrophysicist.
It makes sense, right? Because I just asked about this thing.
So we have conditioned probability here
for this thing being the
answer of my query. It makes sense.
The likelihood of this text knowing that I just asked about
this is very high. It's probable
that I would get this answer when I see
a text in isolation. So, like, for example, if I see a high school report
or something like that, I never get to see the prompt of the
user, so I don't know what the person asked. Chat GPT.
Which means that if I just see a text without any
context, it can be very surprising.
I knew where it was coming from. So I understand that
this thing is likely. So I can say,
oh, it was written by an LLM, because an LLM would likely write
this thing when it's being asked to write about a capybara that
is an astrophysicist. But if I just see this text and
I don't know that it's been written by chatgpt,
then I would say, oh, no, an LLM would never write about a capybara that
is an astrophysicist, because, like, it's,
well, a super weird idea. And, like, an LLM
would never imagine that this thing could be,
like, a real thing, you know? So it's
very surprising to someone that doesn't know the context.
That's the capybara problem, and it's a hard problem
to solve. Well, a group of
scientists tried to solve pretty recently, like,
some months ago, and they came up with formula
that I'm not gonna get into, but the idea
is just to measure, well,
an idea that I think is very interesting, which is that, as we
just saw, like, 30 seconds ago, something can be
surprising. So if something can be surprising, what we
do is that we normalize the thing by
the expected surprise of an LlM on that text.
That can sound confusing. So I'm just going to give you an example so
we understand, what's this thing?
Let's take a text, right? Like, the capybara text. That's confusing
or, like, that's surprising.
We wouldn't expect to, like, see this text just coming
out of, like, NLM. So the perplexity,
the surprise, the.
Just, like, the likelihood of that text,
or in, well, the unlikelihoodness of that text is
very high. So we wouldn't expect to see this
text. But why would
we not expect to see that text?
There's actually two parts of something being
unlikely. The first thing is that in
this case, this text is unlikely because it's unusual. So, like,
the topic is something very creative,
so we wouldn't expect to see this thing. But then
another source of perplexity surprise is
that the text is written by a human person. So when we write as humans,
we use, like, some words that wouldn't be the
top one choice for GPT, for example. GPT will
be more surprised to see a human text than a machine text.
And what we try to do is that we try to isolate these two
parts of surprise. So we try to
remove the part of the perplexity that
just comes from the fact that the text is unusual.
We remove this part. And that's what I
said before, right? Like, we normalize by the expected
surprise of an LLM on that text. We remove the
part of a text being surprising because it's unusual.
And if we get to do that, what we're left with
is that we have a measure of how
likely is the text to be written by an AI.
So that's what we do. That's a formula.
We get a measure, and the measure that we get
is a much better measure of how
is the this, how much is this text likely
to have been written by an LLM.
There's, well, a ratio that they came up with experimentally.
0.85. If we get like a higher
number, it's human. If we get a lower number, then it's AI.
It's a great idea, really. It's a great idea. And this system can
be tried online. I'm just not going to show it because it's basically
the same thing as the other systems that we just saw. But it's
a very smart idea and a very smart
way of trying to get around the problem of
the capybara problem. So that's binoculars.
Really great system, came out very recently,
so a really great system to
try. And the fourth technique, and the
last one I'm going to be talking about is watermarking.
So what's watermarking?
Watermarking is just like something that we embed into a text,
right? Like we put a watermark on something and that thing
sticks into the thing, but we can't actually see it.
So when we do watermarking with the text, we do
a thing that's called a system of two lists,
a red list and a green list.
And to imagine how it works, just imagine that
I have a dictionary here and I start highlighting all of the
words in the dictionary. So, like, the first word is red, second word
is green. The third word is red. The fourth word is green.
And after doing that with all of the words in the dictionary, I have
notion of which word is red
and which word is green. Right. I have two lists. One list is
my red words and the other list is my green
words. So what I do after that is that I
just tell the system, you can never write any
of the red words. If you write a red word, that's banned.
So you can't choose those words in your distribution of probabilities
that we saw before, you can choose those words.
That works great until we get to words that go together.
So, for example, Barack Obama, that thing has to go together, right?
So what we do to solve this problem is that we just do a thing
that's called weak water marking in contrast
to strong water marking. And basically
the idea here is that if you have two words that are,
like, 99% likely to go together,
then you don't apply this thing of the red and green list.
If you have, like, a set of possibilities that
look likely, then you do apply the watermark.
That's a great technique, and it's especially great because it gets rid
of the capybara problem. If we see a text and we
see lots of red words, we know it's human.
If it doesn't have any red words, it's very likely
to have been written by an AI, by an
AI that was using the watermark.
So it's a great technique, really very interesting.
And those are the four techniques that will allow us to understand
how the detection of AI texture works.
But. But I'd love to give you some
little tips on how to attack the thing just
using your common sense. I'd say
one of the things that we should be aware of
is that looking into the types of words doesn't
really work well. Llms tend
to write kind of similarly to humans, so that thing doesn't really
work. But if you think about other things, like your
writing style, if I'm, I don't know,
let's say I'm 15, I'm supposed to write like a
teenager, right? Like, I'm not supposed to write like a PhD.
So your writing style can give some hints of that person
using, like, a chatbot to write the text.
Also dialect. If I'm american, I don't write
as a british person, so I wouldn't use british
words. Also, typos sometimes,
like, can give you, like, a hint of, okay, that person like
was writing text or of course, like they could
just like do it on purpose. But like, these are just
ideas, you know, hints, things that could help you to see,
like, well, like to get ideas.
Also hallucinations. If you see
a hallucination, that thing is really the best hint that you can get.
Like something that just doesn't look real, like something that you know is
false. There's different types of
them, but try to specially look out for things that
just don't make sense at all or facts that you know
are not true or bad math. So, like,
when you're doing math and like, the result is just incorrect,
those are very evident types of
hallucinations. And it really can give you a great
hint on this text has been generated.
However, human annotators
are horrible, are really bad at detecting generated text.
So you should really keep that in the back of your head.
We are very bad at detecting generated text.
So we can try to do it. But yeah,
will it work? Kind of complicated.
This is the real, real important part,
which is how to get around the systems.
So what can we do to just evade the
detectors of generated text? We've just seeing
how they work. So now it will make sense that we
just need to do a very specific thing to get
around them. Right. And if I ask you what's
the main idea that comes to your head when like,
you try to get around them?
Probably you'll tell me that what you have to do is
paraphrasing or rewriting text or
changing the words. So if you do
this thing, if you just like take a text, change the
words of the text, then you will arrive
to a text. If you do it well, if you paraphrase well,
you will arrive to a text where it can't be
spotted anymore. This thing can be done manually.
You can just rewrite the text yourself or you can use paraphrasers.
We'll look into that in a second. But the idea here
is that if you paraphrase, you can really
evade the detectors. In fact, how much
can you evade them? Well,
here I have a table where we had a system that
was detecting text with a 90% accuracy,
which is a lot, right? It's really great.
But my question is, how much do you think it drops when
we paraphrase? So,
like, yeah, like how much can we, like,
reduce the accuracy of this detector that was working really
well with like, well, a set of texts
drops a lot. It drops so much that it basically
becomes a coin. So a flip of a coin has a 50%
chance of being correct. Right.
If we like paraphrase the text,
we make the detector behave as the flip
of a coin. So it means that it's not effective at all. It means that
paraphrasing or rewriting is really, really the
best thing that we can do. And it's really effective as a technique.
We need to do it very well to
be able to actually evade the detectors because it's not just changing five words,
it's really rewriting the phrases. But if we do it
well, we can really evade those
systems. And now I'm going to give you some tools that you could consider
using. If you want to rewrite your texts,
you can just search for them in Google. Really, there's new tools all the
time. So you can just search paraphrase,
text, AI, something like that on Google. And like you will probably
like just find all of these tools that I'm just gonna give you.
But if you want some like names of tools.
Yeah, Grammarly is a good tool. Auto writer,
go copy. There's really like
lots of them and deeper. The one that we
are seeing here, it's one that you have to use in hugging
face. So the website, you just search for deeper paraphrase,
hugging face, something like that. And you'll arrive to a hugging face where you can
actually go and put
your text, get the written thing.
T five is another paraphraser that's hosted on huggingface.
So you just go there, you write your text and you get the
paraphrased version. And you can also
do a smart thing, I'd say, which is you take a text,
for example, like a text in English, you translate it
to French, and then you take the french translation and you translate back
to English. And you should get some differences in the
way that it's been translated, especially if the translator is not
perfect, then you should just like,
yeah, have changes in phrases, etcetera. So that's a way
to write. But if you're going to use DPel,
there's a great tool and that's the one I'm going to show you right now,
which is my favorite thing to write.
Deeper. So you see like you have the translator
or DeepL write, which is a new thing that came out
pretty recently. And what's the great thing about this?
The great thing is that it allows you to just select
what you want us, your rewritten version
interactively. So I can say, I don't like composition,
I prefer resistance. Or instead of this,
you should write this thing. You see, like you can
interactively change the text. And that's the best
way to really like do rewriting well.
So you want to rewrite. That's my
favorite tool. Really go for any of them.
But this one really, I'd say it's top
notch. So deeper write. Really great tool.
Having said that, I'd also love to give you like some manual tricks
that you can try to apply to. Just, I don't know,
avoid being detected. And one of
them is that you should give the system a
lot of information about who you are. So you should tell the system. You should
write as if you are a high school student. That is,
I don't know, like writing about the
french revolution and you're
like a student from the US.
Whatever, you know, like you give lots of information about
like that person, you, and that way
you will allow the system to write a bit more like you.
Also, it's nice to use the active voice.
You should try to ask for that because like a lot of the examples
that are given to these systems to train the systems are in passive
voice. So that means, for example,
500 patients were used for this study.
That's passive voice. That, like in a scientific study, right? Like they use
lots of passive voice. If you ask for active voice,
not only does your text sound like, well, more personal,
but also it changes the distribution of,
like the tokens, like the probabilities of
the tokens, it makes the copyware problem a little
bit worse. So if you ask for the
active voice, that's a good idea to ask for.
It's not going to change a lot. But like, I would try
to ask for this also using very specific data.
So if we go back to the French Revolution example,
just give like lots of data about like what you are
asking about. It started this year,
it lasted for this many months,
and I don't know, like this many people died,
etc. Etcetera. As much specific data as
you come that will help the system to write a
text. That is much more surprising.
Also avoid quotes. I'm not saying to avoid quoting
other people. What I'm trying to say is that you should try to
avoid quotes like George
Washington said, whatever those
phrases that, like everyone says, you should try to avoid them,
because when the system sees one of those phrases,
it will see that all of the words are green
because it's being seen a thousand times already.
So it will think that it's been written by AI when
actually it's not. It's just like a very common phrase, but it will think
that it was written by AI because it's a thing that doesn't surprise the system
at all. So if you want
to not get in trouble, I would avoid
using quotes like that. Quotes and also legal texts
and texts that appear in like, lots of places. Just try
to avoid them. Another great thing to do is to just
outline the structure that you want. First, I want an introduction about
the effects of the French Revolution. Then I want a description of what happened in
the first three months. Whatever, you know, like,
just tell the system what you really want.
Don't let the system choose for you. If you
tell the system everything that you want and are very specific,
for example, by writing the beginning of the answer,
you will get an answer that is much more surprising,
hence much more difficult to detect.
Also, if you write the beginning of the answer, you are allowing the system to
see how you write. So, for example, I said,
I use lots of likes when I speak,
or I try to not do it when I write. But everyone has
a certain style when they write. So if you allow the system
to see how you write, it's more likely that it will
pick up that way of writing and that it will just keep writing
like that. So that's a great way to take advantage
of those systems. Also, maybe consider using
other llms. Like, not everything is just chat
GPT. Like there's cloud, there's bar,
there's lots of them. There's also like hugging
face chat. You can just google for that. There's a lot of
llms, so just try to experiment with other things.
Also, if you don't use English, it's better because the detectors
just, well, lose a lot of accuracy when you
don't use English. You could also
try changing some of the words for other
things. So, for example, like, if you take a word and you switch
it into an emoji, it could potentially
change the distribution of the probabilities in the text and confuse
the system. Doing these types of changes
that you see in the table could potentially
change the detection of the detected text.
And if you want
to rewrite your text, I'd say that
you should probably focus on the start of the
text, because it's like, it's just a simple reason,
it's because it's computationally expensive to
analyze the whole text. So if you have like 500 pages
and you try to run that through,
like, I don't know, detect GPT or some of those tools,
it's very expensive to do. So. Usually the systems that detect
generated text just scan the first two
paragraphs or the first paragraph, and they assume that if
the first paragraph looks generated, then the rest of the text has been generated.
Otherwise it's human. So if you were right,
I'd say that you should probably try to focus on the first
part of the text. It's not necessarily the case
all the time, but many systems take like
this little shortcut. So I'd suggest that you focus
on the start of the text. And the last
tip is that just try to see if it looks generated
or not. So like, you just go to a website like this,
you check if it looks like AI,
it looks like AI. Okay. I have to keep writing.
And those are the main things that
you could do to, well, just use AI without
getting caught, you know,
having said that, I'd love to have a final note on the
fact that it's gonna get harder to attack these texts over
time. And what I mean by that is that as
these models improve over time, the text will look,
well, more human and more human, like, all the time.
So it's going to be much harder to attack.
Bottom line, you might be able to attack
your text, especially if you apply watermarks, et cetera, et cetera.
But this thing of detecting a generated
text is really hard and it's going to be a problem
in the future. So that's where the
talk ends, but with a lot of open questions
regarding ethics, like what should we do about this,
etcetera, etcetera. I don't have time to go into that. It's a really interesting
topic, just too broad to cover.
But I'd love to leave you with something which is
a QR code to rate the session. So if you like the
session, I really love it. If you could give me some positive
feedback, I really appreciate it. And if
you have things that you think that I could improve,
that's also very helpful because I always look at all the comments here
and I try to take them into account for improving
other sessions. So I really be
super thankful. If you could just take like 15 seconds. It's super,
super short. Like really, it's two questions and
just give me like, and well,
what you thought about the session.
Also, I should mention there's a surprise when
you submit the form. So, like, when you click the submit button,
there's, well, a little surprise there. So that's,
well, just to encourage you to fill in
the form. And having said that,
I would love to give you some pointers for, well, some things
you might want to read about if you're interested or
just message me, I'm super happy to speak about these things. All the time.
But having said that, I'm gonna
say goodbye for now, and I hope to see you at another session.
And I really, really wish you a super great conference
full of fun stuff. So goodbye
and see you soon.