Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hiya. So, today we're going to be talking about llms and whether or not
they're good anomaly detectors. And if they're not good anomaly
detectors, then what are the alternatives? What is out there?
So, I'm going to keep this super short, but my name is Chloe, I'm currently
a tech lead and a data engineer at Theo dot UK law.
Fun fact is that I've lived in six countries and moved eight times before
turning 18. No, my parents are not diplomats. That's usually the
question I always get. So why are we
looking at data anomalies? Why do we care? There are different
types of data anomalies out there. There's those that kind
of come along quite regularly and that completely normal. So if you think
of, for example, energy generation, particular things like solar
energy, you might have a day where it's particularly cloudy
and there's no light that gets through and so you'll have a drop in that
solar, you know, solar energy generation, it's a
data anomaly, but it's not, it's not a bad thing, it's kind of just something
that happens. On the other hand, you can also have data anomalies
which are caused by bad data quality, and so that's what we're
going to focus on for this talk. So why is
that so bad? To give you some statistics about
this 12% average revenue loss by
us companies due to bad data.
So why does that happen? Actually,
data quality is so important for lots and lots of reasons,
and people probably will give you different reasons depending on who you ask.
But the ones I like to highlight is, firstly, data is used
in such a big part of decision making.
So when you get data, that's how you know, you know what decisions to make
on your business, on your product, on your users,
all of those things. So if your data quality is bad, it can have
a really big impact on your decision making. You might be making incorrect
decisions, you might be making the non optimal decisions,
etcetera. So that's the first thing to think of. The second one, it definitely affects
trust. So it will affect trust between your data
engineers and the people who lead the business,
because if you're producing incorrect statistics,
it's really going to be damage that trust between the tech team and the
product side, and also it affects trust with your users
as well. If you're providing them wrong data, if you're not cleaning
out that, making sure that quality stays high, it's also
going to impact that trust. The third point is security.
You'll be handling sensitive data a lot of the time either
to do with user data, things behind GDPR,
or even more sensitive data than that. And if you have
bad data quality, that's going to also affect the security of your data.
So that's only three examples, but the list can go on and on. So be
sure to find out why data quality is so important to you and your use
case. It will likely fit within these, but I'm sure that there
are many more. So what causes bad
quality data? Again, there's a long list of causes, but here are a
few that you might think of. First. One could be missing data,
incorrect data. Those are the two most common ones. But there's also
things like outdated data, there's inconsistent formatting
standards. You've also got incompatible systems, you've got data
complexity. All of these kind of feed into bad
data quality. And of course, there's many more than these as well.
So the question is, what can we actually do about it?
So, first off, before jumping into llms, I do want to
say that there are existing tools that help you with data observability,
that help you try and detect that data quality that you have.
So syphilit and elementary are two of those that
are already out in the market and that can be used and they really help
you to pick out, to kind of observe literally the
quality of your data. Or, you know, you could spice
things up a little, which is what we're going to be doing in our case.
And we're going to be using OpenAI and seeing
if it's good as anomaly detector. So we'll
be going through this in four stages. The first one we'll be asking ourselves
the question, why? Why are we doing this? Why do we even want to try
and apply OpenAI to anomaly detection?
The second thing we're going to do is a basic test to just get it
to work. The third thing is we're going to talk about
prompt engineering and how that's going to affect the results
from your anomaly detector. And lastly, we're going to be looking at data
types as well. How does the data type impact how
well OpenAI does as an anomaly detector?
First off, why? The first thing we need to know
is that the data that we can have is very varied.
You can have structured data, you can have unstructured data, you can have semi structured
data, structured data. There are so many different data formats out there
that not all the anomaly detectors that we currently have,
like the ones we saw before, don't really fit all of
these different formattings. So being able to use OpenAI gives us
a bit more flexibility in kind of the data we're handling and
how we're detecting all those anomalies. If we're
using, you know, structured data that we don't have
that much uncertainty in the data formatting, there's a lot more classic methods that we
can use for anomaly detecting, for example,
autoencoders. And we don't have to go through as far as OpenAI in
these cases. So that's one of the reasons why we might want to use OpenAI
in anomaly detection. Now, the second thing is
curiosity. We're all curious as developers, it's kind of
one of the main things that define us. And so of course we
want to try throwing open AI at something and seeing what
happens in this case. So that's what we're going to do for anomaly detection.
So let's start off with a basic test. This. If you haven't
used OpenAI before, this is kind of what it looks like. You've got
a client. In this case, we're using GPT 3.5 as the model
and we're going to send it a sequence of messages. So in this case there's
only two. You've got the system message where we're telling,
we're telling the model here that it's a
data analyzer, it's going to pick up any anomalies that are received in the data.
And then as a user, I'm giving it some examples of data to
here going, we can see that we've got a repetition of the id two that
happens twice. And we've also got a negative cost on that last
one as well. So when you try and put this, when you run this,
you notice that. Yep. It actually picks up the anomalies pretty well. It's not
too complex. It does fine with this basic test.
Let's up the level a little bit. So we already mentioned
electricity generation with solar energy, so why don't we use
the actual data behind this here we're using electricity
consumption data from the UK. There's a
huge amount of data on Kaggle that you can use for this. Right now,
I'm just using a little bit of an extract, so a few lines from it
and it looks a bit like this. There's lots of columns to do with demand,
to do with generation capacity, etcetera, that you can
have from this database. So that's the sort of data we're giving it.
Sending that through to GPT 3.5 to try to
text the anomaly and I don't know if you can pick up from
the images, but there's some places, you know, where I've put a minus one when
there's only positive numbers, where I've put a huge deviation in numbers.
So there's, there's a ten somewhere in there where it should be like 9000,
etc, etcetera. So I manually added some anomalies in that
data and I've given it to the
model.
Unsurprisingly, with such complex data,
I could actually find none of the anomalies I gave it.
In most of the cases, there's a few cases where it did pick up
on some anomalies. Those cases are, for example,
GPT four performed better than GPT 3.5.
It tended to detect the anomalies more. The second
thing I noticed was the more anomalies I gave it, the more difficult it
was to find them. And then the third thing, which was actually kind of
surprising, is that the more the number of lines I gave it,
it didn't necessarily do better. So the number of lines of test data
didn't have a significant impact on its performance.
Right. So after running all of these tests, what's kind of
the conclusion of OpenAI being used as anomaly
detector? So on average, for GPT four,
it detected about 32% of intended anomalies that we
had. This is based off of two anomalies with
20 lines of data. Of course, your results are going to vary
depending on the type of data that you're giving it, the number of anomalies,
et cetera, et cetera. But for our particular case, we're going to use this
as a baseline. Two anomalies, 20 lines of data,
GPT four. In this case, the basic test showed
32% of intended anomalies detected on average.
Great. Basic test done. Let's apply some prompt
engineering. So there's lots of techniques around prompt engineering.
I'm just going to focus on the few. So the first thing we're going to
look at is chain of thought. So with chain of thought, you're really encouraging
the model to break down a complex process into smaller
steps before giving a response. So you're forcing
it to give, you know, its thoughts, break down its thoughts into smaller
bricks, um, so it can really process what it's telling you.
And that's the key behind this form of prompt is,
uh, we're going to do it very simply by giving it a line which
simply says, let's think step by step.
So chain of thought can be done in various ways. You can literally
show it to the breakdown of the thinking it has to give. But the simplest
application of it, it's simply telling it. Let's think step
by step. And so that's what we've done here.
You can see I've indicated a few steps that it should take.
Well, it actually increases the percentage of intended
anomalies detected by about 8% for GPT four.
So that's quite good. Can we push it further?
So if we try and apply few shot learning. Few shot learning is
a different form of prompt engineering. And in
the case of few shot learning, what you're doing is that you're providing an example
of how the model should be responding to the prompts that we give it.
Concretely, what does this mean? You can see
here, you've got kind of got an array of messages. The system message remains the
same, but you can see that there's multiple interactions
between a user and assistant. So what's happening is, I'm telling it, okay, if the
user tells you this. So in this case, if the user gives you this
data of CSV, the CSV data, I'm expecting you
to give this answer, which you can see also in
the image right here. So I've done that only twice.
So I've kind of given it an example of some bad data and an
example of data with no anomaly. And at the end, I'm going to give it
that final prompt of, okay, this is the data I want you
to analyze. What happens in this case,
we up the number of intended anomalies detected by
about 24% for GPT four. So that's a massive
increase. It really does a lot better when you've given it some
examples. Now, can we take this further?
The last one that I tried that had a successful,
let's say successful outcome to it was self reflection and
multi step. So what do I mean by self reflection and multi steps?
The aspect of self reflection is getting a model
to question whether or not the answer it's given you. It's sure
about that answer. They're essentially asking the question, are you sure?
Take your time. And multi step is trying
to break down the amount of things that the model has to do
into several models, then it does. It's not
overloaded by the amount of work it has to do. What do
I mean by this is firstly, as the input, I'm giving it the
data as a CSV with the anomalies. The model is going to
give me the anomalies in the data that it thinks are there.
The second step is I'm going to ask you, are you sure? Take your time.
That's going to make it think, okay, have I given the correct answer?
Here are the new anomalies that I think are there. And then finally
we've got this convert your response to JSON. So converting
the response to JSON just makes our lives so much easier in the long
term because we'll be able to use that elsewhere, we'll be able to
pass that information on. And so really that JSON conversion is to make our lives
easier further down the pipeline. So when we break these steps down
and we give it that self reflection, what happens?
Well, compared to the baseline, we have plus 28% of
intended anomalies detected for GPT four.
So that's amazing. Prompt engineering really has
helped us detect more of the intended anomalies
using OpenAI. Now with all these percentages,
got to make sure that I clarify they don't add up altogether. It's kind
of just how it compares to the baseline. So when you combine them all,
you go from 32% of intended anomalies detected for the
basic test up to about 68% when you use
prompt engineering. So that's a really good increase. We're doing pretty
well, but 68% is still far from
ideal. It's not that accurate in the real world. It's going to, you know,
it's going to require some checking in the background because you, you're not going
to be able to guarantee the results are, are correct. So 68%
is good, but it's not amazing. Now the last stage
is the data types. How does the data type affects the
percentage of anomalies that OpenAI is
able to detect?
So for those of you who don't really know
about OpenAI and how it does with numerics, this might come as a surprise.
For those who have heard about how notorious OpenAI was,
is with numerical data, this isn't going to be so surprising.
So throughout all of these examples, we've been using numerical
data. Now what happens when we apply textual data?
So in this case, the textual data we're using are movie reviews.
So in this case I've given it a sequence of
movie reviews that you can have, and I've inputted some ads and
I've inputted some phrases that don't
make sense as well inside it. And I want to see,
is OpenAI able to detect those anomalies better
than with numerical data? And this is
what you find. So in this case, what I mean by accuracy
is I probably should have renamed that when you mean by accuracies is more the
percentage of intended anomalies detected.
So that's what we've been seeing so far. In the case of numerical data,
GPT 3.5 really stayed at 16%. It didn't
get that much better with prompt engineering. On the other hand, GPT 468
percent, like we saw, it's good, but not amazing.
Now, as soon as you switch over to text based data, to those movie reviews,
you see that with GPT 3.5, the percentage of
intended anomalies detected is. Jumps up to
78%. And for GPT four, it's nearly 100%.
Now, I do say nearly, because this will, you know,
depend on the data you're testing it on. It will depend on your number of
anomalies you have, etcetera. But. But its accuracy is absolutely amazing
with text based data. So why
is this the case? Why is OpenAI so bad with numerical data?
But it's great with text based data? It's kind of in the name. So it's
an LLM, it's a large language model. It hasn't been trained to
understand mathematical concepts. It will even struggle with the basic,
you know, which number is bigger than the other.
So open air hasn't been GPT 3.5.
And GPT four aren't built to handle maths. They're more
built to handle text and to handle languages.
So that's why it does so much better in those cases.
Okay, great. So, you know, we've seemed
to have resolved that all. OpenAI is a great anomaly detector. It can go
up to 100%. That's amazing. But is
it really? So, we've said that this is in the case of textual data.
Now it's going to be. It's not that far of a stretch to say that
most of the data we handle on a day to day basis is
numerical. You know, it's in the financial industry, in healthcare,
in so many industries, most of our data is numerical.
So even though OpenAI is really good at detecting anomalies in text,
we can't forget all the other data that exists out there.
So what do we do in the case of numerical data? There's lots of options
out there. But one of the ones I tried out was actually bigquery.
So why did I choose Bigquery? Bigquery actually has an inbuilt anomaly
detector that you can use, so there's three steps to applying it.
The first one is you have to choose the model that best fits your data.
So in this case, I used an Arima plus model, mostly because we're in a
time series. The second thing you have to do is then create
a model for each of the data columns. So if you remember when
we were looking at that energy generation CSV,
we had multiple columns to do with generation, to do with consumption,
to do with capacity. All of that existed within
the CSV. Now we're going to have to create a model for each of the
data columns using this inbuilt anomaly detector.
Once you built those models, you can run your anomaly detector for each of the
data columns to see. Was there any anomalies
within energy generation in Wales, et cetera,
et cetera. Now, bigquery does allow you
to have interdependence between your columns, especially when you're building your model.
For the case of this experiment, we kind of assumed more
that the columns were independent to each other. But of course,
that will depend on your application and how
you want to get your model to be trained. So there's
a lot of code. I'll have the links at the end for you
to kind of see the article which has all of
the code inside it, but this is the key part. So you have something
which is going to be ML detect anomalies. You're going to give
it the name of your model, and you're going to give it this threshold
that you see at the end. Why do we have a threshold?
So the way that Bigquery's anomaly detector works
is it's going to give you a percentage of certainty that
something is an anomaly. So if it's not very short, it might
tell you, you know, I think this is, you know, zero point, 110 percent
chance of this being an anomaly fit. Sure. It probably is going to go up
to 99%, being like, this is definitely an anomaly.
And so it gives you this probability of something
is an anomaly alongside the boolean true false.
So that really helps us understand what's going on under the hood
when it's detecting these anomalies.
Okay. It does super well.
It detects 100% of all of the intended anomalies.
This is amazing, but closed poignant.
Not quite. What we actually find is that
there are 21 false positives when you look at only about 28
lines of data where there's meant to only be five intended
anomalies here. So when you look at it a bit deeper,
there's some. Some of these are genuine false positives.
When you think, okay, this should definitely not be enough, this should definitely not be
an anomaly. There's nothing that kind of indicates that it is but
when you look a bit deeper, there are some things which do seem as if
they could be anomalies. So if you remember at the start, I mentioned
the example of solar data, you know, when there's a cloudy
day, it drops down to zero or near zero
before coming back up. And so that's an anomaly. That's not too
surprising. So we've got to deal with two cases.
We've got to deal with the false positives, which are genuine false positives.
This isn't anomaly at all. And you also have to deal with more
of the normal anomalies that you can have,
um, within your data. The first thing
we can do is we can increase the threshold. So, like I mentioned before,
there's a level of certainty that bigquery will give you about
whether or not it thinks that something is an anomaly. So we can increase
that threshold being saying, yeah, instead of being 95%
sure, I want you to be 99% sure that this
is an anomaly before classing it as such. So when
we make that change, the internal anomalies
are still detected, as they should. So that's great.
We still have all five interned anomalies detected, and we
have a drop in the number of false positives just because we want to
be sure. We want the model to be more sure that something
is an anomaly before it classes it that
way. So that's all great. This kind of works. So what about the
next one? Adding separate trainee data? So when we're.
When we're creating that model on the data, what happens if we throw even
more data at it? And what I actually found is that
this isn't helpful. So if you just throw more data at it,
what it often does is that it overfits it
overfits it and it isn't able to handle,
let's say, for example, I trained it on a whole year of 2016 data,
and then I wanted to figure out, okay, there's,
like, two days worth of data in 2017.
Can you pick up an anomaly? And it actually performed worse.
So this was quite surprising. And then after looking into
it a little bit, there's one thing that we haven't done yet is that we
haven't tuned our model. So tuning our model is that last thing that
happens. So this might bring back some
memories from university.
Um, but you can tune the non seasonal order
terms that you have. I'm not going to go into too much detail
on this, but feel free to check out Bigquery's
ML anomaly detector, because it's got some more details
about it. But there's three parameters. There's three terms that you
can fine tune. There's the term, which we call the term p,
which is the auto, the order of autoregressive part
of it. So it's a value that can vary between zero and five.
The second term that you can tune is the term d,
which is the degree of differencing. So that's also a value between
zero and two. And finally, the last term that you can tune
is q, and that is the order of moving part and
moving average part of the equation. So you
kind of see that you either move more towards an autoregressive
model or more towards a moving average.
And so with this diagram, what you can see is that the color
coding is going to show you the number of false positive.
So you see that the more you move towards the left
at the moment, the less false positives you have.
So it's really that p term dropping down,
that D term dropping down, and you end up with more of a moving
average model. So this is where you've realized that energy generation in
this particular case is more of a moving average.
So that's super interesting and it's really going to depend on your application.
So I highly encourage you to fine tune
with these different non seasonal order terms, see what kind of fits
your particular application. But this is super
important to do because that's where you realize that you really need to tune for
your particular application. Okay, great.
So we've tuned our model, we've increased the threshold.
What's happened? What's happened with our false positives?
Amazingly, we dropped down to about one false positive for
28 lines of data and five and ten little anomalies, which is amazing.
You know, we've really helped improve the performance.
So this is great. We've looked at with OpenAI is great
with textual data. Now do remember that there is a cost
associated to it. I'm going to show you a comparison with other models
just after. But there is a cost associated to using OpenAI
and for numerical data.
Bigquery ML has this inbuilt anomaly
detector which performs very, very well and you just need to
fine tune it for your application. So amazing.
We've managed to build our own anomaly detectors.
Now, I did mention that OpenAI is quite expensive. So what
about all the open source model? So when I was running this experiment,
I kind of ran it against two of Mistral's model,
which you can see here. And you can see that although
OpenAI's GPT 3.5 and GPT four does
perform better. Mistral is catching up, so it's
not that far behind. There are these open source models which
are doing much, much better now. Coming up.
Great. So that's it for the talk. We've managed to
build two different types of anomaly detectors. If you
want to try things out yourself. I published some things
on Twitter and LinkedIn, including the two articles
that kind of mention everything that I've done here with the code
extract. So please check those out and I hope that you enjoyed it.