How well do LLMs detect anomalies in your data?

Video size:

Abstract

Bad data quality causes US companies to lose, on average, 12% of their revenue. In this talk, we will construct an anomaly detector, exploring LLMs, prompt engineering, and data type impacts. Along this path, we will analyze the use of multiple tools including OpenAI, BigQuery, and Mistral.

Summary

Today we're going to be talking about llms and whether or not they're good anomaly detectors. And if they're not, then what are the alternatives? There are different types of data anomalies out there. You can also have data anomalies which are caused by bad data quality.
Data quality is so important for lots and lots of reasons. If your data quality is bad, it can have a really big impact on your decision making. Also affects trust between your data engineers and the people who lead the business. And if you have bad data quality, that's going to also affect the security of your data.
We're going to be using OpenAI and seeing if it's good as anomaly detector. The data that we can have is very varied. Being able to use OpenAI gives us a bit more flexibility in kind of the data we're handling. We'll also talk about prompt engineering and how that's going to affect the results.
With chain of thought, you're really encouraging the model to break down a complex process into smaller steps before giving a response. It increases the percentage of intended anomalies detected by about 8% for GPT four. Prompt engineering really has helped us detect more of the intended using OpenAI.
OpenAI is notorious for being bad with numerical data. But it's great with text based data. Here, we see how it works with different data types. The percentage of intended anomalies detected jumps from 16% to nearly 100%.
Bigquery's ML anomaly detector has three parameters that you can fine tune. You can tune the non seasonal order terms that you have. This is super important to do because that's where you realize that you really need to tune for your particular application.
OpenAI is great with textual data. Bigquery ML has this inbuilt anomaly detector which performs very, very well. But there is a cost associated to using OpenAI and for numerical data. There are other open source models which are doing much, much better now.
We've managed to build two different types of anomaly detectors. If you want to try things out yourself. I published some things on Twitter and LinkedIn, including the two articles that kind of mention everything that I've done here with the code extract.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hiya. So, today we're going to be talking about llms and whether or not they're good anomaly detectors. And if they're not good anomaly detectors, then what are the alternatives? What is out there? So, I'm going to keep this super short, but my name is Chloe, I'm currently a tech lead and a data engineer at Theo dot UK law. Fun fact is that I've lived in six countries and moved eight times before turning 18. No, my parents are not diplomats. That's usually the question I always get. So why are we looking at data anomalies? Why do we care? There are different types of data anomalies out there. There's those that kind of come along quite regularly and that completely normal. So if you think of, for example, energy generation, particular things like solar energy, you might have a day where it's particularly cloudy and there's no light that gets through and so you'll have a drop in that solar, you know, solar energy generation, it's a data anomaly, but it's not, it's not a bad thing, it's kind of just something that happens. On the other hand, you can also have data anomalies which are caused by bad data quality, and so that's what we're going to focus on for this talk. So why is that so bad? To give you some statistics about this 12% average revenue loss by us companies due to bad data. So why does that happen? Actually, data quality is so important for lots and lots of reasons, and people probably will give you different reasons depending on who you ask. But the ones I like to highlight is, firstly, data is used in such a big part of decision making. So when you get data, that's how you know, you know what decisions to make on your business, on your product, on your users, all of those things. So if your data quality is bad, it can have a really big impact on your decision making. You might be making incorrect decisions, you might be making the non optimal decisions, etcetera. So that's the first thing to think of. The second one, it definitely affects trust. So it will affect trust between your data engineers and the people who lead the business, because if you're producing incorrect statistics, it's really going to be damage that trust between the tech team and the product side, and also it affects trust with your users as well. If you're providing them wrong data, if you're not cleaning out that, making sure that quality stays high, it's also going to impact that trust. The third point is security. You'll be handling sensitive data a lot of the time either to do with user data, things behind GDPR, or even more sensitive data than that. And if you have bad data quality, that's going to also affect the security of your data. So that's only three examples, but the list can go on and on. So be sure to find out why data quality is so important to you and your use case. It will likely fit within these, but I'm sure that there are many more. So what causes bad quality data? Again, there's a long list of causes, but here are a few that you might think of. First. One could be missing data, incorrect data. Those are the two most common ones. But there's also things like outdated data, there's inconsistent formatting standards. You've also got incompatible systems, you've got data complexity. All of these kind of feed into bad data quality. And of course, there's many more than these as well. So the question is, what can we actually do about it? So, first off, before jumping into llms, I do want to say that there are existing tools that help you with data observability, that help you try and detect that data quality that you have. So syphilit and elementary are two of those that are already out in the market and that can be used and they really help you to pick out, to kind of observe literally the quality of your data. Or, you know, you could spice things up a little, which is what we're going to be doing in our case. And we're going to be using OpenAI and seeing if it's good as anomaly detector. So we'll be going through this in four stages. The first one we'll be asking ourselves the question, why? Why are we doing this? Why do we even want to try and apply OpenAI to anomaly detection? The second thing we're going to do is a basic test to just get it to work. The third thing is we're going to talk about prompt engineering and how that's going to affect the results from your anomaly detector. And lastly, we're going to be looking at data types as well. How does the data type impact how well OpenAI does as an anomaly detector? First off, why? The first thing we need to know is that the data that we can have is very varied. You can have structured data, you can have unstructured data, you can have semi structured data, structured data. There are so many different data formats out there that not all the anomaly detectors that we currently have, like the ones we saw before, don't really fit all of these different formattings. So being able to use OpenAI gives us a bit more flexibility in kind of the data we're handling and how we're detecting all those anomalies. If we're using, you know, structured data that we don't have that much uncertainty in the data formatting, there's a lot more classic methods that we can use for anomaly detecting, for example, autoencoders. And we don't have to go through as far as OpenAI in these cases. So that's one of the reasons why we might want to use OpenAI in anomaly detection. Now, the second thing is curiosity. We're all curious as developers, it's kind of one of the main things that define us. And so of course we want to try throwing open AI at something and seeing what happens in this case. So that's what we're going to do for anomaly detection. So let's start off with a basic test. This. If you haven't used OpenAI before, this is kind of what it looks like. You've got a client. In this case, we're using GPT 3.5 as the model and we're going to send it a sequence of messages. So in this case there's only two. You've got the system message where we're telling, we're telling the model here that it's a data analyzer, it's going to pick up any anomalies that are received in the data. And then as a user, I'm giving it some examples of data to here going, we can see that we've got a repetition of the id two that happens twice. And we've also got a negative cost on that last one as well. So when you try and put this, when you run this, you notice that. Yep. It actually picks up the anomalies pretty well. It's not too complex. It does fine with this basic test. Let's up the level a little bit. So we already mentioned electricity generation with solar energy, so why don't we use the actual data behind this here we're using electricity consumption data from the UK. There's a huge amount of data on Kaggle that you can use for this. Right now, I'm just using a little bit of an extract, so a few lines from it and it looks a bit like this. There's lots of columns to do with demand, to do with generation capacity, etcetera, that you can have from this database. So that's the sort of data we're giving it. Sending that through to GPT 3.5 to try to text the anomaly and I don't know if you can pick up from the images, but there's some places, you know, where I've put a minus one when there's only positive numbers, where I've put a huge deviation in numbers. So there's, there's a ten somewhere in there where it should be like 9000, etc, etcetera. So I manually added some anomalies in that data and I've given it to the model. Unsurprisingly, with such complex data, I could actually find none of the anomalies I gave it. In most of the cases, there's a few cases where it did pick up on some anomalies. Those cases are, for example, GPT four performed better than GPT 3.5. It tended to detect the anomalies more. The second thing I noticed was the more anomalies I gave it, the more difficult it was to find them. And then the third thing, which was actually kind of surprising, is that the more the number of lines I gave it, it didn't necessarily do better. So the number of lines of test data didn't have a significant impact on its performance. Right. So after running all of these tests, what's kind of the conclusion of OpenAI being used as anomaly detector? So on average, for GPT four, it detected about 32% of intended anomalies that we had. This is based off of two anomalies with 20 lines of data. Of course, your results are going to vary depending on the type of data that you're giving it, the number of anomalies, et cetera, et cetera. But for our particular case, we're going to use this as a baseline. Two anomalies, 20 lines of data, GPT four. In this case, the basic test showed 32% of intended anomalies detected on average. Great. Basic test done. Let's apply some prompt engineering. So there's lots of techniques around prompt engineering. I'm just going to focus on the few. So the first thing we're going to look at is chain of thought. So with chain of thought, you're really encouraging the model to break down a complex process into smaller steps before giving a response. So you're forcing it to give, you know, its thoughts, break down its thoughts into smaller bricks, um, so it can really process what it's telling you. And that's the key behind this form of prompt is, uh, we're going to do it very simply by giving it a line which simply says, let's think step by step. So chain of thought can be done in various ways. You can literally show it to the breakdown of the thinking it has to give. But the simplest application of it, it's simply telling it. Let's think step by step. And so that's what we've done here. You can see I've indicated a few steps that it should take. Well, it actually increases the percentage of intended anomalies detected by about 8% for GPT four. So that's quite good. Can we push it further? So if we try and apply few shot learning. Few shot learning is a different form of prompt engineering. And in the case of few shot learning, what you're doing is that you're providing an example of how the model should be responding to the prompts that we give it. Concretely, what does this mean? You can see here, you've got kind of got an array of messages. The system message remains the same, but you can see that there's multiple interactions between a user and assistant. So what's happening is, I'm telling it, okay, if the user tells you this. So in this case, if the user gives you this data of CSV, the CSV data, I'm expecting you to give this answer, which you can see also in the image right here. So I've done that only twice. So I've kind of given it an example of some bad data and an example of data with no anomaly. And at the end, I'm going to give it that final prompt of, okay, this is the data I want you to analyze. What happens in this case, we up the number of intended anomalies detected by about 24% for GPT four. So that's a massive increase. It really does a lot better when you've given it some examples. Now, can we take this further? The last one that I tried that had a successful, let's say successful outcome to it was self reflection and multi step. So what do I mean by self reflection and multi steps? The aspect of self reflection is getting a model to question whether or not the answer it's given you. It's sure about that answer. They're essentially asking the question, are you sure? Take your time. And multi step is trying to break down the amount of things that the model has to do into several models, then it does. It's not overloaded by the amount of work it has to do. What do I mean by this is firstly, as the input, I'm giving it the data as a CSV with the anomalies. The model is going to give me the anomalies in the data that it thinks are there. The second step is I'm going to ask you, are you sure? Take your time. That's going to make it think, okay, have I given the correct answer? Here are the new anomalies that I think are there. And then finally we've got this convert your response to JSON. So converting the response to JSON just makes our lives so much easier in the long term because we'll be able to use that elsewhere, we'll be able to pass that information on. And so really that JSON conversion is to make our lives easier further down the pipeline. So when we break these steps down and we give it that self reflection, what happens? Well, compared to the baseline, we have plus 28% of intended anomalies detected for GPT four. So that's amazing. Prompt engineering really has helped us detect more of the intended anomalies using OpenAI. Now with all these percentages, got to make sure that I clarify they don't add up altogether. It's kind of just how it compares to the baseline. So when you combine them all, you go from 32% of intended anomalies detected for the basic test up to about 68% when you use prompt engineering. So that's a really good increase. We're doing pretty well, but 68% is still far from ideal. It's not that accurate in the real world. It's going to, you know, it's going to require some checking in the background because you, you're not going to be able to guarantee the results are, are correct. So 68% is good, but it's not amazing. Now the last stage is the data types. How does the data type affects the percentage of anomalies that OpenAI is able to detect? So for those of you who don't really know about OpenAI and how it does with numerics, this might come as a surprise. For those who have heard about how notorious OpenAI was, is with numerical data, this isn't going to be so surprising. So throughout all of these examples, we've been using numerical data. Now what happens when we apply textual data? So in this case, the textual data we're using are movie reviews. So in this case I've given it a sequence of movie reviews that you can have, and I've inputted some ads and I've inputted some phrases that don't make sense as well inside it. And I want to see, is OpenAI able to detect those anomalies better than with numerical data? And this is what you find. So in this case, what I mean by accuracy is I probably should have renamed that when you mean by accuracies is more the percentage of intended anomalies detected. So that's what we've been seeing so far. In the case of numerical data, GPT 3.5 really stayed at 16%. It didn't get that much better with prompt engineering. On the other hand, GPT 468 percent, like we saw, it's good, but not amazing. Now, as soon as you switch over to text based data, to those movie reviews, you see that with GPT 3.5, the percentage of intended anomalies detected is. Jumps up to 78%. And for GPT four, it's nearly 100%. Now, I do say nearly, because this will, you know, depend on the data you're testing it on. It will depend on your number of anomalies you have, etcetera. But. But its accuracy is absolutely amazing with text based data. So why is this the case? Why is OpenAI so bad with numerical data? But it's great with text based data? It's kind of in the name. So it's an LLM, it's a large language model. It hasn't been trained to understand mathematical concepts. It will even struggle with the basic, you know, which number is bigger than the other. So open air hasn't been GPT 3.5. And GPT four aren't built to handle maths. They're more built to handle text and to handle languages. So that's why it does so much better in those cases. Okay, great. So, you know, we've seemed to have resolved that all. OpenAI is a great anomaly detector. It can go up to 100%. That's amazing. But is it really? So, we've said that this is in the case of textual data. Now it's going to be. It's not that far of a stretch to say that most of the data we handle on a day to day basis is numerical. You know, it's in the financial industry, in healthcare, in so many industries, most of our data is numerical. So even though OpenAI is really good at detecting anomalies in text, we can't forget all the other data that exists out there. So what do we do in the case of numerical data? There's lots of options out there. But one of the ones I tried out was actually bigquery. So why did I choose Bigquery? Bigquery actually has an inbuilt anomaly detector that you can use, so there's three steps to applying it. The first one is you have to choose the model that best fits your data. So in this case, I used an Arima plus model, mostly because we're in a time series. The second thing you have to do is then create a model for each of the data columns. So if you remember when we were looking at that energy generation CSV, we had multiple columns to do with generation, to do with consumption, to do with capacity. All of that existed within the CSV. Now we're going to have to create a model for each of the data columns using this inbuilt anomaly detector. Once you built those models, you can run your anomaly detector for each of the data columns to see. Was there any anomalies within energy generation in Wales, et cetera, et cetera. Now, bigquery does allow you to have interdependence between your columns, especially when you're building your model. For the case of this experiment, we kind of assumed more that the columns were independent to each other. But of course, that will depend on your application and how you want to get your model to be trained. So there's a lot of code. I'll have the links at the end for you to kind of see the article which has all of the code inside it, but this is the key part. So you have something which is going to be ML detect anomalies. You're going to give it the name of your model, and you're going to give it this threshold that you see at the end. Why do we have a threshold? So the way that Bigquery's anomaly detector works is it's going to give you a percentage of certainty that something is an anomaly. So if it's not very short, it might tell you, you know, I think this is, you know, zero point, 110 percent chance of this being an anomaly fit. Sure. It probably is going to go up to 99%, being like, this is definitely an anomaly. And so it gives you this probability of something is an anomaly alongside the boolean true false. So that really helps us understand what's going on under the hood when it's detecting these anomalies. Okay. It does super well. It detects 100% of all of the intended anomalies. This is amazing, but closed poignant. Not quite. What we actually find is that there are 21 false positives when you look at only about 28 lines of data where there's meant to only be five intended anomalies here. So when you look at it a bit deeper, there's some. Some of these are genuine false positives. When you think, okay, this should definitely not be enough, this should definitely not be an anomaly. There's nothing that kind of indicates that it is but when you look a bit deeper, there are some things which do seem as if they could be anomalies. So if you remember at the start, I mentioned the example of solar data, you know, when there's a cloudy day, it drops down to zero or near zero before coming back up. And so that's an anomaly. That's not too surprising. So we've got to deal with two cases. We've got to deal with the false positives, which are genuine false positives. This isn't anomaly at all. And you also have to deal with more of the normal anomalies that you can have, um, within your data. The first thing we can do is we can increase the threshold. So, like I mentioned before, there's a level of certainty that bigquery will give you about whether or not it thinks that something is an anomaly. So we can increase that threshold being saying, yeah, instead of being 95% sure, I want you to be 99% sure that this is an anomaly before classing it as such. So when we make that change, the internal anomalies are still detected, as they should. So that's great. We still have all five interned anomalies detected, and we have a drop in the number of false positives just because we want to be sure. We want the model to be more sure that something is an anomaly before it classes it that way. So that's all great. This kind of works. So what about the next one? Adding separate trainee data? So when we're. When we're creating that model on the data, what happens if we throw even more data at it? And what I actually found is that this isn't helpful. So if you just throw more data at it, what it often does is that it overfits it overfits it and it isn't able to handle, let's say, for example, I trained it on a whole year of 2016 data, and then I wanted to figure out, okay, there's, like, two days worth of data in 2017. Can you pick up an anomaly? And it actually performed worse. So this was quite surprising. And then after looking into it a little bit, there's one thing that we haven't done yet is that we haven't tuned our model. So tuning our model is that last thing that happens. So this might bring back some memories from university. Um, but you can tune the non seasonal order terms that you have. I'm not going to go into too much detail on this, but feel free to check out Bigquery's ML anomaly detector, because it's got some more details about it. But there's three parameters. There's three terms that you can fine tune. There's the term, which we call the term p, which is the auto, the order of autoregressive part of it. So it's a value that can vary between zero and five. The second term that you can tune is the term d, which is the degree of differencing. So that's also a value between zero and two. And finally, the last term that you can tune is q, and that is the order of moving part and moving average part of the equation. So you kind of see that you either move more towards an autoregressive model or more towards a moving average. And so with this diagram, what you can see is that the color coding is going to show you the number of false positive. So you see that the more you move towards the left at the moment, the less false positives you have. So it's really that p term dropping down, that D term dropping down, and you end up with more of a moving average model. So this is where you've realized that energy generation in this particular case is more of a moving average. So that's super interesting and it's really going to depend on your application. So I highly encourage you to fine tune with these different non seasonal order terms, see what kind of fits your particular application. But this is super important to do because that's where you realize that you really need to tune for your particular application. Okay, great. So we've tuned our model, we've increased the threshold. What's happened? What's happened with our false positives? Amazingly, we dropped down to about one false positive for 28 lines of data and five and ten little anomalies, which is amazing. You know, we've really helped improve the performance. So this is great. We've looked at with OpenAI is great with textual data. Now do remember that there is a cost associated to it. I'm going to show you a comparison with other models just after. But there is a cost associated to using OpenAI and for numerical data. Bigquery ML has this inbuilt anomaly detector which performs very, very well and you just need to fine tune it for your application. So amazing. We've managed to build our own anomaly detectors. Now, I did mention that OpenAI is quite expensive. So what about all the open source model? So when I was running this experiment, I kind of ran it against two of Mistral's model, which you can see here. And you can see that although OpenAI's GPT 3.5 and GPT four does perform better. Mistral is catching up, so it's not that far behind. There are these open source models which are doing much, much better now. Coming up. Great. So that's it for the talk. We've managed to build two different types of anomaly detectors. If you want to try things out yourself. I published some things on Twitter and LinkedIn, including the two articles that kind of mention everything that I've done here with the code extract. So please check those out and I hope that you enjoyed it.

Slides

Download slides (PDF)

See all 28 talks at this event!

Conf42 Large Language Models (LLMs) 2024 - Online

April 11 2024

How well do LLMs detect anomalies in your data?

Video size:

Abstract

Summary

Transcript

Slides

Chloe Caron

Architect Developer @ Theodo

Join the community!

Featured event

2025

2024

Info

Conf42 Large Language Models (LLMs) 2024 - Online

April 11 2024

How well do LLMs detect anomalies in your data?

Video size:

Abstract

Summary

Transcript

Slides

Chloe Caron

Architect Developer @ Theodo

Join the community!