Conf42 Machine Learning 2024 - Online

Using Quix, Hugging Face, and InfluxDB for Anomaly Detection and Forecasting

Video size:

Abstract

In this talk, we will learn about how to build a data streaming pipeline that performs anomaly detection (Keras Autoencoders) and forecasting (Holt Winters) of generator data with Quix, Hugging Face, and InfluxDB.

Summary

  • Today we're going to learn about how we can use influxDB v three with HibEMQ and quics to build an anomaly detection solution for some IoT data. We'll also learn how this solution could be used for an industrial IoT use case because it's scalable. Finally, I'll finish this talk by sharing some conclusions, answering some questions and some source code.
  • Ana Eustotis Georgieu is a developer advocate at Influx data. She will talk about how to use HiveMQ to build a data pipeline. With HiveMQ, you can also sync to a variety of databases. Come ask any questions that you have.
  • Time series data is any data that has a timestamp associated with it. Through aggregation we can transform any event into a metric. A time series database has four components. It should allow you to query in time order and accommodate for high write throughput.
  • InfluxdB is a time series database platform. Uses data fusion and Arrow to increase interoperability with business analytics tools and business intelligence tools like Tableau. How we're cherry picking HiveMQ with quics to build this industrial IoT anomaly detection example.
  • Quics is a solution for building, deploying and monitoring event streaming applications. We're also using hugging face as well to deploy or to leverage a anomaly detection solution. The goal of this demo is to highlight how we can perform this type of work for the specific use case.
  • An autoencoder is an unsupervised machine learning technique. It tries and learns patterns in our data without being provided any labels or prompts from the engineer. It can be used for anomaly detection, something like an artificial neural network.
  • The mean squared error is a way to define the difference between our actual data and our predicted data. High errors indicate anomalies. If we have a high MSc, that means that we has a deviation from our predicted or reconstructed data.
  • This is how we use Quic's, influxdB, HivemQ and hugging face to operationalize anomaly detection in the Iot space. The cool thing about this stack is just how scalable it is too. All you need to do in order to try out this demo is to create a free account.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello and welcome. Today we're going to learn about how we can use influxDB v three with HibEMQ and quics, which is a data streaming platform, as well as hugging face to build an anomaly detection solution for some IoT data. And we'll also learn about how this solution in this architecture could be used for an industrial IoT use case because it's scalable. So specifically for the agenda today we'll be talking about what data pipelines are, what HivemQ is, and what influxDB is. Then we'll dive a little bit into AI and ML in data pipelines and the application there. Then we'll go into real world applications and industrial IoT, and we'll follow that with learning how we can build with HiveMQ and MQTT broker quic and inflowxDb. Finally, I'll finish this talk today, sharing some conclusions, answering some questions, or sharing some common questions and some source code with you so that you can go ahead and try this example on your own. So who am I? I'm Ana Eustotis Georgieu. I'm a developer advocate at Influx data. Influx data is the creator of influxdb, and I want to encourage you to connect with me on LinkedIn if you'd like to do so. And I encourage you to ask me any questions that you have about this presentation, time series, data, data pipelines, IoT, etcetera. Come ask any questions that you have. I'd love to connect with you and help you on your time series and data pipeline journey. So what are data pipelines? If you're familiar with Kafka, then you're probably familiar with the source and sync model. Information is generated usually from a sensor, an application, or it could be a log, and it needs to make its way to where it needs to be consumed. And that could be applications, end users, controllers, etcetera. And when you have that data in a pipeline, what can you do with it? You can do things like normalize it, transform it. Maybe you need to standardize from a bunch of different types of protocols, and you might want to do this, all this in flight. So we'll talk about how we can use HiveMQ to build such a data pipeline and then store the data in influxdB. So HiveMQ is an MQTT broker, and like Kafka, it works on a pub sub model, and that's also similar to other MQT brokers as well. And it allows you to take information as a message to a topic that can be ephemeral, persistent, or shared amongst other consumers and producers. When you post data to that topic, other devices that subscribe to that topic can then access that data, process it, and write it to other databases like influxdb. So what else can you do with HivemQ? You can get data from a variety of different sources. You can use things like data hub and extensions to perform any transformations that you might need directly. With HiveMQ, you can also sync to a variety of databases, influxDB included, because they have their own MQTT connectors. But you can also pipe to other databases as well, and you can perform a bunch of ETL that you need. It also has t systems integrations. It can integrate very easily with streaming services like Kafka as well. And also you have a variety of deployment options. They understand that there's a need for flexibility in deployment, and so they offer HiveMQ cloud, which is fully managed and available on major cloud platforms, and HiveMQ self manage, which gives you control to specifically deploy your kubernetes and tailor your HiveMQ to your specific needs. So that's basically HiveMQ in a nutshell. Zoomed out. But before we understand how we can leverage HiveMQ to capture time series data, let's take a step back and talk about what time series data is in general. So, time series data is any data that has a timestamp associated with it. We typically think of stock market data as a prime example for time series data, and a thing that exists consistently across stock market data, or any other type of time series data, is that the single value is not usually that interesting. What you're usually interested in with time series data is the trend of data over time, or the stock value over time, because that lets you know whether or not you should be buying or selling, or whether or not there's a problem on your manufacturing floor. But time series data comes from a variety of different sources. And when we think about the types of time series data, we usually like to think of them in terms of two categories, and that's metrics and events. And metrics are predictable. They will occur at the same interval. So we can think of pulling, for example, a vibration sensor, and reading that vibration data every second from, let's say, something like an accelerometer. Meanwhile, events are unpredictable, and we cannot derive when an event will occur. So in the healthcare space, your cardiovascular, your heart rate would be your metric. And if you have a cardiovascular event like a heart attack or Afib, that is an event. We can also think of things like machine fault alerts. We don't know when a next machine fault will register, but we can store it when it does. However, one thing that's interesting about metrics and events is that through aggregation we can transform any event into a metric. So think about if we did a daily count of how many machine faults occurred. In this way we have a metric that's either going to be zero or more, but we'll at least know that we'll get one reading a day. And this is something that time series databases are also very good at doing. So what is a time series database? It has four components. The first is that it stores timestamp data. Every data point in a time series database is associated with a timestamp, and it should allow you to query in time order. The second is that it accommodates for really high write throughput. So in most cases you're going to have really high volumes of batch data or real time streams from multiple endpoints. So think of something like 1000 sensors, or maybe sensors with really high throughput, like for example, back to that vibration sensor. Example, industrial vibration sensors can give or create up to 10 khz/second so that's 10,000 points every second. And you need to be able to write that data very easily and with performance security in mind. And you also want to make sure that it's not going to impact your query rate. So that brings us to the next part of a time series database, which is being able to query your data efficiently, especially over long time ranges, because what is the value of being able to accommodate really high write throughput if you can't perform efficient queries on that data subsequently? So that's another component of time series databases. And then the last but not least, you want to consider scalability and performance. You want to have access to scalable architecture where you can scale horizontally to handle increased load, often across distributed clusters of machines, again, to accommodate the really high write throughput that is typical for a lot of time series use cases, whether that's in the virtual world or the physical world, like in IoT. So what is influxdB? InfluxdB is a time series database platform. At its heart, influxdb v three is written on top of the Apache ecosystem, so it leverages Apache datafusion, which is a query execution framework. It also leverages Apache Arrow, which is the in memory columnar data format. And then the durable file format is also columnar, and that's parquet. And influxdbv three is really a complete new rewrite of the storage engine in the platform. And this was done to really facilitate a couple key motivations or key pushes design choices, and these ones were accommodate near unlimited dimensionality or cardinality so that you don't have to worry about how you're indexing your points. So you don't have to worry about that anymore. In the past, you have to worry about what you wanted to identify as metadata and fields or tags and fields to make sure that you didn't have runaway cardinality. But that's no longer a concern. And we also wanted to increase interoperability. We wanted to be able to have people hopefully eventually be able to query parquet directly from influxd, have better interoperability with a bunch of machine learning tools and libraries. We also, as a part of using data fusion and Arrow, have the ability to contribute ODBC and JDCB drivers to increase interoperability with business analytics tools and business intelligence tools like Tableau for example. In general, any other companies that's also leveraging Arrow and Arrow flight to transport arrow over network interface means that we can really easily plug in with those other systems. So the idea is to help you avoid vendor locked in and to allow you to build the solution that best suits your needs with the individual tools that, that you really need to solve your problem. And this presentation today is an example of that. How we're cherry picking HiveMQ with quics, with hugging face to build this industrial IoT anomaly detection example. So yeah, we're all about integrating data pipelines in application architectures today for this demo and this example, HibiMQ, is how we're going to collect all of our data from our different sensors. We're going to aggregate our data into one place where we can then bring our data to the rest of the pipeline. And then we'll use a sync to tap into the data from HivemQ and write that data directly to influxdb. Then we also have the ability to process and enrich our data, and we'll do that as well. Then we'll also use a visualization tool on top, like Grafana. You could also use something like Apache superset if you wanted to. And we'll showcase quics, which has support for influxDB, HiveMQ and MQTT. So how do we combine HivemQ and influxdb into one architecture? We are doing the data ingest from HiveMQ and the data processing, and then influxDB is what's underpinning the storage of this raw data and the enrichment and the anomaly detection, essentially. So we'll go into that all in a little bit. And let's talk about Quics so what we'll be doing with quics, what is it? First, Quics is a solution for building, deploying and monitoring event streaming applications, and it uses or leverages Kafka under the hood, but it offers this layer of abstraction with python based plugins. So you really don't need to know any kafka to use quiks. And you also don't necessarily even need to know Python because you can configure everything through the UI if you wanted to. You can also change any of the python that's running underneath any plugins or any components of your pipeline within clicks to make any changes that you might need and enrich any templates that they might have. So yeah, you can use either pre canned plugins from their code library, pre canned templates that are a series of plugins used together to modify and build your own data pipeline. And we'll also use quics to build an ML pipeline from scratch. With not a lot of effort. What kind of problems can we solve with an architecture like this? We're also using hugging face as well to deploy or to leverage a anomaly detection solution. But what kind of problems can we solve like this? Let's talk about a real world challenge. So this is an example where we have a company called Pack and go. And in this imaginary scenario, this packing company is having some issues. We don't know what the root cause is, but we're getting a lot of errors from our manufacturing floor. We don't know why a machine is failing, and we don't know how to identify what's causing the failure when it's running. Normally, the data exhibits a strong cyclical and seasonal pattern that has a very predictable pattern. However, when it starts to break down, it also exhibits a different known pattern. So the question becomes, how can we use HiVeMQ, MQTT and influxDB to automate identifying this pattern in anomalies. So the goal of this demo that I'm going to describe today is to highlight how we can perform this type of work for the specific use case and hopefully give you the confidence to try it for yourself so you can see how easily this could be adopted by someone on the floor. So here's our complete data pipeline architecture, and I'm going to continue to refer back to this so that we understand the different stages of data movement and transformation throughout this talk. So given this hypothetical problem, this would be our solution architecture and our data journey. Basically what we're going to be having is these three robot arms that are packing robots, and these robots are synced to Hive, the HivemQ broker. And we're going to use the quics MQTT client to ingest that data in real time and feed it to our quics destination influxdb plugin. Then we're going to write that data to a table called machine data with all of our raw and historical data in influxdb. From there, we'll use the Qwik source influxdb plugin to query the data back into quics. We'll run it through an ML model, and then we'll pass the results back into influxdb with a new table called ML results. Then we'll use Grafana to visualize the data, our ML results, and alert on these anomalies. And I'll talk more about hugging face in a second, but let's dive into some theory by first stripping it back and just talking about the data in just a little bit. So for this demo, we're actually going to be using a robot machine simulator rather than actual machine data because this is all containerized example that's available to you on GitHub. I'll share those resources at the end of this presentation. But basically, I don't want to have to make you connect hardware yourself to generate some machine data to run this example. So we have this robot machine stimulator, and essentially the robots are going to go through three steps. They create a machine, create a Hive MQ connection, and write a JSON payload to a tape to a topic. So the simulator spins up a thread per request machine. Each has its own independent connection to HiveMQ. So we will have the three robots and the JSON will contain a metadata payload which we can see here. And this contains information about the source. It also contains a data payload which contains actual sensor readings from our robot. So we see things like temperature, load, power and vibration. And as part of our metadata, we see machine id, barcode, and provider. We'll also write each robot's data to a child topic that matches their machine id under the parent topic machine in HiveMQ. So for example, this one machine that we have, machine one, the topic will be machine machine one. Yeah, that's pretty much all of the ingests that we need to know about for the robots. And whoops, this is also what the code looks like under the hood for the MQTT client. So essentially this is going to be a quick crash course on how to connect to Hive MQTT, on to Hive MQ with using a Python class for spinning up the MQTT publisher. So one cool thing about HiveMQ. Two is that I want to mention is that you can set up an insecure connection to HiveMQ's public broker, and you can set up an MQTT client and pass in the public broker credentials. And this is a great tool for just testing your client and HiveMQ connection. So to connect to HiveMQ Cloud, you will need a couple credentials. You'll need to set up authentication because it is natively secure. And you'll also need to set up an SSL certificate and a username and password. And this is also provided to you in the python onboarding in HiveMQ. So you don't really have to worry about digging through docs to figure out how to get these. But basically what we're doing here is after we set up that connection, then we can construct our topic and send our data with the client publish method. And I have two connection methods. Depending on the broker, we have the insecure that doesn't require an any SSL or the secure that requires setting the TSL certificate. It's also worth setting your version of to the default, to the right version that you want to use, which defaults to 3.11. And lastly, at the bottom there is where we are actually constructing our topic and publishing to it. So at this point we are writing data to HiveMQ. And so the question becomes, how does quicz tap into this data stream? It's going to do so through the MQTT subscriber plugin that we can subscribe to our parent topic using the hashtag wildcard, which will use the same library we used to publish data. And we'll be subscribing to all three child topics with that wildcard, bringing in the JSON payload into quic. And then we'll parse that payload and then write it to the Qwik stream, which is really just a Kafka stream under the hood. But luckily, like I mentioned, the Qwiks interface abstracts, working with Kafka so you can just focus on your ETL and data science tasks. And while we're just doing this parsing of this payload here in quics, you could imagine if we were getting data from a variety of different sources with a lot of different protocols. This is Meyer. We might also perform some additional standardization of our data so that we can store all in one place and clean it as we would need. So yeah, so after we parse the JSON payload and write this directly onto the quickstream topic, which is essentially that Kafka topic, then we can apply other plugins, transformer plugins, destination plugins, etcetera. So that's what we're doing with quics. So let's talk about the data science side of industrial it and where we are with things. We mentioned that we have these three robots in this packing scenario, and we can see, for example, what the robots look like when they're performing normally. There's some evidence of some seasonality there, some, maybe some clear patterns that could be identified by decomposing those time series into their respective trend seasonality components. But we also see what the data looks like when we have an anomaly. And in an ideal world, our anomalous data would look something like that middle robot there, where we just have a sudden spike and the data looks completely different from our normal data, and something like a simple threshold indicate to us that we have an anomaly. However, the real world is usually not this easy. Realistically, we might also have our anomalous data presenting some sort of cyclical or seasonal pattern. It might be within the same standard deviation as our normal data. And so it becomes much harder to actually determine whether or not we have an anomaly than doing something as simple as analyzing maybe the standard deviation of our data or a threshold. In this case, we can use a more sophisticated method for anomaly detection, something like an artificial neural network, to help us solve our problem. Specifically, today we'll be employing an autoencoder. What is an autoencoder? It's an unsupervised machine learning technique. So what this means is, essentially, this is a type of machine learning technique that tries and learns patterns in our data without being provided any labels or prompts from the engineer, from me, for example, auto encoders were actually originally used for image compression, but it was found out that they work great for time series anomaly detection. So they've often been repurposed for this exact scenario. So how do autoencoders work? I'm going to briefly go over the idea. Basically, there's this input layer, and let me just take a step back. Let's imagine, for example, that we are trying to convince our robot to learn a dance, and we want this autoencoder to be able to describe this dance and understand it so that we can teach that robot a dance. We use this analogy just because, essentially, whether or not we're looking at vibration or temperature or pressure or whatever other components we might have for monitoring any sort of machine on a machine floor, all these components tell us how a robot is moving or how its health is. We'll just use the dance analogy, because I also am a dancer and I love dance. So any tense I get. So let's go back to the input layer of an autoencoder. An input layer is basically like telling the robot that it's going to learn a dance made out of specific dance steps. And then we have a sequence layer. And the sequence layer is not a standard case layer, but if we want to imagine what it might do, we can think of it as the time where we're going to prepare the sequence of the dance steps that we're going to need to be learned over a certain number of beats or timestamps. It's just mapping out what the whole dance is going to look like. And then we have an LSTM layer or a long, short term memory layer. And this layer acts a lot like the robot's memory. As the robot is going to watch the dance, it's going to use these layers to remember the sequence. And the first LSTM layer with the 16 units could be seen as a robot focusing on remembering the main parts of the dance. And then the second LSTM layer with four units is like trying to remember the key moments or the key movements that define the dance's style, not just the steps, but specifically how those individual shapes look. And within a time series context, to maybe make that a little bit more simpler, maybe we know that we have some seasonal patterns happening. That might be the first 16 layers. And then the second layer is a little bit more specific, like, how does the arch of those layers look like? And then we have an encoding layer. And this essentially back to the dance analogy. The dance moves are encoded into a simpler form. So we can imagine the robot now has a compressed memory of this dance, focusing on the most important moves, which is the output of the second LSTM layer. And this is also similar to how we learn. We first absorb a big picture of something. Then we have the opportunity to focus on the details and understand and tune the details. And then when we start to really put something into our own body or really learn it, we do only usually think of key moments that help prompt us onto the next, while the rest becomes muscle memory. And then we have a repeat vector. And this is like the robot getting ready to recreate the dance. It takes the core memory of the dance and prepares it to expand it back into the full sequence. And then we have the decoding layer. The next LSTM layers are the robot trying to recall the full dance from its core memories. And it starts with the essential moves and then builds up the details until it has the full sequence. And finally, we have the time distributed dance layer. And this is like the robot refining each moves, ensuring that each step in the dance is sharp and matches the original as closely as possible. Now that we understand how autoencoders work in theory, let's talk about how it can help us to detect anomalies. In order to understand that, we need to talk about the mean squared errors. The mean squared error is a way to define the difference between our actual data and our predicted data and determine how well our predicted data is. So the predicted data will be made by the autoencoder, and the MSA represents a reconstruction error. It'll measure the distance, basically between our actual data and our predicted or reconstructed data. Our reconstructed data should be really similar to our normal data. And so if we have a high MSc, that means that we have a deviation from our predicted or reconstructed data, and therefore we must be experiencing something out of the ordinary. So high errors indicate anomalies. So let's talk about some real world challenges with going operational in general, because we've talked about how to build a pipeline, how to incorporate some machine learning. But let's specifically talk about some of the issues melding these worlds together, because you can have really good data scientists and you can have people that are really good at building models. But one true challenge is bringing those models into production. Jupyter and Keras Jupyter notebooks being one of the primary tools that data scientists work in, work in a completely different way than the way that we want to run the models in production and deploy it. So we have our model, our autoencoder, how do we actually deploy it within our solution and monitor the results? This is one of the hardest parts of building an AI driven solution. It's like how do we take a miracle pill that was created out of the lab, which works specifically in a controlled environment, and bring it to production for use in everyday life and a lot of time. This is where the bottleneck is. So we have great machine learning models being developed by excellent data scientists, but it takes forever for them to reach production and actually deliver value. And this is where hugging face comes in. So hugging face is going to be your central repository for storing, evaluating and deploying models along with the datasets. So imagine at git on steroids, you upload your model there, and basically we are using an API to deploy and then access our model in our data pipeline. And Quics has an integration as a native plugin directly with hugging face as well so your data scientists can stay in the realm of their Jupyter notebooks and just focus on their expertise. Push their models to hugging face, and then your data engineers can incorporate those models and deploy them with incremental testing. So the cool thing about this demo is that you can generate anomalies also in real time, and then you can also pick up on those with the auto encoder, adding a tag that labels the data as anomalous or not, along with the MSE percentage. This example that's highlighted here also includes the Grafana visualization that I mentioned earlier. And we could, for example, use Grafana's powerful alerting tools to say, if I have a certain amount of anomalous points within a certain amount of time, then go ahead and actually alert on the data. So just to conclude here, this is how we use Quic's, influxdB, HivemQ and hugging face to operationalize anomaly detection in the Iot space. The cool thing about this stack is just how scalable it is too. Obviously, this demo is only operating on three types of robots with generated robot machine data, but we could easily operate on thousands of robots with this architecture. So what's next? Let's talk about some hypotheticals. We could imagine putting another labeling algorithm and actually labeling them as the type anomalies they are. So we can understand whether or not our machines are encountering bad bearings, shaky belts, an unexpected load, etcetera. And then operators could see in real time what the condition of these machines are. And of course, everyone's really excited about LLMs right now, for good reason, right? And as they get better and faster, we could even think about what else they could do. Maybe they could automate protocol conversion. Maybe they can translate machine language into human readable language. And maybe we could replace dashboards altogether. We don't have to look at dashboards because that's interpreted to us by LLMs. And instead we could provide insights into our environment. We could also think about maybe using AI to pass on expert knowledge. We could monitor how experts are troubleshooting problems, and how operators are troubleshooting problems, and create models based off of their domain knowledge in specific use cases, and how they respond to critical events. Then we could pass down this knowledge to new technicians and offer prompts to help them solve the problem. Perhaps when certain machines encounter certain issues, we also notice a correlation between that and a certain action, or, you know, access of particular documentation or protocols that could be automatically provided to help assist new technicians in them identifying what the root cause is. And eventually, maybe we would even have self defined digital twins where they're making the connections between the machines, the sensors, the applications, monitoring solutions, and they're making these connections to proactively monitor and solve problems. So you could imagine maybe being able to walk onto a factory floor and ask a machine how it's doing and just be able to be a machine doctor. So yeah, we can let our imagination run wild with the combination of all these solutions together. But what are the next step steps for you? The first is I want to encourage you to try this demo out yourself. All you need to do in order to try it is to create a qwiks free account and follow the URL here and clone it so you can get it up and running and a influxdb pre cloud tier. And you can go ahead and like I said, yeah, run this example yourself for free, or pick and choose the components from it that you want to use. I also want to encourage you to join and start up with influxdata.com, and that's where you can get the free tier of influx data. And I also want to encourage you to visit our community, our slack, and our forums at community dot influxdata.com to ask any questions that you might have about this presentation about IoT, machine learning, MQTT, etcetera. And last but not least, I also want to point you to influxdB University. So influxdb University is a free resource where you can get access to live and self paced courses on all things influxdb and even earn badges for some of them. So you can display those badges on your LinkedIn and also please access to documentation and self service content like blogs from influxdata as well so you can learn more about all of this information in detail in the format that works best for you. Also encourage you to join the Hivemq community as well if you have questions about Hivemq and the Qwiks community. We also have Qwiks engineers on our influx data community as well. So if you're doing anything with quics, they're happy to. So if you have any questions, please forward them there. I'd love to hear from.
...

Anais Dotis-Georgiou

Developer Advocate @ InfluxData

Anais Dotis-Georgiou's LinkedIn account Anais Dotis-Georgiou's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)