Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello and welcome. Today we're going to learn about how we can use
influxDB v three with HibEMQ and
quics, which is a data streaming platform, as well
as hugging face to build an anomaly detection solution for
some IoT data. And we'll also learn about how this solution in this architecture
could be used for an industrial IoT use case because it's scalable.
So specifically for the agenda today we'll be talking about what
data pipelines are, what HivemQ is, and what influxDB is.
Then we'll dive a little bit into AI and ML in data pipelines
and the application there. Then we'll go into real world applications
and industrial IoT, and we'll follow that
with learning how we can build with HiveMQ and MQTT broker quic
and inflowxDb. Finally, I'll finish this talk today,
sharing some conclusions, answering some questions, or sharing
some common questions and some source code with you so that you can
go ahead and try this example on your own. So who
am I? I'm Ana Eustotis Georgieu. I'm a developer advocate at Influx
data. Influx data is the creator of influxdb, and I want
to encourage you to connect with me on LinkedIn if you'd like to do so.
And I encourage you to ask me any questions that you have about this presentation,
time series, data, data pipelines, IoT,
etcetera. Come ask any questions that you have. I'd love to connect with you
and help you on your time series and data pipeline journey.
So what are data pipelines? If you're familiar with Kafka,
then you're probably familiar with the source and sync model. Information is generated
usually from a sensor, an application, or it could be a log,
and it needs to make its way to where it needs to be consumed.
And that could be applications, end users, controllers,
etcetera. And when you have that data in a pipeline, what can you do
with it? You can do things like normalize it, transform it.
Maybe you need to standardize from a bunch of different types of protocols,
and you might want to do this, all this in flight. So we'll
talk about how we can use HiveMQ to build such a data
pipeline and then store the data in influxdB. So HiveMQ
is an MQTT broker, and like Kafka, it works on a pub sub
model, and that's also similar to other MQT brokers as well.
And it allows you to take information as a message to a topic that
can be ephemeral, persistent, or shared amongst other consumers and producers.
When you post data to that topic, other devices that subscribe to that
topic can then access that data, process it, and write it to other
databases like influxdb. So what
else can you do with HivemQ? You can get data from a variety
of different sources. You can use things like data hub and extensions
to perform any transformations that you might need directly. With HiveMQ,
you can also sync to a variety of databases,
influxDB included, because they have their own MQTT connectors.
But you can also pipe to other databases as well,
and you can perform a bunch of ETL that you need. It also
has t systems integrations. It can integrate
very easily with streaming services like Kafka as well.
And also you have a variety of deployment options. They understand
that there's a need for flexibility in deployment, and so they offer
HiveMQ cloud, which is fully managed and available on major
cloud platforms, and HiveMQ self manage,
which gives you control to specifically deploy your kubernetes and tailor
your HiveMQ to your specific needs. So that's
basically HiveMQ in a nutshell. Zoomed out.
But before we understand how we can leverage HiveMQ to capture
time series data, let's take a step back and talk about what
time series data is in general. So, time series data
is any data that has a timestamp associated with it.
We typically think of stock market data as a
prime example for time series data, and a thing
that exists consistently across stock
market data, or any other type of time series data, is that the single
value is not usually that interesting. What you're
usually interested in with time series data is the trend of
data over time, or the stock value over time, because that
lets you know whether or not you should be buying or selling, or whether or
not there's a problem on your manufacturing floor.
But time series data comes from a variety of
different sources. And when we think about the types of time
series data, we usually like to think of them in terms of two
categories, and that's metrics and events. And metrics are predictable.
They will occur at the same interval. So we can think of pulling,
for example, a vibration sensor, and reading that vibration
data every second from, let's say, something like an accelerometer.
Meanwhile, events are unpredictable, and we cannot derive when
an event will occur. So in the healthcare space, your cardiovascular,
your heart rate would be your metric. And if you have a cardiovascular
event like a heart attack or Afib, that is an event.
We can also think of things like machine fault alerts.
We don't know when a next machine fault will register,
but we can store it when it does. However, one thing that's
interesting about metrics and events is that through aggregation we can
transform any event into a metric. So think about if we did a
daily count of how many machine faults occurred. In this
way we have a metric that's either going to be zero or more, but we'll
at least know that we'll get one reading a day. And this is something that
time series databases are also very good at doing. So what is
a time series database? It has four components. The first
is that it stores timestamp data. Every data point
in a time series database is associated with a timestamp,
and it should allow you to query in time order.
The second is that it accommodates for really high write throughput.
So in most cases you're going to have really high volumes
of batch data or real time streams from multiple endpoints.
So think of something like 1000 sensors, or maybe sensors with
really high throughput, like for example, back to that vibration sensor.
Example, industrial vibration sensors can give
or create up to 10 khz/second so that's
10,000 points every second. And you need to be able to write that data very
easily and with performance security in mind.
And you also want to make sure that it's not going to impact your query
rate. So that brings us to the next part of
a time series database, which is being able to query your data
efficiently, especially over long time ranges, because what is
the value of being able to accommodate really high write throughput
if you can't perform efficient queries on that data subsequently?
So that's another component of time series databases.
And then the last but not least, you want to consider scalability and
performance. You want to have access to scalable architecture where
you can scale horizontally to handle increased load, often across distributed
clusters of machines, again, to accommodate the really high write throughput
that is typical for a lot of time series use cases,
whether that's in the virtual world or the physical world,
like in IoT. So what is influxdB?
InfluxdB is a time series database platform. At its
heart, influxdb v three is written on top of the Apache
ecosystem, so it leverages Apache datafusion, which is
a query execution framework. It also leverages Apache Arrow,
which is the in memory columnar data format. And then
the durable file format is also columnar, and that's parquet.
And influxdbv three is really a complete new rewrite
of the storage engine in the platform. And this was done to
really facilitate a couple key motivations or key pushes
design choices, and these ones were accommodate
near unlimited dimensionality or cardinality so that
you don't have to worry about how you're indexing your points.
So you don't have to worry about that anymore. In the past, you have to
worry about what you wanted to identify as metadata and
fields or tags and fields to make sure that you didn't have
runaway cardinality. But that's no longer a concern. And we also wanted to
increase interoperability. We wanted to be able to have people
hopefully eventually be able to query parquet directly from influxd, have better
interoperability with a bunch of machine learning tools and libraries.
We also, as a part of using data fusion and Arrow,
have the ability to contribute ODBC and JDCB
drivers to increase interoperability with business
analytics tools and business intelligence tools like Tableau for example.
In general, any other companies that's also leveraging Arrow and
Arrow flight to transport arrow over network interface means
that we can really easily plug in with those other systems. So the
idea is to help you avoid vendor locked in and to allow you to build
the solution that best suits your needs
with the individual tools that, that you really need to solve
your problem. And this presentation today is an example of that.
How we're cherry picking HiveMQ with quics,
with hugging face to build this industrial IoT anomaly detection example.
So yeah, we're all about integrating data pipelines in application architectures
today for this demo and this example,
HibiMQ, is how we're going to collect all of our data from our different sensors.
We're going to aggregate our data into one place where we can then bring our
data to the rest of the pipeline. And then we'll use a sync to
tap into the data from HivemQ and write that data directly to influxdb.
Then we also have the ability to process and enrich our data,
and we'll do that as well. Then we'll also use a visualization tool on top,
like Grafana. You could also use something like Apache superset if you wanted
to. And we'll showcase quics, which has support for influxDB,
HiveMQ and MQTT. So how do we combine
HivemQ and influxdb into one architecture?
We are doing the data ingest from HiveMQ and the data processing,
and then influxDB is what's underpinning the storage
of this raw data and the enrichment and the anomaly
detection, essentially. So we'll go into that all in a little bit.
And let's talk about Quics so what we'll be doing with quics, what is
it? First, Quics is a solution for building,
deploying and monitoring event streaming applications, and it
uses or leverages Kafka under the hood,
but it offers this layer of abstraction with python based plugins.
So you really don't need to know any kafka to use quiks. And you also
don't necessarily even need to know Python because you can configure everything through
the UI if you wanted to. You can also change any of the python
that's running underneath any plugins or any components of
your pipeline within clicks to make any changes that you might need and enrich
any templates that they might have. So yeah, you can use either pre canned
plugins from their code library, pre canned templates that are a
series of plugins used together to modify and build your own data
pipeline. And we'll also use quics to build an
ML pipeline from scratch. With not a lot of effort.
What kind of problems can we solve with an architecture like this? We're also using
hugging face as well to deploy or to leverage a
anomaly detection solution. But what kind of problems can we solve like this?
Let's talk about a real world challenge. So this is an example
where we have a company called Pack and go. And in
this imaginary scenario, this packing company is having some issues.
We don't know what the root cause is, but we're getting a lot of errors
from our manufacturing floor. We don't know why a machine is failing,
and we don't know how to identify what's causing the failure when it's
running. Normally, the data exhibits a strong cyclical and seasonal pattern
that has a very predictable pattern. However, when it starts to break
down, it also exhibits a different known pattern.
So the question becomes, how can we use HiVeMQ,
MQTT and influxDB to automate identifying this
pattern in anomalies. So the goal of this demo that I'm going to describe
today is to highlight how we can perform this type of work for the specific
use case and hopefully give you the confidence to try it for yourself so you
can see how easily this could be adopted by someone on the floor.
So here's our complete data pipeline architecture, and I'm going to
continue to refer back to this so that we understand the different stages of
data movement and transformation throughout this talk. So given this hypothetical
problem, this would be our solution architecture and our
data journey. Basically what we're going to be having is these three robot arms
that are packing robots, and these robots are synced to Hive,
the HivemQ broker. And we're going to use the quics MQTT
client to ingest that data in real time and feed
it to our quics destination influxdb plugin.
Then we're going to write that data to a table called machine data
with all of our raw and historical data in influxdb.
From there, we'll use the Qwik source influxdb plugin to query
the data back into quics. We'll run it through an ML model,
and then we'll pass the results back into influxdb with a new
table called ML results. Then we'll use Grafana
to visualize the data, our ML results, and alert on these anomalies.
And I'll talk more about hugging face in a second, but let's
dive into some theory by first stripping it
back and just talking about the data in just a little bit.
So for this demo, we're actually going to be using a robot machine simulator
rather than actual machine data because this is all
containerized example that's available to you on GitHub. I'll share those resources
at the end of this presentation. But basically, I don't want to
have to make you connect hardware yourself to generate some machine data to run this
example. So we have this robot machine stimulator,
and essentially the robots are going to go through three steps.
They create a machine, create a Hive MQ connection, and write a JSON payload to
a tape to a topic. So the simulator spins up
a thread per request machine. Each has its
own independent connection to HiveMQ.
So we will have the three robots and the JSON
will contain a metadata payload which we can see here.
And this contains information about the source. It also contains
a data payload which contains actual sensor readings from
our robot. So we see things like temperature, load, power and
vibration. And as part of our metadata, we see machine
id, barcode, and provider. We'll also write each robot's
data to a child topic that matches their machine id under the
parent topic machine in HiveMQ. So for example,
this one machine that we have, machine one,
the topic will be machine machine one.
Yeah, that's pretty much all of the ingests that
we need to know about for the robots. And whoops,
this is also what the code looks like under the hood for the MQTT
client. So essentially this is going to be a
quick crash course on how to connect to Hive MQTT,
on to Hive MQ with using a Python
class for spinning up the MQTT publisher.
So one cool thing about HiveMQ. Two is that I want to mention is that
you can set up an insecure connection to HiveMQ's public broker,
and you can set up an MQTT client and pass in the public
broker credentials. And this is a great tool for just testing
your client and HiveMQ connection. So to connect to HiveMQ
Cloud, you will need a couple credentials. You'll need to set
up authentication because it is natively secure.
And you'll also need to set up an SSL certificate and a username and password.
And this is also provided to you in the python onboarding
in HiveMQ. So you don't really have to worry about digging through docs to figure
out how to get these. But basically what we're doing here is
after we set up that connection, then we
can construct our topic and send our data with the client publish
method. And I have two connection methods. Depending on the
broker, we have the insecure that doesn't require an any SSL
or the secure that requires setting the TSL certificate.
It's also worth setting your version of to the default,
to the right version that you want to use, which defaults to
3.11. And lastly, at the bottom
there is where we are actually constructing our topic and publishing to it.
So at this point we are writing data to HiveMQ. And so
the question becomes, how does quicz tap into this data stream?
It's going to do so through the MQTT subscriber plugin that
we can subscribe to our parent topic using the hashtag
wildcard, which will use the same library we
used to publish data. And we'll be subscribing
to all three child topics with that wildcard, bringing in the JSON
payload into quic. And then we'll parse that payload and then
write it to the Qwik stream, which is really just a Kafka stream under the
hood. But luckily, like I mentioned, the Qwiks interface abstracts, working with
Kafka so you can just focus on your ETL and data science tasks.
And while we're just doing this parsing of this payload here in
quics, you could imagine if we were getting data from a variety
of different sources with a lot of different protocols. This is Meyer. We might also
perform some additional standardization of our data so that we can store
all in one place and clean it as we would need. So yeah, so after
we parse the JSON payload and write this directly onto the quickstream
topic, which is essentially that Kafka topic, then we can apply
other plugins, transformer plugins, destination plugins,
etcetera. So that's what we're doing with quics. So let's talk about
the data science side of industrial it and where we are
with things. We mentioned that we have these three robots in this packing scenario,
and we can see, for example, what the robots
look like when they're performing normally. There's some evidence of some seasonality
there, some, maybe some clear patterns that could be identified
by decomposing those time series into their respective trend seasonality
components. But we also see what the data
looks like when we have an anomaly. And in an ideal
world, our anomalous data would look something like that
middle robot there, where we just have a sudden spike and the
data looks completely different from our normal data, and something like
a simple threshold indicate to us that we have an
anomaly. However, the real world is usually not this easy.
Realistically, we might also have our anomalous data
presenting some sort of cyclical or seasonal pattern. It might be within
the same standard deviation as our normal data. And so
it becomes much harder to actually determine whether or not we have
an anomaly than doing something as simple as analyzing maybe the standard
deviation of our data or a threshold.
In this case, we can use a more sophisticated method for anomaly
detection, something like an artificial neural network, to help us solve our problem.
Specifically, today we'll be employing an autoencoder.
What is an autoencoder? It's an unsupervised machine learning technique.
So what this means is, essentially, this is a type of machine learning
technique that tries and learns patterns in our data without
being provided any labels or prompts from the engineer,
from me, for example, auto encoders were actually originally used
for image compression, but it was found out that they work great for time series
anomaly detection. So they've often been repurposed for this
exact scenario. So how do autoencoders work?
I'm going to briefly go over the idea. Basically,
there's this input layer, and let me just take
a step back. Let's imagine, for example, that we are trying to convince
our robot to learn a dance, and we
want this autoencoder to be able to describe this dance and
understand it so that we can teach that robot a dance.
We use this analogy just because, essentially, whether or
not we're looking at vibration or temperature
or pressure or whatever other components
we might have for monitoring any sort
of machine on a machine floor, all these components tell us
how a robot is moving or how its health is. We'll just use
the dance analogy, because I also am a dancer and I love dance. So any
tense I get. So let's go back to the input layer of an
autoencoder. An input layer is basically like telling the
robot that it's going to learn a dance made out of specific dance steps.
And then we have a sequence layer. And the sequence layer is
not a standard case layer, but if we want to imagine
what it might do, we can think of it as the time where we're
going to prepare the sequence of the dance steps that we're going to need to
be learned over a certain number of beats or timestamps. It's just mapping out what
the whole dance is going to look like. And then we have an LSTM layer
or a long, short term memory layer. And this layer acts a lot
like the robot's memory. As the robot is going to watch the dance, it's going
to use these layers to remember the sequence. And the first LSTM
layer with the 16 units could be seen as a robot focusing
on remembering the main parts of the dance. And then the second LSTM
layer with four units is like trying to remember the key moments or the
key movements that define the dance's style, not just the steps,
but specifically how those individual shapes
look. And within a time series context, to maybe make that a little bit more
simpler, maybe we know that we have some seasonal patterns
happening. That might be the first 16 layers. And then the second layer
is a little bit more specific, like, how does the arch of those layers look
like? And then we have an encoding layer. And this essentially
back to the dance analogy. The dance moves are encoded into a
simpler form. So we can imagine the robot now has a compressed memory of
this dance, focusing on the most important moves, which is the output of the second
LSTM layer. And this is also similar to how we learn. We first
absorb a big picture of something. Then we have the opportunity to focus on
the details and understand and tune the details. And then when we
start to really put something into our own body or really learn it,
we do only usually think of key moments that help prompt
us onto the next, while the rest becomes muscle memory.
And then we have a repeat vector. And this is like the robot getting ready
to recreate the dance. It takes the core memory of the dance and prepares
it to expand it back into the full sequence. And then
we have the decoding layer. The next LSTM layers are the robot
trying to recall the full dance from its core memories.
And it starts with the essential moves and then builds up the details until
it has the full sequence. And finally, we have the time distributed dance
layer. And this is like the robot refining each moves,
ensuring that each step in the dance is sharp and matches the original
as closely as possible. Now that we understand how
autoencoders work in theory, let's talk about how it can help us to detect anomalies.
In order to understand that, we need to talk about the mean squared errors.
The mean squared error is a way to define the
difference between our actual data and our predicted data and
determine how well our predicted data is.
So the predicted data will be made by the autoencoder,
and the MSA represents a reconstruction error.
It'll measure the distance, basically between our actual data
and our predicted or reconstructed data. Our reconstructed
data should be really similar to our normal data. And so if we have
a high MSc, that means that we have a deviation
from our predicted or reconstructed data,
and therefore we must be experiencing something out of the ordinary.
So high errors indicate anomalies. So let's talk about
some real world challenges with going operational in general,
because we've talked about how to build a pipeline, how to incorporate
some machine learning. But let's specifically talk about
some of the issues melding these worlds together,
because you can have really good data scientists and you can have people that are
really good at building models. But one true challenge is
bringing those models into production. Jupyter and
Keras Jupyter notebooks being one of the primary tools that data scientists work in,
work in a completely different way than the way that we want to run the
models in production and deploy it. So we have our model, our autoencoder,
how do we actually deploy it within our solution and monitor the results?
This is one of the hardest parts of building an AI driven solution. It's like
how do we take a miracle pill that was created out of the lab,
which works specifically in a controlled environment, and bring it to production for
use in everyday life and a lot of time. This is where the bottleneck
is. So we have great machine learning models being developed
by excellent data scientists, but it takes forever for
them to reach production and actually deliver value. And this is where hugging
face comes in. So hugging face is going to be your central repository for
storing, evaluating and deploying models along with
the datasets. So imagine at git on steroids, you upload your model
there, and basically we are using
an API to deploy and then access
our model in our data pipeline. And Quics has an
integration as a native plugin directly with hugging face as well
so your data scientists can stay in the realm of their Jupyter notebooks and
just focus on their expertise. Push their models to
hugging face, and then your data engineers can incorporate those models and deploy them
with incremental testing. So the cool thing about this demo is that
you can generate anomalies also in real time, and then you
can also pick up on those with the auto encoder, adding a
tag that labels the data as anomalous or not, along with the MSE
percentage. This example that's highlighted here also includes
the Grafana visualization that I mentioned earlier.
And we could, for example, use Grafana's powerful alerting tools
to say, if I have a certain amount of anomalous points within a certain amount
of time, then go ahead and actually alert on the data.
So just to conclude here, this is how we use Quic's, influxdB,
HivemQ and hugging face to operationalize anomaly detection
in the Iot space. The cool thing about this stack is
just how scalable it is too. Obviously, this demo is only operating
on three types of robots with generated robot machine data,
but we could easily operate on thousands of robots with this architecture.
So what's next? Let's talk about some hypotheticals. We could imagine putting
another labeling algorithm and actually labeling them as the type anomalies they
are. So we can understand whether or not our machines are encountering bad
bearings, shaky belts, an unexpected load,
etcetera. And then operators could see in real time
what the condition of these machines are.
And of course, everyone's really excited about LLMs right now,
for good reason, right? And as they get better and faster, we could even think
about what else they could do. Maybe they could automate protocol conversion.
Maybe they can translate machine language into human readable language.
And maybe we could replace dashboards altogether. We don't have
to look at dashboards because that's interpreted to us by LLMs.
And instead we could provide insights into our environment.
We could also think about maybe using AI to pass on
expert knowledge. We could monitor how experts are troubleshooting
problems, and how operators are troubleshooting problems, and create
models based off of their domain knowledge in specific use cases,
and how they respond to critical events. Then we
could pass down this knowledge to new technicians and offer prompts to
help them solve the problem. Perhaps when certain machines
encounter certain issues, we also notice a correlation between that
and a certain action, or, you know, access of
particular documentation or protocols that could be automatically
provided to help assist new technicians in
them identifying what the root cause is. And eventually,
maybe we would even have self defined digital twins where they're making the
connections between the machines, the sensors, the applications,
monitoring solutions, and they're making these connections to proactively monitor
and solve problems. So you could imagine maybe
being able to walk onto a factory floor and ask a machine how
it's doing and just be able to be a machine doctor.
So yeah, we can let our imagination run wild with the combination of
all these solutions together. But what are the next step steps for you?
The first is I want to encourage you to try this demo out yourself.
All you need to do in order to try it is to create
a qwiks free account and follow the URL here
and clone it so you can get it up and running and
a influxdb pre cloud tier.
And you can go ahead and like I said, yeah, run this example
yourself for free, or pick and choose the
components from it that you want to use. I also want to encourage you
to join and start up with influxdata.com,
and that's where you can get the free tier of influx data. And I
also want to encourage you to visit our community, our slack, and our forums
at community dot influxdata.com to
ask any questions that you might have about this presentation about
IoT, machine learning, MQTT, etcetera.
And last but not least, I also want to point you to influxdB University.
So influxdb University is a free resource
where you can get access to live and self paced courses
on all things influxdb and even earn badges for some of them. So you can
display those badges on your LinkedIn and also please access to
documentation and self service content like blogs from
influxdata as well so you can learn more about all of this information
in detail in the format that works best for you.
Also encourage you to join the Hivemq community as well if you have questions
about Hivemq and the Qwiks community. We also have Qwiks engineers
on our influx data community as well. So if you're doing anything
with quics, they're happy to. So if you have any questions, please forward them there.
I'd love to hear from.