Transcript
This transcript was autogenerated. To make changes, submit a PR.
Everybody, and welcome to today we're going to be talking about building a plant monitoring
application with influxDb python flask and with edge
to cloud replication, which is an influxDB feature.
So as you know already, my name is Anais Dotisgeorgiou and I'm a developer advocate
at Influxdata, and I encourage you to connect with me on LinkedIn.
So if you have any questions about time series influxDb
flask Python building this plant
buddy application, which we're going to discussed today, I encourage you to reach
out with me there and ask any questions you have.
So before we begin, let's go over a quick agenda.
So first and foremost, I will be talking about the IoT hardware
setup that we use. So this is all the devices that we use to
monitor our plants for our plant buddy or our plant
monitoring application. Next, I'll be going over the tools we use to build this
application. Then I'll be giving an overview of influxdb,
followed by a data ingestion setup overview.
And then I'm going to talk about flux and sql, which are
two languages that you can use to query influxdb. Then I'll
follow that with an understanding of how to set up edge to data replication and
explain what edge to data replication is. I guess a spoiler,
edge to data replication is just the process of replicating data from an
OSS edge instance of influxdb to a cloud instance to
consolidate your data there. Then we'll talk about the
data requests for building the application. And last
built, not least, share the code base. And with
that, let's begin. Let's begin with talking about the setup for
our IoT devices. So this is a diagram of
roughly how our plant body, application and system
works on the edge. So as you can see, we store and manipulate some
of our data here in the open source build, and then
we send a down sampled version of that data to our
cloud instance. So downsampled data is just the process,
or down sampling data is the process of taking high raw precision
time series data and then creating lower precision aggregates
of that data as sort of to provide a summary of your data.
Because oftentimes we don't care about having that high precision data,
especially over long periods of time. And it's
just a way of consolidating data from multiple edge devices to the
cloud. So in order to
successfully build this application, you need the following
things. The first and foremost is a plant, preferably,
although maybe you could monitor your pet instead.
And we used a particle board microcontroller for
this. But you could use any other compatible microcontroller. You need at
least one IoT sensor for your plant and a
breadboard with jump wires and terminal strips.
So here's a look at the breadboard schematics.
So these schematics are for hooking up our four sensors to our breadboard.
And this diagram is just to help break down what ports the microcontrollers and
the sensors are connected to. And hopefully this will make it easier to
set up the exact same setup or a similar one if you're not familiar
with microcontrollers. And for our plant monitoring application,
we decided to also go with four sensors.
The first sensor monitors temperature and humidity. The second sensor
monitors light, then soil moisture, and then temperature.
So you'll notice that all of those measurements
are all time series data, which is why we use influxdB
to store that, because influxdB is a time series database,
and so that's why it's a good use case for it.
So now let's next talk about the tools that we use to build our application.
So at the front and center is flask. So flask is a microweb
framework that's written in Python, and it's going to be doing the heavy lifting for
this project and help run our local application and routing.
We're also going to be using influxdb for storage.
So influxdb is a time series data storage engine,
but it's also much more than that. It contains APIs
and various tools for working with real time data
and applications, and it also has a massive community and ecosystem.
There are forums,
slack channels, and Reddit where you can go and get community support.
Myself, I will be on those channels helping people as
well as other developer advocates and engineers. There is
a ton of support on GitHub as well, and all
of our products have open source offerings. So that's influxdb
in a nutshell. And it's an ideal solution
for storing our sensor data because it is a time series database and
our data is time series data. And then
we will be using telegraph, which is the other product that influxdata, the creators of
influxDB creates. And telegraph is an
open source plugin, sorry, an open source
collection agent that is plugin driven for metrics and events.
There are over 300 plugin options to choose from. So if you
have the task of ingesting data from a source
and sending it somewhere else, while taking advantage of buffing and caching
capabilities and other agent control capabilities
with a lightweight agent that is open source, go check out telegraph. It's a pretty
cool tool,
and it's very simple to configure. It's configurable in a single
ToML config file and downloadable in a single binary,
as well as influxdb oss as well.
Next we'll talk about our client library suite. So we have
client libraries available in multiple languages,
and you can read and write data into influxdb with them.
So if you don't want to use telegraph to ingest data, you can use a
client library. We'll be using the Python client library to query our
data and write our data to influxdb cloud.
You can also use client libraries if there isn't an otelegraph
plugin like I mentioned. And next I'll be showing a code example
for how to do this.
Also, I want you to be aware of the visual studio flux extension.
So flux is a query language in influxdb OSS,
and it allows you to execute sophisticated data
analysis on your time series data. Sophisticated queries also
create checks and alerts and a bunch of different
things. So this extension is particularly useful for
executing some of those queries that you don't have to go back and forth between
vs code, where you're probably building your application, and the influxdb
UI to build your flux queries. You can just
stay in vs code, which is really helpful for developers.
And last but not least, we will be using plotly. So plotly
is a Python graphing library that makes interactive publication
quality graphs. It's open source and free and easy to use. And look at all
the beautiful graphs that you can create with Polly.
So now I'll give an overview of influxdb, and I've already talked some about
influxdb, but let's just make sure we're on the same page. In order to do
that, we need to establish some context, and that context is answering the
question, what exactly is time series data? So essentially, time series data
is any data that has a timestamp that is associated with it.
Examples include weather conditions, stock prices, customer monitoring,
I mean, even point of sale transactions, healthcare logs
and traces. And we usually think of time series data existing in two categories,
metrics, which is time series data that's gathered at a regular interval.
So when we measure the temperature data of our
house, for example, consider that to be a metric, because we're
measuring IoT regularly, and then we have events, and those are measurements that
are gathered at irregular time intervals. So when
something maybe happens, when something's triggered, for example.
So where does time series appear? Well, the short answer is that it appears in
almost any application across a lot of different spaces. So the first
is consumer and industrial IoT, manufacturing,
industrial platforms, renewable energy and fleet management.
These all contain and provide and create a lot
of time series data. So when we think of things like pressure,
concentration, flow rate, rotations per minute,
temperature, humidity, et cetera,
these are all sensors vibrations, these are all sensors that might exist
in those spaces and data that's collected in those
categories. Then we have software infrastructure. You want
to be able to monitor your
API, endpoints and also developer
tools. And then DevOps is a huge DevOps monitoring is a
huge source of time series,
as well as your containers in Kubernetes.
And last but not least, we can think of real time applications. Things like gaming
applications and fintech are really huge sources of time series data,
but also network monitoring.
So now let's talk about the emergence of the time series database category.
So we came from relational databases and
then document databases like MongoDB came on the scene
and more recently search databases.
But there was a unique need for
databases that can specifically handle time series. And what are
these other types of database categories missing that time series databases have
and address? The first one is that when we write
and think and deal with time series data, we're typically only concerned with
ingesting that data and the queryability of that data.
It's very rare that when you are collecting
really high throughput time series data that you are interested in
performing single point deletes or updates.
So time series databases should create
design assumptions around that and essentially
make trade offs in their design that prioritize really high
ingest and really high reads over
updates and deletes, some things that those other
databases are better at performing. Additionally,
you need ways to interact with your time
series effectively like you probably want. It's very helpful
out of the box to be able to have a visualization component
to a database as well, and a UI for that, just because time
series data isn't really that well understood without graphs.
Second component is abilities to manage your time series lifecycle.
So you might need to be able to automatically expire old data
as soon as it's become irrelevant, and also reduce
your data or downsample your data from raw, high precision data to
lower precision aggregates, so that when you view your historical data,
you can view it as a snapshot and see overall trends more effectively.
You also want to be able to have tools that help you work with timestamps
more easily built, work across time zones more easily.
So ideally you have a
database that also contains additional features and is
a whole platform that makes working with time series data easier,
and that's what influxdata and influxDB aims to do. So this is the architecture
diagram for influxdata and
more specifically the OSS version. So with the OSS
version, influxdb itself is a storage engine, but so much more.
It also has that visualization layer and a query and task engine.
Then we have telegraph, which we've talked about, where you have over 300
plus plugins and listed. There are some of the input plugins that you
can use, for example. And so the main goal
of influxDB and influx data, just to resummarize,
is that you can get data from a lot of different sources. You can pull
that data into influxdb with a variety of different tools,
and then you can use influxdb itself to not only transform
and downsample that data, but create triggers and alerts, and even use
flux, which is the query and scripting language for influxdB,
to also collect that data. And then you want to be able to send that
data for additional application workflows to
gain infrastructure insights and even perform IoT
actions. So now let's talk
about our data ingestion setup. So I'm not going to go into depth on
how the setup the microcontroller.
That's only because each one is unique and its setup structures vary,
so just follow the appropriate ones. But for the sake of understanding
the code from here on out, I want you to be aware that mine is
running on a port on my computer because it's plugged in directly,
and this is an example of how the data comes in. When I run the
command particle serial monitor, it shows me the data
that is coming in. And the sensor data is also highly varied.
So I'll be skipping the details for how to clean up that data and tag
it, which we need to do for our sensors. But all
this code is available on GitHub, so you can check that out there in more
detail.
So when you have your open source influxDb
installed and running, even on localhost, you will see a UI like this
where you can set up your bucket and token, and you can
also do this via the CLI. But I find that using the UI is easier,
and so I wanted to show that for this demo. So this video goes over
how to create a bucket. That's where you're going to store your data.
And in this selection you have the ability to set your retention
policy as well. And a retention policy
just describes the amount of time that you want to actually
retain that data. And when you want to automatically expire it.
And this video also shows you how to set up your API
token as well, which we're doing right now. We normally
suggest that you just use an all access token,
but be careful with it. But for development,
it can be useful at first when you're getting started, but you can also set
up a specific read and write token to specify specifically
just your bucket to protect your data and make sure that none
of your tokens are the same.
So we've already seen how to set up a bucket and token in the UI,
but at this point in the code, I have set up my own bucket on
my cloud account, and I put in the appropriate credentials
and tokens to receive data into influx.
And here we're using the influxdb Python client library,
which allows you to write a few lines of code to begin streaming data into
influxdb. The point here is a data point is being
added to the database, and all of these values changes
based on the device and the values. And we'll add
tags to this point that we're writing to influxdb
to help us differentiate between temperature,
humidity. Or we could use a tag to differentiate between multiple users
or multiple plants that we were monitoring.
So that's pretty much it.
We create a point with the point method, and we're
encapsulating this in a write to influx function,
and we are using the
write method to actually write this point to our
bucket and to our organization,
within our organization. And so this is what
writing data to influxdb with telegraph looks like. This is what the Toml configuration
file looks like. So this is a telegraph config file.
The whole thing is quite large, so I'm not going to go in depth on
the entire thing,
but this is a small part of the configuration file, and this is like the
part that you're actually interested in.
And you can either use telegraph or the client libraries to ingest
data into your OSS instance. But in the project example,
we are using telegraph, but we also have
that client library code available for those who prefer to use
that instead. Each telegraph library has its own documentation,
but MOS is very straightforward to install and set up,
and you just run telegraph with single commands.
Like to run this file, you would just simply say telegraph
config and then specify the path for the file.
And here we're using the execd plugin,
and we are essentially passing in a command that says we want to execute this
python three script with the path to that
script and the serial port to use
that to generate
or collect our time series data with Python
and then take advantage of the agent to write it.
So this is an example of how our data
appears in table format once we've written our data
to influxdb and actually queried it. So we have one measurement which is
sensor data, a field which is light and soil moisture,
and there's more fields including temperature and humidity. And we
also have the equivalent value for each field as well as
a timestamp associated with it.
So it's also worth noting that influxDB
data is making a big push to support SQL. So flux
is still supported in OSS, but in influxdB,
cloud flux is being replaced with SQL. And the main reason why this is the
case is because we find that most users don't want to
take on the burden of learning a new language
that is proprietary to a single piece of
technology they use. They're more comfortable with SQL. So that's why we're
working to provide users in the cloud instances with SQL.
However, we recognize that existing users are still taking advantage of flux.
So if you are an OSS user, you can still
continue to use flux, but we will be using SQL to query
data from our influxDB cloud account.
But I do want to just quickly introduce flux. For those
of you who are confused by that, querying a database with
SQL probably makes a lot of sense to a lot of people, but maybe not
with Flux. So flux is a data scripting language that comes
embedded with influxdB OSS, and it allows you to build data pipelines
to query, analyze and transform your data. So that's
a quick example of what flux looks like. It's kind of JavaScript
esque in its syntax, but functionally sort of operates more
like pandas, where the input of one
line gets pipe forwarded into another
function and each function progressively changes or
provides some sort of analysis or transformation option on your
data.
So this is our most basic flux query that we use to retrieve our data
out of influxdb. So specifically,
the device id and field are both variables that we can change.
For example, the device id could be one and the field could
be air temperature, and by having values be variables,
we call the same flux query for all of our graphs. And our
range is currently set to the past 24 hours.
So that's how much data we're
going to be displaying on the graphs. But we cloud change that as well.
And similarly as well, our bucket
is also something that we can change.
So let's talk a little built more about the change that is happening
from flux to SQL. So the future of influxDB
cloud and the future of open source. So we recently
launched influxDB cloud powered by our IoT storage engine
and it will allow storage in parquet file format with unlimited cardinality.
So if you choose to use the edge to cloud replication version of this project,
you will most likely connect with influxDB Cloud's new SQL version.
So the new cloud version also supports SQL.
And if you choose to stay completely in the open source version then
you'll probably using Flux. And the plan is eventually to
roll iocs and SQL capabilities
to our open source actions as well.
But we can use flight SQL plugins presently
in future with influxDB cloud powered by IoT to take advantage of
Apache Superset, tableau power, Bi and Grafana as well.
So another big move to change the storage engine in
influxDB cloud was to offer more interoperability, and that's because it's
largely built on the Apache ecosystem on things like data fusion,
Parquet and Arrow, which is
all really exciting.
But now let's talk about edge data replication.
So edge data replication is the process
of replicating data from an edge instance of oss to
cloud using the edge data replication tool. So what are the
advantages of edge data replication? Well, the first is that you reduce bandwidth
cost of sending high fidelity data to the cloud, and it also
has network resilience for intermittent failures and connectivity to the cloud.
So to summarize, using a hybrid solution for our application provides
the flexibility to move mundane tasks such as down sampling to the edge.
And you can do that down sampling as needed for each type of device
that you are gathering data from. And this provides
more scope for more interesting analysis and data storage to occur in the cloud.
So that's why we're kind of like looking at this hybrid solution and using both
the edge and the cloud instances of influxdb.
So tangibly, the feature of edge data replication consists of two new
API endpoints, the remotes and replication endpoints, and two new CLI
commands, the remotes and replication commands.
And each replicated bucket also gets a disk
back queue for buffering data safely
in case of any disruptions that might exist.
So now we have our setup instructions here,
which can be found on our GitHub readme built. As you can see here we
have a command to set up our edge device,
which for this project is the open source local host I'm running.
So just follow these commands and then you can get it started for yourself.
And these are the two commands that you have to run
to have your edge connected to your cloud instance. So create
your cloud bucket in the exact same steps as your open source, but just in
the cloud. And we have full documentation for edge
data replication or EDR replication as well, that you can check out. That goes
into more detail about the configuration setup. But basically it's
two steps. You first create a remote connection and then you create a replication
rule between localhost and cloud. So you describe basically
what you want and how you want data to be replicated to your
cloud instance. So now we're ready for data
requests and visualization. So in this step we're calling for
the previous flux query and infilling the variables with our
selections, including bucket, sensor and device. And we return the
result that allows us to graph our incoming data.
And we use a query data frame method or function
from the Python library that pulls our data back into a
data frame format that I find easier to work with a lot of python libraries,
especially for visualization, and there are a few different data
outputs options that you can choose from if you prefer a different style.
So it's just my preferred way of working with the Python client
library. So this part of the demo is currently
under construction. Querying with SQL we are working on redoing
this project for the SQL support as well as updating documentation
to go along with the project. That should be done by the end of this
talk, so I encourage you to go check it out.
Basically all you do is use the
aero flight SQL instead and it's just a couple of
lines there. And then you can query directly with SQL instead
and return a data frame as well.
So the process is almost identical. It's just a couple
of lines different. But yeah, go check out the GitHub
repo because that's all up to date there.
And this is the end result of querying data for
the data points and we can now graph them. And as you can see here
things is an example of the hard coded graphs, but in the demo or
I'll also show the selectable graphs.
So this is what our plant body dashboard
looks like. Basically here we have one graph where we
are looking at the light, but we can also see the soil and room temperature
and the humidity and soil moisture. So here's what
the soil and room temperature looks like. And here's what the room humidity
and soil moisture looks like.
So now let's talk about some further resources so you can run this yourself and
get familiar with everything that we talked about today. So try
it for yourself. Follow the following links. Like I said,
it will be updated with the SQL example as well, so should already
be updated. Honestly, I encourage you to go take a look at it
and try both the purely OSS version and OSS
to cloud. I should mention there's also a free tier
cloud version of influxdB. So yeah, you don't have
to pay for anything to try influxdb cloud
powered by IoT and last built.
Not least, I encourage you to please please join us on
either our slack or also our discourse forums,
community influxdata.com, and to participate in any
conversations around influxdb IoT or influxdb cloud
powered by IoT. Specifically, join the influxdb underscore IoT
channel. So again, get started yourself.
You can visit our website and also our influxDB community
organization contains a bunch of examples from the
developer advocate at influxdata and also community members
as well on different projects using influxdb.
So that's just a good place to get inspired as well if you're just wanting
to check out influxDB. And here are some further resources
as well. I've mentioned a lot of these and the last one worth knowing about
is influxdB University there at the bottom where you can get free
instructional courses and earn badges on all things influx.
Thank you so much.