Transcript
This transcript was autogenerated. To make changes, submit a PR.
So welcome everyone. Today we'll be talking about how we can leverage the Apache
flight, Python and influxdB. We'll also talk about
how we can just leverage the influxDB Python v three client and
any aeroflight SQL client as well. So we'll go over all
three. So just a little bit about me for those of you who
don't know what a developer advocate is, a developer advocate is someone who represents
the community to the company and the company to the community. And I do that
through webinars like this, also creating pocs or
demos that showcase code examples of how to use a variety
of technologies together so that you can understand how to
leverage something like influxdB. Yeah,
and I work for Influxdata, and Influxdata is the creator of influxdB.
So if you enjoy any of this content today, or you want to learn more
about Apache Arrow, about the Apache ecosystem in general,
about influxdB, or perhaps have access
to any questions about time series or data
science in the time series space, I encourage you to reach out on LinkedIn.
Please connect with me there. I'd love to learn about what you're doing
and any questions that you have. But for today's agenda,
we're going to start with a quick introduction to influxdB and time series.
Because influxdb is a time series database and you can't understand what influxdB
is without understanding what time series data is.
Then I'm going to talk about InfluxdB's commitment to the open data architecture,
along with the FDAP stack or the
Aeroflight data fusion, Apache Arrow and Parquet
data Stack. And then I'll talk about leveraging the aeroflight
client and the influxdb v three python client.
And then last but not least, I'll share some projects that leveraging
all of these tools. So let's get right into it.
Let's talk about the introduction to influxDB and time series data.
So time series data is any data
that really has a timestamp associated with it. So the earliest or most
simple example is probably stock market data.
And the unique thing about time series data is that the single value of
it is not really that interesting. So single stock value, for example,
doesn't have a lot of meaning. But what's really important is the
trend of the stock over time. And that's what's really interesting, because that
tells us whether or not we should be buying or selling based on a particular
trend. And so this is true for time series data, any time
series data, which is that the sequence of data points that are
usually consisting of successive measurements from a
data source over time is what really defines
it. And this is true whether or not we're looking at SoC
data, or we're looking at industrial IoT data or application
monitoring data. And so that's time series data.
And when we think of time series data, we really think of it existing within
two categories. So, the first is metrics. And metrics are time
series data that are predictable, and they occur at the same time interval.
So, if we were monitoring, let's say, a machine on
some plant floor, maybe we're
looking at the vibration reading per second from an accelerator,
let's say, and we're pulling that at a regular interval.
So that would be a metric. But you can have event data too,
as a type of time series data. And events are unpredictable,
so we can't derive when an event will occur, but we
can still store events, and we can also,
through aggregation, convert any event data into a
metric. So, for example, we could count the number of events
that occur per day. Maybe event in this machine
use case is when we have some sort of machine fault,
and we can record how many faults happen per day. And then this
way, we're converting our event data into a metric
where we're getting a count on a regular
basis. So what is a time series
database? Well, time series database has four components,
or really pillars that comprise it. The first is that it
must be able to handle time series data,
maybe, pretty obviously. So, every data point in a time series database
is associated with a timestamp. And you should be able to query
this data in a time ordered fashion, because, as we mentioned,
time series data is really meaningless unless it's in that time ordered fashion.
Additionally, it has really high throughput. High, excuse me, really high
write throughput. So, in mace cases,
we're talking about really high volumes of batch data or real time streams
from multiple endpoints. So we could think of like thousands
of sensors that are writing pressure,
concentration, temperature,
light, et cetera data. Or we might even be thinking about
potentially also just one sensor, or a
few sensors that have really high throughput. Like when we think about
an industrial vibration sensor.
That type of sensor is typically writing around 10,000
points per second.
Another kind of pillar of time series data is that we
need to be able to handle really efficient queries over time ranges,
and we also need to be able to perform aggregation over time.
Things like averages, sum min, max, with a
variety of different granularities, whether that's seconds, minutes,
hours, days, or nanoseconds. And then last but
not least, your time series database should be both scalable
and performant. So you need to be able
to design something and scale it horizontally in order to handle
the increased load that is often associated or
often across distributed clusters of machines,
and also to be able to handle the really high write
throughput of and query requirements of
the time series use cases, which really are often
really high dimensionality and high volume use cases.
So let's talk about influxdb specifically. So influxdb
three and influxdb in general is a time series database
platform, and it is built
on Apache Arrow,
arrow, flight data fusion, and parquet. And I'll talk
in detail about what all these are. But one thing that is at our
core is that we believe in open architecture and open data format,
and we believe that these technologies really enable this.
But before I go into those technologies, I just wanted to also highlight some of
our customers, just to give some context around how influxDB is used.
So we're primarily used in IoT monitoring, but also
in application monitoring and software monitoring.
So for some examples, Wayfair uses us for application monitoring.
Tesla uses us to monitor all of their batteries,
their wall batteries, b box, which is not a
logo that's up here, but a company that I think is really cool develops
and manufactures products to provide affordable, clean solar
energy to off grid communities in developing countries. And we
help them monitor all their solar panels. We have companies
that are doing indoor agriculture that are using us,
community members that are monitoring endangered
birds from also community members that are
monitoring their barbecue at home. So there's so many different use cases.
But the one thread is that it all involves time series data.
And just to kind of highlight just the high throughput requirements that
time series requires and that influxDB can provide.
So the latest benchmark for our latest version of
influxdb v three, which is built upon data fusion,
Apache Arrow, arrow and parquet.
If we had a dimensionality of 160,000,
we are able to ingest around
329,000,000 rows per hour,
or 4 million values per
second. So that's really what we're looking at when we're
thinking about the high volume use cases that time
series datasets like influxDB can provide.
So let's talk about this commitment to open data architecture with
Apache Arrow, Apache Arrow, flight data fusion,
and Parquet. So what we mean by this is just the ability
to have good interoperability with a bunch of other tools
and really easily transport data to and from other sources so
that people have the ability to develop their architecture
with the tools that are most aligned with the project. And the problem
that they're trying to solve. And the
way that this stack enables influxdb to do this is
that essentially Apache Arrow
is an in columnar
memory format for defining data. And Apache
Arrow flight is a way to transport ledge data sets like arrow over network
interface. Parquet influxdb uses Parquet as the
durable file format on disk. It's also columnar,
and data fusion is the query execution framework that allows
developers to query influxDB v three in both SQL and influxdb.
InfluxDB happens to be a SQL like language
that's specific to influxdB, but essentially what it allows
developers to do is ingest, store and analyze all types of time series
data, handle this at a really high speed and high volume,
and then also have the ability to have increase interoperability with a bunch
of tools, and be part of the Apache ecosystem.
Which means that when you contribute to these upstream projects,
then so many other tools benefit from it. And any other tools that are also
leveraging these technologies means that you can have a
standard for basically transporting these
data sets to and from all of these different tools.
So we think of things like Dremio leverages,
Apache Arrow and Apache Arrow flight.
A whole bunch of machine learning tools leverage things like Parquet,
I know h 20 does, I know Google,
Bigquery does, I believe Cassandra.
So so many different tools that you can use. If you can extract parquet
files from influxdb and then quickly leverage them in other use cases and other tools,
you just have access to them
and then you can build the architecture as you need.
The other thing that these tools enable for influxDB specifically,
for example, is it allows you to have schema on
write. So schema on write means you don't have to define your schema beforehand so
you can modify the schema as you go. Like we said, it allows
you to query and write millions of rows per second. Since we're
on the bleeding edge of throughput
through the use of this columnar store and also
influxdB is a database purpose built for handling time
series data at a massive scale.
And yeah, that's pretty much that.
So what I wanted to do is take a step back and kind of highlight
why leveraging arrow, Apache Arrow specifically
is so important, because it is that way that we are defining our
data in memory in this columnar format. So the reason why columnar
data works so well specifically for influxdB and helps us achieve
this throughput is that imagine that we had this data
that we were writing to influxdb. This just happens to be the ingest format
for influxdB. Well,
we might have different measurements, which are essentially
tables, and we want to write different fields with a timestamp and
some metadata. So this
is maybe what a table would look like once we wrote this data.
And you can imagine that if we wanted to query this
data for like a max value for our field one,
let's say if we store our data in a row format,
we would run into some problems, which is that
we would have to iterate through every single row,
essentially, and every single column
in order to try and find the max value associated
with one column. But if you store things in
a columnar fashion instead,
then it looks like this. In other words, the data would be
stored like the formatted block. And the other thing to
note too is that neighboring values are the same data
type and oftentimes also the same value
itself. This is especially true for time series data when you
are monitoring things like your environment, where you're
maybe measuring the temperature at a minute interval.
Oftentimes the temperature in a specific environment stays the same for periods
of time. So what does this mean? This provides opportunity for really cheap
compression, which enables these really high cardinality,
high volume use cases that we're talking about. This also enables
faster scan writes by using SMD instructions.
So depending on how your data is stored, you may only have
to look at the first column of data to find the max value
of a particular field. Contrast to that row Arrington storage, where you
have to look at every field,
every tag, and every time set in order to find the max field
value for that one column. So that's just a little sidebar of
why columnar storage specifically suits things
like influxDb and why Arrow is really purpose
built for influxdB. So now let's talk about actually
leveraging the aeroflight client with influxDB and also
the Python client library. So this is what the aeroflight client looks like, the Python
Arrow flight client library. To install it, you do something like PiP,
install Pyro, and you can query a database
like influxdb, three, V three, or any other database that leverages
aeroflight. And what you would do is you'd first create a
flight client by passing in the URL and instantiate
after you've imported the library. And then you write a
JSOn that has the necessary information for running the
query on a specific platform, including the namespace name,
the SQL query that you want to use, and the query type.
So notice here how we are specifying the SQL query. So if you're querying
influxdB exclusively, you could easily switch this between SQL
and influxDB, for example, because you can query influxdb
with both. And so this makes aeroflight a
more convenient tool for querying influxDB specifically or any other database,
just because you have that option to specify the query type.
And this boilerplate is basically query is database agnostic.
And this is what it would look like to use the arrow flight SQL client.
So the arrow flight SQL client basically just wraps
the arrow flight client and the
protocol here is pretty similar. We just instantiate a flight SQL client that's
configured for a particular database. Then we execute
a query to retrieve the flight info. We extract a token
for retrieving the data, use a ticket to request an arrow data stream,
and in this example, we are returning the flight
stream reader for the streaming results. We're reading all the data to a
pirate arrow table and printing that table.
So another important thing to note though too is that about the flight SQL client
specifically is that it returns a stream of data.
The return of the streams of data differ
a good amount between different languages. So that's just something to keep in
mind. It can be a little bit harder to use and a little bit less
flexible, especially if you want to query a variety of different
databases and you want to use kind of the same boilerplate in your
Python script to query any data
stores that leverage aeroflight.
And then this is what it looks like to use the influxdb v three Python
client library. So this basically just wraps the arrow flight client.
And we would first import our library like we would do before.
Then we initialize our client for influxdb. That includes providing the
iD, the URL that influxdB
is being run on the database
name that we are querying data from. We initialize that
client, we provide a SQL query, and then we
actually return our query results. And in the client
method or the query method, you can specify if you
want to use SQL or influxdb. You can also specify
the mode that you want to return your data back in. And that could be
supports both polars and pandas. So you can return a polars or pandas data
frame directly and that just increases interoperability
with a bunch of python libraries that you can use for
things like anomaly detection and forecasting.
So before I go, I want to talk about some projects that leverage the
aeroflight client.
So the first one is that
involve using Grafana and influxdb. So one thing we
did was to contribute this plugin to Grafana, which is
the flight SQL plugin. And so you can use the flight SQL
connector to connect any data store that leverages aeroflight
to or
that leverages arrow to Grafana. So this is kind of
what we mean by our commitment to open data architecture. By contributing
this plugin, we provide this benefit to a number
of different open source tools that leverage aeroflight.
And I want to encourage you too to visit the
influx community organization on GitHub.
That organization just has a bunch of different POCs and demos for
how to use influxdb with a variety of different tools. So one
repo in specific is the Grafana Quickstart, which shows
you how to use that flight SQL plugin, but also how to build
dashboards and query influxdb using SQL,
and also how to leverage the influxdb Grafana
v three plugin too, which is specific to just influxdb.
Another fun project is influxdb that uses influxdb
and mage. So Mage is an open source data pipelining tool for
transforming and integrating data. In essence, you can think
of it as an open source alternative to Apache Airflow,
and it contains a UI that simplifies the ETL creation
process, and it has documentation on
how to deploy these pipelines on AWS,
azure, digitalocean, and GCP. I think they
provide both terraform and helm charts so you can leverage
those. And one specific demo
that we have in the influx community organization
is this one on anomaly detection, where we have
some machines generating some different data
and we use half space trees to identify anomalous behavior
in the machine data and actually allow you to use a
little interface to generate the anomalies in real time
and also get alerts on those
anomalies. So I encourage you to try it yourself
here. Another really cool solution
or project is one that uses quicks. So quicks is similar to
mage in that it's also a solution for building and deploying monitoring
event streaming applications. But the cool thing is that it uses Kafka under
the hood and allows you to control all the elements
of your pipeline with just Python. So it's specifically
designed for processing time series data,
and it comes both in cloud and on prem offerings.
And it also offers a UI that simplifies this ETL processing
building to make that much simpler.
But the really cool thing about it is that it really is
specifically built to handle event streaming and leverages
Kafka so that you don't have to be a domain expert in how to leverage
it. And this particular demo also leverages hive MQ,
which is an MQTT broker. So we get data from a
lot of different machines using MQTT, pass them into HiveMQ
that broker, then centralize all that data in quicks and
perform anomaly detection there. And also quicks
has built in integration with hugging face, which is basically like
sort of a GitHub on steroids for data
scientists where they can publish any
of their algorithms or anomaly
detection or forecasting tools
and algorithms there. And so yeah,
Quix does all of the anomaly detection there after
pulling data out of influxDB, where all
of the MQTT data is written directly to influxDB and performs this anomaly
detection and then also creates
alerts. And in this example, we use auto
encoder instead, which is a type of neural
network that is unsupervised.
And yeah, this is essentially what the architecture for this project looks like.
We have all of these machines that are generating and these
robot arms in this example are also generating
data, pushing that data to hive
MQ with MQTT, and then we use MQTT
clients to write the data to influxdb.
We take advantage of quicks to
actually perform the querying and apply the machine learning model which
is stored in hugging face. And then we use Grafana
to visualize all the data. So the cool thing about this solution architecture
is that while we're only performing this on three types of machines, it could easily
be scaled out for a real world use
case. And so yeah, in general,
I recommend that you check out this demo, but also a variety of other demos
that exist at influx community. If you want to learn more about
how we're leveraging Apache Aero data fusion,
parquet, and aeroflight to increase
interoperability with other tools and provide solutions like some of
the projects that I mentioned here today. Last but not least,
I want to encourage you to join the influxDB community. You can get started with
influxDb by visiting influxdata.com. You can
also check out our documentation. InfluxDB University is a resource
that offers free and live training on a variety of
different topics, including some of the projects that I mentioned
today. And also please join our community. You can join our
community slack at influxcommunity, slack.com,
or also our forums as well,
our discourse forums. So it's entirely up to you what you
want to use. Thank you so much.