Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello, everybody, and welcome to today we're going to be talking about time series database.
Should I use one in my application architecture?
So, my name is Ana East, Otis, Georgia, as you know, and I'm a developer
advocate at Influxdata, and I want to encourage you to connect with me
on LinkedIn in case you have any questions about time series, influxdB,
or anything else that you want to talk about today.
So for today's agenda, first we're going to be talking about
who is influxdb? What is influxdata? Then I'm going to be
going into what is time series? Next, I'll cover what is a time series
database? And how do I know if a time series database
is right for my application? All right,
let's dive in. So, first I wanted to give a
brief introduction to influx data, which is the creators
of Influxdata. So we were founded in 2013
by Paul Dix, who is our CTO and founder.
We focus on developers who build real time applications,
and we are most widely known for our time series data platform,
Influxdata. But we also have an incredibly popular tool called Telegraph,
which is an open source collection agent for metrics and events.
InfluxDB is where developers can build and scale applications with time
as the foundational component. An influxdata is
one platform with one API across products and environments.
The platform is widely adopted by open source developers and paying customers alike
with over 550,000 unique deployments and over
1300 customers. And some of those customers include Google,
Cisco, SAP, Tesla, Disney,
a bunch of people. So even if you
haven't heard of us before, chances are you've probably used products that
actually use influxdata. So the first example that I want
to talk about is Tesla. So specifically, Tesla Powerwalls
for the home. So Tesla pulls time series data from
connected walls and insoler powered user homes, and they
monitor the health, availability and performance of those solar panels and
battery setup with influxdb. They collect at these edge
all of that data into influxdb
that runs their backend system. The second
example that I want to talk about today is Nest. So Nest is a smart
thermostat for the home. Nest monitors the infrastructure that
powers the IoT data collection system wide for all Nest devices.
This includes their use of kubernetes and other software infrastructure that is
run by Nest that is used to collect, manage, transform and analyze
and aggregate device data. Disney plus. So as we
know, Disney plus is an entertainment, streaming service and delivery application.
And Disney plus uses a
global content delivery network to distribute its video series,
movies and shorts to its users worldwide and these monitor
movement and performance of their video content throughout
this global CDN using influxdata.
So also Rapi. Rappi is a non demand foot and goods delivery
application and Rappi uses influxDB to monitor,
react and adjust the fluctuations in the on demand pricing of
their driver and rider delivery network. They collect time series
data from these mobile delivery network, all of their mobile delivery
riders, and then send it to influxdata and into the
cloud based application they've built on to
run the rappy service. And this
is an awesome service that exists in Latin America and
hope one day they also expand to North America.
Okay, so now that we understand roughly what influxdata
is and what they create influxdata and also telegraph,
and we understand some of the applications and some of the customers
that use influxdata, let's take a step
back and ask what really is time series? So time series
data is a collection of observations obtained through repeated measurements over
time. Okay, that's great, but what does that really mean?
Well, the best way to understand something is through examples.
So one of the most classic examples of time series data is
weather stations. So I don't
know if you've ever received a weather station as a gift or have ever used
a weather station, but it's a pretty neat gift and it's a
pretty good idea in case you want to get someone for
someone you love. But essentially it's a console that sits
inside the house and displays the current temperature, humidity, wind, and other conditions.
But it also gives you optionally
instructions on how to set up the weather station outside and send
its results to weather underground. So you can do that as well.
And you can view your data and send data every 5 minutes,
and send temperature data, dew point data, humidity, wind speed,
gust pressure, precipitation rate, et cetera. So all of
this data is data that has a timestamp associated with
it, and that is what makes it time series data.
And additionally, having all these snapshots of
data over time lets weather underground applications
show much more than just the current weather conditions. So you can see
your historical data, and you can see the high and low temperatures for
the day and even graph it out for the user. Another time series
example that exists is predictive maintenance. So one
thing that I was trying to think to myself is, do I think about time
series data all the time? Because I think about it
for work, or do a lot of other people do as well? So I
was thinking about a friend who has a predictive maintenance company,
and does he think about time series every day? And then I realized that he
does, because his entire business is based on time
series data. So here's a screenshot from his website where he's
showing vibration analysis over time. And he uses this
type of data to help companies predict the life
of things like bearings and fan ventilation systems and many
other types of equipment. So he attaches sensors to this equipment
to measure things like those vibrations
and fluctuations over time to help his customers decide when these
might need to do some sort of predictive maintenance and replace a component that
will likely fail in the future so that they can ensure steady and consistent operation.
And then I was thinking about other examples.
So another great example of time series data is
in the healthcare space. Specifically, when we think about heart health,
we look at things like pulse, blood pressure, and temperature.
And that's all time series data that you need to chart.
And it's critical for delivering care to someone. You need
to be able to monitor the change in those values over time.
All right, so time series data is generally categorized in
two different types. There are both metrics and events.
So, first and foremost, all time
series data is represented on a two dimensional
graph, almost always. And where we have
a y value and an x value, the y value is typically the value
of the time series data itself. And the x value is the actual time series.
And one thing that is really unique about time series data is
that the x axis, the x value,
is actually not an independent variable, but it's a dependent variable
as well, because there are often correlations between the value
of something and the time. Like almost always, it is colder
at night. So that's one unique kind of property about time
series data that it's just fun to mention,
because it makes time series analysis, from a statistical
perspective much more complicated.
But back to focusing on metrics, events. So there are two different types
of time series. There are metrics, and metrics occur in regular
time intervals. So those are examples like
temperature or pressure flow rate that you are
gathering at a particular time. And then we
also have events, and events occur at specific points in time.
So in our health chart, for example, an event
might be something like a seizure or maybe an arrhythmia.
So both events and metrics are important time series
elements that you need to monitor health, because a metric event
might just be, or a
regular metric might just be monitoring
your pulse. So why do we need time series database?
So, essentially,
we want to answer this question, and the best way to answer
this question is to understand where
time series data is advancing in a lot of different areas.
So the first and foremost is in customer and industrial IoT.
So when we think of manufacturing and industrial platforms,
renewable and alternative energy systems, or fleet and management
and telematics, we know that these have sensor data
that is collecting data
like pressure, flow rate, temperature, humidity, concentration,
all these things about your environment, rotations per minute, vibrations,
et cetera. Then we have software infrastructure, which is a huge source
of time series as well. DevOps monitoring is a huge source of timeseries
data, whether or not you're monitoring your containers.
Kubernetes, the availability of endpoints,
CI CD.
We also see timeseries data showing up
in real time applications. So fintech is kind of an obvious example
of time series data, and it's a unique one as well,
where you are very likely collecting data at
really, really high precisions, like even the nanosecond precision, which is
a billion points in a single second. Then we also
look at network monitoring and gaming applications as well.
So some of our largest customers and banks
and crypto companies like Capital one, Bank of America,
and Crypto.com, to name drop, a few
are some companies that use influxdb.
So I wanted to talk about additional data sources as well.
So we have seen an absolute explosion of data, and I still remember
when megabytes of data was considered huge. But now people talk
about petabytes and exabytes of data, and one reason for this
is that the source of data isn't just humans anymore, it's machines
and devices. So DevOps and Internet of things,
or IoT, has made a huge impact here. And the number of
devices is growing in dramatically faster
than the number of people. And the amount of data that each device can generate
is in most cases, exponentially more than one human can generate.
So, I'm a home automation hobbyist, and I have dozens of devices,
like smart light switches, that are sending data every few minutes,
and each one generating more data than I could ever as a human.
But I think all of you know and understand this already, and it's probably just
as good or better than me. So let's move on to the next slide.
So, the next thing I wanted to talk about within the context of timeseries databases
is scalability. So, one thing
that time series databases are especially good at are
these ability to scale, to be able to
handle really high ingest volumes. So it's
not uncommon that people might want to be able
to store data at a minute interval,
but maybe also at a second millisecond, microsecond, and even nanosecond
interval. So if that's the case, just take a look at some of the amount
of records that you're generating per day. And if you are writing
at a nanosecond or collecting data at a nanosecond precision,
then you are writing trillions of points per
day, which is just an insane amount of volume. So one problem
that time series databases have to be able to solve is the ability to
support that high ingest volume. Also,
the way that time series databases achieve this is by making certain
design assumptions and design trade offs, and those are
typically around deprioritizing updates and
deletes in favor of increased
ingest and query. And the way
that that is largely also executed is the fact that time series
databases are indexed by time as
well. So we
also need to be able to prioritize queryability and query performance.
So, time series databases are typically organized by time.
And when you query for data, you are typically interested
in querying for a collection of time series data
of time within a certain range. And because time series database
records are organized by time, this means that time ranges of information are
stored together. And so therefore you're able to retrieve that data more
quickly. And this wouldn't always be the case if their
time is just being stored in one column in a relational database, for example,
by contrast. So when analyzing timeseries data, you typically want
to look at a range of data, and because of its organization
within a time series database, this becomes a really quick observation or
operation. And then another reason
why time series databases get used is because of your ability
to actually manage your timeseries data lifecycle.
So one way that you do that is through having tools that allow you
to automatically expire old data. That's critical, especially if you
are collecting data at a nanosecond precision.
You might not want to retain all that data, most likely more than times,
often not. You don't need to retain that for a long period of time,
and you need to be able to automatically expire it, and you want to be
able to automatically expire it in a reliable fashion.
The other thing that you typically want to be able to do is perform down
sampling. So what is down sampling? Downsampling is the
process of taking high precision data
in its raw form, then applying an
aggregate on top of that data to create a lower precision summary
of it, and then only storing that lower precision summary. So, for example,
maybe today we are storing one sample
of temperature every 5 minutes, which would provide a total of 288
records, and we downsample that data to the
average for the entire day. So we've reduced that data,
excuse me, from 288 records to one record,
and especially if you're using OSS versions of influxdata,
that just helps you reduce your disk size as well as the
index of your database, which would
also in turn increase your query performance.
And so with timeseries database, these type of actions
typically happen at the database engine level, and they
shouldn't require that you build these solutions yourself.
Another typical benefit of time series databases is the reduced storage
size that they can often have. We talked about downsampling already, and how
you can utilize downsampling to additionally reduce your disk size
and reduce the amount of data you need to retain.
But compression is also another big benefit of the time series database.
Timeseries records often have similar data.
Think about a railroad track sensor as an example. It might
show a zero meaning no train for almost all of the day.
And critically important is to know when the train was there so that
gate gets closed and the lights come on. But there may be hours
of samplings of zero before there is a one. The only difference in
some of the records may just be the time interval, which also has a pattern
that can be used in compression. So you typically see great compression
with at least some time series databases. Just because of the nature of time series
data itself. There aren't going to be neighboring values that are
identical or extremely similar, which also makes scanning
the data more easy and more
efficient as well. Time series databases also typically
have data retention policies so that you can automatically delete
your data after a period of time, like we mentioned.
But you could certainly build an application to go and delete old records, but it's
just easier to have the database do that for you. And because time series
typically organizes by time, you don't have the inefficiency issues
you see with some other types of databases when you delete large amounts of data
from them. So when we do perform
deletes, one of those trade offs that I was talking about is
deprioritizing deletes and updates for ingest and querying.
But when we say, when we talk about those deletes, we mean like individual points,
whereas time series is better at deleting a large
group of data. Okay, so you're a
software engineer, and you want to know the following question. How do I
know if time series is right for my application?
So in order to answer that question, we should start
asking ourselves a few questions. These first question to ask is,
does my data have a time element? Most data does,
but certainly not all of it. And sometimes it's not critical for understanding
or problem solving. So if there's no time element,
then a time series database is not a good fit and you
should find a different database for it. The next question
I would ask is do I care about changes in
my data over time? Let's look at that basic weather station example.
Its primary feature is to display the current temperature and humidity
inside a house, and if those were the only features the weather station
had, would these be need for a timeseries database? Maybe not.
You could simply have a current temperature field in a relational database and just
update that field whenever there is a change in the metrics. But if you do
care about other samples of the data besides the current one,
then there are still some more questions to ask,
like do I think that I will get a feature request for
my application that will these need to know about changes over
the time what if someday we wanted to
enhance the application to show today's high and low temperatures on the console?
Now you could still use a timeseries database even
if you didn't care about changes over time. Certainly with time series
data you can often just retrieve the last record,
which in this case would contain the same temperature and current temperature column as previously
described. But these need to use a time series database is less
in that case. Another question
that you would ask yourself is how much data will I be working with?
Will my application be working with a large amount of data now or in
the future, or only a very small amount? If the amount of data
is small, it doesn't mean that there is no need to use a time series
application or database. It still might be the most efficient way
to build your application. But if you're working with a large amount of data with
a time element, then it will definitely help to build
a case for scalability that time series databases can bring.
So there's kind of a trade off between familiarity with the tools
that you already maybe know and comfortability.
You may be more comfortable working with a different database and you might be able
to make it work if you have a small amount of data. But if
you have a large amount of data, then you need to probably consider making that
shift. And then another question to ask is, will there be any need to do
any analytics with my data in the future? Data analytics
are often performed by looking at data over a period of time, which makes time
series database a really great choice for real
time analytics. Really, any comparison to metrics or
events in your data over time make it conductive to quick retrieval
of time series data. Another question you might
ask is am I concerned with storage costs.
I know this sounds like a question that the answer is always yes,
but let me state it better. Am I concerned with storage
costs that could be reduced by a time series database?
This doesn't always have an easy answer and sometimes takes some
prototyping to figure out. But very often with time
series data, the answer is yes, and sometimes by a dramatic amount.
Between dancing, filling retention policies and the compression benefits that
timeseries databases provide, another question that you ask yourself
is am I concerned about the application performance with the time series data?
And this is almost yes in all cases. But let's
restate. Do I need to retrieve data in blocks of time for analytics
to build a graph or do some other analysis over blocks of
time and data? And is the speed in which I can do that important?
This is where time series databases will shine. Your application may
or may not have this need, but if it does, it's a good indicator to
consider a time series database.
Okay, so before we wrap it up, let's just get over
a summary of some of the other great options out there.
So there are places these relational databases are still far and away the
best choice, and these are other great categories as
well. But timeseries is definitely growing category for
IoT use cases and for anything
that has events and metrics. But if you have
other types of data, you should
use the right solution for it.
So now if you want to start playing around with timeseries database
and some of the vendors out there offer a free
hosted database version. An example is influxdb. We offer
an on prem open source free version and a
free basic usage hosted cloud option. I actually recommend using
the free tier cloud option first. That's usually what most people do, because you
don't have to download anything and you can just play around with it and get
an intuitive feel for whether or not it's something that you want to consider investing
in. And then from there I usually see users install the
open source version until they have a requirement where
they don't want to host it themselves, in which case they sometimes
return back to the cloud version. I also want to make you aware
of resources that you can use to learn more about influxdata.
So we have blogs, we have slack,
the discourse forums at community, influxdata.com,
Reddit as well. We offer influxdata University as well,
which is a platform for learning about all things related to influxdata's
technology, including influxdata. And you can earn
badges is for any course that you
complete there, and then our documentation is also excellent.
So with that I want to thank you so much, and I
look forward to seeing you next time.