Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, I'm Timothy Span. I'm a develop or advocate.
I work mostly in streaming and in open source.
That might seem a little strange in the machine learning track, but I'll
show you a lot of different things you might not have thought of
before on how to get data,
manipulate data, run different processes
that you needed to run to be able to do your machine
learning and your training, classification.
Lots of different things there. So this is for data engineers, this is
for programmer, this is for data scientists. Anyone who
wants to work with data and do it fast, not have
to wait for a batch or something to load.
Let's just get that data, run it as fast as possible,
and do it in a way that's scalable, open source,
and easy to use. Welcome to my talk here. This is hale
hydrate from stream to lake. Now, I'm going to show you
different ways that you could work with data lakes.
Also any other source or sync of data.
So just a could of quick flink on me. You'll have these slides
so you can go through all these sites and see sources, code articles,
deep dives into different technologies, whether it's Apache Nifi,
Apache Spark, Apache OpenNLP,
Apache Pulsar, Apache Kafka, Apache Flink.
I also have some things on Tensorflow. I do
a lot of different devices. I've got Nvidia Jetsons, raspberry Pis,
tons of different content. And if you're missing something or going,
I wish Tim would do this or that, drop me a line, whether on Twitter,
LinkedIn, wherever you see me, comment in this conference,
I am always looking for more things to work on, more apps
to build more real time data pipelines.
Show me something you're missing and let's do it. One thing you
might notice is, I like cats. I have some cats, so you might see
some in the pictures. Those pictures are perhaps unrelated.
So the agents today, in this 40 or so minutes that
I'm going to do is as follows. My use case, the primary
one is to populate a data lake. This could be in any cloud,
this could be from any vendor, whether it's Snowflake,
Cloudera, Amazon, Microsoft. You know, you could build your
own open source and put it wherever you need it to be. It could be
in a cluster you've built of raspberry pis,
some obscure cloud, a vm,
whatever. We'll go over some of the challenges. What are the different
impacts they have on what you're trying to do. Give you a solution,
show you what the final outcome of that is. Talk to you
about a couple of different streaming technologies in the open source,
which means it's a great community behind it and you don't
have to pay. But once you get into production,
probably want some people to support you, but show you my successful architecture,
spend as much time in that demo as possible and have a couple of
next steps so you know where to go next. So use case
primary one, I'm going to show you a couple because it's that easy.
We're going to do Iot ingestion. So I've got a
device running here on my desk here. If we were
in person, it'd be sitting in front of me in the podium,
which is pretty cool. So we've got a high volume
streaming sources, different types of messages,
different protocols, many vendors. So there could be
a lot of challenges around that. There's no cookie
cutter solution there. Things are different. Sometimes you're in a big hand,
sometimes you're in a little door, you never know. So first thing
that comes up, like we mentioned, lets of different protocols,
different vendors, people like to do things their own way, but I want
to be able to see what's going on right away. I don't want to wait
for an hour, I don't want to wait for the end of the day.
I need this data now because I'm going to be continually training
my machine learning models, or I need to feed data
into something that maybe is using that model,
doing some classification and giving you some analysis
to say, hey, there's a fraud condition, there's an alert,
something's too high, something's been damaged,
you should buy a stock, the weather's changing, you better bring things
inside, you better start selling more ice cream because
it's 100 degrees fahrenheit, those sort of things.
What's nice as well, we'll show you. With the open sources,
I can have visibility, which is often really
hard on everything going on in the stream,
because think about it, there's devices on the edge. I could
be using various cloud servers, maybe I have
a server that's sitting in a gateway in the back of a moving truck.
Lots of things can go wrong. There could be bottlenecks, things could
slow down, messages can get lost. I'll show you how. We could keep
an eye on all that, automate that, make that simple.
You could start today taking real time
data, marrying it with machine learning, and doing that
all in real time. Pretty easy. Now, one thing
that's difficult here is sure, maybe you could
find a solution. Maybe you wrote a ton of python scripts
maybe there's a shell script. You've got code all over the place.
That sprawl is hard to handle. Could also be difficult.
You're hiring maybe consultants, maybe a bunch of different developers.
You have to maintain that. Lots of different tools,
nothing standardized. Can't find all the source code, don't know
where it is running. Is it running over here? I shut down
a vm, I shut down Kubernetes pod.
Now things aren't working. Now I got to learn another tool,
another language, another package. That's painful.
It takes you forever to write this code. Those delays are not going
to make anyone happy. Whether it's who's going to use the final
code, or apps, or just your developers,
your data scientists, your data analysts can't develop things quickly
if they don't have the data, if they can't put their models out there
quickly, if I can't get things to run my classifiers,
you're losing money. You're losing who knows how many things.
You need data fast. You need to be able to develop fast.
Things need to be open,
accessible, and have a community. So if you get stuck, you could
keep moving on quickly. So the main solution here
is using something like Apache Nifi. That is an open
sources tool that does a ton of different sources.
You don't have to learn a new tool for a new source.
Supports lots of different message formats and protocols and
vendors, easy to use, drag and drop,
huge community and fits in well with everything else
in streaming. Supports any type of data. It doesn't matter if
it's xml, if it's binary,
a word, document, PDF, CSV, tab delimited,
JSON Avro, parquet I could say
all kinds of words. Some of them you may know, some of you may not.
Doesn't matter. Tons of different data. Sometimes the data is big,
sometimes it's small, sometimes it's fast, sometimes it's slow.
Doesn't matter. We could do a lot of that with little to no code.
That makes it awesome. I can clean up my data, get it into a
format. So if I have to do something more complex, I could do
that with a pulsar function. I could do that with Flink, flink SQl.
Really great. Nifi also gives me really rich visibility.
It's got something called data provenance, which is a
really robust lineage. And you could see everything about your
data, all the metadata, all the metrics. You could see everything end
to end. This is awesome. And you'll see how powerful
that is in the demo. So now, instead of spending time trying to
write one input app, I could build new use cases,
new apps, get that data to the analysts and the data scientists who could
do more models, more apps, more solutions.
Things are cheaper now. I could develop more things
for lets cost smaller teams to do this part.
Let people do the fun part of building apps, building modules,
solving problems with machine learning and deep learning makes
you a lot more agile. I'm not spending weeks deploying things.
I'm not spending weeks trying to figure out how to deal with
new data when it comes in. To do things in a matter of minutes
or hours makes you very agile when you need to change.
Because tomorrow I have to use a different provider
for weather. I have a new IoT device. Instead of
spending weeks or months trying to customize
it to whatever that format is, I could adapt on
the fly. Makes it very powerful. Now I
call my staff a couple of different things.
This stack of open source Apache streaming tools out there,
sometimes I'll call it flip, sometimes I'll call it flank.
Flip is when I'm using Flink and pulsar,
usually Nifi and a couple other tools as well.
Focus on cloud data engineers, especially around the in
machine learning to help your data scientists. But this could be in many
other platforms. Cloud tends to be the way we're going now,
but it could be in any environment, whether it's on
premise, vms, docker containers, kubernetes, pods,
wherever. With this stack I could
support as many users as you have different frameworks,
different languages, any of those clouds, you have lots
of different data sources, lots of clusters. So if you have any
experience with Python or Java, maybe a little SQL, maybe at
least you have the concept of streaming, maybe a little ETL,
you're ready to go using this. If you're a can,
you could be involved here. Even you could use Nifi.
Cat's going to question how much you're spending on cloud, but who doesn't?
And if you happen to be cognizant
machine learning code there, I could run you whether I'm in
Nifi. Pulsar functions, flink,
minify agents at the edge, wherever you need to run
that code, whether it's sentient or not. We could do that.
My flip stack, just to show you what it is. It uses this really
cool pulsar flink connector that's been developed
by stream native in the open sources. This makes it very easy for
me to connect between flink
code and pulsar, whether it's a source or a sync or both.
Both tends to be common because I put something into a queue
or a topic, I want to process it and send
it down its way. Maybe it's cleaned up, maybe it's joined together, maybe it's aggregated.
We show you a little bit there. I touched on Nifi as a
big part of this. Why scalably?
Real time streaming platform. You can collect,
curate, use it as a universal gateway, could run my analytics
there, I could run my machine learning models there. Does a
lot of things. It does it very easily. And it could
be so many different sources of data, whether they're
cloud based, relational database, NoSQL stores like
Mongo, elastic solar logs,
text files, pdf, zip files, pull things
off of government sources from a slack channel,
hash it up, encrypt it, split text files
apart, put them together, take apart data out of a
zip file, look at the tail end of a log file,
route things different way, add metadata,
use metadata, all that while being very scalable
out to millions of events. A second run on
as big a cluster as you need, as many nodes, whether you're running it in
kubernetes, whether you're running it in vms, in any
of the clouds, on your laptop, on a desktop,
wherever that is, hundreds of pre built components for you to drag
and drop and build your flows very easily
with full back pressure in there,
security, all those features that you'd expect,
guaranteed delivery. This is stable,
scalable and a great solution regardless of how
big the application is, how big your data streams are,
what the data looks like, wherever it's running. You could do that here.
Pulsar Pulsar is a great way to
do distributed messaging regardless of what cloud it's running on or
on premise. Again, the same idea with Nifi,
open source Apache, huge community,
lets of different options for sources and syncs.
Lots of capabilities here. It's a no brainer for putting
your different event data and ML data in there,
stream your data around, do that in a secure,
durable manner. It's georeplicated, it's great.
Even better than my fluffy cat,
any kind of pub sub. So you don't just have to replace workloads
that are common in big data. This can also be in different things
where you might be doing JMS or MQTT, which we see
a lot in IoT, JMS, Kafka, any of
those different protocols supported by this one messaging system
makes it pretty easy. You can run functions very well
integrated here. That makes it easy for you to write
and run some of your machine learning against it like you'd
expect. Very scalable. One thing that's really cool here.
That's unusual is this tiered persistent storage.
So as you put in data into this messaging
system, it could store it out into the cloud and say
s three buckets without you having to write special code,
without having to do the heavy lifting that you might
have to do elsewhere. Full rest API for managing,
monitoring everything you need command flink interface so you make
this easy, hook it up to your DevOps tools and there's all
the clients you'd expect. Whether you write Python
or you write Java, whatever it is, there's a client
out there for you. So you could easily consume and
produce messages. This makes this pretty great.
I touched on Flink. I'm going to show you that in the application.
If you haven't used Flink before, it's something
you really need to start looking at. Flink scales out tremendously
well, integrates with all the different frameworks you have
typically written in Java, but with the
enhancements that's in there. Now for Flink SQl,
instead of having to write these complex distributed apps by hand,
compile them, test them locally, deploy it out
to my clusters to write a SQL statement like I
mentioned before, if you know a little python, maybe a little Java, a little SQL,
all of a sudden now I'm writing massively distributed real
time pipelines for machine learning.
Pretty straightforward. This is amazingly scalable,
used by some of the largest companies in the world. Has everything you need
for fault tolerance, resiliency, high availability,
runs in yarn, runs in kubernetes, all those
features that you want. Pretty awesome there.
Now this is a machine learning talk.
Get back to this. For some reason it's not showing my image here.
Hopefully I lost my image. We'll show you that later.
But basically what I have here is I've
included in my Nifi distribution a couple
of open sources components I've written so that I
could do deep learning as part of that stream.
And these are built using Apache, Mxnet, which is Java
and DL for J. This is a deep learning
Java library that lets me write
in Java and then deploy it as a final model
and say tensorflow or Mxnet or Pytorch.
Really powerful. And sometimes I need to do some natural language
processing. Patchy provides a great library
for that. I put that into Nifi, so you
could use that as part of your flow very easily. You don't have to call
third party services, don't have to call another
library. Pretty easy. This is
my solution here. This is our architecture. Got a lot of different files.
Lets me show you a couple different ones. XMl and JSON
come in from different sources. If I get some, validates them,
cleans them up, routes them where they need to go, we get that through our
messaging system and then landed in the cloud. And then you can write
analytics on it. And I'll show you some analytics there.
And I'll show you some continuous queries in flink just
to show you the ideas of what you can do with data while
it's happening. Real time events, event happens,
you're working on at the same time, there's no delays there. As soon as
that network can get it to you, you're doing something with it. Whether it's
a continuous sql, some kind of decisions, whatever it
may be. This is that IoT data I was
talking about. You got things like temperature, you got things like different
sensor readings, important things. If you're running
any kind of edge application, real time
vehicles, maintenance, all kinds of devices out there
also have ones for weather. And I've got links to
some more contents you can go into that. This is.
Thank you. We are done with the slides. I think
everyone's happy to be done with the slides. I'm happy
to be done with the slides. Let's show you real code
running. That's probably more interesting than what you've seen before.
So right now I am in Apache Nifi. We talked
about it enough. This is it. This is not the flow
diagram for it. This is not some
documentation for it. This is my running system.
There is a GUI environment that lets you build,
monitor, deploy this. This could be locked down and
secure. We could hide this UI if you don't want anyone ever to see
it. But this is running. And if you see here, I've got a real time
stream of data coming in. I have edge devices that
are making HTTPs calls in, sending me
data, and I have it queued because I didn't
want to run it until I could show it to you. So right now I'm
going to start something. This is pretty amazing. I could start and stop,
and I could do this either through the UI, through a rest call
or command flink interface, any part of the system.
So if I want to pause something, maybe I
ran out of cloud money this month. I'm going to pause it here, queue it
up until I can get myself some more cloud availability,
or maybe spin up some new unstructured, or maybe
something's having problems downstream. I can pause it,
never lose data and have no issues there.
So I'm getting data, I'm routing energy data in
sending that to my messaging system. I'm running
some real time queries on some of this
data as it comes in this query I have in
my parameters so I could isolate that out when I deploy
my code using DevOps tools. This is
my current query. If the temperature in fahrenheit is
over 60, we may be getting tools hot. Obviously you
set these yourself. Maybe I could have machine learning figure
out what's the new normal for temperature for that sensor reading or
the current weather. That's up to you. Maybe I compare it against weather,
maybe I compare it against other sensors of the same class.
You get the idea of various things you could do there, stream the data
in and then when I'm ready I split it up
into individual records so I'm not sending
too big a file out, pushing it to a couple different messaging
systems so that I can push that up to the cloud.
If you look here now, I'm in an Amazon hosted cluster,
doesn't look much different to you. I have permissions in both.
Here I've got this router stopped so no
data was processing. I do a refresh and that's all
been processed. Hundreds of records on a single code. Very easy
to do. I add some metadata here,
things like table name, what I'm doing with
it and I'm going to send it to another messaging queue for some
further analytics with flink here. I'm sending
this to a cloud data store and you want
to see how hard it was to write that code, point it to the name
of the server, give it a table. And I'm defining that dynamically
so that nothing's hard coded. And I'm saying it's JSOn.
That's it. Oh, and I want to do up lets this one supports upset,
could have been update or insert whatever. So if you notice
something, where's the fields? I don't have to write any,
handwrite any SQL. I don't have to write any mapping
code. My record reader technology looks
at it, sees if you have a schema, if you've got a schema defined
somewhere, I'll use that. So I know exactly field names,
field types, exactly what you want it to have and then I'll just
match that up to the table. Very easy. Now sometimes
your data might change a lot and you might have it, just have
the code infer it for you. So it'll look at a
number of records and say okay, these are the fields I see in this
JSON or XML or Avro
or different types of files. Let me
show you how those readers work. So I could just pick a different one.
I could do avro comma separated value. Grok will
read logs or any kind of semi structured text.
IP fix files, parquet syslogs
window events XML, get the idea?
Pretty easy to do that. Don't have to learn anything else.
If you have any custom weird formats, open sources,
take a look. Someone else might have written it, you write it yourself
or get a consultant to do it. Pretty easy to do.
These little boxes, I've written about 50 of them.
It's Java. Really simple kind of. If you've ever done spring
code like that, you write it once, build a jar
file, put it on a cluster, it's ready to go.
And I've got links to some of mine that you can download and use in
your system. So that was Iot data.
That's a great type of data. I want to show you a different type of
data since we've got some time here. This is weather data.
If you're in the United States, there's an organization for
the US government called the NOAA and they
put out forecasts from different weather stations and
current readings from weather stations all around the country.
And amazingly enough, every 15 minutes,
not really streaming, but pretty quick batch, they put
out a zip file of every reading in
the country. That's pretty awesome. So I'm going to download that
zip file so I could run this once just to give you
an idea. Get that flowing. Already have it. I'll uncompress
that zip file, pull out all the individual files
in there and then route them. Make sure that they're an airport.
There's some that aren't airports for my customers.
They care about weather conditions around airports.
That's really the main parts of the US.
So I have all these XML files coming in.
I want to convert them to JSON and I'll run a little query on there,
make sure they have location data. I don't want junk coming out.
So at the end result of this is a whole bunch
of JSON, very easy to read. Could have kept
it in XML. I don't really like XML. I'm going to get rid of
that as soon as possible. So here we take a look.
I could see all that provenance and lineage data.
Telling me what server it's run on gives it a unique id,
what's the size, all the different attributes,
where it ran, what type it is. Here's that flink.
So that's the airport ko five. It's an XML
all kinds of metadata, how it downloaded that from
the website and then the actual content.
And I could see, oh yeah, here it is. There's all that JSON data.
That was XML data before.
Now it's JSON data. So I took that
data in, parsed it apart, I put it in Kafka.
And we could show you that in a minute because I wanted to distribute
it. I could usually put it in pulsar. This ones I put in Kafka.
Sometimes I'll put it in jms. Lots of different messaging options.
Probably want to put that in Pulsar is probably your best
option there. But I could do that from, I could do that from any of
those different clients out there. So I'm consuming those messages
back. And let's show you that whole thing we talked about,
stream to lake. This is the stream coming from Pulsar or
Kafka. Here is my lake.
I'm creating Orc files, which is a
type of file that's used by Hive. And I'm putting that in
a directory on s three automatically.
Don't have to do anything. This has, again, something that reads those
JSON files, writes them there. I do the same with parquet,
same with Kudu. That's as easy as it is
to take that stream and get that into my data lake as fast as I
need it to be. And that same data is also flowing
through my messaging system so that I
can write some more advanced analytics on here.
So before I send all that data out, I run some
validation on here. I check it against the schema that I have
to make sure that nothing's weird, because sometimes they give us bad
data, government data, you don't always get good data. I don't want that broken
data. I could store it, I could put it in
a directory, maybe. It's usually junk. That's up to you.
If you see value in it, maybe you could data mine that later. And then
I just put it into messaging system to be doing
some more queries. Now, I mentioned I had some custom
processors to do deep learning. This is one right
here. This is as hard as it is for you. You pick your data set.
There's a couple of parameters there. This is a Resnet 50 for
doing this. And here I'm just running a couple at a time just
to show you the results. And if
we take a look here, you can see here by the name
that these are images. Nifi works on
images. See 700K versus those little files we had before.
And inside the metadata, I put the results
of my deep learning classification. Here's a bounding
box that I could draw around. What I found in the
image, which you see here, is a person.
I could have up to five, I could do more, but I limit it to
five because gets a little hectic for my processing
here. So I found results. It was a person.
There is the images, height, min, max, those sort of things.
What's the probability that it's actually a person?
They have pretty good confidence there. So let's see what that image actually
is. It's the side of my head. I'm a person. I'm very happy.
Sometimes I'm not a person. Today deep learning says I'm
a person. Everyone rejoice.
Yeah. AI has not taken over yet because sometimes I'm not a person.
Sometimes the cat's labeled a person.
You never know what you're getting. So we have all those messages here.
I've got over 25,000 in there already
and these are just going through my message queue
and I'm going to read them with Flink SQl
and do some different analytics on them. That IoT
stream and this weather stream, pretty straightforward, but gives
you an idea what you could do with different data as
it's coming in. You watch it over time.
Here's that IoT sensor data. Got a lot of data has
gone through there. Same with the weather. So this is how I'm
running my Flink SQL. There's lots of different consoles
out there. Stream native has one, Verberica has one,
Cloudera has one. There's one with Apache, Flink.
That one's a command line one. Whatever. You're writing lots of
different ways to run Flink SQl and it's very scalable
now. You could also wrap it in your own Java code if you want to
manage the deployment yourself. Maybe you're doing it all open
source and you don't have advanced environment to do
that. Here I'm showing some of the results
of my continuous SQL. So when I'm building this query, I could
see the results. And this one, if you notice, if you've worked with SQl
before, looks pretty familiar. I'm grabbing a location.
This is wherever that airport was that they
took the weather. And I want the max temperature in fahrenheit,
average temperature, minimum temperature, and I'm just displaying
them here. This is over a short period
of time. We can set up windows of time with these streaming systems.
So maybe I look at all the forecasts in the last 6 hours,
take the minute max, give you ideas there because remember this
is not in my final data store. This is in stream.
While this happens, another record shows up. It's added
to this sql, this is continuously running and
it just keeps going. We got another one here that
I'm not doing any aggregates, I'm just looking at every record where
the location is not null. I mean we did the validation,
but sometimes something doesn't show up. Part of these reads
at some of these weather stations are done manually. Someone's typing
in the weather forecast or the
current conditions. So there's sometimes a little bit of human
error gets in there. But you can see some of the fields here.
And nice thing is we wrap this in a materialized
view so that now I have a rest endpoint that people
can query and this is what it looks like. You get a
JSON array of all that weather data. I could
pull that into Jupyter notebook, pull that into an application,
do what you need to do there. I've got another Flink SQl
here that's joining together two streams. These are
two different topics. I've got one for energy data,
one for my sensor data. I join them together.
This could have been a full outer join, left outer join,
right outer join. Again, if you've done any ANSI SQL 92,
Flink SQl is going to be pretty familiar.
Uses Apache Calcite, which is used in a ton of open source
projects like Phoenix. So you get used to this SQL
once and you pretty much get the syntax. There's some
extra things for doing really interesting complex
event processing and windowing, but it's pretty much SQL.
You don't have to write any Jav or any custom apps here,
but if you've got operators and functions running within
your queuing system, those can be executed before they get here
or after. Gives you some power. Again, another materialized
view here, so that people who don't have the
libraries can just call this rest endpoint,
get a bunch of JSON
and process it as they want to. And I've got a whole bunch
of different jobs running here. I've got
four different flink applications that I could see in the dashboard
and I could dive into them and see what's going on, see how many
records, different things going on.
This one's interesting because you've got two different source tables and a
join, and you can see the number of records processing through there.
Pretty basic, but gives you the idea sometimes you want to
see data stored in the cloud. We said lake.
Well, here's my lake. This is a table
on top of Amazon
s three. And it looks like a
regular table. I could see the location and the details,
where it's stored. And it makes it pretty easy for
me to do what I need to do here. And it just acts
like a database for me. And I have all that permanent data,
so I have all the readings that have ever happened here. I could
store them, do whatever I need to do with that. Same thing
with the sensor readings. Pretty straightforward, regardless of
where I want to store that, whatever my lake is, like I said, it could
be cloud era, it could be Amazon, Microsoft,
Redshift, snowflake, whatever's that next
dremio, whatever's your next data lake,
I could put it there. I create a little dashboard on it.
You can see there's a lot of different readings across the country.
These are some that are close to me. This is a local airport
by me, and I could see all the data there and download
it if I wanted to. And that has things like Latin long, so I could
put it on a map here. And if you look, there's a lot of different
airports where they're doing weather data in the United States.
You zoom out enough and it's just a nice,
pretty design there because there's thousands of records there.
Makes it very easy. Something else I can do is I
can send real time alerts to slack. This is great for DevOps,
but this also may be helpful for your data scientists and
analysts to see. Okay, there's some new data coming in. Maybe we do
a sampling of it. Someone says, okay, there hasn't been data in a
while. Now I see temperatures at can airport. Maybe that gives
me an idea for what I can do next. Lots of options there,
pretty straightforward, I think now we're at the end. Hopefully there's
time during the real time event for questions.
If there isn't, please reach out to me.
I've got all my contact information here. Whether you
contact me at pazdev on Twitter, see me on my website,
open a pull request in know however you
want to contact me. I'm always interested in talking
about streaming and getting data to a data lake.
Really easy, even if it's for machine learning or deep learning or whatever
you need it to be for. Straightforward thing there.
Thanks for coming to my talk. Hope you learned something.
If you're looking to learn a little more,
definitely follow me. We have deeper dives.
We could do whole day events. So reach out.
Thanks a lot.