Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello and welcome to using the flipn pattern for edge AI.
That's when you use Flink, Nifi and Pulsar
together to solve really complex problems.
David and I are using to go through a couple of
slides, little demo, just to give you a feel for how
you would develop IoT edge applications
using this pattern and these open source technologies
that work together really well and scale out
tremendously. I'm Tim span.
I'm a principal developer advocate working
with all the streaming stack. I run some meetups, do some
events, do some blogging, and we'll go
into some cool tech. And I'm David Kermgard,
I'm a developer advocate at Stream native.
I publish some books there, Pulsar in action. I also contribute
a lot of code and give a lot of talks around the world. And I
focus primarily on Ni five, Flink, and Pulsar.
So thanks Tim, for having me.
Yeah, it's the dream team. If you're interested in these
three technologies, these are probably the two people
who like them the most. Every week I put out a
newsletter, covers all kinds of streaming stuff.
All the fun tech out there, very easy to check all
the time. We do meetups all over the place, just look for
them and they cover all the
technology that you like. Today we're going to cover introducing
to these technologies a little overview,
some examples, and then a little bit about each
of the streaming tech there. That's important for this pattern
pulsar to be able to get anything in the stream.
Flink to do some processing and joining Nifi
to scoop up the data and shove it into pulsar
and get it working. And that's really the
flipn stack or flipn pattern or
flipping federation. I've been talking with some
of our friends, you know who you are, Peter, online about
should you use the word stack going back all the way to the lamp stack?
Probably not, but it's a bunch of things together.
Pattern module,
just a guide, some examples, best practices,
things that work together really well and make things easy for you to
build. Streaming apps, everything's Apache. They all work
together really nice. David wrote the connector from Nifi to
Pulsar. There's a really solid connector
from Flink to Pulsar and also things like Pinot
to pulsar and a lot of other projects. Once you get into Apache,
things just really fit together really well.
Again, that is the reason why we like to use that together
with this pattern. It could really accelerate
what you can do as an edge data engineer. Whether this
is just grabbing that raw sensor data,
log data and get that into a stream.
Or it's even more advanced where you've got different
models running at the edge, cameras connected
to gpus, different accelerators
there, and you'll see there's a lot of different people involved in
these projects, lots of framework language options
connecting to different clouds. Most of you out
there probably some kind of engineer. To be a cloud
data engineer involved in these edge things, you're going to need
python comes up a Iot, or Java. Both of those work really
well within this pattern. And you should know little SQL.
My cat is always messing with the camera.
He likes the tools that we use. He thinks I'm spending
too much money on all these devices, but what can you do?
AI, obviously we can run that within Nifi, within pulsar
functions, flink at the edge in these different
agents, and at some point it's going to run this whole
talk for me and fix any typos. So hopefully
we'll see that happen. A lot of different projects.
If I listed all the Apache projects, we could probably spend the
whole 30 minutes. There's so many beneath the covers,
so many that work together. Calcites used everywhere.
You Iot bookkeeper, you got zookeeper, you got
different registries, Tika, whole bunch of different projects.
OpenNLP, I've been playing with some more. Some good things in
there. Iceberg, all them work together.
But the main things we need to build these streaming
applications that connect from the edge all
the way to the cloud or wherever your enterprise is centered,
even if it's federated between multiple availability zones
on premise, wherever this is running, those all
connect together. You're probably not running Flink on the edge.
You certainly could, especially on one of these Jetson boxes or
even one of the beefier ones. But Flink is usually
for downstream. Again, IoT minify, probably running
at the edge. Maybe pulsar or pulsar
might be up a level. How you decide where
you put those connections is a talk
for later. And David has a couple of those out there if you want to
take a look. But Flink is really nice for taking
billions of data points, streaming them together, and run
real time things on them. You can also support batch
if you need to do some batch things. Sometimes people will join gets
things out of a batch source like say
kudu or a relational database. You could
use that to augment what you're doing within Flink. And I'll show you
a little bit of Flink SQl. And what's
nice is scales as big as you need to go scales
out really solid. That's the part of the second part of Flink
I like. First part is they all work together well.
Then you've got this scale out and then third is
with the SQL. It is easy to start. Even if this is
all running in one docker where I've got Flink,
Nifi and Pulsar in one docker on a laptop.
This still can handle all the data engineering I need
from a bunch of devices into final use
case of having real time analytics. Important piece of
the puzzle there. And like I said with that
Flink connector makes it very easy for me to stream
the data in from pulsar regardless of the mode
it's in. Could be in generic pulsar mode, could be in Kafka
mode, and then stream it down, stream into a sync
when you're done. So do my analytics, which could be as easy
as possible in SQL, and then insert the results later,
which could be aggregates, which could be summaries,
could be windows of data, and do this at whatever scale
you want with simple SQL. I'll show you a little bit of
that running in our short talk here.
Then next up, I'll let the expert go into pulsar for you.
Thanks, Tim. So yeah, Apache Pulsar sort of fits nicely in
that stack. As Tim mentioned, Nifi is the edge device or
the edge technology that brings in all the information. But you have
a nice place to buffer this information before it gets processed by a flink sort
of processing engine. And pulsar serves that exact purpose.
It can scale up to just one particular use case. Ten petabytes of data per
day coming in. So all the data you could ever need coming in in a
truly elastically scalable platform that
integrates nicely with all the open source tools, as Tim mentioned. Spark, flink, everything else,
Nifi. So it's a nice, think of it as an infinite buffer,
an infinite state streaming storage that has multiple layers of
different protocols on top. So it speaks its own native
pulsar protocol. It also speaks Kafka, it also speaks MQTT,
it also speaks RabbitMQ. So it's a great way to bring
data in from different sources and put it in a single place and then expose
it up to something like flink for your processing as well.
Yeah, it's hard to underestimate something
that can run more than a million topics,
which if you're starting to do this, you'll see why that's a big deal.
It's actually 10 million now, Tim, I've updated that 10 million.
Yeah, we've done 10 million now. Yeah, it's great. So everybody,
why Apache Pulsar? Everything Kafka can do but
better and more is how I describe it. It came with unified messaging
and streaming on day one. So we actually
support queuing semantics, streaming semantics,
infinite message retention, tiered storage,
offloading things, capabilities that you still don't have in
Kafka like dead letter queues and scheduling, scheduled message delivery
and multi protocol support. Easily scalable,
no data movement. So when you add, as you add up capacity
to your cluster, you don't have to move the data rebalance free and 10
million topics and soon do ten x that again once we get Oxia
fully up and running, that's our zookeeper replacement. There's been a lot of buzz in
the Kafka community around. We finally got rid of zookeeper and that's great.
We've done a similar process, but we rearchitected it entirely
to make it more scalable as well. And then georeplication was the
first use case. It was built at Yahoo in 2012 to
replicate data across multiple directions, a bi
directional mesh, multitenancies built in on day one, and then
encryption all the cool goodies you'd need for a full featured enterprise system.
So it's not just a toy, but it's truly scalable. And you can scan that
core, decode more and find out more information about why all the different capabilities of
Pulsar. 100 million coming.
I hope I don't have 100 million on my next project.
I don't even know how many Kafka clusters you need.
I wouldn't want to run that many. But it allows you to do things like
a topic per device. So if you have large number of devices in IFU space,
you want to have just their information commingle it. You're no longer limited
because of the platform to decide how you structure your data. I think that's the
big win there. Yeah, that's pretty cool. And not just
one per device, maybe one per device sensor. Like even
my little device I'm running for the demo here has two
sensors on it, so maybe each one gets its own topic. And then
I could join them together with flink.
Yes, absolutely. I don't
know if I want to write a sequel that has a 10 million topics
in it and then join them all. I don't know.
That might be a little much. That might be a bit much.
There might be a nice way to view it up though. There are ways
to aggregate. I mean, that's always the thing. If I put everything
in one topic, then I don't have to join.
But then there's a whole lot in that one topic, and that could be a
bottleneck. If I have a million topics, like if
I want to aggregate all those sensors, I could
do it in I five. I don't know. There you go.
That's when you got to balance it. Like, how many topics do I want?
I don't know. No limitations, though.
That's the key. You're not. Whatever makes sense for your architecture.
You pick the best. And Flink plus, pulsar is so
mature at this point, we're talking three or four years
in, so it's pretty solid.
And the flink versions and the Pulsar versions, always incrementing.
Everything's getting better. So pretty nice way
to connect there. And how do
we get data in there now? Most commonly
you could do it with any kind of code because there's support for Pulsar,
support for a lot of different languages. So sometimes I'll just
have a Python app at the edge, push right into pulsar,
and that could be using the native pulsar library that
installs on all these devices. And I've done IoT or
some of these devices are hard coded to push out MqtT.
We can just have pulsar listen that way. So you got some options.
But often I'll have something like minify,
which is a small Nifi agent, running on that device,
just to make it easier for me to manage what's going on.
But one of the reasons why I want Nifi or minify
is besides it's easy to work with, and we'll show you that in the demo.
It has some features that are really nice for picking
up data. If you've ever used any kind of logging agent,
they're usually pretty simple. Maybe you're setting up some YAML or
XML or JSOn, some kind of configuration,
but they are designed to run maybe just at the edge.
You're not going to have a scalable cluster, not going to have something that
guarantees delivery, has built in back pressure,
prioritize queuing, allow you to change the qos
on them. Built in data, providence. Hundreds of different
controls, lots of different sources, version control,
DevOps, all those things you might want with a
scalable architecture. Maybe the last people still using
zookeeper, though, depends what environment you're in.
That could also be done otherwise in Kubernetes,
but pretty straightforward. But it's
just designed so you don't lose data. They keep adding new ones. I'll show you
a little bit of NiFi 20 and the additional records
we could read so I can read all these type of data.
Convert that into a format that's easier to use within pulsar
and flink like Avro, and then use that to
join together data. Got a number of articles out there if you're
interested in using the different data.
This is the one from the demo today, which is the raspberry
PI 400, which is cool because it's got the keyboard.
I don't know why they didn't put a screen with it, but I added a
very tiny screen that has my ip on it.
You're not too valuable there, but data from
the edge, we get that into pulsar
so we can start doing things with IoT and a lot of different options there.
Mine for this particular example is minify agent HTTP
into niFi. Nifi does the cleanup and just gets that into
a topic. Flink does the simplest possible,
SQL gets it, and it can push it anywhere.
I mean, it could go into another topic and then someone else
can consume it. You've Iot a lot of options there,
and I'll show you some inside the demo. Just wanted
to show you different examples. And there's lots of different sources of data that
Nifi, Flink Pulsar can read,
not super hard. And then once we get it out
there, very easy to distribute the data for people to write
apps or whatever you want there. But that's
all the slides. Let's see if we've got things running here.
Hopefully things haven't timed out. Seem to be okay here.
So this is Nifi. This is Nifi
one two four, which is a newer version, but not
the 20 ones. So I've got a
controller here that is receiving Nifi
calls. And Nifi is the ability to listen to rest
endpoints on demand, and take any data you don't have to have it fixed
to some schema or some specific class of data.
Anybody who wants to call it, just post data on
this port and I will consume it pretty
easy. And then I have a provenance and all the data that's come in,
and I know how long it was,
what type of data, what the data was, what the
user agent was, plus the data itself,
which in this case is JSON, pretty straightforward.
And then I could process it, route it, and in this
case, I'm consuming it in here. I paused
the live data. So we've got a bunch of data coming in.
So it's not limited amount and
then I'm using IoT to pulsar and elsewhere.
But how did I get that data? Well, on that device,
that raspberry PI keyboard, I have a minify agent
and it's running a shell script to run some python to
grab some sensor data. I set that agent
that you saw there, and then if it's not
empty, I'm calling Ni five via HTTP
and just sending it into that port. So the data is just streaming in.
If that agent is not running, I'll see what's
going on. I could see all the metadata about that particular
agent. I can also change Iot and debug
it on the fly if I need to. I'll see all
the alerts going on, sort by that. If I have
a ton of different things going on at once, I could see that.
I could see if something's offline and I can
delete new things if need be. I could see if there's
new things on that device. If someone's done an upgrade,
lots of different things you could do. Pretty easy way to do that.
So we got the data from Nifi streaming into pulsar,
and this is an easy way to run flink behind the
scenes, it's just regular apache flink and
there's no jobs running yet. And I'll
just start my query here in this UI,
and as you can see, it deployed the job already. I only have
one node because this is running on a laptop. If I had
a massive cluster, you know it's going to look different. This UI is going
to say there's more resources. You don't have to code differently,
you don't have to write SQL differently. It's just going to take that data
in from the table, filter it out wherever it makes sense.
And here we're showing a sample of that data as it comes in.
And what I do with that is I've got a little materialized
view that takes a cache of that data and
presents it as a rest endpoint.
And then I could just put it in a dashboard. But I could have
also written this dashboard directly
against Pulsar using the pulsar's websocket
API. Depends what you're doing, really.
I should be joining this to another data source,
and there's lots of different things I could join that
to. Depending on what
tables I have out there, like this one, I'd probably join this
to multiple devices that are either the same or similar.
Maybe looking at ones that have similar
regions or maybe from the same area or have
similar sensors, like here, I'm looking at carbon dioxide
and volatile chemicals in this area.
So I might use that to pinpoint things going on
based on what's going on there and maybe join them based
on lat long. I should probably add lat long based on where it is.
I can add a gps sensor there or just manually
do it if I know that sensor is not moving again. When you're
setting up these things, you put them where it makes sense.
And as you can see, more data shows up. This is because
another record came off. The device got pushed to
Nifi. Nifi did a little bit of cleanup
on it, and then we pushed it into pulsar
and then Flink got that event and
it shows up here. And then it'll update this materialized
view with any updates. And then we could push that
to a dashboard Jupyter notebook,
a regular application, or maybe another flink app,
or maybe a spark app. I mean, you have a lot of
options here. Or it can be pushed into
data sync like Pino or Hive or kudu
or hbase or mongo or relational database.
Lots of options there. Depends on what makes
sense for you. I want to show you another thing
we have here, and this is the latest
candidate release for Apache 9520
that runs on JDK 21, which is super
fast and has the ability to run Python.
And there's a couple of python processors built in
for doing some cool stuff like chunking documents,
parsing them, interacting with chat
GBT. You got to put your keys in
there. Pretty fun pushing to different vector
stores like Chroma and Pine cone.
Also there's a couple of new processors I like, one that'll listen
to slack. So if I post a message to slack,
it'll get pushed into, Nifi will
grab that and we'll be able to process data coming from
Slack, which is really cool. Another new feature
is I can now take groups of processors
and run them as stateless. So it runs in
its own clean environment, isolated from anything
else, and it runs from start to finish
as sort of a job or function as a service
runs that completes and you get the logs of results
and then it'll stop depending on how you want to
schedule that. Here we've got it to run in 1
minute segments. It depends on how you want it to
go. Typically with these you'll do something that's triggered
by a time, say maybe
I'm listening for s three changes
or assist log. Something lands in a file system.
This will get anytime a new object shows up, and then
maybe we run some processing against it. And then we're done.
Just gives you an example. There lots
of different things you could do. Just wanted to show you. The new Nifi
20 also has the ability to read parameters
dynamically from different servers, like one password,
like a database, like Azure key vault AWS secrets.
Nice way to do that. A couple of new features in the new
system are pretty cool flow
analysis and
this is an area where we need a lot of work.
So new people who are learning Nifi, you could add
stuff to their server to dissuade them
from doing things that may be problematic. With this
one you could tell them not to use certain component types.
Knifey still has a bunch of them that haven't been deprecated yet,
and any of the ones without records. If you're using
structured data, you might want to not
use them because we've added new ones for Excel,
for window events, for YaML,
for grok. So you could do a lot there.
Makes it pretty easy. Let's get back to this
here. I think we're pretty close to the time here. I want
to thank David. Hopefully everyone has
learned some deep decent stuff on using flipn
pattern and definitely reach out. If you're interested in
seeing more on Pulsar, on Nifi,
on Flink, all these are very cool ways to write apps
and as you can see, there was no magic there. Setting these
things up is pretty easy, especially if you use the cloud
managed services for them, even if not within
a docker container or kubernetes. It's a couple of
clicks. Things are running as you saw,
drag and drop some simple SQl and
things are just running for you and you get your data, you do with
what you want with uh, we can
join things together like Flink has got a pretty rich SQl
here. Here I'm joining two different topics based
on a similar, you know,
lots of different things you could do. You could also use debesium here
so I can load from relational tables, join them
together. Pretty powerful way to do that.
And with this I can join things like Pulsar
and Kafka together. I can join a
database table with a topic.
So a lot of options there. Thanks for coming to the
talk David. Do you want to send them out? Yep.
Thank you everyone for attending the talk. Hopefully you have a lot of fun with
the flipn stack. Thanks Tim. Thanks everyone for attending.