Transcript
This transcript was autogenerated. To make changes, submit a PR.
How's it going? Tim Spann here, principal developer
advocate covering some interesting topics in generative
AI, real time streaming vector databases,
unstructured data, lots of cool stuff.
My talk is enriching generative AI as events
in real time streaming pipelines because I couldn't fit
any more words in the title and I haven't thought of
something cool, catchy way to put all that into something a little
simpler. So we'll give that a try
through the talk to make this as straightforward as possible.
If you're trying to contact me, want to know more about
this interesting things we're doing with generative AI,
reach out to me GitHub medium dzone lots
of places you'll find me if you look.
I do a weekly newsletter covering ton of different projects,
open source for data, unstructured data, vectors,
streaming real time IoT, Python,
Java, JavaScript, lots of cool stuff.
Check it out. You don't have to subscribe, you could just check
episodes as they show up in the GitHub.
And we'll be adding more multimedia and multimodal contact
real soon. So check that out. So let's get into
it. We're going to build some streaming pipelines. Hopefully everything's
going to be going fast. As you might imagine, if you haven't been
around the last couple of years, there's a lot of different types of data.
Like a lot like you wouldn't expect it, I mean,
because you have the data we've been used to for a while, which is
huge enough. I mean, the structured data coming out of databases
and other things is big. But then when you start adding things like
text and images, video logs, all kinds
of really cool advanced 3d models,
chemical structures logs,
emails, social media data,
all of a sudden there's a lot of data and that's just growing.
And now with the explosion of things like chat, GBT,
now there's some serious use cases for this data,
and now it can all come together. I can search,
you know, my text data using an image and
search my videos and all this together, the structured,
unstructured and, you know, make sense of all the data
this is, there's going to be some really cool stuff coming out right
now. There's, there's stuff everywhere, you know,
and it's not easy to manage, especially when, you know,
you may have some cats distracting you. Whether it's local,
in containers or Docker, you're running Amazon,
you're trying to do video analytics, you're not
sure which cloud you're running on. Got to feed stuff to slack. Grab stuff
from slack. There's a lot going on. Fortunately there's
some tools out there in the open source to make this a little easier.
Fortunately you don't have to choose between easy and open source anymore.
The power of some open source projects out there is making
this a lot easier. So you don't
have to worry about it. And not just open source, but open source
with a powerful community where a lot of people
working together, a lot of people using the same code, which is
great and big. Shout out to the Linux foundation for AI
and data. That's really making this possible,
especially within the hybrid cloud environments
that we're seeing with all the different Kubernetes installs and
big environments. I mean you could have start off with a couple of nodes and
before you know it you've got, you know, billions of vectors out
there, tons of data. But what's nice, it's easy
to get started. I could just work in a notebook, I'll show you that
reuse code, use it fast,
integrate with all the cool tools you want to use out there where it's
something like OpenAI, lang chain, llama index,
ton of cool stuff out there and you could do some
really advanced stuff out there. Everyone starts off with maybe a little rag,
maybe just some text in there, just a really good use case, maybe just
all the documents that you need to search. I do that for me. All my
articles and different contents, you could find my
stuff really fast, ask it questions, maybe get
better answers than I give you. Depends what data is.
Lots of other things like we're not going to go into things. The difference
between really dense and sparse
embeddings, filtering, there's so many features
coming, whether you're going to just run on your laptop,
run in the big cloud somewhere, use all
the big name tools, you just use a couple open source,
it's all out there, cool stuff going on
and a ton of different use cases supported. Again, we won't go into
that, but just wanted to mention this quick off,
there are a ton of different index types. If you use some of the
entry level vector systems, you'll be like, oh, we got an index.
And one index is great until you start looking at different
types of data, different use cases, how you're going to use that data,
how you're going to search it, there's a lot of things to think of when
you start going into production and you start adding a ton of different
data sources coming from the world of traditional big
data. You know, it took us a long time to
be able to figure out how am I going to access the data,
what type of data it is, what are the best ways to index,
search and find data wherever it is?
And that's certainly evolving in the unstructured
realm as well. But being able to support all the
different types of searches you might want to do, not just the top k ones,
not just grouping or filtering. Lots of
different things you could do and combine them, being able to have multi
tenancy, do all these collections and partitions,
really important once you start moving into production,
whether that's productions on premise, any of the clouds,
wherever you need to run, you're going to need that. Some data
just can't live with others. Maybe for performance,
maybe for legal reasons, maybe they just shouldn't
cross the barrier. There lots of different compute
types, and we'll see these expand as more types of GPU's,
more types of advanced compute come out there.
There's already a ton, and fortunately the
open source tools there are taking advantage of that, which is important.
Now, if you remember some of the architectures I've
walked through in the past, it's going to look familiar.
The way Milvis works is really similar to some other advanced
systems out there, because there's really not that many ways to write
these that it's going to work. And scale out for all the different
layers because you got to scale out compute separate from storage. If you
were around when we did some of the pulsar talks, it's going to sound
pretty familiar. Got workers that are handling querying
data, indexing your data. What's nice, storing things
out to an object store. Even if you're in a small use case,
you could use minio. That works pretty awesome.
Give people access to your data. We got etCd in there.
Again, part of cloud native system makes
sense to use that for your metadata, so you could scale out really
easily. We're using Kafka Pulsar to distribute
messages between all these different systems. Makes a
lot of sense. It's the type of apps we've been building anyway,
so it makes sense that an advanced system would
use that. We'll go through some of these use cases really fast, but there's
a couple of them you don't think of right away.
Certainly the augmented retrieval. I mean, we all have to do
that now with text chat and blogs and stuff,
but things like molecular similarity search. I was
talking to a guy a couple days ago and he was super excited
about this because they're trying to find some uses for
some of the materials they have. And that stuff works
out really well. That's really hard to do in most systems there,
but we'll dive into that in the future. Future talks.
Just a couple quick ones. Show you some cats scalability,
different types of indexing support for all the major
languages for clients. You know, it's all out there.
One thing that we've been doing is real time data pipelines,
because we've seen that certainly some data
is going to be loaded batch like it was before,
especially getting all your existing content and
documents and websites. Get them that first time loaded.
That's going to be a batch, should be, a batch should be done pretty carefully.
Make sure you get everything, double check it.
Certainly that can be done with the same tools, but you know, you got to
watch that, that first round. But after that new document
comes in, something comes out of medium, something comes off a slack channel.
I need to do that right away. I want to be able to get this
data, whether it's to get it into
a vector store like Milvis, so I can have that available to
build up a prompt, or I'm enriching it,
transforming it, or that's how I'm getting in a request
to look at the data. I mean, we also have to integrate with whoever
needs to ask questions of the AI
or even just do a search of those vectors very
often. That's my final answer in there. I don't have to go any deeper.
I want something like, well, how do I, you know,
build this? How do I build my first novice app?
Well, I don't need artificial intelligence to do that.
There are a number of great articles that will come up right away,
so you don't have to do that. But yeah, like I said, building up the
prompts, getting the proper context. So when we do
need to write a question, we write it correctly, but also then
connecting with things like GPT cache, because sometimes that question's
been answered already. Why spend the money or the
time calling out to something like Jet GBT
or even one of the free models if I don't have to? Let's save energy,
let's save money, time, network bandwidth,
don't call out if you don't need to. It's already been done.
Let's make it smart. Let's work with whatever we need to
work with quickly, and we can work with unstructured
data, with things like Nifi or with Tauee.
There's a ton of different things out there to do it.
But the days of just looking at CSVs
and JSON Protobuf parquet,
Avro, that sort of data is over.
You're gonna be looking at zip files, you'll be looking at every type
of image, every type of document. You know, some of them human
readable, some of them not, some of them binary. You'd be looking at videos,
maybe live streams, sound. I mean being able
to query your database with audio
is pretty awesome. Being able to have those advanced
uis that you only thought of as, you know, maybe something Apple can do
with a phone. But now I can have that in my basic applications
where I just talk to you and you give me the results, I send in
an image, you give me back a document. There's so many, so many
options now. We even just touched on them. It's pretty awesome. But we
need to be able to work with all these different unstructured
data types and there's going to be more, I mean there's probably going to be
some new optimized ones. I'm sure Genai is going to create
some awesome new format that combines multimedia
in a very rich, easy to search and
compressed formats. Waiting for that.
If you haven't seen Nifi 20, we just got the m three release and
this is adding some serious features that make it a very nice open
source tool to be able to do this open source streaming part
of the house where I'm getting data regardless of if it's structured,
unstructured, semi structured, and get it to who
needs it at any speed as quickly as possible.
Often Kafka pulsar in there, drop that
off to my vector store and we're off and running. What's nice
is being able to leverage python and additional
places if I now makes Python a first
class citizen. So you can write full libraries in Python
and make them available for your entire nifi cluster,
which is awesome. So as long as you have Python
310 greater, I mean if you're not running Python 310 or newer,
you should look at your infrastructure because there's, you know,
security issues out there. Upgrade to the latest
or near latest. I mean trying to get all these libraries working sometimes
is fun, but fortunately with kubernetes
and with other Python options,
control some of that. But I wrote one for taking
an address and turn it into latitude and longitude.
Find I needed this because I wanted to do something
like have someone ask me a question. And before I sent it to
Genai I wanted to see is this something
geo important? Again, those crossing of the data types.
Geo data is structured data,
but I mean it's real world and it has implications
with a lot of things. So we need to get this part right. So I
have a library that uses pretty awesome libraries out there,
especially one from openstreetmaps that does
a pretty good job in getting you to the right place.
So that's pretty good. I'm going to show you a couple of different
demos. I don't want to show you too many slides, you'll get these slides
going down. But what I have is my medium.
I have Nifi read that in and it's going to
read in. Write back to slack and we'll show you some other interactions we could
do with slack. Write some stuff to Kafka and
flink and do some analytics and write everything that needs to
be in Milvis there. So I could do rag later. I could
do a lot of things, especially when people ask me questions about my
articles. Hey, let's have a
prompt enhanced Genai do that for
me and then I'll show you a quick little one for images using
the Tobii library, which is pretty
simple. I mean, the more you look into python, the more you're
like, where were you these when I was doing all that Java?
I mean, I still like Java, but some of these python
libraries are amazing. So we're going to show you a little
bit, just a quick one, on getting working
with image data in a database. Like it's nothing. Pretty cool.
We'll touch on rag today. You've seen that before.
And I'll leave you some contact stuff so you can
start doing some cool stuff once we get into some demos.
Hopefully I haven't spent so much time that everything in the world
timed out. You know, I didn't, I'm trying to run this as
close to live. So you're there like I'm at the conference.
Hopefully I see you at the next one. So I have a medium
and I have some data there and very fortunately they'll give you an RSS feed
of your latest couple of articles, which would be awesome if I
didn't have so many articles. So I also download my old
articles and have Nifi load them as well.
But for most use cases, just the last couple articles,
because once I have this running like everything, get that batch
processed, however it takes, get that into your vector
store and then, you know, from current data as
it's changing, grab it. So we're just going to grab
some data from there, which is surprisingly easy.
Now we are in Nifi. This you could download
run on your own, whether it's in Docker Kubernetes or just
in on a JVM on your laptop. Pretty simple.
So I have some code here that is just going
to grab my feed from medium.
Now we could also do this with Python. And I'm looking
at rewriting this in Python and see what the difference is the amount of code.
So you can see here it's atom format RSS,
which is basically a type of XML. Again,
another format that won't die. Every new format we get
three old ones that never die as well. But here we grab
some fields here. So we've got this data, we're going
to grab it. I grabbed it
here in Nifi and then I have something that converts
that RSS data and send that
into Jason, a little easier to work with. Then I
split it on the channel items and we can look at the data
provenance here and see that that happens.
We got ten articles it gives you at a time.
I grabbed some fields I like here, the most important ones,
and then I'm just going to send that along here
to build up a new JSON file. And if I look at this, I've got
ten results here that I just ran and
I could take a look at the data. I switch
that over into a format to be parsed because
it's HTML and then from there parse
it out into small chunks. And I've got my small
chunks here. Optimizing those chunks is always a fun exercise
and there's definitely a lot of different ways to do
that. I could see this is part of my article around irish
transit system, which is pretty cool. Love Ireland,
nice place to go to. Love castles.
So what I have here is I'm going to send that
record into Milvis. And what's nice with Milvis
is there's an awesome open source product
called Attu and this lets you query all your data,
see what it's going on. I could see here that I've loaded some articles,
I could see some other things I have here.
There's also security and all that kind of stuff. But the main thing is
we see we got another record come in and I'm
extracting the text that came out. I'm just going to push that
to Kafka so that I can distribute that.
Also, if there's anyone who needs to know that I
published a new article, I can have a Kafka consumer somewhere
that could send out a slack, a discord,
an email, a fax, I mean
whatever weird thing you want to send. Also I'm sending out
a slack, I'll probably send out a discord,
maybe I'll send one to the Milvis discord. If it's
an article about Milvis, I probably send that
to Genai to tell me where should I publish this and use
their choice. Let's see where they tell me to send
my articles. Maybe they tell me to throw them away. Okay, we could
see here that I got a new slack message.
And this is the article, not well formatted,
but, you know, was pulled out of RSS, but it is the
content of the article posted into
my slack channel for Milvis. And you can see it's got
links and different stuff embedded in there from the article and
some, you know, all the code in there, whatever was
in that article, just as a way to distribute
that. And again, we sent some of that content to Kafka,
but that's pretty straightforward. And we'll just let all those
run. I mean, it only does ten at a time, but if
you take a look, I may have more than ten articles.
So when I do the bulk one,
I'll go back and do all of last year.
You know, maybe just the most recent articles here.
Probably don't need to see articles from multiple
years ago, though. There's a couple of cool ones in there,
like how to automate all the transit systems in the world.
No code. That's kind of cool, I guess, if you're into
that. Okay, so it's running here. I've got a couple more Milvis going,
so we'll show you another part of this to this timeout. Now, the other
thing I have running here is we've got a universal slack
listener. And this is tied to my slack group here,
which is open to all. And I've got the links out there if you want
to join it, if you want to ask questions of my bots, if you
want to ask questions of me, I'm. One of these is me. I don't know
which one there's. I have like three me's on here. Some of them
are bots, some of them are me. See if you can guess. Sometimes I forget
which one's me. Whoever gives you the best answer, that's me.
So we have one just came in and it's getting processed,
coming through the system. Now, some of them I throw away because some
of them are the results of bots like you saw here,
or they're in the wrong channel, or it's just not relevant to anyone.
It would be nice to know the current stock price. Now, this is not taking
any. That is not what I should.
That is not taking any AI I do send everything through
a couple open source AI's thanks to hugging face, but sometimes it just
doesn't make sense. Like what's what are they going to tell me about the current
stock price now? There are plugins and things you could
add to chat, GBT and other things and depend on your pipeline.
You know you can get that data, but just sending that raw
to check GBT doesn't
make a lot of sense. One thing I figured out is if I'm going to
let this be open to everyone, I should
check things out, make sure that's only on one channel.
I don't want to look at every channel. Especially I use a lot just
for debugging. Make sure it's not one of my bots.
I really try not to get one of my bots, but they
do have a lot of users. And the other thing I figured out is to
clean up my prompts. So I found a model
on hugging face that works most of the time.
I got to look and see if I can find a better model or maybe
train it on my own stuff so it knows when sometimes
it's not sure. Like the not safe ones.
A lot of these not safe ones are safe.
You know, it picked a word I didn't think it was. I thought maybe
it was a curse or something was wrong with it, and usually
there's nothing wrong with it, so I don't know. Or sometimes
it takes text that's really bach text, and it goes, oh,
what? Clearly this isn't a human. So sometimes it's actually
that's helpful, but other times it's not.
So we have that coming out. We got the prompt filtered,
make sure it's asked the right question, and then I send
it, which I think is kind of cool, is I send it
to multiple models at once. If you saw my
article, why not four models doing four models at
once? I got mistral eight. Tiny llama.
The tiny llama has been having problems recently. Mistral seven.
I know the models. Models change a lot. Ms.
Phi three, which has been pretty good. And then I'm like,
can I call even more models at once? That might be too much for my
system, but I have one also that translates to Spanish.
It's nice to have content in other languages as you do. French,
German, Hindi. Give me some suggestions for the languages.
There's a lot of models out there. I try h two
o, llama three. So I've got a bunch of
models running, and when they get the results back, if they're
not total junk, I send them out to slack. And then
I've got a slack receiver
here, taken anything I send in and then I'm publishing it.
Pretty straightforward, but gives you an idea here
also, some error might
have, might be out of permissions for one of the stock systems.
Yeah, I might have to change
a token somewhere, but what it does is find
out what company your stock is and then send it
on its way. Let's see, maybe we can get weather. Did we get any
results back? Oh, we got, we got some stuff for, we got,
like we mentioned all those model five three give me nothing
mistral seven told me who amd was. I guess that's
good. Metalama three.
And then there's the translation. What about the weather?
What is the weather in. I'm going to Wildwood.
It's a nice beach town here in New Jersey. We have beaches,
and we're getting some results back really quick. So that's cool.
Here's the weather from a weather forecast. This is probably the smartest
one to look at. And this is showing me some of
the weather from the Philadelphia area, which is somewhat
close there, one of the stations there, and gives me what's the current weather?
And I did some parsing to get, you know,
Latin long from a name which
everyone wants a lat long. There's some paid services that don't, but pretty
easy tells me what the weather is generally from some of the other models.
Kind of cool way to do that. Just to give you an idea,
we can also do some other things in here. Like I can upload
an image, and I often will have
images of my cat. So obviously an image is not
text, it's not a question. It can be a
question if I have the right kind of vector store and I can
say, give me pictures similar to this.
And that's certainly one of the demos we'll do in the future, because that's pretty
cool. So yeah, that works slightly differently.
I have some universal code over here,
gets my, gets that
message in from slack, make sure that it's actually an
image, downloads it with the proper permissions,
if it matches, sends it over to image processing,
which something failed here I might have forgotten to
install. One of the libraries should take a look in that,
and otherwise it's sending select reports.
If things make it through the system, depend on, you know,
what that image looks like. Is there security issues, whatever.
And then it just pops up in our reports here.
Like here, we did a little bit of analytics on there, purple bean
bag chair with cat on it, and a colorful room that's
pretty accurate. And then if you see we've had other reports,
I'm doing some things with real time traffic cameras. That's in future
demos. That'll be the next one. So stay tuned
with conference in six months, how to
get started with Milvis just to make it easier for you.
Automate everything. Thank you for coming to my
talk. Hopefully it wasn't too long. If you have
questions, definitely reach out to me.
I am always on LinkedIn and Twitter,
medium, GitHub,
discord, wherever cool data is,
I'll be out there. Thanks for watching. See you
next time.