Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello, my name is Duncan Blythe. I'm the founder and CTO of Superduperdb.
In this talk, I'll present to you superduperdb, our vision
and mission, the way we work, the way
the technology works, Python snippets how
to get started with superduperdb and a short demo.
So this quotation you see on the screen perfectly describes
what we're aiming to do, and that is to transform
the simplicity and ease with which developers can
get started using AI together with their data.
So a fundamental problem is that data and AI live in separate
silos. Current state of the art methods
for productionizing AI require a lot of data
migration into complex pipelines and infrastructure.
That means maintaining duplicated data in multiple locations
and also various steps and tools, including specialized
vector databases. So as
a result, bringing machine learning to production and AI is
very complex. What often happens is that
deployments take on a character similar to this
depiction here, where data databases
are the initial input nodes to a complex graph
of deployments, tools, steps and processes.
And the current trend of including vector databases in
this setup only make things worse. In 2024,
AI, in order to be simple to use,
needs to come into contact with data, and our
thesis at super Duper DB is this can be greatly simplified
if we can provide a unified data and AI
environment where no duplication and migration
or etl transformations, mlops, pipelines and infrastructure
is necessary. So one environment combining
AI and data as well as vector search.
So we do that by bringing your AI to your database
data deployment. So it's the environment
in which data and AI are unified and that greatly
simplifies the process of AI development and adoption
and allows you to unlock the full potential of your existing
data. With super Duper DB,
you're able to build AI without moving data.
And when I say building AI, I really mean the current
state of the art AI. So generative AI, including llms
and so forth, also standard machine learning use cases
as well as custom workflows which consist of
combinations of these things.
SuperduperDB aims to become the centerpiece of the modern
data centric AI stack. So this is what
it looks like. On the left we have
data, so databases, data warehouses,
your data. We would like to connect
this with AI vector search, and indeed
the Python ecosystem. And this is currently not possible.
With full generality, with SuperduperDB,
this is possible. So SuperduperDB acts as a centerpiece,
orchestrating and connecting these diverse components.
And when I say this, I really mean that you can bring any piece
of code from the open source ecosystem of Python libraries
and integrate vector search completely
flexibly. It's an all in one platform
for all data centric AI use cases.
So we have a deployment which allows for
inference and scalable model training. You can develop models
in combination with the platform, putting together
very complex workflows, and you can also use the platform for
search, navigation and analytics. And that also includes
modern document Q and a systems.
So it's built for developers with the ecosystem mind and we
can allow you to integrate any AI
models and AI APIs directly with your database.
And so we can leverage the full power of the open source
ecosystem for AI and Python, and that is substantial.
We're open source licensed according to Apache two
on GitHub. Please take a look like
subscribe, contribute and that's very important.
We are very keen to get contributors on board,
improving the quality and the features
in the project. So how does it work?
So SuperduperDB acts in
combination with your databases data
deployment and what happens is you can install
models and AI APIs and configure these
to either form inference on your data as
data comes in, or indeed fine tune themselves
on your data. And developers can interact with the system from
Jupyter notebooks, Python scripts,
as well as we're working on other sdks.
So this will come in hopefully the upcoming months.
And we also are working on rest APIs so that you can
easily integrate this with your applications downstream.
You can also easily interact with this
since we're Python first with fast APIs.
So now let's get a little more technical. So what does
the underlying architecture look like of a SuperduperDB system?
So you'll see later that SuperduperDB is a Python package, but at
the same time it's a deployment system.
In order for your Python package to operate,
it's interacting with the components you see on this diagram.
So are various components. So the
principal component is the data backend which corresponds to your
traditional database. And we also have
a metadata store and artifact store.
So these are for saving information about models and actually saving
model data. So these
three things here together get wrapped in this DB variable
that you'll see in the subsequent code snippets.
Work is carried out by a scalable slave
master slave scheduler. So we're using
ray to do this and
work is submitted to this system via
either directly from the
developer requests or from
a change data capture demon which is listening for incoming data.
And you can also set up a vector search component and
that interacts with the query
API. So when you select data, you can optionally link
the query with the vector search
component. So that's still a very high level
view of what's going on. Suffice to
say you can read more about this on our docs and get
into as much detail as you like by exploring our examples
and exploring the code base. So let's have a look
at the code. To connect to superduperDB, you simply
wrap your standard database URI with our wrapper
and you get an object DB which you can do many of the standard
things you would do with a database client, but much much more. So that's
sort of the super duper of your database in
order to query your system.
It's very similar to doing a standard database query.
What you do is you simply connect with the backend of your
choice, so in this case MongDB and you execute
your query object. So the query object is here between brackets.
And this is completely analogous to a standard Pymongo
query for instance. And SQL is a bit more
involved because you need to set up a schema first and
add a table to the system. But after you have this table,
we can perform SQL queries via the Ibis
library. You can create custom data types,
so that allows you to do much more than you do than you
would do with a standard database. For instance here I'm creating
an MP3 data type and the way this works
is that you have this encoder object and you tell
your encoder how it should handle the bytes
from a data point. So here we're just doing a
simple encoding via pickle,
but you can do whatever you like. And that's a theme in superduperdb.
You can define whatever functionality you need via our
system of wrappers and encoders
and connectors. So creating a
model is very flexible. So here
for instance, a very simple model just involving a regular expression.
So we have to think of models in super duper DB. It's not just
a Pytorch model or a hugging face model, but really a
generalized computation. And this computation
can either have auxiliary data, can be trainable or
not, but that's the sort of sense of
the whole project is to link different types of computation
together, which might or might not involve traditional
AI models. So here for instance,
we import the object model wrapper, give it a name,
my extractor, and we passing
as the heavy lifting component of
this model a mapping
which extracts from an input string URLs.
So you can go much more involved than this. For instance,
here we are using Spacey to
pass text and doing essentially named
entity recognition on this text.
SuperduperDb handles saving these diverse
bits of data and code into the
system. So you're now able to use spacey to do
parsing. And the cool thing about this is there's no necessity
for us as superduperdb maintainers
to have already made this spacey integration.
You can really just bring it to your super Duperdb
deployment and it can go even deeper than this. For instance,
you could do something like a custom APIs request handling
logic of how exactly individual data
points. So this would be this predict function
or multiple data points are handled by your model.
So completely versatile,
completely flexible, you'll see through and through that. We use
the data class,
data class decorator around our classes and the reason for
this is this creates a very nice way to expose
these models then to rest API
functionality. And so then you can nicely build front ends
on top of this. So applying a model to data
in a database is simple via the predict in DB.
So you simply say which key you would like to operate
over and also what data you would like to select.
And superduperDb will then under the hood,
load the data, efficiently pass that
data through your model and save the outputs
back in the database. And this can actually
be done in a sort of asynchronous streaming fashion where you
don't need to necessarily even be activating
the model yourself. So this will actually start essentially
getting a life of its own via the listener wrapper.
So you would wrap your model with a listener
and tell the listener to listen to a certain query.
And that means that the system will then listen for incoming data
on this query, and when it comes in, it will
apply that model to that data and populate
the database with outputs over that data.
And a vector index operates together with this listener
component. A vector index in itself needs
to be always up to date, so that's why it operates together with
listener. So you wrap a listener with
a vector index and you instantly make that data underneath
this select query searchable.
Creating more complex functionality where multiple
models interact happens via the stack API.
So you can simply list the components you would like to
like to add to your stack. And as before, add the stack
to the system and
you can even parameterize these stacks in order to make a higher
level interface to your AI functionality.
So what you would do then is essentially perform surgery on your
stack, replacing certain variables
or certain values with variables
become available as parameters in the higher level
app API. So now this app has
two free variables identify and select from
this app. I can very easily share
high level AI functionality. And how do I share it?
Simply export and SuperduperDb will then
inspect the model or models that
you've created and create a very nice JSoN serialized format
with references to artifacts. So we're now
going to see a demo of this system. So imagine
you have a library of video recordings or video,
and you would like to search this video using
natural language for important scenes. Or search
these videos and these could potentially be sensitive recordings,
so you don't necessarily want to send requests off to
externally hosted APIs. So in this
case, what we can do with superduperdb
with very few Python commands is simply
add the videos to the database,
specifying only the uris of where the videos are located.
We can create our own custom model to extract and subsample
video frames for them from the video, vectorize these
frames using computer vision models via Pytorch,
and then once this is set up, we're ready to search through these
frames and return answers to queries such as
this. Show me scenes where the main actor throws a ball as
just a simple example, and we'll get results
in the form of references to places
in the video where this may have happened. So this
is simply just one example of what you can do with superduperdb. There are
numerous examples which you can see on our website.
Suffice to say that if you can think of
it, you can probably do it. So let's start.
So this is a Jupyter notebook and we're going to be interacting with Superduperdb
from this notebook.
Let's connect, we're connecting to MongoDB
and we're going to use the collection
videos to store the data about the videos.
We have this DB connector.
Let's create a special data type video on file
which essentially tells us where our videos are.
So you'll see here that we have a URL of
a video. Let's have a look at what it is.
It's a video of lots of different animals and we
could potentially add multiple here, but we're just going to add this video
to the database. So you'll see that adding a video
from a public Uri, or could be an s three Uri,
is a simple matter of simply inserting that uri
under the hood. The system is actually downloading and caching
the data so it can be used in computations.
All of this happens automatically. You don't need to specify anything.
So we can see here that we have a single document in
the collection now which contains the reference to CRI
and the data on file. So you can see that in more detail
here. There is the cached video.
So now I'm going to use the OpenCV
library to create my own custom model
which takes the data from this video and subsamples frames,
saving those frames back in the database.
So you'll see the logic here isn't important. Suffice to
say. Suffice to say that I can create
any custom logic I like.
So now that model has been created,
let's apply the model to the data in the
database. So you'll see it's iterated through
the frames. And now we've actually extracted one data
point here and verified that a frame has
been saved in that data. So you can see that in more
detail here. If I take one document out of the outputs
collection, you'll see there's actually a Python native image
in there which we extract with
this execute query. And now I
would like to make those frames searchable.
So we're going to use a pytorch model which
is imported from the clip package of OpenAI.
So this is a self hostable model and
we're actually going to use two model components. So one
for the visual part and one for the textual part.
So now those models have been set up, let's wrap them
with a vector index, and we're going to create a vector
index which is essentially multimodal, so it has an indexing listener and a
compatible listener. That means the models can,
sorry. The images can be searched either with a textual
interface or an image interface.
So now the vector index has been set up and the images
have been vectorized. Search through those frames.
So let's look for for instance elephants in
the woods. So this
query here is searching the output collection
using the search term referring
to the index that we've created, and we're able
to extract the timestamp from the results. These are simply Mongodbe
documents returned in the results.
And once you have that timestamp we can actually find
the position in the original video.
So let's confirm elephant.
So we search for elephants in the woods.
And now we have an elephant in the woods.
Let's check this wasn't a fluke monkey
is playing.
So there you have it in very few python commands.
Videos searchable, completely configurable,
self hosting. Configure all steps of
logic yourself via Python, but follow this template and
save the results in superduperdb.
Would you like to know more about superduperdb then
find us on GitHub at superduperdb slash superduperdb.
You can check out our document and example. Use cases
at docs superduperdb.com and try
out the code with a simple pip install. Pip install super duper
db happy coding.