Abstract
In machine learning, e.g. recommendation tools or data classification, data is often represented as high-dimensional vectors. These vectors are stored in so-called vector databases. With vector databases you can efficiently run searching, ranking and recommendation algorithms. Therefore, vector databases became the backbone of ML deployments in industry.
This session is all about vector databases. If you are a data scientist or data/software engineer this session would be interesting for you. You will learn how you can easily run your favorite ML models with the vector database Weaviate. You’ll get an overview of what a vector database like Weaviate can offer: such as semantic search, question answering, data classification, named entity recognition, multimodal search, and much more. After this session, you are able to load in your own data and query it with your preferred ML model!
Session outline
1: What is a vector database?
You’ll learn the basic principles of vector databases. How data is stored, retrieved, and how that differs from other database types (SQL, knowledge graphs, etc).
2: Performing your first semantic search with the vector database Weaviate.
In this phase, you will learn how to set up a Weaviate vector database, how to make a data schema, how to load in data, and how to query data. You can follow along with examples or you can use your own dataset.
3: Advanced search with the vector database Weaviate.
Finally, we will cover other functionalities of Weaviate: multi-modal search, data classification, connecting custom ML models, etc.
Transcript
This transcript was autogenerated. To make changes, submit a PR.
You.
Hello everyone, welcome to my presentation about the vector search
engine Weaviate. In this presentation, I will explain you
what vector search or vector databases are, and I will show you how
it works in live demos. I am Laura
Ham. I am a machine learning product researcher at semi technologies.
And at semi technologies we build the vector search engine Weaviate.
You can connect to me on LinkedIn,
or you can join our slack channel of Weaviate,
which is a slack channel of the community. So the open
source community. And today I will
talk about vectors search and vector databases.
Before I dive into vector databases, I want to explain you
what structured and unstructured data is and
what the challenges in unstructured data. Then I will explain you
everything about factorsearch and factor search engines,
and I will then continue with showing this in live demos.
So first, what is a factor database,
and before that, so what's difficult about unstructured
data? If we compare structured
data with unstructured data, we see that structured
data is what you find in typical database. So in typical
database, typical database is a relational database with
rows and columns and connections between those tables.
So it's mostly quantitative data.
So we have, in this example, we have an id number, a name,
which is a short string, and these city, which is also a short string.
And this is what the typical quantitative data is in
a relational database. On the other
hand, we have unstructured data, and that is what
I call data, what you find in the wild. So that means
it's like longer text documents with a lot of unstructured
text in there, raw sensor data,
we have images or videos, audio files,
also social media data. And what you see in
all those data sources is that there's
a long piece of text or an image of video.
And in this data, there's a lot of information
hidden. So in one image, there can be a lot of information hidden,
but it's difficult to capture that in a traditional
structured database. So with structured
databases on the left hand side,
it's easy to search through. It's easy to store
data, easy to make conclusions from the data,
because you have quantitative data. But if you have the unstructured
data on the right, it's difficult to find the
information that's hidden in that unstructured
data. So if we take an example of
a search engine on top of data
sources, so in search engine, on structured
data, on relational data, it's really easy because it's quantitative
data stored in those tables.
If you have unstructured data, it becomes a lot more challenging.
So here we have an example of,
let's say we have a data set of news articles and all those news
articles have a title and a longer piece of text,
which is the actual article. So if we have this one
article which is about dogs, so the title is these origins
of dogs, how dogs were domesticated.
If you then have a search engine
and you would search with the query in natural language,
animals, in a traditional search engine with
structured data or relational data, you may
not be able to find the article about dogs.
And this is because we search for the word animals.
While the article is actually not using the word animals, but it's
using the word dogs. As humans, we know that animals are related
to dogs, that dogs are a type of animal, so that maybe we
want to see this article as a result. But if we don't
use any semantic layer or machine learning that understands natural
language in English, you may not be able to
find this article in a traditional search engine.
Then if we have a vector search engine and a vector search engine
uses machine learning to be able to
understand the semantics behind language,
you might be able to find actually this answer these
article if you are searching with the animals,
and that is because as I explained before,
we know as humans that dogs and animals are semantically similar
and this is hidden in the english language
and these machine learning model is able to connect those two.
Yeah. Before I explain in detail so how this actually
works, vector Search, I want to show
you that you already know one vector search engine and
that is Google. So Google is a vector search engine too.
With Google, you can put in a very abstract question
like this example, what color of wine is answer?
So Google will browse through all the open
available web pages and it tries to not only
find the web page that is
interesting to this query, but also extract an exact answer from
this web page. So these question is very
abstract. Google is able to find a
concrete answer from a piece of unstructured data and
it also does it really fast. So in this case,
you put in a query, what color of wine is chevronay?
Google is first browsing through all its
indexed web pages and find web pages that might
contain this answer. So it found like almost
13 million results in less than a second and
it's then also able to extract the answer. So white
wine from a paragraph from all of these web
pages. So that's pretty impressive.
And yeah, we're all really happy with that,
of course, but Google is only able to do this with all these,
or Google is doing this with all the unstructured data that
is available on the public web. So all
the web pages that is accessible by everyone.
And this is only a very small portion of data
that's available on all the world. That's because most
of the data is of course available or in the hands of private
companies or for people themselves.
And we don't want Google to see what your
company has in terms of data.
But then if you want to search through your own data,
so we cannot use Google for searching through your own data. So then the question
becomes is what if you could do the same thing? So searching
through unstructured data, through your own data in
of course a simple and secure way.
And that's where open source vector search engines come in.
VV eight is such a vector search engine, which is
open source. And yeah,
as it says here. So instead of just storing raw data like traditional
databases, do, weaviate leverages power of machine
learning model to factorize data. And what these means factorization
is that machine learning models try to understand what your
data is about while it stores data. And also
when you search through it using VV eight,
your search query will also be put through this machine learning model,
be factorized and it's able to find
data objects that are near your search query.
So you can do discovery or classify similar search
results in your data set. So in short,
weaviate is a factor search engine that tries to understand
what your data is about. On the right you see a
three dimensional space and this is an
example, simplified example of how data
is stored in a vector search engine or a vector database.
So we have in this
case three highdimensional. In reality this goes up
to like 300 or even 3000 highdimensional.
And all these dimensions capture some kind of meaning
of data. So on
the right we have some words and some images,
and these words and images have a meaning
to a human and the meaning
is determining these. This data object is stored in the database.
So for example, on the left we have the
word chicken. We have an image of a chicken.
Those are very close because they have kind of the same meaning.
So they are close together in the database.
It's also relatively close to other animals like wolf,
dog, cat, an image of a cat,
but it's then far away. So all these animals are stored
far away from things like an apple or a banana
or the company's school on apple because
they're not semantically similar. So this is how data objects
are stored in a vector database when
you then do a search query. So like animals, we did before
in the example or in this case, we have the search query kitten.
It also puts these query in the
highdimensional space, and it returns
items or data objects that are close to this search query. So in this
case, if you search for kitten, you will see data objects
like cat or the image of a cat returned.
And I will explain these in more detail in the coming minutes.
But first, I want to show you how vector databases are
slightly different from graph databases.
So, with graph databases, as you might know,
data is also not stored in traditional
rows and columns, but data is stored in models
or classes. And all those data objects
can have relations to each other. So if we take the example
again of news articles, we might have a data object
article with the title the origins of dogs.
And this is written by some author and has some category, which are
also separate data objects. And these are relations
between those data objects. So, article has
author, this person, John Doe,
and this person wrote some articles,
for example, this particular article, and this article
has a category nature in this case.
So this is a graph database.
This is different from vector databases. But what you do with
vector databases is you add highdimensional
to it. So these, I have like added two dimensions
to it. And what happens is these
data objects in these graph database will be stored exactly
where the meaning of this data object is. Also.
So in this case, we have
indexed, like, for example, english language. And you see that
a category nature, this data object will be stored
close to, for example, the concept of environment, concept of
animals, but far away from technology and laptop,
and the concept of article, or this article about dogs.
So it's article. So it's close to newspaper, it's close
to dog, and also to cat and animals because they're
all semantically related. And so by
adding these vectors, so these vectors
are essentially just coordinates in a highdimensional space,
you add some context or some meaning to your data,
and this allows you to also search through it semantically.
So how does this work on a technical perspective?
So, the first step of doing vector
search or storing data in a vector database is
choosing some kind of model that can map
your data to vectors, to coordinates in
this high dimensional space. And for this,
you can, for example, use machine learning models.
So the first step is an encoder model. So from data to
a vector, to a coordinate, this is encoding.
It transforms data into vectors. They're also called
retrieved models. And one example is
dense retrievers. Dense retrievers are deep neural networks
or machine learning models that calculate the vectors from
a word. These can
be language models. So youll see an example here on these,
right.
Can example of language models are bird models or sentence transformers
models. You can also do this to images.
And then one example is a Resnet 50 model.
So all these models, they can calculate
a vector position from a piece of data.
You can also do this by not using deep neural networks
but using sparse retrievers, sparse retrievers
like TD TFIDF or BM 25.
They don't use machine learning models,
but they use word frequencies in the documents instead.
So they are a bit light weighted.
So now
we have a model that is able to transform
data. So natural language or images or videos
whatever into vectors.
The second step is you can use weaviate
for example to import all this data and
to store it actually in this vector database.
So Weaviate looks at all the data objects and uses this encoder
machine learning model to vectors the data and
then it will be placed in this hyperspace,
so this high dimensional space.
And then you end up something like this, what I showed before. So we
have a cat that is semantically
related to dog and to animals and to can image
of a cat, but it's far away from Apple and banana.
Then if you use this database as search engine,
weaviate will also take your search query which
you can put in in natural language or search by an image.
For example, it will also use this encoder model
or retriever model to index or vectorize this query.
And so it also gets a vector position like you see here
on the right. And then it does
an approximate nearest neighbor search,
an answering to retrieved data objects that are close to this query.
So if I search for kitten,
then I will get back results that are for example cat
or image of a cat and so on.
And this is called approximate nearest neighbor search.
And this is for example by calculating the cosine distance between
the data objects and the search query, you only want to retrieve the
results that are most close to the query.
And with a vector search engine like
Weaviate, this is done very efficiently.
Even if you have millions or billions of data objects or
search queries, you can still retrieve it very
efficiently. And this is because it uses for example
a indexing library H and SW to
search for these data objects very efficiently.
So to summarize, you have a pretrained
machine learning model, for example a build or
sentence transformer model from hacking phase. You have your own data
and then with VVH you can index them and store them in this vector
database. You can do a
search query and then it will do can approximate nearest neighbor search to retrieved
the data objects close to this query. So this is how a very
relatively simple search pipeline looks
like. And you
can extend this pipeline. So now it was just a very
simple search and you use retriever models for that.
But you can extend this factor search pipeline by for example
reader or generator models. Reader models
are models that extract information from the retrieved data objects.
So that means you can do question answering. With question answering
you put in a natural language query like really
a question in your search query, and then
VV is able to not only find a relevant data object,
but also extract an answer from these relevant data
objects. And that's done by a reader model on top of the retrieved
model. Another example is named entity recognition.
You can also extend it by using generator models.
Generator models use natural language generation
to generate an answer, for example from retrieved data objects.
So here for question answering. In a reader model
it only searches in a data object for
a particular answer but
doesn't modify it, it just retrieves a piece of data. A generator
model can actually generate
language from this data object. So for example it
can summarize a piece of text.
So now that I explained you a bit how this works,
then the question becomes how do you use it? So how do you interact with
Zach factor database vv
eight has two types of API endpoints.
It firstly has all usual restful API
endpoints for crud operations and
additionally it has a GraphqL API to
do intuitive querying. So you
can of course retrieve all the data objects with
a get query. You can do semantic search or for example question
answering depending on what kind of machine learning models you have attached.
And I will show you this. So there
are two demos that are always available for you. The first one
is a super small or relatively small data set
of news articles. So it's only less
than 4000 news articles in there. So that's
really small if we talk about machine learning,
but this is just for demo purposes.
Second one is the complete English Wikipedia indexed.
So that's like billions or
millions of pages from Wikipedia, English Wikipedia and
you can also search for it using Weaviate. But for now I will
just show you the small data set.
So in here I'm connected to this data
set. So this data set has news articles with natural
language text already indexed in Weaviate.
And what youll see here is a,
I'll make it a bit bigger is a graphical user interface
to query and you can query this data set
using graphs. So I will
build this query step by step to explain. So on the
left hand side I built query. On the right hand side you
will see the results.
So I can do a simple query
to get all the
articles. So the news articles that are in the data set and I just
query. It's not these name, it's title.
Just query the title. So this is a really simple get query
basically just to show you how it works.
And on the right you see the result. So now
I have a list of all the articles that are in there and I just
see the title.
I can ask for more data properties here. So this
is the summary of article. For example.
Now with this query there's of course no semantic
search happening. So there's no machine learning happening here.
But I can add this to the query and
let's say I want to find the articles that are near the concept
of animals like I used in the example in the slides
I can do a near text search. Near text is a
function that we defined which uses the
semantic search principles.
And for example I can go for animals.
So now I want to get all the articles that are
near the concept of animals and I want to see the
title and the summary.
And when I run this on the right, I get back all
the other articles ordered on
the relevancy to the query. So they are ordered on how close
they are in the vector space to the search query.
And as you can see the first result
is the example that I used in the slides. So the oranges of
dogs, a new idea about how dogs were domesticated.
And you can see also in this article it's
about dogs and predating
domestication. It's all about wolves.
So it's all about animals, but it doesn't literally use
the word animals. And this is how you can see that you
really do a semantic search.
So the second result, it's about Nigeria
cattle principles. So cattle, I'm not
sure if the word animals is used here but I don't see it.
So vv eight or the machine learning model behind it,
in this case a sentence transform model understands that
with animals I also want to see results like dogs or cattle,
those kind of things.
I can also show you this.
I can call for certainty. And certainty is
a number ranging from zero to one indicating
how close this data point
is actually related to these search query.
So in this case you can say that Weaviate is 79%
sure that this search query is relevant
to what I'm searching for.
Um,
okay, so as
I showed you, you can extend the search pipeline with reader
models and one of the reader models is a question answering model.
So I can also ask
Weaviate or this data set a detailed
question.
So here I have an exact question.
So who is the king of the Netherlands? I'm really looking for one specific answer.
I don't want to see the whole data item, maybe I just want to know
who is the king of the Netherlands and
I need to ask for the answer here.
So one answer is enough.
So if I do this, what happens now
is that VVT will still do a can query. It gets
articles that are near this search query.
So it will find articles that might contain something
like this about King and the Netherlands.
And then with this search query or with these result. So it found
this result about dutch royals.
And with these search query it uses
a question answering machine learning model to extract
an exact answer. So here it found this answer
and you can see that here it's found already in the first sentence.
So King William Alexander
and then went to Netherlands.
So the machine learning model understands that this
king is the king of the Netherlands.
And yeah, there's also these complete english Wikipedia
index as I mentioned. So you
can also ask questions to Wikipedia and
it works similarly.
So you can play around with this if you want. You can find the links
also on our documentation on Wevy IO.
So this
was a very simple example what I showed you with just indexing text.
But you can do more. You can also
use machine learning models which uses
multimodal search. So you can mix media types within VV
eight. This means you can also index
images at the same time as text. So you can
search by an image and retrieve text or the other way around so
you can mix media types. For example,
this is what I showed youll before. So question answering
then as I explained to you, wefit works with
machine learning models to vectorize
data. We have multiple
machine learning models available out of the box. For example most of the
models from hugging these or our own trained machine learning model
called contextionary. We have models
from OpenAI and so on.
But you can also use weaviate without any machine
learning models. If you want to use it as a pure factor search
engine. Or you can use custom machine learning models
if you have your own trained machine learning model but want
to for example scale it or make search engine from it,
you can use VV eight for that.
And finally, what you can also do with VV eight because it
uses these vectors. You can do classification,
automatic data classification, for example
KNM classification if you have training
data or previously classified data available.
Or you can do zero shot classification if you just
want to use the context in Weaviate and
you don't have any training data available.
So that was my presentation about the vector search
engine VV eight. If you have questions,
I'm available in the chat of this conference.
Or you can of course join our slack channel,
our community slack channel, which is very active,
and there's a lot of people who can help you out with questions.
You can always shoot me an email, of course. And if you want to find
out more, you can go to our website, which is VvT IO.
Okay, thank you very much.