Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, I'm Stephan Batifau. I'm a developer advocate at Ziliz
and today I'm here to talk to you about vector databases 101
vector databases. Everyone talks about them and
today my objective is to explain to you how they work, in detail
actually. So let's get started. First,
what is Ziliz? Zilis is the maintainer of
Melvis, which is an open source vector database. We are also
maintaining nowhere, which is a vector search engine GPT cache which
can be very useful if you want to do 17 cache for
LLM queries. We have VDB benchmark, which is also like a benchmark
tool for vector databases. We have Zliscloud,
which is our cloud offering, but I'm not here to talk about that.
Today we also partner with different industry leaders such as AWS,
Google Cloud, Azure. We also partner with different Gene AI
tooling, also partnered directly with some chip manufacturers
such as Nvidia or Intel. But today
what we're going to talk about is going to start with why vector
databases, why you may need one.
Also, where do vectors come from the different use cases?
How do vector database work? What is similar to search
different indexes? And then I'm going to explain to you and show you the build
basic architecture and then hopefully after
that you should be able to understand how database work and vector
databases work in particular. So let's start with the why.
Why do you think you need a vector database?
Well, one reason is that unstructured data is
about 80% of the data you have worldwide. And the
way the vector databases is working is a bit different to usual
SQL NonSQL database. So the search pattern or
the search type in a traditional database is you go
to find a very specific result. You go with select
star from a specific table or select some different columns
because you want a result that is exact. With vector database,
it's a bit different. We're doing a lot of vector comparisons,
which, and if you don't know, vectors are just
representation of the data of the unstructured data. So they
are basically a long series of numbers. And vector databases
are specifically built to work
with this kind of unstructured data, including text,
images, videos and audio. And one thing
that usually people tell me and they're always wondering is like,
can I just use numpy canon? Could I just write
my own and then make it work? The answer is that,
well, I mean, technically yes you can, but you might encounter a
problem once you go to scale at the beginning,
it's fine, you only have 1000 embeddings it's very small. You don't
really need more time. Once you go really high in the
amount of embeddings then you can really struggle
and can take a long time. And then basically it doesn't scale.
And that's usually the reason why when people ask me about vector
search libraries in general, I mean, yeah, they can work.
I'm not going to say you always need a vector database,
but you might encounter some problem with the
search quality. What do you do if you
need hybrid search or if you need filtering? Also scalability.
What happens when you have billions of vectors? What about multi tenancy?
What about the cost? Are you putting everything in memory? Are you putting
everything on disk? Are you putting everything on s three? Security as
well as a big concern. What about data safety and privacy?
Also, what if you have to update your data? What if you
have to update your embeddings? It can be a problem.
TLDR, they usually like the infrastructure to
help you scale and deploy and manage your apps in productions.
They can be really cool for pocs, but once you go more
into the scale and once you go more into production, then they can really
struggle. Why not use a SQL or NoSql database?
Well, there's also a problem here is that they're
usually inefficient in high dimensional spaces, which is what you have
actually for vector databases. They also have sub optimal indexes.
They don't support different indexes that you might need for
vector search. Inadequate query supports
for scalability can be a problem. You have data conversion
issues and then yeah, you have different problems like that. So usually
vector search and vector operations in general are a bit too intensive
from a compute point of view. For traditional
databases and vector databases,
and in particular milvis is really beautiful scale. So I won't go in the
details over like everything, but thanks
to vector databases you might have advanced filtering,
hybrid search, multi vector search, you can also have automatic
backups, replications, higher variabilities,
aggregations. You have different things really
good for billion plus scale, so you can really do those.
If that's not a problem for us, then some
vector erases. And the one in particular, like Mailvis, we have GPU index,
which can be very useful in some cases. You might need
a lot of embeddings, you need a very high accuracy and you need to be
very quick. Then you may need to use a GPU index.
So as a takeaway. Vector databases are purpose
built to handle indexing, storing and querying vector data.
And they usually design and we are usually designed for billion plus use
cases. It doesn't mean you don't have to do, it doesn't mean you can't use
us for smaller scale. It just means that basically
you start from the ground up, you start with a couple of hundreds embeddings,
and then we can help you go to the billion plus.
But then where do vectors come from? It's a
very good question. They just don't pop up like that
out of the blue. So it's usually,
imagine you have your knowledge base, so it can be text, pictures,
internal documents, and the data gets passed into a deep learning
model. And that's very important that this deep learning model is
the right type of model. What I mean by that is that it has to
be trained on the right of data. So if you're working with images,
you gotta have a model that has been trained on image data. For example,
if you're working on text, it has to be trained on text data,
working on images. And you wanna identify the type
of dogs you have, then, well, you gotta have that in
the data. It has to be trained on that.
And what happens then is that you
take your image data, for example, you run it through a deep learning model,
and then you cut off the last layer, because usually the last layer does
a prediction and we're not interested in the prediction,
we're interested in what the model has learned about the data.
We really don't care about the prediction. So what we want to
know is what it has learned. And we want to have the numerical
representation so that later on, when we work
with that kind of data, we have a way to work with it, we can
understand it. And yeah,
that's why we cut off before the last layer.
And why not take the second layer from
the beginning or something? Why is the last one? Why not one in the middle?
It's also because as you pass through the model,
each layer will learn something new. And the second to last layer contains all the
semantic information without doing the prediction. And that's what
vector is in our context. That's what we use all the time.
So then, yeah, then you have your vectors, then you put it in a vector
database like Milvis, and then you can query it.
And how did it work in practice?
Usually you have embedding models and you have a lot of
different embedding models, and they are actually quite important.
You have to keep in mind that embedding models will be key for you.
They are like, you know, you have embedding models that have been trained
on like very very specific things. So let's say you're working with
english data and it's like nothing specific. Maybe the embeddings of
OpenAI will work. But in my case,
I live in Germany and we have german documents,
so we need embedding models that have been trained on german data.
And for example, if I do that, then I'm gonna use Gina AI, usually,
because it's quite good. And I would suggest
for you, if you're like, if you don't know where to start, go on hugging
face, look at the leaderboards and check the different embeddings they
have, because they have so many that it can be really good.
And then hugging face makes it very easy for you as well to use embedding
models. So I would say, yeah, maybe try at first with OpenAI,
try with different models. But yeah,
those are very important. You have to have a specific one. And how
does this work? Because here we have a very simple example.
We have a text which is happy dog wagging trail. Then you put it through
an embedding model and then it generates a vector.
And then this vector is, then you
can imagine this is a vector space. So that's very simplified.
But you can imagine you have vectors everywhere and it's in so many
dimensions. And what happens is that everything that is similar,
that would be close to each other in the vector space.
So if you have a building with a small window, and then you have
another one, which is a big building, once that transformed,
then the vector will be very close in vector space,
and that's the same for other text. And if you have something that
is completely unrelated, then it will be far from the others.
So what about the use cases we've
been talking about? Maybe when you should use it or why you
should use it. But it's more like when should you use it? Now,
it's not all about the embeddings and everything. It's like, okay, the different use cases.
Well, the first one and the main one is rag. So that's
the one that has been the most popular recently, which means retrieval, augmented generation.
It's basically, if I were to explain it, it's like you
want to expand your LLM with knowledge that
is external to the LLM. Let's say you work for a company and
you want to add this data to the LLM. Usually you would use
rack for that. So you put all your data, you process it, you put everything
in a vector database, and then once you make the query,
then it goes through the vector database to see if you have something that is
similar and then it gives that back as a context to the
LLM. And that's what a drag is. But that's only one of
the nine usual use cases. You may use a vector database for,
you have the good old one recommended system.
So for product recommendation, it's actually one of the most common use
cases for mailroom production.
Then they're really able to compare products and users and then use
a vector database to quantify that and compare that. Then you
have text and semantic search, image similarity search,
video search as well, for similarity, and audio as well.
It can also have molecular similarity search, and it
can be really useful for companies that are working with proteins, for example.
Anomaly detection is also a common one. It's how
different are two user actions, how similar are the actions of two users.
And it's very important when it comes to photo detection, for example.
And then you have multimodal similarity search.
So it allows you to search over multiple types of data, text and images.
And that can be really useful. When you do, then you can
combine the two, you can combine multimodal similarity search with
rag and then you have multimodal rag. That's what happens.
Then that's also very useful.
Instead of just giving text to your LLM or your rack system,
you can give an image which is then transformed and then you
can actually return an image. You can return text, you know, it can return whatever
you want it to return. And that's like, it can be very, very useful,
especially in the era of GPT four or Gemini
1.5, you know, like with what was being announced, you can
actually also do it on yourself. So yeah,
those are the common use cases. So then how did it work? You know,
how does it actually work? And now we're going to go a bit more into
the details. So we're going to talk about the different,
you know, the different similarity search.
How do you do similarity search? So let's go.
We have here an example entry and this is what you
may store in a vector database. So on this one,
the two most important pieces are the id
and the embeddings. Id is the way the vector database is able to
have a unique id on your entry and embedding is the vector
embedding. And this is what gets compared when we go and search with a
vector database. So they basically like
mailbis and vector database in general. They are specific kind of database that are
automate, making everything automatic for you.
So you compare embeddings and then you gonna find the
best one, the closest one to you, and that's what we do.
Then here you also have a bunch of metadata fields where
you may filter on those as well. If you want to, if you want to
reduce your query, if you want to lose your search
when you run a query. So let's say, I don't know. Here you only want
publication from towards data science, then you can put a filter for
that one. If you want publication from a different one,
then you can also put a filter on that one and that can allow you
to have better search results. And how does
it work? How do you know that something is similar to something else?
Well, that's the whole concept of similarity search.
And so what is it, how does it work?
Usually as a basic rule of thumb,
you always have similarity metrics for your index.
And people always ask me, oh, which one should I use?
Usually match the one that was used to train your embedding model.
It can be written on hugging face, for example,
depending on the model you use or the research paper or something.
So yeah, if it's been trained with cosine, go with cosine.
So yeah, just try to match the one that has been trained for your embedding
models. And so how does it work?
So as an overview of the search, so you
know, like here you have a text which is find the two most similar images,
documents or video. Then that is processed through an
embedding model. You get a vector and then you're going
to find the most similar ones. So you can see like the most similar ones
to our request is the first one in the database and the second to last
one. And that's what basically is happening. Then it will give you results back.
And yeah, that's in a nutshell, that's how it works with vector
search. But then, you know, I've been telling you, oh, I'm going to tell you
about the metrics. Let's go over the metrics. You have the first one,
which is called l two or euclidean, and this
measures the distance in space. So if you
have the right triangle, it measures direct distance in space.
So you can imagine this as like the amount of space between
two objects. So let's say how far from the screen I am with my
face. That's what it's measuring. And one
problem though, that this metric is sensitive to scale as well
as the vector relative location in space. So what it means is
that vectors with large values will have a larger
euclidean distance than vectors with smaller values.
Even if the vectors are very similar. That doesn't matter.
So that's, one is like, in most cases, you don't
use it with a deep learning model, but you would rather use it with more
basic vector encoding models like LSH,
which means locality sensitive hashing, you very likely
won't see that one. If you work with LLMs in general,
then you have the inner product one.
So it's a bit more complicated than the Euclidean,
it measures the projection of one line into the other.
And the bigger the angle between the two vectors, the smaller the inner product.
And so what we do is that, you know, we project from the origin these
two points, and the inner product is the projection of this point onto
another. And so it's like,
sorry, it's like how do you project that out of the right angle? And this
measure only not the angle difference between the two vectors, but also the
space distance between the two vectors. So it measures the
difference in orientation and magnitude. And l two is only
magnitude. And then the last one, which is the most famous
one, very likely it's cosine. And this one measures orientation.
So cosine is the difference in angles between two vectors.
And you just measure how big the angle is between your two vectors, you know,
and this one is not affected by the size of the vector,
but only by the angle between them. So that's
like, that's like a really good one, you know, like no matter, like how big
your vector is, you know, it doesn't really matter. And this one is
mostly used in NLP. So usually you see like formatting
models, if you work with LLMs, they're going to use cosine and
base also, if you, you may have noticed, but cosine is normalized in a
product. So yeah, if the model was trained using cosine
similarity, you can either use cosine similarity
or you can normalize and you that our product.
So it's usually not suitable though, when you have data where
the magnitude of vector is very important. So you have to take that into account.
For example, it's not really appropriate to,
if you want to use it, to compare the similarity of image embeddings based
on pixel intensities, for example. So yeah,
you may like, you know, choose your different types for
the embeddings. You have sparse embeddings, depending on the
model you have splayed, you have BG, m three, and then
you have some specific kind of index that will go with it very well.
If you have dense embeddings with OpenAI
cohere, different models
like that, then you have different distances you have IPL two,
or cosine, and then you can use face or H
and SW. And then you have binary as well embeddings.
So those ones are like also cohere as binary embeddings.
And then, yeah, you may use different indexes and
then a different distance. So that's kind of it for the
similarity measures. Now we go into
a bit like a bit deeper. We go into indexes.
So keep in mind, all of them have the pros and cons.
There's not one index that is absolutely the best for everything.
That is going to be the best is going to be the cheapest. That is
going to have the best recall. That doesn't happen. So what it indexes
first is that it allows you to facilitate efficient
searching and retrieval of similar embeddings.
It's a special data structure that is built on top of the embeddings.
So that's basically what index is. You have
different strategies. I have tree based index,
graph based, hash based and cluster based.
Later I will only talk about graph based and cluster base.
You can look for the others, but it's not really relevant for us.
Those are some of the indexes that milve support.
So you can see we have a lot of them. But no worries,
there are many options and as I've said before, they all have their pros
and cons. So I'm going to go a bit deeper now into
the different indexes so you can have like an understanding of how they work.
All right, so the first one is the flat index and
this one is quite straightforward, I would say. And it's
a bit more elementary similarity search, you know. So it utilizes
the knee algorithm,
or also called KNN. And basically
if you consider a database with a collection of like a hundred embeddings and
a query embedding, what does we're going to call Q as
query. Then if you want to find the most similar embeddings
to queue, then KNN will calculate the distance between queue and each
embedding in the database using a specific metrics
that you define as well. Then it's going to identify the embedding
of the smallest distance as the most similar to Q.
And it works amazing for accuracy. It's very accurate.
The problem is that it's an exhaustive search.
So scalability is a concern and
it's very accurate, but then it's very slow.
Also, the time complexity actually increases linearly with the number of
vector embeddings in the database. So that's flat.
It's very straightforward, but it doesn't really scale.
Then you have inverted file flat, which is also called
IVF flat, and it's very similar to flats,
but instead of using k nearest neighbors it's
using Ann algorithm which is approximate nearest neighbors.
So how it works is that it divides
embeddings into several partitions and those partitions
don't intersect as well. That's very important. Each partition
has a centroid and every vector embedding in
the databases associated with a specific partition that
is based on the nearest centroid.
So when you have a query queue or query embedding queue that
is provided, IVF will only calculate
the distance between queue and each centroid rather than the
entire set of embeddings that you have in your database. So you already have a
tremendous gain of time. And then the
centroid with the smallest distance is then selected and the embeddings
associated with that partition are used as candidate for the final search. So then once
you know which entry to use, then you can use and search
for all the embeddings that are in this partition. So that's
nice, like it really speeds up the sales process. You may have a problem
though, which is the candidate that is found might not
be the nearest neighbor. So then you can play with the hyper parameters
and that's basically what you can do. Then it's like you can increase
the number of, like there is a hyper parameter that is called NPRoB
which means the number of partitions that you may search. So yeah,
then you have more partitions that you can search and
then it can really help you ensure that the nearest embedding to the
query is not missed even if it's in a different partition.
With that, the cost of search increases
with time, but also it's going to be better than a
flat anyway. Then you have another one which is
inverted file with confisation. So it's the same,
but when memory is limited then it may not be optimal
because you might struggle with that. So then to
address memory constraints, what we can do is that we
can combine with IVF, we can combine,
you're going to involve the mapping of the values in each vector dimension
to lower precision integers. So it's usually
what we call quantization. And there are two kinds of quantization.
There's the first one which is scalar quantization, and the second
one which is product quantization.
So scalar quantization, how it works, like you're
basically mapping flow to integers, but the first step is to determine
and store the maximum and minimum values of each dimension
of the vector in the database, because that's what you need to calculate the step
size. The step size is crucial to scale the
floating point numbers in each dimension to its integer
integers representation. Sorry. So you can see the formula here. It's max value
minus minimum value. And then you divide it
by the number of bins, which is something that you decide.
How it works is that, you know, like if you take the quantity quantized
version of the any vector dimension,
then you subtract the value of this nth dimension from its
minimum value and then you divide the result by the step size.
Then you have another one, which is product concentration.
So this one, it addresses the limits, the limitation
of the square quantization, you know. So you consider the distribution
of each dimension. You then divide vector embeddings into sub
vectors and you perform clustering within each sub vector
to create centroids. And then you encode
each subjector with the id of the nearest centroid.
So again, you know, you have
here you have the query vector, which is a float. Then you like,
you consider the distribution, then you divide the vector
into sub vectors and
then you perform clustering, which is what you have on the image, you know,
at the third point you really like, you perform clustering to
create centroids and then you encode each subjector with
an id of the nearest centroids. And that's how you
can save a lot of memory. And it's really,
you know, like project consolidation offer more powerful memory
compression. So it can be really good if you have like
massive memory constraints, for example.
Then last but not least,
it's usually called h and SWT
mouthful. And this one is
a graph based indexing algorithm. So it combines two key concepts.
It combines skip list and navigable small world
or otherwise NSW. So it's a graph. So skip list,
it's a probabilistic, probabilistic data structure that
is composed of multiple layers of linked list.
And then NSW graphs.
It's built by randomly shuffling data points and inserting them one
by one. So each point is connected to a predefined number of edges.
And then this creates a graph structure that exhibits the small world phenomenon
where any two points are always
connected through relatively a short path.
And that's how it works, basically. So if we go for the skip list,
so again, you know, probably probabilistic data structure that is
composed of like multiple layers of linked lists.
So the lowest layer, which is layer zero,
here it contains all the original linked list with
all the elements as we move to the higher layers. The linked
lists progressively skip over different elements and
then you have fewer and fewer elements with each layer which each higher
layer, sorry. Then during the search process you
start from the highest, you start from the highest layer and gradually
you're going to descend to lower layer until you find the desired
elements. And because of this skip list
can, you know, like it can really bring some speeds to your
search process. It really speeds up your search process. So as an example,
let's say we have a skip list with three layers, as you can see here.
And we have eight elements in the original linked list.
If you want to find the element seven, then the sales
process will look like we have on the image. So we're going to
look and we're going to be like okay, so four, seven is higher than
four, so we're going to go on the right side of four and then
six is lower than seven. So then
we're going to go on the right side. And that's how then you can find
seven. And that's how it works basically for skip list.
And then you have NSW graphs.
So again, as I said, so they built randomly by shuffling data points
and inserting them one by one. And then
it creates a graph structure that exhibits some small worlds and
any two points, they're always connected through a short path.
So that's how it can be really good and really efficient.
If I were to show you a bit of an example of how it works.
So this is it. This is a multi layer
graphs where the lower layers contains the complete
set of data and the higher layer contains only a small fraction
of the points. Again, as we've seen for the linked list,
then HNSW starts by assigning
a random integers number from zero to I to
each data point. And those can be referred as nodes. We can call it
in a graph algorithm, I is the maximum layer at
which the data point can be present. So if a data point
has like I two and the total number
of layer and the graph is five, then the data point should only
be present up until the second layer and then it shouldn't be available
in the layer three and upwards. So yeah, during the search
process how it works is HNSW
starts by selecting an entry point and then this
should be the data point that is presenting the highest layer.
And then it searches for the nearest neighbor closer
to the query point and then recursively continues the search in lower
layers and again and again and again until the nearest data point
is found. So now we have this image again so
hopefully it makes a bit more sense. And again, you know,
there's no like good or bad index. It really depends on
your application requirements. So are you more interested in like query speeds?
Do you have a lot of inserts or deletes? Are you constructing the machine type?
It really, really depends.
For example, if you need 100% recall then flat would be the best.
Then depending on your size of index you might use a standard IVF.
If your index is between 2gb and 20gb,
maybe consider product consolidation or HNSW.
And then you also have composite index and
then you have disk based indexes as well, which can be really useful for you.
And then I'll finish with that. So you can see the Milvis
architecture is Milvis has been built to be
like fully distributed. So that's what you see here. You can see like on
the bottom part you see the s three storage, the object storage. If you
go above you have the worker nodes and those are fully distributed.
So you have the query nodes, the data nodes and the index node.
And that's how we can then scale up to billions
as thanks to those because they are fully independent, so we can scale them up
and down. And that's kind of how Mailvis
works. That's how vector database work. I hope this
talk was useful to you. If you have any questions you
can chat with me on discord directly. Also check out our GitHub
movers is fully open source so you can check our repository.
And yeah, if you have any questions, chat with me on discord. You can find
me on LinkedIn as well. Thank you very much for watching this talk
and I hope you have a good day.