LanceDB: A OSS serverless Vector Database in Rust
Video size:
Abstract
LanceDB is an lightning fast OSS vector in-process database.
It is backed by Lance format - a modern columnar data format that is an alternative to parquet. It’s optimized for high speed random access and management of AI datasets like vectors, documents, and images.
Summary
-
LesDB is a new vectors database in Rust. It offers blazing fast vector search, SQL search and full text search capabilities. In contrast to a lot of other vector database, we designed a vector index to be cloud native.
-
All of our code is using stable rust instead of unstable rust. Rust configured macros help us fine tune the small details of a code layout. LensDB is an in process database. That means there's no extra servers for you to manage. By changing the connection, NestDB can launch in a week or so.
Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, this is Ley from Les Dbting.
Today it's an honor to present our journey of writing
a new vectors database in Rust,
introducing LesDB. It is a fresh
addition to the world of open source and in process
databases. Our journey to this
innovation database began in early 2023.
You can think about LensDB as a SQL lite with
DubDB, but for vector databases.
Thanks to the OSS implementations and modern day storage
technologies, LensDB offers blazing
fast vector search,
SQL search and full text search capabilities.
Our users do not only store vectors
in lensDB, we can store multimodal
data and machine learning data directly with
their embeddings in a single store and
our user loves transDB because of it.
In contrast to a lot of other vector database,
we designed a vector index to be cloud native from
the very beginning. As a result, we can deliver
great latency. Even all of your data and index are
on cloud storage like s three. Let me
tell you how Nest DB starts. From the
end of last year, we started to build this new columnar format
called Lens for AI Data Lake in C Plus
plus. However, it was such a painful
experience for our small team to fight with the C Plus five building
system while trying to maintain the velocity to
deliver values to customers. Here we do
it again. Around the new year, our team decided
to try Rust after we have so many rewrite
my projects in rust blocks. It was a huge,
successful task of rewriting. Not only
we reimplant mass column format
and build vectors search capabilities on top of it,
it just used less than one month of
two of our engineering to reimplant that from scratch.
We were surprised that the performance was actually better than
our original C implementations.
Community experience is a huge boost of work
and more importantly, our developers are
very happy writing the clean code.
Our engineering team does not have pre ras
experience, but we love ros. Even though this
is the first line of ROS we are writing. First and foremost,
the biggest reason we want to try ras was because how painful
the cmake is. Cargo is so well
designed that with a massive community around it,
there are a lot of high quality libraries that we can use immediately
without too much hassle like instemake and
not to mention those common traces of the
Russ languages itself. For example, the meaningful
error messages and opinionated software
engineering practice like models, building tests
and benchmark save us a lot of time to discuss what
code standard we should put in place to
provide guarantee for the code qualities
vulnerable. Since we like to support multi
languages sdks for LesDB,
we do need one of the native languages that can
be easily embedded into other languages.
Lastly, because running vector database has a
huge amount of mass computations, especially vector computations,
be able to easily access the latest syndicate code
from standard library has been a huge win for us.
So what is a vector database?
A vector database is different to traditional database
because the search those vector database
try to surface, try to find relevant
data in a high dimensional vector space. Usually our
user data ranges from a few hundred dimension
to one or 1002 dimensions. There's a difference
from traditional database where each of the columnar
there presents just one dimension of data.
Vector database start to be more recognizable recently because
people start to build AI or ARm applications
where those models use huge vector internally
to present their outputs. Vector database in
fact became their machine learning database
of knowledge. For example, a typical application
useless is to store the embeddings from machine learning
models. And here we take the open Eclipse
model as an example where it does
the text to image embedding mapping our
customer will firstly build a data pipeline to
generate embeddings for all the images,
store them and index them in LensDB.
Then they will build a nature language search application
on top of LensDB to search those images through
the English language.
For example. Within this example you
can easily find me top ten of hammer
movies where you have Einstein chatting
with it and the cliff model will give
you vector and find top ten
most relevant results and return back to the
users within subsequent.
However, building a fast vector
database has its own challenges.
Due to the curse of dimensionalities,
it is not practical to have a perfect index that
is both fast and 100% accurate.
You have to pick from one. That's why a
lot of vector index is called approachmate
nearest neighbors and because
we store those index on s three and a lot of random
I o need to be done to reranking the last results.
It is more difficult to support
the cloud native index vector index
our typical data set in DB is usually
range from 700 to 1500 dimensions
with half million to 1 billion factors in
it. The data type in those database
usually flows d two for the vectors and additional
metadata for other filters.
So building a vector index in
rust the first one we built is called AVFPQ
which firstly dividing the vector space into
small partitions. It's called noise
cells and within each cell we use
product quantization to compress the vectors. So to make
the storage smaller for benefit later
I oss that way to answer
one query we just need to scan small fractions of
partitions that are close to the query vector and avoid
a full data set scan. As a result,
we can speed up the retrieval as well to
achieve high indexing performance and low query latency.
We have pushed a lot of performance gain
out of modern cpus. For example, we wrote
yet another gaming implementations with arrow format
and manually tuned Cindy for distance
computations. The code is specially manually tuned
for l one l two cache friendly and we do
manual loop rolling to achieve better
cpu bandwidth. Some other
optimization has been applied into the gaming as well.
By the end of the day, our SIMD code
is actually faster than numpy arrow with a SIMD implementation
and also AI auto recognitions
and other benchmarks. It was mentioned
that all of our code is using stable
rust instead of unstable rust,
so everyone can take that today during the process
of influence caming and
distance functions, rust help us
to work on these low level details while still maintaining the readability of
the code base. We love rust, especially if you
came from the c or C background. You would know
what a nightmare it is to write multi architecture code
with all the defined and defaults.
Rust configured macros help us super
easy to fine tune the small details of a code layout to support
both XCT eight and arm
architecture like hypo M one m two cargo
benchmarks makes a team keep benchmarking as
part of development culture, which is super
important for such a database engineering teams.
We use flame graph extensively because it is so much
easier for us to identify the hot spots
in the code, while the cargo flame graph is
much easier than the original flame graph which
involves multi pieces of code on
Linux. In the end we use
goldbolt. The compiler Explorer a lot
help us to check and verify the generator subject code to make
sure we have correctly used the right instruction set
without stalling the cpu pipelines.
The only missing piece in ROS we miss the most
would be the generic specialization supporting
C Plus plus, which allowed us to support
optimization for multiple different data types
still with a fallback default implementation,
so it is not only hard to do cpu optimization
in a vector database. The I O is also very tricky too.
Because the disk space is linear,
we cannot present multiple dimension
distance statically and efficiently mapping
to this linear space. As a result,
vector index usually involve a lot of scan. It's more like
an olap kind of workload. Many of
our l optimization is on how to reduce the scan
size as well as reduce the later random
IO which used to re rank the results caused by
pq distortion.
Each of IVF PQ is stored on one file.
IVF century block is stored in the beginning of the file
and it presents the sentry of each of the partition
in this index.
We just open this file once and cache the IVF century.
A search vector comes in and use those IVF
centuries to identify which partition to read and scan
them accordingly. Within each of such partition
we use product quantitization codebook
and PQ code to regenerate the original
vectors and do a margin sort for
the final result. In the
end of this retrieval, pipelines are using the rest of the metadata
from deroid ids to do the
hybrid filters with vector search.
Because rust has so much better library support to read
from cloud storage, it makes our implementation
easier and fast and we can build on top of
great community so we can deliver
this thing faster.
So how about other search capabilities?
In lens we do support SQL
and footech search as well. Because NestDB is built
on the lens column format which is
the fastest growing column format out there. It's faster.
The main capability of a lens have a
similar scan performance like Parquet. However it
deliver 2000 times fast point curry than Parker.
We build that on top of data fusion and SQL password so
it support SQL engine internally.
Also we use tentative rust
package with customization to support cloud storage
so we can support footext search as well.
Underneath of this architecture we use a lot of
asynch like Tokyo features and
object store which allow us to read the
object store s three gcs fast without
large amount of cpu cycles.
So by here hopefully
you enjoy the tech details about lens and you would love to
use it. Let me show how can we use
it. LensDB is an in process
database. That means there's no extra
servers, no kubernetes for you to manage and
if there's disk based index,
that means no huge serverless. You need to load
everything in memory before you can serve your first request.
We do support native Python and typescript
native sdks through the grid pile
three and neon package to bring the FFI
into other languages. Of course
we do have the Rust SDK
as well. You can just install it through cargo.
Since LensDB is an in process DB, it can be used
as simple as sqlite. With DarkdB you just import
lens and connect that to a URL
and you can use the database immediately
in your script or in your server. The interface is
designed to follow the data frame style
of APIs. It's very similar to Pandas or SparkdB.
Realistically we only have three language
candidates can be used to build a multi language inprocess database.
For example c send ros.
Hopefully all the presentation I have did in
this presentation can demonstrate the choice.
It's very obvious.
Lastly, we will launch NestDB cloud
SaaS in a week or so.
By just changing the connection URL
to database, we can connect
to the LensDB SaaS with the same experience.
And our SaaS is the pay per query task models, so it's
only paid by usage and fully managed. And the rest rest
of the user experience is the same as our open source project.
Hope you will enjoy it and try it
out. Thank you very much.
We are looking for your feedback and if you have time
please check out our GitHub repo and give
us a star. Thank you very much.