Conf42 Rustlang 2023 - Online

LanceDB: A OSS serverless Vector Database in Rust

Video size:

Abstract

LanceDB is an lightning fast OSS vector in-process database. It is backed by Lance format - a modern columnar data format that is an alternative to parquet. It’s optimized for high speed random access and management of AI datasets like vectors, documents, and images.

Summary

  • LesDB is a new vectors database in Rust. It offers blazing fast vector search, SQL search and full text search capabilities. In contrast to a lot of other vector database, we designed a vector index to be cloud native.
  • All of our code is using stable rust instead of unstable rust. Rust configured macros help us fine tune the small details of a code layout. LensDB is an in process database. That means there's no extra servers for you to manage. By changing the connection, NestDB can launch in a week or so.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, this is Ley from Les Dbting. Today it's an honor to present our journey of writing a new vectors database in Rust, introducing LesDB. It is a fresh addition to the world of open source and in process databases. Our journey to this innovation database began in early 2023. You can think about LensDB as a SQL lite with DubDB, but for vector databases. Thanks to the OSS implementations and modern day storage technologies, LensDB offers blazing fast vector search, SQL search and full text search capabilities. Our users do not only store vectors in lensDB, we can store multimodal data and machine learning data directly with their embeddings in a single store and our user loves transDB because of it. In contrast to a lot of other vector database, we designed a vector index to be cloud native from the very beginning. As a result, we can deliver great latency. Even all of your data and index are on cloud storage like s three. Let me tell you how Nest DB starts. From the end of last year, we started to build this new columnar format called Lens for AI Data Lake in C Plus plus. However, it was such a painful experience for our small team to fight with the C Plus five building system while trying to maintain the velocity to deliver values to customers. Here we do it again. Around the new year, our team decided to try Rust after we have so many rewrite my projects in rust blocks. It was a huge, successful task of rewriting. Not only we reimplant mass column format and build vectors search capabilities on top of it, it just used less than one month of two of our engineering to reimplant that from scratch. We were surprised that the performance was actually better than our original C implementations. Community experience is a huge boost of work and more importantly, our developers are very happy writing the clean code. Our engineering team does not have pre ras experience, but we love ros. Even though this is the first line of ROS we are writing. First and foremost, the biggest reason we want to try ras was because how painful the cmake is. Cargo is so well designed that with a massive community around it, there are a lot of high quality libraries that we can use immediately without too much hassle like instemake and not to mention those common traces of the Russ languages itself. For example, the meaningful error messages and opinionated software engineering practice like models, building tests and benchmark save us a lot of time to discuss what code standard we should put in place to provide guarantee for the code qualities vulnerable. Since we like to support multi languages sdks for LesDB, we do need one of the native languages that can be easily embedded into other languages. Lastly, because running vector database has a huge amount of mass computations, especially vector computations, be able to easily access the latest syndicate code from standard library has been a huge win for us. So what is a vector database? A vector database is different to traditional database because the search those vector database try to surface, try to find relevant data in a high dimensional vector space. Usually our user data ranges from a few hundred dimension to one or 1002 dimensions. There's a difference from traditional database where each of the columnar there presents just one dimension of data. Vector database start to be more recognizable recently because people start to build AI or ARm applications where those models use huge vector internally to present their outputs. Vector database in fact became their machine learning database of knowledge. For example, a typical application useless is to store the embeddings from machine learning models. And here we take the open Eclipse model as an example where it does the text to image embedding mapping our customer will firstly build a data pipeline to generate embeddings for all the images, store them and index them in LensDB. Then they will build a nature language search application on top of LensDB to search those images through the English language. For example. Within this example you can easily find me top ten of hammer movies where you have Einstein chatting with it and the cliff model will give you vector and find top ten most relevant results and return back to the users within subsequent. However, building a fast vector database has its own challenges. Due to the curse of dimensionalities, it is not practical to have a perfect index that is both fast and 100% accurate. You have to pick from one. That's why a lot of vector index is called approachmate nearest neighbors and because we store those index on s three and a lot of random I o need to be done to reranking the last results. It is more difficult to support the cloud native index vector index our typical data set in DB is usually range from 700 to 1500 dimensions with half million to 1 billion factors in it. The data type in those database usually flows d two for the vectors and additional metadata for other filters. So building a vector index in rust the first one we built is called AVFPQ which firstly dividing the vector space into small partitions. It's called noise cells and within each cell we use product quantization to compress the vectors. So to make the storage smaller for benefit later I oss that way to answer one query we just need to scan small fractions of partitions that are close to the query vector and avoid a full data set scan. As a result, we can speed up the retrieval as well to achieve high indexing performance and low query latency. We have pushed a lot of performance gain out of modern cpus. For example, we wrote yet another gaming implementations with arrow format and manually tuned Cindy for distance computations. The code is specially manually tuned for l one l two cache friendly and we do manual loop rolling to achieve better cpu bandwidth. Some other optimization has been applied into the gaming as well. By the end of the day, our SIMD code is actually faster than numpy arrow with a SIMD implementation and also AI auto recognitions and other benchmarks. It was mentioned that all of our code is using stable rust instead of unstable rust, so everyone can take that today during the process of influence caming and distance functions, rust help us to work on these low level details while still maintaining the readability of the code base. We love rust, especially if you came from the c or C background. You would know what a nightmare it is to write multi architecture code with all the defined and defaults. Rust configured macros help us super easy to fine tune the small details of a code layout to support both XCT eight and arm architecture like hypo M one m two cargo benchmarks makes a team keep benchmarking as part of development culture, which is super important for such a database engineering teams. We use flame graph extensively because it is so much easier for us to identify the hot spots in the code, while the cargo flame graph is much easier than the original flame graph which involves multi pieces of code on Linux. In the end we use goldbolt. The compiler Explorer a lot help us to check and verify the generator subject code to make sure we have correctly used the right instruction set without stalling the cpu pipelines. The only missing piece in ROS we miss the most would be the generic specialization supporting C Plus plus, which allowed us to support optimization for multiple different data types still with a fallback default implementation, so it is not only hard to do cpu optimization in a vector database. The I O is also very tricky too. Because the disk space is linear, we cannot present multiple dimension distance statically and efficiently mapping to this linear space. As a result, vector index usually involve a lot of scan. It's more like an olap kind of workload. Many of our l optimization is on how to reduce the scan size as well as reduce the later random IO which used to re rank the results caused by pq distortion. Each of IVF PQ is stored on one file. IVF century block is stored in the beginning of the file and it presents the sentry of each of the partition in this index. We just open this file once and cache the IVF century. A search vector comes in and use those IVF centuries to identify which partition to read and scan them accordingly. Within each of such partition we use product quantitization codebook and PQ code to regenerate the original vectors and do a margin sort for the final result. In the end of this retrieval, pipelines are using the rest of the metadata from deroid ids to do the hybrid filters with vector search. Because rust has so much better library support to read from cloud storage, it makes our implementation easier and fast and we can build on top of great community so we can deliver this thing faster. So how about other search capabilities? In lens we do support SQL and footech search as well. Because NestDB is built on the lens column format which is the fastest growing column format out there. It's faster. The main capability of a lens have a similar scan performance like Parquet. However it deliver 2000 times fast point curry than Parker. We build that on top of data fusion and SQL password so it support SQL engine internally. Also we use tentative rust package with customization to support cloud storage so we can support footext search as well. Underneath of this architecture we use a lot of asynch like Tokyo features and object store which allow us to read the object store s three gcs fast without large amount of cpu cycles. So by here hopefully you enjoy the tech details about lens and you would love to use it. Let me show how can we use it. LensDB is an in process database. That means there's no extra servers, no kubernetes for you to manage and if there's disk based index, that means no huge serverless. You need to load everything in memory before you can serve your first request. We do support native Python and typescript native sdks through the grid pile three and neon package to bring the FFI into other languages. Of course we do have the Rust SDK as well. You can just install it through cargo. Since LensDB is an in process DB, it can be used as simple as sqlite. With DarkdB you just import lens and connect that to a URL and you can use the database immediately in your script or in your server. The interface is designed to follow the data frame style of APIs. It's very similar to Pandas or SparkdB. Realistically we only have three language candidates can be used to build a multi language inprocess database. For example c send ros. Hopefully all the presentation I have did in this presentation can demonstrate the choice. It's very obvious. Lastly, we will launch NestDB cloud SaaS in a week or so. By just changing the connection URL to database, we can connect to the LensDB SaaS with the same experience. And our SaaS is the pay per query task models, so it's only paid by usage and fully managed. And the rest rest of the user experience is the same as our open source project. Hope you will enjoy it and try it out. Thank you very much. We are looking for your feedback and if you have time please check out our GitHub repo and give us a star. Thank you very much.
...

Lei Xu

Creator & CTO @ LanceDB

Lei Xu's LinkedIn account Lei Xu's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)