lancedb_impl #15

Open
maximevanhees wants to merge 7 commits from lancedb_impl into main
Member

Implemented a new vector database backend called lance. The model is currently not multi-model although features have been put in place to support image processing in the future. LanceDB is a search-only backend, meaning it does not support traditional Redis-like KV commands. All LanceDB-related commands start with LANCE.<command>.

A vector database requires a vector-embedding model. This can be configured separately per-database instance. The user will have to specify which vector-embedding provider he wants to use, which is done using the LANCE.EMBEDDING command. For this we have created the Embedder (trait for text embedding) and ImageEmbedder (trait for image embedding), alongside the TestHashEmbedder and TestImageHashEmbedder. The latter two are deterministic, offline embedders which can be used for testing purposes. There is a per-database embedding configuration, stored as a JSON sidecar file at <base_dir>/lance/<db_id>/<dataset>.lance.embedding.json. More information and a full end-to-end workflow can be found in the lance.md documentation file in the docs directory.

To search inside the lance database, each time the user provides a LANCE.SEARCH command, we find the K most similar vectors to a query vector based on distance metrics (KNN algo). This powers semantic search for both text and image queries by finding the closest matching embeddings in the vector space. This has O(n * d) performance, where n = number of vectors, d = dimension size. There is support for simple equality-based filtering on fields (id, text, media_type, media_uri, or any other metadata key). These filter evaluations are applied during the KNN scan before distance comparison to reduce search space. Possible future enhancements will integrate Lance's native ANN (Approximate Nearest Neighbor) indices (IVF_PQ, HNSW, etc.).

New commands added:

  • LANCE.CREATE - Create dataset with dimension
  • LANCE.STORE - Store text with server-side embedding
  • LANCE.SEARCH - Search using text query
  • LANCE.STOREIMAGE - Store image (URI or base64)
  • LANCE.SEARCHIMAGE - Search using image query
  • LANCE.CREATEINDEX - Create vector index (placeholder)
  • LANCE.EMBEDDING CONFIG SET/GET - Configure embedding provider per dataset
  • LANCE.LIST - List datasets
  • LANCE.INFO - Get dataset information
  • LANCE.DEL - Delete record by ID
  • LANCE.DROP - Drop entire dataset

  • Read operations (SEARCH, LIST, INFO) require read permission
  • Write operations (CREATE, STORE, CREATEINDEX, DEL, DROP, CONFIG SET) require readwrite permission

New RPC calls added:

  • lanceCreate, lanceList, lanceInfo, lanceDel, lanceDrop
  • lanceSetEmbeddingConfig, lanceGetEmbeddingConfig
  • lanceStoreText, lanceSearchText
  • lanceStoreImage, lanceSearchImage
  • lanceCreateIndex

Other fixes:

  • Prohibited access to the administrative database instance 0. User could access it without the required admin-secret (passed as argument at startup), as database 0 was automatically selected when connecting to the database using the redis-cli command. This is now prohibited and the user will now always have to supply KEY ... when selecting database instance 0.
Implemented a new vector database backend called `lance`. The model is currently not multi-model although features have been put in place to support image processing in the future. LanceDB is a search-only backend, meaning it does not support traditional Redis-like KV commands. All LanceDB-related commands start with `LANCE.<command>`. A vector database requires a vector-embedding model. This can be configured separately per-database instance. The user will have to specify which vector-embedding provider he wants to use, which is done using the `LANCE.EMBEDDING` command. For this we have created the `Embedder` (trait for text embedding) and `ImageEmbedder` (trait for image embedding), alongside the `TestHashEmbedder` and `TestImageHashEmbedder`. The latter two are deterministic, offline embedders which can be used for testing purposes. There is a per-database embedding configuration, stored as a JSON sidecar file at `<base_dir>/lance/<db_id>/<dataset>.lance.embedding.json`. More information and a full end-to-end workflow can be found in the `lance.md` documentation file in the `docs` directory. To search inside the `lance` database, each time the user provides a `LANCE.SEARCH` command, we find the K most similar vectors to a query vector based on distance metrics (KNN algo). This powers semantic search for both text and image queries by finding the closest matching embeddings in the vector space. This has `O(n * d)` performance, where n = number of vectors, d = dimension size. There is support for simple equality-based filtering on fields (`id`, `text`, `media_type`, `media_uri`, or any other metadata key). These filter evaluations are applied during the KNN scan before distance comparison to reduce search space. Possible future enhancements will integrate Lance's native ANN (Approximate Nearest Neighbor) indices (IVF_PQ, HNSW, etc.). New commands added: - `LANCE.CREATE` - Create dataset with dimension - `LANCE.STORE` - Store text with server-side embedding - `LANCE.SEARCH` - Search using text query - `LANCE.STOREIMAGE` - Store image (URI or base64) - `LANCE.SEARCHIMAGE` - Search using image query - `LANCE.CREATEINDEX` - Create vector index (placeholder) - `LANCE.EMBEDDING CONFIG SET/GET` - Configure embedding provider per dataset - `LANCE.LIST` - List datasets - `LANCE.INFO` - Get dataset information - `LANCE.DEL` - Delete record by ID - `LANCE.DROP` - Drop entire dataset <br> - Read operations (SEARCH, LIST, INFO) require read permission - Write operations (CREATE, STORE, CREATEINDEX, DEL, DROP, CONFIG SET) require readwrite permission New RPC calls added: - `lanceCreate`, `lanceList`, `lanceInfo`, `lanceDel`, `lanceDrop` - `lanceSetEmbeddingConfig`, `lanceGetEmbeddingConfig` - `lanceStoreText`, `lanceSearchText` - `lanceStoreImage`, `lanceSearchImage` - `lanceCreateIndex` Other fixes: - Prohibited access to the administrative database instance `0`. User could access it without the required `admin-secret` (passed as argument at startup), as database `0` was automatically selected when connecting to the database using the `redis-cli` command. This is now prohibited and the user will now always have to supply `KEY ...` when selecting database instance `0`.
maximevanhees added 7 commits 2025-10-09 09:32:29 +00:00
This pull request can be merged automatically.
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin lancedb_impl:lancedb_impl
git checkout lancedb_impl
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: herocode/herodb#15
No description provided.