Files
herodb/docs/lance_vector_db.md
2025-08-25 06:00:08 +02:00

454 lines
12 KiB
Markdown

# Lance Vector Database Operations
HeroDB includes a powerful vector database integration using Lance, enabling high-performance vector storage, search, and multimodal data management. By default, it uses Ollama for local text embeddings, with support for custom external embedding services.
## Overview
The Lance vector database integration provides:
- **High-performance vector storage** using Lance's columnar format
- **Local Ollama integration** for text embeddings (default, no external dependencies)
- **Custom embedding service support** for advanced use cases
- **Text embedding support** (images via custom services)
- **Vector similarity search** with configurable parameters
- **Scalable indexing** with IVF_PQ (Inverted File with Product Quantization)
- **Redis-compatible command interface**
## Architecture
```
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ HeroDB │ │ External │ │ Lance │
│ Redis Server │◄──►│ Embedding │ │ Vector Store │
│ │ │ Service │ │ │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│ │ │
│ │ │
Redis Protocol HTTP API Arrow/Parquet
Commands JSON Requests Columnar Storage
```
### Key Components
1. **Lance Store**: High-performance columnar vector storage
2. **Ollama Integration**: Local embedding service (default)
3. **Custom Embedding Service**: Optional HTTP API for advanced use cases
4. **Redis Command Interface**: Familiar Redis-style commands
5. **Arrow Schema**: Flexible schema definition for metadata
## Configuration
### Default Setup (Ollama)
HeroDB uses Ollama by default for text embeddings. No configuration is required if Ollama is running locally:
```bash
# Install Ollama (if not already installed)
# Visit: https://ollama.ai
# Pull the embedding model
ollama pull nomic-embed-text
# Ollama automatically runs on localhost:11434
# HeroDB will use this by default
```
**Default Configuration:**
- **URL**: `http://localhost:11434`
- **Model**: `nomic-embed-text`
- **Dimensions**: 768 (for nomic-embed-text)
### Custom Embedding Service (Optional)
To use a custom embedding service instead of Ollama:
```bash
# Set custom embedding service URL
redis-cli HSET config:core:aiembed url "http://your-embedding-service:8080/embed"
# Optional: Set authentication if required
redis-cli HSET config:core:aiembed token "your-api-token"
```
### Embedding Service API Contracts
#### Ollama API (Default)
HeroDB calls Ollama using this format:
```bash
POST http://localhost:11434/api/embeddings
Content-Type: application/json
{
"model": "nomic-embed-text",
"prompt": "Your text to embed"
}
```
Response:
```json
{
"embedding": [0.1, 0.2, 0.3, ...]
}
```
#### Custom Service API
Your custom embedding service should accept POST requests with this JSON format:
```json
{
"texts": ["text1", "text2"], // Optional: array of texts
"images": ["base64_image1", "base64_image2"], // Optional: base64 encoded images
"model": "your-model-name" // Optional: model specification
}
```
And return responses in this format:
```json
{
"embeddings": [[0.1, 0.2, ...], [0.3, 0.4, ...]], // Array of embedding vectors
"model": "model-name", // Model used
"usage": { // Optional usage stats
"tokens": 100,
"requests": 2
}
}
```
## Commands Reference
### Dataset Management
#### LANCE CREATE
Create a new vector dataset with specified dimensions and optional schema.
```bash
LANCE CREATE <dataset> DIM <dimension> [SCHEMA field:type ...]
```
**Parameters:**
- `dataset`: Name of the dataset
- `dimension`: Vector dimension (e.g., 384, 768, 1536)
- `field:type`: Optional metadata fields (string, int, float, bool)
**Examples:**
```bash
# Create a simple dataset for 384-dimensional vectors
LANCE CREATE documents DIM 384
# Create dataset with metadata schema
LANCE CREATE products DIM 768 SCHEMA category:string price:float available:bool
```
#### LANCE LIST
List all available datasets.
```bash
LANCE LIST
```
**Returns:** Array of dataset names
#### LANCE INFO
Get information about a specific dataset.
```bash
LANCE INFO <dataset>
```
**Returns:** Dataset metadata including name, version, row count, and schema
#### LANCE DROP
Delete a dataset and all its data.
```bash
LANCE DROP <dataset>
```
### Data Operations
#### LANCE STORE
Store multimodal data (text/images) with automatic embedding generation.
```bash
LANCE STORE <dataset> [TEXT <text>] [IMAGE <base64>] [key value ...]
```
**Parameters:**
- `dataset`: Target dataset name
- `TEXT`: Text content to embed
- `IMAGE`: Base64-encoded image to embed
- `key value`: Metadata key-value pairs
**Examples:**
```bash
# Store text with metadata
LANCE STORE documents TEXT "Machine learning is transforming industries" category "AI" author "John Doe"
# Store image with metadata
LANCE STORE images IMAGE "iVBORw0KGgoAAAANSUhEUgAA..." category "nature" tags "landscape,mountains"
# Store both text and image
LANCE STORE multimodal TEXT "Beautiful sunset" IMAGE "base64data..." location "California"
```
**Returns:** Unique ID of the stored item
### Search Operations
#### LANCE SEARCH
Search using a raw vector.
```bash
LANCE SEARCH <dataset> VECTOR <vector> K <k> [NPROBES <n>] [REFINE <r>]
```
**Parameters:**
- `dataset`: Dataset to search
- `vector`: Comma-separated vector values (e.g., "0.1,0.2,0.3")
- `k`: Number of results to return
- `NPROBES`: Number of partitions to search (optional)
- `REFINE`: Refine factor for better accuracy (optional)
**Example:**
```bash
LANCE SEARCH documents VECTOR "0.1,0.2,0.3,0.4" K 5 NPROBES 10
```
#### LANCE SEARCH.TEXT
Search using text query (automatically embedded).
```bash
LANCE SEARCH.TEXT <dataset> <query_text> K <k> [NPROBES <n>] [REFINE <r>]
```
**Parameters:**
- `dataset`: Dataset to search
- `query_text`: Text query to search for
- `k`: Number of results to return
- `NPROBES`: Number of partitions to search (optional)
- `REFINE`: Refine factor for better accuracy (optional)
**Example:**
```bash
LANCE SEARCH.TEXT documents "artificial intelligence applications" K 10 NPROBES 20
```
**Returns:** Array of results with distance scores and metadata
### Embedding Operations
#### LANCE EMBED.TEXT
Generate embeddings for text without storing.
```bash
LANCE EMBED.TEXT <text1> [text2] [text3] ...
```
**Example:**
```bash
LANCE EMBED.TEXT "Hello world" "Machine learning" "Vector database"
```
**Returns:** Array of embedding vectors
### Index Management
#### LANCE CREATE.INDEX
Create a vector index for faster search performance.
```bash
LANCE CREATE.INDEX <dataset> <index_type> [PARTITIONS <n>] [SUBVECTORS <n>]
```
**Parameters:**
- `dataset`: Dataset to index
- `index_type`: Index type (currently supports "IVF_PQ")
- `PARTITIONS`: Number of partitions (default: 256)
- `SUBVECTORS`: Number of sub-vectors for PQ (default: 16)
**Example:**
```bash
LANCE CREATE.INDEX documents IVF_PQ PARTITIONS 512 SUBVECTORS 32
```
## Usage Patterns
### 1. Document Search System
```bash
# Setup
LANCE CREATE documents DIM 384 SCHEMA title:string content:string category:string
# Store documents
LANCE STORE documents TEXT "Introduction to machine learning algorithms" title "ML Basics" category "education"
LANCE STORE documents TEXT "Deep learning neural networks explained" title "Deep Learning" category "education"
LANCE STORE documents TEXT "Building scalable web applications" title "Web Dev" category "programming"
# Create index for better performance
LANCE CREATE.INDEX documents IVF_PQ PARTITIONS 256
# Search
LANCE SEARCH.TEXT documents "neural networks" K 5
```
### 2. Image Similarity Search
```bash
# Setup
LANCE CREATE images DIM 512 SCHEMA filename:string tags:string
# Store images (base64 encoded)
LANCE STORE images IMAGE "iVBORw0KGgoAAAANSUhEUgAA..." filename "sunset.jpg" tags "nature,landscape"
LANCE STORE images IMAGE "iVBORw0KGgoAAAANSUhEUgBB..." filename "city.jpg" tags "urban,architecture"
# Search by image
LANCE STORE temp_search IMAGE "query_image_base64..."
# Then use the returned ID to get embedding and search
```
### 3. Multimodal Content Management
```bash
# Setup
LANCE CREATE content DIM 768 SCHEMA type:string source:string
# Store mixed content
LANCE STORE content TEXT "Product description for smartphone" type "product" source "catalog"
LANCE STORE content IMAGE "product_image_base64..." type "product_image" source "catalog"
# Search across all content types
LANCE SEARCH.TEXT content "smartphone features" K 10
```
## Performance Considerations
### Vector Dimensions
- **384**: Good for general text (e.g., sentence-transformers)
- **768**: Standard for BERT-like models
- **1536**: OpenAI text-embedding-ada-002
- **Higher dimensions**: Better accuracy but slower search
### Index Configuration
- **More partitions**: Better for larger datasets (>100K vectors)
- **More sub-vectors**: Better compression but slower search
- **NPROBES**: Higher values = better accuracy, slower search
### Best Practices
1. **Create indexes** for datasets with >1000 vectors
2. **Use appropriate dimensions** based on your embedding model
3. **Configure NPROBES** based on accuracy vs speed requirements
4. **Batch operations** when possible for better performance
5. **Monitor embedding service** response times and rate limits
## Error Handling
Common error scenarios and solutions:
### Embedding Service Errors
```bash
# Error: Embedding service not configured
ERR Embedding service URL not configured. Set it with: HSET config:core:aiembed url <YOUR_EMBEDDING_SERVICE_URL>
# Error: Service unavailable
ERR Embedding service returned error 404 Not Found
```
**Solution:** Ensure embedding service is running and URL is correct.
### Dataset Errors
```bash
# Error: Dataset doesn't exist
ERR Dataset 'mydata' does not exist
# Error: Dimension mismatch
ERR Vector dimension mismatch: expected 384, got 768
```
**Solution:** Create dataset first or check vector dimensions.
### Search Errors
```bash
# Error: Invalid vector format
ERR Invalid vector format
# Error: No index available
ERR No index available for fast search
```
**Solution:** Check vector format or create an index.
## Integration Examples
### With Python
```python
import redis
import json
r = redis.Redis(host='localhost', port=6379)
# Create dataset
r.execute_command('LANCE', 'CREATE', 'docs', 'DIM', '384')
# Store document
result = r.execute_command('LANCE', 'STORE', 'docs',
'TEXT', 'Machine learning tutorial',
'category', 'education')
print(f"Stored with ID: {result}")
# Search
results = r.execute_command('LANCE', 'SEARCH.TEXT', 'docs',
'machine learning', 'K', '5')
print(f"Search results: {results}")
```
### With Node.js
```javascript
const redis = require('redis');
const client = redis.createClient();
// Create dataset
await client.sendCommand(['LANCE', 'CREATE', 'docs', 'DIM', '384']);
// Store document
const id = await client.sendCommand(['LANCE', 'STORE', 'docs',
'TEXT', 'Deep learning guide',
'category', 'AI']);
// Search
const results = await client.sendCommand(['LANCE', 'SEARCH.TEXT', 'docs',
'deep learning', 'K', '10']);
```
## Monitoring and Maintenance
### Health Checks
```bash
# Check if Lance store is available
LANCE LIST
# Check dataset health
LANCE INFO mydataset
# Test embedding service
LANCE EMBED.TEXT "test"
```
### Maintenance Operations
```bash
# Backup: Use standard Redis backup procedures
# The Lance data is stored separately in the data directory
# Cleanup: Remove unused datasets
LANCE DROP old_dataset
# Reindex: Drop and recreate indexes if needed
LANCE DROP dataset_name
LANCE CREATE dataset_name DIM 384
# Re-import data
LANCE CREATE.INDEX dataset_name IVF_PQ
```
This integration provides a powerful foundation for building AI-powered applications with vector search capabilities while maintaining the familiar Redis interface.