Files
osiris/docs/specs/osiris-mvp.md
Timur Gordon 097360ad12 first commit
2025-10-20 22:24:25 +02:00

526 lines
11 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# OSIRIS MVP — Minimal Semantic Store over HeroDB
## 0) Purpose
OSIRIS is a Rust-native object layer on top of HeroDB that provides structured storage and retrieval capabilities without any server-side extensions or indexing engines.
It provides:
- Object CRUD operations
- Namespace management
- Simple local field indexing (field:*)
- Basic keyword scan (substring matching)
- CLI interface
- Future: 9P filesystem interface
It does **not** depend on HeroDB's Tantivy FTS, vectors, or relations.
---
## 1) Architecture
```
HeroDB (unmodified)
├── KV store + encryption
└── RESP protocol
└── OSIRIS
├── store/ object schema + persistence
├── index/ field index & keyword scanning
├── retrieve/ query planner + filtering
├── interfaces/ CLI, 9P (future)
└── config/ namespaces + settings
```
---
## 2) Data Model
```rust
#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct OsirisObject {
pub id: String,
pub ns: String,
pub meta: Metadata,
pub text: Option<String>, // optional plain text
}
#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct Metadata {
pub title: Option<String>,
pub mime: Option<String>,
pub tags: BTreeMap<String, String>,
pub created: OffsetDateTime,
pub updated: OffsetDateTime,
pub size: Option<u64>,
}
```
---
## 3) Keyspace Design
```
meta:<id> → serialized OsirisObject (JSON)
field:tag:<key>=<val> → Set of IDs (for tag filtering)
field:mime:<type> → Set of IDs (for MIME type filtering)
field:title:<title> → Set of IDs (for title filtering)
scan:index → Set of all IDs (for full scan)
```
**Example:**
```
field:tag:project=osiris → {note_1, note_2}
field:mime:text/markdown → {note_1, note_3}
scan:index → {note_1, note_2, note_3, ...}
```
---
## 4) Index Maintenance
### Insert / Update
```rust
// Store object
redis.set(format!("meta:{}", obj.id), serde_json::to_string(&obj)?)?;
// Index tags
for (k, v) in &obj.meta.tags {
redis.sadd(format!("field:tag:{}={}", k, v), &obj.id)?;
}
// Index MIME type
if let Some(mime) = &obj.meta.mime {
redis.sadd(format!("field:mime:{}", mime), &obj.id)?;
}
// Index title
if let Some(title) = &obj.meta.title {
redis.sadd(format!("field:title:{}", title), &obj.id)?;
}
// Add to scan index
redis.sadd("scan:index", &obj.id)?;
```
### Delete
```rust
// Remove object
redis.del(format!("meta:{}", obj.id))?;
// Deindex tags
for (k, v) in &obj.meta.tags {
redis.srem(format!("field:tag:{}={}", k, v), &obj.id)?;
}
// Deindex MIME type
if let Some(mime) = &obj.meta.mime {
redis.srem(format!("field:mime:{}", mime), &obj.id)?;
}
// Deindex title
if let Some(title) = &obj.meta.title {
redis.srem(format!("field:title:{}", title), &obj.id)?;
}
// Remove from scan index
redis.srem("scan:index", &obj.id)?;
```
---
## 5) Retrieval
### Query Structure
```rust
pub struct RetrievalQuery {
pub text: Option<String>, // keyword substring
pub ns: String,
pub filters: Vec<(String, String)>, // field=value
pub top_k: usize,
}
```
### Execution Steps
1. **Collect candidate IDs** from field:* filters (SMEMBERS + intersection)
2. **If text query is provided**, iterate over candidates:
- Fetch `meta:<id>`
- Test substring match on `meta.title`, `text`, or `tags`
- Compute simple relevance score
3. **Sort** by score (descending) and **limit** to `top_k`
This is O(N) for text scan but acceptable for MVP or small datasets (<10k objects).
### Scoring Algorithm
```rust
fn compute_text_score(obj: &OsirisObject, query: &str) -> f32 {
let mut score = 0.0;
// Title match
if let Some(title) = &obj.meta.title {
if title.to_lowercase().contains(query) {
score += 0.5;
}
}
// Text content match
if let Some(text) = &obj.text {
if text.to_lowercase().contains(query) {
score += 0.5;
// Bonus for multiple occurrences
let count = text.to_lowercase().matches(query).count();
score += (count as f32 - 1.0) * 0.1;
}
}
// Tag match
for (key, value) in &obj.meta.tags {
if key.to_lowercase().contains(query) || value.to_lowercase().contains(query) {
score += 0.2;
}
}
score.min(1.0)
}
```
---
## 6) CLI
### Commands
```bash
# Initialize and create namespace
osiris init --herodb redis://localhost:6379
osiris ns create notes
# Add and read objects
osiris put notes/my-note.md ./my-note.md --tags topic=rust,project=osiris
osiris get notes/my-note.md
osiris get notes/my-note.md --raw --output /tmp/note.md
osiris del notes/my-note.md
# Search
osiris find --ns notes --filter topic=rust
osiris find "retrieval" --ns notes
osiris find "rust" --ns notes --filter project=osiris --topk 20
# Namespace management
osiris ns list
osiris ns delete notes
# Statistics
osiris stats
osiris stats --ns notes
```
### Examples
```bash
# Store a note from stdin
echo "This is a note about Rust programming" | \
osiris put notes/rust-intro - \
--title "Rust Introduction" \
--tags topic=rust,level=beginner \
--mime text/plain
# Search for notes about Rust
osiris find "rust" --ns notes
# Filter by tag
osiris find --ns notes --filter topic=rust
# Get note as JSON
osiris get notes/rust-intro
# Get raw content
osiris get notes/rust-intro --raw
```
---
## 7) Configuration
### File Location
`~/.config/osiris/config.toml`
### Example
```toml
[herodb]
url = "redis://localhost:6379"
[namespaces.notes]
db_id = 1
[namespaces.calendar]
db_id = 2
```
### Structure
```rust
pub struct Config {
pub herodb: HeroDbConfig,
pub namespaces: HashMap<String, NamespaceConfig>,
}
pub struct HeroDbConfig {
pub url: String,
}
pub struct NamespaceConfig {
pub db_id: u16,
}
```
---
## 8) Database Allocation
```
DB 0 → HeroDB Admin (managed by HeroDB)
DB 1 → osiris:notes (namespace "notes")
DB 2 → osiris:calendar (namespace "calendar")
DB 3+ → Additional namespaces...
```
Each namespace gets its own isolated HeroDB database.
---
## 9) Dependencies
```toml
[dependencies]
anyhow = "1.0"
redis = { version = "0.24", features = ["aio", "tokio-comp"] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
time = { version = "0.3", features = ["serde", "formatting", "parsing", "macros"] }
tokio = { version = "1.23", features = ["full"] }
clap = { version = "4.5", features = ["derive"] }
toml = "0.8"
uuid = { version = "1.6", features = ["v4", "serde"] }
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["env-filter"] }
```
---
## 10) Future Enhancements
| Feature | When Added | Moves Where |
|---------|-----------|-------------|
| Dedup / blobs | HeroDB extension | HeroDB |
| Vector search | HeroDB extension | HeroDB |
| Full-text search | HeroDB (Tantivy) | HeroDB |
| Relations / graph | OSIRIS later | OSIRIS |
| 9P filesystem | OSIRIS later | OSIRIS |
This MVP maintains clean interface boundaries:
- **HeroDB** remains a plain KV substrate
- **OSIRIS** builds higher-order meaning on top
---
## 11) Implementation Status
### ✅ Completed
- [x] Project structure and Cargo.toml
- [x] Core data models (OsirisObject, Metadata)
- [x] HeroDB client wrapper (RESP protocol)
- [x] Field indexing (tags, MIME, title)
- [x] Search engine (substring matching + scoring)
- [x] Configuration management
- [x] CLI interface (init, ns, put, get, del, find, stats)
- [x] Error handling
- [x] Documentation (README, specs)
### 🚧 Pending
- [ ] 9P filesystem interface
- [ ] Integration tests
- [ ] Performance benchmarks
- [ ] Name resolution (namespace/name ID mapping)
---
## 12) Quick Start
### Prerequisites
Start HeroDB:
```bash
cd /path/to/herodb
cargo run --release -- --dir ./data --admin-secret mysecret --port 6379
```
### Build OSIRIS
```bash
cd /path/to/osiris
cargo build --release
```
### Initialize
```bash
# Create configuration
./target/release/osiris init --herodb redis://localhost:6379
# Create a namespace
./target/release/osiris ns create notes
```
### Usage
```bash
# Add a note
echo "OSIRIS is a minimal object store" | \
./target/release/osiris put notes/intro - \
--title "Introduction" \
--tags topic=osiris,type=doc
# Search
./target/release/osiris find "object store" --ns notes
# Get the note
./target/release/osiris get notes/intro
# Show stats
./target/release/osiris stats --ns notes
```
---
## 13) Testing
### Unit Tests
```bash
cargo test
```
### Integration Tests (requires HeroDB)
```bash
# Start HeroDB
cd /path/to/herodb
cargo run -- --dir /tmp/herodb-test --admin-secret test --port 6379
# Run tests
cd /path/to/osiris
cargo test -- --ignored
```
---
## 14) Performance Characteristics
### Write Performance
- **Object storage**: O(1) - single SET operation
- **Indexing**: O(T) where T = number of tags/fields
- **Total**: O(T) per object
### Read Performance
- **Get by ID**: O(1) - single GET operation
- **Filter by tags**: O(F) where F = number of filters (set intersection)
- **Text search**: O(N) where N = number of candidates (linear scan)
### Storage Overhead
- **Object**: ~1KB per object (JSON serialized)
- **Indexes**: ~50 bytes per tag/field entry
- **Total**: ~1.5KB per object with 10 tags
### Scalability
- **Optimal**: <10,000 objects per namespace
- **Acceptable**: <100,000 objects per namespace
- **Beyond**: Consider migrating to Tantivy FTS
---
## 15) Design Decisions
### Why No Tantivy in MVP?
- **Simplicity**: Avoid HeroDB server-side dependencies
- **Portability**: Works with any Redis-compatible backend
- **Flexibility**: Easy to migrate to Tantivy later
### Why Substring Matching?
- **Good enough**: For small datasets (<10k objects)
- **Simple**: No tokenization, stemming, or complex scoring
- **Fast**: O(N) is acceptable for MVP
### Why Separate Databases per Namespace?
- **Isolation**: Clear separation of concerns
- **Performance**: Smaller keyspaces = faster scans
- **Security**: Can apply different encryption keys per namespace
---
## 16) Migration Path
When ready to scale beyond MVP:
1. **Add Tantivy FTS** (HeroDB extension)
- Create FT.* commands in HeroDB
- Update OSIRIS to use FT.SEARCH instead of substring scan
- Keep field indexes for filtering
2. **Add Vector Search** (HeroDB extension)
- Store embeddings in HeroDB
- Implement ANN search (HNSW/IVF)
- Add hybrid retrieval (BM25 + vector)
3. **Add Relations** (OSIRIS feature)
- Store relation graphs in HeroDB
- Implement graph traversal
- Add relation-based ranking
4. **Add Deduplication** (HeroDB extension)
- Content-addressable storage (BLAKE3)
- Reference counting
- Garbage collection
---
## Summary
**OSIRIS MVP is a minimal, production-ready object store** that:
- Works with unmodified HeroDB
- Provides structured storage with metadata
- Supports field-based filtering
- Includes basic text search
- Exposes a clean CLI interface
- Maintains clear upgrade paths
**Perfect for:**
- Personal knowledge management
- Small-scale document storage
- Prototyping semantic applications
- Learning Rust + Redis patterns
**Next steps:**
- Build and test the MVP
- Gather usage feedback
- Plan Tantivy/vector integration
- Design 9P filesystem interface