db/heromodels/docs/tst_implementation_plan.md
2025-06-27 12:11:04 +03:00

365 lines
12 KiB
Markdown

# Ternary Search Tree (TST) Implementation Plan
## 1. Overview
A Ternary Search Tree (TST) is a type of trie where each node has three children: left, middle, and right. Unlike a RadixTree which compresses common prefixes, a TST stores one character per node and uses a binary search tree-like structure for efficient traversal.
```mermaid
graph TD
A[Root Node 'r'] --> B[Left Child 'a']
A --> C[Middle Child 'o']
A --> D[Right Child 't']
C --> E[Middle Child 'o']
E --> F[Middle Child 'm' - End of Key]
E --> G[Middle Child 't' - End of Key]
```
The TST implementation will use OurDB as the backend for persistent storage, similar to the existing RadixTree implementation. The goal is to provide a more balanced tree structure that offers consistent performance across all operations (set, get, delete, list).
## 2. Core Data Structures
### 2.1 TST Node Structure
```rust
pub struct TSTNode {
// The character stored at this node
pub character: char,
// Value stored at this node (empty if not end of key)
pub value: Vec<u8>,
// Whether this node represents the end of a key
pub is_end_of_key: bool,
// References to child nodes
pub left_id: Option<u32>, // For characters < current character
pub middle_id: Option<u32>, // For characters == current character (next character in key)
pub right_id: Option<u32>, // For characters > current character
}
```
### 2.2 TST Structure
```rust
pub struct TST {
// Database for persistent storage
db: OurDB,
// Database ID of the root node
root_id: Option<u32>,
}
```
## 3. API Design
The TST will maintain similar core functionality to RadixTree but with an API that better suits its structure:
```rust
impl TST {
// Creates a new TST with the specified database path
pub fn new(path: &str, reset: bool) -> Result<Self, Error>;
// Sets a key-value pair in the tree
pub fn set(&mut self, key: &str, value: Vec<u8>) -> Result<(), Error>;
// Gets a value by key from the tree
pub fn get(&mut self, key: &str) -> Result<Vec<u8>, Error>;
// Deletes a key from the tree
pub fn delete(&mut self, key: &str) -> Result<(), Error>;
// Lists all keys with a given prefix
pub fn list(&mut self, prefix: &str) -> Result<Vec<String>, Error>;
// Gets all values for keys with a given prefix
pub fn getall(&mut self, prefix: &str) -> Result<Vec<Vec<u8>>, Error>;
}
```
## 4. Implementation Strategy
### 4.1 Phase 1: Core Data Structures and Serialization
```mermaid
graph TD
A[Define TSTNode and TST structs] --> B[Implement serialization/deserialization]
B --> C[Implement Error handling]
C --> D[Implement OurDB integration]
```
1. Define the `TSTNode` and `TST` structs
2. Implement serialization and deserialization for `TSTNode`
3. Define error types for TST-specific errors
4. Implement OurDB integration for node storage and retrieval
### 4.2 Phase 2: Basic Tree Operations
```mermaid
graph TD
A[Implement new] --> B[Implement set]
B --> C[Implement get]
C --> D[Implement helper functions]
```
1. Implement the `new()` function for creating a new TST
2. Implement the `set()` function for inserting key-value pairs
3. Implement the `get()` function for retrieving values
4. Implement helper functions for node traversal and manipulation
### 4.3 Phase 3: Advanced Tree Operations
```mermaid
graph TD
A[Implement delete] --> B[Implement list]
B --> C[Implement getall]
C --> D[Optimize operations]
```
1. Implement the `delete()` function for removing keys
2. Implement the `list()` function for prefix-based key listing
3. Implement the `getall()` function for retrieving all values with a prefix
4. Optimize operations for balanced performance
### 4.4 Phase 4: Testing and Performance Evaluation
```mermaid
graph TD
A[Create unit tests] --> B[Create integration tests]
B --> C[Create performance tests]
C --> D[Compare with RadixTree]
D --> E[Optimize based on results]
```
1. Create unit tests for each component
2. Create integration tests for the complete system
3. Create performance tests similar to RadixTree's
4. Compare performance with RadixTree
5. Optimize based on performance results
## 5. Implementation Details
### 5.1 Node Structure and Storage
Each TST node will store a single character and have three child pointers (left, middle, right). The nodes will be serialized and stored in OurDB, with node references using OurDB record IDs.
### 5.2 Key Operations
#### 5.2.1 Insertion (set)
```mermaid
graph TD
A[Start at root] --> B{Root exists?}
B -- No --> C[Create root node]
B -- Yes --> D[Compare current char with node char]
D -- Less than --> E[Go to left child]
D -- Equal to --> F[Go to middle child]
D -- Greater than --> G[Go to right child]
E --> H{Child exists?}
F --> H
G --> H
H -- No --> I[Create new node]
H -- Yes --> J[Continue with next char]
I --> J
J --> K{End of key?}
K -- Yes --> L[Set value and mark as end of key]
K -- No --> D
```
1. Start at the root node
2. For each character in the key:
- If the character is less than the current node's character, go to the left child
- If the character is equal to the current node's character, go to the middle child
- If the character is greater than the current node's character, go to the right child
- If the child doesn't exist, create a new node
3. When the end of the key is reached, set the value and mark the node as end of key
#### 5.2.2 Lookup (get)
1. Start at the root node
2. For each character in the key:
- If the character is less than the current node's character, go to the left child
- If the character is equal to the current node's character, go to the middle child
- If the character is greater than the current node's character, go to the right child
- If the child doesn't exist, the key is not found
3. When the end of the key is reached, check if the node is marked as end of key
- If yes, return the value
- If no, the key is not found
#### 5.2.3 Deletion (delete)
1. Find the node corresponding to the end of the key
2. If the node has no children, remove it and update its parent
3. If the node has children, mark it as not end of key and clear its value
4. Recursively clean up any nodes that are no longer needed
#### 5.2.4 Prefix Operations (list, getall)
1. Find the node corresponding to the end of the prefix
2. Perform a traversal of the subtree rooted at that node
3. Collect all keys (for list) or values (for getall) from nodes marked as end of key
### 5.3 Serialization and OurDB Integration
#### 5.3.1 Node Structure for Serialization
Each TSTNode will be serialized with the following logical structure:
1. Version marker (for future format evolution)
2. Character data
3. Is-end-of-key flag
4. Value (if is-end-of-key is true)
5. Child node references (left, middle, right)
#### 5.3.2 OurDB Integration
The TST will use OurDB for node storage and retrieval:
1. **Node Storage**: Each node will be serialized and stored as a record in OurDB.
```rust
fn save_node(&mut self, node_id: Option<u32>, node: &TSTNode) -> Result<u32, Error> {
let data = node.serialize();
let args = OurDBSetArgs {
id: node_id,
data: &data,
};
Ok(self.db.set(args)?)
}
```
2. **Node Retrieval**: Nodes will be retrieved from OurDB and deserialized.
```rust
fn get_node(&mut self, node_id: u32) -> Result<TSTNode, Error> {
let data = self.db.get(node_id)?;
TSTNode::deserialize(&data)
}
```
3. **Root Node Management**: The TST will maintain a root node ID for traversal.
#### 5.3.3 Handling Large Datasets
For large datasets, we'll implement a batching approach similar to the RadixTree's large-scale tests:
1. **Batch Processing**: Process large datasets in manageable batches to avoid OurDB size limitations.
2. **Database Partitioning**: Create separate database instances for very large datasets.
3. **Memory Management**: Implement efficient memory usage patterns to avoid excessive memory consumption.
## 6. Project Structure
```
tst/
├── Cargo.toml
├── src/
│ ├── lib.rs # Public API and re-exports
│ ├── node.rs # TSTNode implementation
│ ├── serialize.rs # Serialization and deserialization
│ ├── error.rs # Error types
│ └── operations.rs # Tree operations implementation
├── tests/
│ ├── basic_test.rs # Basic operations tests
│ ├── prefix_test.rs # Prefix operations tests
│ └── edge_cases.rs # Edge case tests
└── examples/
├── basic_usage.rs # Basic usage example
├── prefix_ops.rs # Prefix operations example
└── performance.rs # Performance benchmark
```
## 7. Performance Considerations
### 7.1 Advantages of TST over RadixTree
1. **Balanced Structure**: TST naturally maintains a more balanced structure, which can lead to more consistent performance across operations.
2. **Character-by-Character Comparison**: TST performs character-by-character comparisons, which can be more efficient for certain workloads.
3. **Efficient Prefix Operations**: TST can efficiently handle prefix operations by traversing the middle child path.
### 7.2 Potential Optimizations
1. **Node Caching**: Cache frequently accessed nodes to reduce database operations.
2. **Balancing Techniques**: Implement balancing techniques to ensure the tree remains balanced.
3. **Batch Operations**: Support batch operations for improved performance.
4. **Memory Management**: Implement efficient memory usage patterns to avoid excessive memory consumption.
## 8. Testing Strategy
### 8.1 Unit Tests
1. Test `TSTNode` serialization/deserialization
2. Test character comparison operations
3. Test error handling
### 8.2 Integration Tests
1. Test basic CRUD operations
2. Test prefix operations
3. Test edge cases (empty keys, very long keys, etc.)
4. Test with large datasets
### 8.3 Performance Tests
1. Measure throughput for set/get operations
2. Measure latency for different operations
3. Test with different tree sizes and key distributions
4. Compare performance with RadixTree
#### 8.3.1 Performance Benchmarking
We'll create comprehensive benchmarks to compare the TST implementation with RadixTree:
```rust
// Example benchmark structure
fn benchmark_set_operations(tree_type: &str, num_records: usize) -> Duration {
let start_time = Instant::now();
// Create tree (TST or RadixTree)
let mut tree = match tree_type {
"tst" => create_tst(),
"radix" => create_radix_tree(),
_ => panic!("Unknown tree type"),
};
// Insert records
for i in 0..num_records {
let key = format!("key:{:08}", i);
let value = format!("val{}", i).into_bytes();
tree.set(&key, value).unwrap();
}
start_time.elapsed()
}
```
We'll benchmark the following operations:
- Set (insertion)
- Get (lookup)
- Delete
- List (prefix search)
- GetAll (prefix values)
For each operation, we'll measure:
- Throughput (operations per second)
- Latency (time per operation)
- Memory usage
- Database size
We'll test with various dataset characteristics:
- Small datasets (100-1,000 keys)
- Medium datasets (10,000-100,000 keys)
- Large datasets (1,000,000+ keys)
- Keys with common prefixes
- Keys with random distribution
- Long keys vs. short keys
## 9. Timeline and Milestones
1. **Week 1**: Core data structures and serialization
2. **Week 2**: Basic tree operations
3. **Week 3**: Advanced tree operations
4. **Week 4**: Testing and performance evaluation
5. **Week 5**: Optimization and documentation
## 10. Conclusion
This implementation plan provides a roadmap for creating a Ternary Search Tree (TST) as an alternative to the RadixTree implementation. The TST will maintain the same core functionality while providing a more balanced tree structure and aiming for balanced performance across all operations.
The implementation will leverage OurDB for persistent storage, similar to RadixTree, but with a different node structure and traversal algorithm that better suits the TST approach.