365 lines
12 KiB
Markdown
365 lines
12 KiB
Markdown
# Ternary Search Tree (TST) Implementation Plan
|
|
|
|
## 1. Overview
|
|
|
|
A Ternary Search Tree (TST) is a type of trie where each node has three children: left, middle, and right. Unlike a RadixTree which compresses common prefixes, a TST stores one character per node and uses a binary search tree-like structure for efficient traversal.
|
|
|
|
```mermaid
|
|
graph TD
|
|
A[Root Node 'r'] --> B[Left Child 'a']
|
|
A --> C[Middle Child 'o']
|
|
A --> D[Right Child 't']
|
|
C --> E[Middle Child 'o']
|
|
E --> F[Middle Child 'm' - End of Key]
|
|
E --> G[Middle Child 't' - End of Key]
|
|
```
|
|
|
|
The TST implementation will use OurDB as the backend for persistent storage, similar to the existing RadixTree implementation. The goal is to provide a more balanced tree structure that offers consistent performance across all operations (set, get, delete, list).
|
|
|
|
## 2. Core Data Structures
|
|
|
|
### 2.1 TST Node Structure
|
|
|
|
```rust
|
|
pub struct TSTNode {
|
|
// The character stored at this node
|
|
pub character: char,
|
|
|
|
// Value stored at this node (empty if not end of key)
|
|
pub value: Vec<u8>,
|
|
|
|
// Whether this node represents the end of a key
|
|
pub is_end_of_key: bool,
|
|
|
|
// References to child nodes
|
|
pub left_id: Option<u32>, // For characters < current character
|
|
pub middle_id: Option<u32>, // For characters == current character (next character in key)
|
|
pub right_id: Option<u32>, // For characters > current character
|
|
}
|
|
```
|
|
|
|
### 2.2 TST Structure
|
|
|
|
```rust
|
|
pub struct TST {
|
|
// Database for persistent storage
|
|
db: OurDB,
|
|
|
|
// Database ID of the root node
|
|
root_id: Option<u32>,
|
|
}
|
|
```
|
|
|
|
## 3. API Design
|
|
|
|
The TST will maintain similar core functionality to RadixTree but with an API that better suits its structure:
|
|
|
|
```rust
|
|
impl TST {
|
|
// Creates a new TST with the specified database path
|
|
pub fn new(path: &str, reset: bool) -> Result<Self, Error>;
|
|
|
|
// Sets a key-value pair in the tree
|
|
pub fn set(&mut self, key: &str, value: Vec<u8>) -> Result<(), Error>;
|
|
|
|
// Gets a value by key from the tree
|
|
pub fn get(&mut self, key: &str) -> Result<Vec<u8>, Error>;
|
|
|
|
// Deletes a key from the tree
|
|
pub fn delete(&mut self, key: &str) -> Result<(), Error>;
|
|
|
|
// Lists all keys with a given prefix
|
|
pub fn list(&mut self, prefix: &str) -> Result<Vec<String>, Error>;
|
|
|
|
// Gets all values for keys with a given prefix
|
|
pub fn getall(&mut self, prefix: &str) -> Result<Vec<Vec<u8>>, Error>;
|
|
}
|
|
```
|
|
|
|
## 4. Implementation Strategy
|
|
|
|
### 4.1 Phase 1: Core Data Structures and Serialization
|
|
|
|
```mermaid
|
|
graph TD
|
|
A[Define TSTNode and TST structs] --> B[Implement serialization/deserialization]
|
|
B --> C[Implement Error handling]
|
|
C --> D[Implement OurDB integration]
|
|
```
|
|
|
|
1. Define the `TSTNode` and `TST` structs
|
|
2. Implement serialization and deserialization for `TSTNode`
|
|
3. Define error types for TST-specific errors
|
|
4. Implement OurDB integration for node storage and retrieval
|
|
|
|
### 4.2 Phase 2: Basic Tree Operations
|
|
|
|
```mermaid
|
|
graph TD
|
|
A[Implement new] --> B[Implement set]
|
|
B --> C[Implement get]
|
|
C --> D[Implement helper functions]
|
|
```
|
|
|
|
1. Implement the `new()` function for creating a new TST
|
|
2. Implement the `set()` function for inserting key-value pairs
|
|
3. Implement the `get()` function for retrieving values
|
|
4. Implement helper functions for node traversal and manipulation
|
|
|
|
### 4.3 Phase 3: Advanced Tree Operations
|
|
|
|
```mermaid
|
|
graph TD
|
|
A[Implement delete] --> B[Implement list]
|
|
B --> C[Implement getall]
|
|
C --> D[Optimize operations]
|
|
```
|
|
|
|
1. Implement the `delete()` function for removing keys
|
|
2. Implement the `list()` function for prefix-based key listing
|
|
3. Implement the `getall()` function for retrieving all values with a prefix
|
|
4. Optimize operations for balanced performance
|
|
|
|
### 4.4 Phase 4: Testing and Performance Evaluation
|
|
|
|
```mermaid
|
|
graph TD
|
|
A[Create unit tests] --> B[Create integration tests]
|
|
B --> C[Create performance tests]
|
|
C --> D[Compare with RadixTree]
|
|
D --> E[Optimize based on results]
|
|
```
|
|
|
|
1. Create unit tests for each component
|
|
2. Create integration tests for the complete system
|
|
3. Create performance tests similar to RadixTree's
|
|
4. Compare performance with RadixTree
|
|
5. Optimize based on performance results
|
|
|
|
## 5. Implementation Details
|
|
|
|
### 5.1 Node Structure and Storage
|
|
|
|
Each TST node will store a single character and have three child pointers (left, middle, right). The nodes will be serialized and stored in OurDB, with node references using OurDB record IDs.
|
|
|
|
### 5.2 Key Operations
|
|
|
|
#### 5.2.1 Insertion (set)
|
|
|
|
```mermaid
|
|
graph TD
|
|
A[Start at root] --> B{Root exists?}
|
|
B -- No --> C[Create root node]
|
|
B -- Yes --> D[Compare current char with node char]
|
|
D -- Less than --> E[Go to left child]
|
|
D -- Equal to --> F[Go to middle child]
|
|
D -- Greater than --> G[Go to right child]
|
|
E --> H{Child exists?}
|
|
F --> H
|
|
G --> H
|
|
H -- No --> I[Create new node]
|
|
H -- Yes --> J[Continue with next char]
|
|
I --> J
|
|
J --> K{End of key?}
|
|
K -- Yes --> L[Set value and mark as end of key]
|
|
K -- No --> D
|
|
```
|
|
|
|
1. Start at the root node
|
|
2. For each character in the key:
|
|
- If the character is less than the current node's character, go to the left child
|
|
- If the character is equal to the current node's character, go to the middle child
|
|
- If the character is greater than the current node's character, go to the right child
|
|
- If the child doesn't exist, create a new node
|
|
3. When the end of the key is reached, set the value and mark the node as end of key
|
|
|
|
#### 5.2.2 Lookup (get)
|
|
|
|
1. Start at the root node
|
|
2. For each character in the key:
|
|
- If the character is less than the current node's character, go to the left child
|
|
- If the character is equal to the current node's character, go to the middle child
|
|
- If the character is greater than the current node's character, go to the right child
|
|
- If the child doesn't exist, the key is not found
|
|
3. When the end of the key is reached, check if the node is marked as end of key
|
|
- If yes, return the value
|
|
- If no, the key is not found
|
|
|
|
#### 5.2.3 Deletion (delete)
|
|
|
|
1. Find the node corresponding to the end of the key
|
|
2. If the node has no children, remove it and update its parent
|
|
3. If the node has children, mark it as not end of key and clear its value
|
|
4. Recursively clean up any nodes that are no longer needed
|
|
|
|
#### 5.2.4 Prefix Operations (list, getall)
|
|
|
|
1. Find the node corresponding to the end of the prefix
|
|
2. Perform a traversal of the subtree rooted at that node
|
|
3. Collect all keys (for list) or values (for getall) from nodes marked as end of key
|
|
|
|
### 5.3 Serialization and OurDB Integration
|
|
|
|
#### 5.3.1 Node Structure for Serialization
|
|
|
|
Each TSTNode will be serialized with the following logical structure:
|
|
|
|
1. Version marker (for future format evolution)
|
|
2. Character data
|
|
3. Is-end-of-key flag
|
|
4. Value (if is-end-of-key is true)
|
|
5. Child node references (left, middle, right)
|
|
|
|
#### 5.3.2 OurDB Integration
|
|
|
|
The TST will use OurDB for node storage and retrieval:
|
|
|
|
1. **Node Storage**: Each node will be serialized and stored as a record in OurDB.
|
|
```rust
|
|
fn save_node(&mut self, node_id: Option<u32>, node: &TSTNode) -> Result<u32, Error> {
|
|
let data = node.serialize();
|
|
let args = OurDBSetArgs {
|
|
id: node_id,
|
|
data: &data,
|
|
};
|
|
Ok(self.db.set(args)?)
|
|
}
|
|
```
|
|
|
|
2. **Node Retrieval**: Nodes will be retrieved from OurDB and deserialized.
|
|
```rust
|
|
fn get_node(&mut self, node_id: u32) -> Result<TSTNode, Error> {
|
|
let data = self.db.get(node_id)?;
|
|
TSTNode::deserialize(&data)
|
|
}
|
|
```
|
|
|
|
3. **Root Node Management**: The TST will maintain a root node ID for traversal.
|
|
|
|
#### 5.3.3 Handling Large Datasets
|
|
|
|
For large datasets, we'll implement a batching approach similar to the RadixTree's large-scale tests:
|
|
|
|
1. **Batch Processing**: Process large datasets in manageable batches to avoid OurDB size limitations.
|
|
2. **Database Partitioning**: Create separate database instances for very large datasets.
|
|
3. **Memory Management**: Implement efficient memory usage patterns to avoid excessive memory consumption.
|
|
|
|
## 6. Project Structure
|
|
|
|
```
|
|
tst/
|
|
├── Cargo.toml
|
|
├── src/
|
|
│ ├── lib.rs # Public API and re-exports
|
|
│ ├── node.rs # TSTNode implementation
|
|
│ ├── serialize.rs # Serialization and deserialization
|
|
│ ├── error.rs # Error types
|
|
│ └── operations.rs # Tree operations implementation
|
|
├── tests/
|
|
│ ├── basic_test.rs # Basic operations tests
|
|
│ ├── prefix_test.rs # Prefix operations tests
|
|
│ └── edge_cases.rs # Edge case tests
|
|
└── examples/
|
|
├── basic_usage.rs # Basic usage example
|
|
├── prefix_ops.rs # Prefix operations example
|
|
└── performance.rs # Performance benchmark
|
|
```
|
|
|
|
## 7. Performance Considerations
|
|
|
|
### 7.1 Advantages of TST over RadixTree
|
|
|
|
1. **Balanced Structure**: TST naturally maintains a more balanced structure, which can lead to more consistent performance across operations.
|
|
2. **Character-by-Character Comparison**: TST performs character-by-character comparisons, which can be more efficient for certain workloads.
|
|
3. **Efficient Prefix Operations**: TST can efficiently handle prefix operations by traversing the middle child path.
|
|
|
|
### 7.2 Potential Optimizations
|
|
|
|
1. **Node Caching**: Cache frequently accessed nodes to reduce database operations.
|
|
2. **Balancing Techniques**: Implement balancing techniques to ensure the tree remains balanced.
|
|
3. **Batch Operations**: Support batch operations for improved performance.
|
|
4. **Memory Management**: Implement efficient memory usage patterns to avoid excessive memory consumption.
|
|
|
|
## 8. Testing Strategy
|
|
|
|
### 8.1 Unit Tests
|
|
|
|
1. Test `TSTNode` serialization/deserialization
|
|
2. Test character comparison operations
|
|
3. Test error handling
|
|
|
|
### 8.2 Integration Tests
|
|
|
|
1. Test basic CRUD operations
|
|
2. Test prefix operations
|
|
3. Test edge cases (empty keys, very long keys, etc.)
|
|
4. Test with large datasets
|
|
|
|
### 8.3 Performance Tests
|
|
|
|
1. Measure throughput for set/get operations
|
|
2. Measure latency for different operations
|
|
3. Test with different tree sizes and key distributions
|
|
4. Compare performance with RadixTree
|
|
|
|
#### 8.3.1 Performance Benchmarking
|
|
|
|
We'll create comprehensive benchmarks to compare the TST implementation with RadixTree:
|
|
|
|
```rust
|
|
// Example benchmark structure
|
|
fn benchmark_set_operations(tree_type: &str, num_records: usize) -> Duration {
|
|
let start_time = Instant::now();
|
|
|
|
// Create tree (TST or RadixTree)
|
|
let mut tree = match tree_type {
|
|
"tst" => create_tst(),
|
|
"radix" => create_radix_tree(),
|
|
_ => panic!("Unknown tree type"),
|
|
};
|
|
|
|
// Insert records
|
|
for i in 0..num_records {
|
|
let key = format!("key:{:08}", i);
|
|
let value = format!("val{}", i).into_bytes();
|
|
tree.set(&key, value).unwrap();
|
|
}
|
|
|
|
start_time.elapsed()
|
|
}
|
|
```
|
|
|
|
We'll benchmark the following operations:
|
|
- Set (insertion)
|
|
- Get (lookup)
|
|
- Delete
|
|
- List (prefix search)
|
|
- GetAll (prefix values)
|
|
|
|
For each operation, we'll measure:
|
|
- Throughput (operations per second)
|
|
- Latency (time per operation)
|
|
- Memory usage
|
|
- Database size
|
|
|
|
We'll test with various dataset characteristics:
|
|
- Small datasets (100-1,000 keys)
|
|
- Medium datasets (10,000-100,000 keys)
|
|
- Large datasets (1,000,000+ keys)
|
|
- Keys with common prefixes
|
|
- Keys with random distribution
|
|
- Long keys vs. short keys
|
|
|
|
## 9. Timeline and Milestones
|
|
|
|
1. **Week 1**: Core data structures and serialization
|
|
2. **Week 2**: Basic tree operations
|
|
3. **Week 3**: Advanced tree operations
|
|
4. **Week 4**: Testing and performance evaluation
|
|
5. **Week 5**: Optimization and documentation
|
|
|
|
## 10. Conclusion
|
|
|
|
This implementation plan provides a roadmap for creating a Ternary Search Tree (TST) as an alternative to the RadixTree implementation. The TST will maintain the same core functionality while providing a more balanced tree structure and aiming for balanced performance across all operations.
|
|
|
|
The implementation will leverage OurDB for persistent storage, similar to RadixTree, but with a different node structure and traversal algorithm that better suits the TST approach. |