|
|
||
|---|---|---|
| .cargo | ||
| .claude | ||
| .forgejo/workflows | ||
| crates/books-client | ||
| docs | ||
| examples | ||
| sdk/js | ||
| specs | ||
| src | ||
| templates | ||
| tosort | ||
| .gitignore | ||
| ai_instructions.md | ||
| build.rs | ||
| build.sh | ||
| Cargo.toml | ||
| IMPLEMENTATION_GUIDE.md | ||
| install.sh | ||
| LICENSE | ||
| Makefile | ||
| openrpc.json | ||
| OPENRPC_SPEC.md | ||
| PHASE_2_SUMMARY.md | ||
| publish.sh | ||
| QUICK_REFERENCE.md | ||
| README.md | ||
| refactor_instructions.md | ||
| REFACTORING_STATUS.md | ||
| run.sh | ||
| run_slides.sh | ||
| WEBSITE.md | ||
Hero Books - Document Management System
A Rust-based document collection management system with CLI, library, and web interfaces for processing markdown-based documentation with support for cross-collection references, link validation, and export to self-contained directories.
Project Structure
This is a single Rust package (hero_books) with multiple entry points:
- Library (
src/lib.rs) - Core functionality for document collection management - CLI (
src/bin/books.rs) - Command-line interface (books_clientbinary) - Web Server (
src/main.rs) - HTTP API and web interface (books_serverbinary) - Modules:
Module Documentation
-
DocTree Module - Complete document collection management system
- Collection scanning and indexing
- Link parsing and validation
- Include directive processing
- Access control management
- Export to self-contained directories
- Read-only client API
-
Website Module - Metadata-driven website definitions
- Website configuration (Docusaurus-style)
- Navigation bar with dropdown menus
- Sidebar navigation with multiple sidebars
- Footer with link columns
- Page-level metadata and SEO
- Theme and styling configuration
- Social links and custom fields
-
Ontology Module - AI-powered semantic extraction
- Document classification against 10 topic ontologies
- Semantic concept and relationship extraction
- Relationship validation with self-correction
- Embedded ontologies (no external files needed)
- Chunking support for large documents
Quick Start
# Build the project
make build
# Run the web server
make run
# Run the CLI
make run-cli
# Run in development mode with debug logging
make dev
# See all available commands
make help
Features
- Collection scanning: Automatically discover collections marked with
.collectionfiles - Cross-collection references: Link between pages in different collections using
collection:pagesyntax - Include directives: Embed content from other pages with
!!include collection:page - Link validation: Detect broken links to pages, images, and files
- Export: Generate self-contained directories with all dependencies
- Access control: Group-based ACL via
.groupfiles - Git integration: Automatically detect repository URLs
Installation
Build from source
make build
Binaries will be at:
target/release/books_client- CLI clienttarget/release/books_server- Web server
Install to PATH
make install
Installs both binaries to ~/hero/bin/:
~/hero/bin/books_client- CLI client~/hero/bin/books_server- Web server
Development build install (fastest compile):
make installdev
Ensure ~/hero/bin is in your PATH. Add to ~/.bashrc or ~/.zshrc:
export PATH="$HOME/hero/bin:$PATH"
Architecture & Concepts
Separation of Concerns
Hero Books separates content from presentation:
DocTree (Content):
- Manages markdown collections and pages
- Validates links and references
- Tracks files and images
- Processes include directives
- Enforces access control
Website (Presentation):
- Defines navigation structure
- Configures sidebars and menus
- Manages theming and styling
- Handles SEO metadata
- Provides plugin architecture
This separation allows flexible website layouts without changing content.
Key Concepts
Collections: Directories of markdown pages marked with .collection file
- Each collection is independently managed
- Collections can reference each other
- Access control per collection via ACL files
Pages: Individual markdown files with:
- Extracted title (from H1 heading)
- Description (from first paragraph)
- Parsed internal links and includes
- Optional front matter metadata
Links: References to pages, images, or files:
- Same collection:
[text](page_name) - Cross-collection:
[text](collection:page) - External: Automatic detection of HTTP(S) URLs
- Images: Identified by extension
Groups: Access control lists defining user membership
- Grant read/write access to collections
- Support wildcards for email patterns
- Support group inclusion (nested groups)
Export: Self-contained read-only directory:
- Pages and files organized by collection
- JSON metadata for each collection
- Suitable for static hosting or archival
Data Flow
Directory Scan
↓
Find Collections (.collection files)
↓
Parse Pages (extract metadata, parse links)
↓
Validate Links (check references exist)
↓
Process Includes (expand !!include directives)
↓
Enforce ACL (check group membership)
↓
Export (write to structured directory)
↓
Read Client (query exported collections)
Refactoring Notes
This project was refactored from a multi-package workspace into a single unified package following Rust best practices:
Previous Structure (workspace with 3 crates):
lib/- atlas-lib libraryatlas/- atlas CLI binaryweb/- atlas-web server
New Structure (single package with multiple binaries):
- Single
hero_bookspackage inCargo.toml - Two binaries via
src/bin/:cli.rs→books_clibinary (legacy)books.rs→books_clientbinary
- Main server binary:
src/main.rs→books_serverbinary - Modular organization in
src/:cli/- CLI command definitionsdoctree/- Core document managementebook/- Ebook parsingweb/- HTTP server handlerswebsite/- Website configuration
Benefits:
- Simpler dependency management
- Unified build system
- Easier code sharing between CLI and web server
- Cleaner project organization
- Aligned with Rust conventions for monolithic applications
CLI Usage
The books_client CLI talks to a running books_server via OpenRPC.
Start the server first
books_server --port 9567 --books-dir /path/to/docs
Scan for collections
# Scan a local path (server must have access)
books_client scan --path /path/to/docs
# Scan from git repository
books_client scan --git-url https://github.com/user/docs.git
List and inspect collections
# List all collections
books_client list
# Get collection details
books_client get my-collection
# Get all pages in a collection
books_client get-pages my-collection
# Get a specific page
books_client get-page my-collection page-name
Process collections
# Process for Q&A extraction and embeddings
books_client process my-collection
# Force reprocessing
books_client process my-collection --force
Metadata management
# Get collection metadata
books_client get-metadata my-collection
# Set collection metadata
books_client set-metadata my-collection --json '{"key": "value"}'
Server health
# Check server health
books_client health
# View OpenRPC schema
books_client discover
Directory Structure
Source Structure
docs/
├── collection1/
│ ├── .collection # Marks as collection (optional: name:custom_name)
│ ├── read.acl # Optional: group names for read access
│ ├── write.acl # Optional: group names for write access
│ ├── page1.md
│ ├── subdir/
│ │ └── page2.md
│ └── img/
│ └── logo.png
├── collection2/
│ ├── .collection
│ └── intro.md
└── groups/ # Special collection for ACL groups
├── .collection
├── admins.group
└── editors.group
Export Structure
/tmp/books/
├── content/
│ └── collection_name/
│ ├── page1.md # Pages at root of collection dir
│ ├── page2.md
│ ├── img/ # All images in img/ subdirectory
│ │ └── logo.png
│ └── files/ # All other files in files/ subdirectory
│ └── document.pdf
└── meta/
└── collection_name.json # Collection metadata
File Formats
.collection
name:custom_collection_name
If empty or name not specified, uses directory name.
.group
// Comments start with //
user@example.com
*@company.com
include:other_group
ACL files (read.acl, write.acl)
admins
editors
One group name per line.
Link Syntax
Page links
[text](page_name) # Same collection
[text](collection:page) # Cross-collection
Image links
 # Same collection
 # Cross-collection
Include directives
!!include page_name
!!include collection:page_name
Name Normalization
Page and collection names are normalized:
- Convert to lowercase
- Replace
-with_ - Replace
/with_ - Remove
.mdextension - Strip numeric prefix (e.g.,
03_page→page) - Remove special characters
Supported Image Extensions
.png,.jpg,.jpeg,.gif,.svg,.webp,.bmp,.tiff,.ico
Service Management with Zinit
Hero Books can be registered and managed as a Zinit-managed service with automatic restart, health checks, and port management.
Starting as a Zinit Service
# Start web server as Zinit-managed service
books_server --port 9567 --start
# Start with custom books directory
books_server --port 9567 --books-dir ./books --start
# Multi-instance support
books_server --port 9567 --start --instance prod
books_server --port 9568 --start --instance dev
Service Management
Once started with --start, services are managed by Zinit:
# View service status
zinit status books_server
# View service logs
zinit logs books_server
# Stop service
zinit stop books_server
# Restart service
zinit restart books_server
# Multi-instance commands
zinit status books_server_prod
zinit logs books_server_dev
Service Features
- Automatic Restart: Service restarts on failure with 5s delay
- Health Checks: TCP port health checks every 10s
- Max Restarts: Up to 5 restart attempts before stopping
- Logging: Full log history available via
zinit logs - Verification: Defensive self-test to verify successful startup
Error Handling
If service startup fails, Zinit will:
- Attempt TCP connection to verify port binding
- Check service state and PID
- Display recent logs on failure
- Clean up failed service registration
Detailed error messages provide diagnostic information:
- Port already in use
- Binary path incorrect
- Zinit server not running
- Permission denied
Development
Code Quality
This project maintains high code quality standards:
-
Dead Code Cleanup: Unused code is either removed or marked with
#[allow(dead_code)]with clear justification:flatten_chapter_pages()- Utility function kept for testingclassify_topic()- Public API method reserved for future useembeddings_from_cache- Field used for statistics reporting
-
Compiler Warnings: All compiler warnings in the
hero_bookscrate are resolved (external dependencies only)
Building
# Build release binaries
make build
# Build with debug info (dev mode)
cargo build
# Run tests
make test
# Run all tests including integration tests
make test-all
# Generate documentation
cargo doc --no-deps --open
# Check for compiler warnings
cargo check
Testing
# Run all tests
cargo test
# Run specific module tests
cargo test doctree
cargo test website
# Run with output
cargo test -- --nocapture
Code Organization
src/
├── lib.rs # Library exports
├── main.rs # Web server entry point (books_server binary)
├── bin/
│ ├── books.rs # CLI entry point (books_client binary)
│ └── cli.rs # Legacy CLI entry point
├── cli/ # CLI commands and handlers
├── doctree/ # Document management
├── ebook/ # Ebook parsing
├── ontology/ # AI-powered semantic extraction
├── vectorsdk/ # Vector search and embeddings
├── publishing/ # Publishing configuration
├── book/ # Book and PDF processing
├── web/ # HTTP API routes and handlers
└── website/ # Website configuration
Adding New Features
- New DocTree functionality: Add to
src/doctree/ - New Website config: Add to
src/website/ - New CLI commands: Add to
src/cli/mod.rs - New API endpoints: Add to
src/web/mod.rs
Library Usage
Ontology Processing
The ontology processor uses AI to classify documents and extract semantic concepts/relationships.
use hero_books::ontology::{OntologyProcessor, ProcessorConfig, ONTOLOGIES};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Create processor with default config
let processor = OntologyProcessor::new();
let document = "Our SaaS product integrates with Slack...";
// Classification only (quick)
let matches = processor.classify(document).await?;
for m in &matches {
println!("{}: {} (primary: {})", m.topic, m.score, m.is_primary);
}
// Full processing (classification + extraction)
let result = processor.process(document).await?;
for sem in &result.semantics {
println!("{}: {} concepts, {} relationships",
sem.category, sem.concepts.len(), sem.relationships.len());
}
// Direct extraction for specific topics
let semantics = processor.extract(document, &["product", "technology"]).await?;
Ok(())
}
Available Topics: business, technology, product, commercial, people, news, legal, financial, health, education
Configuration:
let config = ProcessorConfig {
confidence_threshold: Some(8), // Min score to consider (default: 7)
max_topics: Some(3), // Limit topics processed
filter_topics: Some(vec!["product".into(), "technology".into()]),
temperature: Some(0.0), // LLM temperature
max_input_tokens: Some(60_000), // Chunk if larger
..Default::default()
};
let processor = OntologyProcessor::with_config(config);
Requirements: Set an API key environment variable:
GROQ_API_KEY(preferred)SAMBANOVA_API_KEYOPENROUTER_API_KEY
See examples/src/ontology_processing.rs for a complete example.
DocTree
use doctree::{DocTree, ExportArgs};
fn main() -> doctree::Result<()> {
// Create and scan
let mut doctree = DocTree::new("mydocs");
doctree.scan(Path::new("/path/to/docs"), &[])?;
doctree.init_post()?; // Validate links
// Access pages
let page = doctree.page_get("collection:page")?;
let content = page.content()?;
// Export
doctree.export(ExportArgs {
destination: PathBuf::from("/tmp/books"),
reset: true,
include: false,
})?;
Ok(())
}
DocTreeClient (for reading exports)
use doctree::DocTreeClient;
fn main() -> doctree::Result<()> {
let client = DocTreeClient::new(Path::new("/tmp/books"))?;
// List collections
let collections = client.list_collections()?;
// Get page content
let content = client.get_page_content("collection", "page")?;
// Check existence
if client.page_exists("collection", "page") {
println!("Page exists!");
}
Ok(())
}
Directory Structure
atlasserver_rust/
├── Cargo.toml # Package configuration
├── Makefile # Build automation (make help to see all targets)
├── build.rs # Build script
├── README.md # This file
├── openrpc.json # OpenRPC 1.3.2 API specification
├── src/
│ ├── lib.rs # Library entry point and module declarations
│ ├── main.rs # Web server binary entry point
│ ├── bin/
│ │ ├── books.rs # CLI binary entry point (books_client)
│ │ └── cli.rs # Legacy CLI entry point
│ ├── cli/ # CLI commands and handlers
│ ├── doctree/ # Document tree management
│ ├── ebook/ # Ebook parsing
│ ├── ontology/ # AI-powered semantic extraction
│ ├── vectorsdk/ # Vector search and embeddings
│ ├── publishing/ # Publishing configuration
│ ├── book/ # Book and PDF processing
│ ├── web/ # HTTP API routes and handlers
│ └── website/ # Website configuration
├── crates/
│ └── books-client/ # Rust client library for the API
├── examples/ # Example code for library usage
└── target/
├── debug/ # Debug builds
└── release/
├── books_client # CLI binary
└── books_server # Web server binary