281 lines
6.7 KiB
Markdown
281 lines
6.7 KiB
Markdown
# Hero Supervisor Documentation
|
|
|
|
## Overview
|
|
|
|
Hero Supervisor is a distributed job execution system that manages runners and coordinates job processing across multiple worker nodes. It provides a robust OpenRPC API for job management and runner administration.
|
|
|
|
## Architecture
|
|
|
|
The supervisor consists of several key components:
|
|
|
|
- **Supervisor Core**: Central coordinator that manages runners and job dispatch
|
|
- **OpenRPC Server**: JSON-RPC API server for remote management
|
|
- **Redis Backend**: Job queue and state management
|
|
- **Process Manager**: Runner lifecycle management (Simple or Tmux)
|
|
- **Client Libraries**: Native Rust and WASM clients for integration
|
|
|
|
## Quick Start
|
|
|
|
### Starting the Supervisor
|
|
|
|
```bash
|
|
# With default configuration
|
|
./supervisor
|
|
|
|
# With custom configuration file
|
|
./supervisor --config /path/to/config.toml
|
|
```
|
|
|
|
### Example Configuration
|
|
|
|
```toml
|
|
# config.toml
|
|
redis_url = "redis://localhost:6379"
|
|
namespace = "hero"
|
|
bind_address = "127.0.0.1"
|
|
port = 3030
|
|
|
|
# Admin secrets for full access
|
|
admin_secrets = ["admin-secret-123"]
|
|
|
|
# User secrets for job operations
|
|
user_secrets = ["user-secret-456"]
|
|
|
|
# Register secrets for runner registration
|
|
register_secrets = ["register-secret-789"]
|
|
|
|
[[actors]]
|
|
id = "sal_runner_1"
|
|
name = "sal_runner_1"
|
|
binary_path = "/path/to/sal_runner"
|
|
db_path = "/tmp/sal_db"
|
|
redis_url = "redis://localhost:6379"
|
|
process_manager = "simple"
|
|
|
|
[[actors]]
|
|
id = "osis_runner_1"
|
|
name = "osis_runner_1"
|
|
binary_path = "/path/to/osis_runner"
|
|
db_path = "/tmp/osis_db"
|
|
redis_url = "redis://localhost:6379"
|
|
process_manager = "tmux:osis_session"
|
|
```
|
|
|
|
## API Documentation
|
|
|
|
### Job API Convention
|
|
|
|
The Hero Supervisor follows a consistent naming convention for job operations:
|
|
|
|
- **`jobs.`** - General job operations (create, list)
|
|
- **`job.`** - Specific job operations (run, start, status, result)
|
|
|
|
See [Job API Convention](job-api-convention.md) for detailed documentation.
|
|
|
|
### Core Methods
|
|
|
|
#### Runner Management
|
|
- `register_runner` - Register a new runner
|
|
- `list_runners` - List all registered runners
|
|
- `start_runner` / `stop_runner` - Control runner lifecycle
|
|
- `get_runner_status` - Get runner status
|
|
- `get_runner_logs` - Retrieve runner logs
|
|
|
|
#### Job Management
|
|
- `jobs.create` - Create a job without queuing
|
|
- `jobs.list` - List all jobs with full details
|
|
- `job.run` - Run a job and return result
|
|
- `job.start` - Start a created job
|
|
- `job.stop` - Stop a running job
|
|
- `job.delete` - Delete a job from the system
|
|
- `job.status` - Get job status (non-blocking)
|
|
- `job.result` - Get job result (blocking)
|
|
|
|
#### Administration
|
|
- `add_secret` / `remove_secret` - Manage authentication secrets
|
|
- `get_supervisor_info` - Get system information
|
|
- `rpc.discover` - OpenRPC specification discovery
|
|
|
|
## Client Usage
|
|
|
|
### Rust Client
|
|
|
|
```rust
|
|
use hero_supervisor_openrpc_client::{SupervisorClient, JobBuilder};
|
|
|
|
// Create client
|
|
let client = SupervisorClient::new("http://localhost:3030")?;
|
|
|
|
// Create a job
|
|
let job = JobBuilder::new()
|
|
.caller_id("my_client")
|
|
.context_id("my_context")
|
|
.payload("print('Hello World')")
|
|
.executor("osis")
|
|
.runner("osis_runner_1")
|
|
.timeout(60)
|
|
.build()?;
|
|
|
|
// Option 1: Fire-and-forget execution
|
|
let result = client.job_run("user-secret", job.clone()).await?;
|
|
match result {
|
|
JobResult::Success { success } => println!("Output: {}", success),
|
|
JobResult::Error { error } => println!("Error: {}", error),
|
|
}
|
|
|
|
// Option 2: Asynchronous execution
|
|
let job_id = client.jobs_create("user-secret", job).await?;
|
|
client.job_start("user-secret", &job_id).await?;
|
|
|
|
// Poll for completion
|
|
loop {
|
|
let status = client.job_status(&job_id).await?;
|
|
if status.status == "completed" || status.status == "failed" {
|
|
break;
|
|
}
|
|
tokio::time::sleep(Duration::from_secs(1)).await;
|
|
}
|
|
|
|
let result = client.job_result(&job_id).await?;
|
|
|
|
// Option 3: Job management
|
|
// Stop a running job
|
|
client.job_stop("user-secret", &job_id).await?;
|
|
|
|
// Delete a job
|
|
client.job_delete("user-secret", &job_id).await?;
|
|
|
|
// List all jobs (returns full Job objects)
|
|
let jobs = client.jobs_list("user-secret").await?;
|
|
for job in jobs {
|
|
println!("Job {}: {} ({})", job.id, job.executor, job.payload);
|
|
}
|
|
```
|
|
|
|
### WASM Client
|
|
|
|
```javascript
|
|
import { WasmSupervisorClient, WasmJob } from 'hero-supervisor-openrpc-client';
|
|
|
|
// Create client
|
|
const client = new WasmSupervisorClient('http://localhost:3030');
|
|
|
|
// Create and run job
|
|
const job = new WasmJob('job-id', 'print("Hello")', 'osis', 'osis_runner_1');
|
|
const result = await client.create_job('user-secret', job);
|
|
```
|
|
|
|
## Security
|
|
|
|
### Authentication Levels
|
|
|
|
1. **Admin Secrets**: Full system access
|
|
- All runner management operations
|
|
- All job operations
|
|
- Secret management
|
|
- System information access
|
|
|
|
2. **User Secrets**: Job operations only
|
|
- Create, run, start jobs
|
|
- Get job status and results
|
|
- No runner or secret management
|
|
|
|
3. **Register Secrets**: Runner registration only
|
|
- Register new runners
|
|
- No other operations
|
|
|
|
### Best Practices
|
|
|
|
- Use different secret types for different access levels
|
|
- Rotate secrets regularly
|
|
- Store secrets securely (environment variables, secret management systems)
|
|
- Use HTTPS in production environments
|
|
- Implement proper logging and monitoring
|
|
|
|
## Development
|
|
|
|
### Building
|
|
|
|
```bash
|
|
# Build supervisor binary
|
|
cargo build --release
|
|
|
|
# Build with OpenRPC feature
|
|
cargo build --release --features openrpc
|
|
|
|
# Build client library
|
|
cd clients/openrpc
|
|
cargo build --release
|
|
```
|
|
|
|
### Testing
|
|
|
|
```bash
|
|
# Run tests
|
|
cargo test
|
|
|
|
# Run with Redis (requires Redis server)
|
|
docker run -d -p 6379:6379 redis:alpine
|
|
cargo test -- --ignored
|
|
```
|
|
|
|
### Examples
|
|
|
|
See the `examples/` directory for:
|
|
- Basic supervisor setup
|
|
- Mock runner implementation
|
|
- Comprehensive OpenRPC client usage
|
|
- Configuration examples
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
1. **Redis Connection Failed**
|
|
- Ensure Redis server is running
|
|
- Check Redis URL in configuration
|
|
- Verify network connectivity
|
|
|
|
2. **Runner Registration Failed**
|
|
- Check register secret validity
|
|
- Verify runner binary path exists
|
|
- Ensure runner has proper permissions
|
|
|
|
3. **Job Execution Timeout**
|
|
- Increase job timeout value
|
|
- Check runner resource availability
|
|
- Monitor runner logs for issues
|
|
|
|
4. **OpenRPC Method Not Found**
|
|
- Verify method name spelling
|
|
- Check OpenRPC specification
|
|
- Ensure server supports the method
|
|
|
|
### Logging
|
|
|
|
Enable debug logging:
|
|
```bash
|
|
RUST_LOG=debug ./supervisor --config config.toml
|
|
```
|
|
|
|
### Monitoring
|
|
|
|
Monitor key metrics:
|
|
- Runner status and health
|
|
- Job queue lengths
|
|
- Job success/failure rates
|
|
- Response times
|
|
- Redis connection status
|
|
|
|
## Contributing
|
|
|
|
1. Fork the repository
|
|
2. Create a feature branch
|
|
3. Make changes with tests
|
|
4. Update documentation
|
|
5. Submit a pull request
|
|
|
|
## License
|
|
|
|
[License information here]
|