273 lines
7.3 KiB
Markdown
273 lines
7.3 KiB
Markdown
# Hero Dispatcher Protocol
|
|
|
|
This document describes the Redis-based protocol used by the Hero Dispatcher for job management and worker communication.
|
|
|
|
## Overview
|
|
|
|
The Hero Dispatcher uses Redis as a message broker and data store for managing distributed job execution. Jobs are stored as Redis hashes, and communication with workers happens through Redis lists (queues).
|
|
|
|
## Redis Namespace
|
|
|
|
All dispatcher-related keys use the `hero:` namespace prefix to avoid conflicts with other Redis usage.
|
|
|
|
## Data Structures
|
|
|
|
### Job Storage
|
|
|
|
Jobs are stored as Redis hashes with the following key pattern:
|
|
```
|
|
hero:job:{job_id}
|
|
```
|
|
|
|
**Job Hash Fields:**
|
|
- `id`: Unique job identifier (UUID v4)
|
|
- `caller_id`: Identifier of the client that created the job
|
|
- `worker_id`: Target worker identifier
|
|
- `context_id`: Execution context identifier
|
|
- `script`: Script content to execute (Rhai or HeroScript)
|
|
- `timeout`: Execution timeout in seconds
|
|
- `retries`: Number of retry attempts
|
|
- `concurrent`: Whether to execute in separate thread (true/false)
|
|
- `log_path`: Optional path to log file for job output
|
|
- `created_at`: Job creation timestamp (ISO 8601)
|
|
- `updated_at`: Job last update timestamp (ISO 8601)
|
|
- `status`: Current job status (dispatched/started/error/finished)
|
|
- `env_vars`: Environment variables as JSON object (optional)
|
|
- `prerequisites`: JSON array of job IDs that must complete before this job (optional)
|
|
- `dependents`: JSON array of job IDs that depend on this job completing (optional)
|
|
- `output`: Job execution result (set by worker)
|
|
- `error`: Error message if job failed (set by worker)
|
|
- `dependencies`: List of job IDs that this job depends on
|
|
|
|
### Job Dependencies
|
|
|
|
Jobs can have dependencies on other jobs, which are stored in the `dependencies` field. A job will not be dispatched until all its dependencies have completed successfully.
|
|
|
|
### Work Queues
|
|
|
|
Jobs are queued for execution using Redis lists:
|
|
```
|
|
hero:work_queue:{worker_id}
|
|
```
|
|
|
|
Workers listen on their specific queue using `BLPOP` for job IDs to process.
|
|
|
|
### Stop Queues
|
|
|
|
Job stop requests are sent through dedicated stop queues:
|
|
```
|
|
hero:stop_queue:{worker_id}
|
|
```
|
|
|
|
Workers monitor these queues to receive stop requests for running jobs.
|
|
|
|
### Reply Queues
|
|
|
|
For synchronous job execution, dedicated reply queues are used:
|
|
```
|
|
hero:reply:{job_id}
|
|
```
|
|
|
|
Workers send results to these queues when jobs complete.
|
|
|
|
## Job Lifecycle
|
|
|
|
### 1. Job Creation
|
|
```
|
|
Client -> Redis: HSET hero:job:{job_id} {job_fields}
|
|
```
|
|
|
|
### 2. Job Submission
|
|
```
|
|
Client -> Redis: LPUSH hero:work_queue:{worker_id} {job_id}
|
|
```
|
|
|
|
### 3. Job Processing
|
|
```
|
|
Worker -> Redis: BLPOP hero:work_queue:{worker_id}
|
|
Worker -> Redis: HSET hero:job:{job_id} status "started"
|
|
Worker: Execute script
|
|
Worker -> Redis: HSET hero:job:{job_id} status "finished" output "{result}"
|
|
```
|
|
|
|
### 4. Job Completion (Async)
|
|
```
|
|
Worker -> Redis: LPUSH hero:reply:{job_id} {result}
|
|
```
|
|
|
|
## API Operations
|
|
|
|
### List Jobs
|
|
```rust
|
|
dispatcher.list_jobs() -> Vec<String>
|
|
```
|
|
**Redis Operations:**
|
|
- `KEYS hero:job:*` - Get all job keys
|
|
- Extract job IDs from key names
|
|
|
|
### Stop Job
|
|
```rust
|
|
dispatcher.stop_job(job_id) -> Result<(), DispatcherError>
|
|
```
|
|
**Redis Operations:**
|
|
- `LPUSH hero:stop_queue:{worker_id} {job_id}` - Send stop request
|
|
|
|
### Get Job Status
|
|
```rust
|
|
dispatcher.get_job_status(job_id) -> Result<JobStatus, DispatcherError>
|
|
```
|
|
**Redis Operations:**
|
|
- `HGETALL hero:job:{job_id}` - Get job data
|
|
- Parse `status` field
|
|
|
|
### Get Job Logs
|
|
```rust
|
|
dispatcher.get_job_logs(job_id) -> Result<Option<String>, DispatcherError>
|
|
```
|
|
**Redis Operations:**
|
|
- `HGETALL hero:job:{job_id}` - Get job data
|
|
- Read `log_path` field
|
|
- Read log file from filesystem
|
|
|
|
### Run Job and Await Result
|
|
```rust
|
|
dispatcher.run_job_and_await_result(job, worker_id) -> Result<String, DispatcherError>
|
|
```
|
|
**Redis Operations:**
|
|
1. `HSET hero:job:{job_id} {job_fields}` - Store job
|
|
2. `LPUSH hero:work_queue:{worker_id} {job_id}` - Submit job
|
|
3. `BLPOP hero:reply:{job_id} {timeout}` - Wait for result
|
|
|
|
## Worker Protocol
|
|
|
|
### Job Processing Loop
|
|
```rust
|
|
loop {
|
|
// 1. Wait for job
|
|
job_id = BLPOP hero:work_queue:{worker_id}
|
|
|
|
// 2. Get job details
|
|
job_data = HGETALL hero:job:{job_id}
|
|
|
|
// 3. Update status
|
|
HSET hero:job:{job_id} status "started"
|
|
|
|
// 4. Check for stop requests
|
|
if LLEN hero:stop_queue:{worker_id} > 0 {
|
|
stop_job_id = LPOP hero:stop_queue:{worker_id}
|
|
if stop_job_id == job_id {
|
|
HSET hero:job:{job_id} status "error" error "stopped"
|
|
continue
|
|
}
|
|
}
|
|
|
|
// 5. Execute script
|
|
result = execute_script(job_data.script)
|
|
|
|
// 6. Update job with result
|
|
HSET hero:job:{job_id} status "finished" output result
|
|
|
|
// 7. Send reply if needed
|
|
if reply_queue_exists(hero:reply:{job_id}) {
|
|
LPUSH hero:reply:{job_id} result
|
|
}
|
|
}
|
|
```
|
|
|
|
### Stop Request Handling
|
|
Workers should periodically check the stop queue during long-running jobs:
|
|
```rust
|
|
if LLEN hero:stop_queue:{worker_id} > 0 {
|
|
stop_requests = LRANGE hero:stop_queue:{worker_id} 0 -1
|
|
if stop_requests.contains(current_job_id) {
|
|
// Stop current job execution
|
|
HSET hero:job:{current_job_id} status "error" error "stopped_by_request"
|
|
// Remove stop request
|
|
LREM hero:stop_queue:{worker_id} 1 current_job_id
|
|
return
|
|
}
|
|
}
|
|
```
|
|
|
|
## Error Handling
|
|
|
|
### Job Timeouts
|
|
- Client sets timeout when creating job
|
|
- Worker should respect timeout and stop execution
|
|
- If timeout exceeded: `HSET hero:job:{job_id} status "error" error "timeout"`
|
|
|
|
### Worker Failures
|
|
- If worker crashes, job remains in "started" status
|
|
- Monitoring systems can detect stale jobs and retry
|
|
- Jobs can be requeued: `LPUSH hero:work_queue:{worker_id} {job_id}`
|
|
|
|
### Redis Connection Issues
|
|
- Clients should implement retry logic with exponential backoff
|
|
- Workers should reconnect and resume processing
|
|
- Use Redis persistence to survive Redis restarts
|
|
|
|
## Monitoring and Observability
|
|
|
|
### Queue Monitoring
|
|
```bash
|
|
# Check work queue length
|
|
LLEN hero:work_queue:{worker_id}
|
|
|
|
# Check stop queue length
|
|
LLEN hero:stop_queue:{worker_id}
|
|
|
|
# List all jobs
|
|
KEYS hero:job:*
|
|
|
|
# Get job details
|
|
HGETALL hero:job:{job_id}
|
|
```
|
|
|
|
### Metrics to Track
|
|
- Jobs created per second
|
|
- Jobs completed per second
|
|
- Average job execution time
|
|
- Queue depths
|
|
- Worker availability
|
|
- Error rates by job type
|
|
|
|
## Security Considerations
|
|
|
|
### Redis Security
|
|
- Use Redis AUTH for authentication
|
|
- Enable TLS for Redis connections
|
|
- Restrict Redis network access
|
|
- Use Redis ACLs to limit worker permissions
|
|
|
|
### Job Security
|
|
- Validate script content before execution
|
|
- Sandbox script execution environment
|
|
- Limit resource usage (CPU, memory, disk)
|
|
- Log all job executions for audit
|
|
|
|
### Log File Security
|
|
- Ensure log paths are within allowed directories
|
|
- Validate log file permissions
|
|
- Rotate and archive logs regularly
|
|
- Sanitize sensitive data in logs
|
|
|
|
## Performance Considerations
|
|
|
|
### Redis Optimization
|
|
- Use Redis pipelining for batch operations
|
|
- Configure appropriate Redis memory limits
|
|
- Use Redis clustering for high availability
|
|
- Monitor Redis memory usage and eviction
|
|
|
|
### Job Optimization
|
|
- Keep job payloads small
|
|
- Use efficient serialization formats
|
|
- Batch similar jobs when possible
|
|
- Implement job prioritization if needed
|
|
|
|
### Worker Optimization
|
|
- Pool worker connections to Redis
|
|
- Use async I/O for Redis operations
|
|
- Implement graceful shutdown handling
|
|
- Monitor worker resource usage
|