# Hero Dispatcher Protocol This document describes the Redis-based protocol used by the Hero Dispatcher for job management and worker communication. ## Overview The Hero Dispatcher uses Redis as a message broker and data store for managing distributed job execution. Jobs are stored as Redis hashes, and communication with workers happens through Redis lists (queues). ## Redis Namespace All dispatcher-related keys use the `hero:` namespace prefix to avoid conflicts with other Redis usage. ## Data Structures ### Job Storage Jobs are stored as Redis hashes with the following key pattern: ``` hero:job:{job_id} ``` **Job Hash Fields:** - `id`: Unique job identifier (UUID v4) - `caller_id`: Identifier of the client that created the job - `worker_id`: Target worker identifier - `context_id`: Execution context identifier - `script`: Script content to execute (Rhai or HeroScript) - `timeout`: Execution timeout in seconds - `retries`: Number of retry attempts - `concurrent`: Whether to execute in separate thread (true/false) - `log_path`: Optional path to log file for job output - `created_at`: Job creation timestamp (ISO 8601) - `updated_at`: Job last update timestamp (ISO 8601) - `status`: Current job status (dispatched/started/error/finished) - `env_vars`: Environment variables as JSON object (optional) - `prerequisites`: JSON array of job IDs that must complete before this job (optional) - `dependents`: JSON array of job IDs that depend on this job completing (optional) - `output`: Job execution result (set by worker) - `error`: Error message if job failed (set by worker) - `dependencies`: List of job IDs that this job depends on ### Job Dependencies Jobs can have dependencies on other jobs, which are stored in the `dependencies` field. A job will not be dispatched until all its dependencies have completed successfully. ### Work Queues Jobs are queued for execution using Redis lists: ``` hero:work_queue:{worker_id} ``` Workers listen on their specific queue using `BLPOP` for job IDs to process. ### Stop Queues Job stop requests are sent through dedicated stop queues: ``` hero:stop_queue:{worker_id} ``` Workers monitor these queues to receive stop requests for running jobs. ### Reply Queues For synchronous job execution, dedicated reply queues are used: ``` hero:reply:{job_id} ``` Workers send results to these queues when jobs complete. ## Job Lifecycle ### 1. Job Creation ``` Client -> Redis: HSET hero:job:{job_id} {job_fields} ``` ### 2. Job Submission ``` Client -> Redis: LPUSH hero:work_queue:{worker_id} {job_id} ``` ### 3. Job Processing ``` Worker -> Redis: BLPOP hero:work_queue:{worker_id} Worker -> Redis: HSET hero:job:{job_id} status "started" Worker: Execute script Worker -> Redis: HSET hero:job:{job_id} status "finished" output "{result}" ``` ### 4. Job Completion (Async) ``` Worker -> Redis: LPUSH hero:reply:{job_id} {result} ``` ## API Operations ### List Jobs ```rust dispatcher.list_jobs() -> Vec ``` **Redis Operations:** - `KEYS hero:job:*` - Get all job keys - Extract job IDs from key names ### Stop Job ```rust dispatcher.stop_job(job_id) -> Result<(), DispatcherError> ``` **Redis Operations:** - `LPUSH hero:stop_queue:{worker_id} {job_id}` - Send stop request ### Get Job Status ```rust dispatcher.get_job_status(job_id) -> Result ``` **Redis Operations:** - `HGETALL hero:job:{job_id}` - Get job data - Parse `status` field ### Get Job Logs ```rust dispatcher.get_job_logs(job_id) -> Result, DispatcherError> ``` **Redis Operations:** - `HGETALL hero:job:{job_id}` - Get job data - Read `log_path` field - Read log file from filesystem ### Run Job and Await Result ```rust dispatcher.run_job_and_await_result(job, worker_id) -> Result ``` **Redis Operations:** 1. `HSET hero:job:{job_id} {job_fields}` - Store job 2. `LPUSH hero:work_queue:{worker_id} {job_id}` - Submit job 3. `BLPOP hero:reply:{job_id} {timeout}` - Wait for result ## Worker Protocol ### Job Processing Loop ```rust loop { // 1. Wait for job job_id = BLPOP hero:work_queue:{worker_id} // 2. Get job details job_data = HGETALL hero:job:{job_id} // 3. Update status HSET hero:job:{job_id} status "started" // 4. Check for stop requests if LLEN hero:stop_queue:{worker_id} > 0 { stop_job_id = LPOP hero:stop_queue:{worker_id} if stop_job_id == job_id { HSET hero:job:{job_id} status "error" error "stopped" continue } } // 5. Execute script result = execute_script(job_data.script) // 6. Update job with result HSET hero:job:{job_id} status "finished" output result // 7. Send reply if needed if reply_queue_exists(hero:reply:{job_id}) { LPUSH hero:reply:{job_id} result } } ``` ### Stop Request Handling Workers should periodically check the stop queue during long-running jobs: ```rust if LLEN hero:stop_queue:{worker_id} > 0 { stop_requests = LRANGE hero:stop_queue:{worker_id} 0 -1 if stop_requests.contains(current_job_id) { // Stop current job execution HSET hero:job:{current_job_id} status "error" error "stopped_by_request" // Remove stop request LREM hero:stop_queue:{worker_id} 1 current_job_id return } } ``` ## Error Handling ### Job Timeouts - Client sets timeout when creating job - Worker should respect timeout and stop execution - If timeout exceeded: `HSET hero:job:{job_id} status "error" error "timeout"` ### Worker Failures - If worker crashes, job remains in "started" status - Monitoring systems can detect stale jobs and retry - Jobs can be requeued: `LPUSH hero:work_queue:{worker_id} {job_id}` ### Redis Connection Issues - Clients should implement retry logic with exponential backoff - Workers should reconnect and resume processing - Use Redis persistence to survive Redis restarts ## Monitoring and Observability ### Queue Monitoring ```bash # Check work queue length LLEN hero:work_queue:{worker_id} # Check stop queue length LLEN hero:stop_queue:{worker_id} # List all jobs KEYS hero:job:* # Get job details HGETALL hero:job:{job_id} ``` ### Metrics to Track - Jobs created per second - Jobs completed per second - Average job execution time - Queue depths - Worker availability - Error rates by job type ## Security Considerations ### Redis Security - Use Redis AUTH for authentication - Enable TLS for Redis connections - Restrict Redis network access - Use Redis ACLs to limit worker permissions ### Job Security - Validate script content before execution - Sandbox script execution environment - Limit resource usage (CPU, memory, disk) - Log all job executions for audit ### Log File Security - Ensure log paths are within allowed directories - Validate log file permissions - Rotate and archive logs regularly - Sanitize sensitive data in logs ## Performance Considerations ### Redis Optimization - Use Redis pipelining for batch operations - Configure appropriate Redis memory limits - Use Redis clustering for high availability - Monitor Redis memory usage and eviction ### Job Optimization - Keep job payloads small - Use efficient serialization formats - Batch similar jobs when possible - Implement job prioritization if needed ### Worker Optimization - Pool worker connections to Redis - Use async I/O for Redis operations - Implement graceful shutdown handling - Monitor worker resource usage