hero/core/dispatcher/docs/protocol.md
2025-07-29 01:15:23 +02:00

7.3 KiB

Hero Dispatcher Protocol

This document describes the Redis-based protocol used by the Hero Dispatcher for job management and worker communication.

Overview

The Hero Dispatcher uses Redis as a message broker and data store for managing distributed job execution. Jobs are stored as Redis hashes, and communication with workers happens through Redis lists (queues).

Redis Namespace

All dispatcher-related keys use the hero: namespace prefix to avoid conflicts with other Redis usage.

Data Structures

Job Storage

Jobs are stored as Redis hashes with the following key pattern:

hero:job:{job_id}

Job Hash Fields:

  • id: Unique job identifier (UUID v4)
  • caller_id: Identifier of the client that created the job
  • worker_id: Target worker identifier
  • context_id: Execution context identifier
  • script: Script content to execute (Rhai or HeroScript)
  • timeout: Execution timeout in seconds
  • retries: Number of retry attempts
  • concurrent: Whether to execute in separate thread (true/false)
  • log_path: Optional path to log file for job output
  • created_at: Job creation timestamp (ISO 8601)
  • updated_at: Job last update timestamp (ISO 8601)
  • status: Current job status (dispatched/started/error/finished)
  • env_vars: Environment variables as JSON object (optional)
  • prerequisites: JSON array of job IDs that must complete before this job (optional)
  • dependents: JSON array of job IDs that depend on this job completing (optional)
  • output: Job execution result (set by worker)
  • error: Error message if job failed (set by worker)
  • dependencies: List of job IDs that this job depends on

Job Dependencies

Jobs can have dependencies on other jobs, which are stored in the dependencies field. A job will not be dispatched until all its dependencies have completed successfully.

Work Queues

Jobs are queued for execution using Redis lists:

hero:work_queue:{worker_id}

Workers listen on their specific queue using BLPOP for job IDs to process.

Stop Queues

Job stop requests are sent through dedicated stop queues:

hero:stop_queue:{worker_id}

Workers monitor these queues to receive stop requests for running jobs.

Reply Queues

For synchronous job execution, dedicated reply queues are used:

hero:reply:{job_id}

Workers send results to these queues when jobs complete.

Job Lifecycle

1. Job Creation

Client -> Redis: HSET hero:job:{job_id} {job_fields}

2. Job Submission

Client -> Redis: LPUSH hero:work_queue:{worker_id} {job_id}

3. Job Processing

Worker -> Redis: BLPOP hero:work_queue:{worker_id}
Worker -> Redis: HSET hero:job:{job_id} status "started"
Worker: Execute script
Worker -> Redis: HSET hero:job:{job_id} status "finished" output "{result}"

4. Job Completion (Async)

Worker -> Redis: LPUSH hero:reply:{job_id} {result}

API Operations

List Jobs

dispatcher.list_jobs() -> Vec<String>

Redis Operations:

  • KEYS hero:job:* - Get all job keys
  • Extract job IDs from key names

Stop Job

dispatcher.stop_job(job_id) -> Result<(), DispatcherError>

Redis Operations:

  • LPUSH hero:stop_queue:{worker_id} {job_id} - Send stop request

Get Job Status

dispatcher.get_job_status(job_id) -> Result<JobStatus, DispatcherError>

Redis Operations:

  • HGETALL hero:job:{job_id} - Get job data
  • Parse status field

Get Job Logs

dispatcher.get_job_logs(job_id) -> Result<Option<String>, DispatcherError>

Redis Operations:

  • HGETALL hero:job:{job_id} - Get job data
  • Read log_path field
  • Read log file from filesystem

Run Job and Await Result

dispatcher.run_job_and_await_result(job, worker_id) -> Result<String, DispatcherError>

Redis Operations:

  1. HSET hero:job:{job_id} {job_fields} - Store job
  2. LPUSH hero:work_queue:{worker_id} {job_id} - Submit job
  3. BLPOP hero:reply:{job_id} {timeout} - Wait for result

Worker Protocol

Job Processing Loop

loop {
    // 1. Wait for job
    job_id = BLPOP hero:work_queue:{worker_id}
    
    // 2. Get job details
    job_data = HGETALL hero:job:{job_id}
    
    // 3. Update status
    HSET hero:job:{job_id} status "started"
    
    // 4. Check for stop requests
    if LLEN hero:stop_queue:{worker_id} > 0 {
        stop_job_id = LPOP hero:stop_queue:{worker_id}
        if stop_job_id == job_id {
            HSET hero:job:{job_id} status "error" error "stopped"
            continue
        }
    }
    
    // 5. Execute script
    result = execute_script(job_data.script)
    
    // 6. Update job with result
    HSET hero:job:{job_id} status "finished" output result
    
    // 7. Send reply if needed
    if reply_queue_exists(hero:reply:{job_id}) {
        LPUSH hero:reply:{job_id} result
    }
}

Stop Request Handling

Workers should periodically check the stop queue during long-running jobs:

if LLEN hero:stop_queue:{worker_id} > 0 {
    stop_requests = LRANGE hero:stop_queue:{worker_id} 0 -1
    if stop_requests.contains(current_job_id) {
        // Stop current job execution
        HSET hero:job:{current_job_id} status "error" error "stopped_by_request"
        // Remove stop request
        LREM hero:stop_queue:{worker_id} 1 current_job_id
        return
    }
}

Error Handling

Job Timeouts

  • Client sets timeout when creating job
  • Worker should respect timeout and stop execution
  • If timeout exceeded: HSET hero:job:{job_id} status "error" error "timeout"

Worker Failures

  • If worker crashes, job remains in "started" status
  • Monitoring systems can detect stale jobs and retry
  • Jobs can be requeued: LPUSH hero:work_queue:{worker_id} {job_id}

Redis Connection Issues

  • Clients should implement retry logic with exponential backoff
  • Workers should reconnect and resume processing
  • Use Redis persistence to survive Redis restarts

Monitoring and Observability

Queue Monitoring

# Check work queue length
LLEN hero:work_queue:{worker_id}

# Check stop queue length  
LLEN hero:stop_queue:{worker_id}

# List all jobs
KEYS hero:job:*

# Get job details
HGETALL hero:job:{job_id}

Metrics to Track

  • Jobs created per second
  • Jobs completed per second
  • Average job execution time
  • Queue depths
  • Worker availability
  • Error rates by job type

Security Considerations

Redis Security

  • Use Redis AUTH for authentication
  • Enable TLS for Redis connections
  • Restrict Redis network access
  • Use Redis ACLs to limit worker permissions

Job Security

  • Validate script content before execution
  • Sandbox script execution environment
  • Limit resource usage (CPU, memory, disk)
  • Log all job executions for audit

Log File Security

  • Ensure log paths are within allowed directories
  • Validate log file permissions
  • Rotate and archive logs regularly
  • Sanitize sensitive data in logs

Performance Considerations

Redis Optimization

  • Use Redis pipelining for batch operations
  • Configure appropriate Redis memory limits
  • Use Redis clustering for high availability
  • Monitor Redis memory usage and eviction

Job Optimization

  • Keep job payloads small
  • Use efficient serialization formats
  • Batch similar jobs when possible
  • Implement job prioritization if needed

Worker Optimization

  • Pool worker connections to Redis
  • Use async I/O for Redis operations
  • Implement graceful shutdown handling
  • Monitor worker resource usage