Timur Gordon 7d7ff0f0ab initial commit

2025-07-29 01:15:23 +02:00

7.3 KiB

Raw Blame History

Hero Dispatcher Protocol

This document describes the Redis-based protocol used by the Hero Dispatcher for job management and worker communication.

Overview

The Hero Dispatcher uses Redis as a message broker and data store for managing distributed job execution. Jobs are stored as Redis hashes, and communication with workers happens through Redis lists (queues).

Redis Namespace

All dispatcher-related keys use the hero: namespace prefix to avoid conflicts with other Redis usage.

Data Structures

Job Storage

Jobs are stored as Redis hashes with the following key pattern:

hero:job:{job_id}

Job Hash Fields:

id: Unique job identifier (UUID v4)
caller_id: Identifier of the client that created the job
worker_id: Target worker identifier
context_id: Execution context identifier
script: Script content to execute (Rhai or HeroScript)
timeout: Execution timeout in seconds
retries: Number of retry attempts
concurrent: Whether to execute in separate thread (true/false)
log_path: Optional path to log file for job output
created_at: Job creation timestamp (ISO 8601)
updated_at: Job last update timestamp (ISO 8601)
status: Current job status (dispatched/started/error/finished)
env_vars: Environment variables as JSON object (optional)
prerequisites: JSON array of job IDs that must complete before this job (optional)
dependents: JSON array of job IDs that depend on this job completing (optional)
output: Job execution result (set by worker)
error: Error message if job failed (set by worker)
dependencies: List of job IDs that this job depends on

Job Dependencies

Jobs can have dependencies on other jobs, which are stored in the dependencies field. A job will not be dispatched until all its dependencies have completed successfully.

Work Queues

Jobs are queued for execution using Redis lists:

hero:work_queue:{worker_id}

Workers listen on their specific queue using BLPOP for job IDs to process.

Stop Queues

Job stop requests are sent through dedicated stop queues:

hero:stop_queue:{worker_id}

Workers monitor these queues to receive stop requests for running jobs.

Reply Queues

For synchronous job execution, dedicated reply queues are used:

hero:reply:{job_id}

Workers send results to these queues when jobs complete.

Job Lifecycle

1. Job Creation

Client -> Redis: HSET hero:job:{job_id} {job_fields}

2. Job Submission

Client -> Redis: LPUSH hero:work_queue:{worker_id} {job_id}

3. Job Processing

Worker -> Redis: BLPOP hero:work_queue:{worker_id}
Worker -> Redis: HSET hero:job:{job_id} status "started"
Worker: Execute script
Worker -> Redis: HSET hero:job:{job_id} status "finished" output "{result}"

4. Job Completion (Async)

Worker -> Redis: LPUSH hero:reply:{job_id} {result}

API Operations

List Jobs

dispatcher.list_jobs() -> Vec<String>

Redis Operations:

KEYS hero:job:* - Get all job keys
Extract job IDs from key names

Stop Job

dispatcher.stop_job(job_id) -> Result<(), DispatcherError>

Redis Operations:

LPUSH hero:stop_queue:{worker_id} {job_id} - Send stop request

Get Job Status

dispatcher.get_job_status(job_id) -> Result<JobStatus, DispatcherError>

Redis Operations:

HGETALL hero:job:{job_id} - Get job data
Parse status field

Get Job Logs

dispatcher.get_job_logs(job_id) -> Result<Option<String>, DispatcherError>

Redis Operations:

HGETALL hero:job:{job_id} - Get job data
Read log_path field
Read log file from filesystem

Run Job and Await Result

dispatcher.run_job_and_await_result(job, worker_id) -> Result<String, DispatcherError>

Redis Operations:

HSET hero:job:{job_id} {job_fields} - Store job
LPUSH hero:work_queue:{worker_id} {job_id} - Submit job
BLPOP hero:reply:{job_id} {timeout} - Wait for result

Worker Protocol

Job Processing Loop

loop {
    // 1. Wait for job
    job_id = BLPOP hero:work_queue:{worker_id}
    
    // 2. Get job details
    job_data = HGETALL hero:job:{job_id}
    
    // 3. Update status
    HSET hero:job:{job_id} status "started"
    
    // 4. Check for stop requests
    if LLEN hero:stop_queue:{worker_id} > 0 {
        stop_job_id = LPOP hero:stop_queue:{worker_id}
        if stop_job_id == job_id {
            HSET hero:job:{job_id} status "error" error "stopped"
            continue
        }
    }
    
    // 5. Execute script
    result = execute_script(job_data.script)
    
    // 6. Update job with result
    HSET hero:job:{job_id} status "finished" output result
    
    // 7. Send reply if needed
    if reply_queue_exists(hero:reply:{job_id}) {
        LPUSH hero:reply:{job_id} result
    }
}

Stop Request Handling

Workers should periodically check the stop queue during long-running jobs:

if LLEN hero:stop_queue:{worker_id} > 0 {
    stop_requests = LRANGE hero:stop_queue:{worker_id} 0 -1
    if stop_requests.contains(current_job_id) {
        // Stop current job execution
        HSET hero:job:{current_job_id} status "error" error "stopped_by_request"
        // Remove stop request
        LREM hero:stop_queue:{worker_id} 1 current_job_id
        return
    }
}

Error Handling

Job Timeouts

Client sets timeout when creating job
Worker should respect timeout and stop execution
If timeout exceeded: HSET hero:job:{job_id} status "error" error "timeout"

Worker Failures

If worker crashes, job remains in "started" status
Monitoring systems can detect stale jobs and retry
Jobs can be requeued: LPUSH hero:work_queue:{worker_id} {job_id}

Redis Connection Issues

Clients should implement retry logic with exponential backoff
Workers should reconnect and resume processing
Use Redis persistence to survive Redis restarts

Monitoring and Observability

Queue Monitoring

# Check work queue length
LLEN hero:work_queue:{worker_id}

# Check stop queue length  
LLEN hero:stop_queue:{worker_id}

# List all jobs
KEYS hero:job:*

# Get job details
HGETALL hero:job:{job_id}

Metrics to Track

Jobs created per second
Jobs completed per second
Average job execution time
Queue depths
Worker availability
Error rates by job type

Security Considerations

Redis Security

Use Redis AUTH for authentication
Enable TLS for Redis connections
Restrict Redis network access
Use Redis ACLs to limit worker permissions

Job Security

Validate script content before execution
Sandbox script execution environment
Limit resource usage (CPU, memory, disk)
Log all job executions for audit

Log File Security

Ensure log paths are within allowed directories
Validate log file permissions
Rotate and archive logs regularly
Sanitize sensitive data in logs

Performance Considerations

Redis Optimization

Use Redis pipelining for batch operations
Configure appropriate Redis memory limits
Use Redis clustering for high availability
Monitor Redis memory usage and eviction

Job Optimization

Keep job payloads small
Use efficient serialization formats
Batch similar jobs when possible
Implement job prioritization if needed

Worker Optimization

Pool worker connections to Redis
Use async I/O for Redis operations
Implement graceful shutdown handling
Monitor worker resource usage

7.3 KiB Raw Blame History

Hero Dispatcher Protocol

Overview

Redis Namespace

Data Structures

Job Storage

Job Dependencies

Work Queues

Stop Queues

Reply Queues

Job Lifecycle

1. Job Creation

2. Job Submission

3. Job Processing

4. Job Completion (Async)

API Operations

List Jobs

Stop Job

Get Job Status

Get Job Logs

Run Job and Await Result

Worker Protocol

Job Processing Loop

Stop Request Handling

Error Handling

Job Timeouts

Worker Failures

Redis Connection Issues

Monitoring and Observability

Queue Monitoring

Metrics to Track

Security Considerations

Redis Security

Job Security

Log File Security

Performance Considerations

Redis Optimization

Job Optimization

Worker Optimization

7.3 KiB

Raw Blame History