hero/core/dispatcher/docs/protocol.md

# Hero Dispatcher Protocol

This document describes the Redis-based protocol used by the Hero Dispatcher for job management and worker communication.

## Overview

The Hero Dispatcher uses Redis as a message broker and data store for managing distributed job execution. Jobs are stored as Redis hashes, and communication with workers happens through Redis lists (queues).

## Redis Namespace

All dispatcher-related keys use the `hero:` namespace prefix to avoid conflicts with other Redis usage.

## Data Structures

### Job Storage

Jobs are stored as Redis hashes with the following key pattern:
```
hero:job:{job_id}
```

**Job Hash Fields:**
- `id`: Unique job identifier (UUID v4)
- `caller_id`: Identifier of the client that created the job
- `worker_id`: Target worker identifier
- `context_id`: Execution context identifier
- `script`: Script content to execute (Rhai or HeroScript)
- `timeout`: Execution timeout in seconds
- `retries`: Number of retry attempts
- `concurrent`: Whether to execute in separate thread (true/false)
- `log_path`: Optional path to log file for job output
- `created_at`: Job creation timestamp (ISO 8601)
- `updated_at`: Job last update timestamp (ISO 8601)
- `status`: Current job status (dispatched/started/error/finished)
- `env_vars`: Environment variables as JSON object (optional)
- `prerequisites`: JSON array of job IDs that must complete before this job (optional)
- `dependents`: JSON array of job IDs that depend on this job completing (optional)
- `output`: Job execution result (set by worker)
- `error`: Error message if job failed (set by worker)
- `dependencies`: List of job IDs that this job depends on

### Job Dependencies

Jobs can have dependencies on other jobs, which are stored in the `dependencies` field. A job will not be dispatched until all its dependencies have completed successfully.

### Work Queues

Jobs are queued for execution using Redis lists:
```
hero:work_queue:{worker_id}
```

Workers listen on their specific queue using `BLPOP` for job IDs to process.

### Stop Queues

Job stop requests are sent through dedicated stop queues:
```
hero:stop_queue:{worker_id}
```

Workers monitor these queues to receive stop requests for running jobs.

### Reply Queues

For synchronous job execution, dedicated reply queues are used:
```
hero:reply:{job_id}
```

Workers send results to these queues when jobs complete.

## Job Lifecycle

### 1. Job Creation
```
Client -> Redis: HSET hero:job:{job_id} {job_fields}
```

### 2. Job Submission
```
Client -> Redis: LPUSH hero:work_queue:{worker_id} {job_id}
```

### 3. Job Processing
```
Worker -> Redis: BLPOP hero:work_queue:{worker_id}
Worker -> Redis: HSET hero:job:{job_id} status "started"
Worker: Execute script
Worker -> Redis: HSET hero:job:{job_id} status "finished" output "{result}"
```

### 4. Job Completion (Async)
```
Worker -> Redis: LPUSH hero:reply:{job_id} {result}
```

## API Operations

### List Jobs
```rust
dispatcher.list_jobs() -> Vec<String>
```
**Redis Operations:**
- `KEYS hero:job:*` - Get all job keys
- Extract job IDs from key names

### Stop Job
```rust
dispatcher.stop_job(job_id) -> Result<(), DispatcherError>
```
**Redis Operations:**
- `LPUSH hero:stop_queue:{worker_id} {job_id}` - Send stop request

### Get Job Status
```rust
dispatcher.get_job_status(job_id) -> Result<JobStatus, DispatcherError>
```
**Redis Operations:**
- `HGETALL hero:job:{job_id}` - Get job data
- Parse `status` field

### Get Job Logs
```rust
dispatcher.get_job_logs(job_id) -> Result<Option<String>, DispatcherError>
```
**Redis Operations:**
- `HGETALL hero:job:{job_id}` - Get job data
- Read `log_path` field
- Read log file from filesystem

### Run Job and Await Result
```rust
dispatcher.run_job_and_await_result(job, worker_id) -> Result<String, DispatcherError>
```
**Redis Operations:**
1. `HSET hero:job:{job_id} {job_fields}` - Store job
2. `LPUSH hero:work_queue:{worker_id} {job_id}` - Submit job
3. `BLPOP hero:reply:{job_id} {timeout}` - Wait for result

## Worker Protocol

### Job Processing Loop
```rust
loop {
    // 1. Wait for job
    job_id = BLPOP hero:work_queue:{worker_id}

    // 2. Get job details
    job_data = HGETALL hero:job:{job_id}

    // 3. Update status
    HSET hero:job:{job_id} status "started"

    // 4. Check for stop requests
    if LLEN hero:stop_queue:{worker_id} > 0 {
        stop_job_id = LPOP hero:stop_queue:{worker_id}
        if stop_job_id == job_id {
            HSET hero:job:{job_id} status "error" error "stopped"
            continue
        }
    }

    // 5. Execute script
    result = execute_script(job_data.script)

    // 6. Update job with result
    HSET hero:job:{job_id} status "finished" output result

    // 7. Send reply if needed
    if reply_queue_exists(hero:reply:{job_id}) {
        LPUSH hero:reply:{job_id} result
    }
}
```

### Stop Request Handling
Workers should periodically check the stop queue during long-running jobs:
```rust
if LLEN hero:stop_queue:{worker_id} > 0 {
    stop_requests = LRANGE hero:stop_queue:{worker_id} 0 -1
    if stop_requests.contains(current_job_id) {
        // Stop current job execution
        HSET hero:job:{current_job_id} status "error" error "stopped_by_request"
        // Remove stop request
        LREM hero:stop_queue:{worker_id} 1 current_job_id
        return
    }
}
```

## Error Handling

### Job Timeouts
- Client sets timeout when creating job
- Worker should respect timeout and stop execution
- If timeout exceeded: `HSET hero:job:{job_id} status "error" error "timeout"`

### Worker Failures
- If worker crashes, job remains in "started" status
- Monitoring systems can detect stale jobs and retry
- Jobs can be requeued: `LPUSH hero:work_queue:{worker_id} {job_id}`

### Redis Connection Issues
- Clients should implement retry logic with exponential backoff
- Workers should reconnect and resume processing
- Use Redis persistence to survive Redis restarts

## Monitoring and Observability

### Queue Monitoring
```bash
# Check work queue length
LLEN hero:work_queue:{worker_id}

# Check stop queue length
LLEN hero:stop_queue:{worker_id}

# List all jobs
KEYS hero:job:*

# Get job details
HGETALL hero:job:{job_id}
```

### Metrics to Track
- Jobs created per second
- Jobs completed per second
- Average job execution time
- Queue depths
- Worker availability
- Error rates by job type

## Security Considerations

### Redis Security
- Use Redis AUTH for authentication
- Enable TLS for Redis connections
- Restrict Redis network access
- Use Redis ACLs to limit worker permissions

### Job Security
- Validate script content before execution
- Sandbox script execution environment
- Limit resource usage (CPU, memory, disk)
- Log all job executions for audit

### Log File Security
- Ensure log paths are within allowed directories
- Validate log file permissions
- Rotate and archive logs regularly
- Sanitize sensitive data in logs

## Performance Considerations

### Redis Optimization
- Use Redis pipelining for batch operations
- Configure appropriate Redis memory limits
- Use Redis clustering for high availability
- Monitor Redis memory usage and eviction

### Job Optimization
- Keep job payloads small
- Use efficient serialization formats
- Batch similar jobs when possible
- Implement job prioritization if needed

### Worker Optimization
- Pool worker connections to Redis
- Use async I/O for Redis operations
- Implement graceful shutdown handling
- Monitor worker resource usage