hero/core/supervisor/LIFECYCLE.md
Timur Gordon 8ed40ce99c wip
2025-08-01 00:01:08 +02:00

9.6 KiB

Worker Lifecycle Management

The Hero Supervisor includes comprehensive worker lifecycle management functionality using Zinit as the process manager. This enables the supervisor to manage worker processes, perform health monitoring, and implement load balancing.

Overview

The lifecycle management system provides:

  • Worker Process Management: Start, stop, restart, and monitor worker binaries
  • Health Monitoring: Automatic ping jobs every 10 minutes for idle workers
  • Load Balancing: Dynamic scaling of workers based on demand
  • Service Dependencies: Proper startup ordering with dependency management
  • Graceful Shutdown: Clean termination of worker processes

Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Supervisor    │    │ WorkerLifecycle  │    │     Zinit       │
│                 │◄──►│    Manager       │◄──►│   (Process      │
│  (Job Dispatch) │    │                  │    │    Manager)     │
└─────────────────┘    └──────────────────┘    └─────────────────┘
         │                       │                       │
         │                       │                       │
         ▼                       ▼                       ▼
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│     Redis       │    │ Health Monitor   │    │ Worker Binaries │
│   (Job Queue)   │    │  (Ping Jobs)     │    │  (OSIS/SAL/V)   │
└─────────────────┘    └──────────────────┘    └─────────────────┘

Components

WorkerConfig

Defines configuration for a worker binary:

use hero_supervisor::{WorkerConfig, ScriptType};
use std::path::PathBuf;
use std::collections::HashMap;

let config = WorkerConfig::new(
    "osis_worker_0".to_string(),
    PathBuf::from("/usr/local/bin/osis_worker"),
    ScriptType::OSIS,
)
.with_args(vec![
    "--redis-url".to_string(),
    "redis://localhost:6379".to_string(),
    "--worker-id".to_string(),
    "osis_worker_0".to_string(),
])
.with_env({
    let mut env = HashMap::new();
    env.insert("RUST_LOG".to_string(), "info".to_string());
    env.insert("WORKER_TYPE".to_string(), "osis".to_string());
    env
})
.with_health_check("/usr/local/bin/osis_worker --health-check".to_string())
.with_dependencies(vec!["redis".to_string()]);

WorkerLifecycleManager

Main component for managing worker lifecycles:

use hero_supervisor::{WorkerLifecycleManagerBuilder, Supervisor};

let supervisor = SupervisorBuilder::new()
    .redis_url("redis://localhost:6379")
    .caller_id("my_supervisor")
    .context_id("production")
    .build()?;

let mut lifecycle_manager = WorkerLifecycleManagerBuilder::new("/var/run/zinit.sock".to_string())
    .with_supervisor(supervisor.clone())
    .add_worker(osis_worker_config)
    .add_worker(sal_worker_config)
    .add_worker(v_worker_config)
    .build();

Supported Script Types

The lifecycle manager supports all Hero script types:

  • OSIS: Rhai/HeroScript execution workers
  • SAL: System Abstraction Layer workers
  • V: HeroScript execution in V language
  • Python: HeroScript execution in Python

Key Features

1. Worker Management

// Start all configured workers
lifecycle_manager.start_all_workers().await?;

// Stop all workers
lifecycle_manager.stop_all_workers().await?;

// Restart specific worker
lifecycle_manager.restart_worker("osis_worker_0").await?;

// Get worker status
let status = lifecycle_manager.get_worker_status("osis_worker_0").await?;
println!("Worker state: {:?}, PID: {}", status.state, status.pid);

2. Health Monitoring

The system automatically monitors worker health:

  • Tracks last job execution time for each worker
  • Sends ping jobs to workers idle for 10+ minutes
  • Restarts workers that fail ping checks 3 times
  • Updates job times when workers receive tasks
// Manual health check
lifecycle_manager.monitor_worker_health().await?;

// Update job time (called automatically by supervisor)
lifecycle_manager.update_worker_job_time("osis_worker_0");

// Start continuous health monitoring
lifecycle_manager.start_health_monitoring().await; // Runs forever

3. Dynamic Scaling

Scale workers up or down based on demand:

// Scale OSIS workers to 5 instances
lifecycle_manager.scale_workers(&ScriptType::OSIS, 5).await?;

// Scale down SAL workers to 1 instance  
lifecycle_manager.scale_workers(&ScriptType::SAL, 1).await?;

// Check current running count
let count = lifecycle_manager.get_running_worker_count(&ScriptType::V).await;
println!("Running V workers: {}", count);

4. Service Dependencies

Workers can depend on other services:

let config = WorkerConfig::new(name, binary, script_type)
    .with_dependencies(vec![
        "redis".to_string(),
        "database".to_string(),
        "auth_service".to_string(),
    ]);

Zinit ensures dependencies start before the worker.

Integration with Supervisor

The lifecycle manager integrates seamlessly with the supervisor:

use hero_supervisor::{Supervisor, WorkerLifecycleManager};

// Create supervisor and lifecycle manager
let supervisor = SupervisorBuilder::new().build()?;
let mut lifecycle_manager = WorkerLifecycleManagerBuilder::new(zinit_socket)
    .with_supervisor(supervisor.clone())
    .build();

// Start workers
lifecycle_manager.start_all_workers().await?;

// Create and execute jobs (supervisor automatically routes to workers)
let job = supervisor
    .new_job()
    .script_type(ScriptType::OSIS)
    .script_content("println!(\"Hello World!\");".to_string())
    .build()?;

let result = supervisor.run_job_and_await_result(&job).await?;
println!("Job result: {}", result);

Zinit Service Configuration

The lifecycle manager automatically creates Zinit service configurations:

# Generated service config for osis_worker_0
exec: "/usr/local/bin/osis_worker --redis-url redis://localhost:6379 --worker-id osis_worker_0"
test: "/usr/local/bin/osis_worker --health-check"
oneshot: false  # Restart on exit
after:
  - redis
env:
  RUST_LOG: "info"
  WORKER_TYPE: "osis"

Error Handling

The system provides comprehensive error handling:

use hero_supervisor::SupervisorError;

match lifecycle_manager.start_worker(&config).await {
    Ok(_) => println!("Worker started successfully"),
    Err(SupervisorError::WorkerStartFailed(worker, reason)) => {
        eprintln!("Failed to start {}: {}", worker, reason);
    }
    Err(e) => eprintln!("Other error: {}", e),
}

Example Usage

See examples/lifecycle_demo.rs for a comprehensive demonstration:

# Run the lifecycle demo
cargo run --example lifecycle_demo

# Run with custom Redis URL
REDIS_URL=redis://localhost:6379 cargo run --example lifecycle_demo

Prerequisites

  1. Zinit: Install and run Zinit process manager

    curl https://raw.githubusercontent.com/threefoldtech/zinit/refs/heads/master/install.sh | bash
    zinit init --config /etc/zinit/ --socket /var/run/zinit.sock
    
  2. Redis: Running Redis instance for job queues

    redis-server
    
  3. Worker Binaries: Compiled worker binaries for each script type

    • /usr/local/bin/osis_worker
    • /usr/local/bin/sal_worker
    • /usr/local/bin/v_worker
    • /usr/local/bin/python_worker

Configuration Best Practices

  1. Resource Limits: Configure appropriate resource limits in Zinit
  2. Health Checks: Implement meaningful health check commands
  3. Dependencies: Define proper service dependencies
  4. Environment: Set appropriate environment variables
  5. Logging: Configure structured logging for debugging
  6. Monitoring: Use health monitoring for production deployments

Troubleshooting

Common Issues

  1. Zinit Connection Failed

    • Ensure Zinit is running: ps aux | grep zinit
    • Check socket permissions: ls -la /var/run/zinit.sock
    • Verify socket path in configuration
  2. Worker Start Failed

    • Check binary exists and is executable
    • Verify dependencies are running
    • Review Zinit logs: zinit logs <service-name>
  3. Health Check Failures

    • Implement proper health check endpoint in workers
    • Verify health check command syntax
    • Check worker responsiveness
  4. Redis Connection Issues

    • Ensure Redis is running and accessible
    • Verify Redis URL configuration
    • Check network connectivity

Debug Commands

# Check Zinit status
zinit list

# View service logs
zinit logs osis_worker_0

# Check service status
zinit status osis_worker_0

# Monitor Redis queues
redis-cli keys "hero:job:*"

Performance Considerations

  • Scaling: Start with minimal workers and scale based on queue depth
  • Health Monitoring: Adjust ping intervals based on workload patterns
  • Resource Usage: Monitor CPU/memory usage of worker processes
  • Queue Depth: Monitor Redis queue lengths for scaling decisions

Security

  • Process Isolation: Zinit provides process isolation
  • User Permissions: Run workers with appropriate user permissions
  • Network Security: Secure Redis and Zinit socket access
  • Binary Validation: Verify worker binary integrity before deployment