Timur Gordon 8ed40ce99c wip

2025-08-01 00:01:08 +02:00

9.6 KiB

Raw Blame History

Worker Lifecycle Management

The Hero Supervisor includes comprehensive worker lifecycle management functionality using Zinit as the process manager. This enables the supervisor to manage worker processes, perform health monitoring, and implement load balancing.

Overview

The lifecycle management system provides:

Worker Process Management: Start, stop, restart, and monitor worker binaries
Health Monitoring: Automatic ping jobs every 10 minutes for idle workers
Load Balancing: Dynamic scaling of workers based on demand
Service Dependencies: Proper startup ordering with dependency management
Graceful Shutdown: Clean termination of worker processes

Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Supervisor    │    │ WorkerLifecycle  │    │     Zinit       │
│                 │◄──►│    Manager       │◄──►│   (Process      │
│  (Job Dispatch) │    │                  │    │    Manager)     │
└─────────────────┘    └──────────────────┘    └─────────────────┘
         │                       │                       │
         │                       │                       │
         ▼                       ▼                       ▼
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│     Redis       │    │ Health Monitor   │    │ Worker Binaries │
│   (Job Queue)   │    │  (Ping Jobs)     │    │  (OSIS/SAL/V)   │
└─────────────────┘    └──────────────────┘    └─────────────────┘

Components

WorkerConfig

Defines configuration for a worker binary:

use hero_supervisor::{WorkerConfig, ScriptType};
use std::path::PathBuf;
use std::collections::HashMap;

let config = WorkerConfig::new(
    "osis_worker_0".to_string(),
    PathBuf::from("/usr/local/bin/osis_worker"),
    ScriptType::OSIS,
)
.with_args(vec![
    "--redis-url".to_string(),
    "redis://localhost:6379".to_string(),
    "--worker-id".to_string(),
    "osis_worker_0".to_string(),
])
.with_env({
    let mut env = HashMap::new();
    env.insert("RUST_LOG".to_string(), "info".to_string());
    env.insert("WORKER_TYPE".to_string(), "osis".to_string());
    env
})
.with_health_check("/usr/local/bin/osis_worker --health-check".to_string())
.with_dependencies(vec!["redis".to_string()]);

WorkerLifecycleManager

Main component for managing worker lifecycles:

use hero_supervisor::{WorkerLifecycleManagerBuilder, Supervisor};

let supervisor = SupervisorBuilder::new()
    .redis_url("redis://localhost:6379")
    .caller_id("my_supervisor")
    .context_id("production")
    .build()?;

let mut lifecycle_manager = WorkerLifecycleManagerBuilder::new("/var/run/zinit.sock".to_string())
    .with_supervisor(supervisor.clone())
    .add_worker(osis_worker_config)
    .add_worker(sal_worker_config)
    .add_worker(v_worker_config)
    .build();

Supported Script Types

The lifecycle manager supports all Hero script types:

OSIS: Rhai/HeroScript execution workers
SAL: System Abstraction Layer workers
V: HeroScript execution in V language
Python: HeroScript execution in Python

Key Features

1. Worker Management

// Start all configured workers
lifecycle_manager.start_all_workers().await?;

// Stop all workers
lifecycle_manager.stop_all_workers().await?;

// Restart specific worker
lifecycle_manager.restart_worker("osis_worker_0").await?;

// Get worker status
let status = lifecycle_manager.get_worker_status("osis_worker_0").await?;
println!("Worker state: {:?}, PID: {}", status.state, status.pid);

2. Health Monitoring

The system automatically monitors worker health:

Tracks last job execution time for each worker
Sends ping jobs to workers idle for 10+ minutes
Restarts workers that fail ping checks 3 times
Updates job times when workers receive tasks

// Manual health check
lifecycle_manager.monitor_worker_health().await?;

// Update job time (called automatically by supervisor)
lifecycle_manager.update_worker_job_time("osis_worker_0");

// Start continuous health monitoring
lifecycle_manager.start_health_monitoring().await; // Runs forever

3. Dynamic Scaling

Scale workers up or down based on demand:

// Scale OSIS workers to 5 instances
lifecycle_manager.scale_workers(&ScriptType::OSIS, 5).await?;

// Scale down SAL workers to 1 instance  
lifecycle_manager.scale_workers(&ScriptType::SAL, 1).await?;

// Check current running count
let count = lifecycle_manager.get_running_worker_count(&ScriptType::V).await;
println!("Running V workers: {}", count);

4. Service Dependencies

Workers can depend on other services:

let config = WorkerConfig::new(name, binary, script_type)
    .with_dependencies(vec![
        "redis".to_string(),
        "database".to_string(),
        "auth_service".to_string(),
    ]);

Zinit ensures dependencies start before the worker.

Integration with Supervisor

The lifecycle manager integrates seamlessly with the supervisor:

use hero_supervisor::{Supervisor, WorkerLifecycleManager};

// Create supervisor and lifecycle manager
let supervisor = SupervisorBuilder::new().build()?;
let mut lifecycle_manager = WorkerLifecycleManagerBuilder::new(zinit_socket)
    .with_supervisor(supervisor.clone())
    .build();

// Start workers
lifecycle_manager.start_all_workers().await?;

// Create and execute jobs (supervisor automatically routes to workers)
let job = supervisor
    .new_job()
    .script_type(ScriptType::OSIS)
    .script_content("println!(\"Hello World!\");".to_string())
    .build()?;

let result = supervisor.run_job_and_await_result(&job).await?;
println!("Job result: {}", result);

Zinit Service Configuration

The lifecycle manager automatically creates Zinit service configurations:

# Generated service config for osis_worker_0
exec: "/usr/local/bin/osis_worker --redis-url redis://localhost:6379 --worker-id osis_worker_0"
test: "/usr/local/bin/osis_worker --health-check"
oneshot: false  # Restart on exit
after:
  - redis
env:
  RUST_LOG: "info"
  WORKER_TYPE: "osis"

Error Handling

The system provides comprehensive error handling:

use hero_supervisor::SupervisorError;

match lifecycle_manager.start_worker(&config).await {
    Ok(_) => println!("Worker started successfully"),
    Err(SupervisorError::WorkerStartFailed(worker, reason)) => {
        eprintln!("Failed to start {}: {}", worker, reason);
    }
    Err(e) => eprintln!("Other error: {}", e),
}

Example Usage

See examples/lifecycle_demo.rs for a comprehensive demonstration:

# Run the lifecycle demo
cargo run --example lifecycle_demo

# Run with custom Redis URL
REDIS_URL=redis://localhost:6379 cargo run --example lifecycle_demo

Prerequisites

Zinit: Install and run Zinit process manager

curl https://raw.githubusercontent.com/threefoldtech/zinit/refs/heads/master/install.sh | bash
zinit init --config /etc/zinit/ --socket /var/run/zinit.sock

Redis: Running Redis instance for job queues
```
redis-server
```
Worker Binaries: Compiled worker binaries for each script type
- /usr/local/bin/osis_worker
- /usr/local/bin/sal_worker
- /usr/local/bin/v_worker
- /usr/local/bin/python_worker

Configuration Best Practices

Resource Limits: Configure appropriate resource limits in Zinit
Health Checks: Implement meaningful health check commands
Dependencies: Define proper service dependencies
Environment: Set appropriate environment variables
Logging: Configure structured logging for debugging
Monitoring: Use health monitoring for production deployments

Troubleshooting

Common Issues

Zinit Connection Failed
- Ensure Zinit is running: ps aux | grep zinit
- Check socket permissions: ls -la /var/run/zinit.sock
- Verify socket path in configuration
Worker Start Failed
- Check binary exists and is executable
- Verify dependencies are running
- Review Zinit logs: zinit logs <service-name>
Health Check Failures
- Implement proper health check endpoint in workers
- Verify health check command syntax
- Check worker responsiveness
Redis Connection Issues
- Ensure Redis is running and accessible
- Verify Redis URL configuration
- Check network connectivity

Debug Commands

# Check Zinit status
zinit list

# View service logs
zinit logs osis_worker_0

# Check service status
zinit status osis_worker_0

# Monitor Redis queues
redis-cli keys "hero:job:*"

Performance Considerations

Scaling: Start with minimal workers and scale based on queue depth
Health Monitoring: Adjust ping intervals based on workload patterns
Resource Usage: Monitor CPU/memory usage of worker processes
Queue Depth: Monitor Redis queue lengths for scaling decisions

Security

Process Isolation: Zinit provides process isolation
User Permissions: Run workers with appropriate user permissions
Network Security: Secure Redis and Zinit socket access
Binary Validation: Verify worker binary integrity before deployment

9.6 KiB Raw Blame History