# Worker Lifecycle Management The Hero Supervisor includes comprehensive worker lifecycle management functionality using [Zinit](https://github.com/threefoldtech/zinit) as the process manager. This enables the supervisor to manage worker processes, perform health monitoring, and implement load balancing. ## Overview The lifecycle management system provides: - **Worker Process Management**: Start, stop, restart, and monitor worker binaries - **Health Monitoring**: Automatic ping jobs every 10 minutes for idle workers - **Load Balancing**: Dynamic scaling of workers based on demand - **Service Dependencies**: Proper startup ordering with dependency management - **Graceful Shutdown**: Clean termination of worker processes ## Architecture ``` ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │ Supervisor │ │ WorkerLifecycle │ │ Zinit │ │ │◄──►│ Manager │◄──►│ (Process │ │ (Job Dispatch) │ │ │ │ Manager) │ └─────────────────┘ └──────────────────┘ └─────────────────┘ │ │ │ │ │ │ ▼ ▼ ▼ ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │ Redis │ │ Health Monitor │ │ Worker Binaries │ │ (Job Queue) │ │ (Ping Jobs) │ │ (OSIS/SAL/V) │ └─────────────────┘ └──────────────────┘ └─────────────────┘ ``` ## Components ### WorkerConfig Defines configuration for a worker binary: ```rust use hero_supervisor::{WorkerConfig, ScriptType}; use std::path::PathBuf; use std::collections::HashMap; let config = WorkerConfig::new( "osis_worker_0".to_string(), PathBuf::from("/usr/local/bin/osis_worker"), ScriptType::OSIS, ) .with_args(vec![ "--redis-url".to_string(), "redis://localhost:6379".to_string(), "--worker-id".to_string(), "osis_worker_0".to_string(), ]) .with_env({ let mut env = HashMap::new(); env.insert("RUST_LOG".to_string(), "info".to_string()); env.insert("WORKER_TYPE".to_string(), "osis".to_string()); env }) .with_health_check("/usr/local/bin/osis_worker --health-check".to_string()) .with_dependencies(vec!["redis".to_string()]); ``` ### WorkerLifecycleManager Main component for managing worker lifecycles: ```rust use hero_supervisor::{WorkerLifecycleManagerBuilder, Supervisor}; let supervisor = SupervisorBuilder::new() .redis_url("redis://localhost:6379") .caller_id("my_supervisor") .context_id("production") .build()?; let mut lifecycle_manager = WorkerLifecycleManagerBuilder::new("/var/run/zinit.sock".to_string()) .with_supervisor(supervisor.clone()) .add_worker(osis_worker_config) .add_worker(sal_worker_config) .add_worker(v_worker_config) .build(); ``` ## Supported Script Types The lifecycle manager supports all Hero script types: - **OSIS**: Rhai/HeroScript execution workers - **SAL**: System Abstraction Layer workers - **V**: HeroScript execution in V language - **Python**: HeroScript execution in Python ## Key Features ### 1. Worker Management ```rust // Start all configured workers lifecycle_manager.start_all_workers().await?; // Stop all workers lifecycle_manager.stop_all_workers().await?; // Restart specific worker lifecycle_manager.restart_worker("osis_worker_0").await?; // Get worker status let status = lifecycle_manager.get_worker_status("osis_worker_0").await?; println!("Worker state: {:?}, PID: {}", status.state, status.pid); ``` ### 2. Health Monitoring The system automatically monitors worker health: - Tracks last job execution time for each worker - Sends ping jobs to workers idle for 10+ minutes - Restarts workers that fail ping checks 3 times - Updates job times when workers receive tasks ```rust // Manual health check lifecycle_manager.monitor_worker_health().await?; // Update job time (called automatically by supervisor) lifecycle_manager.update_worker_job_time("osis_worker_0"); // Start continuous health monitoring lifecycle_manager.start_health_monitoring().await; // Runs forever ``` ### 3. Dynamic Scaling Scale workers up or down based on demand: ```rust // Scale OSIS workers to 5 instances lifecycle_manager.scale_workers(&ScriptType::OSIS, 5).await?; // Scale down SAL workers to 1 instance lifecycle_manager.scale_workers(&ScriptType::SAL, 1).await?; // Check current running count let count = lifecycle_manager.get_running_worker_count(&ScriptType::V).await; println!("Running V workers: {}", count); ``` ### 4. Service Dependencies Workers can depend on other services: ```rust let config = WorkerConfig::new(name, binary, script_type) .with_dependencies(vec![ "redis".to_string(), "database".to_string(), "auth_service".to_string(), ]); ``` Zinit ensures dependencies start before the worker. ## Integration with Supervisor The lifecycle manager integrates seamlessly with the supervisor: ```rust use hero_supervisor::{Supervisor, WorkerLifecycleManager}; // Create supervisor and lifecycle manager let supervisor = SupervisorBuilder::new().build()?; let mut lifecycle_manager = WorkerLifecycleManagerBuilder::new(zinit_socket) .with_supervisor(supervisor.clone()) .build(); // Start workers lifecycle_manager.start_all_workers().await?; // Create and execute jobs (supervisor automatically routes to workers) let job = supervisor .new_job() .script_type(ScriptType::OSIS) .script_content("println!(\"Hello World!\");".to_string()) .build()?; let result = supervisor.run_job_and_await_result(&job).await?; println!("Job result: {}", result); ``` ## Zinit Service Configuration The lifecycle manager automatically creates Zinit service configurations: ```yaml # Generated service config for osis_worker_0 exec: "/usr/local/bin/osis_worker --redis-url redis://localhost:6379 --worker-id osis_worker_0" test: "/usr/local/bin/osis_worker --health-check" oneshot: false # Restart on exit after: - redis env: RUST_LOG: "info" WORKER_TYPE: "osis" ``` ## Error Handling The system provides comprehensive error handling: ```rust use hero_supervisor::SupervisorError; match lifecycle_manager.start_worker(&config).await { Ok(_) => println!("Worker started successfully"), Err(SupervisorError::WorkerStartFailed(worker, reason)) => { eprintln!("Failed to start {}: {}", worker, reason); } Err(e) => eprintln!("Other error: {}", e), } ``` ## Example Usage See `examples/lifecycle_demo.rs` for a comprehensive demonstration: ```bash # Run the lifecycle demo cargo run --example lifecycle_demo # Run with custom Redis URL REDIS_URL=redis://localhost:6379 cargo run --example lifecycle_demo ``` ## Prerequisites 1. **Zinit**: Install and run Zinit process manager ```bash curl https://raw.githubusercontent.com/threefoldtech/zinit/refs/heads/master/install.sh | bash zinit init --config /etc/zinit/ --socket /var/run/zinit.sock ``` 2. **Redis**: Running Redis instance for job queues ```bash redis-server ``` 3. **Worker Binaries**: Compiled worker binaries for each script type - `/usr/local/bin/osis_worker` - `/usr/local/bin/sal_worker` - `/usr/local/bin/v_worker` - `/usr/local/bin/python_worker` ## Configuration Best Practices 1. **Resource Limits**: Configure appropriate resource limits in Zinit 2. **Health Checks**: Implement meaningful health check commands 3. **Dependencies**: Define proper service dependencies 4. **Environment**: Set appropriate environment variables 5. **Logging**: Configure structured logging for debugging 6. **Monitoring**: Use health monitoring for production deployments ## Troubleshooting ### Common Issues 1. **Zinit Connection Failed** - Ensure Zinit is running: `ps aux | grep zinit` - Check socket permissions: `ls -la /var/run/zinit.sock` - Verify socket path in configuration 2. **Worker Start Failed** - Check binary exists and is executable - Verify dependencies are running - Review Zinit logs: `zinit logs ` 3. **Health Check Failures** - Implement proper health check endpoint in workers - Verify health check command syntax - Check worker responsiveness 4. **Redis Connection Issues** - Ensure Redis is running and accessible - Verify Redis URL configuration - Check network connectivity ### Debug Commands ```bash # Check Zinit status zinit list # View service logs zinit logs osis_worker_0 # Check service status zinit status osis_worker_0 # Monitor Redis queues redis-cli keys "hero:job:*" ``` ## Performance Considerations - **Scaling**: Start with minimal workers and scale based on queue depth - **Health Monitoring**: Adjust ping intervals based on workload patterns - **Resource Usage**: Monitor CPU/memory usage of worker processes - **Queue Depth**: Monitor Redis queue lengths for scaling decisions ## Security - **Process Isolation**: Zinit provides process isolation - **User Permissions**: Run workers with appropriate user permissions - **Network Security**: Secure Redis and Zinit socket access - **Binary Validation**: Verify worker binary integrity before deployment