baobab/core/supervisor/LIFECYCLE.md
2025-08-05 15:44:33 +02:00

319 lines
9.5 KiB
Markdown

# Actor Lifecycle Management
The Hero Supervisor includes comprehensive actor lifecycle management functionality using [Zinit](https://github.com/threefoldtech/zinit) as the process manager. This enables the supervisor to manage actor processes, perform health monitoring, and implement load balancing.
## Overview
The lifecycle management system provides:
- **Actor Process Management**: Start, stop, restart, and monitor actor binaries
- **Health Monitoring**: Automatic ping jobs every 10 minutes for idle actors
- **Graceful Shutdown**: Clean termination of actor processes
## Architecture
```
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Supervisor │ │ ActorLifecycle │ │ Zinit │
│ │◄──►│ Manager │◄──►│ (Process │
│ (Job Dispatch) │ │ │ │ Manager) │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│ │ │
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Redis │ │ Health Monitor │ │ Actor Binaries │
│ (Job Queue) │ │ (Ping Jobs) │ │ (OSIS/SAL/V) │
└─────────────────┘ └──────────────────┘ └─────────────────┘
```
## Components
### ActorConfig
Defines configuration for a actor binary:
```rust
use hero_supervisor::{ActorConfig, ScriptType};
use std::path::PathBuf;
use std::collections::HashMap;
let config = ActorConfig::new(
"osis_actor_0".to_string(),
PathBuf::from("/usr/local/bin/osis_actor"),
ScriptType::OSIS,
)
.with_args(vec![
"--redis-url".to_string(),
"redis://localhost:6379".to_string(),
"--actor-id".to_string(),
"osis_actor_0".to_string(),
])
.with_env({
let mut env = HashMap::new();
env.insert("RUST_LOG".to_string(), "info".to_string());
env.insert("ACTOR_TYPE".to_string(), "osis".to_string());
env
})
.with_health_check("/usr/local/bin/osis_actor --health-check".to_string())
.with_dependencies(vec!["redis".to_string()]);
```
### ActorLifecycleManager
Main component for managing actor lifecycles:
```rust
use hero_supervisor::{ActorLifecycleManagerBuilder, Supervisor};
let supervisor = SupervisorBuilder::new()
.redis_url("redis://localhost:6379")
.caller_id("my_supervisor")
.context_id("production")
.build()?;
let mut lifecycle_manager = ActorLifecycleManagerBuilder::new("/var/run/zinit.sock".to_string())
.with_supervisor(supervisor.clone())
.add_actor(osis_actor_config)
.add_actor(sal_actor_config)
.add_actor(v_actor_config)
.build();
```
## Supported Script Types
The lifecycle manager supports all Hero script types:
- **OSIS**: Rhai/HeroScript execution actors
- **SAL**: System Abstraction Layer actors
- **V**: HeroScript execution in V language
- **Python**: HeroScript execution in Python
## Key Features
### 1. Actor Management
```rust
// Start all configured actors
lifecycle_manager.start_all_actors().await?;
// Stop all actors
lifecycle_manager.stop_all_actors().await?;
// Restart specific actor
lifecycle_manager.restart_actor("osis_actor_0").await?;
// Get actor status
let status = lifecycle_manager.get_actor_status("osis_actor_0").await?;
println!("Actor state: {:?}, PID: {}", status.state, status.pid);
```
### 2. Health Monitoring
The system automatically monitors actor health:
- Tracks last job execution time for each actor
- Sends ping jobs to actors idle for 10+ minutes
- Restarts actors that fail ping checks 3 times
- Updates job times when actors receive tasks
```rust
// Manual health check
lifecycle_manager.monitor_actor_health().await?;
// Update job time (called automatically by supervisor)
lifecycle_manager.update_actor_job_time("osis_actor_0");
// Start continuous health monitoring
lifecycle_manager.start_health_monitoring().await; // Runs forever
```
### 3. Dynamic Scaling
Scale actors up or down based on demand:
```rust
// Scale OSIS actors to 5 instances
lifecycle_manager.scale_actors(&ScriptType::OSIS, 5).await?;
// Scale down SAL actors to 1 instance
lifecycle_manager.scale_actors(&ScriptType::SAL, 1).await?;
// Check current running count
let count = lifecycle_manager.get_running_actor_count(&ScriptType::V).await;
println!("Running V actors: {}", count);
```
### 4. Service Dependencies
Actors can depend on other services:
```rust
let config = ActorConfig::new(name, binary, script_type)
.with_dependencies(vec![
"redis".to_string(),
"database".to_string(),
"auth_service".to_string(),
]);
```
Zinit ensures dependencies start before the actor.
## Integration with Supervisor
The lifecycle manager integrates seamlessly with the supervisor:
```rust
use hero_supervisor::{Supervisor, ActorLifecycleManager};
// Create supervisor and lifecycle manager
let supervisor = SupervisorBuilder::new().build()?;
let mut lifecycle_manager = ActorLifecycleManagerBuilder::new(zinit_socket)
.with_supervisor(supervisor.clone())
.build();
// Start actors
lifecycle_manager.start_all_actors().await?;
// Create and execute jobs (supervisor automatically routes to actors)
let job = supervisor
.new_job()
.script_type(ScriptType::OSIS)
.script_content("println!(\"Hello World!\");".to_string())
.build()?;
let result = supervisor.run_job_and_await_result(&job).await?;
println!("Job result: {}", result);
```
## Zinit Service Configuration
The lifecycle manager automatically creates Zinit service configurations:
```yaml
# Generated service config for osis_actor_0
exec: "/usr/local/bin/osis_actor --redis-url redis://localhost:6379 --actor-id osis_actor_0"
test: "/usr/local/bin/osis_actor --health-check"
oneshot: false # Restart on exit
after:
- redis
env:
RUST_LOG: "info"
ACTOR_TYPE: "osis"
```
## Error Handling
The system provides comprehensive error handling:
```rust
use hero_supervisor::SupervisorError;
match lifecycle_manager.start_actor(&config).await {
Ok(_) => println!("Actor started successfully"),
Err(SupervisorError::ActorStartFailed(actor, reason)) => {
eprintln!("Failed to start {}: {}", actor, reason);
}
Err(e) => eprintln!("Other error: {}", e),
}
```
## Example Usage
See `examples/lifecycle_demo.rs` for a comprehensive demonstration:
```bash
# Run the lifecycle demo
cargo run --example lifecycle_demo
# Run with custom Redis URL
REDIS_URL=redis://localhost:6379 cargo run --example lifecycle_demo
```
## Prerequisites
1. **Zinit**: Install and run Zinit process manager
```bash
curl https://raw.githubusercontent.com/threefoldtech/zinit/refs/heads/master/install.sh | bash
zinit init --config /etc/zinit/ --socket /var/run/zinit.sock
```
2. **Redis**: Running Redis instance for job queues
```bash
redis-server
```
3. **Actor Binaries**: Compiled actor binaries for each script type
- `/usr/local/bin/osis_actor`
- `/usr/local/bin/sal_actor`
- `/usr/local/bin/v_actor`
- `/usr/local/bin/python_actor`
## Configuration Best Practices
1. **Resource Limits**: Configure appropriate resource limits in Zinit
2. **Health Checks**: Implement meaningful health check commands
3. **Dependencies**: Define proper service dependencies
4. **Environment**: Set appropriate environment variables
5. **Logging**: Configure structured logging for debugging
6. **Monitoring**: Use health monitoring for production deployments
## Troubleshooting
### Common Issues
1. **Zinit Connection Failed**
- Ensure Zinit is running: `ps aux | grep zinit`
- Check socket permissions: `ls -la /var/run/zinit.sock`
- Verify socket path in configuration
2. **Actor Start Failed**
- Check binary exists and is executable
- Verify dependencies are running
- Review Zinit logs: `zinit logs <service-name>`
3. **Health Check Failures**
- Implement proper health check endpoint in actors
- Verify health check command syntax
- Check actor responsiveness
4. **Redis Connection Issues**
- Ensure Redis is running and accessible
- Verify Redis URL configuration
- Check network connectivity
### Debug Commands
```bash
# Check Zinit status
zinit list
# View service logs
zinit logs osis_actor_0
# Check service status
zinit status osis_actor_0
# Monitor Redis queues
redis-cli keys "hero:job:*"
```
## Performance Considerations
- **Scaling**: Start with minimal actors and scale based on queue depth
- **Health Monitoring**: Adjust ping intervals based on workload patterns
- **Resource Usage**: Monitor CPU/memory usage of actor processes
- **Queue Depth**: Monitor Redis queue lengths for scaling decisions
## Security
- **Process Isolation**: Zinit provides process isolation
- **User Permissions**: Run actors with appropriate user permissions
- **Network Security**: Secure Redis and Zinit socket access
- **Binary Validation**: Verify actor binary integrity before deployment
## Future
- **Load Balancing**: Dynamic scaling of actors based on demand
- **Service Dependencies**: Proper startup ordering with dependency management