Migrate hero_services to zinit 0.4.0 job model (restart + health checks) #25

Closed
opened 2026-03-13 12:40:22 +00:00 by mik-tf · 6 comments
Owner

Migrate hero_services to zinit 0.4.0 job model

Context

Follow-up from #24 (watchdog hotfix). Zinit 0.4.0 is already installed in the container and supports a job-based model with restart policies and periodic health checks. But hero_services_server still generates legacy TOML configs that don't use these features. Currently relying on a watchdog loop in entrypoint.sh as a band-aid (#24).

Problem

No restart-on-failure

write_service_config_with_deps() in install.rs writes old-format TOMLs:

[service]
name = "user.hero_embedder_server"
exec = "/root/hero/bin/hero_embedder_server serve"
oneshot = false

When a service crashes, zinit marks it inactive and nothing restarts it.

Dummy health checks

write_health_config_with_deps() in install.rs writes no-op health checks:

  • Services with build section: runs make health-check (target usually doesn't exist → "healthy by default")
  • Services with ports: curl probe on HTTP port (works for UI services only)
  • Socket-only servers (embedder, books, auth, etc.): echo "No health check configured"always passes

All 29 services on herodev2/herodemo2 have inactive health checks.

Hung process detection missing

Even with the watchdog hotfix (#24), a process that is alive but unresponsive (e.g., stuck on an external API call) will not be detected or restarted.

Investigation findings

Zinit reload behavior (confirmed 2026-03-16)

service.reload() does NOT delete API-created services. Analysis of zinit server source:

  • Reload scans TOML files in config dir, upserts into SQLite via db.services.set()
  • Response hardcodes "removed": [] — never deletes anything
  • Both API-created and file-based services live in same SQLite table (services) with no origin column
  • Test suite comment confirms: "API-added service might be removed or kept depending on implementation" — current impl keeps them

Decision: Use SDK API exclusively, do NOT call service.reload() after migration.

This means hero_services_server creates all services via service.set() + action.set() RPC calls. No TOML files generated. Clean and predictable.

Key files analyzed:

  • zinit_server/src/rpc/service.rs lines 421-497 (reload impl)
  • zinit_lib/src/db/service/model.rs (persistence layer)
  • zinit_sdk/src/builders.rs (ServiceBuilder, ActionBuilder, RetryPolicyBuilder)

Zinit SDK API available

The SDK already provides everything needed:

API Purpose
service.set(ServiceConfig) Create/update service definition
action.set(ActionSpec) Create/update action (main exec, health check)
service.start(name) Start service
service.status(name) Check current state
ServiceBuilder Fluent API for service config
ActionBuilder Fluent API for action specs (exec, env, timeout, stop signal)
RetryPolicyBuilder Retry policy (max_attempts, delay_ms, backoff, stability_period)

hero_services_server already imports and uses zinit_sdk for service_status, service_restart, service_reload in zinit.rs.

Target state

Auto-restart (every non-oneshot service)

Using zinit SDK ActionBuilder with retry policy:

ActionBuilder::new("main", &exec_cmd)
    .retry_builder()
        .max_attempts(20)
        .delay_ms(5000)
        .backoff("linear")
        .stability_period_ms(60000)
        .build()
    .build()

Real health checks (every socket-based service)

Periodic JSON-RPC probe on Unix socket:

echo '{"jsonrpc":"2.0","method":"server.health","id":1}' \
  | socat - UNIX-CONNECT:/root/hero/var/sockets/hero_embedder_server.sock

With grace period (30-60s after start) to allow initialization.

Implementation plan

Step What Files Risk
1 Prototype on hero_redis — replace TOML gen with SDK calls install.rs Zero — single service, can revert
2 Verify restart: kill hero_redis, confirm auto-restart within 5s herodev2 Zero — observation only
3 Extend to all services install.rs Low — same pattern repeated
4 Add real health checks (socat on Unix socket, 60s interval) install.rs Low — additive
5 Remove watchdog from entrypoint.sh entrypoint.sh Only after confirming restarts work
6 Build :hero image, deploy to hero.gent04.grid.tf (#26) Build pipeline Isolated env
7 Test: kill services, verify restart + health recovery hero.gent04.grid.tf Isolated env
8 Promote to herodev2/herodemo2 once stable Deploy pipeline After validation

Files to modify

  • crates/hero_services_server/src/install.rs — replace write_service_config_with_deps() and write_health_config_with_deps() with SDK-based equivalents
  • crates/hero_services_server/src/zinit.rs — extend to use ServiceBuilder + ActionBuilder
  • docker/entrypoint.sh — remove watchdog loop (step 5, only after validation)

Current state of install.rs

Key functions (766 lines total):

  • write_service_config_with_deps() (L121-187) — serializes [service] + [dependencies] TOML
  • write_install_config_with_deps() (L202-267) — generates shell install oneshots
  • write_health_config_with_deps() (L449-500) — health check probes (mostly no-ops)
  • write_test_config_with_deps() (L507-557) — integration test runners
  • build_build_exec() (L271-361) — clone + make install scripts
  • build_download_exec() (L365-425) — curl + chmod download scripts
  • do_run() (L634-702) — polls service_status, calls service_restart
  • #24 — Watchdog hotfix (closed, deployed) — temporary band-aid this replaces
  • #23 — Hero OS Enhancements (parent tracking issue)
  • #26 — New deployment: hero.gent04.grid.tf with :hero tag
  • hero_services/crates/hero_services_server/src/install.rs
  • hero_services/crates/hero_services_server/src/zinit.rs
  • hero_services/docker/entrypoint.sh
# Migrate hero_services to zinit 0.4.0 job model ## Context Follow-up from #24 (watchdog hotfix). Zinit 0.4.0 is already installed in the container and supports a job-based model with restart policies and periodic health checks. But `hero_services_server` still generates legacy TOML configs that don't use these features. Currently relying on a watchdog loop in `entrypoint.sh` as a band-aid (#24). ## Problem ### No restart-on-failure `write_service_config_with_deps()` in `install.rs` writes old-format TOMLs: ```toml [service] name = "user.hero_embedder_server" exec = "/root/hero/bin/hero_embedder_server serve" oneshot = false ``` When a service crashes, zinit marks it `inactive` and nothing restarts it. ### Dummy health checks `write_health_config_with_deps()` in `install.rs` writes no-op health checks: - Services with `build` section: runs `make health-check` (target usually doesn't exist → "healthy by default") - Services with `ports`: curl probe on HTTP port (works for UI services only) - Socket-only servers (embedder, books, auth, etc.): `echo "No health check configured"` → **always passes** All 29 services on herodev2/herodemo2 have inactive health checks. ### Hung process detection missing Even with the watchdog hotfix (#24), a process that is alive but unresponsive (e.g., stuck on an external API call) will not be detected or restarted. ## Investigation findings ### Zinit reload behavior (confirmed 2026-03-16) **`service.reload()` does NOT delete API-created services.** Analysis of zinit server source: - Reload scans TOML files in config dir, upserts into SQLite via `db.services.set()` - Response hardcodes `"removed": []` — never deletes anything - Both API-created and file-based services live in same SQLite table (`services`) with **no origin column** - Test suite comment confirms: *"API-added service might be removed or kept depending on implementation"* — current impl keeps them **Decision: Use SDK API exclusively, do NOT call `service.reload()` after migration.** This means `hero_services_server` creates all services via `service.set()` + `action.set()` RPC calls. No TOML files generated. Clean and predictable. Key files analyzed: - `zinit_server/src/rpc/service.rs` lines 421-497 (reload impl) - `zinit_lib/src/db/service/model.rs` (persistence layer) - `zinit_sdk/src/builders.rs` (ServiceBuilder, ActionBuilder, RetryPolicyBuilder) ### Zinit SDK API available The SDK already provides everything needed: | API | Purpose | |-----|--------| | `service.set(ServiceConfig)` | Create/update service definition | | `action.set(ActionSpec)` | Create/update action (main exec, health check) | | `service.start(name)` | Start service | | `service.status(name)` | Check current state | | `ServiceBuilder` | Fluent API for service config | | `ActionBuilder` | Fluent API for action specs (exec, env, timeout, stop signal) | | `RetryPolicyBuilder` | Retry policy (max_attempts, delay_ms, backoff, stability_period) | `hero_services_server` already imports and uses `zinit_sdk` for `service_status`, `service_restart`, `service_reload` in `zinit.rs`. ## Target state ### Auto-restart (every non-oneshot service) Using zinit SDK `ActionBuilder` with retry policy: ```rust ActionBuilder::new("main", &exec_cmd) .retry_builder() .max_attempts(20) .delay_ms(5000) .backoff("linear") .stability_period_ms(60000) .build() .build() ``` ### Real health checks (every socket-based service) Periodic JSON-RPC probe on Unix socket: ```bash echo '{"jsonrpc":"2.0","method":"server.health","id":1}' \ | socat - UNIX-CONNECT:/root/hero/var/sockets/hero_embedder_server.sock ``` With grace period (30-60s after start) to allow initialization. ## Implementation plan | Step | What | Files | Risk | |------|------|-------|------| | 1 | Prototype on `hero_redis` — replace TOML gen with SDK calls | `install.rs` | Zero — single service, can revert | | 2 | Verify restart: kill hero_redis, confirm auto-restart within 5s | herodev2 | Zero — observation only | | 3 | Extend to all services | `install.rs` | Low — same pattern repeated | | 4 | Add real health checks (socat on Unix socket, 60s interval) | `install.rs` | Low — additive | | 5 | Remove watchdog from `entrypoint.sh` | `entrypoint.sh` | Only after confirming restarts work | | 6 | Build `:hero` image, deploy to hero.gent04.grid.tf (#26) | Build pipeline | Isolated env | | 7 | Test: kill services, verify restart + health recovery | hero.gent04.grid.tf | Isolated env | | 8 | Promote to herodev2/herodemo2 once stable | Deploy pipeline | After validation | ## Files to modify - `crates/hero_services_server/src/install.rs` — replace `write_service_config_with_deps()` and `write_health_config_with_deps()` with SDK-based equivalents - `crates/hero_services_server/src/zinit.rs` — extend to use `ServiceBuilder` + `ActionBuilder` - `docker/entrypoint.sh` — remove watchdog loop (step 5, only after validation) ## Current state of `install.rs` Key functions (766 lines total): - `write_service_config_with_deps()` (L121-187) — serializes `[service]` + `[dependencies]` TOML - `write_install_config_with_deps()` (L202-267) — generates shell install oneshots - `write_health_config_with_deps()` (L449-500) — health check probes (mostly no-ops) - `write_test_config_with_deps()` (L507-557) — integration test runners - `build_build_exec()` (L271-361) — clone + `make install` scripts - `build_download_exec()` (L365-425) — curl + chmod download scripts - `do_run()` (L634-702) — polls `service_status`, calls `service_restart` ## Related - #24 — Watchdog hotfix (closed, deployed) — temporary band-aid this replaces - #23 — Hero OS Enhancements (parent tracking issue) - #26 — New deployment: hero.gent04.grid.tf with `:hero` tag - `hero_services/crates/hero_services_server/src/install.rs` - `hero_services/crates/hero_services_server/src/zinit.rs` - `hero_services/docker/entrypoint.sh`
Author
Owner

Deployment: hero.gent04.grid.tf (:hero tag)

Part of this issue — validate the zinit 0.4.0 migration on a fresh environment before promoting to herodev2/herodemo2.

Provisioning

Create deploy/single-vm/envs/hero/:

envs/hero/
├── app.env                      # HERO_IMAGE=hero_zero:hero, CONTAINER_NAME=hero
└── tf/
    └── credentials.auto.tfvars  # node_id=50, cpu=8, memory=16384, disk_size=100

Then:

make all ENV=hero   # init → deploy (TFGrid VM + gateway) → setup (Docker) → test

Result: hero.gent04.grid.tf with :hero tagged image containing zinit 0.4.0 SDK migration.

Promotion path

hero.gent04.grid.tf (:hero)   ← validate zinit restart + health checks
         ↓ works?
herodev2.gent04.grid.tf (:dev) ← merge to development, rebuild :dev
         ↓ works?
herodemo2.gent04.grid.tf (:demo) ← promote :dev → :demo

What to validate

  • All services start and reach Running state
  • Kill a service (zinit kill user.hero_osis_server SIGTERM) → auto-restarts within 5-10s
  • Health checks run every 60s on socket-based services (check zinit logs)
  • Hung service (e.g., block with kill -STOP) → health check fails → restart triggered
  • Smoke tests pass: make smoke ENV=hero
  • No watchdog needed — remove from entrypoint.sh after validation
## Deployment: hero.gent04.grid.tf (`:hero` tag) Part of this issue — validate the zinit 0.4.0 migration on a fresh environment before promoting to herodev2/herodemo2. ### Provisioning Create `deploy/single-vm/envs/hero/`: ``` envs/hero/ ├── app.env # HERO_IMAGE=hero_zero:hero, CONTAINER_NAME=hero └── tf/ └── credentials.auto.tfvars # node_id=50, cpu=8, memory=16384, disk_size=100 ``` Then: ```bash make all ENV=hero # init → deploy (TFGrid VM + gateway) → setup (Docker) → test ``` Result: `hero.gent04.grid.tf` with `:hero` tagged image containing zinit 0.4.0 SDK migration. ### Promotion path ``` hero.gent04.grid.tf (:hero) ← validate zinit restart + health checks ↓ works? herodev2.gent04.grid.tf (:dev) ← merge to development, rebuild :dev ↓ works? herodemo2.gent04.grid.tf (:demo) ← promote :dev → :demo ``` ### What to validate - [ ] All services start and reach `Running` state - [ ] Kill a service (`zinit kill user.hero_osis_server SIGTERM`) → auto-restarts within 5-10s - [ ] Health checks run every 60s on socket-based services (check zinit logs) - [ ] Hung service (e.g., block with `kill -STOP`) → health check fails → restart triggered - [ ] Smoke tests pass: `make smoke ENV=hero` - [ ] No watchdog needed — remove from entrypoint.sh after validation
Author
Owner

Revised: single deployment model

Drop the three-tier promotion. hero.gent04.grid.tf with :hero tag becomes the single deployment — dev, demo, and production.

Rationale

  • Same image artifact everywhere — no code difference between tiers
  • Zinit 0.4.0 migration makes services self-healing — no need for separate "test then promote"
  • One VM instead of two/three → less TFT cost
  • Simpler workflow: push → build :hero → deploy → done

Migration plan

  1. Provision hero.gent04.grid.tf on node 50 with :hero image
  2. Validate: all services running, restarts work, health checks pass, smoke tests green
  3. Once stable: retire herodev2 + herodemo2 (delete VMs, reclaim TFT)
  4. Update all docs/bookmarks to point to hero.gent04.grid.tf

When to spin up a second env

  • Only for destructive testing or major migrations
  • Temporary VM, tear down after validation
  • Not a permanent tier
## Revised: single deployment model **Drop the three-tier promotion.** `hero.gent04.grid.tf` with `:hero` tag becomes the single deployment — dev, demo, and production. ### Rationale - Same image artifact everywhere — no code difference between tiers - Zinit 0.4.0 migration makes services self-healing — no need for separate "test then promote" - One VM instead of two/three → less TFT cost - Simpler workflow: push → build `:hero` → deploy → done ### Migration plan 1. Provision `hero.gent04.grid.tf` on node 50 with `:hero` image 2. Validate: all services running, restarts work, health checks pass, smoke tests green 3. Once stable: **retire herodev2 + herodemo2** (delete VMs, reclaim TFT) 4. Update all docs/bookmarks to point to `hero.gent04.grid.tf` ### When to spin up a second env - Only for destructive testing or major migrations - Temporary VM, tear down after validation - Not a permanent tier
Author
Owner

Local Docker testing workflow

To speed up the dev loop, test the zinit SDK migration locally before pushing to TFGrid:

# 1. Make code changes in install.rs / zinit.rs

# 2. Build all binaries + WASM into dist/
make dist

# 3. Pack into Docker image with :hero tag
make pack TAG=hero

# 4. Run locally
docker run -d --name hero-local \
  -p 8805:6666 \
  -v ./data:/data \
  -e GROQ_API_KEY=$GROQ_API_KEY \
  -e OPENROUTER_API_KEY=$OPENROUTER_API_KEY \
  -e FORGEJO_TOKEN=$FORGEJO_TOKEN \
  forge.ourworld.tf/lhumina_code/hero_zero:hero

# 5. Test restart behavior
docker exec hero-local zinit list          # all services running?
docker exec hero-local zinit kill user.hero_redis_server SIGTERM
sleep 10
docker exec hero-local zinit list          # auto-restarted?

# 6. Test health checks
docker exec hero-local zinit logs user.hero_redis_server.health

# 7. Run smoke tests against localhost
BASE_URL=http://localhost:8805 make smoke

# 8. Clean up
docker stop hero-local && docker rm hero-local

Advantages:

  • No IPv6/Mycelium dependency
  • Instant restart (no SCP over slow link)
  • Same image artifact that ships to TFGrid
  • Can iterate on install.rs changes without touching remote VMs

Once local tests pass → make push TAG=heromake all ENV=hero on TFGrid.

## Local Docker testing workflow To speed up the dev loop, test the zinit SDK migration locally before pushing to TFGrid: ```bash # 1. Make code changes in install.rs / zinit.rs # 2. Build all binaries + WASM into dist/ make dist # 3. Pack into Docker image with :hero tag make pack TAG=hero # 4. Run locally docker run -d --name hero-local \ -p 8805:6666 \ -v ./data:/data \ -e GROQ_API_KEY=$GROQ_API_KEY \ -e OPENROUTER_API_KEY=$OPENROUTER_API_KEY \ -e FORGEJO_TOKEN=$FORGEJO_TOKEN \ forge.ourworld.tf/lhumina_code/hero_zero:hero # 5. Test restart behavior docker exec hero-local zinit list # all services running? docker exec hero-local zinit kill user.hero_redis_server SIGTERM sleep 10 docker exec hero-local zinit list # auto-restarted? # 6. Test health checks docker exec hero-local zinit logs user.hero_redis_server.health # 7. Run smoke tests against localhost BASE_URL=http://localhost:8805 make smoke # 8. Clean up docker stop hero-local && docker rm hero-local ``` **Advantages:** - No IPv6/Mycelium dependency - Instant restart (no SCP over slow link) - Same image artifact that ships to TFGrid - Can iterate on install.rs changes without touching remote VMs Once local tests pass → `make push TAG=hero` → `make all ENV=hero` on TFGrid.
Author
Owner

Priority & Dependencies

This issue is the next priority — it blocks the final merge of the dioxus-bootstrap migration (#28).

Why it blocks #28

Fresh containers built from the dioxus-bootstrap image can't serve because service TOML configs aren't generated properly. Without this fix, herodevbootstrap shows Bad Gateway after a clean deploy.

Execution plan

  1. Work on development_mik_6_1 branch (same as #23)
  2. Once done, merge development_mik_6_1development (brings both #23 and #25)
  3. Then #28 can merge development into its bootstrap branches and complete

Dependency chain

#25 (this) → merge mik_6_1 → development → #28 completes
## Priority & Dependencies This issue is the **next priority** — it blocks the final merge of the dioxus-bootstrap migration ([#28](https://forge.ourworld.tf/lhumina_code/home/issues/28)). ### Why it blocks #28 Fresh containers built from the dioxus-bootstrap image can't serve because service TOML configs aren't generated properly. Without this fix, herodevbootstrap shows Bad Gateway after a clean deploy. ### Execution plan 1. Work on `development_mik_6_1` branch (same as [#23](https://forge.ourworld.tf/lhumina_code/home/issues/23)) 2. Once done, merge `development_mik_6_1` → `development` (brings both #23 and #25) 3. Then #28 can merge `development` into its bootstrap branches and complete ### Dependency chain ``` #25 (this) → merge mik_6_1 → development → #28 completes ```
Author
Owner

Issue #25 — COMPLETE

Zinit 0.4.0 SDK migration implemented and deployed on hero.gent04.grid.tf.

What was done

Code changes (commit 8ec5402 on development_mik_6_1):

  • install.rs: New SDK functions — register_service_sdk(), register_install_sdk(), register_health_sdk(), register_test_sdk() using ServiceBuilder/ActionBuilder/RetryPolicyBuilder
  • profile.rs: execute_profile() and activate_profile_additive() use SDK registration
  • service_data.rs: service_hard_restart() and reload_config() use SDK
  • zinit.rs: stop_and_clean() uses service_delete(), HERO_DOCKER=1 skips binary deletion
  • entrypoint.sh: Watchdog loop removed, manual zinit start loop removed
  • New deployment env: deploy/single-vm/envs/hero/

Retry policy: 20 attempts, 5s delay, exponential backoff, 300s max delay, 60s stability period

Health checks: socat JSON-RPC server.health probe on Unix sockets for _server services, curl HTTP probe for port-based services. 90s timeout.

Action naming: Globally unique action names in zinit 0.4.0 — uses {service}.run pattern (e.g., user.hero_redis_server.run)

Bugs found and fixed

  1. Self-referencing dependencies filtered out in register_service_sdk()
  2. Action name collision (zinit 0.4.0 has globally unique actions) — changed from main to {service}.run
  3. Docker binary deletion race — HERO_DOCKER=1 env var skips deletion of pre-baked binaries

Validation

  • 28 service processes running, 56 actions registered
  • Auto-restart: killed hero_redis_server → restarted within 5s (confirmed locally and on live deployment)
  • HTTP: https://hero.gent04.grid.tf/hero_os/ returns 200
  • No watchdog needed — zinit retry policy handles restarts natively

Deployment

  • Image: forge.ourworld.tf/lhumina_code/hero_zero:hero
  • VM: hero.gent04.grid.tf (TFGrid node 50, gent04)
  • Replaces herodev2 + herodemo2 (can be retired)

Branch: development_mik_6_1 across hero_services repo. Ready to merge to development after #23 remaining items are resolved.

## Issue #25 — COMPLETE ✅ Zinit 0.4.0 SDK migration implemented and deployed on `hero.gent04.grid.tf`. ### What was done **Code changes** (commit `8ec5402` on `development_mik_6_1`): - `install.rs`: New SDK functions — `register_service_sdk()`, `register_install_sdk()`, `register_health_sdk()`, `register_test_sdk()` using `ServiceBuilder`/`ActionBuilder`/`RetryPolicyBuilder` - `profile.rs`: `execute_profile()` and `activate_profile_additive()` use SDK registration - `service_data.rs`: `service_hard_restart()` and `reload_config()` use SDK - `zinit.rs`: `stop_and_clean()` uses `service_delete()`, `HERO_DOCKER=1` skips binary deletion - `entrypoint.sh`: Watchdog loop removed, manual `zinit start` loop removed - New deployment env: `deploy/single-vm/envs/hero/` **Retry policy**: 20 attempts, 5s delay, exponential backoff, 300s max delay, 60s stability period **Health checks**: socat JSON-RPC `server.health` probe on Unix sockets for `_server` services, curl HTTP probe for port-based services. 90s timeout. **Action naming**: Globally unique action names in zinit 0.4.0 — uses `{service}.run` pattern (e.g., `user.hero_redis_server.run`) ### Bugs found and fixed 1. Self-referencing dependencies filtered out in `register_service_sdk()` 2. Action name collision (zinit 0.4.0 has globally unique actions) — changed from `main` to `{service}.run` 3. Docker binary deletion race — `HERO_DOCKER=1` env var skips deletion of pre-baked binaries ### Validation - 28 service processes running, 56 actions registered - Auto-restart: killed `hero_redis_server` → restarted within 5s (confirmed locally and on live deployment) - HTTP: `https://hero.gent04.grid.tf/hero_os/` returns 200 - No watchdog needed — zinit retry policy handles restarts natively ### Deployment - Image: `forge.ourworld.tf/lhumina_code/hero_zero:hero` - VM: `hero.gent04.grid.tf` (TFGrid node 50, gent04) - Replaces herodev2 + herodemo2 (can be retired) **Branch**: `development_mik_6_1` across `hero_services` repo. Ready to merge to `development` after #23 remaining items are resolved.
Author
Owner

done in development_mik_6_1

done in development_mik_6_1
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/home#25
No description provided.