Migrate hero_services to zinit 0.4.0 job model (restart + health checks) #25
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Migrate hero_services to zinit 0.4.0 job model
Context
Follow-up from #24 (watchdog hotfix). Zinit 0.4.0 is already installed in the container and supports a job-based model with restart policies and periodic health checks. But
hero_services_serverstill generates legacy TOML configs that don't use these features. Currently relying on a watchdog loop inentrypoint.shas a band-aid (#24).Problem
No restart-on-failure
write_service_config_with_deps()ininstall.rswrites old-format TOMLs:When a service crashes, zinit marks it
inactiveand nothing restarts it.Dummy health checks
write_health_config_with_deps()ininstall.rswrites no-op health checks:buildsection: runsmake health-check(target usually doesn't exist → "healthy by default")ports: curl probe on HTTP port (works for UI services only)echo "No health check configured"→ always passesAll 29 services on herodev2/herodemo2 have inactive health checks.
Hung process detection missing
Even with the watchdog hotfix (#24), a process that is alive but unresponsive (e.g., stuck on an external API call) will not be detected or restarted.
Investigation findings
Zinit reload behavior (confirmed 2026-03-16)
service.reload()does NOT delete API-created services. Analysis of zinit server source:db.services.set()"removed": []— never deletes anythingservices) with no origin columnDecision: Use SDK API exclusively, do NOT call
service.reload()after migration.This means
hero_services_servercreates all services viaservice.set()+action.set()RPC calls. No TOML files generated. Clean and predictable.Key files analyzed:
zinit_server/src/rpc/service.rslines 421-497 (reload impl)zinit_lib/src/db/service/model.rs(persistence layer)zinit_sdk/src/builders.rs(ServiceBuilder, ActionBuilder, RetryPolicyBuilder)Zinit SDK API available
The SDK already provides everything needed:
service.set(ServiceConfig)action.set(ActionSpec)service.start(name)service.status(name)ServiceBuilderActionBuilderRetryPolicyBuilderhero_services_serveralready imports and useszinit_sdkforservice_status,service_restart,service_reloadinzinit.rs.Target state
Auto-restart (every non-oneshot service)
Using zinit SDK
ActionBuilderwith retry policy:Real health checks (every socket-based service)
Periodic JSON-RPC probe on Unix socket:
With grace period (30-60s after start) to allow initialization.
Implementation plan
hero_redis— replace TOML gen with SDK callsinstall.rsinstall.rsinstall.rsentrypoint.shentrypoint.sh:heroimage, deploy to hero.gent04.grid.tf (#26)Files to modify
crates/hero_services_server/src/install.rs— replacewrite_service_config_with_deps()andwrite_health_config_with_deps()with SDK-based equivalentscrates/hero_services_server/src/zinit.rs— extend to useServiceBuilder+ActionBuilderdocker/entrypoint.sh— remove watchdog loop (step 5, only after validation)Current state of
install.rsKey functions (766 lines total):
write_service_config_with_deps()(L121-187) — serializes[service]+[dependencies]TOMLwrite_install_config_with_deps()(L202-267) — generates shell install oneshotswrite_health_config_with_deps()(L449-500) — health check probes (mostly no-ops)write_test_config_with_deps()(L507-557) — integration test runnersbuild_build_exec()(L271-361) — clone +make installscriptsbuild_download_exec()(L365-425) — curl + chmod download scriptsdo_run()(L634-702) — pollsservice_status, callsservice_restartRelated
:herotaghero_services/crates/hero_services_server/src/install.rshero_services/crates/hero_services_server/src/zinit.rshero_services/docker/entrypoint.shDeployment: hero.gent04.grid.tf (
:herotag)Part of this issue — validate the zinit 0.4.0 migration on a fresh environment before promoting to herodev2/herodemo2.
Provisioning
Create
deploy/single-vm/envs/hero/:Then:
Result:
hero.gent04.grid.tfwith:herotagged image containing zinit 0.4.0 SDK migration.Promotion path
What to validate
Runningstatezinit kill user.hero_osis_server SIGTERM) → auto-restarts within 5-10skill -STOP) → health check fails → restart triggeredmake smoke ENV=heroRevised: single deployment model
Drop the three-tier promotion.
hero.gent04.grid.tfwith:herotag becomes the single deployment — dev, demo, and production.Rationale
:hero→ deploy → doneMigration plan
hero.gent04.grid.tfon node 50 with:heroimagehero.gent04.grid.tfWhen to spin up a second env
Local Docker testing workflow
To speed up the dev loop, test the zinit SDK migration locally before pushing to TFGrid:
Advantages:
Once local tests pass →
make push TAG=hero→make all ENV=heroon TFGrid.Priority & Dependencies
This issue is the next priority — it blocks the final merge of the dioxus-bootstrap migration (#28).
Why it blocks #28
Fresh containers built from the dioxus-bootstrap image can't serve because service TOML configs aren't generated properly. Without this fix, herodevbootstrap shows Bad Gateway after a clean deploy.
Execution plan
development_mik_6_1branch (same as #23)development_mik_6_1→development(brings both #23 and #25)developmentinto its bootstrap branches and completeDependency chain
Issue #25 — COMPLETE ✅
Zinit 0.4.0 SDK migration implemented and deployed on
hero.gent04.grid.tf.What was done
Code changes (commit
8ec5402ondevelopment_mik_6_1):install.rs: New SDK functions —register_service_sdk(),register_install_sdk(),register_health_sdk(),register_test_sdk()usingServiceBuilder/ActionBuilder/RetryPolicyBuilderprofile.rs:execute_profile()andactivate_profile_additive()use SDK registrationservice_data.rs:service_hard_restart()andreload_config()use SDKzinit.rs:stop_and_clean()usesservice_delete(),HERO_DOCKER=1skips binary deletionentrypoint.sh: Watchdog loop removed, manualzinit startloop removeddeploy/single-vm/envs/hero/Retry policy: 20 attempts, 5s delay, exponential backoff, 300s max delay, 60s stability period
Health checks: socat JSON-RPC
server.healthprobe on Unix sockets for_serverservices, curl HTTP probe for port-based services. 90s timeout.Action naming: Globally unique action names in zinit 0.4.0 — uses
{service}.runpattern (e.g.,user.hero_redis_server.run)Bugs found and fixed
register_service_sdk()mainto{service}.runHERO_DOCKER=1env var skips deletion of pre-baked binariesValidation
hero_redis_server→ restarted within 5s (confirmed locally and on live deployment)https://hero.gent04.grid.tf/hero_os/returns 200Deployment
forge.ourworld.tf/lhumina_code/hero_zero:herohero.gent04.grid.tf(TFGrid node 50, gent04)Branch:
development_mik_6_1acrosshero_servicesrepo. Ready to merge todevelopmentafter #23 remaining items are resolved.done in development_mik_6_1