mik-tf commented

Owner

Migrate hero_services to zinit 0.4.0 job model

Context

Follow-up from #24 (watchdog hotfix). Zinit 0.4.0 is already installed in the container and supports a job-based model with restart policies and periodic health checks. But hero_services_server still generates legacy TOML configs that don't use these features. Currently relying on a watchdog loop in entrypoint.sh as a band-aid (#24).

Problem

No restart-on-failure

write_service_config_with_deps() in install.rs writes old-format TOMLs:

[service]
name = "user.hero_embedder_server"
exec = "/root/hero/bin/hero_embedder_server serve"
oneshot = false

When a service crashes, zinit marks it inactive and nothing restarts it.

Dummy health checks

write_health_config_with_deps() in install.rs writes no-op health checks:

Services with build section: runs make health-check (target usually doesn't exist → "healthy by default")
Services with ports: curl probe on HTTP port (works for UI services only)
Socket-only servers (embedder, books, auth, etc.): echo "No health check configured" → always passes

All 29 services on herodev2/herodemo2 have inactive health checks.

Hung process detection missing

Even with the watchdog hotfix (#24), a process that is alive but unresponsive (e.g., stuck on an external API call) will not be detected or restarted.

Investigation findings

Zinit reload behavior (confirmed 2026-03-16)

service.reload() does NOT delete API-created services. Analysis of zinit server source:

Reload scans TOML files in config dir, upserts into SQLite via db.services.set()
Response hardcodes "removed": [] — never deletes anything
Both API-created and file-based services live in same SQLite table (services) with no origin column
Test suite comment confirms: "API-added service might be removed or kept depending on implementation" — current impl keeps them

Decision: Use SDK API exclusively, do NOT call service.reload() after migration.

This means hero_services_server creates all services via service.set() + action.set() RPC calls. No TOML files generated. Clean and predictable.

Key files analyzed:

zinit_server/src/rpc/service.rs lines 421-497 (reload impl)
zinit_lib/src/db/service/model.rs (persistence layer)
zinit_sdk/src/builders.rs (ServiceBuilder, ActionBuilder, RetryPolicyBuilder)

Zinit SDK API available

The SDK already provides everything needed:

API	Purpose
`service.set(ServiceConfig)`	Create/update service definition
`action.set(ActionSpec)`	Create/update action (main exec, health check)
`service.start(name)`	Start service
`service.status(name)`	Check current state
`ServiceBuilder`	Fluent API for service config
`ActionBuilder`	Fluent API for action specs (exec, env, timeout, stop signal)
`RetryPolicyBuilder`	Retry policy (max_attempts, delay_ms, backoff, stability_period)

hero_services_server already imports and uses zinit_sdk for service_status, service_restart, service_reload in zinit.rs.

Target state

Auto-restart (every non-oneshot service)

Using zinit SDK ActionBuilder with retry policy:

ActionBuilder::new("main", &exec_cmd)
    .retry_builder()
        .max_attempts(20)
        .delay_ms(5000)
        .backoff("linear")
        .stability_period_ms(60000)
        .build()
    .build()

Real health checks (every socket-based service)

Periodic JSON-RPC probe on Unix socket:

echo '{"jsonrpc":"2.0","method":"server.health","id":1}' \
  | socat - UNIX-CONNECT:/root/hero/var/sockets/hero_embedder_server.sock

With grace period (30-60s after start) to allow initialization.

Implementation plan

Step	What	Files	Risk
1	Prototype on `hero_redis` — replace TOML gen with SDK calls	`install.rs`	Zero — single service, can revert
2	Verify restart: kill hero_redis, confirm auto-restart within 5s	herodev2	Zero — observation only
3	Extend to all services	`install.rs`	Low — same pattern repeated
4	Add real health checks (socat on Unix socket, 60s interval)	`install.rs`	Low — additive
5	Remove watchdog from `entrypoint.sh`	`entrypoint.sh`	Only after confirming restarts work
6	Build `:hero` image, deploy to hero.gent04.grid.tf (#26)	Build pipeline	Isolated env
7	Test: kill services, verify restart + health recovery	hero.gent04.grid.tf	Isolated env
8	Promote to herodev2/herodemo2 once stable	Deploy pipeline	After validation

Files to modify

crates/hero_services_server/src/install.rs — replace write_service_config_with_deps() and write_health_config_with_deps() with SDK-based equivalents
crates/hero_services_server/src/zinit.rs — extend to use ServiceBuilder + ActionBuilder
docker/entrypoint.sh — remove watchdog loop (step 5, only after validation)

Current state of `install.rs`

Key functions (766 lines total):

write_service_config_with_deps() (L121-187) — serializes [service] + [dependencies] TOML
write_install_config_with_deps() (L202-267) — generates shell install oneshots
write_health_config_with_deps() (L449-500) — health check probes (mostly no-ops)
write_test_config_with_deps() (L507-557) — integration test runners
build_build_exec() (L271-361) — clone + make install scripts
build_download_exec() (L365-425) — curl + chmod download scripts
do_run() (L634-702) — polls service_status, calls service_restart

#24 — Watchdog hotfix (closed, deployed) — temporary band-aid this replaces
#23 — Hero OS Enhancements (parent tracking issue)
#26 — New deployment: hero.gent04.grid.tf with :hero tag
hero_services/crates/hero_services_server/src/install.rs
hero_services/crates/hero_services_server/src/zinit.rs
hero_services/docker/entrypoint.sh

# Migrate hero_services to zinit 0.4.0 job model ## Context Follow-up from #24 (watchdog hotfix). Zinit 0.4.0 is already installed in the container and supports a job-based model with restart policies and periodic health checks. But `hero_services_server` still generates legacy TOML configs that don't use these features. Currently relying on a watchdog loop in `entrypoint.sh` as a band-aid (#24). ## Problem ### No restart-on-failure `write_service_config_with_deps()` in `install.rs` writes old-format TOMLs: ```toml [service] name = "user.hero_embedder_server" exec = "/root/hero/bin/hero_embedder_server serve" oneshot = false ``` When a service crashes, zinit marks it `inactive` and nothing restarts it. ### Dummy health checks `write_health_config_with_deps()` in `install.rs` writes no-op health checks: - Services with `build` section: runs `make health-check` (target usually doesn't exist → "healthy by default") - Services with `ports`: curl probe on HTTP port (works for UI services only) - Socket-only servers (embedder, books, auth, etc.): `echo "No health check configured"` → **always passes** All 29 services on herodev2/herodemo2 have inactive health checks. ### Hung process detection missing Even with the watchdog hotfix (#24), a process that is alive but unresponsive (e.g., stuck on an external API call) will not be detected or restarted. ## Investigation findings ### Zinit reload behavior (confirmed 2026-03-16) **`service.reload()` does NOT delete API-created services.** Analysis of zinit server source: - Reload scans TOML files in config dir, upserts into SQLite via `db.services.set()` - Response hardcodes `"removed": []` — never deletes anything - Both API-created and file-based services live in same SQLite table (`services`) with **no origin column** - Test suite comment confirms: *"API-added service might be removed or kept depending on implementation"* — current impl keeps them **Decision: Use SDK API exclusively, do NOT call `service.reload()` after migration.** This means `hero_services_server` creates all services via `service.set()` + `action.set()` RPC calls. No TOML files generated. Clean and predictable. Key files analyzed: - `zinit_server/src/rpc/service.rs` lines 421-497 (reload impl) - `zinit_lib/src/db/service/model.rs` (persistence layer) - `zinit_sdk/src/builders.rs` (ServiceBuilder, ActionBuilder, RetryPolicyBuilder) ### Zinit SDK API available The SDK already provides everything needed: | API | Purpose | |-----|--------| | `service.set(ServiceConfig)` | Create/update service definition | | `action.set(ActionSpec)` | Create/update action (main exec, health check) | | `service.start(name)` | Start service | | `service.status(name)` | Check current state | | `ServiceBuilder` | Fluent API for service config | | `ActionBuilder` | Fluent API for action specs (exec, env, timeout, stop signal) | | `RetryPolicyBuilder` | Retry policy (max_attempts, delay_ms, backoff, stability_period) | `hero_services_server` already imports and uses `zinit_sdk` for `service_status`, `service_restart`, `service_reload` in `zinit.rs`. ## Target state ### Auto-restart (every non-oneshot service) Using zinit SDK `ActionBuilder` with retry policy: ```rust ActionBuilder::new("main", &exec_cmd) .retry_builder() .max_attempts(20) .delay_ms(5000) .backoff("linear") .stability_period_ms(60000) .build() .build() ``` ### Real health checks (every socket-based service) Periodic JSON-RPC probe on Unix socket: ```bash echo '{"jsonrpc":"2.0","method":"server.health","id":1}' \ | socat - UNIX-CONNECT:/root/hero/var/sockets/hero_embedder_server.sock ``` With grace period (30-60s after start) to allow initialization. ## Implementation plan | Step | What | Files | Risk | |------|------|-------|------| | 1 | Prototype on `hero_redis` — replace TOML gen with SDK calls | `install.rs` | Zero — single service, can revert | | 2 | Verify restart: kill hero_redis, confirm auto-restart within 5s | herodev2 | Zero — observation only | | 3 | Extend to all services | `install.rs` | Low — same pattern repeated | | 4 | Add real health checks (socat on Unix socket, 60s interval) | `install.rs` | Low — additive | | 5 | Remove watchdog from `entrypoint.sh` | `entrypoint.sh` | Only after confirming restarts work | | 6 | Build `:hero` image, deploy to hero.gent04.grid.tf (#26) | Build pipeline | Isolated env | | 7 | Test: kill services, verify restart + health recovery | hero.gent04.grid.tf | Isolated env | | 8 | Promote to herodev2/herodemo2 once stable | Deploy pipeline | After validation | ## Files to modify - `crates/hero_services_server/src/install.rs` — replace `write_service_config_with_deps()` and `write_health_config_with_deps()` with SDK-based equivalents - `crates/hero_services_server/src/zinit.rs` — extend to use `ServiceBuilder` + `ActionBuilder` - `docker/entrypoint.sh` — remove watchdog loop (step 5, only after validation) ## Current state of `install.rs` Key functions (766 lines total): - `write_service_config_with_deps()` (L121-187) — serializes `[service]` + `[dependencies]` TOML - `write_install_config_with_deps()` (L202-267) — generates shell install oneshots - `write_health_config_with_deps()` (L449-500) — health check probes (mostly no-ops) - `write_test_config_with_deps()` (L507-557) — integration test runners - `build_build_exec()` (L271-361) — clone + `make install` scripts - `build_download_exec()` (L365-425) — curl + chmod download scripts - `do_run()` (L634-702) — polls `service_status`, calls `service_restart` ## Related - #24 — Watchdog hotfix (closed, deployed) — temporary band-aid this replaces - #23 — Hero OS Enhancements (parent tracking issue) - #26 — New deployment: hero.gent04.grid.tf with `:hero` tag - `hero_services/crates/hero_services_server/src/install.rs` - `hero_services/crates/hero_services_server/src/zinit.rs` - `hero_services/docker/entrypoint.sh`

mik-tf referenced this issue

2026-03-13 13:17:47 +00:00

Hero OS Enhancements #23

mik-tf referenced this issue

2026-03-15 05:01:29 +00:00

Hero OS — Dioxus Bootstrap Migration #26

mik-tf referenced this issue

2026-03-16 05:13:33 +00:00

Hero OS — Dioxus Bootstrap Migration #26

mik-tf referenced this issue

2026-03-16 17:21:28 +00:00

Hero OS — Dioxus Bootstrap Migration: Status & Merge Plan #28

mik-tf commented

2026-03-16 19:14:45 +00:00

Author

Owner

Deployment: hero.gent04.grid.tf (`:hero` tag)

Part of this issue — validate the zinit 0.4.0 migration on a fresh environment before promoting to herodev2/herodemo2.

Provisioning

Create deploy/single-vm/envs/hero/:

envs/hero/
├── app.env                      # HERO_IMAGE=hero_zero:hero, CONTAINER_NAME=hero
└── tf/
    └── credentials.auto.tfvars  # node_id=50, cpu=8, memory=16384, disk_size=100

Then:

make all ENV=hero   # init → deploy (TFGrid VM + gateway) → setup (Docker) → test

Result: hero.gent04.grid.tf with :hero tagged image containing zinit 0.4.0 SDK migration.

Promotion path

hero.gent04.grid.tf (:hero)   ← validate zinit restart + health checks
         ↓ works?
herodev2.gent04.grid.tf (:dev) ← merge to development, rebuild :dev
         ↓ works?
herodemo2.gent04.grid.tf (:demo) ← promote :dev → :demo

What to validate

All services start and reach Running state
Kill a service (zinit kill user.hero_osis_server SIGTERM) → auto-restarts within 5-10s
Health checks run every 60s on socket-based services (check zinit logs)
Hung service (e.g., block with kill -STOP) → health check fails → restart triggered
Smoke tests pass: make smoke ENV=hero
No watchdog needed — remove from entrypoint.sh after validation

## Deployment: hero.gent04.grid.tf (`:hero` tag) Part of this issue — validate the zinit 0.4.0 migration on a fresh environment before promoting to herodev2/herodemo2. ### Provisioning Create `deploy/single-vm/envs/hero/`: ``` envs/hero/ ├── app.env # HERO_IMAGE=hero_zero:hero, CONTAINER_NAME=hero └── tf/ └── credentials.auto.tfvars # node_id=50, cpu=8, memory=16384, disk_size=100 ``` Then: ```bash make all ENV=hero # init → deploy (TFGrid VM + gateway) → setup (Docker) → test ``` Result: `hero.gent04.grid.tf` with `:hero` tagged image containing zinit 0.4.0 SDK migration. ### Promotion path ``` hero.gent04.grid.tf (:hero) ← validate zinit restart + health checks ↓ works? herodev2.gent04.grid.tf (:dev) ← merge to development, rebuild :dev ↓ works? herodemo2.gent04.grid.tf (:demo) ← promote :dev → :demo ``` ### What to validate - [ ] All services start and reach `Running` state - [ ] Kill a service (`zinit kill user.hero_osis_server SIGTERM`) → auto-restarts within 5-10s - [ ] Health checks run every 60s on socket-based services (check zinit logs) - [ ] Hung service (e.g., block with `kill -STOP`) → health check fails → restart triggered - [ ] Smoke tests pass: `make smoke ENV=hero` - [ ] No watchdog needed — remove from entrypoint.sh after validation

mik-tf commented

2026-03-16 19:16:22 +00:00

Author

Owner

Revised: single deployment model

Drop the three-tier promotion. hero.gent04.grid.tf with :hero tag becomes the single deployment — dev, demo, and production.

Rationale

Same image artifact everywhere — no code difference between tiers
Zinit 0.4.0 migration makes services self-healing — no need for separate "test then promote"
One VM instead of two/three → less TFT cost
Simpler workflow: push → build :hero → deploy → done

Migration plan

Provision hero.gent04.grid.tf on node 50 with :hero image
Validate: all services running, restarts work, health checks pass, smoke tests green
Once stable: retire herodev2 + herodemo2 (delete VMs, reclaim TFT)
Update all docs/bookmarks to point to hero.gent04.grid.tf

When to spin up a second env

Only for destructive testing or major migrations
Temporary VM, tear down after validation
Not a permanent tier

## Revised: single deployment model **Drop the three-tier promotion.** `hero.gent04.grid.tf` with `:hero` tag becomes the single deployment — dev, demo, and production. ### Rationale - Same image artifact everywhere — no code difference between tiers - Zinit 0.4.0 migration makes services self-healing — no need for separate "test then promote" - One VM instead of two/three → less TFT cost - Simpler workflow: push → build `:hero` → deploy → done ### Migration plan 1. Provision `hero.gent04.grid.tf` on node 50 with `:hero` image 2. Validate: all services running, restarts work, health checks pass, smoke tests green 3. Once stable: **retire herodev2 + herodemo2** (delete VMs, reclaim TFT) 4. Update all docs/bookmarks to point to `hero.gent04.grid.tf` ### When to spin up a second env - Only for destructive testing or major migrations - Temporary VM, tear down after validation - Not a permanent tier

mik-tf commented

2026-03-16 19:19:38 +00:00

Author

Owner

Local Docker testing workflow

To speed up the dev loop, test the zinit SDK migration locally before pushing to TFGrid:

# 1. Make code changes in install.rs / zinit.rs

# 2. Build all binaries + WASM into dist/
make dist

# 3. Pack into Docker image with :hero tag
make pack TAG=hero

# 4. Run locally
docker run -d --name hero-local \
  -p 8805:6666 \
  -v ./data:/data \
  -e GROQ_API_KEY=$GROQ_API_KEY \
  -e OPENROUTER_API_KEY=$OPENROUTER_API_KEY \
  -e FORGEJO_TOKEN=$FORGEJO_TOKEN \
  forge.ourworld.tf/lhumina_code/hero_zero:hero

# 5. Test restart behavior
docker exec hero-local zinit list          # all services running?
docker exec hero-local zinit kill user.hero_redis_server SIGTERM
sleep 10
docker exec hero-local zinit list          # auto-restarted?

# 6. Test health checks
docker exec hero-local zinit logs user.hero_redis_server.health

# 7. Run smoke tests against localhost
BASE_URL=http://localhost:8805 make smoke

# 8. Clean up
docker stop hero-local && docker rm hero-local

Advantages:

No IPv6/Mycelium dependency
Instant restart (no SCP over slow link)
Same image artifact that ships to TFGrid
Can iterate on install.rs changes without touching remote VMs

Once local tests pass → make push TAG=hero → make all ENV=hero on TFGrid.

## Local Docker testing workflow To speed up the dev loop, test the zinit SDK migration locally before pushing to TFGrid: ```bash # 1. Make code changes in install.rs / zinit.rs # 2. Build all binaries + WASM into dist/ make dist # 3. Pack into Docker image with :hero tag make pack TAG=hero # 4. Run locally docker run -d --name hero-local \ -p 8805:6666 \ -v ./data:/data \ -e GROQ_API_KEY=$GROQ_API_KEY \ -e OPENROUTER_API_KEY=$OPENROUTER_API_KEY \ -e FORGEJO_TOKEN=$FORGEJO_TOKEN \ forge.ourworld.tf/lhumina_code/hero_zero:hero # 5. Test restart behavior docker exec hero-local zinit list # all services running? docker exec hero-local zinit kill user.hero_redis_server SIGTERM sleep 10 docker exec hero-local zinit list # auto-restarted? # 6. Test health checks docker exec hero-local zinit logs user.hero_redis_server.health # 7. Run smoke tests against localhost BASE_URL=http://localhost:8805 make smoke # 8. Clean up docker stop hero-local && docker rm hero-local ``` **Advantages:** - No IPv6/Mycelium dependency - Instant restart (no SCP over slow link) - Same image artifact that ships to TFGrid - Can iterate on install.rs changes without touching remote VMs Once local tests pass → `make push TAG=hero` → `make all ENV=hero` on TFGrid.

mik-tf commented

2026-03-16 19:38:06 +00:00

Author

Owner

Priority & Dependencies

This issue is the next priority — it blocks the final merge of the dioxus-bootstrap migration (#28).

Why it blocks #28

Fresh containers built from the dioxus-bootstrap image can't serve because service TOML configs aren't generated properly. Without this fix, herodevbootstrap shows Bad Gateway after a clean deploy.

Execution plan

Work on development_mik_6_1 branch (same as #23)
Once done, merge development_mik_6_1 → development (brings both #23 and #25)
Then #28 can merge development into its bootstrap branches and complete

Dependency chain

#25 (this) → merge mik_6_1 → development → #28 completes

## Priority & Dependencies This issue is the **next priority** — it blocks the final merge of the dioxus-bootstrap migration ([#28](https://forge.ourworld.tf/lhumina_code/home/issues/28)). ### Why it blocks #28 Fresh containers built from the dioxus-bootstrap image can't serve because service TOML configs aren't generated properly. Without this fix, herodevbootstrap shows Bad Gateway after a clean deploy. ### Execution plan 1. Work on `development_mik_6_1` branch (same as [#23](https://forge.ourworld.tf/lhumina_code/home/issues/23)) 2. Once done, merge `development_mik_6_1` → `development` (brings both #23 and #25) 3. Then #28 can merge `development` into its bootstrap branches and complete ### Dependency chain ``` #25 (this) → merge mik_6_1 → development → #28 completes ```

mik-tf referenced this issue

2026-03-16 19:38:06 +00:00

Hero OS Enhancements #23

mik-tf commented

2026-03-17 01:03:08 +00:00

Author

Owner

Issue #25 — COMPLETE ✅

Zinit 0.4.0 SDK migration implemented and deployed on hero.gent04.grid.tf.

What was done

Code changes (commit 8ec5402 on development_mik_6_1):

install.rs: New SDK functions — register_service_sdk(), register_install_sdk(), register_health_sdk(), register_test_sdk() using ServiceBuilder/ActionBuilder/RetryPolicyBuilder
profile.rs: execute_profile() and activate_profile_additive() use SDK registration
service_data.rs: service_hard_restart() and reload_config() use SDK
zinit.rs: stop_and_clean() uses service_delete(), HERO_DOCKER=1 skips binary deletion
entrypoint.sh: Watchdog loop removed, manual zinit start loop removed
New deployment env: deploy/single-vm/envs/hero/

Retry policy: 20 attempts, 5s delay, exponential backoff, 300s max delay, 60s stability period

Health checks: socat JSON-RPC server.health probe on Unix sockets for _server services, curl HTTP probe for port-based services. 90s timeout.

Action naming: Globally unique action names in zinit 0.4.0 — uses {service}.run pattern (e.g., user.hero_redis_server.run)

Bugs found and fixed

Self-referencing dependencies filtered out in register_service_sdk()
Action name collision (zinit 0.4.0 has globally unique actions) — changed from main to {service}.run
Docker binary deletion race — HERO_DOCKER=1 env var skips deletion of pre-baked binaries

Validation

28 service processes running, 56 actions registered
Auto-restart: killed hero_redis_server → restarted within 5s (confirmed locally and on live deployment)
HTTP: https://hero.gent04.grid.tf/hero_os/ returns 200
No watchdog needed — zinit retry policy handles restarts natively

Deployment

Image: forge.ourworld.tf/lhumina_code/hero_zero:hero
VM: hero.gent04.grid.tf (TFGrid node 50, gent04)
Replaces herodev2 + herodemo2 (can be retired)

Branch: development_mik_6_1 across hero_services repo. Ready to merge to development after #23 remaining items are resolved.

## Issue #25 — COMPLETE ✅ Zinit 0.4.0 SDK migration implemented and deployed on `hero.gent04.grid.tf`. ### What was done **Code changes** (commit `8ec5402` on `development_mik_6_1`): - `install.rs`: New SDK functions — `register_service_sdk()`, `register_install_sdk()`, `register_health_sdk()`, `register_test_sdk()` using `ServiceBuilder`/`ActionBuilder`/`RetryPolicyBuilder` - `profile.rs`: `execute_profile()` and `activate_profile_additive()` use SDK registration - `service_data.rs`: `service_hard_restart()` and `reload_config()` use SDK - `zinit.rs`: `stop_and_clean()` uses `service_delete()`, `HERO_DOCKER=1` skips binary deletion - `entrypoint.sh`: Watchdog loop removed, manual `zinit start` loop removed - New deployment env: `deploy/single-vm/envs/hero/` **Retry policy**: 20 attempts, 5s delay, exponential backoff, 300s max delay, 60s stability period **Health checks**: socat JSON-RPC `server.health` probe on Unix sockets for `_server` services, curl HTTP probe for port-based services. 90s timeout. **Action naming**: Globally unique action names in zinit 0.4.0 — uses `{service}.run` pattern (e.g., `user.hero_redis_server.run`) ### Bugs found and fixed 1. Self-referencing dependencies filtered out in `register_service_sdk()` 2. Action name collision (zinit 0.4.0 has globally unique actions) — changed from `main` to `{service}.run` 3. Docker binary deletion race — `HERO_DOCKER=1` env var skips deletion of pre-baked binaries ### Validation - 28 service processes running, 56 actions registered - Auto-restart: killed `hero_redis_server` → restarted within 5s (confirmed locally and on live deployment) - HTTP: `https://hero.gent04.grid.tf/hero_os/` returns 200 - No watchdog needed — zinit retry policy handles restarts natively ### Deployment - Image: `forge.ourworld.tf/lhumina_code/hero_zero:hero` - VM: `hero.gent04.grid.tf` (TFGrid node 50, gent04) - Replaces herodev2 + herodemo2 (can be retired) **Branch**: `development_mik_6_1` across `hero_services` repo. Ready to merge to `development` after #23 remaining items are resolved.

mik-tf referenced this issue

2026-03-17 01:03:33 +00:00

Hero OS Enhancements #23

mik-tf referenced this issue

2026-03-18 03:55:11 +00:00

Hero OS Enhancements #23

mik-tf commented

2026-03-18 04:36:09 +00:00

Author

Owner

done in development_mik_6_1

mik-tf closed this issue

2026-03-18 04:36:10 +00:00

Rows
Columns

Migrate hero_services to zinit 0.4.0 job model (restart + health checks) #25

Migrate hero_services to zinit 0.4.0 job model

Context

Problem

No restart-on-failure

Dummy health checks

Hung process detection missing

Investigation findings

Zinit reload behavior (confirmed 2026-03-16)

Zinit SDK API available

Target state

Auto-restart (every non-oneshot service)

Real health checks (every socket-based service)

Implementation plan

Files to modify

Current state of install.rs

Related

Deployment: hero.gent04.grid.tf (:hero tag)

Provisioning

Promotion path

What to validate

Revised: single deployment model

Rationale

Migration plan

When to spin up a second env

Local Docker testing workflow

Priority & Dependencies

Why it blocks #28

Execution plan

Dependency chain

Issue #25 — COMPLETE ✅

What was done

Bugs found and fixed

Validation

Deployment

Current state of `install.rs`

Deployment: hero.gent04.grid.tf (`:hero` tag)