Health-driven auto-restart (G3) — probes mark service unhealthy but supervisor does not restart #115

New issue

Open

opened 2026-05-21 13:05:17 +00:00 by sameh-farouk · 0 comments

sameh-farouk commented

2026-05-21 13:05:17 +00:00

Member

Symptom

The health probe subsystem can mark a service unhealthy, but the supervisor takes no further action. An operator (or an external watchdog calling the service.restart RPC manually) is required to recover.

This is Gap G3 in crates/hero_proc_server/src/supervisor/SPECS.md:117, verbatim:

| G3 | Health-driven auto-restart — health probes flip the service to unhealthy but do not restart it. | An operator (or an external watchdog calling the restart RPC) is required to recover. |

Surface

crates/hero_proc_server/src/supervisor/health.rs — probe execution + state update
crates/hero_proc_server/src/supervisor/mod.rs — would need a restart trigger wired in
crates/hero_proc_server/src/db/service/model.rs — unhealthy state representation

Why this matters

A service that drifts into a degraded state (port bound but stuck, deadlocked event loop, etc.) is the canonical case where supervisor-driven restart is the right recovery. Without G3 closed, the probe is observation-only — useful for dashboards but not for resilience.

Suggested behavior

When a service transitions to unhealthy and:

It has retry_policy.max_attempts budget left,
AND it has not been manually stop'd (i.e. wanted = Start),
AND a stability_period_ms has been satisfied since the last restart,

then the supervisor should trigger a restart with the action's existing retry/backoff semantics (delay_for_attempt etc.).

Gating on wanted = Start is critical so an operator's explicit stop isn't overridden by the auto-restarter.

## Symptom The `health` probe subsystem can mark a service `unhealthy`, but the supervisor takes no further action. An operator (or an external watchdog calling the `service.restart` RPC manually) is required to recover. This is Gap G3 in `crates/hero_proc_server/src/supervisor/SPECS.md:117`, verbatim: > | G3 | **Health-driven auto-restart** — health probes flip the service to `unhealthy` but do not restart it. | An operator (or an external watchdog calling the restart RPC) is required to recover. | ## Surface - `crates/hero_proc_server/src/supervisor/health.rs` — probe execution + state update - `crates/hero_proc_server/src/supervisor/mod.rs` — would need a restart trigger wired in - `crates/hero_proc_server/src/db/service/model.rs` — `unhealthy` state representation ## Why this matters A service that drifts into a degraded state (port bound but stuck, deadlocked event loop, etc.) is the canonical case where supervisor-driven restart is the right recovery. Without G3 closed, the probe is observation-only — useful for dashboards but not for resilience. ## Suggested behavior When a service transitions to `unhealthy` and: - It has `retry_policy.max_attempts` budget left, - AND it has not been manually `stop`'d (i.e. `wanted = Start`), - AND a `stability_period_ms` has been satisfied since the last restart, then the supervisor should trigger a restart with the action's existing retry/backoff semantics (`delay_for_attempt` etc.). Gating on `wanted = Start` is critical so an operator's explicit stop isn't overridden by the auto-restarter.