Health-driven auto-restart (G3) — probes mark service unhealthy but supervisor does not restart #115

Open
opened 2026-05-21 13:05:17 +00:00 by sameh-farouk · 0 comments
Member

Symptom

The health probe subsystem can mark a service unhealthy, but the supervisor takes no further action. An operator (or an external watchdog calling the service.restart RPC manually) is required to recover.

This is Gap G3 in crates/hero_proc_server/src/supervisor/SPECS.md:117, verbatim:

| G3 | Health-driven auto-restart — health probes flip the service to unhealthy but do not restart it. | An operator (or an external watchdog calling the restart RPC) is required to recover. |

Surface

  • crates/hero_proc_server/src/supervisor/health.rs — probe execution + state update
  • crates/hero_proc_server/src/supervisor/mod.rs — would need a restart trigger wired in
  • crates/hero_proc_server/src/db/service/model.rsunhealthy state representation

Why this matters

A service that drifts into a degraded state (port bound but stuck, deadlocked event loop, etc.) is the canonical case where supervisor-driven restart is the right recovery. Without G3 closed, the probe is observation-only — useful for dashboards but not for resilience.

Suggested behavior

When a service transitions to unhealthy and:

  • It has retry_policy.max_attempts budget left,
  • AND it has not been manually stop'd (i.e. wanted = Start),
  • AND a stability_period_ms has been satisfied since the last restart,

then the supervisor should trigger a restart with the action's existing retry/backoff semantics (delay_for_attempt etc.).

Gating on wanted = Start is critical so an operator's explicit stop isn't overridden by the auto-restarter.

## Symptom The `health` probe subsystem can mark a service `unhealthy`, but the supervisor takes no further action. An operator (or an external watchdog calling the `service.restart` RPC manually) is required to recover. This is Gap G3 in `crates/hero_proc_server/src/supervisor/SPECS.md:117`, verbatim: > | G3 | **Health-driven auto-restart** — health probes flip the service to `unhealthy` but do not restart it. | An operator (or an external watchdog calling the restart RPC) is required to recover. | ## Surface - `crates/hero_proc_server/src/supervisor/health.rs` — probe execution + state update - `crates/hero_proc_server/src/supervisor/mod.rs` — would need a restart trigger wired in - `crates/hero_proc_server/src/db/service/model.rs` — `unhealthy` state representation ## Why this matters A service that drifts into a degraded state (port bound but stuck, deadlocked event loop, etc.) is the canonical case where supervisor-driven restart is the right recovery. Without G3 closed, the probe is observation-only — useful for dashboards but not for resilience. ## Suggested behavior When a service transitions to `unhealthy` and: - It has `retry_policy.max_attempts` budget left, - AND it has not been manually `stop`'d (i.e. `wanted = Start`), - AND a `stability_period_ms` has been satisfied since the last restart, then the supervisor should trigger a restart with the action's existing retry/backoff semantics (`delay_for_attempt` etc.). Gating on `wanted = Start` is critical so an operator's explicit stop isn't overridden by the auto-restarter.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_proc#115
No description provided.