Health-driven auto-restart (G3) — probes mark service unhealthy but supervisor does not restart #115
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_proc#115
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Symptom
The
healthprobe subsystem can mark a serviceunhealthy, but the supervisor takes no further action. An operator (or an external watchdog calling theservice.restartRPC manually) is required to recover.This is Gap G3 in
crates/hero_proc_server/src/supervisor/SPECS.md:117, verbatim:Surface
crates/hero_proc_server/src/supervisor/health.rs— probe execution + state updatecrates/hero_proc_server/src/supervisor/mod.rs— would need a restart trigger wired incrates/hero_proc_server/src/db/service/model.rs—unhealthystate representationWhy this matters
A service that drifts into a degraded state (port bound but stuck, deadlocked event loop, etc.) is the canonical case where supervisor-driven restart is the right recovery. Without G3 closed, the probe is observation-only — useful for dashboards but not for resilience.
Suggested behavior
When a service transitions to
unhealthyand:retry_policy.max_attemptsbudget left,stop'd (i.e.wanted = Start),stability_period_mshas been satisfied since the last restart,then the supervisor should trigger a restart with the action's existing retry/backoff semantics (
delay_for_attemptetc.).Gating on
wanted = Startis critical so an operator's explicit stop isn't overridden by the auto-restarter.