restart attempt counter resets every daemon boot — autostart_process_jobs builds Job with ..Default::default() #114
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_proc#114
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Symptom
The
attemptcounter on aJobresets to 0 every timehero_proc_serverrestarts and runs its autostart-process-jobs recovery path. Consequence: a service that has already exhausted its retry budget over many restarts gets a freshmax_attemptsbudget after every daemon bounce — so a permanently-broken service can churn forever in production.Surface
crates/hero_proc_server/src/supervisor/mod.rs:475-484—autostart_process_jobsconstructs the recovery Job:The trailing
..Default::default()zeroes theattemptfield, dropping whatever counter the previous job carried.Repro (sketch)
is_processservice whoseexecexits 1 immediately, withretry_policy = { max_attempts: 3 }.hero_proc_server(graceful shutdown + start).Why now
Adjacent to the
retry_policywork in03e7ed8(which fixed exit-0 routing through retry but didn't address attempt persistence) and touc07_retry_succeeds_on_second_attemptfailing inhero_proc_test— possibly the same root cause if the attempt counter isn't being incremented OR persisted correctly across the retry path either.Proposed fix
Either:
attempt(andexit_codefor diagnostics) from the previous job in the autostart constructor — propagateattempt: job.attempt.(service, action)counter that survives Job archival + daemon restart. Heavier but correct semantics for "this action has retried N times today."Either choice should be paired with applying
RetryPolicy::delay_for_attemptto the actual spawn timing (see related issue I'm filing —delay_for_attemptexists but is not called from production code today).