RetryPolicy::delay_for_attempt is computed but never called in production — backoff has no effect #116
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_proc#116
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Symptom
RetryPolicy::delay_for_attempt(attempt)computes a backoff delay (1s → 2s → 4s → 8s → capped at 10s by default), but the function is never called in production code — only in unit tests. So jobs that retry do so on the supervisor's next 500 ms poll tick, regardless of policy.Surface
crates/hero_proc_server/src/db/actions/model.rs:363defines:Grep for callers:
All 5 callers are inside
#[cfg(test)]— there are no production callers.Context
the prior
03e7ed8(fix(supervisor): merge add-job actions; retry process exit 0 via retry_policy, 2026-05-18) fixed one half of the retry path —apply_exit_statusnow routesis_processdaemons exiting 0 throughretry_policy. But the OTHER half (honordelay_for_attempt) is still uncalled. Result: a job withdelay_ms: 1000, backoff: true, max_delay_ms: 300000will retry on the next 500 ms poll, not after 1 s / 2 s / 4 s / ...Related:
uc07_retry_succeeds_on_second_attemptinhero_proc_testfails withexpected exactly 2 attempts, got 1— possibly the same root cause if the retry just isn't being spawned at all because the backoff path is missing. (May also be the HP-04 attempt-counter-reset bug I just filed.)Suggested fix
In
crates/hero_proc_server/src/supervisor/executor.rs::apply_exit_status(around theshould_retry → set phase: Retryingbranch), computedelay = job.spec.retry_policy.delay_for_attempt(job.attempt)and either:not_before_ms = now_ms() + delayon the Job and havepoll_pending_jobsskip Retrying jobs whosenot_before_mshasn't elapsed.(A is simpler; B is cleaner long-term.)
Pairs with HP-04 (attempt counter reset across daemon restart) — fixing both together gives a fully-correct retry path.