RetryPolicy::delay_for_attempt is computed but never called in production — backoff has no effect #116

New issue

Open

opened 2026-05-21 13:05:18 +00:00 by sameh-farouk · 0 comments

sameh-farouk commented

2026-05-21 13:05:18 +00:00

Member

Symptom

RetryPolicy::delay_for_attempt(attempt) computes a backoff delay (1s → 2s → 4s → 8s → capped at 10s by default), but the function is never called in production code — only in unit tests. So jobs that retry do so on the supervisor's next 500 ms poll tick, regardless of policy.

Surface

crates/hero_proc_server/src/db/actions/model.rs:363 defines:

impl RetryPolicy {
    pub fn delay_for_attempt(&self, attempt: u32) -> u64 { ... }
}

Grep for callers:

crates/hero_proc_server/src/db/actions/model.rs:657:  assert_eq!(policy.delay_for_attempt(0), 1000);
crates/hero_proc_server/src/db/actions/model.rs:658:  assert_eq!(policy.delay_for_attempt(1), 2000);
crates/hero_proc_server/src/db/actions/model.rs:659:  assert_eq!(policy.delay_for_attempt(2), 4000);
crates/hero_proc_server/src/db/actions/model.rs:660:  assert_eq!(policy.delay_for_attempt(3), 8000);
crates/hero_proc_server/src/db/actions/model.rs:661:  assert_eq!(policy.delay_for_attempt(4), 10_000);

All 5 callers are inside #[cfg(test)] — there are no production callers.

Context

the prior 03e7ed8 (fix(supervisor): merge add-job actions; retry process exit 0 via retry_policy, 2026-05-18) fixed one half of the retry path — apply_exit_status now routes is_process daemons exiting 0 through retry_policy. But the OTHER half (honor delay_for_attempt) is still uncalled. Result: a job with delay_ms: 1000, backoff: true, max_delay_ms: 300000 will retry on the next 500 ms poll, not after 1 s / 2 s / 4 s / ...

Related: uc07_retry_succeeds_on_second_attempt in hero_proc_test fails with expected exactly 2 attempts, got 1 — possibly the same root cause if the retry just isn't being spawned at all because the backoff path is missing. (May also be the HP-04 attempt-counter-reset bug I just filed.)

Suggested fix

In crates/hero_proc_server/src/supervisor/executor.rs::apply_exit_status (around the should_retry → set phase: Retrying branch), compute delay = job.spec.retry_policy.delay_for_attempt(job.attempt) and either:

A. Store not_before_ms = now_ms() + delay on the Job and have poll_pending_jobs skip Retrying jobs whose not_before_ms hasn't elapsed.
B. Add a sleep-scheduling abstraction so the supervisor doesn't busy-poll past Retrying jobs.

(A is simpler; B is cleaner long-term.)

Pairs with HP-04 (attempt counter reset across daemon restart) — fixing both together gives a fully-correct retry path.

## Symptom `RetryPolicy::delay_for_attempt(attempt)` computes a backoff delay (1s → 2s → 4s → 8s → capped at 10s by default), but **the function is never called in production code** — only in unit tests. So jobs that retry do so on the supervisor's next 500 ms poll tick, regardless of policy. ## Surface `crates/hero_proc_server/src/db/actions/model.rs:363` defines: ```rust impl RetryPolicy { pub fn delay_for_attempt(&self, attempt: u32) -> u64 { ... } } ``` Grep for callers: ``` crates/hero_proc_server/src/db/actions/model.rs:657: assert_eq!(policy.delay_for_attempt(0), 1000); crates/hero_proc_server/src/db/actions/model.rs:658: assert_eq!(policy.delay_for_attempt(1), 2000); crates/hero_proc_server/src/db/actions/model.rs:659: assert_eq!(policy.delay_for_attempt(2), 4000); crates/hero_proc_server/src/db/actions/model.rs:660: assert_eq!(policy.delay_for_attempt(3), 8000); crates/hero_proc_server/src/db/actions/model.rs:661: assert_eq!(policy.delay_for_attempt(4), 10_000); ``` All 5 callers are inside `#[cfg(test)]` — there are no production callers. ## Context the prior `03e7ed8` (`fix(supervisor): merge add-job actions; retry process exit 0 via retry_policy`, 2026-05-18) fixed one half of the retry path — `apply_exit_status` now routes `is_process` daemons exiting 0 through `retry_policy`. But the OTHER half (honor `delay_for_attempt`) is still uncalled. Result: a job with `delay_ms: 1000, backoff: true, max_delay_ms: 300000` will retry on the next 500 ms poll, not after 1 s / 2 s / 4 s / ... Related: `uc07_retry_succeeds_on_second_attempt` in `hero_proc_test` fails with `expected exactly 2 attempts, got 1` — possibly the same root cause if the retry just isn't being spawned at all because the backoff path is missing. (May also be the HP-04 attempt-counter-reset bug I just filed.) ## Suggested fix In `crates/hero_proc_server/src/supervisor/executor.rs::apply_exit_status` (around the `should_retry → set phase: Retrying` branch), compute `delay = job.spec.retry_policy.delay_for_attempt(job.attempt)` and either: - **A.** Store `not_before_ms = now_ms() + delay` on the Job and have `poll_pending_jobs` skip Retrying jobs whose `not_before_ms` hasn't elapsed. - **B.** Add a sleep-scheduling abstraction so the supervisor doesn't busy-poll past Retrying jobs. (A is simpler; B is cleaner long-term.) Pairs with HP-04 (attempt counter reset across daemon restart) — fixing both together gives a fully-correct retry path.