restart attempt counter resets every daemon boot — autostart_process_jobs builds Job with ..Default::default() #114

New issue

Open

opened 2026-05-21 13:05:16 +00:00 by sameh-farouk · 0 comments

sameh-farouk commented

2026-05-21 13:05:16 +00:00

Member

Symptom

The attempt counter on a Job resets to 0 every time hero_proc_server restarts and runs its autostart-process-jobs recovery path. Consequence: a service that has already exhausted its retry budget over many restarts gets a fresh max_attempts budget after every daemon bounce — so a permanently-broken service can churn forever in production.

Surface

crates/hero_proc_server/src/supervisor/mod.rs:475-484 — autostart_process_jobs constructs the recovery Job:

let new_job = Job {
    name: job.name.clone(),
    context_name: job.context_name.clone(),
    description: job.description.clone(),
    is_process: job.is_process,
    spec: spec.clone(),
    script: spec.script.clone(),
    phase: JobStatus::Pending,
    created_at: now_ms(),
    service_id: job.service_id.clone(),
    action_id: job.action_id.clone(),
    ..Default::default()                  //  ← sets attempt: 0
};

The trailing ..Default::default() zeroes the attempt field, dropping whatever counter the previous job carried.

Repro (sketch)

Define an is_process service whose exec exits 1 immediately, with retry_policy = { max_attempts: 3 }.
Start the service — supervisor spawns it, it fails, retries 3 times, marks Failed.
Restart hero_proc_server (graceful shutdown + start).
Expected: service stays Failed (already exhausted retries).
Actual: autostart_process_jobs creates a NEW Pending job with attempt=0, supervisor retries 3 more times. Repeat forever across daemon restarts.

Why now

Adjacent to the retry_policy work in 03e7ed8 (which fixed exit-0 routing through retry but didn't address attempt persistence) and to uc07_retry_succeeds_on_second_attempt failing in hero_proc_test — possibly the same root cause if the attempt counter isn't being incremented OR persisted correctly across the retry path either.

Proposed fix

Either:

A. Preserve attempt (and exit_code for diagnostics) from the previous job in the autostart constructor — propagate attempt: job.attempt.
B. Move retry-attempt tracking off the Job row and onto a per-(service, action) counter that survives Job archival + daemon restart. Heavier but correct semantics for "this action has retried N times today."

Either choice should be paired with applying RetryPolicy::delay_for_attempt to the actual spawn timing (see related issue I'm filing — delay_for_attempt exists but is not called from production code today).

## Symptom The `attempt` counter on a `Job` resets to 0 every time `hero_proc_server` restarts and runs its autostart-process-jobs recovery path. Consequence: a service that has already exhausted its retry budget over many restarts gets a fresh `max_attempts` budget after every daemon bounce — so a permanently-broken service can churn forever in production. ## Surface `crates/hero_proc_server/src/supervisor/mod.rs:475-484` — `autostart_process_jobs` constructs the recovery Job: ```rust let new_job = Job { name: job.name.clone(), context_name: job.context_name.clone(), description: job.description.clone(), is_process: job.is_process, spec: spec.clone(), script: spec.script.clone(), phase: JobStatus::Pending, created_at: now_ms(), service_id: job.service_id.clone(), action_id: job.action_id.clone(), ..Default::default() // ← sets attempt: 0 }; ``` The trailing `..Default::default()` zeroes the `attempt` field, dropping whatever counter the previous job carried. ## Repro (sketch) 1. Define an `is_process` service whose `exec` exits 1 immediately, with `retry_policy = { max_attempts: 3 }`. 2. Start the service — supervisor spawns it, it fails, retries 3 times, marks Failed. 3. Restart `hero_proc_server` (graceful shutdown + start). 4. Expected: service stays Failed (already exhausted retries). 5. Actual: autostart_process_jobs creates a NEW Pending job with attempt=0, supervisor retries 3 more times. Repeat forever across daemon restarts. ## Why now Adjacent to the `retry_policy` work in `03e7ed8` (which fixed exit-0 routing through retry but didn't address attempt persistence) and to `uc07_retry_succeeds_on_second_attempt` failing in `hero_proc_test` — possibly the same root cause if the attempt counter isn't being incremented OR persisted correctly across the retry path either. ## Proposed fix Either: - **A.** Preserve `attempt` (and `exit_code` for diagnostics) from the previous job in the autostart constructor — propagate `attempt: job.attempt`. - **B.** Move retry-attempt tracking off the Job row and onto a per-`(service, action)` counter that survives Job archival + daemon restart. Heavier but correct semantics for "this action has retried N times today." Either choice should be paired with applying `RetryPolicy::delay_for_attempt` to the actual spawn timing (see related issue I'm filing — `delay_for_attempt` exists but is not called from production code today).