service add-job overwrites existing actions; daemon exit-0 skips retry_policy #106

Closed
opened 2026-05-18 15:33:50 +00:00 by zaelgohary · 4 comments
Member

cmd_add_job builds a fresh ServiceSpec with actions: vec![name], so a second add-job under the same service replaces the start trigger (e.g. hero_aibroker_admin clobbers hero_aibroker_server).

apply_exit_status short-circuits on status.success() for process jobs and marks them Failed without consulting retry_policy, so a daemon that voluntarily exits 0 never retries.

cmd_add_job builds a fresh ServiceSpec with actions: vec![name], so a second add-job under the same service replaces the start trigger (e.g. hero_aibroker_admin clobbers hero_aibroker_server). apply_exit_status short-circuits on status.success() for process jobs and marks them Failed without consulting retry_policy, so a daemon that voluntarily exits 0 never retries.
Author
Member
03e7ed8
Member

Mixed: code fix landed but end-to-end retry still doesn't behave as expected.

Kristof's 03e7ed8 cleanly fixed both halves at the source level:

  • cmd_add_job in crates/hero_proc/src/cli/commands.rs:775-830 now fetches the existing service and merges into merged_actions instead of clobbering.
  • apply_exit_status in crates/hero_proc_server/src/supervisor/executor.rs:1095-1170 treats is_process daemons exiting 0 as unexpected and routes through retry_policy.

But the covering test fails empirically against a freshly built hero_proc_server from origin/development 719ba10:

FAIL  functional::uc_06_07::uc07_retry_succeeds_on_second_attempt
      [1] expected exactly 2 attempts, got 1

So a job that should retry once (max_attempts: 2) only runs a single attempt. The should_retry branch evaluates job.attempt < rp.max_attempts — possibly an attempt-counter issue (related to HP-04, restart attempt counter resets across daemon restarts). Not isolated yet. Issue stays open while the retry path is end-to-end-debugged.

**Mixed: code fix landed but end-to-end retry still doesn't behave as expected.** Kristof's `03e7ed8` cleanly fixed both halves at the source level: - `cmd_add_job` in `crates/hero_proc/src/cli/commands.rs:775-830` now fetches the existing service and merges into `merged_actions` instead of clobbering. - `apply_exit_status` in `crates/hero_proc_server/src/supervisor/executor.rs:1095-1170` treats `is_process` daemons exiting 0 as `unexpected` and routes through `retry_policy`. But the covering test fails empirically against a freshly built `hero_proc_server` from `origin/development` 719ba10: ``` FAIL functional::uc_06_07::uc07_retry_succeeds_on_second_attempt [1] expected exactly 2 attempts, got 1 ``` So a job that should retry once (max_attempts: 2) only runs a single attempt. The `should_retry` branch evaluates `job.attempt < rp.max_attempts` — possibly an attempt-counter issue (related to HP-04, restart attempt counter resets across daemon restarts). Not isolated yet. Issue stays open while the retry path is end-to-end-debugged.
Member

Re-verified on kristof5 under canonical setup — CONFIRMED real defect, not sandbox artifact.

Ran hero_proc_test --basic --functional on kristof5 (canonical bootstrap, hero_proc auto-started via lab, binary from yesterday's latest publish). Result:

FAIL  functional::uc_06_07::uc07_retry_succeeds_on_second_attempt
      [1] expected exactly 2 attempts, got 1

Same failure shape as local sandbox. So Kristof's 03e7ed8 fixed the code-level routing (exit-0 now goes through retry_policy branch), but end-to-end the retry doesn't fire a second attempt. Likely related to HP-04 (restart attempt counter resets every daemon boot — supervisor/mod.rs:405-498); when the supervisor crashes/restarts mid-retry it loses the attempt counter.

Real defect, not a sandbox artifact. Reverting my earlier "provisional" caveat.

**Re-verified on kristof5 under canonical setup — CONFIRMED real defect, not sandbox artifact.** Ran `hero_proc_test --basic --functional` on kristof5 (canonical bootstrap, hero_proc auto-started via lab, binary from yesterday's `latest` publish). Result: ``` FAIL functional::uc_06_07::uc07_retry_succeeds_on_second_attempt [1] expected exactly 2 attempts, got 1 ``` Same failure shape as local sandbox. So Kristof's `03e7ed8` fixed the code-level routing (exit-0 now goes through retry_policy branch), but end-to-end the retry doesn't fire a second attempt. Likely related to **HP-04** (restart attempt counter resets every daemon boot — `supervisor/mod.rs:405-498`); when the supervisor crashes/restarts mid-retry it loses the attempt counter. **Real defect, not a sandbox artifact.** Reverting my earlier "provisional" caveat.
Owner

Done — both halves are fixed on development:

  1. add-job no longer overwrites actionscmd_add_job (hero_proc_cli commands.rs) now builds merged_actions.push(name) (merges into the existing action set) instead of actions: vec![name].
  2. process exit-0 no longer skips retryapply_exit_status (executor.rs) now computes unexpected = !status.success() || job.is_process, so daemon exit-0 routes through the same retry path as non-zero and retry_policy applies.

Closing as resolved.

Done — both halves are fixed on `development`: 1. **add-job no longer overwrites actions** — `cmd_add_job` (hero_proc_cli `commands.rs`) now builds `merged_actions.push(name)` (merges into the existing action set) instead of `actions: vec![name]`. 2. **process exit-0 no longer skips retry** — `apply_exit_status` (executor.rs) now computes `unexpected = !status.success() || job.is_process`, so daemon exit-0 routes through the same retry path as non-zero and `retry_policy` applies. Closing as resolved.
Sign in to join this conversation.
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_proc#106
No description provided.