Job dependency enforcement: block child jobs when parent fails #64

Closed
opened 2026-04-29 04:22:37 +00:00 by despiegk · 2 comments
Owner

Context

A job whose dependency has reached failed should not run — it should be marked failed or cancelled, not wait forever. Today the supervisor's job dependency code does not enforce this: a child whose depends_on parent failed will still execute (or hang in waiting).

A test already exists but is #[ignore]d:

  • tests/integration/tests/jobs.rs:297test_job_failed_dependency_blocks#[ignore = "job dependency enforcement (blocking on failed deps) not yet implemented"]

Repro (from the test)

  1. Create parent job: #!/bin/bash\nexit 1.
  2. Create child job with depends_on: [{action: parent_id, dep_type: requires}].
  3. Wait for parent to reach failed.
  4. Today: child's eventual phase is undefined / runs anyway.
  5. Expected: child phase is waiting → cancelled or waiting → failed, never succeeded, and the child's script does NOT execute.

Acceptance

  • A child job with a requires dependency on a parent that reaches failed is itself transitioned to failed (or cancelled) without running its script.
  • Reason field set to something like dependency <id> failed.
  • Existing test test_job_failed_dependency_blocks is un-ignored and passes.
  • No regression on test_job_graph (the success-path equivalent) or other jobs tests.

Notes / scope

  • This is a correctness bug, not a feature. Today's behavior silently violates the user's stated dependency intent.
  • Probably lives in crates/hero_proc_server/src/supervisor/ — wherever the dep-resolution / scheduling decision is made when a parent transitions to a terminal phase.
  • dep_type is requires for now; if/when other dep types exist (e.g. after) decide their semantics separately.

Follow-up to #56.

## Context A job whose dependency has reached `failed` should not run — it should be marked `failed` or `cancelled`, not wait forever. Today the supervisor's job dependency code does not enforce this: a child whose `depends_on` parent failed will still execute (or hang in `waiting`). A test already exists but is `#[ignore]`d: - `tests/integration/tests/jobs.rs:297` — `test_job_failed_dependency_blocks` — `#[ignore = "job dependency enforcement (blocking on failed deps) not yet implemented"]` ## Repro (from the test) 1. Create parent job: `#!/bin/bash\nexit 1`. 2. Create child job with `depends_on: [{action: parent_id, dep_type: requires}]`. 3. Wait for parent to reach `failed`. 4. Today: child's eventual phase is undefined / runs anyway. 5. Expected: child phase is `waiting → cancelled` or `waiting → failed`, never `succeeded`, and the child's script does NOT execute. ## Acceptance - A child job with a `requires` dependency on a parent that reaches `failed` is itself transitioned to `failed` (or `cancelled`) without running its script. - Reason field set to something like `dependency <id> failed`. - Existing test `test_job_failed_dependency_blocks` is un-ignored and passes. - No regression on `test_job_graph` (the success-path equivalent) or other jobs tests. ## Notes / scope - This is a correctness bug, not a feature. Today's behavior silently violates the user's stated dependency intent. - Probably lives in `crates/hero_proc_server/src/supervisor/` — wherever the dep-resolution / scheduling decision is made when a parent transitions to a terminal phase. - `dep_type` is `requires` for now; if/when other dep types exist (e.g. `after`) decide their semantics separately. Follow-up to #56.
Member

Empirical re-verification on origin/development 719ba10 says this is NOT fixed.

Code review of crates/hero_proc_server/src/supervisor/mod.rs:660-742 looks correct — DAG-aware dependency gating cascade-cancels children whose Requires parent reached Failed/Cancelled, setting phase=Cancelled with descriptive error. There IS a covering test: crates/hero_proc_test/src/tests/functional/uc_08_09_dependencies.rs:148-200 (uc09_failed_parent_blocks_child).

But the test FAILS end-to-end against a freshly built hero_proc_server from this commit:

FAIL  functional::uc_08_09::uc09_failed_parent_blocks_child
      [1] child must NOT be succeeded when parent failed, got succeeded

So the gating logic in the supervisor poll loop is being reached, but the child is somehow still spawning + succeeding before/instead of being cancelled. Likely race between supervisor's per-tick scan and the executor spawning queued children — but root cause not isolated yet. Issue stays open.

Repro:

PATH_ROOT=/tmp/sbx ./target/release/hero_proc_server &
./target/release/hero_proc_test --functional --filter uc09
**Empirical re-verification on `origin/development` 719ba10 says this is NOT fixed.** Code review of `crates/hero_proc_server/src/supervisor/mod.rs:660-742` looks correct — DAG-aware dependency gating cascade-cancels children whose `Requires` parent reached `Failed`/`Cancelled`, setting `phase=Cancelled` with descriptive error. There IS a covering test: `crates/hero_proc_test/src/tests/functional/uc_08_09_dependencies.rs:148-200` (`uc09_failed_parent_blocks_child`). But the test FAILS end-to-end against a freshly built `hero_proc_server` from this commit: ``` FAIL functional::uc_08_09::uc09_failed_parent_blocks_child [1] child must NOT be succeeded when parent failed, got succeeded ``` So the gating logic in the supervisor poll loop is being reached, but the child is somehow still spawning + succeeding before/instead of being cancelled. Likely race between supervisor's per-tick scan and the executor spawning queued children — but root cause not isolated yet. Issue stays open. Repro: ``` PATH_ROOT=/tmp/sbx ./target/release/hero_proc_server & ./target/release/hero_proc_test --functional --filter uc09 ```
Member

Re-verified on kristof5 under canonical setup — CONFIRMED real defect, not sandbox artifact.

Ran hero_proc_test --basic --functional on kristof5 (PATH_ROOT=/home/sameh/hero, hero_proc auto-started via lab, binary from yesterday's latest publish 2026-05-20T13:29). Result:

FAIL  functional::uc_08_09::uc09_failed_parent_blocks_child
      [1] child must NOT be succeeded when parent failed, got succeeded

Same failure shape as local sandbox. The DAG-gating logic at supervisor/mod.rs:660-742 is reached but the child still spawns + succeeds before/instead of being cancelled. Real defect, not a sandbox artifact.

(Reverting my earlier "provisional" caveat — verification on a properly-bootstrapped canonical box shows the same failure.)

**Re-verified on kristof5 under canonical setup — CONFIRMED real defect, not sandbox artifact.** Ran `hero_proc_test --basic --functional` on kristof5 (PATH_ROOT=/home/sameh/hero, hero_proc auto-started via lab, binary from yesterday's `latest` publish 2026-05-20T13:29). Result: ``` FAIL functional::uc_08_09::uc09_failed_parent_blocks_child [1] child must NOT be succeeded when parent failed, got succeeded ``` Same failure shape as local sandbox. The DAG-gating logic at `supervisor/mod.rs:660-742` is reached but the child still spawns + succeeds before/instead of being cancelled. **Real defect, not a sandbox artifact.** (Reverting my earlier "provisional" caveat — verification on a properly-bootstrapped canonical box shows the same failure.)
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_proc#64
No description provided.