Job dependency enforcement: block child jobs when parent fails #64
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_proc#64
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
A job whose dependency has reached
failedshould not run — it should be markedfailedorcancelled, not wait forever. Today the supervisor's job dependency code does not enforce this: a child whosedepends_onparent failed will still execute (or hang inwaiting).A test already exists but is
#[ignore]d:tests/integration/tests/jobs.rs:297—test_job_failed_dependency_blocks—#[ignore = "job dependency enforcement (blocking on failed deps) not yet implemented"]Repro (from the test)
#!/bin/bash\nexit 1.depends_on: [{action: parent_id, dep_type: requires}].failed.waiting → cancelledorwaiting → failed, neversucceeded, and the child's script does NOT execute.Acceptance
requiresdependency on a parent that reachesfailedis itself transitioned tofailed(orcancelled) without running its script.dependency <id> failed.test_job_failed_dependency_blocksis un-ignored and passes.test_job_graph(the success-path equivalent) or other jobs tests.Notes / scope
crates/hero_proc_server/src/supervisor/— wherever the dep-resolution / scheduling decision is made when a parent transitions to a terminal phase.dep_typeisrequiresfor now; if/when other dep types exist (e.g.after) decide their semantics separately.Follow-up to #56.
Empirical re-verification on
origin/development719ba10says this is NOT fixed.Code review of
crates/hero_proc_server/src/supervisor/mod.rs:660-742looks correct — DAG-aware dependency gating cascade-cancels children whoseRequiresparent reachedFailed/Cancelled, settingphase=Cancelledwith descriptive error. There IS a covering test:crates/hero_proc_test/src/tests/functional/uc_08_09_dependencies.rs:148-200(uc09_failed_parent_blocks_child).But the test FAILS end-to-end against a freshly built
hero_proc_serverfrom this commit:So the gating logic in the supervisor poll loop is being reached, but the child is somehow still spawning + succeeding before/instead of being cancelled. Likely race between supervisor's per-tick scan and the executor spawning queued children — but root cause not isolated yet. Issue stays open.
Repro:
Re-verified on kristof5 under canonical setup — CONFIRMED real defect, not sandbox artifact.
Ran
hero_proc_test --basic --functionalon kristof5 (PATH_ROOT=/home/sameh/hero, hero_proc auto-started via lab, binary from yesterday'slatestpublish 2026-05-20T13:29). Result:Same failure shape as local sandbox. The DAG-gating logic at
supervisor/mod.rs:660-742is reached but the child still spawns + succeeds before/instead of being cancelled. Real defect, not a sandbox artifact.(Reverting my earlier "provisional" caveat — verification on a properly-bootstrapped canonical box shows the same failure.)
sameh-farouk referenced this issue2026-05-21 11:15:04 +00:00