issue in stopped jobs #56
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_proc#56
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
do a proper integration test
test all using openrpc from the integration test, full end2end test about how to see runs, jobs, logs
and if we can query all
then delete action, should delete runs and jobs, test this is the case
to restart and fix use skill /nu_service_use
use browser mcp to test in UI as well, check in all detail how the logs and jobs page works
Implementation Spec for Issue #56
Objective
Add a comprehensive end-to-end integration test (Rust, OpenRPC SDK) that exercises the full lifecycle of a failing action: action created -> run started -> job spawned -> job fails after ~3s -> job/run/logs remain searchable post-failure -> action delete cascades to runs and jobs. Fix the cascade-delete gap in
action.delete(current behavior leaves orphaned runs and jobs in the DB) so the test can pass. Add a browser MCP UI testcase that drives the dashboard through hero_router athttp://127.0.0.1:9988/hero_proc/ui/.Requirements
tests/integration/tests/namedfailed_action_lifecycle.rs.hero_proc_serverinstance viaTestHarness, talking through the typed OpenRPC SDK (HeroProcRPCAPIClient).run.create_with_jobs.job.listimmediately after run start.failedterminal phase, then assertphase=="failed",exit_code != 0, error is non-empty, started_at and finished_at are populated, elapsed >= ~2.5s.job.get,job.list({phase:"failed"}),job.list({action_id:"failing-action-56"}), andrun.get(run_id).job.logs(id)returns at least the stdout/stderr lines emitted by the script (poll briefly because logs flush asynchronously).action.delete. After deletion the action, run, and child jobs must all be gone.testcases/19_failed_job_lifecycle/that targets the dashboard exposed by hero_router athttp://127.0.0.1:9988/hero_proc/ui/. The test relies on hero_router being up and routing/hero_proc/ui/to the hero_proc_ui service.Files to Modify/Create
crates/hero_proc_server/src/rpc/action.rs— extendhandle_deleteto cascade.crates/hero_proc_lib/src/db/factory.rs— helpers for cascade lookup.crates/hero_proc_lib/src/db/runs/model.rs—list_run_ids_for_jobs(conn, &[u32]).tests/integration/tests/failed_action_lifecycle.rs— new end-to-end test.testcases/19_failed_job_lifecycle/19_failed_job_lifecycle.md— browser MCP UI testcase againsthttp://127.0.0.1:9988/hero_proc/ui/.Implementation Plan
Step 1: Fix cascade delete in
action.deleteFiles:
crates/hero_proc_server/src/rpc/action.rscrates/hero_proc_lib/src/db/factory.rscrates/hero_proc_lib/src/db/runs/model.rsAdd
list_run_ids_for_jobs(conn, &[u32]) -> Vec<u32>againstrun_jobs.Add helpers on
JobsApi/RunsApito find jobs by action and runs by job ids.In
handle_delete: gather jobs for the action, gather runs for those jobs, cancel and delete jobs (idempotent), delete runs, then delete the action. Return optionaldeleted_runs/deleted_jobscounts.Dependencies: none.
Step 2: Audit supervisor handling of failed jobs
Files:
crates/hero_proc_server/src/supervisor/executor.rs(read-only audit; patch only if a real bug is found).Dependencies: Step 1.
Step 3: Add the integration test
Files:
tests/integration/tests/failed_action_lifecycle.rsEnd-to-end Rust test using
TestHarnessand the typed SDK. Asserts every bullet from Requirements (3a-3g).Dependencies: Step 1, Step 2.
Step 4: Add browser MCP UI testcase against hero_router
Files:
testcases/19_failed_job_lifecycle/19_failed_job_lifecycle.mdhttp://127.0.0.1:9988/hero_proc/ui/(the hero_router-mounted path).Dependencies: Step 1.
Step 5: Wire-up and validation
cargo build -p hero_proc_servercargo test -p hero_proc_integration_tests --test failed_action_lifecycle -- --nocapturecargo test -p hero_proc_integration_tests/nu_service_useskill and exercise the new flow live./run_ui_testsagainsthttp://127.0.0.1:9988/hero_proc/ui/.Dependencies: Steps 1-4.
Acceptance Criteria
cargo test -p hero_proc_integration_tests --test failed_action_lifecyclepasses.job.logs(id)after the job has reachedfailed.job.listfilters andrun.get.http://127.0.0.1:9988/hero_proc/ui/(via hero_router).Notes
job.create(spec)(orphan job, no run) andrun.create_with_jobs(run + linked job). The integration test uses the latter; the cascade fix handles both.run_jobshas FK cascade but jobs and runs may live in separate SQLite connections, so we delete jobs explicitly rather than rely on FK./hero_proc/ui/correctly. If that route is broken, the testcase will report a routing failure rather than a UI bug.Test Results
New test: failed_action_lifecycle
FAIL -
test_failed_action_lifecyclepanicked attests/integration/tests/failed_action_lifecycle.rs:142:The job did reach the expected
failedterminal phase withexit_code = 7, but theJobSummary.started_at_msfield returned byjob.statuswasNone. The assertion that follows (finished_at_ms - started_at_ms >= 2_500ms) therefore could not run. This indicates the executor / status code path is not populatingstarted_at_mson the failed-job summary.Integration suite
Failures:
failed_action_lifecycle::test_failed_action_lifecycle-started_at_ms should be populated(see above).pty::test_pty_cpr_reply_is_stripped- CPR reply\x1b[24;80Rwas not stripped from the PTY stream. Captured stream:Per-binary breakdown:
Library tests
Library unit tests are all green; the regressions are limited to the two integration tests above.
Test Results (after fixes)
failed_action_lifecycle (new test)
PASS
Integration suite
All integration tests pass. No failures (the previously suspect
pty::test_pty_cpr_reply_is_strippedalso passes in this run). The 21 ignored entries are pre-existing skipped tests for unimplemented features (cascade stop, cyclic dependency detection, stress / long-running / network / signal / state-persistence tests, deeper shutdown/process-tree shutdown variants, and one job-dependency-failure-blocking test) and are unrelated to issue #56.Library tests
Server-side bugs found and fixed during this work
executor.rs:job.started_atwas being overwritten to0on terminal job writes; now set BEFORE theRunningupdate so..job.clone()carries it into Failed/Succeeded/Retrying writes.rpc/run.rs:run.getreturned the fieldjob_idsbut the OpenRPC schema and typed SDK expectedjobs; renamed to match.tests/integrationharness:find_server_binarynow honorsCARGO_TARGET_DIR(so the test runner doesn't pick up a stale workspace-local binary).Overall: PASS
Implementation Summary
Changes
Server
crates/hero_proc_server/src/rpc/action.rs—action.deletenow cascades: it lists jobs byaction_id, finds their distinct run ids, cancels and deletes each job, deletes each run, and finally deletes the action. The response now includes optionaldeleted_runsanddeleted_jobscounts.crates/hero_proc_server/src/rpc/run.rs— fixedrun.getreturning the field asjob_ids; the OpenRPC schema and typed SDK expectjobs. Renamed for wire-compat.crates/hero_proc_server/src/supervisor/executor.rs— fixed a bug wherejob.started_atwas overwritten to 0 on terminal writes (Failed / Succeeded / Retrying). The runners (run_job_regular,run_job_pty,run_job_ai,run_job_mcp) now setjob.started_at = nowBEFORE the Running write so subsequent..job.clone()spreads carry the correct value into the DB. As a resultJobSummary.started_at_ms,finished_at_ms, andJob.duration_ms()are now correct after a job finishes.crates/hero_proc_server/openrpc.json—action.deleteresult schema replaced fromOkResponseref with an inline object that adds optionaldeleted_runsanddeleted_jobsinteger fields.Library
crates/hero_proc_lib/src/db/runs/model.rs— addedlist_run_ids_for_jobs(conn, &[u32])returning the distinctrun_ids in therun_jobsbridge table for the given job ids.crates/hero_proc_lib/src/db/factory.rs— addedJobsApi::list_by_action(context, action)andRunsApi::list_run_ids_for_jobs(&[u32])thin wrappers used by the cascade-delete handler.Tests
tests/integration/tests/failed_action_lifecycle.rs— new end-to-end integration test driving the full lifecycle of a failing action through the OpenRPC SDK: create action, run it, see the job appear immediately, wait for failed terminal phase, assert exit_code/error/started_at/finished_at, confirm the job stays searchable viajob.get,job.list({phase:"failed"}),job.list({action_id}),run.get, confirm logs persist viajob.logs, then delete the action and assert the cascade removed the run and the job.tests/integration/src/harness.rs—find_server_binarynow honorsCARGO_TARGET_DIRbefore walking workspace-local target dirs, so stale workspace binaries no longer shadow freshly-built ones during test runs.UI testcase
testcases/26_failed_job_lifecycle/26_failed_job_lifecycle.md— browser MCP UI testcase that drives the same lifecycle through the dashboard exposed via hero_router athttp://127.0.0.1:9988/hero_proc/ui/. Numbered 26 because slots 01..25 are taken.Test Results
failed_action_lifecycle: PASShero_proc_lib --lib: 160 passed, 0 failed, 1 ignoredhero_proc_server --lib: 65 passed, 0 failed, 0 ignoredAcceptance criteria
cargo test -p hero_proc_integration_tests --test failed_action_lifecyclepassesjob.logs(id)after the job has reachedfailedjob.listfilters andrun.getdeleted_runs/deleted_jobsoptional fieldstestcases/26_failed_job_lifecycle/targeting hero_routerNotes
127.0.0.1:9988and routing/hero_proc/ui/to hero_proc_ui. The testcase documents this prerequisite at the top.run.get(wrong field name) andexecutor.rs(started_atoverwritten to 0) were not visible from the existing test surface — the new end-to-end test exposed both.