fix(proc): resolve fail-alone integration test bugs (#136)

mahmoud commented

2026-06-07 17:44:17 +00:00

Owner

Summary

Stabilizes the hero_proc integration suite for #136. On a clean DB the full suite is green: 281 / 281 (verified at head 5acd170). The work is in three parts: (a) the original deterministic fail-alone bug fixes, (b) three real server-side log/DB leak bugs found while hardening the suite, and (c) test-harness hardening for the logger's async read-after-write behaviour.

Server fixes

Leak bugs — these affect any hero_proc consumer, not just tests

Bug	Fix
`job.purge` deleted job rows but never their on-disk log dirs → `logs/core/<action>/<job_id>/` leaked forever	`purge_jobs` now captures `(context, action, id)` before deleting rows; `job.purge` deletes each purged job's logs (mirrors `job.clean`)
`clean_by_tag` only removed per-live-job log dirs; once a row was purged/capped its log dir was orphaned and unreachable via the log API	deletes the action row first (so the scheduler stops re-firing it), then removes the action's whole log subtree — via the log store and a physical `remove_dir_all`
`run_quick_submit` persisted an `__inline:run=<id>` action per inline action that nothing ever deleted → unbounded `actions` growth	runs now cascade-delete their inline actions on `delete` / `delete_many`

Fail-alone bug fixes (category 2 in #136)

Area	Change
`uc06`/`uc07` retry off-by-one	normalize first job `attempt` to 1 at the `run_job` dispatch chokepoint (`executor.rs`)
`uc19`/`uc20` scheduler jobs unattributable	scheduler `create_scheduled_job` sets `action_id` (`engine.rs`); tests resolve scheduled jobs via `jobs_find(action_sid)`
`uc37`/`uc38` service never started	explicit `service_start` after `service_quick_submit`; `uc38` VERSION_2 check polls
log ordering / src normalization / service cascade	`to_chronological` for job logs; shared `normalize_src` for query and delete; `service_delete` tag cascade

Test-harness hardening

Bounded concurrency: run_all caps concurrent tests via buffer_unordered — default 3, override with HERO_PROC_TEST_CONCURRENCY — order-preserving. Unbounded concurrency outran the logger.
poll_until added at ~40 racy log/count/disk read sites (read-after-async-write).
wait_for_job_creation now uses jobs_find(action_sid) instead of the old active_jobs gauge. (Correction to the earlier version of this description: this helper was changed. The previous jobs_find attempt was reverted only because scheduled jobs lacked action_id; that is now fixed server-side, so the jobs_find "fired" signal is honest.)
uc34 serialized after its siblings — its global job_purge was deleting their jobs mid-test.
clean_test_data_removes_everything: stops the scheduler before cleaning, and asserts the logger's queryable store (logs.count == 0) per deleted action instead of raw-filesystem emptiness — the file logger can flush after a dir is removed, so the FS check was inherently racy.

Verification

Clean-DB full suite at 5acd170: 281 / 281, 0 failures. (Supersedes the earlier 263/281, which was measured before the leak fixes.)
Cargo.lock / hero_lib pin (0b06c634) unchanged; clippy clean for edited files.
Known limitation, tracked in #141: the logger has no drain/flush-sync API — the root cause of the read-after-write races these harness changes work around. On a reused (un-wiped) server the cleanup test can still flake on a late buffered flush; CI runs against a fresh server (where it is 281/281).

Refs #136. Follow-up: #141.

## Summary Stabilizes the hero_proc integration suite for #136. On a **clean DB the full suite is green: 281 / 281** (verified at head `5acd170`). The work is in three parts: (a) the original deterministic *fail-alone* bug fixes, (b) **three real server-side log/DB leak bugs** found while hardening the suite, and (c) test-harness hardening for the logger's async read-after-write behaviour. ## Server fixes ### Leak bugs — these affect any hero_proc consumer, not just tests | Bug | Fix | |---|---| | `job.purge` deleted job rows but never their on-disk log dirs → `logs/core/<action>/<job_id>/` leaked forever | `purge_jobs` now captures `(context, action, id)` before deleting rows; `job.purge` deletes each purged job's logs (mirrors `job.clean`) | | `clean_by_tag` only removed per-*live-job* log dirs; once a row was purged/capped its log dir was orphaned and unreachable via the log API | deletes the action row first (so the scheduler stops re-firing it), then removes the action's whole log subtree — via the log store **and** a physical `remove_dir_all` | | `run_quick_submit` persisted an `__inline:run=<id>` action per inline action that nothing ever deleted → unbounded `actions` growth | runs now cascade-delete their inline actions on `delete` / `delete_many` | ### Fail-alone bug fixes (category 2 in #136) | Area | Change | |---|---| | `uc06`/`uc07` retry off-by-one | normalize first job `attempt` to 1 at the `run_job` dispatch chokepoint (`executor.rs`) | | `uc19`/`uc20` scheduler jobs unattributable | scheduler `create_scheduled_job` sets `action_id` (`engine.rs`); tests resolve scheduled jobs via `jobs_find(action_sid)` | | `uc37`/`uc38` service never started | explicit `service_start` after `service_quick_submit`; `uc38` VERSION_2 check polls | | log ordering / src normalization / service cascade | `to_chronological` for job logs; shared `normalize_src` for query **and** delete; `service_delete` tag cascade | ## Test-harness hardening - **Bounded concurrency**: `run_all` caps concurrent tests via `buffer_unordered` — default **3**, override with `HERO_PROC_TEST_CONCURRENCY` — order-preserving. Unbounded concurrency outran the logger. - **`poll_until`** added at ~40 racy log/count/disk read sites (read-after-async-write). - **`wait_for_job_creation` now uses `jobs_find(action_sid)`** instead of the old `active_jobs` gauge. *(Correction to the earlier version of this description: this helper **was** changed. The previous `jobs_find` attempt was reverted only because scheduled jobs lacked `action_id`; that is now fixed server-side, so the `jobs_find` "fired" signal is honest.)* - **`uc34` serialized** after its siblings — its global `job_purge` was deleting their jobs mid-test. - **`clean_test_data_removes_everything`**: stops the scheduler before cleaning, and asserts the logger's **queryable store** (`logs.count == 0`) per deleted action instead of raw-filesystem emptiness — the file logger can flush *after* a dir is removed, so the FS check was inherently racy. ## Verification - **Clean-DB full suite at `5acd170`: 281 / 281, 0 failures.** (Supersedes the earlier 263/281, which was measured before the leak fixes.) - `Cargo.lock` / hero_lib pin (`0b06c634`) unchanged; clippy clean for edited files. - **Known limitation, tracked in #141**: the logger has no drain/flush-sync API — the root cause of the read-after-write races these harness changes work around. On a *reused* (un-wiped) server the cleanup test can still flake on a late buffered flush; CI runs against a fresh server (where it is 281/281). Refs #136. Follow-up: #141.

mahmoud added 3 commits

2026-06-07 17:44:17 +00:00

fix(test): use renamed StateOutput.output field aecc09ad69

The system.state output field was renamed value -> output server-side; the
test still read .value, breaking compilation of the whole suite.

fix(test): drop always-true log assertion 3bc8fa8fcd

`!logs.is_empty() || true` is always true (clippy overly_complex_bool_expr,
deny-by-default). The `?` already proves job_log_lines didn't error and logs
may legitimately be empty, so assert nothing.

fix: fix integration test failures and server bugs

Build and Test / build (pull_request) Successful in 7m21s

Details

eb924c443b

- default nr_of_instances=1; count job attempts 1-based
- scope service_delete cascade by service:<name> tag
- run_by_name picks newest run; scheduler sets action_id
- persist inline run actions like service_quick_submit
- unique() test names; jobs_find lookup; explicit start

Refs #136

mahmoud added 1 commit

2026-06-07 21:40:40 +00:00

Merge branch 'development' into development_fix_proc_integration_tests

Build and Test / build (pull_request) Waiting to run

Details

7d5705afde

mahmoud added 2 commits

2026-06-07 21:41:41 +00:00

fix: stabilize full integration suite to 281/281 cf41003cc7

- job_logs chronological order; logs.delete src normalize
- poll_until helper + bounded run_all concurrency cap
- scheduler wait via jobs_find; uc34 purge runs serially
- convert racy log/job/disk reads to poll-until
- fix delete/cleanup file checks; daemon_singleton tokens

Refs #136

Merge remote development_fix_proc_integration_tests (development sync) into local stabilization work

Build and Test / build (pull_request) Has been cancelled

Details

1677f9f835

mahmoud referenced this pull request

2026-06-07 21:42:57 +00:00

fix new proc on development, some integration tests left #136

mahmoud added 1 commit

2026-06-07 21:56:42 +00:00

docs(logging): fix normalize_src/filter_to_query doc placement

Build and Test / build (pull_request) Successful in 18m32s

Details

bcd032f268

normalize_src was inserted inside filter_to_query's doc block; move it
above so filter_to_query keeps its 'Translate a LogFilter' doc.

mahmoud changed title from ~~WIP: fix(proc): resolve fail-alone integration test bugs (#136)~~ to fix(proc): resolve fail-alone integration test bugs (#136)

2026-06-08 06:48:09 +00:00

mahmoud added 2 commits

2026-06-08 11:45:53 +00:00

fix(jobs,logs): stop log-file leaks on purge/cleanup and bound inline actions cd58114552

Three real server bugs found while stabilizing the integration suite:

- job.purge deleted job rows but never their on-disk log dirs, leaking
  logs on long-running instances. purge_jobs now returns the purged
  (context, action, id) tuples so job.purge and the legacy handler delete
  each job's logs, matching job.clean.
- clean_by_tag only removed per-live-job log dirs; once a job row was gone
  (purge/cap/clean) its log dir was orphaned and unreachable via the log
  API. It now deletes the action row first (so the scheduler stops firing
  it) then removes the action's whole log subtree, via the log store and a
  physical remove_dir_all.
- run_quick_submit persisted an __inline:run=<id> action per inline action
  that nothing ever cleaned, growing the actions table without bound. Runs
  now cascade-delete their inline actions on delete.

cleanup test: stop all schedules before cleaning so nothing recreates log
dirs mid-cleanup, and assert the logger's queryable store (logs.count == 0)
per deleted action instead of racy raw-filesystem state.

Full suite green (281/281) on a fresh server.

chore: Remove Cargo.toml.hero_builder_backup

Build and Test / build (pull_request) Has been cancelled

Details

5232b29e25

mahmoud added 1 commit

2026-06-08 11:59:35 +00:00

chore(review): silence dead_code on delete_individual_error, drop dead handle_purge edit

Build and Test / build (pull_request) Successful in 17m16s

Details

5acd170561

- delete_individual_error is used only by the test-runner binary; the lib
  target compiles the module but never calls it. Add #[allow(dead_code)]
  to match the sibling write_individual_error and keep clippy clean.
- Revert the log-cleanup edit to rpc::job::handle_purge: that handler has
  no dispatch (db::rpc::dispatch is uncalled; the live purge path is
  jobs_impl::job_purge, already fixed), so the edit was on dead code.

mahmoud referenced this pull request

2026-06-08 12:04:42 +00:00

Logger lacks a drain/flush-sync API (root cause of read-after-write races) #141

mahmoud merged commit a5f37ab706 into development

2026-06-08 12:25:41 +00:00

mahmoud referenced this pull request from a commit

2026-06-08 12:25:42 +00:00

fix(proc): resolve fail-alone integration test bugs (#136) (#139)

mahmoud deleted branch development_fix_proc_integration_tests

2026-06-08 12:25:45 +00:00

mahmoud referenced this pull request from a commit

2026-06-08 12:40:43 +00:00