fix(proc): resolve fail-alone integration test bugs (#136) #139

Merged
mahmoud merged 10 commits from development_fix_proc_integration_tests into development 2026-06-08 12:25:41 +00:00
Owner

Summary

Stabilizes the hero_proc integration suite for #136. On a clean DB the full suite is green: 281 / 281 (verified at head 5acd170). The work is in three parts: (a) the original deterministic fail-alone bug fixes, (b) three real server-side log/DB leak bugs found while hardening the suite, and (c) test-harness hardening for the logger's async read-after-write behaviour.

Server fixes

Leak bugs — these affect any hero_proc consumer, not just tests

Bug Fix
job.purge deleted job rows but never their on-disk log dirs → logs/core/<action>/<job_id>/ leaked forever purge_jobs now captures (context, action, id) before deleting rows; job.purge deletes each purged job's logs (mirrors job.clean)
clean_by_tag only removed per-live-job log dirs; once a row was purged/capped its log dir was orphaned and unreachable via the log API deletes the action row first (so the scheduler stops re-firing it), then removes the action's whole log subtree — via the log store and a physical remove_dir_all
run_quick_submit persisted an __inline:run=<id> action per inline action that nothing ever deleted → unbounded actions growth runs now cascade-delete their inline actions on delete / delete_many

Fail-alone bug fixes (category 2 in #136)

Area Change
uc06/uc07 retry off-by-one normalize first job attempt to 1 at the run_job dispatch chokepoint (executor.rs)
uc19/uc20 scheduler jobs unattributable scheduler create_scheduled_job sets action_id (engine.rs); tests resolve scheduled jobs via jobs_find(action_sid)
uc37/uc38 service never started explicit service_start after service_quick_submit; uc38 VERSION_2 check polls
log ordering / src normalization / service cascade to_chronological for job logs; shared normalize_src for query and delete; service_delete tag cascade

Test-harness hardening

  • Bounded concurrency: run_all caps concurrent tests via buffer_unordered — default 3, override with HERO_PROC_TEST_CONCURRENCY — order-preserving. Unbounded concurrency outran the logger.
  • poll_until added at ~40 racy log/count/disk read sites (read-after-async-write).
  • wait_for_job_creation now uses jobs_find(action_sid) instead of the old active_jobs gauge. (Correction to the earlier version of this description: this helper was changed. The previous jobs_find attempt was reverted only because scheduled jobs lacked action_id; that is now fixed server-side, so the jobs_find "fired" signal is honest.)
  • uc34 serialized after its siblings — its global job_purge was deleting their jobs mid-test.
  • clean_test_data_removes_everything: stops the scheduler before cleaning, and asserts the logger's queryable store (logs.count == 0) per deleted action instead of raw-filesystem emptiness — the file logger can flush after a dir is removed, so the FS check was inherently racy.

Verification

  • Clean-DB full suite at 5acd170: 281 / 281, 0 failures. (Supersedes the earlier 263/281, which was measured before the leak fixes.)
  • Cargo.lock / hero_lib pin (0b06c634) unchanged; clippy clean for edited files.
  • Known limitation, tracked in #141: the logger has no drain/flush-sync API — the root cause of the read-after-write races these harness changes work around. On a reused (un-wiped) server the cleanup test can still flake on a late buffered flush; CI runs against a fresh server (where it is 281/281).

Refs #136. Follow-up: #141.

## Summary Stabilizes the hero_proc integration suite for #136. On a **clean DB the full suite is green: 281 / 281** (verified at head `5acd170`). The work is in three parts: (a) the original deterministic *fail-alone* bug fixes, (b) **three real server-side log/DB leak bugs** found while hardening the suite, and (c) test-harness hardening for the logger's async read-after-write behaviour. ## Server fixes ### Leak bugs — these affect any hero_proc consumer, not just tests | Bug | Fix | |---|---| | `job.purge` deleted job rows but never their on-disk log dirs → `logs/core/<action>/<job_id>/` leaked forever | `purge_jobs` now captures `(context, action, id)` before deleting rows; `job.purge` deletes each purged job's logs (mirrors `job.clean`) | | `clean_by_tag` only removed per-*live-job* log dirs; once a row was purged/capped its log dir was orphaned and unreachable via the log API | deletes the action row first (so the scheduler stops re-firing it), then removes the action's whole log subtree — via the log store **and** a physical `remove_dir_all` | | `run_quick_submit` persisted an `__inline:run=<id>` action per inline action that nothing ever deleted → unbounded `actions` growth | runs now cascade-delete their inline actions on `delete` / `delete_many` | ### Fail-alone bug fixes (category 2 in #136) | Area | Change | |---|---| | `uc06`/`uc07` retry off-by-one | normalize first job `attempt` to 1 at the `run_job` dispatch chokepoint (`executor.rs`) | | `uc19`/`uc20` scheduler jobs unattributable | scheduler `create_scheduled_job` sets `action_id` (`engine.rs`); tests resolve scheduled jobs via `jobs_find(action_sid)` | | `uc37`/`uc38` service never started | explicit `service_start` after `service_quick_submit`; `uc38` VERSION_2 check polls | | log ordering / src normalization / service cascade | `to_chronological` for job logs; shared `normalize_src` for query **and** delete; `service_delete` tag cascade | ## Test-harness hardening - **Bounded concurrency**: `run_all` caps concurrent tests via `buffer_unordered` — default **3**, override with `HERO_PROC_TEST_CONCURRENCY` — order-preserving. Unbounded concurrency outran the logger. - **`poll_until`** added at ~40 racy log/count/disk read sites (read-after-async-write). - **`wait_for_job_creation` now uses `jobs_find(action_sid)`** instead of the old `active_jobs` gauge. *(Correction to the earlier version of this description: this helper **was** changed. The previous `jobs_find` attempt was reverted only because scheduled jobs lacked `action_id`; that is now fixed server-side, so the `jobs_find` "fired" signal is honest.)* - **`uc34` serialized** after its siblings — its global `job_purge` was deleting their jobs mid-test. - **`clean_test_data_removes_everything`**: stops the scheduler before cleaning, and asserts the logger's **queryable store** (`logs.count == 0`) per deleted action instead of raw-filesystem emptiness — the file logger can flush *after* a dir is removed, so the FS check was inherently racy. ## Verification - **Clean-DB full suite at `5acd170`: 281 / 281, 0 failures.** (Supersedes the earlier 263/281, which was measured before the leak fixes.) - `Cargo.lock` / hero_lib pin (`0b06c634`) unchanged; clippy clean for edited files. - **Known limitation, tracked in #141**: the logger has no drain/flush-sync API — the root cause of the read-after-write races these harness changes work around. On a *reused* (un-wiped) server the cleanup test can still flake on a late buffered flush; CI runs against a fresh server (where it is 281/281). Refs #136. Follow-up: #141.
The system.state output field was renamed value -> output server-side; the
test still read .value, breaking compilation of the whole suite.
`!logs.is_empty() || true` is always true (clippy overly_complex_bool_expr,
deny-by-default). The `?` already proves job_log_lines didn't error and logs
may legitimately be empty, so assert nothing.
fix: fix integration test failures and server bugs
All checks were successful
Build and Test / build (pull_request) Successful in 7m21s
eb924c443b
- default nr_of_instances=1; count job attempts 1-based
- scope service_delete cascade by service:<name> tag
- run_by_name picks newest run; scheduler sets action_id
- persist inline run actions like service_quick_submit
- unique() test names; jobs_find lookup; explicit start

Refs #136
Merge branch 'development' into development_fix_proc_integration_tests
Some checks are pending
Build and Test / build (pull_request) Waiting to run
7d5705afde
- job_logs chronological order; logs.delete src normalize
- poll_until helper + bounded run_all concurrency cap
- scheduler wait via jobs_find; uc34 purge runs serially
- convert racy log/job/disk reads to poll-until
- fix delete/cleanup file checks; daemon_singleton tokens

Refs #136
docs(logging): fix normalize_src/filter_to_query doc placement
All checks were successful
Build and Test / build (pull_request) Successful in 18m32s
bcd032f268
normalize_src was inserted inside filter_to_query's doc block; move it
above so filter_to_query keeps its 'Translate a LogFilter' doc.
mahmoud changed title from WIP: fix(proc): resolve fail-alone integration test bugs (#136) to fix(proc): resolve fail-alone integration test bugs (#136) 2026-06-08 06:48:09 +00:00
Three real server bugs found while stabilizing the integration suite:

- job.purge deleted job rows but never their on-disk log dirs, leaking
  logs on long-running instances. purge_jobs now returns the purged
  (context, action, id) tuples so job.purge and the legacy handler delete
  each job's logs, matching job.clean.
- clean_by_tag only removed per-live-job log dirs; once a job row was gone
  (purge/cap/clean) its log dir was orphaned and unreachable via the log
  API. It now deletes the action row first (so the scheduler stops firing
  it) then removes the action's whole log subtree, via the log store and a
  physical remove_dir_all.
- run_quick_submit persisted an __inline:run=<id> action per inline action
  that nothing ever cleaned, growing the actions table without bound. Runs
  now cascade-delete their inline actions on delete.

cleanup test: stop all schedules before cleaning so nothing recreates log
dirs mid-cleanup, and assert the logger's queryable store (logs.count == 0)
per deleted action instead of racy raw-filesystem state.

Full suite green (281/281) on a fresh server.
chore: Remove Cargo.toml.hero_builder_backup
Some checks failed
Build and Test / build (pull_request) Has been cancelled
5232b29e25
chore(review): silence dead_code on delete_individual_error, drop dead handle_purge edit
All checks were successful
Build and Test / build (pull_request) Successful in 17m16s
5acd170561
- delete_individual_error is used only by the test-runner binary; the lib
  target compiles the module but never calls it. Add #[allow(dead_code)]
  to match the sibling write_individual_error and keep clippy clean.
- Revert the log-cleanup edit to rpc::job::handle_purge: that handler has
  no dispatch (db::rpc::dispatch is uncalled; the live purge path is
  jobs_impl::job_purge, already fixed), so the edit was on dead code.
mahmoud merged commit a5f37ab706 into development 2026-06-08 12:25:41 +00:00
mahmoud deleted branch development_fix_proc_integration_tests 2026-06-08 12:25:45 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_proc!139
No description provided.