fix(proc): resolve fail-alone integration test bugs (#136) #139
No reviewers
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_proc!139
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "development_fix_proc_integration_tests"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Stabilizes the hero_proc integration suite for #136. On a clean DB the full suite is green: 281 / 281 (verified at head
5acd170). The work is in three parts: (a) the original deterministic fail-alone bug fixes, (b) three real server-side log/DB leak bugs found while hardening the suite, and (c) test-harness hardening for the logger's async read-after-write behaviour.Server fixes
Leak bugs — these affect any hero_proc consumer, not just tests
job.purgedeleted job rows but never their on-disk log dirs →logs/core/<action>/<job_id>/leaked foreverpurge_jobsnow captures(context, action, id)before deleting rows;job.purgedeletes each purged job's logs (mirrorsjob.clean)clean_by_tagonly removed per-live-job log dirs; once a row was purged/capped its log dir was orphaned and unreachable via the log APIremove_dir_allrun_quick_submitpersisted an__inline:run=<id>action per inline action that nothing ever deleted → unboundedactionsgrowthdelete/delete_manyFail-alone bug fixes (category 2 in #136)
uc06/uc07retry off-by-oneattemptto 1 at therun_jobdispatch chokepoint (executor.rs)uc19/uc20scheduler jobs unattributablecreate_scheduled_jobsetsaction_id(engine.rs); tests resolve scheduled jobs viajobs_find(action_sid)uc37/uc38service never startedservice_startafterservice_quick_submit;uc38VERSION_2 check pollsto_chronologicalfor job logs; sharednormalize_srcfor query and delete;service_deletetag cascadeTest-harness hardening
run_allcaps concurrent tests viabuffer_unordered— default 3, override withHERO_PROC_TEST_CONCURRENCY— order-preserving. Unbounded concurrency outran the logger.poll_untiladded at ~40 racy log/count/disk read sites (read-after-async-write).wait_for_job_creationnow usesjobs_find(action_sid)instead of the oldactive_jobsgauge. (Correction to the earlier version of this description: this helper was changed. The previousjobs_findattempt was reverted only because scheduled jobs lackedaction_id; that is now fixed server-side, so thejobs_find"fired" signal is honest.)uc34serialized after its siblings — its globaljob_purgewas deleting their jobs mid-test.clean_test_data_removes_everything: stops the scheduler before cleaning, and asserts the logger's queryable store (logs.count == 0) per deleted action instead of raw-filesystem emptiness — the file logger can flush after a dir is removed, so the FS check was inherently racy.Verification
5acd170: 281 / 281, 0 failures. (Supersedes the earlier 263/281, which was measured before the leak fixes.)Cargo.lock/ hero_lib pin (0b06c634) unchanged; clippy clean for edited files.Refs #136. Follow-up: #141.
WIP: fix(proc): resolve fail-alone integration test bugs (#136)to fix(proc): resolve fail-alone integration test bugs (#136)