Multi-step RPC handlers (run.submit, create_with_jobs, service_define) are non-atomic — adopt db.transaction() helper #128
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_proc#128
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What
After #124 landed (
a58bdf0), several RPC handlers that issue multiple sequentialdb.*.*().awaitcalls are non-atomic. The race class existed before #124 (the oldArc<Mutex<Connection>>was acquired per sub-API call, not per handler), but the pool migration widened the race window from "scheduler-overhead nanoseconds between sync sub-API calls" to "any task can run between.awaitpoints". The probability of a real-world interleaving is now orders of magnitude higher.Concrete handlers affected
crates/hero_proc_server/src/rpc/run.rs::handle_submit(lines ~5826+): inserts a Run, then loops over actions issuingdb.actions.get().await,db.jobs.set().await,db.runs.add_job().await. A concurrentservice.stopor anotherrun.submitcan interleave at any.awaitboundary. Has a hand-rolledrollback_submitasync helper but the rollback itself can race.crates/hero_proc_server/src/rpc/run.rs::handle_create_with_jobs: same shape.crates/hero_proc_server/src/service/define.rs::service_define_inner: writes 1-3 actions then 1 service. A concurrent reader can observe a partially-defined service (actions exist, service doesn't yet, or vice versa).User-visible failure mode
run.submitreturns success while the run's child jobs failed to insert → UI shows "submitted run with no jobs". Worse: orphaned Jobs whose Run was rolled back but the rollback closure raced with another writer.Suggested fix
#124 shipped
HeroProcDb::transaction(|tx| ...)specifically as the fix vehicle for this race class. Each handler's body should be moved inside a transaction closure so the entire multi-step write is one atomic SQLite transaction on one pooled connection. The closure body becomes sync rusqlite (no.awaitinsideinteract) — error responses are built in the async caller from the closure'sResult.Scope: ~500-1000 LOC of restructuring across the 3 handlers + regression tests for racing concurrent submits.
References
transaction()helper as the fix vehicle.Closed by
ef6d564onorigin/development.Summary in commit body: adopted
db.transaction()inrun.submit/run.create_with_jobs/service_define_inner(#128); addedJobsApi/RunsApi/ActionsApi::delete_manyand swapped 2 known loop sites (#129); also restoredspawn_blockingaround git CLI indo_pull_secrets/do_push_secrets(worker-stall regression from #124).Verification: lib regression 242/0; Stage 6 release daemon clean (run.submit + wipe_all + 50× parallel status_all + 20-job churn all green); integration baseline preserved (62 PASS / 7 FAIL, all pre-existing env-only).
External architect review: APPROVE, 5 non-blocking weaknesses (2 fixed in
b2fd17ccleanup before squash: dead actions_model alias, now_ms clock-skew window).