P3/hygiene: supervisor active map leaks JoinHandle entries on panic — no functional impact, opportunistic cleanup #125
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_proc#125
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Severity: P3 / hygiene — NOT a functional defect
Disclosure: I filed this as P1 originally then re-analyzed and downgraded. The original framing overstated the impact by pattern-matching to #123 without checking whether the leak actually manifests as user-visible breakage. It doesn't. Filing it anyway so the cleanup isn't forgotten, but it's a hygiene item — not something to schedule in a sprint.
The technical leak
After landing
hero_proc#123(LogEventBus RAII fix, commit837bdb0), the same panic-leak pattern persists for a different per-job resource: the supervisor'sactive: Arc<Mutex<HashMap<u32, JoinHandle<()>>>>map.crates/hero_proc_server/src/supervisor/mod.rsspawn site:If
executor::run_jobpanics, the spawned task terminates withJoinError::Panicand theactive.lock().await.remove(&job_id)line is never reached — theJoinHandleentry stays in the map.Why it's NOT a real defect
Walking the actual consequences:
JoinHandle<()>is ~100 bytes including tokio internal state. Even with thousands of panic-leaked entries over weeks: KB-MB scale. Not an ops concern.u32job_id collision causing skipped jobs — the original "this is a real bug" framing. But job IDs are SQLiteAUTOINCREMENTrow IDs, strictly monotonic, never recycled. For au32to wrap you'd need ≥4 billion jobs in the daemon's lifetime. Doesn't happen. The collision-induced skip can't fire in practice.Contrast with #123: that leak caused SSE clients to hang waiting for a
log.doneevent that never arrived. Real user-visible breakage. This one has no equivalent.When to take it
Pick this up opportunistically next time
supervisor/mod.rsis being refactored anyway — bundle it into that change. Don't schedule it standalone; the ROI doesn't justify the surface churn on a hot file.Fix shape (when the time comes)
Cheapest option — a periodic reaper in the existing supervisor poll loop:
Wire alongside the other periodic checks. ~10 lines + 2 unit tests. No SDK / wire-format change.
Alternatives (bigger but more idiomatic):
tokio::task::JoinSet— handles task lifecycle natively. Larger refactor: parallelHashMap<u32, AbortHandle>needed for the existingcontains_key+abortpatterns.catch_unwindwrapper inside the spawn closure so the cleanup line runs even on panic. ~5 lines, but keeps the manual-pair footgun in the type system.Cross-reference
837bdb0)crates/hero_proc_server/src/supervisor/mod.rsspawn closure (~line 790)supervisor's active map leaks JoinHandle entries on executor task panic — same family as #123to P3/hygiene: supervisor active map leaks JoinHandle entries on panic — no functional impact, opportunistic cleanup