P3/hygiene: supervisor active map leaks JoinHandle entries on panic — no functional impact, opportunistic cleanup #125

New issue

Open

opened 2026-05-24 11:03:43 +00:00 by sameh-farouk · 0 comments

sameh-farouk commented

2026-05-24 11:03:43 +00:00

Member

Severity: P3 / hygiene — NOT a functional defect

Disclosure: I filed this as P1 originally then re-analyzed and downgraded. The original framing overstated the impact by pattern-matching to #123 without checking whether the leak actually manifests as user-visible breakage. It doesn't. Filing it anyway so the cleanup isn't forgotten, but it's a hygiene item — not something to schedule in a sprint.

The technical leak

After landing hero_proc#123 (LogEventBus RAII fix, commit 837bdb0), the same panic-leak pattern persists for a different per-job resource: the supervisor's active: Arc<Mutex<HashMap<u32, JoinHandle<()>>>> map.

crates/hero_proc_server/src/supervisor/mod.rs spawn site:

let handle = tokio::spawn(async move {
    let _channel = channel_guard;       // RAII — LogEventBus, fixed in #123
    executor::run_job(...).await;
    if let Some(p) = probe_handle { p.abort(); }
    active.lock().await.remove(&job_id); // ← only on happy path
});

self.active.lock().await.insert(job_id, handle);

If executor::run_job panics, the spawned task terminates with JoinError::Panic and the active.lock().await.remove(&job_id) line is never reached — the JoinHandle entry stays in the map.

Why it's NOT a real defect

Walking the actual consequences:

Memory leak — JoinHandle<()> is ~100 bytes including tokio internal state. Even with thousands of panic-leaked entries over weeks: KB-MB scale. Not an ops concern.
u32 job_id collision causing skipped jobs — the original "this is a real bug" framing. But job IDs are SQLite AUTOINCREMENT row IDs, strictly monotonic, never recycled. For a u32 to wrap you'd need ≥4 billion jobs in the daemon's lifetime. Doesn't happen. The collision-induced skip can't fire in practice.
Pinned task memory — same hygiene category as (1).

Contrast with #123: that leak caused SSE clients to hang waiting for a log.done event that never arrived. Real user-visible breakage. This one has no equivalent.

When to take it

Pick this up opportunistically next time supervisor/mod.rs is being refactored anyway — bundle it into that change. Don't schedule it standalone; the ROI doesn't justify the surface churn on a hot file.

Fix shape (when the time comes)

Cheapest option — a periodic reaper in the existing supervisor poll loop:

async fn reap_finished_active(&self) {
    let mut active = self.active.lock().await;
    active.retain(|_, handle| !handle.is_finished());
}

Wire alongside the other periodic checks. ~10 lines + 2 unit tests. No SDK / wire-format change.

Alternatives (bigger but more idiomatic):

Migrate to tokio::task::JoinSet — handles task lifecycle natively. Larger refactor: parallel HashMap<u32, AbortHandle> needed for the existing contains_key + abort patterns.
catch_unwind wrapper inside the spawn closure so the cleanup line runs even on panic. ~5 lines, but keeps the manual-pair footgun in the type system.

Cross-reference

Pattern precedent: #123 (LogEventBus channel leak, shipped 837bdb0)
Surface: crates/hero_proc_server/src/supervisor/mod.rs spawn closure (~line 790)

## Severity: P3 / hygiene — NOT a functional defect **Disclosure: I filed this as P1 originally then re-analyzed and downgraded. The original framing overstated the impact by pattern-matching to #123 without checking whether the leak actually manifests as user-visible breakage. It doesn't. Filing it anyway so the cleanup isn't forgotten, but it's a hygiene item — not something to schedule in a sprint.** ## The technical leak After landing `hero_proc#123` (LogEventBus RAII fix, commit `837bdb0`), the same panic-leak pattern persists for a different per-job resource: the supervisor's `active: Arc<Mutex<HashMap<u32, JoinHandle<()>>>>` map. `crates/hero_proc_server/src/supervisor/mod.rs` spawn site: ```rust let handle = tokio::spawn(async move { let _channel = channel_guard; // RAII — LogEventBus, fixed in #123 executor::run_job(...).await; if let Some(p) = probe_handle { p.abort(); } active.lock().await.remove(&job_id); // ← only on happy path }); self.active.lock().await.insert(job_id, handle); ``` If `executor::run_job` panics, the spawned task terminates with `JoinError::Panic` and the `active.lock().await.remove(&job_id)` line is never reached — the `JoinHandle` entry stays in the map. ## Why it's NOT a real defect Walking the actual consequences: 1. **Memory leak** — `JoinHandle<()>` is ~100 bytes including tokio internal state. Even with thousands of panic-leaked entries over weeks: KB-MB scale. Not an ops concern. 2. **`u32` job_id collision causing skipped jobs** — the original "this is a real bug" framing. But job IDs are SQLite `AUTOINCREMENT` row IDs, strictly monotonic, never recycled. For a `u32` to wrap you'd need ≥4 billion jobs in the daemon's lifetime. **Doesn't happen.** The collision-induced skip can't fire in practice. 3. **Pinned task memory** — same hygiene category as (1). Contrast with **#123**: that leak caused SSE clients to hang waiting for a `log.done` event that never arrived. Real user-visible breakage. This one has no equivalent. ## When to take it Pick this up opportunistically next time `supervisor/mod.rs` is being refactored anyway — bundle it into that change. Don't schedule it standalone; the ROI doesn't justify the surface churn on a hot file. ## Fix shape (when the time comes) Cheapest option — a periodic reaper in the existing supervisor poll loop: ```rust async fn reap_finished_active(&self) { let mut active = self.active.lock().await; active.retain(|_, handle| !handle.is_finished()); } ``` Wire alongside the other periodic checks. ~10 lines + 2 unit tests. No SDK / wire-format change. Alternatives (bigger but more idiomatic): - Migrate to `tokio::task::JoinSet` — handles task lifecycle natively. Larger refactor: parallel `HashMap<u32, AbortHandle>` needed for the existing `contains_key` + `abort` patterns. - `catch_unwind` wrapper inside the spawn closure so the cleanup line runs even on panic. ~5 lines, but keeps the manual-pair footgun in the type system. ## Cross-reference - Pattern precedent: #123 (LogEventBus channel leak, shipped `837bdb0`) - Surface: `crates/hero_proc_server/src/supervisor/mod.rs` spawn closure (~line 790)

sameh-farouk changed title from ~~supervisor's active map leaks JoinHandle entries on executor task panic — same family as #123~~ to P3/hygiene: supervisor active map leaks JoinHandle entries on panic — no functional impact, opportunistic cleanup

2026-05-24 11:11:53 +00:00