hero_proc_server wedges in futex_do_wait after several hours, drops all RPCs (2nd recurrence) #122

Open
opened 2026-05-23 18:36:47 +00:00 by mik-tf · 5 comments
Owner

Filing the recurring deadlock pattern seen at s150 close and again today at s2-028 boot. Same shape both times; worth tracking before it bites a real deploy.

Symptom

hero_proc_server becomes unresponsive after several hours of uptime. All JSON-RPC calls hang indefinitely (verbose curl waits past a 10s timeout with no headers or body returned, then terminates). The process is still running and SHOWS no error, but is stuck in futex_do_wait per /proc/<pid>/wchan.

Reproduction (today's instance)

  • PID 141603, started early afternoon, alive for several hours by 14:50
  • system.ping via curl -sS --unix-socket .../hero_proc/rpc.sock returns immediately with an EMPTY body, exit 0
  • Verbose curl -v blocks past 8s timeout, terminated by signal 143
  • /proc/141603/wchan -> futex_do_wait
  • /proc/141603/status:
    • State: S (sleeping)
    • Threads: 31
    • voluntary_ctxt_switches: 34
    • nonvoluntary_ctxt_switches: 0
    • VmRSS: 100 MB

The near-zero ctxt-switch counters are the signature: 31 threads but only 34 voluntary context switches across the entire process lifetime means the main runtime is blocked on a futex it will never get woken from. No spurious wakeups, no contention.

Prior recurrence

Same pattern at s150 close (different session): PID 8120 also wedged in futex_wait, all RPCs >10s. Recovered then by SIGKILL plus fresh start. Today recovered the same way (SIGKILL 141603, spawned fresh hero_proc_server PID 195599 which restored 100+ supervised services from the persistent state DB cleanly).

Recovery

kill -9 <pid>
rm -f $HERO_SOCKET_DIR/hero_proc/*.sock
nohup $HOME/hero/bin/hero_proc_server --log-level info > /tmp/hps.log 2>&1 &

The supervised-service state DB on disk survives the restart, so child daemons get re-spawned cleanly. Orphan child processes from the wedged supervisor's prior fork tree keep running but are no longer monitored.

What to look for

Two independent incidents from a 2-day window is a thin pattern, but the signature is distinctive enough to be worth tracking. Candidates:

  1. A tokio task somewhere holds a Mutex across an .await and then drops on a path where the waker never fires
  2. A broadcast::Receiver that lagged but the lagged-handler path forgets to re-arm
  3. A reqwest pool deadlock against a flaky downstream
  4. SQLite WAL contention if hero_proc state DB is sqlite-backed under concurrent reads

Next-occurrence diagnostic: capture eu-stack -p <pid> for the wedged process (gdb-style stack of all 31 threads) before SIGKILL. That should point at the futex's call site.

Impact

High blast radius: when hero_proc wedges, EVERY supervised service loses its service.* control surface, including lab service <name> --start/--stop/--status and the cockpit /services panel. Orphan daemons survive the wedge but lose supervision (no restart on crash, no log routing through hero_proc, no health-check). New service installs queue forever.

Not ship-blocking today, but a real-deploy occurrence on a remote VM would be hard to recover without console access (no lab service --restart works while hero_proc is wedged).

Filing the recurring deadlock pattern seen at s150 close and again today at s2-028 boot. Same shape both times; worth tracking before it bites a real deploy. ## Symptom `hero_proc_server` becomes unresponsive after several hours of uptime. All JSON-RPC calls hang indefinitely (verbose curl waits past a 10s timeout with no headers or body returned, then terminates). The process is still running and SHOWS no error, but is stuck in `futex_do_wait` per `/proc/<pid>/wchan`. ## Reproduction (today's instance) - PID 141603, started early afternoon, alive for several hours by 14:50 - `system.ping` via `curl -sS --unix-socket .../hero_proc/rpc.sock` returns immediately with an EMPTY body, exit 0 - Verbose curl `-v` blocks past 8s timeout, terminated by signal 143 - `/proc/141603/wchan` -> `futex_do_wait` - `/proc/141603/status`: - State: S (sleeping) - Threads: 31 - voluntary_ctxt_switches: 34 - nonvoluntary_ctxt_switches: 0 - VmRSS: 100 MB The near-zero ctxt-switch counters are the signature: 31 threads but only 34 voluntary context switches across the entire process lifetime means the main runtime is blocked on a futex it will never get woken from. No spurious wakeups, no contention. ## Prior recurrence Same pattern at s150 close (different session): PID 8120 also wedged in `futex_wait`, all RPCs >10s. Recovered then by SIGKILL plus fresh start. Today recovered the same way (SIGKILL 141603, spawned fresh `hero_proc_server` PID 195599 which restored 100+ supervised services from the persistent state DB cleanly). ## Recovery ``` kill -9 <pid> rm -f $HERO_SOCKET_DIR/hero_proc/*.sock nohup $HOME/hero/bin/hero_proc_server --log-level info > /tmp/hps.log 2>&1 & ``` The supervised-service state DB on disk survives the restart, so child daemons get re-spawned cleanly. Orphan child processes from the wedged supervisor's prior fork tree keep running but are no longer monitored. ## What to look for Two independent incidents from a 2-day window is a thin pattern, but the signature is distinctive enough to be worth tracking. Candidates: 1. A tokio task somewhere holds a `Mutex` across an `.await` and then drops on a path where the waker never fires 2. A `broadcast::Receiver` that lagged but the lagged-handler path forgets to re-arm 3. A reqwest pool deadlock against a flaky downstream 4. SQLite WAL contention if hero_proc state DB is sqlite-backed under concurrent reads Next-occurrence diagnostic: capture `eu-stack -p <pid>` for the wedged process (gdb-style stack of all 31 threads) before SIGKILL. That should point at the futex's call site. ## Impact High blast radius: when hero_proc wedges, EVERY supervised service loses its `service.*` control surface, including `lab service <name> --start/--stop/--status` and the cockpit `/services` panel. Orphan daemons survive the wedge but lose supervision (no restart on crash, no log routing through hero_proc, no health-check). New service installs queue forever. Not ship-blocking today, but a real-deploy occurrence on a remote VM would be hard to recover without console access (no `lab service --restart` works while hero_proc is wedged).
Member

Research notes from a code-level audit — I went looking for the concurrency primitives most likely to produce a futex_do_wait-shaped wedge with the "34 voluntary ctxt switches over hours" signature. No smoking gun without a stack trace from the next occurrence, but here are the surfaces worth scrutinizing.

Surface inventory

1. crates/hero_proc_server/src/db/factory.rsArc<Mutex<Connection>> (std::sync::Mutex)

57 call sites do let conn = self.conn.lock().unwrap(); then a sync SQLite operation. This is std::sync::Mutex, not tokio::sync::Mutex — every lock is a BLOCKING call from inside an async function, and the work is done synchronously while holding it.

Under concurrent RPC load this can saturate the tokio runtime worker pool: every worker that needs the DB blocks the OS thread, and tokio has no way to wake other tasks on that thread until the lock is released. With the multi-threaded scheduler and N workers, if all N hit DB ops simultaneously, the pool stalls until the holder completes.

That alone wouldn't produce the "34 ctxt switches" signature — it'd cause CPU spinning, not deep park. But if combined with another path that needs runtime progress while holding the mutex, it could explain a hard wedge.

Worth wrapping the DB ops in tokio::task::spawn_blocking regardless — it's the canonical fix for "blocking lib inside async."

2. crates/hero_proc_server/src/logging/sse/mod.rsLogEventBus

inner: Arc<RwLock<HashMap<u32, broadcast::Sender<LogEvent>>>>

Entries are inserted by open(job_id) (called by executor before spawn) and removed by close(job_id) (called by executor after the job's task wrapper completes). If the executor task ever panics, gets aborted, or the supervisor restart path doesn't re-run close(), entries leak. With many subscribers attaching/detaching over hours, lifetime drift here is plausible.

Each entry's broadcast::Sender keeps the channel alive even if no Receiver remains. Subscribers attach via subscribe(job_id) -> Receiver; if the SSE client disconnects without dropping the Receiver cleanly (HTTP/SSE has imperfect close semantics), Receivers can pile up and the lagged-handler path matters.

The web.rs:323 lagged-handler block handles the PTY broadcast case but I didn't see equivalent re-arm logic in the SSE broadcast path. Worth checking whether a backed-up Receiver can wedge a publisher.

3. crates/hero_proc_server/src/supervisor/mod.rs::poll_pending_jobs — double-lock pattern

Lines ~652-660:

let active = self.active.lock().await;
if active.contains_key(&summary.id) { continue; }
drop(active);
let active = self.active.lock().await;     // ← same lock, no state change between
if active.contains_key(&summary.id) { continue; }
drop(active);

Two back-to-back acquisitions of the same tokio::sync::Mutex with no work in between. Either:

  • Intentional cooperative-yield (drop releases the lock; the await on re-acquire gives other tasks a chance) — but then it should be tokio::task::yield_now().await not a re-lock
  • Leftover from a refactoring pass

Not a deadlock candidate itself, but self.active is held across multiple await points elsewhere in this function — the spawned executor task closures also try to lock active to remove themselves. Long supervisor poll iterations on 100+ services × multiple actions could starve the cleanup side.

4. crates/hero_proc_server/src/embed/mod.rs:47 — nested block_on

rt.block_on(run(cfg, cancel_child))

If this is reached from inside the main runtime, it's the classic "block_on inside async" deadlock. Worth checking the call context (haven't traced it fully).

What I'd capture on next occurrence

Mik already noted eu-stack -p <pid> — that's the right call. Adding a small one-liner that grabs it on any future occurrence, before SIGKILL:

PID=$(pgrep -x hero_proc_server | head -1)
if [ -n "$PID" ]; then
  eu-stack -p $PID > /tmp/hps-wedge-$(date +%Y%m%d-%H%M).stack 2>&1
  cat /proc/$PID/status >> /tmp/hps-wedge-$(date +%Y%m%d-%H%M).stack
  ls -la /proc/$PID/task/*/wchan | xargs -I {} sh -c 'echo "{}: $(cat {})"' >> /tmp/hps-wedge-$(date +%Y%m%d-%H%M).stack 2>/dev/null
fi

Three things it gives us:

  1. Stack of every thread → tells us WHICH futex
  2. /proc/<pid>/status → confirms the ctxt-switch counters at wedge time
  3. Per-thread wchan → groups threads by what they're waiting on

If /proc/<pid>/task/*/wchan shows all threads on the same futex address, that's the smoking gun for surface 1 or 3 above.

Not proposing a code fix

The signature is too distinctive to fix speculatively — would risk papering over the wrong surface. Once we have the stack from the next occurrence, the right surface should be obvious, and the fix probably narrow.

Will keep this issue on the priority list. cc @despiegk in case the surfaces above match anything you've been touching.

**Research notes from a code-level audit** — I went looking for the concurrency primitives most likely to produce a `futex_do_wait`-shaped wedge with the "34 voluntary ctxt switches over hours" signature. No smoking gun without a stack trace from the next occurrence, but here are the surfaces worth scrutinizing. ## Surface inventory ### 1. `crates/hero_proc_server/src/db/factory.rs` — `Arc<Mutex<Connection>>` (std::sync::Mutex) 57 call sites do `let conn = self.conn.lock().unwrap();` then a sync SQLite operation. This is `std::sync::Mutex`, **not** `tokio::sync::Mutex` — every lock is a BLOCKING call from inside an async function, and the work is done synchronously while holding it. Under concurrent RPC load this can saturate the tokio runtime worker pool: every worker that needs the DB blocks the OS thread, and tokio has no way to wake other tasks on that thread until the lock is released. With the multi-threaded scheduler and N workers, if all N hit DB ops simultaneously, the pool stalls until the holder completes. That alone wouldn't produce the "34 ctxt switches" signature — it'd cause CPU spinning, not deep park. But if combined with another path that needs runtime progress while holding the mutex, it could explain a hard wedge. Worth wrapping the DB ops in `tokio::task::spawn_blocking` regardless — it's the canonical fix for "blocking lib inside async." ### 2. `crates/hero_proc_server/src/logging/sse/mod.rs` — `LogEventBus` ```rust inner: Arc<RwLock<HashMap<u32, broadcast::Sender<LogEvent>>>> ``` Entries are inserted by `open(job_id)` (called by executor before spawn) and removed by `close(job_id)` (called by executor after the job's task wrapper completes). If the executor task ever panics, gets aborted, or the supervisor restart path doesn't re-run `close()`, entries leak. With many subscribers attaching/detaching over hours, lifetime drift here is plausible. Each entry's `broadcast::Sender` keeps the channel alive even if no Receiver remains. Subscribers attach via `subscribe(job_id) -> Receiver`; if the SSE client disconnects without dropping the Receiver cleanly (HTTP/SSE has imperfect close semantics), Receivers can pile up and the lagged-handler path matters. The `web.rs:323` lagged-handler block handles the PTY broadcast case but I didn't see equivalent re-arm logic in the SSE broadcast path. Worth checking whether a backed-up Receiver can wedge a publisher. ### 3. `crates/hero_proc_server/src/supervisor/mod.rs::poll_pending_jobs` — double-lock pattern Lines ~652-660: ```rust let active = self.active.lock().await; if active.contains_key(&summary.id) { continue; } drop(active); let active = self.active.lock().await; // ← same lock, no state change between if active.contains_key(&summary.id) { continue; } drop(active); ``` Two back-to-back acquisitions of the same `tokio::sync::Mutex` with no work in between. Either: - Intentional cooperative-yield (drop releases the lock; the await on re-acquire gives other tasks a chance) — but then it should be `tokio::task::yield_now().await` not a re-lock - Leftover from a refactoring pass Not a deadlock candidate itself, but `self.active` is held across multiple await points elsewhere in this function — the spawned executor task closures also try to lock `active` to remove themselves. Long supervisor poll iterations on 100+ services × multiple actions could starve the cleanup side. ### 4. `crates/hero_proc_server/src/embed/mod.rs:47` — nested `block_on` ```rust rt.block_on(run(cfg, cancel_child)) ``` If this is reached from inside the main runtime, it's the classic "block_on inside async" deadlock. Worth checking the call context (haven't traced it fully). ## What I'd capture on next occurrence Mik already noted `eu-stack -p <pid>` — that's the right call. Adding a small one-liner that grabs it on any future occurrence, before SIGKILL: ```bash PID=$(pgrep -x hero_proc_server | head -1) if [ -n "$PID" ]; then eu-stack -p $PID > /tmp/hps-wedge-$(date +%Y%m%d-%H%M).stack 2>&1 cat /proc/$PID/status >> /tmp/hps-wedge-$(date +%Y%m%d-%H%M).stack ls -la /proc/$PID/task/*/wchan | xargs -I {} sh -c 'echo "{}: $(cat {})"' >> /tmp/hps-wedge-$(date +%Y%m%d-%H%M).stack 2>/dev/null fi ``` Three things it gives us: 1. Stack of every thread → tells us WHICH futex 2. `/proc/<pid>/status` → confirms the ctxt-switch counters at wedge time 3. Per-thread `wchan` → groups threads by what they're waiting on If `/proc/<pid>/task/*/wchan` shows all threads on the same futex address, that's the smoking gun for surface 1 or 3 above. ## Not proposing a code fix The signature is too distinctive to fix speculatively — would risk papering over the wrong surface. Once we have the stack from the next occurrence, the right surface should be obvious, and the fix probably narrow. Will keep this issue on the priority list. cc @despiegk in case the surfaces above match anything you've been touching.
Member

Status update on the four suspect surfaces I called out in the research comment:

Surface Status
1. db/factory.rs Arc<std::sync::Mutex<Connection>> + 57 hot-path lock-in-async sites Filed as #124 — proposed migration to deadpool-sqlite (or equivalent) connection pool. Big change; awaiting prioritization. Eliminates worker-pool starvation, also resolves #121's parallelism cap.
2. LogEventBus leak on executor task panic Fixed in 837bdb0 (closes #123). RAII Drop guard reclaims the channel even if the spawned task panics. Reduces wedge surface area but doesn't prove this was the wedge cause.
3. supervisor/mod.rs::poll_pending_jobs double-lock pattern Cosmetic / no leak risk; not a wedge candidate. Will clean up next time I touch that function.
4. embed/mod.rs:47 rt.block_on(...) Confirmed false alarm — called from a freshly-spawned std::thread, not from inside an async context. Standard embed pattern, not deadlock-prone.

So out of four candidates: 1 fixed (#123837bdb0), 1 filed as defect with proposed fix (#124), 1 false alarm, 1 confirmed cosmetic.

Still need the next reproduction's eu-stack -p <pid> capture before SIGKILL to confirm which (if any) of the remaining surfaces is actually causal. The capture script from the earlier comment is still the right diagnostic; I haven't found anything that should be done speculatively without it.

If the wedge recurs after #124 lands, that strongly narrows the search to whatever's NOT in this surface list.

**Status update on the four suspect surfaces** I called out in the [research comment](https://forge.ourworld.tf/lhumina_code/hero_proc/issues/122#issuecomment-36479): | Surface | Status | |---|---| | 1. `db/factory.rs` `Arc<std::sync::Mutex<Connection>>` + 57 hot-path lock-in-async sites | **Filed as #124** — proposed migration to `deadpool-sqlite` (or equivalent) connection pool. Big change; awaiting prioritization. Eliminates worker-pool starvation, also resolves #121's parallelism cap. | | 2. `LogEventBus` leak on executor task panic | **Fixed in `837bdb0`** (closes #123). RAII Drop guard reclaims the channel even if the spawned task panics. Reduces wedge surface area but doesn't *prove* this was the wedge cause. | | 3. `supervisor/mod.rs::poll_pending_jobs` double-lock pattern | Cosmetic / no leak risk; not a wedge candidate. Will clean up next time I touch that function. | | 4. `embed/mod.rs:47` `rt.block_on(...)` | **Confirmed false alarm** — called from a freshly-spawned `std::thread`, not from inside an async context. Standard embed pattern, not deadlock-prone. | So out of four candidates: 1 fixed (#123 → 837bdb0), 1 filed as defect with proposed fix (#124), 1 false alarm, 1 confirmed cosmetic. **Still need the next reproduction's `eu-stack -p <pid>` capture** before SIGKILL to confirm which (if any) of the remaining surfaces is actually causal. The capture script from the earlier comment is still the right diagnostic; I haven't found anything that should be done speculatively without it. If the wedge recurs after #124 lands, that strongly narrows the search to whatever's NOT in this surface list.
Author
Owner

Reproduced the wedge on a workstation today with PID 1720620, 1h17m of uptime, sustained 83.1% CPU. After full diagnostic the root cause is not a Mutex or futex deadlock, contrary to the earlier hypothesis on this issue. The supervisor was busy, not blocked: all 31 threads showed State: S (sleeping) with wchan=futex_do_wait (normal tokio worker idle state), but the main thread was burning CPU running the scheduler loop. The actions table at /home/pctwo/hero/var/hero_proc.db contained 45 leaked integration-test scheduled actions (names like sched-interval-short-8398, sched-rapid-8398, sched-multi-context, sched19-66171, plus 14 sched-window-*-8398 variants). Together they had created 2,429 jobs in this supervisor's lifetime (383 in the last 60 seconds alone, so 6.4 jobs per second sustained), plus 9,276 invalid-cron WARN entries from two of those actions whose cron strings are unparseable. The database had grown to 352 MB.

Recovery without a code change: DELETE FROM actions WHERE name LIKE 'sched-%' OR name LIKE 'sched%-66171' (45 rows), DELETE FROM jobs WHERE action LIKE 'sched-%' OR action LIKE 'sched%-66171' (1,932 rows), VACUUM. The database dropped from 352 MB to 909 KB. After supervisor restart on the cleaned database, the new PID 3319656 returned RPC calls in 3 ms (was timing out at 10 seconds), held at 0.8% CPU baseline, generated zero scheduler jobs and zero invalid-cron WARN entries in the first 80 seconds, and used 34 MB of RSS (was 254 MB). Two structural follow-ups filed for the underlying causes: #126 (integration tests leak scheduled actions into the operator database and should either use an ephemeral test database or tag test actions for auto-purge on supervisor startup) and #127 (scheduler re-evaluates unparseable cron expressions every tick and emits WARN at 2 per second instead of hard-disabling the action on first parse failure). With those two fixes in place this wedge cannot recur.

Signed-by: mik-tf mik-tf@noreply.invalid

Reproduced the wedge on a workstation today with PID 1720620, 1h17m of uptime, sustained 83.1% CPU. After full diagnostic the root cause is not a Mutex or futex deadlock, contrary to the earlier hypothesis on this issue. The supervisor was busy, not blocked: all 31 threads showed `State: S (sleeping)` with `wchan=futex_do_wait` (normal tokio worker idle state), but the main thread was burning CPU running the scheduler loop. The actions table at /home/pctwo/hero/var/hero_proc.db contained 45 leaked integration-test scheduled actions (names like `sched-interval-short-8398`, `sched-rapid-8398`, `sched-multi-context`, `sched19-66171`, plus 14 `sched-window-*-8398` variants). Together they had created 2,429 jobs in this supervisor's lifetime (383 in the last 60 seconds alone, so 6.4 jobs per second sustained), plus 9,276 invalid-cron WARN entries from two of those actions whose cron strings are unparseable. The database had grown to 352 MB. Recovery without a code change: `DELETE FROM actions WHERE name LIKE 'sched-%' OR name LIKE 'sched%-66171'` (45 rows), `DELETE FROM jobs WHERE action LIKE 'sched-%' OR action LIKE 'sched%-66171'` (1,932 rows), `VACUUM`. The database dropped from 352 MB to 909 KB. After supervisor restart on the cleaned database, the new PID 3319656 returned RPC calls in 3 ms (was timing out at 10 seconds), held at 0.8% CPU baseline, generated zero scheduler jobs and zero invalid-cron WARN entries in the first 80 seconds, and used 34 MB of RSS (was 254 MB). Two structural follow-ups filed for the underlying causes: https://forge.ourworld.tf/lhumina_code/hero_proc/issues/126 (integration tests leak scheduled actions into the operator database and should either use an ephemeral test database or tag test actions for auto-purge on supervisor startup) and https://forge.ourworld.tf/lhumina_code/hero_proc/issues/127 (scheduler re-evaluates unparseable cron expressions every tick and emits WARN at 2 per second instead of hard-disabling the action on first parse failure). With those two fixes in place this wedge cannot recur. Signed-by: mik-tf <mik-tf@noreply.invalid>
Member

Status update — NOT closing yet, but the leading hypothesis is now addressed.

The issue body listed four candidates:

  1. A tokio task holds a Mutex across an .await and then drops on a path where the waker never fires — addressed by #124 (a58bdf0 on development). The Arc<std::sync::Mutex<Connection>> anti-pattern is removed in favour of deadpool_sqlite::Pool + interact. No async task now holds a sync mutex across an .await.
  2. A broadcast::Receiver that lagged but the lagged-handler path forgets to re-arm — partly addressed by #123's LogEventBus RAII rewrite (837bdb0), which fixed channel-entry leak on panic. The lagged-receiver path itself was not changed.
  3. A reqwest pool deadlock against a flaky downstream — not addressed.
  4. SQLite WAL contention if hero_proc state DB is sqlite-backed under concurrent reads — not addressed (pool actually allows more concurrent reads now, but WAL is single-writer regardless of pool size).

What still needs to happen to close this: a multi-hour soak test under sustained load with the new code. If the futex wedge does not recur, hypothesis 1 is confirmed and this can close. If it does recur, capture eu-stack -p <pid> before SIGKILL (as the issue body recommends) to point at the actual futex's call site — that'd narrow to hypothesis 2, 3, or 4.

No new code changes proposed in this comment; this is a tracking update so the issue doesn't sit dormant.

Status update — NOT closing yet, but the leading hypothesis is now addressed. The issue body listed four candidates: 1. **A tokio task holds a `Mutex` across an `.await` and then drops on a path where the waker never fires** — addressed by #124 (`a58bdf0` on `development`). The `Arc<std::sync::Mutex<Connection>>` anti-pattern is removed in favour of `deadpool_sqlite::Pool` + `interact`. No async task now holds a sync mutex across an `.await`. 2. A `broadcast::Receiver` that lagged but the lagged-handler path forgets to re-arm — partly addressed by #123's LogEventBus RAII rewrite (`837bdb0`), which fixed channel-entry leak on panic. The lagged-receiver path itself was not changed. 3. A reqwest pool deadlock against a flaky downstream — not addressed. 4. SQLite WAL contention if hero_proc state DB is sqlite-backed under concurrent reads — not addressed (pool actually allows more concurrent reads now, but WAL is single-writer regardless of pool size). **What still needs to happen to close this:** a multi-hour soak test under sustained load with the new code. If the futex wedge does not recur, hypothesis 1 is confirmed and this can close. If it does recur, capture `eu-stack -p <pid>` before SIGKILL (as the issue body recommends) to point at the actual futex's call site — that'd narrow to hypothesis 2, 3, or 4. No new code changes proposed in this comment; this is a tracking update so the issue doesn't sit dormant.
Member

Correction to my comment 36777 — I should have referenced @mik-tf's autopsy 36585 directly, and the framing "#124 addresses hypothesis 1" was overclaiming. Walking it back honestly here.

What I now think the picture is

Per @mik-tf's autopsy of the May-24 reproduction (PID 1720620, 1h17m uptime, 83.1% CPU sustained, 9,276 invalid-cron WARN, 45 leaked sched-* actions, 2,429 jobs, DB grew to 352 MB): the root cause was a busy scheduler loop, not a mutex deadlock. RPC calls hung as a side-effect (main runtime spending all its time on scheduler ticks). #127 (closed by 4ac8fb0 — single ERROR on first cron parse failure + cron_disabled flag) addresses the per-tick re-evaluation cost; #126 (still open — integration tests leaking actions into the operator DB) is the upstream cause.

Signature-mismatch question I want to flag

The May-24 autopsy is the most informed analysis on this issue, but I'd like to ask one clarifying thing before treating this as settled. The two reports describe genuinely different process signatures:

Original #122 body (May 23, PID 141603) Autopsy 36585 (May 24, PID 1720620)
Uptime "several hours" 1h17m
CPU not stated (implied near-zero) sustained 83.1%
voluntary_ctxt_switches 34 not stated
nonvoluntary_ctxt_switches 0 not stated
VmRSS 100 MB 254 MB
DB size not stated 352 MB (45 leaked actions, 1,932 jobs)
Threads 31 in S + wchan=futex_do_wait 31 in S + wchan=futex_do_wait (workers idle); main thread on-CPU

The thread wchan + state are the same on both. What diverges is the ctxt-switch count and the CPU profile. voluntary=34, nonvoluntary=0 over hours across 31 threads is a hard-progress-stalled signature — a CPU-burning main thread should produce many ctxt switches (kernel preempts a CPU-bound thread continually; the worker threads would also accumulate switches from their normal park/unpark cycles even if they're idle). The May-24 autopsy is consistent with a busy-loop wedge; the May-23 numbers (if taken at face value) look more like a hard-progress-stall wedge.

@mik-tf — should the May-23 ctxt-switch counts be considered superseded by the May-24 autopsy (e.g., transient observation / different reading), or do they describe a genuinely different observation? If the latter, there may still be a separate hard-stall variant worth tracking — though after a58bdf0 + 837bdb0 (see below) I no longer see an obvious code-level surface that would produce it.

What a58bdf0 (#124) and 837bdb0 (#123) DO still cover

Going back to the 4 wedge-surface candidates from my code audit comment 36479:

  • Surface 2 (LogEventBus channel-entry leak on executor task panic) is fixed by 837bdb0 (#123). The RAII OpenChannelGuard ensures the channel map entry is reclaimed on any drop path including stack unwind, so a panicking executor task can no longer leak a broadcast::Sender + its receiver set. That class of lifecycle leak is gone.
  • Surface 1 (Arc<std::sync::Mutex<Connection>> held across .await) is fixed by a58bdf0 (#124). The migration eliminates the canonical Tokio anti-pattern from hero_proc_server; details + post-merge benchmark moved off this thread to #124 comment 36972 where it belongs.
  • Surfaces 3 and 4 (supervisor double-lock pattern + embed/mod.rs block_on) were ruled out at audit time as false alarms / cosmetic.

So if the May-23 signature represented a Mutex-across-.await wedge from the DB or logging surfaces, the architectural cause is gone. If it was from somewhere we didn't enumerate (e.g., glibc-malloc-arena futex, a tokio mutex held in a task that got aborted between acquire and release, etc.), those are NOT addressed by #124.

My honest revised position

  1. The strongest read of the evidence is @mik-tf's autopsy interpretation is correct: #122 was a CPU-burn wedge whose root cause is the #126/#127 pair, with 4ac8fb0 already in place for #127. Once #126 closes, this wedge cannot recur.
  2. #124 (a58bdf0) was filed and shipped as a defect on its own merits — it removes a real anti-pattern, plus correctness fixes (complete_wipe row count, env::set_var soundness, cancel-vs-exit-status TOCTOU race, multi-step handler atomicity, batch delete_many). It also incidentally protects against any future Mutex-across-.await deadlock variant that might appear in this codebase. Perf verification + tradeoffs are on the #124 thread, not here.
  3. The "#124 addresses #122 hypothesis 1" claim in my comment 36777 should be considered withdrawn pending @mik-tf's clarification on whether the May-23 signature can be considered superseded.

Not closing #122 — that's @mik-tf's call as the issue owner and #126 is still open. If the answer to the signature question is "the May-23 numbers were a transient/aberrant reading," #122 should close with @mik-tf's autopsy + #126/#127 as the resolution. If it's "they were two distinct observations conflated into one issue," then the deadlock variant deserves its own issue so we can track whether it recurs post-a58bdf0.

Correction to my [comment 36777](https://forge.ourworld.tf/lhumina_code/hero_proc/issues/122#issuecomment-36777) — I should have referenced @mik-tf's [autopsy 36585](https://forge.ourworld.tf/lhumina_code/hero_proc/issues/122#issuecomment-36585) directly, and the framing "#124 addresses hypothesis 1" was overclaiming. Walking it back honestly here. ## What I now think the picture is Per @mik-tf's autopsy of the May-24 reproduction (PID 1720620, 1h17m uptime, **83.1% CPU sustained**, 9,276 invalid-cron WARN, 45 leaked sched-* actions, 2,429 jobs, DB grew to 352 MB): the root cause was a **busy scheduler loop**, not a mutex deadlock. RPC calls hung as a side-effect (main runtime spending all its time on scheduler ticks). #127 (closed by `4ac8fb0` — single ERROR on first cron parse failure + `cron_disabled` flag) addresses the per-tick re-evaluation cost; #126 (still open — integration tests leaking actions into the operator DB) is the upstream cause. ## Signature-mismatch question I want to flag The May-24 autopsy is the most informed analysis on this issue, but I'd like to ask one clarifying thing before treating this as settled. The two reports describe genuinely different process signatures: | | Original #122 body (May 23, PID 141603) | Autopsy 36585 (May 24, PID 1720620) | |---|---|---| | Uptime | "several hours" | 1h17m | | CPU | not stated (implied near-zero) | **sustained 83.1%** | | `voluntary_ctxt_switches` | **34** | not stated | | `nonvoluntary_ctxt_switches` | **0** | not stated | | VmRSS | 100 MB | 254 MB | | DB size | not stated | 352 MB (45 leaked actions, 1,932 jobs) | | Threads | 31 in `S` + `wchan=futex_do_wait` | 31 in `S` + `wchan=futex_do_wait` (workers idle); main thread on-CPU | The thread `wchan` + state are the same on both. What diverges is the ctxt-switch count and the CPU profile. `voluntary=34, nonvoluntary=0` over hours across 31 threads is a hard-progress-stalled signature — a CPU-burning main thread should produce many ctxt switches (kernel preempts a CPU-bound thread continually; the worker threads would also accumulate switches from their normal park/unpark cycles even if they're idle). The May-24 autopsy is consistent with a busy-loop wedge; the May-23 numbers (if taken at face value) look more like a hard-progress-stall wedge. @mik-tf — should the May-23 ctxt-switch counts be considered superseded by the May-24 autopsy (e.g., transient observation / different reading), or do they describe a genuinely different observation? If the latter, there may still be a separate hard-stall variant worth tracking — though after `a58bdf0` + `837bdb0` (see below) I no longer see an obvious code-level surface that would produce it. ## What `a58bdf0` (#124) and `837bdb0` (#123) DO still cover Going back to the 4 wedge-surface candidates from my [code audit comment 36479](https://forge.ourworld.tf/lhumina_code/hero_proc/issues/122#issuecomment-36479): - **Surface 2 (LogEventBus channel-entry leak on executor task panic)** is fixed by `837bdb0` (#123). The RAII `OpenChannelGuard` ensures the channel map entry is reclaimed on any drop path including stack unwind, so a panicking executor task can no longer leak a `broadcast::Sender` + its receiver set. That class of lifecycle leak is gone. - **Surface 1 (`Arc<std::sync::Mutex<Connection>>` held across .await)** is fixed by `a58bdf0` (#124). The migration eliminates the canonical Tokio anti-pattern from `hero_proc_server`; details + post-merge benchmark moved off this thread to [#124 comment 36972](https://forge.ourworld.tf/lhumina_code/hero_proc/issues/124#issuecomment-36972) where it belongs. - Surfaces 3 and 4 (supervisor double-lock pattern + `embed/mod.rs` block_on) were ruled out at audit time as false alarms / cosmetic. So if the May-23 signature represented a Mutex-across-`.await` wedge from the DB or logging surfaces, the architectural cause is gone. If it was from somewhere we didn't enumerate (e.g., glibc-malloc-arena futex, a tokio mutex held in a task that got aborted between acquire and release, etc.), those are NOT addressed by #124. ## My honest revised position 1. The strongest read of the evidence is **@mik-tf's autopsy interpretation is correct**: #122 was a CPU-burn wedge whose root cause is the #126/#127 pair, with `4ac8fb0` already in place for #127. Once #126 closes, this wedge cannot recur. 2. #124 (`a58bdf0`) was filed and shipped as a defect on its own merits — it removes a real anti-pattern, plus correctness fixes (`complete_wipe` row count, `env::set_var` soundness, cancel-vs-exit-status TOCTOU race, multi-step handler atomicity, batch `delete_many`). It also incidentally protects against any future Mutex-across-`.await` deadlock variant that might appear in this codebase. Perf verification + tradeoffs are on the #124 thread, not here. 3. The "#124 addresses #122 hypothesis 1" claim in my comment 36777 should be considered withdrawn pending @mik-tf's clarification on whether the May-23 signature can be considered superseded. Not closing #122 — that's @mik-tf's call as the issue owner and #126 is still open. If the answer to the signature question is "the May-23 numbers were a transient/aberrant reading," #122 should close with @mik-tf's autopsy + #126/#127 as the resolution. If it's "they were two distinct observations conflated into one issue," then the deadlock variant deserves its own issue so we can track whether it recurs post-`a58bdf0`.
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_proc#122
No description provided.