hero_proc_server wedges in futex_do_wait after several hours, drops all RPCs (2nd recurrence) #122
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_proc#122
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Filing the recurring deadlock pattern seen at s150 close and again today at s2-028 boot. Same shape both times; worth tracking before it bites a real deploy.
Symptom
hero_proc_serverbecomes unresponsive after several hours of uptime. All JSON-RPC calls hang indefinitely (verbose curl waits past a 10s timeout with no headers or body returned, then terminates). The process is still running and SHOWS no error, but is stuck infutex_do_waitper/proc/<pid>/wchan.Reproduction (today's instance)
system.pingviacurl -sS --unix-socket .../hero_proc/rpc.sockreturns immediately with an EMPTY body, exit 0-vblocks past 8s timeout, terminated by signal 143/proc/141603/wchan->futex_do_wait/proc/141603/status:The near-zero ctxt-switch counters are the signature: 31 threads but only 34 voluntary context switches across the entire process lifetime means the main runtime is blocked on a futex it will never get woken from. No spurious wakeups, no contention.
Prior recurrence
Same pattern at s150 close (different session): PID 8120 also wedged in
futex_wait, all RPCs >10s. Recovered then by SIGKILL plus fresh start. Today recovered the same way (SIGKILL 141603, spawned freshhero_proc_serverPID 195599 which restored 100+ supervised services from the persistent state DB cleanly).Recovery
The supervised-service state DB on disk survives the restart, so child daemons get re-spawned cleanly. Orphan child processes from the wedged supervisor's prior fork tree keep running but are no longer monitored.
What to look for
Two independent incidents from a 2-day window is a thin pattern, but the signature is distinctive enough to be worth tracking. Candidates:
Mutexacross an.awaitand then drops on a path where the waker never firesbroadcast::Receiverthat lagged but the lagged-handler path forgets to re-armNext-occurrence diagnostic: capture
eu-stack -p <pid>for the wedged process (gdb-style stack of all 31 threads) before SIGKILL. That should point at the futex's call site.Impact
High blast radius: when hero_proc wedges, EVERY supervised service loses its
service.*control surface, includinglab service <name> --start/--stop/--statusand the cockpit/servicespanel. Orphan daemons survive the wedge but lose supervision (no restart on crash, no log routing through hero_proc, no health-check). New service installs queue forever.Not ship-blocking today, but a real-deploy occurrence on a remote VM would be hard to recover without console access (no
lab service --restartworks while hero_proc is wedged).Research notes from a code-level audit — I went looking for the concurrency primitives most likely to produce a
futex_do_wait-shaped wedge with the "34 voluntary ctxt switches over hours" signature. No smoking gun without a stack trace from the next occurrence, but here are the surfaces worth scrutinizing.Surface inventory
1.
crates/hero_proc_server/src/db/factory.rs—Arc<Mutex<Connection>>(std::sync::Mutex)57 call sites do
let conn = self.conn.lock().unwrap();then a sync SQLite operation. This isstd::sync::Mutex, nottokio::sync::Mutex— every lock is a BLOCKING call from inside an async function, and the work is done synchronously while holding it.Under concurrent RPC load this can saturate the tokio runtime worker pool: every worker that needs the DB blocks the OS thread, and tokio has no way to wake other tasks on that thread until the lock is released. With the multi-threaded scheduler and N workers, if all N hit DB ops simultaneously, the pool stalls until the holder completes.
That alone wouldn't produce the "34 ctxt switches" signature — it'd cause CPU spinning, not deep park. But if combined with another path that needs runtime progress while holding the mutex, it could explain a hard wedge.
Worth wrapping the DB ops in
tokio::task::spawn_blockingregardless — it's the canonical fix for "blocking lib inside async."2.
crates/hero_proc_server/src/logging/sse/mod.rs—LogEventBusEntries are inserted by
open(job_id)(called by executor before spawn) and removed byclose(job_id)(called by executor after the job's task wrapper completes). If the executor task ever panics, gets aborted, or the supervisor restart path doesn't re-runclose(), entries leak. With many subscribers attaching/detaching over hours, lifetime drift here is plausible.Each entry's
broadcast::Senderkeeps the channel alive even if no Receiver remains. Subscribers attach viasubscribe(job_id) -> Receiver; if the SSE client disconnects without dropping the Receiver cleanly (HTTP/SSE has imperfect close semantics), Receivers can pile up and the lagged-handler path matters.The
web.rs:323lagged-handler block handles the PTY broadcast case but I didn't see equivalent re-arm logic in the SSE broadcast path. Worth checking whether a backed-up Receiver can wedge a publisher.3.
crates/hero_proc_server/src/supervisor/mod.rs::poll_pending_jobs— double-lock patternLines ~652-660:
Two back-to-back acquisitions of the same
tokio::sync::Mutexwith no work in between. Either:tokio::task::yield_now().awaitnot a re-lockNot a deadlock candidate itself, but
self.activeis held across multiple await points elsewhere in this function — the spawned executor task closures also try to lockactiveto remove themselves. Long supervisor poll iterations on 100+ services × multiple actions could starve the cleanup side.4.
crates/hero_proc_server/src/embed/mod.rs:47— nestedblock_onIf this is reached from inside the main runtime, it's the classic "block_on inside async" deadlock. Worth checking the call context (haven't traced it fully).
What I'd capture on next occurrence
Mik already noted
eu-stack -p <pid>— that's the right call. Adding a small one-liner that grabs it on any future occurrence, before SIGKILL:Three things it gives us:
/proc/<pid>/status→ confirms the ctxt-switch counters at wedge timewchan→ groups threads by what they're waiting onIf
/proc/<pid>/task/*/wchanshows all threads on the same futex address, that's the smoking gun for surface 1 or 3 above.Not proposing a code fix
The signature is too distinctive to fix speculatively — would risk papering over the wrong surface. Once we have the stack from the next occurrence, the right surface should be obvious, and the fix probably narrow.
Will keep this issue on the priority list. cc @despiegk in case the surfaces above match anything you've been touching.
Status update on the four suspect surfaces I called out in the research comment:
db/factory.rsArc<std::sync::Mutex<Connection>>+ 57 hot-path lock-in-async sitesdeadpool-sqlite(or equivalent) connection pool. Big change; awaiting prioritization. Eliminates worker-pool starvation, also resolves #121's parallelism cap.LogEventBusleak on executor task panic837bdb0(closes #123). RAII Drop guard reclaims the channel even if the spawned task panics. Reduces wedge surface area but doesn't prove this was the wedge cause.supervisor/mod.rs::poll_pending_jobsdouble-lock patternembed/mod.rs:47rt.block_on(...)std::thread, not from inside an async context. Standard embed pattern, not deadlock-prone.So out of four candidates: 1 fixed (#123 →
837bdb0), 1 filed as defect with proposed fix (#124), 1 false alarm, 1 confirmed cosmetic.Still need the next reproduction's
eu-stack -p <pid>capture before SIGKILL to confirm which (if any) of the remaining surfaces is actually causal. The capture script from the earlier comment is still the right diagnostic; I haven't found anything that should be done speculatively without it.If the wedge recurs after #124 lands, that strongly narrows the search to whatever's NOT in this surface list.
Reproduced the wedge on a workstation today with PID 1720620, 1h17m of uptime, sustained 83.1% CPU. After full diagnostic the root cause is not a Mutex or futex deadlock, contrary to the earlier hypothesis on this issue. The supervisor was busy, not blocked: all 31 threads showed
State: S (sleeping)withwchan=futex_do_wait(normal tokio worker idle state), but the main thread was burning CPU running the scheduler loop. The actions table at /home/pctwo/hero/var/hero_proc.db contained 45 leaked integration-test scheduled actions (names likesched-interval-short-8398,sched-rapid-8398,sched-multi-context,sched19-66171, plus 14sched-window-*-8398variants). Together they had created 2,429 jobs in this supervisor's lifetime (383 in the last 60 seconds alone, so 6.4 jobs per second sustained), plus 9,276 invalid-cron WARN entries from two of those actions whose cron strings are unparseable. The database had grown to 352 MB.Recovery without a code change:
DELETE FROM actions WHERE name LIKE 'sched-%' OR name LIKE 'sched%-66171'(45 rows),DELETE FROM jobs WHERE action LIKE 'sched-%' OR action LIKE 'sched%-66171'(1,932 rows),VACUUM. The database dropped from 352 MB to 909 KB. After supervisor restart on the cleaned database, the new PID 3319656 returned RPC calls in 3 ms (was timing out at 10 seconds), held at 0.8% CPU baseline, generated zero scheduler jobs and zero invalid-cron WARN entries in the first 80 seconds, and used 34 MB of RSS (was 254 MB). Two structural follow-ups filed for the underlying causes: #126 (integration tests leak scheduled actions into the operator database and should either use an ephemeral test database or tag test actions for auto-purge on supervisor startup) and #127 (scheduler re-evaluates unparseable cron expressions every tick and emits WARN at 2 per second instead of hard-disabling the action on first parse failure). With those two fixes in place this wedge cannot recur.Signed-by: mik-tf mik-tf@noreply.invalid
Status update — NOT closing yet, but the leading hypothesis is now addressed.
The issue body listed four candidates:
Mutexacross an.awaitand then drops on a path where the waker never fires — addressed by #124 (a58bdf0ondevelopment). TheArc<std::sync::Mutex<Connection>>anti-pattern is removed in favour ofdeadpool_sqlite::Pool+interact. No async task now holds a sync mutex across an.await.broadcast::Receiverthat lagged but the lagged-handler path forgets to re-arm — partly addressed by #123's LogEventBus RAII rewrite (837bdb0), which fixed channel-entry leak on panic. The lagged-receiver path itself was not changed.What still needs to happen to close this: a multi-hour soak test under sustained load with the new code. If the futex wedge does not recur, hypothesis 1 is confirmed and this can close. If it does recur, capture
eu-stack -p <pid>before SIGKILL (as the issue body recommends) to point at the actual futex's call site — that'd narrow to hypothesis 2, 3, or 4.No new code changes proposed in this comment; this is a tracking update so the issue doesn't sit dormant.
Correction to my comment 36777 — I should have referenced @mik-tf's autopsy 36585 directly, and the framing "#124 addresses hypothesis 1" was overclaiming. Walking it back honestly here.
What I now think the picture is
Per @mik-tf's autopsy of the May-24 reproduction (PID 1720620, 1h17m uptime, 83.1% CPU sustained, 9,276 invalid-cron WARN, 45 leaked sched-* actions, 2,429 jobs, DB grew to 352 MB): the root cause was a busy scheduler loop, not a mutex deadlock. RPC calls hung as a side-effect (main runtime spending all its time on scheduler ticks). #127 (closed by
4ac8fb0— single ERROR on first cron parse failure +cron_disabledflag) addresses the per-tick re-evaluation cost; #126 (still open — integration tests leaking actions into the operator DB) is the upstream cause.Signature-mismatch question I want to flag
The May-24 autopsy is the most informed analysis on this issue, but I'd like to ask one clarifying thing before treating this as settled. The two reports describe genuinely different process signatures:
voluntary_ctxt_switchesnonvoluntary_ctxt_switchesS+wchan=futex_do_waitS+wchan=futex_do_wait(workers idle); main thread on-CPUThe thread
wchan+ state are the same on both. What diverges is the ctxt-switch count and the CPU profile.voluntary=34, nonvoluntary=0over hours across 31 threads is a hard-progress-stalled signature — a CPU-burning main thread should produce many ctxt switches (kernel preempts a CPU-bound thread continually; the worker threads would also accumulate switches from their normal park/unpark cycles even if they're idle). The May-24 autopsy is consistent with a busy-loop wedge; the May-23 numbers (if taken at face value) look more like a hard-progress-stall wedge.@mik-tf — should the May-23 ctxt-switch counts be considered superseded by the May-24 autopsy (e.g., transient observation / different reading), or do they describe a genuinely different observation? If the latter, there may still be a separate hard-stall variant worth tracking — though after
a58bdf0+837bdb0(see below) I no longer see an obvious code-level surface that would produce it.What
a58bdf0(#124) and837bdb0(#123) DO still coverGoing back to the 4 wedge-surface candidates from my code audit comment 36479:
837bdb0(#123). The RAIIOpenChannelGuardensures the channel map entry is reclaimed on any drop path including stack unwind, so a panicking executor task can no longer leak abroadcast::Sender+ its receiver set. That class of lifecycle leak is gone.Arc<std::sync::Mutex<Connection>>held across .await) is fixed bya58bdf0(#124). The migration eliminates the canonical Tokio anti-pattern fromhero_proc_server; details + post-merge benchmark moved off this thread to #124 comment 36972 where it belongs.embed/mod.rsblock_on) were ruled out at audit time as false alarms / cosmetic.So if the May-23 signature represented a Mutex-across-
.awaitwedge from the DB or logging surfaces, the architectural cause is gone. If it was from somewhere we didn't enumerate (e.g., glibc-malloc-arena futex, a tokio mutex held in a task that got aborted between acquire and release, etc.), those are NOT addressed by #124.My honest revised position
4ac8fb0already in place for #127. Once #126 closes, this wedge cannot recur.a58bdf0) was filed and shipped as a defect on its own merits — it removes a real anti-pattern, plus correctness fixes (complete_wiperow count,env::set_varsoundness, cancel-vs-exit-status TOCTOU race, multi-step handler atomicity, batchdelete_many). It also incidentally protects against any future Mutex-across-.awaitdeadlock variant that might appear in this codebase. Perf verification + tradeoffs are on the #124 thread, not here.Not closing #122 — that's @mik-tf's call as the issue owner and #126 is still open. If the answer to the signature question is "the May-23 numbers were a transient/aberrant reading," #122 should close with @mik-tf's autopsy + #126/#127 as the resolution. If it's "they were two distinct observations conflated into one issue," then the deadlock variant deserves its own issue so we can track whether it recurs post-
a58bdf0.hero_proc.dbhas no retention or VACUUM scheduling — grows unboundedly on long-lived daemons #131