Orphan supervised processes accumulate after stop/restart cycles #61
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_proc#61
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Observed
On a long-running multi-user box (138.201.206.39, 2026-04-29),
pgrep -affor supervised binaries returns multiple PIDs per service — far more than hero_proc'sproc service listknows about.For user salma (current snapshot):
That's 7 hero_collab orphans + 2 hero_aibroker_server orphans for one user. Confirmed not collab-specific — pattern affects multiple service types.
Pattern
PIDs cluster around restart events (1462843/1462875/1463221 grouped, then 1527585, then 1556820, then 1558829/1558830). Each restart spawned new processes but didn't terminate the old ones.
Hypothesis
proc service stopand/orproc service restartdoesn't reliably kill the entire process tree before spawning a new one. Possibilities:Why this matters
hero_collab_uiprocesses hold open~/hero/var/sockets/hero_collab/ui.sock, blocking new instances from bindingRepro / diagnostic
pgrep -afproc service restart <name>pgrep -af— sometimes both old and new PID coexistThe trigger is intermittent on this box — sometimes restart is clean, sometimes not. A stress-test loop while watching
pgrepshould isolate the conditions.Suggested fix direction
Before spawning a replacement, the supervisor should:
waitpidor/proc/<pid>checkFor already-orphaned processes, a
proc service reapcommand (or auto-reap-on-startup based on UDS ownership) would help recover stuck boxes.Confirmed in the wild — heavy accumulation on a long-running box
Hit this hard on one of my hosts. After ~8+ days of restarts,
psshows multiple live generations of nearly every singleton service still resident, plus a textbookSIGCHLD/waitpidleak. Posting evidence in case it helps narrow down where in the supervisor the bug lives.Duplicate live parents (newest PID is the one currently bound to the UDS; the rest are orphans)
hero_code_servehero_db_uihero_db_serverhero_osis/hero_osis_uihero_slides_*hero_whiteboardhero_collab_*,hero_logic_*,hero_books_*,hero_indexer_*,hero_aibroker_*,hero_biz*,hero_voice_*,hero_agent_*,hero_foundry_*,hero_embedder_*,hero_proxy_ui,hero_livekit_*,hero_os_*,hero_codescalerThat's roughly 2–3 generations of most services. Aggregate RSS of the orphan generations is ~1.5 GB.
Smoking gun: an actual zombie
Parent
405098is an oldhero_livekit_se(the current live one is1855273). So the old supervisor child:livekit-serversubprocess (SIGCHLDignored / nowaitpid), andhero_livekit_sewas spawned.Both halves of #61 in one PID pair.
hero_code_serveworker pools are doubledThe two live parents each carry a full ~30-worker pool:
404145→ children404162–404193(worker RSS ~1.4 MB, paged out)1740421→ children1740438–1740469(worker RSS ~11 MB, hot)Two complete pools fighting for the same UDS — exactly the symptom predicted in the issue body.
Implication for the fix
The proposed direction (TERM the process group,
waitpiduntil exit, escalate to KILL on timeout, then mark slot free) matches what the evidence shows is missing. Two extra things worth folding in based on this data:/procfor processes whosecommmatches a registered service binary but whoseppid != hero_procwould catch these.SIGCHLDhandler /waitpidloop in the supervisor children, orprctl(PR_SET_PDEATHSIG, SIGTERM)on grandchildren — the livekit zombie shows the leak isn't only at the supervisor → service boundary, it's also at service → grandchild.@omarz has already fixed it
Update — closed the lab-side inflow path
Lab was unconditionally calling
hero_proc'srestart_serviceRPC on everylab service <name> --startinvocation.restart_serviceis server-sidestop_then_start— killing the existing supervised PID and spawning a fresh one even when the service was healthy. That's exactly the kind of churn that produces orphan supervised processes if a SIGTERM doesn't fully clean up between kill and respawn.Reproduced on kristof5: 3 consecutive
lab service hero_proc --startcalls produced PIDs 2124432 → 2124649 → 2124745 forhero_proc_admin. Same churn pattern every Hero pod would experience whenever an operator re-ran--start.Fix landed in
hero_skills@2c25f2c:do_start_validated(mirrors the existing pattern inensure_dependency_running): queryservice_status, if state isrunning/oklog "already running — use --reset to force restart" and return early WITHOUT issuing the destructive RPC.--resetcontinues to work as before: the outer caller invokesdo_stopfirst, which makes the liveness pre-check correctly see "not running" and proceed to start.startsemantics — systemd, supervisord, launchd, runit/s6, docker, kubectl all treatstartas idempotent and require explicitrestartfor kill-respawn.Verified empirically on kristof5: 3 consecutive
lab service hero_proc --startafter the fix → identical hero_proc_admin PID (2159396) across all invocations. Pre-fix log linerestart_serviceRPC no longer fires when service is healthy.Lab's republished binary on Forge release
latest(hero_skills/releases/latest, asset 3086 uploaded today 14:34Z) includes this fix and is the canonical version for anycurl … lab_install.sh | bashinstaller going forward.This addresses the primary inflow of the orphan-accumulation pattern reported here. The SIGTERM-race in hero_proc's supervisor that orphans children when it DOES need to restart is a separate concern — but with the kill-respawn frequency reduced from "every
--start" to "only on explicit--resetor actual failure recovery", the orphan accumulation rate should drop sharply in practice.