lhumina_code/hero_proc

Fork 0

Orphan supervised processes accumulate after stop/restart cycles #61

New issue

Open

opened 2026-04-29 03:32:46 +00:00 by sameh-farouk · 3 comments

sameh-farouk commented

2026-04-29 03:32:46 +00:00

Member

Observed

On a long-running multi-user box (138.201.206.39, 2026-04-29), pgrep -af for supervised binaries returns multiple PIDs per service — far more than hero_proc's proc service list knows about.

For user salma (current snapshot):

913095  /home/salma/hero/bin/hero_collab_ui
1462843 /home/salma/hero/bin/hero_collab_ui
1462875 /home/salma/hero/bin/hero_collab_server
1463221 /home/salma/hero/bin/hero_aibroker_server
1527585 /home/salma/hero/bin/hero_collab_ui
1556820 /home/salma/hero/bin/hero_collab_ui
1558779 /home/salma/hero/bin/hero_aibroker_server
1558829 /home/salma/hero/bin/hero_collab_ui
1558830 /home/salma/hero/bin/hero_collab_server

That's 7 hero_collab orphans + 2 hero_aibroker_server orphans for one user. Confirmed not collab-specific — pattern affects multiple service types.

Pattern

PIDs cluster around restart events (1462843/1462875/1463221 grouped, then 1527585, then 1556820, then 1558829/1558830). Each restart spawned new processes but didn't terminate the old ones.

Hypothesis

proc service stop and/or proc service restart doesn't reliably kill the entire process tree before spawning a new one. Possibilities:

TERM signal sent to the parent doesn't propagate to children when the parent doesn't double-fork properly
Retry policy spawns a new instance before the previous instance fully exits, and the supervisor loses track of the original PID
Race: SIGTERM is sent, the retry policy spawns a replacement, the original's exit signal is lost

Why this matters

Memory waste: ~50–150 MB per orphan × 9 orphans = ~1 GB on this user alone
UDS conflicts: orphan hero_collab_ui processes hold open ~/hero/var/sockets/hero_collab/ui.sock, blocking new instances from binding
Stale state: orphans hold cached config in memory (e.g., the livekit.secret cache issue tracked separately) — the new "supervised" instance reads fresh config but the orphan keeps serving stale traffic if it owns the socket
Cumulative drift: 8+ days of restarts on this box left this user with 9 orphans

Repro / diagnostic

Start a hero_proc-managed service
Note its PID via pgrep -af
proc service restart <name>
Compare pgrep -af — sometimes both old and new PID coexist

The trigger is intermittent on this box — sometimes restart is clean, sometimes not. A stress-test loop while watching pgrep should isolate the conditions.

Suggested fix direction

Before spawning a replacement, the supervisor should:

Send SIGTERM to the entire process group (negative PID), not just the parent
Wait for actual exit (or kill timeout) — confirmed via waitpid or /proc/<pid> check
Only then mark the slot available for a new spawn

For already-orphaned processes, a proc service reap command (or auto-reap-on-startup based on UDS ownership) would help recover stuck boxes.

## Observed On a long-running multi-user box (138.201.206.39, 2026-04-29), `pgrep -af` for supervised binaries returns multiple PIDs per service — far more than hero_proc's `proc service list` knows about. For user salma (current snapshot): ``` 913095 /home/salma/hero/bin/hero_collab_ui 1462843 /home/salma/hero/bin/hero_collab_ui 1462875 /home/salma/hero/bin/hero_collab_server 1463221 /home/salma/hero/bin/hero_aibroker_server 1527585 /home/salma/hero/bin/hero_collab_ui 1556820 /home/salma/hero/bin/hero_collab_ui 1558779 /home/salma/hero/bin/hero_aibroker_server 1558829 /home/salma/hero/bin/hero_collab_ui 1558830 /home/salma/hero/bin/hero_collab_server ``` That's 7 hero_collab orphans + 2 hero_aibroker_server orphans for one user. Confirmed not collab-specific — pattern affects multiple service types. ## Pattern PIDs cluster around restart events (1462843/1462875/1463221 grouped, then 1527585, then 1556820, then 1558829/1558830). Each restart spawned new processes but didn't terminate the old ones. ## Hypothesis `proc service stop` and/or `proc service restart` doesn't reliably kill the entire process tree before spawning a new one. Possibilities: - TERM signal sent to the parent doesn't propagate to children when the parent doesn't double-fork properly - Retry policy spawns a new instance before the previous instance fully exits, and the supervisor loses track of the original PID - Race: SIGTERM is sent, the retry policy spawns a replacement, the original's exit signal is lost ## Why this matters - Memory waste: ~50–150 MB per orphan × 9 orphans = ~1 GB on this user alone - UDS conflicts: orphan `hero_collab_ui` processes hold open `~/hero/var/sockets/hero_collab/ui.sock`, blocking new instances from binding - Stale state: orphans hold cached config in memory (e.g., the livekit.secret cache issue tracked separately) — the new "supervised" instance reads fresh config but the orphan keeps serving stale traffic if it owns the socket - Cumulative drift: 8+ days of restarts on this box left this user with 9 orphans ## Repro / diagnostic 1. Start a hero_proc-managed service 2. Note its PID via `pgrep -af` 3. `proc service restart <name>` 4. Compare `pgrep -af` — sometimes both old and new PID coexist The trigger is intermittent on this box — sometimes restart is clean, sometimes not. A stress-test loop while watching `pgrep` should isolate the conditions. ## Suggested fix direction Before spawning a replacement, the supervisor should: 1. Send SIGTERM to the entire process group (negative PID), not just the parent 2. Wait for actual exit (or kill timeout) — confirmed via `waitpid` or `/proc/<pid>` check 3. Only then mark the slot available for a new spawn For already-orphaned processes, a `proc service reap` command (or auto-reap-on-startup based on UDS ownership) would help recover stuck boxes.

👍 1

mahmoud self-assigned this

2026-04-30 10:41:26 +00:00

mahmoud added this to the ACTIVE project

2026-04-30 10:41:30 +00:00

mahmoud added this to the now milestone

2026-04-30 10:41:33 +00:00

mahmoud commented

2026-04-30 10:42:41 +00:00

Owner

Confirmed in the wild — heavy accumulation on a long-running box

Hit this hard on one of my hosts. After ~8+ days of restarts, ps shows multiple live generations of nearly every singleton service still resident, plus a textbook SIGCHLD/waitpid leak. Posting evidence in case it helps narrow down where in the supervisor the bug lives.

Duplicate live parents (newest PID is the one currently bound to the UDS; the rest are orphans)

Service	Live parent PIDs	Expected
`hero_code_serve`	404145, 1740421	1
`hero_db_ui`	404738, 1533338, 1740117	1
`hero_db_server`	404756, 1533546, 1740321	1
`hero_osis` / `hero_osis_ui`	404963/4, 811011/2, 1803707/8	1 each
`hero_slides_*`	405688/9, 1787034/5, 1804901/2	1 each
`hero_whiteboard`	405777/8, 1533319/28, 1805176/7	2 (ui+server)
`hero_collab_`, `hero_logic_`, `hero_books_`, `hero_indexer_`, `hero_aibroker_`, `hero_biz`, `hero_voice_`, `hero_agent_`, `hero_foundry_`, `hero_embedder_`, `hero_proxy_ui`, `hero_livekit_`, `hero_os_`, `hero_codescaler`	2× each	1 each

That's roughly 2–3 generations of most services. Aggregate RSS of the orphan generations is ~1.5 GB.

Smoking gun: an actual zombie

405150 │ 405098 │ livekit-server │ Zombie │ 0 B

Parent 405098 is an old hero_livekit_se (the current live one is 1855273). So the old supervisor child:

never reaped its own livekit-server subprocess (SIGCHLD ignored / no waitpid), and
was itself never killed when the new hero_livekit_se was spawned.

Both halves of #61 in one PID pair.

`hero_code_serve` worker pools are doubled

The two live parents each carry a full ~30-worker pool:

old parent 404145 → children 404162–404193 (worker RSS ~1.4 MB, paged out)
new parent 1740421 → children 1740438–1740469 (worker RSS ~11 MB, hot)

Two complete pools fighting for the same UDS — exactly the symptom predicted in the issue body.

Implication for the fix

The proposed direction (TERM the process group, waitpid until exit, escalate to KILL on timeout, then mark slot free) matches what the evidence shows is missing. Two extra things worth folding in based on this data:

Reaper for already-orphaned PIDs at startup — boxes that have been running through the buggy version need a way to clean up without a manual sweep. Walking /proc for processes whose comm matches a registered service binary but whose ppid != hero_proc would catch these.
Explicit SIGCHLD handler / waitpid loop in the supervisor children, or prctl(PR_SET_PDEATHSIG, SIGTERM) on grandchildren — the livekit zombie shows the leak isn't only at the supervisor → service boundary, it's also at service → grandchild.

## Confirmed in the wild — heavy accumulation on a long-running box Hit this hard on one of my hosts. After ~8+ days of restarts, `ps` shows **multiple live generations of nearly every singleton service** still resident, plus a textbook `SIGCHLD`/`waitpid` leak. Posting evidence in case it helps narrow down where in the supervisor the bug lives. ### Duplicate live parents (newest PID is the one currently bound to the UDS; the rest are orphans) | Service | Live parent PIDs | Expected | |---|---|---| | `hero_code_serve` | 404145, 1740421 | 1 | | `hero_db_ui` | 404738, 1533338, 1740117 | 1 | | `hero_db_server` | 404756, 1533546, 1740321 | 1 | | `hero_osis` / `hero_osis_ui` | 404963/4, 811011/2, 1803707/8 | 1 each | | `hero_slides_*` | 405688/9, 1787034/5, 1804901/2 | 1 each | | `hero_whiteboard` | 405777/8, 1533319/28, 1805176/7 | 2 (ui+server) | | `hero_collab_*`, `hero_logic_*`, `hero_books_*`, `hero_indexer_*`, `hero_aibroker_*`, `hero_biz*`, `hero_voice_*`, `hero_agent_*`, `hero_foundry_*`, `hero_embedder_*`, `hero_proxy_ui`, `hero_livekit_*`, `hero_os_*`, `hero_codescaler` | 2× each | 1 each | That's roughly **2–3 generations** of most services. Aggregate RSS of the orphan generations is ~1.5 GB. ### Smoking gun: an actual zombie ``` 405150 │ 405098 │ livekit-server │ Zombie │ 0 B ``` Parent `405098` is an **old** `hero_livekit_se` (the current live one is `1855273`). So the old supervisor child: 1. never reaped its own `livekit-server` subprocess (`SIGCHLD` ignored / no `waitpid`), and 2. was itself never killed when the new `hero_livekit_se` was spawned. Both halves of #61 in one PID pair. ### `hero_code_serve` worker pools are doubled The two live parents each carry a full ~30-worker pool: - old parent `404145` → children `404162`–`404193` (worker RSS ~1.4 MB, paged out) - new parent `1740421` → children `1740438`–`1740469` (worker RSS ~11 MB, hot) Two complete pools fighting for the same UDS — exactly the symptom predicted in the issue body. ### Implication for the fix The proposed direction (TERM the **process group**, `waitpid` until exit, escalate to KILL on timeout, then mark slot free) matches what the evidence shows is missing. Two extra things worth folding in based on this data: 1. **Reaper for already-orphaned PIDs at startup** — boxes that have been running through the buggy version need a way to clean up without a manual sweep. Walking `/proc` for processes whose `comm` matches a registered service binary but whose `ppid != hero_proc` would catch these. 2. **Explicit `SIGCHLD` handler / `waitpid` loop in the supervisor children**, or `prctl(PR_SET_PDEATHSIG, SIGTERM)` on grandchildren — the livekit zombie shows the leak isn't only at the supervisor → service boundary, it's also at service → grandchild.

omarz was assigned by mahmoud

2026-04-30 11:22:37 +00:00

mahmoud removed their assignment

2026-04-30 11:22:40 +00:00

mahmoud commented

2026-04-30 11:22:50 +00:00

Owner

@omarz has already fixed it

omarz referenced this issue

2026-04-30 11:52:35 +00:00

fix(supervisor): stop service no longer leaves orphan processes #79

sameh-farouk commented

2026-05-20 15:00:27 +00:00

Author

Member

Update — closed the lab-side inflow path

Lab was unconditionally calling hero_proc's restart_service RPC on every lab service <name> --start invocation. restart_service is server-side stop_then_start — killing the existing supervised PID and spawning a fresh one even when the service was healthy. That's exactly the kind of churn that produces orphan supervised processes if a SIGTERM doesn't fully clean up between kill and respawn.

Reproduced on kristof5: 3 consecutive lab service hero_proc --start calls produced PIDs 2124432 → 2124649 → 2124745 for hero_proc_admin. Same churn pattern every Hero pod would experience whenever an operator re-ran --start.

Fix landed in hero_skills@2c25f2c:

Skip-if-running pre-check in do_start_validated (mirrors the existing pattern in ensure_dependency_running): query service_status, if state is running/ok log "already running — use --reset to force restart" and return early WITHOUT issuing the destructive RPC.
--reset continues to work as before: the outer caller invokes do_stop first, which makes the liveness pre-check correctly see "not running" and proceed to start.
Aligns lab with every peer supervisor's start semantics — systemd, supervisord, launchd, runit/s6, docker, kubectl all treat start as idempotent and require explicit restart for kill-respawn.

Verified empirically on kristof5: 3 consecutive lab service hero_proc --start after the fix → identical hero_proc_admin PID (2159396) across all invocations. Pre-fix log line restart_service RPC no longer fires when service is healthy.

Lab's republished binary on Forge release latest (hero_skills/releases/latest, asset 3086 uploaded today 14:34Z) includes this fix and is the canonical version for any curl … lab_install.sh | bash installer going forward.

This addresses the primary inflow of the orphan-accumulation pattern reported here. The SIGTERM-race in hero_proc's supervisor that orphans children when it DOES need to restart is a separate concern — but with the kill-respawn frequency reduced from "every --start" to "only on explicit --reset or actual failure recovery", the orphan accumulation rate should drop sharply in practice.

**Update — closed the lab-side inflow path** Lab was unconditionally calling `hero_proc`'s `restart_service` RPC on every `lab service <name> --start` invocation. `restart_service` is server-side `stop_then_start` — killing the existing supervised PID and spawning a fresh one even when the service was healthy. That's exactly the kind of churn that produces orphan supervised processes if a SIGTERM doesn't fully clean up between kill and respawn. Reproduced on kristof5: 3 consecutive `lab service hero_proc --start` calls produced PIDs 2124432 → 2124649 → 2124745 for `hero_proc_admin`. Same churn pattern every Hero pod would experience whenever an operator re-ran `--start`. Fix landed in [`hero_skills@2c25f2c`](https://forge.ourworld.tf/lhumina_code/hero_skills/commit/2c25f2c): - Skip-if-running pre-check in `do_start_validated` (mirrors the existing pattern in `ensure_dependency_running`): query `service_status`, if state is `running`/`ok` log "already running — use --reset to force restart" and return early WITHOUT issuing the destructive RPC. - `--reset` continues to work as before: the outer caller invokes `do_stop` first, which makes the liveness pre-check correctly see "not running" and proceed to start. - Aligns lab with every peer supervisor's `start` semantics — systemd, supervisord, launchd, runit/s6, docker, kubectl all treat `start` as idempotent and require explicit `restart` for kill-respawn. Verified empirically on kristof5: 3 consecutive `lab service hero_proc --start` after the fix → identical hero_proc_admin PID (2159396) across all invocations. Pre-fix log line `restart_service` RPC no longer fires when service is healthy. Lab's republished binary on Forge release `latest` (`hero_skills/releases/latest`, asset 3086 uploaded today 14:34Z) includes this fix and is the canonical version for any `curl … lab_install.sh | bash` installer going forward. This addresses the **primary inflow** of the orphan-accumulation pattern reported here. The SIGTERM-race in hero_proc's supervisor that orphans children when it DOES need to restart is a separate concern — but with the kill-respawn frequency reduced from "every `--start`" to "only on explicit `--reset` or actual failure recovery", the orphan accumulation rate should drop sharply in practice.

mahmoud referenced this issue

2026-06-08 15:49:57 +00:00

chore(server): remove dead code + cruft (#138 cleanup) #144

mahmoud referenced this issue from a commit

2026-06-08 15:59:27 +00:00

chore(server): remove dead code + cruft (#138 cleanup) (#144)