Architecture (B1): hero_proc owns livekit-server + lk-backend — fixes control-plane-restart drops-calls #49

Open
opened 2026-06-08 21:42:58 +00:00 by sameh-farouk · 0 comments
Member

Why this matters (the concrete impact)

Restarting the control-plane daemon hero_livekit_server drops every active call. Verified on a real server: with ensure_ready gated off (so it could not be the cause), restarting only hero_livekit_server killed both media processes:

before:  livekit-server=246662  lk-backend=246663
after:   PID 246662 DEAD   PID 246663 DEAD   status: stopped

So you cannot deploy/restart the orchestrator without tearing down all live huddles. (Was tracked separately as #48 — folded here; there is no cheaper fix.)

Root cause

hero_livekit_server hand-spawns livekit-server + lk-backend as in-process Children, so they land inside the control-plane's process group. hero_proc puts each service in its own group (executor.rs:716 pre_exec → setpgid(0,0)) and kills by process tree/group — so killing the control-plane service's group takes the media grandchildren with it.

The fix: make the media first-class hero_proc unit(s), separate from the control plane

Stop hand-spawning. Register the media under hero_proc as its own unit(s), separate from hero_livekit_server. Then each media process is in its own process group, and restarting the control-plane service no longer touches it. hero_livekit_server becomes pure control-plane (drops Child handles / pkill, never spawns); ensure_ready (#42) registers/starts the unit instead of spawning. Full design in the #41 spec comment.

Preferred shape — one media hero_proc service with action-level depends_on (works TODAY, no #135 dependency)

Model the media as a single hero_proc service (e.g. hero_livekit_media) with gated actions:

configure  (oneshot: mint secret + write livekit.yaml/backend.env)
livekit-server  depends_on = [configure]
lk-backend      depends_on = [configure]
  • Ordering is enforced by the supervisor today. Set the action-level depends_on directly at registration via hero_proc_sdk's ActionBuilder.depends_on (the same way lab/checker/orchestrator.rs already chains its jobs). This rides the working ordering primitive (the poll loop gates a job until its depends_on jobs reach Succeeded) and does NOT depend on hero_proc#135 — which only wires service-level [[dependencies]]/requires/after, and those are inert on the start path (see hero_proc#135).
  • Isolation/restart-survival comes from hero_proc's existing per-action/service setpgid; this media service is separate from hero_livekit_server, so a control-plane restart leaves it alone.

Avoid the alternative of separate services for livekit-server vs lk-backend wired by service-level requires/after — that ordering is unenforced until hero_proc#135 lands.

Repo ownership

  • This work lives in hero_livekit (consumer-side restructure against an unchanged hero_proc API): drop hand-spawning, register the media service + actions with depends_on, slim hero_livekit_server to control-plane only.
  • One cross-repo touchpoint → hero_skills/lab: livekit-server is an external downloaded binary (not a repo crate), so it needs the onlyoffice-style acquire + launcher idiom (hero_skills/crates/lab/src/service/service_onlyoffice.rs). Reuse that pattern (bespoke service_livekit.rs) or file a hero_skills issue to generalize "register an external downloaded binary as a hero_proc service".
  • hero_proc: no change required for this fix. #135 (service-level dep enforcement) is not a blocker because we use action-level depends_on.

Effort

~3–5 days (see #41 spec). No wait on the hero_proc redesign for the core fix.

  • #41 (full B1 design spec in the comment thread)
  • #42 (turnkey ensure_ready — compatible; only the "spawn children" step moves to the media unit)
  • hero_proc#135 (service-level deps inert on start — explains why we use action-level depends_on, not requires/after)
  • hero_skills onlyoffice launcher idiom (for the external livekit-server binary)
## Why this matters (the concrete impact) **Restarting the control-plane daemon `hero_livekit_server` drops every active call.** Verified on a real server: with `ensure_ready` gated off (so it could not be the cause), restarting *only* `hero_livekit_server` killed both media processes: ``` before: livekit-server=246662 lk-backend=246663 after: PID 246662 DEAD PID 246663 DEAD status: stopped ``` So you cannot deploy/restart the orchestrator without tearing down all live huddles. (Was tracked separately as #48 — folded here; there is no cheaper fix.) ## Root cause `hero_livekit_server` hand-spawns `livekit-server` + `lk-backend` as in-process `Child`ren, so they land **inside the control-plane's process group**. hero_proc puts each *service* in its own group (`executor.rs:716` `pre_exec → setpgid(0,0)`) and kills by process tree/group — so killing the control-plane service's group takes the media grandchildren with it. ## The fix: make the media first-class hero_proc unit(s), separate from the control plane Stop hand-spawning. Register the media under hero_proc as its **own** unit(s), separate from `hero_livekit_server`. Then each media process is in its own process group, and restarting the control-plane service no longer touches it. `hero_livekit_server` becomes pure control-plane (drops `Child` handles / `pkill`, never spawns); `ensure_ready` (#42) registers/starts the unit instead of spawning. Full design in the **#41 spec comment**. ### Preferred shape — one media hero_proc service with action-level `depends_on` (works TODAY, no #135 dependency) Model the media as **a single hero_proc service** (e.g. `hero_livekit_media`) with gated **actions**: ``` configure (oneshot: mint secret + write livekit.yaml/backend.env) livekit-server depends_on = [configure] lk-backend depends_on = [configure] ``` - **Ordering is enforced by the supervisor today.** Set the action-level `depends_on` **directly at registration** via `hero_proc_sdk`'s `ActionBuilder.depends_on` (the same way `lab/checker/orchestrator.rs` already chains its jobs). This rides the *working* ordering primitive (the poll loop gates a job until its `depends_on` jobs reach `Succeeded`) and **does NOT depend on hero_proc#135** — which only wires *service-level* `[[dependencies]]`/`requires`/`after`, and those are **inert on the start path** (see hero_proc#135). - **Isolation/restart-survival** comes from hero_proc's existing per-action/service `setpgid`; this media service is separate from `hero_livekit_server`, so a control-plane restart leaves it alone. Avoid the alternative of *separate* services for livekit-server vs lk-backend wired by service-level `requires`/`after` — that ordering is **unenforced until hero_proc#135** lands. ## Repo ownership - **This work lives in hero_livekit** (consumer-side restructure against an unchanged hero_proc API): drop hand-spawning, register the media service + actions with `depends_on`, slim `hero_livekit_server` to control-plane only. - **One cross-repo touchpoint → hero_skills/lab:** `livekit-server` is an external downloaded binary (not a repo crate), so it needs the onlyoffice-style acquire + launcher idiom (`hero_skills/crates/lab/src/service/service_onlyoffice.rs`). Reuse that pattern (bespoke `service_livekit.rs`) or file a hero_skills issue to generalize "register an external downloaded binary as a hero_proc service". - **hero_proc: no change required** for this fix. #135 (service-level dep enforcement) is *not* a blocker because we use action-level `depends_on`. ## Effort ~3–5 days (see #41 spec). No wait on the hero_proc redesign for the core fix. ## Related - #41 (full B1 design spec in the comment thread) - #42 (turnkey ensure_ready — compatible; only the "spawn children" step moves to the media unit) - hero_proc#135 (service-level deps inert on start — explains why we use action-level `depends_on`, not `requires`/`after`) - hero_skills onlyoffice launcher idiom (for the external livekit-server binary)
sameh-farouk changed title from Architecture (B1): hero_proc owns livekit-server + lk-backend as supervised units to Architecture (B1): hero_proc owns livekit-server + lk-backend — fixes control-plane-restart drops-calls 2026-06-08 21:52:16 +00:00
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_livekit#49
No description provided.