Architecture (B1): hero_proc owns livekit-server + lk-backend — fixes control-plane-restart drops-calls #49
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Why this matters (the concrete impact)
Restarting the control-plane daemon
hero_livekit_serverdrops every active call. Verified on a real server: withensure_readygated off (so it could not be the cause), restarting onlyhero_livekit_serverkilled both media processes:So you cannot deploy/restart the orchestrator without tearing down all live huddles. (Was tracked separately as #48 — folded here; there is no cheaper fix.)
Root cause
hero_livekit_serverhand-spawnslivekit-server+lk-backendas in-processChildren, so they land inside the control-plane's process group. hero_proc puts each service in its own group (executor.rs:716pre_exec → setpgid(0,0)) and kills by process tree/group — so killing the control-plane service's group takes the media grandchildren with it.The fix: make the media first-class hero_proc unit(s), separate from the control plane
Stop hand-spawning. Register the media under hero_proc as its own unit(s), separate from
hero_livekit_server. Then each media process is in its own process group, and restarting the control-plane service no longer touches it.hero_livekit_serverbecomes pure control-plane (dropsChildhandles /pkill, never spawns);ensure_ready(#42) registers/starts the unit instead of spawning. Full design in the #41 spec comment.Preferred shape — one media hero_proc service with action-level
depends_on(works TODAY, no #135 dependency)Model the media as a single hero_proc service (e.g.
hero_livekit_media) with gated actions:depends_ondirectly at registration viahero_proc_sdk'sActionBuilder.depends_on(the same waylab/checker/orchestrator.rsalready chains its jobs). This rides the working ordering primitive (the poll loop gates a job until itsdepends_onjobs reachSucceeded) and does NOT depend on hero_proc#135 — which only wires service-level[[dependencies]]/requires/after, and those are inert on the start path (see hero_proc#135).setpgid; this media service is separate fromhero_livekit_server, so a control-plane restart leaves it alone.Avoid the alternative of separate services for livekit-server vs lk-backend wired by service-level
requires/after— that ordering is unenforced until hero_proc#135 lands.Repo ownership
depends_on, slimhero_livekit_serverto control-plane only.livekit-serveris an external downloaded binary (not a repo crate), so it needs the onlyoffice-style acquire + launcher idiom (hero_skills/crates/lab/src/service/service_onlyoffice.rs). Reuse that pattern (bespokeservice_livekit.rs) or file a hero_skills issue to generalize "register an external downloaded binary as a hero_proc service".depends_on.Effort
~3–5 days (see #41 spec). No wait on the hero_proc redesign for the core fix.
Related
depends_on, notrequires/after)Architecture (B1): hero_proc owns livekit-server + lk-backend as supervised unitsto Architecture (B1): hero_proc owns livekit-server + lk-backend — fixes control-plane-restart drops-calls