Control-plane restart drops all active calls — liveness is handle-based, not process-based #48
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Restarting
hero_livekit_server(deploy, crash, or hero_proc-driven restart) kills all active LiveKit media and disconnects every participant in every room. This is silent and automatic.Root cause
status()and the supervision instart()/ensure_readyare handle-based: liveness is judged by the in-memoryChildhandles (State { livekit: Option<Child>, backend: Option<Child> }inlivekit/rpc.rs). When the daemon restarts, the new process has no handles for the still-alivelivekit-server/lk-backend, so:status()returnsstopped(it can't see the orphaned-but-alive children).ensure_ready(turnkey, #42) therefore re-provisions, andstart()runspkill -f livekit-server/pkill -f lk-backendthen respawns.Verified on a real server during #42 testing: after a server restart with media running, status reports stopped and the children are churned.
Impact
User-facing availability defect. Any restart of the orchestrator daemon (deploy, crash-loop, hero_proc retry) tears down every active huddle/call. Critical before production traffic.
Targeted fix (near-term, within current architecture)
status()reportsrunningiflivekit-serveris actually listening on the configured port (andlk-backendon its port), independent of in-memory handles.start()/ensure_readynon-disruptive: do NOTpkill+respawn media that is already healthy; only (re)spawn what is actually down.Proper fix (long-term)
Subsumed by B1 (hero_proc owns
livekit-server+lk-backendas supervised units) — then the media survives a control-plane restart entirely. See the B1 architecture issue + the design spec in #41.Related
Correction — the "targeted fix" in the description does NOT work (verified)
I claimed the media survives a control-plane restart (orphaned-but-alive) and only
ensure_ready'spkillkills it — so a port/process-basedstatus()+ non-disruptivestart()would spare live calls. That is wrong.Empirical test (real server): gated
ensure_readyOFF (unsetLIVEKIT_VERSIONso it cannot churn), then restarted onlyhero_livekit_server:Both media processes died with the control-plane daemon — hero_proc takes down the process group; the media are children of
hero_livekit_server. So by the time a new server process runs, the media is already gone. A non-disruptive/port-based liveness check has nothing to preserve.Corrected conclusion
In the current architecture, restarting the control plane always drops all calls. The targeted fix is insufficient. The only real fix is decoupling the media lifecycle from the control-plane process:
livekit-server+lk-backendas independent units; restarting the control-plane unit doesn't touch them.This issue therefore stands as the production impact that motivates B1, not a separately-cheap fix. Suggest treating #49 as the fix for this.
Closing — folded into #49. The targeted fix in this issue does not work (the media die with the control-plane process group, verified empirically), so there is no separate/cheaper fix: B1 (#49) is the sole fix, and #49 now carries this restart-drops-calls impact + the proof as its headline motivation.