Control-plane restart drops all active calls — liveness is handle-based, not process-based #48

Closed
opened 2026-06-08 21:42:58 +00:00 by sameh-farouk · 2 comments
Member

Summary

Restarting hero_livekit_server (deploy, crash, or hero_proc-driven restart) kills all active LiveKit media and disconnects every participant in every room. This is silent and automatic.

Root cause

status() and the supervision in start()/ensure_ready are handle-based: liveness is judged by the in-memory Child handles (State { livekit: Option<Child>, backend: Option<Child> } in livekit/rpc.rs). When the daemon restarts, the new process has no handles for the still-alive livekit-server/lk-backend, so:

  1. status() returns stopped (it can't see the orphaned-but-alive children).
  2. ensure_ready (turnkey, #42) therefore re-provisions, and start() runs pkill -f livekit-server / pkill -f lk-backend then respawns.
  3. Net: every control-plane restart kills the media server and lk-backend and respawns them → all live calls drop.

Verified on a real server during #42 testing: after a server restart with media running, status reports stopped and the children are churned.

Impact

User-facing availability defect. Any restart of the orchestrator daemon (deploy, crash-loop, hero_proc retry) tears down every active huddle/call. Critical before production traffic.

Targeted fix (near-term, within current architecture)

  • Make liveness port/process-based: status() reports running if livekit-server is actually listening on the configured port (and lk-backend on its port), independent of in-memory handles.
  • Make start()/ensure_ready non-disruptive: do NOT pkill+respawn media that is already healthy; only (re)spawn what is actually down.

Proper fix (long-term)

Subsumed by B1 (hero_proc owns livekit-server + lk-backend as supervised units) — then the media survives a control-plane restart entirely. See the B1 architecture issue + the design spec in #41.

  • #42 (turnkey ensure_ready — where this surfaced)
  • #41 (B1 design spec comment)
## Summary Restarting `hero_livekit_server` (deploy, crash, or hero_proc-driven restart) **kills all active LiveKit media and disconnects every participant** in every room. This is silent and automatic. ## Root cause `status()` and the supervision in `start()`/`ensure_ready` are **handle-based**: liveness is judged by the in-memory `Child` handles (`State { livekit: Option<Child>, backend: Option<Child> }` in `livekit/rpc.rs`). When the daemon restarts, the **new process has no handles** for the still-alive `livekit-server`/`lk-backend`, so: 1. `status()` returns `stopped` (it can't see the orphaned-but-alive children). 2. `ensure_ready` (turnkey, #42) therefore re-provisions, and `start()` runs `pkill -f livekit-server` / `pkill -f lk-backend` then respawns. 3. Net: every control-plane restart **kills the media server and lk-backend and respawns them** → all live calls drop. Verified on a real server during #42 testing: after a server restart with media running, status reports stopped and the children are churned. ## Impact User-facing availability defect. Any restart of the orchestrator daemon (deploy, crash-loop, hero_proc retry) tears down every active huddle/call. Critical before production traffic. ## Targeted fix (near-term, within current architecture) - Make liveness **port/process-based**: `status()` reports `running` if `livekit-server` is actually listening on the configured port (and `lk-backend` on its port), independent of in-memory handles. - Make `start()`/`ensure_ready` **non-disruptive**: do NOT `pkill`+respawn media that is already healthy; only (re)spawn what is actually down. ## Proper fix (long-term) Subsumed by **B1** (hero_proc owns `livekit-server` + `lk-backend` as supervised units) — then the media survives a control-plane restart entirely. See the B1 architecture issue + the design spec in #41. ## Related - #42 (turnkey ensure_ready — where this surfaced) - #41 (B1 design spec comment)
Author
Member

Correction — the "targeted fix" in the description does NOT work (verified)

I claimed the media survives a control-plane restart (orphaned-but-alive) and only ensure_ready's pkill kills it — so a port/process-based status() + non-disruptive start() would spare live calls. That is wrong.

Empirical test (real server): gated ensure_ready OFF (unset LIVEKIT_VERSION so it cannot churn), then restarted only hero_livekit_server:

before:  livekit-server=246662  lk-backend=246663
after:   PID 246662 DEAD   PID 246663 DEAD   status: stopped

Both media processes died with the control-plane daemon — hero_proc takes down the process group; the media are children of hero_livekit_server. So by the time a new server process runs, the media is already gone. A non-disruptive/port-based liveness check has nothing to preserve.

Corrected conclusion

In the current architecture, restarting the control plane always drops all calls. The targeted fix is insufficient. The only real fix is decoupling the media lifecycle from the control-plane process:

  • Proper: B1 (#49) — hero_proc owns livekit-server + lk-backend as independent units; restarting the control-plane unit doesn't touch them.
  • (A detached-spawn hack — new session so the children survive the parent + hero_proc group-kill — would technically work but introduces orphan-management problems B1 solves cleanly. Not recommended.)

This issue therefore stands as the production impact that motivates B1, not a separately-cheap fix. Suggest treating #49 as the fix for this.

## Correction — the "targeted fix" in the description does NOT work (verified) I claimed the media survives a control-plane restart (orphaned-but-alive) and only `ensure_ready`'s `pkill` kills it — so a port/process-based `status()` + non-disruptive `start()` would spare live calls. **That is wrong.** **Empirical test (real server):** gated `ensure_ready` OFF (unset `LIVEKIT_VERSION` so it cannot churn), then restarted *only* `hero_livekit_server`: ``` before: livekit-server=246662 lk-backend=246663 after: PID 246662 DEAD PID 246663 DEAD status: stopped ``` Both media processes died **with** the control-plane daemon — hero_proc takes down the process group; the media are children of `hero_livekit_server`. So by the time a new server process runs, the media is already gone. A non-disruptive/port-based liveness check has nothing to preserve. ## Corrected conclusion In the current architecture, restarting the control plane **always** drops all calls. The targeted fix is insufficient. The only real fix is **decoupling the media lifecycle from the control-plane process**: - **Proper: B1 (#49)** — hero_proc owns `livekit-server` + `lk-backend` as independent units; restarting the control-plane unit doesn't touch them. - (A detached-spawn hack — new session so the children survive the parent + hero_proc group-kill — would technically work but introduces orphan-management problems B1 solves cleanly. Not recommended.) This issue therefore stands as the **production impact that motivates B1**, not a separately-cheap fix. Suggest treating #49 as the fix for this.
Author
Member

Closing — folded into #49. The targeted fix in this issue does not work (the media die with the control-plane process group, verified empirically), so there is no separate/cheaper fix: B1 (#49) is the sole fix, and #49 now carries this restart-drops-calls impact + the proof as its headline motivation.

Closing — folded into #49. The targeted fix in this issue does not work (the media die with the control-plane process group, verified empirically), so there is no separate/cheaper fix: B1 (#49) is the sole fix, and #49 now carries this restart-drops-calls impact + the proof as its headline motivation.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_livekit#48
No description provided.