feat(collab+livekit): service_livekit module + service_collab dev-auth auto-chain #115
No reviewers
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_skills!115
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "feat/service-livekit-and-collab-dev-defaults"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Makes
service_collab startproduce a working dev stack out of the box on a multi-user Hero host (provisioned viamulti_user_add), by:service_livekit.nu— new module that wrapshero_livekit_server+hero_livekit_uias hero_proc-supervised actions, following the same pattern asservice_os/service_collab.service_collab.nu— adds--auth-mode,--seed-dev-users,--livekit-*flags, detects dev-box context, and defaults to--auth-mode=dev --seed-dev-usersplus auto-chainingservice_livekitwhen credentials are missing.hero_loader.nu— reads~/hero/var/hero_livekit/runtime.jsonon shell startup and populates$env.LIVEKIT_{URL,API_KEY,API_SECRET}, so every nu shell sees consistent LiveKit credentials.Problem this solves
On a standard multi-user dev host today:
service_collab startwith no flags defaults toauth_mode=proxy(Mahmoud's documented safe-prod default).hero_proxyin front, injectingX-User-*identity headers after OAuth/bearer auth.hero_router, nothero_proxy—grepconfirms no auth-header injection anywhere in hero_skills.hero_collab/deploy/, editing.env,docker-compose up, mirroring creds intosecrets.toml— all manual, easy to skip or misconfigure.Shape
service_livekit.nu(new, +253)Standard install/start/stop/status surface mirroring
service_os.nu. Soft preflight that warns (does not hard-fail) when Redis at127.0.0.1:6379is unreachable, since hero_livekit needs it for room/participant state.service_collab.nu(+156 −21)New flags on
start:Dev-box detection (
svx_is_dev_box): presence of~/hero/cfg/hero_cfg.tomlor/etc/hero-users/<user>.env. These are the same signalshero_loader.nualready uses to populateMYCELIUM_IP, so the convention is consistent.Credential-resolution order for LiveKit (per field): explicit flag →
$env.LIVEKIT_*→ auto-start viaservice_livekitand readruntime.json.hero_loader.nu(+29)Additional try-block after the mycelium env block. Reads
runtime.jsonif present, sets$env.LIVEKIT_*only when not already set (sosecrets source/ explicit secrets.toml overrides keep winning).Security considerations
LiveKit credentials do not land on argv. They go into the hero_proc action spec's
env:map only, which reacheshero_collab_servervia/proc/<pid>/environ— owner-only readable (0400). Only--auth-modeand--seed-dev-usersare on argv, and neither is sensitive.This is deliberate: argv leaks via
/proc/<pid>/cmdlineare world-readable, and we just closed exactly that class of bug forservice_proc startin PR #114. Not repeating the mistake.Behavior matrix
service_collab start(no flags)service_livekit installauto-runs (cargo build), then startshero_cfg.toml, no/etc/hero-users/*)--auth-mode=proxy, LiveKit disabled--auth-mode proxyor--no-livekitTesting
Static syntax check on all four files via
nu -c "use <file> *"— clean.Runtime verification on a fresh dev box would be:
service_proc start— verify hero_proc up.service_collab start— verify it auto-installshero_livekit, starts both, printsauth mode : dev,livekit : ws://....ps aux | grep -E 'hero_collab|hero_livekit'— verify LiveKit creds never appear on any argv line.service_collab start --auth-mode proxy --no-livekiton same box — verify it skips livekit auto-chain and passes proxy mode to collab (exercises opt-out path).service_collab stop && service_livekit stop— verify clean teardown.Notes for review
@Mahmoud-Emad — this changes the default for
service_collab starton dev boxes. Wanted to flag it directly since the previous behavior was a documented design decision (your comment in the original module). Rationale:auth_mode=proxywithouthero_proxyin the path is functionally broken; every dev has been editing action specs or not using collab.multi_user_addfootprint, so prod hosts that have never seen that provisioning flow keep their old default.--auth-mode proxy.Happy to adjust the detection heuristic or the defaults if you'd prefer a different policy — this PR is easy to amend.
Commits
Three logical units:
Squash at merge if preferred.
Follow-ups (out of scope)
hero_collab_serveractually reads LiveKit creds from env (expected via clapenv = "LIVEKIT_..."). If not, a small clap config change is needed there for this PR to reach its full effect.Adds `service_livekit install|start|stop|status` following the pattern of service_collab / service_os. Registers hero_livekit_server and hero_livekit_ui as hero_proc actions on ~/hero/var/sockets/hero_livekit/{rpc,ui}.sock Soft preflight warns when Redis is not reachable on 127.0.0.1:6379, since hero_livekit uses it for room/participant state and crashes on first request without it. Warn-only so operators running Redis on a non-default setup are not blocked. First `start` generates ~/hero/var/hero_livekit/runtime.json with a fresh api_key/api_secret pair. hero_loader.nu picks those up into $env.LIVEKIT_* on the next nu shell (follow-up commit). Also registers the module in services/mod.nu so it is importable via the standard `use services *` loader path.andexpression de135ef8bfNushell parses a line starting with `and` as a command invocation, not a continuation of the previous boolean expression. The syntax let x = (not $foo) and ($bar | is-not-empty) fails with `Command 'and' not found` at runtime. Fold the livekit_enabled check into an if/else that short-circuits the `$no_livekit` case and keeps the conjunction on a single line.hero_livekit_server does NOT auto-generate runtime.json on startup — it waits for an explicit LiveKitService.configure RPC call before writing the file. Without this, downstream consumers see no credentials: hero_loader.nu finds no runtime.json to read, and service_collab's dev auto-chain can't resolve livekit creds and falls back to starting collab without livekit. Add svx_configure_livekit helper that: - Polls the rpc.sock for up to 3s in case the supervisor is still binding (defensive; the hero_proc health-check above should already block on the socket). - POSTs LiveKitService.configure via curl --unix-socket with node_ip set from $env.MYCELIUM_IP when available, otherwise letting hero_livekit default to 127.0.0.1. - Verifies runtime.json appears before returning OK. - Idempotent — calling again with the same params after runtime.json exists is a no-op from the supervisor's perspective. Wire it into `start` right after `proc service start` completes.Passing the JSON payload through `-d $body` triggered a nushell external-command word-split on the user's dev box that resulted in curl receiving a mangled URL and erroring: curl: (6) Could not resolve host: exit Feed the body via `$body | ^curl ... --data-binary @-` instead. Side benefit: the JSON payload (which may contain future sensitive fields) stays off /proc/<pid>/cmdline.Two issues surfaced when dogfooding on a multi-user dev box: 1. LiveKitService.configure rejects empty params with "Missing required parameter: node_ip". The previous code sent `{}` when $env.MYCELIUM_IP was unset, which fails on any host without a mycelium bridge. Always pass node_ip: use $env.MYCELIUM_IP when present, fall back to 127.0.0.1. 2. The `$body | ^curl ... --data-binary @-` stdin-pipe form still triggered a nushell external-command word-split on the dev box, producing "curl: (6) Could not resolve host: exit". Nushell appears to be splitting the command in an unexpected way when external args intermix pipes and strings. Switch to writing the JSON payload to a mktemp file and passing it via `--data-binary @file`. That's unambiguous curl syntax, survives whatever nushell is doing to the invocation, and keeps the payload off /proc/<pid>/cmdline. Cleans up the tmpfile after the call.Two dev-box dogfood findings: 1. LiveKitService.configure errors with "Missing required parameter: api_key" when that field is omitted, despite the configuration docs describing it as having a default. Send the full set of fields (node_ip, api_key, domain, livekit_port, backend_port, redis_address) with empty-string / zero values where we want the supervisor's defaults — empty api_key specifically triggers the first-run auto-generation of a random api_secret. 2. Running this exact curl invocation interactively in nu succeeds, but the same invocation inside a `def` body (wrapped in `do { ... } | complete`) consistently fails with "curl: (6) Could not resolve host: exit" on the dev box. Nushell's external-command argv handling appears to mangle the call in module scope in ways I couldn't reproduce in REPL. Work around by passing the whole curl command as a single string to `bash -c`. Bash owns the parsing; nu just invokes bash with one argument. Reliable. The tmpfile approach still keeps the JSON payload off argv.End-to-end testing on a multi-user Hero dev box showed the direct action-spec approach leaves hero_collab_ui running in proxy mode regardless of --auth-mode: ui reads COLLAB_AUTH_MODE from env only (hero_collab_ui/routes.rs:384), and a sibling hero_proc action spec does not inherit env from its peer. hero_collab's CLI handles this at main.rs:250 (.env("COLLAB_AUTH_MODE", &flags.auth_mode) before spawning ui). Bypassing the CLI loses that bridge. Hand over all hero_proc registration, argv/env layout, and --livekit-api-secret-file file-based secret delivery to `hero_collab --start` (the canonical operator entrypoint, per Mahmoud's original module comment). Module's role becomes: - Preflight (sudo check, hero_proc health, binary installed). - Resolve effective flags (dev-box default for --auth-mode, livekit creds from flag/env/auto-chain). - Materialise the LiveKit API secret into a 0600 file at ~/hero/cfg/livekit.secret so it passes via path, not argv or env. - Forward the final flag set to `hero_collab --start` / `--stop`. Net: ~80 fewer lines in service_collab.nu, UI auth now propagates correctly, no drift risk on future hero_collab CLI changes. Also adjust hero_loader.nu: stop populating $env.LIVEKIT_API_SECRET (env inheritance is broad), and instead populate $env.LIVEKIT_API_SECRET_FILE pointing at ~/hero/cfg/livekit.secret when the file exists.orin livekit-resolve 7672e17d44Previously the module only called LiveKitService.configure (writes runtime.json), leaving hero_livekit_server as an idle supervisor — livekit-server and lk-backend binaries were never spawned, nothing bound on port 7880, and browsers attempting to join huddles hit ERR_CONNECTION_REFUSED at ws://[node_ip]:7880/rtc/v1. Replace svx_configure_livekit with svx_bootstrap_livekit which drives the three supervisor RPCs in order: 1. LiveKitService.install — download upstream binaries (idempotent no-op when already present). 2. LiveKitService.configure — write runtime.json (as before). 3. LiveKitService.start — spawn livekit-server and lk-backend; this is the step that opens port 7880. Extract the single-RPC curl-via-bash-c machinery into svx_lk_rpc so all three calls reuse the same transport (and the earlier nushell- external-arg workaround). Install failure is non-fatal (binaries may already be present from a previous run). Configure and start failures abort the bootstrap with a clear message; collab will still come up without livekit (huddle button grays out).LiveKitService.install expects the `livekit_version` and `backend_version` fields to be PRESENT in the params record, not just valid. Empty-string values are the supervisor's "resolve latest upstream GitHub release" signal (per hero_livekit/docs/api.md). Previously we sent `params: {}`, which failed the same "Missing required parameter" check that configure hits. Cascade: install errored → binaries not downloaded → start errored with "binaries not installed — call install first".The supervisor registers methods as \`livekitservice.install\`, \`livekitservice.configure\`, \`livekitservice.start\`, etc. (lowercase), not \`LiveKitService.install\` as the hero_livekit docs/api.md suggested. Confirmed by strings-extract on the \`hero_livekit_server\` binary: livekitservice_rpc_install livekitservice_rpc_configure livekitservice_rpc_start livekitservice_rpc_stop livekitservice_rpc_restart livekitservice_rpc_status livekitservice.install livekitservice.configure livekitservice.start ... The PascalCase form returns "Method not found" for install, and while some PascalCase calls appeared to work earlier, that was almost certainly confirmation bias reading error messages. Aligning to what the binary actually dispatches on.The hero_livekit_backend crate in the hero_livekit workspace produces a binary pinned to the name \"lk-backend\". The supervisor spawns this alongside upstream livekit-server on start, and guards the start operation behind a binary-presence check: {code: -32000, message: "Invalid input: binaries not installed — call install first"} SVX_BINARIES drove cargo's --bin selection and the install verification, but only listed hero_livekit_server and hero_livekit_ui. lk-backend was never built, never copied to ~/hero/bin, and the supervisor refused to start. Adding it to the list fixes the build and the supervisor's presence check in one step. (This is a separate concern from the missing livekitservice.install RPC method — that method is not registered in the dispatcher for some reason, but it turns out we don't need it: cargo can build everything directly, and livekit-server itself is already downloaded at ~/hero/bin/livekit-server.)