lab: model sibling-supervised daemons (supervised flag) so lab build --start stops starting children like lk-backend #315

Open
opened 2026-06-04 19:22:22 +00:00 by sameh-farouk · 1 comment
Member

Problem

lab build --start / --restart starts every binary whose service.toml kind ∈ {server, admin, web} as a standalone hero_proc service. But some long-running daemons are supervised by a sibling binary, not by hero_proc — e.g. hero_livekit's lk-backend and livekit-server are spawned/managed by hero_livekit_server via its start() RPC. lab starts lk-backend standalone, it lacks the env/config the parent injects, and CI/build fails:

starting lk-backend …
  ERROR: 'lk-backend' did not become fully running within 10s
  last service state: failed
FAILED: lk-backend… service 'lk-backend' failed validation: state=failed

Root cause — a modeling gap

kind conflates two orthogonal properties:

  • what kind of process this is (server / admin / web / cli)
  • who owns its lifecycle (hero_proc/lab vs. a sibling binary)

lk-backend is genuinely a server (long-running daemon), but its lifecycle is owned by hero_livekit_server. There's no way to express that, so the kind-only filter starts it.

6aac3e8 (register hero_livekit in SERVICE_MAP) fixed one path — lab service hero_livekit consults the hardcoded SERVICE_MAP, which correctly lists only hero_livekit_server + hero_livekit_admin. But the lab build --start / fast_teardown path does not consult SERVICE_MAP — it re-derives from kind (fast_teardown.rs:422, fast_teardown.rs:520, service_manager.rs:2179) and still starts lk-backend. The two paths disagree.

Proposed fix — a first-class supervised flag on [[binaries]]

1. hero_lib — crates/core/src/base/service.rs, add to Binary:

/// Long-running, but its lifecycle is owned by a sibling binary in the same
/// service (e.g. spawned via the server's start() RPC), not by hero_proc/lab.
/// lab still installs it, but never registers / starts / tears it down standalone.
#[serde(default)]
pub supervised: bool,

2. lab — collapse the 3 duplicated checks into one predicate + the guard:

pub fn is_lab_managed_daemon(b: &Binary) -> bool {
    matches!(b.kind, Kind::Server | Kind::Admin | Kind::Web) && !b.supervised
}

Apply at service_manager.rs:2179, fast_teardown.rs:422, fast_teardown.rs:520. Now both lab service and lab build --start agree.

3. hero_livekit service.toml (×4) — keep the accurate kind, declare ownership:

[[binaries]]
name = "lk-backend"
kind = "server"      # accurate: it IS a long-running server
supervised = true    # but hero_livekit_server owns its lifecycle

4. (optional) retire the SERVICE_MAP hero_livekit entry from 6aac3e8 — the flag now covers every path, so the hardcoded curation becomes redundant (one source of truth).

Why this design

  • Models reality: separates process-type (kind) from lifecycle-ownership (supervised). No semantic lie (cf. relabelling lk-backend to kind=cli).
  • Decentralized + generalizes: any sibling-supervised daemon (e.g. the OnlyOffice backend) declares it in its own manifest — no per-service hardcoded SERVICE_MAP entry.
  • Single source of truth: one predicate honored by both lab paths.
  • Rollout-safe, no flag-day: the ServiceToml/Binary structs have no #[serde(deny_unknown_fields)] and the field is #[serde(default)]. So old lab + new service.toml → ignores the field; new lab + old service.toml → defaults false. Both directions safe, any merge order.

Alternatives considered

  • New Kind::Backend variant — breaks every exhaustive match on Kind ecosystem-wide. Rejected (additive bool is non-breaking).
  • Make fast_teardown read SERVICE_MAP — keeps the centralized hardcoded list; only helps mapped repos.
  • kind = "cli" workaround — lies about process type, mislabels in catalogs, repeated per-repo.

Pre-merge check

Cross-org grep for any struct-literal construction of Binary { … } (positional/all-fields) — those need the new field. Deserialization sites (the majority) are unaffected.

## Problem `lab build --start` / `--restart` starts every binary whose `service.toml` `kind ∈ {server, admin, web}` as a standalone hero_proc service. But some long-running daemons are **supervised by a sibling binary**, not by hero_proc — e.g. `hero_livekit`'s `lk-backend` and `livekit-server` are spawned/managed by `hero_livekit_server` via its `start()` RPC. lab starts `lk-backend` standalone, it lacks the env/config the parent injects, and CI/build fails: ``` starting lk-backend … ERROR: 'lk-backend' did not become fully running within 10s last service state: failed FAILED: lk-backend… service 'lk-backend' failed validation: state=failed ``` ## Root cause — a modeling gap `kind` conflates two orthogonal properties: - **what kind of process** this is (server / admin / web / cli) - **who owns its lifecycle** (hero_proc/lab vs. a sibling binary) `lk-backend` is genuinely a `server` (long-running daemon), but its lifecycle is owned by `hero_livekit_server`. There's no way to express that, so the `kind`-only filter starts it. `6aac3e8` (register hero_livekit in `SERVICE_MAP`) fixed **one** path — `lab service hero_livekit` consults the hardcoded `SERVICE_MAP`, which correctly lists only `hero_livekit_server` + `hero_livekit_admin`. But the `lab build --start` / `fast_teardown` path **does not consult `SERVICE_MAP`** — it re-derives from `kind` (`fast_teardown.rs:422`, `fast_teardown.rs:520`, `service_manager.rs:2179`) and still starts `lk-backend`. The two paths disagree. ## Proposed fix — a first-class `supervised` flag on `[[binaries]]` **1. hero_lib — `crates/core/src/base/service.rs`, add to `Binary`:** ```rust /// Long-running, but its lifecycle is owned by a sibling binary in the same /// service (e.g. spawned via the server's start() RPC), not by hero_proc/lab. /// lab still installs it, but never registers / starts / tears it down standalone. #[serde(default)] pub supervised: bool, ``` **2. lab — collapse the 3 duplicated checks into one predicate + the guard:** ```rust pub fn is_lab_managed_daemon(b: &Binary) -> bool { matches!(b.kind, Kind::Server | Kind::Admin | Kind::Web) && !b.supervised } ``` Apply at `service_manager.rs:2179`, `fast_teardown.rs:422`, `fast_teardown.rs:520`. Now both `lab service` and `lab build --start` agree. **3. hero_livekit `service.toml` (×4) — keep the accurate kind, declare ownership:** ```toml [[binaries]] name = "lk-backend" kind = "server" # accurate: it IS a long-running server supervised = true # but hero_livekit_server owns its lifecycle ``` **4. (optional)** retire the `SERVICE_MAP` hero_livekit entry from `6aac3e8` — the flag now covers every path, so the hardcoded curation becomes redundant (one source of truth). ## Why this design - **Models reality:** separates process-type (`kind`) from lifecycle-ownership (`supervised`). No semantic lie (cf. relabelling `lk-backend` to `kind=cli`). - **Decentralized + generalizes:** any sibling-supervised daemon (e.g. the OnlyOffice backend) declares it in its own manifest — no per-service hardcoded `SERVICE_MAP` entry. - **Single source of truth:** one predicate honored by both lab paths. - **Rollout-safe, no flag-day:** the `ServiceToml`/`Binary` structs have **no `#[serde(deny_unknown_fields)]`** and the field is `#[serde(default)]`. So old lab + new `service.toml` → ignores the field; new lab + old `service.toml` → defaults `false`. Both directions safe, any merge order. ## Alternatives considered - **New `Kind::Backend` variant** — breaks every exhaustive `match` on `Kind` ecosystem-wide. Rejected (additive bool is non-breaking). - **Make `fast_teardown` read `SERVICE_MAP`** — keeps the centralized hardcoded list; only helps mapped repos. - **`kind = "cli"` workaround** — lies about process type, mislabels in catalogs, repeated per-repo. ## Pre-merge check Cross-org grep for any **struct-literal** construction of `Binary { … }` (positional/all-fields) — those need the new field. Deserialization sites (the majority) are unaffected.
Author
Member

Decision: going with the simpler "install-only" approach, not the supervised flag

After implementing and testing the supervised flag end-to-end, we are backing it out in favour of a lighter approach. Two reasons:

1. It is more machinery than the problem needs. The flag required a new field in the shared hero_lib service schema plus changes to multiple lab code paths — and there turned out to be four start-decision sites, not three. builder/orchestrator.rs (the path lab build --start actually uses) was missed in the first pass, so the flag silently did nothing on the exact command users run.

2. It is conceptually awkward. Declaring a binary to the supervisor (lab/hero_proc) and then flagging "…but do not supervise it" is contradictory. lab’s manifest should list what it manages.

The simpler model

lab already has an "install but never start" category — that is what cli/tool binaries are (e.g. hero_do_hero_livekit: installed, never started). lk-backend is operationally exactly that from lab’s perspective: lab installs it to ~/hero/bin, and hero_livekit_server spawns/supervises it via its start() RPC. So the fix is to put lk-backend in lab’s install-only bucket rather than invent a new "long-running-but-do-not-start" concept.

Status

The supervised-flag changes were reverted from the integration branches of hero_lib, hero_skills (lab), and hero_livekit (via git revert, no force-push). The hero_rpc2 migration on hero_livekit integration is untouched.

Follow-up — the actual root issue

lk-backend is declared kind = "server", so every lab start-path tries to launch it standalone, where it dies (it only works as hero_livekit_server’s child). The install-only approach addresses that directly. Also worth a look: all four hero_livekit service.tomls currently list all four binaries (server/admin/lk-backend/do) — that duplication is part of what made this confusing, and is likely where the cleanest fix lives.

## Decision: going with the simpler "install-only" approach, not the `supervised` flag After implementing and testing the `supervised` flag end-to-end, we are **backing it out** in favour of a lighter approach. Two reasons: **1. It is more machinery than the problem needs.** The flag required a new field in the shared `hero_lib` service schema *plus* changes to multiple lab code paths — and there turned out to be **four** start-decision sites, not three. `builder/orchestrator.rs` (the path `lab build --start` actually uses) was missed in the first pass, so the flag silently did nothing on the exact command users run. **2. It is conceptually awkward.** Declaring a binary to the supervisor (lab/hero_proc) and then flagging "…but do not supervise it" is contradictory. lab’s manifest should list what it manages. ### The simpler model lab already has an "install but never start" category — that is what `cli`/tool binaries are (e.g. `hero_do_hero_livekit`: installed, never started). `lk-backend` is operationally exactly that *from lab’s perspective*: lab installs it to `~/hero/bin`, and `hero_livekit_server` spawns/supervises it via its `start()` RPC. So the fix is to put `lk-backend` in lab’s install-only bucket rather than invent a new "long-running-but-do-not-start" concept. ### Status The `supervised`-flag changes were **reverted** from the `integration` branches of `hero_lib`, `hero_skills` (lab), and `hero_livekit` (via `git revert`, no force-push). The hero_rpc2 migration on hero_livekit `integration` is untouched. ### Follow-up — the actual root issue `lk-backend` is declared `kind = "server"`, so every lab start-path tries to launch it standalone, where it dies (it only works as `hero_livekit_server`’s child). The install-only approach addresses that directly. Also worth a look: all four hero_livekit `service.toml`s currently list all four binaries (server/admin/lk-backend/do) — that duplication is part of what made this confusing, and is likely where the cleanest fix lives.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_skills#315
No description provided.