lab service --status/--stop can't see CLI-registered services (SERVICE_MAP vs service.toml/hero_proc registry) #308

Open
opened 2026-06-03 16:17:20 +00:00 by sameh-farouk · 0 comments
Member

Summary

lab service <name> --status (and --stop) cannot see services that were registered with hero_proc by a service's own CLI (e.g. hero_collab --start), because lab and the per-service CLIs register services into hero_proc with two different, incompatible shapes. lab reports state=inactive / pid=0 for services that are in fact running.

This is a symptom of a deeper issue: lab's hardcoded SERVICE_MAP is a parallel source of truth that diverges from each binary's service.toml and from hero_proc's live registry.

Evidence

After hero_collab --start --auth-mode dev --seed-dev-users, hero_proc's service.list contains:

hero_collab          ← the CLI's ServiceBuilder::new("hero_collab") group
hero_collab_server   ← member action
hero_collab_web      ← member action
hero_planner_server  ← (started earlier by `lab service hero_planner --start`)
hero_planner_web     ←   note: NO "hero_planner" group entry

hero_proc job.list (ground truth) shows the collab jobs running:

hero_collab.hero_collab_server  phase=running  pid=<…>
hero_collab.hero_collab_web      phase=running  pid=<…>

But lab service hero_collab --status reports:

service: hero_collab_server   state: inactive  pid: 0
service: hero_collab_web       state: inactive  pid: 0

Root cause: two registration models, not just two namespaces

How it registers Shape in hero_proc
per-service CLI (hero_collab) ServiceBuilder::new("hero_collab") + named member actions hierarchical: service hero_collab → actions hero_collab_server, hero_collab_web; running jobs named hero_collab.<action>
lab hardcoded SERVICE_MAP (crates/lab/src/service/service_manager.rs:508) → one hero_proc service per binary flat: hero_planner_server, hero_planner_web; no parent group

lab's do_status(binary) (service_manager.rs:73) calls service_status(name=<binary>) per binary. For lab's own flat registrations that matches. For a CLI's grouped registration, the live job is hero_collab.hero_collab_web (an action under a service) while lab queries a flat hero_collab_web service — which is an empty/stub entry → inactive / pid 0.

Why this matters

  • lab service <name> --status|--stop silently misreports CLI-started services as down. An operator can't manage (or even see) them via lab.
  • Two tools writing two shapes to the same supervisor produces duplicate/overlapping registry entries (hero_collab and hero_collab_server and hero_collab_web).
  • It blocks the intended "use lab as the single operator surface" story: lab can't be the front-end for services that ship their own lifecycle CLI.

Proposed direction (align on a single source of truth)

The supervisor (hero_proc) is the source of truth; the canonical service identity should come from each binary's service.toml (which already declares [service] name = "…" + member binaries/sockets — lab infocheck already reads these). Both lab and the per-service CLIs should defer to that, rather than either tool owning the namespace.

  1. Canonical model = hierarchical (service → actions). It's what service.toml describes and what the per-service CLIs already do. lab's flat-per-binary model is the outlier.
  2. Retire SERVICE_MAP in favour of dynamic discovery from service.toml — this is already the documented long-term intent (lab.md: "SERVICE_MAP is meant to be replaced by dynamic discovery from each binary's embedded service.toml").
  3. lab … --status should query hero_proc's registry by canonical service name and list whatever actions hero_proc reports — so it sees CLI-registered services for free.
  4. lab … --start should register grouped (service + member actions) like the CLIs, so the two tools converge on one shape instead of producing duplicate entries.

A smaller, immediate mitigation (if the full migration is out of scope short-term): have lab … --status fall back to matching service.<binary>-named jobs in hero_proc's job.list when the flat service_status(<binary>) lookup returns not-found/inactive — so at minimum lab stops misreporting running CLI-started services as down.

Repro

hero_collab --start --auth-mode dev --seed-dev-users
lab service hero_collab --status     # shows inactive/pid 0
# vs hero_proc job.list which shows the jobs running as hero_collab.hero_collab_web

Found while bringing up hero_collab/hero_planner via both lab and the per-service CLI on 2026-06-03.

## Summary `lab service <name> --status` (and `--stop`) cannot see services that were registered with hero_proc by a service's **own CLI** (e.g. `hero_collab --start`), because lab and the per-service CLIs register services into hero_proc with **two different, incompatible shapes**. lab reports `state=inactive / pid=0` for services that are in fact `running`. This is a symptom of a deeper issue: lab's hardcoded `SERVICE_MAP` is a parallel source of truth that diverges from each binary's `service.toml` and from hero_proc's live registry. ## Evidence After `hero_collab --start --auth-mode dev --seed-dev-users`, hero_proc's `service.list` contains: ``` hero_collab ← the CLI's ServiceBuilder::new("hero_collab") group hero_collab_server ← member action hero_collab_web ← member action hero_planner_server ← (started earlier by `lab service hero_planner --start`) hero_planner_web ← note: NO "hero_planner" group entry ``` hero_proc `job.list` (ground truth) shows the collab jobs **running**: ``` hero_collab.hero_collab_server phase=running pid=<…> hero_collab.hero_collab_web phase=running pid=<…> ``` But `lab service hero_collab --status` reports: ``` service: hero_collab_server state: inactive pid: 0 service: hero_collab_web state: inactive pid: 0 ``` ## Root cause: two registration models, not just two namespaces | | How it registers | Shape in hero_proc | |---|---|---| | **per-service CLI** (`hero_collab`) | `ServiceBuilder::new("hero_collab")` + named member actions | **hierarchical**: service `hero_collab` → actions `hero_collab_server`, `hero_collab_web`; running jobs named `hero_collab.<action>` | | **lab** | hardcoded `SERVICE_MAP` (`crates/lab/src/service/service_manager.rs:508`) → one hero_proc service **per binary** | **flat**: `hero_planner_server`, `hero_planner_web`; no parent group | `lab`'s `do_status(binary)` (`service_manager.rs:73`) calls `service_status(name=<binary>)` per binary. For lab's own flat registrations that matches. For a CLI's grouped registration, the live job is `hero_collab.hero_collab_web` (an action under a service) while lab queries a flat `hero_collab_web` service — which is an empty/stub entry → `inactive / pid 0`. ## Why this matters - `lab service <name> --status|--stop` silently misreports CLI-started services as down. An operator can't manage (or even see) them via lab. - Two tools writing two shapes to the same supervisor produces duplicate/overlapping registry entries (`hero_collab` **and** `hero_collab_server` **and** `hero_collab_web`). - It blocks the intended "use lab as the single operator surface" story: lab can't be the front-end for services that ship their own lifecycle CLI. ## Proposed direction (align on a single source of truth) The supervisor (hero_proc) is the source of truth; the canonical service identity should come from each binary's **`service.toml`** (which already declares `[service] name = "…"` + member binaries/sockets — `lab infocheck` already reads these). Both lab and the per-service CLIs should defer to that, rather than either tool owning the namespace. 1. **Canonical model = hierarchical** (service → actions). It's what `service.toml` describes and what the per-service CLIs already do. lab's flat-per-binary model is the outlier. 2. **Retire `SERVICE_MAP`** in favour of dynamic discovery from `service.toml` — this is already the documented long-term intent (lab.md: "SERVICE_MAP is meant to be replaced by dynamic discovery from each binary's embedded service.toml"). 3. **`lab … --status` should query hero_proc's registry by canonical service name** and list whatever actions hero_proc reports — so it sees CLI-registered services for free. 4. **`lab … --start` should register grouped** (service + member actions) like the CLIs, so the two tools converge on one shape instead of producing duplicate entries. A smaller, immediate mitigation (if the full migration is out of scope short-term): have `lab … --status` fall back to matching `service.<binary>`-named jobs in hero_proc's `job.list` when the flat `service_status(<binary>)` lookup returns not-found/inactive — so at minimum lab stops misreporting running CLI-started services as down. ## Repro ``` hero_collab --start --auth-mode dev --seed-dev-users lab service hero_collab --status # shows inactive/pid 0 # vs hero_proc job.list which shows the jobs running as hero_collab.hero_collab_web ``` Found while bringing up hero_collab/hero_planner via both `lab` and the per-service CLI on 2026-06-03.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_skills#308
No description provided.