lab: service core leaves the host in a non-recoverable state after a failed start; related CLI/docs polish #282

Closed
opened 2026-05-21 11:43:49 +00:00 by nabil_salah · 1 comment
Member

Summary

Five separate medium-severity issues observed during a fresh-install test of lab on Ubuntu 24. Grouped here for triage convenience; happy to split into individual issues if preferred. None are install-blockers on their own, but each makes lab less pleasant or harder to recover from.

Environment

  • OS: Ubuntu 24.04 LTS, fresh image
  • Shell: bash, root user
  • lab version: lab 0.1.0
  • Triggered after curl … install.sh | bash + lab user init + (eventually-successful) lab install core

1. lab service <X> --stop fails when the service is in failed state

After a lab service core run that failed at the hero_aibroker_server smoke-test phase, the service is left in failed state with restarts: 4 and there is no lab verb that can clean it up.

Reproduce:

root@vmrx5xp:~# lab service hero_aibroker_server --stop
Stopping hero_aibroker_server…
  hero_aibroker_server: stop returned an error (may already be stopped): hero_aibroker_server: stop failed — state 'failed'

root@vmrx5xp:~# lab service hero_aibroker_server --status
service:  hero_aibroker_server
state:    failed
pid:      0
restarts: 4

Note the contradictory wording: "may already be stopped" (i.e., fine) immediately followed by "stop failed — state 'failed'" (i.e., not fine).

Fix: Treat failed state as already-stopped. --stop should deregister with hero_proc, clear the restart counter, remove socket directories, and return success. Anything else leaves the user with no recovery path short of editing hero_proc's SQLite DB by hand.


2. Failed-start services are left registered with hero_proc

Closely related to #1. When lab service core fails partway through, the half-started service stays registered:

lab service: failed to start 'hero_code_server': dependency 'hero_aibroker_server' built but failed to start: 'hero_aibroker_server' registered with hero_proc but 44 smoke test(s) failed. Service is left running; check the failures above.

This makes lab service core non-idempotent — the next run will see the broken registration and behave inconsistently. The leftover socket directories (~/hero/var/sockets/hero_aibroker/{admin,billing,chat,embedder,images,memory,meta,models,speech,video}/) also linger as empty placeholders.

Fix: On smoke-test or start failure, lab should:

  • Deregister the failed service from hero_proc before exiting.
  • Remove placeholder socket directories created for sub-services that never came up.
  • Return a non-zero exit code so callers/scripts know to retry from a clean state.

3. lab user init warns "lab not installed" when it actually is

After installing via the documented curl one-liner (cargo install --root ~/hero), lab lands at ~/hero/bin/lab. Then:

root@vmrx5xp:~# lab user init
…
warning: could not move lab into /root/hero/bin: ~/.local/bin/lab not found — run lab_install.sh or lab_build.sh first
…
Hero ready. PATH_ROOT=/root/hero.

The init step is looking for lab in ~/.local/bin/lab and complaining when it's not there — but the binary is already in the destination it was trying to move it to (~/hero/bin/lab). The warning is incorrect, and worse, its suggested remediation ("run lab_install.sh or lab_build.sh first") implies lab isn't installed when it already is.

Fix: Before issuing the warning, check whether ~/hero/bin/lab already exists; if so, silently treat the move as already done.


4. lab service (no args) errors with "no .git directory found"

Run from /root (or any non-repo directory):

root@vmrx5xp:~# lab service
error: no service name given and could not infer from git: no .git directory found from /root
Usage: lab service <name> --start|--stop|--status|--install [flags]

There is no top-level lab service discovery command — to find out what services are registered, the user has to already know a service name. This is especially awkward because the older lab --status (which served as a global discovery command) was retired in favor of lab service <name> --status.

Fix: When no name is given AND the cwd isn't in a git repo, fall back to listing all registered services (the equivalent of the old top-level lab --status). That preserves the git-repo-aware ergonomics when inside a service repo, while giving outside-the-repo callers a useful default.


5. README is significantly out of date relative to the current binary

crates/lab/README.md documents top-level flags and verbs that have been retired or renamed in the binary on development:

README says Current binary says
lab --start hero_code_server, lab --stop, lab --status lab service <name> --start | --stop | --status
Top-level --release / --install / --bin flags Moved under lab build
CODEROOT env var Renamed to PATH_CODE
(no mention) New commands: lab path, lab completions, lab infocheck
(no mention) lab user init is now a mandatory step between install.sh and using lab

Concrete consequences during the fresh-install test:

  • A user following the README runs lab --start hero_code_server and gets error: unrecognized subcommand.
  • A user follows the README's secrets / env section, gets CODEROOT not set, and is confused because lab path references PATH_CODE instead.
  • A user finishes install.sh and follows README's "Install" section, never finds the lab user init step, and lab path errors with PATH_ROOT is not set.

Fix: Audit the README section by section against the current binary. Specifically:

  • Replace the "Service lifecycle (lab service, lab --start/stop/status)" section with the new lab service <name> --start|--stop|--status syntax.
  • Move the "Build mode (lab [flags])" section under lab build [flags].
  • Replace CODEROOT references with PATH_CODE (or document both if both work).
  • Add a new "Initialize the user environment" section between Install and any usage, documenting lab user init.
  • Add reference sections for lab path, lab completions, lab infocheck.

Priority

All five are medium severity — they don't block install outright, but each leaves a user stuck or confused at a predictable point. #1 and #2 are the most consequential because they make lab service core non-idempotent after a failure. #5 is the umbrella that, if fixed, would prevent users from hitting many of the others by accident.

## Summary Five separate medium-severity issues observed during a fresh-install test of `lab` on Ubuntu 24. Grouped here for triage convenience; happy to split into individual issues if preferred. None are install-blockers on their own, but each makes `lab` less pleasant or harder to recover from. ## Environment - OS: Ubuntu 24.04 LTS, fresh image - Shell: bash, root user - `lab` version: `lab 0.1.0` - Triggered after `curl … install.sh | bash` + `lab user init` + (eventually-successful) `lab install core` --- ## 1. `lab service <X> --stop` fails when the service is in `failed` state After a `lab service core` run that failed at the `hero_aibroker_server` smoke-test phase, the service is left in `failed` state with `restarts: 4` and there is no `lab` verb that can clean it up. Reproduce: ``` root@vmrx5xp:~# lab service hero_aibroker_server --stop Stopping hero_aibroker_server… hero_aibroker_server: stop returned an error (may already be stopped): hero_aibroker_server: stop failed — state 'failed' root@vmrx5xp:~# lab service hero_aibroker_server --status service: hero_aibroker_server state: failed pid: 0 restarts: 4 ``` Note the contradictory wording: "may already be stopped" (i.e., fine) immediately followed by "stop failed — state 'failed'" (i.e., not fine). **Fix:** Treat `failed` state as already-stopped. `--stop` should deregister with hero_proc, clear the restart counter, remove socket directories, and return success. Anything else leaves the user with no recovery path short of editing hero_proc's SQLite DB by hand. --- ## 2. Failed-start services are left registered with hero_proc Closely related to #1. When `lab service core` fails partway through, the half-started service stays registered: ``` lab service: failed to start 'hero_code_server': dependency 'hero_aibroker_server' built but failed to start: 'hero_aibroker_server' registered with hero_proc but 44 smoke test(s) failed. Service is left running; check the failures above. ``` This makes `lab service core` non-idempotent — the next run will see the broken registration and behave inconsistently. The leftover socket directories (`~/hero/var/sockets/hero_aibroker/{admin,billing,chat,embedder,images,memory,meta,models,speech,video}/`) also linger as empty placeholders. **Fix:** On smoke-test or start failure, lab should: - Deregister the failed service from hero_proc before exiting. - Remove placeholder socket directories created for sub-services that never came up. - Return a non-zero exit code so callers/scripts know to retry from a clean state. --- ## 3. `lab user init` warns "lab not installed" when it actually is After installing via the documented curl one-liner (`cargo install --root ~/hero`), lab lands at `~/hero/bin/lab`. Then: ``` root@vmrx5xp:~# lab user init … warning: could not move lab into /root/hero/bin: ~/.local/bin/lab not found — run lab_install.sh or lab_build.sh first … Hero ready. PATH_ROOT=/root/hero. ``` The init step is looking for `lab` in `~/.local/bin/lab` and complaining when it's not there — but the binary is already in the destination it was trying to move it to (`~/hero/bin/lab`). The warning is incorrect, and worse, its suggested remediation ("run lab_install.sh or lab_build.sh first") implies lab isn't installed when it already is. **Fix:** Before issuing the warning, check whether `~/hero/bin/lab` already exists; if so, silently treat the move as already done. --- ## 4. `lab service` (no args) errors with "no .git directory found" Run from `/root` (or any non-repo directory): ``` root@vmrx5xp:~# lab service error: no service name given and could not infer from git: no .git directory found from /root Usage: lab service <name> --start|--stop|--status|--install [flags] ``` There is no top-level `lab service` discovery command — to find out what services are registered, the user has to already know a service name. This is especially awkward because the older `lab --status` (which served as a global discovery command) was retired in favor of `lab service <name> --status`. **Fix:** When no name is given AND the cwd isn't in a git repo, fall back to listing all registered services (the equivalent of the old top-level `lab --status`). That preserves the git-repo-aware ergonomics when inside a service repo, while giving outside-the-repo callers a useful default. --- ## 5. README is significantly out of date relative to the current binary `crates/lab/README.md` documents top-level flags and verbs that have been retired or renamed in the binary on `development`: | README says | Current binary says | |---|---| | `lab --start hero_code_server`, `lab --stop`, `lab --status` | `lab service <name> --start \| --stop \| --status` | | Top-level `--release` / `--install` / `--bin` flags | Moved under `lab build` | | `CODEROOT` env var | Renamed to `PATH_CODE` | | (no mention) | New commands: `lab path`, `lab completions`, `lab infocheck` | | (no mention) | `lab user init` is now a mandatory step between `install.sh` and using lab | Concrete consequences during the fresh-install test: - A user following the README runs `lab --start hero_code_server` and gets `error: unrecognized subcommand`. - A user follows the README's secrets / env section, gets `CODEROOT` not set, and is confused because `lab path` references `PATH_CODE` instead. - A user finishes `install.sh` and follows README's "Install" section, never finds the `lab user init` step, and `lab path` errors with `PATH_ROOT is not set`. **Fix:** Audit the README section by section against the current binary. Specifically: - Replace the "Service lifecycle (`lab service`, `lab --start/stop/status`)" section with the new `lab service <name> --start|--stop|--status` syntax. - Move the "Build mode (`lab [flags]`)" section under `lab build [flags]`. - Replace `CODEROOT` references with `PATH_CODE` (or document both if both work). - Add a new "Initialize the user environment" section between Install and any usage, documenting `lab user init`. - Add reference sections for `lab path`, `lab completions`, `lab infocheck`. --- ## Priority All five are medium severity — they don't block install outright, but each leaves a user stuck or confused at a predictable point. #1 and #2 are the most consequential because they make `lab service core` non-idempotent after a failure. #5 is the umbrella that, if fixed, would prevent users from hitting many of the others by accident.
Author
Member

Closed as solved by PR #296

Closed as solved by PR #296
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_skills#282
No description provided.