fix(lab): companion-config on build, malformed-cfg diagnostic, --stop --fast no-build #306

Merged
nabil_salah merged 3 commits from develop_lab_fix_followups into development 2026-06-01 10:44:55 +00:00
Member

Summary

Three independent follow-up fixes found during fresh-VM onboarding testing.

Companion config files now placed on the build path too

Some services ship a companion config file alongside the binary — for
example, hero_aibroker needs modelsconfig.yml from the repo before it can
start. Previously this companion fetch only ran in the acquire path
(download/install), so a fresh lab build hero_aibroker would install
the binary, try to start it, and fail because the config wasn't there.

The hook moved from acquire_binary into do_start_validated, so it now
fires on every path that leads to a service start: build, acquire, and
transitive dependency walks. Best-effort and non-fatal: a missing or
unreachable companion config prints a warning rather than aborting.

lab path distinguishes missing vs. malformed hero_cfg.toml

When hero_cfg.toml didn't load, every output mode reported the same
"hero environment not initialised — run lab user init" error, which is
wrong if the file is actually there but malformed. Users would re-run
lab user init, overwrite their broken config with defaults, and lose
whatever they were trying to set.

lab path now probes the config path on the error branch and emits a
distinct message in all four output modes (TTY, nu, json, bash-piped):

  • file absent: "run lab user init" (unchanged)
  • file present but TOML parse error: shows the parse error and the
    file path, so the user can fix the TOML instead of clobbering it

lab build --stop --fast no longer compiles before wiping

The flag's help text says "no build; use alone or with a repo name to
scope", but the code was running a full force-install of the workspace
before the SIGKILL+wipe phase. Two problems:

  1. The build required the cwd to be inside a cargo workspace, so
    cd ~ && lab build --fast --stop failed with the misleading
    "build/install failed: no Cargo.toml found above /root" — framing
    a teardown command as a build failure.

  2. Under source/binary drift (user edited service.toml in source but
    hasn't restarted yet), the build-first behavior actively hurt
    cleanup. The newly-installed binary embeds the new socket paths,
    so the subsequent unlink targets paths that nothing ever bound to,
    while the actually-leaked sockets owned by the still-running old
    binary are left as garbage.

The build phase is removed. The on-disk binary's embedded service.toml
matches what hero_proc registered at start time, so reading it via
<bin> --info gives the correct socket paths to clean up. When
discovery fails (no repo arg, no explicit binaries, cwd outside any
service repo) the error now includes an actionable hint listing the
three ways to scope the command.

The --restart --fast path is intentionally untouched: there the build
is desired (new binary will be what we start), and the install-then-
wipe-then-start ordering still applies.

Test plan

Companion config:

  • Fresh VM, lab install hero_aibroker && lab service hero_aibroker --start — modelsconfig.yml is fetched and placed before start,
    service comes up green.

Malformed config diagnostic:

  • rm ~/hero/cfg/hero_cfg.toml && lab path — says "run lab user
    init".
  • Corrupt hero_cfg.toml (add = mid-line) and re-run lab path
    prints the TOML parse error and the file path, does not suggest
    re-init.

--stop --fast:

  • cd ~ && lab build --fast --stop — prints the hint, exits
    cleanly, no "build/install failed" message.
  • cd ~/hero/code/hero_router && lab build --fast --stop hero_router
    — wipes instantly, no build step printed (was ~10 minutes on a
    cold cache before).
  • cd ~ && lab build --fast --stop hero_router — resolves the repo,
    wipes, no build.
  • --restart --fast still triggers a build (regression check).
  • Drift test: edited hero_router/src/config.rs to change the socket
    path string to "rpc-NEWPATH.sock" without rebuilding, leaked
    rpc.sock and admin.sock via SIGKILL bypassing hero_proc cleanup,
    ran lab build --fast --stop hero_router. Result: both
    actually-leaked sockets (rpc.sock, admin.sock — the paths the
    on-disk binary embeds) were cleaned, the drifted source path
    (rpc-NEWPATH.sock) was correctly never referenced.
## Summary Three independent follow-up fixes found during fresh-VM onboarding testing. ### Companion config files now placed on the build path too Some services ship a companion config file alongside the binary — for example, hero_aibroker needs modelsconfig.yml from the repo before it can start. Previously this companion fetch only ran in the acquire path (download/install), so a fresh `lab build hero_aibroker` would install the binary, try to start it, and fail because the config wasn't there. The hook moved from acquire_binary into do_start_validated, so it now fires on every path that leads to a service start: build, acquire, and transitive dependency walks. Best-effort and non-fatal: a missing or unreachable companion config prints a warning rather than aborting. ### `lab path` distinguishes missing vs. malformed hero_cfg.toml When hero_cfg.toml didn't load, every output mode reported the same "hero environment not initialised — run lab user init" error, which is wrong if the file is actually there but malformed. Users would re-run lab user init, overwrite their broken config with defaults, and lose whatever they were trying to set. `lab path` now probes the config path on the error branch and emits a distinct message in all four output modes (TTY, nu, json, bash-piped): * file absent: "run lab user init" (unchanged) * file present but TOML parse error: shows the parse error and the file path, so the user can fix the TOML instead of clobbering it ### `lab build --stop --fast` no longer compiles before wiping The flag's help text says "no build; use alone or with a repo name to scope", but the code was running a full force-install of the workspace before the SIGKILL+wipe phase. Two problems: 1. The build required the cwd to be inside a cargo workspace, so `cd ~ && lab build --fast --stop` failed with the misleading "build/install failed: no Cargo.toml found above /root" — framing a teardown command as a build failure. 2. Under source/binary drift (user edited service.toml in source but hasn't restarted yet), the build-first behavior actively hurt cleanup. The newly-installed binary embeds the new socket paths, so the subsequent unlink targets paths that nothing ever bound to, while the actually-leaked sockets owned by the still-running old binary are left as garbage. The build phase is removed. The on-disk binary's embedded service.toml matches what hero_proc registered at start time, so reading it via `<bin> --info` gives the correct socket paths to clean up. When discovery fails (no repo arg, no explicit binaries, cwd outside any service repo) the error now includes an actionable hint listing the three ways to scope the command. The --restart --fast path is intentionally untouched: there the build is desired (new binary will be what we start), and the install-then- wipe-then-start ordering still applies. ## Test plan Companion config: - [x] Fresh VM, `lab install hero_aibroker && lab service hero_aibroker --start` — modelsconfig.yml is fetched and placed before start, service comes up green. Malformed config diagnostic: - [x] `rm ~/hero/cfg/hero_cfg.toml && lab path` — says "run lab user init". - [x] Corrupt hero_cfg.toml (add `=` mid-line) and re-run `lab path` — prints the TOML parse error and the file path, does not suggest re-init. --stop --fast: - [x] `cd ~ && lab build --fast --stop` — prints the hint, exits cleanly, no "build/install failed" message. - [x] `cd ~/hero/code/hero_router && lab build --fast --stop hero_router` — wipes instantly, no build step printed (was ~10 minutes on a cold cache before). - [x] `cd ~ && lab build --fast --stop hero_router` — resolves the repo, wipes, no build. - [x] --restart --fast still triggers a build (regression check). - [x] Drift test: edited hero_router/src/config.rs to change the socket path string to "rpc-NEWPATH.sock" without rebuilding, leaked rpc.sock and admin.sock via SIGKILL bypassing hero_proc cleanup, ran `lab build --fast --stop hero_router`. Result: both actually-leaked sockets (rpc.sock, admin.sock — the paths the on-disk binary embeds) were cleaned, the drifted source path (rpc-NEWPATH.sock) was correctly never referenced.
Found exercising failure-recovery probes on a fresh Ubuntu 24 VM.

T3.4 — `lab build --restart hero_aibroker` rebuilt 11 binaries from
source and then failed all 44 smoke tests because
`$PATH_VAR/hero_aibroker/modelsconfig.yml` was missing.

  Root cause: PR5's `ensure_companion_config` hook lives inside
  `acquire_binary`, which covers the install paths (cache hit, Forge
  download, build-from-source-when-binary-missing). But `lab build
  --restart` (and `--fast --restart`, `--reset --start`, `--start`)
  drives the build pipeline directly and installs via `platform/install`
  without ever calling `acquire_binary` — so the companion-config
  fetch was bypassed entirely when the binary was already in place
  and only the config had been removed.

  Fix: also call `ensure_companion_config(&validated.service_name)`
  from `do_start_validated` in `service_manager.rs`, right after the
  existing provider-key preflight. Every start path funnels through
  that function (acquire, build, dep-walk, lab service core, …), so
  the fetch now fires regardless of how we got to start. The three
  existing hooks in `acquire_binary` stay — both layers are
  idempotent (skip when file present + non-empty), so the overlap is
  a cheap no-op and the install path keeps its eager fetch. Non-fatal
  on failure so the binary's own startup error path can still
  surface. Function dropped its now-unused `repo_name` parameter (the
  per-binary table provides the repo) and was made `pub(crate)` so
  service_manager can call it.

T3.2 — corrupt `~/hero/cfg/hero_cfg.toml`, then `lab path` said:

      lab path: PATH_ROOT is not set — run `lab user init` first
        to provision the Hero environment.

  Wrong twice: PATH_ROOT *would* be set if the cfg parsed, and
  `lab user init` does not heal a malformed cfg.

  Fix: `cmd_path` now probes for `~/hero/cfg/hero_cfg.toml` when
  PATH_ROOT is unset. If the file exists but fails to parse, surface
  the actual TOML error with "Fix the TOML (or restore from a backup)
  and re-run" — instead of the misleading "run lab user init"
  message. Distinguishes the genuine pre-init case (no file → keep
  the run-init message) from the broken-cfg case. Applies to all
  four output modes (TTY / nu / json / bash-piped).

Refs: hero_skills#281, hero_skills#282
--stop is a teardown verb but `lab build --stop --fast` was building +
installing every cargo target in the workspace before wiping, citing
"so --info is readable" (from 675d8fe). This forced the caller to be
inside a cargo workspace (otherwise: `no Cargo.toml found above <cwd>`),
contradicting the flag's own help text ("no build; use alone or with a
repo name to scope") and making a destructive cleanup verb unusable from
a non-repo cwd.

The build-first rationale also did not hold for --stop specifically.
fast_teardown reads each binary's embedded service.toml to learn socket
paths to unlink. The *running* service was launched from the on-disk
binary, so the on-disk service.toml lists the sockets it actually owns.
Re-installing first to get "fresh" metadata only helps if source and
installed binary are in sync — in which case fresh == stale. When they
drift (user edited service.toml but has not restarted), the rebuild
makes things *worse*: we unlink the new path (which nothing ever opened)
and leave the old socket the killed process actually owned as garbage.

For --restart --fast the build-first design is correct and is untouched
(the new binary is what we are about to run).

This change skips the run() call on the --stop --fast path, keeps the
optional repo chdir from resolve_build_repo, and improves the discovery-
failure error so callers know they can pass a repo, name binaries, or
cd into a service repo.

Repro (before): `cd ~ && lab build --fast --stop` →
  `lab build --stop --fast: build/install failed: no Cargo.toml found
   above /root`
mahmoud approved these changes 2026-06-01 10:43:47 +00:00
nabil_salah merged commit 599114ac60 into development 2026-06-01 10:44:55 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_skills!306
No description provided.