fix(lab): companion-config on build, malformed-cfg diagnostic, --stop --fast no-build #306
No reviewers
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_skills!306
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "develop_lab_fix_followups"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Three independent follow-up fixes found during fresh-VM onboarding testing.
Companion config files now placed on the build path too
Some services ship a companion config file alongside the binary — for
example, hero_aibroker needs modelsconfig.yml from the repo before it can
start. Previously this companion fetch only ran in the acquire path
(download/install), so a fresh
lab build hero_aibrokerwould installthe binary, try to start it, and fail because the config wasn't there.
The hook moved from acquire_binary into do_start_validated, so it now
fires on every path that leads to a service start: build, acquire, and
transitive dependency walks. Best-effort and non-fatal: a missing or
unreachable companion config prints a warning rather than aborting.
lab pathdistinguishes missing vs. malformed hero_cfg.tomlWhen hero_cfg.toml didn't load, every output mode reported the same
"hero environment not initialised — run lab user init" error, which is
wrong if the file is actually there but malformed. Users would re-run
lab user init, overwrite their broken config with defaults, and lose
whatever they were trying to set.
lab pathnow probes the config path on the error branch and emits adistinct message in all four output modes (TTY, nu, json, bash-piped):
file path, so the user can fix the TOML instead of clobbering it
lab build --stop --fastno longer compiles before wipingThe flag's help text says "no build; use alone or with a repo name to
scope", but the code was running a full force-install of the workspace
before the SIGKILL+wipe phase. Two problems:
The build required the cwd to be inside a cargo workspace, so
cd ~ && lab build --fast --stopfailed with the misleading"build/install failed: no Cargo.toml found above /root" — framing
a teardown command as a build failure.
Under source/binary drift (user edited service.toml in source but
hasn't restarted yet), the build-first behavior actively hurt
cleanup. The newly-installed binary embeds the new socket paths,
so the subsequent unlink targets paths that nothing ever bound to,
while the actually-leaked sockets owned by the still-running old
binary are left as garbage.
The build phase is removed. The on-disk binary's embedded service.toml
matches what hero_proc registered at start time, so reading it via
<bin> --infogives the correct socket paths to clean up. Whendiscovery fails (no repo arg, no explicit binaries, cwd outside any
service repo) the error now includes an actionable hint listing the
three ways to scope the command.
The --restart --fast path is intentionally untouched: there the build
is desired (new binary will be what we start), and the install-then-
wipe-then-start ordering still applies.
Test plan
Companion config:
lab install hero_aibroker && lab service hero_aibroker --start— modelsconfig.yml is fetched and placed before start,service comes up green.
Malformed config diagnostic:
rm ~/hero/cfg/hero_cfg.toml && lab path— says "run lab userinit".
=mid-line) and re-runlab path—prints the TOML parse error and the file path, does not suggest
re-init.
--stop --fast:
cd ~ && lab build --fast --stop— prints the hint, exitscleanly, no "build/install failed" message.
cd ~/hero/code/hero_router && lab build --fast --stop hero_router— wipes instantly, no build step printed (was ~10 minutes on a
cold cache before).
cd ~ && lab build --fast --stop hero_router— resolves the repo,wipes, no build.
path string to "rpc-NEWPATH.sock" without rebuilding, leaked
rpc.sock and admin.sock via SIGKILL bypassing hero_proc cleanup,
ran
lab build --fast --stop hero_router. Result: bothactually-leaked sockets (rpc.sock, admin.sock — the paths the
on-disk binary embeds) were cleaned, the drifted source path
(rpc-NEWPATH.sock) was correctly never referenced.
Found exercising failure-recovery probes on a fresh Ubuntu 24 VM. T3.4 — `lab build --restart hero_aibroker` rebuilt 11 binaries from source and then failed all 44 smoke tests because `$PATH_VAR/hero_aibroker/modelsconfig.yml` was missing. Root cause: PR5's `ensure_companion_config` hook lives inside `acquire_binary`, which covers the install paths (cache hit, Forge download, build-from-source-when-binary-missing). But `lab build --restart` (and `--fast --restart`, `--reset --start`, `--start`) drives the build pipeline directly and installs via `platform/install` without ever calling `acquire_binary` — so the companion-config fetch was bypassed entirely when the binary was already in place and only the config had been removed. Fix: also call `ensure_companion_config(&validated.service_name)` from `do_start_validated` in `service_manager.rs`, right after the existing provider-key preflight. Every start path funnels through that function (acquire, build, dep-walk, lab service core, …), so the fetch now fires regardless of how we got to start. The three existing hooks in `acquire_binary` stay — both layers are idempotent (skip when file present + non-empty), so the overlap is a cheap no-op and the install path keeps its eager fetch. Non-fatal on failure so the binary's own startup error path can still surface. Function dropped its now-unused `repo_name` parameter (the per-binary table provides the repo) and was made `pub(crate)` so service_manager can call it. T3.2 — corrupt `~/hero/cfg/hero_cfg.toml`, then `lab path` said: lab path: PATH_ROOT is not set — run `lab user init` first to provision the Hero environment. Wrong twice: PATH_ROOT *would* be set if the cfg parsed, and `lab user init` does not heal a malformed cfg. Fix: `cmd_path` now probes for `~/hero/cfg/hero_cfg.toml` when PATH_ROOT is unset. If the file exists but fails to parse, surface the actual TOML error with "Fix the TOML (or restore from a backup) and re-run" — instead of the misleading "run lab user init" message. Distinguishes the genuine pre-init case (no file → keep the run-init message) from the broken-cfg case. Applies to all four output modes (TTY / nu / json / bash-piped). Refs: hero_skills#281, hero_skills#282