lab: UX papercuts + hero_aibroker fresh-install + README audit #296

Merged
nabil_salah merged 5 commits from lab-followup-fixes into development 2026-05-25 07:55:47 +00:00
Member

Summary

Bundles four commits that together close issue #282 (the hero_aibroker fresh-install wall) and clean up six medium/low-severity UX papercuts from our recent fresh-Ubuntu-24 test pass. Verified end-to-end on a clean VM: curl … install.sh | bashlab user initlab install corelab secrets set OPENROUTER_API_KEY <key>lab service core now succeeds on the first try with all smoke tests green.

This is a direct follow-up to merged PR #286 (fresh-installer-fixes) which closed issue #281.

Commits

1. fix(lab): UX polish — warnings, SIGPIPE, lab service no-args, sccache noise, polling spam

Six bundled fixes from the followup bug list:

  • Build warnings: clear all 6 cargo build warnings (unused vars in completions.rs, missing #[allow(unreachable_code)] block, dead run_opt / connect_raw, leftover Kind import in fast_teardown.rs).
  • SIGPIPE panic: restore default SIGPIPE handling in main() so lab repo find | head, lab secrets list | grep, etc. stop crashing with "Broken pipe (os error 32)".
  • lab service no-args: print a directory of installed services + usage hints instead of erroring with "no .git directory found" when called outside a git repo.
  • sccache restart noise: silence the benign "Connection refused" / "Starting the server" subprocess output during the routine sccache wrong-dir restart.
  • Polling-WARN spam: throttle lab service dep-wait WARN lines to ≤ 1 per socket per 5 s (after a 10 s warm-up), bringing aggregate noise down ~10× (from ~430 lines per dep-wait phase to ~42).
  • As a side effect of the throttle, interleaved-stdout problems between the polling task and smoke-test task become statistically rare.

2. fix(lab): hero_aibroker fresh-install — config auto-fetch + provider-key preflight + cleaner --stop

Closes the hero_aibroker chain (issue #282):

  • ensure_companion_config in acquire.rs: when acquire_binary resolves hero_aibroker_server, also fetch modelsconfig.yml from the repo's main branch raw URL into $PATH_VAR/hero_aibroker/modelsconfig.yml. Idempotent (skip when present + non-empty). Match is strict — any other binary returns Ok(()) immediately, so zero new behavior for hero_proc, hero_router, hero_db, etc. Fetch failure is non-fatal: warn and continue.
  • preflight_aibroker_provider_key in service_manager.rs: before do_start_validated hands hero_aibroker_server off to hero_proc, query hero_proc secrets for any of the 10 supported provider key names (OPENROUTER_API_KEY, ANTHROPIC_API_KEY, ANTHROPIC_API_KEYS, OPENAI_API_KEY, GROQ_API_KEY, CEREBRAS_API_KEY, DEEPSEEK_API_KEY, DEEPINFRA_API_KEY, SAMBANOVA_API_KEY, HF_API_TOKEN). If none are set, bail up front with the actionable lab secrets set … invocation and full key-name list — instead of letting aibroker register, crash 5×, and bury the error in $PATH_VAR/logs/core/hero_aibroker_server/<job>/<date>.log.
  • do_stop short-circuit: treat state == "failed" | "error" | "halted" as already-stopped (no live process, clean state). Replaces the self-contradictory "stop returned an error (may already be stopped): stop failed — state 'failed'" message.

3. fix(lab acquire): fetch companion config on Forge-download path too

Bug fix on top of commit 2: the previous patch missed the most common path. When try_forge_download returns Ok(true) AND host_path.exists(), the early return inside the match arm bypassed the post-match if forge_ok { ensure_companion_config(...) } block. Moved the call inside the early return.

Verified on a second fresh VM run: lab service core now prints

installed from Forge → /root/hero/bin/hero_aibroker_server
fetching companion config: https://forge.ourworld.tf/lhumina_code/hero_aibroker/raw/branch/main/modelsconfig.yml
installed companion config: /root/hero/var/hero_aibroker/modelsconfig.yml (20309 bytes)

…followed by smoke tests: 44 passed.

4. docs(lab/readme): audit retired/renamed surfaces; document new commands

Eight targeted edits to bring the README in line with the current binary:

  • Quick reference table refreshed: drops bare-lab, adds lab user init, lab install <component>, lab path, lab completions, lab infocheck, all new lab build redeploy verbs (--restart, --fast --restart, --fast --stop, --reset --start), lab service <name> --<verb> syntax.
  • Section 1 "Build mode" retitled to lab build [flags] and prefixed with a deprecation note: bare-lab and top-level --start/--stop/--status are retired, build lives under lab build, service lifecycle under lab service <name>. Every example flag updated to lab build --flag. Adds the destructive-redeploy flow.
  • Section 4 env vars: rename ROOTDIRPATH_ROOT, CODEROOTPATH_CODE. Note hero_cfg.toml hydration.
  • Section 6 (lab build [REPO]): add the new redeploy flag examples (--restart, --fast --restart, --fast --stop, --reset --start). Rename $CODEROOT$PATH_CODE.
  • Section 7 (Service lifecycle): drop the retired lab --start/stop/status parenthetical from the heading. Replace the "Start / stop / status (no build)" subsection with a paragraph pointing at lab service <name> --<verb> and lab build --status|--stop since the top-level flags are gone.
  • New section 7a "Shell integration": documents lab path (incl. TTY-aware error form) and lab completions.
  • New section 7b: documents lab infocheck.
  • Section 10 env vars (build mode): replace BUILDDIR/ROOTDIR with the PATH_* family; add PATH_VAR, CARGO_HOME, RUSTUP_HOME; note auto-hydration.

Test plan

Verified end-to-end on a fresh Ubuntu 24.04 root VM using the canonical 5-command onboarding:

curl -sSfL https://forge.ourworld.tf/lhumina_code/hero_skills/raw/branch/lab-followup-fixes/crates/lab/install.sh \
  | BRANCH=lab-followup-fixes bash
exec bash
lab user init                                  # enter FORGE_TOKEN
lab install core
lab secrets set GROQ_API_KEY <key>             # any of the 10 supported provider keys
lab service core

Confirmed observable behaviors:

  • lab repo find | head -5 exits 0 (no SIGPIPE panic)
  • lab service outside a git repo prints the installed-services directory
  • lab build hero_router cold output has no scary sccache: Connection refused
  • lab service core dep-wait emits ≤ ~50 WARN lines per service (down from ~430)
  • cargo build -p lab finishes with 0 warnings
  • lab service core fetches modelsconfig.yml automatically:
    installed companion config: /root/hero/var/hero_aibroker/modelsconfig.yml (20309 bytes)
  • lab service core finishes with === lab service core: all services started ===,
    every service's smoke tests green (hero_proc 2/2, hero_proc_admin 2/2, hero_router 6/6,
    hero_db_server 4/4, hero_aibroker_server 44/44, hero_code_server 4/4,
    hero_code_admin 2/2, hero_db_admin 2/2)
  • lab service hero_aibroker_server --stop on a failed-state service prints
    "state is failed — no live process, already stopped." (was the contradictory
    "stop returned an error (may already be stopped): stop failed")

If the provider-key preflight fires (no AI provider key set), lab service core bails up front with:

hero_aibroker_server refuses to start without at least one AI provider key.
Set one in hero_proc secrets:

    lab secrets set OPENROUTER_API_KEY <your-key>

Supported provider key names:
    OPENROUTER_API_KEY  ANTHROPIC_API_KEY  ANTHROPIC_API_KEYS
    OPENAI_API_KEY      GROQ_API_KEY       CEREBRAS_API_KEY
    DEEPSEEK_API_KEY    DEEPINFRA_API_KEY  SAMBANOVA_API_KEY
    HF_API_TOKEN

…instead of letting aibroker register, crash 5 times, and bury the message in a job log.

Issues closed

  • hero_skills#282 — covered by commits 2 + 3.

Issues that stay open

None on the original report.

## Summary Bundles four commits that together close issue #282 (the hero_aibroker fresh-install wall) and clean up six medium/low-severity UX papercuts from our recent fresh-Ubuntu-24 test pass. Verified end-to-end on a clean VM: `curl … install.sh | bash` → `lab user init` → `lab install core` → `lab secrets set OPENROUTER_API_KEY <key>` → `lab service core` now succeeds on the first try with all smoke tests green. This is a direct follow-up to merged PR #286 (`fresh-installer-fixes`) which closed issue #281. ## Commits ### 1. `fix(lab): UX polish — warnings, SIGPIPE, lab service no-args, sccache noise, polling spam` Six bundled fixes from the followup bug list: - **Build warnings**: clear all 6 cargo build warnings (unused vars in `completions.rs`, missing `#[allow(unreachable_code)]` block, dead `run_opt` / `connect_raw`, leftover `Kind` import in `fast_teardown.rs`). - **SIGPIPE panic**: restore default SIGPIPE handling in `main()` so `lab repo find | head`, `lab secrets list | grep`, etc. stop crashing with "Broken pipe (os error 32)". - **`lab service` no-args**: print a directory of installed services + usage hints instead of erroring with "no .git directory found" when called outside a git repo. - **sccache restart noise**: silence the benign "Connection refused" / "Starting the server" subprocess output during the routine sccache wrong-dir restart. - **Polling-WARN spam**: throttle `lab service` dep-wait WARN lines to ≤ 1 per socket per 5 s (after a 10 s warm-up), bringing aggregate noise down ~10× (from ~430 lines per dep-wait phase to ~42). - As a side effect of the throttle, interleaved-stdout problems between the polling task and smoke-test task become statistically rare. ### 2. `fix(lab): hero_aibroker fresh-install — config auto-fetch + provider-key preflight + cleaner --stop` Closes the hero_aibroker chain (issue #282): - **`ensure_companion_config`** in `acquire.rs`: when `acquire_binary` resolves `hero_aibroker_server`, also fetch `modelsconfig.yml` from the repo's main branch raw URL into `$PATH_VAR/hero_aibroker/modelsconfig.yml`. Idempotent (skip when present + non-empty). Match is strict — any other binary returns `Ok(())` immediately, so zero new behavior for hero_proc, hero_router, hero_db, etc. Fetch failure is non-fatal: warn and continue. - **`preflight_aibroker_provider_key`** in `service_manager.rs`: before `do_start_validated` hands `hero_aibroker_server` off to hero_proc, query hero_proc secrets for any of the 10 supported provider key names (`OPENROUTER_API_KEY`, `ANTHROPIC_API_KEY`, `ANTHROPIC_API_KEYS`, `OPENAI_API_KEY`, `GROQ_API_KEY`, `CEREBRAS_API_KEY`, `DEEPSEEK_API_KEY`, `DEEPINFRA_API_KEY`, `SAMBANOVA_API_KEY`, `HF_API_TOKEN`). If none are set, bail up front with the actionable `lab secrets set …` invocation and full key-name list — instead of letting aibroker register, crash 5×, and bury the error in `$PATH_VAR/logs/core/hero_aibroker_server/<job>/<date>.log`. - **`do_stop` short-circuit**: treat `state == "failed" | "error" | "halted"` as already-stopped (no live process, clean state). Replaces the self-contradictory "stop returned an error (may already be stopped): stop failed — state 'failed'" message. ### 3. `fix(lab acquire): fetch companion config on Forge-download path too` Bug fix on top of commit 2: the previous patch missed the most common path. When `try_forge_download` returns `Ok(true)` AND `host_path.exists()`, the early return inside the match arm bypassed the post-match `if forge_ok { ensure_companion_config(...) }` block. Moved the call inside the early return. Verified on a second fresh VM run: `lab service core` now prints ``` installed from Forge → /root/hero/bin/hero_aibroker_server fetching companion config: https://forge.ourworld.tf/lhumina_code/hero_aibroker/raw/branch/main/modelsconfig.yml installed companion config: /root/hero/var/hero_aibroker/modelsconfig.yml (20309 bytes) ``` …followed by `smoke tests: 44 passed`. ### 4. `docs(lab/readme): audit retired/renamed surfaces; document new commands` Eight targeted edits to bring the README in line with the current binary: - **Quick reference table** refreshed: drops bare-`lab`, adds `lab user init`, `lab install <component>`, `lab path`, `lab completions`, `lab infocheck`, all new `lab build` redeploy verbs (`--restart`, `--fast --restart`, `--fast --stop`, `--reset --start`), `lab service <name> --<verb>` syntax. - **Section 1 "Build mode"** retitled to `lab build [flags]` and prefixed with a deprecation note: bare-`lab` and top-level `--start`/`--stop`/`--status` are retired, build lives under `lab build`, service lifecycle under `lab service <name>`. Every example flag updated to `lab build --flag`. Adds the destructive-redeploy flow. - **Section 4 env vars**: rename `ROOTDIR`→`PATH_ROOT`, `CODEROOT`→`PATH_CODE`. Note hero_cfg.toml hydration. - **Section 6** (`lab build [REPO]`): add the new redeploy flag examples (`--restart`, `--fast --restart`, `--fast --stop`, `--reset --start`). Rename `$CODEROOT`→`$PATH_CODE`. - **Section 7** (`Service lifecycle`): drop the retired `lab --start/stop/status` parenthetical from the heading. Replace the "Start / stop / status (no build)" subsection with a paragraph pointing at `lab service <name> --<verb>` and `lab build --status|--stop` since the top-level flags are gone. - **New section 7a "Shell integration"**: documents `lab path` (incl. TTY-aware error form) and `lab completions`. - **New section 7b**: documents `lab infocheck`. - **Section 10 env vars (build mode)**: replace `BUILDDIR`/`ROOTDIR` with the `PATH_*` family; add `PATH_VAR`, `CARGO_HOME`, `RUSTUP_HOME`; note auto-hydration. ## Test plan Verified end-to-end on a fresh Ubuntu 24.04 root VM using the canonical 5-command onboarding: ```bash curl -sSfL https://forge.ourworld.tf/lhumina_code/hero_skills/raw/branch/lab-followup-fixes/crates/lab/install.sh \ | BRANCH=lab-followup-fixes bash exec bash lab user init # enter FORGE_TOKEN lab install core lab secrets set GROQ_API_KEY <key> # any of the 10 supported provider keys lab service core ``` Confirmed observable behaviors: - [x] `lab repo find | head -5` exits 0 (no SIGPIPE panic) - [x] `lab service` outside a git repo prints the installed-services directory - [x] `lab build hero_router` cold output has no scary `sccache: Connection refused` - [x] `lab service core` dep-wait emits ≤ ~50 WARN lines per service (down from ~430) - [x] `cargo build -p lab` finishes with 0 warnings - [x] `lab service core` fetches modelsconfig.yml automatically: `installed companion config: /root/hero/var/hero_aibroker/modelsconfig.yml (20309 bytes)` - [x] `lab service core` finishes with `=== lab service core: all services started ===`, every service's smoke tests green (hero_proc 2/2, hero_proc_admin 2/2, hero_router 6/6, hero_db_server 4/4, hero_aibroker_server **44/44**, hero_code_server 4/4, hero_code_admin 2/2, hero_db_admin 2/2) - [x] `lab service hero_aibroker_server --stop` on a `failed`-state service prints "state is failed — no live process, already stopped." (was the contradictory "stop returned an error (may already be stopped): stop failed") If the provider-key preflight fires (no AI provider key set), `lab service core` bails up front with: ``` hero_aibroker_server refuses to start without at least one AI provider key. Set one in hero_proc secrets: lab secrets set OPENROUTER_API_KEY <your-key> Supported provider key names: OPENROUTER_API_KEY ANTHROPIC_API_KEY ANTHROPIC_API_KEYS OPENAI_API_KEY GROQ_API_KEY CEREBRAS_API_KEY DEEPSEEK_API_KEY DEEPINFRA_API_KEY SAMBANOVA_API_KEY HF_API_TOKEN ``` …instead of letting aibroker register, crash 5 times, and bury the message in a job log. ## Issues closed - **hero_skills#282** — covered by commits 2 + 3. ## Issues that stay open None on the original report.
Bundles six small fixes uncovered during fresh-install testing on
Ubuntu 24. Each is independently small but they all share the same
"papercut that makes lab annoying to use day-to-day" character.

1. Clear all 6 cargo build warnings.
   - installers/completions.rs:install — gate unused params under
     `#[allow(unused_variables)]`. Signature kept so re-enabling
     completions is a single-line revert.
   - installers/completions.rs:ensure_nu_config_uses_completions —
     wrap dead body in `#[allow(unreachable_code)] { ... }` block to
     mirror the sibling `install` function.
   - flow/uninstall.rs — delete dead `run_opt`.
   - secrets/client.rs — delete dead `connect_raw` shim.
   - service/fast_teardown.rs:start_repo_services — remove the
     leftover `use herolib_core::base::Kind;` import (Kind is used in
     a sibling function, not this one).

2. Restore default SIGPIPE handling on Unix so `lab repo find | head`,
   `lab secrets list | head`, `lab X | grep` etc. stop crashing with:
       thread 'main' panicked at .../io/stdio.rs:1165:9:
       failed printing to stdout: Broken pipe (os error 32)
   Now the process exits 141 cleanly when its stdout reader goes away,
   like every other well-behaved CLI.

3. `lab service` (no name, outside a git repo) now prints a one-screen
   directory of installed services instead of erroring with
   "no .git directory found" or panicking with "PATH_ROOT is not set".
   New `service::list_installed_service_binaries` + `print_service_directory`
   helpers; both fallback sites in main.rs route through them. Three
   branches: no PATH_ROOT, empty bin dir, populated bin dir.

4. Silence sccache subprocess noise during routine restart. Every cold
   `lab build` printed:
       sccache: error: couldn't connect to server
       sccache: caused by: Connection refused (os error 111)
       sccache: Starting the server...
   The "Connection refused" is benign — `--stop-server` against a
   not-yet-running daemon. Pipe sccache's stdout+stderr to /dev/null
   on both --stop-server and --start-server. Lab still logs its own
   intent via `tracing::info!`/`warn!`.

5. Throttle `lab service` dep-wait WARN spam. Previously the 12-socket
   wait phase in `lab service core` emitted ~24 lines/sec for 18 s
   (~430 identical "not ready yet" lines per dependency). Now we warn
   at most once per socket per 5 s, and only after a 10 s warm-up.
   Reduces aggregate noise to ~2.4 lines/sec and makes the smoke-test
   output below it actually readable.

6. As a side effect of (5), the interleaved-stdout problem (parallel
   WARN + smoke-test ✗ lines fusing mid-line) becomes statistically
   rare. True atomicity would require routing tracing + println
   through a single channel — kept out of scope.
Three coupled fixes that together make `lab service core` work on a fresh
Ubuntu 24 box without manual log-spelunking. Discovered while reproducing
hero_skills#282 end-to-end:

1. ensure_companion_config — when acquire_binary resolves
   hero_aibroker_server, also fetch modelsconfig.yml from the repo's
   main branch raw URL into $PATH_VAR/hero_aibroker/modelsconfig.yml.
   Idempotent (skip when present + non-empty). Hits all three acquire
   paths (installed cache hit / Forge download / build-from-source).
   Match is strict — any other binary returns Ok(()) immediately, so
   zero new behavior for hero_proc, hero_router, hero_db, etc. Fetch
   failure is non-fatal: warn and continue so the binary install path
   never blocks on the config fetch.

   Without this, fresh boxes hit:
       Error: Failed to read config file:
           /root/hero/var/hero_aibroker/modelsconfig.yml

2. preflight_aibroker_provider_key — before do_start_validated hands
   hero_aibroker_server off to hero_proc, query hero_proc secrets for
   any of the 10 supported provider keys (OPENROUTER_API_KEY,
   ANTHROPIC_API_KEY, ANTHROPIC_API_KEYS, OPENAI_API_KEY, GROQ_API_KEY,
   CEREBRAS_API_KEY, DEEPSEEK_API_KEY, DEEPINFRA_API_KEY,
   SAMBANOVA_API_KEY, HF_API_TOKEN). If none are set, bail up front
   with the actionable `lab secrets set …` invocation. Fires for both
   the direct (`lab service hero_aibroker --start`) and transitive
   (lab service core → hero_code → hero_aibroker) paths.

   Without this, the binary registered with hero_proc, refused to
   start with a clear error message — but that message landed in
   /root/hero/var/logs/core/<service>/<job>/<date>.log, not stdout.
   Users had to grep job logs to find out which secret to set.

3. do_stop short-circuits on `failed`/`error`/`halted` state. Before:
       hero_aibroker_server: stop returned an error
           (may already be stopped): stop failed — state 'failed'
   …a self-contradictory message that left users guessing whether the
   service was stopped or not. Now: query state first, treat any of
   {failed, error, halted} as already-stopped and return success with
   "state is failed — no live process, already stopped." Closes
   hero_skills#281 Bug 7.

Plus a README block in the "Next steps after lab user init" section
listing the provider-key step and the full supported-key list, with a
forward reference to the preflight check so users know `lab service
core` will tell them precisely what to set.

Refs: hero_skills#281, hero_skills#282
The previous PR5 commit added `ensure_companion_config` after the
Forge-download branch in `acquire_binary`, but that block was
unreachable for the most common path: when `try_forge_download`
returns `Ok(true)` AND `host_path` already exists (which is the
typical case after a successful Forge install), the function
early-returns inside the match arm — never reaching the post-match
`if forge_ok { … }` block.

Confirmed on a fresh Ubuntu 24 VM running lab-followup-fixes@3eff247:
`lab service core` downloaded `hero_aibroker_server` from Forge, then
went straight to start without any "fetching companion config:" line.
Aibroker then failed its 44 smoke tests because
`/root/hero/var/hero_aibroker/modelsconfig.yml` was missing.

Move the `ensure_companion_config` call inside the early return so
it fires on the Forge-download fast path. The block after the match
becomes dead code; left in place as defensive coverage for the
forge_ok && !host_path.exists() edge case (which shouldn't normally
happen but doesn't hurt to handle).

Refs: hero_skills#282
Bring the README in line with the current binary, last fully accurate
during an earlier flag scheme. Six targeted edits, no narrative changes:

1. Quick reference table — refreshed. Drops bare-`lab` build entry,
   adds `lab user init`, `lab install`, `lab path`, `lab completions`,
   `lab infocheck`, the new build verbs (`--restart`, `--fast --restart`,
   `--fast --stop`, `--reset --start`), and `lab service core` with the
   transitive-dep note.

2. Section 1 "Build mode" — retitled `lab build [flags]` and prefixed
   with a deprecation note explaining bare `lab` and top-level
   `--start`/`--stop`/`--status` are retired. Every example flag is
   updated to `lab build --flag`. Renames `$BUILDDIR` to `$PATH_BUILD`
   in the step list. Adds the destructive-redeploy flow examples
   (`--restart`, `--fast --restart`, `--fast --stop`) and points the
   reader at section 7 for per-service lifecycle.

3. Section 4 env vars — renames `ROOTDIR`→`PATH_ROOT`,
   `CODEROOT`→`PATH_CODE`. Notes that hero_cfg.toml hydration means
   exports work in non-init'd shells.

4. New section 7a "Shell integration" — documents `lab path` and
   `lab completions`. Covers the TTY-aware error form and the
   nu-completion side-effect.

5. New section 7b — documents `lab infocheck`.

6. Section 7 header — drop the retired `lab --start/stop/status`
   parenthetical. Replace the "Start / stop / status (no build)" block
   with a paragraph pointing at `lab service <name> --<verb>` and
   `lab build --status|--stop`, since the top-level flags are gone.

7. Section 6 — note `lab build`'s new redeploy flags (`--restart`,
   `--fast`, `--reset`) inline with the other examples. Update
   `$CODEROOT`→`$PATH_CODE` in the resolution note.

8. Section 10 env vars (build mode) — replace `BUILDDIR`/`ROOTDIR`
   with the `PATH_*` family; add `PATH_VAR`, `CARGO_HOME`, `RUSTUP_HOME`;
   note auto-hydration from hero_cfg.toml.

Refs: hero_skills#281 (the "README drift" bug from fresh-install testing).
nabil_salah changed title from lab-followup-fixes to lab: UX papercuts + hero_aibroker fresh-install + README audit 2026-05-25 07:51:48 +00:00
nabil_salah merged commit 11c1d88085 into development 2026-05-25 07:55:47 +00:00
nabil_salah deleted branch lab-followup-fixes 2026-05-25 07:55:47 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_skills!296
No description provided.