lab update: three followups from fresh-VM testing — agent diagnostic, skills disk-first, kimi config tolerance #305

Closed

nabil_salah wants to merge 0 commits from lab-agent-skills-kimi-fixes into development

nabil_salah commented

2026-05-31 08:08:13 +00:00

Member

Summary

Three small, independently-useful fixes found during a Tier-1 acceptance pass on a fresh Ubuntu 24 VM (lab agent, lab skills list, lab skills sync Kimi step). Each unblocks a previously broken surface; together they make all five Tier-1 surfaces from our test plan green.

Tested end-to-end on the same VM after install + onboarding + lab service core.

N1 — `lab agent` empty error message

Before:

lab agent: claude agent run: Agent command failed with status 1:

…with nothing after the colon. herolib_ai's CommandFailed { stderr } wraps captured stderr, but Claude Code prints most errors to stdout, so the wrapper had nothing to show.

After:

New preflight_claude_binary() runs claude --version before invoking herolib_ai. Catches the "claude not installed / not on PATH" case with lab install ai remediation.
When herolib_ai returns the empty-stderr error, lab appends a common-causes block (claude auth, ANTHROPIC_API_KEY, claude config list) plus a copy-pasteable claude -p '<instruction>' --permission-mode bypassPermissions reproduction command.

N2 — `lab skills list` returns 0 after sync wrote 74

Root cause: the build-time embedded skills registry (build.rs + registry::all()) scans <repo>/claude/skills/<dir>/SKILL.md, but the canonical layout was moved to flat sidecar pairs at <repo>/skills/<group>/<name>.{md,toml} long ago. The embedded registry has been permanently empty since the move.

Fix: route cmd_skills_list / get / find / show through the same on-disk loader that lab skills sync uses (loader::load_all reading $PATH_CODE/hero_skills/skills/). Embedded registry stays as a fallback for when the source tree isn't reachable.

Verified:

$ lab skills list | head -1
74 skill(s) loaded from disk:

N3 — `lab skills sync` Kimi config skipped on missing `SAMBANOVA_API_KEY`

Same shape as the hero_aibroker provider-key bug from PR5: one missing optional secret blocked the whole flow.

Before:

Kimi config: skipped — build kimi config from hero_proc secrets:
  fetch SAMBANOVA_API_KEY from hero_proc (context=core): …

The full ~/.kimi/config.toml was skipped even when GROQ_API_KEY was set.

Fix:

resolve_key_optional returns empty string for missing keys instead of bailing.
ProviderKeys::missing() lists which providers will be inactive.
hero_defaults_with_keys omits provider entries (and their dependent models) whose API key is empty, so the generated config passes validate() (which requires every provider to have api_key OR oauth).
default_model falls back through groq → sambanova → kimi-code based on which providers are configured.
write_kimi_config prints a one-line WARN naming the missing providers and proceeds.

Added 3 new unit tests covering the omit-empty-providers paths. All 6 kimi_config tests pass.

Verified:

$ lab skills sync 2>&1 | grep -A1 Kimi
Kimi config: 1 provider key(s) missing — generated config will route to
  that provider with no credentials. Set with `lab secrets set <NAME>_API_KEY <value>`: SAMBANOVA_API_KEY
Wrote kimi config to /root/.kimi/config.toml

Test plan checklist

lab skills sync writes ~/.kimi/config.toml with a one-line WARN about missing providers
~/.kimi/config.toml contains 2 providers (kimi-code + groq), 0 sambanova references
default_model in the written config references an existing model
lab skills list returns 74 skills from disk
lab skills get forge_api returns name + description + related
lab skills find forge returns 2 matches
lab agent failure produces actionable diagnostic + repro command
cargo build -p lab finishes with 0 warnings
cargo test -p lab --lib agent::kimi_config — 6/6 pass

Refs

hero_skills#281 — the "missing-secret bails a whole flow" pattern (same as PR5)
hero_skills#282 — sibling fix in service_manager

## Summary Three small, independently-useful fixes found during a Tier-1 acceptance pass on a fresh Ubuntu 24 VM (`lab agent`, `lab skills list`, `lab skills sync` Kimi step). Each unblocks a previously broken surface; together they make all five Tier-1 surfaces from our test plan green. Tested end-to-end on the same VM after install + onboarding + `lab service core`. ## N1 — `lab agent` empty error message **Before:** ``` lab agent: claude agent run: Agent command failed with status 1: ``` …with nothing after the colon. herolib_ai's `CommandFailed { stderr }` wraps captured stderr, but Claude Code prints most errors to stdout, so the wrapper had nothing to show. **After:** - New `preflight_claude_binary()` runs `claude --version` before invoking herolib_ai. Catches the "claude not installed / not on PATH" case with `lab install ai` remediation. - When herolib_ai returns the empty-stderr error, lab appends a common-causes block (claude auth, `ANTHROPIC_API_KEY`, `claude config list`) plus a copy-pasteable `claude -p '<instruction>' --permission-mode bypassPermissions` reproduction command. ## N2 — `lab skills list` returns 0 after sync wrote 74 **Root cause:** the build-time embedded skills registry (`build.rs` + `registry::all()`) scans `<repo>/claude/skills/<dir>/SKILL.md`, but the canonical layout was moved to flat sidecar pairs at `<repo>/skills/<group>/<name>.{md,toml}` long ago. The embedded registry has been permanently empty since the move. **Fix:** route `cmd_skills_list` / `get` / `find` / `show` through the same on-disk loader that `lab skills sync` uses (`loader::load_all` reading `$PATH_CODE/hero_skills/skills/`). Embedded registry stays as a fallback for when the source tree isn't reachable. **Verified:** ``` $ lab skills list | head -1 74 skill(s) loaded from disk: ``` ## N3 — `lab skills sync` Kimi config skipped on missing `SAMBANOVA_API_KEY` Same shape as the hero_aibroker provider-key bug from PR5: one missing optional secret blocked the whole flow. **Before:** ``` Kimi config: skipped — build kimi config from hero_proc secrets: fetch SAMBANOVA_API_KEY from hero_proc (context=core): … ``` The full `~/.kimi/config.toml` was skipped even when `GROQ_API_KEY` was set. **Fix:** - `resolve_key_optional` returns empty string for missing keys instead of bailing. - `ProviderKeys::missing()` lists which providers will be inactive. - `hero_defaults_with_keys` **omits** provider entries (and their dependent models) whose API key is empty, so the generated config passes `validate()` (which requires every provider to have api_key OR oauth). - `default_model` falls back through `groq → sambanova → kimi-code` based on which providers are configured. - `write_kimi_config` prints a one-line WARN naming the missing providers and proceeds. Added 3 new unit tests covering the omit-empty-providers paths. All 6 kimi_config tests pass. **Verified:** ``` $ lab skills sync 2>&1 | grep -A1 Kimi Kimi config: 1 provider key(s) missing — generated config will route to that provider with no credentials. Set with `lab secrets set <NAME>_API_KEY <value>`: SAMBANOVA_API_KEY Wrote kimi config to /root/.kimi/config.toml ``` ## Test plan checklist - [x] `lab skills sync` writes `~/.kimi/config.toml` with a one-line WARN about missing providers - [x] `~/.kimi/config.toml` contains 2 providers (kimi-code + groq), 0 sambanova references - [x] `default_model` in the written config references an existing model - [x] `lab skills list` returns 74 skills from disk - [x] `lab skills get forge_api` returns name + description + related - [x] `lab skills find forge` returns 2 matches - [x] `lab agent` failure produces actionable diagnostic + repro command - [x] `cargo build -p lab` finishes with 0 warnings - [x] `cargo test -p lab --lib agent::kimi_config` — 6/6 pass ## Refs - hero_skills#281 — the "missing-secret bails a whole flow" pattern (same as PR5) - hero_skills#282 — sibling fix in service_manager

nabil_salah changed title from ~~lab: three followups from fresh-VM testing — agent diagnostic, skills disk-first, kimi config tolerance~~ to lab update: three followups from fresh-VM testing — agent diagnostic, skills disk-first, kimi config tolerance

2026-05-31 08:11:10 +00:00

nabil_salah closed this pull request

2026-05-31 08:11:30 +00:00