cleaner bootstrap #249

Merged
omarz merged 35 commits from docs/setup-bootstrap-adr into development 2026-05-11 11:22:50 +00:00
Member
No description provided.
lab's own deps (herolib_ai, herolib_core, herolib_derive, herolib_os) require
rustc 1.95.0, matching `rust-version = "1.95.0"` in lab/Cargo.toml. The
installer was still pinning 1.94, so `lab install rust` left the toolchain
one minor short and `cargo install --path .` failed on a fresh box.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Proposes consolidating the three parallel setup surfaces (bash
install_nu.sh + cleanup_nu.sh, ~2.5k LOC of nushell modules,
and crates/lab/) into a four-layer model: one bash bootstrap,
lab as the sole CLI, untouched Rust runtime, and an optional
generated nu overlay for users who pick nu as their shell.

Ships:
- docs/adr/0001-unify-setup-bootstrap.md — decision, rules,
  alternatives, consequences.
- docs/plan/0001-unify-setup-bootstrap.md — 8-phase implementation
  plan grounded in existing call sites (flow/chain.rs uninstall,
  installers/core.rs ensure_shell_init, the three install_nu.sh
  call sites), with acceptance criteria and a risk register.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the sudo-bash shell-out in flow/chain.rs::uninstall with
native Rust. New module crates/lab/src/flow/uninstall.rs maps 1:1
to the 18 cleanup_nu.sh functions, reusing existing primitives
from user/btrfs.rs, user/teardown.rs, user/mycelium.rs, and
installers::host where they already exist.

- Signature of crate::flow::chain::uninstall preserved (no CLI change).
- --dry-run prints planned actions without executing.
- Idempotent: re-runs on a clean host succeed as no-ops.
- nutools/cleanup_nu.sh moved to _archive/ for one-release rollback.

Phase 1 of docs/plan/0001-unify-setup-bootstrap.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New crate::user::cfg module owns hero_cfg.toml IO and the generated
init.sh artifact:

- cfg::write_initial — sudo-tee a minimal [forge] section, 0600,
  user-owned. Refuses to overwrite a non-empty existing token.
- cfg::read_forge — typed access to [forge].
- cfg::generate_init_sh — regenerates init.sh from hero_cfg.toml
  with an AUTO-GENERATED header; backs up to init.sh.bak if the
  existing file is hand-edited; no-op when content is unchanged.

Exposed as `lab user shell-init [--user <name>]`. Replaces the
generate_init_sh function in nutools/install_nu.sh; nothing wires
into it yet (later P2 sub-phases swap the call sites).

Phase 2c + 2e of docs/plan/0001-unify-setup-bootstrap.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the install_nu.sh bash that prompted for and validated
the Forgejo token at the very top of the install flow:

- forge::api::validate_token — HEADs /api/v1/repos/.../hero_skills
  with the candidate token and returns a typed TokenValidation
  enum (Valid / Unauthorized / Forbidden / NotFound / Other /
  NetworkError).
- flow::install::prompt_forge_token — CLI flag → env → TTY
  prompt fallback; never echoes the token; quiet when stdin is
  not a TTY.
- flow::install::resolve_and_validate — combines the two; bails
  on auth failures, warns and continues on transient errors.
- flow::chain::install wires the resolver into the day-0 chain;
  signature unchanged.

Phase 2d of docs/plan/0001-unify-setup-bootstrap.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P2a: ensure_shell_init now patches macOS rc files too (.zshrc,
.bash_profile, .bashrc — zsh first since Catalina). Linux path
unchanged. .profile → .bashrc bridge stays Linux-only.

The old macOS stub (early return Ok(())) is replaced with a real
patch_rc_candidates pass that appends the hero/cfg/init.sh source line
to each existing rc file. Idempotent via grep-qF equivalent: skips any
file that already contains "hero/cfg/init.sh". Signatures of
ensure_shell_init and ensure_shell_init_for are unchanged; no new
dependencies; reuses installers::util::is_macos.

P2b: confirmed the lab host bootstrap does NOT auto-install nushell.
grep -rn 'install_nushell' crates/lab/src/ returns:
  main.rs:1792  InstallCmd::Nu => installers::install_nushell()  [opt-in CLI only]
  installers/mod.rs:39  pub use nu::install_nushell             [re-export]
  installers/nu.rs  [definition]
No call in flow::*, pod::*, or installers::host. The wiring was already
clean — this phase is an audit + verification deliverable.

`lab install nu` is present as InstallCmd::Nu and remains available for
opt-in use (template image, per-user shells). install_nushell is
idempotent when nushell >= MIN_NU_VERSION (0.112.0) is already in
~/hero/bin/nu.

Together these complete the prerequisites for P3 (bootstrap.sh):
a fresh host can be brought up to driver+lab without nu ever
running on it.

Phase 2a + 2b of docs/plan/0001-unify-setup-bootstrap.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`user create` no longer shells out to nutools/install_nu.sh.
Inside the new user's pod we now run:
  - lab install ensure-shell-init  (.bashrc/.zshrc patch)
  - lab user shell-init             (init.sh from hero_cfg.toml)
plus user::cfg::write_initial to seed the TOML.

Nushell is no longer installed per-user. Operators who want nu
run `lab install nu` inside the target pod. Matches ADR-0001:
nu is a user choice, not a system component.

Phase 2 wiring (1/2) — docs/plan/0001-unify-setup-bootstrap.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same swap as the previous commit, applied to the common-pod
provisioning path. `flow common create` / `flow install` no
longer shell into nutools/install_nu.sh inside common — they
run lab install ensure-shell-init + lab user shell-init via
the common pod, after seeding hero_cfg.toml.

Net: grep -rn 'install_nu.sh' crates/lab/src/ returns nothing.
The bash script remains in nutools/ until P3 replaces it with
scripts/bootstrap.sh.

Phase 2 wiring (2/2) — docs/plan/0001-unify-setup-bootstrap.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scripts/bootstrap.sh (188 LOC) replaces ~1160 LOC of legacy
bootstrap code. As UID 0:
  - check deps; resolve+validate FORGE_TOKEN (TTY prompt fallback)
  - create driver user with NOPASSWD sudo (idempotent)
  - seed /home/driver/hero/cfg/hero_cfg.toml (mode 600, driver-owned)
  - download lab-<os>-<arch> from forge latest release to
    /home/driver/hero/bin/lab (0755, driver-owned)
  - su - driver -c 'lab flow install --forgetoken ...'
  - verify clean: no /root/hero, no nu at root

nutools/install_nu.sh and scripts/install.sh moved to _archive/
for one-release rollback. README + nutools/README updated to
point at the new entry. skills/mod.rs + nutools/docs/server_setup.md
install URLs updated to bootstrap.sh.

Phase 3 of docs/plan/0001-unify-setup-bootstrap.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- crate::user::cfg::generate_init_sh hardened: AUTO-GENERATED
  header on every generated init.sh; hand-edited files back up
  to init.sh.bak with a one-line warning before overwrite.
- New cfg::cfg_get / cfg::cfg_set for dotted-key access (e.g.
  forge.token, mycelium.bridge_name). cfg_set automatically
  regenerates init.sh after the write.
- CLI surface: `lab user cfg get <key>` and
  `lab user cfg set <key> <value>` (defaults to current user;
  --user <name> for cross-user ops).
- docs/hero_cfg_schema.md documents the known sections and keys.
- Audit: no Rust file under crates/lab/src/ writes init.sh
  directly anymore — every writer goes through user::cfg.

Phase 4 of docs/plan/0001-unify-setup-bootstrap.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Archive 7 nushell modules whose responsibilities are owned by
`lab` after the P1-P4 work:

- modules/lib/forge.nu       → lab forge / lab repo
- modules/lib/platform.nu    → lab platform detection
- modules/lib/docker.nu      → lab install nerdctl / lab service
- modules/services/*.nu      → lab service <name> --start/--stop/--status

All moved to _archive/nutools/modules/ for one-release rollback.

Decision (b) taken for init.nu / nu.nu: hero_loader.nu calls
init_main (which uses nu_load_init_sh) at login-shell startup, so
both files remain in lib/ until P6 replaces hero_loader.nu wholesale.

The remaining lib/ modules (agent, openrpc, ssh) are opt-in helpers
handled in P7. hero_loader.nu pruned to stop importing services *
(now a comment-only lazy-load note).

lib/mod.nu updated to export only the 5 surviving files. nu.nu
had an unused `use platform.nu` import removed. Dangling references
in nutools/docs/, nutools/modules/README.md, and
claude/skills/hero_config/SKILL.md updated to point at lab equivalents.
crates/ has zero residual references.

Phase 5 of docs/plan/0001-unify-setup-bootstrap.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- crate::user::cfg::generate_hero_nu writes a ~50-line overlay
  at <homedir>/nutools/shell/hero.nu: PATH prepend, ROOTDIR
  defaults, MYCELIUM_* from hero_cfg.toml, LIVEKIT_* from
  hero_livekit runtime.json. AUTO-GENERATED header; backs up
  hand-edited files to hero.nu.bak.
- `lab user shell-init` now regenerates both init.sh (bash) and
  hero.nu (nu) in one pass.
- nutools/modules/hero_loader.nu shrunk to a 3-line stub that
  sources the generated overlay.
- nutools/modules/lib/{init.nu, nu.nu} archived — no longer
  needed by the loader. mod.nu re-export list pruned to match.

Phase 6 of docs/plan/0001-unify-setup-bootstrap.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The three remaining domain helpers in nutools/modules/lib/ are
no longer auto-loaded. Moved to nutools/shell/optional/ where
power users can `use` them from config_user.nu. Empty
nutools/modules/lib/ archived.

README.md updated with the opt-in usage example.

Phase 7 of docs/plan/0001-unify-setup-bootstrap.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
skill_audit/ moved to nutools/shell/optional/skill_audit/ —
opt-in, not auto-loaded. test_multiuser_bridge.nu joins it
under shell/optional/tests/.

REPO_ROOT_REL and fallback path in run.nu updated from
nutools/modules/skill_audit (3 levels) to
nutools/shell/optional/skill_audit (4 levels).

Note: skill_audit may have stale `use lib/*` imports from the
P5 archive sweep; deferred to a follow-up since this tool is
opt-in.

Phase 8 of docs/plan/0001-unify-setup-bootstrap.md — completes
the ADR-0001 consolidation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
10-phase plan to take an old VPS with legacy install_nu.sh / nu /
flow state, nuke it to fresh-machine state, then exercise every
P1-P8 deliverable from docs/plan/0001-unify-setup-bootstrap.md
end-to-end:

  Phase 0: inventory baseline
  Phase 1: NUKE (stop procs, tear down accounts, wipe artefacts,
           strip shell rc, optional package removal, reboot)
  Phase 2: verify clean state
  Phase 3: bootstrap.sh fresh + idempotent + negative test
  Phase 4: lab flow install (host preflight, driver, template,
           common, doctor, status)
  Phase 5: user provisioning (create, login, delete) for two users
  Phase 6: cfg get/set + generated hero.nu overlay
  Phase 7: opt-in nu helpers + skill_audit relocation
  Phase 8: service lifecycle smoke
  Phase 9: lab flow uninstall (dry-run + real + re-bootstrap)

Includes an acceptance matrix mapping each test back to its
P1-P8 deliverable, plus recovery procedures for common
mid-test failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The last nu shell-out in the lab crate. Pre-fix, provisioning.rs ran
`~/hero/bin/nu -c 'use .../secrets.nu *; secrets_sync'` — but secrets.nu
was deleted in Phase 5 of ADR-0001, so the call silently failed on every
fresh install (marked non-fatal, swallowed). The user pod now uses
`lab secrets sync --forge-token <token>` via podmanager::Pod::execute,
which is the canonical native path.

Closes the last ADR-0001 rule 5 violation; lab now has zero nu shell-outs
per crates/lab/src/CLAUDE.md hard rule 5.
ADR-0001 rule 4: nushell is a user choice, not a system component.
The pre-fix code picked nu as common's login shell whenever
~/hero/bin/nu happened to exist (e.g. if template was provisioned
with `lab install nu`). That silently violated the "no nu unless
opted in" guarantee.

Common now always uses /bin/bash. Users who want nu can switch
their personal shell via chsh after provisioning.
A binary's `--info` `dependencies` list is operator-authored and can
cycle by mistake (A -> B -> A).  Pre-fix, ensure_dependency_running
recursed via try_start_binary -> try_start_one -> do_start_validated
-> ensure_dependency_running with no guard — a 2-cycle blew the stack
instead of returning a useful error.

Adds a thread-local HashSet that tracks in-flight dependency names.
DepGuard inserts on construction, removes on drop.  When new() finds
the dep already in the set, it returns None and the function bails
with a message that names the dep and the in-flight chain.
CORE_SERVICES is a hard-coded array (hero_proc, hero_router,
hero_code, hero_db) and ignores each binary's `--info` `dependencies`
list. If a binary declares a dep that isn't already started when
its turn comes up, the auto-resolver still recovers — but silently.

Adds warn_unmet_dependencies: probe the binary's --info before start,
and emit a warning for every declared dep not yet in the
already-started set. Surfaces drift between the hard-coded order
and the binary's own contract without changing start semantics.
Without --strict, smoke-test failures after a successful hero_proc
registration are warnings: the service is technically running but
some endpoints misbehave. Operators driving CI / verification gates
want this to be fatal.

Adds a thread-local STRICT_SMOKE cell, set via set_strict_smoke()
from main.rs when --strict is passed to `lab service <name> --start`.
smoke_test_sockets now returns the failure count, and
do_start_validated bails when strict + failed > 0.
The hard-coded list in crates/lab/src/service/core_services.rs is
hero_proc, hero_router, hero_code, hero_db — not hero_proxy or
hero_aibroker as the doc claimed.
- main.rs: UserCmd::Reset previously took a {name} field but
  user_reset() is a bulk teardown of every non-root hero user. The
  CLI silently ignored the supplied name. Drop the field so the
  surface matches the implementation; pass `lab user reset` with
  no arg now (anyone who passed a name gets a clap error instead
  of the wrong action).
- user/lifecycle.rs: drop unused `user_list` import.
- pod/pod_template.rs: drop unused `Context` import.
- podmanager/mod.rs: remove unused `default_home` helper (and
  now-orphan PathBuf import).

`cargo check -p lab` is clean (0 warnings, 0 errors).
apikeys.db and request_logs.db are local runtime state from a tool that
was run inside the repo root. They should never have been committed —
likely contain API keys / request bodies. Untrack them and ignore *.db /
*.sqlite[3] so the next stray file is caught at `git add`.

Also dedupe the duplicated `.claude` entry and clarify it ignores the
project-local Claude Code state directory (not the tracked claude/
content directory).

Action items for the user (not done by this commit):
  - rm apikeys.db request_logs.db locally if no longer needed
  - rotate any keys present in those files
  - consider scrubbing them from prior commits with git filter-repo

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three security/correctness fixes to scripts/bootstrap.sh:

1. Validate FORGE_TOKEN against /api/v1/user (always requires auth)
   instead of /api/v1/repos/<repo>, which silently returns HTTP 200
   to anonymous requests when the repo is public — making the prior
   check accept any string as "valid". Plan 0002 §3.4 documented
   this as a known gap; closes it.

2. Stop passing FORGE_TOKEN on the command line to `lab flow install`.
   The previous form (`su -c "FORGE_TOKEN=… lab flow install
   --forgetoken \"$FORGE_TOKEN\""`) leaked the secret into both
   /proc/<pid>/cmdline and `ps -ef` for the duration of the run.
   Replaced with a driver-owned, mode-600 tmpfile that exports the
   env vars and execs lab; tmpfile is shredded by the EXIT trap.

3. Require `jq` in root_check_dependencies and drop the grep-based
   JSON parser. The fallback assumed Forgejo emits asset fields in
   `name` → `browser_download_url` order; the API guarantees no
   such order, so the parser could silently pick the wrong asset.

Also drop dead `tr -d '[:space:]'` on `%{http_code}` (already digits)
and merge the EXIT trap so both TMPDIR_LAB and the handoff tmpfile
are cleaned on exit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Plan 0002 §"bootstrap.sh 404s on the binary": the release tag was
  renamed from `lab-latest` to `latest` in commit 5832a79; update
  the doc and switch the inspection one-liner to jq (matches the
  tooling bootstrap.sh now requires).
- README "Build from source": explicitly call out that
  `scripts/bootstrap.sh` is the supported day-0 path and
  `crates/lab/install.sh` is the from-source escape hatch — previously
  both were presented as parallel options without guidance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously each crate carried its own Cargo.lock; resolution drifted
silently and `cargo check -p lab_test` could pull a different version of
a shared dep than `cargo check -p lab` did. Add a top-level workspace
Cargo.toml so both crates share one lock and one build dir. Verified
with `cargo check --workspace` (offline, against the existing target
dir) — both crates build with the unified lockfile.

Drops the two per-crate Cargo.lock files (the workspace root lock is
authoritative; per-crate locks are ignored inside a workspace).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scripts/uninstall.sh — companion to scripts/bootstrap.sh. Distills the
Phase 1 (NUKE) procedure from docs/plan/0002 into a runnable script so
re-bootstrap doesn't require copy-pasting from a markdown doc.

  - Prefers `lab flow uninstall --purge-host` when lab is installed.
  - Falls back to manual teardown (sudoers, br-* bridges, btrfs
    subvolumes, /home/{driver,template,common}, provisioned hero
    users, /etc/hero, root shell-rc Hero block) when lab is missing
    or its uninstall fails.
  - --dry-run, --yes, --keep-host, --purge-host (default) flags.
  - Idempotent on every step.

docs/hero_cfg_schema.md — add a "When each section is written" table
making it clear that bootstrap.sh writes only [forge], and [mycelium]
/ [ssh] appear later via lab provisioning. Avoids confusion when a
fresh driver home shows only [forge] before flow install runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous workflow referenced three files that no longer exist in
the repo:
  - buildenv.sh
  - scripts/test.sh
  - scripts/publish_skills.sh

They were part of the archived build_lib pipeline (see
_archive/build_lib/) and were never restored after the lab
consolidation in ADR-0001. Every push to `development` was therefore
failing CI on the very first step.

Replace with a workflow that exercises what the repo *does* contain:
the Cargo workspace at the root. Runs cargo fmt (informational),
cargo check, and cargo test on push and PR. No publish step — that
job is cut from a dev box with `lab build --upload --platforms allbase`
per README §"Build and upload a release" until a cross-compile
pipeline is added.

Permissions tightened from contents: write / packages: write to
contents: read (no release artifact is produced).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bootstrap.sh: a FORGE_TOKEN containing a `"` or `\` would corrupt the
generated hero_cfg.toml — TOML basic strings disallow both raw. Add a
small toml_escape helper and run TOKEN through it before writing
`token = "..."`. Forgejo PATs don't include those characters today,
but third-party token formats might.

uninstall.sh: remove the `--keep-host-pkgs` flag from the help block.
The script never apt-removes anything, so the flag was dead docs that
contradicted the actual implementation.

README: add "Uninstall / re-bootstrap" section pointing at the new
scripts/uninstall.sh, with the same warning about its destructive
nature that the inline help carries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`curl | sh` installers (claude, uv, rustup) drop their binaries under
~/.local/bin or ~/.cargo/bin. lab inherits a non-login PATH via `su -c`
during `flow install`, so neither directory is visible — `is_on_path`
returns false and `run_cmd("uv"|"rustup", …)` fails with ENOENT.

Two-part fix:
- `installers/util.rs`: `is_on_path` falls back to `~/.local/bin/<bin>`
  so post-install verification doesn't false-negative.
- `main.rs`: `augment_subprocess_path()` prepends `~/.local/bin`,
  `~/.cargo/bin`, and `~/hero/bin` to PATH once at startup so every
  `run_cmd(...)` lab spawns can exec these tools.

Surfaced by ADR-0001 P3 bootstrap end-to-end test (docs/plan/0002).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix(bootstrap): chown ~/hero to driver so lab build can clone into code0
Some checks failed
Build and Test (lab workspace) / build-and-test (pull_request) Failing after 3s
1fb01e0ac8
`install -d` only applies ownership/mode to the leaf directory it
creates. Calling `install -d -o driver ~/hero/cfg` leaves the parent
`~/hero` as root:root (root being the bootstrap caller), so when
`lab flow install` later runs `lab build hero_proc` as driver, the
forge clone into `~/hero/code0/hero_proc` fails with EACCES.

Explicitly create `~/hero` first with driver ownership.

Surfaced by ADR-0001 P3 end-to-end test (docs/plan/0002).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
omarz merged commit d9b5492ba7 into development 2026-05-11 11:22:50 +00:00
omarz deleted branch docs/setup-bootstrap-adr 2026-05-11 11:22:50 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_skills!249
No description provided.