[META] Hero OS demo-deployer arc tracker (cockpit + proxy + content + deployer + manifest + integration) #235

Closed
opened 2026-05-21 11:38:32 +00:00 by mik-tf · 6 comments
Owner

body length: 43658
emo-deployer arc — tracker

Scope: all work needed to go from "lab + CI + a one-off TFGrid VM proving the binaries work" (where we are now, post-hero_demo 09f8365 / s132) to "a team operator types a username in an admin tool and gets back a Forge-OAuth-gated Hero OS demo VM that the user logs into, sees their cockpit, manages their services".

Primary tracker for this arc. PATCHed at each session close.

Current state (Track A s158 close, 2026-05-25): FIRST PUBLIC HERO OS URL LIVE on TFGrid. Pivoted mainnet -> QAnet (twin 703 / FreeFarm2 node 5 / $0 TFT) per newly-minted D-30. Admin VM provisioned via deployer.provision_vm (sid 0062, 16 GB RAM, mycelium-SSH'd in Ubuntu 24.04). Phase 0.5 shipped hero_compute@8f7a2b7 extending D-27 inline-await + rollback_orphans pattern from deploy_vm to deploy_webgateway (closes hero_compute#126; 2 new gateway_hint_tests; pre-merge gate clean). Live-verified on both Ok-path (49s deploy -> state ready) AND rollback-path (4 orphan name+node contracts cancelled cleanly across 2 failure modes — first live D-27 gateway extension proof). Public URL: https://hcockpit.gent01.qa.grid.tf/hero_cockpit/web/services behind TFGrid Web Gateway TLS (D-28 topology, gateway node 2 zone gent01.qa.grid.tf — same zone as Mahmoud's reference instance). End-to-end user walk proven: walker user s158_walker_<ts> minted via deployer.create_user + SSH pubkey uploaded to Forge (D-23 alt-2) + deployer.provision_vm minted child VM sid 0068 (8 GB, mycelium-SSH'd) co-located on same rented node 5 — multi-tenant topology proven.

NEW operational runbook landed: docs/channels/free/admin-vm-deployment-runbook.md (commit b352729) — step-by-step recipe from rent -> provision -> setup-binaries -> deploy_webgateway -> tester handoff, with the 6 install/runtime workarounds discovered at s158 explicitly catalogued and linked to tracking issues.

Demo-app scope clarified: prior framing of s159 as just "hero_books default-load" was too narrow. The canonical demo profile per hero_cockpit#1 §6 enables hero_books + hero_slides + hero_whiteboard + hero_voice + hero_agent + hero_planner + hero_collab on top of bootstrap-core. hero_books default-load may already auto-fire via setup-binaries.sh HERO_BOOKS_DEFAULT_REPOS env wire (the s153 deferred scope); needs live-verify on the admin VM at s159 /start.

7 new Forge follow-ups filed for Mahmoud window: hero_compute#127 service.toml env for TFGRID_NETWORK; hero_proxy#55 IPv6 dual-stack seed bind (blocks public-URL reachability — manual workaround in runbook); hero_cockpit#7 landing-page relative URL bug; hero_cockpit#8 dark/light mode inconsistent across pages; hero_demo#67 setup-binaries.sh missing secret pre-population (includes bare-key-vs-context-prefixed slot ambiguity lesson); hero_compute#128 workload-name client-side validation; hero_skills#303 lab build --download --install silently passes without installing binaries.

State at s158 close: admin VM + walker child VM + rent contract 84920 + gateway sid 0067 ALL UP, intentionally left running through s159+s160 (zero TFT cost on QA). Twin 14199 mainnet treasury baseline 40 untouched. Realistic readiness: 70% testable — guided demo with verbal walkthrough works; self-service for a stranger needs s159 (landing-page fix + workaround sweep) + s160 (AIBROKER_DEMO_KEY staging for AI tier + BYO Forge token UI test). Remaining arc: s159 (sweep ~3-4h) -> s160 (AI keys + BYO test ~3-4h) -> s161 (this issue closure, 30 min). Total ~6-8h.

Previously (Track A s157e close, 2026-05-25): CI GREEN ON hero_compute. s157e shipped hero_compute@e845455 on development repairing 7 of 16 integration tests broken since 8be3294: a 6-LOC COMPUTE_TEST_FAKE_DEPLOY test seam added to operator_twin_id in crates/my_compute_zos_server/src/cloud/grid_driver.rs (mirroring the existing seam in deploy_on_tfgrid at the same file) + 9 placeholder image-name updates "img""Ubuntu 24.04" in crates/my_compute_zos_server/tests/integration.rs. CI run 1299 green on e845455 in 215s vs run 1297 = failure on 9857630. Workspace fully synced at /start: every D-07 35-set repo git pull origin development (no development_mik branches outstanding); hero_compute pulled in 2 new Mahmoud commits (2f07330 rent→reserve UI rename + 9857630 admin_mode toggle). Original s157e scope (mycelium_ip capture + SSH-verify) renamed to s157f. Next: s157f (mycelium fix + SSH verify, 1-2h, ~$1-2 TFT) → s158 → s159 → s160 → s161 closure. Same ~10-15h envelope.

Previously (Track A s157d close, 2026-05-25): DEPLOY_VM FULLY UNBLOCKED. Root cause of hero_compute#125: the daemon passed the user-facing image string (e.g. "Ubuntu 24.04") straight through to the TFGrid SDK as the zmachine workload's flist field, ZOS expected a URL, silently rejected with state=Error + empty result.error. Discovered via the SDK's undocumented TFGRID_DEBUG=1 env var (gates trace_step() calls in tfgrid_sdk_rust/src/grid_client/mod.rs:2361) which surfaced per-workload state + the full workload JSON showing the literal name in the flist field. Fix shipped: hero_compute@1f59151 on development adds IMAGE_REFERENCE_MAP const + resolve_image_reference() helper called once at top of deploy_vm (pass-through for https:// URLs, lookup for known names, friendly InvalidInput error otherwise). Live verify on rented dedicated node 3467 (Canada, farm 646 JimboTFT, RentContract 2095174 under twin 14199 ops): VM sid 0053 via URL + VM sid 0054 via name-resolved → both state=running, contracts 2095179/2095180 + 2095181/2095182 persisted on chain. Multi-tenant pattern proven: 2 distinct VMs on same rented node, distinct slices, distinct secrets. All cleaned at /stop: 4 VMs deleted, node unregistered, RentContract 2095174 cancelled (substrate-ack 20s). Twin 14199 active contracts = 0; treasury 6905 baseline 40 untouched. D-29 minted (D-29 file) locking (a) image-name-resolution in the daemon, (b) demo target = any rentable+extraFee>0+up dedicated node on TFGrid mainnet (substrate gate is node.extraFee > 0, NOT node.rentable: True alone; FreeFarm-specifically constraint REMOVED). #125 closed. Track A continues solo. prompt.md §3 projects from this issue.

Decisions and meeting source: hero_os_tfgrid_deployer#1 (despiegk's Main Story / minutes — authoritative, not edited from here).

1. Foundation (where we are now)

2026-05-25 update (post-s157d) — DEPLOY_VM WORKS. Remaining path to end-user self-serve flow

Where we are: deployer.provision_vm (the operator-facing API that mints a Forge user + Forge token + reads the user-uploaded SSH key + calls hero_compute.deploy_vm) now produces a state=running VM on a rented dedicated mainnet node. The full Track D D1-D5 ladder is end-to-end live for the first time since 2026-05-23.

End-user-journey checklist (what user clicks public link → ... → uses hero AI stack requires, mapped to remaining sessions):

User-journey step Where it stands Owning next session
Click public URL → lands on admin cockpit No public URL yet; admin cockpit runs locally on operator workstation s158 — Admin-on-TFGrid + deploy_webgateway
Logs in via Forge OAuth Code shipped (Track A s133-s139 cockpit + D-22 BYO token landing); needs live-test on the admin VM Walk verified in s160
Uploads SSH public key to Forge themselves D-23 alt-2 designed + tested locally at s142; user uses Forge's own UI Walk verified in s160
Pastes Forge personal token into admin form cockpit/USER_FORGE_TOKEN slot (D-16); admin form exists Walk verified in s160
Updates password (optional) deployer.regenerate_password (D-24, s143) Walk verified in s160
Admin clicks Provision → user gets a VM Unblocked at s157d. deployer.provision_vmComputeService.deploy_vmstate: running. Hero stack (35-set) installs via setup-binaries.sh (Track E, s151). (already works)
User SSHes into THEIR VM ⚠️ hero_compute.wait_until_running returns before mycelium_ip is populated in the workload result; daemon's get_vm returns empty mycelium_ip — hero_compute#121. Easy fix now that TFGRID_DEBUG=1 visibility exists. s157e — hero_compute#121 fix + SSH verify
Accesses hero cockpit on their VM (via https://<their-domain>/) Cockpit runs in the VM's stack; needs hero_proxy + TFGrid Web Gateway hookup per D-28 s160 walk
Hero AI stack content (hero_books, hero_slides, etc.) loaded with default corpora 🟡 Track C C1+C2 partial — +104 LOC parked on hero_books local branch s153_default_libraries since s153 abort s159 — hero_books default-load wire

Remaining sessions (estimated 10-15 hours focused work to home#235 closure):

# Session Focus Est Dependencies
s157e hero_compute#121 fix (post-deploy poll loop populates mycelium_ip from workload result.data) + SSH-verify a fresh deploy end-to-end with the throwaway probe key 1-2h Local daemon work only; new rent ~$1-2 for the SSH verify
s158 Admin-on-TFGrid: rent dedicated node, deploy admin VM via deployer.provision_vm, install Hero stack via setup-binaries.sh, configure hero_proxy + deploy_webgateway per D-28, surface public URL 3-4h Depends on s157e (need SSH to debug the admin VM if anything misbehaves)
s159 Track C C1+C2: rebase s153_default_libraries (hero_books +104 LOC default-load wire) on clean baseline; squash on hero_books development; redeploy admin VM's hero_books so the public URL serves the 4 default content repos out of the box 2-3h Independent of s158 (can run in either order, but admin URL live makes the verify easier)
s160 Full user-journey live walk on the public admin URL: mint a test Forge user, walk SSH key upload + token paste + password reset + provision their VM + login to their cockpit + load content in hero_books. Surface any gaps as Forge issues for s161. 3-4h Depends on s158 + s159
s161 home#235 closure: PATCH this issue body with final outcome, post closure comment with the full s158-s160 evidence, flip state to closed. File the one remaining post-arc follow-up (Track F multi-VM scaling) as a separate hero_os_tfgrid_deployer issue. 30 min Depends on s160 walk landing green

For anyone picking up this arc: start at prompt.md §3 (rewritten at each /stop). Sessions/157d.yml has the full s157d trace including the TFGRID_DEBUG discovery + fix shape + multi-tenant proof. The feedback_squash_merge_gate + feedback_d10_t2_squash_to_development_no_pr + feedback_signoff_no_email + feedback_authorship discipline rules apply throughout.

2026-05-23 update (mid-session pivot) — demo VM bumped to 16 GB; Track C C3 deferred to post-arc follow-up; arc compresses to 9 sessions

  • Demo VM target RAM goes from 8 GB to 16 GB. Shipping this arc ahead of squeezing the embedder is the right priority for a v1 demo. A single Grid Proxy lookup (free_mru=17179869184 against farm_ids=1) confirms FreeFarm has nodes with that headroom. The ram_size change is a parameter on deployer.provision_vm, not a code change in the deployer or hero_compute. Surfaces at the User POV walkthrough and at the multi-user session.
  • Track C C3 (smaller embedder model) is now a post-arc follow-up: hero_embedder#42. Issue is fully spec'd (model registry pointer, env read site, load-gating logic, smoke pattern, acceptance criteria). Not urgent; address after this arc closes. The 8 GB-affordability story still matters once users run on cheaper VMs, but it does not gate any home#235 acceptance row, and on a 16 GB demo VM the current four-variant embedder load is absorbed comfortably.
  • The 10-session arc compresses to 9. New shape: s152 pulls Track B B1+B3 forward (was s153, hero_proxy config templates + TLS strategy decision), s153 pulls Track C C1+C2 forward (was s154, public content repos + hero_books default-load), s154 is User POV walkthrough on the live 16 GB mainnet VM (was s155, still gated on hero_compute#119), s155 is Track F F1 multi-user (was s156), s156 is Track F F2+F3 plus this arc's closure (was s157). One full session of risk saved by not chasing the embedder shrink inside the arc.

2026-05-23 update (post-s148) — self-host daemon up on TFGrid mainnet; D-26 minted; FreeFarm (farm_id=1) locked as the demo deploy target

  • hero_compute_zos daemon supervised on TFGrid mainnet. Squash 844676c on hero_compute development appends the canonical [[env]] PATH_ROOT/HERO_SOCKET_DIR/RUST_LOG block to my_compute_zos_server/service.toml, mirroring the s147 hero_router fix. lab build --release --install --workspace clean (8 of 8 built, 0 failed). lab service my_compute_zos_server --install --start brings the daemon up at PID 3102124, raw JSON-RPC over Unix socket at ~/hero/var/sockets/hero_compute_zos/rpc.sock. Mainnet wallet sourced from TF_VAR_mnemonic in ~/hero/cfg/env/env.sh (the same wallet that funded the s132 OpenTofu deploy); stored under core/TFGRID_MNEMONIC plus the existing core/TFGRID_NETWORK=main. The hero_proc supervisor injects core-context secrets into the daemon environment at spawn, so no service.toml from_secret indirection is needed.
  • Live mainnet round-trip confirmed. Direct UDS smoke: ComputeService.list_images returns the 5 official VM images; ComputeService.node_register queries TFChain mainnet Grid Proxy and returns a real ComputeNode record; ComputeService.node_status reads it back byte-identical from the local persistence layer. The sr25519 keypair derived from the mnemonic produces public key 58f481018853f18b403369537940d8e3a7bb61f36eafe8fff38fab281f230965 (the operator's TFChain identity).
  • D-26 minted locking the self-host architecture: decisions/D-26-self-host-hero-compute-mainnet.md. Workspace next-free advances to D-27. Devnet fallback path stays warm via TFGRID_MNEMONIC_DEVNET in env.sh.
  • D-26 §Demo target locks FreeFarm (farm_id=1) on TFChain mainnet as the canonical deploy target. FreeFarm is the ThreeFold-operated non-dedicated public-tenancy farm; any funded wallet may submit VM contracts there without farm-admin rights. The substrate-side onlytwinadmincandeploy check fires only on dedicated farms and is moot for our demo posture. The s132 OpenTofu deploy of herolab is prior-art that the operator's wallet already exercised the substrate contract-submission path successfully on mainnet under a different code wrapper. Owning underlying hardware (registering and operating our own farm) is a stronger sovereignty story but out of scope for the home#235 arc; public-tenancy on FreeFarm is the right level of effort for a demo.
  • Demo target re-pin queued for s149. s148 used node 2007 as a convenience target for the bring-up smoke; that was the wrong target (node 2007 belongs to herodemo.gent01, a separate twin's machine). s149 step 2 re-points core/TFGRID_NODE_IDS to a FreeFarm node via a single Grid Proxy lookup: GET https://gridproxy.grid.tf/nodes?farm_ids=1&free_mru=8589934592&status=up. No archaeology.
  • hero_compute#118 demoted at /stop with comment 36334. Mahmoud's external endpoint is no longer gating any session; can be added as a future second adapter when convenient.
  • Two deployer code edits queued for s149 (the original §3 "no source code changes" claim was incorrect): hero_os_tfgrid_deployer/.../compute.rs:30 has a hardcoded service-name path constant /hero_compute_mos/... that must become /hero_compute_zos/... or configurable; web.rs:206-229 parses HERO_COMPUTE_NODE_ADDR as a TCP host:port (correct shape, but the local value still needs to be decided to route through hero_router to the new self-hosted UDS).
  • 10-session arc updated: s148 closed; s149 head is now FreeFarm node re-pin + deployer rewire + first deploy_vm round-trip (provision → Mycelium-IPv6 SSH ping → delete). s150 hero_proc#121 fix and downstream sessions s151 through s157 unchanged.
  • Side action skipped this session: the B2 Forge OAuth client registration ops ask remains unfiled; deferred because Track B's proxy/OAuth scope is the better venue for it and it is not load-bearing for the home#235 critical path during s149.
  • Track B continues normally in its own lane. Per the 2026-05-23 single-agent rule clarification, the "single-agent for home#235" rule applies to home#235 work itself, not to Track B's hero_assistance v1.0 work. Both tracks run concurrently with zero file overlap.

2026-05-23 update (post-s147) — self-host pivot + 10-session arc to closure locked

  • Self-host pivot on hero_compute. my_compute_zos_server is our repo, our code, our CI auto-publish. We host the instance ourselves using TF_VAR_mnemonic from ~/hero/cfg/env/env.sh (the same mainnet TFGrid wallet used by the s132 OpenTofu deploy; 12-word BIP39 verified populated). TFGRID_NETWORK=main is already set in hero_proc secret core context. Zero deployer code changes required: existing D4 implementation already calls ComputeService.deploy_vm against whichever endpoint HERO_COMPUTE_NODE_ADDR points at; we point it at our local UDS instead of a remote endpoint.
  • hero_compute#118 demoted from blocker to future second adapter. The operational ask filed at s145 (reachable hero_compute_mos_server endpoint) is no longer gating any session. A comment will be posted at s148 close noting that Mahmoud's endpoint can be added as a future second adapter when convenient; meanwhile we run on our own instance.
  • Single-agent execution for the home#235 arc. Track B / Agent 2 paused until home#235 closes. Parallel-agent coordination overhead (file-region claims, ID race rules, prompt-common.md handshake) exceeded its value for the demo-shippable push. All 10 sessions s148–s157 run on Track A solo.
  • hero_planner promoted to the default cockpit services profile (user requirement 2026-05-23). The repo is already in the D-07 demo service set (Tier B per memory/project_demo_service_set.md), already in hero_demo/deploy/single-vm/scripts/d07_set.txt, and already has .forgejo/workflows/lab-publish.yaml wired for CI auto-publish. What was missing is exposure in the default cockpit-services.toml profile alongside hero_books / hero_slides / hero_whiteboard / hero_call / hero_voice / hero_agent. Folded into s151 (Track E E1) scope.
  • 10-session arc to home#235 closure (locked at s148 /start): s148 self-host my_compute_zos_server on mainnet (mints D-26 for self-host architecture lock); s149 D5 live-smoke on mainnet (first real grid.tf VM via deployer.provision_vm); s150 hero_proc#121 fix (bulk service.status_all RPC + cockpit adoption, mints D-27); s151 Track E E1 setup-binaries manifest refactor + hero_planner in default profile; s152 Track C C3 smaller embedder model (MiniLM-L6, ~80 MB for 8 GB VM fit); s153 Track B B1+B3 hero_proxy config templates + TLS strategy decision; s154 Track C C1+C2 public content repos + hero_books default-load; s155 User POV walkthrough on the live mainnet VM (incl. hero_planner row walks); s156 Track F F1 multi-user end-to-end on mainnet; s157 Track F F2+F3 RAM-fit + multi-user isolation + this issue closure PATCH.
  • Side actions (file at s148 /stop): B2 Forge OAuth client registration ops ask (the only remaining external dependency for the proxy-OAuth gating path); comment on hero_compute#118 demoting Mahmoud's endpoint per above.
  • Deployer side — no code changes this planning session. D-25 (ON DELETE RESTRICT migration) remains the most recent Track D landing (s144 380b992). All Track D status unchanged; D5 live-smoke just had its blocker removed.

2026-05-23 update (post-s145) — methodology + arc-spec session: master-tracker E2E checklist artifact + Mahmoud ops ask + s142 follow-ups all filed

  • home/docs/channels/free/e2e_checklist.md (fee7f0c) — executable companion to the existing home/docs/channels/free-and-paid.md narrative. 71 rows across Admin POV / User POV / Cross-arc boundaries, FREEZONE / hero_assistance D-18 row format, all rows sourced from the meeting minutes + decisions + free-and-paid.md + a code-reading pass on hero_cockpit + hero_os_tfgrid_deployer. Status column is seed-pass only; human verification of every Have row is the s146 head.
  • hero_compute#118 filed — operational ask to Mahmoud for a reachable hero_compute_mos_server endpoint (host:port + node_sid). The only outstanding pre-req for the deployer's first live deploy_vm + get_vm + delete_vm round-trip. Gates s147.
  • hero_compute#116 comment posted — D-24 + D-25 ack to Mahmoud closing s142 follow-up #1. All three s142 follow-ups now closed-out as filed Forge issues (#1 above, #2 = hero_compute#117 typed-SDK gap, #3 = hero_cockpit#4 SSH-key onboarding polish).
  • Mid-session pivot — proposed "E1 = Forge group/repo per user" fallback head was caught as invented scope on re-check against the meeting notes (§8 + §9 ask for shared content + feedback repos, both already covered: lhumina_public/feedback exists, §8 Books backfill is queued separately). Dropped from prompt.md §3.
  • Deployer side — no code changes. D-25 (ON DELETE RESTRICT migration) remains the most recent Track D landing (s144 380b992). All Track D status unchanged.
  • Track B status unchanged from s2-018 — Phase D cleanup (s2-019) remains queued in hero_assistance/.
  • s146 queued — local-cockpit-install + verification pass on the new e2e_checklist.md. Effort tier medium. Output is updated Status columns + audit-log entry + follow-up issues for any clearly-needed feature gaps surfaced during the walkthrough.
  • s147 queued — Track D D5 live-smoke (provision + SSH ping + delete round-trip against a real hero_compute_mos_server). Gated on hero_compute#118 reply + core/FORGEJO_TOKEN + deployer/FORGE_TOKEN re-population.

2026-05-22 update (post-s143) — Track A s143 = Track D D2.1 lifecycle-symmetry polish + Phase B.5 FK-silently-OFF fix + D-24 mint

  • Track A s143 (Track D D2.1) — closed lifecycle-symmetry polish on hero_os_tfgrid_deployer. 3 new JSON-RPC methods: deployer.delete_user (refuse-if-vms per D-24 — caller must cascade via deployer.delete_vm first), deployer.delete_vm (compute-first then sqlite per D-24 — orphan-recoverability asymmetry), deployer.regenerate_password (single-use disclosure shape mirroring create_user.initial_password). Two squashes on development/main: hero_lib ce653c0a (+ForgeClient::delete_user_ssh_key + ForgeClient::update_user_password admin methods, +33 LOC); hero_os_tfgrid_deployer 3508cd1 (+3 RPC methods + sqlite migration scaffold + FK enforcement + 5 new db tests, 8 files +479/-25).
  • LOAD-BEARING Phase B.5 finding absorbed — adversarial review caught PRAGMA foreign_keys was silently OFF in db.rs (sqlite's default foreign_keys=OFF made the vms.user_id REFERENCES users(id) FK a no-op — DELETE FROM users would orphan vms rows with no error). Fixed as a one-line PRAGMA foreign_keys = ON in Db::open + Db::open_in_memory. Test fk_enforcement_blocks_delete_user_with_vms pins the constraint.
  • rusqlite_migration 2.5 scaffold keyed on PRAGMA user_version. The s143 initial migration is the current schema with CREATE TABLE IF NOT EXISTS, so pre-migration dev DBs bootstrap cleanly into user_version=1 without ALTER. Foundation for D-25+ schema bumps.
  • D-24 minted — locks (a) refuse-if-vms for delete_user, (b) compute-first then sqlite for delete_vm, (c) PRAGMA foreign_keys=ON as second-line guard, (d) accepted operational gap: lost vm_secret makes a VM unrecoverable from deployer side. Workspace D-NN advances to D-25 (reserved for the s144 ON DELETE RESTRICT migration head).
  • Tests + lab infocheck green: 21 deployer_server tests + 3 SDK tests pass; lab infocheck = 3/3 crates clean / 0 findings. Binary-symbol smoke: 8/8 RPC method names confirmed in release binary (deployer.create_user|get_user|list_users|delete_user|regenerate_password|provision_vm|list_vms|delete_vm).
  • End-to-end VM smoke still deferred per the carried operational gap (no TFGrid VM exists until Track F's F1). Live Forge admin round-trip also deferred — deployer/FORGE_TOKEN was rotated post-s141 and not re-populated this session.
  • Track B status unchanged from s2-016: Phase B _admin rebuild remains queued.
  • s144 queued = Track D D-25 ON DELETE RESTRICT migration (default head — first real use of the s143 migration scaffold; encodes D-24 at the schema layer). Alts: D5 live-smoke (gated on HERO_COMPUTE_NODE_ADDR) or E1 Forge group/repo per-user.

2026-05-22 update — workspace housekeeping + Track B re-activation (s2-016) under hero_assistance-alignment scope

  • Workspace doc compaction (compaction-2026-05-22): CLAUDE.md + prompt.md + prompt2.md + prompt-common.md compacted 445→53 KB (−88%). Pre-compaction snapshot at archive/2026-05-22-compaction/. pipeline-config.yaml tracking_issue updated from hero_demo#52home#235 to match this arc as the live tracker. CLAUDE.md now leads with home#235 as headline framing. Manifest: sessions/compaction-2026-05-22.yml. No arc code touched, no D-NN/L-NN minted.
  • Track B s2-016 — re-activation + hero_assistance work — Track B re-activated under new scope = multi-phase alignment of lhumina_code/hero_assistance/ with the canonical Hero service template per hero_assistance#15. Pre-archive scope (hero_onboarding v0 spec) preserved as historical; reactivates on the same Track-D-/api/deploy-vm trigger if needed. Three squashes on hero_assistance/development: f81aecc (prior session's #14 squash-merge), c059c1a (Wall 1 rusqlite u64→i64 + Wall 2 reqwest rustls-tls swap), 5330a0f (Phase A drop 5 Dioxus crates + D-26 minted hero_assistance-repo-local retiring D-09/D-17/D-22/D-25 atomically). 6 hero_assistance issue closures (#7/#9/#10/#11/#12/#14 + #13 auto-closed). New meta-issue hero_assistance#15 opens the multi-phase alignment arc (Phases A through E). L-08 (workspace) retro-closed. CI green; releases/tag/latest = 4 musl binaries + 4 md5 sidecars. Workspace lab infocheck 4 clean / 0 findings (was 4/4/20). Procedural skip flagged: worked in shared lhumina_code/hero_assistance/ checkout NOT the worktree-isolated ../hero_assistance-track-agent-2/ (future Track B sessions MUST use the worktree per CLAUDE.md "Cross-track coordination").
  • Track A status unchanged at s2-016 close — Track A's s142 = Track D D3 SSH key lifecycle remains queued and uncommenced (Track A did not run this day). Default head per prompt.md §3; alts D4 first-hero_compute-call or D2.1 D2 polish; pick at /start. The two tracks can run concurrently going forward.

2026-05-21 update (post-s139) — pivot: hero_os_tfgrid_deployer IS the deployment path

  • Track A closed. All 7 hero_cockpit#1 spec items shipped across s133-s139 (s139 = f880247). See hero_cockpit#1 for the closed-as-shipped checklist.
  • herolab.gent02.grid.tf retired 2026-05-21. Destroyed via make destroy ENV=herolab. 5 OpenTofu resources released (grid_deployment, grid_name_proxy, grid_network, 2 random_bytes). Gateway FQDN + mycelium IPv6 released. The s132 manual-deploy proof is done; we don't deploy that way again.
  • hero_os_tfgrid_deployer is now the canonical deployment path for every Hero OS VM, both free-demo and paid-arc-pool. hero_demo make deploy is no longer used for VM provisioning. The deployer's per-user cockpit-services.toml manifest drives the setup-binaries dispatch, the hero_proxy install, the OAuth wiring, and the webgateway binding — all as parts of the deployer's standard post-deploy flow, not as standalone-VM concerns.
  • Track D becomes critical-path. Reordered ahead of Tracks B/C in §2 below. Workspace scaffold landed on hero_os_tfgrid_deployer on 2026-05-20 (ab061f5b → 76919265: 4-crate workspace + JSON-RPC /rpc + /openrpc.json + /health). Agent 1 picks up at D2 (Forge user lifecycle) onwards at s140.
  • Accepted operational gap: ~5 sessions where no TFGrid VM exists for end-to-end smoke. Track B/C/cockpit-followup work continues as local code work (config templates, content repos, model wiring) in parallel. No fallback hero_demo deploy "just for testing in the meantime" — the gap is the honest price of committing to deployer-as-path.

Original 2026-05-20 foundation status (preserved for historical context)

What was working at session 132 (2026-05-20) — the proof-of-concept that established the build/install mechanic now embedded inside the deployer's post-deploy flow; the standalone hero_demo make deploy path is retired:

  • Build pipeline. lab (in hero_skills) builds the D-07 35-set. Workstation + VM-side native builds pass. mycelium is the 35th and is excluded on TFGrid since it ships natively via zinit.
  • CI auto-publish. 31/31 wired D-07 repos run .forgejo/workflows/lab-publish.yaml on every push to development. Each repo refreshes its releases/tag/latest with linux-musl-x86_64 (CLI) + linux-x86_64-gnu (daemons with ONNX) artefacts. See hero_skills#268 (rollout) + hero_skills#269 (per-repo cleanup catalogue, closed).
  • VM-side consumer install. lab build $repo --download --install on a fresh Ubuntu 24.04 TFGrid VM with no Rust toolchain installs all 34 (mycelium skipped) end-to-end, including the 3 ONNX libraries to ~/hero/lib/. This mechanic now lives inside the deployer's post-deploy flow (D4).
  • TFGrid deployment (RETIRED). Was: make deploy ENV=herolab (in hero_demo) provisions one VM via OpenTofu in ~60 s. make setup-binaries runs the lab consumer-side install loop. Now: same OpenTofu provider is available to the deployer's D3 adapter as a secondary path (the primary path is hero_compute via Mahmoud's API once free-form sizing lands).

Known open followups on the foundation:

  • TFGrid public gateway returns HTTP 502 on hero_router alone. hero_router binds to loopback by default; the public gateway hits a closed port. Resolution path: add --bind 0.0.0.0 + put hero_proxy in front of it (Track B below — scoped as deployer-integrated config rather than a standalone install). See hero_router#74.

2. Roadmap — 6 tracks, ~17 sessions remaining (was ~24-26 pre-pivot)

Each track has a slot in the prompt.md §3 session map. Sessions continue from s140.

Order (post-2026-05-21 pivot): Track A closed at s139. Track D becomes critical-path and runs s140-s14X. Tracks B/C run as local code work in parallel with Track D, then merge into Track D's standard per-user manifest. Track E feeds into Track D's post-deploy flow. Track F validates end-to-end after D + E ship.

Track A — hero_cockpit (end-user UI on the VM) — CLOSED s133-s139

Spec: hero_cockpit#1. Scaffolded from hero_template.

Session Focus
s133 A1 — scaffold from hero_template: 5 crates (cli + server + sdk + admin + web), service.toml × 4 daemons, /health + /.well-known endpoints, cargo check + lab infocheck clean.
s134 A2 — Services page + cockpit_server RPCs: list/start/stop/restart/enable/disable_service.
s135 A3 — Settings page + cockpit_server RPCs: get/set/test_byok_key, system_info.
s136 A4 — Feedback iframe (→ lhumina_public/feedback) + Manual pages.
s137 A5 — Per-user manifest (~/hero/cfg/cockpit/services.toml) read/write + profile switching.
s138 A6 — Upgrade button + cockpit.upgrade flow + SSE job log streaming via lab update.
s139 A7 — Dynamic URL mapping (cockpit.expose_service / unexpose_service) via hero_proxy domain.add admin API.

Track D — hero_os_tfgrid_deployer (admin tool) — CRITICAL-PATH, ~5 sessions

Umbrella: hero_os_tfgrid_deployer#2. Sub-issues: #3 D1 / #4 D2 / #5 D3 / #6 D4 / #7 D5 / #8 D6. Workspace scaffold landed on 2026-05-20 (ab061f5b → 76919265): 4-crate workspace + JSON-RPC plumbing.

Goal: an admin tool that, given a Forge username, autonomously provisions a Hero OS demo VM end-to-end — Forge account lifecycle + SSH key gen + hero_compute deploy_vm + setup-binaries dispatch (hero_proxy + cockpit + Track C content all included) + deploy_webgateway + Forge OAuth wiring. No human-in-the-loop after the form submit.

Session Focus
s140 Track D start. Read the existing scaffold (D1) for context, then begin D2 = Forge user lifecycle (REST client + create-or-check + token-gen + dedicated SSH key gen).
s141 D3 — VM-deploy adapter trait. OpenTofu as primary adapter (matches what s132 proved); hero_compute as secondary under a config flag (limited until Mahmoud closes free-form-sizing in hero_compute#116).
s142 D4 — post-deploy flow (scp + setup-binaries dispatch + verify + hero_proxy install + OAuth wiring). Depends on Track E manifest shape (run in parallel with E1 at s142b if needed). VM lifecycle = delete-and-redeploy on TFGrid (start/stop/restart_vm not supported, see §3).
s143 D5 — admin UI (Axum + Askama + Bootstrap, building on the existing scaffold): users list, deploy/destroy actions, per-user state view, gateway URL display, event log.
s144 D6 — wire hero_onboarding's POST /admin/pool-refresh integration so the deployer can feed VMs into the paid arc's pool (see §5.5 convergence point). Or defer to a hero_onboarding-side session if agent 2's s2-015 picks up first.

Track B — hero_proxy config + OAuth + TLS — 3 sessions (parallel with Track D)

Reframed (2026-05-21 pivot): Track B is no longer "install hero_proxy on a standalone VM." It is now configuration + integration work that becomes part of the deployer's standard per-user manifest. All work is local code on hero_proxy repo + the deployer's manifest templates; no TFGrid VM required until Track D's D4 picks up the manifest at provision time.

Session Focus
s145 B1 — hero_proxy config templates for the standard demo VM shape (/ → cockpit admin.sock, /<service>/ → service admin sockets). Lands as a docs + service.toml + default-config PR on hero_proxy.
s146 B2 — Forge OAuth integration template. Register hero_proxy as a Forge OAuth client at forge.ourworld.tf (operational, needs Forge admin); define auth_mode=oauth + oauth_provider=forge.ourworld.tf + allowed_pubkeys=[<user_forge_id>] template that the deployer instantiates per-user at provision time.
s147 B3 — TLS strategy decision + docs. Either TFGrid name_proxy for TLS termination (simpler — Mahmoud's deploy_webgateway handles cert), or LE/certbot inside the VM (more control). Picks one; documents the choice; the deployer instantiates accordingly.

Track C — Public content + smaller models — 3-4 sessions (parallel with Track D)

Session Focus
s148 C1 — Create lhumina_public/docs_owh_public + (optionally) mycelium-public-docs equivalent. Populate with safe demo content.
s149 C2 — Wire hero_books to default-load these public repos on a fresh VM (config + manifest changes; no VM required to develop).
s150 C3 — Deferred to post-arc follow-up: hero_embedder#42. Demo VM bumped to 16 GB instead; the smaller embedder work moves to after this arc closes.
(s151?) C4 — Optional Kimi agent integration into hero_agent. Slot in when ready.

Track E — setup-binaries.sh per-user manifest refactor — CLOSED s151

Lives in hero_demo; tracked at hero_os_tfgrid_deployer#8 (closed). Critical for Track D's D4 post-deploy flow.

Session Focus
s151 E1 — refactored hero_demo/deploy/single-vm/scripts/setup-binaries.sh to read per-user cockpit-services.toml manifest + always-on bootstrap-core (hero_proxy + hero_router + hero_proc + hero_cockpit) + small-embedder flag (EMBEDDER_MODEL_SIZE=small); falls back to d07_set.txt when no manifest present; new DRY_RUN=1 mode. Landed as hero_demo 20f03ba (+207/-38). Bonus: hero_cockpit 558e737 adds hero_planner ManualEntry + new manual page.

Track F — Integration + validation — 2-3 sessions

Session Focus
s152 F1 — end-to-end: deployer creates fresh Forge user → fresh TFGrid VM (via D3 adapter, OpenTofu primary) → setup-binaries runs → cockpit + hero_proxy + Forge OAuth + TLS → user logs in via Forge → manages services → posts feedback. First real end-to-end smoke since 2026-05-21 retirement.
s153 F2 — 8 GB fit validation under the full enabled-by-default set. RSS measurement per service.
s154 F3 — multi-user test (provision 2-3 VMs from the deployer, verify isolation). Hand-off to wider team.

3. Dependency map across repos

Dep Owner Status Notes
hero_compute lifecycle API mahmoud mainnet-ready on TFGrid for the slice model, confirmation at #116#35305 See caveats below
hero_template (scaffold base) despiegk available Has 4 crates (server/admin/web/sdk); cockpit adds CLI as 5th
hero_proxy (OAuth + URL mapping) team / despiegk exists, needs herolab config Track B
Forge OAuth client registration team (forge admin) not done Blocks B2 only
hero_voice end-to-end scott in progress Slots into Track A via cockpit settings flag, not on critical path
hero_web_template timur different shape (Mycelium dashboard) NOT the cockpit base; cockpit uses hero_template

Mahmoud's hero_compute caveats (load-bearing for Tracks B / D / F)

From hero_compute#116#35305:

TFGrid lifecycle surface (what works): deploy_vm, delete_vm, list_vms / get_vm, vm_logs, node_register / node_status / node_unregister, set_tfgrid_node_ids, list_slices / get_slice, node_stats, list_images, get_deployment_logs / list_deployments, get_ssh_keys / set_ssh_keys (per-secret store, not push-into-VM), list_jobs / job_logs, deploy_webgateway / list_webgateways / get_webgateway / delete_webgateway / list_gateway_nodes.

TFGrid stubs (error): start_vm, stop_vm, restart_vm, inject_ssh_keys, vm_exec, vm_stats, attach_hypervisor, migrate_secret.

Constraints that shape the deployer + cockpit:

  1. Slice-based sizing. 1 slice = 4 GB RAM + 1 vCPU + fixed disk-per-slice (~138 GB/slice on node 1774). CPU and RAM coupled 1:4. No publicip, no rootfs, no independent disk parameter. Our 8 GB demo VM ⇒ 2 slices ⇒ 2 vCPU (not 16 — the 16 we saw in s132 was the OpenTofu path, which sets a different shape).
  2. SSH keys at deploy-time only. Pass ssh_keys=[…] to deploy_vm; no inject-after-create. Affects D2 (deployer's Forge user lifecycle): generate the SSH key + pass at deploy_vm time.
  3. VM lifecycle = delete-and-redeploy. No start/stop/restart. Cockpit's per-service start/stop/restart is unaffected (those are hero_proc service calls on services running inside the VM). The deployer's admin UI shows a "destroy + redeploy" action, not three separate buttons.
  4. No vm_exec / vm_stats. Setup-binaries dispatch uses SSH (already true post-s132). Cockpit's system_info reads RAM/disk from the VM's own /proc/meminfo + df + sysinfo crate, not via hero_compute.
  5. Async deploy. deploy_vm returns immediately with state="provisioning"; poll get_vm until state="running" AND mycelium_ip set. Same async pattern for delete_vm (→ deleting → record disappears) and deploy_webgateway.
  6. No metadata field on Vm. Encode user / profile into VM name OR keep the join in the deployer's sqlite (we'd use the deployer's sqlite — schema already has the foreign key).
  7. Auth: UDS local-only. ComputeService listens on a Unix domain socket; per-call auth is the secret parameter (sr25519-signed token from node's TFGRID_MNEMONIC or raw ownership token). A remote deployer reaches it via hero_router (TCP entry point + context/claim auth) or an SSH tunnel — no built-in network auth.

Cross-track follow-ups Mahmoud offered to file: (a) metadata: map<str,str> on Vm spec, (b) free-form sizing + publicip, (c) remote-auth model. We will track each as they get filed.

4. Out of scope (initial demo)

  • Billing / payment / Stripe — deferred per meeting.
  • Self-service onboarding — admin-driven for now.
  • Multi-user-per-VM — one demo user owns one VM.
  • Office / OnlyOffice — too heavy for 8 GB.
  • Team-paid AI inference — BYO keys only.

5.5 — Paid-tier overlay (hero_onboarding, Track B)

The demo-deployer arc above (Tracks A-F) ships a company-paid free demo of Hero OS. A parallel arc — hero_onboarding (separate scope, tracked at hero_onboarding#1) — adds the paid commercial overlay on top of the same substrate. Same deployer, same cockpit, same proxy; differs only in front-gate behavior.

Status (post-2026-05-21, s2-009): Phases 1-7 shipped on lhumina_code/hero_onboarding/development — mycelium proof-of-control login (D-12), Stripe + ClickPesa payments (D-13), per-node billing pipeline (D-14), Idenfy KYC at /payment/intent gate (D-15), VM allocation via PoolAssignmentProvisioner (D-17, race-renamed from D-16). 4 crates, 47 unit tests, 4 white-box smokes, all green. Acceptance: cargo + lab + smoke matrix per hero_onboarding#2-#8.

Convergence points with this issue's tracks:

  • Track D (hero_os_tfgrid_deployer): the deployer becomes the VM source for hero_onboarding's pool. Either pushes via POST /admin/pool-refresh on hero_onboarding (the API hero_onboarding will land in s2-010 Phase 8), or hero_onboarding eventually adds a DeployerProvisioner impl that triggers allocate-on-demand (v1.5 behind the D-17 trait). Either flow ships in 1-2 hero_onboarding sessions once Track D's API is live.
  • Track A (hero_cockpit): hero_onboarding's /vm/allocate hands the user a URL pointing at cockpit on the assigned VM. Auth substrate: Forge OAuth via forge.ourworld.tf — aligns hero_onboarding, hero_cockpit, and the deployer on a single platform-wide user identity. hero_onboarding offers dual login (locked at s2-011 Phase 9 as D-18): Forge OAuth as the default low-friction signup, AND mycelium proof-of-control (D-12 from s2-003) as an alternative for sovereignty-minded users. User row carries both forge_id? and mycelium_address? slots; at least one populated per row, both linkable post-signup via /account/link-*. SSO to cockpit always uses the user's Forge ID — mycelium-only users link a Forge account when they first hit cockpit. The VmAllocation row captures the user's forge_id at allocate time so the deployer/cockpit can grant access. Fully reversible: ~30 LOC to flip back to mycelium-only (if the boss decides sovereignty-first is the only path); ~50 LOC removed to flip to Forge-only (if mycelium turns out unused). Dual-auth is the deliberately-least-committed stance.
  • Track B (hero_proxy): same proxy fronts both arcs (free demo at <vm>.<node>.grid.tf, hero_onboarding at e.g. onboarding.heroos.com). No special integration needed; both are HTTP backends to the proxy.

What hero_onboarding deliberately does NOT do: anything VM-side (cockpit, system_info, services, BYO keys — that's all Track A). Anything HTTPS / TLS-termination (that's Track B). Anything direct-to-TFGrid (that's hero_compute, slot for v1 TfchainAutoDeployProvisioner per hero_compute#116).

What hero_onboarding WAS over-built for, vs the free demo: payment + KYC + per-node billing pipeline + credit-decrement at allocation time. These are commercial-flow concerns that the free demo has no need for. They're not wasted — they're the paid tier — but they're worth noting so this issue's readers know hero_onboarding's scope is intentionally wider than what the free demo needs.

Sequencing (hero_onboarding's next 4 sessions, all parallel to Track A's work — no Track A blocking on Track B or vice versa):

Session Theme Touches this issue?
s2-010 Phase 8 Integration prep — cockpit_url field on PoolVm + VmAllocation; POST /admin/pool-refresh admin route (atomic in-memory pool swap, takes VM_POOL_JSON shape); stub release() cleanup-hook (logs "would call deployer-release here"); dashboard "Open cockpit →" link; forge_id? schema slot reserved for Phase 9. Makes hero_onboarding deployer-pluggable AHEAD of Track D shipping. Coordination-only (this comment) — no code overlap.
s2-011 Phase 9 Forge OAuth login (default) + dual-auth model — add Forge OAuth via forge.ourworld.tf as the default login on hero_onboarding (low friction, aligns with this issue's Forge-gated cockpit); keep mycelium proof-of-control (D-12) as the alternative for sovereignty-minded users; VmAllocation row captures forge_id at allocate time for the SSO bridge to cockpit. D-18 lock on dual-auth model with revisability annotations. Light coordination — surface to Track A that cockpit should honor Forge OAuth sessions from forge.ourworld.tf for paid-tier-allocated VMs.
s2-012 Phase 10 Production keys prep + operator runbook — per-environment config matrix for Stripe + ClickPesa + Idenfy + Forge OAuth in docs/operator-runbook.md; configuration validator (hero_onboarding --check-prod-config fails-fast on missing prod env vars); NO live charge tests at this session (deferred to launch day). No
s2-013 Phase 11 Meta-issue Q-followups: Q#8 refund posture (opt-in env-flag-gated refund on release) + Q#9 multi-currency (per-currency balance display). Possibly D-19 if refund posture locks load-bearing decisions. No
s2-014 Phase 12 End-to-end cron rehearsal + auto-release on expires_at — new vm_auto_release_cron action; production-cadence rehearsal of producer (1h) + aggregator (5min) + auto-release (1h) for a full day in a dev environment. No

After s2-014, hero_onboarding pauses until either (a) Mahmoud closes hero_compute#116 gaps → s2-015 = v1 TfchainAutoDeployProvisioner, or (b) Track D ships its deployer API → s2-015 = v1.5 DeployerProvisioner + real pool-feed integration. Whichever lands first.

D-NN race rule for cross-track decisions: per prompt-common.md, first-minted wins. Track A's D-16 (cockpit-byok-user-forge-token-namespacing) took D-16 on 2026-05-21 08:46 EDT, 40 minutes ahead of Track B's parallel D-16 mint (provisioner-trait-shape) — Track B re-numbered to D-17. Next free D-NN is D-19 (D-18 reserved for the s2-011 Phase 9 dual-auth model lock).

6. Per-track status (updated each session close)

Track Stage Last closed Next Session
A — Cockpit CLOSED A7 (s139 = f880247) — (cosmetic + L134-A followups foldable into any future cockpit-light session)
D — Deployer 🔴 CRITICAL-PATH D1 scaffold landed 2026-05-20 (ab061f5b → 76919265) s140 = start D2 (Forge user lifecycle) s140
B — Proxy + OAuth (reframed) not started; local-code-only until Track D D4 picks it up B1 config templates on hero_proxy repo + Forge OAuth registration s145
C — Public content + small model not started; local-code-only C1 create lhumina_public/docs_owh_public s148
E — setup-binaries refactor not started E1 manifest-driven loop (may run parallel with D4) s151
F — Integration + validation not started; gated on D + B + C + E all landing F1 e2e s152

Notes on the reorder:

  • Track A closed at s139 (Track A = 7 sessions s133-s139, all shipped).
  • Track D is now critical-path because every TFGrid VM going forward must go through the deployer (per the 2026-05-21 pivot in §1). No more hero_demo make deploy provisioning.
  • Track B reframed from "install on a standalone VM" to "config templates + Forge OAuth client registration + TLS strategy docs" — all local code work, no TFGrid VM required during the session. Track D's D4 instantiates the B-templates per VM at provision time.
  • Track C unchanged in scope but reordered to run parallel with Track D rather than between B and D.
  • Accepted operational gap: no TFGrid VM exists between 2026-05-21 (herolab retirement) and Track F's F1 (first deployer-provisioned VM). ~5 sessions worth. End-to-end smoke testing is paused during that window.
body length: 43658 emo-deployer arc — tracker **Scope:** all work needed to go from "lab + CI + a one-off TFGrid VM proving the binaries work" (where we are now, post-[hero_demo `09f8365`](https://forge.ourworld.tf/lhumina_code/hero_demo/commit/09f8365) / s132) to "a team operator types a username in an admin tool and gets back a Forge-OAuth-gated Hero OS demo VM that the user logs into, sees their cockpit, manages their services". **Primary tracker for this arc.** PATCHed at each session close. **Current state (Track A s158 close, 2026-05-25):** **FIRST PUBLIC HERO OS URL LIVE on TFGrid.** Pivoted mainnet -> QAnet (twin 703 / FreeFarm2 node 5 / $0 TFT) per newly-minted [D-30](decisions/D-30-demo-target-freefarm-first-qa-fallback.md). Admin VM provisioned via `deployer.provision_vm` (sid `0062`, 16 GB RAM, mycelium-SSH'd in Ubuntu 24.04). Phase 0.5 shipped [hero_compute@8f7a2b7](https://forge.ourworld.tf/lhumina_code/hero_compute/commit/8f7a2b7) extending D-27 inline-await + `rollback_orphans` pattern from `deploy_vm` to `deploy_webgateway` (closes [hero_compute#126](https://forge.ourworld.tf/lhumina_code/hero_compute/issues/126); 2 new gateway_hint_tests; pre-merge gate clean). Live-verified on both Ok-path (49s deploy -> state ready) AND rollback-path (4 orphan name+node contracts cancelled cleanly across 2 failure modes — first live D-27 gateway extension proof). Public URL: **https://hcockpit.gent01.qa.grid.tf/hero_cockpit/web/services** behind TFGrid Web Gateway TLS (D-28 topology, gateway node 2 zone `gent01.qa.grid.tf` — same zone as Mahmoud's reference instance). End-to-end user walk proven: walker user `s158_walker_<ts>` minted via `deployer.create_user` + SSH pubkey uploaded to Forge (D-23 alt-2) + `deployer.provision_vm` minted child VM sid `0068` (8 GB, mycelium-SSH'd) **co-located on same rented node 5** — multi-tenant topology proven. **NEW operational runbook landed**: [`docs/channels/free/admin-vm-deployment-runbook.md`](docs/channels/free/admin-vm-deployment-runbook.md) (commit `b352729`) — step-by-step recipe from rent -> provision -> setup-binaries -> `deploy_webgateway` -> tester handoff, with the 6 install/runtime workarounds discovered at s158 explicitly catalogued and linked to tracking issues. **Demo-app scope clarified**: prior framing of s159 as just "hero_books default-load" was too narrow. The canonical `demo` profile per [hero_cockpit#1 §6](https://forge.ourworld.tf/lhumina_code/hero_cockpit/issues/1) enables **hero_books + hero_slides + hero_whiteboard + hero_voice + hero_agent + hero_planner + hero_collab** on top of bootstrap-core. hero_books default-load may already auto-fire via setup-binaries.sh `HERO_BOOKS_DEFAULT_REPOS` env wire (the s153 deferred scope); needs live-verify on the admin VM at s159 /start. **7 new Forge follow-ups filed for Mahmoud window**: [hero_compute#127](https://forge.ourworld.tf/lhumina_code/hero_compute/issues/127) service.toml [[env]] for TFGRID_NETWORK; [hero_proxy#55](https://forge.ourworld.tf/lhumina_code/hero_proxy/issues/55) IPv6 dual-stack seed bind (blocks public-URL reachability — manual workaround in runbook); [hero_cockpit#7](https://forge.ourworld.tf/lhumina_code/hero_cockpit/issues/7) landing-page relative URL bug; [hero_cockpit#8](https://forge.ourworld.tf/lhumina_code/hero_cockpit/issues/8) dark/light mode inconsistent across pages; [hero_demo#67](https://forge.ourworld.tf/lhumina_code/hero_demo/issues/67) setup-binaries.sh missing secret pre-population (includes bare-key-vs-context-prefixed slot ambiguity lesson); [hero_compute#128](https://forge.ourworld.tf/lhumina_code/hero_compute/issues/128) workload-name client-side validation; [hero_skills#303](https://forge.ourworld.tf/lhumina_code/hero_skills/issues/303) lab build `--download --install` silently passes without installing binaries. **State at s158 close**: admin VM + walker child VM + rent contract 84920 + gateway sid `0067` ALL UP, intentionally left running through s159+s160 (zero TFT cost on QA). Twin 14199 mainnet treasury baseline 40 untouched. **Realistic readiness**: 70% testable — guided demo with verbal walkthrough works; self-service for a stranger needs s159 (landing-page fix + workaround sweep) + s160 (AIBROKER_DEMO_KEY staging for AI tier + BYO Forge token UI test). **Remaining arc**: s159 (sweep ~3-4h) -> s160 (AI keys + BYO test ~3-4h) -> s161 (this issue closure, 30 min). Total ~6-8h. **Previously (Track A s157e close, 2026-05-25):** **CI GREEN ON hero_compute.** s157e shipped `hero_compute@e845455` on `development` repairing 7 of 16 integration tests broken since `8be3294`: a 6-LOC `COMPUTE_TEST_FAKE_DEPLOY` test seam added to `operator_twin_id` in `crates/my_compute_zos_server/src/cloud/grid_driver.rs` (mirroring the existing seam in `deploy_on_tfgrid` at the same file) + 9 placeholder image-name updates `"img"` → `"Ubuntu 24.04"` in `crates/my_compute_zos_server/tests/integration.rs`. CI run 1299 ✅ green on `e845455` in 215s vs run 1297 = failure on `9857630`. Workspace fully synced at /start: every D-07 35-set repo `git pull origin development` (no `development_mik` branches outstanding); hero_compute pulled in 2 new Mahmoud commits (`2f07330` rent→reserve UI rename + `9857630` admin_mode toggle). Original s157e scope (mycelium_ip capture + SSH-verify) renamed to **s157f**. **Next: s157f** (mycelium fix + SSH verify, 1-2h, ~$1-2 TFT) → s158 → s159 → s160 → s161 closure. Same ~10-15h envelope. **Previously (Track A s157d close, 2026-05-25):** **DEPLOY_VM FULLY UNBLOCKED.** Root cause of [hero_compute#125](https://forge.ourworld.tf/lhumina_code/hero_compute/issues/125): the daemon passed the user-facing `image` string (e.g. `"Ubuntu 24.04"`) straight through to the TFGrid SDK as the zmachine workload's `flist` field, ZOS expected a URL, silently rejected with `state=Error` + empty `result.error`. Discovered via the SDK's undocumented `TFGRID_DEBUG=1` env var (gates `trace_step()` calls in `tfgrid_sdk_rust/src/grid_client/mod.rs:2361`) which surfaced per-workload state + the full workload JSON showing the literal name in the `flist` field. **Fix shipped**: [hero_compute@1f59151](https://forge.ourworld.tf/lhumina_code/hero_compute/commit/1f59151) on `development` adds `IMAGE_REFERENCE_MAP` const + `resolve_image_reference()` helper called once at top of `deploy_vm` (pass-through for `https://` URLs, lookup for known names, friendly `InvalidInput` error otherwise). **Live verify on rented dedicated node 3467** (Canada, farm 646 JimboTFT, RentContract 2095174 under twin 14199 ops): VM sid `0053` via URL + VM sid `0054` via name-resolved → both `state=running`, contracts 2095179/2095180 + 2095181/2095182 persisted on chain. **Multi-tenant pattern proven**: 2 distinct VMs on same rented node, distinct slices, distinct secrets. All cleaned at /stop: 4 VMs deleted, node unregistered, RentContract 2095174 cancelled (substrate-ack 20s). Twin 14199 active contracts = 0; treasury 6905 baseline 40 untouched. **D-29 minted** ([D-29 file](decisions/D-29-deploy-vm-image-resolution-and-rentable-extrafee-gate.md)) locking (a) image-name-resolution in the daemon, (b) demo target = any rentable+extraFee>0+up dedicated node on TFGrid mainnet (substrate gate is `node.extraFee > 0`, NOT `node.rentable: True` alone; FreeFarm-specifically constraint REMOVED). #125 closed. Track A continues solo. `prompt.md §3` projects from this issue. **Decisions and meeting source:** [hero_os_tfgrid_deployer#1](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/issues/1) (despiegk's Main Story / minutes — authoritative, not edited from here). ## 1. Foundation (where we are now) ### 2026-05-25 update (post-s157d) — DEPLOY_VM WORKS. Remaining path to end-user self-serve flow **Where we are**: `deployer.provision_vm` (the operator-facing API that mints a Forge user + Forge token + reads the user-uploaded SSH key + calls `hero_compute.deploy_vm`) now produces a `state=running` VM on a rented dedicated mainnet node. The full Track D D1-D5 ladder is end-to-end live for the first time since 2026-05-23. **End-user-journey checklist** (what `user clicks public link → ... → uses hero AI stack` requires, mapped to remaining sessions): | User-journey step | Where it stands | Owning next session | |---|---|---| | Click public URL → lands on admin cockpit | ❌ No public URL yet; admin cockpit runs locally on operator workstation | **s158 — Admin-on-TFGrid + deploy_webgateway** | | Logs in via Forge OAuth | ✅ Code shipped (Track A s133-s139 cockpit + D-22 BYO token landing); needs live-test on the admin VM | Walk verified in s160 | | Uploads SSH public key to Forge themselves | ✅ D-23 alt-2 designed + tested locally at s142; user uses Forge's own UI | Walk verified in s160 | | Pastes Forge personal token into admin form | ✅ `cockpit/USER_FORGE_TOKEN` slot (D-16); admin form exists | Walk verified in s160 | | Updates password (optional) | ✅ `deployer.regenerate_password` (D-24, s143) | Walk verified in s160 | | Admin clicks Provision → user gets a VM | ✅ **Unblocked at s157d.** `deployer.provision_vm` → `ComputeService.deploy_vm` → `state: running`. Hero stack (35-set) installs via `setup-binaries.sh` (Track E, s151). | (already works) | | User SSHes into THEIR VM | ⚠️ `hero_compute.wait_until_running` returns before mycelium_ip is populated in the workload result; daemon's `get_vm` returns empty mycelium_ip — [hero_compute#121](https://forge.ourworld.tf/lhumina_code/hero_compute/issues/121). Easy fix now that `TFGRID_DEBUG=1` visibility exists. | **s157e — hero_compute#121 fix + SSH verify** | | Accesses hero cockpit on their VM (via `https://<their-domain>/`) | ✅ Cockpit runs in the VM's stack; needs hero_proxy + TFGrid Web Gateway hookup per D-28 | s160 walk | | Hero AI stack content (hero_books, hero_slides, etc.) loaded with default corpora | 🟡 Track C C1+C2 partial — `+104 LOC` parked on `hero_books` local branch `s153_default_libraries` since s153 abort | **s159 — hero_books default-load wire** | **Remaining sessions (estimated 10-15 hours focused work to home#235 closure):** | # | Session | Focus | Est | Dependencies | |---|---|---|---|---| | **s157e** | hero_compute#121 fix (post-deploy poll loop populates mycelium_ip from workload result.data) + SSH-verify a fresh deploy end-to-end with the throwaway probe key | 1-2h | Local daemon work only; new rent ~$1-2 for the SSH verify | | **s158** | Admin-on-TFGrid: rent dedicated node, deploy admin VM via deployer.provision_vm, install Hero stack via setup-binaries.sh, configure hero_proxy + `deploy_webgateway` per D-28, surface public URL | 3-4h | Depends on s157e (need SSH to debug the admin VM if anything misbehaves) | | **s159** | Track C C1+C2: rebase `s153_default_libraries` (hero_books +104 LOC default-load wire) on clean baseline; squash on hero_books development; redeploy admin VM's hero_books so the public URL serves the 4 default content repos out of the box | 2-3h | Independent of s158 (can run in either order, but admin URL live makes the verify easier) | | **s160** | Full user-journey live walk on the public admin URL: mint a test Forge user, walk SSH key upload + token paste + password reset + provision their VM + login to their cockpit + load content in hero_books. Surface any gaps as Forge issues for s161. | 3-4h | Depends on s158 + s159 | | **s161** | home#235 closure: PATCH this issue body with final outcome, post closure comment with the full s158-s160 evidence, flip state to closed. File the one remaining post-arc follow-up (Track F multi-VM scaling) as a separate hero_os_tfgrid_deployer issue. | 30 min | Depends on s160 walk landing green | **For anyone picking up this arc**: start at `prompt.md §3` (rewritten at each /stop). Sessions/157d.yml has the full s157d trace including the TFGRID_DEBUG discovery + fix shape + multi-tenant proof. The `feedback_squash_merge_gate` + `feedback_d10_t2_squash_to_development_no_pr` + `feedback_signoff_no_email` + `feedback_authorship` discipline rules apply throughout. ### 2026-05-23 update (mid-session pivot) — demo VM bumped to 16 GB; Track C C3 deferred to post-arc follow-up; arc compresses to 9 sessions - **Demo VM target RAM goes from 8 GB to 16 GB.** Shipping this arc ahead of squeezing the embedder is the right priority for a v1 demo. A single Grid Proxy lookup (`free_mru=17179869184` against `farm_ids=1`) confirms FreeFarm has nodes with that headroom. The `ram_size` change is a parameter on `deployer.provision_vm`, not a code change in the deployer or hero_compute. Surfaces at the User POV walkthrough and at the multi-user session. - **Track C C3 (smaller embedder model) is now a post-arc follow-up: [hero_embedder#42](https://forge.ourworld.tf/lhumina_code/hero_embedder/issues/42).** Issue is fully spec'd (model registry pointer, env read site, load-gating logic, smoke pattern, acceptance criteria). Not urgent; address after this arc closes. The 8 GB-affordability story still matters once users run on cheaper VMs, but it does not gate any home#235 acceptance row, and on a 16 GB demo VM the current four-variant embedder load is absorbed comfortably. - **The 10-session arc compresses to 9.** New shape: s152 pulls Track B B1+B3 forward (was s153, hero_proxy config templates + TLS strategy decision), s153 pulls Track C C1+C2 forward (was s154, public content repos + hero_books default-load), s154 is User POV walkthrough on the live 16 GB mainnet VM (was s155, still gated on [hero_compute#119](https://forge.ourworld.tf/lhumina_code/hero_compute/issues/119)), s155 is Track F F1 multi-user (was s156), s156 is Track F F2+F3 plus this arc's closure (was s157). One full session of risk saved by not chasing the embedder shrink inside the arc. ### 2026-05-23 update (post-s148) — self-host daemon up on TFGrid mainnet; D-26 minted; FreeFarm (farm_id=1) locked as the demo deploy target - **`hero_compute_zos` daemon supervised on TFGrid mainnet.** Squash `844676c` on hero_compute development appends the canonical `[[env]] PATH_ROOT/HERO_SOCKET_DIR/RUST_LOG` block to `my_compute_zos_server/service.toml`, mirroring the s147 hero_router fix. `lab build --release --install --workspace` clean (8 of 8 built, 0 failed). `lab service my_compute_zos_server --install --start` brings the daemon up at PID 3102124, raw JSON-RPC over Unix socket at `~/hero/var/sockets/hero_compute_zos/rpc.sock`. Mainnet wallet sourced from `TF_VAR_mnemonic` in `~/hero/cfg/env/env.sh` (the same wallet that funded the s132 OpenTofu deploy); stored under `core/TFGRID_MNEMONIC` plus the existing `core/TFGRID_NETWORK=main`. The hero_proc supervisor injects core-context secrets into the daemon environment at spawn, so no service.toml `from_secret` indirection is needed. - **Live mainnet round-trip confirmed.** Direct UDS smoke: `ComputeService.list_images` returns the 5 official VM images; `ComputeService.node_register` queries TFChain mainnet Grid Proxy and returns a real `ComputeNode` record; `ComputeService.node_status` reads it back byte-identical from the local persistence layer. The sr25519 keypair derived from the mnemonic produces public key `58f481018853f18b403369537940d8e3a7bb61f36eafe8fff38fab281f230965` (the operator's TFChain identity). - **D-26 minted** locking the self-host architecture: `decisions/D-26-self-host-hero-compute-mainnet.md`. Workspace next-free advances to D-27. Devnet fallback path stays warm via `TFGRID_MNEMONIC_DEVNET` in env.sh. - **D-26 §Demo target locks FreeFarm (farm_id=1) on TFChain mainnet as the canonical deploy target.** FreeFarm is the ThreeFold-operated non-dedicated public-tenancy farm; any funded wallet may submit VM contracts there without farm-admin rights. The substrate-side `onlytwinadmincandeploy` check fires only on dedicated farms and is moot for our demo posture. The s132 OpenTofu deploy of herolab is prior-art that the operator's wallet already exercised the substrate contract-submission path successfully on mainnet under a different code wrapper. Owning underlying hardware (registering and operating our own farm) is a stronger sovereignty story but out of scope for the home#235 arc; public-tenancy on FreeFarm is the right level of effort for a demo. - **Demo target re-pin queued for s149.** s148 used node 2007 as a convenience target for the bring-up smoke; that was the wrong target (node 2007 belongs to herodemo.gent01, a separate twin's machine). s149 step 2 re-points `core/TFGRID_NODE_IDS` to a FreeFarm node via a single Grid Proxy lookup: `GET https://gridproxy.grid.tf/nodes?farm_ids=1&free_mru=8589934592&status=up`. No archaeology. - **[hero_compute#118](https://forge.ourworld.tf/lhumina_code/hero_compute/issues/118) demoted at /stop** with [comment 36334](https://forge.ourworld.tf/lhumina_code/hero_compute/issues/118#issuecomment-36334). Mahmoud's external endpoint is no longer gating any session; can be added as a future second adapter when convenient. - **Two deployer code edits queued for s149 (the original §3 "no source code changes" claim was incorrect):** `hero_os_tfgrid_deployer/.../compute.rs:30` has a hardcoded service-name path constant `/hero_compute_mos/...` that must become `/hero_compute_zos/...` or configurable; `web.rs:206-229` parses `HERO_COMPUTE_NODE_ADDR` as a TCP host:port (correct shape, but the local value still needs to be decided to route through hero_router to the new self-hosted UDS). - **10-session arc updated:** s148 ✅ closed; s149 head is now FreeFarm node re-pin + deployer rewire + first `deploy_vm` round-trip (provision → Mycelium-IPv6 SSH ping → delete). s150 hero_proc#121 fix and downstream sessions s151 through s157 unchanged. - **Side action skipped this session**: the B2 Forge OAuth client registration ops ask remains unfiled; deferred because Track B's proxy/OAuth scope is the better venue for it and it is not load-bearing for the home#235 critical path during s149. - **Track B continues normally in its own lane.** Per the 2026-05-23 single-agent rule clarification, the "single-agent for home#235" rule applies to home#235 work itself, not to Track B's hero_assistance v1.0 work. Both tracks run concurrently with zero file overlap. ### 2026-05-23 update (post-s147) — self-host pivot + 10-session arc to closure locked - **Self-host pivot on hero_compute.** [`my_compute_zos_server`](https://forge.ourworld.tf/lhumina_code/hero_compute/src/branch/development/crates/my_compute_zos_server) is our repo, our code, our CI auto-publish. We host the instance ourselves using `TF_VAR_mnemonic` from `~/hero/cfg/env/env.sh` (the same mainnet TFGrid wallet used by the s132 OpenTofu deploy; 12-word BIP39 verified populated). `TFGRID_NETWORK=main` is already set in `hero_proc secret` core context. Zero deployer code changes required: existing D4 implementation already calls `ComputeService.deploy_vm` against whichever endpoint `HERO_COMPUTE_NODE_ADDR` points at; we point it at our local UDS instead of a remote endpoint. - **[hero_compute#118](https://forge.ourworld.tf/lhumina_code/hero_compute/issues/118) demoted from blocker to future second adapter.** The operational ask filed at s145 (reachable `hero_compute_mos_server` endpoint) is no longer gating any session. A comment will be posted at s148 close noting that Mahmoud's endpoint can be added as a future second adapter when convenient; meanwhile we run on our own instance. - **Single-agent execution for the home#235 arc.** Track B / Agent 2 paused until home#235 closes. Parallel-agent coordination overhead (file-region claims, ID race rules, prompt-common.md handshake) exceeded its value for the demo-shippable push. All 10 sessions s148–s157 run on Track A solo. - **`hero_planner` promoted to the default cockpit services profile** (user requirement 2026-05-23). The repo is already in the D-07 demo service set (Tier B per `memory/project_demo_service_set.md`), already in `hero_demo/deploy/single-vm/scripts/d07_set.txt`, and already has `.forgejo/workflows/lab-publish.yaml` wired for CI auto-publish. What was missing is exposure in the default `cockpit-services.toml` profile alongside `hero_books` / `hero_slides` / `hero_whiteboard` / `hero_call` / `hero_voice` / `hero_agent`. Folded into s151 (Track E E1) scope. - **10-session arc to home#235 closure (locked at s148 /start):** s148 self-host `my_compute_zos_server` on mainnet (mints D-26 for self-host architecture lock); s149 D5 live-smoke on mainnet (first real grid.tf VM via `deployer.provision_vm`); s150 [hero_proc#121](https://forge.ourworld.tf/lhumina_code/hero_proc/issues/121) fix (bulk `service.status_all` RPC + cockpit adoption, mints D-27); s151 Track E E1 setup-binaries manifest refactor + hero_planner in default profile; s152 Track C C3 smaller embedder model (MiniLM-L6, ~80 MB for 8 GB VM fit); s153 Track B B1+B3 hero_proxy config templates + TLS strategy decision; s154 Track C C1+C2 public content repos + hero_books default-load; s155 User POV walkthrough on the live mainnet VM (incl. hero_planner row walks); s156 Track F F1 multi-user end-to-end on mainnet; s157 Track F F2+F3 RAM-fit + multi-user isolation + this issue closure PATCH. - **Side actions** (file at s148 /stop): B2 Forge OAuth client registration ops ask (the only remaining external dependency for the proxy-OAuth gating path); comment on hero_compute#118 demoting Mahmoud's endpoint per above. - **Deployer side** — no code changes this planning session. D-25 (`ON DELETE RESTRICT` migration) remains the most recent Track D landing (s144 `380b992`). All Track D status unchanged; D5 live-smoke just had its blocker removed. ### 2026-05-23 update (post-s145) — methodology + arc-spec session: master-tracker E2E checklist artifact + Mahmoud ops ask + s142 follow-ups all filed - **`home/docs/channels/free/e2e_checklist.md`** ([`fee7f0c`](https://forge.ourworld.tf/lhumina_code/home/commit/fee7f0c)) — executable companion to the existing `home/docs/channels/free-and-paid.md` narrative. 71 rows across Admin POV / User POV / Cross-arc boundaries, FREEZONE / hero_assistance D-18 row format, all rows sourced from the meeting minutes + decisions + free-and-paid.md + a code-reading pass on hero_cockpit + hero_os_tfgrid_deployer. Status column is seed-pass only; human verification of every Have row is the s146 head. - **[hero_compute#118](https://forge.ourworld.tf/lhumina_code/hero_compute/issues/118) filed** — operational ask to Mahmoud for a reachable `hero_compute_mos_server` endpoint (host:port + node_sid). The only outstanding pre-req for the deployer's first live `deploy_vm` + `get_vm` + `delete_vm` round-trip. Gates s147. - **[hero_compute#116 comment](https://forge.ourworld.tf/lhumina_code/hero_compute/issues/116#issuecomment-36173) posted** — D-24 + D-25 ack to Mahmoud closing s142 follow-up #1. All three s142 follow-ups now closed-out as filed Forge issues (#1 above, #2 = [hero_compute#117](https://forge.ourworld.tf/lhumina_code/hero_compute/issues/117) typed-SDK gap, #3 = [hero_cockpit#4](https://forge.ourworld.tf/lhumina_code/hero_cockpit/issues/4) SSH-key onboarding polish). - **Mid-session pivot** — proposed "E1 = Forge group/repo per user" fallback head was caught as invented scope on re-check against the meeting notes (§8 + §9 ask for *shared* content + feedback repos, both already covered: `lhumina_public/feedback` exists, §8 Books backfill is queued separately). Dropped from `prompt.md` §3. - **Deployer side** — no code changes. D-25 (`ON DELETE RESTRICT` migration) remains the most recent Track D landing (s144 `380b992`). All Track D status unchanged. - **Track B status unchanged from s2-018** — Phase D cleanup (s2-019) remains queued in `hero_assistance/`. - **s146 queued** — local-cockpit-install + verification pass on the new `e2e_checklist.md`. Effort tier medium. Output is updated Status columns + audit-log entry + follow-up issues for any clearly-needed feature gaps surfaced during the walkthrough. - **s147 queued** — Track D **D5 live-smoke** (provision + SSH ping + delete round-trip against a real `hero_compute_mos_server`). Gated on [hero_compute#118](https://forge.ourworld.tf/lhumina_code/hero_compute/issues/118) reply + `core/FORGEJO_TOKEN` + `deployer/FORGE_TOKEN` re-population. ### 2026-05-22 update (post-s143) — Track A s143 = Track D D2.1 lifecycle-symmetry polish + Phase B.5 FK-silently-OFF fix + D-24 mint - **Track A s143 (Track D D2.1)** — closed lifecycle-symmetry polish on hero_os_tfgrid_deployer. 3 new JSON-RPC methods: `deployer.delete_user` (refuse-if-vms per D-24 — caller must cascade via `deployer.delete_vm` first), `deployer.delete_vm` (compute-first then sqlite per D-24 — orphan-recoverability asymmetry), `deployer.regenerate_password` (single-use disclosure shape mirroring `create_user.initial_password`). Two squashes on `development`/`main`: hero_lib [`ce653c0a`](https://forge.ourworld.tf/lhumina_code/hero_lib/commit/ce653c0a) (+`ForgeClient::delete_user_ssh_key` + `ForgeClient::update_user_password` admin methods, +33 LOC); hero_os_tfgrid_deployer [`3508cd1`](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/commit/3508cd1) (+3 RPC methods + sqlite migration scaffold + FK enforcement + 5 new db tests, 8 files +479/-25). - **LOAD-BEARING Phase B.5 finding absorbed** — adversarial review caught `PRAGMA foreign_keys` was silently OFF in `db.rs` (sqlite's default `foreign_keys=OFF` made the `vms.user_id REFERENCES users(id)` FK a no-op — `DELETE FROM users` would orphan vms rows with no error). Fixed as a one-line `PRAGMA foreign_keys = ON` in `Db::open` + `Db::open_in_memory`. Test `fk_enforcement_blocks_delete_user_with_vms` pins the constraint. - **rusqlite_migration 2.5 scaffold** keyed on `PRAGMA user_version`. The s143 initial migration is the current schema with `CREATE TABLE IF NOT EXISTS`, so pre-migration dev DBs bootstrap cleanly into `user_version=1` without ALTER. Foundation for D-25+ schema bumps. - **[D-24](https://forge.ourworld.tf/lhumina_code/home/issues/235) minted** — locks (a) refuse-if-vms for `delete_user`, (b) compute-first then sqlite for `delete_vm`, (c) `PRAGMA foreign_keys=ON` as second-line guard, (d) accepted operational gap: lost `vm_secret` makes a VM unrecoverable from deployer side. Workspace D-NN advances to D-25 (reserved for the s144 ON DELETE RESTRICT migration head). - **Tests + lab infocheck green**: 21 deployer_server tests + 3 SDK tests pass; lab infocheck = 3/3 crates clean / 0 findings. Binary-symbol smoke: 8/8 RPC method names confirmed in release binary (`deployer.create_user|get_user|list_users|delete_user|regenerate_password|provision_vm|list_vms|delete_vm`). - **End-to-end VM smoke still deferred** per the carried operational gap (no TFGrid VM exists until Track F's F1). Live Forge admin round-trip also deferred — `deployer/FORGE_TOKEN` was rotated post-s141 and not re-populated this session. - **Track B status unchanged from s2-016**: Phase B `_admin` rebuild remains queued. - **s144 queued** = Track D **D-25 ON DELETE RESTRICT migration** (default head — first real use of the s143 migration scaffold; encodes D-24 at the schema layer). Alts: D5 live-smoke (gated on `HERO_COMPUTE_NODE_ADDR`) or E1 Forge group/repo per-user. ### 2026-05-22 update — workspace housekeeping + Track B re-activation (s2-016) under hero_assistance-alignment scope - **Workspace doc compaction** (`compaction-2026-05-22`): CLAUDE.md + prompt.md + prompt2.md + prompt-common.md compacted 445→53 KB (−88%). Pre-compaction snapshot at `archive/2026-05-22-compaction/`. `pipeline-config.yaml` tracking_issue updated from `hero_demo#52` → `home#235` to match this arc as the live tracker. CLAUDE.md now leads with home#235 as headline framing. Manifest: `sessions/compaction-2026-05-22.yml`. No arc code touched, no D-NN/L-NN minted. - **Track B s2-016 — re-activation + hero_assistance work** — Track B re-activated under new scope = multi-phase alignment of `lhumina_code/hero_assistance/` with the canonical Hero service template per [hero_assistance#15](https://forge.ourworld.tf/lhumina_code/hero_assistance/issues/15). Pre-archive scope (hero_onboarding v0 spec) preserved as historical; reactivates on the same Track-D-`/api/deploy-vm` trigger if needed. Three squashes on hero_assistance/development: [`f81aecc`](https://forge.ourworld.tf/lhumina_code/hero_assistance/commit/f81aecc) (prior session's #14 squash-merge), [`c059c1a`](https://forge.ourworld.tf/lhumina_code/hero_assistance/commit/c059c1a) (Wall 1 rusqlite u64→i64 + Wall 2 reqwest rustls-tls swap), [`5330a0f`](https://forge.ourworld.tf/lhumina_code/hero_assistance/commit/5330a0f) (Phase A drop 5 Dioxus crates + D-26 minted hero_assistance-repo-local retiring D-09/D-17/D-22/D-25 atomically). 6 hero_assistance issue closures (#7/#9/#10/#11/#12/#14 + #13 auto-closed). New meta-issue [hero_assistance#15](https://forge.ourworld.tf/lhumina_code/hero_assistance/issues/15) opens the multi-phase alignment arc (Phases A through E). L-08 (workspace) retro-closed. CI green; releases/tag/latest = 4 musl binaries + 4 md5 sidecars. Workspace `lab infocheck` 4 clean / 0 findings (was 4/4/20). Procedural skip flagged: worked in shared `lhumina_code/hero_assistance/` checkout NOT the worktree-isolated `../hero_assistance-track-agent-2/` (future Track B sessions MUST use the worktree per CLAUDE.md "Cross-track coordination"). - **Track A status unchanged at s2-016 close** — Track A's s142 = Track D D3 SSH key lifecycle remains queued and uncommenced (Track A did not run this day). Default head per `prompt.md §3`; alts D4 first-hero_compute-call or D2.1 D2 polish; pick at /start. The two tracks can run concurrently going forward. ### 2026-05-21 update (post-s139) — pivot: hero_os_tfgrid_deployer IS the deployment path - **Track A closed.** All 7 hero_cockpit#1 spec items shipped across s133-s139 (s139 = [`f880247`](https://forge.ourworld.tf/lhumina_code/hero_cockpit/commit/f880247)). See [hero_cockpit#1](https://forge.ourworld.tf/lhumina_code/hero_cockpit/issues/1) for the closed-as-shipped checklist. - **`herolab.gent02.grid.tf` retired 2026-05-21.** Destroyed via `make destroy ENV=herolab`. 5 OpenTofu resources released (grid_deployment, grid_name_proxy, grid_network, 2 random_bytes). Gateway FQDN + mycelium IPv6 released. The s132 manual-deploy proof is done; we don't deploy that way again. - **`hero_os_tfgrid_deployer` is now the canonical deployment path** for every Hero OS VM, both free-demo and paid-arc-pool. `hero_demo make deploy` is no longer used for VM provisioning. The deployer's per-user `cockpit-services.toml` manifest drives the setup-binaries dispatch, the hero_proxy install, the OAuth wiring, and the webgateway binding — all as parts of the deployer's standard post-deploy flow, not as standalone-VM concerns. - **Track D becomes critical-path.** Reordered ahead of Tracks B/C in §2 below. Workspace scaffold landed on [hero_os_tfgrid_deployer](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer) on 2026-05-20 (ab061f5b → 76919265: 4-crate workspace + JSON-RPC /rpc + /openrpc.json + /health). Agent 1 picks up at D2 (Forge user lifecycle) onwards at s140. - **Accepted operational gap:** ~5 sessions where no TFGrid VM exists for end-to-end smoke. Track B/C/cockpit-followup work continues as local code work (config templates, content repos, model wiring) in parallel. No fallback hero_demo deploy "just for testing in the meantime" — the gap is the honest price of committing to deployer-as-path. ### Original 2026-05-20 foundation status (preserved for historical context) What was working at session 132 (2026-05-20) — the **proof-of-concept that established the build/install mechanic now embedded inside the deployer's post-deploy flow**; the standalone `hero_demo make deploy` path is retired: - **Build pipeline.** `lab` (in `hero_skills`) builds the D-07 35-set. Workstation + VM-side native builds pass. mycelium is the 35th and is excluded on TFGrid since it ships natively via zinit. - **CI auto-publish.** 31/31 wired D-07 repos run `.forgejo/workflows/lab-publish.yaml` on every push to `development`. Each repo refreshes its `releases/tag/latest` with linux-musl-x86_64 (CLI) + linux-x86_64-gnu (daemons with ONNX) artefacts. See [hero_skills#268](https://forge.ourworld.tf/lhumina_code/hero_skills/issues/268) (rollout) + [hero_skills#269](https://forge.ourworld.tf/lhumina_code/hero_skills/issues/269) (per-repo cleanup catalogue, closed). - **VM-side consumer install.** `lab build $repo --download --install` on a fresh Ubuntu 24.04 TFGrid VM with no Rust toolchain installs all 34 (mycelium skipped) end-to-end, including the 3 ONNX libraries to `~/hero/lib/`. **This mechanic now lives inside the deployer's post-deploy flow (D4).** - **TFGrid deployment (RETIRED).** Was: `make deploy ENV=herolab` (in `hero_demo`) provisions one VM via OpenTofu in ~60 s. `make setup-binaries` runs the lab consumer-side install loop. Now: same OpenTofu provider is available to the deployer's D3 adapter as a secondary path (the primary path is hero_compute via Mahmoud's API once free-form sizing lands). **Known open followups on the foundation:** - **TFGrid public gateway returns HTTP 502 on hero_router alone.** hero_router binds to loopback by default; the public gateway hits a closed port. Resolution path: add `--bind 0.0.0.0` + put hero_proxy in front of it (Track B below — scoped as deployer-integrated config rather than a standalone install). See [hero_router#74](https://forge.ourworld.tf/lhumina_code/hero_router/issues/74). ## 2. Roadmap — 6 tracks, ~17 sessions remaining (was ~24-26 pre-pivot) Each track has a slot in the `prompt.md §3` session map. Sessions continue from s140. **Order (post-2026-05-21 pivot):** Track A closed at s139. Track D becomes critical-path and runs s140-s14X. Tracks B/C run as local code work in parallel with Track D, then merge into Track D's standard per-user manifest. Track E feeds into Track D's post-deploy flow. Track F validates end-to-end after D + E ship. ### Track A — `hero_cockpit` (end-user UI on the VM) — ✅ CLOSED s133-s139 Spec: [hero_cockpit#1](https://forge.ourworld.tf/lhumina_code/hero_cockpit/issues/1). Scaffolded from [`hero_template`](https://forge.ourworld.tf/lhumina_code/hero_template). | Session | Focus | |---|---| | s133 | A1 — scaffold from hero_template: 5 crates (cli + server + sdk + admin + web), service.toml × 4 daemons, /health + /.well-known endpoints, `cargo check` + `lab infocheck` clean. ✅ | | s134 | A2 — Services page + cockpit_server RPCs: list/start/stop/restart/enable/disable_service. ✅ | | s135 | A3 — Settings page + cockpit_server RPCs: get/set/test_byok_key, system_info. ✅ | | s136 | A4 — Feedback iframe (→ `lhumina_public/feedback`) + Manual pages. ✅ | | s137 | A5 — Per-user manifest (`~/hero/cfg/cockpit/services.toml`) read/write + profile switching. ✅ | | s138 | A6 — Upgrade button + cockpit.upgrade flow + SSE job log streaming via `lab update`. ✅ | | s139 | A7 — Dynamic URL mapping (`cockpit.expose_service` / `unexpose_service`) via hero_proxy `domain.add` admin API. ✅ | ### Track D — `hero_os_tfgrid_deployer` (admin tool) — CRITICAL-PATH, ~5 sessions Umbrella: [hero_os_tfgrid_deployer#2](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/issues/2). Sub-issues: [#3](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/issues/3) D1 / [#4](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/issues/4) D2 / [#5](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/issues/5) D3 / [#6](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/issues/6) D4 / [#7](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/issues/7) D5 / [#8](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/issues/8) D6. Workspace scaffold landed on 2026-05-20 (ab061f5b → 76919265): 4-crate workspace + JSON-RPC plumbing. **Goal:** an admin tool that, given a Forge username, autonomously provisions a Hero OS demo VM end-to-end — Forge account lifecycle + SSH key gen + hero_compute deploy_vm + setup-binaries dispatch (hero_proxy + cockpit + Track C content all included) + deploy_webgateway + Forge OAuth wiring. No human-in-the-loop after the form submit. | Session | Focus | |---|---| | s140 | **Track D start.** Read the existing scaffold (D1) for context, then begin D2 = Forge user lifecycle (REST client + create-or-check + token-gen + dedicated SSH key gen). | | s141 | D3 — VM-deploy adapter trait. **OpenTofu as primary adapter** (matches what s132 proved); **hero_compute as secondary** under a config flag (limited until Mahmoud closes free-form-sizing in [hero_compute#116](https://forge.ourworld.tf/lhumina_code/hero_compute/issues/116)). | | s142 | D4 — post-deploy flow (scp + setup-binaries dispatch + verify + hero_proxy install + OAuth wiring). Depends on Track E manifest shape (run in parallel with E1 at s142b if needed). **VM lifecycle = delete-and-redeploy** on TFGrid (start/stop/restart_vm not supported, see §3). | | s143 | D5 — admin UI (Axum + Askama + Bootstrap, building on the existing scaffold): users list, deploy/destroy actions, per-user state view, gateway URL display, event log. | | s144 | D6 — wire hero_onboarding's `POST /admin/pool-refresh` integration so the deployer can feed VMs into the paid arc's pool (see §5.5 convergence point). Or defer to a hero_onboarding-side session if agent 2's s2-015 picks up first. | ### Track B — `hero_proxy` config + OAuth + TLS — 3 sessions (parallel with Track D) **Reframed (2026-05-21 pivot):** Track B is no longer "install hero_proxy on a standalone VM." It is now **configuration + integration work that becomes part of the deployer's standard per-user manifest**. All work is local code on `hero_proxy` repo + the deployer's manifest templates; no TFGrid VM required until Track D's D4 picks up the manifest at provision time. | Session | Focus | |---|---| | s145 | B1 — `hero_proxy` config templates for the standard demo VM shape (`/` → cockpit admin.sock, `/<service>/` → service admin sockets). Lands as a docs + service.toml + default-config PR on `hero_proxy`. | | s146 | B2 — Forge OAuth integration template. Register `hero_proxy` as a Forge OAuth client at `forge.ourworld.tf` (operational, needs Forge admin); define `auth_mode=oauth` + `oauth_provider=forge.ourworld.tf` + `allowed_pubkeys=[<user_forge_id>]` template that the deployer instantiates per-user at provision time. | | s147 | B3 — TLS strategy decision + docs. Either TFGrid `name_proxy` for TLS termination (simpler — Mahmoud's `deploy_webgateway` handles cert), or LE/certbot inside the VM (more control). Picks one; documents the choice; the deployer instantiates accordingly. | ### Track C — Public content + smaller models — 3-4 sessions (parallel with Track D) | Session | Focus | |---|---| | s148 | C1 — Create `lhumina_public/docs_owh_public` + (optionally) `mycelium-public-docs` equivalent. Populate with safe demo content. | | s149 | C2 — Wire `hero_books` to default-load these public repos on a fresh VM (config + manifest changes; no VM required to develop). | | ~~s150~~ | C3 — **Deferred to post-arc follow-up: [hero_embedder#42](https://forge.ourworld.tf/lhumina_code/hero_embedder/issues/42).** Demo VM bumped to 16 GB instead; the smaller embedder work moves to after this arc closes. | | (s151?) | C4 — Optional Kimi agent integration into `hero_agent`. Slot in when ready. | ### Track E — `setup-binaries.sh` per-user manifest refactor — ✅ CLOSED s151 Lives in `hero_demo`; tracked at [hero_os_tfgrid_deployer#8](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/issues/8) (closed). Critical for Track D's D4 post-deploy flow. | Session | Focus | |---|---| | s151 | E1 — refactored `hero_demo/deploy/single-vm/scripts/setup-binaries.sh` to read per-user `cockpit-services.toml` manifest + always-on bootstrap-core (`hero_proxy + hero_router + hero_proc + hero_cockpit`) + small-embedder flag (`EMBEDDER_MODEL_SIZE=small`); falls back to `d07_set.txt` when no manifest present; new `DRY_RUN=1` mode. Landed as `hero_demo 20f03ba` (+207/-38). Bonus: `hero_cockpit 558e737` adds hero_planner ManualEntry + new manual page. ✅ | ### Track F — Integration + validation — 2-3 sessions | Session | Focus | |---|---| | s152 | F1 — end-to-end: deployer creates fresh Forge user → fresh TFGrid VM (via D3 adapter, OpenTofu primary) → setup-binaries runs → cockpit + hero_proxy + Forge OAuth + TLS → user logs in via Forge → manages services → posts feedback. **First real end-to-end smoke since 2026-05-21 retirement.** | | s153 | F2 — 8 GB fit validation under the full enabled-by-default set. RSS measurement per service. | | s154 | F3 — multi-user test (provision 2-3 VMs from the deployer, verify isolation). Hand-off to wider team. | ## 3. Dependency map across repos | Dep | Owner | Status | Notes | |---|---|---|---| | `hero_compute` lifecycle API | mahmoud | mainnet-ready on TFGrid for the slice model, [confirmation at #116#35305](https://forge.ourworld.tf/lhumina_code/hero_compute/issues/116#issuecomment-35305) | See caveats below | | `hero_template` (scaffold base) | despiegk | available | Has 4 crates (server/admin/web/sdk); cockpit adds CLI as 5th | | `hero_proxy` (OAuth + URL mapping) | team / despiegk | exists, needs herolab config | Track B | | Forge OAuth client registration | team (forge admin) | not done | Blocks B2 only | | `hero_voice` end-to-end | scott | in progress | Slots into Track A via cockpit settings flag, not on critical path | | `hero_web_template` | timur | different shape (Mycelium dashboard) | NOT the cockpit base; cockpit uses `hero_template` | ### Mahmoud's hero_compute caveats (load-bearing for Tracks B / D / F) From [hero_compute#116#35305](https://forge.ourworld.tf/lhumina_code/hero_compute/issues/116#issuecomment-35305): **TFGrid lifecycle surface (what works):** `deploy_vm`, `delete_vm`, `list_vms` / `get_vm`, `vm_logs`, `node_register` / `node_status` / `node_unregister`, `set_tfgrid_node_ids`, `list_slices` / `get_slice`, `node_stats`, `list_images`, `get_deployment_logs` / `list_deployments`, `get_ssh_keys` / `set_ssh_keys` (per-secret store, not push-into-VM), `list_jobs` / `job_logs`, `deploy_webgateway` / `list_webgateways` / `get_webgateway` / `delete_webgateway` / `list_gateway_nodes`. **TFGrid stubs (error):** `start_vm`, `stop_vm`, `restart_vm`, `inject_ssh_keys`, `vm_exec`, `vm_stats`, `attach_hypervisor`, `migrate_secret`. **Constraints that shape the deployer + cockpit:** 1. **Slice-based sizing.** 1 slice = 4 GB RAM + 1 vCPU + fixed disk-per-slice (~138 GB/slice on node 1774). CPU and RAM coupled 1:4. **No `publicip`, no `rootfs`, no independent disk parameter.** Our 8 GB demo VM ⇒ 2 slices ⇒ 2 vCPU (not 16 — the 16 we saw in s132 was the OpenTofu path, which sets a different shape). 2. **SSH keys at deploy-time only.** Pass `ssh_keys=[…]` to `deploy_vm`; no inject-after-create. Affects D2 (deployer's Forge user lifecycle): generate the SSH key + pass at deploy_vm time. 3. **VM lifecycle = delete-and-redeploy.** No start/stop/restart. Cockpit's per-service start/stop/restart is unaffected (those are `hero_proc service` calls on services running inside the VM). The deployer's admin UI shows a "destroy + redeploy" action, not three separate buttons. 4. **No `vm_exec` / `vm_stats`.** Setup-binaries dispatch uses SSH (already true post-s132). Cockpit's `system_info` reads RAM/disk from the VM's own `/proc/meminfo` + `df` + sysinfo crate, not via hero_compute. 5. **Async deploy.** `deploy_vm` returns immediately with `state="provisioning"`; poll `get_vm` until `state="running"` AND `mycelium_ip` set. Same async pattern for `delete_vm` (→ `deleting` → record disappears) and `deploy_webgateway`. 6. **No metadata field on Vm.** Encode `user` / `profile` into VM `name` OR keep the join in the deployer's sqlite (we'd use the deployer's sqlite — schema already has the foreign key). 7. **Auth: UDS local-only.** `ComputeService` listens on a Unix domain socket; per-call auth is the `secret` parameter (sr25519-signed token from node's `TFGRID_MNEMONIC` or raw ownership token). A **remote** deployer reaches it via hero_router (TCP entry point + context/claim auth) or an SSH tunnel — no built-in network auth. **Cross-track follow-ups Mahmoud offered to file:** (a) `metadata: map<str,str>` on Vm spec, (b) free-form sizing + publicip, (c) remote-auth model. We will track each as they get filed. ## 4. Out of scope (initial demo) - Billing / payment / Stripe — deferred per meeting. - Self-service onboarding — admin-driven for now. - Multi-user-per-VM — one demo user owns one VM. - Office / OnlyOffice — too heavy for 8 GB. - Team-paid AI inference — BYO keys only. ## 5. Cross-links - [hero_os_tfgrid_deployer#1](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/issues/1) — Despiegk's Main Story / meeting minutes (decisions source) - [hero_os_tfgrid_deployer#2](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/issues/2) — v0.1 umbrella (deployer scope) - [hero_cockpit#1](https://forge.ourworld.tf/lhumina_code/hero_cockpit/issues/1) — cockpit deep spec - [hero_compute#116](https://forge.ourworld.tf/lhumina_code/hero_compute/issues/116) — compute integration coordination (Mahmoud) - [hero_demo `09f8365`](https://forge.ourworld.tf/lhumina_code/hero_demo/commit/09f8365) — herolab env + setup-binaries.sh (s132) - [lhumina_public/feedback#1](https://forge.ourworld.tf/lhumina_public/feedback/pulls/1) — feedback repo bootstrap PR ## 5.5 — Paid-tier overlay (hero_onboarding, Track B) The demo-deployer arc above (Tracks A-F) ships a **company-paid free demo** of Hero OS. A parallel arc — **hero_onboarding** (separate scope, tracked at [hero_onboarding#1](https://forge.ourworld.tf/lhumina_code/hero_onboarding/issues/1)) — adds the **paid commercial overlay** on top of the same substrate. Same deployer, same cockpit, same proxy; **differs only in front-gate behavior**. **Status (post-2026-05-21, s2-009):** Phases 1-7 shipped on `lhumina_code/hero_onboarding/development` — mycelium proof-of-control login (D-12), Stripe + ClickPesa payments (D-13), per-node billing pipeline (D-14), Idenfy KYC at /payment/intent gate (D-15), VM allocation via `PoolAssignmentProvisioner` (D-17, race-renamed from D-16). 4 crates, 47 unit tests, 4 white-box smokes, all green. Acceptance: cargo + lab + smoke matrix per [hero_onboarding#2-#8](https://forge.ourworld.tf/lhumina_code/hero_onboarding/issues/1). **Convergence points with this issue's tracks:** - **Track D (hero_os_tfgrid_deployer)**: the deployer becomes the VM source for hero_onboarding's pool. Either pushes via `POST /admin/pool-refresh` on hero_onboarding (the API hero_onboarding will land in s2-010 Phase 8), or hero_onboarding eventually adds a `DeployerProvisioner` impl that triggers allocate-on-demand (v1.5 behind the D-17 trait). Either flow ships in 1-2 hero_onboarding sessions once Track D's API is live. - **Track A (hero_cockpit)**: hero_onboarding's `/vm/allocate` hands the user a URL pointing at cockpit on the assigned VM. **Auth substrate: Forge OAuth via forge.ourworld.tf** — aligns hero_onboarding, hero_cockpit, and the deployer on a single platform-wide user identity. hero_onboarding offers **dual login** (locked at s2-011 Phase 9 as D-18): Forge OAuth as the default low-friction signup, AND mycelium proof-of-control (D-12 from s2-003) as an alternative for sovereignty-minded users. User row carries both `forge_id?` and `mycelium_address?` slots; at least one populated per row, both linkable post-signup via `/account/link-*`. SSO to cockpit always uses the user's Forge ID — mycelium-only users link a Forge account when they first hit cockpit. The `VmAllocation` row captures the user's `forge_id` at allocate time so the deployer/cockpit can grant access. **Fully reversible**: ~30 LOC to flip back to mycelium-only (if the boss decides sovereignty-first is the only path); ~50 LOC removed to flip to Forge-only (if mycelium turns out unused). Dual-auth is the deliberately-least-committed stance. - **Track B (hero_proxy)**: same proxy fronts both arcs (free demo at `<vm>.<node>.grid.tf`, hero_onboarding at e.g. `onboarding.heroos.com`). No special integration needed; both are HTTP backends to the proxy. **What hero_onboarding deliberately does NOT do:** anything VM-side (cockpit, system_info, services, BYO keys — that's all Track A). Anything HTTPS / TLS-termination (that's Track B). Anything direct-to-TFGrid (that's hero_compute, slot for v1 TfchainAutoDeployProvisioner per [hero_compute#116](https://forge.ourworld.tf/lhumina_code/hero_compute/issues/116)). **What hero_onboarding WAS over-built for, vs the free demo:** payment + KYC + per-node billing pipeline + credit-decrement at allocation time. These are commercial-flow concerns that the free demo has no need for. They're not wasted — they're the paid tier — but they're worth noting so this issue's readers know hero_onboarding's scope is intentionally wider than what the free demo needs. **Sequencing (hero_onboarding's next 4 sessions, all parallel to Track A's work — no Track A blocking on Track B or vice versa):** | Session | Theme | Touches this issue? | |---|---|---| | **s2-010 Phase 8** | Integration prep — `cockpit_url` field on PoolVm + VmAllocation; `POST /admin/pool-refresh` admin route (atomic in-memory pool swap, takes `VM_POOL_JSON` shape); stub `release()` cleanup-hook (logs "would call deployer-release here"); dashboard "Open cockpit →" link; `forge_id?` schema slot reserved for Phase 9. Makes hero_onboarding deployer-pluggable AHEAD of Track D shipping. | Coordination-only (this comment) — no code overlap. | | **s2-011 Phase 9** | **Forge OAuth login (default) + dual-auth model** — add Forge OAuth via forge.ourworld.tf as the default login on hero_onboarding (low friction, aligns with this issue's Forge-gated cockpit); keep mycelium proof-of-control (D-12) as the alternative for sovereignty-minded users; `VmAllocation` row captures `forge_id` at allocate time for the SSO bridge to cockpit. **D-18 lock** on dual-auth model with revisability annotations. | Light coordination — surface to Track A that cockpit should honor Forge OAuth sessions from `forge.ourworld.tf` for paid-tier-allocated VMs. | | **s2-012 Phase 10** | Production keys prep + operator runbook — per-environment config matrix for Stripe + ClickPesa + Idenfy + Forge OAuth in `docs/operator-runbook.md`; configuration validator (`hero_onboarding --check-prod-config` fails-fast on missing prod env vars); NO live charge tests at this session (deferred to launch day). | No | | **s2-013 Phase 11** | Meta-issue Q-followups: Q#8 refund posture (opt-in env-flag-gated refund on release) + Q#9 multi-currency (per-currency balance display). Possibly D-19 if refund posture locks load-bearing decisions. | No | | **s2-014 Phase 12** | End-to-end cron rehearsal + auto-release on `expires_at` — new `vm_auto_release_cron` action; production-cadence rehearsal of producer (1h) + aggregator (5min) + auto-release (1h) for a full day in a dev environment. | No | After s2-014, hero_onboarding pauses until either (a) Mahmoud closes [hero_compute#116](https://forge.ourworld.tf/lhumina_code/hero_compute/issues/116) gaps → s2-015 = v1 `TfchainAutoDeployProvisioner`, or (b) Track D ships its deployer API → s2-015 = v1.5 `DeployerProvisioner` + real pool-feed integration. Whichever lands first. **D-NN race rule for cross-track decisions:** per `prompt-common.md`, first-minted wins. Track A's D-16 (cockpit-byok-user-forge-token-namespacing) took D-16 on 2026-05-21 08:46 EDT, 40 minutes ahead of Track B's parallel D-16 mint (provisioner-trait-shape) — Track B re-numbered to D-17. Next free D-NN is **D-19** (D-18 reserved for the s2-011 Phase 9 dual-auth model lock). ## 6. Per-track status (updated each session close) | Track | Stage | Last closed | Next | Session | |---|---|---|---|---| | A — Cockpit | **✅ CLOSED** | A7 (s139 = [`f880247`](https://forge.ourworld.tf/lhumina_code/hero_cockpit/commit/f880247)) | — (cosmetic + L134-A followups foldable into any future cockpit-light session) | — | | **D — Deployer** | **🔴 CRITICAL-PATH** | D1 scaffold landed 2026-05-20 (ab061f5b → 76919265) | s140 = start D2 (Forge user lifecycle) | **s140** | | B — Proxy + OAuth (reframed) | not started; local-code-only until Track D D4 picks it up | — | B1 config templates on `hero_proxy` repo + Forge OAuth registration | s145 | | C — Public content + small model | not started; local-code-only | — | C1 create lhumina_public/docs_owh_public | s148 | | E — setup-binaries refactor | not started | — | E1 manifest-driven loop (may run parallel with D4) | s151 | | F — Integration + validation | not started; gated on D + B + C + E all landing | — | F1 e2e | s152 | **Notes on the reorder:** - Track A closed at s139 (Track A = 7 sessions s133-s139, all shipped). - Track D is now critical-path because **every TFGrid VM going forward must go through the deployer** (per the 2026-05-21 pivot in §1). No more `hero_demo make deploy` provisioning. - Track B reframed from "install on a standalone VM" to "config templates + Forge OAuth client registration + TLS strategy docs" — all local code work, no TFGrid VM required during the session. Track D's D4 instantiates the B-templates per VM at provision time. - Track C unchanged in scope but reordered to run parallel with Track D rather than between B and D. - Accepted operational gap: no TFGrid VM exists between 2026-05-21 (herolab retirement) and Track F's F1 (first deployer-provisioned VM). ~5 sessions worth. End-to-end smoke testing is paused during that window.
Author
Owner

Track A closed — s139 = A7 URL mapping landed

f880247 — 11 files, +1229/-3. Track A1-A7 all done across s133-s139.

What ships

cockpit.expose_service { service, subdomain } pushes a real route into hero_proxy via the existing domain.add RPC — no hero_router changes, single-repo session. The cockpit owner picks a subdomain in /services, cockpit calls hero_proxy.domain_add({ domain: "<sub>.<base>", target_type: "socket", target: "$HERO_SOCKET_DIR/<service>/admin.sock", https_redirect: true, enabled: true }), persists to ~/hero/cfg/cockpit/exposures.toml, and the row in /services becomes a clickable link.

  • 5 new RPCs: expose_service / unexpose_service / list_exposures / get_base_domain / set_base_domain
  • New exposures.rs module + 6 unit tests (mirrors s137 manifest.rs)
  • 10th "URL" column in /services + Bootstrap modal with subdomain input + live preview
  • "Public exposure base domain" section on /settings
  • BASE_DOMAIN secret stored in the cockpit context per D-16 (never touches operator state)
  • Partial-failure protection: hero_proxy unreachable → -32603 + exposures.toml not written

Phase B finding worth tracking

hero_router has NO custom path-prefix alias surface today — the per-service reverse proxy at /{service_name}/{webname} is hard-coded socket-dir-derived (see crates/hero_router/src/server/routes.rs:2275-2286). router.add is OpenRPC-sidebar registration, not HTTP routing. The s139 spec adherence to §8 subdomain shape (via the already-built hero_proxy) sidesteps this entirely, but the gap is real for any future feature that wants custom path-prefix routing through hero_router rather than subdomain through hero_proxy.

Verification

  • D-10 5/5 GREEN: cargo test 16/16, lab build VICTORY 4/4, lab infocheck 4 clean.
  • Live end-to-end against running hero_proxy_server: expose hero_books_admin at "mik" → mik.herodemo.gent01.grid.tf route_id=1 in hero_proxy + clickable link in /services; idempotent re-expose; conflict errors; unexpose dual-removes; idempotent re-unexpose returns existed:false.
  • All validation paths -32602 (unknown service, uppercase, leading hyphen, dot, missing socket).
  • Web layer regression-clean: /services /settings /feedback /manual /about all 200.

Arc rotation due at s140

Track A is closed. s140 picks the next track to enter from B/C/D/E/F per the status table above. My read on priorities:

  • B (proxy + OAuth) has the highest leverage on what just shipped — today herodemo runs nginx→hero_router (not hero_proxy), so the cockpit's URL feature only works locally on workstations until hero_proxy is wired into the demo deployment. B would close that gap.
  • D (deployer admin) unblocks the TFGrid VM-per-user demo flow but needs Kristof + Mahmoud coordination.
  • C (content+models) is independent and has cleaner solo execution.
  • E and F are lower demo visibility / premature.

Will defer the actual choice to the next session's planning step. Full session narrative at sessions/139.yml.

## Track A closed — s139 = A7 URL mapping landed [`f880247`](https://forge.ourworld.tf/lhumina_code/hero_cockpit/commit/f880247) — 11 files, +1229/-3. Track A1-A7 all done across s133-s139. ### What ships `cockpit.expose_service { service, subdomain }` pushes a real route into `hero_proxy` via the existing `domain.add` RPC — no hero_router changes, single-repo session. The cockpit owner picks a subdomain in /services, cockpit calls `hero_proxy.domain_add({ domain: "<sub>.<base>", target_type: "socket", target: "$HERO_SOCKET_DIR/<service>/admin.sock", https_redirect: true, enabled: true })`, persists to `~/hero/cfg/cockpit/exposures.toml`, and the row in /services becomes a clickable link. - 5 new RPCs: `expose_service / unexpose_service / list_exposures / get_base_domain / set_base_domain` - New `exposures.rs` module + 6 unit tests (mirrors s137 manifest.rs) - 10th "URL" column in /services + Bootstrap modal with subdomain input + live preview - "Public exposure base domain" section on /settings - BASE_DOMAIN secret stored in the `cockpit` context per D-16 (never touches operator state) - Partial-failure protection: hero_proxy unreachable → -32603 + exposures.toml not written ### Phase B finding worth tracking `hero_router` has NO custom path-prefix alias surface today — the per-service reverse proxy at `/{service_name}/{webname}` is hard-coded socket-dir-derived (see `crates/hero_router/src/server/routes.rs:2275-2286`). `router.add` is OpenRPC-sidebar registration, not HTTP routing. The s139 spec adherence to §8 subdomain shape (via the already-built hero_proxy) sidesteps this entirely, but the gap is real for any future feature that wants custom path-prefix routing through hero_router rather than subdomain through hero_proxy. ### Verification - D-10 5/5 GREEN: cargo test 16/16, lab build VICTORY 4/4, lab infocheck 4 clean. - Live end-to-end against running hero_proxy_server: expose hero_books_admin at "mik" → `mik.herodemo.gent01.grid.tf` route_id=1 in hero_proxy + clickable link in /services; idempotent re-expose; conflict errors; unexpose dual-removes; idempotent re-unexpose returns existed:false. - All validation paths -32602 (unknown service, uppercase, leading hyphen, dot, missing socket). - Web layer regression-clean: /services /settings /feedback /manual /about all 200. ### Arc rotation due at s140 Track A is closed. s140 picks the next track to enter from B/C/D/E/F per the status table above. My read on priorities: - **B (proxy + OAuth)** has the highest leverage on what just shipped — today herodemo runs nginx→hero_router (not hero_proxy), so the cockpit's URL feature only works locally on workstations until hero_proxy is wired into the demo deployment. B would close that gap. - **D (deployer admin)** unblocks the TFGrid VM-per-user demo flow but needs Kristof + Mahmoud coordination. - **C (content+models)** is independent and has cleaner solo execution. - **E** and **F** are lower demo visibility / premature. Will defer the actual choice to the next session's planning step. Full session narrative at [`sessions/139.yml`](https://forge.ourworld.tf/lhumina_code/hero_work/src/branch/development/sessions/139.yml).
Author
Owner

Filed a small meta-issue locking the email-provider choice for both arcs: home#236 — META Email / notifications strategy. Locked at D-20 (decisions/D-20-email-provider-sendgrid.md).

Decision: SendGrid for all transactional emails originated by either the demo-deployer arc or hero_onboarding. Sender domain TBD — picked at the first session that writes email-sending code. Forge-native + Stripe / Idenfy / ClickPesa platform emails stay on their respective backbones.

No immediate action for any Track A-F session — the rule is live in prompt.md §3 standing rules and the acceptance-criteria list lives in home#236.

## Cross-link: email / notifications strategy Filed a small meta-issue locking the email-provider choice for both arcs: [home#236 — META Email / notifications strategy](https://forge.ourworld.tf/lhumina_code/home/issues/236). Locked at D-20 (`decisions/D-20-email-provider-sendgrid.md`). Decision: **SendGrid** for all transactional emails originated by either the demo-deployer arc or hero_onboarding. Sender domain TBD — picked at the first session that writes email-sending code. Forge-native + Stripe / Idenfy / ClickPesa platform emails stay on their respective backbones. No immediate action for any Track A-F session — the rule is live in `prompt.md §3` standing rules and the acceptance-criteria list lives in home#236.
Author
Owner

Added a cross-arc overview doc at home/docs/channels/free-and-paid.md (commit bfbf552).

Audience: engineers + stakeholders. Walks through the free testing channel (admin-driven community evaluation, this issue's scope) and the paid commercial product (hero_onboarding#1) end-to-end — four UX flows, shared substrate, where the channels touch each other, and explicit out-of-scope per channel.

Not a replacement for this issue. This issue stays the engineering tracker — per-track status table, session map, compute caveats. The new doc is the cross-arc reader's-eye view that this issue intentionally doesn't try to be.

## Cross-link: two-channels overview Added a cross-arc overview doc at [home/docs/channels/free-and-paid.md](https://forge.ourworld.tf/lhumina_code/home/src/branch/development/docs/channels/free-and-paid.md) (commit [`bfbf552`](https://forge.ourworld.tf/lhumina_code/home/commit/bfbf552)). Audience: engineers + stakeholders. Walks through the free testing channel (admin-driven community evaluation, this issue's scope) and the paid commercial product ([hero_onboarding#1](https://forge.ourworld.tf/lhumina_code/hero_onboarding/issues/1)) end-to-end — four UX flows, shared substrate, where the channels touch each other, and explicit out-of-scope per channel. Not a replacement for this issue. This issue stays the engineering tracker — per-track status table, session map, compute caveats. The new doc is the cross-arc reader's-eye view that this issue intentionally doesn't try to be.
Author
Owner

Added an executable companion to the existing two-channels narrative at home/docs/channels/free-and-paid.md. The new file lives at home/docs/channels/free/e2e_checklist.md and makes the free-testing channel of the Hero stack inspectable at row grain. Opens with a short story-logic recap of what the admin does end-to-end and what the tester does end-to-end (so a non-engineer stakeholder can read the file top-down), then drops into a matrix where one row equals one user-facing action, with Have / Need / Blocked status, test-pyramid layer, and a pointer to the source (decision file, meeting-note section, RPC method, or template). Scope is integration-level only: "tester can open Books from cockpit" is one row here, the rest stays in hero_books's own checklist. Pattern lineage is hero_assistance D-18 (originally from znzfreezone_deploy/docs/dev/e2e_checklist.md), with an audit log at the top for status regressions. Initial seed has 57 rows across admin POV / user POV / cross-arc boundaries, sourced from the meeting notes plus current decisions plus a code-reading pass on hero_cockpit and hero_os_tfgrid_deployer. Next session is the verification pass: walk a local cockpit install and flip each Have row based on observation.

Added an executable companion to the existing two-channels narrative at `home/docs/channels/free-and-paid.md`. The new file lives at `home/docs/channels/free/e2e_checklist.md` and makes the free-testing channel of the Hero stack inspectable at row grain. Opens with a short story-logic recap of what the admin does end-to-end and what the tester does end-to-end (so a non-engineer stakeholder can read the file top-down), then drops into a matrix where one row equals one user-facing action, with Have / Need / Blocked status, test-pyramid layer, and a pointer to the source (decision file, meeting-note section, RPC method, or template). Scope is integration-level only: "tester can open Books from cockpit" is one row here, the rest stays in `hero_books`'s own checklist. Pattern lineage is hero_assistance D-18 (originally from `znzfreezone_deploy/docs/dev/e2e_checklist.md`), with an audit log at the top for status regressions. Initial seed has 57 rows across admin POV / user POV / cross-arc boundaries, sourced from the meeting notes plus current decisions plus a code-reading pass on hero_cockpit and hero_os_tfgrid_deployer. Next session is the verification pass: walk a local cockpit install and flip each Have row based on observation.
Author
Owner

Update (s157d, 2026-05-25) — deploy_vm UNBLOCKED + full deploy/test mechanics + remaining roadmap

Headline: the 6-day deploy_vm investigation is resolved. Fix shipped at hero_compute@1f59151 (closes hero_compute#125). Multi-tenant pattern (one rented dedicated node, two distinct VMs co-located on it) live-verified tonight. The issue body §Current state is fully updated; this comment surfaces the deploy + test mechanics + remaining-session list as a single read for anyone picking up.

What was wrong (one paragraph)

hero_compute's deploy_vm passed the user-facing image string (e.g. "Ubuntu 24.04") straight through to the TFGrid SDK as the zmachine workload's flist field. ZOS expects a URL there; given a name, ZOS silently sets the workload state=Error with an empty result.error and the daemon surfaces "vm deployment entered error state" with no actionable detail. Found by enabling the SDK's undocumented TFGRID_DEBUG=1 env var (gates trace_step() calls in tfgrid_sdk_rust/src/grid_client/mod.rs:2361), which printed per-workload state lines and the full workload JSON showing the literal name in the flist field. Fix is a 5-entry name→URL map in the daemon plus a resolve_image_reference() helper called once at the top of deploy_vm (pass-through for https:// URLs, friendly InvalidInput for unknown names).

How to deploy a VM end-to-end (the s157d recipe)

Prerequisites: env sourced (source ~/hero/cfg/init.sh && source ~/hero/cfg/env/env.sh), hero_proc supervisor running, TFGRID_MNEMONIC set in core/ context, twin has TFT balance, hero_compute origin/development at 1f59151 or later.

  1. Pick a rentable node with extraFee > 0 (the substrate-side public-rent gate; rentable: True alone is NOT enough, substrate rejects with OnlyTwinAdminCanDeploy when extraFee=0):

    curl -s 'https://gridproxy.grid.tf/nodes?rentable=true&rented=false&status=up&size=100' \
      | python3 -c 'import json,sys; [print(n["nodeId"], n["country"], n.get("extraFee")) for n in json.load(sys.stdin) if (n.get("extraFee") or 0)>0]'
    

    Tonight we used node 3467 (Canada, farm 646 JimboTFT, $91.80/mo + 10000 mUSD extraFee).

  2. Rent it via ComputeService.rent_node({node_id}) through the daemon's UDS socket at ~/hero/var/sockets/hero_compute_zos/rpc.sock. Substrate-ack arrives in ~10s; poll rent_status until state=done; verify on chain at gridproxy.grid.tf/contracts?twin_id=<your_twin>&type=rent (should see state=Created).

  3. Register the catalog: set_tfgrid_node_ids({node_ids: "<id>"}), then node_unregister (if stale rows from a prior session), then node_register, then confirm via list_nodes. Slice math depends on node MRU/SRU; node 3467 yields 6 slices of ~5 GiB MRU each.

  4. Deploy a VM via ComputeService.deploy_vm({name, slice_count, secret, image, ssh_keys, node_sid}):

    • image can be a friendly name ("Ubuntu 24.04", "Alpine", etc.); the daemon resolves it to the canonical flist URL since 1f59151.
    • OR a full https://hub.grid.tf/...flist URL (still works).
    • Returns state=running in 60-90s; persists 2 contracts on chain (network + VM).
  5. Cleanup at end of session: delete_vm({sid, secret}) per VM (substrate-cancel each contract pair); node_unregister (requires zero VM rows in the daemon's compute_db; note that delete_vm does NOT remove the local row, you may need to manually rm ~/hero/var/compute_tfgrid/data/root/cloud/vm/*.otoml for stale error-state rows); finally cancel_rent_contract({contract_id, node_id}). Rent contract billing stops on substrate-Deleted (about 20s for the ack).

How to test / debug a deploy (the TFGRID_DEBUG=1 recipe)

The TFGrid SDK has a built-in debug-trace mode gated on an undocumented env var. Always enable it for any hero_compute investigation; without it the SDK is silent and the only thing you get back is the bare ZOS-side error.

Run the daemon manually with explicit env (hero_proc's normal supervised launch can't be used because the service.toml has default="info" for RUST_LOG without a from_secret line, so secret-store updates don't propagate to RUST_LOG for this service):

source ~/hero/cfg/init.sh && source ~/hero/cfg/env/env.sh
hero_proc service stop my_compute_zos_server
sleep 2
rm -f /home/pctwo/hero/var/sockets/hero_compute_zos/rpc.sock
# read mnemonic without echoing length or value
TM=$(hero_proc secret get TFGRID_MNEMONIC --context core --quiet 2>/dev/null | grep '^value:' | sed 's/^value: *//')
# launch with TFGRID_DEBUG=1 + debug-level for our daemon module
TFGRID_MNEMONIC="$TM" TFGRID_NETWORK=main TFGRID_NODE_IDS=<your_node> \
  RUST_LOG='info,my_compute_zos_server=debug' TFGRID_DEBUG=1 \
  PATH_ROOT="$PATH_ROOT" HERO_SOCKET_DIR="$HERO_SOCKET_DIR" \
  nohup /home/pctwo/hero/bin/my_compute_zos_server > /tmp/compute_debug.log 2>&1 &

The trace then includes lines like:

  • [tfgrid-debug] workload states for contract X: data=ok, 0052=error (per-workload state, the most diagnostic single line we never had before tonight)
  • [tfgrid-debug] deployment X appeared on node twin Y
  • [tfgrid-debug] decrypting cipher payload from twin Y for zos.deployment.get
  • The FULL workload JSON sent to ZOS (this is how we found the image-name-not-URL bug)

After the investigation: hero_proc service start my_compute_zos_server restores normal supervised mode.

Chain-state checks (useful during cleanup or debugging)

# All active contracts under your twin
curl -s 'https://gridproxy.grid.tf/contracts?twin_id=<twin>&state=Created&size=20'

# Specific contract status
curl -s 'https://gridproxy.grid.tf/contracts?contract_id=<id>'

# Node status + free capacity (use the list endpoint, not /nodes/<id> which omits totals)
curl -s 'https://gridproxy.grid.tf/nodes?node_id=<id>&size=1'

What's left to home#235 closure (also in §0 of the issue body)

# Focus Est Depends on
s157e hero_compute#121 fix: post-deploy poll loop populates mycelium_ip from workload result.data so get_vm returns a real IPv6. Then SSH-verify end-to-end with the throwaway probe key. 1-2h none
s158 Admin-on-TFGrid: rent dedicated node, deploy admin VM via deployer.provision_vm, install Hero stack via setup-binaries.sh, configure hero_proxy + deploy_webgateway per D-28, surface public URL. 3-4h s157e
s159 hero_books default-load wire: rebase s153_default_libraries (+104 LOC parked since s153) on clean baseline, squash, redeploy admin VM's hero_books with the 4 public content repos auto-loaded. 2-3h independent of s158
s160 Full user-journey live walk on the public admin URL: mint test Forge user, walk SSH key upload + token paste + password reset + provision their VM + login to their cockpit + load content in hero_books. Surface any gaps as Forge issues for s161. 3-4h s158 + s159
s161 Close home#235: PATCH body with final outcome, post closure comment with full s158-s160 evidence, flip state to closed. File Track F multi-VM scaling as separate hero_os_tfgrid_deployer issue. 30 min s160 green

Total remaining: ~10-15 hours focused work to arc closure.

Pointers for anyone picking up

  • End-user verification matrix (admin POV + user POV + cross-arc boundary rows): home/docs/channels/free/e2e_checklist.md. Seeded at s145 + first-verified at s146; s160 walk produces the final round of Status flips before closure.
  • prompt.md §3 in the workspace is the next-session entry point (rewritten at each /stop).
  • sessions/157d.yml has the full s157d trace including the TFGRID_DEBUG=1 discovery, the fix shape, the multi-tenant proof, and the methodology lessons.
  • decisions/D-29-deploy-vm-image-resolution-and-rentable-extrafee-gate.md locks the architectural decisions (image resolution location, rentable-extraFee>0 demo target).
  • Discipline rules in force across all remaining sessions: feedback_squash_merge_gate (pause for explicit OK before every squash-merge), feedback_d10_t2_squash_to_development_no_pr (local squash + direct push to development, no PR), feedback_signoff_no_email (commit body trailer is Signed-by: mik-tf <mik-tf@noreply.invalid> literal, no git commit -s), feedback_authorship (no co-author trailers, no AI attribution).
## Update (s157d, 2026-05-25) — deploy_vm UNBLOCKED + full deploy/test mechanics + remaining roadmap **Headline**: the 6-day `deploy_vm` investigation is resolved. Fix shipped at [hero_compute@1f59151](https://forge.ourworld.tf/lhumina_code/hero_compute/commit/1f59151) (closes [hero_compute#125](https://forge.ourworld.tf/lhumina_code/hero_compute/issues/125)). Multi-tenant pattern (one rented dedicated node, two distinct VMs co-located on it) live-verified tonight. The issue body §Current state is fully updated; this comment surfaces the deploy + test mechanics + remaining-session list as a single read for anyone picking up. ### What was wrong (one paragraph) hero_compute's `deploy_vm` passed the user-facing `image` string (e.g. `"Ubuntu 24.04"`) straight through to the TFGrid SDK as the zmachine workload's `flist` field. ZOS expects a URL there; given a name, ZOS silently sets the workload `state=Error` with an empty `result.error` and the daemon surfaces `"vm deployment entered error state"` with no actionable detail. Found by enabling the SDK's undocumented `TFGRID_DEBUG=1` env var (gates `trace_step()` calls in `tfgrid_sdk_rust/src/grid_client/mod.rs:2361`), which printed per-workload state lines and the full workload JSON showing the literal name in the `flist` field. Fix is a 5-entry name→URL map in the daemon plus a `resolve_image_reference()` helper called once at the top of `deploy_vm` (pass-through for `https://` URLs, friendly InvalidInput for unknown names). ### How to deploy a VM end-to-end (the s157d recipe) Prerequisites: env sourced (`source ~/hero/cfg/init.sh && source ~/hero/cfg/env/env.sh`), hero_proc supervisor running, `TFGRID_MNEMONIC` set in `core/` context, twin has TFT balance, hero_compute origin/development at `1f59151` or later. 1. **Pick a rentable node with `extraFee > 0`** (the substrate-side public-rent gate; `rentable: True` alone is NOT enough, substrate rejects with `OnlyTwinAdminCanDeploy` when extraFee=0): ```bash curl -s 'https://gridproxy.grid.tf/nodes?rentable=true&rented=false&status=up&size=100' \ | python3 -c 'import json,sys; [print(n["nodeId"], n["country"], n.get("extraFee")) for n in json.load(sys.stdin) if (n.get("extraFee") or 0)>0]' ``` Tonight we used node 3467 (Canada, farm 646 JimboTFT, $91.80/mo + 10000 mUSD extraFee). 2. **Rent it via `ComputeService.rent_node({node_id})`** through the daemon's UDS socket at `~/hero/var/sockets/hero_compute_zos/rpc.sock`. Substrate-ack arrives in ~10s; poll `rent_status` until `state=done`; verify on chain at `gridproxy.grid.tf/contracts?twin_id=<your_twin>&type=rent` (should see `state=Created`). 3. **Register the catalog**: `set_tfgrid_node_ids({node_ids: "<id>"})`, then `node_unregister` (if stale rows from a prior session), then `node_register`, then confirm via `list_nodes`. Slice math depends on node MRU/SRU; node 3467 yields 6 slices of ~5 GiB MRU each. 4. **Deploy a VM** via `ComputeService.deploy_vm({name, slice_count, secret, image, ssh_keys, node_sid})`: - `image` can be a friendly name (`"Ubuntu 24.04"`, `"Alpine"`, etc.); the daemon resolves it to the canonical flist URL since 1f59151. - OR a full `https://hub.grid.tf/...flist` URL (still works). - Returns `state=running` in 60-90s; persists 2 contracts on chain (network + VM). 5. **Cleanup at end of session**: `delete_vm({sid, secret})` per VM (substrate-cancel each contract pair); `node_unregister` (requires zero VM rows in the daemon's compute_db; note that `delete_vm` does NOT remove the local row, you may need to manually `rm ~/hero/var/compute_tfgrid/data/root/cloud/vm/*.otoml` for stale error-state rows); finally `cancel_rent_contract({contract_id, node_id})`. Rent contract billing stops on substrate-Deleted (about 20s for the ack). ### How to test / debug a deploy (the `TFGRID_DEBUG=1` recipe) The TFGrid SDK has a built-in debug-trace mode gated on an undocumented env var. Always enable it for any hero_compute investigation; without it the SDK is silent and the only thing you get back is the bare ZOS-side error. Run the daemon manually with explicit env (hero_proc's normal supervised launch can't be used because the service.toml has `default="info"` for `RUST_LOG` without a `from_secret` line, so secret-store updates don't propagate to RUST_LOG for this service): ```bash source ~/hero/cfg/init.sh && source ~/hero/cfg/env/env.sh hero_proc service stop my_compute_zos_server sleep 2 rm -f /home/pctwo/hero/var/sockets/hero_compute_zos/rpc.sock # read mnemonic without echoing length or value TM=$(hero_proc secret get TFGRID_MNEMONIC --context core --quiet 2>/dev/null | grep '^value:' | sed 's/^value: *//') # launch with TFGRID_DEBUG=1 + debug-level for our daemon module TFGRID_MNEMONIC="$TM" TFGRID_NETWORK=main TFGRID_NODE_IDS=<your_node> \ RUST_LOG='info,my_compute_zos_server=debug' TFGRID_DEBUG=1 \ PATH_ROOT="$PATH_ROOT" HERO_SOCKET_DIR="$HERO_SOCKET_DIR" \ nohup /home/pctwo/hero/bin/my_compute_zos_server > /tmp/compute_debug.log 2>&1 & ``` The trace then includes lines like: - `[tfgrid-debug] workload states for contract X: data=ok, 0052=error` (per-workload state, the most diagnostic single line we never had before tonight) - `[tfgrid-debug] deployment X appeared on node twin Y` - `[tfgrid-debug] decrypting cipher payload from twin Y for zos.deployment.get` - The FULL workload JSON sent to ZOS (this is how we found the image-name-not-URL bug) After the investigation: `hero_proc service start my_compute_zos_server` restores normal supervised mode. ### Chain-state checks (useful during cleanup or debugging) ```bash # All active contracts under your twin curl -s 'https://gridproxy.grid.tf/contracts?twin_id=<twin>&state=Created&size=20' # Specific contract status curl -s 'https://gridproxy.grid.tf/contracts?contract_id=<id>' # Node status + free capacity (use the list endpoint, not /nodes/<id> which omits totals) curl -s 'https://gridproxy.grid.tf/nodes?node_id=<id>&size=1' ``` ### What's left to home#235 closure (also in §0 of the issue body) | # | Focus | Est | Depends on | |---|---|---|---| | **s157e** | [hero_compute#121](https://forge.ourworld.tf/lhumina_code/hero_compute/issues/121) fix: post-deploy poll loop populates `mycelium_ip` from workload `result.data` so `get_vm` returns a real IPv6. Then SSH-verify end-to-end with the throwaway probe key. | 1-2h | none | | **s158** | Admin-on-TFGrid: rent dedicated node, deploy admin VM via `deployer.provision_vm`, install Hero stack via `setup-binaries.sh`, configure `hero_proxy` + `deploy_webgateway` per D-28, surface public URL. | 3-4h | s157e | | **s159** | hero_books default-load wire: rebase `s153_default_libraries` (+104 LOC parked since s153) on clean baseline, squash, redeploy admin VM's hero_books with the 4 public content repos auto-loaded. | 2-3h | independent of s158 | | **s160** | Full user-journey live walk on the public admin URL: mint test Forge user, walk SSH key upload + token paste + password reset + provision their VM + login to their cockpit + load content in hero_books. Surface any gaps as Forge issues for s161. | 3-4h | s158 + s159 | | **s161** | Close home#235: PATCH body with final outcome, post closure comment with full s158-s160 evidence, flip state to closed. File Track F multi-VM scaling as separate `hero_os_tfgrid_deployer` issue. | 30 min | s160 green | **Total remaining: ~10-15 hours focused work to arc closure.** ### Pointers for anyone picking up - **End-user verification matrix** (admin POV + user POV + cross-arc boundary rows): [home/docs/channels/free/e2e_checklist.md](https://forge.ourworld.tf/lhumina_code/home/src/branch/development/docs/channels/free/e2e_checklist.md). Seeded at s145 + first-verified at s146; s160 walk produces the final round of Status flips before closure. - `prompt.md §3` in the workspace is the next-session entry point (rewritten at each /stop). - `sessions/157d.yml` has the full s157d trace including the `TFGRID_DEBUG=1` discovery, the fix shape, the multi-tenant proof, and the methodology lessons. - `decisions/D-29-deploy-vm-image-resolution-and-rentable-extrafee-gate.md` locks the architectural decisions (image resolution location, rentable-extraFee>0 demo target). - Discipline rules in force across all remaining sessions: `feedback_squash_merge_gate` (pause for explicit OK before every squash-merge), `feedback_d10_t2_squash_to_development_no_pr` (local squash + direct push to development, no PR), `feedback_signoff_no_email` (commit body trailer is `Signed-by: mik-tf <mik-tf@noreply.invalid>` literal, no `git commit -s`), `feedback_authorship` (no co-author trailers, no AI attribution).
Author
Owner

Phase 1 closed - Phase 2 picks up at home#237

Phase 1 of the demo-deployer arc ships its substrate. Recapping what is now live:

  • First public Hero OS URL at https://hcockpit.gent01.qa.grid.tf/, TLS terminated at the TFGrid Web Gateway, forwarding to admin VM over Mycelium.
  • Admin VM provisions tester VMs end-to-end on TFGrid via deployer.provision_vm. Multi-tenant proven (admin VM and a walker child VM co-located on the same rented QA node).
  • Cockpit and services management UI live, 35-set service install via setup-binaries.sh per-user manifest, hero_proc supervision, BYO key paste flow, feedback handoff to public Forge repo.
  • Deployer admin UI for users and VMs committed to development branch in session 160 (hero_tfgrid_deployer@7036a9f). Users list, create user, provision VM, regenerate password, destroy user/VM all wired. Not yet deployed to the admin VM because it would expose privileged actions to the open internet without an auth gate.
  • Cockpit /welcome paste-token onboarding shell committed in session 160 (hero_cockpit@8ef1108) as the interim onboarding flow. Becomes a fallback once Phase 2 SSO ships.
  • Eight decisions locked: D-22 (Forge token namespacing), D-23 (SSH custody), D-26 (self-hosted compute), D-27 (substrate await on-chain ack), D-28 (TLS at TFGrid Web Gateway), D-29 (image-name resolution and extraFee gate), D-30 (demo target QA fallback). Plus D-18 in hero_onboarding (PKCE-S256 OAuth wire spec).
  • Default content libraries preloaded on first boot (4 public docs repos: docs_hero, geomind, ourworld, coopcloud public variants), session 159.
  • End-to-end SSH proof into a real TFGrid VM via Mycelium (session 157f), the first such proof in workspace history.

The executable checklist is at 47 Have / 20 Need / 4 Blocked across 71 rows.

What is left to make the demo a self-service tester environment (no operator hand-holding required) is now scoped as Phase 2 at home#237. Phase 2 ships Forge SSO across admin and user surfaces, admin allowlist gating, OAuth token persistence for ongoing Forge API access on the user's behalf, and the redeployed live walk. Roughly 4 focused sessions of work.

Closing this issue as shipped. Phase 2 continues the arc.

## Phase 1 closed - Phase 2 picks up at [home#237](https://forge.ourworld.tf/lhumina_code/home/issues/237) Phase 1 of the demo-deployer arc ships its substrate. Recapping what is now live: - **First public Hero OS URL** at https://hcockpit.gent01.qa.grid.tf/, TLS terminated at the TFGrid Web Gateway, forwarding to admin VM over Mycelium. - **Admin VM provisions tester VMs end-to-end** on TFGrid via deployer.provision_vm. Multi-tenant proven (admin VM and a walker child VM co-located on the same rented QA node). - **Cockpit and services management** UI live, 35-set service install via setup-binaries.sh per-user manifest, hero_proc supervision, BYO key paste flow, feedback handoff to public Forge repo. - **Deployer admin UI for users and VMs** committed to development branch in session 160 ([hero_tfgrid_deployer@7036a9f](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/commit/7036a9f)). Users list, create user, provision VM, regenerate password, destroy user/VM all wired. Not yet deployed to the admin VM because it would expose privileged actions to the open internet without an auth gate. - **Cockpit /welcome paste-token onboarding shell** committed in session 160 ([hero_cockpit@8ef1108](https://forge.ourworld.tf/lhumina_code/hero_cockpit/commit/8ef1108)) as the interim onboarding flow. Becomes a fallback once Phase 2 SSO ships. - **Eight decisions locked**: D-22 (Forge token namespacing), D-23 (SSH custody), D-26 (self-hosted compute), D-27 (substrate await on-chain ack), D-28 (TLS at TFGrid Web Gateway), D-29 (image-name resolution and extraFee gate), D-30 (demo target QA fallback). Plus D-18 in hero_onboarding (PKCE-S256 OAuth wire spec). - **Default content libraries** preloaded on first boot (4 public docs repos: docs_hero, geomind, ourworld, coopcloud public variants), session 159. - **End-to-end SSH proof** into a real TFGrid VM via Mycelium (session 157f), the first such proof in workspace history. The executable checklist is at 47 Have / 20 Need / 4 Blocked across 71 rows. What is left to make the demo a self-service tester environment (no operator hand-holding required) is now scoped as Phase 2 at [home#237](https://forge.ourworld.tf/lhumina_code/home/issues/237). Phase 2 ships Forge SSO across admin and user surfaces, admin allowlist gating, OAuth token persistence for ongoing Forge API access on the user's behalf, and the redeployed live walk. Roughly 4 focused sessions of work. Closing this issue as shipped. Phase 2 continues the arc.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/home#235
No description provided.