[META] Hero OS demo-deployer arc tracker (cockpit + proxy + content + deployer + manifest + integration) #235
Labels
No labels
meeting-notes
meeting-transcript
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/home#235
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
body length: 43658
emo-deployer arc — tracker
Scope: all work needed to go from "lab + CI + a one-off TFGrid VM proving the binaries work" (where we are now, post-hero_demo
09f8365/ s132) to "a team operator types a username in an admin tool and gets back a Forge-OAuth-gated Hero OS demo VM that the user logs into, sees their cockpit, manages their services".Primary tracker for this arc. PATCHed at each session close.
Current state (Track A s158 close, 2026-05-25): FIRST PUBLIC HERO OS URL LIVE on TFGrid. Pivoted mainnet -> QAnet (twin 703 / FreeFarm2 node 5 / $0 TFT) per newly-minted D-30. Admin VM provisioned via
deployer.provision_vm(sid0062, 16 GB RAM, mycelium-SSH'd in Ubuntu 24.04). Phase 0.5 shipped hero_compute@8f7a2b7 extending D-27 inline-await +rollback_orphanspattern fromdeploy_vmtodeploy_webgateway(closes hero_compute#126; 2 new gateway_hint_tests; pre-merge gate clean). Live-verified on both Ok-path (49s deploy -> state ready) AND rollback-path (4 orphan name+node contracts cancelled cleanly across 2 failure modes — first live D-27 gateway extension proof). Public URL: https://hcockpit.gent01.qa.grid.tf/hero_cockpit/web/services behind TFGrid Web Gateway TLS (D-28 topology, gateway node 2 zonegent01.qa.grid.tf— same zone as Mahmoud's reference instance). End-to-end user walk proven: walker users158_walker_<ts>minted viadeployer.create_user+ SSH pubkey uploaded to Forge (D-23 alt-2) +deployer.provision_vmminted child VM sid0068(8 GB, mycelium-SSH'd) co-located on same rented node 5 — multi-tenant topology proven.NEW operational runbook landed:
docs/channels/free/admin-vm-deployment-runbook.md(commitb352729) — step-by-step recipe from rent -> provision -> setup-binaries ->deploy_webgateway-> tester handoff, with the 6 install/runtime workarounds discovered at s158 explicitly catalogued and linked to tracking issues.Demo-app scope clarified: prior framing of s159 as just "hero_books default-load" was too narrow. The canonical
demoprofile per hero_cockpit#1 §6 enables hero_books + hero_slides + hero_whiteboard + hero_voice + hero_agent + hero_planner + hero_collab on top of bootstrap-core. hero_books default-load may already auto-fire via setup-binaries.shHERO_BOOKS_DEFAULT_REPOSenv wire (the s153 deferred scope); needs live-verify on the admin VM at s159 /start.7 new Forge follow-ups filed for Mahmoud window: hero_compute#127 service.toml env for TFGRID_NETWORK; hero_proxy#55 IPv6 dual-stack seed bind (blocks public-URL reachability — manual workaround in runbook); hero_cockpit#7 landing-page relative URL bug; hero_cockpit#8 dark/light mode inconsistent across pages; hero_demo#67 setup-binaries.sh missing secret pre-population (includes bare-key-vs-context-prefixed slot ambiguity lesson); hero_compute#128 workload-name client-side validation; hero_skills#303 lab build
--download --installsilently passes without installing binaries.State at s158 close: admin VM + walker child VM + rent contract 84920 + gateway sid
0067ALL UP, intentionally left running through s159+s160 (zero TFT cost on QA). Twin 14199 mainnet treasury baseline 40 untouched. Realistic readiness: 70% testable — guided demo with verbal walkthrough works; self-service for a stranger needs s159 (landing-page fix + workaround sweep) + s160 (AIBROKER_DEMO_KEY staging for AI tier + BYO Forge token UI test). Remaining arc: s159 (sweep ~3-4h) -> s160 (AI keys + BYO test ~3-4h) -> s161 (this issue closure, 30 min). Total ~6-8h.Previously (Track A s157e close, 2026-05-25): CI GREEN ON hero_compute. s157e shipped
hero_compute@e845455ondevelopmentrepairing 7 of 16 integration tests broken since8be3294: a 6-LOCCOMPUTE_TEST_FAKE_DEPLOYtest seam added tooperator_twin_idincrates/my_compute_zos_server/src/cloud/grid_driver.rs(mirroring the existing seam indeploy_on_tfgridat the same file) + 9 placeholder image-name updates"img"→"Ubuntu 24.04"incrates/my_compute_zos_server/tests/integration.rs. CI run 1299 ✅ green one845455in 215s vs run 1297 = failure on9857630. Workspace fully synced at /start: every D-07 35-set repogit pull origin development(nodevelopment_mikbranches outstanding); hero_compute pulled in 2 new Mahmoud commits (2f07330rent→reserve UI rename +9857630admin_mode toggle). Original s157e scope (mycelium_ip capture + SSH-verify) renamed to s157f. Next: s157f (mycelium fix + SSH verify, 1-2h, ~$1-2 TFT) → s158 → s159 → s160 → s161 closure. Same ~10-15h envelope.Previously (Track A s157d close, 2026-05-25): DEPLOY_VM FULLY UNBLOCKED. Root cause of hero_compute#125: the daemon passed the user-facing
imagestring (e.g."Ubuntu 24.04") straight through to the TFGrid SDK as the zmachine workload'sflistfield, ZOS expected a URL, silently rejected withstate=Error+ emptyresult.error. Discovered via the SDK's undocumentedTFGRID_DEBUG=1env var (gatestrace_step()calls intfgrid_sdk_rust/src/grid_client/mod.rs:2361) which surfaced per-workload state + the full workload JSON showing the literal name in theflistfield. Fix shipped: hero_compute@1f59151 ondevelopmentaddsIMAGE_REFERENCE_MAPconst +resolve_image_reference()helper called once at top ofdeploy_vm(pass-through forhttps://URLs, lookup for known names, friendlyInvalidInputerror otherwise). Live verify on rented dedicated node 3467 (Canada, farm 646 JimboTFT, RentContract 2095174 under twin 14199 ops): VM sid0053via URL + VM sid0054via name-resolved → bothstate=running, contracts 2095179/2095180 + 2095181/2095182 persisted on chain. Multi-tenant pattern proven: 2 distinct VMs on same rented node, distinct slices, distinct secrets. All cleaned at /stop: 4 VMs deleted, node unregistered, RentContract 2095174 cancelled (substrate-ack 20s). Twin 14199 active contracts = 0; treasury 6905 baseline 40 untouched. D-29 minted (D-29 file) locking (a) image-name-resolution in the daemon, (b) demo target = any rentable+extraFee>0+up dedicated node on TFGrid mainnet (substrate gate isnode.extraFee > 0, NOTnode.rentable: Truealone; FreeFarm-specifically constraint REMOVED). #125 closed. Track A continues solo.prompt.md §3projects from this issue.Decisions and meeting source: hero_os_tfgrid_deployer#1 (despiegk's Main Story / minutes — authoritative, not edited from here).
1. Foundation (where we are now)
2026-05-25 update (post-s157d) — DEPLOY_VM WORKS. Remaining path to end-user self-serve flow
Where we are:
deployer.provision_vm(the operator-facing API that mints a Forge user + Forge token + reads the user-uploaded SSH key + callshero_compute.deploy_vm) now produces astate=runningVM on a rented dedicated mainnet node. The full Track D D1-D5 ladder is end-to-end live for the first time since 2026-05-23.End-user-journey checklist (what
user clicks public link → ... → uses hero AI stackrequires, mapped to remaining sessions):cockpit/USER_FORGE_TOKENslot (D-16); admin form existsdeployer.regenerate_password(D-24, s143)deployer.provision_vm→ComputeService.deploy_vm→state: running. Hero stack (35-set) installs viasetup-binaries.sh(Track E, s151).hero_compute.wait_until_runningreturns before mycelium_ip is populated in the workload result; daemon'sget_vmreturns empty mycelium_ip — hero_compute#121. Easy fix now thatTFGRID_DEBUG=1visibility exists.https://<their-domain>/)+104 LOCparked onhero_bookslocal branchs153_default_librariessince s153 abortRemaining sessions (estimated 10-15 hours focused work to home#235 closure):
deploy_webgatewayper D-28, surface public URLs153_default_libraries(hero_books +104 LOC default-load wire) on clean baseline; squash on hero_books development; redeploy admin VM's hero_books so the public URL serves the 4 default content repos out of the boxFor anyone picking up this arc: start at
prompt.md §3(rewritten at each /stop). Sessions/157d.yml has the full s157d trace including the TFGRID_DEBUG discovery + fix shape + multi-tenant proof. Thefeedback_squash_merge_gate+feedback_d10_t2_squash_to_development_no_pr+feedback_signoff_no_email+feedback_authorshipdiscipline rules apply throughout.2026-05-23 update (mid-session pivot) — demo VM bumped to 16 GB; Track C C3 deferred to post-arc follow-up; arc compresses to 9 sessions
free_mru=17179869184againstfarm_ids=1) confirms FreeFarm has nodes with that headroom. Theram_sizechange is a parameter ondeployer.provision_vm, not a code change in the deployer or hero_compute. Surfaces at the User POV walkthrough and at the multi-user session.2026-05-23 update (post-s148) — self-host daemon up on TFGrid mainnet; D-26 minted; FreeFarm (farm_id=1) locked as the demo deploy target
hero_compute_zosdaemon supervised on TFGrid mainnet. Squash844676con hero_compute development appends the canonical[[env]] PATH_ROOT/HERO_SOCKET_DIR/RUST_LOGblock tomy_compute_zos_server/service.toml, mirroring the s147 hero_router fix.lab build --release --install --workspaceclean (8 of 8 built, 0 failed).lab service my_compute_zos_server --install --startbrings the daemon up at PID 3102124, raw JSON-RPC over Unix socket at~/hero/var/sockets/hero_compute_zos/rpc.sock. Mainnet wallet sourced fromTF_VAR_mnemonicin~/hero/cfg/env/env.sh(the same wallet that funded the s132 OpenTofu deploy); stored undercore/TFGRID_MNEMONICplus the existingcore/TFGRID_NETWORK=main. The hero_proc supervisor injects core-context secrets into the daemon environment at spawn, so no service.tomlfrom_secretindirection is needed.ComputeService.list_imagesreturns the 5 official VM images;ComputeService.node_registerqueries TFChain mainnet Grid Proxy and returns a realComputeNoderecord;ComputeService.node_statusreads it back byte-identical from the local persistence layer. The sr25519 keypair derived from the mnemonic produces public key58f481018853f18b403369537940d8e3a7bb61f36eafe8fff38fab281f230965(the operator's TFChain identity).decisions/D-26-self-host-hero-compute-mainnet.md. Workspace next-free advances to D-27. Devnet fallback path stays warm viaTFGRID_MNEMONIC_DEVNETin env.sh.onlytwinadmincandeploycheck fires only on dedicated farms and is moot for our demo posture. The s132 OpenTofu deploy of herolab is prior-art that the operator's wallet already exercised the substrate contract-submission path successfully on mainnet under a different code wrapper. Owning underlying hardware (registering and operating our own farm) is a stronger sovereignty story but out of scope for the home#235 arc; public-tenancy on FreeFarm is the right level of effort for a demo.core/TFGRID_NODE_IDSto a FreeFarm node via a single Grid Proxy lookup:GET https://gridproxy.grid.tf/nodes?farm_ids=1&free_mru=8589934592&status=up. No archaeology.hero_os_tfgrid_deployer/.../compute.rs:30has a hardcoded service-name path constant/hero_compute_mos/...that must become/hero_compute_zos/...or configurable;web.rs:206-229parsesHERO_COMPUTE_NODE_ADDRas a TCP host:port (correct shape, but the local value still needs to be decided to route through hero_router to the new self-hosted UDS).deploy_vmround-trip (provision → Mycelium-IPv6 SSH ping → delete). s150 hero_proc#121 fix and downstream sessions s151 through s157 unchanged.2026-05-23 update (post-s147) — self-host pivot + 10-session arc to closure locked
my_compute_zos_serveris our repo, our code, our CI auto-publish. We host the instance ourselves usingTF_VAR_mnemonicfrom~/hero/cfg/env/env.sh(the same mainnet TFGrid wallet used by the s132 OpenTofu deploy; 12-word BIP39 verified populated).TFGRID_NETWORK=mainis already set inhero_proc secretcore context. Zero deployer code changes required: existing D4 implementation already callsComputeService.deploy_vmagainst whichever endpointHERO_COMPUTE_NODE_ADDRpoints at; we point it at our local UDS instead of a remote endpoint.hero_compute_mos_serverendpoint) is no longer gating any session. A comment will be posted at s148 close noting that Mahmoud's endpoint can be added as a future second adapter when convenient; meanwhile we run on our own instance.hero_plannerpromoted to the default cockpit services profile (user requirement 2026-05-23). The repo is already in the D-07 demo service set (Tier B permemory/project_demo_service_set.md), already inhero_demo/deploy/single-vm/scripts/d07_set.txt, and already has.forgejo/workflows/lab-publish.yamlwired for CI auto-publish. What was missing is exposure in the defaultcockpit-services.tomlprofile alongsidehero_books/hero_slides/hero_whiteboard/hero_call/hero_voice/hero_agent. Folded into s151 (Track E E1) scope.my_compute_zos_serveron mainnet (mints D-26 for self-host architecture lock); s149 D5 live-smoke on mainnet (first real grid.tf VM viadeployer.provision_vm); s150 hero_proc#121 fix (bulkservice.status_allRPC + cockpit adoption, mints D-27); s151 Track E E1 setup-binaries manifest refactor + hero_planner in default profile; s152 Track C C3 smaller embedder model (MiniLM-L6, ~80 MB for 8 GB VM fit); s153 Track B B1+B3 hero_proxy config templates + TLS strategy decision; s154 Track C C1+C2 public content repos + hero_books default-load; s155 User POV walkthrough on the live mainnet VM (incl. hero_planner row walks); s156 Track F F1 multi-user end-to-end on mainnet; s157 Track F F2+F3 RAM-fit + multi-user isolation + this issue closure PATCH.ON DELETE RESTRICTmigration) remains the most recent Track D landing (s144380b992). All Track D status unchanged; D5 live-smoke just had its blocker removed.2026-05-23 update (post-s145) — methodology + arc-spec session: master-tracker E2E checklist artifact + Mahmoud ops ask + s142 follow-ups all filed
home/docs/channels/free/e2e_checklist.md(fee7f0c) — executable companion to the existinghome/docs/channels/free-and-paid.mdnarrative. 71 rows across Admin POV / User POV / Cross-arc boundaries, FREEZONE / hero_assistance D-18 row format, all rows sourced from the meeting minutes + decisions + free-and-paid.md + a code-reading pass on hero_cockpit + hero_os_tfgrid_deployer. Status column is seed-pass only; human verification of every Have row is the s146 head.hero_compute_mos_serverendpoint (host:port + node_sid). The only outstanding pre-req for the deployer's first livedeploy_vm+get_vm+delete_vmround-trip. Gates s147.lhumina_public/feedbackexists, §8 Books backfill is queued separately). Dropped fromprompt.md§3.ON DELETE RESTRICTmigration) remains the most recent Track D landing (s144380b992). All Track D status unchanged.hero_assistance/.e2e_checklist.md. Effort tier medium. Output is updated Status columns + audit-log entry + follow-up issues for any clearly-needed feature gaps surfaced during the walkthrough.hero_compute_mos_server). Gated on hero_compute#118 reply +core/FORGEJO_TOKEN+deployer/FORGE_TOKENre-population.2026-05-22 update (post-s143) — Track A s143 = Track D D2.1 lifecycle-symmetry polish + Phase B.5 FK-silently-OFF fix + D-24 mint
deployer.delete_user(refuse-if-vms per D-24 — caller must cascade viadeployer.delete_vmfirst),deployer.delete_vm(compute-first then sqlite per D-24 — orphan-recoverability asymmetry),deployer.regenerate_password(single-use disclosure shape mirroringcreate_user.initial_password). Two squashes ondevelopment/main: hero_libce653c0a(+ForgeClient::delete_user_ssh_key+ForgeClient::update_user_passwordadmin methods, +33 LOC); hero_os_tfgrid_deployer3508cd1(+3 RPC methods + sqlite migration scaffold + FK enforcement + 5 new db tests, 8 files +479/-25).PRAGMA foreign_keyswas silently OFF indb.rs(sqlite's defaultforeign_keys=OFFmade thevms.user_id REFERENCES users(id)FK a no-op —DELETE FROM userswould orphan vms rows with no error). Fixed as a one-linePRAGMA foreign_keys = ONinDb::open+Db::open_in_memory. Testfk_enforcement_blocks_delete_user_with_vmspins the constraint.PRAGMA user_version. The s143 initial migration is the current schema withCREATE TABLE IF NOT EXISTS, so pre-migration dev DBs bootstrap cleanly intouser_version=1without ALTER. Foundation for D-25+ schema bumps.delete_user, (b) compute-first then sqlite fordelete_vm, (c)PRAGMA foreign_keys=ONas second-line guard, (d) accepted operational gap: lostvm_secretmakes a VM unrecoverable from deployer side. Workspace D-NN advances to D-25 (reserved for the s144 ON DELETE RESTRICT migration head).deployer.create_user|get_user|list_users|delete_user|regenerate_password|provision_vm|list_vms|delete_vm).deployer/FORGE_TOKENwas rotated post-s141 and not re-populated this session._adminrebuild remains queued.HERO_COMPUTE_NODE_ADDR) or E1 Forge group/repo per-user.2026-05-22 update — workspace housekeeping + Track B re-activation (s2-016) under hero_assistance-alignment scope
compaction-2026-05-22): CLAUDE.md + prompt.md + prompt2.md + prompt-common.md compacted 445→53 KB (−88%). Pre-compaction snapshot atarchive/2026-05-22-compaction/.pipeline-config.yamltracking_issue updated fromhero_demo#52→home#235to match this arc as the live tracker. CLAUDE.md now leads with home#235 as headline framing. Manifest:sessions/compaction-2026-05-22.yml. No arc code touched, no D-NN/L-NN minted.lhumina_code/hero_assistance/with the canonical Hero service template per hero_assistance#15. Pre-archive scope (hero_onboarding v0 spec) preserved as historical; reactivates on the same Track-D-/api/deploy-vmtrigger if needed. Three squashes on hero_assistance/development:f81aecc(prior session's #14 squash-merge),c059c1a(Wall 1 rusqlite u64→i64 + Wall 2 reqwest rustls-tls swap),5330a0f(Phase A drop 5 Dioxus crates + D-26 minted hero_assistance-repo-local retiring D-09/D-17/D-22/D-25 atomically). 6 hero_assistance issue closures (#7/#9/#10/#11/#12/#14 + #13 auto-closed). New meta-issue hero_assistance#15 opens the multi-phase alignment arc (Phases A through E). L-08 (workspace) retro-closed. CI green; releases/tag/latest = 4 musl binaries + 4 md5 sidecars. Workspacelab infocheck4 clean / 0 findings (was 4/4/20). Procedural skip flagged: worked in sharedlhumina_code/hero_assistance/checkout NOT the worktree-isolated../hero_assistance-track-agent-2/(future Track B sessions MUST use the worktree per CLAUDE.md "Cross-track coordination").prompt.md §3; alts D4 first-hero_compute-call or D2.1 D2 polish; pick at /start. The two tracks can run concurrently going forward.2026-05-21 update (post-s139) — pivot: hero_os_tfgrid_deployer IS the deployment path
f880247). See hero_cockpit#1 for the closed-as-shipped checklist.herolab.gent02.grid.tfretired 2026-05-21. Destroyed viamake destroy ENV=herolab. 5 OpenTofu resources released (grid_deployment, grid_name_proxy, grid_network, 2 random_bytes). Gateway FQDN + mycelium IPv6 released. The s132 manual-deploy proof is done; we don't deploy that way again.hero_os_tfgrid_deployeris now the canonical deployment path for every Hero OS VM, both free-demo and paid-arc-pool.hero_demo make deployis no longer used for VM provisioning. The deployer's per-usercockpit-services.tomlmanifest drives the setup-binaries dispatch, the hero_proxy install, the OAuth wiring, and the webgateway binding — all as parts of the deployer's standard post-deploy flow, not as standalone-VM concerns.Original 2026-05-20 foundation status (preserved for historical context)
What was working at session 132 (2026-05-20) — the proof-of-concept that established the build/install mechanic now embedded inside the deployer's post-deploy flow; the standalone
hero_demo make deploypath is retired:lab(inhero_skills) builds the D-07 35-set. Workstation + VM-side native builds pass. mycelium is the 35th and is excluded on TFGrid since it ships natively via zinit..forgejo/workflows/lab-publish.yamlon every push todevelopment. Each repo refreshes itsreleases/tag/latestwith linux-musl-x86_64 (CLI) + linux-x86_64-gnu (daemons with ONNX) artefacts. See hero_skills#268 (rollout) + hero_skills#269 (per-repo cleanup catalogue, closed).lab build $repo --download --installon a fresh Ubuntu 24.04 TFGrid VM with no Rust toolchain installs all 34 (mycelium skipped) end-to-end, including the 3 ONNX libraries to~/hero/lib/. This mechanic now lives inside the deployer's post-deploy flow (D4).make deploy ENV=herolab(inhero_demo) provisions one VM via OpenTofu in ~60 s.make setup-binariesruns the lab consumer-side install loop. Now: same OpenTofu provider is available to the deployer's D3 adapter as a secondary path (the primary path is hero_compute via Mahmoud's API once free-form sizing lands).Known open followups on the foundation:
--bind 0.0.0.0+ put hero_proxy in front of it (Track B below — scoped as deployer-integrated config rather than a standalone install). See hero_router#74.2. Roadmap — 6 tracks, ~17 sessions remaining (was ~24-26 pre-pivot)
Each track has a slot in the
prompt.md §3session map. Sessions continue from s140.Order (post-2026-05-21 pivot): Track A closed at s139. Track D becomes critical-path and runs s140-s14X. Tracks B/C run as local code work in parallel with Track D, then merge into Track D's standard per-user manifest. Track E feeds into Track D's post-deploy flow. Track F validates end-to-end after D + E ship.
Track A —
hero_cockpit(end-user UI on the VM) — ✅ CLOSED s133-s139Spec: hero_cockpit#1. Scaffolded from
hero_template.cargo check+lab infocheckclean. ✅lhumina_public/feedback) + Manual pages. ✅~/hero/cfg/cockpit/services.toml) read/write + profile switching. ✅lab update. ✅cockpit.expose_service/unexpose_service) via hero_proxydomain.addadmin API. ✅Track D —
hero_os_tfgrid_deployer(admin tool) — CRITICAL-PATH, ~5 sessionsUmbrella: hero_os_tfgrid_deployer#2. Sub-issues: #3 D1 / #4 D2 / #5 D3 / #6 D4 / #7 D5 / #8 D6. Workspace scaffold landed on 2026-05-20 (ab061f5b → 76919265): 4-crate workspace + JSON-RPC plumbing.
Goal: an admin tool that, given a Forge username, autonomously provisions a Hero OS demo VM end-to-end — Forge account lifecycle + SSH key gen + hero_compute deploy_vm + setup-binaries dispatch (hero_proxy + cockpit + Track C content all included) + deploy_webgateway + Forge OAuth wiring. No human-in-the-loop after the form submit.
POST /admin/pool-refreshintegration so the deployer can feed VMs into the paid arc's pool (see §5.5 convergence point). Or defer to a hero_onboarding-side session if agent 2's s2-015 picks up first.Track B —
hero_proxyconfig + OAuth + TLS — 3 sessions (parallel with Track D)Reframed (2026-05-21 pivot): Track B is no longer "install hero_proxy on a standalone VM." It is now configuration + integration work that becomes part of the deployer's standard per-user manifest. All work is local code on
hero_proxyrepo + the deployer's manifest templates; no TFGrid VM required until Track D's D4 picks up the manifest at provision time.hero_proxyconfig templates for the standard demo VM shape (/→ cockpit admin.sock,/<service>/→ service admin sockets). Lands as a docs + service.toml + default-config PR onhero_proxy.hero_proxyas a Forge OAuth client atforge.ourworld.tf(operational, needs Forge admin); defineauth_mode=oauth+oauth_provider=forge.ourworld.tf+allowed_pubkeys=[<user_forge_id>]template that the deployer instantiates per-user at provision time.name_proxyfor TLS termination (simpler — Mahmoud'sdeploy_webgatewayhandles cert), or LE/certbot inside the VM (more control). Picks one; documents the choice; the deployer instantiates accordingly.Track C — Public content + smaller models — 3-4 sessions (parallel with Track D)
lhumina_public/docs_owh_public+ (optionally)mycelium-public-docsequivalent. Populate with safe demo content.hero_booksto default-load these public repos on a fresh VM (config + manifest changes; no VM required to develop).s150hero_agent. Slot in when ready.Track E —
setup-binaries.shper-user manifest refactor — ✅ CLOSED s151Lives in
hero_demo; tracked at hero_os_tfgrid_deployer#8 (closed). Critical for Track D's D4 post-deploy flow.hero_demo/deploy/single-vm/scripts/setup-binaries.shto read per-usercockpit-services.tomlmanifest + always-on bootstrap-core (hero_proxy + hero_router + hero_proc + hero_cockpit) + small-embedder flag (EMBEDDER_MODEL_SIZE=small); falls back tod07_set.txtwhen no manifest present; newDRY_RUN=1mode. Landed ashero_demo 20f03ba(+207/-38). Bonus:hero_cockpit 558e737adds hero_planner ManualEntry + new manual page. ✅Track F — Integration + validation — 2-3 sessions
3. Dependency map across repos
hero_computelifecycle APIhero_template(scaffold base)hero_proxy(OAuth + URL mapping)hero_voiceend-to-endhero_web_templatehero_templateMahmoud's hero_compute caveats (load-bearing for Tracks B / D / F)
From hero_compute#116#35305:
TFGrid lifecycle surface (what works):
deploy_vm,delete_vm,list_vms/get_vm,vm_logs,node_register/node_status/node_unregister,set_tfgrid_node_ids,list_slices/get_slice,node_stats,list_images,get_deployment_logs/list_deployments,get_ssh_keys/set_ssh_keys(per-secret store, not push-into-VM),list_jobs/job_logs,deploy_webgateway/list_webgateways/get_webgateway/delete_webgateway/list_gateway_nodes.TFGrid stubs (error):
start_vm,stop_vm,restart_vm,inject_ssh_keys,vm_exec,vm_stats,attach_hypervisor,migrate_secret.Constraints that shape the deployer + cockpit:
publicip, norootfs, no independent disk parameter. Our 8 GB demo VM ⇒ 2 slices ⇒ 2 vCPU (not 16 — the 16 we saw in s132 was the OpenTofu path, which sets a different shape).ssh_keys=[…]todeploy_vm; no inject-after-create. Affects D2 (deployer's Forge user lifecycle): generate the SSH key + pass at deploy_vm time.hero_proc servicecalls on services running inside the VM). The deployer's admin UI shows a "destroy + redeploy" action, not three separate buttons.vm_exec/vm_stats. Setup-binaries dispatch uses SSH (already true post-s132). Cockpit'ssystem_inforeads RAM/disk from the VM's own/proc/meminfo+df+ sysinfo crate, not via hero_compute.deploy_vmreturns immediately withstate="provisioning"; pollget_vmuntilstate="running"ANDmycelium_ipset. Same async pattern fordelete_vm(→deleting→ record disappears) anddeploy_webgateway.user/profileinto VMnameOR keep the join in the deployer's sqlite (we'd use the deployer's sqlite — schema already has the foreign key).ComputeServicelistens on a Unix domain socket; per-call auth is thesecretparameter (sr25519-signed token from node'sTFGRID_MNEMONICor raw ownership token). A remote deployer reaches it via hero_router (TCP entry point + context/claim auth) or an SSH tunnel — no built-in network auth.Cross-track follow-ups Mahmoud offered to file: (a)
metadata: map<str,str>on Vm spec, (b) free-form sizing + publicip, (c) remote-auth model. We will track each as they get filed.4. Out of scope (initial demo)
5. Cross-links
09f8365— herolab env + setup-binaries.sh (s132)5.5 — Paid-tier overlay (hero_onboarding, Track B)
The demo-deployer arc above (Tracks A-F) ships a company-paid free demo of Hero OS. A parallel arc — hero_onboarding (separate scope, tracked at hero_onboarding#1) — adds the paid commercial overlay on top of the same substrate. Same deployer, same cockpit, same proxy; differs only in front-gate behavior.
Status (post-2026-05-21, s2-009): Phases 1-7 shipped on
lhumina_code/hero_onboarding/development— mycelium proof-of-control login (D-12), Stripe + ClickPesa payments (D-13), per-node billing pipeline (D-14), Idenfy KYC at /payment/intent gate (D-15), VM allocation viaPoolAssignmentProvisioner(D-17, race-renamed from D-16). 4 crates, 47 unit tests, 4 white-box smokes, all green. Acceptance: cargo + lab + smoke matrix per hero_onboarding#2-#8.Convergence points with this issue's tracks:
POST /admin/pool-refreshon hero_onboarding (the API hero_onboarding will land in s2-010 Phase 8), or hero_onboarding eventually adds aDeployerProvisionerimpl that triggers allocate-on-demand (v1.5 behind the D-17 trait). Either flow ships in 1-2 hero_onboarding sessions once Track D's API is live./vm/allocatehands the user a URL pointing at cockpit on the assigned VM. Auth substrate: Forge OAuth via forge.ourworld.tf — aligns hero_onboarding, hero_cockpit, and the deployer on a single platform-wide user identity. hero_onboarding offers dual login (locked at s2-011 Phase 9 as D-18): Forge OAuth as the default low-friction signup, AND mycelium proof-of-control (D-12 from s2-003) as an alternative for sovereignty-minded users. User row carries bothforge_id?andmycelium_address?slots; at least one populated per row, both linkable post-signup via/account/link-*. SSO to cockpit always uses the user's Forge ID — mycelium-only users link a Forge account when they first hit cockpit. TheVmAllocationrow captures the user'sforge_idat allocate time so the deployer/cockpit can grant access. Fully reversible: ~30 LOC to flip back to mycelium-only (if the boss decides sovereignty-first is the only path); ~50 LOC removed to flip to Forge-only (if mycelium turns out unused). Dual-auth is the deliberately-least-committed stance.<vm>.<node>.grid.tf, hero_onboarding at e.g.onboarding.heroos.com). No special integration needed; both are HTTP backends to the proxy.What hero_onboarding deliberately does NOT do: anything VM-side (cockpit, system_info, services, BYO keys — that's all Track A). Anything HTTPS / TLS-termination (that's Track B). Anything direct-to-TFGrid (that's hero_compute, slot for v1 TfchainAutoDeployProvisioner per hero_compute#116).
What hero_onboarding WAS over-built for, vs the free demo: payment + KYC + per-node billing pipeline + credit-decrement at allocation time. These are commercial-flow concerns that the free demo has no need for. They're not wasted — they're the paid tier — but they're worth noting so this issue's readers know hero_onboarding's scope is intentionally wider than what the free demo needs.
Sequencing (hero_onboarding's next 4 sessions, all parallel to Track A's work — no Track A blocking on Track B or vice versa):
cockpit_urlfield on PoolVm + VmAllocation;POST /admin/pool-refreshadmin route (atomic in-memory pool swap, takesVM_POOL_JSONshape); stubrelease()cleanup-hook (logs "would call deployer-release here"); dashboard "Open cockpit →" link;forge_id?schema slot reserved for Phase 9. Makes hero_onboarding deployer-pluggable AHEAD of Track D shipping.VmAllocationrow capturesforge_idat allocate time for the SSO bridge to cockpit. D-18 lock on dual-auth model with revisability annotations.forge.ourworld.tffor paid-tier-allocated VMs.docs/operator-runbook.md; configuration validator (hero_onboarding --check-prod-configfails-fast on missing prod env vars); NO live charge tests at this session (deferred to launch day).expires_at— newvm_auto_release_cronaction; production-cadence rehearsal of producer (1h) + aggregator (5min) + auto-release (1h) for a full day in a dev environment.After s2-014, hero_onboarding pauses until either (a) Mahmoud closes hero_compute#116 gaps → s2-015 = v1
TfchainAutoDeployProvisioner, or (b) Track D ships its deployer API → s2-015 = v1.5DeployerProvisioner+ real pool-feed integration. Whichever lands first.D-NN race rule for cross-track decisions: per
prompt-common.md, first-minted wins. Track A's D-16 (cockpit-byok-user-forge-token-namespacing) took D-16 on 2026-05-21 08:46 EDT, 40 minutes ahead of Track B's parallel D-16 mint (provisioner-trait-shape) — Track B re-numbered to D-17. Next free D-NN is D-19 (D-18 reserved for the s2-011 Phase 9 dual-auth model lock).6. Per-track status (updated each session close)
f880247)hero_proxyrepo + Forge OAuth registrationNotes on the reorder:
hero_demo make deployprovisioning.Track A closed — s139 = A7 URL mapping landed
f880247— 11 files, +1229/-3. Track A1-A7 all done across s133-s139.What ships
cockpit.expose_service { service, subdomain }pushes a real route intohero_proxyvia the existingdomain.addRPC — no hero_router changes, single-repo session. The cockpit owner picks a subdomain in /services, cockpit callshero_proxy.domain_add({ domain: "<sub>.<base>", target_type: "socket", target: "$HERO_SOCKET_DIR/<service>/admin.sock", https_redirect: true, enabled: true }), persists to~/hero/cfg/cockpit/exposures.toml, and the row in /services becomes a clickable link.expose_service / unexpose_service / list_exposures / get_base_domain / set_base_domainexposures.rsmodule + 6 unit tests (mirrors s137 manifest.rs)cockpitcontext per D-16 (never touches operator state)Phase B finding worth tracking
hero_routerhas NO custom path-prefix alias surface today — the per-service reverse proxy at/{service_name}/{webname}is hard-coded socket-dir-derived (seecrates/hero_router/src/server/routes.rs:2275-2286).router.addis OpenRPC-sidebar registration, not HTTP routing. The s139 spec adherence to §8 subdomain shape (via the already-built hero_proxy) sidesteps this entirely, but the gap is real for any future feature that wants custom path-prefix routing through hero_router rather than subdomain through hero_proxy.Verification
mik.herodemo.gent01.grid.tfroute_id=1 in hero_proxy + clickable link in /services; idempotent re-expose; conflict errors; unexpose dual-removes; idempotent re-unexpose returns existed:false.Arc rotation due at s140
Track A is closed. s140 picks the next track to enter from B/C/D/E/F per the status table above. My read on priorities:
Will defer the actual choice to the next session's planning step. Full session narrative at
sessions/139.yml.Cross-link: email / notifications strategy
Filed a small meta-issue locking the email-provider choice for both arcs: home#236 — META Email / notifications strategy. Locked at D-20 (
decisions/D-20-email-provider-sendgrid.md).Decision: SendGrid for all transactional emails originated by either the demo-deployer arc or hero_onboarding. Sender domain TBD — picked at the first session that writes email-sending code. Forge-native + Stripe / Idenfy / ClickPesa platform emails stay on their respective backbones.
No immediate action for any Track A-F session — the rule is live in
prompt.md §3standing rules and the acceptance-criteria list lives in home#236.Cross-link: two-channels overview
Added a cross-arc overview doc at home/docs/channels/free-and-paid.md (commit
bfbf552).Audience: engineers + stakeholders. Walks through the free testing channel (admin-driven community evaluation, this issue's scope) and the paid commercial product (hero_onboarding#1) end-to-end — four UX flows, shared substrate, where the channels touch each other, and explicit out-of-scope per channel.
Not a replacement for this issue. This issue stays the engineering tracker — per-track status table, session map, compute caveats. The new doc is the cross-arc reader's-eye view that this issue intentionally doesn't try to be.
mik-tf referenced this issue from lhumina_code/hero_os_tfgrid_deployer2026-05-21 21:59:37 +00:00
mik-tf referenced this issue from lhumina_code/hero_assistance2026-05-22 13:52:24 +00:00
Added an executable companion to the existing two-channels narrative at
home/docs/channels/free-and-paid.md. The new file lives athome/docs/channels/free/e2e_checklist.mdand makes the free-testing channel of the Hero stack inspectable at row grain. Opens with a short story-logic recap of what the admin does end-to-end and what the tester does end-to-end (so a non-engineer stakeholder can read the file top-down), then drops into a matrix where one row equals one user-facing action, with Have / Need / Blocked status, test-pyramid layer, and a pointer to the source (decision file, meeting-note section, RPC method, or template). Scope is integration-level only: "tester can open Books from cockpit" is one row here, the rest stays inhero_books's own checklist. Pattern lineage is hero_assistance D-18 (originally fromznzfreezone_deploy/docs/dev/e2e_checklist.md), with an audit log at the top for status regressions. Initial seed has 57 rows across admin POV / user POV / cross-arc boundaries, sourced from the meeting notes plus current decisions plus a code-reading pass on hero_cockpit and hero_os_tfgrid_deployer. Next session is the verification pass: walk a local cockpit install and flip each Have row based on observation.Update (s157d, 2026-05-25) — deploy_vm UNBLOCKED + full deploy/test mechanics + remaining roadmap
Headline: the 6-day
deploy_vminvestigation is resolved. Fix shipped at hero_compute@1f59151 (closes hero_compute#125). Multi-tenant pattern (one rented dedicated node, two distinct VMs co-located on it) live-verified tonight. The issue body §Current state is fully updated; this comment surfaces the deploy + test mechanics + remaining-session list as a single read for anyone picking up.What was wrong (one paragraph)
hero_compute's
deploy_vmpassed the user-facingimagestring (e.g."Ubuntu 24.04") straight through to the TFGrid SDK as the zmachine workload'sflistfield. ZOS expects a URL there; given a name, ZOS silently sets the workloadstate=Errorwith an emptyresult.errorand the daemon surfaces"vm deployment entered error state"with no actionable detail. Found by enabling the SDK's undocumentedTFGRID_DEBUG=1env var (gatestrace_step()calls intfgrid_sdk_rust/src/grid_client/mod.rs:2361), which printed per-workload state lines and the full workload JSON showing the literal name in theflistfield. Fix is a 5-entry name→URL map in the daemon plus aresolve_image_reference()helper called once at the top ofdeploy_vm(pass-through forhttps://URLs, friendly InvalidInput for unknown names).How to deploy a VM end-to-end (the s157d recipe)
Prerequisites: env sourced (
source ~/hero/cfg/init.sh && source ~/hero/cfg/env/env.sh), hero_proc supervisor running,TFGRID_MNEMONICset incore/context, twin has TFT balance, hero_compute origin/development at1f59151or later.Pick a rentable node with
extraFee > 0(the substrate-side public-rent gate;rentable: Truealone is NOT enough, substrate rejects withOnlyTwinAdminCanDeploywhen extraFee=0):Tonight we used node 3467 (Canada, farm 646 JimboTFT, $91.80/mo + 10000 mUSD extraFee).
Rent it via
ComputeService.rent_node({node_id})through the daemon's UDS socket at~/hero/var/sockets/hero_compute_zos/rpc.sock. Substrate-ack arrives in ~10s; pollrent_statusuntilstate=done; verify on chain atgridproxy.grid.tf/contracts?twin_id=<your_twin>&type=rent(should seestate=Created).Register the catalog:
set_tfgrid_node_ids({node_ids: "<id>"}), thennode_unregister(if stale rows from a prior session), thennode_register, then confirm vialist_nodes. Slice math depends on node MRU/SRU; node 3467 yields 6 slices of ~5 GiB MRU each.Deploy a VM via
ComputeService.deploy_vm({name, slice_count, secret, image, ssh_keys, node_sid}):imagecan be a friendly name ("Ubuntu 24.04","Alpine", etc.); the daemon resolves it to the canonical flist URL since 1f59151.https://hub.grid.tf/...flistURL (still works).state=runningin 60-90s; persists 2 contracts on chain (network + VM).Cleanup at end of session:
delete_vm({sid, secret})per VM (substrate-cancel each contract pair);node_unregister(requires zero VM rows in the daemon's compute_db; note thatdelete_vmdoes NOT remove the local row, you may need to manuallyrm ~/hero/var/compute_tfgrid/data/root/cloud/vm/*.otomlfor stale error-state rows); finallycancel_rent_contract({contract_id, node_id}). Rent contract billing stops on substrate-Deleted (about 20s for the ack).How to test / debug a deploy (the
TFGRID_DEBUG=1recipe)The TFGrid SDK has a built-in debug-trace mode gated on an undocumented env var. Always enable it for any hero_compute investigation; without it the SDK is silent and the only thing you get back is the bare ZOS-side error.
Run the daemon manually with explicit env (hero_proc's normal supervised launch can't be used because the service.toml has
default="info"forRUST_LOGwithout afrom_secretline, so secret-store updates don't propagate to RUST_LOG for this service):The trace then includes lines like:
[tfgrid-debug] workload states for contract X: data=ok, 0052=error(per-workload state, the most diagnostic single line we never had before tonight)[tfgrid-debug] deployment X appeared on node twin Y[tfgrid-debug] decrypting cipher payload from twin Y for zos.deployment.getAfter the investigation:
hero_proc service start my_compute_zos_serverrestores normal supervised mode.Chain-state checks (useful during cleanup or debugging)
What's left to home#235 closure (also in §0 of the issue body)
mycelium_ipfrom workloadresult.datasoget_vmreturns a real IPv6. Then SSH-verify end-to-end with the throwaway probe key.deployer.provision_vm, install Hero stack viasetup-binaries.sh, configurehero_proxy+deploy_webgatewayper D-28, surface public URL.s153_default_libraries(+104 LOC parked since s153) on clean baseline, squash, redeploy admin VM's hero_books with the 4 public content repos auto-loaded.hero_os_tfgrid_deployerissue.Total remaining: ~10-15 hours focused work to arc closure.
Pointers for anyone picking up
prompt.md §3in the workspace is the next-session entry point (rewritten at each /stop).sessions/157d.ymlhas the full s157d trace including theTFGRID_DEBUG=1discovery, the fix shape, the multi-tenant proof, and the methodology lessons.decisions/D-29-deploy-vm-image-resolution-and-rentable-extrafee-gate.mdlocks the architectural decisions (image resolution location, rentable-extraFee>0 demo target).feedback_squash_merge_gate(pause for explicit OK before every squash-merge),feedback_d10_t2_squash_to_development_no_pr(local squash + direct push to development, no PR),feedback_signoff_no_email(commit body trailer isSigned-by: mik-tf <mik-tf@noreply.invalid>literal, nogit commit -s),feedback_authorship(no co-author trailers, no AI attribution).Phase 1 closed - Phase 2 picks up at home#237
Phase 1 of the demo-deployer arc ships its substrate. Recapping what is now live:
The executable checklist is at 47 Have / 20 Need / 4 Blocked across 71 rows.
What is left to make the demo a self-service tester environment (no operator hand-holding required) is now scoped as Phase 2 at home#237. Phase 2 ships Forge SSO across admin and user surfaces, admin allowlist gating, OAuth token persistence for ongoing Forge API access on the user's behalf, and the redeployed live walk. Roughly 4 focused sessions of work.
Closing this issue as shipped. Phase 2 continues the arc.