[META] Hero OS demo Phase 3 - complete admin and tester UX #238

Closed
opened 2026-05-27 02:17:42 +00:00 by mik-tf · 12 comments
Owner

s173 close (2026-05-28): Code-only ship of the four operator-and-tester UX issues filed during the s172d live walk. Two squash-merges landed on origin/development across hero_os_tfgrid_deployer and hero_cockpit. hero_os_tfgrid_deployer@a6fc6a4 adds a small polling loop to the admin per-user VMs table (new GET /users/{u}/vms.json endpoint; rows refresh in place every five seconds; full page reload only when install_state crosses a boundary that reshapes the action set) and appends /hero_cockpit/web/ to every rendered cockpit URL on the admin UI so the link lands on the cockpit rather than hero_proxy's service-discovery dashboard. hero_cockpit@ba02baa turns the cockpit Services page into the sandbox's service catalog: a new catalog module defines fourteen canonical service entries, two new RPCs (list_catalog and install_service) validate requests against the catalog so a hand-crafted call cannot trigger lab build on an arbitrary repo string, the page renders a greyed-out row per uninstalled catalog entry with a per-row Install button, and the URL column gains clickable cockpit-relative links derived from service.name suffix conventions for every running service with a web UI. Pre-merge gate green on both commits (fmt + clippy --workspace --all-targets -- -D warnings + 72 deployer server-lib tests + 22 cockpit server-lib tests with six new in catalog.rs + 16 cockpit web tests with four new for the URL deriver + workspace release build + --info smoke on both deployer binaries and all four cockpit binaries). Closes hero_cockpit#11, hero_cockpit#12, hero_os_tfgrid_deployer#19, and the install-side polling half of hero_os_tfgrid_deployer#18 (Provision-side async conversion deferred as a focused follow-up). Operator authorized a code-only autonomous shape with admin VM 0069 + alice123 + all QA live state explicitly off-limits, so the actual live verify + e2e_checklist row flips + Forge issue closes carry to the next operator-driven smoke + deploy cycle once CI republishes the binaries. No D-NN or L-NN minted. End-to-end checklist counts unchanged from s172d: 63 Have / 18 Need / 2 Blocked across 83 rows. Carry to next session (operator-driven, ~1-2h): lab build hero_os_tfgrid_deployer --download --install + lab build hero_cockpit --download --install on admin VM 0069, restart the four touched services via hero_proc, browser-walk the four UX behaviours, flip the rows that live evidence supports, post comments + close the four Forge issues. Then a following session picks up the deferred welcome-email pipeline (A-18 + B-1, with operator selection of resend.com vs SendGrid amending or superseding D-20), the BYO-key auto-start cascade in hero_cockpit, and the Provision-async conversion for the remaining half of hero_os_tfgrid_deployer#18.

s172d close (2026-05-28): Per-tester Forgejo OAuth apps replace the workspace-shared model. Every tester VM now gets its own OAuth application minted on forge.ourworld.tf at provision time and reaped at delete time; each app's redirect_uris allowlist contains exactly one URI bound to that tester's own URL. A real Forge user (alice172d) walked the full SSO loop end-to-end on her own cockpit URL: anonymous request returned 302 to Forge OAuth with her per-tester client_id, she set a permanent password on her first Forge login, saw the consent screen displayed her tester app name, accepted, and landed on her own Hero Cockpit. Three distinct OAuth client identities proven live across three URLs (admin plus two testers), each with a single-URI redirect_uris allowlist private to that host. Six squash-merges shipped on hero_os_tfgrid_deployer/development closing every gap surfaced live: per-tester OAuth app create-and-delete wire with a new schema migration for the per-VM client_id and client_secret triple; the missing hero_proxy domain.add call that registers each tester URL as OAuth-gated; a health-poll loop replacing a fixed sleep that wasn't enough for a cold-start hero_proxy on a fresh VM; the hero_proxy allowlist secret being pushed at the wrong hero_proc context; and the empirical discovery that Forgejo rotates the OAuth client_secret as a side effect of every PATCH on an OAuth application, even when the body does not request rotation, which had been silently invalidating the admin VM's own SSO session config on every tester operation. Three Forge issues filed for the demo's remaining UX polish: hero_cockpit#11 (services page Install button for components not yet installed), hero_cockpit#12 (clickable URL column for services with a web UI), hero_os_tfgrid_deployer#18 (admin UI progress indicator for Provision symmetric with Install). The admin VM deployment runbook gained a new caveat plus a one-shot OAuth-secret-rotation recovery appendix. The full tester onboarding flow was also live-walked by an operator through the admin UI without any curl scripts: register on Forge, set permanent password, upload SSH key, return to admin UI, see SSH key badge appear, click Provision (VM minted in 55s), click Install (9-minute cascade). End-to-end checklist rows: A-31 (per-tester hero_proxy allowlist) Need to Have; B-40 (tester opens cockpit URL on their own VM) caveat dropped. Counts moved to 63 Have / 18 Need / 2 Blocked across 83 rows. Carries to next session: A-18 plus B-1 welcome-email pipeline (operator selects resend.com vs SendGrid as the email provider), hero_books BYO-key auto-start cascade in the cockpit (when tester pastes an AI key in Settings, hero_aibroker plus hero_books start automatically), the three cockpit-polish issues above. Estimated nine to thirteen hours.

s172c close (2026-05-28): The install pipeline now works LIVE end-to-end. A freshly provisioned tester VM walks through deployer.install_hero_stack to install_state=ready in roughly eight minutes; all twelve canonical components are running inside the tester VM; the tester's TFGrid Web Gateway URL serves the cockpit publicly (HTTP 200 on /, HTTP 303 on /hero_cockpit/web/ to the welcome page). Admin SSH co-injection verified live (workstation SSH into the tester's root via mycelium succeeded). Five squash-merges shipped on hero_os_tfgrid_deployer/development closing every install-pipeline gap surfaced live (mycelium IP not persisted, webgateway not cleaned on delete, PEM trailing newline lost on secret seed, missing curl on bare base image, listener seed not propagated via bash environment files, and a port-default drift between hero_proxy builds). The empirical lesson — that runtime config must flow through hero_proc's secret store rather than bash environment files — was codified as a locked architectural decision (workspace-private). A structural follow-up issue was filed: hero_os_tfgrid_deployer#17 to replace the bash install runner with a Rust crate consuming a typed manifest. A post-v1 vision issue was filed: hero_demo#68 for a service catalog UI that lets testers self-install any published Hero service. End-to-end checklist rows A-30 and B-40 flipped to Have-with-caveat (install pipeline live; full browser SSO walks pending propagation of the OAuth client secrets to tester VMs which is the next session's primary task, the same SSH-push pattern this session shipped). Counts moved to 62 Have / 19 Need / 2 Blocked across 83 rows. Carries to next session: propagate the four cockpit OAuth secrets to tester VMs via the same SSH-push pattern, then walk two testers simultaneously (alice plus bob) with browser SSO walks demonstrating per-tester isolation plus the admin symmetric trust verified across both cockpits; estimated three to five hours.

s172b close (2026-05-27): The Forgejo OAuth callback URL list is now managed automatically by the deployer. Every time a new tester VM is provisioned, the deployer adds that tester's callback URL to the workspace OAuth application; every time a tester VM is deleted, the URL is removed. This eliminates the manual sixty-second step the operator previously did per tester through the Forge admin UI. Code shipped at hero_os_tfgrid_deployer@ec27241 (+354 LOC, +7 tests, pre-merge gate clean). The cutover required pivoting the production OAuth application from a site-admin app to a user-owned app because Forgejo only exposes OAuth applications through the per-user API path (confirmed by reading the Forgejo source for redirect URI matching). Browser walk by operator verified login still lands. The first attempted end-to-end install walk on a fresh tester surfaced four latent bugs in the install pipeline code that shipped earlier (none are design issues with the trust model; all are wire-up gaps), captured in the next-session plan with concrete fix recipes. The install live walk and the end-to-end checklist row flips carry to the next session, estimated three to four hours.

s171 close (2026-05-27): A-12 (deployer.provision_vm calls deploy_webgateway after deploy_vm, persists daemon-returned fqdn, surfaces on admin user_detail.html) shipped at hero_os_tfgrid_deployer@15e5473 (+492/-6 across 8 files: schema M3 webgateway_fqdn column via the canonical recreate-with-FK dance, ComputeAdapter.deploy_webgateway JSON-RPC wrapper, handle_provision_vm extension, admin Cockpit URL column with copy-to-clipboard, new TFGRID_GATEWAY_NODE_SID env block). Pre-merge gate clean: fmt + clippy -D warnings + 32 server-lib tests (+6 new) + --info smoke on both deployer binaries. Live walk on admin VM 0069 surfaced an API asymmetry (deploy_webgateway.node_sid takes raw TFGrid node_id, not the daemon-local catalog sid that deploy_vm.node_sid takes); pivoted secret to TFGRID_GATEWAY_NODE_SID=2 and retried, then three subsequent provision attempts each hit the daemon's 300s inline-await timeout (consistent QA substrate finalization slowness today; not a transient flake). Daemon-side rollback ran cleanly each time (2 orphan contracts cancelled per attempt). Filed hero_compute#131 requesting deploy_webgateway 300s timeout bump + env var + per-chain differentiation. A-12 flipped in docs/hero_os/free/e2e_checklist.md from Need to Have-with-caveat (code path complete + gateway-node selection live-verified through daemon logs; live URL gated on QA substrate window or hero_compute#131). 60 Have / 20 Need / 2 Blocked across 82 rows. Carries to s172: A-30 Hero stack auto-install on tester VM (design-locked + minimal vertical slice; SSH-and-run vs cloud-init vs pre-baked image decision needed up-front).

Current state (s168 close, 2026-05-27)

Code shipped across three repos: hero_os_tfgrid_deployer@8c640cd (provisioning fixes, SSH-key readiness in admin UI, closes hero_os_tfgrid_deployer#11), hero_cockpit@a52b784 (admin scaffold redirect, Books card, hero-voice-bar), and hero_lib@2f46f8f5 (upstream tools/src/forge/client.rs fix). All deployed on the public QA admin VM at hcockpit.gent01.qa.grid.tf. Browser walk partial: cockpit admin redirect and Books card on tester landing confirmed; voice-bar, tester creation, default-image provision, regenerate-password, and Books navigation carry to the next session. Three follow-up issues filed: hero_router#113, hero_os_tfgrid_deployer#12, hero_voice#36.


Session 167 handoff, 2026-05-27: flow contract confirmed for the free-testing channel. Runbook creates the admin VM; allowlisted admins use /hero_tfgrid_deployer/admin/ to create/select Forge testers and provision child VMs through existing deployer/compute; testers use Forge to change password and upload SSH keys, then enter /hero_cockpit/web/ through SSO. Normal cockpit use must not require pasting a Forge API token; token paste remains fallback/headless. Paid onboarding/billing/KYC stays out of scope for this issue.

Phase 3 completes the hand-off-ready Hero testing environment after Phase 2 closed the SSO/auth substrate. The auth perimeter is already correct: cockpit and deployer paths on https://hcockpit.gent01.qa.grid.tf are restricted until Forge login, and the QA admin allowlist is mik-tf,scott,despiegk. This issue owns the post-login product UX and the final home/docs/hero_os/free/e2e_checklist.md walk.

Admin target UX: an allowlisted admin opens https://hcockpit.gent01.qa.grid.tf/hero_tfgrid_deployer/admin/, logs in through forge.ourworld.tf if needed, and lands on a real deployer admin dashboard, not scaffold text. The dashboard lets the admin list testers, see each tester's VM status, create a new tester, provision a VM, watch provisioning state (provisioning, starting, running, failed), regenerate a one-time password, destroy and redeploy a VM, delete or disable users where allowed, and see useful event/log/error details when something fails. The admin should not need CLI knowledge. Implementation should reuse hero_tfgrid_deployer for orchestration and hero_compute for VM lifecycle, gateway and state information.

Tester target UX: a tester receives the cockpit URL plus Forge credentials out of band, opens https://hcockpit.gent01.qa.grid.tf/hero_cockpit/web/, signs in through Forge SSO, grants consent once, and lands in Hero Cockpit as their personal Hero computer. The cockpit home should be clear and non-admin: app/services launcher, Books visible and openable, Settings visible, Manual/help visible, service status available without overwhelming the user, non-production demo warning visible, and no normal-user paste-token flow. Voice is in scope through local lhumina_code/hero_voice using hero_voice_widget / <hero-voice-bar>. Slides, Whiteboard and Call should only be presented as working if they are actually reachable; otherwise their checklist rows stay Need or Blocked.

Completion rule: e2e_checklist.md is the acceptance contract. A row flips to Have only after live browser verification on the SSO-gated QA URL. Code existing, RPC working, or a CLI-only workaround is not enough. Definition of done: a real admin can operate the tester and VM lifecycle from the browser, and a real tester can log in and use the core Hero cockpit without us standing next to them.

**s173 close (2026-05-28)**: Code-only ship of the four operator-and-tester UX issues filed during the s172d live walk. Two squash-merges landed on origin/development across hero_os_tfgrid_deployer and hero_cockpit. [hero_os_tfgrid_deployer@a6fc6a4](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/commit/a6fc6a4) adds a small polling loop to the admin per-user VMs table (new `GET /users/{u}/vms.json` endpoint; rows refresh in place every five seconds; full page reload only when install_state crosses a boundary that reshapes the action set) and appends `/hero_cockpit/web/` to every rendered cockpit URL on the admin UI so the link lands on the cockpit rather than hero_proxy's service-discovery dashboard. [hero_cockpit@ba02baa](https://forge.ourworld.tf/lhumina_code/hero_cockpit/commit/ba02baa) turns the cockpit Services page into the sandbox's service catalog: a new catalog module defines fourteen canonical service entries, two new RPCs (list_catalog and install_service) validate requests against the catalog so a hand-crafted call cannot trigger lab build on an arbitrary repo string, the page renders a greyed-out row per uninstalled catalog entry with a per-row Install button, and the URL column gains clickable cockpit-relative links derived from service.name suffix conventions for every running service with a web UI. Pre-merge gate green on both commits (fmt + clippy `--workspace --all-targets -- -D warnings` + 72 deployer server-lib tests + 22 cockpit server-lib tests with six new in catalog.rs + 16 cockpit web tests with four new for the URL deriver + workspace release build + --info smoke on both deployer binaries and all four cockpit binaries). Closes hero_cockpit#11, hero_cockpit#12, hero_os_tfgrid_deployer#19, and the install-side polling half of hero_os_tfgrid_deployer#18 (Provision-side async conversion deferred as a focused follow-up). Operator authorized a code-only autonomous shape with admin VM 0069 + alice123 + all QA live state explicitly off-limits, so the actual live verify + e2e_checklist row flips + Forge issue closes carry to the next operator-driven smoke + deploy cycle once CI republishes the binaries. No D-NN or L-NN minted. End-to-end checklist counts unchanged from s172d: 63 Have / 18 Need / 2 Blocked across 83 rows. Carry to next session (operator-driven, ~1-2h): `lab build hero_os_tfgrid_deployer --download --install` + `lab build hero_cockpit --download --install` on admin VM 0069, restart the four touched services via hero_proc, browser-walk the four UX behaviours, flip the rows that live evidence supports, post comments + close the four Forge issues. Then a following session picks up the deferred welcome-email pipeline (A-18 + B-1, with operator selection of resend.com vs SendGrid amending or superseding D-20), the BYO-key auto-start cascade in hero_cockpit, and the Provision-async conversion for the remaining half of hero_os_tfgrid_deployer#18. **s172d close (2026-05-28)**: Per-tester Forgejo OAuth apps replace the workspace-shared model. Every tester VM now gets its own OAuth application minted on `forge.ourworld.tf` at provision time and reaped at delete time; each app's redirect_uris allowlist contains exactly one URI bound to that tester's own URL. A real Forge user (alice172d) walked the full SSO loop end-to-end on her own cockpit URL: anonymous request returned 302 to Forge OAuth with her per-tester client_id, she set a permanent password on her first Forge login, saw the consent screen displayed her tester app name, accepted, and landed on her own Hero Cockpit. Three distinct OAuth client identities proven live across three URLs (admin plus two testers), each with a single-URI redirect_uris allowlist private to that host. Six squash-merges shipped on [hero_os_tfgrid_deployer/development](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/commits/branch/development) closing every gap surfaced live: per-tester OAuth app create-and-delete wire with a new schema migration for the per-VM client_id and client_secret triple; the missing hero_proxy domain.add call that registers each tester URL as OAuth-gated; a health-poll loop replacing a fixed sleep that wasn't enough for a cold-start hero_proxy on a fresh VM; the hero_proxy allowlist secret being pushed at the wrong hero_proc context; and the empirical discovery that Forgejo rotates the OAuth client_secret as a side effect of every PATCH on an OAuth application, even when the body does not request rotation, which had been silently invalidating the admin VM's own SSO session config on every tester operation. Three Forge issues filed for the demo's remaining UX polish: hero_cockpit#11 (services page Install button for components not yet installed), hero_cockpit#12 (clickable URL column for services with a web UI), hero_os_tfgrid_deployer#18 (admin UI progress indicator for Provision symmetric with Install). The admin VM deployment runbook gained a new caveat plus a one-shot OAuth-secret-rotation recovery appendix. The full tester onboarding flow was also live-walked by an operator through the admin UI without any curl scripts: register on Forge, set permanent password, upload SSH key, return to admin UI, see SSH key badge appear, click Provision (VM minted in 55s), click Install (9-minute cascade). End-to-end checklist rows: A-31 (per-tester hero_proxy allowlist) Need to Have; B-40 (tester opens cockpit URL on their own VM) caveat dropped. Counts moved to 63 Have / 18 Need / 2 Blocked across 83 rows. Carries to next session: A-18 plus B-1 welcome-email pipeline (operator selects resend.com vs SendGrid as the email provider), hero_books BYO-key auto-start cascade in the cockpit (when tester pastes an AI key in Settings, hero_aibroker plus hero_books start automatically), the three cockpit-polish issues above. Estimated nine to thirteen hours. **s172c close (2026-05-28)**: The install pipeline now works LIVE end-to-end. A freshly provisioned tester VM walks through `deployer.install_hero_stack` to install_state=ready in roughly eight minutes; all twelve canonical components are running inside the tester VM; the tester's TFGrid Web Gateway URL serves the cockpit publicly (HTTP 200 on `/`, HTTP 303 on `/hero_cockpit/web/` to the welcome page). Admin SSH co-injection verified live (workstation SSH into the tester's root via mycelium succeeded). Five squash-merges shipped on [hero_os_tfgrid_deployer/development](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/commits/branch/development) closing every install-pipeline gap surfaced live (mycelium IP not persisted, webgateway not cleaned on delete, PEM trailing newline lost on secret seed, missing curl on bare base image, listener seed not propagated via bash environment files, and a port-default drift between hero_proxy builds). The empirical lesson — that runtime config must flow through hero_proc's secret store rather than bash environment files — was codified as a locked architectural decision (workspace-private). A structural follow-up issue was filed: [hero_os_tfgrid_deployer#17](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/issues/17) to replace the bash install runner with a Rust crate consuming a typed manifest. A post-v1 vision issue was filed: [hero_demo#68](https://forge.ourworld.tf/lhumina_code/hero_demo/issues/68) for a service catalog UI that lets testers self-install any published Hero service. End-to-end checklist rows A-30 and B-40 flipped to Have-with-caveat (install pipeline live; full browser SSO walks pending propagation of the OAuth client secrets to tester VMs which is the next session's primary task, the same SSH-push pattern this session shipped). Counts moved to 62 Have / 19 Need / 2 Blocked across 83 rows. Carries to next session: propagate the four cockpit OAuth secrets to tester VMs via the same SSH-push pattern, then walk two testers simultaneously (alice plus bob) with browser SSO walks demonstrating per-tester isolation plus the admin symmetric trust verified across both cockpits; estimated three to five hours. **s172b close (2026-05-27)**: The Forgejo OAuth callback URL list is now managed automatically by the deployer. Every time a new tester VM is provisioned, the deployer adds that tester's callback URL to the workspace OAuth application; every time a tester VM is deleted, the URL is removed. This eliminates the manual sixty-second step the operator previously did per tester through the Forge admin UI. Code shipped at [hero_os_tfgrid_deployer@ec27241](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/commit/ec27241) (+354 LOC, +7 tests, pre-merge gate clean). The cutover required pivoting the production OAuth application from a site-admin app to a user-owned app because Forgejo only exposes OAuth applications through the per-user API path (confirmed by reading the Forgejo source for redirect URI matching). Browser walk by operator verified login still lands. The first attempted end-to-end install walk on a fresh tester surfaced four latent bugs in the install pipeline code that shipped earlier (none are design issues with the trust model; all are wire-up gaps), captured in the next-session plan with concrete fix recipes. The install live walk and the end-to-end checklist row flips carry to the next session, estimated three to four hours. **s171 close (2026-05-27)**: A-12 (deployer.provision_vm calls deploy_webgateway after deploy_vm, persists daemon-returned fqdn, surfaces on admin user_detail.html) shipped at [hero_os_tfgrid_deployer@15e5473](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/commit/15e5473) (+492/-6 across 8 files: schema M3 webgateway_fqdn column via the canonical recreate-with-FK dance, ComputeAdapter.deploy_webgateway JSON-RPC wrapper, handle_provision_vm extension, admin Cockpit URL column with copy-to-clipboard, new TFGRID_GATEWAY_NODE_SID env block). Pre-merge gate clean: fmt + clippy `-D warnings` + 32 server-lib tests (+6 new) + `--info` smoke on both deployer binaries. Live walk on admin VM 0069 surfaced an API asymmetry (deploy_webgateway.node_sid takes raw TFGrid node_id, not the daemon-local catalog sid that deploy_vm.node_sid takes); pivoted secret to TFGRID_GATEWAY_NODE_SID=2 and retried, then three subsequent provision attempts each hit the daemon's 300s inline-await timeout (consistent QA substrate finalization slowness today; not a transient flake). Daemon-side rollback ran cleanly each time (2 orphan contracts cancelled per attempt). Filed [hero_compute#131](https://forge.ourworld.tf/lhumina_code/hero_compute/issues/131) requesting deploy_webgateway 300s timeout bump + env var + per-chain differentiation. A-12 flipped in `docs/hero_os/free/e2e_checklist.md` from Need to Have-with-caveat (code path complete + gateway-node selection live-verified through daemon logs; live URL gated on QA substrate window or hero_compute#131). 60 Have / 20 Need / 2 Blocked across 82 rows. Carries to s172: A-30 Hero stack auto-install on tester VM (design-locked + minimal vertical slice; SSH-and-run vs cloud-init vs pre-baked image decision needed up-front). ## Current state (s168 close, 2026-05-27) Code shipped across three repos: hero_os_tfgrid_deployer@8c640cd (provisioning fixes, SSH-key readiness in admin UI, closes hero_os_tfgrid_deployer#11), hero_cockpit@a52b784 (admin scaffold redirect, Books card, hero-voice-bar), and hero_lib@2f46f8f5 (upstream tools/src/forge/client.rs fix). All deployed on the public QA admin VM at hcockpit.gent01.qa.grid.tf. Browser walk partial: cockpit admin redirect and Books card on tester landing confirmed; voice-bar, tester creation, default-image provision, regenerate-password, and Books navigation carry to the next session. Three follow-up issues filed: hero_router#113, hero_os_tfgrid_deployer#12, hero_voice#36. --- **Session 167 handoff, 2026-05-27:** flow contract confirmed for the free-testing channel. Runbook creates the admin VM; allowlisted admins use `/hero_tfgrid_deployer/admin/` to create/select Forge testers and provision child VMs through existing deployer/compute; testers use Forge to change password and upload SSH keys, then enter `/hero_cockpit/web/` through SSO. Normal cockpit use must not require pasting a Forge API token; token paste remains fallback/headless. Paid onboarding/billing/KYC stays out of scope for this issue. Phase 3 completes the hand-off-ready Hero testing environment after Phase 2 closed the SSO/auth substrate. The auth perimeter is already correct: cockpit and deployer paths on `https://hcockpit.gent01.qa.grid.tf` are restricted until Forge login, and the QA admin allowlist is `mik-tf,scott,despiegk`. This issue owns the post-login product UX and the final `home/docs/hero_os/free/e2e_checklist.md` walk. Admin target UX: an allowlisted admin opens `https://hcockpit.gent01.qa.grid.tf/hero_tfgrid_deployer/admin/`, logs in through `forge.ourworld.tf` if needed, and lands on a real deployer admin dashboard, not scaffold text. The dashboard lets the admin list testers, see each tester's VM status, create a new tester, provision a VM, watch provisioning state (`provisioning`, `starting`, `running`, `failed`), regenerate a one-time password, destroy and redeploy a VM, delete or disable users where allowed, and see useful event/log/error details when something fails. The admin should not need CLI knowledge. Implementation should reuse `hero_tfgrid_deployer` for orchestration and `hero_compute` for VM lifecycle, gateway and state information. Tester target UX: a tester receives the cockpit URL plus Forge credentials out of band, opens `https://hcockpit.gent01.qa.grid.tf/hero_cockpit/web/`, signs in through Forge SSO, grants consent once, and lands in Hero Cockpit as their personal Hero computer. The cockpit home should be clear and non-admin: app/services launcher, Books visible and openable, Settings visible, Manual/help visible, service status available without overwhelming the user, non-production demo warning visible, and no normal-user paste-token flow. Voice is in scope through local `lhumina_code/hero_voice` using `hero_voice_widget` / `<hero-voice-bar>`. Slides, Whiteboard and Call should only be presented as working if they are actually reachable; otherwise their checklist rows stay Need or Blocked. Completion rule: `e2e_checklist.md` is the acceptance contract. A row flips to Have only after live browser verification on the SSO-gated QA URL. Code existing, RPC working, or a CLI-only workaround is not enough. Definition of done: a real admin can operate the tester and VM lifecycle from the browser, and a real tester can log in and use the core Hero cockpit without us standing next to them.
Author
Owner

One implementation guardrail for this issue: do not reinvent VM lifecycle or dashboard foundations. The admin side should reuse hero_compute for deploy, delete, gateway and state information, and hero_tfgrid_deployer should present that cleanly to operators. The tester side should connect the existing hero_cockpit surfaces and use hero_voice through hero_voice_widget for voice. This is primarily a wiring, deployment and UX verification pass over existing Hero components, with new code only where the checklist exposes a real gap.

One implementation guardrail for this issue: do not reinvent VM lifecycle or dashboard foundations. The admin side should reuse `hero_compute` for deploy, delete, gateway and state information, and `hero_tfgrid_deployer` should present that cleanly to operators. The tester side should connect the existing `hero_cockpit` surfaces and use `hero_voice` through `hero_voice_widget` for voice. This is primarily a wiring, deployment and UX verification pass over existing Hero components, with new code only where the checklist exposes a real gap.
Author
Owner

Additional implementation alignment for the next session: pull development before coding, use hero_ui_dashboard_admin for admin UX shape, hero_ui_dashboard_implementation for Rust/Askama/admin-lib wiring, and hero_service_implementation with the current lhumina_code/hero_service template as the service reference. First inventory the existing Hero stack, especially hero_compute, hero_tfgrid_deployer, hero_cockpit, and hero_voice, then connect what is already there. Do not reinvent VM lifecycle, service lifecycle, dashboard chrome, API docs, logs/jobs widgets, or voice integration.

Additional implementation alignment for the next session: pull `development` before coding, use `hero_ui_dashboard_admin` for admin UX shape, `hero_ui_dashboard_implementation` for Rust/Askama/admin-lib wiring, and `hero_service_implementation` with the current `lhumina_code/hero_service` template as the service reference. First inventory the existing Hero stack, especially `hero_compute`, `hero_tfgrid_deployer`, `hero_cockpit`, and `hero_voice`, then connect what is already there. Do not reinvent VM lifecycle, service lifecycle, dashboard chrome, API docs, logs/jobs widgets, or voice integration.
Author
Owner

Session 167 planning/handoff note: the implementation target is the free-testing channel from docs/hero_os/overview.md, grounded in docs/hero_os/free/admin-vm-deployment-runbook.md.

Flow contract for s168:

  1. Runbook/admin bootstrap produces the dedicated-node admin VM with Hero stack, Forge OAuth/admin allowlist, hero_proxy, deployer, cockpit, and TFGrid webgateway.
  2. Admin operating UX is the deployer admin surface, /hero_tfgrid_deployer/admin/: create/select Forge tester, verify the tester has SSH keys, provision the tester child VM through existing hero_tfgrid_deployer + hero_compute, see state/errors/logs, and hand the child cockpit URL plus credentials to the tester out of band.
  3. Tester Forge prep is explicit: change generated password if required, upload SSH public key, and approve Forge OAuth on first cockpit visit.
  4. Tester cockpit UX is /hero_cockpit/web/: non-admin personal Hero cockpit with Books, services, settings, manual/help, feedback, demo warning, and hero_voice_widget visible/usable. Normal browser use should not depend on pasting a Forge API token; token paste remains fallback/headless only. BYO AI provider keys remain settings-page material.
  5. Paid-channel onboarding, billing, KYC, credit balance, and pre-warmed pool management are out of scope for this issue, though cockpit/proxy/compute fixes should remain compatible because both channels share the in-VM substrate.

Known s168 first checks: remove or route around the hero_cockpit_admin scaffold, fix deployer provisioning defaults/env (DEFAULT_IMAGE, HERO_COMPUTE_NODE_ADDR), add Books + voice discoverability, deploy, then live browser-walk e2e_checklist.md before flipping rows.

Session 167 planning/handoff note: the implementation target is the free-testing channel from `docs/hero_os/overview.md`, grounded in `docs/hero_os/free/admin-vm-deployment-runbook.md`. Flow contract for s168: 1. Runbook/admin bootstrap produces the dedicated-node admin VM with Hero stack, Forge OAuth/admin allowlist, hero_proxy, deployer, cockpit, and TFGrid webgateway. 2. Admin operating UX is the deployer admin surface, `/hero_tfgrid_deployer/admin/`: create/select Forge tester, verify the tester has SSH keys, provision the tester child VM through existing `hero_tfgrid_deployer` + `hero_compute`, see state/errors/logs, and hand the child cockpit URL plus credentials to the tester out of band. 3. Tester Forge prep is explicit: change generated password if required, upload SSH public key, and approve Forge OAuth on first cockpit visit. 4. Tester cockpit UX is `/hero_cockpit/web/`: non-admin personal Hero cockpit with Books, services, settings, manual/help, feedback, demo warning, and `hero_voice_widget` visible/usable. Normal browser use should not depend on pasting a Forge API token; token paste remains fallback/headless only. BYO AI provider keys remain settings-page material. 5. Paid-channel onboarding, billing, KYC, credit balance, and pre-warmed pool management are out of scope for this issue, though cockpit/proxy/compute fixes should remain compatible because both channels share the in-VM substrate. Known s168 first checks: remove or route around the `hero_cockpit_admin` scaffold, fix deployer provisioning defaults/env (`DEFAULT_IMAGE`, `HERO_COMPUTE_NODE_ADDR`), add Books + voice discoverability, deploy, then live browser-walk `e2e_checklist.md` before flipping rows.
Author
Owner

Post-handoff SSH key safety clarification: provisioning must never silently create a tester VM without the tester's SSH public key injected.

Canonical custody remains Forge: the tester's public key is stored under the tester's Forge account. If cockpit offers an SSH-key helper, it should upload the public key to Forge under the tester identity; cockpit/deployer should not become the private-key custody system.

Existing server behavior already has the right invariant in hero_tfgrid_deployer_server/src/web.rs::handle_provision_vm: it calls forge.list_user_ssh_keys(username), returns an actionable -32602 error if none exist, and passes ssh_keys inline to ComputeService.deploy_vm when provisioning.

s168 acceptance should make this visible and verified in UX/checklist:

  • Admin user detail shows SSH key readiness/count before Provision VM.
  • Missing-key path blocks provisioning with clear instructions to upload a Forge SSH key.
  • Successful provision path shows ssh_key_count > 0 / equivalent evidence.
  • Browser walk verifies both missing-key fail and key-present success, so no tester VM is launched in a locked-out state.
Post-handoff SSH key safety clarification: provisioning must never silently create a tester VM without the tester's SSH public key injected. Canonical custody remains Forge: the tester's public key is stored under the tester's Forge account. If cockpit offers an SSH-key helper, it should upload the public key to Forge under the tester identity; cockpit/deployer should not become the private-key custody system. Existing server behavior already has the right invariant in `hero_tfgrid_deployer_server/src/web.rs::handle_provision_vm`: it calls `forge.list_user_ssh_keys(username)`, returns an actionable `-32602` error if none exist, and passes `ssh_keys` inline to `ComputeService.deploy_vm` when provisioning. s168 acceptance should make this visible and verified in UX/checklist: - Admin user detail shows SSH key readiness/count before Provision VM. - Missing-key path blocks provisioning with clear instructions to upload a Forge SSH key. - Successful provision path shows `ssh_key_count > 0` / equivalent evidence. - Browser walk verifies both missing-key fail and key-present success, so no tester VM is launched in a locked-out state.
Author
Owner

Update: code for the admin and tester UX work landed this session across three repos.

hero_os_tfgrid_deployer (8c640cd) ships the provisioning fixes (HERO_COMPUTE_NODE_ADDR is now actually read, default image is Ubuntu 24.04 so the resolver accepts it, and the admin user-detail page shows a Forge SSH-key badge and gates the Provision button when the tester has zero keys). It closes hero_os_tfgrid_deployer#11.

hero_cockpit (a52b784) routes the cockpit admin scaffold to the deployer admin via a 302 redirect, adds a Books card as the first card on the tester landing page, and wires the hero-voice-bar widget into the navbar with the canonical voice-widget asset links per the hero_voice_widget skill.

hero_lib (2f46f8f5) fixes an upstream regression in tools/src/forge/client.rs that was breaking downstream workspace builds.

All three are deployed to the public QA admin VM hcockpit.gent01.qa.grid.tf. The admin and cockpit binaries were installed via manual SCP because the Forgejo Actions release pipeline was wedged on missing token scope. The token scope and stale repo-level overrides were also cleaned up during the session, so future deploys should go through the canonical lab build pipeline.

The deployer/FORGE_TOKEN on the VM was rotated to a site-admin Forgejo token because the original token was non-admin and forge.create_user was failing with 403.

Browser-walk so far: the admin URL correctly lands on the deployer admin instead of the scaffold, and the Books card is visible on the tester landing. Voice-bar render, tester creation, provision_vm with the default image, regenerate-password, and Books navigation carry to the next session. Three follow-up issues were filed: hero_router#113 (a prefix-doubling bug that breaks /hero_books/web/), hero_os_tfgrid_deployer#12 (the admin navbar shows the OS username instead of the SSO user), and hero_voice#36 (operator note about redeploying hero_voice_admin when a host UI adopts the widget embed).

Update: code for the admin and tester UX work landed this session across three repos. hero_os_tfgrid_deployer (8c640cd) ships the provisioning fixes (HERO_COMPUTE_NODE_ADDR is now actually read, default image is `Ubuntu 24.04` so the resolver accepts it, and the admin user-detail page shows a Forge SSH-key badge and gates the Provision button when the tester has zero keys). It closes hero_os_tfgrid_deployer#11. hero_cockpit (a52b784) routes the cockpit admin scaffold to the deployer admin via a 302 redirect, adds a Books card as the first card on the tester landing page, and wires the hero-voice-bar widget into the navbar with the canonical voice-widget asset links per the hero_voice_widget skill. hero_lib (2f46f8f5) fixes an upstream regression in tools/src/forge/client.rs that was breaking downstream workspace builds. All three are deployed to the public QA admin VM hcockpit.gent01.qa.grid.tf. The admin and cockpit binaries were installed via manual SCP because the Forgejo Actions release pipeline was wedged on missing token scope. The token scope and stale repo-level overrides were also cleaned up during the session, so future deploys should go through the canonical lab build pipeline. The `deployer/FORGE_TOKEN` on the VM was rotated to a site-admin Forgejo token because the original token was non-admin and forge.create_user was failing with 403. Browser-walk so far: the admin URL correctly lands on the deployer admin instead of the scaffold, and the Books card is visible on the tester landing. Voice-bar render, tester creation, provision_vm with the default image, regenerate-password, and Books navigation carry to the next session. Three follow-up issues were filed: hero_router#113 (a prefix-doubling bug that breaks /hero_books/web/), hero_os_tfgrid_deployer#12 (the admin navbar shows the OS username instead of the SSO user), and hero_voice#36 (operator note about redeploying hero_voice_admin when a host UI adopts the widget embed).
Author
Owner

s169 closed 2026-05-27 — verify-and-close walk + per-tester-VM arc made explicit + 5 UX squash-merges + multi-session roadmap to home#238 closure

End-to-end admin + tester SSO browser walk on the public QA admin VM 0069. Verified A-20 (admin user list), A-25 (regenerate password), A-28 (SSH-key readiness pre-flight in both states), cockpit landing with Books card + voice-bar rendered, all tester pages render. A-21 blocked by P0 hero_os_tfgrid_deployer#13my_compute_zos_server not running on admin VM (operational fix queued for s170).

5 UX issues filed + fixed + closed via squash-merges:

Squash-merges (live on origin/development; redeploy queued for s170):

  • hero_os_tfgrid_deployer@c649d76
  • hero_cockpit@08e7788
  • home@a0dd2f3

Per-tester-VM arc made explicit in e2e_checklist.md with 4 new Need rows mapping the gap from today's state to executive summary lines 27/28/29/31-40:

  • A-29 Need — compute daemon prereq on admin VM (closes deployer#13 once registered)
  • A-30 Need — Hero stack present on freshly provisioned tester VM (exec line 27)
  • A-31 Need — per-tester hero_proxy allowlist on tester VM (exec line 28)
  • B-40 Need — tester opens cockpit URL on THEIR own VM (exec lines 31-40)
  • B-41 Need — tester uses Books / Slides / Planner / Agent on THEIR own VM

Multi-session arc to home#238 closure: s170 (compute daemon + UX redeploy + first Provision) → s171 (A-12 deploy_webgateway integration) → s172 (A-30 Hero stack auto-install) → s172-bis (A-31 per-tester allowlist) → s173 (full e2e walk + close).

Counts after s169: 57 Have / 23 Need / 2 Blocked across 82 rows (was 54/20/2 across 76).

See sessions/169.yml (local pipeline artifact) for full per-step record.

**s169 closed 2026-05-27 — verify-and-close walk + per-tester-VM arc made explicit + 5 UX squash-merges + multi-session roadmap to home#238 closure** End-to-end admin + tester SSO browser walk on the public QA admin VM `0069`. Verified A-20 (admin user list), A-25 (regenerate password), A-28 (SSH-key readiness pre-flight in both states), cockpit landing with Books card + voice-bar rendered, all tester pages render. A-21 blocked by P0 [`hero_os_tfgrid_deployer#13`](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/issues/13) — `my_compute_zos_server` not running on admin VM (operational fix queued for s170). **5 UX issues filed + fixed + closed via squash-merges**: - [`hero_os_tfgrid_deployer#12`](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/issues/12) — SSO username instead of OS username in admin navbar - [`hero_os_tfgrid_deployer#14`](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/issues/14) — Create-user success panel rewritten as SSO-first walk - [`hero_os_tfgrid_deployer#15`](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/issues/15) — Node SID help text matches reality - [`hero_os_tfgrid_deployer#16`](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/issues/16) — Bootstrap modal dialogs replace browser confirm() - [`hero_cockpit#10`](https://forge.ourworld.tf/lhumina_code/hero_cockpit/issues/10) — Dropped `table-light` thead so dark theme is honored **Squash-merges** (live on origin/development; redeploy queued for s170): - `hero_os_tfgrid_deployer@c649d76` - `hero_cockpit@08e7788` - `home@a0dd2f3` **Per-tester-VM arc made explicit in `e2e_checklist.md`** with 4 new Need rows mapping the gap from today's state to executive summary lines 27/28/29/31-40: - **A-29** Need — compute daemon prereq on admin VM (closes deployer#13 once registered) - **A-30** Need — Hero stack present on freshly provisioned tester VM (exec line 27) - **A-31** Need — per-tester `hero_proxy` allowlist on tester VM (exec line 28) - **B-40** Need — tester opens cockpit URL on THEIR own VM (exec lines 31-40) - **B-41** Need — tester uses Books / Slides / Planner / Agent on THEIR own VM **Multi-session arc to home#238 closure**: s170 (compute daemon + UX redeploy + first Provision) → s171 (A-12 deploy_webgateway integration) → s172 (A-30 Hero stack auto-install) → s172-bis (A-31 per-tester allowlist) → s173 (full e2e walk + close). **Counts** after s169: 57 Have / 23 Need / 2 Blocked across 82 rows (was 54/20/2 across 76). See `sessions/169.yml` (local pipeline artifact) for full per-step record.
Author
Owner

s171 close — A-12 deploy_webgateway after deploy_vm shipped

Code shipped at hero_os_tfgrid_deployer@15e5473 (+492/-6 across 8 files).

What landed:

  • Schema M3 added webgateway_fqdn TEXT NOT NULL DEFAULT '' to the vms table via the canonical recreate-with-FK dance (preserves the M2 ON DELETE RESTRICT FK).
  • ComputeAdapter gained a typed Webgateway struct + deploy_webgateway method wrapping the JSON-RPC envelope.
  • handle_provision_vm calls deploy_webgateway(name={user}-demo, kind=Name, fqdn="", backends=["http://[mycelium_ip]:9988"], tls_passthrough=false, secret=vm_secret, node_sid=<env>) after db.insert_vm, then persists the daemon-returned Webgateway.fqdn via the new update_vm_webgateway setter.
  • On failure leaves the VM running and surfaces webgateway_error in the JSON response for operator retry.
  • Admin user_detail.html adds a "Cockpit URL" column on the VMs table plus a Cockpit URL row in the post-Provision alert, both with a copy-to-clipboard button.
  • New [[env]] TFGRID_GATEWAY_NODE_SID block on hero_tfgrid_deployer_server/service.toml with default="".

Pre-merge gate: fmt + clippy -D warnings + workspace release build + 32 server-lib tests pass (+6 new); --info smoke clean on both deployer binaries.

Live walk on admin VM 0069: surfaced an API asymmetry where deploy_webgateway.node_sid takes the raw TFGrid node_id (e.g. "2") while deploy_vm.node_sid takes the daemon-local catalog sid (e.g. "0001"); pivoted the operator secret to TFGRID_GATEWAY_NODE_SID=2 and retried. Three subsequent attempts each hit the daemon's 300s inline-await timeout on the substrate write (consistent QA substrate finalization slowness today, not a transient flake). Daemon-side rollback ran cleanly each time, cancelling 2 orphan contracts per attempt. Filed hero_compute#131 requesting the 300s timeout be bumped, exposed as an env var, and differentiated per chain.

Phase B.5 adversarial review caught two protocol fixes before the code shipped: the deployer must read the daemon-returned fqdn (never compute it locally), and backends must carry an http:// scheme prefix.

A-12 row in docs/hero_os/free/e2e_checklist.md (renamed mid-session by another maintainer from docs/hero_os/free/) flipped from Need to Have-with-caveat. Code path complete and the gateway-node selection live-verified through daemon logs; live URL gated on QA substrate window opening or hero_compute#131 landing. Counts: 60 Have / 20 Need / 2 Blocked across 82 rows.

Cleanup: 4 throwaway VMs deleted, the throwaway Forge user purged via admin DELETE (verified 404), ephemeral SSH key scrubbed from /tmp. QA twin 703 RentContract 84983 and the admin VM 0069 gateway contracts untouched. Zero TFT cost.

Next session (s172) = A-30 Hero stack auto-install on tester VM (design-lock between SSH-and-run, cloud-init, and pre-baked image, then ship a minimal vertical slice with hero_proxy + hero_router + hero_proc + hero_cockpit running on a fresh tester VM).

## s171 close — A-12 deploy_webgateway after deploy_vm shipped Code shipped at [hero_os_tfgrid_deployer@15e5473](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/commit/15e5473) (+492/-6 across 8 files). **What landed:** - Schema M3 added `webgateway_fqdn TEXT NOT NULL DEFAULT ''` to the `vms` table via the canonical recreate-with-FK dance (preserves the M2 ON DELETE RESTRICT FK). - `ComputeAdapter` gained a typed `Webgateway` struct + `deploy_webgateway` method wrapping the JSON-RPC envelope. - `handle_provision_vm` calls `deploy_webgateway(name={user}-demo, kind=Name, fqdn="", backends=["http://[mycelium_ip]:9988"], tls_passthrough=false, secret=vm_secret, node_sid=<env>)` after `db.insert_vm`, then persists the daemon-returned `Webgateway.fqdn` via the new `update_vm_webgateway` setter. - On failure leaves the VM running and surfaces `webgateway_error` in the JSON response for operator retry. - Admin `user_detail.html` adds a "Cockpit URL" column on the VMs table plus a Cockpit URL row in the post-Provision alert, both with a copy-to-clipboard button. - New `[[env]] TFGRID_GATEWAY_NODE_SID` block on `hero_tfgrid_deployer_server/service.toml` with `default=""`. **Pre-merge gate**: fmt + clippy `-D warnings` + workspace release build + 32 server-lib tests pass (+6 new); `--info` smoke clean on both deployer binaries. **Live walk on admin VM 0069**: surfaced an API asymmetry where `deploy_webgateway.node_sid` takes the raw TFGrid `node_id` (e.g. `"2"`) while `deploy_vm.node_sid` takes the daemon-local catalog sid (e.g. `"0001"`); pivoted the operator secret to `TFGRID_GATEWAY_NODE_SID=2` and retried. Three subsequent attempts each hit the daemon's 300s inline-await timeout on the substrate write (consistent QA substrate finalization slowness today, not a transient flake). Daemon-side rollback ran cleanly each time, cancelling 2 orphan contracts per attempt. Filed [hero_compute#131](https://forge.ourworld.tf/lhumina_code/hero_compute/issues/131) requesting the 300s timeout be bumped, exposed as an env var, and differentiated per chain. **Phase B.5 adversarial review** caught two protocol fixes before the code shipped: the deployer must read the daemon-returned fqdn (never compute it locally), and backends must carry an `http://` scheme prefix. **A-12 row in `docs/hero_os/free/e2e_checklist.md`** (renamed mid-session by another maintainer from `docs/hero_os/free/`) flipped from Need to Have-with-caveat. Code path complete and the gateway-node selection live-verified through daemon logs; live URL gated on QA substrate window opening or hero_compute#131 landing. Counts: 60 Have / 20 Need / 2 Blocked across 82 rows. **Cleanup**: 4 throwaway VMs deleted, the throwaway Forge user purged via admin DELETE (verified 404), ephemeral SSH key scrubbed from `/tmp`. QA twin 703 RentContract 84983 and the admin VM `0069` gateway contracts untouched. Zero TFT cost. **Next session (s172)** = A-30 Hero stack auto-install on tester VM (design-lock between SSH-and-run, cloud-init, and pre-baked image, then ship a minimal vertical slice with `hero_proxy` + `hero_router` + `hero_proc` + `hero_cockpit` running on a fresh tester VM).
Author
Owner

Design lock for the next session: after provision_vm mints the cockpit URL, the deployer installs the Hero stack on the tester VM over SSH using a stable installer keypair held in the admin VM's secret store. At provision time the deployer co-injects three pubkey sets into the new VM's authorized_keys: the tester's own Forge SSH key, the deployer's installer key, and the workspace admin SSH keys, so workspace admins keep standing root access to every tester VM for ops and debugging. The tester VM's hero_proxy is configured with a symmetric web allowlist: the tester's Forge identity plus the workspace admin Forge identities (deployer/ADMIN_FORGE_USERS), so SSH access and cockpit web access converge on the same identity set. The workspace registers one shared Forgejo OAuth app and the deployer patches its redirect_uris per Provision (append on provision, remove on delete). This is the sandbox trust model and is explicitly bounded to the Hero OS Tester Sandbox; the future paid-tier sovereign deploy inherits none of these defaults (no admin SSH co-injection, no admin in tester web allowlist, no shared installer key, no shared OAuth app). The A-30 canonical stack list also grows from 11 to 12 components with hero_biz joining; B-41 caption updated to match. Live walk and row flips for A-30 + A-31 land later in the session after the implementation phases.

Design lock for the next session: after `provision_vm` mints the cockpit URL, the deployer installs the Hero stack on the tester VM over SSH using a stable installer keypair held in the admin VM's secret store. At provision time the deployer co-injects three pubkey sets into the new VM's `authorized_keys`: the tester's own Forge SSH key, the deployer's installer key, and the workspace admin SSH keys, so workspace admins keep standing root access to every tester VM for ops and debugging. The tester VM's `hero_proxy` is configured with a symmetric web allowlist: the tester's Forge identity plus the workspace admin Forge identities (`deployer/ADMIN_FORGE_USERS`), so SSH access and cockpit web access converge on the same identity set. The workspace registers one shared Forgejo OAuth app and the deployer patches its `redirect_uris` per Provision (append on provision, remove on delete). This is the sandbox trust model and is explicitly bounded to the Hero OS Tester Sandbox; the future paid-tier sovereign deploy inherits none of these defaults (no admin SSH co-injection, no admin in tester web allowlist, no shared installer key, no shared OAuth app). The A-30 canonical stack list also grows from 11 to 12 components with `hero_biz` joining; B-41 caption updated to match. Live walk and row flips for A-30 + A-31 land later in the session after the implementation phases.
Author
Owner

Session 172c close summary

What shipped, live

The install pipeline works end-to-end on a freshly provisioned tester VM. deployer.install_hero_stack advances the new VM through install_state none → installing → ready in about eight minutes. All twelve canonical components run on the tester. The tester's TFGrid Web Gateway URL serves the cockpit publicly: external HTTPS curl returns 200 on / and 303 on /hero_cockpit/web/ to the welcome page. Admin SSH co-injection verified live (operator workstation SSH into the tester's root over mycelium succeeded).

Code

Five squash-merges on hero_os_tfgrid_deployer/development closed every install-pipeline gap surfaced live. The commit chain is 319cf68 → ce9b9e4 → cab2f16 → 794da22 → 483c8b8 → 541d9d5 (cumulative deployer server md5 9173e330ab6ddff5849118e5edc51a88 on admin VM). Pre-merge gate green on every commit: fmt, clippy --workspace --all-targets -D warnings, 55 server-lib tests (six new) + 2 SDK tests, workspace release build, --info smoke on both deployer binaries.

Architectural lesson

A new locked decision codifies that tester VM runtime configuration flows exclusively through hero_proc's secret store via service.toml env blocks. Bash environment files (/root/app.env) bypass hero_proc-managed daemons and never reach the services that actually need the values. This was empirically demonstrated through three iterations during the session: setting HERO_PROXY_SEED_GATEWAY_LISTENER=1 in app.env produced exactly the same 502 Bad Gateway as setting nothing, because the managed daemon reads from hero_proc, not from bash. Only after extending the deployer's SSH payload to run hero_proc secret set --quiet --context core HERO_PROXY_SEED_GATEWAY_LISTENER 1 and restart hero_proxy did the listener actually bind a TCP socket.

Structural follow-up filed

hero_os_tfgrid_deployer#17 — promote the tester-VM install runner from the bash script in hero_demo to a Rust crate in the deployer workspace that consumes a typed install manifest. Retires the impedance boundary between the deployer's typed Rust shape and what daemons actually see.

Post-v1 vision filed

hero_demo#68hero_store service catalog UI for tester VMs. Browsable list of every Hero service published via lab-publish.yaml, one-click install or uninstall onto the tester's own VM. Depends on the install-runner cleanup landing first. Post-v1-sandbox polish.

Checklist row flips

A-30 (Hero stack present + running on a freshly provisioned tester VM) → Have-with-caveat. B-40 (tester opens cockpit URL on their own provisioned VM) → Have-with-caveat. The caveat in both cases is that the install runner is still the bash script in hero_demo per the structural follow-up above, and full browser SSO walks need the next session's OAuth-secret propagation work to complete. Counts moved 60 Have / 21 Need / 2 Blocked → 62 / 19 / 2 across 83 rows.

Cleanup at /stop

Tester VM and Forge user deleted (Gap 3 webgateway cleanup verified live four times across the session — every delete returned webgateway_error: ""). Workstation and admin VM temp files shredded. No orphan VMs or contracts on QA. Admin VM stays up at the public URL.

Carries to next session (estimated three to five hours)

Propagate the four cockpit OAuth secrets to tester VMs via the same SSH-push pattern. Walk two testers simultaneously (alice plus bob) with browser SSO walks demonstrating per-tester isolation plus admin symmetric trust across both cockpits. Flip the SSO-dependent checklist rows (A-31, B-41, several B-1x where the live walk surfaces evidence). After that the substrate supports any number of testers — provisioning more becomes mechanical.

## Session 172c close summary **What shipped, live** The install pipeline works end-to-end on a freshly provisioned tester VM. `deployer.install_hero_stack` advances the new VM through install_state none → installing → ready in about eight minutes. All twelve canonical components run on the tester. The tester's TFGrid Web Gateway URL serves the cockpit publicly: external HTTPS curl returns 200 on `/` and 303 on `/hero_cockpit/web/` to the welcome page. Admin SSH co-injection verified live (operator workstation SSH into the tester's root over mycelium succeeded). **Code** Five squash-merges on hero_os_tfgrid_deployer/development closed every install-pipeline gap surfaced live. The commit chain is `319cf68 → ce9b9e4 → cab2f16 → 794da22 → 483c8b8 → 541d9d5` (cumulative deployer server md5 `9173e330ab6ddff5849118e5edc51a88` on admin VM). Pre-merge gate green on every commit: fmt, clippy `--workspace --all-targets -D warnings`, 55 server-lib tests (six new) + 2 SDK tests, workspace release build, `--info` smoke on both deployer binaries. **Architectural lesson** A new locked decision codifies that tester VM runtime configuration flows exclusively through hero_proc's secret store via service.toml [[env]] blocks. Bash environment files (`/root/app.env`) bypass hero_proc-managed daemons and never reach the services that actually need the values. This was empirically demonstrated through three iterations during the session: setting `HERO_PROXY_SEED_GATEWAY_LISTENER=1` in app.env produced exactly the same 502 Bad Gateway as setting nothing, because the managed daemon reads from hero_proc, not from bash. Only after extending the deployer's SSH payload to run `hero_proc secret set --quiet --context core HERO_PROXY_SEED_GATEWAY_LISTENER 1` and restart hero_proxy did the listener actually bind a TCP socket. **Structural follow-up filed** [hero_os_tfgrid_deployer#17](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/issues/17) — promote the tester-VM install runner from the bash script in hero_demo to a Rust crate in the deployer workspace that consumes a typed install manifest. Retires the impedance boundary between the deployer's typed Rust shape and what daemons actually see. **Post-v1 vision filed** [hero_demo#68](https://forge.ourworld.tf/lhumina_code/hero_demo/issues/68) — `hero_store` service catalog UI for tester VMs. Browsable list of every Hero service published via `lab-publish.yaml`, one-click install or uninstall onto the tester's own VM. Depends on the install-runner cleanup landing first. Post-v1-sandbox polish. **Checklist row flips** A-30 (Hero stack present + running on a freshly provisioned tester VM) → Have-with-caveat. B-40 (tester opens cockpit URL on their own provisioned VM) → Have-with-caveat. The caveat in both cases is that the install runner is still the bash script in hero_demo per the structural follow-up above, and full browser SSO walks need the next session's OAuth-secret propagation work to complete. Counts moved 60 Have / 21 Need / 2 Blocked → 62 / 19 / 2 across 83 rows. **Cleanup at /stop** Tester VM and Forge user deleted (Gap 3 webgateway cleanup verified live four times across the session — every delete returned `webgateway_error: ""`). Workstation and admin VM temp files shredded. No orphan VMs or contracts on QA. Admin VM stays up at the public URL. **Carries to next session (estimated three to five hours)** Propagate the four cockpit OAuth secrets to tester VMs via the same SSH-push pattern. Walk two testers simultaneously (alice plus bob) with browser SSO walks demonstrating per-tester isolation plus admin symmetric trust across both cockpits. Flip the SSO-dependent checklist rows (A-31, B-41, several B-1x where the live walk surfaces evidence). After that the substrate supports any number of testers — provisioning more becomes mechanical.
Author
Owner

Next session plan (v1 demo close)

The s172d live walk surfaced four narrow UI / UX gaps that are the last things blocking the v1 demo from feeling clickable end-to-end. The architectural substrate (per-tester OAuth, install pipeline, admin allowlist) all works; what is left is operator and tester polish:

P0 — Admin UI auto-refresh on Install and Provision state
hero_os_tfgrid_deployer#18. Today the admin must manually refresh the user-detail page to see the Install state transition from installing to ready, and Provision has no visible progress at all (looks like a hung browser for the full 55 to 60 seconds). Single fix is a small polling script on the VMs table plus a new provision-state column symmetric with the existing install-state column.

P0.5 — Cockpit URL column in admin UI is missing the /hero_cockpit/web/ path
hero_os_tfgrid_deployer#19. The admin sees https://<tester>.<gateway> and clicks it, but lands on Hero Proxy's own service-discovery dashboard, not on the cockpit. One-line template fix to append /hero_cockpit/web/ to the displayed link and href.

P1a — Cockpit Services page: Install button for uninstalled components
hero_cockpit#11. The cockpit currently only lists services already known to Hero Proc. Components in the canonical demo stack that are not yet started (because their dependency is not met, e.g. Books needs an AI key) are invisible from the cockpit, so the tester has no UI path to bring them up. Unified list with greyed-out rows for "available but not installed" plus a per-row Install button that fires lab build then lab service start server-side.

P1b — Cockpit Services page: clickable URL column
hero_cockpit#12. The Services page has a URL column but it currently shows an em-dash for every row. Every service with an _admin or _web binary has a publicly reachable URL the cockpit can compute from its own service.toml. Render those URLs as clickable links and the tester gets an obvious affordance to open any service.

P2 — BYO-key auto-start cascade in cockpit Settings
When a tester pastes an AI provider key in the cockpit Settings page, the cockpit's save handler should automatically fire the lab-service-start cascade for the AI-dependent components (Hero AI Broker plus Hero Books plus Hero Agent). Today the tester pastes the key and nothing visible happens; they have to know to SSH into their VM and run a command to bring up Books. After this cascade, Books just works the moment the key is saved.

After s173, the v1 tester loop is end-to-end clickable: admin creates user via admin UI, tester registers on Forge and uploads SSH key, admin clicks Provision and sees real-time progress, admin clicks Install and sees real-time state transitions, admin shares the cockpit URL with the tester out-of-band (Slack / email, manually for v1), tester clicks URL and signs in via SSO, sees cockpit Services with both installed components (Books, Slides, Whiteboard, etc) and uninstalled components (greyed out with Install buttons), pastes an AI key in Settings and Books starts working automatically, clicks any service's URL to open it.

s174 (v2 polish) adds the automatic welcome-email pipeline so the admin no longer shares URLs out-of-band: A-18 (welcome email at user-create time with cockpit URL plus initial password plus 4-step onboarding) and B-1 ("your VM is ready" email at install-ready time). Selecting the email provider (resend.com or SendGrid) is the first decision in s174.

Estimated effort: s173 about 6 to 9 hours, s174 about 4 to 6 hours, both stay inside home#238.

**Next session plan (v1 demo close)** The s172d live walk surfaced four narrow UI / UX gaps that are the last things blocking the v1 demo from feeling clickable end-to-end. The architectural substrate (per-tester OAuth, install pipeline, admin allowlist) all works; what is left is operator and tester polish: **P0 — Admin UI auto-refresh on Install and Provision state** [hero_os_tfgrid_deployer#18](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/issues/18). Today the admin must manually refresh the user-detail page to see the Install state transition from `installing` to `ready`, and Provision has no visible progress at all (looks like a hung browser for the full 55 to 60 seconds). Single fix is a small polling script on the VMs table plus a new provision-state column symmetric with the existing install-state column. **P0.5 — Cockpit URL column in admin UI is missing the /hero_cockpit/web/ path** [hero_os_tfgrid_deployer#19](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/issues/19). The admin sees `https://<tester>.<gateway>` and clicks it, but lands on Hero Proxy's own service-discovery dashboard, not on the cockpit. One-line template fix to append `/hero_cockpit/web/` to the displayed link and href. **P1a — Cockpit Services page: Install button for uninstalled components** [hero_cockpit#11](https://forge.ourworld.tf/lhumina_code/hero_cockpit/issues/11). The cockpit currently only lists services already known to Hero Proc. Components in the canonical demo stack that are not yet started (because their dependency is not met, e.g. Books needs an AI key) are invisible from the cockpit, so the tester has no UI path to bring them up. Unified list with greyed-out rows for "available but not installed" plus a per-row Install button that fires `lab build` then `lab service start` server-side. **P1b — Cockpit Services page: clickable URL column** [hero_cockpit#12](https://forge.ourworld.tf/lhumina_code/hero_cockpit/issues/12). The Services page has a URL column but it currently shows an em-dash for every row. Every service with an `_admin` or `_web` binary has a publicly reachable URL the cockpit can compute from its own service.toml. Render those URLs as clickable links and the tester gets an obvious affordance to open any service. **P2 — BYO-key auto-start cascade in cockpit Settings** When a tester pastes an AI provider key in the cockpit Settings page, the cockpit's save handler should automatically fire the lab-service-start cascade for the AI-dependent components (Hero AI Broker plus Hero Books plus Hero Agent). Today the tester pastes the key and nothing visible happens; they have to know to SSH into their VM and run a command to bring up Books. After this cascade, Books just works the moment the key is saved. **After s173, the v1 tester loop is end-to-end clickable**: admin creates user via admin UI, tester registers on Forge and uploads SSH key, admin clicks Provision and sees real-time progress, admin clicks Install and sees real-time state transitions, admin shares the cockpit URL with the tester out-of-band (Slack / email, manually for v1), tester clicks URL and signs in via SSO, sees cockpit Services with both installed components (Books, Slides, Whiteboard, etc) and uninstalled components (greyed out with Install buttons), pastes an AI key in Settings and Books starts working automatically, clicks any service's URL to open it. **s174 (v2 polish)** adds the automatic welcome-email pipeline so the admin no longer shares URLs out-of-band: A-18 (welcome email at user-create time with cockpit URL plus initial password plus 4-step onboarding) and B-1 ("your VM is ready" email at install-ready time). Selecting the email provider (resend.com or SendGrid) is the first decision in s174. Estimated effort: s173 about 6 to 9 hours, s174 about 4 to 6 hours, both stay inside home#238.
Author
Owner

Scope refinement on the cockpit Services polish (hero_cockpit#11)

The previous comment described the Install button work as covering the 12 auto-installed demo components. After thinking through the catalog model more carefully, the right scope is broader: the cockpit Services page becomes the platform's service catalog, full stop. So hero_cockpit#11 ships:

  • 12 demo components auto-installed at provision time render with the normal action set
  • The remaining user-facing Hero services from the canonical 35-repo demo set (Hero Embedder, Hero Indexer, Hero Collab, Hero Assistance, Hero Foundry, Hero Archipelagos, etc., roughly 5-7 additional services) render as greyed-out rows with Install buttons
  • Tester clicks Install on any of them and the cockpit drives the install end-to-end

This collapses what was previously planned as a future "service catalog UI" arc (filed at hero_demo#68) into the same page. The cockpit IS the catalog for the free / sandbox tier. A separate searchable marketplace surface is only worth building when paid services and onboarding flows exist, which is a future paid-tier concern outside home#238's scope.

Net for s173 V1 close: same four issues (hero_os_tfgrid_deployer#18 admin UI auto-refresh, hero_os_tfgrid_deployer#19 cockpit URL path-suffix, hero_cockpit#11 Install button with the full-catalog scope, hero_cockpit#12 clickable URL column) plus the BYO-key auto-start cascade. After s173 the cockpit Services page is the complete platform catalog for the sandbox demo.

**Scope refinement on the cockpit Services polish (hero_cockpit#11)** The previous comment described the Install button work as covering the 12 auto-installed demo components. After thinking through the catalog model more carefully, the right scope is broader: the cockpit Services page becomes the platform's service catalog, full stop. So hero_cockpit#11 ships: - 12 demo components auto-installed at provision time render with the normal action set - The remaining user-facing Hero services from the canonical 35-repo demo set (Hero Embedder, Hero Indexer, Hero Collab, Hero Assistance, Hero Foundry, Hero Archipelagos, etc., roughly 5-7 additional services) render as greyed-out rows with Install buttons - Tester clicks Install on any of them and the cockpit drives the install end-to-end This collapses what was previously planned as a future "service catalog UI" arc (filed at hero_demo#68) into the same page. The cockpit IS the catalog for the free / sandbox tier. A separate searchable marketplace surface is only worth building when paid services and onboarding flows exist, which is a future paid-tier concern outside home#238's scope. Net for s173 V1 close: same four issues (hero_os_tfgrid_deployer#18 admin UI auto-refresh, hero_os_tfgrid_deployer#19 cockpit URL path-suffix, hero_cockpit#11 Install button with the full-catalog scope, hero_cockpit#12 clickable URL column) plus the BYO-key auto-start cascade. After s173 the cockpit Services page is the complete platform catalog for the sandbox demo.
Author
Owner

Closing as the visible UX surface is complete. The work shipped across two stretches:

The admin path landed first: tfgrid_deployer admin UI with per-user VMs table, install state machine, provisioning flow, per-tester Forgejo OAuth apps replacing the workspace-shared model. Real Forge users walked through the admin VM end-to-end (alice172d, alice123).

The tester path landed second: cockpit Services page with install-from-catalog flow, Bootstrap modals replacing every browser confirm/alert leak, dark-mode contrast fixes on Disable button and Logs drawer, log_tail rendered in the install result modal so dependency-cascade failures surface inline, Manual with 17 entries split into Core infrastructure (4) and Apps (13), About data locations table covering all 16 catalog services that store user data, Feedback page secondary sections wrapped in Bootstrap cards, landing page CTAs unified, Settings cleanup (Public exposure base domain section removed alongside the Expose/Unexpose UI hiding), and a connection-status dot fix in hero_admin_lib that now paints green when connected and stays steady (used to be grey-when-connected with a constant pulse on the wrong state). About 23 commits across hero_cockpit, hero_os_tfgrid_deployer, hero_website_framework, and hero_demo.

What this arc does NOT cover: functional verification of the catalog apps themselves. A tester can click Install on hero_books and the install cascade completes, but whether hero_books actually renders a library, indexes a document, and answers a grounded question is unverified. That work moves to the new arc: home#239.

Signed-by: mik-tf mik-tf@noreply.invalid

Closing as the visible UX surface is complete. The work shipped across two stretches: The admin path landed first: tfgrid_deployer admin UI with per-user VMs table, install state machine, provisioning flow, per-tester Forgejo OAuth apps replacing the workspace-shared model. Real Forge users walked through the admin VM end-to-end (alice172d, alice123). The tester path landed second: cockpit Services page with install-from-catalog flow, Bootstrap modals replacing every browser confirm/alert leak, dark-mode contrast fixes on Disable button and Logs drawer, log_tail rendered in the install result modal so dependency-cascade failures surface inline, Manual with 17 entries split into Core infrastructure (4) and Apps (13), About data locations table covering all 16 catalog services that store user data, Feedback page secondary sections wrapped in Bootstrap cards, landing page CTAs unified, Settings cleanup (Public exposure base domain section removed alongside the Expose/Unexpose UI hiding), and a connection-status dot fix in hero_admin_lib that now paints green when connected and stays steady (used to be grey-when-connected with a constant pulse on the wrong state). About 23 commits across hero_cockpit, hero_os_tfgrid_deployer, hero_website_framework, and hero_demo. What this arc does NOT cover: functional verification of the catalog apps themselves. A tester can click Install on hero_books and the install cascade completes, but whether hero_books actually renders a library, indexes a document, and answers a grounded question is unverified. That work moves to the new arc: [home#239](https://forge.ourworld.tf/lhumina_code/home/issues/239). Signed-by: mik-tf <mik-tf@noreply.invalid>
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/home#238
No description provided.