hero_os_tfgrid_deployer integration: methods we'll consume + small gaps #116

Open
opened 2026-05-20 21:41:39 +00:00 by mik-tf · 6 comments
Owner

hero_os_tfgrid_deployer integration: methods we'll consume + small gaps

The new admin tool hero_os_tfgrid_deployer (scope under discussion at hero_os_tfgrid_deployer#1) will consume ComputeService OpenRPC (currently in crates/my_compute_zos_server/src/cloud/openrpc.json) as its only VM-lifecycle backend.

Reviewed the spec — most of what we need is already there. Filing this issue to (a) confirm intended usage so we don't drift, and (b) surface a few small gaps that would make the deployer's flow easier.

Methods the deployer will call

For each demo user we provision:

  1. Inject SSH keyComputeService.inject_ssh_keys — deployer generates a per-user ED25519 key, registers the public half via this method, retains the private half in its sqlite for SSH-back-in.
  2. Deploy VMComputeService.deploy_vm with spec { cpu: 16, memory: 8 GB, disk: 200 GB, rootfs: 16 GB, flist: "ubuntu-24.04-latest", publicip: false, node_id: <pinned> }. Today's s132 work proves this spec is sufficient (16 CPU is overcommit for an 8 GB VM but matches what's in flight via the OpenTofu path).
  3. Wait until VM is reachable → poll ComputeService.get_vm for mycelium_ip to appear + open.
  4. Deploy gatewayComputeService.deploy_webgateway mapping <user>.<node>.grid.tfhttp://<vm_ip>:9988 (where hero_router listens).
  5. Run bootstrap script → currently via SSH from deployer to VM (proven path in hero_demo/deploy/single-vm/scripts/setup-binaries.sh). Alternative: pipe through ComputeService.vm_exec if it handles long-running scripts cleanly — see Gap 2 below.
  6. Track + manage lifecycleComputeService.list_vms / get_vm / vm_stats for the admin UI's per-user state view.

Confirmation questions (low-cost — flag any "yes" / "no" / "TBD")

  • C1: Is deploy_vm ready for production use on TFGrid mainnet? (s132 used OpenTofu directly against TFGrid — works. Want to swap to this once it's stable for our flow.)
  • C2: Does deploy_vm return synchronously after the VM is fully reachable (SSH-able), or does it return early and require polling get_vm? Documentation in the OpenRPC summary field would resolve this for any caller.
  • C3: Is there a "metadata" or "tag" field on the VM spec? Deployer would store { user: "<forge_id>", profile: "demo", provisioned_at: ... } per-VM so the admin UI can join VMs back to users without round-tripping its own sqlite.
  • C4: inject_ssh_keys — is this called pre- or post-deploy_vm? Order matters for our deployer flow.

Small gaps (what would help us)

  • G1: A ComputeService.wait_vm_ready(vm_id, timeout) method that blocks until the VM is SSH-able (or the timeout expires). Today we'd poll get_vm from the deployer — works but every caller reimplements the same readiness logic. Not a blocker; nice-to-have.
  • G2: Clarity on vm_exec — does it stream stdout incrementally (good for our setup-binaries.sh which prints ~1500 lines of lab build progress over 5-30 min) or buffer until the command exits? If buffered, we keep the SSH path; if streamed, we can drop the SSH dependency on the deployer side entirely.
  • G3: deploy_webgateway — does it return the publicly-resolvable FQDN immediately, or does DNS propagation need extra wait? S132 saw the gateway resolve within ~30 s of tofu apply completing; if hero_compute mirrors that, no action needed.
  • G4: Auth model for the deployer → hero_compute connection. Currently the deployer is "admin-only" (us). Is the existing ComputeService socket reachable only locally, or does it expect bearer-token auth over network? Deployer's host (deployer admin UI) is not on the same machine as hero_compute.

None of these are blockers — happy to file separate issues for any of them if that's easier. Mostly this is a tee-up for the deployer work that starts in the next few sessions (current plan in hero_os_tfgrid_deployer#1 and the follow-up scope issues we're about to file there).

What's NOT a gap

  • VM lifecycle methods: complete. deploy_vm / start_vm / stop_vm / restart_vm / delete_vm / list_vms / get_vm — all present.
  • Web gateway: complete. deploy_webgateway / list_webgateways / get_webgateway / delete_webgateway — present.
  • SSH key injection: present (inject_ssh_keys).
  • VM diagnostics: present (vm_logs, vm_stats, vm_exec).
  • Nice surprise: migrate_secret, list_images, attach_hypervisor are also there — beyond what we need immediately but useful later.

Context

cc @mahmoud , no rush — answers can come incrementally.

## hero_os_tfgrid_deployer integration: methods we'll consume + small gaps The new admin tool `hero_os_tfgrid_deployer` (scope under discussion at [`hero_os_tfgrid_deployer#1`](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/issues/1)) will consume `ComputeService` OpenRPC (currently in `crates/my_compute_zos_server/src/cloud/openrpc.json`) as its only VM-lifecycle backend. Reviewed the spec — most of what we need is **already there**. Filing this issue to (a) confirm intended usage so we don't drift, and (b) surface a few small gaps that would make the deployer's flow easier. ### Methods the deployer will call For each demo user we provision: 1. **Inject SSH key** → `ComputeService.inject_ssh_keys` — deployer generates a per-user ED25519 key, registers the public half via this method, retains the private half in its sqlite for SSH-back-in. 2. **Deploy VM** → `ComputeService.deploy_vm` with spec `{ cpu: 16, memory: 8 GB, disk: 200 GB, rootfs: 16 GB, flist: "ubuntu-24.04-latest", publicip: false, node_id: <pinned> }`. Today's s132 work proves this spec is sufficient (16 CPU is overcommit for an 8 GB VM but matches what's in flight via the OpenTofu path). 3. **Wait until VM is reachable** → poll `ComputeService.get_vm` for mycelium_ip to appear + open. 4. **Deploy gateway** → `ComputeService.deploy_webgateway` mapping `<user>.<node>.grid.tf` → `http://<vm_ip>:9988` (where hero_router listens). 5. **Run bootstrap script** → currently via SSH from deployer to VM (proven path in [`hero_demo/deploy/single-vm/scripts/setup-binaries.sh`](https://forge.ourworld.tf/lhumina_code/hero_demo/src/branch/development/deploy/single-vm/scripts/setup-binaries.sh)). Alternative: pipe through `ComputeService.vm_exec` if it handles long-running scripts cleanly — see Gap 2 below. 6. **Track + manage lifecycle** → `ComputeService.list_vms` / `get_vm` / `vm_stats` for the admin UI's per-user state view. ### Confirmation questions (low-cost — flag any "yes" / "no" / "TBD") - **C1:** Is `deploy_vm` ready for production use on TFGrid mainnet? (s132 used OpenTofu directly against TFGrid — works. Want to swap to this once it's stable for our flow.) - **C2:** Does `deploy_vm` return synchronously after the VM is fully reachable (SSH-able), or does it return early and require polling `get_vm`? Documentation in the OpenRPC `summary` field would resolve this for any caller. - **C3:** Is there a "metadata" or "tag" field on the VM spec? Deployer would store `{ user: "<forge_id>", profile: "demo", provisioned_at: ... }` per-VM so the admin UI can join VMs back to users without round-tripping its own sqlite. - **C4:** `inject_ssh_keys` — is this called pre- or post-`deploy_vm`? Order matters for our deployer flow. ### Small gaps (what would help us) - **G1:** A `ComputeService.wait_vm_ready(vm_id, timeout)` method that blocks until the VM is SSH-able (or the timeout expires). Today we'd poll `get_vm` from the deployer — works but every caller reimplements the same readiness logic. Not a blocker; nice-to-have. - **G2:** Clarity on `vm_exec` — does it stream stdout incrementally (good for our `setup-binaries.sh` which prints ~1500 lines of `lab build` progress over 5-30 min) or buffer until the command exits? If buffered, we keep the SSH path; if streamed, we can drop the SSH dependency on the deployer side entirely. - **G3:** `deploy_webgateway` — does it return the publicly-resolvable FQDN immediately, or does DNS propagation need extra wait? S132 saw the gateway resolve within ~30 s of `tofu apply` completing; if hero_compute mirrors that, no action needed. - **G4:** Auth model for the deployer → hero_compute connection. Currently the deployer is "admin-only" (us). Is the existing `ComputeService` socket reachable only locally, or does it expect bearer-token auth over network? Deployer's host (deployer admin UI) is not on the same machine as `hero_compute`. None of these are blockers — happy to file separate issues for any of them if that's easier. Mostly this is a tee-up for the deployer work that starts in the next few sessions (current plan in [`hero_os_tfgrid_deployer#1`](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/issues/1) and the follow-up scope issues we're about to file there). ### What's NOT a gap - VM lifecycle methods: complete. `deploy_vm` / `start_vm` / `stop_vm` / `restart_vm` / `delete_vm` / `list_vms` / `get_vm` — all present. - Web gateway: complete. `deploy_webgateway` / `list_webgateways` / `get_webgateway` / `delete_webgateway` — present. - SSH key injection: present (`inject_ssh_keys`). - VM diagnostics: present (`vm_logs`, `vm_stats`, `vm_exec`). - Nice surprise: `migrate_secret`, `list_images`, `attach_hypervisor` are also there — beyond what we need immediately but useful later. ### Context - VM bootstrap proof point (s132): [`hero_demo` setup-binaries.sh — 34/34 PASS on lab download/install + hero_proc + hero_router GREEN on a fresh TFGrid VM](https://forge.ourworld.tf/lhumina_code/hero_demo/src/branch/development/deploy/single-vm/scripts/setup-binaries.sh). - Cockpit spec: [`hero_cockpit#1`](https://forge.ourworld.tf/lhumina_code/hero_cockpit/issues/1). - Meeting notes umbrella: [`hero_os_tfgrid_deployer#1`](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/issues/1). cc @mahmoud , no rush — answers can come incrementally.
Owner

Thanks @mik-tf — went through this against the latest development (0670ec5), reading schemas/cloud/cloud.oschema and the TFGrid impl in crates/my_compute_zos_server/src/cloud/{rpc.rs,grid_driver.rs}. Most of the spec maps, but there are a few important corrections before you build the deployer on it: several methods listed under "What's NOT a gap" are actually TFGrid stubs that return an error.

⚠️ Actual working surface on the TFGrid (ZOS) backend

ComputeService is one trait shared by two backends; on TFGrid a number of methods are intentionally unimplemented and return Err("… is not supported on TFGrid").

Works on TFGrid: node_register / node_status / node_unregister, set_tfgrid_node_ids, list_slices / get_slice, deploy_vm, delete_vm, list_vms / get_vm, vm_logs, node_stats, list_images, get_deployment_logs / list_deployments, get_ssh_keys / set_ssh_keys, list_jobs / job_logs, deploy_webgateway / list_webgateways / get_webgateway / delete_webgateway / list_gateway_nodes.

Stubbed on TFGrid (error): start_vm, stop_vm, restart_vm, inject_ssh_keys, vm_exec, vm_stats, attach_hypervisor, migrate_secret.

So three claims in "What's NOT a gap" need revising:

  • inject_ssh_keys (your step 1) is NOT supported on TFGrid. SSH keys are deploy-time only: pass the public key to deploy_vm(ssh_keys=[…]) and the driver injects it via the flist's SSH_KEY env at boot. (set_ssh_keys / get_ssh_keys only manage a per-secret key store; they do not push into a running VM.)
  • VM lifecycle is not "complete": only deploy_vm + delete_vm work. start_vm / stop_vm / restart_vm are stubs (to "stop", you delete; to "restart", delete and re-deploy_vm).
  • Diagnostics: only vm_logs works (returns the deploy/delete progress log — not arbitrary command output). vm_stats and vm_exec are stubbed; node_stats exists but is node-level capacity/benchmark from Grid Proxy, not per-VM live telemetry.

⚠️ The VM spec is slice-based, not free-form

Real signature:

deploy_vm(name, slice_count, secret, image, ssh_keys, node_sid) -> Vm
  • 1 slice = 4 GB RAM + 1 vCPU + a fixed disk-per-slice (sized at node registration; ~138 GB/slice on node 1774). In code: cpu_count = slice_count, memory = slice_count × 4 GB, disk = Σ slice disk.
  • So { cpu: 16, memory: 8 GB, disk: 200 GB, rootfs: 16 GB, publicip: false } cannot be expressed: 8 GB = 2 slices = 2 vCPU (CPU and RAM are coupled 1:4), there is no publicip, no rootfs, no independent disk parameter.
  • node_sid is matched against the node hostname (tfnode-<id>), not a sid.
  • image must be the full flist URL (https://hub.grid.tf/tf-official-vms/ubuntu-24.04-latest.flist), not ubuntu-24.04-latest.
  • If the deployer needs free-form sizing or a public IP, that is an API change — happy to scope it.

Confirmation questions

  • C1 — mainnet-ready? For the slice model, yes: we deployed and deleted real VMs on mainnet dedicated node 1774 this week (reached running + mycelium IP; contracts created and cancelled cleanly — 0670ec5 just made the cancel idempotent on delete_vm). Caveats: the slice-spec mapping above, async deploy (C2), and the capacity precheck was just fixed to size the catalog from free (not total) node capacity. It is not an OpenTofu-equivalent for free-form specs / public IPs.
  • C2 — sync or async? Async. deploy_vm returns immediately with state="provisioning" and runs the on-chain deploy in a background task. Poll get_vm until state=="running" and mycelium_ip is set (then it is reachable). Same async pattern for delete_vm (→ deleting → record disappears) and deploy_webgateway. Agreed it's worth documenting in the OpenRPC summary.
  • C3 — metadata/tag field? No. Vm has name (free text, indexed) + config{ cpu_count, image, network_config?, extra_args? }. No generic tag map. Today: encode user/profile into name, or keep the join in the deployer's sqlite. A metadata: map<str,str> would be a small schema change.
  • C4 — inject_ssh_keys pre/post? Neither — it's not supported on TFGrid. Provide the key at deploy via deploy_vm(ssh_keys=[…]).

Small gaps

  • G1 — wait_vm_ready: doesn't exist; poll get_vm. Note "ready" today = state=running + mycelium_ip present — the server does not probe SSH/port, so a true SSH-able wait would be new logic.
  • G2 — vm_exec stream vs buffer: moot — not implemented on TFGrid, and the schema returns only an i32 exit code, so even a future implementation wouldn't stream stdout without an API change. Keep the SSH path for setup-binaries.sh.
  • G3 — webgateway FQDN immediate? No. deploy_webgateway is async; for kind="name" the FQDN is filled in only when the background deploy reaches state=ready → poll get_webgateway. For kind="fqdn" you supply it. DNS propagation is on top of that.
  • G4 — auth / network model: ComputeService listens on a Unix domain socket (hero_compute_zos/rpc.sock, raw protocol) — local-only, no TCP bind, no built-in bearer auth. Per-call auth is the secret parameter = an sr25519-signed token derived from the node's TFGRID_MNEMONIC (or a raw ownership token), checked per-VM in verify_secret. A remote deployer (different host) needs the service exposed over the network — that's hero_router's job (TCP entry point + context/claim auth) or an SSH tunnel to the UDS. There is no network auth in the service itself today. This is the main cross-machine integration question and probably deserves its own issue.

Net

Lifecycle (deploy_vm / delete_vm / get_vm / list_vms / vm_logs) and the webgateway methods are solid on mainnet. The deployer should plan around: (1) slice-based sizing (no free-form cpu/disk/public-IP), (2) deploy-time SSH keys (no inject_ssh_keys), (3) SSH for exec/diagnostics (no vm_exec / vm_stats), and (4) remote auth via hero_router (the UDS is local-only).

Happy to file separate issues for the three actionable ones: a metadata field on the VM spec, free-form/public-IP sizing, and the remote-auth model.

Thanks @mik-tf — went through this against the latest `development` (`0670ec5`), reading `schemas/cloud/cloud.oschema` and the TFGrid impl in `crates/my_compute_zos_server/src/cloud/{rpc.rs,grid_driver.rs}`. Most of the spec maps, but there are a few **important corrections** before you build the deployer on it: several methods listed under "What's NOT a gap" are actually **TFGrid stubs that return an error**. ## ⚠️ Actual working surface on the TFGrid (ZOS) backend `ComputeService` is one trait shared by two backends; on **TFGrid** a number of methods are intentionally unimplemented and return `Err("… is not supported on TFGrid")`. **Works on TFGrid:** `node_register` / `node_status` / `node_unregister`, `set_tfgrid_node_ids`, `list_slices` / `get_slice`, `deploy_vm`, `delete_vm`, `list_vms` / `get_vm`, `vm_logs`, `node_stats`, `list_images`, `get_deployment_logs` / `list_deployments`, `get_ssh_keys` / `set_ssh_keys`, `list_jobs` / `job_logs`, `deploy_webgateway` / `list_webgateways` / `get_webgateway` / `delete_webgateway` / `list_gateway_nodes`. **Stubbed on TFGrid (error):** `start_vm`, `stop_vm`, `restart_vm`, `inject_ssh_keys`, `vm_exec`, `vm_stats`, `attach_hypervisor`, `migrate_secret`. So three claims in "What's NOT a gap" need revising: - ❌ **`inject_ssh_keys` (your step 1) is NOT supported on TFGrid.** SSH keys are **deploy-time only**: pass the public key to `deploy_vm(ssh_keys=[…])` and the driver injects it via the flist's `SSH_KEY` env at boot. (`set_ssh_keys` / `get_ssh_keys` only manage a per-secret key *store*; they do not push into a running VM.) - ❌ **VM lifecycle is not "complete":** only `deploy_vm` + `delete_vm` work. `start_vm` / `stop_vm` / `restart_vm` are stubs (to "stop", you delete; to "restart", delete and re-`deploy_vm`). - ❌ **Diagnostics:** only `vm_logs` works (returns the deploy/delete progress log — not arbitrary command output). `vm_stats` and `vm_exec` are stubbed; `node_stats` exists but is node-level capacity/benchmark from Grid Proxy, not per-VM live telemetry. ## ⚠️ The VM spec is slice-based, not free-form Real signature: ``` deploy_vm(name, slice_count, secret, image, ssh_keys, node_sid) -> Vm ``` - 1 slice = **4 GB RAM + 1 vCPU + a fixed disk-per-slice** (sized at node registration; ~138 GB/slice on node 1774). In code: `cpu_count = slice_count`, `memory = slice_count × 4 GB`, `disk = Σ slice disk`. - So `{ cpu: 16, memory: 8 GB, disk: 200 GB, rootfs: 16 GB, publicip: false }` **cannot be expressed**: 8 GB = 2 slices = **2 vCPU** (CPU and RAM are coupled 1:4), there is **no `publicip`, no `rootfs`, no independent disk** parameter. - `node_sid` is matched against the node **hostname** (`tfnode-<id>`), not a sid. - `image` must be the **full flist URL** (`https://hub.grid.tf/tf-official-vms/ubuntu-24.04-latest.flist`), not `ubuntu-24.04-latest`. - If the deployer needs free-form sizing or a public IP, that is an API change — happy to scope it. ## Confirmation questions - **C1 — mainnet-ready?** For the slice model, **yes**: we deployed *and* deleted real VMs on **mainnet** dedicated node 1774 this week (reached `running` + mycelium IP; contracts created and cancelled cleanly — `0670ec5` just made the cancel idempotent on `delete_vm`). Caveats: the slice-spec mapping above, async deploy (C2), and the capacity precheck was just fixed to size the catalog from *free* (not total) node capacity. It is not an OpenTofu-equivalent for free-form specs / public IPs. - **C2 — sync or async?** **Async.** `deploy_vm` returns immediately with `state="provisioning"` and runs the on-chain deploy in a background task. Poll `get_vm` until `state=="running"` **and** `mycelium_ip` is set (then it is reachable). Same async pattern for `delete_vm` (→ `deleting` → record disappears) and `deploy_webgateway`. Agreed it's worth documenting in the OpenRPC `summary`. - **C3 — metadata/tag field?** **No.** `Vm` has `name` (free text, indexed) + `config{ cpu_count, image, network_config?, extra_args? }`. No generic tag map. Today: encode `user`/`profile` into `name`, or keep the join in the deployer's sqlite. A `metadata: map<str,str>` would be a small schema change. - **C4 — inject_ssh_keys pre/post?** Neither — it's **not supported on TFGrid**. Provide the key **at deploy** via `deploy_vm(ssh_keys=[…])`. ## Small gaps - **G1 — `wait_vm_ready`:** doesn't exist; poll `get_vm`. Note "ready" today = `state=running` + `mycelium_ip` present — the server does not probe SSH/port, so a true SSH-able wait would be new logic. - **G2 — `vm_exec` stream vs buffer:** moot — **not implemented on TFGrid**, and the schema returns only an `i32` exit code, so even a future implementation wouldn't stream stdout without an API change. **Keep the SSH path** for `setup-binaries.sh`. - **G3 — webgateway FQDN immediate?** **No.** `deploy_webgateway` is async; for `kind="name"` the FQDN is filled in only when the background deploy reaches `state=ready` → poll `get_webgateway`. For `kind="fqdn"` you supply it. DNS propagation is on top of that. - **G4 — auth / network model:** ComputeService listens on a **Unix domain socket** (`hero_compute_zos/rpc.sock`, raw protocol) — **local-only, no TCP bind, no built-in bearer auth.** Per-call auth is the `secret` parameter = an **sr25519-signed token** derived from the node's `TFGRID_MNEMONIC` (or a raw ownership token), checked per-VM in `verify_secret`. A **remote** deployer (different host) needs the service exposed over the network — that's **hero_router**'s job (TCP entry point + context/claim auth) or an SSH tunnel to the UDS. There is no network auth in the service itself today. This is the main cross-machine integration question and probably deserves its own issue. ## Net Lifecycle (`deploy_vm` / `delete_vm` / `get_vm` / `list_vms` / `vm_logs`) and the webgateway methods are solid on mainnet. The deployer should plan around: (1) **slice-based sizing** (no free-form cpu/disk/public-IP), (2) **deploy-time SSH keys** (no `inject_ssh_keys`), (3) **SSH for exec/diagnostics** (no `vm_exec` / `vm_stats`), and (4) **remote auth via hero_router** (the UDS is local-only). Happy to file separate issues for the three actionable ones: a `metadata` field on the VM spec, free-form/public-IP sizing, and the remote-auth model.
Author
Owner

Thanks @mahmoud — your corrections are folded into the arc tracker at lhumina_code/home#235 §3 (Dependency map). The deployer + cockpit are planned around the slice model, deploy-time SSH, delete-and-redeploy lifecycle, async deploy, and the UDS-local-only auth model. Substantive reply to your comment follows separately.

Thanks @mahmoud — your corrections are folded into the arc tracker at [lhumina_code/home#235](https://forge.ourworld.tf/lhumina_code/home/issues/235) §3 (Dependency map). The deployer + cockpit are planned around the slice model, deploy-time SSH, delete-and-redeploy lifecycle, async deploy, and the UDS-local-only auth model. Substantive reply to your comment follows separately.
Author
Owner

Acknowledged — corrections folded into the deployer + cockpit plan

Thanks @mahmoud — the three "no-gap" claims I had wrong are corrected in the arc-level tracker at lhumina_code/home#235 §3. Quoting them back to confirm we read them the same way:

What we'll plan around on TFGrid

Constraint What changes in our plan
start_vm / stop_vm / restart_vm stubbed Deployer admin UI exposes a single "destroy + redeploy" action, not three lifecycle buttons. Cockpit's per-service start/stop/restart is unaffected — those are hero_proc service calls on services running inside the VM, not VM-level.
inject_ssh_keys stubbed Deployer's Forge-user-lifecycle step generates the SSH key + passes it at deploy_vm(ssh_keys=[…]) time. No inject-after-create path.
vm_exec + vm_stats stubbed Setup-binaries dispatch stays on SSH (already true since s132). Cockpit's system_info reads RAM/disk from the VM's own /proc/meminfo + df + sysinfo crate, not via hero_compute.
Slice-based sizing (1 slice = 4 GB + 1 vCPU + fixed disk, no publicip / no rootfs / no free-form) Demo profile = 2 slices = 2 vCPU + 8 GB. Our s132 herolab VM was provisioned via the OpenTofu path (16 CPU / 8 GB / 200 GB / 16 GB rootfs) which is outside the hero_compute slice model — explains the spec mismatch I had in my head. OpenTofu stays the primary deployer adapter until free-form sizing + publicip ship; hero_compute is the secondary adapter behind a config flag.
Async deploy_vm + deploy_webgateway Deployer polls get_vm until state="running" AND mycelium_ip set; polls get_webgateway for FQDN on kind="name". Worth surfacing in OpenRPC summary/description so consumers don't trip on the implicit polling contract.
No metadata field on Vm We keep the user/profile join in the deployer's sqlite (already in the D1 schema).
UDS local-only auth on ComputeService Remote deployer reaches the service via hero_router (TCP entry point + context/claim auth) per /hero_router skill + per-call secret parameter for verify_secret. Network-level wrapping is hero_router's job, not hero_compute's.

Confirmation questions — answered

  • C1 (mainnet-ready): confirmed for the slice model; we plan around the constraints above. The "OpenTofu-equivalent for free-form specs / public IPs" gap is a known-and-handled limitation, not a blocker for v0.1.
  • C2 (sync vs async): confirmed async; will encode the polling contract in the deployer's adapter trait. +1 on documenting it in the OpenRPC schema.
  • C3 (metadata field): confirmed no metadata map; we'll use deployer-side sqlite.
  • C4 (inject_ssh_keys pre/post): neither — keys at deploy time.

Follow-up issues you offered to file

Three concrete things you mentioned you'd happily scope as separate issues:

  1. metadata: map<str,str> field on the Vm spec — would let the deployer encode user/profile/owner directly instead of relying on its sqlite for the join. Nice-to-have, not blocking.
  2. Free-form sizing + public-IP support on deploy_vm — promotes hero_compute from secondary to primary adapter for the deployer.
  3. Remote-auth model for ComputeService (or a doc-level reference to how hero_router fronts it) — clarifies the cross-machine integration story for the deployer.

Happy to file each on hero_compute myself if you prefer; just let us know which way works better. The arc-level tracker home#235 will pick them up either way.

What we don't need from hero_compute right now

  • wait_vm_ready — confirmed; we poll get_vm ourselves. "Ready" = state="running" + mycelium_ip set + SSH-able on the deployer side via its own probe.
  • vm_exec streaming — moot; SSH path is the right answer for setup-binaries dispatch.

Net: deployer + cockpit work proceeds against the working surface you described. Webgateway async + slice-based sizing are documented constraints, not surprises. Track A (cockpit) starts in s133; Track D (deployer) starts s146-ish; integration in s152.

## Acknowledged — corrections folded into the deployer + cockpit plan Thanks @mahmoud — the three "no-gap" claims I had wrong are corrected in the arc-level tracker at [lhumina_code/home#235](https://forge.ourworld.tf/lhumina_code/home/issues/235) §3. Quoting them back to confirm we read them the same way: ### What we'll plan around on TFGrid | Constraint | What changes in our plan | |---|---| | `start_vm` / `stop_vm` / `restart_vm` stubbed | Deployer admin UI exposes a single "destroy + redeploy" action, not three lifecycle buttons. Cockpit's per-service start/stop/restart is unaffected — those are `hero_proc service` calls on services running inside the VM, not VM-level. | | `inject_ssh_keys` stubbed | Deployer's Forge-user-lifecycle step generates the SSH key + passes it at `deploy_vm(ssh_keys=[…])` time. No inject-after-create path. | | `vm_exec` + `vm_stats` stubbed | Setup-binaries dispatch stays on SSH (already true since s132). Cockpit's `system_info` reads RAM/disk from the VM's own `/proc/meminfo` + `df` + sysinfo crate, not via hero_compute. | | Slice-based sizing (1 slice = 4 GB + 1 vCPU + fixed disk, no publicip / no rootfs / no free-form) | Demo profile = 2 slices = 2 vCPU + 8 GB. Our s132 herolab VM was provisioned via the OpenTofu path (16 CPU / 8 GB / 200 GB / 16 GB rootfs) which is outside the hero_compute slice model — explains the spec mismatch I had in my head. **OpenTofu stays the primary deployer adapter** until free-form sizing + publicip ship; hero_compute is the secondary adapter behind a config flag. | | Async deploy_vm + deploy_webgateway | Deployer polls `get_vm` until `state="running"` AND `mycelium_ip` set; polls `get_webgateway` for FQDN on `kind="name"`. Worth surfacing in OpenRPC `summary`/`description` so consumers don't trip on the implicit polling contract. | | No `metadata` field on `Vm` | We keep the `user`/`profile` join in the deployer's sqlite (already in the D1 schema). | | UDS local-only auth on ComputeService | Remote deployer reaches the service via hero_router (TCP entry point + context/claim auth) per [`/hero_router` skill](https://forge.ourworld.tf/lhumina_code/hero_router) + per-call `secret` parameter for verify_secret. Network-level wrapping is hero_router's job, not hero_compute's. | ### Confirmation questions — answered - **C1 (mainnet-ready):** confirmed for the slice model; we plan around the constraints above. The "OpenTofu-equivalent for free-form specs / public IPs" gap is a known-and-handled limitation, not a blocker for v0.1. - **C2 (sync vs async):** confirmed async; will encode the polling contract in the deployer's adapter trait. +1 on documenting it in the OpenRPC schema. - **C3 (metadata field):** confirmed no metadata map; we'll use deployer-side sqlite. - **C4 (inject_ssh_keys pre/post):** neither — keys at deploy time. ### Follow-up issues you offered to file Three concrete things you mentioned you'd happily scope as separate issues: 1. `metadata: map<str,str>` field on the `Vm` spec — would let the deployer encode user/profile/owner directly instead of relying on its sqlite for the join. Nice-to-have, not blocking. 2. Free-form sizing + public-IP support on `deploy_vm` — promotes hero_compute from secondary to primary adapter for the deployer. 3. Remote-auth model for ComputeService (or a doc-level reference to how hero_router fronts it) — clarifies the cross-machine integration story for the deployer. Happy to file each on `hero_compute` myself if you prefer; just let us know which way works better. The arc-level tracker [home#235](https://forge.ourworld.tf/lhumina_code/home/issues/235) will pick them up either way. ### What we don't need from hero_compute right now - `wait_vm_ready` — confirmed; we poll `get_vm` ourselves. "Ready" = `state="running"` + `mycelium_ip` set + SSH-able on the deployer side via its own probe. - `vm_exec` streaming — moot; SSH path is the right answer for setup-binaries dispatch. Net: deployer + cockpit work proceeds against the working surface you described. Webgateway async + slice-based sizing are documented constraints, not surprises. Track A (cockpit) starts in s133; Track D (deployer) starts s146-ish; integration in s152.
Author
Owner

Net for the deployer arc — three small asks

Thanks @mahmoud — re-read your corrections + the working surface against the s141 deployer-side code work just done (forge.rs/db.rs/3 OpenRPC methods landed on hero_os_tfgrid_deployer main, see 796e715). Net: the deployer plan is unblocked on your current development HEAD. We sidestep all 6 TFGrid stubs by design (destroy+redeploy instead of start/stop/restart_vm; deploy-time SSH keys via deploy_vm(ssh_keys=[…]); SSH for exec; in-VM /proc for stats) and the working slice-model surface (deploy_vm/delete_vm/get_vm/list_vms/vm_logs/deploy_webgateway) is what s143 will call directly.

Three housekeeping items to keep coordination tight

  1. Tag v0.1.0-rc1 at current development HEAD. Today our Cargo.toml would pin branch = "development" which moves under us; a tagged release lets the deployer pin a known-good. ~10 min for you.
  2. Doc the async polling contract in schemas/cloud/cloud.oschema summary/description for deploy_vm / delete_vm / deploy_webgateway — every future consumer trips on the implicit "returns immediately, poll until state="running" AND mycelium_ip set" rule otherwise. ~30 min, doc-only.
  3. File the 3 follow-up issues you offered to scope at the end of your reply: (a) metadata: map<str,str> field on Vm spec, (b) free-form sizing + public-IP on deploy_vm, (c) remote-auth model for ComputeService (or a pointer to hero_router fronting). We can draft + post them ourselves with cross-links if you prefer — just say which way works better.

What we'll do on the deployer side regardless

  • Write ComputeServiceAdapter against the working surface (slice model, 2 slices = demo profile).
  • Front your UDS with hero_router (TCP + claim-based auth) for cross-host reach — that's our hero_router config work, no hero_compute changes.
  • Use OpenTofu as the primary provisioning adapter (per s139 pivot retaining the OpenTofu path for free-form sizing + publicip) and hero_compute as the secondary adapter behind a config flag, until your (b) lands.

Keeping coordination minimal — your daily work on zos_admin is independent of this, so we're not on each other's critical paths. Tag + docs + 3 issues is the whole ask. Thanks!

Signed-by: mik-tf mik-tf@noreply.invalid

## Net for the deployer arc — three small asks Thanks @mahmoud — re-read your corrections + the working surface against the s141 deployer-side code work just done (forge.rs/db.rs/3 OpenRPC methods landed on `hero_os_tfgrid_deployer` `main`, see [`796e715`](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/commit/796e715)). Net: **the deployer plan is unblocked on your current `development` HEAD**. We sidestep all 6 TFGrid stubs by design (`destroy+redeploy` instead of `start/stop`/`restart_vm`; deploy-time SSH keys via `deploy_vm(ssh_keys=[…])`; SSH for exec; in-VM `/proc` for stats) and the working slice-model surface (`deploy_vm`/`delete_vm`/`get_vm`/`list_vms`/`vm_logs`/`deploy_webgateway`) is what s143 will call directly. ## Three housekeeping items to keep coordination tight 1. **Tag `v0.1.0-rc1` at current `development` HEAD.** Today our Cargo.toml would pin `branch = "development"` which moves under us; a tagged release lets the deployer pin a known-good. ~10 min for you. 2. **Doc the async polling contract** in `schemas/cloud/cloud.oschema` `summary`/`description` for `deploy_vm` / `delete_vm` / `deploy_webgateway` — every future consumer trips on the implicit "returns immediately, poll until `state="running"` AND `mycelium_ip` set" rule otherwise. ~30 min, doc-only. 3. **File the 3 follow-up issues** you offered to scope at the end of your reply: `(a)` `metadata: map<str,str>` field on `Vm` spec, `(b)` free-form sizing + public-IP on `deploy_vm`, `(c)` remote-auth model for ComputeService (or a pointer to hero_router fronting). We can draft + post them ourselves with cross-links if you prefer — just say which way works better. ## What we'll do on the deployer side regardless - Write `ComputeServiceAdapter` against the working surface (slice model, 2 slices = demo profile). - Front your UDS with hero_router (TCP + claim-based auth) for cross-host reach — that's our hero_router config work, no hero_compute changes. - Use OpenTofu as the **primary** provisioning adapter (per s139 pivot retaining the OpenTofu path for free-form sizing + publicip) and hero_compute as the **secondary** adapter behind a config flag, until your `(b)` lands. Keeping coordination minimal — your daily work on `zos_admin` is independent of this, so we're not on each other's critical paths. Tag + docs + 3 issues is the whole ask. Thanks! Signed-by: mik-tf <mik-tf@noreply.invalid>
Author
Owner

s142 close — D-23 SSH key custody locked + first deployer→hero_compute call live

(ack on the slice-model surface from #116 thread, plus 2 follow-ups for you when convenient)

1. D-23 custody model locked (workspace decision file: D-23)

After Phase B.5 adversarial review of our planned deployer.request_ssh_key_for(user_id, vm_id) shape, we caught that storing the user's SSH private key in the deployer would re-introduce exactly the impersonation-vault we'd already rejected for Forge tokens at our D-22. Different (root-shell-grade), worse (no must_change_password analog to make it decay). So we forked the plan:

  • End-user uploads their own SSH pubkeys to their forge.ourworld.tf account via the standard /user/settings/keys UI.
  • Deployer reads-only via the new ForgeClient::list_user_ssh_keys(username) admin endpoint at provision time.
  • ComputeService.deploy_vm(name, slice_count, secret, image, ssh_keys, node_sid) is called with ssh_keys flowing inline as a list of openssh strings.
  • A per-VM vm_secret ownership token (CSPRNG 32-char) is minted deployer-side and stored in our sqlite — separate credential class from SSH access; it authenticates VM management (delete, redeploy) only.

So the deployer never holds an SSH private key anywhere — not in sqlite, not in memory beyond the deploy call, not in transit. Matches your hero_compute_sdk::ssh_secret_hash-keyed SshKeyStore model server-side.

2. Spec/wire alignment question (minor)

While reading the wire shape we noticed there are two openrpc.json files in my_compute_mos_server:

  • crates/my_compute_mos_server/openrpc.json (top-level, 14.9KB, May 14) — deploy_vm here takes (name, slice_sid, image, cpu_count, secret), NO ssh_keys parameter, no set_ssh_keys method.
  • crates/my_compute_mos_server/src/cloud/openrpc.json (under src/cloud/, 46.7KB, May 21) — matches the oschema; deploy_vm takes the full (name, slice_count, secret, image, ssh_keys, node_sid); set_ssh_keys present.

The top-level file looks stale (probably an older regen that didn't get cleaned up). Could you either delete it or doc which is canonical? It cost us a Phase B back-and-forth before we realized which was the wire spec.

3. Typed Rust SDK gap (filed separately)

crates/my_compute_zos_sdk/src/lib.rs:24 is still the original TODO stub. We hand-rolled the JSON-RPC envelopes via hero_compute_sdk::http_rpc_tcp for now (the deployer's new compute.rs adapter), but a typed SDK would remove that boilerplate for every future consumer. Filing as its own issue so it can stand independently.

Live-smoke gap (operational, not blocking): end-to-end provisioning is paused until we have a TFGrid VM (Track F's F1 in our internal arc). Unit tests + binary-symbol smoke cover the dispatch + decode + error paths; full provision_vm → deploy_vm → poll until running → SSH ping waits on a real VM.

— mik-tf

### s142 close — D-23 SSH key custody locked + first deployer→hero_compute call live (ack on the slice-model surface from #116 thread, plus 2 follow-ups for you when convenient) **1. D-23 custody model locked (workspace decision file: [D-23](https://forge.ourworld.tf/lhumina_code/home/issues/235))** After Phase B.5 adversarial review of our planned `deployer.request_ssh_key_for(user_id, vm_id)` shape, we caught that storing the user's SSH private key in the deployer would re-introduce exactly the impersonation-vault we'd already rejected for Forge tokens at our D-22. Different (root-shell-grade), worse (no `must_change_password` analog to make it decay). So we forked the plan: - **End-user uploads their own SSH pubkeys** to their forge.ourworld.tf account via the standard `/user/settings/keys` UI. - **Deployer reads-only** via the new `ForgeClient::list_user_ssh_keys(username)` admin endpoint at provision time. - **`ComputeService.deploy_vm(name, slice_count, secret, image, ssh_keys, node_sid)`** is called with `ssh_keys` flowing inline as a list of openssh strings. - A **per-VM `vm_secret`** ownership token (CSPRNG 32-char) is minted deployer-side and stored in our sqlite — separate credential class from SSH access; it authenticates VM management (delete, redeploy) only. So the deployer never holds an SSH private key anywhere — not in sqlite, not in memory beyond the deploy call, not in transit. Matches your `hero_compute_sdk::ssh_secret_hash`-keyed `SshKeyStore` model server-side. **2. Spec/wire alignment question (minor)** While reading the wire shape we noticed there are two `openrpc.json` files in `my_compute_mos_server`: - `crates/my_compute_mos_server/openrpc.json` (top-level, 14.9KB, May 14) — `deploy_vm` here takes `(name, slice_sid, image, cpu_count, secret)`, NO `ssh_keys` parameter, no `set_ssh_keys` method. - `crates/my_compute_mos_server/src/cloud/openrpc.json` (under src/cloud/, 46.7KB, May 21) — matches the oschema; `deploy_vm` takes the full `(name, slice_count, secret, image, ssh_keys, node_sid)`; `set_ssh_keys` present. The top-level file looks stale (probably an older regen that didn't get cleaned up). Could you either delete it or doc which is canonical? It cost us a Phase B back-and-forth before we realized which was the wire spec. **3. Typed Rust SDK gap (filed separately)** `crates/my_compute_zos_sdk/src/lib.rs:24` is still the original TODO stub. We hand-rolled the JSON-RPC envelopes via `hero_compute_sdk::http_rpc_tcp` for now (the deployer's new `compute.rs` adapter), but a typed SDK would remove that boilerplate for every future consumer. Filing as its own issue so it can stand independently. **Live-smoke gap (operational, not blocking):** end-to-end provisioning is paused until we have a TFGrid VM (Track F's F1 in our internal arc). Unit tests + binary-symbol smoke cover the dispatch + decode + error paths; full `provision_vm → deploy_vm → poll until running → SSH ping` waits on a real VM. — mik-tf
Author
Owner

Quick update on the decisions side since the last reply pre-dates both. We locked the deployer's delete semantics at D-24 (lhumina_code/home#235): delete_user refuses if the user still owns VMs (no auto-cascade saga), delete_vm calls ComputeService.delete_vm first and only then drops the sqlite row (orphan compute bills money and the per-VM secret is lost on sqlite drop, so the compute call has to succeed before we lose the handle), and PRAGMA foreign_keys=ON is now a second-line guard. Then yesterday we landed D-25 on top: the vms.user_id FK got upgraded from a bare reference to ON DELETE RESTRICT via the canonical SQLite recreate-with-FK dance, so the refuse-if-vms invariant is now enforced at the schema layer too, not just in the handler. No wire-shape changes on your side from any of this. Also separately filed #118 with the only outstanding ask from us, which is access to a hero_compute_mos_server we can hit for the first live smoke.

Quick update on the decisions side since the last reply pre-dates both. We locked the deployer's delete semantics at D-24 (https://forge.ourworld.tf/lhumina_code/home/issues/235): `delete_user` refuses if the user still owns VMs (no auto-cascade saga), `delete_vm` calls `ComputeService.delete_vm` first and only then drops the sqlite row (orphan compute bills money and the per-VM secret is lost on sqlite drop, so the compute call has to succeed before we lose the handle), and `PRAGMA foreign_keys=ON` is now a second-line guard. Then yesterday we landed D-25 on top: the `vms.user_id` FK got upgraded from a bare reference to `ON DELETE RESTRICT` via the canonical SQLite recreate-with-FK dance, so the refuse-if-vms invariant is now enforced at the schema layer too, not just in the handler. No wire-shape changes on your side from any of this. Also separately filed #118 with the only outstanding ask from us, which is access to a `hero_compute_mos_server` we can hit for the first live smoke.
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_compute#116
No description provided.