hero_os_tfgrid_deployer integration: methods we'll consume + small gaps #116
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_compute#116
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
hero_os_tfgrid_deployer integration: methods we'll consume + small gaps
The new admin tool
hero_os_tfgrid_deployer(scope under discussion athero_os_tfgrid_deployer#1) will consumeComputeServiceOpenRPC (currently incrates/my_compute_zos_server/src/cloud/openrpc.json) as its only VM-lifecycle backend.Reviewed the spec — most of what we need is already there. Filing this issue to (a) confirm intended usage so we don't drift, and (b) surface a few small gaps that would make the deployer's flow easier.
Methods the deployer will call
For each demo user we provision:
ComputeService.inject_ssh_keys— deployer generates a per-user ED25519 key, registers the public half via this method, retains the private half in its sqlite for SSH-back-in.ComputeService.deploy_vmwith spec{ cpu: 16, memory: 8 GB, disk: 200 GB, rootfs: 16 GB, flist: "ubuntu-24.04-latest", publicip: false, node_id: <pinned> }. Today's s132 work proves this spec is sufficient (16 CPU is overcommit for an 8 GB VM but matches what's in flight via the OpenTofu path).ComputeService.get_vmfor mycelium_ip to appear + open.ComputeService.deploy_webgatewaymapping<user>.<node>.grid.tf→http://<vm_ip>:9988(where hero_router listens).hero_demo/deploy/single-vm/scripts/setup-binaries.sh). Alternative: pipe throughComputeService.vm_execif it handles long-running scripts cleanly — see Gap 2 below.ComputeService.list_vms/get_vm/vm_statsfor the admin UI's per-user state view.Confirmation questions (low-cost — flag any "yes" / "no" / "TBD")
deploy_vmready for production use on TFGrid mainnet? (s132 used OpenTofu directly against TFGrid — works. Want to swap to this once it's stable for our flow.)deploy_vmreturn synchronously after the VM is fully reachable (SSH-able), or does it return early and require pollingget_vm? Documentation in the OpenRPCsummaryfield would resolve this for any caller.{ user: "<forge_id>", profile: "demo", provisioned_at: ... }per-VM so the admin UI can join VMs back to users without round-tripping its own sqlite.inject_ssh_keys— is this called pre- or post-deploy_vm? Order matters for our deployer flow.Small gaps (what would help us)
ComputeService.wait_vm_ready(vm_id, timeout)method that blocks until the VM is SSH-able (or the timeout expires). Today we'd pollget_vmfrom the deployer — works but every caller reimplements the same readiness logic. Not a blocker; nice-to-have.vm_exec— does it stream stdout incrementally (good for oursetup-binaries.shwhich prints ~1500 lines oflab buildprogress over 5-30 min) or buffer until the command exits? If buffered, we keep the SSH path; if streamed, we can drop the SSH dependency on the deployer side entirely.deploy_webgateway— does it return the publicly-resolvable FQDN immediately, or does DNS propagation need extra wait? S132 saw the gateway resolve within ~30 s oftofu applycompleting; if hero_compute mirrors that, no action needed.ComputeServicesocket reachable only locally, or does it expect bearer-token auth over network? Deployer's host (deployer admin UI) is not on the same machine ashero_compute.None of these are blockers — happy to file separate issues for any of them if that's easier. Mostly this is a tee-up for the deployer work that starts in the next few sessions (current plan in
hero_os_tfgrid_deployer#1and the follow-up scope issues we're about to file there).What's NOT a gap
deploy_vm/start_vm/stop_vm/restart_vm/delete_vm/list_vms/get_vm— all present.deploy_webgateway/list_webgateways/get_webgateway/delete_webgateway— present.inject_ssh_keys).vm_logs,vm_stats,vm_exec).migrate_secret,list_images,attach_hypervisorare also there — beyond what we need immediately but useful later.Context
hero_demosetup-binaries.sh — 34/34 PASS on lab download/install + hero_proc + hero_router GREEN on a fresh TFGrid VM.hero_cockpit#1.hero_os_tfgrid_deployer#1.cc @mahmoud , no rush — answers can come incrementally.
Thanks @mik-tf — went through this against the latest
development(0670ec5), readingschemas/cloud/cloud.oschemaand the TFGrid impl incrates/my_compute_zos_server/src/cloud/{rpc.rs,grid_driver.rs}. Most of the spec maps, but there are a few important corrections before you build the deployer on it: several methods listed under "What's NOT a gap" are actually TFGrid stubs that return an error.⚠️ Actual working surface on the TFGrid (ZOS) backend
ComputeServiceis one trait shared by two backends; on TFGrid a number of methods are intentionally unimplemented and returnErr("… is not supported on TFGrid").Works on TFGrid:
node_register/node_status/node_unregister,set_tfgrid_node_ids,list_slices/get_slice,deploy_vm,delete_vm,list_vms/get_vm,vm_logs,node_stats,list_images,get_deployment_logs/list_deployments,get_ssh_keys/set_ssh_keys,list_jobs/job_logs,deploy_webgateway/list_webgateways/get_webgateway/delete_webgateway/list_gateway_nodes.Stubbed on TFGrid (error):
start_vm,stop_vm,restart_vm,inject_ssh_keys,vm_exec,vm_stats,attach_hypervisor,migrate_secret.So three claims in "What's NOT a gap" need revising:
inject_ssh_keys(your step 1) is NOT supported on TFGrid. SSH keys are deploy-time only: pass the public key todeploy_vm(ssh_keys=[…])and the driver injects it via the flist'sSSH_KEYenv at boot. (set_ssh_keys/get_ssh_keysonly manage a per-secret key store; they do not push into a running VM.)deploy_vm+delete_vmwork.start_vm/stop_vm/restart_vmare stubs (to "stop", you delete; to "restart", delete and re-deploy_vm).vm_logsworks (returns the deploy/delete progress log — not arbitrary command output).vm_statsandvm_execare stubbed;node_statsexists but is node-level capacity/benchmark from Grid Proxy, not per-VM live telemetry.⚠️ The VM spec is slice-based, not free-form
Real signature:
cpu_count = slice_count,memory = slice_count × 4 GB,disk = Σ slice disk.{ cpu: 16, memory: 8 GB, disk: 200 GB, rootfs: 16 GB, publicip: false }cannot be expressed: 8 GB = 2 slices = 2 vCPU (CPU and RAM are coupled 1:4), there is nopublicip, norootfs, no independent disk parameter.node_sidis matched against the node hostname (tfnode-<id>), not a sid.imagemust be the full flist URL (https://hub.grid.tf/tf-official-vms/ubuntu-24.04-latest.flist), notubuntu-24.04-latest.Confirmation questions
running+ mycelium IP; contracts created and cancelled cleanly —0670ec5just made the cancel idempotent ondelete_vm). Caveats: the slice-spec mapping above, async deploy (C2), and the capacity precheck was just fixed to size the catalog from free (not total) node capacity. It is not an OpenTofu-equivalent for free-form specs / public IPs.deploy_vmreturns immediately withstate="provisioning"and runs the on-chain deploy in a background task. Pollget_vmuntilstate=="running"andmycelium_ipis set (then it is reachable). Same async pattern fordelete_vm(→deleting→ record disappears) anddeploy_webgateway. Agreed it's worth documenting in the OpenRPCsummary.Vmhasname(free text, indexed) +config{ cpu_count, image, network_config?, extra_args? }. No generic tag map. Today: encodeuser/profileintoname, or keep the join in the deployer's sqlite. Ametadata: map<str,str>would be a small schema change.deploy_vm(ssh_keys=[…]).Small gaps
wait_vm_ready: doesn't exist; pollget_vm. Note "ready" today =state=running+mycelium_ippresent — the server does not probe SSH/port, so a true SSH-able wait would be new logic.vm_execstream vs buffer: moot — not implemented on TFGrid, and the schema returns only ani32exit code, so even a future implementation wouldn't stream stdout without an API change. Keep the SSH path forsetup-binaries.sh.deploy_webgatewayis async; forkind="name"the FQDN is filled in only when the background deploy reachesstate=ready→ pollget_webgateway. Forkind="fqdn"you supply it. DNS propagation is on top of that.hero_compute_zos/rpc.sock, raw protocol) — local-only, no TCP bind, no built-in bearer auth. Per-call auth is thesecretparameter = an sr25519-signed token derived from the node'sTFGRID_MNEMONIC(or a raw ownership token), checked per-VM inverify_secret. A remote deployer (different host) needs the service exposed over the network — that's hero_router's job (TCP entry point + context/claim auth) or an SSH tunnel to the UDS. There is no network auth in the service itself today. This is the main cross-machine integration question and probably deserves its own issue.Net
Lifecycle (
deploy_vm/delete_vm/get_vm/list_vms/vm_logs) and the webgateway methods are solid on mainnet. The deployer should plan around: (1) slice-based sizing (no free-form cpu/disk/public-IP), (2) deploy-time SSH keys (noinject_ssh_keys), (3) SSH for exec/diagnostics (novm_exec/vm_stats), and (4) remote auth via hero_router (the UDS is local-only).Happy to file separate issues for the three actionable ones: a
metadatafield on the VM spec, free-form/public-IP sizing, and the remote-auth model.Thanks @mahmoud — your corrections are folded into the arc tracker at lhumina_code/home#235 §3 (Dependency map). The deployer + cockpit are planned around the slice model, deploy-time SSH, delete-and-redeploy lifecycle, async deploy, and the UDS-local-only auth model. Substantive reply to your comment follows separately.
Acknowledged — corrections folded into the deployer + cockpit plan
Thanks @mahmoud — the three "no-gap" claims I had wrong are corrected in the arc-level tracker at lhumina_code/home#235 §3. Quoting them back to confirm we read them the same way:
What we'll plan around on TFGrid
start_vm/stop_vm/restart_vmstubbedhero_proc servicecalls on services running inside the VM, not VM-level.inject_ssh_keysstubbeddeploy_vm(ssh_keys=[…])time. No inject-after-create path.vm_exec+vm_statsstubbedsystem_inforeads RAM/disk from the VM's own/proc/meminfo+df+ sysinfo crate, not via hero_compute.get_vmuntilstate="running"ANDmycelium_ipset; pollsget_webgatewayfor FQDN onkind="name". Worth surfacing in OpenRPCsummary/descriptionso consumers don't trip on the implicit polling contract.metadatafield onVmuser/profilejoin in the deployer's sqlite (already in the D1 schema)./hero_routerskill + per-callsecretparameter for verify_secret. Network-level wrapping is hero_router's job, not hero_compute's.Confirmation questions — answered
Follow-up issues you offered to file
Three concrete things you mentioned you'd happily scope as separate issues:
metadata: map<str,str>field on theVmspec — would let the deployer encode user/profile/owner directly instead of relying on its sqlite for the join. Nice-to-have, not blocking.deploy_vm— promotes hero_compute from secondary to primary adapter for the deployer.Happy to file each on
hero_computemyself if you prefer; just let us know which way works better. The arc-level tracker home#235 will pick them up either way.What we don't need from hero_compute right now
wait_vm_ready— confirmed; we pollget_vmourselves. "Ready" =state="running"+mycelium_ipset + SSH-able on the deployer side via its own probe.vm_execstreaming — moot; SSH path is the right answer for setup-binaries dispatch.Net: deployer + cockpit work proceeds against the working surface you described. Webgateway async + slice-based sizing are documented constraints, not surprises. Track A (cockpit) starts in s133; Track D (deployer) starts s146-ish; integration in s152.
mik-tf referenced this issue from lhumina_research/hero_demo2026-05-21 12:02:01 +00:00
mik-tf referenced this issue from lhumina_code/hero_os_tfgrid_deployer2026-05-21 21:59:37 +00:00
Net for the deployer arc — three small asks
Thanks @mahmoud — re-read your corrections + the working surface against the s141 deployer-side code work just done (forge.rs/db.rs/3 OpenRPC methods landed on
hero_os_tfgrid_deployermain, see796e715). Net: the deployer plan is unblocked on your currentdevelopmentHEAD. We sidestep all 6 TFGrid stubs by design (destroy+redeployinstead ofstart/stop/restart_vm; deploy-time SSH keys viadeploy_vm(ssh_keys=[…]); SSH for exec; in-VM/procfor stats) and the working slice-model surface (deploy_vm/delete_vm/get_vm/list_vms/vm_logs/deploy_webgateway) is what s143 will call directly.Three housekeeping items to keep coordination tight
v0.1.0-rc1at currentdevelopmentHEAD. Today our Cargo.toml would pinbranch = "development"which moves under us; a tagged release lets the deployer pin a known-good. ~10 min for you.schemas/cloud/cloud.oschemasummary/descriptionfordeploy_vm/delete_vm/deploy_webgateway— every future consumer trips on the implicit "returns immediately, poll untilstate="running"ANDmycelium_ipset" rule otherwise. ~30 min, doc-only.(a)metadata: map<str,str>field onVmspec,(b)free-form sizing + public-IP ondeploy_vm,(c)remote-auth model for ComputeService (or a pointer to hero_router fronting). We can draft + post them ourselves with cross-links if you prefer — just say which way works better.What we'll do on the deployer side regardless
ComputeServiceAdapteragainst the working surface (slice model, 2 slices = demo profile).(b)lands.Keeping coordination minimal — your daily work on
zos_adminis independent of this, so we're not on each other's critical paths. Tag + docs + 3 issues is the whole ask. Thanks!Signed-by: mik-tf mik-tf@noreply.invalid
s142 close — D-23 SSH key custody locked + first deployer→hero_compute call live
(ack on the slice-model surface from #116 thread, plus 2 follow-ups for you when convenient)
1. D-23 custody model locked (workspace decision file: D-23)
After Phase B.5 adversarial review of our planned
deployer.request_ssh_key_for(user_id, vm_id)shape, we caught that storing the user's SSH private key in the deployer would re-introduce exactly the impersonation-vault we'd already rejected for Forge tokens at our D-22. Different (root-shell-grade), worse (nomust_change_passwordanalog to make it decay). So we forked the plan:/user/settings/keysUI.ForgeClient::list_user_ssh_keys(username)admin endpoint at provision time.ComputeService.deploy_vm(name, slice_count, secret, image, ssh_keys, node_sid)is called withssh_keysflowing inline as a list of openssh strings.vm_secretownership token (CSPRNG 32-char) is minted deployer-side and stored in our sqlite — separate credential class from SSH access; it authenticates VM management (delete, redeploy) only.So the deployer never holds an SSH private key anywhere — not in sqlite, not in memory beyond the deploy call, not in transit. Matches your
hero_compute_sdk::ssh_secret_hash-keyedSshKeyStoremodel server-side.2. Spec/wire alignment question (minor)
While reading the wire shape we noticed there are two
openrpc.jsonfiles inmy_compute_mos_server:crates/my_compute_mos_server/openrpc.json(top-level, 14.9KB, May 14) —deploy_vmhere takes(name, slice_sid, image, cpu_count, secret), NOssh_keysparameter, noset_ssh_keysmethod.crates/my_compute_mos_server/src/cloud/openrpc.json(under src/cloud/, 46.7KB, May 21) — matches the oschema;deploy_vmtakes the full(name, slice_count, secret, image, ssh_keys, node_sid);set_ssh_keyspresent.The top-level file looks stale (probably an older regen that didn't get cleaned up). Could you either delete it or doc which is canonical? It cost us a Phase B back-and-forth before we realized which was the wire spec.
3. Typed Rust SDK gap (filed separately)
crates/my_compute_zos_sdk/src/lib.rs:24is still the original TODO stub. We hand-rolled the JSON-RPC envelopes viahero_compute_sdk::http_rpc_tcpfor now (the deployer's newcompute.rsadapter), but a typed SDK would remove that boilerplate for every future consumer. Filing as its own issue so it can stand independently.Live-smoke gap (operational, not blocking): end-to-end provisioning is paused until we have a TFGrid VM (Track F's F1 in our internal arc). Unit tests + binary-symbol smoke cover the dispatch + decode + error paths; full
provision_vm → deploy_vm → poll until running → SSH pingwaits on a real VM.— mik-tf
Quick update on the decisions side since the last reply pre-dates both. We locked the deployer's delete semantics at D-24 (lhumina_code/home#235):
delete_userrefuses if the user still owns VMs (no auto-cascade saga),delete_vmcallsComputeService.delete_vmfirst and only then drops the sqlite row (orphan compute bills money and the per-VM secret is lost on sqlite drop, so the compute call has to succeed before we lose the handle), andPRAGMA foreign_keys=ONis now a second-line guard. Then yesterday we landed D-25 on top: thevms.user_idFK got upgraded from a bare reference toON DELETE RESTRICTvia the canonical SQLite recreate-with-FK dance, so the refuse-if-vms invariant is now enforced at the schema layer too, not just in the handler. No wire-shape changes on your side from any of this. Also separately filed #118 with the only outstanding ask from us, which is access to ahero_compute_mos_serverwe can hit for the first live smoke.