deploy_vm consistently rejected at ZOS workload phase on FreeFarm mainnet (regression since 2026-05-23) #125

Closed
opened 2026-05-24 20:13:23 +00:00 by mik-tf · 5 comments
Owner

ComputeService.deploy_vm against TFGrid mainnet consistently fails with the ZOS-side error vm deployment entered error state, independent of the chosen node or image. Last known good was the deploy_vm round trip 2026-05-23 16:25 UTC (vm sid 000t on tfnode-12, twin 6905). Today (2026-05-24) every deploy attempt fails at the ZOS workload phase even though TFChain accepts both contracts and Grid Proxy shows ample free capacity.

Reproduction

Workstation operator: hero_compute self-hosted from the core/TFGRID_MNEMONIC (twin 6905) on TFGrid mainnet. Stack: hero_router + hero_tfgrid_deployer_server + my_compute_zos_server, all freshly built from origin/development HEAD as of this filing. cargo test --workspace clean.

deployer.provision_vm (which calls ComputeService.deploy_vm internally) was tried with:

  • node tfnode-1 (sid 000u), node tfnode-12 (sid 000q)
  • image ubuntu-22.04 (deployer default), image Ubuntu 24.04 (catalog default), image https://hub.grid.tf/tf-official-vms/ubuntu-24.04-latest.flist (explicit flist URL)
  • slice_count 2 (= 8 GB demo profile)

All four attempts produced the same error:

{"code":-32603,"data":"Redis operation error: Internal error: backend error: vm deployment entered error state [deploy-phase=zos-workload] - ZOS daemon on the target node rejected the workload after contract submission. ..."}

Timing: each attempt creates 2 contracts on TFChain (network + VM, ~30s apart), then ZOS rejects the workload ~30-60s later. Total elapsed: ~60-90s.

Grid state checks

  • Twin 6905: 40 to 41 baseline contracts before each attempt, no FreeZone production touched.
  • Node 1 (tfnode-1): status=up, total MRU 188 GB, used 73 GB, free 115.6 GB. All 26 slices status=free per ComputeService.list_slices.
  • Node 12 (tfnode-12): online per ComputeService.list_nodes.
  • Each failed attempt left exactly 2 orphan contracts in state Created (network + VM) on TFChain. All recovered via my_compute_zos_server --cancel-contracts <vm> <net> (4 pairs cleaned this session: 2095131, 2095132, 2095133, 2095134, 2095135, 2095136, 2095137, 2095138).

Hypothesis

The vm deployment entered error state text originates from tfgrid_sdk_rust's GridClient when ZOS rejects the workload at the daemon side. Possible causes:

  1. ZOS daemon regression on FreeFarm nodes 1 and 12 (both fail identically)
  2. tfgrid_sdk_rust pinned at b8774c34 is now incompatible with current TFChain or ZOS protocol
  3. Mycelium connectivity between operator workstation and FreeFarm node ZOS daemons is broken (workstation mycelium is running with 10 peers connected)
  4. ZOS workload spec validation: cpu_count, memory_bytes, or mycelium_seed format changed expectations

What was verified working today: list_nodes, list_slices, list_images, node_register, node_status, deploy_vm filter / param validation (everything pre-contract-submission). Contract creation on TFChain also works (we created 8 contracts and all reached state Created, then we cancelled all 8). So the operator wallet, network connectivity to TFChain, and the GridClient are all healthy. The break is specifically at the post-contract ZOS workload stage.

Asks

  1. Reproduce on QA (zoscompute.gent01.qa.grid.tf topology) to confirm whether mainnet is affected uniquely.
  2. Check ZOS node-side logs on FreeFarm node 1 or 12 around 2026-05-24 19:46 to 20:11 UTC for any workload validation rejection log lines (twin 6905 is the source twin).
  3. If a tfgrid_sdk_rust bump is needed, please flag and we will run the dependency bump under our own pre-merge gate.

This is the gating issue for the demo-deployer arc closure: home#235 closure depends on at least one full successful round trip through deployer.provision_vm.

`ComputeService.deploy_vm` against TFGrid mainnet consistently fails with the ZOS-side error `vm deployment entered error state`, independent of the chosen node or image. Last known good was the deploy_vm round trip 2026-05-23 16:25 UTC (vm sid `000t` on tfnode-12, twin 6905). Today (2026-05-24) every deploy attempt fails at the ZOS workload phase even though TFChain accepts both contracts and Grid Proxy shows ample free capacity. ## Reproduction Workstation operator: hero_compute self-hosted from the `core/TFGRID_MNEMONIC` (twin 6905) on TFGrid mainnet. Stack: hero_router + hero_tfgrid_deployer_server + my_compute_zos_server, all freshly built from `origin/development` HEAD as of this filing. `cargo test --workspace` clean. `deployer.provision_vm` (which calls `ComputeService.deploy_vm` internally) was tried with: - node `tfnode-1` (sid `000u`), node `tfnode-12` (sid `000q`) - image `ubuntu-22.04` (deployer default), image `Ubuntu 24.04` (catalog default), image `https://hub.grid.tf/tf-official-vms/ubuntu-24.04-latest.flist` (explicit flist URL) - slice_count 2 (= 8 GB demo profile) All four attempts produced the same error: ``` {"code":-32603,"data":"Redis operation error: Internal error: backend error: vm deployment entered error state [deploy-phase=zos-workload] - ZOS daemon on the target node rejected the workload after contract submission. ..."} ``` Timing: each attempt creates 2 contracts on TFChain (network + VM, ~30s apart), then ZOS rejects the workload ~30-60s later. Total elapsed: ~60-90s. ## Grid state checks - Twin 6905: 40 to 41 baseline contracts before each attempt, no FreeZone production touched. - Node 1 (tfnode-1): `status=up`, total MRU 188 GB, used 73 GB, free 115.6 GB. All 26 slices `status=free` per ComputeService.list_slices. - Node 12 (tfnode-12): online per ComputeService.list_nodes. - Each failed attempt left exactly 2 orphan contracts in state Created (network + VM) on TFChain. All recovered via `my_compute_zos_server --cancel-contracts <vm> <net>` (4 pairs cleaned this session: 2095131, 2095132, 2095133, 2095134, 2095135, 2095136, 2095137, 2095138). ## Hypothesis The `vm deployment entered error state` text originates from `tfgrid_sdk_rust`'s GridClient when ZOS rejects the workload at the daemon side. Possible causes: 1. ZOS daemon regression on FreeFarm nodes 1 and 12 (both fail identically) 2. tfgrid_sdk_rust pinned at `b8774c34` is now incompatible with current TFChain or ZOS protocol 3. Mycelium connectivity between operator workstation and FreeFarm node ZOS daemons is broken (workstation mycelium is running with 10 peers connected) 4. ZOS workload spec validation: cpu_count, memory_bytes, or mycelium_seed format changed expectations What was verified working today: list_nodes, list_slices, list_images, node_register, node_status, deploy_vm filter / param validation (everything pre-contract-submission). Contract creation on TFChain also works (we created 8 contracts and all reached state Created, then we cancelled all 8). So the operator wallet, network connectivity to TFChain, and the GridClient are all healthy. The break is specifically at the post-contract ZOS workload stage. ## Asks 1. Reproduce on QA (`zoscompute.gent01.qa.grid.tf` topology) to confirm whether mainnet is affected uniquely. 2. Check ZOS node-side logs on FreeFarm node 1 or 12 around 2026-05-24 19:46 to 20:11 UTC for any workload validation rejection log lines (twin 6905 is the source twin). 3. If a tfgrid_sdk_rust bump is needed, please flag and we will run the dependency bump under our own pre-merge gate. This is the gating issue for the demo-deployer arc closure: home#235 closure depends on at least one full successful round trip through deployer.provision_vm.
Author
Owner

Investigation update from a 3-deploy probe at s157a close (2026-05-24 22:30-22:40Z), now that hero_compute has the deploy_vm Err-path orphan rollback landed at 8be3294 (closes #119 reopened):

1. Repro is NOT node-specific. Two consecutive provision_vm attempts via hero_tfgrid_deployer_server against FreeFarm node 1 (tfnode-1, sid 000u) AND one attempt against FreeFarm node 12 (tfnode-12, sid 000q) all failed identically at the ZOS workload phase, ~110-115s wall clock per attempt. Both nodes report status=online on Grid Proxy. The error shape is bit-for-bit identical across nodes:

vm deployment entered error state [deploy-phase=zos-workload]

Six contracts created on twin 6905 (2095139+2095140 on node 1, 2095141+2095142 on node 1 with patched SDK, 2095143+2095144 on node 12). All six auto-cancelled by the new s157a rollback within ~6s of the SDK Err.

2. The SDK is not hiding the reason — ZOS itself provides none. We patched tfgrid_sdk_rust/src/grid_client/mod.rs at lines 1063-1068 and 1219-1224 locally to surface vm_workload.result.error: String (the field is already there in zos::ResultData, just discarded by GridError::backend("vm deployment entered error state")). With the patched binary, the surfaced error reads:

vm deployment entered error state (001t=)

The format is workload_name=error_string. The error string is empty. ZOS marks the VM workload STATE_ERROR but writes no message into result.error. This holds across all three deploys on both nodes.

3. Only the VM workload errors; the network workload provisions OK. Every failed deploy minted exactly two contracts on chain: one network contract (e.g. 2095139, deployment_data.type=network, name=rust_net_<ts>) and one VM contract (e.g. 2095140, deployment_data.type=vm, name=001t). The vm_changes.iter().filter(state==Error) in the SDK matches only the VM workload, never the network workload. So the substrate side and the network workload are healthy; the ZOS daemon rejects the VM workload specifically, silently.

4. Window of regression. s149 (2026-05-23) successfully deployed and round-tripped a VM on node 12 with the same flow. s156 (2026-05-24 ~16:44Z) and s157 (2026-05-24 ~20:00Z) and now s157a (2026-05-24 ~22:30Z) all fail with the identical opaque pattern. The regression window is approximately 24h, on the TFGrid side (we ruled out our flist URL form, image variant, and SDK rev — the SDK is on the same b8774c34 mainnet pin throughout).

5. Action items.

a. We cannot diagnose further from off-node — result.error is empty so the SDK has no more to give. Next step needs node-side ZOS logs on FreeFarm node 1 OR node 12 around the workload errors (e.g. zinit log zos.boot, journalctl -u zos, or the ZOS state.json for the failed deployments). If anyone has node-side access on FreeFarm, the workloads we just tried are: contract 2095140 (vm 001t on node 1), 2095142 (vm 001t on node 1), 2095144 (vm 001u on node 12), all under twin 6905, all in state=Deleted now after rollback.

b. The SDK-side gap (result.error discarded) is worth a small upstream PR to threefoldtech/tfgrid-sdk-rust so future investigations don't need a local fork. Diff is ~10 lines split across two near-identical spots in grid_client/mod.rs. Not needed for #125 itself (the error is empty), but useful infrastructure for any future "what did ZOS reject" debugging.

c. With s157a's rollback landed, this failure mode no longer leaks contracts — the daemon cleans up automatically on every Err. The bug remains gating in that no real VM can be provisioned on mainnet right now, but the cost is now bounded to substrate fees on the (instantly cancelled) network+vm contracts per attempt.

Reproduction inputs (in case someone wants to retry):

  • provision_vm via deployer (port 9988 → /hero_tfgrid_deployer/rpc) with node_sid=000u or 000q, default image ubuntu-22.04, default slice_count 2
  • Workspace: this hero_compute build at 8be3294 (origin/development)
  • Twin: 6905 (Hero treasury wallet, mainnet)
Investigation update from a 3-deploy probe at s157a close (2026-05-24 22:30-22:40Z), now that hero_compute has the deploy_vm Err-path orphan rollback landed at 8be3294 (closes #119 reopened): **1. Repro is NOT node-specific.** Two consecutive `provision_vm` attempts via `hero_tfgrid_deployer_server` against FreeFarm node 1 (`tfnode-1`, sid `000u`) AND one attempt against FreeFarm node 12 (`tfnode-12`, sid `000q`) all failed identically at the ZOS workload phase, ~110-115s wall clock per attempt. Both nodes report `status=online` on Grid Proxy. The error shape is bit-for-bit identical across nodes: ``` vm deployment entered error state [deploy-phase=zos-workload] ``` Six contracts created on twin 6905 (2095139+2095140 on node 1, 2095141+2095142 on node 1 with patched SDK, 2095143+2095144 on node 12). All six auto-cancelled by the new s157a rollback within ~6s of the SDK Err. **2. The SDK is not hiding the reason — ZOS itself provides none.** We patched `tfgrid_sdk_rust/src/grid_client/mod.rs` at lines 1063-1068 and 1219-1224 locally to surface `vm_workload.result.error: String` (the field is already there in `zos::ResultData`, just discarded by `GridError::backend("vm deployment entered error state")`). With the patched binary, the surfaced error reads: ``` vm deployment entered error state (001t=) ``` The format is `workload_name=error_string`. The error string is **empty**. ZOS marks the VM workload `STATE_ERROR` but writes no message into `result.error`. This holds across all three deploys on both nodes. **3. Only the VM workload errors; the network workload provisions OK.** Every failed deploy minted exactly two contracts on chain: one network contract (e.g. 2095139, `deployment_data.type=network`, `name=rust_net_<ts>`) and one VM contract (e.g. 2095140, `deployment_data.type=vm`, `name=001t`). The `vm_changes.iter().filter(state==Error)` in the SDK matches only the VM workload, never the network workload. So the substrate side and the network workload are healthy; the ZOS daemon rejects the VM workload specifically, silently. **4. Window of regression.** s149 (2026-05-23) successfully deployed and round-tripped a VM on node 12 with the same flow. s156 (2026-05-24 ~16:44Z) and s157 (2026-05-24 ~20:00Z) and now s157a (2026-05-24 ~22:30Z) all fail with the identical opaque pattern. The regression window is approximately 24h, on the TFGrid side (we ruled out our flist URL form, image variant, and SDK rev — the SDK is on the same `b8774c34` mainnet pin throughout). **5. Action items.** a. We cannot diagnose further from off-node — `result.error` is empty so the SDK has no more to give. Next step needs node-side ZOS logs on FreeFarm node 1 OR node 12 around the workload errors (e.g. `zinit log zos.boot`, `journalctl -u zos`, or the ZOS `state.json` for the failed deployments). If anyone has node-side access on FreeFarm, the workloads we just tried are: contract 2095140 (vm `001t` on node 1), 2095142 (vm `001t` on node 1), 2095144 (vm `001u` on node 12), all under twin 6905, all in state=Deleted now after rollback. b. The SDK-side gap (`result.error` discarded) is worth a small upstream PR to `threefoldtech/tfgrid-sdk-rust` so future investigations don't need a local fork. Diff is ~10 lines split across two near-identical spots in `grid_client/mod.rs`. Not needed for #125 itself (the error is empty), but useful infrastructure for any future "what did ZOS reject" debugging. c. With s157a's rollback landed, this failure mode no longer leaks contracts — the daemon cleans up automatically on every Err. The bug remains gating in that no real VM can be provisioned on mainnet right now, but the cost is now bounded to substrate fees on the (instantly cancelled) network+vm contracts per attempt. Reproduction inputs (in case someone wants to retry): - `provision_vm` via `deployer` (port 9988 → `/hero_tfgrid_deployer/rpc`) with `node_sid=000u` or `000q`, default image `ubuntu-22.04`, default slice_count 2 - Workspace: this `hero_compute` build at 8be3294 (origin/development) - Twin: 6905 (Hero treasury wallet, mainnet)
Author
Owner

ROOT CAUSE IDENTIFIED — rootfs_size_bytes massively overflows SSD capacity

Following the user's tip ("hero_demo with OpenTofu works"), I dug into the differential between our SDK path and the canonical Go/Terraform path. The bug is on OUR side, not TFGrid.

The smoking gun — slice catalog on FreeFarm node 1 (sid 000u) right now reports disk_gb=5128 per slice (24 slices total). Node 12 reports disk_gb=182926 per slice (2 slices total). For a default 2-slice deploy the resulting rootfs_size_bytes is ~10 TB on node 1 and ~365 TB on node 12. ZOS rejects silently with state=Error and empty error string because rootfs is SSD-backed and the request is orders of magnitude larger than the available SSD.

The bug chain:

  1. crates/my_compute_zos_server/src/cloud/node_capacity.rs:381 computes total_disk_gb from SSD plus HDD combined:

    total_disk_gb: (total.sru + total.hru) / GIB,
    

    For node 1, Grid Proxy reports sru=1863 GB and hru=134112 GB (134 TB of spinning disk), so total_disk_gb = ~136,000 GB.

  2. cloud/node_capacity.rs:171 size_node_catalog then divides that by slice_count with headroom:

    disk_per_slice_gb = (free_disk_gb * 100) / (hr * slice_count)
    

    yielding ~5128 GB per slice on a 24-slice node 1, ~182926 GB per slice on a 2-slice node 12.

  3. cloud/rpc.rs:1120 then sums those at deploy time:

    let total_disk_gb: u64 = allocated.iter().map(|s| s.disk_gb).sum();
    

    and cloud/rpc.rs:1253 passes that to the SDK as rootfs bytes:

    total_disk_gb * 1024 * 1024 * 1024
    
  4. The Rust SDK serializes this as the workload's "size" field in the zmachine workload data. ZOS allocates rootfs from SSD only. ~10 TB rootfs request, ~1.3 TB SSD free. ZOS bails before the VM ever boots, sets state=Error.

  5. The Rust SDK's result.error reads empty (confirmed via local SDK patch surfacing vm_workload.result.error: String in grid_client/mod.rs:1063-1068). ZOS does not currently write a reason into the workload result for this failure mode.

Why s149 worked and s156+ does not: most likely the s149 node had nominal HDD (or none), so the math produced a sane rootfs size. The first FreeFarm node with substantial HDD that came online and got registered tipped the catalog past what SSD can satisfy. Both currently-registered nodes (1 + 12) have substantial HDD, so both fail identically.

Why OpenTofu (hero_demo/deploy/single-vm/tf/main.tf) works: the OpenTofu config sets rootfs_size = 2048 MB (or 16384 MB for nu-shell setups) by hand, and the data disk is a separate workload mounted at /data. The rootfs is sized to fit on SSD, and the data disk can land on HDD via mount. The Rust path conflates rootfs and total disk into one field, then sources both from sru+hru.

Proposed fix (small, structural):

a. node_capacity.rs:381 should size total_disk_gb from sru only, not sru+hru. HDD is irrelevant to rootfs and should not feed the slice catalog's disk dimension. With this fix, node 1 yields ~70 GB rootfs per slice (1863 GB SSD across 24 slices with 110% headroom), which is healthy.

b. Optional but desirable: have cloud/rpc.rs:deploy_vm cap rootfs at a fixed sane value (e.g., 20-50 GB) and pass any larger disk request through a separate volumes field on VmSpec (which the SDK supports — VmSpec.volumes: Vec<VolumeMountSpec>). This mirrors the Go SDK / Terraform pattern of disks + mounts and makes rootfs vs data-disk semantics explicit.

c. Until either lands: existing registrations on nodes with substantial HDD must be reset. ComputeService.node_unregister + node_register for nodes 1 and 12 will re-size the catalog after the node_capacity fix lands.

Triangulation done in this session (3 live deploys today, all auto-rolled-back by s157a 8be3294, zero residue on Grid Proxy):

  • Default ubuntu-22.04 image, node 1 → state=Error, empty SDK error, 2 orphans cancelled (2095139, 2095140)
  • Default image, node 1, with SDK locally patched to surface vm_workload.result.error → state=Error, error string EMPTY, 2 orphans cancelled (2095141, 2095142)
  • Default image, node 12 → identical failure mode, 2 orphans cancelled (2095143, 2095144)
  • Explicit https://hub.grid.tf/tf-official-vms/ubuntu-24.04-latest.flist image, node 1 → identical failure mode. Flist URL form is not the cause.

The auto-rollback (grid_driver::scan_orphan_contracts_since + cancel_one_on_tfgrid) worked four-for-four across both nodes, on every Err path. Zero new state=Created contracts under twin 6905 since session start.

Next: implement (a) above on a small development_mik branch, re-register nodes 1 and 12, retry one deploy, expect Running state with mycelium_ip populated. Will pick this up in s157b after a session boundary so the rollback fix lands clean first.

## ROOT CAUSE IDENTIFIED — rootfs_size_bytes massively overflows SSD capacity Following the user's tip ("hero_demo with OpenTofu works"), I dug into the differential between our SDK path and the canonical Go/Terraform path. The bug is on OUR side, not TFGrid. **The smoking gun** — slice catalog on FreeFarm node 1 (sid `000u`) right now reports `disk_gb=5128` per slice (24 slices total). Node 12 reports `disk_gb=182926` per slice (2 slices total). For a default 2-slice deploy the resulting `rootfs_size_bytes` is ~10 TB on node 1 and ~365 TB on node 12. ZOS rejects silently with state=Error and empty error string because rootfs is SSD-backed and the request is orders of magnitude larger than the available SSD. **The bug chain**: 1. `crates/my_compute_zos_server/src/cloud/node_capacity.rs:381` computes `total_disk_gb` from **SSD plus HDD** combined: ```rust total_disk_gb: (total.sru + total.hru) / GIB, ``` For node 1, Grid Proxy reports `sru=1863 GB` and `hru=134112 GB` (134 TB of spinning disk), so `total_disk_gb = ~136,000 GB`. 2. `cloud/node_capacity.rs:171 size_node_catalog` then divides that by slice_count with headroom: ```rust disk_per_slice_gb = (free_disk_gb * 100) / (hr * slice_count) ``` yielding ~5128 GB per slice on a 24-slice node 1, ~182926 GB per slice on a 2-slice node 12. 3. `cloud/rpc.rs:1120` then sums those at deploy time: ```rust let total_disk_gb: u64 = allocated.iter().map(|s| s.disk_gb).sum(); ``` and `cloud/rpc.rs:1253` passes that to the SDK as rootfs bytes: ```rust total_disk_gb * 1024 * 1024 * 1024 ``` 4. The Rust SDK serializes this as the workload's `"size"` field in the zmachine workload data. ZOS allocates rootfs from SSD only. ~10 TB rootfs request, ~1.3 TB SSD free. ZOS bails before the VM ever boots, sets state=Error. 5. The Rust SDK's `result.error` reads empty (confirmed via local SDK patch surfacing `vm_workload.result.error: String` in `grid_client/mod.rs:1063-1068`). ZOS does not currently write a reason into the workload result for this failure mode. **Why s149 worked and s156+ does not**: most likely the s149 node had nominal HDD (or none), so the math produced a sane rootfs size. The first FreeFarm node with substantial HDD that came online and got registered tipped the catalog past what SSD can satisfy. Both currently-registered nodes (1 + 12) have substantial HDD, so both fail identically. **Why OpenTofu (`hero_demo/deploy/single-vm/tf/main.tf`) works**: the OpenTofu config sets `rootfs_size = 2048 MB` (or 16384 MB for nu-shell setups) by hand, and the data disk is a separate workload mounted at `/data`. The rootfs is sized to fit on SSD, and the data disk can land on HDD via mount. The Rust path conflates rootfs and total disk into one field, then sources both from `sru+hru`. **Proposed fix** (small, structural): a. `node_capacity.rs:381` should size `total_disk_gb` from `sru` only, not `sru+hru`. HDD is irrelevant to rootfs and should not feed the slice catalog's disk dimension. With this fix, node 1 yields ~70 GB rootfs per slice (1863 GB SSD across 24 slices with 110% headroom), which is healthy. b. Optional but desirable: have `cloud/rpc.rs:deploy_vm` cap rootfs at a fixed sane value (e.g., 20-50 GB) and pass any larger disk request through a separate `volumes` field on `VmSpec` (which the SDK supports — `VmSpec.volumes: Vec<VolumeMountSpec>`). This mirrors the Go SDK / Terraform pattern of `disks + mounts` and makes rootfs vs data-disk semantics explicit. c. Until either lands: existing registrations on nodes with substantial HDD must be reset. `ComputeService.node_unregister` + `node_register` for nodes 1 and 12 will re-size the catalog after the node_capacity fix lands. **Triangulation done in this session** (3 live deploys today, all auto-rolled-back by s157a 8be3294, zero residue on Grid Proxy): - Default `ubuntu-22.04` image, node 1 → state=Error, empty SDK error, 2 orphans cancelled (2095139, 2095140) - Default image, node 1, with SDK locally patched to surface `vm_workload.result.error` → state=Error, error string EMPTY, 2 orphans cancelled (2095141, 2095142) - Default image, node 12 → identical failure mode, 2 orphans cancelled (2095143, 2095144) - Explicit `https://hub.grid.tf/tf-official-vms/ubuntu-24.04-latest.flist` image, node 1 → identical failure mode. Flist URL form is not the cause. The auto-rollback (`grid_driver::scan_orphan_contracts_since` + `cancel_one_on_tfgrid`) worked four-for-four across both nodes, on every Err path. Zero new state=Created contracts under twin 6905 since session start. **Next**: implement (a) above on a small `development_mik` branch, re-register nodes 1 and 12, retry one deploy, expect Running state with mycelium_ip populated. Will pick this up in s157b after a session boundary so the rollback fix lands clean first.
mik-tf reopened this issue 2026-05-25 02:07:26 +00:00
Author
Owner

s157c update — 2 SDK challenge bugs found + failure-mode shift

Context for Mahmoud: s157c picked this up overnight. Three things you need to know.

1. We can rent dedicated nodes from twin 14199 ops wallet — confirmed with a successful rent_node(3467) on farm 646 (JimboTFT, Canada). Discovery: the substrate gate for public rent is extraFee > 0 on the node (not rentable: True alone — that flag alone fires OnlyTwinAdminCanDeploy). FreeFarm rentable big-class nodes (2010-2025) are all status: down right now (~25 minute timing observed) — that's why FreeFarm dedicated path looked open at s157b /stop and was closed by the time s157c started.

2. Wire-payload diff against tfgrid-sdk-go found two workload_challenge bugs for ZMACHINE_TYPE in tfgrid-sdk-rust HEAD b8774c34 — same class of bug as commit 74d9ed2 (fix(gateway): correct field order in workload_challenge to match ZOS) that just landed for gateway workloads:

  • MachineNetwork field-order swap: Rust emits Interfaces before Mycelium in the challenge; Go SDK struct order is PublicIP → Planetary → Mycelium → Interfaces. Swapping the order in tfgrid-sdk-rust/src/grid_client/deployment.rs:680-683 is the primary fix.
  • Missing corex field: the ZMachineData struct includes corex: bool but the challenge code does NOT include it. Go SDK struct has Corex between Env and GPU. Insert at deployment.rs:702 to match.

3. After applying both patches against the rented node, the failure mode changed: from vm deployment entered error state [deploy-phase=zos-workload] in ~110s (ZOS signature reject) to TFGrid deploy timed out after 300s (substrate-accept, ZOS provisioning, just not Running by the D-27 5-min timeout). Bumping the timeout to 900s now; this looks like the patches are correct and the VM needs more time to come up on a fresh node.

Direct ask: Can you confirm those two challenge fixes look right? And: what changed on mainnet ZOS between 2026-05-23 and 2026-05-24 — was it a coordinated SDK + ZOS protocol bump where these challenge fields were added? Will continue probing while you sleep; if we land a successful deploy we will report back here.

State preserved on TFChain: twin 14199 baseline = 1 active RentContract on node 3467 (contract_id 2095159, billing hourly), all probe-orphan node contracts auto-cancelled by the s157a rollback. Twin 6905 treasury untouched at 40 contracts.

## s157c update — 2 SDK challenge bugs found + failure-mode shift **Context for Mahmoud:** s157c picked this up overnight. Three things you need to know. **1. We can rent dedicated nodes from twin 14199 ops wallet** — confirmed with a successful `rent_node(3467)` on farm 646 (JimboTFT, Canada). Discovery: the substrate gate for public rent is **extraFee > 0** on the node (not `rentable: True` alone — that flag alone fires `OnlyTwinAdminCanDeploy`). FreeFarm rentable big-class nodes (2010-2025) are all `status: down` right now (~25 minute timing observed) — that's why FreeFarm dedicated path looked open at s157b /stop and was closed by the time s157c started. **2. Wire-payload diff against tfgrid-sdk-go found two `workload_challenge` bugs for ZMACHINE_TYPE in tfgrid-sdk-rust HEAD `b8774c34`** — same class of bug as commit `74d9ed2` (`fix(gateway): correct field order in workload_challenge to match ZOS`) that just landed for gateway workloads: - **`MachineNetwork` field-order swap**: Rust emits `Interfaces` before `Mycelium` in the challenge; Go SDK struct order is `PublicIP → Planetary → Mycelium → Interfaces`. Swapping the order in `tfgrid-sdk-rust/src/grid_client/deployment.rs:680-683` is the primary fix. - **Missing `corex` field**: the `ZMachineData` struct includes `corex: bool` but the challenge code does NOT include it. Go SDK struct has `Corex` between `Env` and `GPU`. Insert at `deployment.rs:702` to match. **3. After applying both patches against the rented node, the failure mode changed: from `vm deployment entered error state [deploy-phase=zos-workload]` in ~110s (ZOS signature reject) to `TFGrid deploy timed out after 300s` (substrate-accept, ZOS provisioning, just not Running by the D-27 5-min timeout).** Bumping the timeout to 900s now; this looks like the patches are correct and the VM needs more time to come up on a fresh node. **Direct ask:** Can you confirm those two challenge fixes look right? And: what changed on mainnet ZOS between 2026-05-23 and 2026-05-24 — was it a coordinated SDK + ZOS protocol bump where these challenge fields were added? Will continue probing while you sleep; if we land a successful deploy we will report back here. **State preserved on TFChain:** twin 14199 baseline = 1 active RentContract on node 3467 (contract_id 2095159, billing hourly), all probe-orphan node contracts auto-cancelled by the s157a rollback. Twin 6905 treasury untouched at 40 contracts.
Author
Owner

s157c session close — null results that narrow the investigation

Correction to my earlier comment 36619: the two SDK fixes I suggested were wrong. After reading zosbase canonical Challenge() directly (https://github.com/threefoldtech/zosbase/blob/master/pkg/gridtypes/zos/zmachine.go, ZMachine.Challenge() line 191 and MachineNetwork.Challenge() line 30):

  • MachineNetwork.Challenge() canonical order IS PublicIP → Planetary → Interfaces → Mycelium (matches the ORIGINAL Rust SDK at tfgrid-sdk-rust@b8774c34/src/grid_client/deployment.rs:673-683). The Go struct order has Mycelium before Interfaces, but the Challenge method does NOT follow struct order. My swap patch was wrong and actively regressed deploys (workload never appeared on node when applied).
  • ZMachine.Challenge() does NOT include corex either. So inserting corex was neutral (or also wrong).

6 deploy probes ran tonight on rented dedicated node 3467 (farm 646, JimboTFT, Canada). All failed identically with vm deployment entered error state [deploy-phase=zos-workload] empty-error rejection. Variables we ruled out:

  1. SDK challenge fields (corex, mycelium/interfaces order) — both wrong hypotheses
  2. ssh_keys empty vs populated — same failure
  3. Data volume workload present vs absent (s149 worked without volume, s157b added it) — same failure either way
  4. Shared-tenancy vs dedicated-rented node class — both fail (s157b tested 3 shared-tenancy nodes; s157c tested 1 rented dedicated)
  5. SDK pin bump — already at HEAD b8774c34; no newer commits on master to bump to

State preserved on TFChain: RentContract 2095159 cancelled (substrate Deleted, ~32s); twin 14199 active contracts = 0; all probe-orphan node contracts auto-cancelled by the s157a rollback path; twin 6905 treasury untouched at 40 contracts.

Direct ask for you, Mahmoud: the empty result.error from ZOS is the worst possible signal — we have ruled out the obvious hypotheses but have no positive direction. Two specific questions for when you wake:

  • Question A: Is your zoscompute.gent01.qa.grid.tf (QA chain) the only place a deploy currently works for you? Has anything on mainnet ZOS daemons changed between 2026-05-23 and 2026-05-24 that requires SDK / wire-payload changes we don't yet have?
  • Question B: Is there a way to get ZOS-side diagnostic logs / a non-empty result.error from the workload rejection? Or a known-good wire-payload from a successful recent mainnet deploy we can diff against?

If the answer to Question A is "mainnet broken for everyone, awaiting upstream fix", that's actionable — we hold. If you have a working mainnet recipe, that's a 1-session-to-arc-closure unblock.

Session artefacts: Forge comment 36619 above carries the full s157c chain; full session manifest is in our workspace sessions/157c.yml. We will pick up here when you reply or when we have new signal.

## s157c session close — null results that narrow the investigation **Correction to my earlier comment 36619:** the two SDK fixes I suggested were wrong. After reading zosbase canonical `Challenge()` directly (https://github.com/threefoldtech/zosbase/blob/master/pkg/gridtypes/zos/zmachine.go, `ZMachine.Challenge()` line 191 and `MachineNetwork.Challenge()` line 30): - `MachineNetwork.Challenge()` canonical order IS **PublicIP → Planetary → Interfaces → Mycelium** (matches the ORIGINAL Rust SDK at `tfgrid-sdk-rust@b8774c34/src/grid_client/deployment.rs:673-683`). The Go *struct* order has Mycelium before Interfaces, but the *Challenge* method does NOT follow struct order. My swap patch was wrong and actively regressed deploys (workload never appeared on node when applied). - `ZMachine.Challenge()` does NOT include `corex` either. So inserting corex was neutral (or also wrong). **6 deploy probes ran tonight on rented dedicated node 3467 (farm 646, JimboTFT, Canada). All failed identically with `vm deployment entered error state [deploy-phase=zos-workload]` empty-error rejection. Variables we ruled out:** 1. SDK challenge fields (corex, mycelium/interfaces order) — both wrong hypotheses 2. `ssh_keys` empty vs populated — same failure 3. Data volume workload present vs absent (s149 worked without volume, s157b added it) — same failure either way 4. Shared-tenancy vs dedicated-rented node class — both fail (s157b tested 3 shared-tenancy nodes; s157c tested 1 rented dedicated) 5. SDK pin bump — already at HEAD `b8774c34`; no newer commits on `master` to bump to **State preserved on TFChain:** RentContract 2095159 cancelled (substrate Deleted, ~32s); twin 14199 active contracts = 0; all probe-orphan node contracts auto-cancelled by the s157a rollback path; twin 6905 treasury untouched at 40 contracts. **Direct ask for you, Mahmoud:** the empty `result.error` from ZOS is the worst possible signal — we have ruled out the obvious hypotheses but have no positive direction. Two specific questions for when you wake: - **Question A:** Is your `zoscompute.gent01.qa.grid.tf` (QA chain) the only place a deploy currently works for you? Has anything on mainnet ZOS daemons changed between 2026-05-23 and 2026-05-24 that requires SDK / wire-payload changes we don't yet have? - **Question B:** Is there a way to get ZOS-side diagnostic logs / a non-empty `result.error` from the workload rejection? Or a known-good wire-payload from a successful recent mainnet deploy we can diff against? If the answer to **Question A** is "mainnet broken for everyone, awaiting upstream fix", that's actionable — we hold. If you have a working mainnet recipe, that's a 1-session-to-arc-closure unblock. **Session artefacts:** Forge comment 36619 above carries the full s157c chain; full session manifest is in our workspace `sessions/157c.yml`. We will pick up here when you reply or when we have new signal.
Author
Owner

s157d close — fixed, on our side

Found and fixed. Mahmoud, you do NOT need to investigate this — the bug was entirely on our side, never reachable from the empty result.error ZOS returns.

Root cause: hero_compute's deploy_vm passed the user-supplied image string straight to the TFGrid SDK as the zmachine workload's flist field. When a caller used the friendly name from list_images (e.g. "Ubuntu 24.04"), ZOS received the literal string "Ubuntu 24.04" as its flist URL, set workload state to Error with empty result.error, and we surfaced the same opaque vm deployment entered error state. s149 worked because that loop happened to pass the URL directly.

How we found it: grepping tfgrid_sdk_rust source for tracing macros returned ZERO. But grepping for workload.result.state led to a function called trace_step() at src/grid_client/mod.rs:2361:

fn trace_step(message: impl AsRef<str>) {
    if std::env::var_os("TFGRID_DEBUG").is_none() {
        return;
    }
    eprintln!("[tfgrid-debug] {}", message.as_ref());
}

The SDK has a built-in debug system gated on TFGRID_DEBUG=1 — never documented anywhere visible. Setting it produced lines like workload states for contract 2095178: data=init, 0052=init, data=ok, 0052=error and the full workload JSON payload, immediately showing flist: "Ubuntu 24.04" (the name) where ZOS expected a URL.

Fix landed: hero_compute@1f59151 on development. Adds a 5-entry IMAGE_REFERENCE_MAP mirroring list_images + a resolve_image_reference() helper called once at the top of deploy_vm. Pass-through for https:// / http:// URLs; lookup by name for known entries; friendly InvalidInput error for anything else (lists the valid names plus the URL form).

Live verification on rented dedicated node 3467 (Canada, farm 646 JimboTFT, RentContract 2095174 under twin 14199 ops):

  • Probe with explicit URL https://hub.grid.tf/tf-official-vms/ubuntu-24.04-latest.flist → VM sid 0053 state=running, contracts 2095179 + 2095180 persist on chain (the first successful deploy_vm of the arc since s149).
  • Post-fix probe with image name "Ubuntu 24.04" (the original failing input) → daemon resolves to URL internally → VM sid 0054 state=running, contracts 2095181 + 2095182 persist. Same rented node, distinct slice, distinct secret. Multi-tenant pattern proven.
  • Pre-merge gate green: cargo fmt --check + cargo clippy --workspace --all-targets -- -D warnings + cargo build --workspace --release all clean.

Bonus substrate insight (s157c): node.rentable: True ALONE does NOT allow non-admin-twin rents on mainnet. The substrate gate is node.extraFee > 0 on the node (the farmer's opt-in to allow public dedicated rent). Node 7609 (extraFee=0) was rejected with OnlyTwinAdminCanDeploy; node 3467 (extraFee=10000 mUSD) accepted immediately. This goes into the workspace decisions/D-29-...md (Hero demo target = any rentable+extraFee>0+up dedicated node on mainnet; not specifically FreeFarm).

Follow-up hero_compute polish (separate concern, not arc-blocking): hero_compute#121wait_until_running returns before mycelium_ip is populated in the workload result, so get_vm returns state=running with empty mycelium_ip and SSH-from-our-workstation is blocked. ZOS-side the VM IS reachable; this is a daemon bookkeeping gap not a deploy gap.

Closing this one. Thanks for the QA-instance reference + OpenRPC spec — confirming code parity helped narrow the search.

## s157d close — fixed, on our side **Found and fixed.** Mahmoud, you do NOT need to investigate this — the bug was entirely on our side, never reachable from the empty `result.error` ZOS returns. **Root cause:** hero_compute's `deploy_vm` passed the user-supplied `image` string straight to the TFGrid SDK as the zmachine workload's `flist` field. When a caller used the friendly name from `list_images` (e.g. `"Ubuntu 24.04"`), ZOS received the literal string `"Ubuntu 24.04"` as its flist URL, set workload state to Error with empty `result.error`, and we surfaced the same opaque `vm deployment entered error state`. s149 worked because that loop happened to pass the URL directly. **How we found it:** grepping `tfgrid_sdk_rust` source for tracing macros returned ZERO. But grepping for `workload.result.state` led to a function called `trace_step()` at `src/grid_client/mod.rs:2361`: ```rust fn trace_step(message: impl AsRef<str>) { if std::env::var_os("TFGRID_DEBUG").is_none() { return; } eprintln!("[tfgrid-debug] {}", message.as_ref()); } ``` **The SDK has a built-in debug system gated on `TFGRID_DEBUG=1`** — never documented anywhere visible. Setting it produced lines like `workload states for contract 2095178: data=init, 0052=init, data=ok, 0052=error` and the full workload JSON payload, immediately showing `flist: "Ubuntu 24.04"` (the name) where ZOS expected a URL. **Fix landed:** [hero_compute@1f59151](https://forge.ourworld.tf/lhumina_code/hero_compute/commit/1f59151) on `development`. Adds a 5-entry `IMAGE_REFERENCE_MAP` mirroring `list_images` + a `resolve_image_reference()` helper called once at the top of `deploy_vm`. Pass-through for `https://` / `http://` URLs; lookup by name for known entries; friendly `InvalidInput` error for anything else (lists the valid names plus the URL form). **Live verification on rented dedicated node 3467 (Canada, farm 646 JimboTFT, RentContract 2095174 under twin 14199 ops):** - Probe with explicit URL `https://hub.grid.tf/tf-official-vms/ubuntu-24.04-latest.flist` → VM sid `0053` state=running, contracts 2095179 + 2095180 persist on chain (the first successful deploy_vm of the arc since s149). - Post-fix probe with image name `"Ubuntu 24.04"` (the original failing input) → daemon resolves to URL internally → VM sid `0054` state=running, contracts 2095181 + 2095182 persist. Same rented node, distinct slice, distinct secret. Multi-tenant pattern proven. - Pre-merge gate green: `cargo fmt --check` + `cargo clippy --workspace --all-targets -- -D warnings` + `cargo build --workspace --release` all clean. **Bonus substrate insight (s157c):** `node.rentable: True` ALONE does NOT allow non-admin-twin rents on mainnet. The substrate gate is `node.extraFee > 0` on the node (the farmer's opt-in to allow public dedicated rent). Node 7609 (extraFee=0) was rejected with `OnlyTwinAdminCanDeploy`; node 3467 (extraFee=10000 mUSD) accepted immediately. This goes into the workspace `decisions/D-29-...md` (Hero demo target = any rentable+extraFee>0+up dedicated node on mainnet; not specifically FreeFarm). **Follow-up hero_compute polish (separate concern, not arc-blocking):** [hero_compute#121](https://forge.ourworld.tf/lhumina_code/hero_compute/issues/121) — `wait_until_running` returns before `mycelium_ip` is populated in the workload result, so `get_vm` returns `state=running` with empty mycelium_ip and SSH-from-our-workstation is blocked. ZOS-side the VM IS reachable; this is a daemon bookkeeping gap not a deploy gap. Closing this one. Thanks for the QA-instance reference + OpenRPC spec — confirming code parity helped narrow the search.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_compute#125
No description provided.