wait_until_running can return Ok before Mycelium overlay has a route to the VM #121

Closed
opened 2026-05-23 16:04:23 +00:00 by mik-tf · 1 comment
Owner

After deploy_vm, the spawned grid-driver task writes both state=Running and mycelium_ip into hero_db atomically (crates/my_compute_zos_server/src/cloud/rpc.rs:1139-1150). The mycelium_ip field holds whatever the ZeroOS hypervisor reported back at deploy time, which is the address the new VM has been assigned, not a value confirming the Mycelium overlay routes packets to it. Callers polling for state==Running && mycelium_ip != "" (for example crates/my_compute_zos_server::compute.rs:165-187 in the deployer's adapter) can therefore see Ok before the Mycelium DHT has propagated a route, and a downstream SSH connection to that address may time out despite wait_until_running having returned success. Two viable fixes: (a) add a reachability probe (ping6 <addr> or nc -z6 <addr> 22) inside wait_until_running before declaring success, or (b) rename the field to mycelium_ip_recorded and expose a separate mycelium_ip_reachable flag set only after the daemon successfully reaches the address itself.

After `deploy_vm`, the spawned grid-driver task writes both `state=Running` and `mycelium_ip` into hero_db atomically (`crates/my_compute_zos_server/src/cloud/rpc.rs:1139-1150`). The `mycelium_ip` field holds whatever the ZeroOS hypervisor reported back at deploy time, which is the address the new VM has been assigned, not a value confirming the Mycelium overlay routes packets to it. Callers polling for `state==Running && mycelium_ip != ""` (for example `crates/my_compute_zos_server::compute.rs:165-187` in the deployer's adapter) can therefore see Ok before the Mycelium DHT has propagated a route, and a downstream SSH connection to that address may time out despite `wait_until_running` having returned success. Two viable fixes: (a) add a reachability probe (`ping6 <addr>` or `nc -z6 <addr> 22`) inside `wait_until_running` before declaring success, or (b) rename the field to `mycelium_ip_recorded` and expose a separate `mycelium_ip_reachable` flag set only after the daemon successfully reaches the address itself.
Author
Owner

Closed at hero_compute@ba7d281 (s157f).

Fix shape — both gaps closed in one commit on cloud/grid_driver.rs::deploy_on_tfgrid:

  1. Post-deploy mycelium_ip poll via new SDK getter GridClient::get_deployment_workloads(node_twin_id, contract_id). The SDK's deploy_vm extracts mycelium_ip from workload.result.data the moment result.state == ok is first seen; ZOS sometimes lags the populated field behind the state transition. The daemon now re-polls zos.deployment.get until result.data["mycelium_ip"] is non-empty (60s budget, 1s interval, WARN-on-timeout).

  2. TCP reachability probe on [mycelium_ip]:22 (60s budget, 1s connect timeout, 2s retry interval, WARN-on-timeout). Even with mycelium_ip populated the Mycelium DHT route from the daemon's host can lag by a few seconds; the probe makes the Ok response match "you can SSH in now" rather than "in some seconds from now". Deploy stays Ok on probe timeout — the contract is honest either way.

The new SDK getter required a fork of tfgrid-sdk-rust. The fork lives at https://forge.ourworld.tf/lhumina_code/tfgrid_sdk_rust on the development branch (HEAD 57f6494) and adds exactly one public method on GridClient. Upstream PR to threefoldtech/tfgrid-sdk-rust to follow.

Live verify on twin 14199 / rented dedicated node 3467 (Canada, farm 646 JimboTFT):

  • rent_node 3467: RentContract 2095255 created
  • node_register: sid=0055, 6 slices, 31 GB MRU
  • deploy_vm s157f-v1: contracts 2095256 (network) + 2095257 (vm), VM sid 005c, image=Ubuntu 24.04 resolved
  • mycelium_ip poll: populated 461:b4f1:e80a:84f6:ff0f:60d6:ab2:358 at attempt 5 (~5s after SDK return)
  • SSH reachability probe: succeeded at attempt 2, 3.17s elapsed
  • ssh root@[mycelium_ip] uname -a: returned Linux 005c 6.1.21 #1 SMP PREEMPT_DYNAMIC x86_64, PRETTY_NAME=Ubuntu 24.04 LTS, uptime 0 min

First end-to-end "Hero OS deploys a real TFGrid VM users can SSH into via Mycelium" proof.

Pre-merge gate: fmt + clippy --workspace --all-targets -- -D warnings + workspace release build + 16/16 integration tests.

Closed at hero_compute@ba7d281 (s157f). Fix shape — both gaps closed in one commit on cloud/grid_driver.rs::deploy_on_tfgrid: 1) Post-deploy mycelium_ip poll via new SDK getter GridClient::get_deployment_workloads(node_twin_id, contract_id). The SDK's deploy_vm extracts mycelium_ip from workload.result.data the moment result.state == ok is first seen; ZOS sometimes lags the populated field behind the state transition. The daemon now re-polls zos.deployment.get until result.data["mycelium_ip"] is non-empty (60s budget, 1s interval, WARN-on-timeout). 2) TCP reachability probe on [mycelium_ip]:22 (60s budget, 1s connect timeout, 2s retry interval, WARN-on-timeout). Even with mycelium_ip populated the Mycelium DHT route from the daemon's host can lag by a few seconds; the probe makes the Ok response match "you can SSH in now" rather than "in some seconds from now". Deploy stays Ok on probe timeout — the contract is honest either way. The new SDK getter required a fork of tfgrid-sdk-rust. The fork lives at https://forge.ourworld.tf/lhumina_code/tfgrid_sdk_rust on the development branch (HEAD 57f6494) and adds exactly one public method on GridClient. Upstream PR to threefoldtech/tfgrid-sdk-rust to follow. Live verify on twin 14199 / rented dedicated node 3467 (Canada, farm 646 JimboTFT): - rent_node 3467: RentContract 2095255 created - node_register: sid=0055, 6 slices, 31 GB MRU - deploy_vm s157f-v1: contracts 2095256 (network) + 2095257 (vm), VM sid 005c, image=Ubuntu 24.04 resolved - mycelium_ip poll: populated 461:b4f1:e80a:84f6:ff0f:60d6:ab2:358 at attempt 5 (~5s after SDK return) - SSH reachability probe: succeeded at attempt 2, 3.17s elapsed - ssh root@[mycelium_ip] uname -a: returned Linux 005c 6.1.21 #1 SMP PREEMPT_DYNAMIC x86_64, PRETTY_NAME=Ubuntu 24.04 LTS, uptime 0 min First end-to-end "Hero OS deploys a real TFGrid VM users can SSH into via Mycelium" proof. Pre-merge gate: fmt + clippy --workspace --all-targets -- -D warnings + workspace release build + 16/16 integration tests.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_compute#121
No description provided.