wait_until_running can return Ok before Mycelium overlay has a route to the VM #121

New issue

Closed

opened 2026-05-23 16:04:23 +00:00 by mik-tf · 1 comment

mik-tf commented

2026-05-23 16:04:23 +00:00

Owner

After deploy_vm, the spawned grid-driver task writes both state=Running and mycelium_ip into hero_db atomically (crates/my_compute_zos_server/src/cloud/rpc.rs:1139-1150). The mycelium_ip field holds whatever the ZeroOS hypervisor reported back at deploy time, which is the address the new VM has been assigned, not a value confirming the Mycelium overlay routes packets to it. Callers polling for state==Running && mycelium_ip != "" (for example crates/my_compute_zos_server::compute.rs:165-187 in the deployer's adapter) can therefore see Ok before the Mycelium DHT has propagated a route, and a downstream SSH connection to that address may time out despite wait_until_running having returned success. Two viable fixes: (a) add a reachability probe (ping6 <addr> or nc -z6 <addr> 22) inside wait_until_running before declaring success, or (b) rename the field to mycelium_ip_recorded and expose a separate mycelium_ip_reachable flag set only after the daemon successfully reaches the address itself.

After `deploy_vm`, the spawned grid-driver task writes both `state=Running` and `mycelium_ip` into hero_db atomically (`crates/my_compute_zos_server/src/cloud/rpc.rs:1139-1150`). The `mycelium_ip` field holds whatever the ZeroOS hypervisor reported back at deploy time, which is the address the new VM has been assigned, not a value confirming the Mycelium overlay routes packets to it. Callers polling for `state==Running && mycelium_ip != ""` (for example `crates/my_compute_zos_server::compute.rs:165-187` in the deployer's adapter) can therefore see Ok before the Mycelium DHT has propagated a route, and a downstream SSH connection to that address may time out despite `wait_until_running` having returned success. Two viable fixes: (a) add a reachability probe (`ping6 <addr>` or `nc -z6 <addr> 22`) inside `wait_until_running` before declaring success, or (b) rename the field to `mycelium_ip_recorded` and expose a separate `mycelium_ip_reachable` flag set only after the daemon successfully reaches the address itself.

mik-tf referenced this issue

2026-05-25 03:32:40 +00:00

deploy_vm consistently rejected at ZOS workload phase on FreeFarm mainnet (regression since 2026-05-23) #125

mik-tf referenced this issue from lhumina_code/home

2026-05-25 03:39:01 +00:00

[META] Hero OS demo-deployer arc tracker (cockpit + proxy + content + deployer + manifest + integration) #235

mik-tf referenced this issue from lhumina_code/home

2026-05-25 03:54:01 +00:00

[META] Hero OS demo-deployer arc tracker (cockpit + proxy + content + deployer + manifest + integration) #235

mik-tf referenced this issue from a commit

2026-05-25 14:09:49 +00:00

grid_client: add public get_deployment_workloads getter for hero_compute#121

mik-tf closed this issue

2026-05-25 14:09:49 +00:00

mik-tf referenced this issue from a commit

2026-05-25 14:13:57 +00:00

hero_compute: close hero_compute#121 — mycelium_ip poll + SSH reachability probe in deploy_vm

mik-tf commented

2026-05-25 14:14:39 +00:00

Author

Owner

Closed at hero_compute@ba7d281 (s157f).

Fix shape — both gaps closed in one commit on cloud/grid_driver.rs::deploy_on_tfgrid:

Post-deploy mycelium_ip poll via new SDK getter GridClient::get_deployment_workloads(node_twin_id, contract_id). The SDK's deploy_vm extracts mycelium_ip from workload.result.data the moment result.state == ok is first seen; ZOS sometimes lags the populated field behind the state transition. The daemon now re-polls zos.deployment.get until result.data["mycelium_ip"] is non-empty (60s budget, 1s interval, WARN-on-timeout).
TCP reachability probe on [mycelium_ip]:22 (60s budget, 1s connect timeout, 2s retry interval, WARN-on-timeout). Even with mycelium_ip populated the Mycelium DHT route from the daemon's host can lag by a few seconds; the probe makes the Ok response match "you can SSH in now" rather than "in some seconds from now". Deploy stays Ok on probe timeout — the contract is honest either way.

The new SDK getter required a fork of tfgrid-sdk-rust. The fork lives at https://forge.ourworld.tf/lhumina_code/tfgrid_sdk_rust on the development branch (HEAD 57f6494) and adds exactly one public method on GridClient. Upstream PR to threefoldtech/tfgrid-sdk-rust to follow.

Live verify on twin 14199 / rented dedicated node 3467 (Canada, farm 646 JimboTFT):

rent_node 3467: RentContract 2095255 created
node_register: sid=0055, 6 slices, 31 GB MRU
deploy_vm s157f-v1: contracts 2095256 (network) + 2095257 (vm), VM sid 005c, image=Ubuntu 24.04 resolved
mycelium_ip poll: populated 461:b4f1:e80a:84f6:ff0f:60d6:ab2:358 at attempt 5 (~5s after SDK return)
SSH reachability probe: succeeded at attempt 2, 3.17s elapsed
ssh root@[mycelium_ip] uname -a: returned Linux 005c 6.1.21 #1 SMP PREEMPT_DYNAMIC x86_64, PRETTY_NAME=Ubuntu 24.04 LTS, uptime 0 min

First end-to-end "Hero OS deploys a real TFGrid VM users can SSH into via Mycelium" proof.

Pre-merge gate: fmt + clippy --workspace --all-targets -- -D warnings + workspace release build + 16/16 integration tests.

Closed at hero_compute@ba7d281 (s157f). Fix shape — both gaps closed in one commit on cloud/grid_driver.rs::deploy_on_tfgrid: 1) Post-deploy mycelium_ip poll via new SDK getter GridClient::get_deployment_workloads(node_twin_id, contract_id). The SDK's deploy_vm extracts mycelium_ip from workload.result.data the moment result.state == ok is first seen; ZOS sometimes lags the populated field behind the state transition. The daemon now re-polls zos.deployment.get until result.data["mycelium_ip"] is non-empty (60s budget, 1s interval, WARN-on-timeout). 2) TCP reachability probe on [mycelium_ip]:22 (60s budget, 1s connect timeout, 2s retry interval, WARN-on-timeout). Even with mycelium_ip populated the Mycelium DHT route from the daemon's host can lag by a few seconds; the probe makes the Ok response match "you can SSH in now" rather than "in some seconds from now". Deploy stays Ok on probe timeout — the contract is honest either way. The new SDK getter required a fork of tfgrid-sdk-rust. The fork lives at https://forge.ourworld.tf/lhumina_code/tfgrid_sdk_rust on the development branch (HEAD 57f6494) and adds exactly one public method on GridClient. Upstream PR to threefoldtech/tfgrid-sdk-rust to follow. Live verify on twin 14199 / rented dedicated node 3467 (Canada, farm 646 JimboTFT): - rent_node 3467: RentContract 2095255 created - node_register: sid=0055, 6 slices, 31 GB MRU - deploy_vm s157f-v1: contracts 2095256 (network) + 2095257 (vm), VM sid 005c, image=Ubuntu 24.04 resolved - mycelium_ip poll: populated 461:b4f1:e80a:84f6:ff0f:60d6:ab2:358 at attempt 5 (~5s after SDK return) - SSH reachability probe: succeeded at attempt 2, 3.17s elapsed - ssh root@[mycelium_ip] uname -a: returned Linux 005c 6.1.21 #1 SMP PREEMPT_DYNAMIC x86_64, PRETTY_NAME=Ubuntu 24.04 LTS, uptime 0 min First end-to-end "Hero OS deploys a real TFGrid VM users can SSH into via Mycelium" proof. Pre-merge gate: fmt + clippy --workspace --all-targets -- -D warnings + workspace release build + 16/16 integration tests.