wait_until_running can return Ok before Mycelium overlay has a route to the VM #121
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_compute#121
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
After
deploy_vm, the spawned grid-driver task writes bothstate=Runningandmycelium_ipinto hero_db atomically (crates/my_compute_zos_server/src/cloud/rpc.rs:1139-1150). Themycelium_ipfield holds whatever the ZeroOS hypervisor reported back at deploy time, which is the address the new VM has been assigned, not a value confirming the Mycelium overlay routes packets to it. Callers polling forstate==Running && mycelium_ip != ""(for examplecrates/my_compute_zos_server::compute.rs:165-187in the deployer's adapter) can therefore see Ok before the Mycelium DHT has propagated a route, and a downstream SSH connection to that address may time out despitewait_until_runninghaving returned success. Two viable fixes: (a) add a reachability probe (ping6 <addr>ornc -z6 <addr> 22) insidewait_until_runningbefore declaring success, or (b) rename the field tomycelium_ip_recordedand expose a separatemycelium_ip_reachableflag set only after the daemon successfully reaches the address itself.Closed at hero_compute@ba7d281 (s157f).
Fix shape — both gaps closed in one commit on cloud/grid_driver.rs::deploy_on_tfgrid:
Post-deploy mycelium_ip poll via new SDK getter GridClient::get_deployment_workloads(node_twin_id, contract_id). The SDK's deploy_vm extracts mycelium_ip from workload.result.data the moment result.state == ok is first seen; ZOS sometimes lags the populated field behind the state transition. The daemon now re-polls zos.deployment.get until result.data["mycelium_ip"] is non-empty (60s budget, 1s interval, WARN-on-timeout).
TCP reachability probe on [mycelium_ip]:22 (60s budget, 1s connect timeout, 2s retry interval, WARN-on-timeout). Even with mycelium_ip populated the Mycelium DHT route from the daemon's host can lag by a few seconds; the probe makes the Ok response match "you can SSH in now" rather than "in some seconds from now". Deploy stays Ok on probe timeout — the contract is honest either way.
The new SDK getter required a fork of tfgrid-sdk-rust. The fork lives at https://forge.ourworld.tf/lhumina_code/tfgrid_sdk_rust on the development branch (HEAD 57f6494) and adds exactly one public method on GridClient. Upstream PR to threefoldtech/tfgrid-sdk-rust to follow.
Live verify on twin 14199 / rented dedicated node 3467 (Canada, farm 646 JimboTFT):
First end-to-end "Hero OS deploys a real TFGrid VM users can SSH into via Mycelium" proof.
Pre-merge gate: fmt + clippy --workspace --all-targets -- -D warnings + workspace release build + 16/16 integration tests.