[ops] Fresh TF Grid VM Mycelium route propagation takes 15+ min — make publicip=true the nu-shell deploy default #165
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Symptom
When provisioning a fresh TF Grid VM on a freefarm node that does NOT already host other TFGrid deployments (e.g. node 2007, new to our twin), the Mycelium route to the new VM's mycelium IP does not become reachable from the operator's laptop for 10-20+ minutes after
tofu applyreturns.Observed on 2026-04-24 while provisioning
herodemoon node 2007 (freefarm, Belgium):tofu applycomplete at T+0 withmycelium_ip = "5eb:16b5:c874:b249:ff0f:3053:dc5e:6390"ssh driver@[mycelium_ip]returnedNo route to hostfor 20+ tries over ~5 minpublicip = true(from the farm's 99-address IPv4 pool) — SSH worked instantly on the new185.69.166.153address.The VM was healthy the whole time; it was the Mycelium overlay-routing table on the operator's local mycelium daemon that hadn't populated the route.
Root cause (hypothesis)
Mycelium routing is source-advertised and converges in seconds-to-minutes for active nodes with established peer connections, but convergence delays stack up for:
random_bytes.mycelium_ip_seedaddresses that don't match any cached entries from a previous deployRe-running
mycelium-restartlocally sometimes helps (by flushing + resubscribing to peer list), but not always.Demo impact
For interactive demo-build flows (Ansible/scripts over SSH), waiting 15 min for the SSH channel to come up breaks the flow. The typical solution —
mycelium-restart— requires sudo and often doesn't accelerate the convergence enough.Proposed fix
1. Make
publicip = truethe default for nu-shell single-VM demosAll of our active demo environments (hero, heronu, herodev, herozero) use
publicip = truefor a reason we'd forgotten: it avoids this exact problem.heronubriefly triedpublicip = falseand we spent ~15 min chasing Mycelium convergence before flipping back.Update
hero_zero/deploy/single-vm/envs/*/tf/credentials.auto.tfvars.exampledefault topublicip = true, and document why in the comment. The 99-IP freefarm pool has ample capacity.2. Keep Mycelium as a fallback (not primary)
Mycelium is genuinely useful when IPv4 is unavailable (some farms don't expose IPv4 pools). Keep it active; just don't depend on it as the bootstrap path. With
publicip = trueand Mycelium, operators get IPv4 immediately and Mycelium converges in the background — belt-and-suspenders.3. Document the failure mode
Add a short section to
docs_hero/ops/deployment.md(if/when we write it) noting: "On fresh VMs, SSH via Mycelium may take 10-20 min to work while the overlay converges. Public IPv4 avoids this."Related
make demotarget (should default to publicip=true)Signed-off-by: mik-tf
publicip=truethe nu-shell deploy default #33Moved to hero_demo#33 — see lhumina_code/hero_demo#33