[ops] Fresh TF Grid VM Mycelium route propagation takes 15+ min — make publicip=true the nu-shell deploy default #33

Open
opened 2026-04-28 12:21:17 +00:00 by mik-tf · 0 comments
Owner

Symptom

When provisioning a fresh TF Grid VM on a freefarm node that does NOT already host other TFGrid deployments (e.g. node 2007, new to our twin), the Mycelium route to the new VM's mycelium IP does not become reachable from the operator's laptop for 10-20+ minutes after tofu apply returns.

Observed on 2026-04-24 while provisioning herodemo on node 2007 (freefarm, Belgium):

  • tofu apply complete at T+0 with mycelium_ip = "5eb:16b5:c874:b249:ff0f:3053:dc5e:6390"
  • ssh driver@[mycelium_ip] returned No route to host for 20+ tries over ~5 min
  • Switched to adding publicip = true (from the farm's 99-address IPv4 pool) — SSH worked instantly on the new 185.69.166.153 address.

The VM was healthy the whole time; it was the Mycelium overlay-routing table on the operator's local mycelium daemon that hadn't populated the route.

Root cause (hypothesis)

Mycelium routing is source-advertised and converges in seconds-to-minutes for active nodes with established peer connections, but convergence delays stack up for:

  • First time seeing a new mycelium seed (no prior route table entry)
  • Nodes whose mycelium daemons don't have tight peer paths to the operator's laptop's upstream mycelium peers
  • Fresh random_bytes.mycelium_ip_seed addresses that don't match any cached entries from a previous deploy

Re-running mycelium-restart locally sometimes helps (by flushing + resubscribing to peer list), but not always.

Demo impact

For interactive demo-build flows (Ansible/scripts over SSH), waiting 15 min for the SSH channel to come up breaks the flow. The typical solution — mycelium-restart — requires sudo and often doesn't accelerate the convergence enough.

Proposed fix

1. Make publicip = true the default for nu-shell single-VM demos

All of our active demo environments (hero, heronu, herodev, herozero) use publicip = true for a reason we'd forgotten: it avoids this exact problem. heronu briefly tried publicip = false and we spent ~15 min chasing Mycelium convergence before flipping back.

Update hero_zero/deploy/single-vm/envs/*/tf/credentials.auto.tfvars.example default to publicip = true, and document why in the comment. The 99-IP freefarm pool has ample capacity.

2. Keep Mycelium as a fallback (not primary)

Mycelium is genuinely useful when IPv4 is unavailable (some farms don't expose IPv4 pools). Keep it active; just don't depend on it as the bootstrap path. With publicip = true and Mycelium, operators get IPv4 immediately and Mycelium converges in the background — belt-and-suspenders.

3. Document the failure mode

Add a short section to docs_hero/ops/deployment.md (if/when we write it) noting: "On fresh VMs, SSH via Mycelium may take 10-20 min to work while the overlay converges. Public IPv4 avoids this."

Signed-off-by: mik-tf


Originally filed as home#165 on 2026-04-24 by mik-tf — moved to hero_demo as part of consolidating issue tracking.

## Symptom When provisioning a fresh TF Grid VM on a freefarm node that does NOT already host other TFGrid deployments (e.g. node 2007, new to our twin), the Mycelium route to the new VM's mycelium IP does not become reachable from the operator's laptop for **10-20+ minutes** after `tofu apply` returns. Observed on 2026-04-24 while provisioning `herodemo` on node 2007 (freefarm, Belgium): - `tofu apply` complete at T+0 with `mycelium_ip = "5eb:16b5:c874:b249:ff0f:3053:dc5e:6390"` - `ssh driver@[mycelium_ip]` returned `No route to host` for 20+ tries over ~5 min - Switched to adding `publicip = true` (from the farm's 99-address IPv4 pool) — SSH worked **instantly** on the new `185.69.166.153` address. The VM was healthy the whole time; it was the Mycelium overlay-routing table on the operator's local mycelium daemon that hadn't populated the route. ## Root cause (hypothesis) Mycelium routing is source-advertised and converges in seconds-to-minutes for **active** nodes with established peer connections, but convergence delays stack up for: - First time seeing a new mycelium seed (no prior route table entry) - Nodes whose mycelium daemons don't have tight peer paths to the operator's laptop's upstream mycelium peers - Fresh `random_bytes.mycelium_ip_seed` addresses that don't match any cached entries from a previous deploy Re-running `mycelium-restart` locally sometimes helps (by flushing + resubscribing to peer list), but not always. ## Demo impact For interactive demo-build flows (Ansible/scripts over SSH), waiting 15 min for the SSH channel to come up breaks the flow. The typical solution — `mycelium-restart` — requires sudo and often doesn't accelerate the convergence enough. ## Proposed fix ### 1. Make `publicip = true` the default for nu-shell single-VM demos All of our active demo environments (hero, heronu, herodev, herozero) use `publicip = true` for a reason we'd forgotten: it avoids this exact problem. `heronu` briefly tried `publicip = false` and we spent ~15 min chasing Mycelium convergence before flipping back. Update `hero_zero/deploy/single-vm/envs/*/tf/credentials.auto.tfvars.example` default to `publicip = true`, and document why in the comment. The 99-IP freefarm pool has ample capacity. ### 2. Keep Mycelium as a fallback (not primary) Mycelium is genuinely useful when IPv4 is unavailable (some farms don't expose IPv4 pools). Keep it active; just don't depend on it as the bootstrap path. With `publicip = true` **and** Mycelium, operators get IPv4 immediately and Mycelium converges in the background — belt-and-suspenders. ### 3. Document the failure mode Add a short section to `docs_hero/ops/deployment.md` (if/when we write it) noting: "On fresh VMs, SSH via Mycelium may take 10-20 min to work while the overlay converges. Public IPv4 avoids this." ## Related - https://forge.ourworld.tf/lhumina_code/home/issues/160 — consolidated demo state - https://forge.ourworld.tf/lhumina_code/home/issues/161 — disaster recovery pattern - https://forge.ourworld.tf/lhumina_code/home/issues/163 — `make demo` target (should default to publicip=true) Signed-off-by: mik-tf --- *Originally filed as [home#165](https://forge.ourworld.tf/lhumina_code/home/issues/165) on 2026-04-24 by mik-tf — moved to hero_demo as part of consolidating issue tracking.*
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_demo#33
No description provided.