[deployer] Pre-warm a pool of tester VMs so onboarding is fast and reliable #266

Open
opened 2026-06-07 15:59:25 +00:00 by mik-tf · 1 comment
Owner

The demo runs on a dedicated node we already pay for, so leaving virtual machines on it idle costs nothing extra. Instead of creating a tester VM on demand each time we add someone, which makes a person wait while a brand new machine boots and joins the network (and sometimes that network route never comes up in time, so the install fails outright), we could pre provision a pool of tester VMs up front, each already booted with the admin SSH keys and left ready. A periodic health check would ping each pool machine to confirm it is reachable, and tear down and recreate any that are unresponsive, so the pool stays known good. Adding a tester then becomes preparing their account and running the Hero stack setup on a machine that is already booted and reachable, which takes the slow and flaky part off the moment someone is actually waiting. A natural follow up is to also pre install the Hero binaries on the pool machines so only the per user configuration runs at assignment, which makes onboarding both reliable and fast. This needs the machine records to carry a pool and assignment model rather than one machine created per user, and the provision step to split into create a pool machine and assign a machine to a user. Capacity should be sized on real placement rather than raw slice counts, and the recreate path needs to handle teardown reliably.

Signed-by: mik-tf mik-tf@noreply.invalid

The demo runs on a dedicated node we already pay for, so leaving virtual machines on it idle costs nothing extra. Instead of creating a tester VM on demand each time we add someone, which makes a person wait while a brand new machine boots and joins the network (and sometimes that network route never comes up in time, so the install fails outright), we could pre provision a pool of tester VMs up front, each already booted with the admin SSH keys and left ready. A periodic health check would ping each pool machine to confirm it is reachable, and tear down and recreate any that are unresponsive, so the pool stays known good. Adding a tester then becomes preparing their account and running the Hero stack setup on a machine that is already booted and reachable, which takes the slow and flaky part off the moment someone is actually waiting. A natural follow up is to also pre install the Hero binaries on the pool machines so only the per user configuration runs at assignment, which makes onboarding both reliable and fast. This needs the machine records to carry a pool and assignment model rather than one machine created per user, and the provision step to split into create a pool machine and assign a machine to a user. Capacity should be sized on real placement rather than raw slice counts, and the recreate path needs to handle teardown reliably. Signed-by: mik-tf <mik-tf@noreply.invalid>
Author
Owner

One refinement on the golden image follow up mentioned above: pre installing the binaries on pool machines is probably not worth it. The binaries are only about two minutes of the install, and main and development rebuild often, so a pre baked image would go stale quickly for marginal gain. The warm pool on its own already takes onboarding from around twenty minutes down to a few minutes by removing the brand new machine boot and network wait. So lets do the warm pool first and only revisit pre baking if we find a clean way to keep it fresh.

Signed-by: mik-tf mik-tf@noreply.invalid

One refinement on the golden image follow up mentioned above: pre installing the binaries on pool machines is probably not worth it. The binaries are only about two minutes of the install, and main and development rebuild often, so a pre baked image would go stale quickly for marginal gain. The warm pool on its own already takes onboarding from around twenty minutes down to a few minutes by removing the brand new machine boot and network wait. So lets do the warm pool first and only revisit pre baking if we find a clean way to keep it fresh. Signed-by: mik-tf <mik-tf@noreply.invalid>
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/home#266
No description provided.