zos_server: slice catalog over-commits the node (bootstrap_nodes carves 100% of capacity, ignores ZOS overhead) #114

New issue

Open

opened 2026-05-20 07:01:06 +00:00 by mahmoud · 0 comments

mahmoud commented

2026-05-20 07:01:06 +00:00

Owner

Problem

bootstrap_nodes (crates/my_compute_zos_server/src/cloud/rpc.rs:375-408) carves a node into slices that sum to ~100% of nominal capacity, leaving nothing for ZOS per-VM overhead:

let usable_memory_gb  = total_memory_gb - reserved_memory_gb;   // 31 - 1 = 30
let slice_count       = usable_memory_gb / slice_size_gb;       // 30 / 4 = 7
let disk_per_slice_gb = total_disk_gb / slice_count;            // 931 / 7 = 133  <- 100% of disk

Real-world overhead measured on node 1774: a nominal {4 GB RAM, 133 GB SSD} slice's VM actually consumes ~7 GB RAM / ~153 GB SSD on the grid. So the 7-slice catalog can never be fully deployed — the catalog lies, and the UI offers slices that cannot be placed.

This is the root cause behind the opaque vm deployment entered error state failures. PR #113 adds a live-capacity precheck in deploy_vm so doomed requests fail fast with a clear message, but that's a guard, not a fix — the catalog itself is still wrong, so the dashboard still shows e.g. "6 free slices" that can't all be deployed.

Proposed fix

Reserve disk headroom: add reserved_disk_gb (or a percentage) so disk_per_slice_gb = (total_disk_gb - reserved_disk_gb) / slice_count.
Account for memory overhead: raise the effective per-slice memory cost (or reserved_memory_gb) to cover the observed ~3 GB/VM ZOS overhead, lowering slice_count to a number that can actually be deployed concurrently.
Re-register path: node + slices are persisted in hero_db at register time, so existing nodes won't recompute. Add a node_reregister admin action (wipe compute_node/slice for the node, re-run node_register) — or a migration that recomputes sizes in place.

Acceptance

A freshly registered node's slice catalog can be fully deployed without any slice hitting vm deployment entered error state.
The dashboard slice count reflects deployable capacity, not nominal.

Context

Diagnosed on mahmoud-ashraf-devbox, node 1774 (mitana farm, dedicated): 8 cru / 31 GB mru / 931.5 GB sru, carved into 7 slices of {4 GB, 133 GB}. Related: PR #113 (precheck).

## Problem `bootstrap_nodes` (`crates/my_compute_zos_server/src/cloud/rpc.rs:375-408`) carves a node into slices that sum to ~100% of nominal capacity, leaving nothing for ZOS per-VM overhead: ```rust let usable_memory_gb = total_memory_gb - reserved_memory_gb; // 31 - 1 = 30 let slice_count = usable_memory_gb / slice_size_gb; // 30 / 4 = 7 let disk_per_slice_gb = total_disk_gb / slice_count; // 931 / 7 = 133 <- 100% of disk ``` Real-world overhead measured on node 1774: a nominal `{4 GB RAM, 133 GB SSD}` slice's VM actually consumes **~7 GB RAM / ~153 GB SSD** on the grid. So the 7-slice catalog can never be fully deployed — the catalog lies, and the UI offers slices that cannot be placed. This is the root cause behind the opaque `vm deployment entered error state` failures. PR #113 adds a live-capacity *precheck* in `deploy_vm` so doomed requests fail fast with a clear message, but that's a guard, not a fix — the catalog itself is still wrong, so the dashboard still shows e.g. "6 free slices" that can't all be deployed. ## Proposed fix 1. Reserve disk headroom: add `reserved_disk_gb` (or a percentage) so `disk_per_slice_gb = (total_disk_gb - reserved_disk_gb) / slice_count`. 2. Account for memory overhead: raise the effective per-slice memory cost (or `reserved_memory_gb`) to cover the observed ~3 GB/VM ZOS overhead, lowering `slice_count` to a number that can actually be deployed concurrently. 3. Re-register path: node + slices are persisted in hero_db at register time, so existing nodes won't recompute. Add a `node_reregister` admin action (wipe `compute_node`/`slice` for the node, re-run `node_register`) — or a migration that recomputes sizes in place. ## Acceptance - A freshly registered node's slice catalog can be fully deployed without any slice hitting `vm deployment entered error state`. - The dashboard slice count reflects deployable capacity, not nominal. ## Context Diagnosed on `mahmoud-ashraf-devbox`, node 1774 (mitana farm, dedicated): 8 cru / 31 GB mru / 931.5 GB sru, carved into 7 slices of {4 GB, 133 GB}. Related: PR #113 (precheck).