zos_server: slice catalog over-commits the node (bootstrap_nodes carves 100% of capacity, ignores ZOS overhead) #114

Open
opened 2026-05-20 07:01:06 +00:00 by mahmoud · 0 comments
Owner

Problem

bootstrap_nodes (crates/my_compute_zos_server/src/cloud/rpc.rs:375-408) carves a node into slices that sum to ~100% of nominal capacity, leaving nothing for ZOS per-VM overhead:

let usable_memory_gb  = total_memory_gb - reserved_memory_gb;   // 31 - 1 = 30
let slice_count       = usable_memory_gb / slice_size_gb;       // 30 / 4 = 7
let disk_per_slice_gb = total_disk_gb / slice_count;            // 931 / 7 = 133  <- 100% of disk

Real-world overhead measured on node 1774: a nominal {4 GB RAM, 133 GB SSD} slice's VM actually consumes ~7 GB RAM / ~153 GB SSD on the grid. So the 7-slice catalog can never be fully deployed — the catalog lies, and the UI offers slices that cannot be placed.

This is the root cause behind the opaque vm deployment entered error state failures. PR #113 adds a live-capacity precheck in deploy_vm so doomed requests fail fast with a clear message, but that's a guard, not a fix — the catalog itself is still wrong, so the dashboard still shows e.g. "6 free slices" that can't all be deployed.

Proposed fix

  1. Reserve disk headroom: add reserved_disk_gb (or a percentage) so disk_per_slice_gb = (total_disk_gb - reserved_disk_gb) / slice_count.
  2. Account for memory overhead: raise the effective per-slice memory cost (or reserved_memory_gb) to cover the observed ~3 GB/VM ZOS overhead, lowering slice_count to a number that can actually be deployed concurrently.
  3. Re-register path: node + slices are persisted in hero_db at register time, so existing nodes won't recompute. Add a node_reregister admin action (wipe compute_node/slice for the node, re-run node_register) — or a migration that recomputes sizes in place.

Acceptance

  • A freshly registered node's slice catalog can be fully deployed without any slice hitting vm deployment entered error state.
  • The dashboard slice count reflects deployable capacity, not nominal.

Context

Diagnosed on mahmoud-ashraf-devbox, node 1774 (mitana farm, dedicated): 8 cru / 31 GB mru / 931.5 GB sru, carved into 7 slices of {4 GB, 133 GB}. Related: PR #113 (precheck).

## Problem `bootstrap_nodes` (`crates/my_compute_zos_server/src/cloud/rpc.rs:375-408`) carves a node into slices that sum to ~100% of nominal capacity, leaving nothing for ZOS per-VM overhead: ```rust let usable_memory_gb = total_memory_gb - reserved_memory_gb; // 31 - 1 = 30 let slice_count = usable_memory_gb / slice_size_gb; // 30 / 4 = 7 let disk_per_slice_gb = total_disk_gb / slice_count; // 931 / 7 = 133 <- 100% of disk ``` Real-world overhead measured on node 1774: a nominal `{4 GB RAM, 133 GB SSD}` slice's VM actually consumes **~7 GB RAM / ~153 GB SSD** on the grid. So the 7-slice catalog can never be fully deployed — the catalog lies, and the UI offers slices that cannot be placed. This is the root cause behind the opaque `vm deployment entered error state` failures. PR #113 adds a live-capacity *precheck* in `deploy_vm` so doomed requests fail fast with a clear message, but that's a guard, not a fix — the catalog itself is still wrong, so the dashboard still shows e.g. "6 free slices" that can't all be deployed. ## Proposed fix 1. Reserve disk headroom: add `reserved_disk_gb` (or a percentage) so `disk_per_slice_gb = (total_disk_gb - reserved_disk_gb) / slice_count`. 2. Account for memory overhead: raise the effective per-slice memory cost (or `reserved_memory_gb`) to cover the observed ~3 GB/VM ZOS overhead, lowering `slice_count` to a number that can actually be deployed concurrently. 3. Re-register path: node + slices are persisted in hero_db at register time, so existing nodes won't recompute. Add a `node_reregister` admin action (wipe `compute_node`/`slice` for the node, re-run `node_register`) — or a migration that recomputes sizes in place. ## Acceptance - A freshly registered node's slice catalog can be fully deployed without any slice hitting `vm deployment entered error state`. - The dashboard slice count reflects deployable capacity, not nominal. ## Context Diagnosed on `mahmoud-ashraf-devbox`, node 1774 (mitana farm, dedicated): 8 cru / 31 GB mru / 931.5 GB sru, carved into 7 slices of {4 GB, 133 GB}. Related: PR #113 (precheck).
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_compute#114
No description provided.