zos_server: slice catalog over-commits the node (bootstrap_nodes carves 100% of capacity, ignores ZOS overhead) #114
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_compute#114
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
bootstrap_nodes(crates/my_compute_zos_server/src/cloud/rpc.rs:375-408) carves a node into slices that sum to ~100% of nominal capacity, leaving nothing for ZOS per-VM overhead:Real-world overhead measured on node 1774: a nominal
{4 GB RAM, 133 GB SSD}slice's VM actually consumes ~7 GB RAM / ~153 GB SSD on the grid. So the 7-slice catalog can never be fully deployed — the catalog lies, and the UI offers slices that cannot be placed.This is the root cause behind the opaque
vm deployment entered error statefailures. PR #113 adds a live-capacity precheck indeploy_vmso doomed requests fail fast with a clear message, but that's a guard, not a fix — the catalog itself is still wrong, so the dashboard still shows e.g. "6 free slices" that can't all be deployed.Proposed fix
reserved_disk_gb(or a percentage) sodisk_per_slice_gb = (total_disk_gb - reserved_disk_gb) / slice_count.reserved_memory_gb) to cover the observed ~3 GB/VM ZOS overhead, loweringslice_countto a number that can actually be deployed concurrently.node_reregisteradmin action (wipecompute_node/slicefor the node, re-runnode_register) — or a migration that recomputes sizes in place.Acceptance
vm deployment entered error state.Context
Diagnosed on
mahmoud-ashraf-devbox, node 1774 (mitana farm, dedicated): 8 cru / 31 GB mru / 931.5 GB sru, carved into 7 slices of {4 GB, 133 GB}. Related: PR #113 (precheck).