fix(zos_server): pre-check live node capacity before deploy_vm #113
No reviewers
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_compute!113
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "fix/deploy-vm-capacity-precheck"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
Deploying a VM that requests more resources than the target node can actually host fails asynchronously with an opaque message:
vm deployment entered error state. The user has no way to tell it was a capacity issue.Observed on
mahmoud-ashraf-devbox, node 1774 (dedicated): a 6-slice request (6 × {4 GB RAM, 133 GB SSD}= 24 GB RAM / 798 GB SSD) failed repeatedly, while the node only had ~23 GB RAM / ~778 GB SSD free (one slice already in use by a running VM, plus ZOS/system overhead).Root cause
deploy_vmonly guarded against the local free-slice count (rpc.rs:545). That count comes frombootstrap_nodes, which carves the node at ~100% of nominal capacity:No headroom is left for ZOS per-VM overhead (a nominal
{4 GB, 133 GB}slice's VM actually consumes ~7 GB RAM / ~153 GB SSD on the grid). So the 7-slice catalog over-commits the node: a request can pass the slice-count guard yet exceed real free capacity, and only ZOS — at deploy time, after an on-chain contract is submitted — rejects it.Fix (this PR — fast-fail precheck)
Before creating the VM record or submitting the contract, query Grid Proxy for the target node's live
used/totalcpu/mem/disk (thequery_node_from_proxyhelper already returns all of these) and reject the request synchronously with a clearInvalidInputif it doesn't fit:Details:
usedfigures lag real time; a deploy sitting right at the edge would otherwise still be rejected on-chain. (Local const for now — easy to promote to a config knob.)reserved_memory_gb.Out of scope (tracked separately)
bootstrap_nodes): carve disk/memory with headroom so the catalog stops offering un-deployable slices. Needs a node re-register path. → separate issue.Test plan
cargo check -p my_compute_zos_serverpasses locallymy_compute_zos_serveron devboxInvalidInputwith the capacity message (no on-chain contract submitted)Tracking the out-of-scope follow-ups referenced above:
Verified on devbox (node 1774) after merge: 6-slice deploy is REJECTED instantly with
insufficient capacity on tfnode-1774: requested 6 slice(s) need 6 vCPU / 26 GB RAM / 877 GB SSD (incl. 10% headroom), but only 7 vCPU / 23 GB RAM / 778 GB SSD free; no VM record created, slices stayed 6 free / 1 in_use. Revealed free values confirm a 5-slice request (needs 5 / 22 / 731) passes. Error surfaces as -32603 with the message indata— pre-existing framework wrapping shared by the existing slice_count>256 and empty-name guards.Filed the framework-level error-code cleanup (the
-32603/Redis operation errorwrapping noted above) as hero_rpc#83 — it is codegen + dispatch, affecting all OSIS services, not a hero_compute change.