fix(zos_server): pre-check live node capacity before deploy_vm #113

Merged
mahmoud merged 1 commit from fix/deploy-vm-capacity-precheck into development 2026-05-20 07:00:51 +00:00
Owner

Problem

Deploying a VM that requests more resources than the target node can actually host fails asynchronously with an opaque message: vm deployment entered error state. The user has no way to tell it was a capacity issue.

Observed on mahmoud-ashraf-devbox, node 1774 (dedicated): a 6-slice request (6 × {4 GB RAM, 133 GB SSD} = 24 GB RAM / 798 GB SSD) failed repeatedly, while the node only had ~23 GB RAM / ~778 GB SSD free (one slice already in use by a running VM, plus ZOS/system overhead).

Root cause

deploy_vm only guarded against the local free-slice count (rpc.rs:545). That count comes from bootstrap_nodes, which carves the node at ~100% of nominal capacity:

let slice_count       = (total_memory_gb - reserved_memory_gb) / slice_size_gb; // 30/4 = 7
let disk_per_slice_gb = total_disk_gb / slice_count;                            // 931/7 = 133 → 100% of disk

No headroom is left for ZOS per-VM overhead (a nominal {4 GB, 133 GB} slice's VM actually consumes ~7 GB RAM / ~153 GB SSD on the grid). So the 7-slice catalog over-commits the node: a request can pass the slice-count guard yet exceed real free capacity, and only ZOS — at deploy time, after an on-chain contract is submitted — rejects it.

Fix (this PR — fast-fail precheck)

Before creating the VM record or submitting the contract, query Grid Proxy for the target node's live used/total cpu/mem/disk (the query_node_from_proxy helper already returns all of these) and reject the request synchronously with a clear InvalidInput if it doesn't fit:

insufficient capacity on tfnode-1774: requested 6 slice(s) need 6 vCPU / 26 GB RAM / 877 GB SSD
(incl. 10% headroom), but only 7 vCPU / 23 GB RAM / 778 GB SSD free

Details:

  • A 10% headroom pads the request because Grid Proxy used figures lag real time; a deploy sitting right at the edge would otherwise still be rejected on-chain. (Local const for now — easy to promote to a config knob.)
  • Memory free also subtracts the existing reserved_memory_gb.
  • Fail open: if Grid Proxy is unreachable, log a warning and proceed — the on-chain deploy remains the ultimate gate, and a proxy outage shouldn't block all deploys.
  • Placed before any state mutation, so a rejected deploy creates no VM record and marks no slices in-use — nothing to roll back.

Out of scope (tracked separately)

  • Slice-catalog over-commit (the root sizing bug in bootstrap_nodes): carve disk/memory with headroom so the catalog stops offering un-deployable slices. Needs a node re-register path. → separate issue.
  • GridError → human message mapping for non-capacity on-chain failures. → separate issue.

Test plan

  • cargo check -p my_compute_zos_server passes locally
  • Build + deploy patched my_compute_zos_server on devbox
  • 6-slice deploy on node 1774 → instant InvalidInput with the capacity message (no on-chain contract submitted)
  • 5-slice deploy on node 1774 → still succeeds
  • Grid Proxy unreachable → warning logged, deploy proceeds (fail-open)
## Problem Deploying a VM that requests more resources than the target node can actually host fails **asynchronously** with an opaque message: `vm deployment entered error state`. The user has no way to tell it was a capacity issue. Observed on `mahmoud-ashraf-devbox`, node 1774 (dedicated): a 6-slice request (`6 × {4 GB RAM, 133 GB SSD}` = 24 GB RAM / 798 GB SSD) failed repeatedly, while the node only had ~23 GB RAM / ~778 GB SSD free (one slice already in use by a running VM, plus ZOS/system overhead). ## Root cause `deploy_vm` only guarded against the **local free-slice count** (`rpc.rs:545`). That count comes from `bootstrap_nodes`, which carves the node at ~100% of nominal capacity: ```rust let slice_count = (total_memory_gb - reserved_memory_gb) / slice_size_gb; // 30/4 = 7 let disk_per_slice_gb = total_disk_gb / slice_count; // 931/7 = 133 → 100% of disk ``` No headroom is left for ZOS per-VM overhead (a nominal `{4 GB, 133 GB}` slice's VM actually consumes ~7 GB RAM / ~153 GB SSD on the grid). So the 7-slice catalog over-commits the node: a request can pass the slice-count guard yet exceed real free capacity, and only ZOS — at deploy time, after an on-chain contract is submitted — rejects it. ## Fix (this PR — fast-fail precheck) Before creating the VM record or submitting the contract, query Grid Proxy for the target node's **live** `used`/`total` cpu/mem/disk (the `query_node_from_proxy` helper already returns all of these) and reject the request synchronously with a clear `InvalidInput` if it doesn't fit: ``` insufficient capacity on tfnode-1774: requested 6 slice(s) need 6 vCPU / 26 GB RAM / 877 GB SSD (incl. 10% headroom), but only 7 vCPU / 23 GB RAM / 778 GB SSD free ``` Details: - A 10% headroom pads the request because Grid Proxy `used` figures lag real time; a deploy sitting right at the edge would otherwise still be rejected on-chain. (Local const for now — easy to promote to a config knob.) - Memory free also subtracts the existing `reserved_memory_gb`. - **Fail open**: if Grid Proxy is unreachable, log a warning and proceed — the on-chain deploy remains the ultimate gate, and a proxy outage shouldn't block all deploys. - Placed before any state mutation, so a rejected deploy creates no VM record and marks no slices in-use — nothing to roll back. ## Out of scope (tracked separately) - **Slice-catalog over-commit** (the root sizing bug in `bootstrap_nodes`): carve disk/memory with headroom so the catalog stops offering un-deployable slices. Needs a node re-register path. → separate issue. - **GridError → human message mapping** for non-capacity on-chain failures. → separate issue. ## Test plan - [x] `cargo check -p my_compute_zos_server` passes locally - [ ] Build + deploy patched `my_compute_zos_server` on devbox - [ ] 6-slice deploy on node 1774 → instant `InvalidInput` with the capacity message (no on-chain contract submitted) - [ ] 5-slice deploy on node 1774 → still succeeds - [ ] Grid Proxy unreachable → warning logged, deploy proceeds (fail-open)
fix(zos_server): pre-check live node capacity before deploy_vm
Some checks failed
Test / test (pull_request) Failing after 3m32s
a935f88bba
deploy_vm only guarded against the local free-slice count, which is
derived from a slice catalog that carves the node at ~100% of nominal
capacity (disk_per_slice = total_disk / slice_count) and ignores ZOS
per-VM overhead. A request could pass that guard yet exceed what the node
can actually host, so the on-chain deploy failed asynchronously with the
opaque "vm deployment entered error state".

Add a synchronous precheck: before creating the VM record or submitting
the contract, query Grid Proxy for the target node's live used/total
cpu/mem/disk and reject the request with a clear InvalidInput message if
it (plus a 10% headroom for proxy lag) doesn't fit. Fail open when the
proxy is unreachable — the on-chain deploy stays the ultimate gate and a
proxy outage shouldn't block all deploys.

Refs #33-style fast-fail: turns the cryptic async error into an
immediate, actionable one. Slice-catalog over-commit (the root sizing
bug) is tracked separately.
mahmoud merged commit 3b87dc94b8 into development 2026-05-20 07:00:51 +00:00
mahmoud deleted branch fix/deploy-vm-capacity-precheck 2026-05-20 07:00:56 +00:00
Author
Owner

Tracking the out-of-scope follow-ups referenced above:

  • Slice-catalog over-commit (root sizing bug): #114
  • GridError → human message mapping: #115
Tracking the out-of-scope follow-ups referenced above: - Slice-catalog over-commit (root sizing bug): #114 - GridError → human message mapping: #115
Author
Owner

Verified on devbox (node 1774) after merge: 6-slice deploy is REJECTED instantly with insufficient capacity on tfnode-1774: requested 6 slice(s) need 6 vCPU / 26 GB RAM / 877 GB SSD (incl. 10% headroom), but only 7 vCPU / 23 GB RAM / 778 GB SSD free; no VM record created, slices stayed 6 free / 1 in_use. Revealed free values confirm a 5-slice request (needs 5 / 22 / 731) passes. Error surfaces as -32603 with the message in data — pre-existing framework wrapping shared by the existing slice_count>256 and empty-name guards.

Verified on devbox (node 1774) after merge: 6-slice deploy is REJECTED instantly with `insufficient capacity on tfnode-1774: requested 6 slice(s) need 6 vCPU / 26 GB RAM / 877 GB SSD (incl. 10% headroom), but only 7 vCPU / 23 GB RAM / 778 GB SSD free`; no VM record created, slices stayed 6 free / 1 in_use. Revealed free values confirm a 5-slice request (needs 5 / 22 / 731) passes. Error surfaces as -32603 with the message in `data` — pre-existing framework wrapping shared by the existing slice_count>256 and empty-name guards.
Author
Owner

Filed the framework-level error-code cleanup (the -32603 / Redis operation error wrapping noted above) as hero_rpc#83 — it is codegen + dispatch, affecting all OSIS services, not a hero_compute change.

Filed the framework-level error-code cleanup (the `-32603` / `Redis operation error` wrapping noted above) as hero_rpc#83 — it is codegen + dispatch, affecting all OSIS services, not a hero_compute change.
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_compute!113
No description provided.