zos_server: map GridError variants to human-readable deploy messages #115

Open
opened 2026-05-20 07:01:06 +00:00 by mahmoud · 0 comments
Owner

Problem

When a TFGrid deploy fails, deploy_vm's background task surfaces the raw GridError via vm_log_fail (crates/my_compute_zos_server/src/cloud/rpc.rs:693-704). For most failures the user sees:

ERROR: backend error: vm deployment entered error state

…which gives no hint about why (capacity, placement, contract, network, node offline, etc.).

PR #113 handles the most common cause (insufficient capacity) by failing fast before submitting the contract. But other on-chain failures still surface as this opaque string.

Proposed fix

In grid_driver (or where the GridError is converted to a log line), match common GridError variants and emit clearer messages, e.g.:

  • workload error state -> probe the workload result/reason from the deployment if available and include it
  • contract creation/billing failures -> "contract rejected: "
  • node unreachable / offline -> "node is not reachable"
  • network/mycelium setup failures -> "network setup failed: "

Fall back to the raw error for unmapped variants. Keep the mapping in one place so it's reusable by both the sync and async paths.

Context

Low priority polish. Diagnosed alongside PR #113 and the slice-carving issue.

## Problem When a TFGrid deploy fails, `deploy_vm`'s background task surfaces the raw `GridError` via `vm_log_fail` (`crates/my_compute_zos_server/src/cloud/rpc.rs:693-704`). For most failures the user sees: ``` ERROR: backend error: vm deployment entered error state ``` …which gives no hint about *why* (capacity, placement, contract, network, node offline, etc.). PR #113 handles the most common cause (insufficient capacity) by failing fast before submitting the contract. But other on-chain failures still surface as this opaque string. ## Proposed fix In `grid_driver` (or where the `GridError` is converted to a log line), match common `GridError` variants and emit clearer messages, e.g.: - workload error state -> probe the workload result/reason from the deployment if available and include it - contract creation/billing failures -> "contract rejected: <reason>" - node unreachable / offline -> "node <id> is not reachable" - network/mycelium setup failures -> "network setup failed: <reason>" Fall back to the raw error for unmapped variants. Keep the mapping in one place so it's reusable by both the sync and async paths. ## Context Low priority polish. Diagnosed alongside PR #113 and the slice-carving issue.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_compute#115
No description provided.