deploy_vm returns Ok before the spawned substrate submission finishes #120

Closed
opened 2026-05-23 16:04:22 +00:00 by mik-tf · 1 comment
Owner

ComputeService.deploy_vm (at crates/my_compute_zos_server/src/cloud/rpc.rs:1116-1213) returns a successful response carrying a vm_sid as soon as the local slice allocation completes, but the substrate createDeploymentContract calls run in a tokio::spawn'd task that the RPC handler does not await. Callers persist their local state (vm_secret, vm_sid, owner mapping) based on the synchronous Ok return, before the on-chain write has actually landed. If the caller crashes or the network drops between the daemon's Ok response and the caller's own state-write, the spawned task can still create one or two on-chain contracts that the caller no longer has a reference to, costing TFT indefinitely. This is the symmetric twin of issue #119 (delete_vm). Fix: await the spawned submission task and return Ok only after substrate ack lands, or expose an explicit pending state the caller can poll. See cloud/rpc.rs:1116-1213 and the spawned task block at line 1116.

`ComputeService.deploy_vm` (at `crates/my_compute_zos_server/src/cloud/rpc.rs:1116-1213`) returns a successful response carrying a `vm_sid` as soon as the local slice allocation completes, but the substrate `createDeploymentContract` calls run in a `tokio::spawn`'d task that the RPC handler does not await. Callers persist their local state (vm_secret, vm_sid, owner mapping) based on the synchronous Ok return, before the on-chain write has actually landed. If the caller crashes or the network drops between the daemon's Ok response and the caller's own state-write, the spawned task can still create one or two on-chain contracts that the caller no longer has a reference to, costing TFT indefinitely. This is the symmetric twin of issue #119 (delete_vm). Fix: await the spawned submission task and return Ok only after substrate ack lands, or expose an explicit pending state the caller can poll. See `cloud/rpc.rs:1116-1213` and the spawned task block at line 1116.
Author
Owner

Cross-referencing the s156 walk evidence detailed at #119 — the D-27 substrate-await fix at 39d9b8a is incomplete on the error path for both deploy_vm and (by symmetry) delete_vm. The same rollback-on-error pattern that #119 needs on the deploy side is needed on the delete side too: if delete_vm errors after publishing a cancel intent to substrate, the on-chain state may end up half-cancelled with no client-side record. Tracking both fixes together since the structure is identical.

Signed-by: mik-tf mik-tf@noreply.invalid

Cross-referencing the s156 walk evidence detailed at https://forge.ourworld.tf/lhumina_code/hero_compute/issues/119 — the D-27 substrate-await fix at 39d9b8a is incomplete on the error path for both deploy_vm and (by symmetry) delete_vm. The same rollback-on-error pattern that #119 needs on the deploy side is needed on the delete side too: if delete_vm errors after publishing a cancel intent to substrate, the on-chain state may end up half-cancelled with no client-side record. Tracking both fixes together since the structure is identical. Signed-by: mik-tf <mik-tf@noreply.invalid>
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_compute#120
No description provided.