deploy_webgateway can create the gateway on the grid yet return an error to the caller, and there is no way to query an existing gateway #133

Open
opened 2026-06-04 19:55:21 +00:00 by mik-tf · 1 comment
Owner

When the deployer asks the compute daemon to create a per-tester web gateway, we have seen cases where the gateway is actually created on the grid (the domain resolves and serves the tester) but the deploy call still returns an error, or times out, to the caller. The caller then assumes the gateway was rolled back and records nothing about it, which leaves a live gateway the caller cannot track or clean up, and causes downstream setup that depends on knowing the gateway domain to be skipped. Two changes would fix this at the source. First, deploy_webgateway should not return an error once the on-chain contract has been created: it should wait until the gateway is ready and return success with the final domain, or fully roll back and return an error, with nothing in between. Second, please add a way to look up an existing gateway by name (returning its domain and id), so a caller that lost the response, or is reconciling after a restart, can adopt the live gateway instead of creating a duplicate. We hit this with a tester whose gateway is live and serving but was completely unknown to the deployer, so the deployer could neither record its domain nor set up the tester login protection that depends on it. On the deployer side we now retry and refuse to publish a tester without its login protection, but the reliable fix is in the daemon so the gateway state is never ambiguous.

Signed-by: mik-tf mik-tf@noreply.invalid

When the deployer asks the compute daemon to create a per-tester web gateway, we have seen cases where the gateway is actually created on the grid (the domain resolves and serves the tester) but the deploy call still returns an error, or times out, to the caller. The caller then assumes the gateway was rolled back and records nothing about it, which leaves a live gateway the caller cannot track or clean up, and causes downstream setup that depends on knowing the gateway domain to be skipped. Two changes would fix this at the source. First, deploy_webgateway should not return an error once the on-chain contract has been created: it should wait until the gateway is ready and return success with the final domain, or fully roll back and return an error, with nothing in between. Second, please add a way to look up an existing gateway by name (returning its domain and id), so a caller that lost the response, or is reconciling after a restart, can adopt the live gateway instead of creating a duplicate. We hit this with a tester whose gateway is live and serving but was completely unknown to the deployer, so the deployer could neither record its domain nor set up the tester login protection that depends on it. On the deployer side we now retry and refuse to publish a tester without its login protection, but the reliable fix is in the daemon so the gateway state is never ambiguous. Signed-by: mik-tf <mik-tf@noreply.invalid>
Author
Owner

Confirming this from live onboarding on QAnet (twin 703), and it is reproducible. deploy_webgateway for a name gateway reaches on-chain ready but returns an empty fqdn intermittently. We saw it recur across several gateway attempts for one tester, while fresh names succeeded in the same window, so it is a transient read-back miss rather than a per-name problem. The daemon already knows the deterministic answer: the log line on success prints the selected gateway node zone (for example gent01.qa.grid.tf), and a name gateway's fqdn is always name.zone. The caller cannot capture the domain when it comes back empty, so it skips the per-tester OAuth gate and install stalls. We added a consumer-side workaround in the deployer that derives name.zone when deploy_webgateway returns ready without an fqdn, which unblocks onboarding, but the durable fix belongs here: have deploy_webgateway fill the fqdn from name plus the selected node zone before returning, instead of relying on the SDK read-back. A lookup-by-name that returns an existing gateway's domain and id, as already suggested in this issue, would also let a caller recover the domain after a transient miss.

Confirming this from live onboarding on QAnet (twin 703), and it is reproducible. deploy_webgateway for a name gateway reaches on-chain ready but returns an empty fqdn intermittently. We saw it recur across several gateway attempts for one tester, while fresh names succeeded in the same window, so it is a transient read-back miss rather than a per-name problem. The daemon already knows the deterministic answer: the log line on success prints the selected gateway node zone (for example gent01.qa.grid.tf), and a name gateway's fqdn is always name.zone. The caller cannot capture the domain when it comes back empty, so it skips the per-tester OAuth gate and install stalls. We added a consumer-side workaround in the deployer that derives name.zone when deploy_webgateway returns ready without an fqdn, which unblocks onboarding, but the durable fix belongs here: have deploy_webgateway fill the fqdn from name plus the selected node zone before returning, instead of relying on the SDK read-back. A lookup-by-name that returns an existing gateway's domain and id, as already suggested in this issue, would also let a caller recover the domain after a transient miss.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_compute#133
No description provided.