fix(cli): use UDS connect-probe health check for _ui and _admin #22

Merged
mik-tf merged 1 commit from development_mik into development 2026-05-23 15:34:42 +00:00
Owner

Closes #21

The action specs for hero_assistance_ui and hero_assistance_admin configured hero_proc with http_url: "http://localhost/health", but both daemons bind UDS only (app.sock and admin.sock). Every probe attempt failed because nothing on the host serves http://localhost:80/health, and after the retry budget elapsed hero_proc killed each daemon. With max_attempts=3 the actions ended up permanently failed; observed restart cadence before retries exhausted was roughly 30 to 35 seconds.

Switch both health checks to the same shape hero_assistance_server already uses: openrpc_socket: Some(<their UDS path>). Per hero_proc_server/src/types/config_ext.rs the OpenRpcSocket variant is a connect-only probe, so this works regardless of whether the daemon exposes /rpc or /openrpc.json (the customer UI's router only has /rpc; the admin has both).

Add phase24c_build_service_definition_health_checks_use_uds_connect_probe pinning the contract across all three actions so a regression to http_url is caught at test time.

Live verify against the rewritten action spec on an installed hero_assistance: all three daemons (server, ui, admin) stayed in running phase under hero_proc for 5.5 minutes with the same PIDs (no restart cycle), and curl --unix-socket against rpc.sock, app.sock, and admin.sock all returned 200 with the expected health JSON.

Pre-merge gate clean: cargo fmt --check, cargo clippy --release --workspace --all-targets -- -D warnings, cargo build --workspace --release. Workspace tests 255 pass / 2 fail / 14 ignored (+1 from the new pin test vs the 254/1/14 baseline; the 2 fails are the documented pre-existing flakes phase24b_ui_add_access_fails_when_hero_proc_unreachable and the transient phase10_multi_project_merged_stream_tags_by_project_id).

Closes https://forge.ourworld.tf/lhumina_code/hero_assistance/issues/21 The action specs for `hero_assistance_ui` and `hero_assistance_admin` configured hero_proc with `http_url: "http://localhost/health"`, but both daemons bind UDS only (`app.sock` and `admin.sock`). Every probe attempt failed because nothing on the host serves `http://localhost:80/health`, and after the retry budget elapsed hero_proc killed each daemon. With `max_attempts=3` the actions ended up permanently failed; observed restart cadence before retries exhausted was roughly 30 to 35 seconds. Switch both health checks to the same shape `hero_assistance_server` already uses: `openrpc_socket: Some(<their UDS path>)`. Per `hero_proc_server/src/types/config_ext.rs` the `OpenRpcSocket` variant is a connect-only probe, so this works regardless of whether the daemon exposes `/rpc` or `/openrpc.json` (the customer UI's router only has `/rpc`; the admin has both). Add `phase24c_build_service_definition_health_checks_use_uds_connect_probe` pinning the contract across all three actions so a regression to `http_url` is caught at test time. Live verify against the rewritten action spec on an installed `hero_assistance`: all three daemons (server, ui, admin) stayed in `running` phase under hero_proc for 5.5 minutes with the same PIDs (no restart cycle), and `curl --unix-socket` against `rpc.sock`, `app.sock`, and `admin.sock` all returned 200 with the expected health JSON. Pre-merge gate clean: `cargo fmt --check`, `cargo clippy --release --workspace --all-targets -- -D warnings`, `cargo build --workspace --release`. Workspace tests 255 pass / 2 fail / 14 ignored (+1 from the new pin test vs the 254/1/14 baseline; the 2 fails are the documented pre-existing flakes `phase24b_ui_add_access_fails_when_hero_proc_unreachable` and the transient `phase10_multi_project_merged_stream_tags_by_project_id`).
The action specs for hero_assistance_ui and hero_assistance_admin
configured hero_proc with http_url: "http://localhost/health", but both
daemons bind UDS only (app.sock and admin.sock respectively). Every
probe attempt failed because nothing on the host serves
http://localhost:80/health, and after the retry budget elapsed
hero_proc killed each daemon. With max_attempts=3 the actions ended up
permanently failed; observed restart cadence before retries exhausted
was roughly 30 to 35 seconds.

Switch both health checks to the same shape hero_assistance_server
already uses: openrpc_socket: Some(<their UDS path>). Per
hero_proc_server/src/types/config_ext.rs the OpenRpcSocket variant is
a connect-only probe, so this works regardless of whether the daemon
exposes /rpc or /openrpc.json (hero_assistance_ui's router only has
/rpc; hero_assistance_admin has both).

Add phase24c_build_service_definition_health_checks_use_uds_connect_probe
pinning the contract across all three actions so a regression to
http_url is caught at test time.

Live verify against the rewritten action spec on an installed
hero_assistance: all three daemons (server, ui, admin) stayed in
running phase under hero_proc for 5.5 minutes with the same PIDs (no
restart cycle), and curl --unix-socket against rpc.sock, app.sock, and
admin.sock all returned 200 with the expected health JSON.

Pre-merge gate clean: cargo fmt --check, cargo clippy --release
--workspace --all-targets -- -D warnings, cargo build --workspace
--release. Workspace tests 255 pass / 2 fail / 14 ignored (+1 from the
new pin test vs the 254/1/14 baseline; the 2 fails are the documented
pre-existing flakes phase24b_ui_add_access_fails_when_hero_proc_unreachable
and the transient phase10_multi_project_merged_stream_tags_by_project_id).

Closes #21

Signed-by: mik-tf <mik-tf@noreply.invalid>
mik-tf merged commit ee2be7d342 into development 2026-05-23 15:34:42 +00:00
mik-tf deleted branch development_mik 2026-05-23 15:34:42 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_assistance!22
No description provided.