fix(cli): health-check collab_web over its Unix socket, not TCP #69

Merged
sameh-farouk merged 1 commit from development_sameh into development 2026-06-03 16:04:12 +00:00
Member

Problem

hero_collab_web won't stay running under hero_collab --start / lab service hero_collab --start — it restart-loops to failed within ~14s. Only a manual launch (no hero_proc supervision) survives. The chat/canvas UI then 404s with "daemon not running."

Root cause (systematic investigation)

Not the daemon — the health check. The web action registers:

http_url: Some("http://localhost/health".into())

hero_proc's HTTP probe parses a plain http:// URL as a TCP target. With no port it resolves to bare localhost, and the connect is always refused (collab_web serves on the web.sock Unix socket, nothing on TCP). Proven:

  • http://localhost/health over TCP → HTTP 000 (refused)
  • /health over web.sock (unix) → HTTP 200

Every probe fails → after retries (3) hero_proc signals the action to stop → kills the healthy daemon → retry → failed (exit_code=-1). The start_period_ms: 5000 + 3×interval_ms: 3000 ≈ 14s grace is why a quick check "looked fine" before it died.

The server action is unaffected because it correctly uses an openrpc_socket connect-probe on rpc.sock.

Fix

hero_proc supports HTTP-over-UDS health checks via http+unix:///abs/socket.sock[/path] (hero_proc_server::process::check_health). Point the web action's health check at the resolved web.sock using that form:

http_url: Some(format!("http+unix://{ui_sock}/health"))

Verification

With the fix, collab_web stays phase=running attempt=0 under hero_proc for 40s+ (well past the old ~14s kill window); same pid throughout; web.sock stable. Registered http_url is now http+unix:///…/hero_collab/web.sock/health.

Bug present identically on main — will cherry-pick after merge.

Fixes the "hero_proc kills collab_web under supervision" issue; the manual-launch workaround is no longer needed.

## Problem `hero_collab_web` won't stay running under `hero_collab --start` / `lab service hero_collab --start` — it restart-loops to `failed` within ~14s. Only a manual launch (no hero_proc supervision) survives. The chat/canvas UI then 404s with "daemon not running." ## Root cause (systematic investigation) Not the daemon — the **health check**. The web action registers: ```rust http_url: Some("http://localhost/health".into()) ``` hero_proc's HTTP probe parses a plain `http://` URL as a **TCP** target. With no port it resolves to bare `localhost`, and the connect is **always refused** (collab_web serves on the `web.sock` *Unix* socket, nothing on TCP). Proven: - `http://localhost/health` over TCP → HTTP 000 (refused) - `/health` over `web.sock` (unix) → HTTP 200 Every probe fails → after `retries` (3) hero_proc signals the action to stop → kills the **healthy** daemon → retry → `failed` (exit_code=-1). The `start_period_ms: 5000` + 3×`interval_ms: 3000` ≈ 14s grace is why a quick check "looked fine" before it died. The server action is unaffected because it correctly uses an `openrpc_socket` connect-probe on `rpc.sock`. ## Fix hero_proc supports HTTP-over-UDS health checks via `http+unix:///abs/socket.sock[/path]` (`hero_proc_server::process::check_health`). Point the web action's health check at the resolved `web.sock` using that form: ```rust http_url: Some(format!("http+unix://{ui_sock}/health")) ``` ## Verification With the fix, collab_web stays `phase=running attempt=0` under hero_proc for **40s+** (well past the old ~14s kill window); same pid throughout; web.sock stable. Registered `http_url` is now `http+unix:///…/hero_collab/web.sock/health`. Bug present identically on `main` — will cherry-pick after merge. Fixes the "hero_proc kills collab_web under supervision" issue; the manual-launch workaround is no longer needed.
hero_collab_web binds a Unix socket (web.sock), but its hero_proc
health check was registered as `http://localhost/health`. hero_proc's
HTTP probe parses a plain `http://` URL as a TCP target; with no port
it resolves to `localhost` and the connect is always refused. The
probe therefore failed every interval, and after `retries` hero_proc
signalled the action to stop — killing the *healthy* daemon and
restart-looping it to `failed` (exit_code=-1) within ~14s
(start_period 5s + 3×3s).

Symptoms: collab_web would not stay running under `hero_collab --start`
/ `lab service hero_collab --start`; only a manual launch (no hero_proc
health check) survived. The chat/canvas UI then 404'd with
"daemon not running" once the web.sock disappeared.

Root cause was the URL scheme, not the daemon. hero_proc supports
HTTP-over-UDS health checks via `http+unix:///abs/socket.sock[/path]`
(see hero_proc_server::process::check_health). Point the web action's
health check at the resolved web.sock using that form — parity with
the server action, which already uses an `openrpc_socket` connect
probe on rpc.sock.

Verified: with the fix, collab_web stays `phase=running attempt=0`
under hero_proc for 40s+ (well past the old ~14s kill window); the
health_check http_url is `http+unix:///.../hero_collab/web.sock/health`.
sameh-farouk merged commit 8e3ad9af51 into development 2026-06-03 16:04:12 +00:00
sameh-farouk deleted branch development_sameh 2026-06-03 16:04:12 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_collab!69
No description provided.