fix(cli): health-check collab_web over its Unix socket, not TCP #69
No reviewers
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_collab!69
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "development_sameh"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
hero_collab_webwon't stay running underhero_collab --start/lab service hero_collab --start— it restart-loops tofailedwithin ~14s. Only a manual launch (no hero_proc supervision) survives. The chat/canvas UI then 404s with "daemon not running."Root cause (systematic investigation)
Not the daemon — the health check. The web action registers:
hero_proc's HTTP probe parses a plain
http://URL as a TCP target. With no port it resolves to barelocalhost, and the connect is always refused (collab_web serves on theweb.sockUnix socket, nothing on TCP). Proven:http://localhost/healthover TCP → HTTP 000 (refused)/healthoverweb.sock(unix) → HTTP 200Every probe fails → after
retries(3) hero_proc signals the action to stop → kills the healthy daemon → retry →failed(exit_code=-1). Thestart_period_ms: 5000+ 3×interval_ms: 3000≈ 14s grace is why a quick check "looked fine" before it died.The server action is unaffected because it correctly uses an
openrpc_socketconnect-probe onrpc.sock.Fix
hero_proc supports HTTP-over-UDS health checks via
http+unix:///abs/socket.sock[/path](hero_proc_server::process::check_health). Point the web action's health check at the resolvedweb.sockusing that form:Verification
With the fix, collab_web stays
phase=running attempt=0under hero_proc for 40s+ (well past the old ~14s kill window); same pid throughout; web.sock stable. Registeredhttp_urlis nowhttp+unix:///…/hero_collab/web.sock/health.Bug present identically on
main— will cherry-pick after merge.Fixes the "hero_proc kills collab_web under supervision" issue; the manual-launch workaround is no longer needed.