embedder_server: startup race vs hero_embedderd model load + tokio panic on async-context drop of blocking client #23
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_embedder#23
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
hero_embedder_serverracks up startup retries and is left in a permanentfailedstate whenhero_embedderdtakes longer than ~3 seconds to start serving its/healthendpoint. On a fresh boot the daemon needs to mmap ~2 GB of ONNX models (bge-small,bge-base,bge-reranker-base) before it starts listening on127.0.0.1:8092; until then the server's startup probe (is_reachable(), 3 s connect timeout) fails. hero_proc respawns the server several times in quick succession, exhausts the retry budget, and then stops — at which point manualhero_proc job retry hero_embedder hero_embedder_serveris required to bring the system up.The server fails on every attempt with:
while in parallel the daemon is healthy a few seconds later:
Reproduction
hero_proc service stop hero_embedder.service_embedder start --reset(orhero_embedder --start).hero_proc job list hero_embedder. Observed:rpc.sockis missing, the dashboard seeshero_router404s for/rpc, and (post #20) every panel renders the "Backend unavailable…" alert.hero_proc job retry hero_embedder hero_embedder_serversucceeds and the server stays up afterwards (because the daemon is by now ready).Root cause
crates/hero_embedder_server/src/main.rs::discover_embedderdcallsEmbedderdClient::new(url)?.is_reachable()exactly once at startup.EmbedderdClient'sconnect_timeoutis 5 s and theis_reachablerequest itself uses a 3 s timeout, so the function returnsErr("daemon not reachable")after at most ~5 s if the daemon hasn't bound the port yet.There is no retry-with-backoff at the server level, and no hero_proc-level dependency declaration that would gate
hero_embedder_serverstartup onhero_embedderd's/healthreturning 200. So whoever loses the race loses for good (until manual retry).Suggested fix direction
In
crates/hero_embedder_server/src/main.rs, givediscover_embedderdan explicit retry budget. Sketch:The 30 s budget covers a cold model load on this hardware (~5–8 s) with significant headroom. After the budget expires, we still emit the existing actionable error message so operators can diagnose a genuine misconfiguration.
Equivalent alternative: declare a hero_proc dependency in the action registration so
hero_embedder_serveronly starts afterhero_embedderd's/healthreturns 200. That fix lives inhero_skillsand is more invasive across the stack; the in-server retry above is self-contained and works regardless of the surrounding orchestrator.Acceptance criteria
service_embedder start --reset, all three jobs (hero_embedderd,hero_embedder_server,hero_embedder_ui) reachrunningwithout manualjob retry.~/hero/var/sockets/hero_embedder/rpc.sockexists within ~10 s of the start command.hero_embedderdis genuinely missing or misconfigured, the server still emits the existing error pointing at the URL after the 30 s budget — no silent infinite hang.discover_embedderdreturns immediately on first probe, no extra latency.Notes
Cannot drop a runtime in a context where blocking is not allowedwhendiscover_embedderdis called directly from the#[tokio::main]async runtime (becausereqwest::blocking::Client::builder()spawns/drops a runtime). That's a separate defect tracked in the same PR viatokio::task::spawn_blocking. Neither fix obsoletes the other — the panic fix lets the function run at all; this race fix lets it succeed under realistic timing.bge-small(FP32 + INT8),bge-base(FP32 + INT8),bge-reranker-base. Roughly 2 GB total mmapped. A faster / smaller model set would reduce the race window but would not eliminate it; the retry is the correct robustness fix.