[nu-demo] hero_embedder_server starts before hero_embedderd finishes loading models — needs dependency ordering #168
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Symptom
On service restart,
hero_embedder_serverfails on first attempt with:hero_embedderd takes ~15s to load the 4 ONNX models (bge-small, bge-base, bge-reranker-base, etc.). hero_embedder_server checks
HERO_EMBEDDERD_URLconnectivity at startup and refuses to run if the daemon isn't ready.The action's retry_policy says
max_attempts=5, backoff=true, delay_ms=2000. That should eventually succeed once the daemon is up (~15s after start). But in practice the 5 retries fall within the 15s window and all fail, then hero_proc marks the jobfailedand stops retrying.Manually retrying via
hero_proc job retry hero_embedder hero_embedder_serverat any point AFTER daemon is healthy succeeds.Root cause
hero_proc service definitions don't express "start hero_embedder_server AFTER hero_embedderd is healthy." The
service add --after <service>option exists for service-level ordering, but not for action-level ordering WITHIN a service.When a service has multiple actions, they all start concurrently (or in a semi-random order), and there's no way to gate action B on action A's health check passing.
Fixes (ordered by effort)
1. Bump the retry policy
Cheapest: in
service_embedder.nu, sethero_embedder_server'sretry_policytomax_attempts=20, stability_period_ms=60000so it keeps retrying past the daemon's 15s warmup. Works but wastes CPU on failed-connect attempts.2. Add an action-level dependency mechanism
Extend the ActionSpec schema with an optional
depends_on_action: Option<Vec<String>>field. hero_proc supervisor waits until each listed action's health check passes before starting this one. Minor hero_proc code change; cleanest solution.3. Run the server binary as a child-process-of hero_embedderd instead of a sibling
Refactor hero_embedder_server to be a thread within hero_embedderd (or use a Unix socket that only gets bound after the daemon declares itself ready). Not backward compatible but architecturally cleaner — sibling issue #145 already suggests converting to an async RPC model.
4. Supervisor-level startup probe
hero_proc could have a "wait-for-port / wait-for-socket" pre-start hook per action.
hero_embedder_server.pre_start = "wait_tcp 127.0.0.1 8092 30s". Small addition; covers dozens of similar ordering issues across the Hero stack.Demo workaround (applied on herodemo 2026-04-24)
After the service starts and
hero_embedderdcomes online, manually retry the failed server job:This always works because the daemon is up by then.
Related
Signed-off-by: mik-tf
Fixed in current
service_embedder.nuvia fix-option #1 from the issue body ("Bump the retry policy"), implemented viastart_timeout_msrather thanmax_attempts— different mechanism, same outcome.Verification (
service_embedder.nu):update retry_policy {|t| $t.retry_policy | merge {start_timeout_ms: 180000}}forhero_embedderd(180s window).update retry_policy {|t| $t.retry_policy | merge {start_timeout_ms: 120000}}forhero_embedder_server(120s window).With the daemon warming up the 4 ONNX models in ~15s, a 120-second start_timeout window gives the server's existing retry budget (
max_attempts=5, backoff=true, delay_ms=2000) ample room to span the warmup and succeed on a later attempt without hero_proc giving up. The original symptom (server markedfailedbecause all 5 retries fell within the 15s window) is no longer reachable.Functional confirmation: the herodemo bring-up sessions in 2026-04-25 / 2026-04-26 saw
hero_embedder_serverreachrunningstate on every restart without the manualhero_proc job retry hero_embedder_serverworkaround the issue body describes.Architectural follow-up (NOT this fix): the issue's option #2 — adding a true
depends_on_action: Option<Vec<String>>to ActionSpec so the supervisor gates B on A's health check — remains the cleanest long-term answer, especially as more services adopt the daemon-plus-server pattern (already happening in hero_office/onlyoffice). That work would land in hero_proc / hero_rpc, not in this module, and is properly tracked separately when the time comes.Meta-tracker: home#193.
Signed-off-by: mik-tf