service.status RPC latency under concurrent load limits effective parallelism to ~9x #121
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_proc#121
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
On a local hero_proc 0.7.0 with 101 supervised services, a single service.status call over UDS takes ~225ms with curl overhead (roughly 50-100ms server-side). Running 20 sequential service.status calls totals 37 seconds (1.8s per call), and 100 concurrent service.status calls complete in 13.7s, implying the daemon caps effective parallelism at roughly 9x regardless of how many futures the client fans out. service.stats shows the same scaling. hero_cockpit#6 (lhumina_code/hero_cockpit#6) surfaced this because hero_cockpit_server.handle_list_services issues N service.status plus N service.stats per service.list_full row, and even with futures::future::join_all parallelization the round-trip exceeds the hero_router 10s upstream timeout for any reasonably sized cockpit stack. Two viable directions: (1) add a bulk service.status_all RPC that returns state plus restarts plus pid plus memory_bytes plus cpu_percent for every supervised service in one call, or (2) profile and reduce the per-call latency of service.status under concurrent load (the daemon appears to serialize on something inside the supervisor handler). Either path resolves the cockpit /services page rendering and unblocks similar fan-out patterns in other admin UIs.
Closed by
e833dc9— addsservice.status_allbulk RPC.Took option (a) from the issue (bulk RPC) over option (b) (profile + reduce per-call latency) because option (a) has structural amortization that option (b) cannot match without rewriting the SQL layer:
monitoring::processes_stats(&pids)covers every running PID in one sysinfo refresh (vs one mutex per service in the fan-out).get_for_service+get_job_ids+ N xjobs.get. The existingservice.statushandler walks the same chain 3 times per call (viaservice_running_jobs,service_last_terminal_state,count_restarts); the new bulk handler walks it once and derives state, pid, and restarts inline from the materialized job list.Local smoke at 105 supervised services, steady state:
The bulk row for any given service matches
service.statusbyte-for-byte on the overlapping fields (pid, current_run_id, restarts, state, health_reason); the extra fields (memory_bytes, cpu_percent, context) fold inservice.stats.Cockpit adoption landed alongside in
lhumina_code/hero_cockpit@722ace2; that closes hero_cockpit#6 too.For audit trail: the underlying serialization point this issue identified ("the daemon appears to serialize on something inside the supervisor handler") was the
Arc<std::sync::Mutex<rusqlite::Connection>>indb/factory.rs, which forced all 57 sub-API call sites to lock-then-blocking-SQLite from inside async tasks. That root cause was removed in #124 (a58bdf0ondevelopment) — the daemon now usesdeadpool_sqlite::Poolwithinteract(), so multiple SQLite operations can be in flight concurrently and async workers are no longer parked on a sync mutex.The
e833dc9 feat(rpc): add service.status_all bulk RPCworkaround that closed this issue stays in place — it's still a valid latency optimization for any client that needs N service statuses at once, even after the underlying cap is removed.Stage 6 verification on a release daemon (50 parallel
service.status_allcalls): completed in ~120–300ms total (vs the ~9× serial cap = ~13.7s for 100 calls measured in the original repro). The cap is gone empirically too, not just architecturally.Informational only — leaving this closed.
sameh-farouk referenced this issue2026-05-25 13:38:06 +00:00