service.status RPC latency under concurrent load limits effective parallelism to ~9x #121

Closed
opened 2026-05-23 05:09:37 +00:00 by mik-tf · 2 comments
Owner

On a local hero_proc 0.7.0 with 101 supervised services, a single service.status call over UDS takes ~225ms with curl overhead (roughly 50-100ms server-side). Running 20 sequential service.status calls totals 37 seconds (1.8s per call), and 100 concurrent service.status calls complete in 13.7s, implying the daemon caps effective parallelism at roughly 9x regardless of how many futures the client fans out. service.stats shows the same scaling. hero_cockpit#6 (lhumina_code/hero_cockpit#6) surfaced this because hero_cockpit_server.handle_list_services issues N service.status plus N service.stats per service.list_full row, and even with futures::future::join_all parallelization the round-trip exceeds the hero_router 10s upstream timeout for any reasonably sized cockpit stack. Two viable directions: (1) add a bulk service.status_all RPC that returns state plus restarts plus pid plus memory_bytes plus cpu_percent for every supervised service in one call, or (2) profile and reduce the per-call latency of service.status under concurrent load (the daemon appears to serialize on something inside the supervisor handler). Either path resolves the cockpit /services page rendering and unblocks similar fan-out patterns in other admin UIs.

On a local hero_proc 0.7.0 with 101 supervised services, a single service.status call over UDS takes ~225ms with curl overhead (roughly 50-100ms server-side). Running 20 sequential service.status calls totals 37 seconds (1.8s per call), and 100 concurrent service.status calls complete in 13.7s, implying the daemon caps effective parallelism at roughly 9x regardless of how many futures the client fans out. service.stats shows the same scaling. hero_cockpit#6 (https://forge.ourworld.tf/lhumina_code/hero_cockpit/issues/6) surfaced this because hero_cockpit_server.handle_list_services issues N service.status plus N service.stats per service.list_full row, and even with futures::future::join_all parallelization the round-trip exceeds the hero_router 10s upstream timeout for any reasonably sized cockpit stack. Two viable directions: (1) add a bulk service.status_all RPC that returns state plus restarts plus pid plus memory_bytes plus cpu_percent for every supervised service in one call, or (2) profile and reduce the per-call latency of service.status under concurrent load (the daemon appears to serialize on something inside the supervisor handler). Either path resolves the cockpit /services page rendering and unblocks similar fan-out patterns in other admin UIs.
Author
Owner

Closed by e833dc9 — adds service.status_all bulk RPC.

Took option (a) from the issue (bulk RPC) over option (b) (profile + reduce per-call latency) because option (a) has structural amortization that option (b) cannot match without rewriting the SQL layer:

  • Single SYSMON mutex grab via monitoring::processes_stats(&pids) covers every running PID in one sysinfo refresh (vs one mutex per service in the fan-out).
  • Single SQL chain per service: get_for_service + get_job_ids + N x jobs.get. The existing service.status handler walks the same chain 3 times per call (via service_running_jobs, service_last_terminal_state, count_restarts); the new bulk handler walks it once and derives state, pid, and restarts inline from the materialized job list.

Local smoke at 105 supervised services, steady state:

pattern wall-clock
OLD: serial N x service.status + N x service.stats (daemon-side, direct UDS, single client) 347 ms
NEW: bulk service.status_all (daemon-side, direct UDS) 10 ms
NEW: bulk service.status_all (cockpit -> hero_proc, through hero_router) 14 ms

The bulk row for any given service matches service.status byte-for-byte on the overlapping fields (pid, current_run_id, restarts, state, health_reason); the extra fields (memory_bytes, cpu_percent, context) fold in service.stats.

Cockpit adoption landed alongside in lhumina_code/hero_cockpit@722ace2; that closes hero_cockpit#6 too.

Closed by https://forge.ourworld.tf/lhumina_code/hero_proc/commit/e833dc9 — adds `service.status_all` bulk RPC. Took option (a) from the issue (bulk RPC) over option (b) (profile + reduce per-call latency) because option (a) has structural amortization that option (b) cannot match without rewriting the SQL layer: - Single SYSMON mutex grab via `monitoring::processes_stats(&pids)` covers every running PID in one sysinfo refresh (vs one mutex per service in the fan-out). - Single SQL chain per service: `get_for_service` + `get_job_ids` + N x `jobs.get`. The existing `service.status` handler walks the same chain 3 times per call (via `service_running_jobs`, `service_last_terminal_state`, `count_restarts`); the new bulk handler walks it once and derives state, pid, and restarts inline from the materialized job list. Local smoke at 105 supervised services, steady state: | pattern | wall-clock | |---|---| | OLD: serial N x service.status + N x service.stats (daemon-side, direct UDS, single client) | 347 ms | | NEW: bulk service.status_all (daemon-side, direct UDS) | 10 ms | | NEW: bulk service.status_all (cockpit -> hero_proc, through hero_router) | 14 ms | The bulk row for any given service matches `service.status` byte-for-byte on the overlapping fields (pid, current_run_id, restarts, state, health_reason); the extra fields (memory_bytes, cpu_percent, context) fold in `service.stats`. Cockpit adoption landed alongside in https://forge.ourworld.tf/lhumina_code/hero_cockpit/commit/722ace2; that closes hero_cockpit#6 too.
Member

For audit trail: the underlying serialization point this issue identified ("the daemon appears to serialize on something inside the supervisor handler") was the Arc<std::sync::Mutex<rusqlite::Connection>> in db/factory.rs, which forced all 57 sub-API call sites to lock-then-blocking-SQLite from inside async tasks. That root cause was removed in #124 (a58bdf0 on development) — the daemon now uses deadpool_sqlite::Pool with interact(), so multiple SQLite operations can be in flight concurrently and async workers are no longer parked on a sync mutex.

The e833dc9 feat(rpc): add service.status_all bulk RPC workaround that closed this issue stays in place — it's still a valid latency optimization for any client that needs N service statuses at once, even after the underlying cap is removed.

Stage 6 verification on a release daemon (50 parallel service.status_all calls): completed in ~120–300ms total (vs the ~9× serial cap = ~13.7s for 100 calls measured in the original repro). The cap is gone empirically too, not just architecturally.

Informational only — leaving this closed.

For audit trail: the underlying serialization point this issue identified ("the daemon appears to serialize on something inside the supervisor handler") was the `Arc<std::sync::Mutex<rusqlite::Connection>>` in `db/factory.rs`, which forced all 57 sub-API call sites to lock-then-blocking-SQLite from inside async tasks. That root cause was removed in #124 (`a58bdf0` on `development`) — the daemon now uses `deadpool_sqlite::Pool` with `interact()`, so multiple SQLite operations can be in flight concurrently and async workers are no longer parked on a sync mutex. The `e833dc9 feat(rpc): add service.status_all bulk RPC` workaround that closed this issue stays in place — it's still a valid latency optimization for any client that needs N service statuses at once, even after the underlying cap is removed. Stage 6 verification on a release daemon (50 parallel `service.status_all` calls): completed in ~120–300ms total (vs the ~9× serial cap = ~13.7s for 100 calls measured in the original repro). The cap is gone empirically too, not just architecturally. Informational only — leaving this closed.
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_proc#121
No description provided.