cockpit.list_services times out: serialized N+1 RPC fan-out to hero_proc #6

Closed
opened 2026-05-23 04:11:59 +00:00 by mik-tf · 3 comments
Owner

Found while running a verification pass against a local cockpit install. The cockpit.list_services handler at crates/hero_cockpit_server/src/main.rs:460 calls service.list_full, then for each returned service it awaits a serialized service_status followed by a serialized service_stats. With around 90 services registered in hero_proc on a realistic VM, that is roughly 180 sequential RPC round-trips per page load, and the request reliably exceeds the router default 10 second upstream timeout. The visible symptom is that opening /services (which is the primary cockpit page after login) returns the plaintext string upstream timeout instead of the services table, which makes every cockpit lifecycle button unreachable. Likely fix is to fan the per-service service_status and service_stats calls out concurrently with futures::future::join_all (or to extend service.list_full to return state and stats inline, then drop the secondary calls entirely). Reproduced locally with 91 services discovered by hero_router. Happy to open a PR once a preferred shape is confirmed.

Found while running a verification pass against a local cockpit install. The `cockpit.list_services` handler at `crates/hero_cockpit_server/src/main.rs:460` calls `service.list_full`, then for each returned service it awaits a serialized `service_status` followed by a serialized `service_stats`. With around 90 services registered in hero_proc on a realistic VM, that is roughly 180 sequential RPC round-trips per page load, and the request reliably exceeds the router default 10 second upstream timeout. The visible symptom is that opening `/services` (which is the primary cockpit page after login) returns the plaintext string `upstream timeout` instead of the services table, which makes every cockpit lifecycle button unreachable. Likely fix is to fan the per-service `service_status` and `service_stats` calls out concurrently with `futures::future::join_all` (or to extend `service.list_full` to return state and stats inline, then drop the secondary calls entirely). Reproduced locally with 91 services discovered by hero_router. Happy to open a PR once a preferred shape is confirmed.
Author
Owner

Partial fix landed in c0a2a10 on development. The per-row service.status and service.stats calls now fire concurrently via tokio::join! plus futures::future::join_all instead of the serialized loop, collapsing the cockpit-side fan-out from 200s+ to about 22s on a 101-service local stack. The remaining gap above the hero_router 10s upstream timeout is hero_proc-side: a single service.status call averages 1.8s under concurrent load and the daemon caps effective parallelism at roughly 9x. Filed as a separate follow-up so this one can close on the N+1 fix that was its title.

Partial fix landed in [`c0a2a10`](https://forge.ourworld.tf/lhumina_code/hero_cockpit/commit/c0a2a10) on `development`. The per-row `service.status` and `service.stats` calls now fire concurrently via `tokio::join!` plus `futures::future::join_all` instead of the serialized loop, collapsing the cockpit-side fan-out from 200s+ to about 22s on a 101-service local stack. The remaining gap above the hero_router 10s upstream timeout is hero_proc-side: a single `service.status` call averages 1.8s under concurrent load and the daemon caps effective parallelism at roughly 9x. Filed as a separate follow-up so this one can close on the N+1 fix that was its title.
Author
Owner

Follow-up filed at hero_proc#121 for the residual daemon-side latency.

Follow-up filed at [hero_proc#121](https://forge.ourworld.tf/lhumina_code/hero_proc/issues/121) for the residual daemon-side latency.
Author
Owner

Fully closed by 722ace2handle_list_services now uses the new service.status_all bulk RPC from lhumina_code/hero_proc@e833dc9.

The s147 partial fix (c0a2a10) parallelized the cockpit-side fan-out with tokio::join! + join_all, but the daemon-side per-call cost still dominated (1.8 s/call x ~9x effective parallelism = ~22 s on 101 services, still over the hero_router 10 s upstream timeout). The new bulk RPC eliminates the per-call sysinfo mutex and the 3x-redundant SQL chain per call.

Local smoke through hero_router on 105 services:

call wall-clock
1 14 ms
2 15 ms
3 14 ms

No 504. Page renders the full table with state, pid, mem_rss_bytes, cpu_percent, restarts, current_run_id, enabled for every supervised service.

Also drops futures from the workspace deps — was only used by the now-removed join_all.

Fully closed by https://forge.ourworld.tf/lhumina_code/hero_cockpit/commit/722ace2 — `handle_list_services` now uses the new `service.status_all` bulk RPC from https://forge.ourworld.tf/lhumina_code/hero_proc/commit/e833dc9. The s147 partial fix (https://forge.ourworld.tf/lhumina_code/hero_cockpit/commit/c0a2a10) parallelized the cockpit-side fan-out with `tokio::join!` + `join_all`, but the daemon-side per-call cost still dominated (1.8 s/call x ~9x effective parallelism = ~22 s on 101 services, still over the hero_router 10 s upstream timeout). The new bulk RPC eliminates the per-call sysinfo mutex and the 3x-redundant SQL chain per call. Local smoke through hero_router on 105 services: | call | wall-clock | |---|---| | 1 | 14 ms | | 2 | 15 ms | | 3 | 14 ms | No 504. Page renders the full table with state, pid, mem_rss_bytes, cpu_percent, restarts, current_run_id, enabled for every supervised service. Also drops `futures` from the workspace deps — was only used by the now-removed `join_all`.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_cockpit#6
No description provided.