hero_admin_lib: admin/web binaries hang pre-main against latest hero_rpc #12

Open
opened 2026-05-28 15:46:17 +00:00 by timur · 0 comments
Owner

Symptom

Every scaffolded _admin or _server binary that depends on hero_admin_lib (currently pinned to 89f041897633d6a6ca4af2c804147ae2c829e423 in downstream Cargo.lock files) hangs at startup with zero stdout/stderr when invoked with any argument set, including --info. The binary process exists in ps, sits at 0% CPU forever in macOS UE state, and is unkillable except via reboot. Same behaviour reproducible on two different Hero service repos (hero_service + recipe_server reference).

In contrast, a peer _web binary in the same workspace — same service_base!() macro, same validate_service_toml + handle_info_flag startup boilerplate, just without the hero_admin_lib dep — works perfectly (--info prints the embedded service.toml and exits 0).

What's been ruled out

  • Not a build / link issue: binary signs cleanly via macOS ad-hoc signing, otool -L shows the same dylib set as the working _web binary.
  • Not a service_base!() issue: the macro just declares const SERVICE_TOML + const BUILD_NR (no runtime side effects).
  • Not a tokio runtime init issue: reproducible with #[tokio::main(flavor = "current_thread")]. Also reproducible with a sync fn main() { eprintln!("hi"); exit(0); } in the same crate — zero output. So the hang is in pre-main static initialization, before main() runs.
  • Not a per-platform tokio thread limit: other Rust binaries in the workspace (hero_service_web, a hand-rolled hello-world) start fine concurrently.
  • Not a zombie-induced kernel deadlock alone: the hang reproduces even with the system freshly rebooted (and zombies accumulate as a consequence of the hang, not a cause).

What's left as a hypothesis

Something in hero_admin_lib's dep tree runs a pre-main static initializer (a ctor! macro or distributed-slice registration in a transitive C-FFI dep) that blocks on a system resource. Likely candidates worth probing first:

  • hero_admin_lib::middleware::base_path_middleware and hero_admin_lib::routes — anything that touches rust-embed for the embedded asset blob (shared_static_handler).
  • hero_admin_lib::socket::admin_socket_path — the helper that resolves $HERO_SOCKET_DIR via herolib_core::base::resolve_socket_dir. Could be hitting a deadlock with hero_proc if hero_proc was registered against a different socket dir.
  • Any transitive axum-server or tower-http middleware that mounts middleware at static-init time.

Reproduction

# Clone any service that depends on hero_admin_lib at 89f04189:
git clone https://forge.ourworld.tf/lhumina_code/hero_service
cd hero_service
cargo build -p hero_service_admin
export PATH_ROOT=$HOME/hero HERO_SOCKET_DIR=$HOME/hero/var/sockets/default
./target/debug/hero_service_admin --info
# (hangs forever, zero output, process becomes UE-state zombie)

Meanwhile:

./target/debug/hero_service_web --info
# (prints the embedded service.toml and exits 0)

Newer hero_admin_lib (f3c5b07a) — different break

cargo update -p hero_admin_lib to the current tip surfaces a different problem: ~25 compile errors in scaffolded admin/main.rs files against the newer hero_admin_lib API (route signature changes, removed helpers). The two issues are independent — the hang is in 89f04189, the API drift is in f3c5b07a. Both need triage; a unified fix that lets scaffolded admin/main.rs (per hero_rpc generator template) start cleanly against the latest hero_admin_lib is the win.

Why this matters

With _admin hanging:

  • lab service <name> --start can't lifecycle-manage the service (lab calls <bin> --info first to read service.toml metadata; that hangs forever).
  • The JS-client-driven UI work (hero_router serves /<svc>/js/client.js, admin pages embed <script type="module"> to call it browser-side) can't be tested in a real browser session.

Downstream demos that depend on admin starting (the hero_service template, every UI dashboard) are blocked.

Asks for the triage agent

  1. Pull hero_website_framework origin/development, build a tiny admin binary that just has hero_admin_lib as a dep + a sync main that prints "hi". Confirm it hangs.
  2. Bisect the deps in hero_admin_lib/Cargo.toml to find the offending transitive crate (likely something with a ctor or FFI initializer).
  3. Either fix the pre-main init or stop pulling the problematic transitive at the hero_admin_lib level.
  4. Update hero_admin_lib's API to match what the post-hero_rpc#142 scaffolded admin/main.rs expects (or land coordinated updates in hero_rpc's ui_emit.rs + scaffold.rs to track the new API).

Related: hero_rpc#142 SDK pipeline reversal and the hero_rpc2 + hero_rpc_openrpc fold issue tracked in lhumina_code/hero_rpc (see latest issue list).

## Symptom Every scaffolded `_admin` or `_server` binary that depends on `hero_admin_lib` (currently pinned to `89f041897633d6a6ca4af2c804147ae2c829e423` in downstream Cargo.lock files) **hangs at startup with zero stdout/stderr** when invoked with any argument set, including `--info`. The binary process exists in `ps`, sits at 0% CPU forever in macOS `UE` state, and is unkillable except via reboot. Same behaviour reproducible on two different Hero service repos (hero_service + recipe_server reference). In contrast, a peer `_web` binary in the same workspace — same `service_base!()` macro, same `validate_service_toml` + `handle_info_flag` startup boilerplate, just without the `hero_admin_lib` dep — works perfectly (`--info` prints the embedded `service.toml` and exits 0). ## What's been ruled out - **Not a build / link issue:** binary signs cleanly via macOS ad-hoc signing, `otool -L` shows the same dylib set as the working `_web` binary. - **Not a `service_base!()` issue:** the macro just declares `const SERVICE_TOML` + `const BUILD_NR` (no runtime side effects). - **Not a tokio runtime init issue:** reproducible with `#[tokio::main(flavor = "current_thread")]`. Also reproducible with a **sync `fn main() { eprintln!("hi"); exit(0); }`** in the same crate — zero output. So the hang is in **pre-main static initialization**, before `main()` runs. - **Not a per-platform tokio thread limit:** other Rust binaries in the workspace (`hero_service_web`, a hand-rolled hello-world) start fine concurrently. - **Not a zombie-induced kernel deadlock alone:** the hang reproduces even with the system freshly rebooted (and zombies accumulate as a *consequence* of the hang, not a cause). ## What's left as a hypothesis Something in `hero_admin_lib`'s dep tree runs a pre-main static initializer (a `ctor!` macro or distributed-slice registration in a transitive C-FFI dep) that blocks on a system resource. Likely candidates worth probing first: - `hero_admin_lib::middleware::base_path_middleware` and `hero_admin_lib::routes` — anything that touches `rust-embed` for the embedded asset blob (`shared_static_handler`). - `hero_admin_lib::socket::admin_socket_path` — the helper that resolves `$HERO_SOCKET_DIR` via `herolib_core::base::resolve_socket_dir`. Could be hitting a deadlock with `hero_proc` if hero_proc was registered against a different socket dir. - Any transitive `axum-server` or `tower-http` middleware that mounts middleware at static-init time. ## Reproduction ```bash # Clone any service that depends on hero_admin_lib at 89f04189: git clone https://forge.ourworld.tf/lhumina_code/hero_service cd hero_service cargo build -p hero_service_admin export PATH_ROOT=$HOME/hero HERO_SOCKET_DIR=$HOME/hero/var/sockets/default ./target/debug/hero_service_admin --info # (hangs forever, zero output, process becomes UE-state zombie) ``` Meanwhile: ```bash ./target/debug/hero_service_web --info # (prints the embedded service.toml and exits 0) ``` ## Newer hero_admin_lib (`f3c5b07a`) — different break `cargo update -p hero_admin_lib` to the current tip surfaces a different problem: ~25 compile errors in scaffolded admin/main.rs files against the newer hero_admin_lib API (route signature changes, removed helpers). The two issues are independent — the hang is in `89f04189`, the API drift is in `f3c5b07a`. Both need triage; a unified fix that lets scaffolded admin/main.rs (per hero_rpc generator template) start cleanly against the latest hero_admin_lib is the win. ## Why this matters With `_admin` hanging: - `lab service <name> --start` can't lifecycle-manage the service (lab calls `<bin> --info` first to read service.toml metadata; that hangs forever). - The JS-client-driven UI work (hero_router serves `/<svc>/js/client.js`, admin pages embed `<script type="module">` to call it browser-side) can't be tested in a real browser session. Downstream demos that depend on admin starting (the hero_service template, every UI dashboard) are blocked. ## Asks for the triage agent 1. Pull hero_website_framework `origin/development`, build a tiny admin binary that just has `hero_admin_lib` as a dep + a sync `main` that prints "hi". Confirm it hangs. 2. Bisect the deps in `hero_admin_lib/Cargo.toml` to find the offending transitive crate (likely something with a ctor or FFI initializer). 3. Either fix the pre-main init or stop pulling the problematic transitive at the hero_admin_lib level. 4. Update hero_admin_lib's API to match what the post-hero_rpc#142 scaffolded admin/main.rs expects (or land coordinated updates in hero_rpc's `ui_emit.rs` + `scaffold.rs` to track the new API). Related: [hero_rpc#142 SDK pipeline reversal](https://forge.ourworld.tf/lhumina_code/hero_rpc/commit/3f53db8) and the hero_rpc2 + hero_rpc_openrpc fold issue tracked in `lhumina_code/hero_rpc` (see latest issue list).
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_website_framework#12
No description provided.