[High][Performance/Architecture] Admin APIs serialize behind a single Node mutex and hold it across long waits #31

Open
opened 2026-03-19 22:43:22 +00:00 by thabeta · 0 comments
Owner

Summary

The HTTP and JSON-RPC layers store the node as Arc<Mutex<Node<_>>>, and several handlers hold that mutex while awaiting long-running work.

This is especially problematic for:

  • message long-poll receive
  • proxy connect flows
  • any future API call that waits on network or timeouts

Why this matters

A coarse node-wide mutex turns independent admin operations into a serialized queue. In practice, one slow or waiting request can block unrelated state inspection or admin actions.

This is not just a micro-optimization issue. It changes the runtime behavior of the control surface under load.

Evidence

Shared state type:

  • mycelium-api/src/lib.rs:48-50

HTTP long-poll message receive holds the node lock through timeout/wait:

  • mycelium-api/src/message.rs:142-149

HTTP proxy connect holds the node lock across async connect:

  • mycelium-api/src/lib.rs:359-365

JSON-RPC long-poll message receive similarly holds the node lock:

  • mycelium-api/src/rpc.rs:396-403

JSON-RPC proxy connect does the same:

  • mycelium-api/src/rpc.rs:301-305

Expected behavior

Slow API calls should not block unrelated read-only or control operations on the whole node.

Actual behavior

The API surface is effectively serialized on a single mutex in key paths.

Suggested fix

  • Split the node into independently lockable subsystems or expose cloneable handles instead of Mutex<Node>.
  • Avoid holding the node lock across long waits; extract the relevant subsystem handle first.
  • Add concurrency tests around message long-poll plus concurrent admin requests.

Risk

High for operability and perceived reliability. Under contention, the admin surface can appear hung even when the node is otherwise healthy.

## Summary The HTTP and JSON-RPC layers store the node as `Arc<Mutex<Node<_>>>`, and several handlers hold that mutex while awaiting long-running work. This is especially problematic for: - message long-poll receive - proxy connect flows - any future API call that waits on network or timeouts ## Why this matters A coarse node-wide mutex turns independent admin operations into a serialized queue. In practice, one slow or waiting request can block unrelated state inspection or admin actions. This is not just a micro-optimization issue. It changes the runtime behavior of the control surface under load. ## Evidence Shared state type: - `mycelium-api/src/lib.rs:48-50` HTTP long-poll message receive holds the node lock through timeout/wait: - `mycelium-api/src/message.rs:142-149` HTTP proxy connect holds the node lock across async connect: - `mycelium-api/src/lib.rs:359-365` JSON-RPC long-poll message receive similarly holds the node lock: - `mycelium-api/src/rpc.rs:396-403` JSON-RPC proxy connect does the same: - `mycelium-api/src/rpc.rs:301-305` ## Expected behavior Slow API calls should not block unrelated read-only or control operations on the whole node. ## Actual behavior The API surface is effectively serialized on a single mutex in key paths. ## Suggested fix - Split the node into independently lockable subsystems or expose cloneable handles instead of `Mutex<Node>`. - Avoid holding the node lock across long waits; extract the relevant subsystem handle first. - Add concurrency tests around message long-poll plus concurrent admin requests. ## Risk High for operability and perceived reliability. Under contention, the admin surface can appear hung even when the node is otherwise healthy.
Sign in to join this conversation.
No labels
Urgent
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
geomind_code/mycelium_network#31
No description provided.