Native TCP transport for explorer — remove socat dependency for multi-machine deployments #30

Open
opened 2026-03-17 11:11:07 +00:00 by mahmoud · 0 comments
Owner

Problem

The entire hero_compute RPC stack uses Unix sockets
only. Nodes on different physical machines cannot
send heartbeats to a remote explorer or receive
proxied VM calls without an external socat bridge.

Current workaround: socat bridges set up manually
or via make start scripts. This is fragile, requires
socat installed on every node, and is not
container-friendly.

What Needs to Change

5 components need updating:

  1. hero_rpc — UnixRpcServer needs an optional
    TcpListener alongside the existing UnixListener.
    Should be backward compatible — Unix socket
    remains the default.

  2. hero_compute_explorer/src/main.rs — When
    EXPLORER_TCP_ADDR env var is set (e.g.
    0.0.0.0:9002), bind a TCP listener in addition
    to the Unix socket.

  3. hero_compute_server/src/heartbeat_sender.rs —
    Parse tcp://host:port in EXPLORER_ADDRESSES
    and use TcpStream::connect() instead of
    UnixStream::connect() for remote explorers.

  4. hero_compute_explorer/src/explorer/proxy.rs —
    Support TcpStream for remote nodes alongside
    UnixStream for local nodes. The node's
    socket_path field becomes a URI:
    unix:///path/to/socket (local node)
    tcp://host:port (remote node)

  5. schemas/explorer/explorer.oschema —
    Rename or redefine socket_path field to
    accept both unix:// and tcp:// URIs.

New Environment Variables

EXPLORER_TCP_ADDR=0.0.0.0:9002
On the explorer — enables TCP listener.
If unset, TCP is disabled (Unix only,
current behavior).

EXPLORER_ADDRESSES=tcp://135.181.217.244:9002
On the node — where to send heartbeats.
Supports both:
unix:///path/to/explorer.sock (local)
tcp://host:port (remote)
Comma-separated for multiple explorers.

Security Note

TCP transport should initially be unauthenticated
(same as Unix socket today). TLS/mTLS is a
follow-up once the basic transport works.
For now, network-level security (firewall rules,
VPN, mycelium network) is sufficient.

Backward Compatibility

  • Unix socket behavior must be unchanged
  • Existing deployments with no EXPLORER_TCP_ADDR
    set must continue to work identically
  • socat bridge approach remains valid as fallback

Why This Matters

This is the prerequisite for:

  • ThreeFold Grid backend integration
    (TF nodes are remote by definition)
  • Container deployments
  • Production multi-node setups beyond ~5 nodes
  • Removing the socat dependency

Note on herolib_core

herolib_core already has OpenRpcTransport::http()
for the client side. The server side (UnixRpcServer)
has no TCP equivalent. The client-side building
blocks exist — server side needs to be added.

Definition of Done

  • Explorer binds TCP when EXPLORER_TCP_ADDR is set
  • Heartbeat sender connects via TCP to remote
    explorers
  • Explorer proxy forwards calls to remote nodes
    via TCP
  • socket_path field supports tcp:// URIs
  • EXPLORER_ADDRESSES accepts both unix:// and
    tcp:// schemes
  • Existing Unix-only deployments unaffected
  • socat bridge scripts kept as fallback but
    documented as optional
  • docs/setup.md updated with multi-machine
    TCP setup instructions
### Problem The entire hero_compute RPC stack uses Unix sockets only. Nodes on different physical machines cannot send heartbeats to a remote explorer or receive proxied VM calls without an external socat bridge. Current workaround: socat bridges set up manually or via make start scripts. This is fragile, requires socat installed on every node, and is not container-friendly. ### What Needs to Change 5 components need updating: 1. hero_rpc — UnixRpcServer needs an optional TcpListener alongside the existing UnixListener. Should be backward compatible — Unix socket remains the default. 2. hero_compute_explorer/src/main.rs — When EXPLORER_TCP_ADDR env var is set (e.g. 0.0.0.0:9002), bind a TCP listener in addition to the Unix socket. 3. hero_compute_server/src/heartbeat_sender.rs — Parse tcp://host:port in EXPLORER_ADDRESSES and use TcpStream::connect() instead of UnixStream::connect() for remote explorers. 4. hero_compute_explorer/src/explorer/proxy.rs — Support TcpStream for remote nodes alongside UnixStream for local nodes. The node's socket_path field becomes a URI: unix:///path/to/socket (local node) tcp://host:port (remote node) 5. schemas/explorer/explorer.oschema — Rename or redefine socket_path field to accept both unix:// and tcp:// URIs. ### New Environment Variables EXPLORER_TCP_ADDR=0.0.0.0:9002 On the explorer — enables TCP listener. If unset, TCP is disabled (Unix only, current behavior). EXPLORER_ADDRESSES=tcp://135.181.217.244:9002 On the node — where to send heartbeats. Supports both: unix:///path/to/explorer.sock (local) tcp://host:port (remote) Comma-separated for multiple explorers. ### Security Note TCP transport should initially be unauthenticated (same as Unix socket today). TLS/mTLS is a follow-up once the basic transport works. For now, network-level security (firewall rules, VPN, mycelium network) is sufficient. ### Backward Compatibility - Unix socket behavior must be unchanged - Existing deployments with no EXPLORER_TCP_ADDR set must continue to work identically - socat bridge approach remains valid as fallback ### Why This Matters This is the prerequisite for: - ThreeFold Grid backend integration (TF nodes are remote by definition) - Container deployments - Production multi-node setups beyond ~5 nodes - Removing the socat dependency ### Note on herolib_core herolib_core already has OpenRpcTransport::http() for the client side. The server side (UnixRpcServer) has no TCP equivalent. The client-side building blocks exist — server side needs to be added. ### Definition of Done - [ ] Explorer binds TCP when EXPLORER_TCP_ADDR is set - [ ] Heartbeat sender connects via TCP to remote explorers - [ ] Explorer proxy forwards calls to remote nodes via TCP - [ ] socket_path field supports tcp:// URIs - [ ] EXPLORER_ADDRESSES accepts both unix:// and tcp:// schemes - [ ] Existing Unix-only deployments unaffected - [ ] socat bridge scripts kept as fallback but documented as optional - [ ] docs/setup.md updated with multi-machine TCP setup instructions
mahmoud self-assigned this 2026-03-17 11:11:17 +00:00
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_compute#30
No description provided.