[High] Race Condition in Route Update Propagation #22

New issue

Closed

opened 2026-02-11 19:31:45 +00:00 by thabeta · 1 comment

thabeta commented

2026-02-11 19:31:45 +00:00

Owner

Issue

Concurrent updates to the routing table during peer synchronization can cause stale metric caching in the Babel protocol implementation.

Location

mycelium/src/babel/route_request.rs

Problem Description

When multiple peers send route updates simultaneously, the route_request handler does not use atomic operations or sufficient locking to ensure all metric updates are consistently applied. This can lead to:

Incorrect path metrics being used for route decisions
Packets being routed through suboptimal paths
Transient routing loops during network topology changes

Impact

Severity: HIGH (affects routing correctness)
Frequency: Occurs under high peer churn or large mesh networks
User Impact: Unstable routing, higher latency, potential packet loss

Remediation

Use a version-based or epoch-based approach to atomic route updates
Implement read-write locks or RwLock for route table access
Add integration tests that stress-test concurrent route updates
Document the thread-safety guarantees of the routing table

Testing

Create a chaos test with 100+ peers sending contradictory routes
Verify no stale metrics are observed in routing decisions
Measure update propagation latency under concurrent load

## Issue Concurrent updates to the routing table during peer synchronization can cause stale metric caching in the Babel protocol implementation. ## Location `mycelium/src/babel/route_request.rs` ## Problem Description When multiple peers send route updates simultaneously, the route_request handler does not use atomic operations or sufficient locking to ensure all metric updates are consistently applied. This can lead to: - Incorrect path metrics being used for route decisions - Packets being routed through suboptimal paths - Transient routing loops during network topology changes ## Impact - **Severity**: HIGH (affects routing correctness) - **Frequency**: Occurs under high peer churn or large mesh networks - **User Impact**: Unstable routing, higher latency, potential packet loss ## Remediation 1. Use a version-based or epoch-based approach to atomic route updates 2. Implement read-write locks or RwLock for route table access 3. Add integration tests that stress-test concurrent route updates 4. Document the thread-safety guarantees of the routing table ## Testing - Create a chaos test with 100+ peers sending contradictory routes - Verify no stale metrics are observed in routing decisions - Measure update propagation latency under concurrent load

lee commented

2026-03-20 11:39:33 +00:00

Owner

Route requests are handled by reads from the routing table (#17 (comment)) which at that point have the most up to date calculated metrics. While there could be updates which are queued for processing, this is always the case since said update could also just still be in flight meaning the receiver node does not know about it yet. Note that the the babel spec accounts for this by preventing routing loops

Route requests are handled by reads from the routing table (https://forge.ourworld.tf/geomind_code/mycelium_network/issues/17#issuecomment-14018) which at that point have the most up to date calculated metrics. While there could be updates which are queued for processing, this is always the case since said update could also just still be in flight meaning the receiver node does not know about it yet. Note that the the babel spec accounts for this by preventing routing loops