[PERF] Shard RoutingTable to reduce lock contention #17
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Description
The global
RwLockon theRoutingTablecauses thread contention during high-frequency route updates.Proposed Fix
Implement sharding by subnet prefix or hash to allow multiple workers to update segments of the table concurrently.
The full architecture of the routing table is as follows:
The table (prefix trie) itself is a left-right datastructure. There are 2 copies of the trie. Internally there is a pointer to the readable copy, and a pointer to the writeable copy. Every time a read is done, this goes through the readable copy. When a new subnet is inserted/old one deleted, the read pointer moves to the other copy. Once all existing reads are done, the modification/write progresses and modifies the first copy (which now no longer has any readers). After the write has finished, the readers move back ot this copy and once no old reader exists on the second copy, the write is applied to said copy. This way reads can always progress even if a write is currently in progress.
The value associated with every subnet is held in an Arc, so we cheaply hand out copies to it. Modification to the route list of a the subnet are done through a closure, which is applied in a CAS fashion on the Arc. Essentially, on every modification to the list, a copy of the current list is made, which is then modified, and swapped with the original (CAS style). If the CAS op fails, a new copy of the current route list is obtained and the closure is applied again, until it succeeds.
This means that essentially al writes to the routing table are deterministic and not racing.
Furthermore, the handler for updates can spawn on multiple tasks (which should get distributed over multiple threads). Assignment to a task is based on the (hash of) the relevant subnet, meaning there is a deterministic mapping between subnets and the task that handles them. This generally avoids race conditions in the CAS operation (CAS should always succeed first try so there is no overhead of retries), while also allowing multiple updates to be handled fully in parallel (especially if the subnet is known to the peer, as the only actual lock taken is when a new subnet is inserted/existing subnet is removed).
For other data structures which map subnets to data, we generally use dashmap, which is internally sharded to allow for parallel writes most of the time
mik-tf referenced this issue2026-05-19 17:31:39 +00:00