[CRITICAL] NodeIndex swap-remove corruption in ServiceGraph #15
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
ServiceGraphuses petgraph'sNodeIndexasServiceId, which is used as keys in multiple HashMaps across the supervisor:timers,process_tasks,health_attempts,pending_restarts.When
remove_servicecallsreindex_for_swap_remove, petgraph performs a swap-remove: the last node is moved into the removed node's slot, inheriting itsNodeIndex. All the supervisor's HashMaps still reference the oldServiceId-- which now points to a different service.Impact
remove_servicecleans up timers/tasks for the removed service, but silently corrupts state for whatever service was swapped into its slotreloadcallsremap_service_idsto handle this, butremove_servicedoes NOT call remapReproduction
Files
crates/my_init_server/src/graph.rs--remove_service,reindex_for_swap_removecrates/my_init_server/src/supervisor/mod.rs-- HashMap fields using ServiceId keysSuggested Fix
Use stable identifiers (e.g., u64 with generation counter, or name-based lookup) instead of raw petgraph
NodeIndex. Or rebuild all ServiceId-keyed maps after any structural graph change.Confirmed by code inspection at crates/my_init_server/src/graph.rs:575-606.
remove_servicecallsreindex_for_swap_removebut does not remap the ServiceId-keyed HashMap entries in the supervisor (timers,process_tasks,health_attempts,pending_restarts). Thereloadhandler has aremap_service_idscall that handles this, butremove_servicebypasses it entirely. A real data corruption that triggers whenever services are removed at runtime.Evidence:
reindex_for_swap_removeat graph.rs:606, called fromremove_serviceat graph.rs:596.