[CRITICAL] NodeIndex swap-remove corruption in ServiceGraph #15

Open
opened 2026-05-11 10:52:00 +00:00 by thabeta · 1 comment
Owner

Problem

ServiceGraph uses petgraph's NodeIndex as ServiceId, which is used as keys in multiple HashMaps across the supervisor: timers, process_tasks, health_attempts, pending_restarts.

When remove_service calls reindex_for_swap_remove, petgraph performs a swap-remove: the last node is moved into the removed node's slot, inheriting its NodeIndex. All the supervisor's HashMaps still reference the old ServiceId -- which now points to a different service.

Impact

  • remove_service cleans up timers/tasks for the removed service, but silently corrupts state for whatever service was swapped into its slot
  • reload calls remap_service_ids to handle this, but remove_service does NOT call remap
  • This is a latent bug that manifests whenever services are removed at runtime

Reproduction

  1. Add services A, B, C (gets NodeIndex 0, 1, 2)
  2. Schedule a timer for service B (stored in timers HashMap keyed by ServiceId(1))
  3. Remove service A -- petgraph swaps C into slot 0 (no issue)
  4. Now remove service B -- petgraph swaps C into slot 1
  5. The timer entry for ServiceId(1) now points to service C instead of B
  6. When the timer fires, it operates on the wrong service

Files

  • crates/my_init_server/src/graph.rs -- remove_service, reindex_for_swap_remove
  • crates/my_init_server/src/supervisor/mod.rs -- HashMap fields using ServiceId keys

Suggested Fix

Use stable identifiers (e.g., u64 with generation counter, or name-based lookup) instead of raw petgraph NodeIndex. Or rebuild all ServiceId-keyed maps after any structural graph change.

## Problem `ServiceGraph` uses petgraph's `NodeIndex` as `ServiceId`, which is used as keys in multiple HashMaps across the supervisor: `timers`, `process_tasks`, `health_attempts`, `pending_restarts`. When `remove_service` calls `reindex_for_swap_remove`, petgraph performs a swap-remove: the last node is moved into the removed node's slot, inheriting its `NodeIndex`. All the supervisor's HashMaps still reference the old `ServiceId` -- which now points to a **different service**. ## Impact - `remove_service` cleans up timers/tasks for the removed service, but silently corrupts state for whatever service was swapped into its slot - `reload` calls `remap_service_ids` to handle this, but `remove_service` does NOT call remap - This is a latent bug that manifests whenever services are removed at runtime ## Reproduction 1. Add services A, B, C (gets NodeIndex 0, 1, 2) 2. Schedule a timer for service B (stored in timers HashMap keyed by ServiceId(1)) 3. Remove service A -- petgraph swaps C into slot 0 (no issue) 4. Now remove service B -- petgraph swaps C into slot 1 5. The timer entry for ServiceId(1) now points to service C instead of B 6. When the timer fires, it operates on the wrong service ## Files - `crates/my_init_server/src/graph.rs` -- `remove_service`, `reindex_for_swap_remove` - `crates/my_init_server/src/supervisor/mod.rs` -- HashMap fields using ServiceId keys ## Suggested Fix Use stable identifiers (e.g., u64 with generation counter, or name-based lookup) instead of raw petgraph `NodeIndex`. Or rebuild all ServiceId-keyed maps after any structural graph change.
Member

Classification: valid-bug — NodeIndex swap-remove corrupts supervisor HashMap entries when services are removed at runtime. Data corruption confirmed by code inspection.

Confirmed by code inspection at crates/my_init_server/src/graph.rs:575-606. remove_service calls reindex_for_swap_remove but does not remap the ServiceId-keyed HashMap entries in the supervisor (timers, process_tasks, health_attempts, pending_restarts). The reload handler has a remap_service_ids call that handles this, but remove_service bypasses it entirely. A real data corruption that triggers whenever services are removed at runtime.

Evidence: reindex_for_swap_remove at graph.rs:606, called from remove_service at graph.rs:596.

> Classification: valid-bug — NodeIndex swap-remove corrupts supervisor HashMap entries when services are removed at runtime. Data corruption confirmed by code inspection. Confirmed by code inspection at crates/my_init_server/src/graph.rs:575-606. `remove_service` calls `reindex_for_swap_remove` but does not remap the ServiceId-keyed HashMap entries in the supervisor (`timers`, `process_tasks`, `health_attempts`, `pending_restarts`). The `reload` handler has a `remap_service_ids` call that handles this, but `remove_service` bypasses it entirely. A real data corruption that triggers whenever services are removed at runtime. Evidence: `reindex_for_swap_remove` at graph.rs:606, called from `remove_service` at graph.rs:596.
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
geomind_code/my_init#15
No description provided.