TOCTOU in cleanup_stale_sockets: stale-delete can nuke a just-respawned socket #49

Closed
opened 2026-04-22 14:54:32 +00:00 by rawdaGastan · 4 comments
Member

Category: race / medium

What

cleanup_stale_sockets() first probes every socket in $HERO_SOCKET_DIR, collects the unresponsive ones into a Vec, then iterates that list and remove_file()s each entry. A service that respawns during the probe loop can have its brand-new socket deleted by the router.

Where

crates/hero_router/src/scanner.rs:154-217

Relevant sequence:

  1. L182 / L190 — is_socket_responsive(&path).await → returns false.
  2. Path pushed into stale_paths.
  3. Outer loop continues probing other sockets (each probe has a 1 s timeout — see is_socket_responsive).
  4. L198 — std::fs::remove_file(&path) is called, possibly seconds later on a large directory.

Between step 2 and step 4 the service can be restarted by hero_proc, bind a new socket at the same path, and be fully operational — then we delete its socket file out from under it.

Why it's wrong

Unix sockets bound via bind(2) do not keep the filesystem entry alive on their own — if the file is unlinked, existing connections still work but no new clients can connect(2). The service looks healthy from its own perspective (no error, accept loop is fine) but is invisible to the outside world until it rebinds or restarts. Our next scan will then (correctly) mark it inactive, and the user sees flapping.

Reproduction

Worst-case: a service with a slow startup that churns via hero_proc restart loops. A probe issued just before the restart sees connection refused (bind not yet done), marks it stale, and by the time we reach the delete step the service has rebound and is serving — we then unlink the live socket.

Can also be triggered manually:

# While hero_router scanner is running:
rm ~/hero/var/sockets/some_service/rpc.sock   # force probe to fail
# Simultaneously restart the service — window is ~1-3s

Suggested fix

Re-probe with a short timeout immediately before unlinking, and ignore ENOENT:

for path in stale_paths {
    // Re-check responsiveness right before delete. If the socket is
    // now alive, skip — it was respawned during the probe loop.
    if self.is_socket_responsive(&path).await {
        continue;
    }
    match std::fs::remove_file(&path) {
        Ok(_) => { /* ... */ }
        Err(e) if e.kind() == std::io::ErrorKind::NotFound => { /* already gone */ }
        Err(e) => warn!("failed to delete stale socket {}: {}", path.display(), e),
    }
}

Even better: skip cleanup entirely for sockets whose directory corresponds to a currently-running hero_proc service (since hero_proc is the authoritative lifecycle source). Scanner can consult the SDK's job_list for running processes and refuse to unlink their sockets.

Alternative

Require N consecutive failed probes (with delay between) before marking stale, to widen the window past normal restart times.

## Category: race / **medium** ## What `cleanup_stale_sockets()` first probes every socket in `$HERO_SOCKET_DIR`, collects the unresponsive ones into a `Vec`, then iterates that list and `remove_file()`s each entry. A service that **respawns during the probe loop** can have its brand-new socket deleted by the router. ## Where [`crates/hero_router/src/scanner.rs:154-217`](src/scanner.rs#L154) Relevant sequence: 1. L182 / L190 — `is_socket_responsive(&path).await` → returns `false`. 2. Path pushed into `stale_paths`. 3. Outer loop continues probing other sockets (each probe has a 1 s timeout — see `is_socket_responsive`). 4. L198 — `std::fs::remove_file(&path)` is called, possibly **seconds later** on a large directory. Between step 2 and step 4 the service can be restarted by `hero_proc`, bind a new socket at the same path, and be fully operational — then we delete its socket file out from under it. ## Why it's wrong Unix sockets bound via `bind(2)` do not keep the filesystem entry alive on their own — if the file is unlinked, existing connections still work but **no new clients can `connect(2)`**. The service looks healthy from its own perspective (no error, accept loop is fine) but is invisible to the outside world until it rebinds or restarts. Our next scan will then (correctly) mark it inactive, and the user sees flapping. ## Reproduction Worst-case: a service with a slow startup that churns via `hero_proc` restart loops. A probe issued just before the restart sees `connection refused` (bind not yet done), marks it stale, and by the time we reach the delete step the service has rebound and is serving — we then unlink the live socket. Can also be triggered manually: ```bash # While hero_router scanner is running: rm ~/hero/var/sockets/some_service/rpc.sock # force probe to fail # Simultaneously restart the service — window is ~1-3s ``` ## Suggested fix Re-probe with a short timeout immediately before unlinking, and ignore `ENOENT`: ```rust for path in stale_paths { // Re-check responsiveness right before delete. If the socket is // now alive, skip — it was respawned during the probe loop. if self.is_socket_responsive(&path).await { continue; } match std::fs::remove_file(&path) { Ok(_) => { /* ... */ } Err(e) if e.kind() == std::io::ErrorKind::NotFound => { /* already gone */ } Err(e) => warn!("failed to delete stale socket {}: {}", path.display(), e), } } ``` Even better: skip cleanup entirely for sockets whose directory corresponds to a currently-running `hero_proc` service (since hero_proc is the authoritative lifecycle source). Scanner can consult the SDK's `job_list` for running processes and refuse to unlink their sockets. ## Alternative Require N consecutive failed probes (with delay between) before marking stale, to widen the window past normal restart times.
rawdaGastan added this to the ACTIVE project 2026-04-23 10:47:25 +00:00
Author
Member

Implementation Spec for Issue #49

Objective

Eliminate the TOCTOU window in cleanup_stale_sockets by re-probing each socket immediately before remove_file and treating ENOENT on delete as a no-op, so a socket that respawned (or was already cleaned by another process) between the initial probe and the delete is not incorrectly unlinked or logged as an error.

Requirements

  • Re-probe immediately before delete; skip unlink if the socket came back to life.
  • Treat ENOENT on delete as success (file already gone — another process removed it; not an error).
  • Other remove_file errors (EPERM, EIO, etc.) still logged at warn! as today.
  • No regression to the success path: unresponsive stale sockets still get cleaned up.
  • Cheap: the pre-delete re-probe reuses the 1s timeout that is_socket_responsive already uses — no new config knob.
  • Keep the diff surgical; no refactor of unrelated code.

Files to Modify/Create

  • crates/hero_router/src/scanner.rs — modify the post-collection delete loop inside cleanup_stale_sockets (lines 196-211); add a #[cfg(test)] module covering the ENOENT invariant.

Implementation Plan

Step 1: Patch the delete loop in cleanup_stale_sockets

Files: crates/hero_router/src/scanner.rs

Replace the existing block at lines 196-211 with a pre-delete re-probe and ENOENT-tolerant error handling:

let mut deleted_count = 0usize;
for path in stale_paths {
    // TOCTOU guard (issue #49): re-probe immediately before deleting.
    // A service may have respawned between the initial probe loop and now,
    // binding a fresh socket at the same path. Unlinking that would make
    // the just-respawned service silently unreachable to new clients.
    if self.is_socket_responsive(&path).await {
        debug!(
            "  Socket respawned before cleanup, skipping: {}",
            path.display()
        );
        continue;
    }

    match std::fs::remove_file(&path) {
        Ok(_) => {
            info!("  Deleted stale socket: {}", path.display());
            deleted_count += 1;
        }
        Err(e) if e.kind() == std::io::ErrorKind::NotFound => {
            // Another process (or a race with our own earlier sweep) already
            // unlinked the file. Not an error.
            debug!(
                "  Stale socket already gone (ENOENT): {}",
                path.display()
            );
        }
        Err(e) => {
            warn!(
                "  Failed to delete stale socket {}: {}",
                path.display(),
                e
            );
        }
    }
}

Subtasks:

  • is_socket_responsive (line ~220) is unchanged — it already uses a 1s connect timeout, which is exactly what we want for the re-probe.
  • No changes to Scanner fields, constructor, or any caller of cleanup_stale_sockets.
  • debug is already imported via use tracing::{debug, info, warn};.
  • std::io::ErrorKind is reached via the qualified path; no new use needed.

Dependencies: none.

Step 2: Unit test for the ENOENT invariant

Files: crates/hero_router/src/scanner.rs

The respawn race is timing-dependent and not cleanly reproducible in CI without a real UDS harness. The ENOENT path is trivially testable and documents the invariant the fix relies on.

Append a #[cfg(test)] mod tests block at the end of scanner.rs:

#[cfg(test)]
mod tests {
    use super::*;

    /// Documents the invariant the issue #49 fix relies on: removing a path
    /// that no longer exists returns `ErrorKind::NotFound`, which the cleanup
    /// loop now treats as a no-op rather than a warning.
    #[test]
    fn remove_file_on_missing_path_reports_notfound() {
        let dir = tempfile::tempdir().unwrap();
        let missing = dir.path().join("does_not_exist.sock");
        let err = std::fs::remove_file(&missing).expect_err("should fail");
        assert_eq!(err.kind(), std::io::ErrorKind::NotFound);
    }
}

Subtasks:

  • tempfile is already in [dev-dependencies] — no Cargo changes.
  • Append after the final function in the file (the module currently has no tests).

Dependencies: Step 1.

Acceptance Criteria

  • A socket that goes from "unresponsive" to "responsive" between probe and delete is NOT unlinked (guarded by the is_socket_responsive(&path).await re-probe; verified by code review — race is timing-dependent).
  • A path that was deleted between probe and our own delete (ENOENT) is treated as success, not logged as an error (verified by the new test and by code review of the NotFound arm).
  • The existing unresponsive-socket cleanup path still deletes and still increments deleted_count (unchanged semantics in the Ok(_) arm).
  • cargo test --workspace passes.
  • cargo clippy --workspace --all-targets -- -D warnings passes.

Notes

  • Re-probe cost: adds at most one extra 1s connect per stale candidate. In normal operation stale_paths is small (most sockets are live and filtered out in the first pass), so the added latency is bounded and acceptable.
  • hero_proc integration is deferred. The issue lists consulting hero_proc for running jobs as "even better", not required. Sticking to items (1) and (2) keeps this PR surgical.
  • No helper extraction. Extracting a probe-then-delete helper would grow the diff beyond a targeted race fix; leaving the method intact and editing 15 lines inside the loop body keeps review cost low.
  • Race reproducibility: verified by inspection — is_socket_responsive returning true unconditionally causes continue before remove_file is reached. Manual reproduction requires a UDS bind/delete race and is not automated.
## Implementation Spec for Issue #49 ### Objective Eliminate the TOCTOU window in `cleanup_stale_sockets` by re-probing each socket immediately before `remove_file` and treating `ENOENT` on delete as a no-op, so a socket that respawned (or was already cleaned by another process) between the initial probe and the delete is not incorrectly unlinked or logged as an error. ### Requirements - Re-probe immediately before delete; skip unlink if the socket came back to life. - Treat `ENOENT` on delete as success (file already gone — another process removed it; not an error). - Other `remove_file` errors (`EPERM`, `EIO`, etc.) still logged at `warn!` as today. - No regression to the success path: unresponsive stale sockets still get cleaned up. - Cheap: the pre-delete re-probe reuses the 1s timeout that `is_socket_responsive` already uses — no new config knob. - Keep the diff surgical; no refactor of unrelated code. ### Files to Modify/Create - `crates/hero_router/src/scanner.rs` — modify the post-collection delete loop inside `cleanup_stale_sockets` (lines 196-211); add a `#[cfg(test)]` module covering the ENOENT invariant. ### Implementation Plan #### Step 1: Patch the delete loop in `cleanup_stale_sockets` Files: `crates/hero_router/src/scanner.rs` Replace the existing block at lines 196-211 with a pre-delete re-probe and ENOENT-tolerant error handling: ```rust let mut deleted_count = 0usize; for path in stale_paths { // TOCTOU guard (issue #49): re-probe immediately before deleting. // A service may have respawned between the initial probe loop and now, // binding a fresh socket at the same path. Unlinking that would make // the just-respawned service silently unreachable to new clients. if self.is_socket_responsive(&path).await { debug!( " Socket respawned before cleanup, skipping: {}", path.display() ); continue; } match std::fs::remove_file(&path) { Ok(_) => { info!(" Deleted stale socket: {}", path.display()); deleted_count += 1; } Err(e) if e.kind() == std::io::ErrorKind::NotFound => { // Another process (or a race with our own earlier sweep) already // unlinked the file. Not an error. debug!( " Stale socket already gone (ENOENT): {}", path.display() ); } Err(e) => { warn!( " Failed to delete stale socket {}: {}", path.display(), e ); } } } ``` Subtasks: - `is_socket_responsive` (line ~220) is unchanged — it already uses a 1s connect timeout, which is exactly what we want for the re-probe. - No changes to `Scanner` fields, constructor, or any caller of `cleanup_stale_sockets`. - `debug` is already imported via `use tracing::{debug, info, warn};`. - `std::io::ErrorKind` is reached via the qualified path; no new `use` needed. Dependencies: none. #### Step 2: Unit test for the ENOENT invariant Files: `crates/hero_router/src/scanner.rs` The respawn race is timing-dependent and not cleanly reproducible in CI without a real UDS harness. The ENOENT path is trivially testable and documents the invariant the fix relies on. Append a `#[cfg(test)] mod tests` block at the end of `scanner.rs`: ```rust #[cfg(test)] mod tests { use super::*; /// Documents the invariant the issue #49 fix relies on: removing a path /// that no longer exists returns `ErrorKind::NotFound`, which the cleanup /// loop now treats as a no-op rather than a warning. #[test] fn remove_file_on_missing_path_reports_notfound() { let dir = tempfile::tempdir().unwrap(); let missing = dir.path().join("does_not_exist.sock"); let err = std::fs::remove_file(&missing).expect_err("should fail"); assert_eq!(err.kind(), std::io::ErrorKind::NotFound); } } ``` Subtasks: - `tempfile` is already in `[dev-dependencies]` — no Cargo changes. - Append after the final function in the file (the module currently has no tests). Dependencies: Step 1. ### Acceptance Criteria - [ ] A socket that goes from "unresponsive" to "responsive" between probe and delete is NOT unlinked (guarded by the `is_socket_responsive(&path).await` re-probe; verified by code review — race is timing-dependent). - [ ] A path that was deleted between probe and our own delete (ENOENT) is treated as success, not logged as an error (verified by the new test and by code review of the `NotFound` arm). - [ ] The existing unresponsive-socket cleanup path still deletes and still increments `deleted_count` (unchanged semantics in the `Ok(_)` arm). - [ ] `cargo test --workspace` passes. - [ ] `cargo clippy --workspace --all-targets -- -D warnings` passes. ### Notes - **Re-probe cost**: adds at most one extra 1s connect per stale candidate. In normal operation `stale_paths` is small (most sockets are live and filtered out in the first pass), so the added latency is bounded and acceptable. - **hero_proc integration is deferred.** The issue lists consulting `hero_proc` for running jobs as "even better", not required. Sticking to items (1) and (2) keeps this PR surgical. - **No helper extraction.** Extracting a `probe-then-delete` helper would grow the diff beyond a targeted race fix; leaving the method intact and editing 15 lines inside the loop body keeps review cost low. - **Race reproducibility**: verified by inspection — `is_socket_responsive` returning `true` unconditionally causes `continue` before `remove_file` is reached. Manual reproduction requires a UDS bind/delete race and is not automated.
Author
Member

Test Results

  • Total: 72
  • Passed: 72
  • Failed: 0
  • Ignored: 0

Lint

  • cargo clippy --workspace --all-targets -- -D warnings: pass
  • cargo fmt --all --check: pass

New test added in this PR

  • scanner::tests::remove_file_on_missing_path_reports_notfound
## Test Results - Total: 72 - Passed: 72 - Failed: 0 - Ignored: 0 ### Lint - `cargo clippy --workspace --all-targets -- -D warnings`: pass - `cargo fmt --all --check`: pass ### New test added in this PR - `scanner::tests::remove_file_on_missing_path_reports_notfound`
Author
Member

Implementation Summary

Closing #49 via PR (linked in a follow-up comment).

Changes

crates/hero_router/src/scanner.rs (+32 / -0)

  • cleanup_stale_sockets: inside the post-collection delete loop, added a pre-delete re-probe and an ErrorKind::NotFound arm.
    • Before std::fs::remove_file(&path), call self.is_socket_responsive(&path).await. If the socket is now responsive, log at debug! and continue — the service respawned between the initial probe pass and the cleanup loop, and unlinking would make it silently unreachable to new clients.
    • After remove_file, match Err(e) if e.kind() == std::io::ErrorKind::NotFound and log at debug! — another process (or an earlier sweep of ours) already removed the file; not an error. Other I/O errors (EPERM, EIO, etc.) remain logged at warn! with the same format as before.
  • Preserved existing logging style (emojis on info!/warn! arms untouched; new debug! lines use plain text).
  • is_socket_responsive, the Scanner struct, and all callers of cleanup_stale_sockets are unchanged.
  • Added a #[cfg(test)] mod tests block with remove_file_on_missing_path_reports_notfound, which documents the invariant the NotFound arm relies on.

Why no helper extraction or hero_proc integration

The issue's "even better" suggestion — consulting hero_proc's job_list and refusing to unlink sockets owned by live jobs — would widen the diff into SDK wiring and was explicitly deferred. The pre-delete re-probe plus NotFound handling covers the two concrete failure modes spelled out in the reproduction section, and is a surgical 15-line change inside the existing loop body.

Extracting the per-path "probe-then-delete" step into a private async helper was considered and rejected for the same reason — it grows the diff beyond a targeted race fix and requires passing &self anyway.

Cost

The re-probe adds one extra 1s connect per stale candidate in the worst case. In steady-state operation stale_paths is small (most sockets are live and filtered in the first pass), so the additional latency is bounded.

Race reproducibility

The respawn race is timing-dependent. Verified by inspection: is_socket_responsive returning true causes continue before remove_file is reached, so a respawned socket is not unlinked. The NotFound path is covered by the new test.

Test results

  • cargo test --workspace — 72 passed / 0 failed / 0 ignored.
  • cargo clippy --workspace --all-targets -- -D warnings — clean.
  • cargo fmt --all --check — clean.

Manual reproduction (suggested for reviewer)

# While hero_router scanner is running against a socket dir:
rm ~/hero/var/sockets/some_service/rpc.sock       # force next probe to fail
# Before the scanner's cleanup loop reaches this path, rebind the socket:
# (e.g. restart the service via hero_proc in another terminal).
# Expected: scanner logs
#   DEBUG hero_router::scanner: Socket respawned before cleanup, skipping: <path>
# and leaves the new socket intact.
## Implementation Summary Closing #49 via PR (linked in a follow-up comment). ### Changes **`crates/hero_router/src/scanner.rs`** (+32 / -0) - `cleanup_stale_sockets`: inside the post-collection delete loop, added a pre-delete re-probe and an `ErrorKind::NotFound` arm. - Before `std::fs::remove_file(&path)`, call `self.is_socket_responsive(&path).await`. If the socket is now responsive, log at `debug!` and `continue` — the service respawned between the initial probe pass and the cleanup loop, and unlinking would make it silently unreachable to new clients. - After `remove_file`, match `Err(e) if e.kind() == std::io::ErrorKind::NotFound` and log at `debug!` — another process (or an earlier sweep of ours) already removed the file; not an error. Other I/O errors (`EPERM`, `EIO`, etc.) remain logged at `warn!` with the same format as before. - Preserved existing logging style (emojis on `info!`/`warn!` arms untouched; new `debug!` lines use plain text). - `is_socket_responsive`, the `Scanner` struct, and all callers of `cleanup_stale_sockets` are unchanged. - Added a `#[cfg(test)] mod tests` block with `remove_file_on_missing_path_reports_notfound`, which documents the invariant the `NotFound` arm relies on. ### Why no helper extraction or hero_proc integration The issue's "even better" suggestion — consulting `hero_proc`'s `job_list` and refusing to unlink sockets owned by live jobs — would widen the diff into SDK wiring and was explicitly deferred. The pre-delete re-probe plus `NotFound` handling covers the two concrete failure modes spelled out in the reproduction section, and is a surgical 15-line change inside the existing loop body. Extracting the per-path "probe-then-delete" step into a private async helper was considered and rejected for the same reason — it grows the diff beyond a targeted race fix and requires passing `&self` anyway. ### Cost The re-probe adds one extra 1s connect per stale candidate in the worst case. In steady-state operation `stale_paths` is small (most sockets are live and filtered in the first pass), so the additional latency is bounded. ### Race reproducibility The respawn race is timing-dependent. Verified by inspection: `is_socket_responsive` returning `true` causes `continue` before `remove_file` is reached, so a respawned socket is not unlinked. The `NotFound` path is covered by the new test. ### Test results - `cargo test --workspace` — 72 passed / 0 failed / 0 ignored. - `cargo clippy --workspace --all-targets -- -D warnings` — clean. - `cargo fmt --all --check` — clean. ### Manual reproduction (suggested for reviewer) ``` # While hero_router scanner is running against a socket dir: rm ~/hero/var/sockets/some_service/rpc.sock # force next probe to fail # Before the scanner's cleanup loop reaches this path, rebind the socket: # (e.g. restart the service via hero_proc in another terminal). # Expected: scanner logs # DEBUG hero_router::scanner: Socket respawned before cleanup, skipping: <path> # and leaves the new socket intact. ```
Author
Member

Pull request opened: #54

This PR implements the changes discussed in this issue.

Pull request opened: https://forge.ourworld.tf/lhumina_code/hero_router/pulls/54 This PR implements the changes discussed in this issue.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_router#49
No description provided.