TOCTOU in cleanup_stale_sockets: stale-delete can nuke a just-respawned socket #49
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_router#49
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Category: race / medium
What
cleanup_stale_sockets()first probes every socket in$HERO_SOCKET_DIR, collects the unresponsive ones into aVec, then iterates that list andremove_file()s each entry. A service that respawns during the probe loop can have its brand-new socket deleted by the router.Where
crates/hero_router/src/scanner.rs:154-217Relevant sequence:
is_socket_responsive(&path).await→ returnsfalse.stale_paths.is_socket_responsive).std::fs::remove_file(&path)is called, possibly seconds later on a large directory.Between step 2 and step 4 the service can be restarted by
hero_proc, bind a new socket at the same path, and be fully operational — then we delete its socket file out from under it.Why it's wrong
Unix sockets bound via
bind(2)do not keep the filesystem entry alive on their own — if the file is unlinked, existing connections still work but no new clients canconnect(2). The service looks healthy from its own perspective (no error, accept loop is fine) but is invisible to the outside world until it rebinds or restarts. Our next scan will then (correctly) mark it inactive, and the user sees flapping.Reproduction
Worst-case: a service with a slow startup that churns via
hero_procrestart loops. A probe issued just before the restart seesconnection refused(bind not yet done), marks it stale, and by the time we reach the delete step the service has rebound and is serving — we then unlink the live socket.Can also be triggered manually:
Suggested fix
Re-probe with a short timeout immediately before unlinking, and ignore
ENOENT:Even better: skip cleanup entirely for sockets whose directory corresponds to a currently-running
hero_procservice (since hero_proc is the authoritative lifecycle source). Scanner can consult the SDK'sjob_listfor running processes and refuse to unlink their sockets.Alternative
Require N consecutive failed probes (with delay between) before marking stale, to widen the window past normal restart times.
Implementation Spec for Issue #49
Objective
Eliminate the TOCTOU window in
cleanup_stale_socketsby re-probing each socket immediately beforeremove_fileand treatingENOENTon delete as a no-op, so a socket that respawned (or was already cleaned by another process) between the initial probe and the delete is not incorrectly unlinked or logged as an error.Requirements
ENOENTon delete as success (file already gone — another process removed it; not an error).remove_fileerrors (EPERM,EIO, etc.) still logged atwarn!as today.is_socket_responsivealready uses — no new config knob.Files to Modify/Create
crates/hero_router/src/scanner.rs— modify the post-collection delete loop insidecleanup_stale_sockets(lines 196-211); add a#[cfg(test)]module covering the ENOENT invariant.Implementation Plan
Step 1: Patch the delete loop in
cleanup_stale_socketsFiles:
crates/hero_router/src/scanner.rsReplace the existing block at lines 196-211 with a pre-delete re-probe and ENOENT-tolerant error handling:
Subtasks:
is_socket_responsive(line ~220) is unchanged — it already uses a 1s connect timeout, which is exactly what we want for the re-probe.Scannerfields, constructor, or any caller ofcleanup_stale_sockets.debugis already imported viause tracing::{debug, info, warn};.std::io::ErrorKindis reached via the qualified path; no newuseneeded.Dependencies: none.
Step 2: Unit test for the ENOENT invariant
Files:
crates/hero_router/src/scanner.rsThe respawn race is timing-dependent and not cleanly reproducible in CI without a real UDS harness. The ENOENT path is trivially testable and documents the invariant the fix relies on.
Append a
#[cfg(test)] mod testsblock at the end ofscanner.rs:Subtasks:
tempfileis already in[dev-dependencies]— no Cargo changes.Dependencies: Step 1.
Acceptance Criteria
is_socket_responsive(&path).awaitre-probe; verified by code review — race is timing-dependent).NotFoundarm).deleted_count(unchanged semantics in theOk(_)arm).cargo test --workspacepasses.cargo clippy --workspace --all-targets -- -D warningspasses.Notes
stale_pathsis small (most sockets are live and filtered out in the first pass), so the added latency is bounded and acceptable.hero_procfor running jobs as "even better", not required. Sticking to items (1) and (2) keeps this PR surgical.probe-then-deletehelper would grow the diff beyond a targeted race fix; leaving the method intact and editing 15 lines inside the loop body keeps review cost low.is_socket_responsivereturningtrueunconditionally causescontinuebeforeremove_fileis reached. Manual reproduction requires a UDS bind/delete race and is not automated.Test Results
Lint
cargo clippy --workspace --all-targets -- -D warnings: passcargo fmt --all --check: passNew test added in this PR
scanner::tests::remove_file_on_missing_path_reports_notfoundImplementation Summary
Closing #49 via PR (linked in a follow-up comment).
Changes
crates/hero_router/src/scanner.rs(+32 / -0)cleanup_stale_sockets: inside the post-collection delete loop, added a pre-delete re-probe and anErrorKind::NotFoundarm.std::fs::remove_file(&path), callself.is_socket_responsive(&path).await. If the socket is now responsive, log atdebug!andcontinue— the service respawned between the initial probe pass and the cleanup loop, and unlinking would make it silently unreachable to new clients.remove_file, matchErr(e) if e.kind() == std::io::ErrorKind::NotFoundand log atdebug!— another process (or an earlier sweep of ours) already removed the file; not an error. Other I/O errors (EPERM,EIO, etc.) remain logged atwarn!with the same format as before.info!/warn!arms untouched; newdebug!lines use plain text).is_socket_responsive, theScannerstruct, and all callers ofcleanup_stale_socketsare unchanged.#[cfg(test)] mod testsblock withremove_file_on_missing_path_reports_notfound, which documents the invariant theNotFoundarm relies on.Why no helper extraction or hero_proc integration
The issue's "even better" suggestion — consulting
hero_proc'sjob_listand refusing to unlink sockets owned by live jobs — would widen the diff into SDK wiring and was explicitly deferred. The pre-delete re-probe plusNotFoundhandling covers the two concrete failure modes spelled out in the reproduction section, and is a surgical 15-line change inside the existing loop body.Extracting the per-path "probe-then-delete" step into a private async helper was considered and rejected for the same reason — it grows the diff beyond a targeted race fix and requires passing
&selfanyway.Cost
The re-probe adds one extra 1s connect per stale candidate in the worst case. In steady-state operation
stale_pathsis small (most sockets are live and filtered in the first pass), so the additional latency is bounded.Race reproducibility
The respawn race is timing-dependent. Verified by inspection:
is_socket_responsivereturningtruecausescontinuebeforeremove_fileis reached, so a respawned socket is not unlinked. TheNotFoundpath is covered by the new test.Test results
cargo test --workspace— 72 passed / 0 failed / 0 ignored.cargo clippy --workspace --all-targets -- -D warnings— clean.cargo fmt --all --check— clean.Manual reproduction (suggested for reviewer)
Pull request opened: #54
This PR implements the changes discussed in this issue.