Logger lacks a drain/flush-sync API (root cause of read-after-write races) #141

New issue

Open

opened 2026-06-08 12:04:42 +00:00 by mahmoud · 0 comments

mahmoud commented

2026-06-08 12:04:42 +00:00

Owner

Problem

herolib_core's file logger has no drain / flush-sync / read-after-write API. Writes are buffered and flushed asynchronously, and there is no way for a caller to wait until "everything written so far is on disk and queryable."

This single gap is the root cause behind several work-arounds in the hero_proc integration suite (and a real, if minor, product sharp edge):

run_all concurrency cap (default 3, HERO_PROC_TEST_CONCURRENCY) — needed because unbounded concurrency outruns the logger.
poll_until scattered across ~40 read sites — tests must poll because a log/count/file read immediately after a job can see stale/missing data.
clean_test_data_removes_everything FS→store assertion shift — a deleted log dir can be recreated by a late buffered flush, so a raw-filesystem "is it empty" check is inherently racy. The test now asserts the queryable store instead.

Why it keeps resurfacing

It is a product gap, not just test flakiness: any consumer that writes logs and then immediately reads/deletes them can observe the same race. The suite just hits it hardest because it does delete-then-assert-empty in tight loops.

Proposed follow-up

Add a flush/drain primitive to the logger in herolib_core (e.g. LogStore::flush() / drain() that blocks until the in-flight buffer is persisted), then:

hero_proc can call it before a delete/cleanup so on-disk state is consistent;
the integration suite can drop most poll_until shims and revert the cleanup test to a direct assertion.

Context

Surfaced while stabilizing the integration suite for #136 (PR #139). PR #139 fixes the real server-side log-leak bugs and works around the logger limitation; this issue tracks the underlying drain-API gap as the durable fix.

## Problem `herolib_core`'s file logger has **no drain / flush-sync / read-after-write API**. Writes are buffered and flushed asynchronously, and there is no way for a caller to wait until "everything written so far is on disk and queryable." This single gap is the root cause behind several work-arounds in the hero_proc integration suite (and a real, if minor, product sharp edge): - **`run_all` concurrency cap** (default 3, `HERO_PROC_TEST_CONCURRENCY`) — needed because unbounded concurrency outruns the logger. - **`poll_until` scattered across ~40 read sites** — tests must poll because a log/count/file read immediately after a job can see stale/missing data. - **`clean_test_data_removes_everything` FS→store assertion shift** — a deleted log dir can be *recreated* by a late buffered flush, so a raw-filesystem "is it empty" check is inherently racy. The test now asserts the queryable store instead. ## Why it keeps resurfacing It is a **product gap**, not just test flakiness: any consumer that writes logs and then immediately reads/deletes them can observe the same race. The suite just hits it hardest because it does delete-then-assert-empty in tight loops. ## Proposed follow-up Add a flush/drain primitive to the logger in `herolib_core` (e.g. `LogStore::flush()` / `drain()` that blocks until the in-flight buffer is persisted), then: - hero_proc can call it before a delete/cleanup so on-disk state is consistent; - the integration suite can drop most `poll_until` shims and revert the cleanup test to a direct assertion. ## Context Surfaced while stabilizing the integration suite for #136 (PR #139). PR #139 fixes the real server-side log-leak bugs and works around the logger limitation; this issue tracks the underlying drain-API gap as the durable fix.