hero_log: sources_in_root cache (5s TTL) makes newly-created source dirs invisible to wildcard queries #142

Open
opened 2026-05-21 15:00:49 +00:00 by sameh-farouk · 0 comments
Member

Symptom

functional::runs::logs_query_by_service_src_wildcard passes in a fresh sandbox but fails when run after another test that inserts logs against the same daemon. Specifically: after structured_logs_all_levels populates core.run_log_levels.* entries, a subsequent wildcard query for core.run-svc-query.* (a different src family) returns 0 entries even though the test inserts 12 of them.

Repro

SANDBOX=$(mktemp -d); mkdir -p $SANDBOX/var/sockets/hero_proc
PATH_ROOT=$SANDBOX ./target/release/hero_proc_server &
sleep 2
./target/release/hero_proc_test --filter functional::runs::structured_logs_all_levels  # passes
./target/release/hero_proc_test --filter functional::runs::logs_query_by_service_src_wildcard  # FAILS (count=0)
./target/release/hero_proc_test --filter functional::runs::logs_query_by_service_src_wildcard  # FAILS again on the same daemon

Root cause

In herolib_core/src/logger/query.rs::search_logs (lines 138-164), there is a fast path that skips the source-tree walk only when src_prefix resolves to an exact existing source directory:

let mut sources: Vec<String> = if !query.src_prefix.is_empty() {
    let dotted = &query.src_prefix;
    let dir = cfg.root.join(dotted.replace('.', "/"));
    if dir.is_dir() && dir_has_daily_log(&dir) {
        vec![dotted.clone()]
    } else {
        let mut s = sources_in_root(&cfg.root)?;     // ← cached, 5s TTL
        s.retain(|s| src_matches_prefix(s, dotted));
        s
    }
};

For a wildcard query like core.run_svc_query (which is a 2-segment prefix that doesn't name a single directory because entries are stored as core/run_svc_query/main/ and .../worker/), the code falls back to sources_in_root — a cached source listing with a 5-second TTL.

The cache is populated the first time anyone queries that root. Once populated, new source directories created AFTER the cache population are INVISIBLE until the TTL expires. The test inserts 12 entries (creating new src dirs run_svc_query.main and run_svc_query.worker) and waits only 300 ms before querying — well within the 5 s TTL, so the new dirs aren't returned by sources_in_root and the wildcard match retains nothing.

The fast path comment even documents this:

// 1. `sources_in_root` is cached with a 5s TTL. A source dir created
//    AFTER the cache was populated is invisible to queries until the
//    TTL expires — racing against newly-spawned jobs in tests.

Why this is in herolib_core, not hero_proc

The cache lives in herolib_core::logger::query, shared by every hero_log consumer. Any wildcard query on a freshly-created source family will hit the same staleness window across the whole ecosystem (not just hero_proc tests). Fix must land in herolib_core.

Possible fixes (in priority order)

  1. Invalidate sources_in_root cache on log writes — when a new src is registered (a new directory created), bump a counter that the cache reader checks before returning. Most direct.
  2. Shorten the cache TTL drastically (e.g. 200ms) — quick band-aid; impacts hot-query throughput.
  3. Bypass cache for prefix queries — only use the cache for src_prefix = "" (list-all). Anything else walks the directory tree once.
  4. Expose a force_refresh() API — let consumers opt out per query.

Surfaced while fixing the rest of the runs/logs cluster (hero_proc commits land separately).

The structured_logs and wildcard tests individually pass; only sequential runs trigger this. uc15 oneshot logs (hero_proc#117) is a different bug (executor flush ordering, not cache TTL).

cc the herolib_core logger owner.

## Symptom `functional::runs::logs_query_by_service_src_wildcard` passes in a fresh sandbox but fails when run after another test that inserts logs against the same daemon. Specifically: after `structured_logs_all_levels` populates `core.run_log_levels.*` entries, a subsequent wildcard query for `core.run-svc-query.*` (a different src family) returns 0 entries even though the test inserts 12 of them. ## Repro ```bash SANDBOX=$(mktemp -d); mkdir -p $SANDBOX/var/sockets/hero_proc PATH_ROOT=$SANDBOX ./target/release/hero_proc_server & sleep 2 ./target/release/hero_proc_test --filter functional::runs::structured_logs_all_levels # passes ./target/release/hero_proc_test --filter functional::runs::logs_query_by_service_src_wildcard # FAILS (count=0) ./target/release/hero_proc_test --filter functional::runs::logs_query_by_service_src_wildcard # FAILS again on the same daemon ``` ## Root cause In `herolib_core/src/logger/query.rs::search_logs` (lines 138-164), there is a fast path that skips the source-tree walk only when `src_prefix` resolves to an *exact* existing source directory: ```rust let mut sources: Vec<String> = if !query.src_prefix.is_empty() { let dotted = &query.src_prefix; let dir = cfg.root.join(dotted.replace('.', "/")); if dir.is_dir() && dir_has_daily_log(&dir) { vec![dotted.clone()] } else { let mut s = sources_in_root(&cfg.root)?; // ← cached, 5s TTL s.retain(|s| src_matches_prefix(s, dotted)); s } }; ``` For a wildcard query like `core.run_svc_query` (which is a 2-segment prefix that doesn't name a single directory because entries are stored as `core/run_svc_query/main/` and `.../worker/`), the code falls back to `sources_in_root` — a cached source listing with a 5-second TTL. The cache is populated the first time anyone queries that root. Once populated, new source directories created AFTER the cache population are INVISIBLE until the TTL expires. The test inserts 12 entries (creating new src dirs `run_svc_query.main` and `run_svc_query.worker`) and waits only 300 ms before querying — well within the 5 s TTL, so the new dirs aren't returned by `sources_in_root` and the wildcard match retains nothing. The fast path comment even documents this: ``` // 1. `sources_in_root` is cached with a 5s TTL. A source dir created // AFTER the cache was populated is invisible to queries until the // TTL expires — racing against newly-spawned jobs in tests. ``` ## Why this is in `herolib_core`, not `hero_proc` The cache lives in `herolib_core::logger::query`, shared by every hero_log consumer. Any wildcard query on a freshly-created source family will hit the same staleness window across the whole ecosystem (not just hero_proc tests). Fix must land in herolib_core. ## Possible fixes (in priority order) 1. **Invalidate `sources_in_root` cache on log writes** — when a new src is registered (a new directory created), bump a counter that the cache reader checks before returning. Most direct. 2. **Shorten the cache TTL drastically** (e.g. 200ms) — quick band-aid; impacts hot-query throughput. 3. **Bypass cache for prefix queries** — only use the cache for `src_prefix = ""` (list-all). Anything else walks the directory tree once. 4. **Expose a `force_refresh()` API** — let consumers opt out per query. ## Related Surfaced while fixing the rest of the runs/logs cluster (hero_proc commits land separately). The structured_logs and wildcard tests individually pass; only sequential runs trigger this. uc15 oneshot logs (`hero_proc#117`) is a different bug (executor flush ordering, not cache TTL). cc the herolib_core logger owner.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_lib#142
No description provided.