logs.get returns empty for completed jobs — likely log-flush race in executor (uc15 + runs::structured_logs/wildcard cluster) #117

New issue

Closed

opened 2026-05-21 13:40:37 +00:00 by sameh-farouk · 1 comment

sameh-farouk commented

2026-05-21 13:40:37 +00:00

Member

Symptom

A service_define_oneshot job that completes successfully has empty logs.get output, even when the script body explicitly echos a recognizable marker. uc15_oneshot_service_runs_to_completion in hero_proc_test exhibits this:

c.service_define_oneshot(ServiceDefineOneshotInput {
    name: "svc15".into(),
    exec: "echo ONESHOT_DONE; exit 0".into(),
    opts: None,
})
// ... service.start ... poll until !running ... service.children → job_id ...
let logs = c.job_logs(JobLogsInput { id: job_id, ... }).await?.value;
anyhow::ensure!(
    logs.iter().any(|l| l.line.contains("ONESHOT_DONE")),
    "logs must contain ONESHOT_DONE"
);

After today's 477889a fix (which got the service-define+children plumbing working), uc15 now reaches the log-assertion step but the logs vector is empty. The job ran (state transitioned to inactive/succeeded), the script DID exit 0, but logs.get returns nothing.

Repro

# After 477889a is in HEAD
SANDBOX=/tmp/uc15_repro; rm -rf $SANDBOX; mkdir -p $SANDBOX/{var/sockets/hero_proc,var/db}
PATH_ROOT=$SANDBOX ./target/release/hero_proc_server &
sleep 2
PATH_ROOT=$SANDBOX ./target/release/hero_proc_test --filter functional::uc_12_16::uc15_oneshot_service_runs_to_completion
# → FAIL "logs must contain ONESHOT_DONE"

Likely scope

Smells like a log-flush ordering bug: the executor marks the job terminal before the log sink finishes flushing the stdout buffer to SQLite. Related symptoms in the same test run:

runs::structured_logs_all_levels — level=0 (debug) should have >=1 log entry, got 0
runs::logs_query_by_service_src_wildcard — wildcard src query should return >=12 entries, got 0

All three are "wrote logs, can't read them back" failures. Worth investigating as a cluster — possibly the same root cause.

Surface to investigate

crates/hero_proc_server/src/supervisor/executor.rs::run_job — where stdout/stderr are piped and apply_exit_status is called. Verify the log sink is awaited/flushed before phase transitions to terminal.
crates/hero_proc_server/src/logging/ — LogSink / LogStore flush semantics.
crates/hero_proc_server/src/rpc/logs.rs (if exists) — what logs.get actually reads.

Why I didn't bundle this with #113

#113 was service.define interpreter + service.children job_id — both in rpc/service.rs + service/define.rs. This is logging/ + supervisor/executor.rs — different surface, different root cause, different reviewer (likely @maintainer for the executor half).

Surfaced while landing #113. Worth its own ticket.

## Symptom A `service_define_oneshot` job that completes successfully has empty `logs.get` output, even when the script body explicitly `echo`s a recognizable marker. `uc15_oneshot_service_runs_to_completion` in `hero_proc_test` exhibits this: ```rust c.service_define_oneshot(ServiceDefineOneshotInput { name: "svc15".into(), exec: "echo ONESHOT_DONE; exit 0".into(), opts: None, }) // ... service.start ... poll until !running ... service.children → job_id ... let logs = c.job_logs(JobLogsInput { id: job_id, ... }).await?.value; anyhow::ensure!( logs.iter().any(|l| l.line.contains("ONESHOT_DONE")), "logs must contain ONESHOT_DONE" ); ``` After today's `477889a` fix (which got the service-define+children plumbing working), `uc15` now reaches the log-assertion step but the logs vector is empty. The job ran (state transitioned to inactive/succeeded), the script DID exit 0, but `logs.get` returns nothing. ## Repro ```bash # After 477889a is in HEAD SANDBOX=/tmp/uc15_repro; rm -rf $SANDBOX; mkdir -p $SANDBOX/{var/sockets/hero_proc,var/db} PATH_ROOT=$SANDBOX ./target/release/hero_proc_server & sleep 2 PATH_ROOT=$SANDBOX ./target/release/hero_proc_test --filter functional::uc_12_16::uc15_oneshot_service_runs_to_completion # → FAIL "logs must contain ONESHOT_DONE" ``` ## Likely scope Smells like a log-flush ordering bug: the executor marks the job terminal before the log sink finishes flushing the stdout buffer to SQLite. Related symptoms in the same test run: - `runs::structured_logs_all_levels` — `level=0 (debug) should have >=1 log entry, got 0` - `runs::logs_query_by_service_src_wildcard` — `wildcard src query should return >=12 entries, got 0` All three are "wrote logs, can't read them back" failures. Worth investigating as a cluster — possibly the same root cause. ## Surface to investigate - `crates/hero_proc_server/src/supervisor/executor.rs::run_job` — where stdout/stderr are piped and `apply_exit_status` is called. Verify the log sink is awaited/flushed before phase transitions to terminal. - `crates/hero_proc_server/src/logging/` — `LogSink` / `LogStore` flush semantics. - `crates/hero_proc_server/src/rpc/logs.rs` (if exists) — what `logs.get` actually reads. ## Why I didn't bundle this with #113 #113 was `service.define` interpreter + `service.children` job_id — both in `rpc/service.rs` + `service/define.rs`. This is `logging/` + `supervisor/executor.rs` — different surface, different root cause, different reviewer (likely @maintainer for the executor half). Surfaced while landing #113. Worth its own ticket.

mahmoud referenced this issue

2026-06-08 15:49:57 +00:00

chore(server): remove dead code + cruft (#138 cleanup) #144

mahmoud referenced this issue from a commit

2026-06-08 15:59:26 +00:00

chore(server): remove dead code + cruft (#138 cleanup) (#144)

mahmoud commented

2026-06-08 16:25:40 +00:00

Owner

Closing as superseded — uc15_oneshot_service_runs_to_completion and the runs::structured_logs/wildcard cluster now pass (test-level read-after-write polling added during the suite hardening). The underlying root cause (the logger has no drain/flush-sync API) is tracked in #141; the symptom-level repro here is resolved.

Closing as superseded — `uc15_oneshot_service_runs_to_completion` and the `runs::structured_logs`/wildcard cluster now pass (test-level read-after-write polling added during the suite hardening). The underlying root cause (the logger has no drain/flush-sync API) is tracked in #141; the symptom-level repro here is resolved.

mahmoud closed this issue

2026-06-08 16:25:40 +00:00