Supervisor `hero_proc.db` has no retention or VACUUM scheduling — grows unboundedly on long-lived daemons #131

New issue

Open

opened 2026-05-26 01:15:50 +00:00 by sameh-farouk · 0 comments

sameh-farouk commented

2026-05-26 01:15:50 +00:00

Member

Problem

hero_proc.db (the supervisor's SQLite store for actions, services, jobs, runs, secrets) has no automatic retention policy and no scheduled VACUUM. On any long-lived daemon, the DB grows unboundedly:

Every job execution inserts a row in jobs (with full spec_json, stats, stderr, etc.).
Every run.submit inserts a row in runs + N child jobs.
Deletes (e.g. job_ops::delete_jobs, action.clean_by_tag, system.wipe_all) free pages but don't shrink the file without VACUUM.
An archived INTEGER flag exists on jobs/runs, but no policy ever sets it and no compactor ever acts on it.

This is independent of #126 (test-leakage). Even with zero test pollution, normal supervisor operation accumulates job/run history forever.

Evidence this matters in practice

From #122 autopsy (@mik-tf comment 36585):

352 MB DB after ~2,429 jobs + 45 leaked actions on a single workstation after 1h17m of supervisor uptime.

From #126 (mik-tf):

Recovery is DELETE FROM actions WHERE name LIKE 'sched-%' plus VACUUM, which reduces the database from 352 MB to 909 KB.

A 388× shrink-after-VACUUM means the file is mostly empty pages. SQLite never reclaims this on its own without auto_vacuum=FULL/INCREMENTAL set at DB-creation time, or an explicit VACUUM command.

This is the same architectural failure mode as #87 ("Log-store runaway: 58 GB in 24h") — except the log SQLite has since been moved to the file-based LogStore. The supervisor DB still has it.

Why this contributes to wedge severity

Bigger DB → slower full-table scans in scheduler ticks, run-status queries, service.list. The #122 wedge ran 9,276 invalid-cron evaluations against a 352 MB DB; the CPU-burn pathology amplified with table size. Retention + VACUUM doesn't fix #122 directly (the leak source is #126), but it bounds the worst-case blast radius.

Proposed scope

Retention policy on jobs + runs
- Configurable max_age_days (e.g. 30) and max_rows_per_table (e.g. 100k) — operator-visible, sensible defaults.
- Background compactor (already in supervisor): periodically (every 1h?) archive then delete rows past the threshold.
- Distinguish: terminal/finished rows are candidates; running/pending rows are never touched.
- Use the existing archived INTEGER column or replace it with explicit deletion + an archive table that gets rotated to a separate file (less ambitious: just delete).
Scheduled VACUUM
- Enable PRAGMA auto_vacuum=INCREMENTAL at DB creation (NOTE: this only takes effect on a fresh DB; requires VACUUM once to convert existing DBs).
- Periodic PRAGMA incremental_vacuum from the compactor (cheap; reclaims a few pages at a time).
- Or simpler: explicit VACUUM once per night/week from the compactor — blocking, but acceptable during low-traffic windows.
ANALYZE refresh
- Run ANALYZE after large compactions so the query planner has accurate cardinality stats. Otherwise post-cleanup, indexed queries can degrade until the next natural ANALYZE.
Operator-visible knobs
- system.compactor_status RPC returning last-run, next-run, rows-archived, bytes-reclaimed.
- Surface in the admin UI (/admin/maintenance or similar) alongside DB size + PRAGMA page_count * page_size.

Out of scope (for this issue)

Moving job/run history out of SQLite entirely (would be a different design discussion — see #124 comment 36972 for why SQLite is the right call given the current product shape).
Schema normalization of JSON-blob columns (spec_json, tags_json, deps_json, job_sequence_json). Separate issue if/when query patterns motivate it.
Pool size tuning (separate concern; not retention-related).

Acceptance criteria

A long-running supervisor with continuous job activity has a bounded steady-state DB size.
After a burst (e.g. wipe_all + reload), the file shrinks within one compactor cycle without manual VACUUM.
Tests verify the compactor archives jobs older than max_age_days and that running/pending rows are never touched.
system.compactor_status reflects the most recent run.

Priority

P2 / hygiene. Not blocking #122 (root cause is #126), but the universal class of "long-lived workstation supervisor accumulates unbounded history" is real, and this is the durable fix.

cc @mik-tf — the VACUUM observation in #126 is what crystallized this.

## Problem `hero_proc.db` (the supervisor's SQLite store for `actions`, `services`, `jobs`, `runs`, `secrets`) has no automatic retention policy and no scheduled `VACUUM`. On any long-lived daemon, the DB grows unboundedly: - Every job execution inserts a row in `jobs` (with full `spec_json`, stats, stderr, etc.). - Every `run.submit` inserts a row in `runs` + N child jobs. - Deletes (e.g. `job_ops::delete_jobs`, `action.clean_by_tag`, `system.wipe_all`) free pages but **don't shrink the file** without `VACUUM`. - An `archived INTEGER` flag exists on `jobs`/`runs`, but **no policy ever sets it** and **no compactor ever acts on it**. This is independent of #126 (test-leakage). Even with zero test pollution, normal supervisor operation accumulates job/run history forever. ## Evidence this matters in practice **From #122 autopsy** ([@mik-tf comment 36585](https://forge.ourworld.tf/lhumina_code/hero_proc/issues/122#issuecomment-36585)): > 352 MB DB after ~2,429 jobs + 45 leaked actions on a single workstation after 1h17m of supervisor uptime. **From #126** ([mik-tf](https://forge.ourworld.tf/lhumina_code/hero_proc/issues/126)): > Recovery is `DELETE FROM actions WHERE name LIKE 'sched-%'` plus `VACUUM`, which reduces the database from **352 MB to 909 KB**. A 388× shrink-after-VACUUM means the file is mostly empty pages. SQLite never reclaims this on its own without `auto_vacuum=FULL`/`INCREMENTAL` set at DB-creation time, or an explicit `VACUUM` command. This is the same architectural failure mode as #87 ("Log-store runaway: 58 GB in 24h") — except the log SQLite has since been moved to the file-based `LogStore`. The supervisor DB still has it. ## Why this contributes to wedge severity Bigger DB → slower full-table scans in scheduler ticks, run-status queries, `service.list`. The #122 wedge ran 9,276 invalid-cron evaluations against a 352 MB DB; the CPU-burn pathology amplified with table size. Retention + VACUUM doesn't fix #122 directly (the leak source is #126), but it bounds the worst-case blast radius. ## Proposed scope 1. **Retention policy on `jobs` + `runs`** - Configurable `max_age_days` (e.g. 30) and `max_rows_per_table` (e.g. 100k) — operator-visible, sensible defaults. - Background compactor (already in supervisor): periodically (`every 1h`?) archive then delete rows past the threshold. - Distinguish: terminal/finished rows are candidates; running/pending rows are never touched. - Use the existing `archived INTEGER` column or replace it with explicit deletion + an `archive` table that gets rotated to a separate file (less ambitious: just delete). 2. **Scheduled `VACUUM`** - Enable `PRAGMA auto_vacuum=INCREMENTAL` at DB creation (NOTE: this only takes effect on a fresh DB; requires `VACUUM` once to convert existing DBs). - Periodic `PRAGMA incremental_vacuum` from the compactor (cheap; reclaims a few pages at a time). - Or simpler: explicit `VACUUM` once per night/week from the compactor — blocking, but acceptable during low-traffic windows. 3. **`ANALYZE` refresh** - Run `ANALYZE` after large compactions so the query planner has accurate cardinality stats. Otherwise post-cleanup, indexed queries can degrade until the next natural `ANALYZE`. 4. **Operator-visible knobs** - `system.compactor_status` RPC returning last-run, next-run, rows-archived, bytes-reclaimed. - Surface in the admin UI (`/admin/maintenance` or similar) alongside DB size + `PRAGMA page_count * page_size`. ## Out of scope (for this issue) - Moving job/run history out of SQLite entirely (would be a different design discussion — see [#124 comment 36972](https://forge.ourworld.tf/lhumina_code/hero_proc/issues/124#issuecomment-36972) for why SQLite is the right call given the current product shape). - Schema normalization of JSON-blob columns (`spec_json`, `tags_json`, `deps_json`, `job_sequence_json`). Separate issue if/when query patterns motivate it. - Pool size tuning (separate concern; not retention-related). ## Acceptance criteria - A long-running supervisor with continuous job activity has a bounded steady-state DB size. - After a burst (e.g. wipe_all + reload), the file shrinks within one compactor cycle without manual `VACUUM`. - Tests verify the compactor archives jobs older than `max_age_days` and that `running`/`pending` rows are never touched. - `system.compactor_status` reflects the most recent run. ## Priority P2 / hygiene. Not blocking #122 (root cause is #126), but the universal class of "long-lived workstation supervisor accumulates unbounded history" is real, and this is the durable fix. cc @mik-tf — the VACUUM observation in #126 is what crystallized this.