Supervisor hero_proc.db has no retention or VACUUM scheduling — grows unboundedly on long-lived daemons #131
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_proc#131
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
hero_proc.db(the supervisor's SQLite store foractions,services,jobs,runs,secrets) has no automatic retention policy and no scheduledVACUUM. On any long-lived daemon, the DB grows unboundedly:jobs(with fullspec_json, stats, stderr, etc.).run.submitinserts a row inruns+ N child jobs.job_ops::delete_jobs,action.clean_by_tag,system.wipe_all) free pages but don't shrink the file withoutVACUUM.archived INTEGERflag exists onjobs/runs, but no policy ever sets it and no compactor ever acts on it.This is independent of #126 (test-leakage). Even with zero test pollution, normal supervisor operation accumulates job/run history forever.
Evidence this matters in practice
From #122 autopsy (@mik-tf comment 36585):
From #126 (mik-tf):
A 388× shrink-after-VACUUM means the file is mostly empty pages. SQLite never reclaims this on its own without
auto_vacuum=FULL/INCREMENTALset at DB-creation time, or an explicitVACUUMcommand.This is the same architectural failure mode as #87 ("Log-store runaway: 58 GB in 24h") — except the log SQLite has since been moved to the file-based
LogStore. The supervisor DB still has it.Why this contributes to wedge severity
Bigger DB → slower full-table scans in scheduler ticks, run-status queries,
service.list. The #122 wedge ran 9,276 invalid-cron evaluations against a 352 MB DB; the CPU-burn pathology amplified with table size. Retention + VACUUM doesn't fix #122 directly (the leak source is #126), but it bounds the worst-case blast radius.Proposed scope
Retention policy on
jobs+runsmax_age_days(e.g. 30) andmax_rows_per_table(e.g. 100k) — operator-visible, sensible defaults.every 1h?) archive then delete rows past the threshold.archived INTEGERcolumn or replace it with explicit deletion + anarchivetable that gets rotated to a separate file (less ambitious: just delete).Scheduled
VACUUMPRAGMA auto_vacuum=INCREMENTALat DB creation (NOTE: this only takes effect on a fresh DB; requiresVACUUMonce to convert existing DBs).PRAGMA incremental_vacuumfrom the compactor (cheap; reclaims a few pages at a time).VACUUMonce per night/week from the compactor — blocking, but acceptable during low-traffic windows.ANALYZErefreshANALYZEafter large compactions so the query planner has accurate cardinality stats. Otherwise post-cleanup, indexed queries can degrade until the next naturalANALYZE.Operator-visible knobs
system.compactor_statusRPC returning last-run, next-run, rows-archived, bytes-reclaimed./admin/maintenanceor similar) alongside DB size +PRAGMA page_count * page_size.Out of scope (for this issue)
spec_json,tags_json,deps_json,job_sequence_json). Separate issue if/when query patterns motivate it.Acceptance criteria
VACUUM.max_age_daysand thatrunning/pendingrows are never touched.system.compactor_statusreflects the most recent run.Priority
P2 / hygiene. Not blocking #122 (root cause is #126), but the universal class of "long-lived workstation supervisor accumulates unbounded history" is real, and this is the durable fix.
cc @mik-tf — the VACUUM observation in #126 is what crystallized this.