snapshot retention: SNAPSHOT_RETENTION=3 not enforced — gc sorts by source mtime #31
Labels
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_codescalers#31
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
SNAPSHOT_RETENTION = 3incrates/hero_codescalers_server/src/upgrade.rs:66, but/var/lib/hero_codescalers/rollout-snapshots/currently holds 4 directories, each ~613 MB (≈2.4 GB).Root cause
gc_old_snapshots()(line 965) sorts vialist_snapshot_dirs()bymetadata.modified()then.skip(keep). Butcreate_snapshot()usesrsync -awhich preserves the source mtimes. Every snapshot ends up with the same mtime as/home/template/hero/binitself, so the sort order is filesystem-iteration-dependent and the trim doesn't reliably drop the oldest snapshot.Observed on herodev:
All four snapshots have the same mtime (08:35 — when the template tree was last built), even though the rollout IDs span hours apart.
Suggested fix
Replace mtime sort in
list_snapshot_dirs()with one of:upg_<timestamp_ms>_<rand_hex>— lexical sort on the name gives chronological order without filesystem help.metadata.created()/ctimeviaMetadataExt) — set when the directory was created/renamed, not preserved by rsync.rsynccompletes insidecreate_snapshot().Operational impact
Disk usage grows monotonically until something else clears
/var/lib/hero_codescalers/. With 91 cells × 7 MB-ish per binary and 33 services per snapshot, each retained snapshot is hundreds of MB.7dec1df