snapshot retention: SNAPSHOT_RETENTION=3 not enforced — gc sorts by source mtime #31

New issue

Closed

opened 2026-05-25 09:08:14 +00:00 by zaelgohary · 1 comment

zaelgohary commented

2026-05-25 09:08:14 +00:00

Member

Problem

SNAPSHOT_RETENTION = 3 in crates/hero_codescalers_server/src/upgrade.rs:66, but /var/lib/hero_codescalers/rollout-snapshots/ currently holds 4 directories, each ~613 MB (≈2.4 GB).

Root cause

gc_old_snapshots() (line 965) sorts via list_snapshot_dirs() by metadata.modified() then .skip(keep). But create_snapshot() uses rsync -a which preserves the source mtimes. Every snapshot ends up with the same mtime as /home/template/hero/bin itself, so the sort order is filesystem-iteration-dependent and the trim doesn't reliably drop the oldest snapshot.

Observed on herodev:

drwxr-xr-x 1 template template 4088 May 25 08:35 upg_1779691156667_f04d9a5d
drwxr-xr-x 1 template template 4088 May 25 08:35 upg_1779692906594_d858a641
drwxr-xr-x 1 template template 4088 May 25 08:35 upg_1779693559892_22409c1a
drwxr-xr-x 1 template template 4088 May 25 08:35 upg_1779699648123_9e08a279

All four snapshots have the same mtime (08:35 — when the template tree was last built), even though the rollout IDs span hours apart.

Suggested fix

Replace mtime sort in list_snapshot_dirs() with one of:

Sort by directory name (most reliable). The id is upg_<timestamp_ms>_<rand_hex> — lexical sort on the name gives chronological order without filesystem help.
Sort by ctime (metadata.created() / ctime via MetadataExt) — set when the directory was created/renamed, not preserved by rsync.
Touch the dst dir mtime after rsync completes inside create_snapshot().

Operational impact

Disk usage grows monotonically until something else clears /var/lib/hero_codescalers/. With 91 cells × 7 MB-ish per binary and 33 services per snapshot, each retained snapshot is hundreds of MB.

## Problem `SNAPSHOT_RETENTION = 3` in `crates/hero_codescalers_server/src/upgrade.rs:66`, but `/var/lib/hero_codescalers/rollout-snapshots/` currently holds **4** directories, each ~613 MB (≈2.4 GB). ## Root cause `gc_old_snapshots()` (line 965) sorts via `list_snapshot_dirs()` by `metadata.modified()` then `.skip(keep)`. But `create_snapshot()` uses `rsync -a` which **preserves the source mtimes**. Every snapshot ends up with the same mtime as `/home/template/hero/bin` itself, so the sort order is filesystem-iteration-dependent and the trim doesn't reliably drop the oldest *snapshot*. Observed on herodev: ``` drwxr-xr-x 1 template template 4088 May 25 08:35 upg_1779691156667_f04d9a5d drwxr-xr-x 1 template template 4088 May 25 08:35 upg_1779692906594_d858a641 drwxr-xr-x 1 template template 4088 May 25 08:35 upg_1779693559892_22409c1a drwxr-xr-x 1 template template 4088 May 25 08:35 upg_1779699648123_9e08a279 ``` All four snapshots have the same mtime (08:35 — when the template tree was last built), even though the rollout IDs span hours apart. ## Suggested fix Replace mtime sort in `list_snapshot_dirs()` with one of: 1. **Sort by directory name** (most reliable). The id is `upg_<timestamp_ms>_<rand_hex>` — lexical sort on the name gives chronological order without filesystem help. 2. **Sort by ctime** (`metadata.created()` / `ctime` via `MetadataExt`) — set when the directory was created/renamed, not preserved by rsync. 3. Touch the dst dir mtime after `rsync` completes inside `create_snapshot()`. ## Operational impact Disk usage grows monotonically until something else clears `/var/lib/hero_codescalers/`. With 91 cells × 7 MB-ish per binary and 33 services per snapshot, each retained snapshot is hundreds of MB.

zaelgohary commented

2026-05-25 12:49:18 +00:00

Author

Member

7dec1df

7dec1df