snapshot retention: SNAPSHOT_RETENTION=3 not enforced — gc sorts by source mtime #31

Closed
opened 2026-05-25 09:08:14 +00:00 by zaelgohary · 1 comment
Member

Problem

SNAPSHOT_RETENTION = 3 in crates/hero_codescalers_server/src/upgrade.rs:66, but /var/lib/hero_codescalers/rollout-snapshots/ currently holds 4 directories, each ~613 MB (≈2.4 GB).

Root cause

gc_old_snapshots() (line 965) sorts via list_snapshot_dirs() by metadata.modified() then .skip(keep). But create_snapshot() uses rsync -a which preserves the source mtimes. Every snapshot ends up with the same mtime as /home/template/hero/bin itself, so the sort order is filesystem-iteration-dependent and the trim doesn't reliably drop the oldest snapshot.

Observed on herodev:

drwxr-xr-x 1 template template 4088 May 25 08:35 upg_1779691156667_f04d9a5d
drwxr-xr-x 1 template template 4088 May 25 08:35 upg_1779692906594_d858a641
drwxr-xr-x 1 template template 4088 May 25 08:35 upg_1779693559892_22409c1a
drwxr-xr-x 1 template template 4088 May 25 08:35 upg_1779699648123_9e08a279

All four snapshots have the same mtime (08:35 — when the template tree was last built), even though the rollout IDs span hours apart.

Suggested fix

Replace mtime sort in list_snapshot_dirs() with one of:

  1. Sort by directory name (most reliable). The id is upg_<timestamp_ms>_<rand_hex> — lexical sort on the name gives chronological order without filesystem help.
  2. Sort by ctime (metadata.created() / ctime via MetadataExt) — set when the directory was created/renamed, not preserved by rsync.
  3. Touch the dst dir mtime after rsync completes inside create_snapshot().

Operational impact

Disk usage grows monotonically until something else clears /var/lib/hero_codescalers/. With 91 cells × 7 MB-ish per binary and 33 services per snapshot, each retained snapshot is hundreds of MB.

## Problem `SNAPSHOT_RETENTION = 3` in `crates/hero_codescalers_server/src/upgrade.rs:66`, but `/var/lib/hero_codescalers/rollout-snapshots/` currently holds **4** directories, each ~613 MB (≈2.4 GB). ## Root cause `gc_old_snapshots()` (line 965) sorts via `list_snapshot_dirs()` by `metadata.modified()` then `.skip(keep)`. But `create_snapshot()` uses `rsync -a` which **preserves the source mtimes**. Every snapshot ends up with the same mtime as `/home/template/hero/bin` itself, so the sort order is filesystem-iteration-dependent and the trim doesn't reliably drop the oldest *snapshot*. Observed on herodev: ``` drwxr-xr-x 1 template template 4088 May 25 08:35 upg_1779691156667_f04d9a5d drwxr-xr-x 1 template template 4088 May 25 08:35 upg_1779692906594_d858a641 drwxr-xr-x 1 template template 4088 May 25 08:35 upg_1779693559892_22409c1a drwxr-xr-x 1 template template 4088 May 25 08:35 upg_1779699648123_9e08a279 ``` All four snapshots have the same mtime (08:35 — when the template tree was last built), even though the rollout IDs span hours apart. ## Suggested fix Replace mtime sort in `list_snapshot_dirs()` with one of: 1. **Sort by directory name** (most reliable). The id is `upg_<timestamp_ms>_<rand_hex>` — lexical sort on the name gives chronological order without filesystem help. 2. **Sort by ctime** (`metadata.created()` / `ctime` via `MetadataExt`) — set when the directory was created/renamed, not preserved by rsync. 3. Touch the dst dir mtime after `rsync` completes inside `create_snapshot()`. ## Operational impact Disk usage grows monotonically until something else clears `/var/lib/hero_codescalers/`. With 91 cells × 7 MB-ish per binary and 33 services per snapshot, each retained snapshot is hundreds of MB.
Author
Member
7dec1df
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_codescalers#31
No description provided.