feat: idempotent seeding — add stable SIDs to all mock seed files #2

Merged
mik-tf merged 2 commits from development_idempotent_seeding into development 2026-02-12 02:34:39 +00:00
Owner

Problem

Every time hero_osis restarts with --seed-dir, it creates duplicate records for all seeded entities. This is because seed TOML files have no sid field, so seed_domain() injects sid = "0000" (global_id=0), which causes db.set() to generate a brand new random SID on every restart.

Visible impact: the Companies list (and all other entity lists) shows N copies of each seeded entity after N restarts.

Root Cause

TOML file (no sid) → seed_domain() injects sid="0000" → SmartId { global_id: 0 }
→ db.set() sees global_id==0 → id_new() → new random SID → INSERT (never upsert)
→ restart → repeat → duplicates accumulate

Solution

Add a stable sid field to all 608 mock seed TOML files. Example:

_type = "Company"
sid = "s001"
name = "Andreessen Horowitz"
...

The s-prefix SIDs (base36 starting at global_id=1,306,369) are safely above the auto-increment range (starting at global_id=2). Since db.set() already performs an upsert when global_id != 0, restarting the server now overwrites existing records instead of creating duplicates.

SID Convention

  • Format: s001, s002, ..., s00a, s00b, ... (base36, s-prefixed)
  • Sequential per (context, domain, entity type) group
  • 1.3M+ user creates per entity type before any collision risk
  • Script included at scripts/add_seed_sids.py for reproducibility

Changes

  • 608 TOML files: +1 line each (sid = "sXXX" after _type)
  • 1 new script: scripts/add_seed_sids.py
  • Zero code changes — relies entirely on existing db.set() upsert behavior

Follow-up: Seed Mode (create-only vs upsert)

Currently, seeding always upserts — meaning user edits to seeded entities get overwritten on restart. A future enhancement would add a --seed-mode flag:

  • upsert (current behavior): always overwrite seed data on restart
  • create-only: skip seeding if the entity already exists, preserving user edits

This requires changes in both hero_lib (skip-if-exists logic in seed_domain()) and hero_osis (CLI flag + passthrough). Tracked as separate issues:

  • hero_lib: seed_domain() create-only mode
  • hero_osis: --seed-mode CLI flag
## Problem Every time hero_osis restarts with `--seed-dir`, it creates **duplicate records** for all seeded entities. This is because seed TOML files have no `sid` field, so `seed_domain()` injects `sid = "0000"` (global_id=0), which causes `db.set()` to generate a brand new random SID on every restart. Visible impact: the Companies list (and all other entity lists) shows N copies of each seeded entity after N restarts. ## Root Cause ``` TOML file (no sid) → seed_domain() injects sid="0000" → SmartId { global_id: 0 } → db.set() sees global_id==0 → id_new() → new random SID → INSERT (never upsert) → restart → repeat → duplicates accumulate ``` ## Solution Add a stable `sid` field to all 608 mock seed TOML files. Example: ```toml _type = "Company" sid = "s001" name = "Andreessen Horowitz" ... ``` The `s`-prefix SIDs (base36 starting at global_id=1,306,369) are safely above the auto-increment range (starting at global_id=2). Since `db.set()` already performs an **upsert** when `global_id != 0`, restarting the server now overwrites existing records instead of creating duplicates. ### SID Convention - Format: `s001`, `s002`, ..., `s00a`, `s00b`, ... (base36, s-prefixed) - Sequential per (context, domain, entity type) group - 1.3M+ user creates per entity type before any collision risk - Script included at `scripts/add_seed_sids.py` for reproducibility ## Changes - **608 TOML files**: +1 line each (`sid = "sXXX"` after `_type`) - **1 new script**: `scripts/add_seed_sids.py` - **Zero code changes** — relies entirely on existing `db.set()` upsert behavior ## Follow-up: Seed Mode (create-only vs upsert) Currently, seeding always upserts — meaning user edits to seeded entities get overwritten on restart. A future enhancement would add a `--seed-mode` flag: - `upsert` (current behavior): always overwrite seed data on restart - `create-only`: skip seeding if the entity already exists, preserving user edits This requires changes in both **hero_lib** (skip-if-exists logic in `seed_domain()`) and **hero_osis** (CLI flag + passthrough). Tracked as separate issues: - hero_lib: seed_domain() create-only mode - hero_osis: --seed-mode CLI flag
feat: add stable SIDs to all mock seed files for idempotent seeding
Some checks failed
Build and Test / build (pull_request) Failing after 1m25s
Build and Test / build (push) Failing after 1m49s
01f8e63fe7
Add sid field to all 608 TOML seed files under data/mock/ to prevent
duplicate records from being created on every server restart.

Problem:
Seed files had no sid field, so seed_domain() injected sid="0000"
(global_id=0), causing db.set() to generate a new random SID on every
restart. This created duplicate entities each time hero_osis started.

Solution:
Each seed file now has a stable sid (e.g. sid="s001") assigned
sequentially per (context, domain, type) group. The 's' prefix places
these at global_id ~1.3M, safely above the auto-increment range which
starts at global_id=2. Since db.set() upserts when global_id != 0,
restarting the server now overwrites existing records instead of
creating duplicates.

Convention:
- SIDs use base36 format starting from 's001' (global_id=1,306,369)
- Sequential per entity type within each context/domain
- 1.3M+ user creates per type needed before any collision
- Included scripts/add_seed_sids.py for reproducibility

No code changes - relies entirely on existing db.set() upsert behavior.
mik-tf changed title from feat: idempotent seeding — add stable SIDs to all mock seed files to WIP: feat: idempotent seeding — add stable SIDs to all mock seed files 2026-02-11 15:20:43 +00:00
Author
Owner

E2E Test Results

Tested locally with a full clean-slate boot + restart cycle to confirm idempotent seeding.

Test Setup

  1. Stopped all running hero_osis / hero_zero processes
  2. Cleaned hero_redis data directory (fresh empty DB)
  3. Copied the modified seed files (with stable SIDs) into the runtime seed directory
  4. Built hero_osis from source (cargo build --release --no-default-features --features all-domains)
  5. Started hero_redis (port 6666) and hero_osis (--seed-dir ... --contexts default)

Test 1: First Boot (seed into empty DB)

All 608 seed files ingested. Queried company.list + company.get across 3 contexts via JSON-RPC:

Context Companies Duplicates
geomind 10 0
threefold 9 0
incubaid 13 0

All SIDs unique, all company names unique within each context.

Test 2: Restart (re-seed into existing DB)

Stopped hero_osis, restarted it (same --seed-dir flag, same data). Seeding ran again automatically on startup. Queried the same endpoints:

Context Boot 1 Boot 2 Match
geomind 10 10 OK
threefold 9 9 OK
incubaid 13 13 OK

Counts identical. Zero duplicates. Seeding is idempotent.

How It Works

  • Each seed TOML now has a stable sid field (e.g. sid = "s001")
  • seed_domain() in herolib-osis sees the existing sid, passes it through to db.set()
  • SmartId::parse("s001") yields global_id = 1,306,369 (non-zero)
  • db.set() with non-zero global_id performs an upsert instead of generating a new ID
  • On restart, same SID = same record updated in place, no duplicates created

Cleanup

After testing, stopped all processes and restored original seed files from backup. No changes were made to the branch during testing — the PR is exactly as committed.

## E2E Test Results Tested locally with a full clean-slate boot + restart cycle to confirm idempotent seeding. ### Test Setup 1. Stopped all running hero_osis / hero_zero processes 2. Cleaned hero_redis data directory (fresh empty DB) 3. Copied the modified seed files (with stable SIDs) into the runtime seed directory 4. Built hero_osis from source (`cargo build --release --no-default-features --features all-domains`) 5. Started hero_redis (port 6666) and hero_osis (`--seed-dir ... --contexts default`) ### Test 1: First Boot (seed into empty DB) All 608 seed files ingested. Queried `company.list` + `company.get` across 3 contexts via JSON-RPC: | Context | Companies | Duplicates | |---------|-----------|------------| | geomind | 10 | **0** | | threefold | 9 | **0** | | incubaid | 13 | **0** | All SIDs unique, all company names unique within each context. ### Test 2: Restart (re-seed into existing DB) Stopped hero_osis, restarted it (same `--seed-dir` flag, same data). Seeding ran again automatically on startup. Queried the same endpoints: | Context | Boot 1 | Boot 2 | Match | |---------|--------|--------|-------| | geomind | 10 | 10 | OK | | threefold | 9 | 9 | OK | | incubaid | 13 | 13 | OK | **Counts identical. Zero duplicates. Seeding is idempotent.** ### How It Works - Each seed TOML now has a stable `sid` field (e.g. `sid = "s001"`) - `seed_domain()` in herolib-osis sees the existing `sid`, passes it through to `db.set()` - `SmartId::parse("s001")` yields `global_id = 1,306,369` (non-zero) - `db.set()` with non-zero `global_id` performs an **upsert** instead of generating a new ID - On restart, same SID = same record updated in place, no duplicates created ### Cleanup After testing, stopped all processes and restored original seed files from backup. No changes were made to the branch during testing — the PR is exactly as committed.
mik-tf changed title from WIP: feat: idempotent seeding — add stable SIDs to all mock seed files to feat: idempotent seeding — add stable SIDs to all mock seed files 2026-02-11 15:30:59 +00:00
mik-tf force-pushed development_idempotent_seeding from 01f8e63fe7
Some checks failed
Build and Test / build (pull_request) Failing after 1m25s
Build and Test / build (push) Failing after 1m49s
to 15ec978f86
Some checks failed
Build and Test / build (pull_request) Failing after 1m23s
2026-02-12 02:34:28 +00:00
Compare
mik-tf merged commit cedf6ccfad into development 2026-02-12 02:34:39 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_osis!2
No description provided.