[ops] Disaster recovery for heronu demo VM — runtime state dump + /data snapshot to object storage #161

New issue

Closed

opened 2026-04-24 03:33:32 +00:00 by mik-tf · 1 comment

mik-tf commented

2026-04-24 03:33:32 +00:00

Owner

Why

The heronu demo VM (TF Grid freefarm node 1) holds ~12 hours of hand-patching and demo data. If the VM dies, the Mycelium route flaps for days, or the node is reclaimed, recovery today means redoing the entire patch cycle from scratch — even with the development_mik_nu_demo branches now pushed (see #160).

Runtime state that IS NOT in git:

hero_proc action get <name> --format yaml for every action — env vars, script paths, restart policies, health checks. Some carry the AIBROKER_API_ENDPOINT, HERO_AGENT_ROUTING_MODE, FORGEJO_TOKEN config that makes the demo work.
/home/driver/hero/var/** — OSIS data (business, calendar, projects, media, identity across 5 contexts), hero_books namespaces (hero=163 docs, geomind=1733+ docs indexing), embedder HNSW indexes, hero_foundry webdav content.
/home/driver/code/docs_* — 4 cloned doc libraries (~800 MB).
Seeded OSIS records with schema-migrated data (30 projects + all business + calendar + media across contexts).
~/hero/var/agent/mcp.json trimmed to hero_books only.

None of this is reconstructible from the code alone — it's either the result of out-of-band operator commands (hero_proc action set) or of a destructive seed-migration flow.

What Tier 1 disaster recovery looks like

Two artifacts, generated periodically, stored outside the VM:

1. Runtime state dump (weekly or post-deploy, ~1 MB)

A single JSON blob capturing:

{
  "heronu_snapshot": "2026-04-24T03:30:00Z",
  "hero_proc": {
    "<action>": { "env": {...}, "script": "...", "health_checks": [...] },
    ...
  },
  "libraries_txt": "hero https://...\ngeomind https://...\n...",
  "mcp_json": { ... },
  "context_list": ["default", "geomind", "incubaid", "root", "threefold"],
  "patches_applied": {
    "<repo>": { "branch": "development_mik_nu_demo", "head_commit": "<sha>" },
    ...
  }
}

Commit this to lhumina_code/demo_state (new repo) or push to an S3-compatible bucket. A recovery script reads it, runs hero_proc action set for each entry, checks out the recorded commits on the new VM.

2. Data tarball (daily, ~2-5 GB compressed)

tar -czf heronu-data-<date>.tar.gz -C /data/home/driver \
  hero/var/osisdb hero/var/embedder/data hero/var/books \
  hero/var/hero_foundry/webdav hero/var/agent/workspace

Push to TF Grid QSFS (preferred, native) or an S3-compatible bucket reachable via Mycelium. Size will drop significantly once embedder HNSW indexes are excluded and re-generatable from source.

3. Recovery runbook (on `docs_hero/ops/disaster_recovery.md`)

Step-by-step: new VM via existing Terraform, apt install system deps from a fixed list, git clone all repos at recorded commits, download + extract data tarball, replay hero_proc action dump, verify with smoke tests.

Why this is cheap and high-leverage

Effort: ~4-6 hours to write the dump script + store+retrieve tooling + runbook. Under a day.
Ongoing cost: zero after automation (cron).
Bus factor: drops from 1 (the current operator) to any engineer who can read the runbook.
Doesn't preclude Tier 2/3 later: state dump and data tarball remain useful even once declarative IaC lands.

Concrete deliverables

scripts/ops/dump_demo_state.sh — generates the state JSON described above
scripts/ops/backup_demo_data.sh — daily tar+push of /data/home/driver/hero/var/
scripts/ops/restore_demo.sh — idempotent replay onto a fresh VM
docs_hero/ops/disaster_recovery.md — operator runbook
Verify end-to-end by running restore on a sacrificial TF Grid VM and confirming AI Assistant grounds on docs_hero queries

#160 — consolidated demo state and remaining work
#148 — nu-demo architecture index

Signed-off-by: mik-tf

## Why The heronu demo VM (TF Grid freefarm node 1) holds ~12 hours of hand-patching and demo data. If the VM dies, the Mycelium route flaps for days, or the node is reclaimed, recovery today means redoing the entire patch cycle from scratch — even with the development_mik_nu_demo branches now pushed (see [#160](https://forge.ourworld.tf/lhumina_code/home/issues/160)). Runtime state that IS NOT in git: - `hero_proc action get <name> --format yaml` for every action — env vars, script paths, restart policies, health checks. Some carry the AIBROKER_API_ENDPOINT, HERO_AGENT_ROUTING_MODE, FORGEJO_TOKEN config that makes the demo work. - `/home/driver/hero/var/**` — OSIS data (business, calendar, projects, media, identity across 5 contexts), hero_books namespaces (hero=163 docs, geomind=1733+ docs indexing), embedder HNSW indexes, hero_foundry webdav content. - `/home/driver/code/docs_*` — 4 cloned doc libraries (~800 MB). - Seeded OSIS records with schema-migrated data (30 projects + all business + calendar + media across contexts). - `~/hero/var/agent/mcp.json` trimmed to hero_books only. None of this is reconstructible from the code alone — it's either the result of out-of-band operator commands (`hero_proc action set`) or of a destructive seed-migration flow. ## What Tier 1 disaster recovery looks like Two artifacts, generated periodically, stored outside the VM: ### 1. Runtime state dump (weekly or post-deploy, ~1 MB) A single JSON blob capturing: ``` { "heronu_snapshot": "2026-04-24T03:30:00Z", "hero_proc": { "<action>": { "env": {...}, "script": "...", "health_checks": [...] }, ... }, "libraries_txt": "hero https://...\ngeomind https://...\n...", "mcp_json": { ... }, "context_list": ["default", "geomind", "incubaid", "root", "threefold"], "patches_applied": { "<repo>": { "branch": "development_mik_nu_demo", "head_commit": "<sha>" }, ... } } ``` Commit this to `lhumina_code/demo_state` (new repo) or push to an S3-compatible bucket. A recovery script reads it, runs `hero_proc action set` for each entry, checks out the recorded commits on the new VM. ### 2. Data tarball (daily, ~2-5 GB compressed) ``` tar -czf heronu-data-<date>.tar.gz -C /data/home/driver \ hero/var/osisdb hero/var/embedder/data hero/var/books \ hero/var/hero_foundry/webdav hero/var/agent/workspace ``` Push to TF Grid QSFS (preferred, native) or an S3-compatible bucket reachable via Mycelium. Size will drop significantly once embedder HNSW indexes are excluded and re-generatable from source. ### 3. Recovery runbook (on `docs_hero/ops/disaster_recovery.md`) Step-by-step: new VM via existing Terraform, `apt install` system deps from a fixed list, git clone all repos at recorded commits, download + extract data tarball, replay hero_proc action dump, verify with smoke tests. ## Why this is cheap and high-leverage - **Effort:** ~4-6 hours to write the dump script + store+retrieve tooling + runbook. Under a day. - **Ongoing cost:** zero after automation (cron). - **Bus factor:** drops from 1 (the current operator) to any engineer who can read the runbook. - **Doesn't preclude Tier 2/3 later:** state dump and data tarball remain useful even once declarative IaC lands. ## Concrete deliverables - [ ] `scripts/ops/dump_demo_state.sh` — generates the state JSON described above - [ ] `scripts/ops/backup_demo_data.sh` — daily tar+push of `/data/home/driver/hero/var/` - [ ] `scripts/ops/restore_demo.sh` — idempotent replay onto a fresh VM - [ ] `docs_hero/ops/disaster_recovery.md` — operator runbook - [ ] Verify end-to-end by running restore on a sacrificial TF Grid VM and confirming AI Assistant grounds on docs_hero queries ## Related - [#160](https://forge.ourworld.tf/lhumina_code/home/issues/160) — consolidated demo state and remaining work - [#148](https://forge.ourworld.tf/lhumina_code/home/issues/148) — nu-demo architecture index Signed-off-by: mik-tf