[ops] Disaster recovery for heronu demo VM — runtime state dump + /data snapshot to object storage #161
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Why
The heronu demo VM (TF Grid freefarm node 1) holds ~12 hours of hand-patching and demo data. If the VM dies, the Mycelium route flaps for days, or the node is reclaimed, recovery today means redoing the entire patch cycle from scratch — even with the development_mik_nu_demo branches now pushed (see #160).
Runtime state that IS NOT in git:
hero_proc action get <name> --format yamlfor every action — env vars, script paths, restart policies, health checks. Some carry the AIBROKER_API_ENDPOINT, HERO_AGENT_ROUTING_MODE, FORGEJO_TOKEN config that makes the demo work./home/driver/hero/var/**— OSIS data (business, calendar, projects, media, identity across 5 contexts), hero_books namespaces (hero=163 docs, geomind=1733+ docs indexing), embedder HNSW indexes, hero_foundry webdav content./home/driver/code/docs_*— 4 cloned doc libraries (~800 MB).~/hero/var/agent/mcp.jsontrimmed to hero_books only.None of this is reconstructible from the code alone — it's either the result of out-of-band operator commands (
hero_proc action set) or of a destructive seed-migration flow.What Tier 1 disaster recovery looks like
Two artifacts, generated periodically, stored outside the VM:
1. Runtime state dump (weekly or post-deploy, ~1 MB)
A single JSON blob capturing:
Commit this to
lhumina_code/demo_state(new repo) or push to an S3-compatible bucket. A recovery script reads it, runshero_proc action setfor each entry, checks out the recorded commits on the new VM.2. Data tarball (daily, ~2-5 GB compressed)
Push to TF Grid QSFS (preferred, native) or an S3-compatible bucket reachable via Mycelium. Size will drop significantly once embedder HNSW indexes are excluded and re-generatable from source.
3. Recovery runbook (on
docs_hero/ops/disaster_recovery.md)Step-by-step: new VM via existing Terraform,
apt installsystem deps from a fixed list, git clone all repos at recorded commits, download + extract data tarball, replay hero_proc action dump, verify with smoke tests.Why this is cheap and high-leverage
Concrete deliverables
scripts/ops/dump_demo_state.sh— generates the state JSON described abovescripts/ops/backup_demo_data.sh— daily tar+push of/data/home/driver/hero/var/scripts/ops/restore_demo.sh— idempotent replay onto a fresh VMdocs_hero/ops/disaster_recovery.md— operator runbookRelated
Signed-off-by: mik-tf
publicip=truethe nu-shell deploy default #165Moved to hero_demo#29 — see lhumina_code/hero_demo#29
publicip=truethe nu-shell deploy default #33