[ops] Restore herodemo.gent01.grid.tf to fully-functional state — services updated, data populated, demo browseable #46
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Goal
Get https://herodemo.gent01.grid.tf/ back to a fully-functional demo state — every archipelago tab shows live content, all services on the latest
origin/developmentbinaries, demo browseable end-to-end. This is the operational priority, separate from the reproducibility work tracked in the runbook + seed issues.Why this is now an issue (state at 2026-04-30 ~16:30 UTC)
Today's session fixed multiple real bugs (hero_proc sysmon fd leak
hero_proc#81, hero_office X-Forwarded-Proto, photos double-slash, etc.) but the operational disruption (hero_proc daemon restarts during diagnosis, partial sweep failure onhero_osinstall, mass-bounce that didn't include the per-domainhero_osis_*services) left the demo in a state where:hero_osis_*services have stale supervisor state — UI calls returnHTTP 404: Socket 'rpc.sock' not found for 'hero_osis_<X>'(observed forhero_osis_base,hero_osis_communication; likely affects all 14).local changes detected, committing before pullerror).Acceptance criteria
hero_osis_*services (immediate fix for 404s): Verify each hasrpc.sock+ui.sockpost-restart.hero_ossweep blocker so phase 1 can complete:hero_ossource repo on the VM at~/hero/code0/hero_os: what local uncommitted changes are there?origin/development.service_complete --update --releaseto get the remaining 15 services rebuilt.hero_proc service list— every service greenSequencing
This issue stays open until the demo is visibly populated end-to-end. It is independent of the reproducibility issues.
References
hero_proc#81— sysmon fd leak fix that started this whole sweep efforthero_skills@4cb40f6— gentle cargo + force restart inservice_complete --updateSigned-off-by: mik-tf
Status update — observations during execution (2026-04-30 PM session)
Sweep ran via
service_complete --update --releasewith the gentle-cargo + force-restart fixes fromhero_skills@4cb40f6. Phase 1 successfully built and installed binaries for 12 services (proc, router, mycelium, code, codescalers, lib_rhai, embedder, proxy, db, os, osis, collab) before phase 2 stalled onservice_livekit.Sysmon fix held perfectly
hero_proc#81(the sysmon /proc fd leak fix deployed earlier in the session) stayed solid through the whole sweep:/proc/<pid>/statretention regardless of how many service restarts hammered through.The sweep would not have been survivable without the leak fix — just gentle cargo alone would not have prevented the previous OOM trajectory.
sweep blocked at
service_livekit startFailure trace:
service_livekit.nustart path probes Redis on a hardcoded port 6379 (Hero's actual default is 6378), then on probe failure falls throughagent.nuto invoke^claude(Claude Code CLI). Two bugs in one path. Filed as a separate hero_skills issue.Workaround used today:
hero_proc service start hero_livekitdirectly (bypasses the broken nu start logic). hero_livekit came back up healthy on the freshly-built binary.Remaining 9 services done via manual loop
Because phase 2 stops on first failure, services after
livekit(biz, aibroker, logic, slides, whiteboard, indexer, foundry, voice, agent) never got theirstart --reset --update. Resolved with a manual loop on the VM:Per-domain
hero_osis_*recoveryA separate symptom surfaced before the sweep: every
hero_osis_<domain>/rpc.sockwas missing, breaking Contexts, Photos, Biz, Messages UIs (all 404s on the per-domain sockets). Root cause: the unifiedhero_osisserver (which atomically binds all 17 per-domain sockets) had been killed during today's earlier mass-bounce;hero_osis_uiwas alive but the actual server wasn't. Fixed byservice_osis start --reset— restored all per-domain sockets in one shot. All 5 contexts (root, default, geomind, incubaid, threefold) and underlying OSIS data on disk were intact — no data loss.Demo throughout
https://herodemo.gent01.grid.tf/stayed responsive (401 in <1s) the entire ~3h sweep. Gentle cargo (nice 19 ionice idle -j 4) kept the VM at load avg 1-4 vs the load 80+ from this morning's un-niced cargo storm.Next operational steps for #46
Related issues spawned today
Signed-off-by: mik-tf