[docs] Add §13 to runbook — Updating an existing deploy via service_complete --update --release #47

Closed
opened 2026-04-30 16:30:49 +00:00 by mik-tf · 1 comment
Owner

Goal

Add a section to docs/ops/DEPLOYMENT.md documenting the canonical "update an existing Hero OS deploy to latest origin/development" flow, so operators have a clear post-deploy update story.

Background

hero_skills@4cb40f6 shipped service_complete --update --release as the canonical "pull all latest source, gentle-build, force-restart everything" entry point. Two operator-relevant fixes are baked in:

  1. Cargo runs at nice -n 19 ionice -c 3 -j 4 by default so a build can no longer monopolise the host while live services are running on the same VM (lessons from herodemo cargo storm 2026-04-30 — load avg 80+ on a 16-CPU VM, demo unresponsive while cargo build -j 16 competed for I/O).
  2. start --update now also passes --reset per service, so phase 2 actually restarts each service to pick up its freshly built binary (was a silent no-op before).

The mechanism works. It is not yet documented in the runbook.

Acceptance criteria

  • New section in docs/ops/DEPLOYMENT.md (number TBD — likely §13 or a sub-section under §6) titled along the lines of "Updating an existing deploy".
  • The section explains the two-phase model:
    • Phase 1 — for each repo: git pull --ff-only origin/development, gentle cargo build --release, copy binary to ~/hero/bin/.
    • Phase 2 — for each runtime service: service_X start --reset --update, force re-register + restart.
  • One canonical command block:
    su - driver -c '
      source ~/hero/cfg/init.sh
      nu -c "use ~/hero/code/hero_skills/tools/modules/services *; service_complete --update --release"
    '
    
  • Pre-flight checks documented:
    • hero_proc daemon healthy (RSS bounded, fd count single-digits or low double-digits)
    • Demo URL responsive
  • Verification steps after the run:
    • hero_proc service list — every service green
    • Spot-check a couple of /hero_<svc>/ui/health endpoints return 200
    • Browser refresh — every archipelago tab loads
  • Time budget noted (30-90 min depending on what's drifted; demo stays usable throughout thanks to gentle cargo).
  • Env-var overrides documented:
    • HERO_CARGO_NICE (default 19)
    • HERO_CARGO_IONICE_C (default 3)
    • HERO_CARGO_JOBS (default 4)
  • If phase 1 stops on a single service failure, what to do (fix that service's source state, re-run — forge merge aborts on local uncommitted changes, common cause).

Out of scope

  • Seeding data (separate issues — see #SEED-MEDIA and #SEED-OSIS).
  • TF-Grid-specific operational quirks (already covered as sidebars in §1-§3).

References

  • hero_skills@4cb40f6 — the canonical update mechanism.
  • hero_proc#81 — the sysmon fd leak fix that made gentle cargo work in practice.

Signed-off-by: mik-tf

## Goal Add a section to `docs/ops/DEPLOYMENT.md` documenting the canonical "update an existing Hero OS deploy to latest origin/development" flow, so operators have a clear post-deploy update story. ## Background `hero_skills@4cb40f6` shipped `service_complete --update --release` as the canonical "pull all latest source, gentle-build, force-restart everything" entry point. Two operator-relevant fixes are baked in: 1. Cargo runs at `nice -n 19 ionice -c 3 -j 4` by default so a build can no longer monopolise the host while live services are running on the same VM (lessons from herodemo cargo storm 2026-04-30 — load avg 80+ on a 16-CPU VM, demo unresponsive while `cargo build -j 16` competed for I/O). 2. `start --update` now also passes `--reset` per service, so phase 2 actually restarts each service to pick up its freshly built binary (was a silent no-op before). The mechanism works. It is not yet documented in the runbook. ## Acceptance criteria - [ ] New section in `docs/ops/DEPLOYMENT.md` (number TBD — likely §13 or a sub-section under §6) titled along the lines of "Updating an existing deploy". - [ ] The section explains the two-phase model: - Phase 1 — for each repo: `git pull --ff-only origin/development`, gentle `cargo build --release`, copy binary to `~/hero/bin/`. - Phase 2 — for each runtime service: `service_X start --reset --update`, force re-register + restart. - [ ] One canonical command block: ```bash su - driver -c ' source ~/hero/cfg/init.sh nu -c "use ~/hero/code/hero_skills/tools/modules/services *; service_complete --update --release" ' ``` - [ ] Pre-flight checks documented: - hero_proc daemon healthy (RSS bounded, fd count single-digits or low double-digits) - Demo URL responsive - [ ] Verification steps after the run: - `hero_proc service list` — every service green - Spot-check a couple of `/hero_<svc>/ui/health` endpoints return 200 - Browser refresh — every archipelago tab loads - [ ] Time budget noted (30-90 min depending on what's drifted; demo stays usable throughout thanks to gentle cargo). - [ ] Env-var overrides documented: - `HERO_CARGO_NICE` (default 19) - `HERO_CARGO_IONICE_C` (default 3) - `HERO_CARGO_JOBS` (default 4) - [ ] If phase 1 stops on a single service failure, what to do (fix that service's source state, re-run — `forge merge` aborts on local uncommitted changes, common cause). ## Out of scope - Seeding data (separate issues — see #SEED-MEDIA and #SEED-OSIS). - TF-Grid-specific operational quirks (already covered as sidebars in §1-§3). ## References - `hero_skills@4cb40f6` — the canonical update mechanism. - `hero_proc#81` — the sysmon fd leak fix that made gentle cargo work in practice. Signed-off-by: mik-tf
Author
Owner

Shipped in 67e5765. §13 "Updating an existing deploy" with subsections 13.1-13.5 (preflight, tunable knobs, verification, single-service path, troubleshooting). Old §13 "Open work" moved to Appendix D as historical changelog.

Signed-off-by: mik-tf

Shipped in https://forge.ourworld.tf/lhumina_code/hero_demo/commit/67e5765. §13 "Updating an existing deploy" with subsections 13.1-13.5 (preflight, tunable knobs, verification, single-service path, troubleshooting). Old §13 "Open work" moved to Appendix D as historical changelog. Signed-off-by: mik-tf
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_demo#47
No description provided.