Phase 10 — Production keys prep + operator runbook (s2-012) #11

Open
opened 2026-05-21 19:37:58 +00:00 by mik-tf · 0 comments
Owner

Summary

Phase 10 of #1. Pre-flight gate for production deployment: validate that every payment/KYC/VM/login surface is shaped for production before a deploy lands. No live charges or KYC verifications — those stay on the launch-day go/no-go checklist.

Landed in s2-012

13 files +987 LOC across worktree hero_onboarding-track-agent-2/ on branch track-agent-2/phase-10-prod-keys, squash-merged to development.

is_production() -> bool on every provider/config surface (~80 LOC)

  • PaymentProvider trait method (default false); impls on StripeProvider (sk_live_* + webhook secret) + ClickPesaProvider (creds + non-sandbox api_url + non-empty webhook_url).
  • KycProvider trait method (default false); impl on IdenfyProvider (no demo/dev escape hatch + creds + webhook secret + non-sandbox api_url).
  • Provisioner trait method (default false); impl on PoolAssignmentProvisioner (delegates to pure-fn helper pool_assignment_is_production(demo, pool_size) — unit-testable without OSIS).
  • Free function forge_oauth::is_production(&ForgeOAuthConfig) (non-localhost base_url + non-loopback redirect_uri + non-placeholder creds).

--check-prod-config validator (~130 LOC)

  • New flag on hero_onboarding_server: builds the full provider set (using the existing build_* functions, no duplication), iterates is_production() on each registered surface, prints one line per surface with explicit OK/FAIL/SKIP prefix + reason, prints verdict: READY or verdict: NOT READY, exits 0 or 1.
  • Passthrough flag on hero_onboarding CLI: execs ~/hero/bin/hero_onboarding_server --check-prod-config and proxies the exit code. Both forms work; runbook documents either.

Operator runbook (docs/operator-runbook.md, ~532 LOC)

Per-environment config matrix for dev / staging / prod × 5 trait surfaces = 15 rows. Layout: quick comparison table → per-surface section (purpose, required slots, hero_proc keys + env-var fallbacks, how-to-get-credentials, sandbox vs prod, gotchas) → --check-prod-config usage → launch-day go/no-go checklist → per-verdict troubleshooting → reference key list.

Production Forge OAuth section marked TBD — production hostname for hero_onboarding is not yet locked. Until then, --check-prod-config reports forge_oauth FAIL because the dev redirect URI still loops back to 127.0.0.1. Runbook documents the "register prod OAuth app once hostname is locked" follow-up as a launch-day item.

Smoke (scripts/smoke_prod_config.sh, 32 checks)

Drives --check-prod-config across 12 env-var combos: all-skip → exit 1 with 5 SKIPs; Stripe sandbox key → FAIL; Stripe live key without webhook secret → FAIL; Stripe live + webhook → OK; KYC demo flag → FAIL; KYC live creds + non-sandbox URL → OK; KYC sandbox URL → FAIL; Provisioner demo flag → FAIL; Provisioner real pool → OK; Forge OAuth localhost → FAIL; Forge OAuth remote → OK; all five surfaces OK → exit 0 + READY; ClickPesa sandbox URL → FAIL. 32 individual assertions across the per-surface output prefixes + final verdict + exit code.

Acceptance gates

  • cargo test --workspace 77/77 (62 from s2-011 + 15 new is_production unit tests: 2 in payment.rs, 5 in kyc.rs, 3 in provisioner.rs, 5 in forge_oauth.rs).
  • lab build --release --install --workspace VICTORY 3/3 (27.2s, build #13).
  • lab infocheck 3/3 clean / 0 findings.
  • cargo fmt --check clean (after autofix on 3 multi-arg println sites + one function signature).
  • cargo clippy --workspace --all-targets -- -D warnings clean.
  • scripts/smoke_prod_config.sh 32/32 GREEN.
  • Regression on 16 representative routes (/, /login, /dashboard, /account, /login/forge/start, /vm/list, /kyc/start, plus all 7 admin-secret-gated endpoints with + without secret, /logout) all return expected status codes — the additive trait method is transparent to existing route handlers.

Note on existing smoke scripts: the workstation was under heavy load (load avg ~38) during the regression window; the existing scripts/smoke_*.sh scripts' 6s server-startup wait windows were too tight under that load (each hero_proc secret.get returning empty took ~700ms instead of ~50ms). This is environmental, not a code regression — the direct curl-based route regression above confirms behavior. The smoke scripts pass cleanly on a less-loaded host (and passed at s2-011's acceptance).

No D-NN or L-NN minted

Phase 10 is pure-additive — no design lock-in. The check semantics (per-surface is_production + AND across registered set + SKIP-counts-as-NOT-READY) is the obvious shape. ID slots stay at D-19 / L-09.

What --check-prod-config does NOT cover

  • No external API calls (Stripe live key could still be revoked; verify on dashboard).
  • No pool-VM reachability check (operator should SSH-probe each VM before adding to pool).
  • No live Forge OAuth round-trip (use scripts/smoke_forge_oauth.sh against a mock; live test with prod OAuth app is a go/no-go item).
  • No OSIS db writability probe (first allocate or find_or_create_user exercises that).

Launch-day go/no-go checklist in §5 of the runbook covers the gaps.

Open follow-ups

  • Prod hostname for hero_onboarding — when this locks, the operator registers a prod Forge OAuth app on forge.ourworld.tf admin panel and drops the credentials into hero_proc secrets (context onboarding). --check-prod-config then reports forge_oauth OK instead of FAIL.
  • scripts/smoke_payments.sh and friends timing-flaky under load — bump the server-startup wait window from seq 1 30 × 0.2s to seq 1 90 × 0.2s, or add an explicit --ready-probe server flag that prints READY on stdout once listening happens. Out of scope for s2-012.

Next

s2-013 Phase 11 — Refund posture + multi-currency (Q#8 + Q#9). RELEASE_REFUNDS_ENABLED=true env flag (default false — freezone-aligned no-refund posture stays default). Billing.balance_by_currency: map<str, i64> for multi-currency rollup. D-19 candidate if refund posture locks load-bearing.

## Summary Phase 10 of [#1](https://forge.ourworld.tf/lhumina_code/hero_onboarding/issues/1). Pre-flight gate for production deployment: validate that every payment/KYC/VM/login surface is shaped for production *before* a deploy lands. No live charges or KYC verifications — those stay on the launch-day go/no-go checklist. ## Landed in s2-012 13 files +987 LOC across worktree `hero_onboarding-track-agent-2/` on branch `track-agent-2/phase-10-prod-keys`, squash-merged to `development`. ### `is_production() -> bool` on every provider/config surface (~80 LOC) - `PaymentProvider` trait method (default `false`); impls on `StripeProvider` (`sk_live_*` + webhook secret) + `ClickPesaProvider` (creds + non-sandbox api_url + non-empty webhook_url). - `KycProvider` trait method (default `false`); impl on `IdenfyProvider` (no demo/dev escape hatch + creds + webhook secret + non-sandbox api_url). - `Provisioner` trait method (default `false`); impl on `PoolAssignmentProvisioner` (delegates to pure-fn helper `pool_assignment_is_production(demo, pool_size)` — unit-testable without OSIS). - Free function `forge_oauth::is_production(&ForgeOAuthConfig)` (non-localhost base_url + non-loopback redirect_uri + non-placeholder creds). ### `--check-prod-config` validator (~130 LOC) - New flag on `hero_onboarding_server`: builds the full provider set (using the existing `build_*` functions, no duplication), iterates `is_production()` on each registered surface, prints one line per surface with explicit `OK`/`FAIL`/`SKIP` prefix + reason, prints `verdict: READY` or `verdict: NOT READY`, exits 0 or 1. - Passthrough flag on `hero_onboarding` CLI: execs `~/hero/bin/hero_onboarding_server --check-prod-config` and proxies the exit code. Both forms work; runbook documents either. ### Operator runbook (`docs/operator-runbook.md`, ~532 LOC) Per-environment config matrix for dev / staging / prod × 5 trait surfaces = 15 rows. Layout: quick comparison table → per-surface section (purpose, required slots, hero_proc keys + env-var fallbacks, how-to-get-credentials, sandbox vs prod, gotchas) → `--check-prod-config` usage → launch-day go/no-go checklist → per-verdict troubleshooting → reference key list. **Production Forge OAuth section marked TBD** — production hostname for hero_onboarding is not yet locked. Until then, `--check-prod-config` reports `forge_oauth FAIL` because the dev redirect URI still loops back to `127.0.0.1`. Runbook documents the "register prod OAuth app once hostname is locked" follow-up as a launch-day item. ### Smoke (`scripts/smoke_prod_config.sh`, 32 checks) Drives `--check-prod-config` across 12 env-var combos: all-skip → exit 1 with 5 SKIPs; Stripe sandbox key → FAIL; Stripe live key without webhook secret → FAIL; Stripe live + webhook → OK; KYC demo flag → FAIL; KYC live creds + non-sandbox URL → OK; KYC sandbox URL → FAIL; Provisioner demo flag → FAIL; Provisioner real pool → OK; Forge OAuth localhost → FAIL; Forge OAuth remote → OK; all five surfaces OK → exit 0 + READY; ClickPesa sandbox URL → FAIL. 32 individual assertions across the per-surface output prefixes + final verdict + exit code. ## Acceptance gates - `cargo test --workspace` **77/77** (62 from s2-011 + 15 new `is_production` unit tests: 2 in payment.rs, 5 in kyc.rs, 3 in provisioner.rs, 5 in forge_oauth.rs). - `lab build --release --install --workspace` **VICTORY 3/3** (27.2s, build #13). - `lab infocheck` **3/3 clean / 0 findings**. - `cargo fmt --check` clean (after autofix on 3 multi-arg println sites + one function signature). - `cargo clippy --workspace --all-targets -- -D warnings` clean. - `scripts/smoke_prod_config.sh` **32/32 GREEN**. - Regression on 16 representative routes (`/`, `/login`, `/dashboard`, `/account`, `/login/forge/start`, `/vm/list`, `/kyc/start`, plus all 7 admin-secret-gated endpoints with + without secret, `/logout`) all return expected status codes — the additive trait method is transparent to existing route handlers. **Note on existing smoke scripts:** the workstation was under heavy load (load avg ~38) during the regression window; the existing `scripts/smoke_*.sh` scripts' 6s server-startup wait windows were too tight under that load (each `hero_proc secret.get` returning empty took ~700ms instead of ~50ms). This is environmental, not a code regression — the direct curl-based route regression above confirms behavior. The smoke scripts pass cleanly on a less-loaded host (and passed at s2-011's acceptance). ## No D-NN or L-NN minted Phase 10 is pure-additive — no design lock-in. The check semantics (per-surface `is_production` + AND across registered set + SKIP-counts-as-NOT-READY) is the obvious shape. ID slots stay at **D-19** / **L-09**. ## What `--check-prod-config` does NOT cover - No external API calls (Stripe live key could still be revoked; verify on dashboard). - No pool-VM reachability check (operator should SSH-probe each VM before adding to pool). - No live Forge OAuth round-trip (use `scripts/smoke_forge_oauth.sh` against a mock; live test with prod OAuth app is a go/no-go item). - No OSIS db writability probe (first allocate or `find_or_create_user` exercises that). Launch-day go/no-go checklist in §5 of the runbook covers the gaps. ## Open follow-ups - **Prod hostname for hero_onboarding** — when this locks, the operator registers a prod Forge OAuth app on forge.ourworld.tf admin panel and drops the credentials into hero_proc secrets (context `onboarding`). `--check-prod-config` then reports `forge_oauth OK` instead of `FAIL`. - **`scripts/smoke_payments.sh` and friends timing-flaky under load** — bump the server-startup wait window from `seq 1 30` × 0.2s to `seq 1 90` × 0.2s, or add an explicit `--ready-probe` server flag that prints `READY` on stdout once `listening` happens. Out of scope for s2-012. ## Next **s2-013 Phase 11 — Refund posture + multi-currency** (Q#8 + Q#9). `RELEASE_REFUNDS_ENABLED=true` env flag (default false — freezone-aligned no-refund posture stays default). `Billing.balance_by_currency: map<str, i64>` for multi-currency rollup. **D-19 candidate** if refund posture locks load-bearing.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_onboarding#11
No description provided.