[infra][P1] CI-built static-musl binaries + --from-ci install path — make deploys minutes, not hours #54
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Today's deploy on herodemo took ~3 hours wall-clock and exposed five deploy-blocking bugs (hero_router#81, hero_proc#91, hero_skills#186, hero_collab#42, and a
JobCreateInputregression at hero_embedder), each found at compile time on a fresh build. None of them would have reached production if a CI pipeline had built statically-linked binaries on commit and the VM had simply downloaded them.This issue proposes adding a CI-built artifact path to the install pipeline, as an additional option, not a replacement for the current build-on-VM path.
Current model: build on each VM
service_X install --update --releaserunsforge merge(git pull) thencargo build --releaseon the VM. Pros: simple, debuggable, no extra infrastructure. Cons:service_install_allcascade is 30-60 min./datafilled before the build even finished a couple of times./Volumes/T7in hero_router#81) are invisible until someone tries to build on a different machine.Proposed: two-path install
Defaults shift over time:
--from-sourcewhile we're rolling out, then--from-cionce we trust the artifacts and have a fallback story.--from-ciresolves like this:--commit <sha>, orforge_url HEAD, orlatest tag).<service>-<commit>-x86_64-musl.chmod +x, drop in~/hero/bin/.Total time per service: ~5-10 seconds (download + chmod + swap), down from 30-60 seconds warm / 5-30 minutes cold.
Static linking is the real fix
Container matching is a coping strategy — "make the deploy environment match the build environment" punts the problem instead of solving it. Statically-linked musl binaries solve it: same bytes run on any Linux x86_64 (or aarch64), no glibc compatibility table, no "build env must match deploy env," shippable to anywhere a kernel runs.
Per-service classification:
bundled, rustls already in the dep tree (no openssl) —cargo build --target x86_64-unknown-linux-musl --releaseshould work directly.libonnxruntime.sonext to the binary (still relocatable + distributable, just not single-file), (b) statically link ONNX against musl (extra work, possibly upstream-PR territory), (c) keep these two on glibc as a "near-static" exception. Pragmatic call depends on demo-target.docker runmodel.Existing infrastructure
The CI side is already partly built. From the skills index:
forge-release-workflow— Forgejo workflow that builds Linux binaries (amd64 musl, optionally arm64 gnu) on tag push and uploads to Releases.forge_release— Forgejo Releases management.forge_package— binary publishing to forge.ourworld.tf packages registry.build_lib— build system library for Hero projects.forge_docker_publish— for the OnlyOffice-style cases.The missing piece is the deploy side —
service_X install --from-ciand the resolution / download / verify logic.Storage and retention
<service>-<sha>or git tag.x86_64-muslfirst;aarch64-gnuoraarch64-muslonce Hero deploys to ARM.Rollout plan
forge-release-workflow, build musl binaries on every push todevelopment, upload ashero_proc-<sha>-x86_64-musl.--from-ciinstall path in service_proc.nu, behind an opt-in flag. Existing--from-sourcestays default.--from-cipath and we've verified rollback works, flip the default in service_install_all.Out of scope
--from-sourcestays forever as the dev-iteration / debug / disaster-recovery path.Tradeoffs to acknowledge
--from-sourcealways works, and once an artifact is in Forgejo Releases, it's there even if CI is currently broken.Cross-refs
/Volumes/T7macOS path leak (would have been caught at CI build time, never reached deploy)service_lib_rhairename gap (CI deploy would have surfaced this in pre-prod, not in the live deploy)This is a multi-week project to roll out properly across the stack but a clear ROI: today's 3-hour deploy with 5 bugs becomes a 5-minute deploy with no bugs once the artifact pipeline is the default path.
Rollout sketch (post-demo, not blocking)
Parking some implementation thinking here so whoever picks this up has a starting point. Not the priority right now — demo work comes first.
Order of operations
The install-side change is the high-leverage piece. Every
service_X.nu installtoday is a thin wrapper aroundsvc_installinlib.nu. So--from-cilives insvc_install— one change, all 17 services that go through that helper inherit the new path.Effort estimate per easy-tier service
Total for 16 easy-tier services: ~30-50 focused hours spread across PRs. Realistic calendar: ~1 day for lib.nu, ~1 week for hero_proc pilot end-to-end, ~2-3 weeks fanning out.
Gotchas to flag upfront
Cargo.lock must be committed for reproducible CI builds. Today hero_embedder has
Cargo.lockgitignored — exactly the source of the lockfile-drift bug we hit during this session. CI building from a moving lockfile is non-reproducible. Step zero on every service: ensure Cargo.lock is committed.musl-incompatible crates lurk. Most pure-Rust deps work, but anything that links to openssl-sys, libsqlite3-sys without
bundled, native-tls, etc. needs swapping. The first service we put through CI reveals the pattern.Build perf: 16 services on the same Forgejo Actions runner feels slow without sccache. Hero already has
sccache.nupatterns — wire that into the CI workflow once the basics work.Default-flip discipline: don't flip
--from-cito default until rollback works. "Deploy went bad, redeploy commit Y" must grab the CI artifact for Y, not rebuild it.--from-sourcestays the always-available out.Per-service ownership: 16 PRs touching 16 repos means coordinating with whoever owns each repo. Worth a heads-up before the rollout starts so per-service maintainers can flag musl-incompatibility cases ahead of time.
When to start
After the demo ships and the team has cycles. Until then, today's deploy reality (build-on-VM, ~5-10 min warm-cache cycle once we're past the cold start) is workable.
2026-05-02 — Picking this up. Fresh audit + hero_proc pilot plan.
Context
Session 53 priority shifted to this issue. The 3-hour deploy + 5 deploy-blocking bugs at session 52 made the cost of staying on the build-on-VM model concrete. Goal for this session: prove the loop end-to-end on
hero_proc—tag → CI → static-musl artifact in Forgejo Releases →service_proc install --from-cion a fresh VM. Once one service works, the rest is mechanical.Related: coopcloud/circle_ops#773 — the
Set up jobzombie-network failure mode that was forcing CI re-runs is fixed as of 2026-04-28 (peter deployed cleanup script + expanded address pools 256→4352 networks + Prometheus zombie alerts). Re-runs should no longer be needed.Fresh CI audit (last 5 runs on
development, 2026-05-02)Updated state vs 2026-04-25 audit in hero_demo#39:
release.yaml): hero_router, hero_aibroker, hero_indexer, hero_proxy, hero_db, hero_livekit, hero_os, hero_osis, hero_rpc.build-linux.yamlbut no release.yaml — they cross-compile musl in CI but don't upload artifacts.developmentat all (the same set #39 flagged).check. Worth a look during the rollout.Why hero_proc is the right pilot
developmentfor the last 5 runs.buildenv.shandbuild-linux.yaml— only missing piece is the artifact-publishingrelease.yaml.--from-sourcekeeps working.Pilot plan — concrete steps
Step 1 — Add
release.yamlto hero_proc.development_mik_release_artifacts_hero_procbuildenv.sh).main. Hero usesdevelopmenteverywhere — the check needs to allowdevelopment(or be removed). Will fix in the port.Step 2 — Tag + verify.
v0.x.y-devon hero_proc.forge.ourworld.tf/lhumina_code/hero_proc/releases/tag/v0.x.y-devwith<bin>-linux-amd64-muslfor each binary in$BINARIES.file <bin>should reportstatic-pie linked, stripped. Smoke-test by running on a fresh container.Step 3 —
service_proc install --from-ciin hero_skills.development_mik_release_artifacts_hero_skills_from_cipkg_url(repo, version, bin)→ resolves Forgejo Releases URL.service_proc.nu installlearns--from-ci [<version>]:latest).$BINARIES, verify checksum (compute from response — Forgejo doesn't sign yet, see #54 §Storage),chmod +x, drop in~/hero/bin/.hero_proc service restart hero_proc(orservice_proc start --resetsince it's not its own service).--from-source.--from-ciis opt-in.Step 4 — End-to-end test.
service_proc install --from-cishould produce a working hero_proc in <60s (vs 5-30 min cold cargo build).Branch naming for the rollout
Per-repo branches:
development_mik_release_artifacts_<repo>(e.g._hero_proc,_hero_indexer). Each PR scoped to one repo, gated by green CI.Out of scope for this session
install_corework — already partially closed via home#192.Signed-off-by: mik-tf
2026-05-02 — Pilot landed.
Consumer side merged: hero_skills 3387d284 via PR #193.
Verified end-to-end against the live
lhumina_code/hero_proc/releases/tag/v0.4.4:$HOME/hero/bin/hero_proc --versionreportshero_proc 0.4.4--resetwork as designedSide-finding worth recording
The publisher side already exists for far more repos than the 2026-04-25 audit captured. Per a fresh re-audit:
release.yaml(the canonical hero_router template): hero_router, hero_aibroker, hero_indexer, hero_proxy, hero_db, hero_livekit, hero_os, hero_osis, hero_rpc.build-linux.yamldoing the same publish-to-Releases work under a different filename: hero_proc, hero_lib_rhai, hero_biz, hero_books, hero_embedder, hero_voice, hero_browser, hero_browser_mcp, hero_whiteboard, hero_foundry, hero_foundry_ui, hero_aibroker (overlap), hero_indexer (overlap), hero_db (overlap), hero_osis (overlap).Union of distinct repos publishing artifacts on tag push: ~20. The cosmetic rename (
build-linux.yaml→release.yaml) and naming-convention sweep is a separate cleanup; it doesn't gate consumer rollout.What's next
service_proc install --from-ciend-to-end on a real environment, register + start through the supervisor, confirm full lifecycle. Specs match herodemo so the same VM can graduate to demo duty once CI-paved deploys are proven.--from-ciintoservice_router,service_aibroker,service_indexer,service_proxy,service_db,service_osis,service_books,service_biz,service_collab,service_foundry,service_logic,service_archipelagos,service_lib,service_code. The helper inlib.nuis already generic — each module needs a one-line wire-up.service_install_all --from-ci— once individual services work, lift the flag to the whole-stack installer.hero_voice+hero_embedder. Bundle the.sonext to the binary or stay on glibc as a documented exception.build-linux.yaml→release.yamlper #39's canonical naming.Closing as stale
Signed-off-by: mik-tf
2026-05-02 — Pilot smoke-tested green on a fresh TFGrid VM.
Provisioned
heroci.gent01.grid.tf(16 vCPU / 32 GB / 200 GB / 16 GB rootfs, public IPv4 + Mycelium fallback) and ranservice_proc install --from-ci --version v0.4.4end-to-end. All three binaries downloaded, verified ELF, installed, and report the correct version.Wall-clock numbers
tofu apply(full VM provision)apt installbaseline (curl/wget/git/file/ca-certs)git clone --depth 1 hero_skillsservice_proc install --from-ci --version v0.4.4For reference: session 52's source-build hero_proc deploy took ~10 minutes of cold cargo build. Cold-cache full
service_install_allwas 30-60 min. Pilot delivers the speedup #54 called for.Behavior verified
latest → v0.4.4)--version v0.4.4)ELF 64-bit … static-pie linked, stripped(run on a fresh Ubuntu 24.04 with no toolchain dependency)hero_proc --versionandhero_proc_server --versionboth report0.4.4--resetcorrectly forces refetchStrategic rollout plan (next sessions)
Phase 1 (~1-2 sessions): wire
--from-ciinto the 14 easy-tier service modules with working CI. Each module is a ~4-line patch (the helper inlib.nuis already generic and works across repos). Group as 3-4 PRs of 4-5 services each. Smoke-test each batch on heroci.Services:
service_router,service_aibroker,service_db,service_foundry,service_biz,service_books,service_whiteboard,service_proxy,service_osis,service_indexer,service_browser,service_slides,service_matrixchat,service_editor.Phase 2 (~1 session): fix CI on 4 currently-broken/missing repos (
hero_collab,hero_logic,hero_codescalers,hero_livekit) by porting the canonical hero_router release.yaml per hero_demo#39. Then wire--from-ciinto their service modules.Phase 3 (~0.5 session): wire
--from-ciintoservice_install_all— the strategic payoff. Whole-stack deploy from CI artifacts: minutes, not hours.Out of scope for this rollout:
hero_voice,hero_embedder) — deferred pending bundling decision (this issue §hard-tier)hero_os) — different shape, separate pipelinehero_office) — third-party stack stays containerizedFollowups filed during this session
developmentSigned-off-by: mik-tf
mik-tf referenced this issue2026-05-02 14:04:26 +00:00
2026-05-03 — Phase 1 partial: 5 services live with
--from-ci, blocker surfacedWhat landed
hero_skills#195 merged at
a13c9ef0. Adds--from-cito 4 more service modules, mirroring the pilot pattern inservice_proc.nu:service_router(asset suffixlinux-amd64-musl, hero_router v0.2.2)service_proxy(asset suffixlinux-amd64-musl, hero_proxy v0.5.0)service_db(asset suffixlinux-amd64, hero_db v0.3.2)service_indexer(asset suffixlinux-amd64-musl, hero_indexer v0.1.3)Verified end-to-end on heroci.gent01.grid.tf — all 10 binaries land in
~/hero/bin/, sized 1.3-11 MB,hero_router --versionreports0.2.1, server/UI binaries launch.Live
--from-cicoverage so far: 5 services (hero_procfrom pilot + the 4 above).Blocker for the rest of Phase 1
The originally-planned 14 easy-tier services do NOT all have a usable release on the forge today. Concretely, of the 9 services NOT yet wired:
service_aibrokerSVX_BINARIESincludeshero_aibroker_services(added since v0.1.0 via commit 591e071); latest release would 404. Needs re-tag.service_osisbuildenv.shaddshero_osis_seedafter the tag; same 404 pattern. Needs re-tag.service_bizservice_whiteboardservice_editorv*tag pushedservice_foundryservice_browserservice_slidesservice_matrixchatbuild-linux.yaml/release.yamlat allThe consumer wiring is now a one-line patch per service. The actual blocker is upstream: each repo needs a working tag-triggered release pipeline producing the binaries the module expects.
What's next (in order)
hero_aibrokerre-tag (in flight) — bumping to v0.1.1 to gethero_aibroker_servicespublished. Once green,service_aibrokerjoins on a one-line patch (asset suffixlinux-amd64-musl).hero_osisre-tag — same shape: bump to v1.0.0-rc6 to gethero_osis_seedpublished.hero_editor,hero_foundry,hero_browser,hero_slideseach need av0.1.0cut once their CI is verified green on a tag.hero_biz,hero_whiteboardhave tags pushed but the build-linux.yaml runs failed; needs investigation per repo.hero_matrixchatneeds abuild-linux.yamlworkflow added, classed as a #39 cleanup item.service_install_all --from-cionly after the per-service rollout is complete.The pattern is now boring and mechanical: tag a fresh
v*, wait for CI to upload assets, add a one-line--from-cibranch + asset_suffix to the correspondingservice_X.nu. PR-1 is the template; subsequent PRs will look identical.Signed-off-by: mik-tf
2026-05-03 (later) —
hero_aibrokerjoins, plus replicable recipe for the restWhat landed since the last update
release.yamlgate relaxed to allow tagging ondevelopmentormain(mirrorshero_proxy, the working template). 4-line diff.developmentHEAD.release.yamlpublished 4 binaries withlinux-amd64-muslsuffix:hero_aibroker,hero_aibroker_server,hero_aibroker_ui,hero_aibroker_servicesservice_aibrokerconsumer wiring. Merged at9cad828.Smoke-tested end-to-end on heroci: all 4 binaries land in
/root/hero/bin/, sized 4–12 MB,filereports static-pie ELF stripped.--from-cicoverage now: 6 services (hero_proc,hero_router,hero_proxy,hero_db,hero_indexer,hero_aibroker).Tangential issue filed
hero_aibroker#58 —
test_server_rpc_methodsinbuild.yamlfails because it needs a livehero_db, which CI doesn't provide. Pre-existing, not blocking the release pipeline. Two recommended fixes documented (#[ignore]-by-default vs. starthero_dbin the CI job).Generalised recipe for the next service
The aibroker work fell into a 3-step pattern that should generalise to most of the remaining easy-tier services. Per repo:
release.yaml— does the "Verify tag is on..." gate acceptdevelopment?hero_proxyis the canonical example. If main-only, ship a 4-linemirror hero_proxy gatePR.developmentHEAD that publishes whatever binaries the correspondingservice_X.numodule'sSVX_BINARIESexpects today (the binary list often drifts ahead of the last release).hero_skills, smokeservice_X install --from-cion heroci.Per-repo readiness audit (where I can tell from outside the repo):
release.yamlexistshero_routerhero_indexerhero_osishero_osis_seedmissing — needs re-tag)hero_bizhero_whiteboardhero_editorhero_foundryhero_browserhero_slideshero_matrixchatrouterandindexeralready have working releases — their main-only gate is cosmetic, not blocking, until the next release-cut. They can stay as-is until the next tag-cut on those repos.The "8 needing first/fresh release" set is now a sequenced workflow, one repo at a time, identical shape every time. Suggest tackling them in priority order — happy to take the next one whenever you're ready.
Signed-off-by: mik-tf
2026-05-03 (later still) —
hero_osisjoins, plus a load-bearing findingWhat landed since the last update
hero_osis,hero_osis_ui,hero_osis_seed,hero_bot) which the v1.0.0-rc5 release was missing. No gate fix needed (hero_osisonly gates the upload step onstartsWith(github.ref, 'refs/tags/v'), not on branch).service_osisconsumer wiring, one-line patch. Merged atb94bd7e.Smoke-tested end-to-end on heroci. The 3 module-expected binaries land cleanly.
--from-ciinstall coverage now: 7 services —hero_proc,hero_router,hero_proxy,hero_db,hero_indexer,hero_aibroker,hero_osis. 20 binaries totalling ~108 MB sitting in/root/hero/bin/on heroci, all from CI artifacts, no cargo run anywhere on the box.Bumps along the way (worth recording)
hero_osisrepo'sFORGEJO_TOKENsecret was unset → tag-push CI failed silently on the create-release POST (curl-sfswallowed the 401, python crashed on empty stdin). Token added → tag deleted + re-pushed → run #494 succeeded.docker create(21 min onSet up jobfor one attempt). Other repos' jobs flowed through fine in parallel — looked like single-runner saturation rather than pool-wide failure.Load-bearing finding:
--from-ciis install-only todayWhile trying to drive
service_proc start --rooton heroci as a stress-test, hit a hard wall:service_X startalways purges existing binaries and reinstalls via cargo, regardless of how the binary got on disk. On a CI-paved host (no source repos, noROOTDIR), it errors out withROOTDIR not setafter wiping the just-installed CI binaries.This means the
--from-ciinstall path doesn't yet enable a full CI-paved stack lifecycle. We can--from-ciinstall the binaries; we cannot--from-cistart the supervised stack.Filed hero_demo#64 with two recommended fixes:
--from-ciflag to eachservice_X.nustartfunction (mirrors install pattern, ~5 lines per service).startinto a separateregister/upcommand that doesn't reinstall (cleaner separation, matchesapt install foo && systemctl start fooshape).Recommendation: A for immediate rollout (mechanical, mirrors merged install pattern), B as a follow-on architectural cleanup.
Tracking as limitation L-05 in the workspace pipeline.
Updated rollout map
What's next
service_X.nu.hero_editoris the lowest-effort candidate (workflow exists, just never tag-pushed). Each service is now a self-contained ~10-30 min unit if no CI debugging is needed; longer if first-tag CI uncovers issues.--from-citostart(hero_demo#64 Option A). Independent track from #1. Once installed, this unlocks the full "fresh VM → working stack from CI artifacts" demo.hero_matrixchat(no release workflow at all).service_install_all --from-ci.Signed-off-by: mik-tf
Session 55 — Phase 2 audit plan (cluster-by-cluster)
Session 54 reverted
--from-cifrom the 8 services without published artifacts. Session 55 audits why their CI doesn't publish. Surveyed all 8.forgejo/workflows/+ Forge API tags/releases/runs first to avoid duplicating work.Findings
hero_matrixchathero_editorhero_slideshero_bizhero_browserhero_foundryhero_whiteboardhero_bookshero_biz,hero_browser,hero_editor,hero_foundryship near-identical 68-74-linebuild-linux.yamltemplates that all source the samescripts/build_lib.sh(~2370 lines, present in every repo) and call shared helperssetup_linux_toolchain/build_binaries/publish_binaries. They'll fail or succeed for the same reason.Audit clustering
Rather than 8 independent audits, group by likely shared root cause:
hero_editor,hero_slides. Workflow is fine; just hasn't been triggered. Fix: tag + watch a run.hero_matrixchat. Hasci.ymltest/lint only, nobuild-linux.yaml. Fix: port a working template.hero_biz,hero_browser,hero_foundry,hero_whiteboard(whiteboard is partial outlier — inline release logic, not shared helper). Investigatehero_bizfirst; finding likely propagates to 3 siblings.hero_books. Distinct symptom; standalone deep-dive.Order this session
Output
8 per-repo issues filed (one per service), cross-linked where they share a root cause, plus a closing summary comment back here with effort estimates for Phase 2 implementation.
Out of scope this session: any actual CI fixes — audit + issues only.
--from-ciinstall path blind to this repo #13--from-ciblind #118--from-ciinstall path blind to this repo #16v*tag has been pushed;build-linux.yamlexists but never triggered #5v*tag has been pushed;release.yaml(inline, single-target) exists but never triggered #42ci.yml(test/lint); needsbuild-linux.yaml#4Session 55 — Phase 2 audit complete
8 per-repo issues filed. Root cause analysis revealed a shared-helper bug that explains 4 of 8 services at once.
Per-repo issues
build-linux.yamlat allRoot cause for clusters A + B (4 of 8 services)
scripts/build_lib.sh::publish_binarieswrites binaries only to the Forgejo package registry (/api/packages/<owner>/generic/<pkg>/<version>/<asset>). It never creates a Forgejo Release nor uploads to/api/v1/repos/<repo>/releases/<id>/assets.svc_install_from_ciin hero_skills/tools/modules/services/lib.nu:510 downloads fromforge.ourworld.tf/<repo>/releases/download/<tag>/<asset>— release assets only, not pkg registry. Net: 4 services have working CI but are invisible to--from-ci.The 6 services in Phase 1 with working CI (hero_proc, hero_router, hero_proxy, hero_db, hero_indexer, hero_aibroker, hero_osis) all have inline release-creation + asset-upload logic in their workflows, not the shared helper.
Phase 2 effort estimates
publish_release_assetshelper toscripts/build_lib.sh, propagate to 4 repos, re-tag each, validate on heroci--from-ciconsumer wiring in service_*.nu modules for unblocked servicesPhase 2 total estimate: ~12-18 h of work. Highest-leverage starting point is the shared-helper fix — unblocks 4 services in one PR.
Out of scope this session
Ready to ship — closing audit phase.
Decision: Releases is canonical, work from the 7 already-working services
After broader assessment of where Hero sits relative to industry-standard binary distribution:
Industry signal
Static-Linux-binary distribution from a forge is overwhelmingly Releases-based in the OSS world: kubectl, gh, ripgrep, hugo, terraform, docker-compose, nu, deno, bun, foundry, cargo-binstall — all expect Release assets. Generic package registries (Forgejo
/api/packages/, GitHub Packages generic) are used for typed packages consumed by typed package managers (npm/cargo/pip/docker), not bare binaries pulled by deploy scripts.Hero-specific reasons Releases wins
curlcan pull withoutFORGEJO_TOKEN. Pkg registry needs token plumbing on every VM./releasespage tells humans what shipped and when. Pkg registry pages are machine-only.gh release download/cargo binstall/ install.sh story works out of the box.Where we already stand
The 7 working services (hero_proc, hero_router, hero_proxy, hero_db, hero_indexer, hero_aibroker, hero_osis) all do exactly this: their
build-linux.yamlhas inlineCreate Release+Upload Release Assetssteps, plus an optional pkg-registry mirror. They are the canonical Hero pattern. No new helper needed — copy the working pattern into the 4 broken repos.Updated Phase 2 plan
build-linux.yamlinto each repo's workflow. Drop reliance onbuild_lib.sh::publish_binariesfor release-asset publishing (keep it for optional pkg-registry mirror or remove it entirely). Re-tag, validate on heroci. ~2-3 h total (one PR pattern, applied 4x).build-linux.yamldirectly.Skill ecosystem follow-up (not blocking)
build_lib_ciSKILL.md template currently calls onlypublish_binaries(pkg registry). Should be updated to match the canonical hero_proc pattern (inline release-asset upload + optional pkg-registry mirror).tfgrid_deployshould switch its consumer to Release URLs to drop the auth-token requirement on TFGrid VMs.Filed for visibility; not blocking the 4-repo Phase 2 fix above.
Phase 3 scope: WASM artifacts (hero_os + hero_archipelagos)
Adding to the roadmap. After Phase 2 (binary cluster A/B/C/D/E) lands, the next major chunk is WASM artifact distribution for the browser-side stack. Same architectural gap as cluster A, plus an additional consumer-side gap.
Producer state
release.yamlv*taghero_os-web-<v>.tar.gz)build-release.yamlv*taghero_archipelagos-wasm.tar.gz)Both have the same Cluster A bug — publish to pkg registry, never to Releases. hero_archipelagos additionally has never been tagged.
Consumer state — bigger gap
service_os.nuline 26: "service_os install — fetch source, cargo build, copy binaries". The deploy script builds locally (~25 min cold per CLAUDE.md) and never attempts to fetch the WASM tarball CI produces.So even fixing the producer side gives us nothing until a
svc_install_wasm_from_cihelper exists inhero_skills/tools/modules/services/lib.nu— alongside the existingsvc_install_from_ci(binary-shaped). The shapes differ: WASM artifact is a.tar.gzof a directory tree that extracts into~/hero/share/hero_os/public/(or/islands/), not a single binary copied to~/hero/bin/.Why Phase 3 is the highest-leverage piece of the whole CI roadmap
--from-ciPhase 2)dx build --release→ ~30 sec download + tar extracthero_os is the front-door for every demo VM and every contributor onboarding. Sub-minute fresh-deploy is unblocked entirely by Phase 3.
Phase 3 effort estimate (~8-12 h)
.tar.gzasset instead of multiple binaries)svc_install_wasm_from_cihelper in hero_skills/lib.nu (new shape, not a one-line addition — handles tarball download + extract + content-hash bookkeeping)service_os.nuinstall path to prefer--from-ci, fall back to localdx buildUpdated roadmap
--from-ci--from-cibuild_lib_citemplate,tfgrid_deployReleases default)End-state ("complete CI via Hero OS nu-shell"): every Hero service (binary or WASM) ships via tag → CI → Releases →
service_<name> install --from-ci. Bare TFGrid VM → fully working Hero OS in <2 min wall-clock. Today's 25-min cold demo deploy → ~90 sec.2026-05-04 — Session 55 producer-side check-in: 2 of 8 unblocked
Re-audited the 8 cluster A/B/C/D/E targets. Two have shipped Release assets since the audit comment (28672):
hero_bookslinux-amd64hero_browserlinux-amd64+linux-arm64Both unblocked the consumer-side wiring that was reverted at session 54. Working on the consumer-side wiring next:
service_books— re-add--from-ci/--version, asset suffixlinux-amd64, target tagv0.1.6-rc1. Smoke on heroci.service_browser— same shape, target tagv0.1.4-rc5.Two PRs on
hero_skills, one per service, mirroring the #196 / #197 cadence.Updated rollout map
--from-ci(proc, router, proxy, db, indexer, aibroker, osis)End-state target unchanged: every Hero service ships via tag → CI → Releases →
service_<name> install --from-ci.Signed-off-by: mik-tf
2026-05-04 — Complete current-state recap
Posting the full picture in one place after session 55's smoke validation. State below reflects what's live on heroci + Forgejo Releases +
hero_skills/developmentright now.heroci.gent01.grid.tf (CI-validation VM)
178.251.27.21/root/hero/bin/ELF static-pie, every one of them got there via--from-ci(no cargo has run on this box)hero_procdaemonstart --from-cigap)Producer × consumer matrix (all 15 services)
--from-ciwired)hero_prochero_routerhero_proxyhero_dbhero_indexerhero_aibrokerhero_osishero_bookshero_browserhero_bizhero_foundryhero_whiteboardhero_editorhero_slideshero_matrixchatCoverage summary
Spot-checked binaries on heroci (all static-pie ELF, version-correct)
Outstanding architectural gaps (independent of binary rollout)
service_X startalways purges & rebuilds via cargo, defeating--from-ciinstalls on CI-paved hosts → hero_demo#64.Session 55 in flight
linux-amd64, target tagv0.1.4-rc5)Signed-off-by: mik-tf
2026-05-04 — Session 55 close: 9/15 services live with
--from-ciBoth PRs merged. Coverage now 9/15 end-to-end.
Landed
service_booksre-add--from-ci(linux-amd64, v0.1.6-rc1) — squash-merged at2f38fc89service_browseradd--from-ci(linux-amd64, v0.1.4-rc5) — squash-merged at2ed37497Re-validated end-to-end on heroci from merged
developmentHEADBoth
--version latestresolutions correct. All binariesELF 64-bit LSB pie executable, statically linked.Updated coverage
9/15 services fully E2E. 6/15 still producer-blocked — natural session 56 entry point is cluster A propagation to hero_biz + hero_foundry (port the inline release-asset upload pattern from hero_books's working
build-linux.yamlinto each repo's workflow, then re-tag).Next session entry points (in suggested order)
start --from-cilifecycle gap, hero_demo#64. Independent track.Signed-off-by: mik-tf
Sibling cleanup — asset naming convention
Filed home#212 — standardize CI release-asset naming on Rust target triples (honest libc per repo).
Three conventions in use today across the 9 working repos, and several assets misrepresent their libc (
hero_proc-linux-amd64is musl;hero_books-linux-amd64is gnu;hero_browser-linux-arm64is gnu). Migration is pure-rename via ForgejoPATCHAPI — no rebuilds for any of the 9. Future producer-side work (the 6 still-blocked repos under this issue) adopts the new convention from the first tag-cut.Independent of this issue's
--from-cirollout but the same surface area; cross-posting so anyone working on Phase 2 cluster-A propagation knows to use*-x86_64-unknown-linux-musl(or-gnu) directly in the workflow + service module rather than perpetuatinglinux-amd64/linux-amd64-musl.Signed-off-by: mik-tf
Phase 2 — execution plan for the remaining 6 services
Tiered easiest → hardest, lowest-variance first. Cumulative effort ~12-15h focused work, splittable across 3-5 sessions.
Tier 1 — push tag, watch CI, ship (~1h each)
Workflow exists, has inline release logic (or close to it), no prerequisite blockers. Each is essentially
git tag v0.1.0 && git push origin v0.1.0after sanity-checking the workflow's binary list matches the correspondingservice_X.nuSVX_BINARIES, then a one-line consumer wiring + heroci smoke.Tier 2 — port inline-upload pattern from hero_books (~2-3h each)
Producer-side workflow is wired and tag-triggered, but uses the broken shared-helper that writes to pkg registry instead of Releases. Each needs the inline release-asset upload pattern from hero_books's
build-linux.yaml(which just landed and works).Tier 3 — full template port (~1-2h)
build-linux.yamlported from hero_books's working version + first tag.Tier 4 — debug (~2-4h)
Cross-cutting decisions adopted
<bin>-x86_64-unknown-linux-muslfor musl-built,-gnufor glibc-built — honest about what the workflow actually compiles). This means the new producer-side workflows include the new asset naming directly, no future rename pass needed for these 6 repos.git push origin v*.*.*is a visible-to-others action; per-repo authorization at the moment of tagging.Path to "complete CI via Hero OS nu-shell"
After this Phase 2 finishes (all 15 services with
--from-ci):--from-citostartlifecycle (no purge-and-rebuild)--from-ciservice_install_all --from-ci(whole-stack default)Then a fresh TFGrid VM → fully working Hero OS in ~90 sec wall-clock vs today's 25-min cold demo deploy.
Starting Tier 1 now: hero_slides first.
Signed-off-by: mik-tf
2026-05-04 — Slides E2E complete: 10/15 services live + lessons captured
Slides landed end-to-end
hero_slides_libreqwest → rustls (musl unblock), workspace fmt + clippy clean71221e1release.yamladopts target-triple convention per home#21254646cfv0.1.0-rc2— 3 assets namedhero_slides{,_server,_ui}-x86_64-unknown-linux-muslservice_slides.nu --from-ciwired withx86_64-unknown-linux-muslsuffixLive
--from-cicoverage now: 10/15 services — proc, router, proxy, db, indexer, aibroker, osis, books, browser, slides (new).Lessons captured (apply to remaining 5 producer-blocked repos)
These cost us ~2-3 hours of friction on slides that we don't have to repeat:
FORGEJO_TOKENsecret must exist in each new repo BEFORE the first tag push. hero_slides hit the exact same bump as hero_osis (c28650) — therelease.yamlPOSTs to/api/v1/repos/.../releaseswith the secret; if it's unset,curl -sfswallows the 401 and the python parser crashes on empty stdin. The build itself succeeds, but the release is never created. Cosmetic failed-run noise persists on the tag's check status forever after.Action for next 5 repos: before tagging, verify with:
If absent: set it via UI before tagging.
Run the workspace gate locally before pushing any cleanup PR. Per
feedback_workspace_build_before_merge.md:cargo fmt --check && cargo clippy --workspace --all-targets -- -D warnings && cargo build --workspace --release. Cost runner cycles + 2 PR rounds on slides (separate fmt + musl PRs initially) before user redirected to bundle. Most of the producer-blocked repos likely have similar chronic fmt+clippy drift built up — bundle hygiene with the bug fix in one PR.Adopt the home#212 target-triple naming from day one in each new repo's
release.yaml. The matrixartifact:field inrelease.yamlshould match thetarget:field exactly. No future rename needed when the repo's first tag ships.Match
service_X.nuSVX_BINARIEStobuildenv.shBINARIESbefore tagging. If they drift,--from-cifails because the consumer asks for binaries the producer doesn't publish.Consumer wiring suffix matches the new convention.
service_X.nucallssvc_install_from_ci ... "x86_64-unknown-linux-musl"(or-gnuif the workflow is glibc), not the oldlinux-amd64/linux-amd64-muslshapes.Updated rollout map
--from-ci(proc, router, proxy, db, indexer, aibroker, osis)linux-amd64naming)bunmissing in CI runner)--from-citostartlifecycle--from-ciservice_install_all --from-ciNext session (56) suggested order
bunrunner gap (likely a Containerfile or apt-install fix in the CI image). Workflow already uses target-triple internally; just apply lessons-learned checklist.Each repo at ~30 min - 2 h. Estimate ~6-8 h remaining for full Phase 2 completion → coverage 15/15.
Signed-off-by: mik-tf
Tiering correction — hero_editor is hard-tier (ONNX)
Investigated hero_editor for the next round of producer-side fixes and discovered it actually belongs in the hard tier alongside hero_voice and hero_embedder. Updating the rollout map accordingly.
What we found
When hero_editor's CI v0.1.0-rc3 ran (after hero_editor#4 fixed the
bun: command not foundissue), the cargo build itself would have hit the same musl/openssl-sys wall as hero_slides did — but for a fundamentally different reason:hero_editor_uiusesvoice_activity_detectorfor VAD incrates/hero_editor_ui/src/voice/audio.rs(not feature-gated). VAD pulls inort(ONNX Runtime Rust bindings).ort-syshas a build-dependency onureq(defaults to native-tls) to download ONNX Runtime libraries during compilation. That's how openssl-sys gets pulled into the build environment.This is the same architectural class as hero_voice and hero_embedder, called out in this issue's body:
The audit comment (c28672) and Phase 2 plan (c28994) misclassified hero_editor as easy-tier — the v0.1.0-rc3 build failed at
bunBEFORE reaching the cargo build step that would have surfaced the ONNX issue, so the audit couldn't see it.Real tier-2 picture for the remaining 5 producer-blocked
Hard-tier consolidated
These three now share an architectural dependency on a libonnxruntime strategy:
Three options remain unchanged: (a) bundle
libonnxruntime.sonext to the binary, (b) statically link ONNX against musl (likely upstream PR territory), (c) keep on glibc as documented near-static exception.Updated session order
After step 4: easy/special tier complete (14/15). Step 5 closes out the last 1/15 along with hero_voice + hero_embedder.
Signed-off-by: mik-tf
2026-05-04 — Session 55 close: 11/15 services live + 8-item Phase 2 playbook
Closing out session 55. Coverage moved 7 → 11/15 services with
--from-ciend-to-end. Three E2E completions (slides, biz; plus books and browser earlier in the session). Plus architectural wins: home#212 target-triple naming standard adopted, hero_editor moved to hard-tier.Producer × consumer matrix (full)
--from-ciwired)8-item Phase 2 pre-flight playbook (apply to hero_foundry / hero_whiteboard / hero_matrixchat)
These 8 distinct CI gotchas surfaced during this session — pre-flight all of them on each remaining repo BEFORE pushing the first tag, to compress the multi-iteration debug cycle we just went through with hero_biz (5 tag rounds, 8h elapsed) down to 1-2 iterations:
FORGEJO_TOKENrepo secret exists in repo settings (set via Forgejo UI before tagging)reqwestdeclarations in workspace usedefault-features = false, features = ["rustls-tls", ...]— NEVER native-tls (pulls openssl-sys → musl breaks)buildenv.sh::ALL_FEATURESreferences real, currently-existing workspace features. Stale value referencing renamed/removed features will fail atcargo build --features ...time. Set to"default"if unsure.release.yaml/build-linux.yaml, JSON parsing should usepython3(universally available inghcr.io/despiegk/builder:latest), NEVERjq(not installed).workflow_dispatchref resolution can flake — runs sometimes pick up an OLD sha. Prefer push trigger via tag-cut for verification.buildenv.shpath used by inline release-upload steps matches the repo's actual layout. hero_books hasscripts/buildenv.sh; hero_biz hasbuildenv.shat root.rustup target add "${{ matrix.target }}"in Setup toolchain step.setup_linux_toolchainfrom build_lib.sh can silently skip the musl target on the despiegk/builder runner → E0463 "can't find crate for core".Plus the structural decisions also adopted this session:
x86_64-unknown-linux-musl,aarch64-unknown-linux-gnu) per home#212. Honest about libc per repo. Already in slides + biz; precedent set.Create Release+Upload Release Assetscurl steps mirroring hero_books's working build-linux.yaml. Replaces shared-helperpublish_binariesthat only writes to pkg registry (cluster A bug).Updated rollout map
--from-ci(proc, router, proxy, db, indexer, aibroker, osis)linux-amd64naming)--from-citostartlifecycle--from-ciservice_install_all --from-ciSide artifacts filed this session
Session 56 entry point
hero_foundry — same cluster A shape as hero_biz, should land much faster with the 8-item playbook applied up front. After foundry, whiteboard (different shape — no openssl issue, debug existing failed CI runs) and matrixchat (no release workflow, full template port).
Estimated remaining for full Phase 2 (3 services): ~3-5h with playbook discipline, vs the ~8h hero_biz took without it.
Signed-off-by: mik-tf
Session 56 close — Phase 2 cluster A complete (12/15 services E2E)
hero_foundry full E2E — v0.2.3-rc2 published with 6 target-triple-named assets (3 binaries × 2 archs).
Playbook validation (8 items applied verbatim)
Result: single successful tag (rc2 — rc1 was a stale session-53 tag pointing to old commit, skipped not retagged). Compare biz: 4 rc rounds. The playbook compressed cluster A propagation from ~8h debug to ~110min including squash-merge + smoke test.
Surprises
development_mik_release_assetsfrom session 53 (commit1cbf71a) was layered on top rather than reset — preserved audit trail of pre-playbook state. Squash-merge collapsed both at land time.-D warningserrors across the workspace. Filed hero_foundry#28 (mirrors hero_biz#22 deferral). Workspacecargo build --workspace --releasegreen = structural gate met.Coverage
Pinned sessions
PRs landed: hero_foundry#29 + hero_skills#205. Manifest:
sessions/56.yml.Session 58 status — cluster E shipped (matrixchat) → 14/15 services E2E
What landed
Producer (hero_matrixchat):
Consumer (hero_skills):
service_matrixchat install --from-ci, mirrors service_biz/foundry/whiteboard verbatim~/hero/bin/(rustls fix verified, no openssl drag)Coverage now (14/15 services E2E)
Bugs found / fixed beyond the 8-item playbook
Matrixchat needed two fixes the playbook didn't cover (now part of the cluster-E learnings):
No workspace member declared
[features]—cargo build --features defaulterrored "none of the selected packages contains this feature: default". Fix: add[features]\ndefault = []to one library crate (hero_matrixchat_sdk, mirroring hero_foundry_core / hero_biz_app convention). Repos that already had a default-bearing library crate (foundry, biz, whiteboard) didn't hit this.Pre-existing fmt debt from upstream
0907fca—ci.yml'scargo fmt --all --checkstep was failing on every push since the herolib_core integration commit. Surfaced now because it blocked our PRs from being green. Fix: mechanicalcargo fmt, no behaviour change.v0.1.0-rc1 ate both before they were diagnosed; v0.1.0-rc2 was clean first try after the unblock PR.
Workspace-build gate
Clean:
cargo fmt --check && cargo clippy --workspace --all-targets -- -D warnings && cargo build --workspace --features default --release. No clippy debt — no follow-up issue needed (unlike hero_biz#22 / hero_foundry#28).Next session (59)
Pinned: hero_editor (cluster D hard-tier). Strategically blocked on ONNX cross-compile (voice_activity_detector → ort → onnxruntime). Likely swap with a dedicated ONNX-strategy session covering hero_voice + hero_embedder + hero_editor together per session 55's hard-tier note. If session 59 starts on editor and immediately hits the ONNX wall, retreat and convert to ONNX-strategy first.
After editor: Phase 2 is complete at 15/15 and we move to Phase 3 (deploy --from-ci on herodemo).
Signed-off-by: mik-tf
Complete roadmap —
lhumina_codehero OS to greenScope: end-to-end from Phase 2 finish through deployable AI-native demo. Sequenced for maximum leverage early. Honest effort estimates; flagged uncertainties as such.
Where we are after session 58
--from-ci.lhumina_code/hero_skills/tools/modules/services/has 34 service modules total — 14 wired, 20 not yet wired. Plushero_osWASM shell,hero_archipelagosnative islands, and ~7 unscoped repos.Inventory: 34 hero_skills service modules
Releasable units outside
service_*.nu:hero_os(Dioxus/WASM shell),hero_archipelagos(native islands),hero_browser_mcp,hero_foundry_ui,hero_indexer_ui,kokoro-micro. Unscoped:hero_auth,hero_cluster,hero_compute_manager,hero_coordinator,hero_launcher,hero_ledger,hero_researcher,hero_lib_rhai,hero_web_template,dist— some likely deprecated, all need triage before effort commitment.The plan — 9 phases
Phase 2 — finish (last 1 + ONNX-blocked 3)
Goal: 15/15 killer-demo services + voice + embedder all
--from-ci.Work:
decisions/D-05-onnx-cross-compile.md. Single proof-of-concept tag on one of the three.Definition of done:
service_*.numodules in hero_skills wire--from-ci.Estimate: 4 sessions. Risk: ONNX may not have a unified approach across all three — fallback is per-service ad-hoc fix (3 sessions become 3 distinct ones, total cost rises to ~6-7).
Blockers: none. Parallel-eligible: docs_hero Phase 1 content (independent track).
Phase 3 — herodemo deploys entirely via
--from-ciGoal: Zero
cargo buildinvocations on the herodemo VM. Every service install path goes through downloaded musl artifacts.Work:
service_X startcurrently purges + rebuilds via cargo, defeating--from-ciinstalls. Need start-side awareness of "already installed from CI".--from-ci --version v…for every service.Definition of done:
service_proc start --from-ci --version v…works idempotently for all 17 services on a fresh VM.Estimate: 2-3 sessions. Risk: L-05 fix may surface architectural questions about service lifecycle; could expand. Blockers: Phase 2 finish (need wired services to roll across). Parallel-eligible: docs_hero, reliability META design work.
Phase 4 — hero infra services on
--from-ciGoal: AI/coordination layer (agent, code, collab, hero_do, mycelium) on the same release pipeline.
Work: Apply the playbook to each. After ONNX is solved, agent + code are likely pure-Rust and inherit cleanly. mycelium is a special case (separate upstream, may already have its own pipeline). collab has had FD-leak issues — investigate before pipelining.
Definition of done: 22/34 services wired (14 + 3 ONNX + 5 infra).
Estimate: 3-4 sessions. Risk: collab/mycelium may need refactor before pipelining is sensible. Blockers: Phase 2 finish; ONNX strategy locked. Parallel-eligible: Phase 5 inventory.
Phase 5 — auxiliary + office services
Goal: Cover the remaining 12 service modules.
Pre-work (1 session): Triage inventory. Determine which are actively maintained vs deprecated; which have unique build constraints (office stack is not pure-Rust); which are demo-critical. Output: a per-service decision (pipeline / deprecate / defer).
Then per-service:
Definition of done: Every actively-maintained service has either a
--from-cipipeline OR an explicit deprecation note in its README + workspace removal.Estimate: 5-7 sessions (1 triage + 4-6 inheritance/bespoke). Blockers: Phase 4 (proves pattern at scale). Parallel-eligible: WASM shell pipeline.
Phase 6 — hero_os WASM shell release pipeline
Goal: Reproducible WASM bundle deploys for the Dioxus shell.
Work: Separate workflow shape — wasm-pack/trunk build, content-hashed bundle, deploy to herodemo's
~/hero/share/hero_os/public/(or equivalent CDN/origin). Distinct from the hero_proc service pattern.Definition of done: Tag push → WASM bundle uploaded →
make install-assets-releaseequivalent runs on herodemo.Estimate: 2 sessions. Risk: the WASM build is ~25 min cold; CI runtime budget may force optimizations. Blockers: none from Phase 2-5. Parallel-eligible: Phase 5.
Phase 7 — hero_archipelagos native islands
Goal: Native Dioxus islands (photos, videos, calendar, etc.) on the same release pipeline.
Work: Each island is a binary; same biz canonical pattern but per-island matrix. Likely a single workflow with per-island feature gates. Investigate whether one-binary-many-islands or many-binaries shape.
Definition of done: Every active island ships musl/arm64 release artifacts on tag push.
Estimate: 2-3 sessions. Blockers: Phase 2-3 (proves the pattern). Parallel-eligible: Phase 8.
Phase 8 — Reliability META
Goal: Close the architectural gaps that have been accumulating in
limitations/.Targets:
Definition of done: Every L-* limitation either resolved with a linked PR or explicitly accepted with a long-term tracking issue.
Estimate: 3-5 sessions (each is a real refactor). Blockers: none from earlier phases. Parallel-eligible from Phase 3+: can start as soon as deploy is stable.
Phase 9 — Ambient AI vision per hero_demo#52
Goal: The actual product — Hero OS as a sovereign AI-native personal OS.
Work (each is a session+):
Definition of done: hero_demo#52 acceptance criteria met. Demo verifiable at https://herodemo.gent01.grid.tf/ with no human onboarding.
Estimate: 6-10 sessions. This is the real product work; everything before is plumbing. Blockers: Phase 2 (services) + Phase 3 (deploy). Parallel-eligible: docs_hero Phase 1 content (the agent grounds on it).
Best path — critical sequence
Why this sequence:
Total: 25-40 sessions to "all hero OS green" depending on Phase 5 triage outcome and ONNX strategy success rate.
Roughly 25-80 hours of focused execution time depending on session length. With the multi-session pipeline discipline, this is 1-3 calendar months of part-time work or 2-4 weeks full-time.
Risks + open decisions
Recommended next session
Session 59 = ONNX-strategy session (Phase 2 finish path A → strategy variant).
Maximum leverage: one investigation session unlocks 3 services. The alternative — direct attempt on hero_editor — has high probability of hitting the ONNX wall in the first 15 min and forcing a retreat anyway, so we'd pay for both the retreat AND the strategy work.
Path: investigate prebuilt onnxruntime musl distribution +
ortcrate'sdownload-binariesfeature, lock approach indecisions/D-05-onnx-cross-compile.md, single proof-of-concept tag on hero_editor (smallest of the three). Sessions 60-62 then apply pattern across voice + embedder + editor.Roadmap drafted at session 58 close. To revise, comment with proposed changes; locked decisions go to
decisions/D-NN-*.md. This comment is the SSOT for the meta-plan until hero_demo#52 absorbs it.Signed-off-by: mik-tf
Session 60 — D-05 implementation pilot complete: hero_editor → 15/15 (+1 ONNX service)
Coverage: 14/15 → 15/15 (original Phase 2 set complete) + first ONNX service shipped, paving the way for hero_voice (session 61) and hero_embedder (session 62) to repeat the pattern.
Producer side —
hero_editor v0.1.0-rc4Released at https://forge.ourworld.tf/lhumina_code/hero_editor/releases/tag/v0.1.0-rc4 with 6 assets (~45 MB total):
hero_editor_server-{x86_64,aarch64}-unknown-linux-gnu(518 KB / 488 KB)hero_editor_ui-{x86_64,aarch64}-unknown-linux-gnu(1.2 MB / 1.2 MB)libonnxruntime.so.1.25.1-{x86_64,aarch64}-unknown-linux-gnu(22.7 MB / 19.2 MB)D-05 fully validated end-to-end on the producer pipeline: load-dynamic ort + matrix swap to gnu + bundled Microsoft
libonnxruntime.soall worked.Consumer side
service_editor install --from-ciwith libonnxruntime.so handling.svc_verify_elfwas unexported; nu degraded the Command-not-found error to External-command-failed inside try/catch, masking the real diagnosis).services/hero_editor.tomlcommitted direct to development withORT_DYLIB_PATH=__HERO_BIN__/libonnxruntime.so.1.25.1in the UI action env.Heroci smoke
service_editor install --from-ci --rootproduced all 3 artifacts at/root/hero/bin/.hero_editor_serverbinds its Unix socket;hero_editor_uiprints all routes and stays alive.fileandreadelf -dfalsely report "statically linked / no dynamic section") —upx -don a copy reveals the underlying glibc-dynamic linkage (libc.so.6, libm.so.6, libgcc_s.so.1), confirming dlopen() will work for ort's runtime libonnxruntime.so resolution.Producer-side fix-forward chain on rc4
Three small fixes had to land before rc4 went green; saving the lessons in the playbook:
actions/checkout@v4had the Forgejo auth bug (extraheader + git-fetch don't agree). Editor'sbuild.yamlhad documented this and used a manual git clone since PR #4; same fix needed inbuild-linux.yaml.make bundle-webneeds an explicit Setup-bun step (already present inbuild.yaml).build-macos.yaml(forge.ourworld.tf has no macOS runner; failures were just template carryover).write:repositoryscope refresh (mirrors session 57 whiteboard pattern).Playbook additions
Add to the 8-item Phase 2 playbook (now 14):
build-linux.yamlmust mirror the toolchain/auth conventions ofbuild.yaml— manual clone ifbuild.yamldoes, plus any non-default toolchain installs (bun, deno, etc.).build-macos.yamlif present — forge.ourworld.tf has no macOS runner.FORGEJO_TOKENsecret haswrite:repositoryscope before tagging.Next sessions
Coverage projection: 15/15 + 2 → 17 services with --from-ci after sessions 61 + 62.
Session 63 — D-05 hero_embedder pilot, ONNX rollout complete (16/15 → 17/15+)
Third and final application of D-05 —
load-dynamic+ bundledlibonnxruntime.so+ matrix musl→gnu — afterhero_editor(session 60) andhero_voice(session 61). The D-05 ONNX rollout is done; all 17 first-class services now ship CI-built artifacts.Producer side
2257c36(squash of 3 commits: fmt+clippy debt cleanup; D-05 workflow port + buildenv pin; hero_embedderd added to BINARIES — caught defect, the only ort-loading binary in the workspace was missing from the release manifest).cargo tree -e features -p hero_embedder_lib | grep -E 'download-binaries|copy-dylibs'returns empty. ort was already declared withdefault-features = false, features = ["load-dynamic", "api-24"]on the workspace dep, so step 1 of the D-05 playbook collapsed to verification.cargo fmt --check,cargo clippy --workspace --all-targets -- -D warnings,cargo build --workspace --release(2m 33s) all clean.libonnxruntime.so.1.25.1× 2 archs) on first attempt — zero fix-forwards. Playbook items 14–16 (carried from sessions 57+60+61) prevented the editor's 4 fix-forwards.Consumer side
39ab04d(squash of 3 commits):refactor(lib): factor svc_install_onnx_runtime_download into lib.nu (rule of 3)— voice + editor were carrying byte-identical helpers; embedder triggered the rule-of-three. The shared helper takesonnx_versionandci_targetas args.feat(service_embedder): add --download / --version— mirrors voice canonical shape.svx_embedderd_actionprefers the bundled.soover the system/usr/local/onnxruntimeinstall when on disk. ORT preflight (svx_ort_require) is skipped under--downloadbecause the bundled .so is the source of truth, not the system install.fix(dispatcher): forward --download/--version to embedder install/start— closes the dispatcher gap surfaced session 62.f33f8a7— addeddownload = "..."URLs +ORT_DYLIB_PATHenv toservices/hero_embedder.toml(mirrors voice manifest pattern).Heroci smoke (post-merge, hero_skills @
39ab04d)Verification:
nm -D /root/hero/bin/libonnxruntime.so.1.25.1 | grep OrtGetApiBase→OrtGetApiBase@@VERS_1.25.1(correct ABI).upx -d+lddconfirms glibc-dynamic linkage (libc.so.6,libm.so.6,libgcc_s.so.1).hero_embedderdboots cleanly underORT_DYLIB_PATH=/root/hero/bin/libonnxruntime.so.1.25.1.Embedder semantic-search end-to-end is a UX gate per D-05 (same pattern as the voice-WS deferral in session 60 — D-05 only requires the binary starts with the .so resolvable; full feature exercise is a separate session).
Coverage delta
The 14 already-shipping services keep static-musl. The 3 ONNX services ship gnu-glibc binaries with a sibling
libonnxruntime.so. Any future ONNX service needs only its own constants + one call into the now-sharedsvc_install_onnx_runtime_downloadhelper.Notes
BINARIESinhero_embedder/buildenv.shwas missinghero_embedderd, the only binary in the workspace that loads ort dynamically. Without that fix the bundled.sowould have shipped with no consumer. Filed as a third producer commit before opening for merge.--from-ci/--downloadnot supported onstart's purge-and-rebuild path) remains open and out of scope for this session.mik-tf referenced this issue2026-05-11 03:01:57 +00:00