Publish generated Q&A back to the library repo so it is not re-generated on every machine #141

New issue

Open

opened 2026-05-30 15:14:09 +00:00 by mik-tf · 2 comments

mik-tf commented

2026-05-30 15:14:09 +00:00

Owner

This used to work differently. The library repo carried per-page sidecar files under .ai/ (for example collections/<collection>/.ai/<page>.toml) holding the generated Q&A keyed by a content hash, with embeddings_generated = false. Vectors were never stored in the repo, only the Q&A text. That made the expensive work portable: clone the repo, the Q&A is already present, skip the LLM pass, and re-embed locally for free. That write-back was removed during the recent rework. The current import pipeline comment states that the Q&A now lives in hero_memory's data dir and not the source repo, and the old push_ai argument is now unused. Older library repos still contain those .ai/*.toml files, so the format and its content_hash field are a known-good reference for what to restore.

The request is to restore that round trip. It has two ends, in two repos:

hero_memory. After qa.extract generates Q&A for a page, also write it to a sidecar next to the source page in the existing .ai/<page>.toml shape, stamped with the same content_hash it already computes during scan. On scan, convert, or extract, if a sidecar with a matching content_hash already exists, load the Q&A from it and mark the extractor as already run, so the LLM call is skipped. The skip machinery already exists (qa.extract already short-circuits with the reason "already extracted at this content_hash"); today it keys off the local stored record, and it would need to also recognise a sidecar that arrived with the cloned repo.
hero_books. After ingest, commit and push the new or updated .ai/ files back to the library repo, ideally behind a flag so a plain read-only consumer does not attempt to push. The old import pipeline had this push-back and it needs restoring.

The invariant that keeps this safe and small: only LLM-derived text (the Q&A, and later any ontology output) is published to the repo, never the vectors. Vectors are model and dimension specific and are cheap to regenerate locally, so they stay local. The per-page content_hash is the single staleness key. Same hash means reuse the sidecar and skip the LLM. A different hash (a page was edited) invalidates just that page and triggers re-extraction of only that page.

Outcome: a library's Q&A is generated once on one machine, pushed to the forge, and every machine that later clones the library inherits the Q&A and only does the free local embed. Editing a page regenerates just that page.

Relevant code:

hero_books import pipeline and the removed push-back: crates/hero_books_server/src/web/server.rs (import_collection_pipeline and import_local_pipeline).
hero_memory extract, convert, and scan: api/qa.rs (skip logic), api/convert.rs (converted_hash), collections/scan.rs (content_hash computation and stale-clearing).

One question for the team before implementing. Moving the Q&A into hero_memory's data dir looks deliberate. Was there a reason for it, for example a different portability or sync mechanism that is planned instead? If the sidecar approach is still the intended path, we are happy to implement both ends, starting with a single library to confirm the full clone, reuse, and skip-LLM loop end to end, then the rest.

When hero_books ingests a documentation library it does two things. It runs an LLM pass to generate question and answer pairs for each page (the slow and paid step, currently routed through OpenRouter), and it embeds the result into hero_memory (fast and free, done by the local embedder). Today the generated Q&A and the vectors are written only into hero_memory's local data directory. Nothing is written back to the source library repository. So when the same public library is set up on a second machine, that machine clones the repo, finds no generated Q&A, and re-runs the entire LLM pass from scratch, paying the cost again. For a shared set of public libraries used across many installs, the same expensive work is repeated on every machine. This used to work differently. The library repo carried per-page sidecar files under `.ai/` (for example `collections/<collection>/.ai/<page>.toml`) holding the generated Q&A keyed by a content hash, with `embeddings_generated = false`. Vectors were never stored in the repo, only the Q&A text. That made the expensive work portable: clone the repo, the Q&A is already present, skip the LLM pass, and re-embed locally for free. That write-back was removed during the recent rework. The current import pipeline comment states that the Q&A now lives in hero_memory's data dir and not the source repo, and the old `push_ai` argument is now unused. Older library repos still contain those `.ai/*.toml` files, so the format and its `content_hash` field are a known-good reference for what to restore. The request is to restore that round trip. It has two ends, in two repos: 1. hero_memory. After `qa.extract` generates Q&A for a page, also write it to a sidecar next to the source page in the existing `.ai/<page>.toml` shape, stamped with the same `content_hash` it already computes during scan. On scan, convert, or extract, if a sidecar with a matching `content_hash` already exists, load the Q&A from it and mark the extractor as already run, so the LLM call is skipped. The skip machinery already exists (`qa.extract` already short-circuits with the reason "already extracted at this content_hash"); today it keys off the local stored record, and it would need to also recognise a sidecar that arrived with the cloned repo. 2. hero_books. After ingest, commit and push the new or updated `.ai/` files back to the library repo, ideally behind a flag so a plain read-only consumer does not attempt to push. The old import pipeline had this push-back and it needs restoring. The invariant that keeps this safe and small: only LLM-derived text (the Q&A, and later any ontology output) is published to the repo, never the vectors. Vectors are model and dimension specific and are cheap to regenerate locally, so they stay local. The per-page `content_hash` is the single staleness key. Same hash means reuse the sidecar and skip the LLM. A different hash (a page was edited) invalidates just that page and triggers re-extraction of only that page. Outcome: a library's Q&A is generated once on one machine, pushed to the forge, and every machine that later clones the library inherits the Q&A and only does the free local embed. Editing a page regenerates just that page. Relevant code: - hero_books import pipeline and the removed push-back: [crates/hero_books_server/src/web/server.rs](https://forge.ourworld.tf/lhumina_code/hero_books/src/branch/development/crates/hero_books_server/src/web/server.rs) (`import_collection_pipeline` and `import_local_pipeline`). - hero_memory extract, convert, and scan: [api/qa.rs](https://forge.ourworld.tf/lhumina_code/hero_memory/src/branch/development/crates/hero_memory_lib/src/api/qa.rs) (skip logic), [api/convert.rs](https://forge.ourworld.tf/lhumina_code/hero_memory/src/branch/development/crates/hero_memory_lib/src/api/convert.rs) (`converted_hash`), [collections/scan.rs](https://forge.ourworld.tf/lhumina_code/hero_memory/src/branch/development/crates/hero_memory_lib/src/collections/scan.rs) (`content_hash` computation and stale-clearing). One question for the team before implementing. Moving the Q&A into hero_memory's data dir looks deliberate. Was there a reason for it, for example a different portability or sync mechanism that is planned instead? If the sidecar approach is still the intended path, we are happy to implement both ends, starting with a single library to confirm the full clone, reuse, and skip-LLM loop end to end, then the rest.

mik-tf commented

2026-05-30 15:17:20 +00:00

Author

Owner

What we observed. After cloning all four public libraries onto a tester and running the full ingest (question/answer generation plus local embedding), search works correctly when given a library name, but the Hero Books web UI shows "1 library, 0 books" (only the legacy empty default). The four libraries are present in hero_memory (their collections are ready and searchable), but they do not appear as browsable books in the web UI.

Root cause. There are two separate stores and the web UI reads the one the current ingest never fills:

hero_memory's internal data dir (redb) holds the Q&A and vectors. The ingest writes here. Search reads here. This works.
The library tree under ~/hero/var/books/{library}/ (with library.toml and books/<book>/book.json) is what libraries.list and books.list and therefore the web UI read. Nothing currently populates it from a library, so it stays empty.

The build-the-book-tree step was removed in the same rework. In hero_books discover_and_convert_ebooks (crates/hero_books_server/src/web/server.rs) the code comment states: "Auto-conversion of ebooks to book.toml has been removed, it relied on the deleted hero_books_lib::ai pipeline." So at startup the libraries are git-cloned, but the conversion that produced the book.toml / book.json tree no longer runs, and the web UI has nothing to show.

So three connected pieces were removed by the rework, and they should be considered together:

Writing the generated Q&A back into the repo as .ai/<page>.toml sidecars (the original body of this issue).
Pushing those sidecars to the forge so other machines inherit them (the original body of this issue).
Converting a cloned library repo into the ~/hero/var/books/ book tree so the libraries are browsable in the web UI (this comment).

Net effect today: a fresh machine clones the four libraries, the web UI shows zero books, and any ingest that is run re-pays the question/answer LLM cost from scratch with nothing published back.

Desired end state, restated to cover all three:

A library is processed once on one machine: Q&A generated, written to .ai/ sidecars keyed by content hash, book tree built, and the .ai/ sidecars pushed to the forge.
Every machine that later clones that library inherits the .ai/ Q&A, skips the LLM step (content hash matches), builds the book tree so the web UI shows the books, and only re-embeds locally for free.
Editing a page changes its content hash and re-processes just that page.
Vectors are never published, only the LLM-derived text. Vectors are regenerated locally.

Open question to the team is unchanged and now covers all three pieces: moving Q&A into hero_memory's data dir and removing the book-tree conversion looks deliberate. Is the sidecar plus book-tree approach still the intended direction, or is a different portability and browse mechanism planned. If the sidecar approach is the path, we are happy to implement all three ends, starting with one library to confirm the full clone, browse, reuse, and skip-LLM loop end to end, then the rest.

Relevant code for piece 3:

hero_books startup conversion that was removed: crates/hero_books_server/src/web/server.rs (discover_and_convert_ebooks, ensure_library_repos).
the book tree the web UI reads: crates/hero_books_lib/src/library.rs (list_library_dirs, library_books_dir, ensure_library_dirs) and the scanner in crates/hero_books_server/src/web/server.rs around the book.json scan.

Additional finding while validating on a live machine: the same rework also removed the step that turns a cloned library repo into the on-disk book tree the web UI reads, so this is wider than just the Q&A push-back. What we observed. After cloning all four public libraries onto a tester and running the full ingest (question/answer generation plus local embedding), search works correctly when given a library name, but the Hero Books web UI shows "1 library, 0 books" (only the legacy empty default). The four libraries are present in hero_memory (their collections are ready and searchable), but they do not appear as browsable books in the web UI. Root cause. There are two separate stores and the web UI reads the one the current ingest never fills: - hero_memory's internal data dir (redb) holds the Q&A and vectors. The ingest writes here. Search reads here. This works. - The library tree under `~/hero/var/books/{library}/` (with `library.toml` and `books/<book>/book.json`) is what `libraries.list` and `books.list` and therefore the web UI read. Nothing currently populates it from a library, so it stays empty. The build-the-book-tree step was removed in the same rework. In hero_books `discover_and_convert_ebooks` (crates/hero_books_server/src/web/server.rs) the code comment states: "Auto-conversion of ebooks to book.toml has been removed, it relied on the deleted hero_books_lib::ai pipeline." So at startup the libraries are git-cloned, but the conversion that produced the `book.toml` / `book.json` tree no longer runs, and the web UI has nothing to show. So three connected pieces were removed by the rework, and they should be considered together: 1. Writing the generated Q&A back into the repo as `.ai/<page>.toml` sidecars (the original body of this issue). 2. Pushing those sidecars to the forge so other machines inherit them (the original body of this issue). 3. Converting a cloned library repo into the `~/hero/var/books/` book tree so the libraries are browsable in the web UI (this comment). Net effect today: a fresh machine clones the four libraries, the web UI shows zero books, and any ingest that is run re-pays the question/answer LLM cost from scratch with nothing published back. Desired end state, restated to cover all three: - A library is processed once on one machine: Q&A generated, written to `.ai/` sidecars keyed by content hash, book tree built, and the `.ai/` sidecars pushed to the forge. - Every machine that later clones that library inherits the `.ai/` Q&A, skips the LLM step (content hash matches), builds the book tree so the web UI shows the books, and only re-embeds locally for free. - Editing a page changes its content hash and re-processes just that page. - Vectors are never published, only the LLM-derived text. Vectors are regenerated locally. Open question to the team is unchanged and now covers all three pieces: moving Q&A into hero_memory's data dir and removing the book-tree conversion looks deliberate. Is the sidecar plus book-tree approach still the intended direction, or is a different portability and browse mechanism planned. If the sidecar approach is the path, we are happy to implement all three ends, starting with one library to confirm the full clone, browse, reuse, and skip-LLM loop end to end, then the rest. Relevant code for piece 3: - hero_books startup conversion that was removed: [crates/hero_books_server/src/web/server.rs](https://forge.ourworld.tf/lhumina_code/hero_books/src/branch/development/crates/hero_books_server/src/web/server.rs) (`discover_and_convert_ebooks`, `ensure_library_repos`). - the book tree the web UI reads: [crates/hero_books_lib/src/library.rs](https://forge.ourworld.tf/lhumina_code/hero_books/src/branch/development/crates/hero_books_lib/src/library.rs) (`list_library_dirs`, `library_books_dir`, `ensure_library_dirs`) and the scanner in [crates/hero_books_server/src/web/server.rs](https://forge.ourworld.tf/lhumina_code/hero_books/src/branch/development/crates/hero_books_server/src/web/server.rs) around the `book.json` scan.

mik-tf commented

2026-05-31 02:10:45 +00:00

Author

Owner

Following up with a concrete plan for restoring the publish-back of generated Q&A, plus a safety design so it is easy to turn off if you would rather keep everything in hero_memory.

What I found in the current code: the pieces are mostly still here, just disconnected. The exporter still copies a collection's .ai/ files into the book tree, the PageMetadata/TopicQA/QAPair types still exist, herolib_git is already a dependency, and the push_ai flag is still wired end to end (just unused). The only parts actually deleted in the late-April refactor were the content-hash skip check and the git add .ai/ write-back, both recoverable from the parent of that commit. On the hero_memory side nothing needs to change: qa.extract generates, qa.list reads the pairs back, and index.add can ingest pre-made pairs and embed them through the provider without calling the model again.

Proposed approach, kept additive so it does not change your storage model: hero_memory stays the runtime store. On top of it, hero_books gains a portable cache in each library's .ai/. Two roles. A privileged publish step, run by a maintainer with write access, generates Q&A once, reads it back, writes .ai/<page>.toml keyed by content hash, and pushes it to the library repo. A free consume step, run by every fresh machine, reads .ai/ when the hash matches and embeds those pairs locally instead of re-generating them. We would persist only the Q&A text, not vectors, since embedding is cheap to redo locally.

To keep this low risk for you, the whole thing lands behind an off-by-default switch (the publish side reuses the existing push_ai flag, the consume side a new toggle), in small self-contained commits, so it stays dormant unless explicitly enabled and a single revert removes it. Separately we also need to restore the step that builds the browsable book tree, since right now the web UI shows the libraries as zero books even though search works.

One question before we wire it up: are you happy with this additive approach (the runtime store stays, .ai/ is a portable cache layer on top), and would you prefer the publish step as a hero_books command a maintainer runs or as a separate privileged endpoint? If you would rather keep Q&A only in hero_memory, we leave the switch off and this stays out of your way.

Following up with a concrete plan for restoring the publish-back of generated Q&A, plus a safety design so it is easy to turn off if you would rather keep everything in hero_memory. What I found in the current code: the pieces are mostly still here, just disconnected. The exporter still copies a collection's `.ai/` files into the book tree, the `PageMetadata`/`TopicQA`/`QAPair` types still exist, herolib_git is already a dependency, and the `push_ai` flag is still wired end to end (just unused). The only parts actually deleted in the late-April refactor were the content-hash skip check and the `git add .ai/` write-back, both recoverable from the parent of that commit. On the hero_memory side nothing needs to change: `qa.extract` generates, `qa.list` reads the pairs back, and `index.add` can ingest pre-made pairs and embed them through the provider without calling the model again. Proposed approach, kept additive so it does not change your storage model: hero_memory stays the runtime store. On top of it, hero_books gains a portable cache in each library's `.ai/`. Two roles. A privileged publish step, run by a maintainer with write access, generates Q&A once, reads it back, writes `.ai/<page>.toml` keyed by content hash, and pushes it to the library repo. A free consume step, run by every fresh machine, reads `.ai/` when the hash matches and embeds those pairs locally instead of re-generating them. We would persist only the Q&A text, not vectors, since embedding is cheap to redo locally. To keep this low risk for you, the whole thing lands behind an off-by-default switch (the publish side reuses the existing `push_ai` flag, the consume side a new toggle), in small self-contained commits, so it stays dormant unless explicitly enabled and a single revert removes it. Separately we also need to restore the step that builds the browsable book tree, since right now the web UI shows the libraries as zero books even though search works. One question before we wire it up: are you happy with this additive approach (the runtime store stays, `.ai/` is a portable cache layer on top), and would you prefer the publish step as a hero_books command a maintainer runs or as a separate privileged endpoint? If you would rather keep Q&A only in hero_memory, we leave the switch off and this stays out of your way.