Publish generated Q&A back to the library repo so it is not re-generated on every machine #141
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_books#141
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
When hero_books ingests a documentation library it does two things. It runs an LLM pass to generate question and answer pairs for each page (the slow and paid step, currently routed through OpenRouter), and it embeds the result into hero_memory (fast and free, done by the local embedder). Today the generated Q&A and the vectors are written only into hero_memory's local data directory. Nothing is written back to the source library repository. So when the same public library is set up on a second machine, that machine clones the repo, finds no generated Q&A, and re-runs the entire LLM pass from scratch, paying the cost again. For a shared set of public libraries used across many installs, the same expensive work is repeated on every machine.
This used to work differently. The library repo carried per-page sidecar files under
.ai/(for examplecollections/<collection>/.ai/<page>.toml) holding the generated Q&A keyed by a content hash, withembeddings_generated = false. Vectors were never stored in the repo, only the Q&A text. That made the expensive work portable: clone the repo, the Q&A is already present, skip the LLM pass, and re-embed locally for free. That write-back was removed during the recent rework. The current import pipeline comment states that the Q&A now lives in hero_memory's data dir and not the source repo, and the oldpush_aiargument is now unused. Older library repos still contain those.ai/*.tomlfiles, so the format and itscontent_hashfield are a known-good reference for what to restore.The request is to restore that round trip. It has two ends, in two repos:
hero_memory. After
qa.extractgenerates Q&A for a page, also write it to a sidecar next to the source page in the existing.ai/<page>.tomlshape, stamped with the samecontent_hashit already computes during scan. On scan, convert, or extract, if a sidecar with a matchingcontent_hashalready exists, load the Q&A from it and mark the extractor as already run, so the LLM call is skipped. The skip machinery already exists (qa.extractalready short-circuits with the reason "already extracted at this content_hash"); today it keys off the local stored record, and it would need to also recognise a sidecar that arrived with the cloned repo.hero_books. After ingest, commit and push the new or updated
.ai/files back to the library repo, ideally behind a flag so a plain read-only consumer does not attempt to push. The old import pipeline had this push-back and it needs restoring.The invariant that keeps this safe and small: only LLM-derived text (the Q&A, and later any ontology output) is published to the repo, never the vectors. Vectors are model and dimension specific and are cheap to regenerate locally, so they stay local. The per-page
content_hashis the single staleness key. Same hash means reuse the sidecar and skip the LLM. A different hash (a page was edited) invalidates just that page and triggers re-extraction of only that page.Outcome: a library's Q&A is generated once on one machine, pushed to the forge, and every machine that later clones the library inherits the Q&A and only does the free local embed. Editing a page regenerates just that page.
Relevant code:
import_collection_pipelineandimport_local_pipeline).converted_hash), collections/scan.rs (content_hashcomputation and stale-clearing).One question for the team before implementing. Moving the Q&A into hero_memory's data dir looks deliberate. Was there a reason for it, for example a different portability or sync mechanism that is planned instead? If the sidecar approach is still the intended path, we are happy to implement both ends, starting with a single library to confirm the full clone, reuse, and skip-LLM loop end to end, then the rest.
Additional finding while validating on a live machine: the same rework also removed the step that turns a cloned library repo into the on-disk book tree the web UI reads, so this is wider than just the Q&A push-back.
What we observed. After cloning all four public libraries onto a tester and running the full ingest (question/answer generation plus local embedding), search works correctly when given a library name, but the Hero Books web UI shows "1 library, 0 books" (only the legacy empty default). The four libraries are present in hero_memory (their collections are ready and searchable), but they do not appear as browsable books in the web UI.
Root cause. There are two separate stores and the web UI reads the one the current ingest never fills:
~/hero/var/books/{library}/(withlibrary.tomlandbooks/<book>/book.json) is whatlibraries.listandbooks.listand therefore the web UI read. Nothing currently populates it from a library, so it stays empty.The build-the-book-tree step was removed in the same rework. In hero_books
discover_and_convert_ebooks(crates/hero_books_server/src/web/server.rs) the code comment states: "Auto-conversion of ebooks to book.toml has been removed, it relied on the deleted hero_books_lib::ai pipeline." So at startup the libraries are git-cloned, but the conversion that produced thebook.toml/book.jsontree no longer runs, and the web UI has nothing to show.So three connected pieces were removed by the rework, and they should be considered together:
.ai/<page>.tomlsidecars (the original body of this issue).~/hero/var/books/book tree so the libraries are browsable in the web UI (this comment).Net effect today: a fresh machine clones the four libraries, the web UI shows zero books, and any ingest that is run re-pays the question/answer LLM cost from scratch with nothing published back.
Desired end state, restated to cover all three:
.ai/sidecars keyed by content hash, book tree built, and the.ai/sidecars pushed to the forge..ai/Q&A, skips the LLM step (content hash matches), builds the book tree so the web UI shows the books, and only re-embeds locally for free.Open question to the team is unchanged and now covers all three pieces: moving Q&A into hero_memory's data dir and removing the book-tree conversion looks deliberate. Is the sidecar plus book-tree approach still the intended direction, or is a different portability and browse mechanism planned. If the sidecar approach is the path, we are happy to implement all three ends, starting with one library to confirm the full clone, browse, reuse, and skip-LLM loop end to end, then the rest.
Relevant code for piece 3:
discover_and_convert_ebooks,ensure_library_repos).list_library_dirs,library_books_dir,ensure_library_dirs) and the scanner in crates/hero_books_server/src/web/server.rs around thebook.jsonscan.Following up with a concrete plan for restoring the publish-back of generated Q&A, plus a safety design so it is easy to turn off if you would rather keep everything in hero_memory.
What I found in the current code: the pieces are mostly still here, just disconnected. The exporter still copies a collection's
.ai/files into the book tree, thePageMetadata/TopicQA/QAPairtypes still exist, herolib_git is already a dependency, and thepush_aiflag is still wired end to end (just unused). The only parts actually deleted in the late-April refactor were the content-hash skip check and thegit add .ai/write-back, both recoverable from the parent of that commit. On the hero_memory side nothing needs to change:qa.extractgenerates,qa.listreads the pairs back, andindex.addcan ingest pre-made pairs and embed them through the provider without calling the model again.Proposed approach, kept additive so it does not change your storage model: hero_memory stays the runtime store. On top of it, hero_books gains a portable cache in each library's
.ai/. Two roles. A privileged publish step, run by a maintainer with write access, generates Q&A once, reads it back, writes.ai/<page>.tomlkeyed by content hash, and pushes it to the library repo. A free consume step, run by every fresh machine, reads.ai/when the hash matches and embeds those pairs locally instead of re-generating them. We would persist only the Q&A text, not vectors, since embedding is cheap to redo locally.To keep this low risk for you, the whole thing lands behind an off-by-default switch (the publish side reuses the existing
push_aiflag, the consume side a new toggle), in small self-contained commits, so it stays dormant unless explicitly enabled and a single revert removes it. Separately we also need to restore the step that builds the browsable book tree, since right now the web UI shows the libraries as zero books even though search works.One question before we wire it up: are you happy with this additive approach (the runtime store stays,
.ai/is a portable cache layer on top), and would you prefer the publish step as a hero_books command a maintainer runs or as a separate privileged endpoint? If you would rather keep Q&A only in hero_memory, we leave the switch off and this stays out of your way.