[nu-demo] hero_books re-runs LLM Q&A extraction despite pre-shipped `.ai/<page>.toml` cache — content_hash mismatch #158

New issue

Closed

opened 2026-04-24 02:28:01 +00:00 by mik-tf · 1 comment

mik-tf commented

2026-04-24 02:28:01 +00:00

Owner

Symptom

On heronu, adding the 4 demo libraries (docs_hero, docs_geomind, docs_mycelium, docs_owh) to libraries.txt triggers hero_books to re-extract Q&A via LLM calls to hero_aibroker for every page in every collection. With ~300 pages across geomind/mycelium/ourworld, this takes 20-40 minutes and burns API tokens — even though each library already ships with pre-extracted Q&A.

Every cloned library has:

docs_<name>/
├── .ai/ebooks/*.toml           ← ebook metadata (collection ordering, page titles)
└── collections/<coll>/
    ├── *.md                    ← page content
    └── .ai/
        ├── <page>.toml         ← Q&A pairs + content_hash + vectors metadata
        └── <page>.vectors.bin  ← pre-computed embeddings (384-d BGE-small)

Example — docs_mycelium/collections/ai_platform_tech/.ai/0_tech_overview.toml (~10 KB):

content_hash = "387035a010b47317"
source_path = "0_tech_overview"
title = "Technology Stack Overview"
generated_at = "1770864619"

[[topics.technology.pairs]]
question = "What is the Long Term AI Memory component..."
answer = "The Long Term AI Memory component is..."

# ...8 more topic pairs

This is exactly the format hero_books would produce. The extraction work has already been done upstream and committed.

Current behavior (confirmed on heronu 2026-04-24)

hero_books log shows for each page:

Processed: datacenters (53 Q&A pairs)      ← re-extracted via LLM
Processed: overview (27 Q&A pairs) [Q&A cached]  ← hit cache

The [Q&A cached] marker appears ONLY for pages in docs_hero. Every page in geomind / mycelium / ourworld is re-extracted.

Root cause (hypothesis)

hero_books_lib/src/ai/book.rs:241 computes content_hash = compute_content_hash(&content) on the exported .md file — i.e. the version hero_books writes into /home/driver/hero/var/books/<ns>/books/<collection>/<page>.md after its own post-processing step.

The pre-shipped .ai/<page>.toml was generated by an upstream pipeline (likely the heroscript → markdown conversion tool). Its content_hash was computed on a different input — probably the raw markdown before any normalization, OR the .hero source file before conversion, OR the html rendered output.

Result: hero_books's hash ≠ stored hash → cache miss → full LLM re-extraction.

The Q&A pairs in the stored TOML are still valid and well-formed. The .vectors.bin sidecars are likewise valid — they encode real embeddings. Throwing all of that away to call Claude again is pure waste.

Proposed fixes (pragmatic → proper)

1. Trust-if-present mode

Add a config flag (default on for libraries cloned from remote repos):

// hero_books_lib/src/ai/book.rs::process_page
if config.trust_shipped_ai
    && meta_path.exists()
    && let Ok(existing) = DocumentMetadata::load(&meta_path)
    && !existing.qa_pairs.is_empty()
{
    // Use shipped .ai/<page>.toml as-is, skip LLM
    metadata = existing;
    // Verify or recompute embeddings from shipped .vectors.bin if present
    return Ok(true); // counts as processed for stats
}

2. Normalize content_hash computation

Match whatever upstream pipeline does. If the upstream hashes the raw source (before rendering), hero_books should do the same. Read through how docs_mycelium was built (probably a build.sh or similar in the repo) and align.

3. Audit & document the `.ai/` contract

Define a canonical spec for what .ai/<page>.toml must contain and how content_hash is computed, and publish it as part of the docs_hero library contributor guide. Both the upstream generator and hero_books' consumer must follow the same rule — right now they're drifting.

4. Reuse `.vectors.bin`

Even if Q&A extraction needs to re-run for some reason (e.g. schema migration), the pre-computed embeddings in .vectors.bin should always be reused. Upload them directly to hero_embedder instead of re-computing.

Impact on demo

Without this fix: mycelium + ourworld indexing takes 20-40 minutes when adding them to libraries.txt, during which the AI Assistant can't ground answers in their content.
With this fix: seconds (just read TOML + upload vectors). The whole 40-book demo becomes ready within a minute of clone.

Verification

After fix:

# Restart hero_books fresh on a VM with all 4 libraries cloned
hero_proc service restart hero_books
# Watch logs:
hero_proc log query hero_books --lines 200 | grep 'Indexed'
# Expect: all 40 books show [Q&A cached] or [Q&A prebuilt], 0 LLM calls to aibroker.
# Total elapsed < 60s.

#148 — nu-demo architecture index
Sibling issue (Books UI double-slash) — why the UI couldn't display the libraries once indexed

Signed-off-by: mik-tf

## Symptom On heronu, adding the 4 demo libraries (`docs_hero`, `docs_geomind`, `docs_mycelium`, `docs_owh`) to `libraries.txt` triggers hero_books to re-extract Q&A via LLM calls to hero_aibroker for **every page** in every collection. With ~300 pages across geomind/mycelium/ourworld, this takes 20-40 minutes and burns API tokens — even though each library **already ships with pre-extracted Q&A**. Every cloned library has: ``` docs_<name>/ ├── .ai/ebooks/*.toml ← ebook metadata (collection ordering, page titles) └── collections/<coll>/ ├── *.md ← page content └── .ai/ ├── <page>.toml ← Q&A pairs + content_hash + vectors metadata └── <page>.vectors.bin ← pre-computed embeddings (384-d BGE-small) ``` Example — `docs_mycelium/collections/ai_platform_tech/.ai/0_tech_overview.toml` (~10 KB): ```toml content_hash = "387035a010b47317" source_path = "0_tech_overview" title = "Technology Stack Overview" generated_at = "1770864619" [[topics.technology.pairs]] question = "What is the Long Term AI Memory component..." answer = "The Long Term AI Memory component is..." # ...8 more topic pairs ``` This is exactly the format hero_books would produce. **The extraction work has already been done upstream and committed.** ## Current behavior (confirmed on heronu 2026-04-24) hero_books log shows for each page: ``` Processed: datacenters (53 Q&A pairs) ← re-extracted via LLM Processed: overview (27 Q&A pairs) [Q&A cached] ← hit cache ``` The `[Q&A cached]` marker appears ONLY for pages in docs_hero. Every page in geomind / mycelium / ourworld is re-extracted. ## Root cause (hypothesis) `hero_books_lib/src/ai/book.rs:241` computes `content_hash = compute_content_hash(&content)` on the **exported** `.md` file — i.e. the version hero_books writes into `/home/driver/hero/var/books/<ns>/books/<collection>/<page>.md` after its own post-processing step. The pre-shipped `.ai/<page>.toml` was generated by an **upstream** pipeline (likely the heroscript → markdown conversion tool). Its `content_hash` was computed on a different input — probably the raw markdown before any normalization, OR the `.hero` source file before conversion, OR the html rendered output. Result: hero_books's hash ≠ stored hash → cache miss → full LLM re-extraction. The Q&A pairs in the stored TOML are still valid and well-formed. The `.vectors.bin` sidecars are likewise valid — they encode real embeddings. Throwing all of that away to call Claude again is pure waste. ## Proposed fixes (pragmatic → proper) ### 1. Trust-if-present mode Add a config flag (default on for libraries cloned from remote repos): ```rust // hero_books_lib/src/ai/book.rs::process_page if config.trust_shipped_ai && meta_path.exists() && let Ok(existing) = DocumentMetadata::load(&meta_path) && !existing.qa_pairs.is_empty() { // Use shipped .ai/<page>.toml as-is, skip LLM metadata = existing; // Verify or recompute embeddings from shipped .vectors.bin if present return Ok(true); // counts as processed for stats } ``` ### 2. Normalize content_hash computation Match whatever upstream pipeline does. If the upstream hashes the raw source (before rendering), hero_books should do the same. Read through how `docs_mycelium` was built (probably a `build.sh` or similar in the repo) and align. ### 3. Audit & document the `.ai/` contract Define a canonical spec for what `.ai/<page>.toml` must contain and how `content_hash` is computed, and publish it as part of the `docs_hero` library contributor guide. Both the upstream generator and hero_books' consumer must follow the same rule — right now they're drifting. ### 4. Reuse `.vectors.bin` Even if Q&A extraction needs to re-run for some reason (e.g. schema migration), the pre-computed embeddings in `.vectors.bin` should always be reused. Upload them directly to hero_embedder instead of re-computing. ## Impact on demo - Without this fix: mycelium + ourworld indexing takes 20-40 minutes when adding them to `libraries.txt`, during which the AI Assistant can't ground answers in their content. - With this fix: seconds (just read TOML + upload vectors). The whole 40-book demo becomes ready within a minute of clone. ## Verification After fix: ```bash # Restart hero_books fresh on a VM with all 4 libraries cloned hero_proc service restart hero_books # Watch logs: hero_proc log query hero_books --lines 200 | grep 'Indexed' # Expect: all 40 books show [Q&A cached] or [Q&A prebuilt], 0 LLM calls to aibroker. # Total elapsed < 60s. ``` ## Related - https://forge.ourworld.tf/lhumina_code/home/issues/148 — nu-demo architecture index - Sibling issue (Books UI double-slash) — why the UI couldn't display the libraries once indexed Signed-off-by: mik-tf