rpc.discover can return a stale OpenRPC spec — regenerated clients then drift #32

Closed
opened 2026-04-21 16:46:36 +00:00 by timur · 1 comment
Owner

Problem

rpc.discover is implemented by reading a cached OpenRPC document that the server holds in memory. In practice, this document can fall out of sync with the live binary — if the server registers a service after caching, if the cache is populated from a static include_str!'d file that pre-dates the last regen, or if the service registry mutates between cache-build and request time.

Observed: a running hero_logic binary (built Apr 20) serves a rpc.discover response that omits LogicService.play_start and siblings, even though those methods are absolutely present in the compiled code and the .oschema. Every downstream tool that regenerates from rpc.discover then inherits the miss.

Expected

rpc.discover should return the exact same bytes as the OpenRPC spec compiled into the running binary. No caching layer between "what the code says" and "what the discover endpoint returns." If caching is needed for performance, it should invalidate on server start, not persist across a restart.

Proposed fix

Option A: Serve rpc.discover directly from the include_str!'d spec constant. No runtime rebuild, no cache. O(1), always correct.

Option B: Keep the current cache, but rebuild it on every server start (clear on startup, first call triggers rebuild from schema). Slightly more work on first call, safe across restarts.

Either is fine. Option A is simpler.

  • #29 (the issue that led to discovery of this drift — tactical fix)
  • #(big-move) (the proper fix — move Python codegen to build-time so rpc.discover isn't the canonical source of truth anymore)

This issue is specifically about the discover endpoint's correctness, regardless of what consumes it.

## Problem `rpc.discover` is implemented by reading a cached OpenRPC document that the server holds in memory. In practice, this document can fall out of sync with the live binary — if the server registers a service after caching, if the cache is populated from a static `include_str!`'d file that pre-dates the last regen, or if the service registry mutates between cache-build and request time. Observed: a running hero_logic binary (built Apr 20) serves a `rpc.discover` response that omits `LogicService.play_start` and siblings, even though those methods are absolutely present in the compiled code and the `.oschema`. Every downstream tool that regenerates from `rpc.discover` then inherits the miss. ## Expected `rpc.discover` should return the exact same bytes as the OpenRPC spec compiled into the running binary. No caching layer between "what the code says" and "what the discover endpoint returns." If caching is needed for performance, it should invalidate on server start, not persist across a restart. ## Proposed fix Option A: Serve `rpc.discover` directly from the `include_str!`'d spec constant. No runtime rebuild, no cache. O(1), always correct. Option B: Keep the current cache, but rebuild it on every server start (clear on startup, first call triggers rebuild from schema). Slightly more work on first call, safe across restarts. Either is fine. Option A is simpler. ## Related - #29 (the issue that led to discovery of this drift — tactical fix) - #(big-move) (the proper fix — move Python codegen to build-time so `rpc.discover` isn't the canonical source of truth anymore) This issue is specifically about the discover endpoint's correctness, regardless of what consumes it.
Author
Owner

Promoting from "can land any time" to prerequisite for hero_logic#13.

#13's flow library + Service Agent v3 work depends on LogicService.flow_library_search returning matches based on the current set of methods, and the agent then generating Python that calls the current generated client. If rpc.discover on hero_logic serves a stale embedded spec, the router's cached ~/.hero/var/router/python/hero_logic_client.py won't match what the running service actually accepts — the LLM would generate calls against methods that no longer exist (or miss methods that just landed), and #13 will hit non-deterministic failures that look like agent bugs but are actually cache-drift.

Proposed scope unchanged from this issue's original body: serve rpc.discover directly from the include_str!'d constant — no in-memory cache between the embedded spec and the response. Optionally: invalidate the router's per-service hash on service-process-restart so a rebuilt service triggers regeneration without waiting for the next scanner pass.

Ready to pick this one up next.

Promoting from "can land any time" to **prerequisite for hero_logic#13**. #13's flow library + Service Agent v3 work depends on `LogicService.flow_library_search` returning matches based on the *current* set of methods, and the agent then generating Python that calls the *current* generated client. If `rpc.discover` on hero_logic serves a stale embedded spec, the router's cached `~/.hero/var/router/python/hero_logic_client.py` won't match what the running service actually accepts — the LLM would generate calls against methods that no longer exist (or miss methods that just landed), and #13 will hit non-deterministic failures that look like agent bugs but are actually cache-drift. Proposed scope unchanged from this issue's original body: serve `rpc.discover` directly from the `include_str!`'d constant — no in-memory cache between the embedded spec and the response. Optionally: invalidate the router's per-service hash on service-process-restart so a rebuilt service triggers regeneration without waiting for the next scanner pass. Ready to pick this one up next.
timur closed this issue 2026-05-05 11:26:24 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_rpc#32
No description provided.