feat: implementation gaps to match PRD (resumable flows, generic pause, herolib_base, unified sub-flow API) #29

Closed
opened 2026-05-13 13:04:32 +00:00 by timur · 0 comments
Owner

Tracks the work to bring the codebase in line with the new PRD.md. Concrete deltas grouped by surface.

1. Generic pause / resume primitive (subsumes the original ask_user design from #28)

Rename and generalize:

  • New core primitive: flow.pause(name, *, schema=None, ui=None) in hero_tracing.py. Returns the resume payload on replay. Exits subprocess with code 75 on first hit, persists a ResumeRequest.
  • ask_user.text / number / choice / multi_choice / confirm become UI-flavored helpers over flow.pause(..., ui={...}). They do not become a separate mechanism.
  • New RPC: play_resume(play_sid, resume_id, payload_json) -> {ok, resumed_at} — one method, used by UI form posts, webhooks, cron triggers, and inter-service signals alike.
  • New RPC: play_pending_resumes(play_sid) -> [ResumeRequest].
  • New Play status: awaiting_resume. Lifecycle: pending → running → awaiting_resume → running → ... → success | failed | cancelled | timed_out.
  • ResumeRequest schema: {id, name, schema, ui, asked_at_span_id, asked_at, payload, resumed_at}. id is deterministic from span path + call sequence.
  • The UI rendering hint is ui != null filtering; non-UI resumes don't render forms.

2. Step memoization + replay

  • Compute step_key = sha1(workflow_version_sid + '|' + parent_path + '|' + flow_name + '|' + canonical_json(sorted_kwargs)) in the @flow wrapper.
  • On replay, short-circuit @flow calls whose key is in _STEP_CACHE. Emit the span with status=replayed.
  • Persist completed step outputs via a new step_output event over the span socket; server writes Play.step_outputs.
  • Boot stub loads HERO_REPLAY_STEP_OUTPUTS_FILE and HERO_REPLAY_RESUMES_FILE (env-pointed) into _STEP_CACHE and _ANSWER_CACHE before the user's flow runs.
  • Validate @flow(outputs=...) declarations on cache write; raise on non-serializable returns.
  • Workflow-version invalidation: workflow_version_sid is part of the step key; mismatch invalidates globally and the UI prompts to restart fresh or rollback.
  • Discourage time-based kwargs in @flow signatures (defeats memoization). Pull non-determinism inside the body.

3. Unified sub-flow API: flow.invoke(name, *, spawn=False, **inputs)

  • Single authoring function. Default spawn=False = in-process (current behaviour from F7).
  • spawn=True = calls LogicService.play_run_async over the local RPC socket, waits via play_wait(timeout_ms=0), returns the child's output_data.
  • Both produce a kind=subflow span on the parent; spawned variant additionally carries child_play_sid.
  • Decide on the future of the from <flow_name> import <fn> meta-path importer: either keep as sugar that compiles to flow.invoke(name, ...), or remove in favour of one explicit API. Recommend removing to keep one mental model.

4. Schema additions (logic.oschema)

  • Play.pending_resumes: [ResumeRequest]
  • Play.received_resumes: str (JSON {resume_id: payload})
  • Play.step_outputs: str (JSON {step_key: output})
  • Play.total_cost_usd: f64
  • New value type ResumeRequest (shape above).
  • PlayStatus (renamed from ExecutionStatus) adds awaiting_resume.
  • SpanStatus adds replayed.
  • Span adds kind: SpanKind, source_file: str, source_line: u32.
  • New value type SpanKind = flow_root | step | rpc | subflow | other.
  • New value type FlowField (rename of FlowInput); Workflow.outputs: [FlowField] is populated from @flow(outputs=...) on save.

5. RPC additions

  • play_start(workflow_sid, input_data, name, prefill_resumes) -> Play
  • play_run_async(workflow_sid, input_data, parent_span_id, prefill_resumes) -> str
  • play_resume(play_sid, resume_id, payload_json) -> {ok, resumed_at}
  • play_pending_resumes(play_sid) -> [ResumeRequest]

6. Admin UI — bottom-bar island

In the play-detail page, below the flow tree:

  • Resizable + collapsible (state in localStorage).
  • Three tabs: Logs (live span events), Pending resumes (every ResumeRequest with ui != null, badged with count, renders form per ui.kind, posts to play_resume), Events (full Play.spans history).
  • Graph view: mark replayed spans with dashed border + "↻ replayed" tooltip.
  • Graph view: subflow card collapsed by default; expand renders child Play's full graph inline (recursive); toolbar bulk controls Expand all / Collapse all / Expand to failures.

7. herolib_base lifecycle migration

All three binaries (hero_logic, hero_logic_server, hero_logic_admin) move to the latest hero standard:

  • Add service.toml at each crate root declaring kind, sockets, protocols, env vars.
  • service_base!(); macro at module scope.
  • main.rs order: validate_service_toml → handle_info_flag → Args::parse → print_startup_banner → prepare_sockets (server/admin only).
  • Remove hand-rolled print_startup_info(), socket dir resolution, stale-socket cleanup.
  • BUILD_NR constant via option_env!("HERO_BUILD_NR").
  • hero_logic CLI keeps --start / --stop semantics; replace the HeroServices::new(...) registration path with the lab/proc pattern from herolib_base and hero_service.

8. Pre-filled resumes for non-interactive runs

  • play_start / play_run_async accept prefill_resumes: {resume_name_or_pattern → payload}.
  • End-to-end test scripts in /examples/ use this so plays run headlessly.
  • If a pause is hit that isn't pre-filled, the run fails fast (no hanging waits).

9. /examples/ E2E scaffold

  • Repo-root /examples/ directory exists (README.md already added). Populate with at least one driver script per major scenario: simple play, sub-flow composition, pause/resume cycle, multi-pause replay.

Acceptance

  • A flow that calls flow.pause("approve", ui={"kind":"confirm"}) exits with awaiting_resume; UI renders a confirm button; play_resume resumes the play and the flow returns the payload.
  • A flow with three steps where step 2 pauses replays steps 1 + 2 from cache and only step 3 onwards executes fresh.
  • A flow that does an OSIS *_set followed by a pause does NOT double-create the OSIS record on resume.
  • flow.invoke(name, **inputs) and flow.invoke(name, spawn=True, **inputs) both work; the latter shows the child Play expandable inline in the graph view.
  • lab infocheck reports 0 issues for all three binaries; startup banners are consistent with service.toml; --info returns the manifest.
  • /examples/ directory has at least one runnable driver covering pause/resume with prefill_resumes.
Tracks the work to bring the codebase in line with the new [PRD.md](../src/branch/development/PRD.md). Concrete deltas grouped by surface. ## 1. Generic pause / resume primitive (subsumes the original `ask_user` design from #28) Rename and generalize: - New core primitive: `flow.pause(name, *, schema=None, ui=None)` in `hero_tracing.py`. Returns the resume payload on replay. Exits subprocess with code 75 on first hit, persists a `ResumeRequest`. - `ask_user.text / number / choice / multi_choice / confirm` become **UI-flavored helpers** over `flow.pause(..., ui={...})`. They do not become a separate mechanism. - New RPC: `play_resume(play_sid, resume_id, payload_json) -> {ok, resumed_at}` — one method, used by UI form posts, webhooks, cron triggers, and inter-service signals alike. - New RPC: `play_pending_resumes(play_sid) -> [ResumeRequest]`. - New Play status: `awaiting_resume`. Lifecycle: `pending → running → awaiting_resume → running → ... → success | failed | cancelled | timed_out`. - `ResumeRequest` schema: `{id, name, schema, ui, asked_at_span_id, asked_at, payload, resumed_at}`. `id` is deterministic from span path + call sequence. - The UI rendering hint is `ui != null` filtering; non-UI resumes don't render forms. ## 2. Step memoization + replay - Compute `step_key = sha1(workflow_version_sid + '|' + parent_path + '|' + flow_name + '|' + canonical_json(sorted_kwargs))` in the `@flow` wrapper. - On replay, short-circuit `@flow` calls whose key is in `_STEP_CACHE`. Emit the span with `status=replayed`. - Persist completed step outputs via a new `step_output` event over the span socket; server writes `Play.step_outputs`. - Boot stub loads `HERO_REPLAY_STEP_OUTPUTS_FILE` and `HERO_REPLAY_RESUMES_FILE` (env-pointed) into `_STEP_CACHE` and `_ANSWER_CACHE` before the user's flow runs. - Validate `@flow(outputs=...)` declarations on cache write; raise on non-serializable returns. - Workflow-version invalidation: `workflow_version_sid` is part of the step key; mismatch invalidates globally and the UI prompts to restart fresh or rollback. - Discourage time-based kwargs in `@flow` signatures (defeats memoization). Pull non-determinism inside the body. ## 3. Unified sub-flow API: `flow.invoke(name, *, spawn=False, **inputs)` - Single authoring function. Default `spawn=False` = in-process (current behaviour from F7). - `spawn=True` = calls `LogicService.play_run_async` over the local RPC socket, waits via `play_wait(timeout_ms=0)`, returns the child's `output_data`. - Both produce a `kind=subflow` span on the parent; spawned variant additionally carries `child_play_sid`. - Decide on the future of the `from <flow_name> import <fn>` meta-path importer: either keep as sugar that compiles to `flow.invoke(name, ...)`, or remove in favour of one explicit API. Recommend removing to keep one mental model. ## 4. Schema additions (`logic.oschema`) - `Play.pending_resumes: [ResumeRequest]` - `Play.received_resumes: str` (JSON `{resume_id: payload}`) - `Play.step_outputs: str` (JSON `{step_key: output}`) - `Play.total_cost_usd: f64` - New value type `ResumeRequest` (shape above). - `PlayStatus` (renamed from `ExecutionStatus`) adds `awaiting_resume`. - `SpanStatus` adds `replayed`. - `Span` adds `kind: SpanKind`, `source_file: str`, `source_line: u32`. - New value type `SpanKind = flow_root | step | rpc | subflow | other`. - New value type `FlowField` (rename of `FlowInput`); `Workflow.outputs: [FlowField]` is populated from `@flow(outputs=...)` on save. ## 5. RPC additions - `play_start(workflow_sid, input_data, name, prefill_resumes) -> Play` - `play_run_async(workflow_sid, input_data, parent_span_id, prefill_resumes) -> str` - `play_resume(play_sid, resume_id, payload_json) -> {ok, resumed_at}` - `play_pending_resumes(play_sid) -> [ResumeRequest]` ## 6. Admin UI — bottom-bar island In the play-detail page, below the flow tree: - Resizable + collapsible (state in localStorage). - Three tabs: **Logs** (live span events), **Pending resumes** (every `ResumeRequest` with `ui != null`, badged with count, renders form per `ui.kind`, posts to `play_resume`), **Events** (full Play.spans history). - Graph view: mark replayed spans with dashed border + "↻ replayed" tooltip. - Graph view: subflow card collapsed by default; expand renders child Play's full graph inline (recursive); toolbar bulk controls Expand all / Collapse all / **Expand to failures**. ## 7. `herolib_base` lifecycle migration All three binaries (`hero_logic`, `hero_logic_server`, `hero_logic_admin`) move to the latest hero standard: - Add `service.toml` at each crate root declaring kind, sockets, protocols, env vars. - `service_base!();` macro at module scope. - `main.rs` order: `validate_service_toml → handle_info_flag → Args::parse → print_startup_banner → prepare_sockets` (server/admin only). - Remove hand-rolled `print_startup_info()`, socket dir resolution, stale-socket cleanup. - `BUILD_NR` constant via `option_env!("HERO_BUILD_NR")`. - `hero_logic` CLI keeps `--start` / `--stop` semantics; replace the `HeroServices::new(...)` registration path with the lab/proc pattern from [`herolib_base`](https://forge.ourworld.tf/lhumina_code/hero_skills/src/branch/development/claude/skills/herolib_base/SKILL.md) and [`hero_service`](https://forge.ourworld.tf/lhumina_code/hero_skills/src/branch/development/claude/skills/hero_service/SKILL.md). ## 8. Pre-filled resumes for non-interactive runs - `play_start` / `play_run_async` accept `prefill_resumes: {resume_name_or_pattern → payload}`. - End-to-end test scripts in `/examples/` use this so plays run headlessly. - If a pause is hit that isn't pre-filled, the run **fails fast** (no hanging waits). ## 9. `/examples/` E2E scaffold - Repo-root `/examples/` directory exists (`README.md` already added). Populate with at least one driver script per major scenario: simple play, sub-flow composition, pause/resume cycle, multi-pause replay. ## Acceptance - A flow that calls `flow.pause("approve", ui={"kind":"confirm"})` exits with `awaiting_resume`; UI renders a confirm button; `play_resume` resumes the play and the flow returns the payload. - A flow with three steps where step 2 pauses replays steps 1 + 2 from cache and only step 3 onwards executes fresh. - A flow that does an OSIS `*_set` followed by a pause does NOT double-create the OSIS record on resume. - `flow.invoke(name, **inputs)` and `flow.invoke(name, spawn=True, **inputs)` both work; the latter shows the child Play expandable inline in the graph view. - `lab infocheck` reports 0 issues for all three binaries; startup banners are consistent with `service.toml`; `--info` returns the manifest. - `/examples/` directory has at least one runnable driver covering pause/resume with `prefill_resumes`.
timur closed this issue 2026-05-14 10:07:30 +00:00
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_logic#29
No description provided.