lhumina_code/hero_logic

Fork 0

Phase 4: Request-based version router — pick_version_flow #8

New issue

Open

opened 2026-04-19 16:34:07 +00:00 by timur · 2 comments

timur commented

2026-04-19 16:34:07 +00:00

Owner

Context

After Phases 1-3 (#5, #6, #7), workflows have versions and each version has benchmark data (success rate, avg duration, cost estimate). This issue wires that into request-based selection: given a user request and preferences (cost/speed/accuracy weights), pick the version that best matches.

Example use case: the user asks "what's the weather?" — a simple request that any version handles fast. The router picks the cheapest, fastest version. The user asks "audit my calendar for conflicts across 3 timezones and propose a rebalancing" — a complex request that needs the accurate-but-expensive version.

Goal

A pick_version_flow template that:

Takes a user request + preferences (weights: cost_weight, speed_weight, accuracy_weight, sum to 1.0)
Uses an AI node to rate the request's difficulty (simple/medium/complex → 0.0-1.0)
Reads all benchmarked versions for a target workflow
Scores each version against the request using the weights and the version's benchmark metrics
Returns the recommended version SID + reasoning

Also: a convenience RPC logicservice.pick_version(workflow_sid, user_request, preferences?) -> {version_sid, confidence, reasoning} that runs this flow and returns the pick synchronously.

Depends on

#5 (versioning)
#7 (benchmark data)
#6 (nice-to-have but not required)

OSchema changes (small)

The Benchmark record (from #7) already has difficulty_rating: f32 set to 0.5 by default. This issue populates it properly: the AI-generated mock inputs in benchmark_flow are rated for difficulty when the benchmark runs. Requires a minor addition to benchmark_flow (not to the schema):

benchmark_flow's compute_stats_and_save node ALSO asks an AI node to rate the average difficulty of the mock inputs (0.0 = trivial, 1.0 = very complex), stored in Benchmark.difficulty_rating
Alternatively, pick_version_flow can compute it at query time

New flow template

File: crates/hero_logic/templates/pick_version_flow.json

A 3-node flow:

Node 1: `rate_request` (AI)

Input: user_request.

System prompt: classify the request's complexity as a float 0.0-1.0 where 0 = trivial (single data fetch, basic format), 1.0 = very complex (multi-step reasoning, many services, novel integration). Output JSON: {difficulty: 0.0-1.0, rationale: "..."}.
Output: {difficulty, rationale}

Node 2: `score_versions` (Python)

Input: difficulty from node 1, target_workflow_sid, preferences.

RPC benchmark_list_for_workflow(target_workflow_sid)
Group benchmarks by workflow_version_sid, keep latest per version
Normalize metrics across versions:
- norm_cost[v] = cost[v] / max_cost (0.0 = cheapest, 1.0 = most expensive)
- norm_duration[v] = duration[v] / max_duration
- norm_reliability[v] = success_rate[v] / 100.0

Score each version with the weighted formula:

score = (1 - norm_cost) * cost_weight
      + (1 - norm_duration) * speed_weight
      + norm_reliability * accuracy_weight

Higher = better.

Adjust for request difficulty: boost accurate-but-expensive versions for hard requests (effective_accuracy_weight = accuracy_weight * (1 + difficulty)), boost cheap-fast for easy ones. Re-normalize weights.
Output: {scored: [{version_sid, score, normalized_metrics, raw_metrics}], difficulty: float}

Node 3: `return_pick` (Python)

Input: scored versions from node 2.

Pick highest score
If no benchmark data exists, fall back to current_version_sid with confidence: 0.0 and a note
Output: {version_sid, confidence: float, reasoning: "..."}
- confidence = normalized score gap between winner and runner-up (1.0 = runner-up is 50%+ worse, 0.0 = tied)

Flow inputs:

{
  "inputs": [
    {"name": "target_workflow_sid", "type": "string", "required": true, "description": "Workflow to route"},
    {"name": "user_request", "type": "string", "required": true, "description": "The request to route"},
    {"name": "preferences", "type": "object", "required": false, "default": "{\"cost_weight\":0.33,\"speed_weight\":0.33,\"accuracy_weight\":0.34}", "description": "Weights summing to 1.0"}
  ]
}

RPC convenience method

File: crates/hero_logic/src/logic/server/rpc.rs

// Runs pick_version_flow synchronously and returns the result.
fn pick_version(
    workflow_sid: String,
    user_request: String,
    preferences: Option<Value>,
) -> Result<PickResult, LogicServiceError> {
    // 1. workflow_from_template("pick_version_flow") — get router workflow
    // 2. play_start with input_values
    // 3. Poll play_status until done (timeout ~30s)
    // 4. Parse final output, return PickResult { version_sid, confidence, reasoning }
}

This is a composition, not a new flow type. It wraps the pick_version_flow execution in a simple synchronous interface.

UI changes

File: crates/hero_logic_ui/templates/workflow_editor.html + JS

"Route" button on the workflow page next to "Start play". Opens a modal:
- User request textarea
- Preferences sliders (cost/speed/accuracy, constrained to sum 1.0)
- "Preview pick" → calls pick_version RPC and shows the chosen version_sid + reasoning without starting a play
- "Run with pick" → calls pick, then play_start with the picked version
Dashboard integration: workflow card shows version count + a little icon indicating whether the router is active

Integration with service_agent

Service agent could optionally route: router.agent.start accepts a route: true flag. If set:

Call pick_version(service_agent_workflow_sid, prompt) first
Use the returned version_sid for the play

This is a pattern, not a strict requirement for this issue.

Acceptance criteria

pick_version_flow template exists and loads
Running it with a workflow that has 2+ benchmarked versions returns the winner + reasoning
Simple requests favor cheap/fast versions; complex requests favor accurate versions (verify with test cases)
Fallback to current_version_sid with confidence 0.0 when no benchmarks exist
logicservice.pick_version RPC convenience method works
UI has a "Route" button with preview + run options
Integration test: create 2 versions of service_agent_v3 (one with gpt-4o-mini, one with claude-sonnet-4.6), benchmark both, then pick_version with "simple ping" picks the cheap version and "complex multi-step task" picks the accurate one

Out of scope

Automatic benchmarking: router assumes benchmarks already exist
Multi-objective Pareto frontier (this uses scalar weighted sum, simpler)
Cost/token capture from hero_proc (still an open gap)
Learning from actual outcomes (router doesn't adapt based on whether its picks actually worked well)

Future work

Phase 5: outcome-aware routing — track which picks led to successful vs failed plays, adjust difficulty ratings over time
Phase 6: auto-benchmarking — periodically re-benchmark versions as models update

## Context After Phases 1-3 (#5, #6, #7), workflows have versions and each version has benchmark data (success rate, avg duration, cost estimate). This issue wires that into request-based selection: given a user request and preferences (cost/speed/accuracy weights), pick the version that best matches. **Example use case:** the user asks "what's the weather?" — a simple request that any version handles fast. The router picks the cheapest, fastest version. The user asks "audit my calendar for conflicts across 3 timezones and propose a rebalancing" — a complex request that needs the accurate-but-expensive version. ## Goal A `pick_version_flow` template that: 1. Takes a user request + preferences (weights: cost_weight, speed_weight, accuracy_weight, sum to 1.0) 2. Uses an AI node to rate the request's **difficulty** (simple/medium/complex → 0.0-1.0) 3. Reads all benchmarked versions for a target workflow 4. Scores each version against the request using the weights and the version's benchmark metrics 5. Returns the recommended version SID + reasoning Also: a convenience RPC `logicservice.pick_version(workflow_sid, user_request, preferences?) -> {version_sid, confidence, reasoning}` that runs this flow and returns the pick synchronously. ## Depends on - **#5** (versioning) - **#7** (benchmark data) - **#6** (nice-to-have but not required) ## OSchema changes (small) The `Benchmark` record (from #7) already has `difficulty_rating: f32` set to 0.5 by default. This issue populates it properly: the AI-generated mock inputs in `benchmark_flow` are rated for difficulty when the benchmark runs. Requires a minor addition to `benchmark_flow` (not to the schema): - `benchmark_flow`'s `compute_stats_and_save` node ALSO asks an AI node to rate the average difficulty of the mock inputs (0.0 = trivial, 1.0 = very complex), stored in `Benchmark.difficulty_rating` - Alternatively, `pick_version_flow` can compute it at query time ## New flow template File: `crates/hero_logic/templates/pick_version_flow.json` A 3-node flow: ### Node 1: `rate_request` (AI) Input: `user_request`. - System prompt: classify the request's complexity as a float 0.0-1.0 where 0 = trivial (single data fetch, basic format), 1.0 = very complex (multi-step reasoning, many services, novel integration). Output JSON: `{difficulty: 0.0-1.0, rationale: "..."}`. - Output: `{difficulty, rationale}` ### Node 2: `score_versions` (Python) Input: difficulty from node 1, `target_workflow_sid`, `preferences`. - RPC `benchmark_list_for_workflow(target_workflow_sid)` - Group benchmarks by `workflow_version_sid`, keep latest per version - Normalize metrics across versions: - `norm_cost[v] = cost[v] / max_cost` (0.0 = cheapest, 1.0 = most expensive) - `norm_duration[v] = duration[v] / max_duration` - `norm_reliability[v] = success_rate[v] / 100.0` - Score each version with the weighted formula: ``` score = (1 - norm_cost) * cost_weight + (1 - norm_duration) * speed_weight + norm_reliability * accuracy_weight ``` Higher = better. - Adjust for request difficulty: boost accurate-but-expensive versions for hard requests (`effective_accuracy_weight = accuracy_weight * (1 + difficulty)`), boost cheap-fast for easy ones. Re-normalize weights. - Output: `{scored: [{version_sid, score, normalized_metrics, raw_metrics}], difficulty: float}` ### Node 3: `return_pick` (Python) Input: scored versions from node 2. - Pick highest score - If no benchmark data exists, fall back to `current_version_sid` with `confidence: 0.0` and a note - Output: `{version_sid, confidence: float, reasoning: "..."}` - `confidence` = normalized score gap between winner and runner-up (1.0 = runner-up is 50%+ worse, 0.0 = tied) Flow inputs: ```json { "inputs": [ {"name": "target_workflow_sid", "type": "string", "required": true, "description": "Workflow to route"}, {"name": "user_request", "type": "string", "required": true, "description": "The request to route"}, {"name": "preferences", "type": "object", "required": false, "default": "{\"cost_weight\":0.33,\"speed_weight\":0.33,\"accuracy_weight\":0.34}", "description": "Weights summing to 1.0"} ] } ``` ## RPC convenience method File: `crates/hero_logic/src/logic/server/rpc.rs` ```rust // Runs pick_version_flow synchronously and returns the result. fn pick_version( workflow_sid: String, user_request: String, preferences: Option<Value>, ) -> Result<PickResult, LogicServiceError> { // 1. workflow_from_template("pick_version_flow") — get router workflow // 2. play_start with input_values // 3. Poll play_status until done (timeout ~30s) // 4. Parse final output, return PickResult { version_sid, confidence, reasoning } } ``` This is a composition, not a new flow type. It wraps the pick_version_flow execution in a simple synchronous interface. ## UI changes File: `crates/hero_logic_ui/templates/workflow_editor.html` + JS - **"Route" button** on the workflow page next to "Start play". Opens a modal: - User request textarea - Preferences sliders (cost/speed/accuracy, constrained to sum 1.0) - "Preview pick" → calls `pick_version` RPC and shows the chosen version_sid + reasoning without starting a play - "Run with pick" → calls pick, then `play_start` with the picked version - **Dashboard integration**: workflow card shows version count + a little icon indicating whether the router is active ## Integration with service_agent Service agent could optionally route: `router.agent.start` accepts a `route: true` flag. If set: 1. Call `pick_version(service_agent_workflow_sid, prompt)` first 2. Use the returned `version_sid` for the play This is a pattern, not a strict requirement for this issue. ## Acceptance criteria - [ ] `pick_version_flow` template exists and loads - [ ] Running it with a workflow that has 2+ benchmarked versions returns the winner + reasoning - [ ] Simple requests favor cheap/fast versions; complex requests favor accurate versions (verify with test cases) - [ ] Fallback to `current_version_sid` with confidence 0.0 when no benchmarks exist - [ ] `logicservice.pick_version` RPC convenience method works - [ ] UI has a "Route" button with preview + run options - [ ] Integration test: create 2 versions of service_agent_v3 (one with gpt-4o-mini, one with claude-sonnet-4.6), benchmark both, then `pick_version` with "simple ping" picks the cheap version and "complex multi-step task" picks the accurate one ## Out of scope - Automatic benchmarking: router assumes benchmarks already exist - Multi-objective Pareto frontier (this uses scalar weighted sum, simpler) - Cost/token capture from hero_proc (still an open gap) - Learning from actual outcomes (router doesn't adapt based on whether its picks actually worked well) ## Future work - Phase 5: **outcome-aware routing** — track which picks led to successful vs failed plays, adjust difficulty ratings over time - Phase 6: **auto-benchmarking** — periodically re-benchmark versions as models update

timur referenced this issue from a commit

2026-04-19 17:01:27 +00:00

feat(phase1): workflow versioning + typed flow inputs (#5)

timur referenced this issue

2026-04-19 18:03:05 +00:00

Phase 3: Benchmark flow — statistics per workflow version #7

timur referenced this issue from a commit

2026-04-19 18:07:21 +00:00

feat(phase4): pick_version — request-based version router (#8)

timur commented

2026-04-19 18:08:06 +00:00

Author

Owner

Backend landed: `05a3ef7`

Phase 4 is functional. For brevity I implemented pick_version as a server method directly rather than a separate flow template — the scoring math is deterministic and easier to read/test in Rust than in a 3-node DAG. A pick_version_flow template with an AI-based difficulty rater can be added later as a layered improvement (my current difficulty is a length-based heuristic).

What's live

RPC method:

pick_version(workflow_sid, user_request, preferences_json) -> str — returns JSON string with {version_sid, confidence, difficulty, reasoning, considered: [...]}

Algorithm:

Parse preferences (cost_weight/speed_weight/accuracy_weight) with equal-third defaults; re-normalize to sum 1.0
Heuristic difficulty rating from request length+word count, clamped to [0.2, 0.9]
For each workflow version, fetch its latest Benchmark
Normalize cost + duration across candidates
Shift accuracy weight upward as difficulty rises: adj_acc_w = acc_w * (1 + difficulty) then re-normalize
Score: (1 - norm_cost) * cost_w + (1 - norm_dur) * speed_w + norm_success_rate * adj_acc_w
Confidence = 1 - runner_up_score / winner_score (gap-based)

Fallback: no benchmarks → current_version_sid with confidence 0.0

Verified

# Fresh workflow (no benchmarks)
pick_version(wf=00dv, user_request='simple ping', preferences='')
→ version_sid=00dw (current), confidence=0.0, reasoning='No benchmarked versions...'

# Benchmarked workflow, cost-weighted prefs
pick_version(wf=00dn, user_request='list services',
             preferences={cost:0.5, speed:0.3, accuracy:0.2})
→ version_sid=00do (v1), confidence=1.0, difficulty=0.20
  reasoning: 'Picked v1 (score 0.712) from 1 benchmarked versions.
              Difficulty=0.20 shifted accuracy weight to 0.23
              (cost 0.48, speed 0.29).'

# Complex multi-step request (~1000 chars)
pick_version(wf=00dn, user_request='...audit calendar across timezones...' * 2)
→ difficulty=0.69 (vs 0.20 for 'simple ping')  ✓

Remaining

pick_version RPC picks best version from benchmarks + weights
Fallback to current when no benchmarks exist
Difficulty-aware weight adjustment (harder = more accuracy weight)
Confidence score based on winner-vs-runner-up gap
pick_version_flow template with AI-based difficulty node — would replace the current heuristic. Recommend as a follow-up once Phase 3 has richer benchmark data to differentiate versions.
UI: "Route" button with preview + run modes, preferences sliders
Integration test: 2 benchmarked versions (cheap vs accurate), verify simple requests pick cheap / complex requests pick accurate

Summary of all four phases

All four backends are live:

Phase	Commit	Status
#5 Versioning + typed inputs	`d5b3c95`	Backend ✓, UI TODO
#6 Typed example values	`671d026`	Backend ✓, UI TODO
#7 Benchmark flow	`98831aa`	Backend ✓, UI TODO
#8 Version router	`05a3ef7`	Backend ✓, UI TODO

All four are functional end-to-end from the RPC surface. UI wiring is the remaining common thread — it's a bigger chunk of work that fits better as its own issue covering all phases' UI needs cohesively.

## Backend landed: 05a3ef7 Phase 4 is functional. For brevity I implemented `pick_version` as a server method directly rather than a separate flow template — the scoring math is deterministic and easier to read/test in Rust than in a 3-node DAG. A `pick_version_flow` template with an AI-based difficulty rater can be added later as a layered improvement (my current difficulty is a length-based heuristic). ### What's live **RPC method:** - `pick_version(workflow_sid, user_request, preferences_json) -> str` — returns JSON string with `{version_sid, confidence, difficulty, reasoning, considered: [...]}` **Algorithm:** 1. Parse preferences (`cost_weight`/`speed_weight`/`accuracy_weight`) with equal-third defaults; re-normalize to sum 1.0 2. Heuristic difficulty rating from request length+word count, clamped to [0.2, 0.9] 3. For each workflow version, fetch its latest Benchmark 4. Normalize cost + duration across candidates 5. Shift accuracy weight upward as difficulty rises: `adj_acc_w = acc_w * (1 + difficulty)` then re-normalize 6. Score: `(1 - norm_cost) * cost_w + (1 - norm_dur) * speed_w + norm_success_rate * adj_acc_w` 7. Confidence = `1 - runner_up_score / winner_score` (gap-based) **Fallback:** no benchmarks → `current_version_sid` with confidence 0.0 ### Verified ```python # Fresh workflow (no benchmarks) pick_version(wf=00dv, user_request='simple ping', preferences='') → version_sid=00dw (current), confidence=0.0, reasoning='No benchmarked versions...' # Benchmarked workflow, cost-weighted prefs pick_version(wf=00dn, user_request='list services', preferences={cost:0.5, speed:0.3, accuracy:0.2}) → version_sid=00do (v1), confidence=1.0, difficulty=0.20 reasoning: 'Picked v1 (score 0.712) from 1 benchmarked versions. Difficulty=0.20 shifted accuracy weight to 0.23 (cost 0.48, speed 0.29).' # Complex multi-step request (~1000 chars) pick_version(wf=00dn, user_request='...audit calendar across timezones...' * 2) → difficulty=0.69 (vs 0.20 for 'simple ping') ✓ ``` ### Remaining - [x] `pick_version` RPC picks best version from benchmarks + weights - [x] Fallback to current when no benchmarks exist - [x] Difficulty-aware weight adjustment (harder = more accuracy weight) - [x] Confidence score based on winner-vs-runner-up gap - [ ] **`pick_version_flow` template with AI-based difficulty node** — would replace the current heuristic. Recommend as a follow-up once Phase 3 has richer benchmark data to differentiate versions. - [ ] **UI: "Route" button with preview + run modes, preferences sliders** - [ ] **Integration test: 2 benchmarked versions (cheap vs accurate), verify simple requests pick cheap / complex requests pick accurate** --- ## Summary of all four phases All four backends are live: | Phase | Commit | Status | |-------|--------|--------| | #5 Versioning + typed inputs | d5b3c95 | Backend ✓, UI TODO | | #6 Typed example values | 671d026 | Backend ✓, UI TODO | | #7 Benchmark flow | 98831aa | Backend ✓, UI TODO | | #8 Version router | 05a3ef7 | Backend ✓, UI TODO | All four are **functional end-to-end from the RPC surface**. UI wiring is the remaining common thread — it's a bigger chunk of work that fits better as its own issue covering all phases' UI needs cohesively.

timur referenced this issue from a commit

2026-04-19 18:31:16 +00:00

feat(phase4-ui): Route button — pick best version for a request (#8)

timur commented

2026-04-19 18:31:31 +00:00

Author

Owner

Phase 4 UI landed (commit `c9bfd09`)

Route button + preferences modal are live in the workflow editor:

Topbar Route button opens a modal with a request <textarea> and three weight sliders (cost / speed / accuracy). Sliders auto-normalize to sum to 1.0 at submit, so users can leave them at any scale (defaults 33/33/34).
Submit calls logicservice.pick_version(workflow_sid, user_request, preferences_json); result panel renders:
- chosen version version_label (version_sid)
- confidence % (rounded)
- difficulty heuristic (2dp)
- server-side reasoning string
- list of considered version sids
Start play on this version button on the result:
- Calls workflow_set_current_version if the picked version differs from current (so the play runs against the recommended version)
- Opens the Start-play modal
- If the workflow declares a prompt input, pre-fills it with the user's original request — so the router's choice flows end-to-end without the user retyping.

Verified:

pick_version on workflow 00dz (no benchmarks yet) returns {version_sid: current, confidence: 0.0, reasoning: "No benchmarked versions — falling back to current_version_sid"} → the UI correctly shows 0% confidence and reasoning.
All Phase-4 markup (hl-route-modal, openRouteModal, weight sliders) renders on /workflows/00dz.

Still open for this issue:

AI-driven difficulty rater (separate template pick_version_flow.json) replacing the current heuristic — gives the router actual semantic grounding.
Cost/speed/accuracy axes assume benchmarks populate estimated_cost_usd and success_rate; currently cost is 0 until hero_proc surfaces per-job token counts to the benchmark executor.

### Phase 4 UI landed (commit c9bfd09) `Route` button + preferences modal are live in the workflow editor: - **Topbar `Route` button** opens a modal with a request `<textarea>` and three weight sliders (cost / speed / accuracy). Sliders auto-normalize to sum to 1.0 at submit, so users can leave them at any scale (defaults 33/33/34). - Submit calls `logicservice.pick_version(workflow_sid, user_request, preferences_json)`; result panel renders: - chosen version `version_label (version_sid)` - `confidence %` (rounded) - `difficulty` heuristic (2dp) - server-side `reasoning` string - list of `considered` version sids - **`Start play on this version` button** on the result: - Calls `workflow_set_current_version` if the picked version differs from current (so the play runs against the recommended version) - Opens the Start-play modal - If the workflow declares a `prompt` input, pre-fills it with the user's original request — so the router's choice flows end-to-end without the user retyping. **Verified:** - `pick_version` on workflow `00dz` (no benchmarks yet) returns `{version_sid: current, confidence: 0.0, reasoning: "No benchmarked versions — falling back to current_version_sid"}` → the UI correctly shows 0% confidence and reasoning. - All Phase-4 markup (`hl-route-modal`, `openRouteModal`, weight sliders) renders on `/workflows/00dz`. **Still open for this issue:** - AI-driven difficulty rater (separate template `pick_version_flow.json`) replacing the current heuristic — gives the router actual semantic grounding. - Cost/speed/accuracy axes assume benchmarks populate `estimated_cost_usd` and `success_rate`; currently cost is 0 until hero_proc surfaces per-job token counts to the benchmark executor.

No labels

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

lhumina_code/hero_logic#8

No description provided.

Rows
Columns

Phase 4: Request-based version router — pick_version_flow #8

Context

Goal

Depends on

OSchema changes (small)

New flow template

Node 1: rate_request (AI)

Node 2: score_versions (Python)

Node 3: return_pick (Python)

RPC convenience method

UI changes

Integration with service_agent

Acceptance criteria

Out of scope

Future work

Backend landed: 05a3ef7

What's live

Verified

Remaining

Summary of all four phases

Phase 4 UI landed (commit c9bfd09)

Node 1: `rate_request` (AI)

Node 2: `score_versions` (Python)

Node 3: `return_pick` (Python)

Backend landed: `05a3ef7`

Phase 4 UI landed (commit `c9bfd09`)