Phase 4: Request-based version router — pick_version_flow #8

Open
opened 2026-04-19 16:34:07 +00:00 by timur · 2 comments
Owner

Context

After Phases 1-3 (#5, #6, #7), workflows have versions and each version has benchmark data (success rate, avg duration, cost estimate). This issue wires that into request-based selection: given a user request and preferences (cost/speed/accuracy weights), pick the version that best matches.

Example use case: the user asks "what's the weather?" — a simple request that any version handles fast. The router picks the cheapest, fastest version. The user asks "audit my calendar for conflicts across 3 timezones and propose a rebalancing" — a complex request that needs the accurate-but-expensive version.

Goal

A pick_version_flow template that:

  1. Takes a user request + preferences (weights: cost_weight, speed_weight, accuracy_weight, sum to 1.0)
  2. Uses an AI node to rate the request's difficulty (simple/medium/complex → 0.0-1.0)
  3. Reads all benchmarked versions for a target workflow
  4. Scores each version against the request using the weights and the version's benchmark metrics
  5. Returns the recommended version SID + reasoning

Also: a convenience RPC logicservice.pick_version(workflow_sid, user_request, preferences?) -> {version_sid, confidence, reasoning} that runs this flow and returns the pick synchronously.

Depends on

  • #5 (versioning)
  • #7 (benchmark data)
  • #6 (nice-to-have but not required)

OSchema changes (small)

The Benchmark record (from #7) already has difficulty_rating: f32 set to 0.5 by default. This issue populates it properly: the AI-generated mock inputs in benchmark_flow are rated for difficulty when the benchmark runs. Requires a minor addition to benchmark_flow (not to the schema):

  • benchmark_flow's compute_stats_and_save node ALSO asks an AI node to rate the average difficulty of the mock inputs (0.0 = trivial, 1.0 = very complex), stored in Benchmark.difficulty_rating
  • Alternatively, pick_version_flow can compute it at query time

New flow template

File: crates/hero_logic/templates/pick_version_flow.json

A 3-node flow:

Node 1: rate_request (AI)

Input: user_request.

  • System prompt: classify the request's complexity as a float 0.0-1.0 where 0 = trivial (single data fetch, basic format), 1.0 = very complex (multi-step reasoning, many services, novel integration). Output JSON: {difficulty: 0.0-1.0, rationale: "..."}.
  • Output: {difficulty, rationale}

Node 2: score_versions (Python)

Input: difficulty from node 1, target_workflow_sid, preferences.

  • RPC benchmark_list_for_workflow(target_workflow_sid)
  • Group benchmarks by workflow_version_sid, keep latest per version
  • Normalize metrics across versions:
    • norm_cost[v] = cost[v] / max_cost (0.0 = cheapest, 1.0 = most expensive)
    • norm_duration[v] = duration[v] / max_duration
    • norm_reliability[v] = success_rate[v] / 100.0
  • Score each version with the weighted formula:
    score = (1 - norm_cost) * cost_weight
          + (1 - norm_duration) * speed_weight
          + norm_reliability * accuracy_weight
    
    Higher = better.
  • Adjust for request difficulty: boost accurate-but-expensive versions for hard requests (effective_accuracy_weight = accuracy_weight * (1 + difficulty)), boost cheap-fast for easy ones. Re-normalize weights.
  • Output: {scored: [{version_sid, score, normalized_metrics, raw_metrics}], difficulty: float}

Node 3: return_pick (Python)

Input: scored versions from node 2.

  • Pick highest score
  • If no benchmark data exists, fall back to current_version_sid with confidence: 0.0 and a note
  • Output: {version_sid, confidence: float, reasoning: "..."}
    • confidence = normalized score gap between winner and runner-up (1.0 = runner-up is 50%+ worse, 0.0 = tied)

Flow inputs:

{
  "inputs": [
    {"name": "target_workflow_sid", "type": "string", "required": true, "description": "Workflow to route"},
    {"name": "user_request", "type": "string", "required": true, "description": "The request to route"},
    {"name": "preferences", "type": "object", "required": false, "default": "{\"cost_weight\":0.33,\"speed_weight\":0.33,\"accuracy_weight\":0.34}", "description": "Weights summing to 1.0"}
  ]
}

RPC convenience method

File: crates/hero_logic/src/logic/server/rpc.rs

// Runs pick_version_flow synchronously and returns the result.
fn pick_version(
    workflow_sid: String,
    user_request: String,
    preferences: Option<Value>,
) -> Result<PickResult, LogicServiceError> {
    // 1. workflow_from_template("pick_version_flow") — get router workflow
    // 2. play_start with input_values
    // 3. Poll play_status until done (timeout ~30s)
    // 4. Parse final output, return PickResult { version_sid, confidence, reasoning }
}

This is a composition, not a new flow type. It wraps the pick_version_flow execution in a simple synchronous interface.

UI changes

File: crates/hero_logic_ui/templates/workflow_editor.html + JS

  • "Route" button on the workflow page next to "Start play". Opens a modal:
    • User request textarea
    • Preferences sliders (cost/speed/accuracy, constrained to sum 1.0)
    • "Preview pick" → calls pick_version RPC and shows the chosen version_sid + reasoning without starting a play
    • "Run with pick" → calls pick, then play_start with the picked version
  • Dashboard integration: workflow card shows version count + a little icon indicating whether the router is active

Integration with service_agent

Service agent could optionally route: router.agent.start accepts a route: true flag. If set:

  1. Call pick_version(service_agent_workflow_sid, prompt) first
  2. Use the returned version_sid for the play

This is a pattern, not a strict requirement for this issue.

Acceptance criteria

  • pick_version_flow template exists and loads
  • Running it with a workflow that has 2+ benchmarked versions returns the winner + reasoning
  • Simple requests favor cheap/fast versions; complex requests favor accurate versions (verify with test cases)
  • Fallback to current_version_sid with confidence 0.0 when no benchmarks exist
  • logicservice.pick_version RPC convenience method works
  • UI has a "Route" button with preview + run options
  • Integration test: create 2 versions of service_agent_v3 (one with gpt-4o-mini, one with claude-sonnet-4.6), benchmark both, then pick_version with "simple ping" picks the cheap version and "complex multi-step task" picks the accurate one

Out of scope

  • Automatic benchmarking: router assumes benchmarks already exist
  • Multi-objective Pareto frontier (this uses scalar weighted sum, simpler)
  • Cost/token capture from hero_proc (still an open gap)
  • Learning from actual outcomes (router doesn't adapt based on whether its picks actually worked well)

Future work

  • Phase 5: outcome-aware routing — track which picks led to successful vs failed plays, adjust difficulty ratings over time
  • Phase 6: auto-benchmarking — periodically re-benchmark versions as models update
## Context After Phases 1-3 (#5, #6, #7), workflows have versions and each version has benchmark data (success rate, avg duration, cost estimate). This issue wires that into request-based selection: given a user request and preferences (cost/speed/accuracy weights), pick the version that best matches. **Example use case:** the user asks "what's the weather?" — a simple request that any version handles fast. The router picks the cheapest, fastest version. The user asks "audit my calendar for conflicts across 3 timezones and propose a rebalancing" — a complex request that needs the accurate-but-expensive version. ## Goal A `pick_version_flow` template that: 1. Takes a user request + preferences (weights: cost_weight, speed_weight, accuracy_weight, sum to 1.0) 2. Uses an AI node to rate the request's **difficulty** (simple/medium/complex → 0.0-1.0) 3. Reads all benchmarked versions for a target workflow 4. Scores each version against the request using the weights and the version's benchmark metrics 5. Returns the recommended version SID + reasoning Also: a convenience RPC `logicservice.pick_version(workflow_sid, user_request, preferences?) -> {version_sid, confidence, reasoning}` that runs this flow and returns the pick synchronously. ## Depends on - **#5** (versioning) - **#7** (benchmark data) - **#6** (nice-to-have but not required) ## OSchema changes (small) The `Benchmark` record (from #7) already has `difficulty_rating: f32` set to 0.5 by default. This issue populates it properly: the AI-generated mock inputs in `benchmark_flow` are rated for difficulty when the benchmark runs. Requires a minor addition to `benchmark_flow` (not to the schema): - `benchmark_flow`'s `compute_stats_and_save` node ALSO asks an AI node to rate the average difficulty of the mock inputs (0.0 = trivial, 1.0 = very complex), stored in `Benchmark.difficulty_rating` - Alternatively, `pick_version_flow` can compute it at query time ## New flow template File: `crates/hero_logic/templates/pick_version_flow.json` A 3-node flow: ### Node 1: `rate_request` (AI) Input: `user_request`. - System prompt: classify the request's complexity as a float 0.0-1.0 where 0 = trivial (single data fetch, basic format), 1.0 = very complex (multi-step reasoning, many services, novel integration). Output JSON: `{difficulty: 0.0-1.0, rationale: "..."}`. - Output: `{difficulty, rationale}` ### Node 2: `score_versions` (Python) Input: difficulty from node 1, `target_workflow_sid`, `preferences`. - RPC `benchmark_list_for_workflow(target_workflow_sid)` - Group benchmarks by `workflow_version_sid`, keep latest per version - Normalize metrics across versions: - `norm_cost[v] = cost[v] / max_cost` (0.0 = cheapest, 1.0 = most expensive) - `norm_duration[v] = duration[v] / max_duration` - `norm_reliability[v] = success_rate[v] / 100.0` - Score each version with the weighted formula: ``` score = (1 - norm_cost) * cost_weight + (1 - norm_duration) * speed_weight + norm_reliability * accuracy_weight ``` Higher = better. - Adjust for request difficulty: boost accurate-but-expensive versions for hard requests (`effective_accuracy_weight = accuracy_weight * (1 + difficulty)`), boost cheap-fast for easy ones. Re-normalize weights. - Output: `{scored: [{version_sid, score, normalized_metrics, raw_metrics}], difficulty: float}` ### Node 3: `return_pick` (Python) Input: scored versions from node 2. - Pick highest score - If no benchmark data exists, fall back to `current_version_sid` with `confidence: 0.0` and a note - Output: `{version_sid, confidence: float, reasoning: "..."}` - `confidence` = normalized score gap between winner and runner-up (1.0 = runner-up is 50%+ worse, 0.0 = tied) Flow inputs: ```json { "inputs": [ {"name": "target_workflow_sid", "type": "string", "required": true, "description": "Workflow to route"}, {"name": "user_request", "type": "string", "required": true, "description": "The request to route"}, {"name": "preferences", "type": "object", "required": false, "default": "{\"cost_weight\":0.33,\"speed_weight\":0.33,\"accuracy_weight\":0.34}", "description": "Weights summing to 1.0"} ] } ``` ## RPC convenience method File: `crates/hero_logic/src/logic/server/rpc.rs` ```rust // Runs pick_version_flow synchronously and returns the result. fn pick_version( workflow_sid: String, user_request: String, preferences: Option<Value>, ) -> Result<PickResult, LogicServiceError> { // 1. workflow_from_template("pick_version_flow") — get router workflow // 2. play_start with input_values // 3. Poll play_status until done (timeout ~30s) // 4. Parse final output, return PickResult { version_sid, confidence, reasoning } } ``` This is a composition, not a new flow type. It wraps the pick_version_flow execution in a simple synchronous interface. ## UI changes File: `crates/hero_logic_ui/templates/workflow_editor.html` + JS - **"Route" button** on the workflow page next to "Start play". Opens a modal: - User request textarea - Preferences sliders (cost/speed/accuracy, constrained to sum 1.0) - "Preview pick" → calls `pick_version` RPC and shows the chosen version_sid + reasoning without starting a play - "Run with pick" → calls pick, then `play_start` with the picked version - **Dashboard integration**: workflow card shows version count + a little icon indicating whether the router is active ## Integration with service_agent Service agent could optionally route: `router.agent.start` accepts a `route: true` flag. If set: 1. Call `pick_version(service_agent_workflow_sid, prompt)` first 2. Use the returned `version_sid` for the play This is a pattern, not a strict requirement for this issue. ## Acceptance criteria - [ ] `pick_version_flow` template exists and loads - [ ] Running it with a workflow that has 2+ benchmarked versions returns the winner + reasoning - [ ] Simple requests favor cheap/fast versions; complex requests favor accurate versions (verify with test cases) - [ ] Fallback to `current_version_sid` with confidence 0.0 when no benchmarks exist - [ ] `logicservice.pick_version` RPC convenience method works - [ ] UI has a "Route" button with preview + run options - [ ] Integration test: create 2 versions of service_agent_v3 (one with gpt-4o-mini, one with claude-sonnet-4.6), benchmark both, then `pick_version` with "simple ping" picks the cheap version and "complex multi-step task" picks the accurate one ## Out of scope - Automatic benchmarking: router assumes benchmarks already exist - Multi-objective Pareto frontier (this uses scalar weighted sum, simpler) - Cost/token capture from hero_proc (still an open gap) - Learning from actual outcomes (router doesn't adapt based on whether its picks actually worked well) ## Future work - Phase 5: **outcome-aware routing** — track which picks led to successful vs failed plays, adjust difficulty ratings over time - Phase 6: **auto-benchmarking** — periodically re-benchmark versions as models update
Author
Owner

Backend landed: 05a3ef7

Phase 4 is functional. For brevity I implemented pick_version as a server method directly rather than a separate flow template — the scoring math is deterministic and easier to read/test in Rust than in a 3-node DAG. A pick_version_flow template with an AI-based difficulty rater can be added later as a layered improvement (my current difficulty is a length-based heuristic).

What's live

RPC method:

  • pick_version(workflow_sid, user_request, preferences_json) -> str — returns JSON string with {version_sid, confidence, difficulty, reasoning, considered: [...]}

Algorithm:

  1. Parse preferences (cost_weight/speed_weight/accuracy_weight) with equal-third defaults; re-normalize to sum 1.0
  2. Heuristic difficulty rating from request length+word count, clamped to [0.2, 0.9]
  3. For each workflow version, fetch its latest Benchmark
  4. Normalize cost + duration across candidates
  5. Shift accuracy weight upward as difficulty rises: adj_acc_w = acc_w * (1 + difficulty) then re-normalize
  6. Score: (1 - norm_cost) * cost_w + (1 - norm_dur) * speed_w + norm_success_rate * adj_acc_w
  7. Confidence = 1 - runner_up_score / winner_score (gap-based)

Fallback: no benchmarks → current_version_sid with confidence 0.0

Verified

# Fresh workflow (no benchmarks)
pick_version(wf=00dv, user_request='simple ping', preferences='')
 version_sid=00dw (current), confidence=0.0, reasoning='No benchmarked versions...'

# Benchmarked workflow, cost-weighted prefs
pick_version(wf=00dn, user_request='list services',
             preferences={cost:0.5, speed:0.3, accuracy:0.2})
 version_sid=00do (v1), confidence=1.0, difficulty=0.20
  reasoning: 'Picked v1 (score 0.712) from 1 benchmarked versions.
              Difficulty=0.20 shifted accuracy weight to 0.23
              (cost 0.48, speed 0.29).'

# Complex multi-step request (~1000 chars)
pick_version(wf=00dn, user_request='...audit calendar across timezones...' * 2)
 difficulty=0.69 (vs 0.20 for 'simple ping')  

Remaining

  • pick_version RPC picks best version from benchmarks + weights
  • Fallback to current when no benchmarks exist
  • Difficulty-aware weight adjustment (harder = more accuracy weight)
  • Confidence score based on winner-vs-runner-up gap
  • pick_version_flow template with AI-based difficulty node — would replace the current heuristic. Recommend as a follow-up once Phase 3 has richer benchmark data to differentiate versions.
  • UI: "Route" button with preview + run modes, preferences sliders
  • Integration test: 2 benchmarked versions (cheap vs accurate), verify simple requests pick cheap / complex requests pick accurate

Summary of all four phases

All four backends are live:

Phase Commit Status
#5 Versioning + typed inputs d5b3c95 Backend ✓, UI TODO
#6 Typed example values 671d026 Backend ✓, UI TODO
#7 Benchmark flow 98831aa Backend ✓, UI TODO
#8 Version router 05a3ef7 Backend ✓, UI TODO

All four are functional end-to-end from the RPC surface. UI wiring is the remaining common thread — it's a bigger chunk of work that fits better as its own issue covering all phases' UI needs cohesively.

## Backend landed: 05a3ef7 Phase 4 is functional. For brevity I implemented `pick_version` as a server method directly rather than a separate flow template — the scoring math is deterministic and easier to read/test in Rust than in a 3-node DAG. A `pick_version_flow` template with an AI-based difficulty rater can be added later as a layered improvement (my current difficulty is a length-based heuristic). ### What's live **RPC method:** - `pick_version(workflow_sid, user_request, preferences_json) -> str` — returns JSON string with `{version_sid, confidence, difficulty, reasoning, considered: [...]}` **Algorithm:** 1. Parse preferences (`cost_weight`/`speed_weight`/`accuracy_weight`) with equal-third defaults; re-normalize to sum 1.0 2. Heuristic difficulty rating from request length+word count, clamped to [0.2, 0.9] 3. For each workflow version, fetch its latest Benchmark 4. Normalize cost + duration across candidates 5. Shift accuracy weight upward as difficulty rises: `adj_acc_w = acc_w * (1 + difficulty)` then re-normalize 6. Score: `(1 - norm_cost) * cost_w + (1 - norm_dur) * speed_w + norm_success_rate * adj_acc_w` 7. Confidence = `1 - runner_up_score / winner_score` (gap-based) **Fallback:** no benchmarks → `current_version_sid` with confidence 0.0 ### Verified ```python # Fresh workflow (no benchmarks) pick_version(wf=00dv, user_request='simple ping', preferences='') → version_sid=00dw (current), confidence=0.0, reasoning='No benchmarked versions...' # Benchmarked workflow, cost-weighted prefs pick_version(wf=00dn, user_request='list services', preferences={cost:0.5, speed:0.3, accuracy:0.2}) → version_sid=00do (v1), confidence=1.0, difficulty=0.20 reasoning: 'Picked v1 (score 0.712) from 1 benchmarked versions. Difficulty=0.20 shifted accuracy weight to 0.23 (cost 0.48, speed 0.29).' # Complex multi-step request (~1000 chars) pick_version(wf=00dn, user_request='...audit calendar across timezones...' * 2) → difficulty=0.69 (vs 0.20 for 'simple ping') ✓ ``` ### Remaining - [x] `pick_version` RPC picks best version from benchmarks + weights - [x] Fallback to current when no benchmarks exist - [x] Difficulty-aware weight adjustment (harder = more accuracy weight) - [x] Confidence score based on winner-vs-runner-up gap - [ ] **`pick_version_flow` template with AI-based difficulty node** — would replace the current heuristic. Recommend as a follow-up once Phase 3 has richer benchmark data to differentiate versions. - [ ] **UI: "Route" button with preview + run modes, preferences sliders** - [ ] **Integration test: 2 benchmarked versions (cheap vs accurate), verify simple requests pick cheap / complex requests pick accurate** --- ## Summary of all four phases All four backends are live: | Phase | Commit | Status | |-------|--------|--------| | #5 Versioning + typed inputs | d5b3c95 | Backend ✓, UI TODO | | #6 Typed example values | 671d026 | Backend ✓, UI TODO | | #7 Benchmark flow | 98831aa | Backend ✓, UI TODO | | #8 Version router | 05a3ef7 | Backend ✓, UI TODO | All four are **functional end-to-end from the RPC surface**. UI wiring is the remaining common thread — it's a bigger chunk of work that fits better as its own issue covering all phases' UI needs cohesively.
Author
Owner

Phase 4 UI landed (commit c9bfd09)

Route button + preferences modal are live in the workflow editor:

  • Topbar Route button opens a modal with a request <textarea> and three weight sliders (cost / speed / accuracy). Sliders auto-normalize to sum to 1.0 at submit, so users can leave them at any scale (defaults 33/33/34).
  • Submit calls logicservice.pick_version(workflow_sid, user_request, preferences_json); result panel renders:
    • chosen version version_label (version_sid)
    • confidence % (rounded)
    • difficulty heuristic (2dp)
    • server-side reasoning string
    • list of considered version sids
  • Start play on this version button on the result:
    • Calls workflow_set_current_version if the picked version differs from current (so the play runs against the recommended version)
    • Opens the Start-play modal
    • If the workflow declares a prompt input, pre-fills it with the user's original request — so the router's choice flows end-to-end without the user retyping.

Verified:

  • pick_version on workflow 00dz (no benchmarks yet) returns {version_sid: current, confidence: 0.0, reasoning: "No benchmarked versions — falling back to current_version_sid"} → the UI correctly shows 0% confidence and reasoning.
  • All Phase-4 markup (hl-route-modal, openRouteModal, weight sliders) renders on /workflows/00dz.

Still open for this issue:

  • AI-driven difficulty rater (separate template pick_version_flow.json) replacing the current heuristic — gives the router actual semantic grounding.
  • Cost/speed/accuracy axes assume benchmarks populate estimated_cost_usd and success_rate; currently cost is 0 until hero_proc surfaces per-job token counts to the benchmark executor.
### Phase 4 UI landed (commit c9bfd09) `Route` button + preferences modal are live in the workflow editor: - **Topbar `Route` button** opens a modal with a request `<textarea>` and three weight sliders (cost / speed / accuracy). Sliders auto-normalize to sum to 1.0 at submit, so users can leave them at any scale (defaults 33/33/34). - Submit calls `logicservice.pick_version(workflow_sid, user_request, preferences_json)`; result panel renders: - chosen version `version_label (version_sid)` - `confidence %` (rounded) - `difficulty` heuristic (2dp) - server-side `reasoning` string - list of `considered` version sids - **`Start play on this version` button** on the result: - Calls `workflow_set_current_version` if the picked version differs from current (so the play runs against the recommended version) - Opens the Start-play modal - If the workflow declares a `prompt` input, pre-fills it with the user's original request — so the router's choice flows end-to-end without the user retyping. **Verified:** - `pick_version` on workflow `00dz` (no benchmarks yet) returns `{version_sid: current, confidence: 0.0, reasoning: "No benchmarked versions — falling back to current_version_sid"}` → the UI correctly shows 0% confidence and reasoning. - All Phase-4 markup (`hl-route-modal`, `openRouteModal`, weight sliders) renders on `/workflows/00dz`. **Still open for this issue:** - AI-driven difficulty rater (separate template `pick_version_flow.json`) replacing the current heuristic — gives the router actual semantic grounding. - Cost/speed/accuracy axes assume benchmarks populate `estimated_cost_usd` and `success_rate`; currently cost is 0 until hero_proc surfaces per-job token counts to the benchmark executor.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_logic#8
No description provided.