herocode/baobab

Fork 0

Maxime Van Hees 0ebda7c1aa Updates

2025-08-14 14:14:34 +02:00

10 KiB

Raw Blame History

Redis Queue Naming Proposal (Multi-Actor, Multi-Type, Scalable)

Goal

Define a consistent, future-proof Redis naming scheme that:
- Supports multiple actor types (OSIS, SAL, V, Python)
- Supports multiple pools/groups and instances per type
- Enables fair load-balancing and targeted dispatch
- Works with both “hash-output” actors and “reply-queue” actors
- Keeps migration straightforward from the current keys

Motivation

Today, multiple non-unified patterns exist:
- Per-actor keys like "hero:job:{actor_id}" consumed by in-crate Rhai actor
- Per-type keys like "hero:job:actor_queue:{suffix}" used by other components
- Protocol docs that reference "hero:work_queue:{actor_id}" and "hero:reply:{job_id}"
This fragmentation causes stuck “Dispatched” jobs when the LPUSH target doesn’t match the BLPOP listener. We need one canonical scheme, with well-defined fallbacks.

1) Canonical Key Names

Prefix conventions

Namespace prefix: hero:
All queues collected under hero:q:* to separate from job hashes hero:job:*
All metadata under hero:meta:* for discoverability

Job and result keys

Job hash (unchanged): hero:job:{job_id}
Reply queue: hero:q:reply:{job_id}

Work queues (new canonical)

Type queue (shared): hero:q:work:type:{script_type}
- Examples:
  - hero:q:work:type:osis
  - hero:q:work:type:sal
  - hero:q:work:type:v
  - hero:q:work:type:python
Group queue (optional, shared within a group): hero:q:work:type:{script_type}:group:{group}
- Examples:
  - hero:q:work:type:osis:group:default
  - hero:q:work:type:sal:group:io
Instance queue (most specific, used for targeted dispatch): hero:q:work:type:{script_type}:group:{group}:inst:{instance}
- Examples:
  - hero:q:work:type:osis:group:default:inst:1
  - hero:q:work:type:sal:group:io:inst:3

Control queues (optional, future)

Stop/control per-type: hero:q:ctl:type:{script_type}
Stop/control per-instance: hero:q:ctl:type:{script_type}:group:{group}:inst:{instance}

Actor presence and metadata

Instance presence (ephemeral, with TTL refresh): hero:meta:actor:inst:{script_type}:{group}:{instance}
- Value: JSON { pid, hostname, started_at, version, capabilities, last_heartbeat }
- Used by the supervisor to discover live consumers and to select targeted queueing

2) Dispatch Strategy

Default: Push to the Type queue hero:q:work:type:{script_type}
- Allows N instances to BLPOP the same shared queue (standard fan-out).
Targeted: If user or scheduler specifies a group and/or instance, push to the most specific queue
- Instance queue (highest specificity):
  - hero:q:work:type:{script_type}:group:{group}:inst:{instance}
- Else Group queue:
  - hero:q:work:type:{script_type}:group:{group}
- Else Type queue (fallback):
  - hero:q:work:type:{script_type}
Priority queues (optional extension):
- Append :prio:{level} to any of the above
- Actors BLPOP a list of queues in priority order

Example routing

No group/instance specified:
- LPUSH hero:q:work:type:osis {job_id}
Group specified ("default"), no instance:
- LPUSH hero:q:work:type:osis:group:default {job_id}
Specific instance:
- LPUSH hero:q:work:type:osis:group:default:inst:2 {job_id}

3) Actor Consumption Strategy

Actor identifies itself with:
- script_type (osis/sal/v/python)
- group (defaults to "default")
- instance number (unique within group)
Actor registers presence:
- SET hero:meta:actor:inst:{script_type}:{group}:{instance} {...} EX 15
- Periodically refresh to act as heartbeat
Actor BLPOP order:
1. Instance queue (most specific)
2. Group queue
3. Type queue
This ensures targeted jobs are taken first (if any), otherwise fall back to group or shared type queue.
Actors that implement reply-queue semantics will also LPUSH to hero:q:reply:{job_id} on completion. Others just update hero:job:{job_id} with status+output.

4) Backward Compatibility And Migration

During transition, Supervisor can LPUSH to both:
- New canonical queues (hero:q:work:type:...)
- Selected legacy queues (hero:job:actor_queue:{suffix}, hero:job:{actor_id}, hero:work_queue:...)
Actors:
- Update actors to BLPOP the canonical queues first, then legacy fallback
Phased plan:
1. Introduce canonical queues alongside legacy; Supervisor pushes to both (compat mode)
2. Switch actors to consume canonical first
3. Deprecate legacy queues and remove dual-push
No change to job hashes hero:job:{job_id}

5) Required Code Changes (by file)

Supervisor (routing and reply queue)

Replace queue computation with canonical builder:
- rust.Supervisor::get_actor_queue_key()
  - Change to build canonical keys given script_type (+ optional group/instance from Job or policy)
Update start logic to LPUSH to canonical queue(s):
- rust.Supervisor::start_job_using_connection()
  - Use only canonical queue(s). In migration phase, also LPUSH legacy queues.
Standardize reply queue name:
- rust.Supervisor::run_job_and_await_result()
  - Use hero:q:reply:{job_id}
  - Keep “poll job hash” fallback for actors that don’t use reply queues
Stop queue naming:
- rust.Supervisor::stop_job()
  - Use hero:q:ctl:type:{script_type} in canonical mode

Actor (consumption and presence)

In-crate Rhai actor:
- Queue key construction and BLPOP list:
  - rust.spawn_rhai_actor()
  - Current queue_key at [core/actor/src/lib.rs:220]
  - Replace single-queue BLPOP with multi-key BLPOP in priority order:
    1. hero:q:work:type:{script_type}:group:{group}:inst:{instance}
    2. hero:q:work:type:{script_type}:group:{group}
    3. hero:q:work:type:{script_type}
  - For migration, optionally include legacy queues last.
- Presence registration (periodic SET with TTL):
  - Add at actor startup and refresh on loop tick
For actors that implement reply queues:
- After finishing job, LPUSH hero:q:reply:{job_id} {result}
- For hash-only actors, continue to call rust.Job::set_result()

Shared constants (avoid string drift)

Introduce constants and helpers in a central crate (hero_job) to build keys consistently:
- fn job_hash_key(job_id) -> "hero:job:{job_id}"
- fn reply_queue_key(job_id) -> "hero:q:reply:{job_id}"
- fn work_queue_type(script_type) -> "hero:q:work:type:{type}"
- fn work_queue_group(script_type, group) -> "hero:q:work:type:{type}:group:{group}"
- fn work_queue_instance(script_type, group, inst) -> "hero:q:work:type:{type}:group:{group}:inst:{inst}"
Replace open-coded strings in:
- rust.Supervisor
- rust.Actor code
- Any CLI/TUI or interface components that reference queues

Interfaces

OpenRPC/WebSocket servers do not need to know queue names; they call Supervisor API. No changes except to follow the Supervisor’s behavior for “run-and-wait” vs “create+start+get_output” flows.

6) Example Scenarios

Scenario A: Single OSIS pool with two instances

Actors:
- osis group=default inst=1
- osis group=default inst=2
Incoming job (no targeting):
- LPUSH hero:q:work:type:osis {job_id}
Actors BLPOP order:
- inst queue
- group queue
- type queue (this one will supply)
Effective result: classic round-robin-like behavior, two workers share load.

Scenario B: SAL pool “io” with instance 3; targeted dispatch

Job sets target group=io and instance=3
Supervisor LPUSH hero:q:work:type:sal:group:io:inst:3 {job_id}
Only that instance consumes it, enabling pinning to a specific worker.

Scenario C: Mixed old and new actors (migration window)

Supervisor pushes to canonical queue(s) and to a legacy queue hero:job:actor_queue:osis
New actors consume canonical queues
Legacy actors consume legacy queue
No job is stuck; both ecosystems coexist until the legacy path is removed.

7) Phased Migration Plan

Phase 0 (Docs + helpers)

Add helpers in hero_job to compute keys (see “Shared constants”)
Document the new scheme and consumption order (this file)

Phase 1 (Supervisor)

Update rust.Supervisor::get_actor_queue_key() and rust.Supervisor::start_job_using_connection() to use canonical queues
Keep dual-push to legacy queues behind a feature flag or config for rollout
Standardize reply queue to hero:q:reply:{job_id} in rust.Supervisor::run_job_and_await_result()

Phase 2 (Actors)

Update rust.spawn_rhai_actor() to BLPOP from canonical queues in priority order and to register presence keys
Optionally emit reply to hero:q:reply:{job_id} in addition to hash-based result (feature flag)

Phase 3 (Cleanup)

After all actors and Supervisor deployments are updated and stable, remove the legacy dual-push and fallback consume paths

8) Optional Enhancements

Priority queues:
- Suffix queues with :prio:{0|1|2}; actors BLPOP [inst prio0, group prio0, type prio0, inst prio1, group prio1, type prio1, ...]
Rate limiting/back-pressure:
- Use metadata to signal busy state or reported in-flight jobs; Supervisor can target instance queues accordingly.
Resilience:
- Move to Redis Streams for job event logs; lists remain fine for simple FIFO processing.
Observability:
- hero:meta:actor:* and hero:meta:queue:stats:* to keep simple metrics for dashboards.

9) Summary

Canonicalize to hero:q:work:type:{...} (+ group, + instance), and hero:q:reply:{job_id}
Actors consume instance → group → type
Supervisor pushes to most specific queue available, defaulting to type
Provide helpers to build keys and remove ad-hoc string formatting
Migrate with a dual-push (canonical + legacy) phase to avoid downtime

Proposed touchpoints to implement (clickable references)

10 KiB Raw Blame History Unescape Escape