10 KiB
10 KiB
Redis Queue Naming Proposal (Multi-Actor, Multi-Type, Scalable)
Goal
- Define a consistent, future-proof Redis naming scheme that:
- Supports multiple actor types (OSIS, SAL, V, Python)
- Supports multiple pools/groups and instances per type
- Enables fair load-balancing and targeted dispatch
- Works with both “hash-output” actors and “reply-queue” actors
- Keeps migration straightforward from the current keys
Motivation
- Today, multiple non-unified patterns exist:
- Per-actor keys like "hero:job:{actor_id}" consumed by in-crate Rhai actor
- Per-type keys like "hero:job:actor_queue:{suffix}" used by other components
- Protocol docs that reference "hero:work_queue:{actor_id}" and "hero:reply:{job_id}"
- This fragmentation causes stuck “Dispatched” jobs when the LPUSH target doesn’t match the BLPOP listener. We need one canonical scheme, with well-defined fallbacks.
1) Canonical Key Names
Prefix conventions
- Namespace prefix: hero:
- All queues collected under hero:q:* to separate from job hashes hero:job:*
- All metadata under hero:meta:* for discoverability
Job and result keys
- Job hash (unchanged): hero:job:{job_id}
- Reply queue: hero:q:reply:{job_id}
Work queues (new canonical)
- Type queue (shared): hero:q:work:type:{script_type}
- Examples:
- hero:q:work:type:osis
- hero:q:work:type:sal
- hero:q:work:type:v
- hero:q:work:type:python
- Examples:
- Group queue (optional, shared within a group): hero:q:work:type:{script_type}:group:{group}
- Examples:
- hero:q:work:type:osis:group:default
- hero:q:work:type:sal:group:io
- Examples:
- Instance queue (most specific, used for targeted dispatch): hero:q:work:type:{script_type}:group:{group}:inst:{instance}
- Examples:
- hero:q:work:type:osis:group:default:inst:1
- hero:q:work:type:sal:group:io:inst:3
- Examples:
Control queues (optional, future)
- Stop/control per-type: hero:q:ctl:type:{script_type}
- Stop/control per-instance: hero:q:ctl:type:{script_type}:group:{group}:inst:{instance}
Actor presence and metadata
- Instance presence (ephemeral, with TTL refresh): hero:meta:actor:inst:{script_type}:{group}:{instance}
- Value: JSON { pid, hostname, started_at, version, capabilities, last_heartbeat }
- Used by the supervisor to discover live consumers and to select targeted queueing
2) Dispatch Strategy
- Default: Push to the Type queue hero:q:work:type:{script_type}
- Allows N instances to BLPOP the same shared queue (standard fan-out).
- Targeted: If user or scheduler specifies a group and/or instance, push to the most specific queue
- Instance queue (highest specificity):
- hero:q:work:type:{script_type}:group:{group}:inst:{instance}
- Else Group queue:
- hero:q:work:type:{script_type}:group:{group}
- Else Type queue (fallback):
- hero:q:work:type:{script_type}
- Instance queue (highest specificity):
- Priority queues (optional extension):
- Append :prio:{level} to any of the above
- Actors BLPOP a list of queues in priority order
Example routing
- No group/instance specified:
- LPUSH hero:q:work:type:osis {job_id}
- Group specified ("default"), no instance:
- LPUSH hero:q:work:type:osis:group:default {job_id}
- Specific instance:
- LPUSH hero:q:work:type:osis:group:default:inst:2 {job_id}
3) Actor Consumption Strategy
- Actor identifies itself with:
- script_type (osis/sal/v/python)
- group (defaults to "default")
- instance number (unique within group)
- Actor registers presence:
- SET hero:meta:actor:inst:{script_type}:{group}:{instance} {...} EX 15
- Periodically refresh to act as heartbeat
- Actor BLPOP order:
- Instance queue (most specific)
- Group queue
- Type queue
- This ensures targeted jobs are taken first (if any), otherwise fall back to group or shared type queue.
- Actors that implement reply-queue semantics will also LPUSH to hero:q:reply:{job_id} on completion. Others just update hero:job:{job_id} with status+output.
4) Backward Compatibility And Migration
- During transition, Supervisor can LPUSH to both:
- New canonical queues (hero:q:work:type:...)
- Selected legacy queues (hero:job:actor_queue:{suffix}, hero:job:{actor_id}, hero:work_queue:...)
- Actors:
- Update actors to BLPOP the canonical queues first, then legacy fallback
- Phased plan:
- Introduce canonical queues alongside legacy; Supervisor pushes to both (compat mode)
- Switch actors to consume canonical first
- Deprecate legacy queues and remove dual-push
- No change to job hashes hero:job:{job_id}
5) Required Code Changes (by file)
Supervisor (routing and reply queue)
- Replace queue computation with canonical builder:
- rust.Supervisor::get_actor_queue_key()
- Change to build canonical keys given script_type (+ optional group/instance from Job or policy)
- rust.Supervisor::get_actor_queue_key()
- Update start logic to LPUSH to canonical queue(s):
- rust.Supervisor::start_job_using_connection()
- Use only canonical queue(s). In migration phase, also LPUSH legacy queues.
- rust.Supervisor::start_job_using_connection()
- Standardize reply queue name:
- rust.Supervisor::run_job_and_await_result()
- Use hero:q:reply:{job_id}
- Keep “poll job hash” fallback for actors that don’t use reply queues
- rust.Supervisor::run_job_and_await_result()
- Stop queue naming:
- rust.Supervisor::stop_job()
- Use hero:q:ctl:type:{script_type} in canonical mode
- rust.Supervisor::stop_job()
Actor (consumption and presence)
- In-crate Rhai actor:
- Queue key construction and BLPOP list:
- rust.spawn_rhai_actor()
- Current queue_key at [core/actor/src/lib.rs:220]
- Replace single-queue BLPOP with multi-key BLPOP in priority order:
- hero:q:work:type:{script_type}:group:{group}:inst:{instance}
- hero:q:work:type:{script_type}:group:{group}
- hero:q:work:type:{script_type}
- For migration, optionally include legacy queues last.
- Presence registration (periodic SET with TTL):
- Add at actor startup and refresh on loop tick
- Queue key construction and BLPOP list:
- For actors that implement reply queues:
- After finishing job, LPUSH hero:q:reply:{job_id} {result}
- For hash-only actors, continue to call rust.Job::set_result()
Shared constants (avoid string drift)
- Introduce constants and helpers in a central crate (hero_job) to build keys consistently:
- fn job_hash_key(job_id) -> "hero:job:{job_id}"
- fn reply_queue_key(job_id) -> "hero:q:reply:{job_id}"
- fn work_queue_type(script_type) -> "hero:q:work:type:{type}"
- fn work_queue_group(script_type, group) -> "hero:q:work:type:{type}:group:{group}"
- fn work_queue_instance(script_type, group, inst) -> "hero:q:work:type:{type}:group:{group}:inst:{inst}"
- Replace open-coded strings in:
- rust.Supervisor
- rust.Actor code
- Any CLI/TUI or interface components that reference queues
Interfaces
- OpenRPC/WebSocket servers do not need to know queue names; they call Supervisor API. No changes except to follow the Supervisor’s behavior for “run-and-wait” vs “create+start+get_output” flows.
6) Example Scenarios
Scenario A: Single OSIS pool with two instances
- Actors:
- osis group=default inst=1
- osis group=default inst=2
- Incoming job (no targeting):
- LPUSH hero:q:work:type:osis {job_id}
- Actors BLPOP order:
- inst queue
- group queue
- type queue (this one will supply)
- Effective result: classic round-robin-like behavior, two workers share load.
Scenario B: SAL pool “io” with instance 3; targeted dispatch
- Job sets target group=io and instance=3
- Supervisor LPUSH hero:q:work:type:sal:group:io:inst:3 {job_id}
- Only that instance consumes it, enabling pinning to a specific worker.
Scenario C: Mixed old and new actors (migration window)
- Supervisor pushes to canonical queue(s) and to a legacy queue hero:job:actor_queue:osis
- New actors consume canonical queues
- Legacy actors consume legacy queue
- No job is stuck; both ecosystems coexist until the legacy path is removed.
7) Phased Migration Plan
Phase 0 (Docs + helpers)
- Add helpers in hero_job to compute keys (see “Shared constants”)
- Document the new scheme and consumption order (this file)
Phase 1 (Supervisor)
- Update rust.Supervisor::get_actor_queue_key() and rust.Supervisor::start_job_using_connection() to use canonical queues
- Keep dual-push to legacy queues behind a feature flag or config for rollout
- Standardize reply queue to hero:q:reply:{job_id} in rust.Supervisor::run_job_and_await_result()
Phase 2 (Actors)
- Update rust.spawn_rhai_actor() to BLPOP from canonical queues in priority order and to register presence keys
- Optionally emit reply to hero:q:reply:{job_id} in addition to hash-based result (feature flag)
Phase 3 (Cleanup)
- After all actors and Supervisor deployments are updated and stable, remove the legacy dual-push and fallback consume paths
8) Optional Enhancements
- Priority queues:
- Suffix queues with :prio:{0|1|2}; actors BLPOP [inst prio0, group prio0, type prio0, inst prio1, group prio1, type prio1, ...]
- Rate limiting/back-pressure:
- Use metadata to signal busy state or reported in-flight jobs; Supervisor can target instance queues accordingly.
- Resilience:
- Move to Redis Streams for job event logs; lists remain fine for simple FIFO processing.
- Observability:
- hero:meta:actor:* and hero:meta:queue:stats:* to keep simple metrics for dashboards.
9) Summary
- Canonicalize to hero:q:work:type:{...} (+ group, + instance), and hero:q:reply:{job_id}
- Actors consume instance → group → type
- Supervisor pushes to most specific queue available, defaulting to type
- Provide helpers to build keys and remove ad-hoc string formatting
- Migrate with a dual-push (canonical + legacy) phase to avoid downtime
Proposed touchpoints to implement (clickable references)