# Redis Queue Naming Proposal (Multi-Actor, Multi-Type, Scalable) Goal - Define a consistent, future-proof Redis naming scheme that: - Supports multiple actor types (OSIS, SAL, V, Python) - Supports multiple pools/groups and instances per type - Enables fair load-balancing and targeted dispatch - Works with both “hash-output” actors and “reply-queue” actors - Keeps migration straightforward from the current keys Motivation - Today, multiple non-unified patterns exist: - Per-actor keys like "hero:job:{actor_id}" consumed by in-crate Rhai actor - Per-type keys like "hero:job:actor_queue:{suffix}" used by other components - Protocol docs that reference "hero:work_queue:{actor_id}" and "hero:reply:{job_id}" - This fragmentation causes stuck “Dispatched” jobs when the LPUSH target doesn’t match the BLPOP listener. We need one canonical scheme, with well-defined fallbacks. ## 1) Canonical Key Names Prefix conventions - Namespace prefix: hero: - All queues collected under hero:q:* to separate from job hashes hero:job:* - All metadata under hero:meta:* for discoverability Job and result keys - Job hash (unchanged): hero:job:{job_id} - Reply queue: hero:q:reply:{job_id} Work queues (new canonical) - Type queue (shared): hero:q:work:type:{script_type} - Examples: - hero:q:work:type:osis - hero:q:work:type:sal - hero:q:work:type:v - hero:q:work:type:python - Group queue (optional, shared within a group): hero:q:work:type:{script_type}:group:{group} - Examples: - hero:q:work:type:osis:group:default - hero:q:work:type:sal:group:io - Instance queue (most specific, used for targeted dispatch): hero:q:work:type:{script_type}:group:{group}:inst:{instance} - Examples: - hero:q:work:type:osis:group:default:inst:1 - hero:q:work:type:sal:group:io:inst:3 Control queues (optional, future) - Stop/control per-type: hero:q:ctl:type:{script_type} - Stop/control per-instance: hero:q:ctl:type:{script_type}:group:{group}:inst:{instance} Actor presence and metadata - Instance presence (ephemeral, with TTL refresh): hero:meta:actor:inst:{script_type}:{group}:{instance} - Value: JSON { pid, hostname, started_at, version, capabilities, last_heartbeat } - Used by the supervisor to discover live consumers and to select targeted queueing ## 2) Dispatch Strategy - Default: Push to the Type queue hero:q:work:type:{script_type} - Allows N instances to BLPOP the same shared queue (standard fan-out). - Targeted: If user or scheduler specifies a group and/or instance, push to the most specific queue - Instance queue (highest specificity): - hero:q:work:type:{script_type}:group:{group}:inst:{instance} - Else Group queue: - hero:q:work:type:{script_type}:group:{group} - Else Type queue (fallback): - hero:q:work:type:{script_type} - Priority queues (optional extension): - Append :prio:{level} to any of the above - Actors BLPOP a list of queues in priority order Example routing - No group/instance specified: - LPUSH hero:q:work:type:osis {job_id} - Group specified ("default"), no instance: - LPUSH hero:q:work:type:osis:group:default {job_id} - Specific instance: - LPUSH hero:q:work:type:osis:group:default:inst:2 {job_id} ## 3) Actor Consumption Strategy - Actor identifies itself with: - script_type (osis/sal/v/python) - group (defaults to "default") - instance number (unique within group) - Actor registers presence: - SET hero:meta:actor:inst:{script_type}:{group}:{instance} {...} EX 15 - Periodically refresh to act as heartbeat - Actor BLPOP order: 1) Instance queue (most specific) 2) Group queue 3) Type queue - This ensures targeted jobs are taken first (if any), otherwise fall back to group or shared type queue. - Actors that implement reply-queue semantics will also LPUSH to hero:q:reply:{job_id} on completion. Others just update hero:job:{job_id} with status+output. ## 4) Backward Compatibility And Migration - During transition, Supervisor can LPUSH to both: - New canonical queues (hero:q:work:type:...) - Selected legacy queues (hero:job:actor_queue:{suffix}, hero:job:{actor_id}, hero:work_queue:...) - Actors: - Update actors to BLPOP the canonical queues first, then legacy fallback - Phased plan: 1) Introduce canonical queues alongside legacy; Supervisor pushes to both (compat mode) 2) Switch actors to consume canonical first 3) Deprecate legacy queues and remove dual-push - No change to job hashes hero:job:{job_id} ## 5) Required Code Changes (by file) Supervisor (routing and reply queue) - Replace queue computation with canonical builder: - [rust.Supervisor::get_actor_queue_key()](core/supervisor/src/lib.rs:410) - Change to build canonical keys given script_type (+ optional group/instance from Job or policy) - Update start logic to LPUSH to canonical queue(s): - [rust.Supervisor::start_job_using_connection()](core/supervisor/src/lib.rs:599) - Use only canonical queue(s). In migration phase, also LPUSH legacy queues. - Standardize reply queue name: - [rust.Supervisor::run_job_and_await_result()](core/supervisor/src/lib.rs:689) - Use hero:q:reply:{job_id} - Keep “poll job hash” fallback for actors that don’t use reply queues - Stop queue naming: - [rust.Supervisor::stop_job()](core/supervisor/src/lib.rs:789) - Use hero:q:ctl:type:{script_type} in canonical mode Actor (consumption and presence) - In-crate Rhai actor: - Queue key construction and BLPOP list: - [rust.spawn_rhai_actor()](core/actor/src/lib.rs:211) - Current queue_key at [core/actor/src/lib.rs:220] - Replace single-queue BLPOP with multi-key BLPOP in priority order: 1) hero:q:work:type:{script_type}:group:{group}:inst:{instance} 2) hero:q:work:type:{script_type}:group:{group} 3) hero:q:work:type:{script_type} - For migration, optionally include legacy queues last. - Presence registration (periodic SET with TTL): - Add at actor startup and refresh on loop tick - For actors that implement reply queues: - After finishing job, LPUSH hero:q:reply:{job_id} {result} - For hash-only actors, continue to call [rust.Job::set_result()](core/job/src/lib.rs:322) Shared constants (avoid string drift) - Introduce constants and helpers in a central crate (hero_job) to build keys consistently: - fn job_hash_key(job_id) -> "hero:job:{job_id}" - fn reply_queue_key(job_id) -> "hero:q:reply:{job_id}" - fn work_queue_type(script_type) -> "hero:q:work:type:{type}" - fn work_queue_group(script_type, group) -> "hero:q:work:type:{type}:group:{group}" - fn work_queue_instance(script_type, group, inst) -> "hero:q:work:type:{type}:group:{group}:inst:{inst}" - Replace open-coded strings in: - [rust.Supervisor](core/supervisor/src/lib.rs:1) - [rust.Actor code](core/actor/src/lib.rs:1) - Any CLI/TUI or interface components that reference queues Interfaces - OpenRPC/WebSocket servers do not need to know queue names; they call Supervisor API. No changes except to follow the Supervisor’s behavior for “run-and-wait” vs “create+start+get_output” flows. ## 6) Example Scenarios Scenario A: Single OSIS pool with two instances - Actors: - osis group=default inst=1 - osis group=default inst=2 - Incoming job (no targeting): - LPUSH hero:q:work:type:osis {job_id} - Actors BLPOP order: - inst queue - group queue - type queue (this one will supply) - Effective result: classic round-robin-like behavior, two workers share load. Scenario B: SAL pool “io” with instance 3; targeted dispatch - Job sets target group=io and instance=3 - Supervisor LPUSH hero:q:work:type:sal:group:io:inst:3 {job_id} - Only that instance consumes it, enabling pinning to a specific worker. Scenario C: Mixed old and new actors (migration window) - Supervisor pushes to canonical queue(s) and to a legacy queue hero:job:actor_queue:osis - New actors consume canonical queues - Legacy actors consume legacy queue - No job is stuck; both ecosystems coexist until the legacy path is removed. ## 7) Phased Migration Plan Phase 0 (Docs + helpers) - Add helpers in hero_job to compute keys (see “Shared constants”) - Document the new scheme and consumption order (this file) Phase 1 (Supervisor) - Update [rust.Supervisor::get_actor_queue_key()](core/supervisor/src/lib.rs:410) and [rust.Supervisor::start_job_using_connection()](core/supervisor/src/lib.rs:599) to use canonical queues - Keep dual-push to legacy queues behind a feature flag or config for rollout - Standardize reply queue to hero:q:reply:{job_id} in [rust.Supervisor::run_job_and_await_result()](core/supervisor/src/lib.rs:689) Phase 2 (Actors) - Update [rust.spawn_rhai_actor()](core/actor/src/lib.rs:211) to BLPOP from canonical queues in priority order and to register presence keys - Optionally emit reply to hero:q:reply:{job_id} in addition to hash-based result (feature flag) Phase 3 (Cleanup) - After all actors and Supervisor deployments are updated and stable, remove the legacy dual-push and fallback consume paths ## 8) Optional Enhancements - Priority queues: - Suffix queues with :prio:{0|1|2}; actors BLPOP [inst prio0, group prio0, type prio0, inst prio1, group prio1, type prio1, ...] - Rate limiting/back-pressure: - Use metadata to signal busy state or reported in-flight jobs; Supervisor can target instance queues accordingly. - Resilience: - Move to Redis Streams for job event logs; lists remain fine for simple FIFO processing. - Observability: - hero:meta:actor:* and hero:meta:queue:stats:* to keep simple metrics for dashboards. ## 9) Summary - Canonicalize to hero:q:work:type:{...} (+ group, + instance), and hero:q:reply:{job_id} - Actors consume instance → group → type - Supervisor pushes to most specific queue available, defaulting to type - Provide helpers to build keys and remove ad-hoc string formatting - Migrate with a dual-push (canonical + legacy) phase to avoid downtime Proposed touchpoints to implement (clickable references) - [rust.Supervisor::get_actor_queue_key()](core/supervisor/src/lib.rs:410) - [rust.Supervisor::start_job_using_connection()](core/supervisor/src/lib.rs:599) - [rust.Supervisor::run_job_and_await_result()](core/supervisor/src/lib.rs:689) - [rust.spawn_rhai_actor()](core/actor/src/lib.rs:211) - [core/actor/src/lib.rs](core/actor/src/lib.rs:220) - [rust.Job::set_result()](core/job/src/lib.rs:322)