baobab/docs/REDIS_QUEUES_NAMING_PROPOSAL.md
Maxime Van Hees 0ebda7c1aa Updates
2025-08-14 14:14:34 +02:00

231 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Redis Queue Naming Proposal (Multi-Actor, Multi-Type, Scalable)
Goal
- Define a consistent, future-proof Redis naming scheme that:
- Supports multiple actor types (OSIS, SAL, V, Python)
- Supports multiple pools/groups and instances per type
- Enables fair load-balancing and targeted dispatch
- Works with both “hash-output” actors and “reply-queue” actors
- Keeps migration straightforward from the current keys
Motivation
- Today, multiple non-unified patterns exist:
- Per-actor keys like "hero:job:{actor_id}" consumed by in-crate Rhai actor
- Per-type keys like "hero:job:actor_queue:{suffix}" used by other components
- Protocol docs that reference "hero:work_queue:{actor_id}" and "hero:reply:{job_id}"
- This fragmentation causes stuck “Dispatched” jobs when the LPUSH target doesnt match the BLPOP listener. We need one canonical scheme, with well-defined fallbacks.
## 1) Canonical Key Names
Prefix conventions
- Namespace prefix: hero:
- All queues collected under hero:q:* to separate from job hashes hero:job:*
- All metadata under hero:meta:* for discoverability
Job and result keys
- Job hash (unchanged): hero:job:{job_id}
- Reply queue: hero:q:reply:{job_id}
Work queues (new canonical)
- Type queue (shared): hero:q:work:type:{script_type}
- Examples:
- hero:q:work:type:osis
- hero:q:work:type:sal
- hero:q:work:type:v
- hero:q:work:type:python
- Group queue (optional, shared within a group): hero:q:work:type:{script_type}:group:{group}
- Examples:
- hero:q:work:type:osis:group:default
- hero:q:work:type:sal:group:io
- Instance queue (most specific, used for targeted dispatch): hero:q:work:type:{script_type}:group:{group}:inst:{instance}
- Examples:
- hero:q:work:type:osis:group:default:inst:1
- hero:q:work:type:sal:group:io:inst:3
Control queues (optional, future)
- Stop/control per-type: hero:q:ctl:type:{script_type}
- Stop/control per-instance: hero:q:ctl:type:{script_type}:group:{group}:inst:{instance}
Actor presence and metadata
- Instance presence (ephemeral, with TTL refresh): hero:meta:actor:inst:{script_type}:{group}:{instance}
- Value: JSON { pid, hostname, started_at, version, capabilities, last_heartbeat }
- Used by the supervisor to discover live consumers and to select targeted queueing
## 2) Dispatch Strategy
- Default: Push to the Type queue hero:q:work:type:{script_type}
- Allows N instances to BLPOP the same shared queue (standard fan-out).
- Targeted: If user or scheduler specifies a group and/or instance, push to the most specific queue
- Instance queue (highest specificity):
- hero:q:work:type:{script_type}:group:{group}:inst:{instance}
- Else Group queue:
- hero:q:work:type:{script_type}:group:{group}
- Else Type queue (fallback):
- hero:q:work:type:{script_type}
- Priority queues (optional extension):
- Append :prio:{level} to any of the above
- Actors BLPOP a list of queues in priority order
Example routing
- No group/instance specified:
- LPUSH hero:q:work:type:osis {job_id}
- Group specified ("default"), no instance:
- LPUSH hero:q:work:type:osis:group:default {job_id}
- Specific instance:
- LPUSH hero:q:work:type:osis:group:default:inst:2 {job_id}
## 3) Actor Consumption Strategy
- Actor identifies itself with:
- script_type (osis/sal/v/python)
- group (defaults to "default")
- instance number (unique within group)
- Actor registers presence:
- SET hero:meta:actor:inst:{script_type}:{group}:{instance} {...} EX 15
- Periodically refresh to act as heartbeat
- Actor BLPOP order:
1) Instance queue (most specific)
2) Group queue
3) Type queue
- This ensures targeted jobs are taken first (if any), otherwise fall back to group or shared type queue.
- Actors that implement reply-queue semantics will also LPUSH to hero:q:reply:{job_id} on completion. Others just update hero:job:{job_id} with status+output.
## 4) Backward Compatibility And Migration
- During transition, Supervisor can LPUSH to both:
- New canonical queues (hero:q:work:type:...)
- Selected legacy queues (hero:job:actor_queue:{suffix}, hero:job:{actor_id}, hero:work_queue:...)
- Actors:
- Update actors to BLPOP the canonical queues first, then legacy fallback
- Phased plan:
1) Introduce canonical queues alongside legacy; Supervisor pushes to both (compat mode)
2) Switch actors to consume canonical first
3) Deprecate legacy queues and remove dual-push
- No change to job hashes hero:job:{job_id}
## 5) Required Code Changes (by file)
Supervisor (routing and reply queue)
- Replace queue computation with canonical builder:
- [rust.Supervisor::get_actor_queue_key()](core/supervisor/src/lib.rs:410)
- Change to build canonical keys given script_type (+ optional group/instance from Job or policy)
- Update start logic to LPUSH to canonical queue(s):
- [rust.Supervisor::start_job_using_connection()](core/supervisor/src/lib.rs:599)
- Use only canonical queue(s). In migration phase, also LPUSH legacy queues.
- Standardize reply queue name:
- [rust.Supervisor::run_job_and_await_result()](core/supervisor/src/lib.rs:689)
- Use hero:q:reply:{job_id}
- Keep “poll job hash” fallback for actors that dont use reply queues
- Stop queue naming:
- [rust.Supervisor::stop_job()](core/supervisor/src/lib.rs:789)
- Use hero:q:ctl:type:{script_type} in canonical mode
Actor (consumption and presence)
- In-crate Rhai actor:
- Queue key construction and BLPOP list:
- [rust.spawn_rhai_actor()](core/actor/src/lib.rs:211)
- Current queue_key at [core/actor/src/lib.rs:220]
- Replace single-queue BLPOP with multi-key BLPOP in priority order:
1) hero:q:work:type:{script_type}:group:{group}:inst:{instance}
2) hero:q:work:type:{script_type}:group:{group}
3) hero:q:work:type:{script_type}
- For migration, optionally include legacy queues last.
- Presence registration (periodic SET with TTL):
- Add at actor startup and refresh on loop tick
- For actors that implement reply queues:
- After finishing job, LPUSH hero:q:reply:{job_id} {result}
- For hash-only actors, continue to call [rust.Job::set_result()](core/job/src/lib.rs:322)
Shared constants (avoid string drift)
- Introduce constants and helpers in a central crate (hero_job) to build keys consistently:
- fn job_hash_key(job_id) -> "hero:job:{job_id}"
- fn reply_queue_key(job_id) -> "hero:q:reply:{job_id}"
- fn work_queue_type(script_type) -> "hero:q:work:type:{type}"
- fn work_queue_group(script_type, group) -> "hero:q:work:type:{type}:group:{group}"
- fn work_queue_instance(script_type, group, inst) -> "hero:q:work:type:{type}:group:{group}:inst:{inst}"
- Replace open-coded strings in:
- [rust.Supervisor](core/supervisor/src/lib.rs:1)
- [rust.Actor code](core/actor/src/lib.rs:1)
- Any CLI/TUI or interface components that reference queues
Interfaces
- OpenRPC/WebSocket servers do not need to know queue names; they call Supervisor API. No changes except to follow the Supervisors behavior for “run-and-wait” vs “create+start+get_output” flows.
## 6) Example Scenarios
Scenario A: Single OSIS pool with two instances
- Actors:
- osis group=default inst=1
- osis group=default inst=2
- Incoming job (no targeting):
- LPUSH hero:q:work:type:osis {job_id}
- Actors BLPOP order:
- inst queue
- group queue
- type queue (this one will supply)
- Effective result: classic round-robin-like behavior, two workers share load.
Scenario B: SAL pool “io” with instance 3; targeted dispatch
- Job sets target group=io and instance=3
- Supervisor LPUSH hero:q:work:type:sal:group:io:inst:3 {job_id}
- Only that instance consumes it, enabling pinning to a specific worker.
Scenario C: Mixed old and new actors (migration window)
- Supervisor pushes to canonical queue(s) and to a legacy queue hero:job:actor_queue:osis
- New actors consume canonical queues
- Legacy actors consume legacy queue
- No job is stuck; both ecosystems coexist until the legacy path is removed.
## 7) Phased Migration Plan
Phase 0 (Docs + helpers)
- Add helpers in hero_job to compute keys (see “Shared constants”)
- Document the new scheme and consumption order (this file)
Phase 1 (Supervisor)
- Update [rust.Supervisor::get_actor_queue_key()](core/supervisor/src/lib.rs:410) and [rust.Supervisor::start_job_using_connection()](core/supervisor/src/lib.rs:599) to use canonical queues
- Keep dual-push to legacy queues behind a feature flag or config for rollout
- Standardize reply queue to hero:q:reply:{job_id} in [rust.Supervisor::run_job_and_await_result()](core/supervisor/src/lib.rs:689)
Phase 2 (Actors)
- Update [rust.spawn_rhai_actor()](core/actor/src/lib.rs:211) to BLPOP from canonical queues in priority order and to register presence keys
- Optionally emit reply to hero:q:reply:{job_id} in addition to hash-based result (feature flag)
Phase 3 (Cleanup)
- After all actors and Supervisor deployments are updated and stable, remove the legacy dual-push and fallback consume paths
## 8) Optional Enhancements
- Priority queues:
- Suffix queues with :prio:{0|1|2}; actors BLPOP [inst prio0, group prio0, type prio0, inst prio1, group prio1, type prio1, ...]
- Rate limiting/back-pressure:
- Use metadata to signal busy state or reported in-flight jobs; Supervisor can target instance queues accordingly.
- Resilience:
- Move to Redis Streams for job event logs; lists remain fine for simple FIFO processing.
- Observability:
- hero:meta:actor:* and hero:meta:queue:stats:* to keep simple metrics for dashboards.
## 9) Summary
- Canonicalize to hero:q:work:type:{...} (+ group, + instance), and hero:q:reply:{job_id}
- Actors consume instance → group → type
- Supervisor pushes to most specific queue available, defaulting to type
- Provide helpers to build keys and remove ad-hoc string formatting
- Migrate with a dual-push (canonical + legacy) phase to avoid downtime
Proposed touchpoints to implement (clickable references)
- [rust.Supervisor::get_actor_queue_key()](core/supervisor/src/lib.rs:410)
- [rust.Supervisor::start_job_using_connection()](core/supervisor/src/lib.rs:599)
- [rust.Supervisor::run_job_and_await_result()](core/supervisor/src/lib.rs:689)
- [rust.spawn_rhai_actor()](core/actor/src/lib.rs:211)
- [core/actor/src/lib.rs](core/actor/src/lib.rs:220)
- [rust.Job::set_result()](core/job/src/lib.rs:322)