baobab/docs/REDIS_QUEUES_NAMING_PROPOSAL.md
Maxime Van Hees 0ebda7c1aa Updates
2025-08-14 14:14:34 +02:00

10 KiB
Raw Blame History

Redis Queue Naming Proposal (Multi-Actor, Multi-Type, Scalable)

Goal

  • Define a consistent, future-proof Redis naming scheme that:
    • Supports multiple actor types (OSIS, SAL, V, Python)
    • Supports multiple pools/groups and instances per type
    • Enables fair load-balancing and targeted dispatch
    • Works with both “hash-output” actors and “reply-queue” actors
    • Keeps migration straightforward from the current keys

Motivation

  • Today, multiple non-unified patterns exist:
    • Per-actor keys like "hero:job:{actor_id}" consumed by in-crate Rhai actor
    • Per-type keys like "hero:job:actor_queue:{suffix}" used by other components
    • Protocol docs that reference "hero:work_queue:{actor_id}" and "hero:reply:{job_id}"
  • This fragmentation causes stuck “Dispatched” jobs when the LPUSH target doesnt match the BLPOP listener. We need one canonical scheme, with well-defined fallbacks.

1) Canonical Key Names

Prefix conventions

  • Namespace prefix: hero:
  • All queues collected under hero:q:* to separate from job hashes hero:job:*
  • All metadata under hero:meta:* for discoverability

Job and result keys

  • Job hash (unchanged): hero:job:{job_id}
  • Reply queue: hero:q:reply:{job_id}

Work queues (new canonical)

  • Type queue (shared): hero:q:work:type:{script_type}
    • Examples:
      • hero:q:work:type:osis
      • hero:q:work:type:sal
      • hero:q:work:type:v
      • hero:q:work:type:python
  • Group queue (optional, shared within a group): hero:q:work:type:{script_type}:group:{group}
    • Examples:
      • hero:q:work:type:osis:group:default
      • hero:q:work:type:sal:group:io
  • Instance queue (most specific, used for targeted dispatch): hero:q:work:type:{script_type}:group:{group}:inst:{instance}
    • Examples:
      • hero:q:work:type:osis:group:default:inst:1
      • hero:q:work:type:sal:group:io:inst:3

Control queues (optional, future)

  • Stop/control per-type: hero:q:ctl:type:{script_type}
  • Stop/control per-instance: hero:q:ctl:type:{script_type}:group:{group}:inst:{instance}

Actor presence and metadata

  • Instance presence (ephemeral, with TTL refresh): hero:meta:actor:inst:{script_type}:{group}:{instance}
    • Value: JSON { pid, hostname, started_at, version, capabilities, last_heartbeat }
    • Used by the supervisor to discover live consumers and to select targeted queueing

2) Dispatch Strategy

  • Default: Push to the Type queue hero:q:work:type:{script_type}
    • Allows N instances to BLPOP the same shared queue (standard fan-out).
  • Targeted: If user or scheduler specifies a group and/or instance, push to the most specific queue
    • Instance queue (highest specificity):
      • hero:q:work:type:{script_type}:group:{group}:inst:{instance}
    • Else Group queue:
      • hero:q:work:type:{script_type}:group:{group}
    • Else Type queue (fallback):
      • hero:q:work:type:{script_type}
  • Priority queues (optional extension):
    • Append :prio:{level} to any of the above
    • Actors BLPOP a list of queues in priority order

Example routing

  • No group/instance specified:
    • LPUSH hero:q:work:type:osis {job_id}
  • Group specified ("default"), no instance:
    • LPUSH hero:q:work:type:osis:group:default {job_id}
  • Specific instance:
    • LPUSH hero:q:work:type:osis:group:default:inst:2 {job_id}

3) Actor Consumption Strategy

  • Actor identifies itself with:
    • script_type (osis/sal/v/python)
    • group (defaults to "default")
    • instance number (unique within group)
  • Actor registers presence:
    • SET hero:meta:actor:inst:{script_type}:{group}:{instance} {...} EX 15
    • Periodically refresh to act as heartbeat
  • Actor BLPOP order:
    1. Instance queue (most specific)
    2. Group queue
    3. Type queue
  • This ensures targeted jobs are taken first (if any), otherwise fall back to group or shared type queue.
  • Actors that implement reply-queue semantics will also LPUSH to hero:q:reply:{job_id} on completion. Others just update hero:job:{job_id} with status+output.

4) Backward Compatibility And Migration

  • During transition, Supervisor can LPUSH to both:
    • New canonical queues (hero:q:work:type:...)
    • Selected legacy queues (hero:job:actor_queue:{suffix}, hero:job:{actor_id}, hero:work_queue:...)
  • Actors:
    • Update actors to BLPOP the canonical queues first, then legacy fallback
  • Phased plan:
    1. Introduce canonical queues alongside legacy; Supervisor pushes to both (compat mode)
    2. Switch actors to consume canonical first
    3. Deprecate legacy queues and remove dual-push
  • No change to job hashes hero:job:{job_id}

5) Required Code Changes (by file)

Supervisor (routing and reply queue)

Actor (consumption and presence)

  • In-crate Rhai actor:
    • Queue key construction and BLPOP list:
      • rust.spawn_rhai_actor()
      • Current queue_key at [core/actor/src/lib.rs:220]
      • Replace single-queue BLPOP with multi-key BLPOP in priority order:
        1. hero:q:work:type:{script_type}:group:{group}:inst:{instance}
        2. hero:q:work:type:{script_type}:group:{group}
        3. hero:q:work:type:{script_type}
      • For migration, optionally include legacy queues last.
    • Presence registration (periodic SET with TTL):
      • Add at actor startup and refresh on loop tick
  • For actors that implement reply queues:
    • After finishing job, LPUSH hero:q:reply:{job_id} {result}
    • For hash-only actors, continue to call rust.Job::set_result()

Shared constants (avoid string drift)

  • Introduce constants and helpers in a central crate (hero_job) to build keys consistently:
    • fn job_hash_key(job_id) -> "hero:job:{job_id}"
    • fn reply_queue_key(job_id) -> "hero:q:reply:{job_id}"
    • fn work_queue_type(script_type) -> "hero:q:work:type:{type}"
    • fn work_queue_group(script_type, group) -> "hero:q:work:type:{type}:group:{group}"
    • fn work_queue_instance(script_type, group, inst) -> "hero:q:work:type:{type}:group:{group}:inst:{inst}"
  • Replace open-coded strings in:

Interfaces

  • OpenRPC/WebSocket servers do not need to know queue names; they call Supervisor API. No changes except to follow the Supervisors behavior for “run-and-wait” vs “create+start+get_output” flows.

6) Example Scenarios

Scenario A: Single OSIS pool with two instances

  • Actors:
    • osis group=default inst=1
    • osis group=default inst=2
  • Incoming job (no targeting):
    • LPUSH hero:q:work:type:osis {job_id}
  • Actors BLPOP order:
    • inst queue
    • group queue
    • type queue (this one will supply)
  • Effective result: classic round-robin-like behavior, two workers share load.

Scenario B: SAL pool “io” with instance 3; targeted dispatch

  • Job sets target group=io and instance=3
  • Supervisor LPUSH hero:q:work:type:sal:group:io:inst:3 {job_id}
  • Only that instance consumes it, enabling pinning to a specific worker.

Scenario C: Mixed old and new actors (migration window)

  • Supervisor pushes to canonical queue(s) and to a legacy queue hero:job:actor_queue:osis
  • New actors consume canonical queues
  • Legacy actors consume legacy queue
  • No job is stuck; both ecosystems coexist until the legacy path is removed.

7) Phased Migration Plan

Phase 0 (Docs + helpers)

  • Add helpers in hero_job to compute keys (see “Shared constants”)
  • Document the new scheme and consumption order (this file)

Phase 1 (Supervisor)

Phase 2 (Actors)

  • Update rust.spawn_rhai_actor() to BLPOP from canonical queues in priority order and to register presence keys
  • Optionally emit reply to hero:q:reply:{job_id} in addition to hash-based result (feature flag)

Phase 3 (Cleanup)

  • After all actors and Supervisor deployments are updated and stable, remove the legacy dual-push and fallback consume paths

8) Optional Enhancements

  • Priority queues:
    • Suffix queues with :prio:{0|1|2}; actors BLPOP [inst prio0, group prio0, type prio0, inst prio1, group prio1, type prio1, ...]
  • Rate limiting/back-pressure:
    • Use metadata to signal busy state or reported in-flight jobs; Supervisor can target instance queues accordingly.
  • Resilience:
    • Move to Redis Streams for job event logs; lists remain fine for simple FIFO processing.
  • Observability:
    • hero:meta:actor:* and hero:meta:queue:stats:* to keep simple metrics for dashboards.

9) Summary

  • Canonicalize to hero:q:work:type:{...} (+ group, + instance), and hero:q:reply:{job_id}
  • Actors consume instance → group → type
  • Supervisor pushes to most specific queue available, defaulting to type
  • Provide helpers to build keys and remove ad-hoc string formatting
  • Migrate with a dual-push (canonical + legacy) phase to avoid downtime

Proposed touchpoints to implement (clickable references)