reliability service #104

Open
opened 2026-05-16 03:41:38 +00:00 by despiegk · 1 comment
Owner

Hero Proc Service Reliability Notes

We need to do deeper research and cleanup around how service works.

A service is a combination of actions. Each action represents one thing that needs to be done to keep the service alive.

Anything marked as a process must be highly reliable. The service runner should keep trying until the service is alive and healthy.

Actions and Singleton Behavior

An action can have a singleton property.

When an action is marked as singleton, it means there is a PID file somewhere for that action. We need to verify that singleton handling is implemented correctly per action.

The lifecycle should be:

  1. Check the PID file.
  2. Find the existing PID.
  3. If the process is still running, terminate it properly.
  4. Remove stale sockets if needed.
  5. Restart the action.
  6. Keep retrying until it works.

Action Metadata

Sockets and TCP ports should already be defined as part of the action itself. That means the action should explicitly know:

  • which Unix sockets it occupies
  • which TCP ports it uses

This allows the service runner to be much more robust.

Socket and Port Cleanup

Before starting an action, we should inspect the sockets and ports defined on the action.

If sockets exist, we can determine whether another process is using them.

The runner should then:

  1. Detect stale sockets.
  2. Detect processes occupying expected sockets or ports.
  3. Remove stale sockets.
  4. Stop conflicting processes where appropriate.
  5. Start the action cleanly.

This is important when restarting hero_proc itself. After a restart, the service runner should be able to recover the full service state and go through the same cleanup/start cycle again.

It can only cleanup others if kill_others is set or singleton is set (both is ok)

Runs and Jobs

Each service run should keep working within the same run context.

If an action fails, we should not keep creating new jobs forever. Instead, we should reuse the same job and track how many times it was restarted.

Add a new field to the job model:

restart_count: u32

This is clearer than occurrence.

Each time the binary or script inside the action is restarted, increment restart_count.

Logging

Each job should have one stable job ID.

When the job restarts, we should not create endless new job logs.

Instead:

  • keep the same job ID
  • increment restart_count
  • replace or rotate the active log cleanly
  • remove the old “running” lock/log state before creating a new one

This avoids the current mess where we get too many job blocks and makes service behavior much easier to understand.

Main Goal

Make services self-healing.

A service should continuously try to keep its actions alive, clean up stale state, reuse the same job identity, and restart reliably until the action succeeds.

## Hero Proc Service Reliability Notes We need to do deeper research and cleanup around how `service` works. A service is a combination of actions. Each action represents one thing that needs to be done to keep the service alive. Anything marked as a process must be highly reliable. The service runner should keep trying until the service is alive and healthy. ## Actions and Singleton Behavior An action can have a `singleton` property. When an action is marked as singleton, it means there is a PID file somewhere for that action. We need to verify that singleton handling is implemented correctly per action. The lifecycle should be: 1. Check the PID file. 2. Find the existing PID. 3. If the process is still running, terminate it properly. 4. Remove stale sockets if needed. 5. Restart the action. 6. Keep retrying until it works. ## Action Metadata Sockets and TCP ports should already be defined as part of the action itself. That means the action should explicitly know: * which Unix sockets it occupies * which TCP ports it uses This allows the service runner to be much more robust. ## Socket and Port Cleanup Before starting an action, we should inspect the sockets and ports defined on the action. If sockets exist, we can determine whether another process is using them. The runner should then: 1. Detect stale sockets. 2. Detect processes occupying expected sockets or ports. 3. Remove stale sockets. 4. Stop conflicting processes where appropriate. 5. Start the action cleanly. This is important when restarting `hero_proc` itself. After a restart, the service runner should be able to recover the full service state and go through the same cleanup/start cycle again. It can only cleanup others if kill_others is set or singleton is set (both is ok) ## Runs and Jobs Each service run should keep working within the same run context. If an action fails, we should not keep creating new jobs forever. Instead, we should reuse the same job and track how many times it was restarted. Add a new field to the job model: ```rust restart_count: u32 ``` This is clearer than `occurrence`. Each time the binary or script inside the action is restarted, increment `restart_count`. ## Logging Each job should have one stable job ID. When the job restarts, we should not create endless new job logs. Instead: * keep the same job ID * increment `restart_count` * replace or rotate the active log cleanly * remove the old “running” lock/log state before creating a new one This avoids the current mess where we get too many job blocks and makes service behavior much easier to understand. ## Main Goal Make services self-healing. A service should continuously try to keep its actions alive, clean up stale state, reuse the same job identity, and restart reliably until the action succeeds.
Author
Owner

Implementation Spec for Issue #104 — Reliability Service

Objective

Make hero_proc services self-healing by giving every action explicit knowledge of the sockets and TCP ports it owns, making singleton lifecycle deterministic (PID file → kill old → cleanup sockets/ports → restart → keep retrying), and making restart accounting visible by adding a restart_count field on Job. This first slice keeps the same job identity across restarts and rotates its running log instead of creating an endless stream of new job records.

Requirements

  • ActionSpec exposes the sockets and TCP ports an action is expected to own.
  • A central pre-spawn cleanup routine inspects those sockets/ports, removes stale UDS files, and (only when singleton or kill_other is set) stops conflicting holders before spawning.
  • Job gains restart_count: u32 persisted in SQLite (kept alongside attempt).
  • The supervisor reuses the existing job row for an is_process action that needs to be respawned (no new SQL row, no new logs prefix); restart_count is incremented and the prior "running" log block is rotated/cleared.
  • Behaviour is identical on a hero_proc cold restart: the recovery path runs the same cleanup → start cycle for every known process action.
  • All current functional tests still pass; new tests cover socket/port cleanup and restart_count increment.

Files to Modify/Create

  • crates/hero_proc_server/src/db/actions/model.rs — extend ActionSpec with sockets: Vec<String> and tcp_ports: Vec<u16> (defaulted, backwards-compatible serde). Update Default, validation, and the unit tests at the bottom of the file. The new fields describe what the action owns, while KillOther keeps describing what to reap.
  • crates/hero_proc_server/src/db/jobs/model.rs — add restart_count: u32 to Job and JobSummary, update Default, to_summary, schema and additive migration ALTER TABLE jobs ADD COLUMN restart_count INTEGER NOT NULL DEFAULT 0, insert_job, update_job, SELECT lists, and row_to_job.
  • crates/hero_proc_server/src/pid/mod.rs — add force_kill_owners(sockets: &[String], tcp_ports: &[u16]) -> Vec<u32> and cleanup_stale_sockets(sockets: &[String]).
  • crates/hero_proc_server/src/process.rs — add helpers pids_holding_tcp_port(port: u16) -> Vec<u32> and pids_holding_unix_socket(path: &Path) -> Vec<u32> (lsof-based, Linux/macOS gated).
  • crates/hero_proc_server/src/supervisor/executor.rs
    1. Before the existing kill_other block call prepare_action_resources(&job) that removes stale UDS files and, when singleton || kill_other.is_some(), reclaims live owners via pid::force_kill_owners.
    2. In retry/failure branches for is_process jobs: bump restart_count, keep phase = Retrying, and rotate the running log.
  • crates/hero_proc_server/src/supervisor/mod.rs — make autostart_process_jobs reuse the existing job row when a process action needs respawning (no new Job). Update list_process_jobs_needing_restart and list_is_process_terminal_jobs accordingly.
  • crates/hero_proc_server/src/logging/store.rs — add rotate_job_logs(job_id: u32, restart_count: u32) to archive or clear prior running entries before respawn.
  • crates/hero_proc_server/openrpc.json + the two checked-in generated client mirrors — extend Job and ActionSpec schemas.
  • crates/hero_proc_app/src/types.rs — mirror restart_count, sockets, tcp_ports.
  • crates/hero_proc_test/src/tests/functional/singleton.rs — case for UDS/TCP-port reclaim on respawn.
  • crates/hero_proc_test/src/tests/functional/uc_31_34_action_cascade_process.rs — case for stable job id + restart_count increment.

Implementation Plan

Step 1: Extend models

Files: crates/hero_proc_server/src/db/actions/model.rs, crates/hero_proc_server/src/db/jobs/model.rs, crates/hero_proc_app/src/types.rs

  • Add sockets, tcp_ports to ActionSpec (#[serde(default, skip_serializing_if = "Vec::is_empty")]).
  • Add restart_count: u32 to Job / JobSummary (#[serde(default)]).
  • Update defaults, to_summary, schema, additive migration, every INSERT/UPDATE and row_to_job.
  • Extend unit tests with a restart_count = 3 round-trip.
    Dependencies: none.

Step 2: Resource-aware process helpers

Files: crates/hero_proc_server/src/process.rs, crates/hero_proc_server/src/pid/mod.rs

  • Add pids_holding_tcp_port / pids_holding_unix_socket using lsof.
  • Add cleanup_stale_sockets and force_kill_owners in pid/mod.rs.
  • Cover stale UDS path with tempfile-based unit tests.
    Dependencies: Step 1.

Step 3: Pre-spawn cleanup in the executor

Files: crates/hero_proc_server/src/supervisor/executor.rs

  • Add a private prepare_action_resources(job: &Job) called right before kill_other.
  • Always remove stale UDS files; reclaim live owners only when singleton || kill_other.is_some().
  • Idempotent and safe to re-invoke.
    Dependencies: Steps 1, 2.

Step 4: Reuse-row restart loop

Files: crates/hero_proc_server/src/supervisor/mod.rs, crates/hero_proc_server/src/supervisor/executor.rs, crates/hero_proc_server/src/db/factory.rs, crates/hero_proc_server/src/db/jobs/model.rs

  • list_process_jobs_needing_restart (and underlying SQL helper) returns Retrying/Failed rows too.
  • autostart_process_jobs mutates the existing row in place: reset transient state, increment restart_count, drop the duplicate-check block.
  • Executor's retry/failure branch (for is_process): set Retrying, bump restart_count, reset transient fields.
    Dependencies: Steps 1, 3.

Step 5: Rotate the running log on restart

Files: crates/hero_proc_server/src/logging/store.rs, crates/hero_proc_server/src/supervisor/executor.rs

  • Implement rotate_job_logs(job_id, restart_count) — either delete or re-tag previous "running" entries.
  • Call it from the executor right before respawn.
    Dependencies: Step 4.

Step 6: Plumb through OpenRPC & app types

Files: crates/hero_proc_server/openrpc.json, both checked-in client mirrors, crates/hero_proc_app/src/types.rs

  • Extend Job and ActionSpec schemas; regenerate the mirrors.
    Dependencies: Step 1.

Step 7: Tests

Files: crates/hero_proc_test/src/tests/functional/singleton.rs, crates/hero_proc_test/src/tests/functional/uc_31_34_action_cascade_process.rs

  • Singleton: previous instance holds a UDS, gets cleanly reclaimed before respawn.
  • Process restart: same job.id, restart_count goes 0 → 1 → 2, old running:true entries are gone.
    Dependencies: Steps 1–5.

Acceptance Criteria

  • ActionSpec carries sockets and tcp_ports; round-trips through SQLite and OpenRPC.
  • Job carries restart_count; additive migration on existing DBs.
  • Singleton with a live previous PID is stopped before the new instance starts.
  • Stale UDS files in ActionSpec.sockets are removed before every start; live owners reclaimed only when singleton || kill_other.
  • When a process action fails or disappears, the supervisor reuses the same Job.id and bumps restart_count.
  • The active "running" log block is rotated on each restart.
  • All existing hero_proc_test functional tests pass; the two new tests pass.

Notes

  • Scope is intentionally the "first reliable slice": models + cleanup + restart_count + log rotation. Deferred: exponential-backoff continuous loop, HealthCheck-driven respawn, UI surface for restart_count.
  • sockets / tcp_ports describe what an action binds; KillOther keeps describing what to reap from others. Existing service definitions keep working without setting the new fields.
  • Migration is additive only. restart_count defaults to 0 for historical rows.
  • macOS lsof UDS output differs from Linux — reuse the existing kill_other socket logic so platform behaviour does not drift.
  • The OpenRPC generated client files are checked in; regenerate via crates/hero_proc_sdk/build.rs or update in lockstep to keep admin UI / app consumers compiling.
## Implementation Spec for Issue #104 — Reliability Service ### Objective Make hero_proc services self-healing by giving every action explicit knowledge of the sockets and TCP ports it owns, making singleton lifecycle deterministic (PID file → kill old → cleanup sockets/ports → restart → keep retrying), and making restart accounting visible by adding a `restart_count` field on `Job`. This first slice keeps the same job identity across restarts and rotates its running log instead of creating an endless stream of new job records. ### Requirements - `ActionSpec` exposes the sockets and TCP ports an action is expected to own. - A central pre-spawn cleanup routine inspects those sockets/ports, removes stale UDS files, and (only when `singleton` or `kill_other` is set) stops conflicting holders before spawning. - `Job` gains `restart_count: u32` persisted in SQLite (kept alongside `attempt`). - The supervisor reuses the existing job row for an `is_process` action that needs to be respawned (no new SQL row, no new logs prefix); `restart_count` is incremented and the prior "running" log block is rotated/cleared. - Behaviour is identical on a hero_proc cold restart: the recovery path runs the same cleanup → start cycle for every known process action. - All current functional tests still pass; new tests cover socket/port cleanup and `restart_count` increment. ### Files to Modify/Create - `crates/hero_proc_server/src/db/actions/model.rs` — extend `ActionSpec` with `sockets: Vec<String>` and `tcp_ports: Vec<u16>` (defaulted, backwards-compatible serde). Update `Default`, validation, and the unit tests at the bottom of the file. The new fields describe what the action *owns*, while `KillOther` keeps describing what to *reap*. - `crates/hero_proc_server/src/db/jobs/model.rs` — add `restart_count: u32` to `Job` and `JobSummary`, update `Default`, `to_summary`, schema and additive migration `ALTER TABLE jobs ADD COLUMN restart_count INTEGER NOT NULL DEFAULT 0`, `insert_job`, `update_job`, `SELECT` lists, and `row_to_job`. - `crates/hero_proc_server/src/pid/mod.rs` — add `force_kill_owners(sockets: &[String], tcp_ports: &[u16]) -> Vec<u32>` and `cleanup_stale_sockets(sockets: &[String])`. - `crates/hero_proc_server/src/process.rs` — add helpers `pids_holding_tcp_port(port: u16) -> Vec<u32>` and `pids_holding_unix_socket(path: &Path) -> Vec<u32>` (lsof-based, Linux/macOS gated). - `crates/hero_proc_server/src/supervisor/executor.rs` — 1. Before the existing `kill_other` block call `prepare_action_resources(&job)` that removes stale UDS files and, when `singleton || kill_other.is_some()`, reclaims live owners via `pid::force_kill_owners`. 2. In retry/failure branches for `is_process` jobs: bump `restart_count`, keep `phase = Retrying`, and rotate the running log. - `crates/hero_proc_server/src/supervisor/mod.rs` — make `autostart_process_jobs` reuse the existing job row when a process action needs respawning (no new `Job`). Update `list_process_jobs_needing_restart` and `list_is_process_terminal_jobs` accordingly. - `crates/hero_proc_server/src/logging/store.rs` — add `rotate_job_logs(job_id: u32, restart_count: u32)` to archive or clear prior running entries before respawn. - `crates/hero_proc_server/openrpc.json` + the two checked-in generated client mirrors — extend `Job` and `ActionSpec` schemas. - `crates/hero_proc_app/src/types.rs` — mirror `restart_count`, `sockets`, `tcp_ports`. - `crates/hero_proc_test/src/tests/functional/singleton.rs` — case for UDS/TCP-port reclaim on respawn. - `crates/hero_proc_test/src/tests/functional/uc_31_34_action_cascade_process.rs` — case for stable job id + `restart_count` increment. ### Implementation Plan #### Step 1: Extend models Files: `crates/hero_proc_server/src/db/actions/model.rs`, `crates/hero_proc_server/src/db/jobs/model.rs`, `crates/hero_proc_app/src/types.rs` - Add `sockets`, `tcp_ports` to `ActionSpec` (`#[serde(default, skip_serializing_if = "Vec::is_empty")]`). - Add `restart_count: u32` to `Job` / `JobSummary` (`#[serde(default)]`). - Update defaults, `to_summary`, schema, additive migration, every `INSERT`/`UPDATE` and `row_to_job`. - Extend unit tests with a `restart_count = 3` round-trip. Dependencies: none. #### Step 2: Resource-aware process helpers Files: `crates/hero_proc_server/src/process.rs`, `crates/hero_proc_server/src/pid/mod.rs` - Add `pids_holding_tcp_port` / `pids_holding_unix_socket` using `lsof`. - Add `cleanup_stale_sockets` and `force_kill_owners` in `pid/mod.rs`. - Cover stale UDS path with `tempfile`-based unit tests. Dependencies: Step 1. #### Step 3: Pre-spawn cleanup in the executor Files: `crates/hero_proc_server/src/supervisor/executor.rs` - Add a private `prepare_action_resources(job: &Job)` called right before `kill_other`. - Always remove stale UDS files; reclaim live owners only when `singleton || kill_other.is_some()`. - Idempotent and safe to re-invoke. Dependencies: Steps 1, 2. #### Step 4: Reuse-row restart loop Files: `crates/hero_proc_server/src/supervisor/mod.rs`, `crates/hero_proc_server/src/supervisor/executor.rs`, `crates/hero_proc_server/src/db/factory.rs`, `crates/hero_proc_server/src/db/jobs/model.rs` - `list_process_jobs_needing_restart` (and underlying SQL helper) returns `Retrying`/`Failed` rows too. - `autostart_process_jobs` mutates the existing row in place: reset transient state, increment `restart_count`, drop the duplicate-check block. - Executor's retry/failure branch (for `is_process`): set `Retrying`, bump `restart_count`, reset transient fields. Dependencies: Steps 1, 3. #### Step 5: Rotate the running log on restart Files: `crates/hero_proc_server/src/logging/store.rs`, `crates/hero_proc_server/src/supervisor/executor.rs` - Implement `rotate_job_logs(job_id, restart_count)` — either delete or re-tag previous "running" entries. - Call it from the executor right before respawn. Dependencies: Step 4. #### Step 6: Plumb through OpenRPC & app types Files: `crates/hero_proc_server/openrpc.json`, both checked-in client mirrors, `crates/hero_proc_app/src/types.rs` - Extend `Job` and `ActionSpec` schemas; regenerate the mirrors. Dependencies: Step 1. #### Step 7: Tests Files: `crates/hero_proc_test/src/tests/functional/singleton.rs`, `crates/hero_proc_test/src/tests/functional/uc_31_34_action_cascade_process.rs` - Singleton: previous instance holds a UDS, gets cleanly reclaimed before respawn. - Process restart: same `job.id`, `restart_count` goes 0 → 1 → 2, old `running:true` entries are gone. Dependencies: Steps 1–5. ### Acceptance Criteria - [ ] `ActionSpec` carries `sockets` and `tcp_ports`; round-trips through SQLite and OpenRPC. - [ ] `Job` carries `restart_count`; additive migration on existing DBs. - [ ] Singleton with a live previous PID is stopped before the new instance starts. - [ ] Stale UDS files in `ActionSpec.sockets` are removed before every start; live owners reclaimed only when `singleton || kill_other`. - [ ] When a process action fails or disappears, the supervisor reuses the same `Job.id` and bumps `restart_count`. - [ ] The active "running" log block is rotated on each restart. - [ ] All existing `hero_proc_test` functional tests pass; the two new tests pass. ### Notes - Scope is intentionally the "first reliable slice": models + cleanup + `restart_count` + log rotation. Deferred: exponential-backoff continuous loop, HealthCheck-driven respawn, UI surface for `restart_count`. - `sockets` / `tcp_ports` describe what an action binds; `KillOther` keeps describing what to reap from others. Existing service definitions keep working without setting the new fields. - Migration is additive only. `restart_count` defaults to 0 for historical rows. - macOS `lsof` UDS output differs from Linux — reuse the existing `kill_other` socket logic so platform behaviour does not drift. - The OpenRPC generated client files are checked in; regenerate via `crates/hero_proc_sdk/build.rs` or update in lockstep to keep admin UI / app consumers compiling.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_proc#104
No description provided.