upgrade.run: child status reconciliation drifts — children stay running/pending after their hero_proc job exits #28

New issue

Closed

opened 2026-05-25 06:49:08 +00:00 by zaelgohary · 0 comments

zaelgohary commented

2026-05-25 06:49:08 +00:00

Member

Observation

From a real upgrade.run(dry_run=false) against 7 managed users on herodev (upgrade upg_1779691156667_f04d9a5d), the daemon-tracked child phases diverged from the actual hero_proc job phases:

  rawan      service_mycelium          daemon=running actual=failed exit=1 err=exited with code 1
  mik        service_proc              daemon=running actual=failed exit=1 err=exited with code 1
  zainab     service_mycelium          daemon=running (no child_job_id recorded)
  mahmoud    service_router            daemon=running (no child_job_id recorded)
  timur      service_proc              daemon=pending (no child_job_id recorded)
  nabil      service_proc              daemon=pending (no child_job_id recorded)
  nabil      service_mycelium          daemon=pending (no child_job_id recorded)

upgrade.status showed partial with [failed=14 pending=3 running=4] for 5+ minutes after the actual hero_proc jobs exited.
The daemon never advanced these children to a terminal phase.
Even children without a child_job_id are flagged as running/pending — the orchestrator clearly enqueued something but didn't record the resulting job id.

Likely fix areas

Child polling loop in the orchestrator — confirm it polls each child's target hero_proc and reads phase/exit_code reliably. The daemon's startup log already mentions upgrade.reconcile_interrupted failed at startup, suggesting reconciliation isn't robust.
child_job_id recording — when the orchestrator enqueues against the user's hero_proc, store the returned id atomically with the row. Right now a few rows are running/pending with child_job_id: null, meaning the enqueue presumably happened but the response wasn't persisted.
Timeout — children should auto-fail after some grace period if their tracked job_id can't be reconciled.

## Observation From a real `upgrade.run(dry_run=false)` against 7 managed users on herodev (upgrade `upg_1779691156667_f04d9a5d`), the daemon-tracked child phases diverged from the actual hero_proc job phases: ``` rawan service_mycelium daemon=running actual=failed exit=1 err=exited with code 1 mik service_proc daemon=running actual=failed exit=1 err=exited with code 1 zainab service_mycelium daemon=running (no child_job_id recorded) mahmoud service_router daemon=running (no child_job_id recorded) timur service_proc daemon=pending (no child_job_id recorded) nabil service_proc daemon=pending (no child_job_id recorded) nabil service_mycelium daemon=pending (no child_job_id recorded) ``` - `upgrade.status` showed `partial` with `[failed=14 pending=3 running=4]` for **5+ minutes** after the actual hero_proc jobs exited. - The daemon never advanced these children to a terminal phase. - Even children without a `child_job_id` are flagged as `running`/`pending` — the orchestrator clearly enqueued *something* but didn't record the resulting job id. ## Likely fix areas 1. **Child polling loop** in the orchestrator — confirm it polls each child's target hero_proc and reads `phase`/`exit_code` reliably. The daemon's startup log already mentions `upgrade.reconcile_interrupted failed at startup`, suggesting reconciliation isn't robust. 2. **`child_job_id` recording** — when the orchestrator enqueues against the user's hero_proc, store the returned id atomically with the row. Right now a few rows are `running`/`pending` with `child_job_id: null`, meaning the enqueue presumably happened but the response wasn't persisted. 3. **Timeout** — children should auto-fail after some grace period if their tracked job_id can't be reconciled.