upgrade.run: child status reconciliation drifts — children stay running/pending after their hero_proc job exits #28
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Observation
From a real
upgrade.run(dry_run=false)against 7 managed users on herodev (upgradeupg_1779691156667_f04d9a5d), the daemon-tracked child phases diverged from the actual hero_proc job phases:upgrade.statusshowedpartialwith[failed=14 pending=3 running=4]for 5+ minutes after the actual hero_proc jobs exited.child_job_idare flagged asrunning/pending— the orchestrator clearly enqueued something but didn't record the resulting job id.Likely fix areas
phase/exit_codereliably. The daemon's startup log already mentionsupgrade.reconcile_interrupted failed at startup, suggesting reconciliation isn't robust.child_job_idrecording — when the orchestrator enqueues against the user's hero_proc, store the returned id atomically with the row. Right now a few rows arerunning/pendingwithchild_job_id: null, meaning the enqueue presumably happened but the response wasn't persisted.