upgrade: cell grace period + poll error budget + resilient reconcile #30

Merged
zaelgohary merged 2 commits from development_fix_28_polling_grace_reconcile into development 2026-05-25 10:35:28 +00:00
Member

Summary

Rollout cells can no longer pin a rollout open indefinitely on a wedged child or transient hero_proc unreachability. Daemon startup also survives a single corrupt upgrade record instead of bailing on the whole reconcile pass.

Partially addresses #28 (polling drift + startup reconciliation robustness + missing grace-period auto-fail).

Changes

  • Added 600s CELL_GRACE_PERIOD; cell marked failed if it exceeds it without a terminal phase.
  • Added MAX_CONSECUTIVE_POLL_ERRORS (30) budget; failed jobs::get_at or missing-phase responses no longer loop forever (the prior unwrap_or("running") could wedge a cell silently).
  • Split reconcile_interrupted into reconcile_one so one bad record only logs + skips instead of aborting startup for every other in-flight upgrade.

Test Results

Full rollout upg_1779696877140_81382f54 reached status=succeeded with 91/91 children in succeeded phase in ~47s on the new binary.

## Summary Rollout cells can no longer pin a rollout open indefinitely on a wedged child or transient hero_proc unreachability. Daemon startup also survives a single corrupt upgrade record instead of bailing on the whole reconcile pass. ## Related Issue Partially addresses #28 (polling drift + startup reconciliation robustness + missing grace-period auto-fail). ## Changes - Added 600s CELL_GRACE_PERIOD; cell marked failed if it exceeds it without a terminal phase. - Added MAX_CONSECUTIVE_POLL_ERRORS (30) budget; failed jobs::get_at or missing-phase responses no longer loop forever (the prior unwrap_or("running") could wedge a cell silently). - Split reconcile_interrupted into reconcile_one so one bad record only logs + skips instead of aborting startup for every other in-flight upgrade. ## Test Results Full rollout upg_1779696877140_81382f54 reached status=succeeded with 91/91 children in succeeded phase in ~47s on the new binary.
zaelgohary merged commit 4fa6e27ce2 into development 2026-05-25 10:35:28 +00:00
zaelgohary deleted branch development_fix_28_polling_grace_reconcile 2026-05-25 10:35:28 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_codescalers!30
No description provided.