Skip to main content

max / makenotwork

1.1 KB · 20 lines History Blame Raw
1 -- Record when a tier is left in a partial / mixed-version state.
2 --
3 -- A canary promote that fails mid-rollout tries to roll the touched nodes
4 -- back to the prior version (CF3 canary rollback). That compensation is
5 -- best-effort: a node whose rollback also fails, or a first-deploy failure
6 -- with no prior version to restore, leaves the tier genuinely inconsistent —
7 -- some nodes on the new (broken) version, some on the old. The rollback
8 -- endpoint has the same exposure if a node deploy fails partway through the
9 -- fleet.
10 --
11 -- Before this column that condition was invisible: `tier_state` still showed
12 -- a single `current_version` (never advanced on failure), so `/state` and the
13 -- TUI reported a clean tier while the fleet was split-brain — the operator
14 -- only learned of it by reading `deploys` rows or SSHing the nodes.
15 --
16 -- `partial_reason` is NULL when the tier is consistent and a human-readable
17 -- explanation when it is not. It is set by the promote/rollback failure paths
18 -- and cleared by any subsequent clean full promote or rollback of the tier.
19 ALTER TABLE tier_state ADD COLUMN partial_reason TEXT;
20