Skip to main content

max / makenotwork

5.9 KB · 135 lines History Blame Raw
1 # migration_dry_run: failure modes + what the operator sees
2
3 `migration_dry_run` is the load-bearing pre-flight gate. It restores the
4 latest prod backup into the scratch postgres, then runs the worktree's
5 migrations on top. The point is to catch migrations that work on a fresh DB
6 but break on real prod data, *before* the binary tries them in production.
7
8 Gate definition (`MNW/sando/daemon/src/gates.rs::migration_dry_run`):
9
10 1. Look up latest row in `backups` table (set by `POST /backup/fetch`).
11 2. `reset_scratch` — drop every non-system schema, recreate `public`.
12 3. `restore_dump``gunzip -c <backup> | psql <scratch_url>`.
13 4. `run_migrator``sqlx::migrate::Migrator::new(<worktree>/server/migrations).run()`.
14
15 All four steps' failure cases below.
16
17 ## Operator surface
18
19 Each invocation writes a row to `gate_runs`:
20
21 | col | meaning |
22 |--------------|---------------------------------------------------------------|
23 | version | the version being built |
24 | tier | always "mm" |
25 | gate_kind | "migration_dry_run" |
26 | passed | 1 = green, 0 = red |
27 | detail | error tail (up to ~4 KB) — what to read first |
28 | finished_at | wall-clock end |
29
30 In the WS `/events` stream you also get `gate_start` and `gate_done` envelopes.
31 The TUI surfaces these in the gate strip; on red, click into `detail`.
32
33 ## Known failure modes
34
35 ### 1. No backup fetched
36
37 `detail = "no backup fetched; call /backup/fetch first"`
38
39 Cause: `backups` table is empty (sando state DB is fresh, or the daily timer
40 hasn't fired yet). Recovery: `curl -X POST $SANDO_DAEMON/backup/fetch`.
41
42 ### 2. scratch_db_url unset
43
44 `detail = "scratch_db_url unset in daemon config"`
45
46 Cause: `/etc/sando/sando-daemon.toml` is missing the `scratch_db_url` line.
47 Recovery: add it, restart sandod.
48
49 ### 3. Scratch reset failed
50
51 `detail = "scratch reset: <pg-error>"`
52
53 Cause: postgres is down on the Sando host, or the role/db disappeared.
54 Recovery: `systemctl status postgresql`, recreate `sando` role + `sando_scratch`
55 db if needed (see bootstrap-node.sh template).
56
57 ### 4. Restore failed
58
59 `detail = "restore: <gunzip|psql error>"`
60
61 Cause: backup file is corrupt or truncated. Verify with
62 `zcat /srv/sando/backups/latest.sql.gz | head` — should start with `--`
63 postgres dump preamble. If corrupt: re-fetch from prod with
64 `POST /backup/fetch`. If repeatable, prod's backup itself is bad.
65
66 ### 5. Migration drift (the load-bearing case)
67
68 `detail = "migration <N> was previously applied but is missing in the resolved migrations"`
69
70 **This is the gate doing its job.** The backup carries
71 `_sqlx_migrations` rows for every migration prod has applied. If the worktree
72 is missing one of those files, sqlx refuses to run — because applying the
73 worktree against this DB would skip a step prod thinks is done.
74
75 **Real example (2026-05-31):** sha `eee96a7` was pushed to sandod before
76 migrations 123-132 landed in main. `migration_dry_run` failed with
77 "migration 123 was previously applied but is missing in the resolved
78 migrations". Recovery: push the up-to-date `main`.
79
80 This is also the only signal you get for a forgotten migration file. If
81 someone deletes `migrations/123_foo.sql` without truly rolling back the
82 schema change in prod, this gate is what catches it.
83
84 ### 6. Migration content changed
85
86 `detail = "migration <N> was previously applied but has been modified"`
87 (or similar — sqlx phrasing varies by version).
88
89 Cause: the worktree's `123_foo.sql` has a different checksum than the version
90 prod recorded in `_sqlx_migrations`. Don't fix by overwriting prod's
91 checksum — that hides a real divergence. Investigate which version was
92 "right" and add a follow-up migration that produces the intended state.
93
94 ### 7. Migration content broken against prod data
95
96 `detail = "migration <N>: <pg syntax/constraint/data error>"`
97
98 Cause: the new migration runs fine on an empty schema (which is what
99 local-dev `cargo test` exercises) but fails on actual prod data. Examples:
100 adding `NOT NULL` to a column with existing nulls; `DROP COLUMN` referenced
101 by a view; `UNIQUE` constraint violated by existing rows.
102
103 **This is the 2026-05-22-incident class.** Without `migration_dry_run`, the
104 binary deploys, starts up, runs the migration, partially-applies it, exits
105 non-zero, systemd crash-loops, prod is down. With the gate, the failure
106 happens on the Sando host's scratch DB and prod stays up.
107
108 Recovery: fix the migration. Common patterns:
109 - `NOT NULL` + default + then alter to non-default
110 - `DROP COLUMN` only after dropping dependents
111 - backfill via separate migration before the constraint
112
113 ## Things this gate does NOT catch
114
115 - Migrations that succeed but break the *application* — only caught by
116 `cargo_test` (red) or `boot_smoke` (binary fails to start).
117 - Migrations that are slow on prod-scale data (scratch DB has prod *content*
118 but no prod *load*).
119 - Migrations that need privileged operations not granted to `sando` role
120 (the scratch role isn't superuser by design — see Phase 0 decisions).
121
122 ## Operator playbook (red gate)
123
124 1. Read `detail`. It's almost always the right answer in the first line.
125 2. If it's a drift case (#5), check `git log -- server/migrations/` for what
126 prod has that the worktree doesn't.
127 3. If it's a content failure (#7), reproduce locally:
128 `cargo sqlx migrate run --source server/migrations` against a freshly
129 restored backup. Iterate on the migration file.
130 4. Push the fix. Sandod's mutex aborts the old build; the new sha runs the
131 gate from scratch.
132 5. Never bypass — there's no `?force=true` and there shouldn't be. If you
133 really need to ship around a known-bad migration, that's an explicit
134 `--hotfix` promote, and you own the prod consequences.
135