max / makenotwork

5.9 KB · 135 lines History Blame Raw

1	# migration_dry_run: failure modes + what the operator sees
2
3	`migration_dry_run` is the load-bearing pre-flight gate. It restores the
4	latest prod backup into the scratch postgres, then runs the worktree's
5	migrations on top. The point is to catch migrations that work on a fresh DB
6	but break on real prod data, before the binary tries them in production.
7
8	Gate definition (`MNW/sando/daemon/src/gates.rs::migration_dry_run`):
9
10	1. Look up latest row in `backups` table (set by `POST /backup/fetch`).
11	2. `reset_scratch` — drop every non-system schema, recreate `public`.
12	3. `restore_dump` — `gunzip -c <backup> \| psql <scratch_url>`.
13	4. `run_migrator` — `sqlx::migrate::Migrator::new(<worktree>/server/migrations).run()`.
14
15	All four steps' failure cases below.
16
17	## Operator surface
18
19	Each invocation writes a row to `gate_runs`:
20
21	\| col \| meaning \|
22	\|--------------\|---------------------------------------------------------------\|
23	\| version \| the version being built \|
24	\| tier \| always "mm" \|
25	\| gate_kind \| "migration_dry_run" \|
26	\| passed \| 1 = green, 0 = red \|
27	\| detail \| error tail (up to ~4 KB) — what to read first \|
28	\| finished_at \| wall-clock end \|
29
30	In the WS `/events` stream you also get `gate_start` and `gate_done` envelopes.
31	The TUI surfaces these in the gate strip; on red, click into `detail`.
32
33	## Known failure modes
34
35	### 1. No backup fetched
36
37	`detail = "no backup fetched; call /backup/fetch first"`
38
39	Cause: `backups` table is empty (sando state DB is fresh, or the daily timer
40	hasn't fired yet). Recovery: `curl -X POST $SANDO_DAEMON/backup/fetch`.
41
42	### 2. scratch_db_url unset
43
44	`detail = "scratch_db_url unset in daemon config"`
45
46	Cause: `/etc/sando/sando-daemon.toml` is missing the `scratch_db_url` line.
47	Recovery: add it, restart sandod.
48
49	### 3. Scratch reset failed
50
51	`detail = "scratch reset: <pg-error>"`
52
53	Cause: postgres is down on the Sando host, or the role/db disappeared.
54	Recovery: `systemctl status postgresql`, recreate `sando` role + `sando_scratch`
55	db if needed (see bootstrap-node.sh template).
56
57	### 4. Restore failed
58
59	`detail = "restore: <gunzip\|psql error>"`
60
61	Cause: backup file is corrupt or truncated. Verify with
62	`zcat /srv/sando/backups/latest.sql.gz \| head` — should start with `--`
63	postgres dump preamble. If corrupt: re-fetch from prod with
64	`POST /backup/fetch`. If repeatable, prod's backup itself is bad.
65
66	### 5. Migration drift (the load-bearing case)
67
68	`detail = "migration <N> was previously applied but is missing in the resolved migrations"`
69
70	This is the gate doing its job. The backup carries
71	`_sqlx_migrations` rows for every migration prod has applied. If the worktree
72	is missing one of those files, sqlx refuses to run — because applying the
73	worktree against this DB would skip a step prod thinks is done.
74
75	Real example (2026-05-31): sha `eee96a7` was pushed to sandod before
76	migrations 123-132 landed in main. `migration_dry_run` failed with
77	"migration 123 was previously applied but is missing in the resolved
78	migrations". Recovery: push the up-to-date `main`.
79
80	This is also the only signal you get for a forgotten migration file. If
81	someone deletes `migrations/123_foo.sql` without truly rolling back the
82	schema change in prod, this gate is what catches it.
83
84	### 6. Migration content changed
85
86	`detail = "migration <N> was previously applied but has been modified"`
87	(or similar — sqlx phrasing varies by version).
88
89	Cause: the worktree's `123_foo.sql` has a different checksum than the version
90	prod recorded in `_sqlx_migrations`. Don't fix by overwriting prod's
91	checksum — that hides a real divergence. Investigate which version was
92	"right" and add a follow-up migration that produces the intended state.
93
94	### 7. Migration content broken against prod data
95
96	`detail = "migration <N>: <pg syntax/constraint/data error>"`
97
98	Cause: the new migration runs fine on an empty schema (which is what
99	local-dev `cargo test` exercises) but fails on actual prod data. Examples:
100	adding `NOT NULL` to a column with existing nulls; `DROP COLUMN` referenced
101	by a view; `UNIQUE` constraint violated by existing rows.
102
103	This is the 2026-05-22-incident class. Without `migration_dry_run`, the
104	binary deploys, starts up, runs the migration, partially-applies it, exits
105	non-zero, systemd crash-loops, prod is down. With the gate, the failure
106	happens on the Sando host's scratch DB and prod stays up.
107
108	Recovery: fix the migration. Common patterns:
109	- `NOT NULL` + default + then alter to non-default
110	- `DROP COLUMN` only after dropping dependents
111	- backfill via separate migration before the constraint
112
113	## Things this gate does NOT catch
114
115	- Migrations that succeed but break the application — only caught by
116	`cargo_test` (red) or `boot_smoke` (binary fails to start).
117	- Migrations that are slow on prod-scale data (scratch DB has prod content
118	but no prod load).
119	- Migrations that need privileged operations not granted to `sando` role
120	(the scratch role isn't superuser by design — see Phase 0 decisions).
121
122	## Operator playbook (red gate)
123
124	1. Read `detail`. It's almost always the right answer in the first line.
125	2. If it's a drift case (#5), check `git log -- server/migrations/` for what
126	prod has that the worktree doesn't.
127	3. If it's a content failure (#7), reproduce locally:
128	`cargo sqlx migrate run --source server/migrations` against a freshly
129	restored backup. Iterate on the migration file.
130	4. Push the fix. Sandod's mutex aborts the old build; the new sha runs the
131	gate from scratch.
132	5. Never bypass — there's no `?force=true` and there shouldn't be. If you
133	really need to ship around a known-bad migration, that's an explicit
134	`--hotfix` promote, and you own the prod consequences.
135