| 1 |
# migration_dry_run: failure modes + what the operator sees |
| 2 |
|
| 3 |
`migration_dry_run` is the load-bearing pre-flight gate. It restores the |
| 4 |
latest prod backup into the scratch postgres, then runs the worktree's |
| 5 |
migrations on top. The point is to catch migrations that work on a fresh DB |
| 6 |
but break on real prod data, *before* the binary tries them in production. |
| 7 |
|
| 8 |
Gate definition (`MNW/sando/daemon/src/gates.rs::migration_dry_run`): |
| 9 |
|
| 10 |
1. Look up latest row in `backups` table (set by `POST /backup/fetch`). |
| 11 |
2. `reset_scratch` — drop every non-system schema, recreate `public`. |
| 12 |
3. `restore_dump` — `gunzip -c <backup> | psql <scratch_url>`. |
| 13 |
4. `run_migrator` — `sqlx::migrate::Migrator::new(<worktree>/server/migrations).run()`. |
| 14 |
|
| 15 |
All four steps' failure cases below. |
| 16 |
|
| 17 |
## Operator surface |
| 18 |
|
| 19 |
Each invocation writes a row to `gate_runs`: |
| 20 |
|
| 21 |
|
| 22 |
|
| 23 |
| version | the version being built | |
| 24 |
| tier | always "mm" | |
| 25 |
| gate_kind | "migration_dry_run" | |
| 26 |
| passed | 1 = green, 0 = red | |
| 27 |
| detail | error tail (up to ~4 KB) — what to read first | |
| 28 |
| finished_at | wall-clock end | |
| 29 |
|
| 30 |
In the WS `/events` stream you also get `gate_start` and `gate_done` envelopes. |
| 31 |
The TUI surfaces these in the gate strip; on red, click into `detail`. |
| 32 |
|
| 33 |
## Known failure modes |
| 34 |
|
| 35 |
### 1. No backup fetched |
| 36 |
|
| 37 |
`detail = "no backup fetched; call /backup/fetch first"` |
| 38 |
|
| 39 |
Cause: `backups` table is empty (sando state DB is fresh, or the daily timer |
| 40 |
hasn't fired yet). Recovery: `curl -X POST $SANDO_DAEMON/backup/fetch`. |
| 41 |
|
| 42 |
### 2. scratch_db_url unset |
| 43 |
|
| 44 |
`detail = "scratch_db_url unset in daemon config"` |
| 45 |
|
| 46 |
Cause: `/etc/sando/sando-daemon.toml` is missing the `scratch_db_url` line. |
| 47 |
Recovery: add it, restart sandod. |
| 48 |
|
| 49 |
### 3. Scratch reset failed |
| 50 |
|
| 51 |
`detail = "scratch reset: <pg-error>"` |
| 52 |
|
| 53 |
Cause: postgres is down on the Sando host, or the role/db disappeared. |
| 54 |
Recovery: `systemctl status postgresql`, recreate `sando` role + `sando_scratch` |
| 55 |
db if needed (see bootstrap-node.sh template). |
| 56 |
|
| 57 |
### 4. Restore failed |
| 58 |
|
| 59 |
`detail = "restore: <gunzip|psql error>"` |
| 60 |
|
| 61 |
Cause: backup file is corrupt or truncated. Verify with |
| 62 |
`zcat /srv/sando/backups/latest.sql.gz | head` — should start with `--` |
| 63 |
postgres dump preamble. If corrupt: re-fetch from prod with |
| 64 |
`POST /backup/fetch`. If repeatable, prod's backup itself is bad. |
| 65 |
|
| 66 |
### 5. Migration drift (the load-bearing case) |
| 67 |
|
| 68 |
`detail = "migration <N> was previously applied but is missing in the resolved migrations"` |
| 69 |
|
| 70 |
**This is the gate doing its job.** The backup carries |
| 71 |
`_sqlx_migrations` rows for every migration prod has applied. If the worktree |
| 72 |
is missing one of those files, sqlx refuses to run — because applying the |
| 73 |
worktree against this DB would skip a step prod thinks is done. |
| 74 |
|
| 75 |
**Real example (2026-05-31):** sha `eee96a7` was pushed to sandod before |
| 76 |
migrations 123-132 landed in main. `migration_dry_run` failed with |
| 77 |
"migration 123 was previously applied but is missing in the resolved |
| 78 |
migrations". Recovery: push the up-to-date `main`. |
| 79 |
|
| 80 |
This is also the only signal you get for a forgotten migration file. If |
| 81 |
someone deletes `migrations/123_foo.sql` without truly rolling back the |
| 82 |
schema change in prod, this gate is what catches it. |
| 83 |
|
| 84 |
### 6. Migration content changed |
| 85 |
|
| 86 |
`detail = "migration <N> was previously applied but has been modified"` |
| 87 |
(or similar — sqlx phrasing varies by version). |
| 88 |
|
| 89 |
Cause: the worktree's `123_foo.sql` has a different checksum than the version |
| 90 |
prod recorded in `_sqlx_migrations`. Don't fix by overwriting prod's |
| 91 |
checksum — that hides a real divergence. Investigate which version was |
| 92 |
"right" and add a follow-up migration that produces the intended state. |
| 93 |
|
| 94 |
### 7. Migration content broken against prod data |
| 95 |
|
| 96 |
`detail = "migration <N>: <pg syntax/constraint/data error>"` |
| 97 |
|
| 98 |
Cause: the new migration runs fine on an empty schema (which is what |
| 99 |
local-dev `cargo test` exercises) but fails on actual prod data. Examples: |
| 100 |
adding `NOT NULL` to a column with existing nulls; `DROP COLUMN` referenced |
| 101 |
by a view; `UNIQUE` constraint violated by existing rows. |
| 102 |
|
| 103 |
**This is the 2026-05-22-incident class.** Without `migration_dry_run`, the |
| 104 |
binary deploys, starts up, runs the migration, partially-applies it, exits |
| 105 |
non-zero, systemd crash-loops, prod is down. With the gate, the failure |
| 106 |
happens on the Sando host's scratch DB and prod stays up. |
| 107 |
|
| 108 |
Recovery: fix the migration. Common patterns: |
| 109 |
- `NOT NULL` + default + then alter to non-default |
| 110 |
- `DROP COLUMN` only after dropping dependents |
| 111 |
- backfill via separate migration before the constraint |
| 112 |
|
| 113 |
## Things this gate does NOT catch |
| 114 |
|
| 115 |
- Migrations that succeed but break the *application* — only caught by |
| 116 |
`cargo_test` (red) or `boot_smoke` (binary fails to start). |
| 117 |
- Migrations that are slow on prod-scale data (scratch DB has prod *content* |
| 118 |
but no prod *load*). |
| 119 |
- Migrations that need privileged operations not granted to `sando` role |
| 120 |
(the scratch role isn't superuser by design — see Phase 0 decisions). |
| 121 |
|
| 122 |
## Operator playbook (red gate) |
| 123 |
|
| 124 |
1. Read `detail`. It's almost always the right answer in the first line. |
| 125 |
2. If it's a drift case (#5), check `git log -- server/migrations/` for what |
| 126 |
prod has that the worktree doesn't. |
| 127 |
3. If it's a content failure (#7), reproduce locally: |
| 128 |
`cargo sqlx migrate run --source server/migrations` against a freshly |
| 129 |
restored backup. Iterate on the migration file. |
| 130 |
4. Push the fix. Sandod's mutex aborts the old build; the new sha runs the |
| 131 |
gate from scratch. |
| 132 |
5. Never bypass — there's no `?force=true` and there shouldn't be. If you |
| 133 |
really need to ship around a known-bad migration, that's an explicit |
| 134 |
`--hotfix` promote, and you own the prod consequences. |
| 135 |
|