# migration_dry_run: failure modes + what the operator sees `migration_dry_run` is the load-bearing pre-flight gate. It restores the latest prod backup into the scratch postgres, then runs the worktree's migrations on top. The point is to catch migrations that work on a fresh DB but break on real prod data, *before* the binary tries them in production. Gate definition (`MNW/sando/daemon/src/gates.rs::migration_dry_run`): 1. Look up latest row in `backups` table (set by `POST /backup/fetch`). 2. `reset_scratch` — drop every non-system schema, recreate `public`. 3. `restore_dump` — `gunzip -c | psql `. 4. `run_migrator` — `sqlx::migrate::Migrator::new(/server/migrations).run()`. All four steps' failure cases below. ## Operator surface Each invocation writes a row to `gate_runs`: | col | meaning | |--------------|---------------------------------------------------------------| | version | the version being built | | tier | always "mm" | | gate_kind | "migration_dry_run" | | passed | 1 = green, 0 = red | | detail | error tail (up to ~4 KB) — what to read first | | finished_at | wall-clock end | In the WS `/events` stream you also get `gate_start` and `gate_done` envelopes. The TUI surfaces these in the gate strip; on red, click into `detail`. ## Known failure modes ### 1. No backup fetched `detail = "no backup fetched; call /backup/fetch first"` Cause: `backups` table is empty (sando state DB is fresh, or the daily timer hasn't fired yet). Recovery: `curl -X POST $SANDO_DAEMON/backup/fetch`. ### 2. scratch_db_url unset `detail = "scratch_db_url unset in daemon config"` Cause: `/etc/sando/sando-daemon.toml` is missing the `scratch_db_url` line. Recovery: add it, restart sandod. ### 3. Scratch reset failed `detail = "scratch reset: "` Cause: postgres is down on the Sando host, or the role/db disappeared. Recovery: `systemctl status postgresql`, recreate `sando` role + `sando_scratch` db if needed (see bootstrap-node.sh template). ### 4. Restore failed `detail = "restore: "` Cause: backup file is corrupt or truncated. Verify with `zcat /srv/sando/backups/latest.sql.gz | head` — should start with `--` postgres dump preamble. If corrupt: re-fetch from prod with `POST /backup/fetch`. If repeatable, prod's backup itself is bad. ### 5. Migration drift (the load-bearing case) `detail = "migration was previously applied but is missing in the resolved migrations"` **This is the gate doing its job.** The backup carries `_sqlx_migrations` rows for every migration prod has applied. If the worktree is missing one of those files, sqlx refuses to run — because applying the worktree against this DB would skip a step prod thinks is done. **Real example (2026-05-31):** sha `eee96a7` was pushed to sandod before migrations 123-132 landed in main. `migration_dry_run` failed with "migration 123 was previously applied but is missing in the resolved migrations". Recovery: push the up-to-date `main`. This is also the only signal you get for a forgotten migration file. If someone deletes `migrations/123_foo.sql` without truly rolling back the schema change in prod, this gate is what catches it. ### 6. Migration content changed `detail = "migration was previously applied but has been modified"` (or similar — sqlx phrasing varies by version). Cause: the worktree's `123_foo.sql` has a different checksum than the version prod recorded in `_sqlx_migrations`. Don't fix by overwriting prod's checksum — that hides a real divergence. Investigate which version was "right" and add a follow-up migration that produces the intended state. ### 7. Migration content broken against prod data `detail = "migration : "` Cause: the new migration runs fine on an empty schema (which is what local-dev `cargo test` exercises) but fails on actual prod data. Examples: adding `NOT NULL` to a column with existing nulls; `DROP COLUMN` referenced by a view; `UNIQUE` constraint violated by existing rows. **This is the 2026-05-22-incident class.** Without `migration_dry_run`, the binary deploys, starts up, runs the migration, partially-applies it, exits non-zero, systemd crash-loops, prod is down. With the gate, the failure happens on the Sando host's scratch DB and prod stays up. Recovery: fix the migration. Common patterns: - `NOT NULL` + default + then alter to non-default - `DROP COLUMN` only after dropping dependents - backfill via separate migration before the constraint ## Things this gate does NOT catch - Migrations that succeed but break the *application* — only caught by `cargo_test` (red) or `boot_smoke` (binary fails to start). - Migrations that are slow on prod-scale data (scratch DB has prod *content* but no prod *load*). - Migrations that need privileged operations not granted to `sando` role (the scratch role isn't superuser by design — see Phase 0 decisions). ## Operator playbook (red gate) 1. Read `detail`. It's almost always the right answer in the first line. 2. If it's a drift case (#5), check `git log -- server/migrations/` for what prod has that the worktree doesn't. 3. If it's a content failure (#7), reproduce locally: `cargo sqlx migrate run --source server/migrations` against a freshly restored backup. Iterate on the migration file. 4. Push the fix. Sandod's mutex aborts the old build; the new sha runs the gate from scratch. 5. Never bypass — there's no `?force=true` and there shouldn't be. If you really need to ship around a known-bad migration, that's an explicit `--hotfix` promote, and you own the prod consequences.