# migration_dry_run: failure modes + what the operator sees

`migration_dry_run` is the load-bearing pre-flight gate. It restores the
latest prod backup into the scratch postgres, then runs the worktree's
migrations on top. The point is to catch migrations that work on a fresh DB
but break on real prod data, *before* the binary tries them in production.

Gate definition (`MNW/sando/daemon/src/gates.rs::migration_dry_run`):

1. Look up latest row in `backups` table (set by `POST /backup/fetch`).
2. `reset_scratch` — drop every non-system schema, recreate `public`.
3. `restore_dump` — `gunzip -c <backup> | psql <scratch_url>`.
4. `run_migrator` — `sqlx::migrate::Migrator::new(<worktree>/server/migrations).run()`.

All four steps' failure cases below.

## Operator surface

Each invocation writes a row to `gate_runs`:

| col          | meaning                                                       |
|--------------|---------------------------------------------------------------|
| version      | the version being built                                       |
| tier         | always "mm"                                                   |
| gate_kind    | "migration_dry_run"                                           |
| passed       | 1 = green, 0 = red                                            |
| detail       | error tail (up to ~4 KB) — what to read first                 |
| finished_at  | wall-clock end                                                |

In the WS `/events` stream you also get `gate_start` and `gate_done` envelopes.
The TUI surfaces these in the gate strip; on red, click into `detail`.

## Known failure modes

### 1. No backup fetched

`detail = "no backup fetched; call /backup/fetch first"`

Cause: `backups` table is empty (sando state DB is fresh, or the daily timer
hasn't fired yet). Recovery: `curl -X POST $SANDO_DAEMON/backup/fetch`.

### 2. scratch_db_url unset

`detail = "scratch_db_url unset in daemon config"`

Cause: `/etc/sando/sando-daemon.toml` is missing the `scratch_db_url` line.
Recovery: add it, restart sandod.

### 3. Scratch reset failed

`detail = "scratch reset: <pg-error>"`

Cause: postgres is down on the Sando host, or the role/db disappeared.
Recovery: `systemctl status postgresql`, recreate `sando` role + `sando_scratch`
db if needed (see bootstrap-node.sh template).

### 4. Restore failed

`detail = "restore: <gunzip|psql error>"`

Cause: backup file is corrupt or truncated. Verify with
`zcat /srv/sando/backups/latest.sql.gz | head` — should start with `--`
postgres dump preamble. If corrupt: re-fetch from prod with
`POST /backup/fetch`. If repeatable, prod's backup itself is bad.

### 5. Migration drift (the load-bearing case)

`detail = "migration <N> was previously applied but is missing in the resolved migrations"`

**This is the gate doing its job.** The backup carries
`_sqlx_migrations` rows for every migration prod has applied. If the worktree
is missing one of those files, sqlx refuses to run — because applying the
worktree against this DB would skip a step prod thinks is done.

**Real example (2026-05-31):** sha `eee96a7` was pushed to sandod before
migrations 123-132 landed in main. `migration_dry_run` failed with
"migration 123 was previously applied but is missing in the resolved
migrations". Recovery: push the up-to-date `main`.

This is also the only signal you get for a forgotten migration file. If
someone deletes `migrations/123_foo.sql` without truly rolling back the
schema change in prod, this gate is what catches it.

### 6. Migration content changed

`detail = "migration <N> was previously applied but has been modified"`
(or similar — sqlx phrasing varies by version).

Cause: the worktree's `123_foo.sql` has a different checksum than the version
prod recorded in `_sqlx_migrations`. Don't fix by overwriting prod's
checksum — that hides a real divergence. Investigate which version was
"right" and add a follow-up migration that produces the intended state.

### 7. Migration content broken against prod data

`detail = "migration <N>: <pg syntax/constraint/data error>"`

Cause: the new migration runs fine on an empty schema (which is what
local-dev `cargo test` exercises) but fails on actual prod data. Examples:
adding `NOT NULL` to a column with existing nulls; `DROP COLUMN` referenced
by a view; `UNIQUE` constraint violated by existing rows.

**This is the 2026-05-22-incident class.** Without `migration_dry_run`, the
binary deploys, starts up, runs the migration, partially-applies it, exits
non-zero, systemd crash-loops, prod is down. With the gate, the failure
happens on the Sando host's scratch DB and prod stays up.

Recovery: fix the migration. Common patterns:
- `NOT NULL` + default + then alter to non-default
- `DROP COLUMN` only after dropping dependents
- backfill via separate migration before the constraint

## Things this gate does NOT catch

- Migrations that succeed but break the *application* — only caught by
  `cargo_test` (red) or `boot_smoke` (binary fails to start).
- Migrations that are slow on prod-scale data (scratch DB has prod *content*
  but no prod *load*).
- Migrations that need privileged operations not granted to `sando` role
  (the scratch role isn't superuser by design — see Phase 0 decisions).

## Operator playbook (red gate)

1. Read `detail`. It's almost always the right answer in the first line.
2. If it's a drift case (#5), check `git log -- server/migrations/` for what
   prod has that the worktree doesn't.
3. If it's a content failure (#7), reproduce locally:
   `cargo sqlx migrate run --source server/migrations` against a freshly
   restored backup. Iterate on the migration file.
4. Push the fix. Sandod's mutex aborts the old build; the new sha runs the
   gate from scratch.
5. Never bypass — there's no `?force=true` and there shouldn't be. If you
   really need to ship around a known-bad migration, that's an explicit
   `--hotfix` promote, and you own the prod consequences.