8f502d8 sando: add todo.md with CI/CD migration roadmap - makenotwork - Git

1

+

# Sando TODO

2

+

3

+

Open work only. Completed items move to `todo_done.md` (sibling file) when one exists. Design notes go in `plans/<name>.md`, not folded into checkboxes.

4

+

5

+

Format rule: every actionable line is a `- [ ]` checkbox. Headings group phases and themes; do not put status updates in them.

6

+

7

+

Roadmap target: replace `server/deploy/deploy.sh` and astra-hosted `server/deploy/run-ci.sh` with Sando running on the MakeMachine, gating Hetzner prod through testnot.work.

8

+

9

+

Phases are ordered for execution. Phase 0 must finish before Phase 1 is meaningful. Phases 5+ are post-cutover hardening.

10

+

11

+

## Key Paths

12

+

13

+

Read these to orient before working on Sando:

14

+

15

+

- `README.md` — quickstart, API surface, v0 limitations

16

+

- `sando.toml` — current topology (MM → A → B; C declared, not provisioned)

17

+

- `daemon/src/main.rs` — startup sequence (config → topology → migrate → sync → bare-repo bootstrap → serve)

18

+

- `daemon/src/routes.rs` — `/state`, `/promote`, `/rollback`, `/rebuild`, `/backup/fetch`, `/events`

19

+

- `daemon/src/gates.rs` — gate runners; the load-bearing logic

20

+

- `daemon/src/build.rs` — `build_and_run_mm` is the MM-tier pipeline

21

+

- `daemon/src/deploy.rs` — `deploy_local`; remote SSH stub

22

+

- `daemon/migrations/001_init.sql` — schema (tiers/nodes as rows)

23

+

- `server/deploy/deploy.sh` — current cross-compile + push-to-Hetzner script (what we are replacing)

24

+

- `server/deploy/run-ci.sh` — current astra CI script (what we are replacing)

25

+

- `_meta/docs/operations.md` — burn-in rule and hotfix policy that gates encode

26

+

27

+

---

28

+

29

+

## Phase 0 — MakeMachine bootstrap

30

+

31

+

Hardware and base provisioning. None of the remote-deploy work below matters until MM exists.

32

+

33

+

- [ ] Purchase MakeMachine hardware (Threadripper 7960X + RTX PRO 6000 Blackwell + 256 GB ECC + Gen5 NVMe; ~$14-16K per `project_inference_stack.md`).

34

+

- [ ] Install x86_64 Linux (match Hetzner prod distro/version to keep build env aligned).

35

+

- [ ] Join MM to tailnet; allocate a stable hostname and record in `_meta/infra_tailnet.md`.

36

+

- [ ] Provision `sando` system user; lock down the home dir; set up scoped SSH keys for outbound deploys.

37

+

- [ ] Install scratch Postgres locally on MM; create the `sando_scratch` role + DB used by `migration_dry_run`.

38

+

- [ ] Write the `sandod.service` systemd unit (run as `sando` user, restart on failure, `EnvironmentFile=/etc/sando/sando.env`).

39

+

- [ ] Install `sandod` binary at `/usr/local/bin/sandod`; enable + start the unit.

40

+

- [ ] Write the production `sando.toml`; bare repo path under `/srv/sando/mnw.git`; A node `testnot.work`; B node Hetzner prod.

41

+

42

+

## Phase 1 — Remote deploy

43

+

44

+

The MVP only deploys to `ssh_target=local`. Production needs real SSH/rsync.

45

+

46

+

- [ ] Implement `deploy::deploy_node` remote path: rsync the staged binary to `<ssh_target>:<release_root>/releases/<version>/server`, then `ssh <ssh_target> "ln -sfn releases/<version> current && systemctl reload-or-restart <unit>"`.

47

+

- [ ] Settle systemd unit naming convention. Current MNW server unit is `makenotwork.service`; decide whether Sando keeps that name or migrates to `mnw-server.service`. Capture in `plans/systemd-units.md` before changing anything live.

48

+

- [ ] Add `node.systemd_unit` field to `sando.toml` (default derives from the tier+role) so the convention is explicit per-node.

49

+

- [ ] Bootstrap script for adding a fresh node: creates `<release_root>`, installs the systemd unit pointing at `<release_root>/current/server`, adds the sando SSH key to `authorized_keys`. Idempotent.

50

+

- [ ] Garbage-collect old releases on the remote: keep last N (configurable, default 5) per node. Run at end of each successful deploy.

51

+

- [ ] Handle `rsync` failure mid-deploy: leave the previous `current` symlink intact; mark `deploys.outcome = 'failed'`; do not advance `tier_state`.

52

+

53

+

## Phase 2 — Backup pipeline + migration dry-run

54

+

55

+

`migration_dry_run` is the load-bearing gate. It needs a real backup source, not a fixture.

56

+

57

+

- [ ] Confirm astra's offsite replica (per `sync-backup-offsite.sh`) writes a deterministic latest-link path Sando can rsync from. If not, add one.

58

+

- [ ] Wire the production `sando.toml` `backup.source` to the astra rsync URL.

59

+

- [ ] Schedule a daily `POST /backup/fetch` (cron or systemd timer on MM) so a fresh backup is always within 24h of any promote attempt.

60

+

- [ ] First end-to-end `migration_dry_run` against a real prod backup; confirm it catches the 2026-05-22 incident class (drop+recreate column migration sequence).

61

+

- [ ] Document the failure modes: what does the operator see in `/state` when the dry-run fails? Capture in `plans/migration-dryrun-failures.md`.

62

+

- [ ] Decide retention on `backups` table — prune rows older than N days so SQLite doesn't grow forever.

63

+

64

+

## Phase 3 — Parity with current `deploy.sh`

65

+

66

+

Sando currently only ships the binary. `deploy.sh` does more. Inventory each piece and either fold it into Sando or document the explicit hand-off.

67

+

68

+

- [ ] **Caddyfile** — `deploy.sh upload_config` pushes `server/deploy/Caddyfile` to `/etc/caddy/Caddyfile` and reloads Caddy. Decide: ship as a versioned config artifact alongside the binary (cleanest), or keep Caddy config out-of-band? Capture in `plans/config-artifacts.md`.

69

+

- [ ] **systemd unit** — `deploy.sh` uploads `makenotwork.service`. With Sando the unit points at `current/server` and shouldn't change per release. Move unit ownership to the node-bootstrap script (Phase 1) and remove from per-deploy flow.

70

+

- [ ] **Backup script** — `backup-db.sh` is uploaded by `deploy.sh`. Move to node-bootstrap; not a per-release artifact.

71

+

- [ ] **Error pages** — static HTML in `server/deploy/error-pages/`. Either bake into the binary (preferred — versions with code) or ship as a `releases/<version>/error-pages/` sibling. Capture decision.

72

+

- [ ] **Security configs** — `sshd-git.conf`, `fail2ban-sshd.conf`, `setup-firewall.sh`. Move to node-bootstrap.

73

+

- [ ] **Restart warning** — `deploy.sh send_restart_warning` posts a banner before restart. Decide whether Sando emits this and through what surface (probably the existing in-app banner mechanism).

74

+

- [ ] **Cross-compile from macOS** — `deploy.sh` builds on the dev laptop via `cargo-zigbuild`. Sando builds natively on MM (x86_64 Linux). Verify the resulting binaries are byte-identical or at least behavior-equivalent across one full sprint before retiring `deploy.sh`.

75

+

- [ ] **Prod migrations** — today, who runs `sqlx migrate run` against prod? `deploy.sh` doesn't (verify). Sando should run prod migrations as part of `POST /promote/{tier}` for the prod tiers, OR there should be an explicit `POST /migrate/{tier}` operator action. Decide.

76

+

77

+

## Phase 4 — Cutover

78

+

79

+

Run Sando in parallel with `deploy.sh` until trust is built, then retire the old path.

80

+

81

+

- [ ] First successful Sando-only deploy to **testnot.work** (tier A). Old `deploy.sh` still primary for prod.

82

+

- [ ] One sprint (two months) of Sando-shadow runs: every `deploy.sh` deploy is also driven through Sando in dry-run mode (gates run, deploys go to a parallel `releases/` dir on prod but don't swap `current`). Compare outcomes.

83

+

- [ ] First Sando-only deploy to **Hetzner prod** (tier B). `deploy.sh` retained but unused.

84

+

- [ ] Move `server/deploy/deploy.sh` to `server/deploy/archive/deploy.sh.legacy` with a header explaining the cutover; do not delete (reference for the next year).

85

+

- [ ] Decommission astra CI runner (`server/deploy/run-ci.sh`). Sando's `cargo_test` gate replaces it; if any astra-specific checks are still needed (e.g., `cargo audit`), add them as additional gate kinds in `daemon/src/gates.rs`.

86

+

- [ ] Update `CLAUDE.md` and `_meta/docs/operations.md` to point at Sando, not `deploy.sh`.

87

+

88

+

## Phase 5 — Operator UX

89

+

90

+

The TUI polls. The MVP requires you to hand-insert a row for `manual_confirm`. Both are fine for one operator but rough.

91

+

92

+

- [ ] Implement `WS /events`: tail of gate starts/finishes, deploy events, build logs. Subscribe from the TUI.

93

+

- [ ] TUI: actions pane. `p` for promote (prompts for version + tier), `R` for rollback, `b` for backup fetch, `c` for manual_confirm.

94

+

- [ ] `POST /confirm/{tier}` endpoint that inserts a `gate_runs` row with `passed=1, gate_kind='manual_confirm'` for the current pending version. Replaces the hand-SQL workaround.

95

+

- [ ] TUI live log pane that follows the most recent build / gate run; backed by `WS /events`.

96

+

- [ ] `POST /promote` body should accept `version` as optional; default to the current MM version when target is A, predecessor's current when target is B+. Reduces ceremony.

97

+

98

+

## Phase 6 — Monitoring + alerting

99

+

100

+

- [ ] Wire MM's `/metrics` endpoint into the existing MNW Prometheus scrape config; record where the scrape config lives in `_meta/` or wherever monitoring already runs.

101

+

- [ ] Add counters: `sando_builds_total{outcome}`, `sando_gates_total{tier,kind,outcome}`, `sando_deploys_total{tier,outcome}`, `sando_burn_in_remaining_hours{tier}`.

102

+

- [ ] Alert: build failed. Page on first failure (not flap-protected — builds are infrequent).

103

+

- [ ] Alert: migration_dry_run failed. Page immediately. This is the 2026-05-22-class signal.

104

+

- [ ] Alert: a tier has had `current_version` unchanged for > N days while MM is green. (Operator forgot to promote.)

105

+

106

+

## Phase 7 — Multi-node B+C

107

+

108

+

Today B is the only prod node. Adding C is the second prod node + CF Load Balancing.

109

+

110

+

- [ ] Provision tier C node (Hetzner or alternate provider — capture rationale).

111

+

- [ ] Update `sando.toml`: set `c.provisioned = true`, add `[[tier.node]]`.

112

+

- [ ] Set up Cloudflare Load Balancing with B + C as origin pool, health-checked.

113

+

- [ ] Verify sequential canary in Sando: deploy to B, wait for CF health-check to mark healthy (probably 30-60s probe interval), then deploy to C. Add a `node.health_url` field and a gate-style wait between nodes.

114

+

- [ ] Document in README that `canary = "parallel"` exists but should never be used for B+C unless you understand the failure modes.

115

+

116

+

## Phase 8 — Postgres-on-D

117

+

118

+

Move Postgres off the prod app node so B+C become truly interchangeable.

119

+

120

+

- [ ] Provision Postgres-only machine D (modest spec; reliability over performance).

121

+

- [ ] Migrate the prod DB from Hetzner app node to D. Capture procedure in `plans/postgres-d-migration.md`.

122

+

- [ ] Update `server` `DATABASE_URL` everywhere (env files on B+C, scratch URL on MM stays local).

123

+

- [ ] Replica/HA story stays deferred; D is SPOF for now (per `_meta/preclear/.../decisions.md`).

124

+

125

+

## Phase 9 — Hardening

126

+

127

+

Pick up after cutover is stable.

128

+

129

+

- [ ] Tailnet ACL audit: confirm only the laptop can reach `sandod:7766`. Document the ACL.

130

+

- [ ] Decide if v0.2 needs token auth on `sandod` endpoints (revisit assumption from `decisions.md` once there's a real second operator).

131

+

- [ ] Sando self-deploy: Sando builds and deploys *itself* through its own pipeline. Bootstraps the bootstrap. Closes the chicken-and-egg loop and is satisfying.

132

+

- [ ] Backup-of-Sando-state: nightly SQLite snapshot to astra. The state DB tracks 6 months of deploys; losing it on a MM disk failure would be annoying.

133

+

134

+

## Notes / non-checkbox

135

+

136

+

- WS `/events` and the operator-UX work in Phase 5 can run in parallel with Phase 1-3 once MM exists. They are sequenced after for review clarity, not because they block anything.

137

+

- "Hotfix override" and `reset_burn_in` flag are already implemented end-to-end (see `decisions.md`); not on this list because there's nothing left to do until prod uses them.

138

+

- C tier exists in the schema as a `provisioned=false` row from day one — adding C in Phase 7 is a TOML edit, not a migration.

		@@ -0,0 +1,138 @@
1	+	# Sando TODO
2	+
3	+	Open work only. Completed items move to `todo_done.md` (sibling file) when one exists. Design notes go in `plans/<name>.md`, not folded into checkboxes.
4	+
5	+	Format rule: every actionable line is a `- [ ]` checkbox. Headings group phases and themes; do not put status updates in them.
6	+
7	+	Roadmap target: replace `server/deploy/deploy.sh` and astra-hosted `server/deploy/run-ci.sh` with Sando running on the MakeMachine, gating Hetzner prod through testnot.work.
8	+
9	+	Phases are ordered for execution. Phase 0 must finish before Phase 1 is meaningful. Phases 5+ are post-cutover hardening.
10	+
11	+	## Key Paths
12	+
13	+	Read these to orient before working on Sando:
14	+
15	+	- `README.md` — quickstart, API surface, v0 limitations
16	+	- `sando.toml` — current topology (MM → A → B; C declared, not provisioned)
17	+	- `daemon/src/main.rs` — startup sequence (config → topology → migrate → sync → bare-repo bootstrap → serve)
18	+	- `daemon/src/routes.rs` — `/state`, `/promote`, `/rollback`, `/rebuild`, `/backup/fetch`, `/events`
19	+	- `daemon/src/gates.rs` — gate runners; the load-bearing logic
20	+	- `daemon/src/build.rs` — `build_and_run_mm` is the MM-tier pipeline
21	+	- `daemon/src/deploy.rs` — `deploy_local`; remote SSH stub
22	+	- `daemon/migrations/001_init.sql` — schema (tiers/nodes as rows)
23	+	- `server/deploy/deploy.sh` — current cross-compile + push-to-Hetzner script (what we are replacing)
24	+	- `server/deploy/run-ci.sh` — current astra CI script (what we are replacing)
25	+	- `_meta/docs/operations.md` — burn-in rule and hotfix policy that gates encode
26	+
27	+	---
28	+
29	+	## Phase 0 — MakeMachine bootstrap
30	+
31	+	Hardware and base provisioning. None of the remote-deploy work below matters until MM exists.
32	+
33	+	- [ ] Purchase MakeMachine hardware (Threadripper 7960X + RTX PRO 6000 Blackwell + 256 GB ECC + Gen5 NVMe; ~$14-16K per `project_inference_stack.md`).
34	+	- [ ] Install x86_64 Linux (match Hetzner prod distro/version to keep build env aligned).
35	+	- [ ] Join MM to tailnet; allocate a stable hostname and record in `_meta/infra_tailnet.md`.
36	+	- [ ] Provision `sando` system user; lock down the home dir; set up scoped SSH keys for outbound deploys.
37	+	- [ ] Install scratch Postgres locally on MM; create the `sando_scratch` role + DB used by `migration_dry_run`.
38	+	- [ ] Write the `sandod.service` systemd unit (run as `sando` user, restart on failure, `EnvironmentFile=/etc/sando/sando.env`).
39	+	- [ ] Install `sandod` binary at `/usr/local/bin/sandod`; enable + start the unit.
40	+	- [ ] Write the production `sando.toml`; bare repo path under `/srv/sando/mnw.git`; A node `testnot.work`; B node Hetzner prod.
41	+
42	+	## Phase 1 — Remote deploy
43	+
44	+	The MVP only deploys to `ssh_target=local`. Production needs real SSH/rsync.
45	+
46	+	- [ ] Implement `deploy::deploy_node` remote path: rsync the staged binary to `<ssh_target>:<release_root>/releases/<version>/server`, then `ssh <ssh_target> "ln -sfn releases/<version> current && systemctl reload-or-restart <unit>"`.
47	+	- [ ] Settle systemd unit naming convention. Current MNW server unit is `makenotwork.service`; decide whether Sando keeps that name or migrates to `mnw-server.service`. Capture in `plans/systemd-units.md` before changing anything live.
48	+	- [ ] Add `node.systemd_unit` field to `sando.toml` (default derives from the tier+role) so the convention is explicit per-node.
49	+	- [ ] Bootstrap script for adding a fresh node: creates `<release_root>`, installs the systemd unit pointing at `<release_root>/current/server`, adds the sando SSH key to `authorized_keys`. Idempotent.
50	+	- [ ] Garbage-collect old releases on the remote: keep last N (configurable, default 5) per node. Run at end of each successful deploy.
51	+	- [ ] Handle `rsync` failure mid-deploy: leave the previous `current` symlink intact; mark `deploys.outcome = 'failed'`; do not advance `tier_state`.
52	+
53	+	## Phase 2 — Backup pipeline + migration dry-run
54	+
55	+	`migration_dry_run` is the load-bearing gate. It needs a real backup source, not a fixture.
56	+
57	+	- [ ] Confirm astra's offsite replica (per `sync-backup-offsite.sh`) writes a deterministic latest-link path Sando can rsync from. If not, add one.
58	+	- [ ] Wire the production `sando.toml` `backup.source` to the astra rsync URL.
59	+	- [ ] Schedule a daily `POST /backup/fetch` (cron or systemd timer on MM) so a fresh backup is always within 24h of any promote attempt.
60	+	- [ ] First end-to-end `migration_dry_run` against a real prod backup; confirm it catches the 2026-05-22 incident class (drop+recreate column migration sequence).
61	+	- [ ] Document the failure modes: what does the operator see in `/state` when the dry-run fails? Capture in `plans/migration-dryrun-failures.md`.
62	+	- [ ] Decide retention on `backups` table — prune rows older than N days so SQLite doesn't grow forever.
63	+
64	+	## Phase 3 — Parity with current `deploy.sh`
65	+
66	+	Sando currently only ships the binary. `deploy.sh` does more. Inventory each piece and either fold it into Sando or document the explicit hand-off.
67	+
68	+	- [ ] Caddyfile — `deploy.sh upload_config` pushes `server/deploy/Caddyfile` to `/etc/caddy/Caddyfile` and reloads Caddy. Decide: ship as a versioned config artifact alongside the binary (cleanest), or keep Caddy config out-of-band? Capture in `plans/config-artifacts.md`.
69	+	- [ ] systemd unit — `deploy.sh` uploads `makenotwork.service`. With Sando the unit points at `current/server` and shouldn't change per release. Move unit ownership to the node-bootstrap script (Phase 1) and remove from per-deploy flow.
70	+	- [ ] Backup script — `backup-db.sh` is uploaded by `deploy.sh`. Move to node-bootstrap; not a per-release artifact.
71	+	- [ ] Error pages — static HTML in `server/deploy/error-pages/`. Either bake into the binary (preferred — versions with code) or ship as a `releases/<version>/error-pages/` sibling. Capture decision.
72	+	- [ ] Security configs — `sshd-git.conf`, `fail2ban-sshd.conf`, `setup-firewall.sh`. Move to node-bootstrap.
73	+	- [ ] Restart warning — `deploy.sh send_restart_warning` posts a banner before restart. Decide whether Sando emits this and through what surface (probably the existing in-app banner mechanism).
74	+	- [ ] Cross-compile from macOS — `deploy.sh` builds on the dev laptop via `cargo-zigbuild`. Sando builds natively on MM (x86_64 Linux). Verify the resulting binaries are byte-identical or at least behavior-equivalent across one full sprint before retiring `deploy.sh`.
75	+	- [ ] Prod migrations — today, who runs `sqlx migrate run` against prod? `deploy.sh` doesn't (verify). Sando should run prod migrations as part of `POST /promote/{tier}` for the prod tiers, OR there should be an explicit `POST /migrate/{tier}` operator action. Decide.
76	+
77	+	## Phase 4 — Cutover
78	+
79	+	Run Sando in parallel with `deploy.sh` until trust is built, then retire the old path.
80	+
81	+	- [ ] First successful Sando-only deploy to testnot.work (tier A). Old `deploy.sh` still primary for prod.
82	+	- [ ] One sprint (two months) of Sando-shadow runs: every `deploy.sh` deploy is also driven through Sando in dry-run mode (gates run, deploys go to a parallel `releases/` dir on prod but don't swap `current`). Compare outcomes.
83	+	- [ ] First Sando-only deploy to Hetzner prod (tier B). `deploy.sh` retained but unused.
84	+	- [ ] Move `server/deploy/deploy.sh` to `server/deploy/archive/deploy.sh.legacy` with a header explaining the cutover; do not delete (reference for the next year).
85	+	- [ ] Decommission astra CI runner (`server/deploy/run-ci.sh`). Sando's `cargo_test` gate replaces it; if any astra-specific checks are still needed (e.g., `cargo audit`), add them as additional gate kinds in `daemon/src/gates.rs`.
86	+	- [ ] Update `CLAUDE.md` and `_meta/docs/operations.md` to point at Sando, not `deploy.sh`.
87	+
88	+	## Phase 5 — Operator UX
89	+
90	+	The TUI polls. The MVP requires you to hand-insert a row for `manual_confirm`. Both are fine for one operator but rough.
91	+
92	+	- [ ] Implement `WS /events`: tail of gate starts/finishes, deploy events, build logs. Subscribe from the TUI.
93	+	- [ ] TUI: actions pane. `p` for promote (prompts for version + tier), `R` for rollback, `b` for backup fetch, `c` for manual_confirm.
94	+	- [ ] `POST /confirm/{tier}` endpoint that inserts a `gate_runs` row with `passed=1, gate_kind='manual_confirm'` for the current pending version. Replaces the hand-SQL workaround.
95	+	- [ ] TUI live log pane that follows the most recent build / gate run; backed by `WS /events`.
96	+	- [ ] `POST /promote` body should accept `version` as optional; default to the current MM version when target is A, predecessor's current when target is B+. Reduces ceremony.
97	+
98	+	## Phase 6 — Monitoring + alerting
99	+
100	+	- [ ] Wire MM's `/metrics` endpoint into the existing MNW Prometheus scrape config; record where the scrape config lives in `_meta/` or wherever monitoring already runs.
101	+	- [ ] Add counters: `sando_builds_total{outcome}`, `sando_gates_total{tier,kind,outcome}`, `sando_deploys_total{tier,outcome}`, `sando_burn_in_remaining_hours{tier}`.
102	+	- [ ] Alert: build failed. Page on first failure (not flap-protected — builds are infrequent).
103	+	- [ ] Alert: migration_dry_run failed. Page immediately. This is the 2026-05-22-class signal.
104	+	- [ ] Alert: a tier has had `current_version` unchanged for > N days while MM is green. (Operator forgot to promote.)
105	+
106	+	## Phase 7 — Multi-node B+C
107	+
108	+	Today B is the only prod node. Adding C is the second prod node + CF Load Balancing.
109	+
110	+	- [ ] Provision tier C node (Hetzner or alternate provider — capture rationale).
111	+	- [ ] Update `sando.toml`: set `c.provisioned = true`, add `[[tier.node]]`.
112	+	- [ ] Set up Cloudflare Load Balancing with B + C as origin pool, health-checked.
113	+	- [ ] Verify sequential canary in Sando: deploy to B, wait for CF health-check to mark healthy (probably 30-60s probe interval), then deploy to C. Add a `node.health_url` field and a gate-style wait between nodes.
114	+	- [ ] Document in README that `canary = "parallel"` exists but should never be used for B+C unless you understand the failure modes.
115	+
116	+	## Phase 8 — Postgres-on-D
117	+
118	+	Move Postgres off the prod app node so B+C become truly interchangeable.
119	+
120	+	- [ ] Provision Postgres-only machine D (modest spec; reliability over performance).
121	+	- [ ] Migrate the prod DB from Hetzner app node to D. Capture procedure in `plans/postgres-d-migration.md`.
122	+	- [ ] Update `server` `DATABASE_URL` everywhere (env files on B+C, scratch URL on MM stays local).
123	+	- [ ] Replica/HA story stays deferred; D is SPOF for now (per `_meta/preclear/.../decisions.md`).
124	+
125	+	## Phase 9 — Hardening
126	+
127	+	Pick up after cutover is stable.
128	+
129	+	- [ ] Tailnet ACL audit: confirm only the laptop can reach `sandod:7766`. Document the ACL.
130	+	- [ ] Decide if v0.2 needs token auth on `sandod` endpoints (revisit assumption from `decisions.md` once there's a real second operator).
131	+	- [ ] Sando self-deploy: Sando builds and deploys itself through its own pipeline. Bootstraps the bootstrap. Closes the chicken-and-egg loop and is satisfying.
132	+	- [ ] Backup-of-Sando-state: nightly SQLite snapshot to astra. The state DB tracks 6 months of deploys; losing it on a MM disk failure would be annoying.
133	+
134	+	## Notes / non-checkbox
135	+
136	+	- WS `/events` and the operator-UX work in Phase 5 can run in parallel with Phase 1-3 once MM exists. They are sequenced after for review clarity, not because they block anything.
137	+	- "Hotfix override" and `reset_burn_in` flag are already implemented end-to-end (see `decisions.md`); not on this list because there's nothing left to do until prod uses them.
138	+	- C tier exists in the schema as a `provisioned=false` row from day one — adding C in Phase 7 is a TOML edit, not a migration.

max / makenotwork