Skip to main content

max / makenotwork

sando: pop-os -> fw13 in docs/config/plans after host rename Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Author: Max Johnson <me@maxj.phd> · 2026-06-05 00:19 UTC
Commit: 6a7f0dcd3da621121a296c99990e8ae64058b974
Parent: 104bcd0
9 files changed, +39 insertions, -39 deletions
@@ -6,7 +6,7 @@
6 6 #
7 7 # Required env:
8 8 # SANDO_PUBKEY — sando user's public key on the Sando host. Get it via:
9 - # `ssh pop-os 'sudo cat /srv/sando/.ssh/id_ed25519.pub'`
9 + # `ssh fw13 'sudo cat /srv/sando/.ssh/id_ed25519.pub'`
10 10 #
11 11 # Optional env:
12 12 # DEPLOY_ROOT — defaults to /opt/mnw
@@ -2,7 +2,7 @@
2 2 # Idempotent bootstrap for a fresh Sando host (the machine running sandod).
3 3 #
4 4 # Captures the three PG footguns + system user + systemd unit + scratch DB +
5 - # .ssh setup + known_hosts seeding that pop-os accumulated by hand over the
5 + # .ssh setup + known_hosts seeding that fw13 accumulated by hand over the
6 6 # 2026-06-02 hardening session. Re-run any time the sando host is rebuilt.
7 7 #
8 8 # Run as root on the new host. The script is safe to run repeatedly — every
@@ -45,7 +45,7 @@ if [[ $EUID -ne 0 ]]; then
45 45 fi
46 46
47 47 # All paths the host should accept overrides for, with sane defaults that
48 - # match the live pop-os install.
48 + # match the live fw13 install.
49 49 SANDO_USER="${SANDO_USER:-sando}"
50 50 SANDO_HOME="${SANDO_HOME:-/srv/sando}"
51 51 SANDO_DAEMON_URL="${SANDO_DAEMON_URL:-http://127.0.0.1:7766}"
@@ -168,7 +168,7 @@ log "10/13 /etc/sando configs"
168 168 install -d -m 0755 /etc/sando
169 169 # sando-daemon.toml.example is the canonical production config (per the
170 170 # header comment). Install as-is; operator edits the listen address if
171 - # binding to a non-pop-os tailnet IP.
171 + # binding to a non-fw13 tailnet IP.
172 172 install -m 0644 -o root -g root \
173 173 "$SCRIPT_DIR/sando-daemon.toml.example" \
174 174 /etc/sando/sando-daemon.toml
@@ -1,7 +1,7 @@
1 1 # Sando daemon config (production).
2 2 # Install at /etc/sando/sando-daemon.toml on the Sando host.
3 3
4 - listen = "100.103.89.95:7766" # pop-os tailnet IP; bind tailnet-only, not 0.0.0.0
4 + listen = "100.103.89.95:7766" # fw13 tailnet IP; bind tailnet-only, not 0.0.0.0
5 5 db_path = "/srv/sando/state/sando.db"
6 6 topology_path = "/etc/sando/sando.toml"
7 7 workdir = "/srv/sando/work"
@@ -1,5 +1,5 @@
1 1 # Sando daemon systemd service
2 - # Place at /etc/systemd/system/sandod.service on the Sando host (pop-os).
2 + # Place at /etc/systemd/system/sandod.service on the Sando host (fw13).
3 3 #
4 4 # Commands:
5 5 # sudo systemctl daemon-reload
@@ -80,7 +80,7 @@ Same as security configs. Backup script is host infrastructure, not release arti
80 80
81 81 - For testnot (low traffic): skip. Service crash-loops invisibly enough.
82 82 - For prod cutover: sando must implement this. Options:
83 - - **A**: Sando POSTs `/api/internal/restart-warning` itself, requires CLI_SERVICE_TOKEN exposed to sando. Token would live in `/etc/sando/sando.env` on pop-os.
83 + - **A**: Sando POSTs `/api/internal/restart-warning` itself, requires CLI_SERVICE_TOKEN exposed to sando. Token would live in `/etc/sando/sando.env` on fw13.
84 84 - **B**: Sando exposes a `pre_deploy_hook` per-tier in `sando.toml` (shell command); operator decides.
85 85 - Recommendation: **A** for prod tiers only (`tier.restart_warning_seconds = 30` in `sando.toml`). Tier A (testnot) leaves it unset = no warning.
86 86
@@ -88,7 +88,7 @@ Phase 5 implementation, not blocking cutover-readiness.
88 88
89 89 ### 8. Cross-compile from macOS — **retire**
90 90
91 - Pop-os is x86_64 Ubuntu-derived, prod is x86_64 Ubuntu 24.04. Sando builds natively. Cargo-zigbuild path goes away once sando is canonical.
91 + fw13 is x86_64 Ubuntu-derived, prod is x86_64 Ubuntu 24.04. Sando builds natively. Cargo-zigbuild path goes away once sando is canonical.
92 92
93 93 - Verify: take a recent prod binary (from `deploy.sh`'s build) and sando's binary for the same sha, compare runtime behavior across one full sprint of testnot use.
94 94 - Once verified, mark `deploy.sh` archived and delete cargo-zigbuild from dev-machine setup notes.
@@ -109,7 +109,7 @@ MNW server runs `sqlx::migrate!("./migrations").run(&db).await` in `main.rs:73`
109 109
110 110 | Step | Replaced by Sando | Moved to node-bootstrap | Retired |
111 111 |---------------------|------------------------------|-------------------------|---------|
112 - | build_binary | yes (native on pop-os) | | |
112 + | build_binary | yes (native on fw13) | | |
113 113 | upload_config | | yes (Caddyfile, etc.) | |
114 114 | upload_binary | yes (+ mnw-admin) | | |
115 115 | send_restart_warning| yes (Phase 5, prod tier only)| | |
@@ -133,7 +133,7 @@ Folded in here while editing topology code. Schema migration in `sando/daemon/mi
133 133 ## C. Test on MM
134 134
135 135 1. `cargo test --release --features fast-tests` in `sando/daemon/` — all existing tests pass + any new staging tests added (e.g. `stage_dir copies on success`, `stage_dir errors when required-missing`, `release_contents config parses`).
136 - 2. Build sandod, install, restart on pop-os.
136 + 2. Build sandod, install, restart on fw13.
137 137 3. `POST /rebuild` against current MNW main.
138 138 4. Inspect `/srv/sando/releases/<v>/` — should contain `makenotwork`, `mnw-admin`, `error-pages/`, `static/`, `docs/`. Verify total size is sane (single-digit MB for binary, tens of MB for static+docs).
139 139 5. boot_smoke still passes (binary doesn't care what's in the dir alongside it).
@@ -200,7 +200,7 @@ Folded in here while editing topology code. Schema migration in `sando/daemon/mi
200 200
201 201 ## G. Outcomes (2026-06-02)
202 202
203 - Session 1 landed in one focused push. All 10 tasks done, 44/44 sando-daemon tests green, pipeline went `host` green end-to-end against sha `f0970b8` (version 0.9.5) on pop-os.
203 + Session 1 landed in one focused push. All 10 tasks done, 44/44 sando-daemon tests green, pipeline went `host` green end-to-end against sha `f0970b8` (version 0.9.5) on fw13.
204 204
205 205 ### What shipped (commit `f0970b8` on sando bare + mnw + srht remotes)
206 206
@@ -208,7 +208,7 @@ Session 1 landed in one focused push. All 10 tasks done, 44/44 sando-daemon test
208 208 - `build.rs::build_and_run_host` (renamed from `_mm`) iterates `cfg.release_contents`, calling `stage_entry()` per row. `cp -a` semantics; supports the merge-into-existing-dir form so multiple entries can target the same `dst` (used for `docs/` from 3 worktree sources).
209 209 - `deploy.rs` rsync gained `--delete` (no stale assets across versions) and swapped `--chmod=F0755` for `--chmod=Du=rwx,Dgo=rx,Fu=rw,Fgo=r,F+X` (binaries 0755, data files 0644).
210 210 - `bootstrap-node.sh` writes FHS-style unit: `EnvironmentFile=/etc/mnw/makenotwork.env`, `ReadWritePaths=/var/lib/mnw`, `WorkingDirectory=<release>/current`. Pre-creates `/etc/mnw` (root:service 0750) + `/var/lib/mnw` (service:service 0750).
211 - - Migration `002_rename_mm_to_host.sql` — `PRAGMA defer_foreign_keys = ON` + 5 UPDATEs (tiers, nodes, deploys, gate_runs, tier_state). Preserved all existing state on pop-os (host current=0.9.5 + a current=0.8.12 carried through).
211 + - Migration `002_rename_mm_to_host.sql` — `PRAGMA defer_foreign_keys = ON` + 5 UPDATEs (tiers, nodes, deploys, gate_runs, tier_state). Preserved all existing state on fw13 (host current=0.9.5 + a current=0.8.12 carried through).
212 212 - Post-receive hook now lives in repo at `sando/deploy/post-receive` and sources `/etc/sando/sando.env` — `SANDO_DAEMON` resolves to the tailnet listener instead of the 127.0.0.1 default. `bootstrap-sandod-host.sh` installs it.
213 213
214 214 ### Open-question answers from §F
@@ -221,7 +221,7 @@ Session 1 landed in one focused push. All 10 tasks done, 44/44 sando-daemon test
221 221
222 222 - **deploy.sh has a CSS minification step** (`npx clean-css-cli`) before rsync. Sando does not. Effect: bundle ships unminified CSS (~3x larger on the wire than deploy.sh-shipped CSS). `server/build.rs` hashes the *unminified* `style.css` for the cache-bust `?v=...`, so correctness is preserved — purely a size issue. Future fix: either eat the size cost (gzip handles most of it), move minification into `server/build.rs`, or add a build-step gate to sando. **Not addressed in Session 1.**
223 223 - **mnw-admin invocation surface is bigger than expected.** Live call sites on prod: (1) sudoers `/etc/sudoers.d/*` entry `makenotwork ALL=(git) NOPASSWD: /opt/makenotwork/mnw-admin rebuild-keys` — needs path update in Session 3; (2) `command=` prefixes in `/home/git/.ssh/authorized_keys` that `mnw-admin rebuild-keys` itself generates — auto-update on the first post-migration rebuild-keys run. Session 3 sequence: edit the sudoers file first, then run `mnw-admin rebuild-keys` once.
224 - - **The defensive assert-on-stray-`"mm"`-lookup proposed in the plan was skipped.** Tests catch it: any unrenamed site fails when the DB no longer has a row matching it. After the rename + sync.rs test run + the production restart on pop-os, no "mm" lookups remained.
224 + - **The defensive assert-on-stray-`"mm"`-lookup proposed in the plan was skipped.** Tests catch it: any unrenamed site fails when the DB no longer has a row matching it. After the rename + sync.rs test run + the production restart on fw13, no "mm" lookups remained.
225 225
226 226 ### Carry-over for Session 2
227 227
@@ -229,7 +229,7 @@ The Session 2 starting point shifted slightly because we did Session 1's prep +
229 229
230 230 - `f0970b8` is the active sha on sando's bare repo and is current on tier host. Tier a is on the stale `0.8.12` from pre-Session-1.
231 231 - testnot still has the unit shape from the *pre-Session-1* `bootstrap-node.sh`. It is in a crashloop (MissingDatabaseUrl, no env file). Session 2 reprovisioning will replace its systemd unit with the new FHS shape AND populate `/etc/mnw/makenotwork.env` from scratch.
232 - - `bootstrap-sandod-host.sh` on pop-os is the new version; re-running it is idempotent.
232 + - `bootstrap-sandod-host.sh` on fw13 is the new version; re-running it is idempotent.
233 233 - Sando's pubkey on testnot under the `deploy` user: confirmed working earlier (`sudo -u sando ssh deploy@testnot` returned). No re-auth needed.
234 234 - The bundle has not yet been deployed remotely. Tier a's `0.8.12` deploy predates `release_contents`; the testnot release dir contains only binaries. First Session 2 promotion to tier a will be the first remote deploy of the full bundle.
235 235
@@ -2,7 +2,7 @@
2 2
3 3 Captured 2026-06-03 after the cutover. Resolves §6.5 step 8 of `launchplan_final.md`: first full sando deploy to Hetzner prod, replacing `deploy.sh` as the live deploy path.
4 4
5 - Status: **complete 2026-06-03.** Prod runs `makenotwork` 0.9.5 (sha `f0970b8`) from `/opt/mnw/current/`, deployed via `POST /promote/b {"hotfix":true}` from sandod on pop-os. Outage window 3m25s (02:50:33 → 02:53:58 UTC). All features green. See §F for outcomes and §G for the four hardcoded paths that block the eventual `rm -rf /opt/makenotwork/`.
5 + Status: **complete 2026-06-03.** Prod runs `makenotwork` 0.9.5 (sha `f0970b8`) from `/opt/mnw/current/`, deployed via `POST /promote/b {"hotfix":true}` from sandod on fw13. Outage window 3m25s (02:50:33 → 02:53:58 UTC). All features green. See §F for outcomes and §G for the four hardcoded paths that block the eventual `rm -rf /opt/makenotwork/`.
6 6
7 7 ## Background — Session 1 set the layout, Session 2 proved it on testnot, Session 3 cut prod over
8 8
@@ -55,7 +55,7 @@ In order, with the exact reason each step exists:
55 55 7. **Caddyfile rewrite**: `sed -i 's|/opt/makenotwork/error-pages|/opt/mnw/current/error-pages|g'`. `caddy validate` before reload; `systemctl reload caddy`.
56 56 8. **Sudoers rewrite**: same sed pattern on `/etc/sudoers.d/mnw-git-ssh`; `visudo -c -f` to validate.
57 57 9. **`systemctl daemon-reload`** to pick up the new unit.
58 - 10. **`systemctl restart sandod`** on pop-os — sandod caches `sando.toml` at startup; the new tier B target wouldn't have taken effect without this. **First `POST /promote/b` failed with NXDOMAIN against the stale `prod-1.makenot.work` because sandod hadn't been restarted yet.** Fixed by restarting sandod and re-promoting.
58 + 10. **`systemctl restart sandod`** on fw13 — sandod caches `sando.toml` at startup; the new tier B target wouldn't have taken effect without this. **First `POST /promote/b` failed with NXDOMAIN against the stale `prod-1.makenot.work` because sandod hadn't been restarted yet.** Fixed by restarting sandod and re-promoting.
59 59 11. **`POST /promote/b {"hotfix":true}`** — `hotfix: true` bypasses the 48h burn-in on tier A (which had just promoted to 0.9.5 ~15 min prior; burn-in not yet elapsed). Sando rsync'd the 161MB bundle to `/opt/mnw/releases/0.9.5/`, swapped the `current` symlink, called `systemctl reload-or-restart makenotwork.service`.
60 60 12. **Service up 02:53:55 UTC.** Outage window ends 02:53:58 once health serves 200. 733 YARA rules compiled, all integrations (S3, Stripe, MT, WAM, git, scanner, custom domain cache) live.
61 61 13. **External smoke checks**: `/`, `/login`, `/pricing`, `/docs`, `/docs/economics`, `/docs/roadmap`, `/docs/tiers` — all 200.
@@ -66,14 +66,14 @@ In order, with the exact reason each step exists:
66 66
67 67 - `/opt/makenotwork/` — full contents, untouched. Soak rollback path: stop new unit, swap systemd unit back, start old binary. Plan: `rm -rf` after a week, post-0.9.6 deploy (see §G).
68 68 - `/opt/git/` — untouched. Git user's `/etc/passwd` home; mnw-admin's regenerated `authorized_keys` writes to `/opt/git/.ssh/authorized_keys` (not `/home/git/`, despite earlier confusion). The rsync to `/var/lib/mnw/git/` populated the new GIT_REPOS_PATH; the server reads from there, but git push lands in `/opt/git/` because that's git user's home. Both paths now hold the repo bytes; that's wasteful but harmless during the soak.
69 - - `/opt/makenotwork/backups/` — 885M of pg dumps. Script and cron still write there. Sando's backup-fetch on pop-os still pulls from there (configured pre-cutover). Migration to `/var/lib/mnw/backups/` is its own follow-up (touches script, crontab, pop-os sando config).
69 + - `/opt/makenotwork/backups/` — 885M of pg dumps. Script and cron still write there. Sando's backup-fetch on fw13 still pulls from there (configured pre-cutover). Migration to `/var/lib/mnw/backups/` is its own follow-up (touches script, crontab, fw13 sando config).
70 70 - `yara-rules-src/`, `rustdoc/`, `ssh/`, `.env.bak.*` — not in any env var or systemd path. Confirmed by grepping the running 0.9.5 binary's path references. Will be swept in the post-soak cleanup.
71 71
72 72 ## E. What broke and how it was caught
73 73
74 74 Three small things, all caught by smoke checks:
75 75
76 - 1. **`sandod` cached `sando.toml`.** First promote attempt returned `creating remote release dir` (an in-flight progress string that became the error message). `journalctl -u sandod` showed it was still resolving `prod-1.makenot.work`. `scp sando.toml pop-os:/tmp/`, `sudo cp /tmp/sando.toml /etc/sando/sando.toml`, `sudo systemctl restart sandod`, re-promote. Worth documenting that `sandod` does not watch the file; alternative is to add an inotify or SIGHUP handler.
76 + 1. **`sandod` cached `sando.toml`.** First promote attempt returned `creating remote release dir` (an in-flight progress string that became the error message). `journalctl -u sandod` showed it was still resolving `prod-1.makenot.work`. `scp sando.toml fw13:/tmp/`, `sudo cp /tmp/sando.toml /etc/sando/sando.toml`, `sudo systemctl restart sandod`, re-promote. Worth documenting that `sandod` does not watch the file; alternative is to add an inotify or SIGHUP handler.
77 77 2. **First doc smoke checks were wrong URLs.** `/about/economics`, `/docs/about/economics` returned 404; panicked briefly that the cutover broke doc routing. False alarm: the route is `/docs/{slug}` where slug is the filename stem (e.g., `/docs/economics`). Verified with `grep doc_page MNW/server/src/` after the panic. **Worth fixing in any future smoke script** — use the real URL scheme, not guessed-from-filesystem paths.
78 78 3. **`mnw-admin rebuild-keys` needed env loading from root context.** `sudo -u git /opt/mnw/current/mnw-admin rebuild-keys` fails with `DATABASE_URL must be set: NotPresent` because the binary's `dotenvy::from_path("/opt/makenotwork/.env")` runs as git, which can't read `.env` (mode 0600 makenotwork). Workaround: `set -a; source /etc/mnw/makenotwork.env; set +a; sudo -u git -E /opt/mnw/current/mnw-admin rebuild-keys`. Cleanest long-term fix is in §G.
79 79
@@ -120,7 +120,7 @@ Ship as 0.9.6. Cleanup sequence after: deploy 0.9.6 via sando → `rebuild-keys`
120 120 Independent of G.1. Touches:
121 121 - `server/deploy/backup-db.sh` — hardcoded `BACKUP_DIR="/opt/makenotwork/backups"` near top.
122 122 - `makenotwork` user crontab on prod.
123 - - Sando's `backup.source` URL on pop-os (currently pulls from `/opt/makenotwork/backups/latest.sql.gz` via rrsync).
123 + - Sando's `backup.source` URL on fw13 (currently pulls from `/opt/makenotwork/backups/latest.sql.gz` via rrsync).
124 124
125 125 Easiest order: copy the existing 885M dir to `/var/lib/mnw/backups/`, edit script + crontab + sando config in one window, retire `/opt/makenotwork/backups/` after one successful daily backup lands in the new location and sando confirms it pulled cleanly.
126 126
@@ -4,12 +4,12 @@
4 4 # unlock promotion *to* the next tier, the nodes it ships to, and the canary
5 5 # policy for shipping within the tier.
6 6 #
7 - # Day-one wiring: host (pop-os, local) -> A (testnot.work) -> B (prod-1). C is
7 + # Day-one wiring: host (fw13, local) -> A (testnot.work) -> B (prod-1). C is
8 8 # declared but not provisioned; adding the second prod node later is a config
9 9 # edit (set provisioned = true, fill in [[tier.node]]).
10 10 #
11 11 # The first tier is "host" — it refers to whatever machine sandod runs on
12 - # (currently pop-os). Renamed from the legacy "mm" name in Session 1 of
12 + # (currently fw13). Renamed from the legacy "mm" name in Session 1 of
13 13 # the sando bundle redesign.
14 14
15 15 [repo]
@@ -23,7 +23,7 @@ branch = "main"
23 23 source = "ssh://backup-puller@alpha-west-1:2200/latest.sql.gz"
24 24 local_path = "/srv/sando/backups/latest.sql.gz"
25 25
26 - # ---- host: pop-os local pre-staging gate ----
26 + # ---- host: fw13 local pre-staging gate ----
27 27 [[tier]]
28 28 name = "host"
29 29 provisioned = true
@@ -33,7 +33,7 @@ gates = [
33 33 { kind = "migration_dry_run" },
34 34 { kind = "boot_smoke" },
35 35 ]
36 - # Host is the daemon's own machine (pop-os); no remote node row.
36 + # Host is the daemon's own machine (fw13); no remote node row.
37 37
38 38 # ---- A: testnot.work staging ----
39 39 [[tier]]
M sando/todo.md +16 -16
@@ -41,7 +41,7 @@ Sando has zero automated tests today — daemon + TUI have been validated by run
41 41
42 42 - [ ] Launches against `SANDO_DAEMON=http://100.103.89.95:7766` without crashing; header shows daemon URL.
43 43 - [ ] WS status: `ws ok` appears in the header within ~1s of launch (sandod is reachable).
44 - - [ ] WS reconnects: `sudo systemctl restart sandod` on pop-os; header flips `ws ok → ws ... → ws ok` within ~5s. Events resume.
44 + - [ ] WS reconnects: `sudo systemctl restart sandod` on fw13; header flips `ws ok → ws ... → ws ok` within ~5s. Events resume.
45 45 - [ ] `↑/↓` and `j/k` move the row highlight through all 4 tiers; selection persists across the 2s state refresh.
46 46 - [ ] `b` triggers backup fetch: status bar shows `[ok] backup/fetch: ...`, events log gets a `backup_fetched` line a moment later.
47 47 - [ ] `c` on tier `a` (which has `current_version=0.8.12`) records a manual_confirm; event appears.
@@ -56,7 +56,7 @@ Sando has zero automated tests today — daemon + TUI have been validated by run
56 56
57 57 55 tests passing as of 2026-05-31 (14 TUI + 41 daemon). Remaining gaps:
58 58
59 - - [x] `gates::reset_scratch` — verifies dropping every non-system schema (planted `foo` + `tower_sessions`, ran reset, asserted only `public` remains). Gated by `SANDO_TEST_PG_URL` env var so it skips on hosts without postgres. Run on pop-os with `SANDO_TEST_PG_URL=postgres:///sando_scratch?host=/var/run/postgresql cargo test`.
59 + - [x] `gates::reset_scratch` — verifies dropping every non-system schema (planted `foo` + `tower_sessions`, ran reset, asserted only `public` remains). Gated by `SANDO_TEST_PG_URL` env var so it skips on hosts without postgres. Run on fw13 with `SANDO_TEST_PG_URL=postgres:///sando_scratch?host=/var/run/postgresql cargo test`.
60 60 - [x] `deploy::deploy_local` — copies multiple binaries (`PRIMARY`/`ADMIN`), swaps symlink atomically across two consecutive deploys, gc_local_releases keeps last N by mtime + handles missing dir + noop under threshold. `sh_quote` round-trip.
61 61 - [x] `deploy::deploy_remote` failure path — against unroutable `192.0.2.1`, verifies clean ssh-attributed error (no panic / hang); ConnectTimeout bounds the test wallclock to ~10s. Plus `deploy_node` with `ssh_target="local"` short-circuits to symlink swap.
62 62 - [x] `backup::fetch` URL parsing — extracted `parse_source` → `BackupSource` enum. 10 tests: file://, rsync://, ssh:// with/without port, multi-segment ssh path, non-numeric `:foo` colon treated as part of host (not port), and all malformed-input rejections (empty, scheme-only, ftp, no path on ssh, empty user@host).
@@ -84,9 +84,9 @@ Sando has zero automated tests today — daemon + TUI have been validated by run
84 84
85 85 ---
86 86
87 - Roadmap target: replace `server/deploy/deploy.sh` and astra-hosted `server/deploy/run-ci.sh` with Sando running on **pop-os**, gating Hetzner prod through testnot.work.
87 + Roadmap target: replace `server/deploy/deploy.sh` and astra-hosted `server/deploy/run-ci.sh` with Sando running on **fw13**, gating Hetzner prod through testnot.work.
88 88
89 - **Host decision:** Sando runs on pop-os (x86_64 Ubuntu-derived, systemd). Architecturally closest to Hetzner prod, no cross-compile, no init-system split. MakeMachine and EveryCycle are now a separate project — not Sando's concern.
89 + **Host decision:** Sando runs on fw13 (x86_64 Ubuntu-derived, systemd). Architecturally closest to Hetzner prod, no cross-compile, no init-system split. MakeMachine and EveryCycle are now a separate project — not Sando's concern.
90 90
91 91 Phases are ordered for execution. Phase 0 must finish before Phase 1 is meaningful. Phases 5+ are post-cutover hardening.
92 92
@@ -108,28 +108,28 @@ Read these to orient before working on Sando:
108 108
109 109 ---
110 110
111 - ## Phase 0 — pop-os bootstrap
111 + ## Phase 0 — fw13 bootstrap
112 112
113 - - [x] Provision `sando` system user on pop-os; lock down home dir; generate SSH keypair at `/srv/sando/.ssh/id_ed25519` for outbound deploys.
114 - - [x] Install scratch Postgres locally on pop-os; create `sando_scratch` role + DB used by `migration_dry_run`. (Owner of own DB; non-superuser.)
113 + - [x] Provision `sando` system user on fw13; lock down home dir; generate SSH keypair at `/srv/sando/.ssh/id_ed25519` for outbound deploys.
114 + - [x] Install scratch Postgres locally on fw13; create `sando_scratch` role + DB used by `migration_dry_run`. (Owner of own DB; non-superuser.)
115 115 - [x] Write systemd unit for `sandod` (long-run service, restart on failure, env from `/etc/sando/sando.env`). Installed at `/etc/systemd/system/sandod.service`.
116 116 - [x] Write the production `sando.toml`; bare repo path under `/srv/sando/mnw.git`. Installed at `/etc/sando/sando.toml`; daemon config at `/etc/sando/sando-daemon.toml`.
117 117 - [x] Install `sandod` binary at `/usr/local/bin/sandod`; enable + start the service. Live on `100.103.89.95:7766`; bare repo auto-bootstrapped at `/srv/sando/mnw.git`.
118 - - [x] Verify MNW server builds reproducibly on pop-os. `makenotwork` 0.8.12 built in 132s; sqlx online mode against `sando_scratch` postgres (sandod prep-resets all non-system schemas + applies all 133 MNW migrations before invoking cargo).
119 - - [ ] Register sando pubkey with Hetzner prod (`deploy@alpha-west-1`) and testnot.work once that node exists. Pubkey: `ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIEK+vhpr1V8VnsEemN9x6tAA2S05kmv/mQ3eVgSXSkJ8 sando@pop-os`. (Moved to Phase 1 — not blocking Phase 0 exit.)
118 + - [x] Verify MNW server builds reproducibly on fw13. `makenotwork` 0.8.12 built in 132s; sqlx online mode against `sando_scratch` postgres (sandod prep-resets all non-system schemas + applies all 133 MNW migrations before invoking cargo).
119 + - [ ] Register sando pubkey with Hetzner prod (`deploy@alpha-west-1`) and testnot.work once that node exists. Pubkey: `ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIEK+vhpr1V8VnsEemN9x6tAA2S05kmv/mQ3eVgSXSkJ8 sando@fw13`. (Moved to Phase 1 — not blocking Phase 0 exit.)
120 120
121 121 ### Phase 0 follow-ups (not blocking, but visible)
122 122
123 123 - [ ] `cargo_test` gate fails on MNW today — beyond the sqlx-online fix (already in), tests likely need a separate prepared DB (or per-test isolation). Investigate when wiring up Phase 1 gates.
124 124 - [ ] Sandod observability: add `WS /events` (Phase 5) and consider streaming build/test stdout to a per-run log file rather than buffering in `Output`.
125 125 - [ ] sqlx-cli (`v0.9.0`) at `/srv/sando/.cargo/bin/sqlx` is installed for the sando user but unused — sandod uses `sqlx::migrate::Migrator` programmatically (v0.8.6). Decide later whether to drop sqlx-cli or use it for diagnostics.
126 - - [ ] pop-os WoL: `ethtool` shows no wake-on capability on the USB ethernet — WoL likely won't work; rely on manual wake or BIOS settings. Record in `_meta/` if a solution surfaces.
126 + - [ ] fw13 WoL: `ethtool` shows no wake-on capability on the USB ethernet — WoL likely won't work; rely on manual wake or BIOS settings. Record in `_meta/` if a solution surfaces.
127 127
128 128 ## Phase 1 — Remote deploy
129 129
130 130 The MVP only deploys to `ssh_target=local`. Production needs real SSH/rsync.
131 131
132 - - [x] Implement `deploy::deploy_node` remote path: rsync staged binary to `<ssh_target>:<release_root>/releases/<version>/<bin_name>`, then `ssh <ssh_target>` does `mv -Tf` symlink swap + `sudo systemctl reload-or-restart <service>`. First real promote landed 2026-05-31: pop-os → testnot, version 0.8.12.
132 + - [x] Implement `deploy::deploy_node` remote path: rsync staged binary to `<ssh_target>:<release_root>/releases/<version>/<bin_name>`, then `ssh <ssh_target>` does `mv -Tf` symlink swap + `sudo systemctl reload-or-restart <service>`. First real promote landed 2026-05-31: fw13 → testnot, version 0.8.12.
133 133 - [x] Add `node.service_name` to `sando.toml` (default `makenotwork.service`).
134 134 - [x] Bootstrap script for adding a fresh node: `MNW/sando/deploy/bootstrap-node.sh`. (See Phase 3 — node-bootstrap script for full details.)
135 135 - [x] Garbage-collect old releases on the remote: keep last N=5 per node, sorted by mtime. Runs at end of each successful deploy (local + remote variants). Tied via `RELEASES_TO_KEEP` const.
@@ -151,7 +151,7 @@ Sando's deploy machinery is done, but testnot's MNW runtime needs the rest befor
151 151
152 152 - [x] ~~Confirm astra's offsite replica writes a deterministic latest-link path.~~ Pivoted: pull direct from prod (`backup-puller@alpha-west-1:2200`, rrsync-locked to `/opt/makenotwork/backups/`). Astra offsite is separately broken — see carryover below.
153 153 - [x] Wire the production `sando.toml` `backup.source` — `ssh://backup-puller@alpha-west-1:2200/latest.sql.gz` with `latest.sql.gz` as a hard link on prod.
154 - - [x] Schedule a daily `POST /backup/fetch` (systemd timer on pop-os). `sandod-backup-fetch.{service,timer}` in `MNW/sando/deploy/`. Runs daily at 04:00 UTC (one hour after prod's 03:00 UTC backup-db.sh). Service uses `EnvironmentFile=/etc/sando/sando.env` for `$SANDO_DAEMON`. Verified 2026-05-31: one-shot test pulled 36MB backup, recorded in `backups` table.
154 + - [x] Schedule a daily `POST /backup/fetch` (systemd timer on fw13). `sandod-backup-fetch.{service,timer}` in `MNW/sando/deploy/`. Runs daily at 04:00 UTC (one hour after prod's 03:00 UTC backup-db.sh). Service uses `EnvironmentFile=/etc/sando/sando.env` for `$SANDO_DAEMON`. Verified 2026-05-31: one-shot test pulled 36MB backup, recorded in `backups` table.
155 155 - [x] First end-to-end `migration_dry_run` against a real prod backup. Passed 2026-05-31 for sha 4541ebc in 1.2s: restored 36MB dump + applied all 133 migrations cleanly. Sha eee96a7 correctly failed `migration_dry_run` because it lacked migrations 123-132 that prod has applied — exactly the prod-vs-repo drift the gate is designed to catch.
156 156 - [x] Document the failure modes: `plans/migration-dryrun-failures.md`. Covers all 7 fail modes (no backup, scratch_url unset, scratch reset, restore, drift, checksum mismatch, content broken against prod data) with operator playbook.
157 157 - [x] Decide retention on `backups` table. 30 days; pruned at end of `backup::fetch`. `DELETE FROM backups WHERE fetched_at < datetime('now', '-30 days')`.
@@ -172,7 +172,7 @@ Decisions captured in `plans/config-artifacts.md`. Summary: Caddyfile / systemd
172 172 - [x] **mnw-admin binary** — `cfg.bin_names: Vec<String>` (default `["server"]`, MNW uses `["makenotwork","mnw-admin"]`). `deploy_local` copies each from worktree's `target/release/<bin>`; `deploy_node` rsyncs the whole staged dir. `Config::primary_bin()` returns first entry for systemd reference. `versions.artifact_path` stores the primary; release dir is derived as `.parent()`. Verified on testnot 2026-05-31.
173 173 - [x] **Security configs** — decided: bootstrap-only. (§5.)
174 174 - [ ] **Restart warning** — Phase 5, prod-tier only via `tier.restart_warning_seconds` in `sando.toml`; needs `CLI_SERVICE_TOKEN` in `/etc/sando/sando.env`. (§7.)
175 - - [x] **Cross-compile from macOS** — decided: retire after one sprint of testnot parity verification. Pop-os builds natively. (§8.)
175 + - [x] **Cross-compile from macOS** — decided: retire after one sprint of testnot parity verification. fw13 builds natively. (§8.)
176 176 - [x] **Prod migrations** — decided: server self-applies on startup. Sando does NOT run them. `migration_dry_run` gate is the prod safety net. (§9.)
177 177 - [x] **Node-bootstrap script** — `MNW/sando/deploy/bootstrap-node.sh`. Idempotent. Takes `SANDO_PUBKEY` (required), `BIN_NAME`, `SERVICE_NAME`, `SERVICE_USER`, `DEPLOY_ROOT` env. Installs base packages (rsync/ufw/fail2ban), optionally postgres/tailscale/caddy, creates deploy user + dirs + sudoers entry + systemd unit, sets up UFW. Deliberately does NOT touch Caddyfile content, certs, postgres role/db, or secrets — those are operator-decisions per-node. testnot was done by hand and matches roughly what the script produces. Test by re-running on the next node added (tier B Hetzner prod move or tier C).
178 178
@@ -200,7 +200,7 @@ The TUI polls. The MVP requires you to hand-insert a row for `manual_confirm`. B
200 200
201 201 ## Phase 6 — Monitoring + alerting
202 202
203 - - [ ] Wire pop-os `/metrics` endpoint into the existing MNW Prometheus scrape config; record where the scrape config lives in `_meta/` or wherever monitoring already runs.
203 + - [ ] Wire fw13 `/metrics` endpoint into the existing MNW Prometheus scrape config; record where the scrape config lives in `_meta/` or wherever monitoring already runs.
204 204 - [ ] Add counters: `sando_builds_total{outcome}`, `sando_gates_total{tier,kind,outcome}`, `sando_deploys_total{tier,outcome}`, `sando_burn_in_remaining_hours{tier}`.
205 205 - [ ] Alert: build failed. Page on first failure (not flap-protected — builds are infrequent).
206 206 - [ ] Alert: migration_dry_run failed. Page immediately. This is the 2026-05-22-class signal.
@@ -222,7 +222,7 @@ Move Postgres off the prod app node so B+C become truly interchangeable.
222 222
223 223 - [ ] Provision Postgres-only machine D (modest spec; reliability over performance).
224 224 - [ ] Migrate the prod DB from Hetzner app node to D. Capture procedure in `plans/postgres-d-migration.md`.
225 - - [ ] Update `server` `DATABASE_URL` everywhere (env files on B+C, scratch URL on pop-os stays local).
225 + - [ ] Update `server` `DATABASE_URL` everywhere (env files on B+C, scratch URL on fw13 stays local).
226 226 - [ ] Replica/HA story stays deferred; D is SPOF for now (per `_meta/preclear/.../decisions.md`).
227 227
228 228 ## Phase 9 — Hardening
@@ -232,7 +232,7 @@ Pick up after cutover is stable.
232 232 - [ ] Tailnet ACL audit: confirm only the laptop can reach `sandod:7766`. Document the ACL.
233 233 - [ ] Decide if v0.2 needs token auth on `sandod` endpoints (revisit assumption from `decisions.md` once there's a real second operator).
234 234 - [ ] Sando self-deploy: Sando builds and deploys *itself* through its own pipeline. Bootstraps the bootstrap. Closes the chicken-and-egg loop and is satisfying.
235 - - [ ] Backup-of-Sando-state: nightly SQLite snapshot to astra. The state DB tracks 6 months of deploys; losing it on a pop-os disk failure would be annoying.
235 + - [ ] Backup-of-Sando-state: nightly SQLite snapshot to astra. The state DB tracks 6 months of deploys; losing it on a fw13 disk failure would be annoying.
236 236
237 237 ## Notes / non-checkbox
238 238