max / makenotwork
9 files changed,
+39 insertions,
-39 deletions
| @@ -6,7 +6,7 @@ | |||
| 6 | 6 | # | |
| 7 | 7 | # Required env: | |
| 8 | 8 | # SANDO_PUBKEY — sando user's public key on the Sando host. Get it via: | |
| 9 | - | # `ssh pop-os 'sudo cat /srv/sando/.ssh/id_ed25519.pub'` | |
| 9 | + | # `ssh fw13 'sudo cat /srv/sando/.ssh/id_ed25519.pub'` | |
| 10 | 10 | # | |
| 11 | 11 | # Optional env: | |
| 12 | 12 | # DEPLOY_ROOT — defaults to /opt/mnw |
| @@ -2,7 +2,7 @@ | |||
| 2 | 2 | # Idempotent bootstrap for a fresh Sando host (the machine running sandod). | |
| 3 | 3 | # | |
| 4 | 4 | # Captures the three PG footguns + system user + systemd unit + scratch DB + | |
| 5 | - | # .ssh setup + known_hosts seeding that pop-os accumulated by hand over the | |
| 5 | + | # .ssh setup + known_hosts seeding that fw13 accumulated by hand over the | |
| 6 | 6 | # 2026-06-02 hardening session. Re-run any time the sando host is rebuilt. | |
| 7 | 7 | # | |
| 8 | 8 | # Run as root on the new host. The script is safe to run repeatedly — every | |
| @@ -45,7 +45,7 @@ if [[ $EUID -ne 0 ]]; then | |||
| 45 | 45 | fi | |
| 46 | 46 | ||
| 47 | 47 | # All paths the host should accept overrides for, with sane defaults that | |
| 48 | - | # match the live pop-os install. | |
| 48 | + | # match the live fw13 install. | |
| 49 | 49 | SANDO_USER="${SANDO_USER:-sando}" | |
| 50 | 50 | SANDO_HOME="${SANDO_HOME:-/srv/sando}" | |
| 51 | 51 | SANDO_DAEMON_URL="${SANDO_DAEMON_URL:-http://127.0.0.1:7766}" | |
| @@ -168,7 +168,7 @@ log "10/13 /etc/sando configs" | |||
| 168 | 168 | install -d -m 0755 /etc/sando | |
| 169 | 169 | # sando-daemon.toml.example is the canonical production config (per the | |
| 170 | 170 | # header comment). Install as-is; operator edits the listen address if | |
| 171 | - | # binding to a non-pop-os tailnet IP. | |
| 171 | + | # binding to a non-fw13 tailnet IP. | |
| 172 | 172 | install -m 0644 -o root -g root \ | |
| 173 | 173 | "$SCRIPT_DIR/sando-daemon.toml.example" \ | |
| 174 | 174 | /etc/sando/sando-daemon.toml |
| @@ -1,7 +1,7 @@ | |||
| 1 | 1 | # Sando daemon config (production). | |
| 2 | 2 | # Install at /etc/sando/sando-daemon.toml on the Sando host. | |
| 3 | 3 | ||
| 4 | - | listen = "100.103.89.95:7766" # pop-os tailnet IP; bind tailnet-only, not 0.0.0.0 | |
| 4 | + | listen = "100.103.89.95:7766" # fw13 tailnet IP; bind tailnet-only, not 0.0.0.0 | |
| 5 | 5 | db_path = "/srv/sando/state/sando.db" | |
| 6 | 6 | topology_path = "/etc/sando/sando.toml" | |
| 7 | 7 | workdir = "/srv/sando/work" |
| @@ -1,5 +1,5 @@ | |||
| 1 | 1 | # Sando daemon systemd service | |
| 2 | - | # Place at /etc/systemd/system/sandod.service on the Sando host (pop-os). | |
| 2 | + | # Place at /etc/systemd/system/sandod.service on the Sando host (fw13). | |
| 3 | 3 | # | |
| 4 | 4 | # Commands: | |
| 5 | 5 | # sudo systemctl daemon-reload |
| @@ -80,7 +80,7 @@ Same as security configs. Backup script is host infrastructure, not release arti | |||
| 80 | 80 | ||
| 81 | 81 | - For testnot (low traffic): skip. Service crash-loops invisibly enough. | |
| 82 | 82 | - For prod cutover: sando must implement this. Options: | |
| 83 | - | - **A**: Sando POSTs `/api/internal/restart-warning` itself, requires CLI_SERVICE_TOKEN exposed to sando. Token would live in `/etc/sando/sando.env` on pop-os. | |
| 83 | + | - **A**: Sando POSTs `/api/internal/restart-warning` itself, requires CLI_SERVICE_TOKEN exposed to sando. Token would live in `/etc/sando/sando.env` on fw13. | |
| 84 | 84 | - **B**: Sando exposes a `pre_deploy_hook` per-tier in `sando.toml` (shell command); operator decides. | |
| 85 | 85 | - Recommendation: **A** for prod tiers only (`tier.restart_warning_seconds = 30` in `sando.toml`). Tier A (testnot) leaves it unset = no warning. | |
| 86 | 86 | ||
| @@ -88,7 +88,7 @@ Phase 5 implementation, not blocking cutover-readiness. | |||
| 88 | 88 | ||
| 89 | 89 | ### 8. Cross-compile from macOS — **retire** | |
| 90 | 90 | ||
| 91 | - | Pop-os is x86_64 Ubuntu-derived, prod is x86_64 Ubuntu 24.04. Sando builds natively. Cargo-zigbuild path goes away once sando is canonical. | |
| 91 | + | fw13 is x86_64 Ubuntu-derived, prod is x86_64 Ubuntu 24.04. Sando builds natively. Cargo-zigbuild path goes away once sando is canonical. | |
| 92 | 92 | ||
| 93 | 93 | - Verify: take a recent prod binary (from `deploy.sh`'s build) and sando's binary for the same sha, compare runtime behavior across one full sprint of testnot use. | |
| 94 | 94 | - Once verified, mark `deploy.sh` archived and delete cargo-zigbuild from dev-machine setup notes. | |
| @@ -109,7 +109,7 @@ MNW server runs `sqlx::migrate!("./migrations").run(&db).await` in `main.rs:73` | |||
| 109 | 109 | ||
| 110 | 110 | | Step | Replaced by Sando | Moved to node-bootstrap | Retired | | |
| 111 | 111 | |---------------------|------------------------------|-------------------------|---------| | |
| 112 | - | | build_binary | yes (native on pop-os) | | | | |
| 112 | + | | build_binary | yes (native on fw13) | | | | |
| 113 | 113 | | upload_config | | yes (Caddyfile, etc.) | | | |
| 114 | 114 | | upload_binary | yes (+ mnw-admin) | | | | |
| 115 | 115 | | send_restart_warning| yes (Phase 5, prod tier only)| | | |
| @@ -133,7 +133,7 @@ Folded in here while editing topology code. Schema migration in `sando/daemon/mi | |||
| 133 | 133 | ## C. Test on MM | |
| 134 | 134 | ||
| 135 | 135 | 1. `cargo test --release --features fast-tests` in `sando/daemon/` — all existing tests pass + any new staging tests added (e.g. `stage_dir copies on success`, `stage_dir errors when required-missing`, `release_contents config parses`). | |
| 136 | - | 2. Build sandod, install, restart on pop-os. | |
| 136 | + | 2. Build sandod, install, restart on fw13. | |
| 137 | 137 | 3. `POST /rebuild` against current MNW main. | |
| 138 | 138 | 4. Inspect `/srv/sando/releases/<v>/` — should contain `makenotwork`, `mnw-admin`, `error-pages/`, `static/`, `docs/`. Verify total size is sane (single-digit MB for binary, tens of MB for static+docs). | |
| 139 | 139 | 5. boot_smoke still passes (binary doesn't care what's in the dir alongside it). | |
| @@ -200,7 +200,7 @@ Folded in here while editing topology code. Schema migration in `sando/daemon/mi | |||
| 200 | 200 | ||
| 201 | 201 | ## G. Outcomes (2026-06-02) | |
| 202 | 202 | ||
| 203 | - | Session 1 landed in one focused push. All 10 tasks done, 44/44 sando-daemon tests green, pipeline went `host` green end-to-end against sha `f0970b8` (version 0.9.5) on pop-os. | |
| 203 | + | Session 1 landed in one focused push. All 10 tasks done, 44/44 sando-daemon tests green, pipeline went `host` green end-to-end against sha `f0970b8` (version 0.9.5) on fw13. | |
| 204 | 204 | ||
| 205 | 205 | ### What shipped (commit `f0970b8` on sando bare + mnw + srht remotes) | |
| 206 | 206 | ||
| @@ -208,7 +208,7 @@ Session 1 landed in one focused push. All 10 tasks done, 44/44 sando-daemon test | |||
| 208 | 208 | - `build.rs::build_and_run_host` (renamed from `_mm`) iterates `cfg.release_contents`, calling `stage_entry()` per row. `cp -a` semantics; supports the merge-into-existing-dir form so multiple entries can target the same `dst` (used for `docs/` from 3 worktree sources). | |
| 209 | 209 | - `deploy.rs` rsync gained `--delete` (no stale assets across versions) and swapped `--chmod=F0755` for `--chmod=Du=rwx,Dgo=rx,Fu=rw,Fgo=r,F+X` (binaries 0755, data files 0644). | |
| 210 | 210 | - `bootstrap-node.sh` writes FHS-style unit: `EnvironmentFile=/etc/mnw/makenotwork.env`, `ReadWritePaths=/var/lib/mnw`, `WorkingDirectory=<release>/current`. Pre-creates `/etc/mnw` (root:service 0750) + `/var/lib/mnw` (service:service 0750). | |
| 211 | - | - Migration `002_rename_mm_to_host.sql` — `PRAGMA defer_foreign_keys = ON` + 5 UPDATEs (tiers, nodes, deploys, gate_runs, tier_state). Preserved all existing state on pop-os (host current=0.9.5 + a current=0.8.12 carried through). | |
| 211 | + | - Migration `002_rename_mm_to_host.sql` — `PRAGMA defer_foreign_keys = ON` + 5 UPDATEs (tiers, nodes, deploys, gate_runs, tier_state). Preserved all existing state on fw13 (host current=0.9.5 + a current=0.8.12 carried through). | |
| 212 | 212 | - Post-receive hook now lives in repo at `sando/deploy/post-receive` and sources `/etc/sando/sando.env` — `SANDO_DAEMON` resolves to the tailnet listener instead of the 127.0.0.1 default. `bootstrap-sandod-host.sh` installs it. | |
| 213 | 213 | ||
| 214 | 214 | ### Open-question answers from §F | |
| @@ -221,7 +221,7 @@ Session 1 landed in one focused push. All 10 tasks done, 44/44 sando-daemon test | |||
| 221 | 221 | ||
| 222 | 222 | - **deploy.sh has a CSS minification step** (`npx clean-css-cli`) before rsync. Sando does not. Effect: bundle ships unminified CSS (~3x larger on the wire than deploy.sh-shipped CSS). `server/build.rs` hashes the *unminified* `style.css` for the cache-bust `?v=...`, so correctness is preserved — purely a size issue. Future fix: either eat the size cost (gzip handles most of it), move minification into `server/build.rs`, or add a build-step gate to sando. **Not addressed in Session 1.** | |
| 223 | 223 | - **mnw-admin invocation surface is bigger than expected.** Live call sites on prod: (1) sudoers `/etc/sudoers.d/*` entry `makenotwork ALL=(git) NOPASSWD: /opt/makenotwork/mnw-admin rebuild-keys` — needs path update in Session 3; (2) `command=` prefixes in `/home/git/.ssh/authorized_keys` that `mnw-admin rebuild-keys` itself generates — auto-update on the first post-migration rebuild-keys run. Session 3 sequence: edit the sudoers file first, then run `mnw-admin rebuild-keys` once. | |
| 224 | - | - **The defensive assert-on-stray-`"mm"`-lookup proposed in the plan was skipped.** Tests catch it: any unrenamed site fails when the DB no longer has a row matching it. After the rename + sync.rs test run + the production restart on pop-os, no "mm" lookups remained. | |
| 224 | + | - **The defensive assert-on-stray-`"mm"`-lookup proposed in the plan was skipped.** Tests catch it: any unrenamed site fails when the DB no longer has a row matching it. After the rename + sync.rs test run + the production restart on fw13, no "mm" lookups remained. | |
| 225 | 225 | ||
| 226 | 226 | ### Carry-over for Session 2 | |
| 227 | 227 | ||
| @@ -229,7 +229,7 @@ The Session 2 starting point shifted slightly because we did Session 1's prep + | |||
| 229 | 229 | ||
| 230 | 230 | - `f0970b8` is the active sha on sando's bare repo and is current on tier host. Tier a is on the stale `0.8.12` from pre-Session-1. | |
| 231 | 231 | - testnot still has the unit shape from the *pre-Session-1* `bootstrap-node.sh`. It is in a crashloop (MissingDatabaseUrl, no env file). Session 2 reprovisioning will replace its systemd unit with the new FHS shape AND populate `/etc/mnw/makenotwork.env` from scratch. | |
| 232 | - | - `bootstrap-sandod-host.sh` on pop-os is the new version; re-running it is idempotent. | |
| 232 | + | - `bootstrap-sandod-host.sh` on fw13 is the new version; re-running it is idempotent. | |
| 233 | 233 | - Sando's pubkey on testnot under the `deploy` user: confirmed working earlier (`sudo -u sando ssh deploy@testnot` returned). No re-auth needed. | |
| 234 | 234 | - The bundle has not yet been deployed remotely. Tier a's `0.8.12` deploy predates `release_contents`; the testnot release dir contains only binaries. First Session 2 promotion to tier a will be the first remote deploy of the full bundle. | |
| 235 | 235 |
| @@ -2,7 +2,7 @@ | |||
| 2 | 2 | ||
| 3 | 3 | Captured 2026-06-03 after the cutover. Resolves §6.5 step 8 of `launchplan_final.md`: first full sando deploy to Hetzner prod, replacing `deploy.sh` as the live deploy path. | |
| 4 | 4 | ||
| 5 | - | Status: **complete 2026-06-03.** Prod runs `makenotwork` 0.9.5 (sha `f0970b8`) from `/opt/mnw/current/`, deployed via `POST /promote/b {"hotfix":true}` from sandod on pop-os. Outage window 3m25s (02:50:33 → 02:53:58 UTC). All features green. See §F for outcomes and §G for the four hardcoded paths that block the eventual `rm -rf /opt/makenotwork/`. | |
| 5 | + | Status: **complete 2026-06-03.** Prod runs `makenotwork` 0.9.5 (sha `f0970b8`) from `/opt/mnw/current/`, deployed via `POST /promote/b {"hotfix":true}` from sandod on fw13. Outage window 3m25s (02:50:33 → 02:53:58 UTC). All features green. See §F for outcomes and §G for the four hardcoded paths that block the eventual `rm -rf /opt/makenotwork/`. | |
| 6 | 6 | ||
| 7 | 7 | ## Background — Session 1 set the layout, Session 2 proved it on testnot, Session 3 cut prod over | |
| 8 | 8 | ||
| @@ -55,7 +55,7 @@ In order, with the exact reason each step exists: | |||
| 55 | 55 | 7. **Caddyfile rewrite**: `sed -i 's|/opt/makenotwork/error-pages|/opt/mnw/current/error-pages|g'`. `caddy validate` before reload; `systemctl reload caddy`. | |
| 56 | 56 | 8. **Sudoers rewrite**: same sed pattern on `/etc/sudoers.d/mnw-git-ssh`; `visudo -c -f` to validate. | |
| 57 | 57 | 9. **`systemctl daemon-reload`** to pick up the new unit. | |
| 58 | - | 10. **`systemctl restart sandod`** on pop-os — sandod caches `sando.toml` at startup; the new tier B target wouldn't have taken effect without this. **First `POST /promote/b` failed with NXDOMAIN against the stale `prod-1.makenot.work` because sandod hadn't been restarted yet.** Fixed by restarting sandod and re-promoting. | |
| 58 | + | 10. **`systemctl restart sandod`** on fw13 — sandod caches `sando.toml` at startup; the new tier B target wouldn't have taken effect without this. **First `POST /promote/b` failed with NXDOMAIN against the stale `prod-1.makenot.work` because sandod hadn't been restarted yet.** Fixed by restarting sandod and re-promoting. | |
| 59 | 59 | 11. **`POST /promote/b {"hotfix":true}`** — `hotfix: true` bypasses the 48h burn-in on tier A (which had just promoted to 0.9.5 ~15 min prior; burn-in not yet elapsed). Sando rsync'd the 161MB bundle to `/opt/mnw/releases/0.9.5/`, swapped the `current` symlink, called `systemctl reload-or-restart makenotwork.service`. | |
| 60 | 60 | 12. **Service up 02:53:55 UTC.** Outage window ends 02:53:58 once health serves 200. 733 YARA rules compiled, all integrations (S3, Stripe, MT, WAM, git, scanner, custom domain cache) live. | |
| 61 | 61 | 13. **External smoke checks**: `/`, `/login`, `/pricing`, `/docs`, `/docs/economics`, `/docs/roadmap`, `/docs/tiers` — all 200. | |
| @@ -66,14 +66,14 @@ In order, with the exact reason each step exists: | |||
| 66 | 66 | ||
| 67 | 67 | - `/opt/makenotwork/` — full contents, untouched. Soak rollback path: stop new unit, swap systemd unit back, start old binary. Plan: `rm -rf` after a week, post-0.9.6 deploy (see §G). | |
| 68 | 68 | - `/opt/git/` — untouched. Git user's `/etc/passwd` home; mnw-admin's regenerated `authorized_keys` writes to `/opt/git/.ssh/authorized_keys` (not `/home/git/`, despite earlier confusion). The rsync to `/var/lib/mnw/git/` populated the new GIT_REPOS_PATH; the server reads from there, but git push lands in `/opt/git/` because that's git user's home. Both paths now hold the repo bytes; that's wasteful but harmless during the soak. | |
| 69 | - | - `/opt/makenotwork/backups/` — 885M of pg dumps. Script and cron still write there. Sando's backup-fetch on pop-os still pulls from there (configured pre-cutover). Migration to `/var/lib/mnw/backups/` is its own follow-up (touches script, crontab, pop-os sando config). | |
| 69 | + | - `/opt/makenotwork/backups/` — 885M of pg dumps. Script and cron still write there. Sando's backup-fetch on fw13 still pulls from there (configured pre-cutover). Migration to `/var/lib/mnw/backups/` is its own follow-up (touches script, crontab, fw13 sando config). | |
| 70 | 70 | - `yara-rules-src/`, `rustdoc/`, `ssh/`, `.env.bak.*` — not in any env var or systemd path. Confirmed by grepping the running 0.9.5 binary's path references. Will be swept in the post-soak cleanup. | |
| 71 | 71 | ||
| 72 | 72 | ## E. What broke and how it was caught | |
| 73 | 73 | ||
| 74 | 74 | Three small things, all caught by smoke checks: | |
| 75 | 75 | ||
| 76 | - | 1. **`sandod` cached `sando.toml`.** First promote attempt returned `creating remote release dir` (an in-flight progress string that became the error message). `journalctl -u sandod` showed it was still resolving `prod-1.makenot.work`. `scp sando.toml pop-os:/tmp/`, `sudo cp /tmp/sando.toml /etc/sando/sando.toml`, `sudo systemctl restart sandod`, re-promote. Worth documenting that `sandod` does not watch the file; alternative is to add an inotify or SIGHUP handler. | |
| 76 | + | 1. **`sandod` cached `sando.toml`.** First promote attempt returned `creating remote release dir` (an in-flight progress string that became the error message). `journalctl -u sandod` showed it was still resolving `prod-1.makenot.work`. `scp sando.toml fw13:/tmp/`, `sudo cp /tmp/sando.toml /etc/sando/sando.toml`, `sudo systemctl restart sandod`, re-promote. Worth documenting that `sandod` does not watch the file; alternative is to add an inotify or SIGHUP handler. | |
| 77 | 77 | 2. **First doc smoke checks were wrong URLs.** `/about/economics`, `/docs/about/economics` returned 404; panicked briefly that the cutover broke doc routing. False alarm: the route is `/docs/{slug}` where slug is the filename stem (e.g., `/docs/economics`). Verified with `grep doc_page MNW/server/src/` after the panic. **Worth fixing in any future smoke script** — use the real URL scheme, not guessed-from-filesystem paths. | |
| 78 | 78 | 3. **`mnw-admin rebuild-keys` needed env loading from root context.** `sudo -u git /opt/mnw/current/mnw-admin rebuild-keys` fails with `DATABASE_URL must be set: NotPresent` because the binary's `dotenvy::from_path("/opt/makenotwork/.env")` runs as git, which can't read `.env` (mode 0600 makenotwork). Workaround: `set -a; source /etc/mnw/makenotwork.env; set +a; sudo -u git -E /opt/mnw/current/mnw-admin rebuild-keys`. Cleanest long-term fix is in §G. | |
| 79 | 79 | ||
| @@ -120,7 +120,7 @@ Ship as 0.9.6. Cleanup sequence after: deploy 0.9.6 via sando → `rebuild-keys` | |||
| 120 | 120 | Independent of G.1. Touches: | |
| 121 | 121 | - `server/deploy/backup-db.sh` — hardcoded `BACKUP_DIR="/opt/makenotwork/backups"` near top. | |
| 122 | 122 | - `makenotwork` user crontab on prod. | |
| 123 | - | - Sando's `backup.source` URL on pop-os (currently pulls from `/opt/makenotwork/backups/latest.sql.gz` via rrsync). | |
| 123 | + | - Sando's `backup.source` URL on fw13 (currently pulls from `/opt/makenotwork/backups/latest.sql.gz` via rrsync). | |
| 124 | 124 | ||
| 125 | 125 | Easiest order: copy the existing 885M dir to `/var/lib/mnw/backups/`, edit script + crontab + sando config in one window, retire `/opt/makenotwork/backups/` after one successful daily backup lands in the new location and sando confirms it pulled cleanly. | |
| 126 | 126 |
| @@ -4,12 +4,12 @@ | |||
| 4 | 4 | # unlock promotion *to* the next tier, the nodes it ships to, and the canary | |
| 5 | 5 | # policy for shipping within the tier. | |
| 6 | 6 | # | |
| 7 | - | # Day-one wiring: host (pop-os, local) -> A (testnot.work) -> B (prod-1). C is | |
| 7 | + | # Day-one wiring: host (fw13, local) -> A (testnot.work) -> B (prod-1). C is | |
| 8 | 8 | # declared but not provisioned; adding the second prod node later is a config | |
| 9 | 9 | # edit (set provisioned = true, fill in [[tier.node]]). | |
| 10 | 10 | # | |
| 11 | 11 | # The first tier is "host" — it refers to whatever machine sandod runs on | |
| 12 | - | # (currently pop-os). Renamed from the legacy "mm" name in Session 1 of | |
| 12 | + | # (currently fw13). Renamed from the legacy "mm" name in Session 1 of | |
| 13 | 13 | # the sando bundle redesign. | |
| 14 | 14 | ||
| 15 | 15 | [repo] | |
| @@ -23,7 +23,7 @@ branch = "main" | |||
| 23 | 23 | source = "ssh://backup-puller@alpha-west-1:2200/latest.sql.gz" | |
| 24 | 24 | local_path = "/srv/sando/backups/latest.sql.gz" | |
| 25 | 25 | ||
| 26 | - | # ---- host: pop-os local pre-staging gate ---- | |
| 26 | + | # ---- host: fw13 local pre-staging gate ---- | |
| 27 | 27 | [[tier]] | |
| 28 | 28 | name = "host" | |
| 29 | 29 | provisioned = true | |
| @@ -33,7 +33,7 @@ gates = [ | |||
| 33 | 33 | { kind = "migration_dry_run" }, | |
| 34 | 34 | { kind = "boot_smoke" }, | |
| 35 | 35 | ] | |
| 36 | - | # Host is the daemon's own machine (pop-os); no remote node row. | |
| 36 | + | # Host is the daemon's own machine (fw13); no remote node row. | |
| 37 | 37 | ||
| 38 | 38 | # ---- A: testnot.work staging ---- | |
| 39 | 39 | [[tier]] |
| @@ -41,7 +41,7 @@ Sando has zero automated tests today — daemon + TUI have been validated by run | |||
| 41 | 41 | ||
| 42 | 42 | - [ ] Launches against `SANDO_DAEMON=http://100.103.89.95:7766` without crashing; header shows daemon URL. | |
| 43 | 43 | - [ ] WS status: `ws ok` appears in the header within ~1s of launch (sandod is reachable). | |
| 44 | - | - [ ] WS reconnects: `sudo systemctl restart sandod` on pop-os; header flips `ws ok → ws ... → ws ok` within ~5s. Events resume. | |
| 44 | + | - [ ] WS reconnects: `sudo systemctl restart sandod` on fw13; header flips `ws ok → ws ... → ws ok` within ~5s. Events resume. | |
| 45 | 45 | - [ ] `↑/↓` and `j/k` move the row highlight through all 4 tiers; selection persists across the 2s state refresh. | |
| 46 | 46 | - [ ] `b` triggers backup fetch: status bar shows `[ok] backup/fetch: ...`, events log gets a `backup_fetched` line a moment later. | |
| 47 | 47 | - [ ] `c` on tier `a` (which has `current_version=0.8.12`) records a manual_confirm; event appears. | |
| @@ -56,7 +56,7 @@ Sando has zero automated tests today — daemon + TUI have been validated by run | |||
| 56 | 56 | ||
| 57 | 57 | 55 tests passing as of 2026-05-31 (14 TUI + 41 daemon). Remaining gaps: | |
| 58 | 58 | ||
| 59 | - | - [x] `gates::reset_scratch` — verifies dropping every non-system schema (planted `foo` + `tower_sessions`, ran reset, asserted only `public` remains). Gated by `SANDO_TEST_PG_URL` env var so it skips on hosts without postgres. Run on pop-os with `SANDO_TEST_PG_URL=postgres:///sando_scratch?host=/var/run/postgresql cargo test`. | |
| 59 | + | - [x] `gates::reset_scratch` — verifies dropping every non-system schema (planted `foo` + `tower_sessions`, ran reset, asserted only `public` remains). Gated by `SANDO_TEST_PG_URL` env var so it skips on hosts without postgres. Run on fw13 with `SANDO_TEST_PG_URL=postgres:///sando_scratch?host=/var/run/postgresql cargo test`. | |
| 60 | 60 | - [x] `deploy::deploy_local` — copies multiple binaries (`PRIMARY`/`ADMIN`), swaps symlink atomically across two consecutive deploys, gc_local_releases keeps last N by mtime + handles missing dir + noop under threshold. `sh_quote` round-trip. | |
| 61 | 61 | - [x] `deploy::deploy_remote` failure path — against unroutable `192.0.2.1`, verifies clean ssh-attributed error (no panic / hang); ConnectTimeout bounds the test wallclock to ~10s. Plus `deploy_node` with `ssh_target="local"` short-circuits to symlink swap. | |
| 62 | 62 | - [x] `backup::fetch` URL parsing — extracted `parse_source` → `BackupSource` enum. 10 tests: file://, rsync://, ssh:// with/without port, multi-segment ssh path, non-numeric `:foo` colon treated as part of host (not port), and all malformed-input rejections (empty, scheme-only, ftp, no path on ssh, empty user@host). | |
| @@ -84,9 +84,9 @@ Sando has zero automated tests today — daemon + TUI have been validated by run | |||
| 84 | 84 | ||
| 85 | 85 | --- | |
| 86 | 86 | ||
| 87 | - | Roadmap target: replace `server/deploy/deploy.sh` and astra-hosted `server/deploy/run-ci.sh` with Sando running on **pop-os**, gating Hetzner prod through testnot.work. | |
| 87 | + | Roadmap target: replace `server/deploy/deploy.sh` and astra-hosted `server/deploy/run-ci.sh` with Sando running on **fw13**, gating Hetzner prod through testnot.work. | |
| 88 | 88 | ||
| 89 | - | **Host decision:** Sando runs on pop-os (x86_64 Ubuntu-derived, systemd). Architecturally closest to Hetzner prod, no cross-compile, no init-system split. MakeMachine and EveryCycle are now a separate project — not Sando's concern. | |
| 89 | + | **Host decision:** Sando runs on fw13 (x86_64 Ubuntu-derived, systemd). Architecturally closest to Hetzner prod, no cross-compile, no init-system split. MakeMachine and EveryCycle are now a separate project — not Sando's concern. | |
| 90 | 90 | ||
| 91 | 91 | Phases are ordered for execution. Phase 0 must finish before Phase 1 is meaningful. Phases 5+ are post-cutover hardening. | |
| 92 | 92 | ||
| @@ -108,28 +108,28 @@ Read these to orient before working on Sando: | |||
| 108 | 108 | ||
| 109 | 109 | --- | |
| 110 | 110 | ||
| 111 | - | ## Phase 0 — pop-os bootstrap | |
| 111 | + | ## Phase 0 — fw13 bootstrap | |
| 112 | 112 | ||
| 113 | - | - [x] Provision `sando` system user on pop-os; lock down home dir; generate SSH keypair at `/srv/sando/.ssh/id_ed25519` for outbound deploys. | |
| 114 | - | - [x] Install scratch Postgres locally on pop-os; create `sando_scratch` role + DB used by `migration_dry_run`. (Owner of own DB; non-superuser.) | |
| 113 | + | - [x] Provision `sando` system user on fw13; lock down home dir; generate SSH keypair at `/srv/sando/.ssh/id_ed25519` for outbound deploys. | |
| 114 | + | - [x] Install scratch Postgres locally on fw13; create `sando_scratch` role + DB used by `migration_dry_run`. (Owner of own DB; non-superuser.) | |
| 115 | 115 | - [x] Write systemd unit for `sandod` (long-run service, restart on failure, env from `/etc/sando/sando.env`). Installed at `/etc/systemd/system/sandod.service`. | |
| 116 | 116 | - [x] Write the production `sando.toml`; bare repo path under `/srv/sando/mnw.git`. Installed at `/etc/sando/sando.toml`; daemon config at `/etc/sando/sando-daemon.toml`. | |
| 117 | 117 | - [x] Install `sandod` binary at `/usr/local/bin/sandod`; enable + start the service. Live on `100.103.89.95:7766`; bare repo auto-bootstrapped at `/srv/sando/mnw.git`. | |
| 118 | - | - [x] Verify MNW server builds reproducibly on pop-os. `makenotwork` 0.8.12 built in 132s; sqlx online mode against `sando_scratch` postgres (sandod prep-resets all non-system schemas + applies all 133 MNW migrations before invoking cargo). | |
| 119 | - | - [ ] Register sando pubkey with Hetzner prod (`deploy@alpha-west-1`) and testnot.work once that node exists. Pubkey: `ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIEK+vhpr1V8VnsEemN9x6tAA2S05kmv/mQ3eVgSXSkJ8 sando@pop-os`. (Moved to Phase 1 — not blocking Phase 0 exit.) | |
| 118 | + | - [x] Verify MNW server builds reproducibly on fw13. `makenotwork` 0.8.12 built in 132s; sqlx online mode against `sando_scratch` postgres (sandod prep-resets all non-system schemas + applies all 133 MNW migrations before invoking cargo). | |
| 119 | + | - [ ] Register sando pubkey with Hetzner prod (`deploy@alpha-west-1`) and testnot.work once that node exists. Pubkey: `ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIEK+vhpr1V8VnsEemN9x6tAA2S05kmv/mQ3eVgSXSkJ8 sando@fw13`. (Moved to Phase 1 — not blocking Phase 0 exit.) | |
| 120 | 120 | ||
| 121 | 121 | ### Phase 0 follow-ups (not blocking, but visible) | |
| 122 | 122 | ||
| 123 | 123 | - [ ] `cargo_test` gate fails on MNW today — beyond the sqlx-online fix (already in), tests likely need a separate prepared DB (or per-test isolation). Investigate when wiring up Phase 1 gates. | |
| 124 | 124 | - [ ] Sandod observability: add `WS /events` (Phase 5) and consider streaming build/test stdout to a per-run log file rather than buffering in `Output`. | |
| 125 | 125 | - [ ] sqlx-cli (`v0.9.0`) at `/srv/sando/.cargo/bin/sqlx` is installed for the sando user but unused — sandod uses `sqlx::migrate::Migrator` programmatically (v0.8.6). Decide later whether to drop sqlx-cli or use it for diagnostics. | |
| 126 | - | - [ ] pop-os WoL: `ethtool` shows no wake-on capability on the USB ethernet — WoL likely won't work; rely on manual wake or BIOS settings. Record in `_meta/` if a solution surfaces. | |
| 126 | + | - [ ] fw13 WoL: `ethtool` shows no wake-on capability on the USB ethernet — WoL likely won't work; rely on manual wake or BIOS settings. Record in `_meta/` if a solution surfaces. | |
| 127 | 127 | ||
| 128 | 128 | ## Phase 1 — Remote deploy | |
| 129 | 129 | ||
| 130 | 130 | The MVP only deploys to `ssh_target=local`. Production needs real SSH/rsync. | |
| 131 | 131 | ||
| 132 | - | - [x] Implement `deploy::deploy_node` remote path: rsync staged binary to `<ssh_target>:<release_root>/releases/<version>/<bin_name>`, then `ssh <ssh_target>` does `mv -Tf` symlink swap + `sudo systemctl reload-or-restart <service>`. First real promote landed 2026-05-31: pop-os → testnot, version 0.8.12. | |
| 132 | + | - [x] Implement `deploy::deploy_node` remote path: rsync staged binary to `<ssh_target>:<release_root>/releases/<version>/<bin_name>`, then `ssh <ssh_target>` does `mv -Tf` symlink swap + `sudo systemctl reload-or-restart <service>`. First real promote landed 2026-05-31: fw13 → testnot, version 0.8.12. | |
| 133 | 133 | - [x] Add `node.service_name` to `sando.toml` (default `makenotwork.service`). | |
| 134 | 134 | - [x] Bootstrap script for adding a fresh node: `MNW/sando/deploy/bootstrap-node.sh`. (See Phase 3 — node-bootstrap script for full details.) | |
| 135 | 135 | - [x] Garbage-collect old releases on the remote: keep last N=5 per node, sorted by mtime. Runs at end of each successful deploy (local + remote variants). Tied via `RELEASES_TO_KEEP` const. | |
| @@ -151,7 +151,7 @@ Sando's deploy machinery is done, but testnot's MNW runtime needs the rest befor | |||
| 151 | 151 | ||
| 152 | 152 | - [x] ~~Confirm astra's offsite replica writes a deterministic latest-link path.~~ Pivoted: pull direct from prod (`backup-puller@alpha-west-1:2200`, rrsync-locked to `/opt/makenotwork/backups/`). Astra offsite is separately broken — see carryover below. | |
| 153 | 153 | - [x] Wire the production `sando.toml` `backup.source` — `ssh://backup-puller@alpha-west-1:2200/latest.sql.gz` with `latest.sql.gz` as a hard link on prod. | |
| 154 | - | - [x] Schedule a daily `POST /backup/fetch` (systemd timer on pop-os). `sandod-backup-fetch.{service,timer}` in `MNW/sando/deploy/`. Runs daily at 04:00 UTC (one hour after prod's 03:00 UTC backup-db.sh). Service uses `EnvironmentFile=/etc/sando/sando.env` for `$SANDO_DAEMON`. Verified 2026-05-31: one-shot test pulled 36MB backup, recorded in `backups` table. | |
| 154 | + | - [x] Schedule a daily `POST /backup/fetch` (systemd timer on fw13). `sandod-backup-fetch.{service,timer}` in `MNW/sando/deploy/`. Runs daily at 04:00 UTC (one hour after prod's 03:00 UTC backup-db.sh). Service uses `EnvironmentFile=/etc/sando/sando.env` for `$SANDO_DAEMON`. Verified 2026-05-31: one-shot test pulled 36MB backup, recorded in `backups` table. | |
| 155 | 155 | - [x] First end-to-end `migration_dry_run` against a real prod backup. Passed 2026-05-31 for sha 4541ebc in 1.2s: restored 36MB dump + applied all 133 migrations cleanly. Sha eee96a7 correctly failed `migration_dry_run` because it lacked migrations 123-132 that prod has applied — exactly the prod-vs-repo drift the gate is designed to catch. | |
| 156 | 156 | - [x] Document the failure modes: `plans/migration-dryrun-failures.md`. Covers all 7 fail modes (no backup, scratch_url unset, scratch reset, restore, drift, checksum mismatch, content broken against prod data) with operator playbook. | |
| 157 | 157 | - [x] Decide retention on `backups` table. 30 days; pruned at end of `backup::fetch`. `DELETE FROM backups WHERE fetched_at < datetime('now', '-30 days')`. | |
| @@ -172,7 +172,7 @@ Decisions captured in `plans/config-artifacts.md`. Summary: Caddyfile / systemd | |||
| 172 | 172 | - [x] **mnw-admin binary** — `cfg.bin_names: Vec<String>` (default `["server"]`, MNW uses `["makenotwork","mnw-admin"]`). `deploy_local` copies each from worktree's `target/release/<bin>`; `deploy_node` rsyncs the whole staged dir. `Config::primary_bin()` returns first entry for systemd reference. `versions.artifact_path` stores the primary; release dir is derived as `.parent()`. Verified on testnot 2026-05-31. | |
| 173 | 173 | - [x] **Security configs** — decided: bootstrap-only. (§5.) | |
| 174 | 174 | - [ ] **Restart warning** — Phase 5, prod-tier only via `tier.restart_warning_seconds` in `sando.toml`; needs `CLI_SERVICE_TOKEN` in `/etc/sando/sando.env`. (§7.) | |
| 175 | - | - [x] **Cross-compile from macOS** — decided: retire after one sprint of testnot parity verification. Pop-os builds natively. (§8.) | |
| 175 | + | - [x] **Cross-compile from macOS** — decided: retire after one sprint of testnot parity verification. fw13 builds natively. (§8.) | |
| 176 | 176 | - [x] **Prod migrations** — decided: server self-applies on startup. Sando does NOT run them. `migration_dry_run` gate is the prod safety net. (§9.) | |
| 177 | 177 | - [x] **Node-bootstrap script** — `MNW/sando/deploy/bootstrap-node.sh`. Idempotent. Takes `SANDO_PUBKEY` (required), `BIN_NAME`, `SERVICE_NAME`, `SERVICE_USER`, `DEPLOY_ROOT` env. Installs base packages (rsync/ufw/fail2ban), optionally postgres/tailscale/caddy, creates deploy user + dirs + sudoers entry + systemd unit, sets up UFW. Deliberately does NOT touch Caddyfile content, certs, postgres role/db, or secrets — those are operator-decisions per-node. testnot was done by hand and matches roughly what the script produces. Test by re-running on the next node added (tier B Hetzner prod move or tier C). | |
| 178 | 178 | ||
| @@ -200,7 +200,7 @@ The TUI polls. The MVP requires you to hand-insert a row for `manual_confirm`. B | |||
| 200 | 200 | ||
| 201 | 201 | ## Phase 6 — Monitoring + alerting | |
| 202 | 202 | ||
| 203 | - | - [ ] Wire pop-os `/metrics` endpoint into the existing MNW Prometheus scrape config; record where the scrape config lives in `_meta/` or wherever monitoring already runs. | |
| 203 | + | - [ ] Wire fw13 `/metrics` endpoint into the existing MNW Prometheus scrape config; record where the scrape config lives in `_meta/` or wherever monitoring already runs. | |
| 204 | 204 | - [ ] Add counters: `sando_builds_total{outcome}`, `sando_gates_total{tier,kind,outcome}`, `sando_deploys_total{tier,outcome}`, `sando_burn_in_remaining_hours{tier}`. | |
| 205 | 205 | - [ ] Alert: build failed. Page on first failure (not flap-protected — builds are infrequent). | |
| 206 | 206 | - [ ] Alert: migration_dry_run failed. Page immediately. This is the 2026-05-22-class signal. | |
| @@ -222,7 +222,7 @@ Move Postgres off the prod app node so B+C become truly interchangeable. | |||
| 222 | 222 | ||
| 223 | 223 | - [ ] Provision Postgres-only machine D (modest spec; reliability over performance). | |
| 224 | 224 | - [ ] Migrate the prod DB from Hetzner app node to D. Capture procedure in `plans/postgres-d-migration.md`. | |
| 225 | - | - [ ] Update `server` `DATABASE_URL` everywhere (env files on B+C, scratch URL on pop-os stays local). | |
| 225 | + | - [ ] Update `server` `DATABASE_URL` everywhere (env files on B+C, scratch URL on fw13 stays local). | |
| 226 | 226 | - [ ] Replica/HA story stays deferred; D is SPOF for now (per `_meta/preclear/.../decisions.md`). | |
| 227 | 227 | ||
| 228 | 228 | ## Phase 9 — Hardening | |
| @@ -232,7 +232,7 @@ Pick up after cutover is stable. | |||
| 232 | 232 | - [ ] Tailnet ACL audit: confirm only the laptop can reach `sandod:7766`. Document the ACL. | |
| 233 | 233 | - [ ] Decide if v0.2 needs token auth on `sandod` endpoints (revisit assumption from `decisions.md` once there's a real second operator). | |
| 234 | 234 | - [ ] Sando self-deploy: Sando builds and deploys *itself* through its own pipeline. Bootstraps the bootstrap. Closes the chicken-and-egg loop and is satisfying. | |
| 235 | - | - [ ] Backup-of-Sando-state: nightly SQLite snapshot to astra. The state DB tracks 6 months of deploys; losing it on a pop-os disk failure would be annoying. | |
| 235 | + | - [ ] Backup-of-Sando-state: nightly SQLite snapshot to astra. The state DB tracks 6 months of deploys; losing it on a fw13 disk failure would be annoying. | |
| 236 | 236 | ||
| 237 | 237 | ## Notes / non-checkbox | |
| 238 | 238 |