Skip to main content

max / makenotwork

sando: todo.md — adopt pop-os completion marks; Session 5 (launch + soak) Brings in ~50 historical [x] completion marks that landed on pop-os while main was advancing toward consolidation. Replaces the Session 4 soak-clock TODO with a Session 5 block reflecting today's actual completion: v0.9.7 launched via Sando, /opt/{git,makenotwork,backups} removed, per-DB backups under /var/lib/mnw/backups/, offsite to astra wired up. Only Session-4 follow-up still open is the systemd drop-in removal (deferred to next mnw-cli deploy --config).
Author: Max Johnson <me@maxj.phd> · 2026-06-04 01:11 UTC
Commit: a3bc35f3cba3e041c4cfe358413ca41b0a90c63a
Parent: 7284805
1 file changed, +60 insertions, -4 deletions
M sando/todo.md +60 -4
@@ -19,11 +19,12 @@ Claude-only follow-ups (no user input needed; pick the next slice):
19 19 - Phase 4 prep — first Sando-only deploy to testnot (needs Track B — see below)
20 20 - Sando test suite — see "Testing" section below; sandod and TUI have zero unit/integration tests today
21 21
22 - Session 4 — 0.9.6 path-decoupling shipped 2026-06-03 (commits `bfba435`, `445bfb7`). Remaining:
22 + Session 5 — 0.9.7 launched 2026-06-03 via Sando through host → A → B (hotfix=true, skip-burn-in). Soak cleanup closed (launchplan_final §1). Remaining:
23 23
24 - - [ ] **Soak clock for `rm -rf /opt/makenotwork/`** — 0.9.6 cut over 2026-06-03 03:46 UTC, `rebuild-keys` ran the same minute, all 4 authorized_keys repointed at `/opt/mnw/current/mnw-admin`. Eligible for cleanup ~2026-06-10. Pre-cleanup checks per launchplan §7: `journalctl -u makenotwork --since "1 week ago" | grep /opt/makenotwork/` empty, plus the §7 "Then, cleanup" sublist (rm /opt/makenotwork, rm /opt/git after the duality decision, migrate backups dir, leave makenotwork shell as bash for sando).
25 - - [ ] **Remove live drop-in** `/etc/systemd/system/mnw-cli.service.d/fhs-git-path.conf` on prod — it added `ReadWritePaths=/var/lib/mnw` to fix the EROFS that broke every creator git push after Session 3 (mnw-cli's systemd namespace had `/opt/git` writable but not `/var/lib/mnw/git`). The unit file in `mnw-cli/deploy/mnw-cli.service` is now patched to include the path, so the drop-in becomes redundant next time `./mnw-cli/deploy/deploy.sh --config` runs. Until then both apply (harmless dupe).
26 - - [ ] **Discovered pre-existing prod bug, fixed live:** `/etc/mnw/makenotwork.env` was unreadable by the git user (mode 640 root:makenotwork), so any `mnw-admin git-auth` invocation via authorized_keys command= panicked with "DATABASE_URL must be set". Same was true of the legacy `/opt/makenotwork/.env` (had been silently broken on 0.9.5 too). Applied `setfacl u:git:r` on both env files and `setfacl u:git:x /etc/mnw` directly on prod; codified the ACL block (conditional on git user existing) in `bootstrap-node.sh`. Next bootstrap will set it automatically.
24 + - [x] **Soak cleanup eligible 2026-06-10 — shortened and shipped 2026-06-03.** Gate verified clean since the 06-03 02:53 migration boot. Removed `/opt/git` (99M, stale duplicate of `/var/lib/mnw/git`), `/opt/makenotwork` (177M, post-yara-relocation), `/opt/backups` (277M, root pg_backup output). 553M reclaimed. yara-rules relocated from `/opt/makenotwork/yara-rules` → real `/opt/mnw/yara-rules` (733 rules compiled fine from new path).
25 + - [x] **Backups rebuilt under `/var/lib/mnw/backups/<db>/`** (makenotwork + multithreaded, per-DB subdirs), per-user crons (03:00 + 03:05), offsite to astra `/opt/backups/mnw/<db>/` via Tailscale SSH `tag:prod → max@tag:testing` rule. `backup-puller` rrsync re-rooted at `/var/lib/mnw/backups`; sando `backup.source` updated to `ssh://backup-puller@alpha-west-1:2200/makenotwork/latest.sql.gz`; `/backup/fetch` verified 38MB matched prod size.
26 + - [x] **Pre-existing meta.git ownership drift fixed inline** — `mnw-cli:git` → `git:git` (tightened `safe.directory` was rejecting it). Surfaced by post-rm ls-remote regression test.
27 + - [ ] **Remove live drop-in** `/etc/systemd/system/mnw-cli.service.d/fhs-git-path.conf` on prod. The unit file in `mnw-cli/deploy/mnw-cli.service` is patched to include `ReadWritePaths=/var/lib/mnw`, so the drop-in becomes redundant next time `./mnw-cli/deploy/deploy.sh --config` runs. Until then both apply (harmless dupe).
27 28
28 29 Decision-gated (needs user input first):
29 30
@@ -55,8 +56,19 @@ Sando has zero automated tests today — daemon + TUI have been validated by run
55 56
56 57 55 tests passing as of 2026-05-31 (14 TUI + 41 daemon). Remaining gaps:
57 58
59 + - [x] `gates::reset_scratch` — verifies dropping every non-system schema (planted `foo` + `tower_sessions`, ran reset, asserted only `public` remains). Gated by `SANDO_TEST_PG_URL` env var so it skips on hosts without postgres. Run on pop-os with `SANDO_TEST_PG_URL=postgres:///sando_scratch?host=/var/run/postgresql cargo test`.
60 + - [x] `deploy::deploy_local` — copies multiple binaries (`PRIMARY`/`ADMIN`), swaps symlink atomically across two consecutive deploys, gc_local_releases keeps last N by mtime + handles missing dir + noop under threshold. `sh_quote` round-trip.
61 + - [x] `deploy::deploy_remote` failure path — against unroutable `192.0.2.1`, verifies clean ssh-attributed error (no panic / hang); ConnectTimeout bounds the test wallclock to ~10s. Plus `deploy_node` with `ssh_target="local"` short-circuits to symlink swap.
62 + - [x] `backup::fetch` URL parsing — extracted `parse_source` → `BackupSource` enum. 10 tests: file://, rsync://, ssh:// with/without port, multi-segment ssh path, non-numeric `:foo` colon treated as part of host (not port), and all malformed-input rejections (empty, scheme-only, ftp, no path on ssh, empty user@host).
63 + - [x] `events::emit` no-subscribers no-op; `emit_reaches_a_subscriber`; envelope serializes with flat `kind` field (locks the WS/TUI contract); `lagged_subscriber_observes_recv_error_lagged` exercises broadcast capacity.
58 64 - [ ] `events_ws` handler end-to-end — drive WS through a slow client, assert `{"kind":"lagged",...}` frame arrives. Possible (bind axum to ephemeral port + tungstenite client) but the bus-level lag detection is already locked in by `lagged_subscriber_observes_recv_error_lagged`. Diminishing returns vs effort. Deferred.
59 65 - [ ] `build` mutex behavior — requires real cargo or a slow stub. Treated as a manual checklist item under "TUI hands-on" instead. (Already validated by hand 2026-05-31.)
66 + - [x] `routes::confirm` — rejects when tier has no `current_version` (409 Conflict — surfaced that GateBlocked maps to 409 not 400, locked in), accepts + inserts a passing gate_runs row when set, 404 on unknown tier.
67 + - [x] `routes::promote` — refuses promote-to-first-tier (409), errors when neither body nor predecessor has a version, 404 when explicit version's `versions` row is missing.
68 + - [x] `unsatisfied_gates` — 6 tests: empty, failed-kind flagging, latest-row-wins (red→green flap clears), hotfix skips burn_in only, ignores other tiers/versions, **null `passed` treated as failing** (locks the in-flight-race safety property).
69 + - [x] `run_migrator` errors on missing migrations dir.
70 + - [x] sqlx migrations exercised via existing `sync` tests.
71 +
60 72 ### End-to-end harness
61 73
62 74 - [ ] Single-binary smoke: spin up sandod against tmpdir config + a tmp postgres; push a fixture commit; assert the full pipeline (build → gates → MM tier_state advance) completes in under 30s. Run on CI for every sando PR.
@@ -64,6 +76,12 @@ Sando has zero automated tests today — daemon + TUI have been validated by run
64 76
65 77 ### TUI unit tests
66 78
79 + - [x] `format_event` — golden tests for build_ok, gate_done (pass+fail), backup_fetched, deploy_failed, unknown kind, malformed JSON.
80 + - [x] `ws_url_from`: `http://` → `ws://`, `https://` → `wss://`, only replaces scheme once, unknown scheme passes through.
81 + - [x] `Action::Display` impl produces `backup/fetch`, `promote/<tier>`, etc.
82 + - [x] `Shared::push_event` ring-buffer cap at 200; oldest entries drop in FIFO order.
83 + - [x] `truncate` short-string passthrough vs long-string ellipsis.
84 +
67 85 ---
68 86
69 87 Roadmap target: replace `server/deploy/deploy.sh` and astra-hosted `server/deploy/run-ci.sh` with Sando running on **pop-os**, gating Hetzner prod through testnot.work.
@@ -92,6 +110,12 @@ Read these to orient before working on Sando:
92 110
93 111 ## Phase 0 — pop-os bootstrap
94 112
113 + - [x] Provision `sando` system user on pop-os; lock down home dir; generate SSH keypair at `/srv/sando/.ssh/id_ed25519` for outbound deploys.
114 + - [x] Install scratch Postgres locally on pop-os; create `sando_scratch` role + DB used by `migration_dry_run`. (Owner of own DB; non-superuser.)
115 + - [x] Write systemd unit for `sandod` (long-run service, restart on failure, env from `/etc/sando/sando.env`). Installed at `/etc/systemd/system/sandod.service`.
116 + - [x] Write the production `sando.toml`; bare repo path under `/srv/sando/mnw.git`. Installed at `/etc/sando/sando.toml`; daemon config at `/etc/sando/sando-daemon.toml`.
117 + - [x] Install `sandod` binary at `/usr/local/bin/sandod`; enable + start the service. Live on `100.103.89.95:7766`; bare repo auto-bootstrapped at `/srv/sando/mnw.git`.
118 + - [x] Verify MNW server builds reproducibly on pop-os. `makenotwork` 0.8.12 built in 132s; sqlx online mode against `sando_scratch` postgres (sandod prep-resets all non-system schemas + applies all 133 MNW migrations before invoking cargo).
95 119 - [ ] Register sando pubkey with Hetzner prod (`deploy@alpha-west-1`) and testnot.work once that node exists. Pubkey: `ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIEK+vhpr1V8VnsEemN9x6tAA2S05kmv/mQ3eVgSXSkJ8 sando@pop-os`. (Moved to Phase 1 — not blocking Phase 0 exit.)
96 120
97 121 ### Phase 0 follow-ups (not blocking, but visible)
@@ -105,6 +129,12 @@ Read these to orient before working on Sando:
105 129
106 130 The MVP only deploys to `ssh_target=local`. Production needs real SSH/rsync.
107 131
132 + - [x] Implement `deploy::deploy_node` remote path: rsync staged binary to `<ssh_target>:<release_root>/releases/<version>/<bin_name>`, then `ssh <ssh_target>` does `mv -Tf` symlink swap + `sudo systemctl reload-or-restart <service>`. First real promote landed 2026-05-31: pop-os → testnot, version 0.8.12.
133 + - [x] Add `node.service_name` to `sando.toml` (default `makenotwork.service`).
134 + - [x] Bootstrap script for adding a fresh node: `MNW/sando/deploy/bootstrap-node.sh`. (See Phase 3 — node-bootstrap script for full details.)
135 + - [x] Garbage-collect old releases on the remote: keep last N=5 per node, sorted by mtime. Runs at end of each successful deploy (local + remote variants). Tied via `RELEASES_TO_KEEP` const.
136 + - [x] Handle `rsync` failure mid-deploy: leave the previous `current` symlink intact; mark `deploys.outcome = 'failed'`; do not advance `tier_state`. (Verified the routes.rs path; rsync runs before symlink swap so failure naturally leaves `current` untouched.)
137 +
108 138 ### Phase 1 — Track B: testnot live-app setup (NOT blocking Phase 2)
109 139
110 140 Sando's deploy machinery is done, but testnot's MNW runtime needs the rest before its `makenotwork.service` can stay up:
@@ -119,14 +149,33 @@ Sando's deploy machinery is done, but testnot's MNW runtime needs the rest befor
119 149
120 150 `migration_dry_run` is the load-bearing gate. It needs a real backup source, not a fixture.
121 151
152 + - [x] ~~Confirm astra's offsite replica writes a deterministic latest-link path.~~ Pivoted: pull direct from prod (`backup-puller@alpha-west-1:2200`, rrsync-locked to `/opt/makenotwork/backups/`). Astra offsite is separately broken — see carryover below.
153 + - [x] Wire the production `sando.toml` `backup.source` — `ssh://backup-puller@alpha-west-1:2200/latest.sql.gz` with `latest.sql.gz` as a hard link on prod.
154 + - [x] Schedule a daily `POST /backup/fetch` (systemd timer on pop-os). `sandod-backup-fetch.{service,timer}` in `MNW/sando/deploy/`. Runs daily at 04:00 UTC (one hour after prod's 03:00 UTC backup-db.sh). Service uses `EnvironmentFile=/etc/sando/sando.env` for `$SANDO_DAEMON`. Verified 2026-05-31: one-shot test pulled 36MB backup, recorded in `backups` table.
155 + - [x] First end-to-end `migration_dry_run` against a real prod backup. Passed 2026-05-31 for sha 4541ebc in 1.2s: restored 36MB dump + applied all 133 migrations cleanly. Sha eee96a7 correctly failed `migration_dry_run` because it lacked migrations 123-132 that prod has applied — exactly the prod-vs-repo drift the gate is designed to catch.
156 + - [x] Document the failure modes: `plans/migration-dryrun-failures.md`. Covers all 7 fail modes (no backup, scratch_url unset, scratch reset, restore, drift, checksum mismatch, content broken against prod data) with operator playbook.
157 + - [x] Decide retention on `backups` table. 30 days; pruned at end of `backup::fetch`. `DELETE FROM backups WHERE fetched_at < datetime('now', '-30 days')`.
158 +
122 159 ### Phase 2 carryovers / adjacent fires
123 160
124 161 - [ ] **Offsite backup sync from prod → astra still broken.** Diagnosed 2026-05-31: `sync-backup-offsite.sh` was never deployed to prod (`deploy.sh` gap when it was added). `makenotwork@prod` had no SSH key. Generated key + installed pubkey on `max@astra:~/.ssh/authorized_keys`, created `/opt/backups/mnw` on astra. **Blocked** on Tailscale ACL: astra runs only Tailscale SSH (no regular sshd on a bypass port), and the ACL denies `tag:tagged-devices` (alpha-west-1) → astra as user `max`. Needs ACL update in the Tailscale admin console, then deploy `sync-backup-offsite.sh` to `/opt/makenotwork/` and test. Makenotwork@prod pubkey: `ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAILzyQQ7pmBIZat8fABlpG/opwh4w5GhLIfkX2qxKxuT0 makenotwork@alpha-west-1`.
162 + - [x] **Prod backup `latest.sql.gz` hard link.** `backup-db.sh` now maintains `latest.sql.gz` atomically (`ln -f $LATEST.new && mv -Tf .new latest.sql.gz`). Deployed 2026-05-31; manual run verified (nlinks=2).
163 +
125 164 ## Phase 3 — Parity with current `deploy.sh`
126 165
127 166 Decisions captured in `plans/config-artifacts.md`. Summary: Caddyfile / systemd unit / backup script / security configs all move to **one-time node-bootstrap**, not per-deploy. error-pages bake into binary (MNW PR) with sibling fallback. mnw-admin ships alongside server via `bin_names: Vec<String>`. Restart warning is Phase 5, prod-tier-only. Prod migrations: server self-applies on startup (`main.rs:73`), sando does not.
128 167
168 + - [x] **Caddyfile** — decided: bootstrap-only. Not per-deploy. (`plans/config-artifacts.md` §1.)
169 + - [x] **systemd unit** — decided: bootstrap-only. (§4.)
170 + - [x] **Backup script** — decided: bootstrap-only. (§6.)
171 + - [x] **Error pages** — short-term done: ship as release-dir sibling. `build_and_run_mm` `cp -a` from `worktree/server/deploy/error-pages/` into the staged release dir; deploy_node's rsync of the whole dir picks it up. Verified on testnot 2026-05-31. Long-term `include_dir!` bake-in still a separate MNW PR.
172 + - [x] **mnw-admin binary** — `cfg.bin_names: Vec<String>` (default `["server"]`, MNW uses `["makenotwork","mnw-admin"]`). `deploy_local` copies each from worktree's `target/release/<bin>`; `deploy_node` rsyncs the whole staged dir. `Config::primary_bin()` returns first entry for systemd reference. `versions.artifact_path` stores the primary; release dir is derived as `.parent()`. Verified on testnot 2026-05-31.
173 + - [x] **Security configs** — decided: bootstrap-only. (§5.)
129 174 - [ ] **Restart warning** — Phase 5, prod-tier only via `tier.restart_warning_seconds` in `sando.toml`; needs `CLI_SERVICE_TOKEN` in `/etc/sando/sando.env`. (§7.)
175 + - [x] **Cross-compile from macOS** — decided: retire after one sprint of testnot parity verification. Pop-os builds natively. (§8.)
176 + - [x] **Prod migrations** — decided: server self-applies on startup. Sando does NOT run them. `migration_dry_run` gate is the prod safety net. (§9.)
177 + - [x] **Node-bootstrap script** — `MNW/sando/deploy/bootstrap-node.sh`. Idempotent. Takes `SANDO_PUBKEY` (required), `BIN_NAME`, `SERVICE_NAME`, `SERVICE_USER`, `DEPLOY_ROOT` env. Installs base packages (rsync/ufw/fail2ban), optionally postgres/tailscale/caddy, creates deploy user + dirs + sudoers entry + systemd unit, sets up UFW. Deliberately does NOT touch Caddyfile content, certs, postgres role/db, or secrets — those are operator-decisions per-node. testnot was done by hand and matches roughly what the script produces. Test by re-running on the next node added (tier B Hetzner prod move or tier C).
178 +
130 179 ## Phase 4 — Cutover
131 180
132 181 Run Sando in parallel with `deploy.sh` until trust is built, then retire the old path.
@@ -142,6 +191,13 @@ Run Sando in parallel with `deploy.sh` until trust is built, then retire the old
142 191
143 192 The TUI polls. The MVP requires you to hand-insert a row for `manual_confirm`. Both are fine for one operator but rough.
144 193
194 + - [x] Build mutex: single-slot `AppState.active_build: Mutex<Option<AbortHandle>>`; newer `/rebuild` aborts any in-flight build. Cargo commands set `.kill_on_drop(true)` so abort propagates SIGKILL to cargo + rustc children. (Landed 2026-05-31 after observing two concurrent builds racing the scratch DB.)
195 + - [x] Implement `WS /events`: tail of gate starts/finishes, deploy events, build logs. Event enum in `daemon/src/events.rs`; `broadcast::channel(256)` in `AppState`; emit sites in build.rs, gates.rs, routes.rs (rebuild, promote, rollback, confirm, backup_fetch). Verified 2026-05-31: live JSON envelopes stream to a python `websockets` client.
196 + - [x] TUI: actions pane. `↑↓`/`jk` select tier; `p` promote (no body — defaults version); `R` rollback; `b` backup fetch; `c` manual_confirm. Action results land in the events log. Daemon URL via `$SANDO_DAEMON`. Built in `tui/src/main.rs` 2026-05-31.
197 + - [x] `POST /confirm/{tier}` endpoint — inserts `gate_runs` row with `passed=1, gate_kind='manual_confirm'` for the tier's `current_version`. Replaces hand-SQL workaround. Verified 2026-05-31 against tier `a`.
198 + - [x] TUI live log pane that follows the most recent build / gate run; backed by `WS /events`. 200-event ring buffer, human-formatted per kind. WS auto-reconnects every 3s. Header shows ws connection state.
199 + - [x] `POST /promote` body — `version` now optional; defaults to predecessor tier's `current_version`. (Unblocks the "promote what just baked" flow.)
200 +
145 201 ## Phase 6 — Monitoring + alerting
146 202
147 203 - [ ] Wire pop-os `/metrics` endpoint into the existing MNW Prometheus scrape config; record where the scrape config lives in `_meta/` or wherever monitoring already runs.