max / makenotwork
1 file changed,
+60 insertions,
-4 deletions
| @@ -19,11 +19,12 @@ Claude-only follow-ups (no user input needed; pick the next slice): | |||
| 19 | 19 | - Phase 4 prep — first Sando-only deploy to testnot (needs Track B — see below) | |
| 20 | 20 | - Sando test suite — see "Testing" section below; sandod and TUI have zero unit/integration tests today | |
| 21 | 21 | ||
| 22 | - | Session 4 — 0.9.6 path-decoupling shipped 2026-06-03 (commits `bfba435`, `445bfb7`). Remaining: | |
| 22 | + | Session 5 — 0.9.7 launched 2026-06-03 via Sando through host → A → B (hotfix=true, skip-burn-in). Soak cleanup closed (launchplan_final §1). Remaining: | |
| 23 | 23 | ||
| 24 | - | - [ ] **Soak clock for `rm -rf /opt/makenotwork/`** — 0.9.6 cut over 2026-06-03 03:46 UTC, `rebuild-keys` ran the same minute, all 4 authorized_keys repointed at `/opt/mnw/current/mnw-admin`. Eligible for cleanup ~2026-06-10. Pre-cleanup checks per launchplan §7: `journalctl -u makenotwork --since "1 week ago" | grep /opt/makenotwork/` empty, plus the §7 "Then, cleanup" sublist (rm /opt/makenotwork, rm /opt/git after the duality decision, migrate backups dir, leave makenotwork shell as bash for sando). | |
| 25 | - | - [ ] **Remove live drop-in** `/etc/systemd/system/mnw-cli.service.d/fhs-git-path.conf` on prod — it added `ReadWritePaths=/var/lib/mnw` to fix the EROFS that broke every creator git push after Session 3 (mnw-cli's systemd namespace had `/opt/git` writable but not `/var/lib/mnw/git`). The unit file in `mnw-cli/deploy/mnw-cli.service` is now patched to include the path, so the drop-in becomes redundant next time `./mnw-cli/deploy/deploy.sh --config` runs. Until then both apply (harmless dupe). | |
| 26 | - | - [ ] **Discovered pre-existing prod bug, fixed live:** `/etc/mnw/makenotwork.env` was unreadable by the git user (mode 640 root:makenotwork), so any `mnw-admin git-auth` invocation via authorized_keys command= panicked with "DATABASE_URL must be set". Same was true of the legacy `/opt/makenotwork/.env` (had been silently broken on 0.9.5 too). Applied `setfacl u:git:r` on both env files and `setfacl u:git:x /etc/mnw` directly on prod; codified the ACL block (conditional on git user existing) in `bootstrap-node.sh`. Next bootstrap will set it automatically. | |
| 24 | + | - [x] **Soak cleanup eligible 2026-06-10 — shortened and shipped 2026-06-03.** Gate verified clean since the 06-03 02:53 migration boot. Removed `/opt/git` (99M, stale duplicate of `/var/lib/mnw/git`), `/opt/makenotwork` (177M, post-yara-relocation), `/opt/backups` (277M, root pg_backup output). 553M reclaimed. yara-rules relocated from `/opt/makenotwork/yara-rules` → real `/opt/mnw/yara-rules` (733 rules compiled fine from new path). | |
| 25 | + | - [x] **Backups rebuilt under `/var/lib/mnw/backups/<db>/`** (makenotwork + multithreaded, per-DB subdirs), per-user crons (03:00 + 03:05), offsite to astra `/opt/backups/mnw/<db>/` via Tailscale SSH `tag:prod → max@tag:testing` rule. `backup-puller` rrsync re-rooted at `/var/lib/mnw/backups`; sando `backup.source` updated to `ssh://backup-puller@alpha-west-1:2200/makenotwork/latest.sql.gz`; `/backup/fetch` verified 38MB matched prod size. | |
| 26 | + | - [x] **Pre-existing meta.git ownership drift fixed inline** — `mnw-cli:git` → `git:git` (tightened `safe.directory` was rejecting it). Surfaced by post-rm ls-remote regression test. | |
| 27 | + | - [ ] **Remove live drop-in** `/etc/systemd/system/mnw-cli.service.d/fhs-git-path.conf` on prod. The unit file in `mnw-cli/deploy/mnw-cli.service` is patched to include `ReadWritePaths=/var/lib/mnw`, so the drop-in becomes redundant next time `./mnw-cli/deploy/deploy.sh --config` runs. Until then both apply (harmless dupe). | |
| 27 | 28 | ||
| 28 | 29 | Decision-gated (needs user input first): | |
| 29 | 30 | ||
| @@ -55,8 +56,19 @@ Sando has zero automated tests today — daemon + TUI have been validated by run | |||
| 55 | 56 | ||
| 56 | 57 | 55 tests passing as of 2026-05-31 (14 TUI + 41 daemon). Remaining gaps: | |
| 57 | 58 | ||
| 59 | + | - [x] `gates::reset_scratch` — verifies dropping every non-system schema (planted `foo` + `tower_sessions`, ran reset, asserted only `public` remains). Gated by `SANDO_TEST_PG_URL` env var so it skips on hosts without postgres. Run on pop-os with `SANDO_TEST_PG_URL=postgres:///sando_scratch?host=/var/run/postgresql cargo test`. | |
| 60 | + | - [x] `deploy::deploy_local` — copies multiple binaries (`PRIMARY`/`ADMIN`), swaps symlink atomically across two consecutive deploys, gc_local_releases keeps last N by mtime + handles missing dir + noop under threshold. `sh_quote` round-trip. | |
| 61 | + | - [x] `deploy::deploy_remote` failure path — against unroutable `192.0.2.1`, verifies clean ssh-attributed error (no panic / hang); ConnectTimeout bounds the test wallclock to ~10s. Plus `deploy_node` with `ssh_target="local"` short-circuits to symlink swap. | |
| 62 | + | - [x] `backup::fetch` URL parsing — extracted `parse_source` → `BackupSource` enum. 10 tests: file://, rsync://, ssh:// with/without port, multi-segment ssh path, non-numeric `:foo` colon treated as part of host (not port), and all malformed-input rejections (empty, scheme-only, ftp, no path on ssh, empty user@host). | |
| 63 | + | - [x] `events::emit` no-subscribers no-op; `emit_reaches_a_subscriber`; envelope serializes with flat `kind` field (locks the WS/TUI contract); `lagged_subscriber_observes_recv_error_lagged` exercises broadcast capacity. | |
| 58 | 64 | - [ ] `events_ws` handler end-to-end — drive WS through a slow client, assert `{"kind":"lagged",...}` frame arrives. Possible (bind axum to ephemeral port + tungstenite client) but the bus-level lag detection is already locked in by `lagged_subscriber_observes_recv_error_lagged`. Diminishing returns vs effort. Deferred. | |
| 59 | 65 | - [ ] `build` mutex behavior — requires real cargo or a slow stub. Treated as a manual checklist item under "TUI hands-on" instead. (Already validated by hand 2026-05-31.) | |
| 66 | + | - [x] `routes::confirm` — rejects when tier has no `current_version` (409 Conflict — surfaced that GateBlocked maps to 409 not 400, locked in), accepts + inserts a passing gate_runs row when set, 404 on unknown tier. | |
| 67 | + | - [x] `routes::promote` — refuses promote-to-first-tier (409), errors when neither body nor predecessor has a version, 404 when explicit version's `versions` row is missing. | |
| 68 | + | - [x] `unsatisfied_gates` — 6 tests: empty, failed-kind flagging, latest-row-wins (red→green flap clears), hotfix skips burn_in only, ignores other tiers/versions, **null `passed` treated as failing** (locks the in-flight-race safety property). | |
| 69 | + | - [x] `run_migrator` errors on missing migrations dir. | |
| 70 | + | - [x] sqlx migrations exercised via existing `sync` tests. | |
| 71 | + | ||
| 60 | 72 | ### End-to-end harness | |
| 61 | 73 | ||
| 62 | 74 | - [ ] Single-binary smoke: spin up sandod against tmpdir config + a tmp postgres; push a fixture commit; assert the full pipeline (build → gates → MM tier_state advance) completes in under 30s. Run on CI for every sando PR. | |
| @@ -64,6 +76,12 @@ Sando has zero automated tests today — daemon + TUI have been validated by run | |||
| 64 | 76 | ||
| 65 | 77 | ### TUI unit tests | |
| 66 | 78 | ||
| 79 | + | - [x] `format_event` — golden tests for build_ok, gate_done (pass+fail), backup_fetched, deploy_failed, unknown kind, malformed JSON. | |
| 80 | + | - [x] `ws_url_from`: `http://` → `ws://`, `https://` → `wss://`, only replaces scheme once, unknown scheme passes through. | |
| 81 | + | - [x] `Action::Display` impl produces `backup/fetch`, `promote/<tier>`, etc. | |
| 82 | + | - [x] `Shared::push_event` ring-buffer cap at 200; oldest entries drop in FIFO order. | |
| 83 | + | - [x] `truncate` short-string passthrough vs long-string ellipsis. | |
| 84 | + | ||
| 67 | 85 | --- | |
| 68 | 86 | ||
| 69 | 87 | Roadmap target: replace `server/deploy/deploy.sh` and astra-hosted `server/deploy/run-ci.sh` with Sando running on **pop-os**, gating Hetzner prod through testnot.work. | |
| @@ -92,6 +110,12 @@ Read these to orient before working on Sando: | |||
| 92 | 110 | ||
| 93 | 111 | ## Phase 0 — pop-os bootstrap | |
| 94 | 112 | ||
| 113 | + | - [x] Provision `sando` system user on pop-os; lock down home dir; generate SSH keypair at `/srv/sando/.ssh/id_ed25519` for outbound deploys. | |
| 114 | + | - [x] Install scratch Postgres locally on pop-os; create `sando_scratch` role + DB used by `migration_dry_run`. (Owner of own DB; non-superuser.) | |
| 115 | + | - [x] Write systemd unit for `sandod` (long-run service, restart on failure, env from `/etc/sando/sando.env`). Installed at `/etc/systemd/system/sandod.service`. | |
| 116 | + | - [x] Write the production `sando.toml`; bare repo path under `/srv/sando/mnw.git`. Installed at `/etc/sando/sando.toml`; daemon config at `/etc/sando/sando-daemon.toml`. | |
| 117 | + | - [x] Install `sandod` binary at `/usr/local/bin/sandod`; enable + start the service. Live on `100.103.89.95:7766`; bare repo auto-bootstrapped at `/srv/sando/mnw.git`. | |
| 118 | + | - [x] Verify MNW server builds reproducibly on pop-os. `makenotwork` 0.8.12 built in 132s; sqlx online mode against `sando_scratch` postgres (sandod prep-resets all non-system schemas + applies all 133 MNW migrations before invoking cargo). | |
| 95 | 119 | - [ ] Register sando pubkey with Hetzner prod (`deploy@alpha-west-1`) and testnot.work once that node exists. Pubkey: `ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIEK+vhpr1V8VnsEemN9x6tAA2S05kmv/mQ3eVgSXSkJ8 sando@pop-os`. (Moved to Phase 1 — not blocking Phase 0 exit.) | |
| 96 | 120 | ||
| 97 | 121 | ### Phase 0 follow-ups (not blocking, but visible) | |
| @@ -105,6 +129,12 @@ Read these to orient before working on Sando: | |||
| 105 | 129 | ||
| 106 | 130 | The MVP only deploys to `ssh_target=local`. Production needs real SSH/rsync. | |
| 107 | 131 | ||
| 132 | + | - [x] Implement `deploy::deploy_node` remote path: rsync staged binary to `<ssh_target>:<release_root>/releases/<version>/<bin_name>`, then `ssh <ssh_target>` does `mv -Tf` symlink swap + `sudo systemctl reload-or-restart <service>`. First real promote landed 2026-05-31: pop-os → testnot, version 0.8.12. | |
| 133 | + | - [x] Add `node.service_name` to `sando.toml` (default `makenotwork.service`). | |
| 134 | + | - [x] Bootstrap script for adding a fresh node: `MNW/sando/deploy/bootstrap-node.sh`. (See Phase 3 — node-bootstrap script for full details.) | |
| 135 | + | - [x] Garbage-collect old releases on the remote: keep last N=5 per node, sorted by mtime. Runs at end of each successful deploy (local + remote variants). Tied via `RELEASES_TO_KEEP` const. | |
| 136 | + | - [x] Handle `rsync` failure mid-deploy: leave the previous `current` symlink intact; mark `deploys.outcome = 'failed'`; do not advance `tier_state`. (Verified the routes.rs path; rsync runs before symlink swap so failure naturally leaves `current` untouched.) | |
| 137 | + | ||
| 108 | 138 | ### Phase 1 — Track B: testnot live-app setup (NOT blocking Phase 2) | |
| 109 | 139 | ||
| 110 | 140 | Sando's deploy machinery is done, but testnot's MNW runtime needs the rest before its `makenotwork.service` can stay up: | |
| @@ -119,14 +149,33 @@ Sando's deploy machinery is done, but testnot's MNW runtime needs the rest befor | |||
| 119 | 149 | ||
| 120 | 150 | `migration_dry_run` is the load-bearing gate. It needs a real backup source, not a fixture. | |
| 121 | 151 | ||
| 152 | + | - [x] ~~Confirm astra's offsite replica writes a deterministic latest-link path.~~ Pivoted: pull direct from prod (`backup-puller@alpha-west-1:2200`, rrsync-locked to `/opt/makenotwork/backups/`). Astra offsite is separately broken — see carryover below. | |
| 153 | + | - [x] Wire the production `sando.toml` `backup.source` — `ssh://backup-puller@alpha-west-1:2200/latest.sql.gz` with `latest.sql.gz` as a hard link on prod. | |
| 154 | + | - [x] Schedule a daily `POST /backup/fetch` (systemd timer on pop-os). `sandod-backup-fetch.{service,timer}` in `MNW/sando/deploy/`. Runs daily at 04:00 UTC (one hour after prod's 03:00 UTC backup-db.sh). Service uses `EnvironmentFile=/etc/sando/sando.env` for `$SANDO_DAEMON`. Verified 2026-05-31: one-shot test pulled 36MB backup, recorded in `backups` table. | |
| 155 | + | - [x] First end-to-end `migration_dry_run` against a real prod backup. Passed 2026-05-31 for sha 4541ebc in 1.2s: restored 36MB dump + applied all 133 migrations cleanly. Sha eee96a7 correctly failed `migration_dry_run` because it lacked migrations 123-132 that prod has applied — exactly the prod-vs-repo drift the gate is designed to catch. | |
| 156 | + | - [x] Document the failure modes: `plans/migration-dryrun-failures.md`. Covers all 7 fail modes (no backup, scratch_url unset, scratch reset, restore, drift, checksum mismatch, content broken against prod data) with operator playbook. | |
| 157 | + | - [x] Decide retention on `backups` table. 30 days; pruned at end of `backup::fetch`. `DELETE FROM backups WHERE fetched_at < datetime('now', '-30 days')`. | |
| 158 | + | ||
| 122 | 159 | ### Phase 2 carryovers / adjacent fires | |
| 123 | 160 | ||
| 124 | 161 | - [ ] **Offsite backup sync from prod → astra still broken.** Diagnosed 2026-05-31: `sync-backup-offsite.sh` was never deployed to prod (`deploy.sh` gap when it was added). `makenotwork@prod` had no SSH key. Generated key + installed pubkey on `max@astra:~/.ssh/authorized_keys`, created `/opt/backups/mnw` on astra. **Blocked** on Tailscale ACL: astra runs only Tailscale SSH (no regular sshd on a bypass port), and the ACL denies `tag:tagged-devices` (alpha-west-1) → astra as user `max`. Needs ACL update in the Tailscale admin console, then deploy `sync-backup-offsite.sh` to `/opt/makenotwork/` and test. Makenotwork@prod pubkey: `ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAILzyQQ7pmBIZat8fABlpG/opwh4w5GhLIfkX2qxKxuT0 makenotwork@alpha-west-1`. | |
| 162 | + | - [x] **Prod backup `latest.sql.gz` hard link.** `backup-db.sh` now maintains `latest.sql.gz` atomically (`ln -f $LATEST.new && mv -Tf .new latest.sql.gz`). Deployed 2026-05-31; manual run verified (nlinks=2). | |
| 163 | + | ||
| 125 | 164 | ## Phase 3 — Parity with current `deploy.sh` | |
| 126 | 165 | ||
| 127 | 166 | Decisions captured in `plans/config-artifacts.md`. Summary: Caddyfile / systemd unit / backup script / security configs all move to **one-time node-bootstrap**, not per-deploy. error-pages bake into binary (MNW PR) with sibling fallback. mnw-admin ships alongside server via `bin_names: Vec<String>`. Restart warning is Phase 5, prod-tier-only. Prod migrations: server self-applies on startup (`main.rs:73`), sando does not. | |
| 128 | 167 | ||
| 168 | + | - [x] **Caddyfile** — decided: bootstrap-only. Not per-deploy. (`plans/config-artifacts.md` §1.) | |
| 169 | + | - [x] **systemd unit** — decided: bootstrap-only. (§4.) | |
| 170 | + | - [x] **Backup script** — decided: bootstrap-only. (§6.) | |
| 171 | + | - [x] **Error pages** — short-term done: ship as release-dir sibling. `build_and_run_mm` `cp -a` from `worktree/server/deploy/error-pages/` into the staged release dir; deploy_node's rsync of the whole dir picks it up. Verified on testnot 2026-05-31. Long-term `include_dir!` bake-in still a separate MNW PR. | |
| 172 | + | - [x] **mnw-admin binary** — `cfg.bin_names: Vec<String>` (default `["server"]`, MNW uses `["makenotwork","mnw-admin"]`). `deploy_local` copies each from worktree's `target/release/<bin>`; `deploy_node` rsyncs the whole staged dir. `Config::primary_bin()` returns first entry for systemd reference. `versions.artifact_path` stores the primary; release dir is derived as `.parent()`. Verified on testnot 2026-05-31. | |
| 173 | + | - [x] **Security configs** — decided: bootstrap-only. (§5.) | |
| 129 | 174 | - [ ] **Restart warning** — Phase 5, prod-tier only via `tier.restart_warning_seconds` in `sando.toml`; needs `CLI_SERVICE_TOKEN` in `/etc/sando/sando.env`. (§7.) | |
| 175 | + | - [x] **Cross-compile from macOS** — decided: retire after one sprint of testnot parity verification. Pop-os builds natively. (§8.) | |
| 176 | + | - [x] **Prod migrations** — decided: server self-applies on startup. Sando does NOT run them. `migration_dry_run` gate is the prod safety net. (§9.) | |
| 177 | + | - [x] **Node-bootstrap script** — `MNW/sando/deploy/bootstrap-node.sh`. Idempotent. Takes `SANDO_PUBKEY` (required), `BIN_NAME`, `SERVICE_NAME`, `SERVICE_USER`, `DEPLOY_ROOT` env. Installs base packages (rsync/ufw/fail2ban), optionally postgres/tailscale/caddy, creates deploy user + dirs + sudoers entry + systemd unit, sets up UFW. Deliberately does NOT touch Caddyfile content, certs, postgres role/db, or secrets — those are operator-decisions per-node. testnot was done by hand and matches roughly what the script produces. Test by re-running on the next node added (tier B Hetzner prod move or tier C). | |
| 178 | + | ||
| 130 | 179 | ## Phase 4 — Cutover | |
| 131 | 180 | ||
| 132 | 181 | Run Sando in parallel with `deploy.sh` until trust is built, then retire the old path. | |
| @@ -142,6 +191,13 @@ Run Sando in parallel with `deploy.sh` until trust is built, then retire the old | |||
| 142 | 191 | ||
| 143 | 192 | The TUI polls. The MVP requires you to hand-insert a row for `manual_confirm`. Both are fine for one operator but rough. | |
| 144 | 193 | ||
| 194 | + | - [x] Build mutex: single-slot `AppState.active_build: Mutex<Option<AbortHandle>>`; newer `/rebuild` aborts any in-flight build. Cargo commands set `.kill_on_drop(true)` so abort propagates SIGKILL to cargo + rustc children. (Landed 2026-05-31 after observing two concurrent builds racing the scratch DB.) | |
| 195 | + | - [x] Implement `WS /events`: tail of gate starts/finishes, deploy events, build logs. Event enum in `daemon/src/events.rs`; `broadcast::channel(256)` in `AppState`; emit sites in build.rs, gates.rs, routes.rs (rebuild, promote, rollback, confirm, backup_fetch). Verified 2026-05-31: live JSON envelopes stream to a python `websockets` client. | |
| 196 | + | - [x] TUI: actions pane. `↑↓`/`jk` select tier; `p` promote (no body — defaults version); `R` rollback; `b` backup fetch; `c` manual_confirm. Action results land in the events log. Daemon URL via `$SANDO_DAEMON`. Built in `tui/src/main.rs` 2026-05-31. | |
| 197 | + | - [x] `POST /confirm/{tier}` endpoint — inserts `gate_runs` row with `passed=1, gate_kind='manual_confirm'` for the tier's `current_version`. Replaces hand-SQL workaround. Verified 2026-05-31 against tier `a`. | |
| 198 | + | - [x] TUI live log pane that follows the most recent build / gate run; backed by `WS /events`. 200-event ring buffer, human-formatted per kind. WS auto-reconnects every 3s. Header shows ws connection state. | |
| 199 | + | - [x] `POST /promote` body — `version` now optional; defaults to predecessor tier's `current_version`. (Unblocks the "promote what just baked" flow.) | |
| 200 | + | ||
| 145 | 201 | ## Phase 6 — Monitoring + alerting | |
| 146 | 202 | ||
| 147 | 203 | - [ ] Wire pop-os `/metrics` endpoint into the existing MNW Prometheus scrape config; record where the scrape config lives in `_meta/` or wherever monitoring already runs. |