Skip to main content

max / makenotwork

16.1 KB · 164 lines History Blame Raw
1 # Session 3 — first sando-driven prod deploy
2
3 Captured 2026-06-03 after the cutover. Resolves §6.5 step 8 of `launchplan_final.md`: first full sando deploy to Hetzner prod, replacing `deploy.sh` as the live deploy path.
4
5 Status: **complete 2026-06-03.** Prod runs `makenotwork` 0.9.5 (sha `f0970b8`) from `/opt/mnw/current/`, deployed via `POST /promote/b {"hotfix":true}` from sandod on fw13. Outage window 3m25s (02:50:33 → 02:53:58 UTC). All features green. See §F for outcomes and §G for the four hardcoded paths that block the eventual `rm -rf /opt/makenotwork/`.
6
7 ## Background — Session 1 set the layout, Session 2 proved it on testnot, Session 3 cut prod over
8
9 Session 1 redesigned the on-disk layout (`/opt/mnw/releases/<v>/` + `current` symlink; `/etc/mnw/makenotwork.env`; `/var/lib/mnw/` for state) and shipped the sando-side code that produces the full versioned bundle (binaries + static + docs + error-pages + assumptions). Session 2 reprovisioned testnot under that layout; the first remote deploy of the full bundle landed cleanly after three small gotchas (sqlx URL form, pg_ident map, `ASSUMPTIONS_PATH` mismatch — all logged in `launchplan_final.md` §6.9).
10
11 Session 3 is the real-stakes one: prod was on 0.9.1 via `deploy.sh`, `/opt/makenotwork/` had eight months of accreted state (885M of backups, .env, yara-rules, ssh dir, rustdoc, sudoers entries, cron jobs, Caddyfile references). The Session 1 plan enumerated some of the move sequence but understated the surface area; the actual cutover surfaced several things worth documenting so the next major reprovision (or a disaster-recovery rebuild) doesn't re-discover them.
12
13 ## A. Inventory taken before any prod write
14
15 `/opt/makenotwork/` contents (`makenotwork:makenotwork` unless noted):
16
17 - `makenotwork`, `mnw-admin` — 0.9.1 binaries (`root:root`)
18 - `.env` (110 lines), 5× `.env.bak.*` files (`root:root`)
19 - `docs/`, `static/`, `error-pages/` — content (will be replaced by release bundle)
20 - `backups/` — 885M
21 - `yara-rules/` — 8.5M compiled, `root:root`
22 - `yara-rules-src/` — upstream YARA sources (compiled to `yara-rules/`), `root:root`
23 - `rustdoc/` — generated docs, `501:staff` (uploaded from Mac via `deploy.sh`)
24 - `ssh/``known_hosts` for build runner, `root:root`
25 - `backup-db.sh` — cron'd daily at 03:00 UTC from `makenotwork`'s crontab
26 - `deploy/``deploy.sh` staging area, `root:root`
27
28 Other prod state in play:
29
30 - `/opt/git/` — 99M, `git:git`. Both git user's home (`/etc/passwd` says `git:x:995:986::/opt/git:/bin/sh`) *and* the GIT_REPOS_PATH target. Conflating these turns out to matter (§F).
31 - `/etc/caddy/Caddyfile` — three `root * /opt/makenotwork/error-pages` lines.
32 - `/etc/sudoers.d/mnw-git-ssh``makenotwork ALL=(git) NOPASSWD: /opt/makenotwork/mnw-admin rebuild-keys`.
33 - `/etc/sudoers.d/mnw-cli-git``mnw-cli ALL=(git) NOPASSWD: /usr/bin/git-*, /usr/bin/tee, /usr/bin/chmod`. No /opt path references; left alone.
34 - `makenotwork` user crontab: `0 3 * * * /opt/makenotwork/backup-db.sh >> /opt/makenotwork/backups/backup.log 2>&1`.
35 - Root crontab: `0 3 * * * /opt/backups/pg_backup.sh >> /var/log/pg_backup.log 2>&1` — unrelated, left alone.
36
37 ## B. Pre-flight (no prod impact)
38
39 1. **`sando.toml` tier B fixed.** Was `deploy@prod-1.makenot.work` (NXDOMAIN, no port). Now `makenotwork@alpha-west-1` with port handling via `~sando/.ssh/config` Host block. Chose to keep service user as `makenotwork` rather than introduce a `deploy` user — avoids chowning 885M of backups and redoing pg peer auth that's been stable for months. The same reasoning applies to a hypothetical tier C: keep the existing user, don't introduce a new one for cosmetic uniformity with testnot.
40 2. **Sando pubkey installed** in `/home/makenotwork/.ssh/authorized_keys` (mode 0600, owned makenotwork).
41 3. **`chsh -s /bin/bash makenotwork`** — was `/usr/sbin/nologin`. SSH was rejecting connections, not key auth failing. Worth detecting/fixing in `bootstrap-node.sh` for future provisions where someone has hardened the runtime user.
42 4. **`/srv/sando/.ssh/config`** Host block for port 2200; `known_hosts` seeded via `ssh-keyscan -p 2200`.
43 5. **Dry-run rsync** from sando → prod's `/opt/mnw/releases/_probe/` succeeded (after `bootstrap-node.sh` created `/opt/mnw/`).
44
45 ## C. Cutover sequence (3m25s outage)
46
47 In order, with the exact reason each step exists:
48
49 1. **`systemctl stop makenotwork`** — 02:50:33 UTC. Outage window starts.
50 2. **Backups taken**: `/etc/systemd/system/makenotwork.service → /root/makenotwork.service.bak-pre-cutover`; `/opt/makenotwork/.env → /root/dotenv.bak-pre-cutover`; Caddyfile, sudoers, crontab also backed up to `/root/*.bak-pre-cutover`. Rollback path for any step failing before service restart.
51 3. **`bootstrap-node.sh`** with `SERVICE_USER=makenotwork SANDO_PUBKEY=… INSTALL_POSTGRES=0 INSTALL_CADDY=0 INSTALL_TAILSCALE=0 ENABLE_FIREWALL=0` — postgres/caddy/tailscale/UFW already configured on prod, don't touch. Created `/opt/mnw/`, `/etc/mnw/`, `/var/lib/mnw/`, the new systemd unit, the unused `deploy` user (harmless), the sudoers entry for `deploy`. The new unit references `EnvironmentFile=/etc/mnw/makenotwork.env` and `ReadWritePaths=/var/lib/mnw`, with `RestartPreventExitStatus=2` (MNW server convention: exit 2 = migration failure, don't crashloop).
52 4. **`cp /opt/makenotwork/.env /etc/mnw/makenotwork.env`** (copy, not move — original stays for one-week rollback). `chmod 0640 root:makenotwork`. Then `sed` rewrites of `DOCS_PATH`, `ASSUMPTIONS_PATH`, `YARA_RULES_DIR`, `GIT_REPOS_PATH` for the new layout. `HOST`, `PORT`, `DATABASE_URL`, `HOST_URL` unchanged.
53 5. **`ln -s /opt/makenotwork/yara-rules /opt/mnw/yara-rules`** — yara-rules is operator-managed (independent update cadence), not in the release bundle (Session 1 layout principle: category #3). The symlink lets the new env's `YARA_RULES_DIR=/opt/mnw/yara-rules` continue to resolve. When `/opt/makenotwork/` is eventually removed, the rules dir moves to a permanent path (probably `/var/lib/mnw/yara-rules` or `/etc/mnw/yara-rules`) and the symlink retargets.
54 6. **`rsync -aHX /opt/git/ /var/lib/mnw/git/`** — preserves `git:git` ownership and the directory hardlinks. `chmod 0755 /var/lib/mnw` so the git user can traverse (default was 0750 makenotwork:makenotwork, which blocked git's git-receive-pack from reaching the repos).
55 7. **Caddyfile rewrite**: `sed -i 's|/opt/makenotwork/error-pages|/opt/mnw/current/error-pages|g'`. `caddy validate` before reload; `systemctl reload caddy`.
56 8. **Sudoers rewrite**: same sed pattern on `/etc/sudoers.d/mnw-git-ssh`; `visudo -c -f` to validate.
57 9. **`systemctl daemon-reload`** to pick up the new unit.
58 10. **`systemctl restart sandod`** on fw13 — sandod caches `sando.toml` at startup; the new tier B target wouldn't have taken effect without this. **First `POST /promote/b` failed with NXDOMAIN against the stale `prod-1.makenot.work` because sandod hadn't been restarted yet.** Fixed by restarting sandod and re-promoting.
59 11. **`POST /promote/b {"hotfix":true}`**`hotfix: true` bypasses the 48h burn-in on tier A (which had just promoted to 0.9.5 ~15 min prior; burn-in not yet elapsed). Sando rsync'd the 161MB bundle to `/opt/mnw/releases/0.9.5/`, swapped the `current` symlink, called `systemctl reload-or-restart makenotwork.service`.
60 12. **Service up 02:53:55 UTC.** Outage window ends 02:53:58 once health serves 200. 733 YARA rules compiled, all integrations (S3, Stripe, MT, WAM, git, scanner, custom domain cache) live.
61 13. **External smoke checks**: `/`, `/login`, `/pricing`, `/docs`, `/docs/economics`, `/docs/roadmap`, `/docs/tiers` — all 200.
62 14. **`rebuild-keys` to regenerate `/opt/git/.ssh/authorized_keys`**`dotenvy` doesn't auto-load when running mnw-admin standalone (it loads from `/opt/makenotwork/.env`, mode 0600 `makenotwork:makenotwork`, unreadable by git). Worked around by sourcing the env in root then `sudo -u git -E`. **Regenerated keys still contain `command="/opt/makenotwork/mnw-admin git-auth ..."`** — see §G.
63 15. **Git push test**`git ls-remote git@ssh.makenot.work:max/meta.git` returns refs cleanly. Cutover verified end-to-end.
64
65 ## D. What stayed in place (intentional)
66
67 - `/opt/makenotwork/` — full contents, untouched. Soak rollback path: stop new unit, swap systemd unit back, start old binary. Plan: `rm -rf` after a week, post-0.9.6 deploy (see §G).
68 - `/opt/git/` — untouched. Git user's `/etc/passwd` home; mnw-admin's regenerated `authorized_keys` writes to `/opt/git/.ssh/authorized_keys` (not `/home/git/`, despite earlier confusion). The rsync to `/var/lib/mnw/git/` populated the new GIT_REPOS_PATH; the server reads from there, but git push lands in `/opt/git/` because that's git user's home. Both paths now hold the repo bytes; that's wasteful but harmless during the soak.
69 - `/opt/makenotwork/backups/` — 885M of pg dumps. Script and cron still write there. Sando's backup-fetch on fw13 still pulls from there (configured pre-cutover). Migration to `/var/lib/mnw/backups/` is its own follow-up (touches script, crontab, fw13 sando config).
70 - `yara-rules-src/`, `rustdoc/`, `ssh/`, `.env.bak.*` — not in any env var or systemd path. Confirmed by grepping the running 0.9.5 binary's path references. Will be swept in the post-soak cleanup.
71
72 ## E. What broke and how it was caught
73
74 Three small things, all caught by smoke checks:
75
76 1. **`sandod` cached `sando.toml`.** First promote attempt returned `creating remote release dir` (an in-flight progress string that became the error message). `journalctl -u sandod` showed it was still resolving `prod-1.makenot.work`. `scp sando.toml fw13:/tmp/`, `sudo cp /tmp/sando.toml /etc/sando/sando.toml`, `sudo systemctl restart sandod`, re-promote. Worth documenting that `sandod` does not watch the file; alternative is to add an inotify or SIGHUP handler.
77 2. **First doc smoke checks were wrong URLs.** `/about/economics`, `/docs/about/economics` returned 404; panicked briefly that the cutover broke doc routing. False alarm: the route is `/docs/{slug}` where slug is the filename stem (e.g., `/docs/economics`). Verified with `grep doc_page MNW/server/src/` after the panic. **Worth fixing in any future smoke script** — use the real URL scheme, not guessed-from-filesystem paths.
78 3. **`mnw-admin rebuild-keys` needed env loading from root context.** `sudo -u git /opt/mnw/current/mnw-admin rebuild-keys` fails with `DATABASE_URL must be set: NotPresent` because the binary's `dotenvy::from_path("/opt/makenotwork/.env")` runs as git, which can't read `.env` (mode 0600 makenotwork). Workaround: `set -a; source /etc/mnw/makenotwork.env; set +a; sudo -u git -E /opt/mnw/current/mnw-admin rebuild-keys`. Cleanest long-term fix is in §G.
79
80 ## F. Outcomes (verified)
81
82 **Sando state after cutover:**
83
84 ```
85 host cur=0.9.5 prev=0.9.5 burn_in_started=2026-06-03T02:23:28Z
86 a cur=0.9.5 prev=0.8.12 burn_in_started=2026-06-03T02:38:57Z
87 b cur=0.9.5 prev=None burn_in_started=2026-06-03T02:53:56Z
88 c not provisioned
89 ```
90
91 **Prod externally:**
92 - `https://makenot.work/api/health``{"status":"operational","version":"0.9.5","checks":{"database":true}}`.
93 - `/`, `/login`, `/pricing`, `/docs`, `/docs/economics`, `/docs/roadmap`, `/docs/tiers` → 200.
94 - Git: `git ls-remote git@ssh.makenot.work:max/meta.git` → returns refs.
95
96 **Prod internally:**
97 - `systemctl status makenotwork` → active, PID 3123111, listening 0.0.0.0:3000.
98 - 733 YARA rules compiled from `/opt/mnw/yara-rules` (symlink).
99 - All integrations enabled per startup log: `s3=true, synckit_s3=false, stripe=true, scanner=true, mt=true, wam=true, git=true`.
100
101 **deploy.sh path retained.** Not retired; remains as break-glass per `feedback_prefer_sando_over_deploy_sh` (sando is preferred *default*; deploy.sh stays runnable for outages where sando host is down).
102
103 ## G. Open follow-ups
104
105 ### G.1 The hardcoded `/opt/makenotwork/` paths (blocks the cleanup milestone)
106
107 Session 1 outcomes claimed "`command=` prefixes auto-update on the first post-migration `rebuild-keys` run." That's wrong — confirmed during step 14. The path is a `const` in the binary, not pulled from env. Four sites need lifting before `/opt/makenotwork/` can be removed:
108
109 | File | Line | Current value | Target |
110 |---|---|---|---|
111 | `server/src/git_ssh.rs` | 15 | `const MNW_ADMIN_PATH: &str = "/opt/makenotwork/mnw-admin"` | `/opt/mnw/current/mnw-admin` |
112 | `server/src/bin/mnw-admin.rs` | 122 | `dotenvy::from_path("/opt/makenotwork/.env")` | `/etc/mnw/makenotwork.env` |
113 | `server/src/build_runner.rs` | 467 | `const BUILD_SSH_KNOWN_HOSTS: &str = "/opt/makenotwork/ssh/known_hosts"` | `/etc/mnw/known_hosts` (or delete if dead — verify usage first) |
114 | `server/src/routes/api/ssh_keys.rs` | 165 | `args(["-u", "git", "/opt/makenotwork/mnw-admin", "rebuild-keys"])` | `/opt/mnw/current/mnw-admin` |
115
116 Ship as 0.9.6. Cleanup sequence after: deploy 0.9.6 via sando → `rebuild-keys` once (regenerates `authorized_keys` with new path in command=) → soak one week → `rm -rf /opt/makenotwork/`.
117
118 ### G.2 The backups dir migration
119
120 Independent of G.1. Touches:
121 - `server/deploy/backup-db.sh` — hardcoded `BACKUP_DIR="/opt/makenotwork/backups"` near top.
122 - `makenotwork` user crontab on prod.
123 - Sando's `backup.source` URL on fw13 (currently pulls from `/opt/makenotwork/backups/latest.sql.gz` via rrsync).
124
125 Easiest order: copy the existing 885M dir to `/var/lib/mnw/backups/`, edit script + crontab + sando config in one window, retire `/opt/makenotwork/backups/` after one successful daily backup lands in the new location and sando confirms it pulled cleanly.
126
127 ### G.3 The `/opt/git` vs `/var/lib/mnw/git` duality
128
129 Both directories currently hold the same repos. Git pushes land in `/opt/git/` (git user's home from `/etc/passwd`). Server reads from `/var/lib/mnw/git/` (GIT_REPOS_PATH). They drift the moment someone pushes.
130
131 Two ways out:
132 - (a) `usermod -d /var/lib/mnw/git git` to make git's home match GIT_REPOS_PATH. Single source of truth. Risk: any cron / script that reads git's home (none I found, but worth grepping) breaks.
133 - (b) Revert GIT_REPOS_PATH to `/opt/git/`. Avoids the move but locks the path forever and reverts a piece of Session 1's FHS migration.
134
135 (a) is the right answer. Do it during the post-0.9.6 soak window.
136
137 ### G.4 `bootstrap-node.sh` polish
138
139 From this cutover and Session 2:
140
141 - **Detect `nologin` shell** on `SERVICE_USER` and refuse with a clear error (or auto-`chsh`). Costs ~1 min of cutover time if you don't know to check.
142 - **Sibling `bootstrap-node-postgres.sh`** for the common pg_ident map case (when SERVICE_USER ≠ pg role name). Or document the manual steps in the script's "next steps" output.
143 - **README-postgres.md note** on the sqlx URL form: `postgres:///db?host=/var/run/postgresql&user=name`, not `postgres://user@/db?host=...`.
144
145 ### G.5 `ASSUMPTIONS_PATH` mismatch
146
147 `sando-daemon.toml` puts the file at `<release>/docs/assumptions.toml`; prod's pre-existing env expected `<release>/docs/business/assumptions.toml` (matching the source layout `server/docs/business/assumptions.toml`). Worked around with an env edit during cutover but both prod and testnot now have non-canonical `ASSUMPTIONS_PATH=/opt/mnw/current/docs/assumptions.toml`. Fix: change `release_contents[3].dst` in `sando-daemon.toml` to `docs/business/assumptions.toml` and revert the env path on both nodes. Small, do it during the 0.9.6 sprint.
148
149 ## H. Key paths (for orientation)
150
151 - `MNW/sando/sando.toml` — tier B definition (`makenotwork@alpha-west-1`).
152 - `MNW/sando/deploy/bootstrap-node.sh` — node-bootstrap; ran on prod with `SERVICE_USER=makenotwork`.
153 - `MNW/sando/daemon/sando-daemon.toml` — release_contents (note §G.5 ASSUMPTIONS_PATH mismatch).
154 - `MNW/server/src/{git_ssh.rs, build_runner.rs, bin/mnw-admin.rs, routes/api/ssh_keys.rs}` — the four hardcoded path sites.
155 - `MNW/server/deploy/backup-db.sh` — hardcoded backup dir.
156 - `/etc/systemd/system/makenotwork.service` (prod) — new FHS unit.
157 - `/etc/mnw/makenotwork.env` (prod) — new env file location.
158 - `/etc/sudoers.d/mnw-git-ssh` (prod) — updated to `/opt/mnw/current/mnw-admin`.
159 - `/etc/caddy/Caddyfile` (prod) — three error-pages refs updated.
160 - `/opt/makenotwork/` (prod) — full pre-cutover state, kept for soak rollback.
161 - `launchplan_final.md` §6.5 step 8 — original plan this session closes.
162 - `launchplan_final.md` §6.9 — Session 2/3 gotchas summary.
163 - `launchplan_final.md` §7 — 0.9.6 path-decoupling spec.
164