| 1 |
# Session 3 — first sando-driven prod deploy |
| 2 |
|
| 3 |
Captured 2026-06-03 after the cutover. Resolves §6.5 step 8 of `launchplan_final.md`: first full sando deploy to Hetzner prod, replacing `deploy.sh` as the live deploy path. |
| 4 |
|
| 5 |
Status: **complete 2026-06-03.** Prod runs `makenotwork` 0.9.5 (sha `f0970b8`) from `/opt/mnw/current/`, deployed via `POST /promote/b {"hotfix":true}` from sandod on fw13. Outage window 3m25s (02:50:33 → 02:53:58 UTC). All features green. See §F for outcomes and §G for the four hardcoded paths that block the eventual `rm -rf /opt/makenotwork/`. |
| 6 |
|
| 7 |
## Background — Session 1 set the layout, Session 2 proved it on testnot, Session 3 cut prod over |
| 8 |
|
| 9 |
Session 1 redesigned the on-disk layout (`/opt/mnw/releases/<v>/` + `current` symlink; `/etc/mnw/makenotwork.env`; `/var/lib/mnw/` for state) and shipped the sando-side code that produces the full versioned bundle (binaries + static + docs + error-pages + assumptions). Session 2 reprovisioned testnot under that layout; the first remote deploy of the full bundle landed cleanly after three small gotchas (sqlx URL form, pg_ident map, `ASSUMPTIONS_PATH` mismatch — all logged in `launchplan_final.md` §6.9). |
| 10 |
|
| 11 |
Session 3 is the real-stakes one: prod was on 0.9.1 via `deploy.sh`, `/opt/makenotwork/` had eight months of accreted state (885M of backups, .env, yara-rules, ssh dir, rustdoc, sudoers entries, cron jobs, Caddyfile references). The Session 1 plan enumerated some of the move sequence but understated the surface area; the actual cutover surfaced several things worth documenting so the next major reprovision (or a disaster-recovery rebuild) doesn't re-discover them. |
| 12 |
|
| 13 |
## A. Inventory taken before any prod write |
| 14 |
|
| 15 |
`/opt/makenotwork/` contents (`makenotwork:makenotwork` unless noted): |
| 16 |
|
| 17 |
- `makenotwork`, `mnw-admin` — 0.9.1 binaries (`root:root`) |
| 18 |
- `.env` (110 lines), 5× `.env.bak.*` files (`root:root`) |
| 19 |
- `docs/`, `static/`, `error-pages/` — content (will be replaced by release bundle) |
| 20 |
- `backups/` — 885M |
| 21 |
- `yara-rules/` — 8.5M compiled, `root:root` |
| 22 |
- `yara-rules-src/` — upstream YARA sources (compiled to `yara-rules/`), `root:root` |
| 23 |
- `rustdoc/` — generated docs, `501:staff` (uploaded from Mac via `deploy.sh`) |
| 24 |
- `ssh/` — `known_hosts` for build runner, `root:root` |
| 25 |
- `backup-db.sh` — cron'd daily at 03:00 UTC from `makenotwork`'s crontab |
| 26 |
- `deploy/` — `deploy.sh` staging area, `root:root` |
| 27 |
|
| 28 |
Other prod state in play: |
| 29 |
|
| 30 |
- `/opt/git/` — 99M, `git:git`. Both git user's home (`/etc/passwd` says `git:x:995:986::/opt/git:/bin/sh`) *and* the GIT_REPOS_PATH target. Conflating these turns out to matter (§F). |
| 31 |
- `/etc/caddy/Caddyfile` — three `root * /opt/makenotwork/error-pages` lines. |
| 32 |
- `/etc/sudoers.d/mnw-git-ssh` — `makenotwork ALL=(git) NOPASSWD: /opt/makenotwork/mnw-admin rebuild-keys`. |
| 33 |
- `/etc/sudoers.d/mnw-cli-git` — `mnw-cli ALL=(git) NOPASSWD: /usr/bin/git-*, /usr/bin/tee, /usr/bin/chmod`. No /opt path references; left alone. |
| 34 |
- `makenotwork` user crontab: `0 3 * * * /opt/makenotwork/backup-db.sh >> /opt/makenotwork/backups/backup.log 2>&1`. |
| 35 |
- Root crontab: `0 3 * * * /opt/backups/pg_backup.sh >> /var/log/pg_backup.log 2>&1` — unrelated, left alone. |
| 36 |
|
| 37 |
## B. Pre-flight (no prod impact) |
| 38 |
|
| 39 |
1. **`sando.toml` tier B fixed.** Was `deploy@prod-1.makenot.work` (NXDOMAIN, no port). Now `makenotwork@alpha-west-1` with port handling via `~sando/.ssh/config` Host block. Chose to keep service user as `makenotwork` rather than introduce a `deploy` user — avoids chowning 885M of backups and redoing pg peer auth that's been stable for months. The same reasoning applies to a hypothetical tier C: keep the existing user, don't introduce a new one for cosmetic uniformity with testnot. |
| 40 |
2. **Sando pubkey installed** in `/home/makenotwork/.ssh/authorized_keys` (mode 0600, owned makenotwork). |
| 41 |
3. **`chsh -s /bin/bash makenotwork`** — was `/usr/sbin/nologin`. SSH was rejecting connections, not key auth failing. Worth detecting/fixing in `bootstrap-node.sh` for future provisions where someone has hardened the runtime user. |
| 42 |
4. **`/srv/sando/.ssh/config`** Host block for port 2200; `known_hosts` seeded via `ssh-keyscan -p 2200`. |
| 43 |
5. **Dry-run rsync** from sando → prod's `/opt/mnw/releases/_probe/` succeeded (after `bootstrap-node.sh` created `/opt/mnw/`). |
| 44 |
|
| 45 |
## C. Cutover sequence (3m25s outage) |
| 46 |
|
| 47 |
In order, with the exact reason each step exists: |
| 48 |
|
| 49 |
1. **`systemctl stop makenotwork`** — 02:50:33 UTC. Outage window starts. |
| 50 |
2. **Backups taken**: `/etc/systemd/system/makenotwork.service → /root/makenotwork.service.bak-pre-cutover`; `/opt/makenotwork/.env → /root/dotenv.bak-pre-cutover`; Caddyfile, sudoers, crontab also backed up to `/root/*.bak-pre-cutover`. Rollback path for any step failing before service restart. |
| 51 |
3. **`bootstrap-node.sh`** with `SERVICE_USER=makenotwork SANDO_PUBKEY=… INSTALL_POSTGRES=0 INSTALL_CADDY=0 INSTALL_TAILSCALE=0 ENABLE_FIREWALL=0` — postgres/caddy/tailscale/UFW already configured on prod, don't touch. Created `/opt/mnw/`, `/etc/mnw/`, `/var/lib/mnw/`, the new systemd unit, the unused `deploy` user (harmless), the sudoers entry for `deploy`. The new unit references `EnvironmentFile=/etc/mnw/makenotwork.env` and `ReadWritePaths=/var/lib/mnw`, with `RestartPreventExitStatus=2` (MNW server convention: exit 2 = migration failure, don't crashloop). |
| 52 |
4. **`cp /opt/makenotwork/.env /etc/mnw/makenotwork.env`** (copy, not move — original stays for one-week rollback). `chmod 0640 root:makenotwork`. Then `sed` rewrites of `DOCS_PATH`, `ASSUMPTIONS_PATH`, `YARA_RULES_DIR`, `GIT_REPOS_PATH` for the new layout. `HOST`, `PORT`, `DATABASE_URL`, `HOST_URL` unchanged. |
| 53 |
5. **`ln -s /opt/makenotwork/yara-rules /opt/mnw/yara-rules`** — yara-rules is operator-managed (independent update cadence), not in the release bundle (Session 1 layout principle: category #3). The symlink lets the new env's `YARA_RULES_DIR=/opt/mnw/yara-rules` continue to resolve. When `/opt/makenotwork/` is eventually removed, the rules dir moves to a permanent path (probably `/var/lib/mnw/yara-rules` or `/etc/mnw/yara-rules`) and the symlink retargets. |
| 54 |
6. **`rsync -aHX /opt/git/ /var/lib/mnw/git/`** — preserves `git:git` ownership and the directory hardlinks. `chmod 0755 /var/lib/mnw` so the git user can traverse (default was 0750 makenotwork:makenotwork, which blocked git's git-receive-pack from reaching the repos). |
| 55 |
7. **Caddyfile rewrite**: `sed -i 's|/opt/makenotwork/error-pages|/opt/mnw/current/error-pages|g'`. `caddy validate` before reload; `systemctl reload caddy`. |
| 56 |
8. **Sudoers rewrite**: same sed pattern on `/etc/sudoers.d/mnw-git-ssh`; `visudo -c -f` to validate. |
| 57 |
9. **`systemctl daemon-reload`** to pick up the new unit. |
| 58 |
10. **`systemctl restart sandod`** on fw13 — sandod caches `sando.toml` at startup; the new tier B target wouldn't have taken effect without this. **First `POST /promote/b` failed with NXDOMAIN against the stale `prod-1.makenot.work` because sandod hadn't been restarted yet.** Fixed by restarting sandod and re-promoting. |
| 59 |
11. **`POST /promote/b {"hotfix":true}`** — `hotfix: true` bypasses the 48h burn-in on tier A (which had just promoted to 0.9.5 ~15 min prior; burn-in not yet elapsed). Sando rsync'd the 161MB bundle to `/opt/mnw/releases/0.9.5/`, swapped the `current` symlink, called `systemctl reload-or-restart makenotwork.service`. |
| 60 |
12. **Service up 02:53:55 UTC.** Outage window ends 02:53:58 once health serves 200. 733 YARA rules compiled, all integrations (S3, Stripe, MT, WAM, git, scanner, custom domain cache) live. |
| 61 |
13. **External smoke checks**: `/`, `/login`, `/pricing`, `/docs`, `/docs/economics`, `/docs/roadmap`, `/docs/tiers` — all 200. |
| 62 |
14. **`rebuild-keys` to regenerate `/opt/git/.ssh/authorized_keys`** — `dotenvy` doesn't auto-load when running mnw-admin standalone (it loads from `/opt/makenotwork/.env`, mode 0600 `makenotwork:makenotwork`, unreadable by git). Worked around by sourcing the env in root then `sudo -u git -E`. **Regenerated keys still contain `command="/opt/makenotwork/mnw-admin git-auth ..."`** — see §G. |
| 63 |
15. **Git push test** — `git ls-remote git@ssh.makenot.work:max/meta.git` returns refs cleanly. Cutover verified end-to-end. |
| 64 |
|
| 65 |
## D. What stayed in place (intentional) |
| 66 |
|
| 67 |
- `/opt/makenotwork/` — full contents, untouched. Soak rollback path: stop new unit, swap systemd unit back, start old binary. Plan: `rm -rf` after a week, post-0.9.6 deploy (see §G). |
| 68 |
- `/opt/git/` — untouched. Git user's `/etc/passwd` home; mnw-admin's regenerated `authorized_keys` writes to `/opt/git/.ssh/authorized_keys` (not `/home/git/`, despite earlier confusion). The rsync to `/var/lib/mnw/git/` populated the new GIT_REPOS_PATH; the server reads from there, but git push lands in `/opt/git/` because that's git user's home. Both paths now hold the repo bytes; that's wasteful but harmless during the soak. |
| 69 |
- `/opt/makenotwork/backups/` — 885M of pg dumps. Script and cron still write there. Sando's backup-fetch on fw13 still pulls from there (configured pre-cutover). Migration to `/var/lib/mnw/backups/` is its own follow-up (touches script, crontab, fw13 sando config). |
| 70 |
- `yara-rules-src/`, `rustdoc/`, `ssh/`, `.env.bak.*` — not in any env var or systemd path. Confirmed by grepping the running 0.9.5 binary's path references. Will be swept in the post-soak cleanup. |
| 71 |
|
| 72 |
## E. What broke and how it was caught |
| 73 |
|
| 74 |
Three small things, all caught by smoke checks: |
| 75 |
|
| 76 |
1. **`sandod` cached `sando.toml`.** First promote attempt returned `creating remote release dir` (an in-flight progress string that became the error message). `journalctl -u sandod` showed it was still resolving `prod-1.makenot.work`. `scp sando.toml fw13:/tmp/`, `sudo cp /tmp/sando.toml /etc/sando/sando.toml`, `sudo systemctl restart sandod`, re-promote. Worth documenting that `sandod` does not watch the file; alternative is to add an inotify or SIGHUP handler. |
| 77 |
2. **First doc smoke checks were wrong URLs.** `/about/economics`, `/docs/about/economics` returned 404; panicked briefly that the cutover broke doc routing. False alarm: the route is `/docs/{slug}` where slug is the filename stem (e.g., `/docs/economics`). Verified with `grep doc_page MNW/server/src/` after the panic. **Worth fixing in any future smoke script** — use the real URL scheme, not guessed-from-filesystem paths. |
| 78 |
3. **`mnw-admin rebuild-keys` needed env loading from root context.** `sudo -u git /opt/mnw/current/mnw-admin rebuild-keys` fails with `DATABASE_URL must be set: NotPresent` because the binary's `dotenvy::from_path("/opt/makenotwork/.env")` runs as git, which can't read `.env` (mode 0600 makenotwork). Workaround: `set -a; source /etc/mnw/makenotwork.env; set +a; sudo -u git -E /opt/mnw/current/mnw-admin rebuild-keys`. Cleanest long-term fix is in §G. |
| 79 |
|
| 80 |
## F. Outcomes (verified) |
| 81 |
|
| 82 |
**Sando state after cutover:** |
| 83 |
|
| 84 |
``` |
| 85 |
host cur=0.9.5 prev=0.9.5 burn_in_started=2026-06-03T02:23:28Z |
| 86 |
a cur=0.9.5 prev=0.8.12 burn_in_started=2026-06-03T02:38:57Z |
| 87 |
b cur=0.9.5 prev=None burn_in_started=2026-06-03T02:53:56Z |
| 88 |
c not provisioned |
| 89 |
``` |
| 90 |
|
| 91 |
**Prod externally:** |
| 92 |
- `https://makenot.work/api/health` → `{"status":"operational","version":"0.9.5","checks":{"database":true}}`. |
| 93 |
- `/`, `/login`, `/pricing`, `/docs`, `/docs/economics`, `/docs/roadmap`, `/docs/tiers` → 200. |
| 94 |
- Git: `git ls-remote git@ssh.makenot.work:max/meta.git` → returns refs. |
| 95 |
|
| 96 |
**Prod internally:** |
| 97 |
- `systemctl status makenotwork` → active, PID 3123111, listening 0.0.0.0:3000. |
| 98 |
- 733 YARA rules compiled from `/opt/mnw/yara-rules` (symlink). |
| 99 |
- All integrations enabled per startup log: `s3=true, synckit_s3=false, stripe=true, scanner=true, mt=true, wam=true, git=true`. |
| 100 |
|
| 101 |
**deploy.sh path retained.** Not retired; remains as break-glass per `feedback_prefer_sando_over_deploy_sh` (sando is preferred *default*; deploy.sh stays runnable for outages where sando host is down). |
| 102 |
|
| 103 |
## G. Open follow-ups |
| 104 |
|
| 105 |
### G.1 The hardcoded `/opt/makenotwork/` paths (blocks the cleanup milestone) |
| 106 |
|
| 107 |
Session 1 outcomes claimed "`command=` prefixes auto-update on the first post-migration `rebuild-keys` run." That's wrong — confirmed during step 14. The path is a `const` in the binary, not pulled from env. Four sites need lifting before `/opt/makenotwork/` can be removed: |
| 108 |
|
| 109 |
|
| 110 |
|
| 111 |
| `server/src/git_ssh.rs` | 15 | `const MNW_ADMIN_PATH: &str = "/opt/makenotwork/mnw-admin"` | `/opt/mnw/current/mnw-admin` | |
| 112 |
| `server/src/bin/mnw-admin.rs` | 122 | `dotenvy::from_path("/opt/makenotwork/.env")` | `/etc/mnw/makenotwork.env` | |
| 113 |
| `server/src/build_runner.rs` | 467 | `const BUILD_SSH_KNOWN_HOSTS: &str = "/opt/makenotwork/ssh/known_hosts"` | `/etc/mnw/known_hosts` (or delete if dead — verify usage first) | |
| 114 |
| `server/src/routes/api/ssh_keys.rs` | 165 | `args(["-u", "git", "/opt/makenotwork/mnw-admin", "rebuild-keys"])` | `/opt/mnw/current/mnw-admin` | |
| 115 |
|
| 116 |
Ship as 0.9.6. Cleanup sequence after: deploy 0.9.6 via sando → `rebuild-keys` once (regenerates `authorized_keys` with new path in command=) → soak one week → `rm -rf /opt/makenotwork/`. |
| 117 |
|
| 118 |
### G.2 The backups dir migration |
| 119 |
|
| 120 |
Independent of G.1. Touches: |
| 121 |
- `server/deploy/backup-db.sh` — hardcoded `BACKUP_DIR="/opt/makenotwork/backups"` near top. |
| 122 |
- `makenotwork` user crontab on prod. |
| 123 |
- Sando's `backup.source` URL on fw13 (currently pulls from `/opt/makenotwork/backups/latest.sql.gz` via rrsync). |
| 124 |
|
| 125 |
Easiest order: copy the existing 885M dir to `/var/lib/mnw/backups/`, edit script + crontab + sando config in one window, retire `/opt/makenotwork/backups/` after one successful daily backup lands in the new location and sando confirms it pulled cleanly. |
| 126 |
|
| 127 |
### G.3 The `/opt/git` vs `/var/lib/mnw/git` duality |
| 128 |
|
| 129 |
Both directories currently hold the same repos. Git pushes land in `/opt/git/` (git user's home from `/etc/passwd`). Server reads from `/var/lib/mnw/git/` (GIT_REPOS_PATH). They drift the moment someone pushes. |
| 130 |
|
| 131 |
Two ways out: |
| 132 |
- (a) `usermod -d /var/lib/mnw/git git` to make git's home match GIT_REPOS_PATH. Single source of truth. Risk: any cron / script that reads git's home (none I found, but worth grepping) breaks. |
| 133 |
- (b) Revert GIT_REPOS_PATH to `/opt/git/`. Avoids the move but locks the path forever and reverts a piece of Session 1's FHS migration. |
| 134 |
|
| 135 |
(a) is the right answer. Do it during the post-0.9.6 soak window. |
| 136 |
|
| 137 |
### G.4 `bootstrap-node.sh` polish |
| 138 |
|
| 139 |
From this cutover and Session 2: |
| 140 |
|
| 141 |
- **Detect `nologin` shell** on `SERVICE_USER` and refuse with a clear error (or auto-`chsh`). Costs ~1 min of cutover time if you don't know to check. |
| 142 |
- **Sibling `bootstrap-node-postgres.sh`** for the common pg_ident map case (when SERVICE_USER ≠ pg role name). Or document the manual steps in the script's "next steps" output. |
| 143 |
- **README-postgres.md note** on the sqlx URL form: `postgres:///db?host=/var/run/postgresql&user=name`, not `postgres://user@/db?host=...`. |
| 144 |
|
| 145 |
### G.5 `ASSUMPTIONS_PATH` mismatch |
| 146 |
|
| 147 |
`sando-daemon.toml` puts the file at `<release>/docs/assumptions.toml`; prod's pre-existing env expected `<release>/docs/business/assumptions.toml` (matching the source layout `server/docs/business/assumptions.toml`). Worked around with an env edit during cutover but both prod and testnot now have non-canonical `ASSUMPTIONS_PATH=/opt/mnw/current/docs/assumptions.toml`. Fix: change `release_contents[3].dst` in `sando-daemon.toml` to `docs/business/assumptions.toml` and revert the env path on both nodes. Small, do it during the 0.9.6 sprint. |
| 148 |
|
| 149 |
## H. Key paths (for orientation) |
| 150 |
|
| 151 |
- `MNW/sando/sando.toml` — tier B definition (`makenotwork@alpha-west-1`). |
| 152 |
- `MNW/sando/deploy/bootstrap-node.sh` — node-bootstrap; ran on prod with `SERVICE_USER=makenotwork`. |
| 153 |
- `MNW/sando/daemon/sando-daemon.toml` — release_contents (note §G.5 ASSUMPTIONS_PATH mismatch). |
| 154 |
- `MNW/server/src/{git_ssh.rs, build_runner.rs, bin/mnw-admin.rs, routes/api/ssh_keys.rs}` — the four hardcoded path sites. |
| 155 |
- `MNW/server/deploy/backup-db.sh` — hardcoded backup dir. |
| 156 |
- `/etc/systemd/system/makenotwork.service` (prod) — new FHS unit. |
| 157 |
- `/etc/mnw/makenotwork.env` (prod) — new env file location. |
| 158 |
- `/etc/sudoers.d/mnw-git-ssh` (prod) — updated to `/opt/mnw/current/mnw-admin`. |
| 159 |
- `/etc/caddy/Caddyfile` (prod) — three error-pages refs updated. |
| 160 |
- `/opt/makenotwork/` (prod) — full pre-cutover state, kept for soak rollback. |
| 161 |
- `launchplan_final.md` §6.5 step 8 — original plan this session closes. |
| 162 |
- `launchplan_final.md` §6.9 — Session 2/3 gotchas summary. |
| 163 |
- `launchplan_final.md` §7 — 0.9.6 path-decoupling spec. |
| 164 |
|