max / makenotwork
1 file changed,
+138 insertions,
-0 deletions
| @@ -0,0 +1,138 @@ | |||
| 1 | + | # Sando TODO | |
| 2 | + | ||
| 3 | + | Open work only. Completed items move to `todo_done.md` (sibling file) when one exists. Design notes go in `plans/<name>.md`, not folded into checkboxes. | |
| 4 | + | ||
| 5 | + | Format rule: every actionable line is a `- [ ]` checkbox. Headings group phases and themes; do not put status updates in them. | |
| 6 | + | ||
| 7 | + | Roadmap target: replace `server/deploy/deploy.sh` and astra-hosted `server/deploy/run-ci.sh` with Sando running on the MakeMachine, gating Hetzner prod through testnot.work. | |
| 8 | + | ||
| 9 | + | Phases are ordered for execution. Phase 0 must finish before Phase 1 is meaningful. Phases 5+ are post-cutover hardening. | |
| 10 | + | ||
| 11 | + | ## Key Paths | |
| 12 | + | ||
| 13 | + | Read these to orient before working on Sando: | |
| 14 | + | ||
| 15 | + | - `README.md` — quickstart, API surface, v0 limitations | |
| 16 | + | - `sando.toml` — current topology (MM → A → B; C declared, not provisioned) | |
| 17 | + | - `daemon/src/main.rs` — startup sequence (config → topology → migrate → sync → bare-repo bootstrap → serve) | |
| 18 | + | - `daemon/src/routes.rs` — `/state`, `/promote`, `/rollback`, `/rebuild`, `/backup/fetch`, `/events` | |
| 19 | + | - `daemon/src/gates.rs` — gate runners; the load-bearing logic | |
| 20 | + | - `daemon/src/build.rs` — `build_and_run_mm` is the MM-tier pipeline | |
| 21 | + | - `daemon/src/deploy.rs` — `deploy_local`; remote SSH stub | |
| 22 | + | - `daemon/migrations/001_init.sql` — schema (tiers/nodes as rows) | |
| 23 | + | - `server/deploy/deploy.sh` — current cross-compile + push-to-Hetzner script (what we are replacing) | |
| 24 | + | - `server/deploy/run-ci.sh` — current astra CI script (what we are replacing) | |
| 25 | + | - `_meta/docs/operations.md` — burn-in rule and hotfix policy that gates encode | |
| 26 | + | ||
| 27 | + | --- | |
| 28 | + | ||
| 29 | + | ## Phase 0 — MakeMachine bootstrap | |
| 30 | + | ||
| 31 | + | Hardware and base provisioning. None of the remote-deploy work below matters until MM exists. | |
| 32 | + | ||
| 33 | + | - [ ] Purchase MakeMachine hardware (Threadripper 7960X + RTX PRO 6000 Blackwell + 256 GB ECC + Gen5 NVMe; ~$14-16K per `project_inference_stack.md`). | |
| 34 | + | - [ ] Install x86_64 Linux (match Hetzner prod distro/version to keep build env aligned). | |
| 35 | + | - [ ] Join MM to tailnet; allocate a stable hostname and record in `_meta/infra_tailnet.md`. | |
| 36 | + | - [ ] Provision `sando` system user; lock down the home dir; set up scoped SSH keys for outbound deploys. | |
| 37 | + | - [ ] Install scratch Postgres locally on MM; create the `sando_scratch` role + DB used by `migration_dry_run`. | |
| 38 | + | - [ ] Write the `sandod.service` systemd unit (run as `sando` user, restart on failure, `EnvironmentFile=/etc/sando/sando.env`). | |
| 39 | + | - [ ] Install `sandod` binary at `/usr/local/bin/sandod`; enable + start the unit. | |
| 40 | + | - [ ] Write the production `sando.toml`; bare repo path under `/srv/sando/mnw.git`; A node `testnot.work`; B node Hetzner prod. | |
| 41 | + | ||
| 42 | + | ## Phase 1 — Remote deploy | |
| 43 | + | ||
| 44 | + | The MVP only deploys to `ssh_target=local`. Production needs real SSH/rsync. | |
| 45 | + | ||
| 46 | + | - [ ] Implement `deploy::deploy_node` remote path: rsync the staged binary to `<ssh_target>:<release_root>/releases/<version>/server`, then `ssh <ssh_target> "ln -sfn releases/<version> current && systemctl reload-or-restart <unit>"`. | |
| 47 | + | - [ ] Settle systemd unit naming convention. Current MNW server unit is `makenotwork.service`; decide whether Sando keeps that name or migrates to `mnw-server.service`. Capture in `plans/systemd-units.md` before changing anything live. | |
| 48 | + | - [ ] Add `node.systemd_unit` field to `sando.toml` (default derives from the tier+role) so the convention is explicit per-node. | |
| 49 | + | - [ ] Bootstrap script for adding a fresh node: creates `<release_root>`, installs the systemd unit pointing at `<release_root>/current/server`, adds the sando SSH key to `authorized_keys`. Idempotent. | |
| 50 | + | - [ ] Garbage-collect old releases on the remote: keep last N (configurable, default 5) per node. Run at end of each successful deploy. | |
| 51 | + | - [ ] Handle `rsync` failure mid-deploy: leave the previous `current` symlink intact; mark `deploys.outcome = 'failed'`; do not advance `tier_state`. | |
| 52 | + | ||
| 53 | + | ## Phase 2 — Backup pipeline + migration dry-run | |
| 54 | + | ||
| 55 | + | `migration_dry_run` is the load-bearing gate. It needs a real backup source, not a fixture. | |
| 56 | + | ||
| 57 | + | - [ ] Confirm astra's offsite replica (per `sync-backup-offsite.sh`) writes a deterministic latest-link path Sando can rsync from. If not, add one. | |
| 58 | + | - [ ] Wire the production `sando.toml` `backup.source` to the astra rsync URL. | |
| 59 | + | - [ ] Schedule a daily `POST /backup/fetch` (cron or systemd timer on MM) so a fresh backup is always within 24h of any promote attempt. | |
| 60 | + | - [ ] First end-to-end `migration_dry_run` against a real prod backup; confirm it catches the 2026-05-22 incident class (drop+recreate column migration sequence). | |
| 61 | + | - [ ] Document the failure modes: what does the operator see in `/state` when the dry-run fails? Capture in `plans/migration-dryrun-failures.md`. | |
| 62 | + | - [ ] Decide retention on `backups` table — prune rows older than N days so SQLite doesn't grow forever. | |
| 63 | + | ||
| 64 | + | ## Phase 3 — Parity with current `deploy.sh` | |
| 65 | + | ||
| 66 | + | Sando currently only ships the binary. `deploy.sh` does more. Inventory each piece and either fold it into Sando or document the explicit hand-off. | |
| 67 | + | ||
| 68 | + | - [ ] **Caddyfile** — `deploy.sh upload_config` pushes `server/deploy/Caddyfile` to `/etc/caddy/Caddyfile` and reloads Caddy. Decide: ship as a versioned config artifact alongside the binary (cleanest), or keep Caddy config out-of-band? Capture in `plans/config-artifacts.md`. | |
| 69 | + | - [ ] **systemd unit** — `deploy.sh` uploads `makenotwork.service`. With Sando the unit points at `current/server` and shouldn't change per release. Move unit ownership to the node-bootstrap script (Phase 1) and remove from per-deploy flow. | |
| 70 | + | - [ ] **Backup script** — `backup-db.sh` is uploaded by `deploy.sh`. Move to node-bootstrap; not a per-release artifact. | |
| 71 | + | - [ ] **Error pages** — static HTML in `server/deploy/error-pages/`. Either bake into the binary (preferred — versions with code) or ship as a `releases/<version>/error-pages/` sibling. Capture decision. | |
| 72 | + | - [ ] **Security configs** — `sshd-git.conf`, `fail2ban-sshd.conf`, `setup-firewall.sh`. Move to node-bootstrap. | |
| 73 | + | - [ ] **Restart warning** — `deploy.sh send_restart_warning` posts a banner before restart. Decide whether Sando emits this and through what surface (probably the existing in-app banner mechanism). | |
| 74 | + | - [ ] **Cross-compile from macOS** — `deploy.sh` builds on the dev laptop via `cargo-zigbuild`. Sando builds natively on MM (x86_64 Linux). Verify the resulting binaries are byte-identical or at least behavior-equivalent across one full sprint before retiring `deploy.sh`. | |
| 75 | + | - [ ] **Prod migrations** — today, who runs `sqlx migrate run` against prod? `deploy.sh` doesn't (verify). Sando should run prod migrations as part of `POST /promote/{tier}` for the prod tiers, OR there should be an explicit `POST /migrate/{tier}` operator action. Decide. | |
| 76 | + | ||
| 77 | + | ## Phase 4 — Cutover | |
| 78 | + | ||
| 79 | + | Run Sando in parallel with `deploy.sh` until trust is built, then retire the old path. | |
| 80 | + | ||
| 81 | + | - [ ] First successful Sando-only deploy to **testnot.work** (tier A). Old `deploy.sh` still primary for prod. | |
| 82 | + | - [ ] One sprint (two months) of Sando-shadow runs: every `deploy.sh` deploy is also driven through Sando in dry-run mode (gates run, deploys go to a parallel `releases/` dir on prod but don't swap `current`). Compare outcomes. | |
| 83 | + | - [ ] First Sando-only deploy to **Hetzner prod** (tier B). `deploy.sh` retained but unused. | |
| 84 | + | - [ ] Move `server/deploy/deploy.sh` to `server/deploy/archive/deploy.sh.legacy` with a header explaining the cutover; do not delete (reference for the next year). | |
| 85 | + | - [ ] Decommission astra CI runner (`server/deploy/run-ci.sh`). Sando's `cargo_test` gate replaces it; if any astra-specific checks are still needed (e.g., `cargo audit`), add them as additional gate kinds in `daemon/src/gates.rs`. | |
| 86 | + | - [ ] Update `CLAUDE.md` and `_meta/docs/operations.md` to point at Sando, not `deploy.sh`. | |
| 87 | + | ||
| 88 | + | ## Phase 5 — Operator UX | |
| 89 | + | ||
| 90 | + | The TUI polls. The MVP requires you to hand-insert a row for `manual_confirm`. Both are fine for one operator but rough. | |
| 91 | + | ||
| 92 | + | - [ ] Implement `WS /events`: tail of gate starts/finishes, deploy events, build logs. Subscribe from the TUI. | |
| 93 | + | - [ ] TUI: actions pane. `p` for promote (prompts for version + tier), `R` for rollback, `b` for backup fetch, `c` for manual_confirm. | |
| 94 | + | - [ ] `POST /confirm/{tier}` endpoint that inserts a `gate_runs` row with `passed=1, gate_kind='manual_confirm'` for the current pending version. Replaces the hand-SQL workaround. | |
| 95 | + | - [ ] TUI live log pane that follows the most recent build / gate run; backed by `WS /events`. | |
| 96 | + | - [ ] `POST /promote` body should accept `version` as optional; default to the current MM version when target is A, predecessor's current when target is B+. Reduces ceremony. | |
| 97 | + | ||
| 98 | + | ## Phase 6 — Monitoring + alerting | |
| 99 | + | ||
| 100 | + | - [ ] Wire MM's `/metrics` endpoint into the existing MNW Prometheus scrape config; record where the scrape config lives in `_meta/` or wherever monitoring already runs. | |
| 101 | + | - [ ] Add counters: `sando_builds_total{outcome}`, `sando_gates_total{tier,kind,outcome}`, `sando_deploys_total{tier,outcome}`, `sando_burn_in_remaining_hours{tier}`. | |
| 102 | + | - [ ] Alert: build failed. Page on first failure (not flap-protected — builds are infrequent). | |
| 103 | + | - [ ] Alert: migration_dry_run failed. Page immediately. This is the 2026-05-22-class signal. | |
| 104 | + | - [ ] Alert: a tier has had `current_version` unchanged for > N days while MM is green. (Operator forgot to promote.) | |
| 105 | + | ||
| 106 | + | ## Phase 7 — Multi-node B+C | |
| 107 | + | ||
| 108 | + | Today B is the only prod node. Adding C is the second prod node + CF Load Balancing. | |
| 109 | + | ||
| 110 | + | - [ ] Provision tier C node (Hetzner or alternate provider — capture rationale). | |
| 111 | + | - [ ] Update `sando.toml`: set `c.provisioned = true`, add `[[tier.node]]`. | |
| 112 | + | - [ ] Set up Cloudflare Load Balancing with B + C as origin pool, health-checked. | |
| 113 | + | - [ ] Verify sequential canary in Sando: deploy to B, wait for CF health-check to mark healthy (probably 30-60s probe interval), then deploy to C. Add a `node.health_url` field and a gate-style wait between nodes. | |
| 114 | + | - [ ] Document in README that `canary = "parallel"` exists but should never be used for B+C unless you understand the failure modes. | |
| 115 | + | ||
| 116 | + | ## Phase 8 — Postgres-on-D | |
| 117 | + | ||
| 118 | + | Move Postgres off the prod app node so B+C become truly interchangeable. | |
| 119 | + | ||
| 120 | + | - [ ] Provision Postgres-only machine D (modest spec; reliability over performance). | |
| 121 | + | - [ ] Migrate the prod DB from Hetzner app node to D. Capture procedure in `plans/postgres-d-migration.md`. | |
| 122 | + | - [ ] Update `server` `DATABASE_URL` everywhere (env files on B+C, scratch URL on MM stays local). | |
| 123 | + | - [ ] Replica/HA story stays deferred; D is SPOF for now (per `_meta/preclear/.../decisions.md`). | |
| 124 | + | ||
| 125 | + | ## Phase 9 — Hardening | |
| 126 | + | ||
| 127 | + | Pick up after cutover is stable. | |
| 128 | + | ||
| 129 | + | - [ ] Tailnet ACL audit: confirm only the laptop can reach `sandod:7766`. Document the ACL. | |
| 130 | + | - [ ] Decide if v0.2 needs token auth on `sandod` endpoints (revisit assumption from `decisions.md` once there's a real second operator). | |
| 131 | + | - [ ] Sando self-deploy: Sando builds and deploys *itself* through its own pipeline. Bootstraps the bootstrap. Closes the chicken-and-egg loop and is satisfying. | |
| 132 | + | - [ ] Backup-of-Sando-state: nightly SQLite snapshot to astra. The state DB tracks 6 months of deploys; losing it on a MM disk failure would be annoying. | |
| 133 | + | ||
| 134 | + | ## Notes / non-checkbox | |
| 135 | + | ||
| 136 | + | - WS `/events` and the operator-UX work in Phase 5 can run in parallel with Phase 1-3 once MM exists. They are sequenced after for review clarity, not because they block anything. | |
| 137 | + | - "Hotfix override" and `reset_burn_in` flag are already implemented end-to-end (see `decisions.md`); not on this list because there's nothing left to do until prod uses them. | |
| 138 | + | - C tier exists in the schema as a `provisioned=false` row from day one — adding C in Phase 7 is a TOML edit, not a migration. |