Skip to main content

max / makenotwork

sando: add todo.md with CI/CD migration roadmap Nine phases from MakeMachine bootstrap through cutover and post-cutover hardening. Phase 0 is hardware + base provisioning. Phases 1-3 build remote deploys, real backup pipeline, and parity with deploy.sh. Phase 4 is the parallel-run-then-retire cutover. Phases 5-9 are operator UX, monitoring, multi-node B+C, Postgres-on-D, and hardening. Key Paths section points at the files to read first.
Author: Max J. <87768334+MaxJMath@users.noreply.github.com> · 2026-05-23 02:32 UTC
Commit: 8f502d845aff5119821a3d0d5dce388d339b1e81
Parent: 68922df
1 file changed, +138 insertions, -0 deletions
A sando/todo.md +138
@@ -0,0 +1,138 @@
1 + # Sando TODO
2 +
3 + Open work only. Completed items move to `todo_done.md` (sibling file) when one exists. Design notes go in `plans/<name>.md`, not folded into checkboxes.
4 +
5 + Format rule: every actionable line is a `- [ ]` checkbox. Headings group phases and themes; do not put status updates in them.
6 +
7 + Roadmap target: replace `server/deploy/deploy.sh` and astra-hosted `server/deploy/run-ci.sh` with Sando running on the MakeMachine, gating Hetzner prod through testnot.work.
8 +
9 + Phases are ordered for execution. Phase 0 must finish before Phase 1 is meaningful. Phases 5+ are post-cutover hardening.
10 +
11 + ## Key Paths
12 +
13 + Read these to orient before working on Sando:
14 +
15 + - `README.md` — quickstart, API surface, v0 limitations
16 + - `sando.toml` — current topology (MM → A → B; C declared, not provisioned)
17 + - `daemon/src/main.rs` — startup sequence (config → topology → migrate → sync → bare-repo bootstrap → serve)
18 + - `daemon/src/routes.rs` — `/state`, `/promote`, `/rollback`, `/rebuild`, `/backup/fetch`, `/events`
19 + - `daemon/src/gates.rs` — gate runners; the load-bearing logic
20 + - `daemon/src/build.rs` — `build_and_run_mm` is the MM-tier pipeline
21 + - `daemon/src/deploy.rs` — `deploy_local`; remote SSH stub
22 + - `daemon/migrations/001_init.sql` — schema (tiers/nodes as rows)
23 + - `server/deploy/deploy.sh` — current cross-compile + push-to-Hetzner script (what we are replacing)
24 + - `server/deploy/run-ci.sh` — current astra CI script (what we are replacing)
25 + - `_meta/docs/operations.md` — burn-in rule and hotfix policy that gates encode
26 +
27 + ---
28 +
29 + ## Phase 0 — MakeMachine bootstrap
30 +
31 + Hardware and base provisioning. None of the remote-deploy work below matters until MM exists.
32 +
33 + - [ ] Purchase MakeMachine hardware (Threadripper 7960X + RTX PRO 6000 Blackwell + 256 GB ECC + Gen5 NVMe; ~$14-16K per `project_inference_stack.md`).
34 + - [ ] Install x86_64 Linux (match Hetzner prod distro/version to keep build env aligned).
35 + - [ ] Join MM to tailnet; allocate a stable hostname and record in `_meta/infra_tailnet.md`.
36 + - [ ] Provision `sando` system user; lock down the home dir; set up scoped SSH keys for outbound deploys.
37 + - [ ] Install scratch Postgres locally on MM; create the `sando_scratch` role + DB used by `migration_dry_run`.
38 + - [ ] Write the `sandod.service` systemd unit (run as `sando` user, restart on failure, `EnvironmentFile=/etc/sando/sando.env`).
39 + - [ ] Install `sandod` binary at `/usr/local/bin/sandod`; enable + start the unit.
40 + - [ ] Write the production `sando.toml`; bare repo path under `/srv/sando/mnw.git`; A node `testnot.work`; B node Hetzner prod.
41 +
42 + ## Phase 1 — Remote deploy
43 +
44 + The MVP only deploys to `ssh_target=local`. Production needs real SSH/rsync.
45 +
46 + - [ ] Implement `deploy::deploy_node` remote path: rsync the staged binary to `<ssh_target>:<release_root>/releases/<version>/server`, then `ssh <ssh_target> "ln -sfn releases/<version> current && systemctl reload-or-restart <unit>"`.
47 + - [ ] Settle systemd unit naming convention. Current MNW server unit is `makenotwork.service`; decide whether Sando keeps that name or migrates to `mnw-server.service`. Capture in `plans/systemd-units.md` before changing anything live.
48 + - [ ] Add `node.systemd_unit` field to `sando.toml` (default derives from the tier+role) so the convention is explicit per-node.
49 + - [ ] Bootstrap script for adding a fresh node: creates `<release_root>`, installs the systemd unit pointing at `<release_root>/current/server`, adds the sando SSH key to `authorized_keys`. Idempotent.
50 + - [ ] Garbage-collect old releases on the remote: keep last N (configurable, default 5) per node. Run at end of each successful deploy.
51 + - [ ] Handle `rsync` failure mid-deploy: leave the previous `current` symlink intact; mark `deploys.outcome = 'failed'`; do not advance `tier_state`.
52 +
53 + ## Phase 2 — Backup pipeline + migration dry-run
54 +
55 + `migration_dry_run` is the load-bearing gate. It needs a real backup source, not a fixture.
56 +
57 + - [ ] Confirm astra's offsite replica (per `sync-backup-offsite.sh`) writes a deterministic latest-link path Sando can rsync from. If not, add one.
58 + - [ ] Wire the production `sando.toml` `backup.source` to the astra rsync URL.
59 + - [ ] Schedule a daily `POST /backup/fetch` (cron or systemd timer on MM) so a fresh backup is always within 24h of any promote attempt.
60 + - [ ] First end-to-end `migration_dry_run` against a real prod backup; confirm it catches the 2026-05-22 incident class (drop+recreate column migration sequence).
61 + - [ ] Document the failure modes: what does the operator see in `/state` when the dry-run fails? Capture in `plans/migration-dryrun-failures.md`.
62 + - [ ] Decide retention on `backups` table — prune rows older than N days so SQLite doesn't grow forever.
63 +
64 + ## Phase 3 — Parity with current `deploy.sh`
65 +
66 + Sando currently only ships the binary. `deploy.sh` does more. Inventory each piece and either fold it into Sando or document the explicit hand-off.
67 +
68 + - [ ] **Caddyfile** — `deploy.sh upload_config` pushes `server/deploy/Caddyfile` to `/etc/caddy/Caddyfile` and reloads Caddy. Decide: ship as a versioned config artifact alongside the binary (cleanest), or keep Caddy config out-of-band? Capture in `plans/config-artifacts.md`.
69 + - [ ] **systemd unit** — `deploy.sh` uploads `makenotwork.service`. With Sando the unit points at `current/server` and shouldn't change per release. Move unit ownership to the node-bootstrap script (Phase 1) and remove from per-deploy flow.
70 + - [ ] **Backup script** — `backup-db.sh` is uploaded by `deploy.sh`. Move to node-bootstrap; not a per-release artifact.
71 + - [ ] **Error pages** — static HTML in `server/deploy/error-pages/`. Either bake into the binary (preferred — versions with code) or ship as a `releases/<version>/error-pages/` sibling. Capture decision.
72 + - [ ] **Security configs** — `sshd-git.conf`, `fail2ban-sshd.conf`, `setup-firewall.sh`. Move to node-bootstrap.
73 + - [ ] **Restart warning** — `deploy.sh send_restart_warning` posts a banner before restart. Decide whether Sando emits this and through what surface (probably the existing in-app banner mechanism).
74 + - [ ] **Cross-compile from macOS** — `deploy.sh` builds on the dev laptop via `cargo-zigbuild`. Sando builds natively on MM (x86_64 Linux). Verify the resulting binaries are byte-identical or at least behavior-equivalent across one full sprint before retiring `deploy.sh`.
75 + - [ ] **Prod migrations** — today, who runs `sqlx migrate run` against prod? `deploy.sh` doesn't (verify). Sando should run prod migrations as part of `POST /promote/{tier}` for the prod tiers, OR there should be an explicit `POST /migrate/{tier}` operator action. Decide.
76 +
77 + ## Phase 4 — Cutover
78 +
79 + Run Sando in parallel with `deploy.sh` until trust is built, then retire the old path.
80 +
81 + - [ ] First successful Sando-only deploy to **testnot.work** (tier A). Old `deploy.sh` still primary for prod.
82 + - [ ] One sprint (two months) of Sando-shadow runs: every `deploy.sh` deploy is also driven through Sando in dry-run mode (gates run, deploys go to a parallel `releases/` dir on prod but don't swap `current`). Compare outcomes.
83 + - [ ] First Sando-only deploy to **Hetzner prod** (tier B). `deploy.sh` retained but unused.
84 + - [ ] Move `server/deploy/deploy.sh` to `server/deploy/archive/deploy.sh.legacy` with a header explaining the cutover; do not delete (reference for the next year).
85 + - [ ] Decommission astra CI runner (`server/deploy/run-ci.sh`). Sando's `cargo_test` gate replaces it; if any astra-specific checks are still needed (e.g., `cargo audit`), add them as additional gate kinds in `daemon/src/gates.rs`.
86 + - [ ] Update `CLAUDE.md` and `_meta/docs/operations.md` to point at Sando, not `deploy.sh`.
87 +
88 + ## Phase 5 — Operator UX
89 +
90 + The TUI polls. The MVP requires you to hand-insert a row for `manual_confirm`. Both are fine for one operator but rough.
91 +
92 + - [ ] Implement `WS /events`: tail of gate starts/finishes, deploy events, build logs. Subscribe from the TUI.
93 + - [ ] TUI: actions pane. `p` for promote (prompts for version + tier), `R` for rollback, `b` for backup fetch, `c` for manual_confirm.
94 + - [ ] `POST /confirm/{tier}` endpoint that inserts a `gate_runs` row with `passed=1, gate_kind='manual_confirm'` for the current pending version. Replaces the hand-SQL workaround.
95 + - [ ] TUI live log pane that follows the most recent build / gate run; backed by `WS /events`.
96 + - [ ] `POST /promote` body should accept `version` as optional; default to the current MM version when target is A, predecessor's current when target is B+. Reduces ceremony.
97 +
98 + ## Phase 6 — Monitoring + alerting
99 +
100 + - [ ] Wire MM's `/metrics` endpoint into the existing MNW Prometheus scrape config; record where the scrape config lives in `_meta/` or wherever monitoring already runs.
101 + - [ ] Add counters: `sando_builds_total{outcome}`, `sando_gates_total{tier,kind,outcome}`, `sando_deploys_total{tier,outcome}`, `sando_burn_in_remaining_hours{tier}`.
102 + - [ ] Alert: build failed. Page on first failure (not flap-protected — builds are infrequent).
103 + - [ ] Alert: migration_dry_run failed. Page immediately. This is the 2026-05-22-class signal.
104 + - [ ] Alert: a tier has had `current_version` unchanged for > N days while MM is green. (Operator forgot to promote.)
105 +
106 + ## Phase 7 — Multi-node B+C
107 +
108 + Today B is the only prod node. Adding C is the second prod node + CF Load Balancing.
109 +
110 + - [ ] Provision tier C node (Hetzner or alternate provider — capture rationale).
111 + - [ ] Update `sando.toml`: set `c.provisioned = true`, add `[[tier.node]]`.
112 + - [ ] Set up Cloudflare Load Balancing with B + C as origin pool, health-checked.
113 + - [ ] Verify sequential canary in Sando: deploy to B, wait for CF health-check to mark healthy (probably 30-60s probe interval), then deploy to C. Add a `node.health_url` field and a gate-style wait between nodes.
114 + - [ ] Document in README that `canary = "parallel"` exists but should never be used for B+C unless you understand the failure modes.
115 +
116 + ## Phase 8 — Postgres-on-D
117 +
118 + Move Postgres off the prod app node so B+C become truly interchangeable.
119 +
120 + - [ ] Provision Postgres-only machine D (modest spec; reliability over performance).
121 + - [ ] Migrate the prod DB from Hetzner app node to D. Capture procedure in `plans/postgres-d-migration.md`.
122 + - [ ] Update `server` `DATABASE_URL` everywhere (env files on B+C, scratch URL on MM stays local).
123 + - [ ] Replica/HA story stays deferred; D is SPOF for now (per `_meta/preclear/.../decisions.md`).
124 +
125 + ## Phase 9 — Hardening
126 +
127 + Pick up after cutover is stable.
128 +
129 + - [ ] Tailnet ACL audit: confirm only the laptop can reach `sandod:7766`. Document the ACL.
130 + - [ ] Decide if v0.2 needs token auth on `sandod` endpoints (revisit assumption from `decisions.md` once there's a real second operator).
131 + - [ ] Sando self-deploy: Sando builds and deploys *itself* through its own pipeline. Bootstraps the bootstrap. Closes the chicken-and-egg loop and is satisfying.
132 + - [ ] Backup-of-Sando-state: nightly SQLite snapshot to astra. The state DB tracks 6 months of deploys; losing it on a MM disk failure would be annoying.
133 +
134 + ## Notes / non-checkbox
135 +
136 + - WS `/events` and the operator-UX work in Phase 5 can run in parallel with Phase 1-3 once MM exists. They are sequenced after for review clarity, not because they block anything.
137 + - "Hotfix override" and `reset_burn_in` flag are already implemented end-to-end (see `decisions.md`); not on this list because there's nothing left to do until prod uses them.
138 + - C tier exists in the schema as a `provisioned=false` row from day one — adding C in Phase 7 is a TOML edit, not a migration.