Skip to main content

max / makenotwork

sando: WIP — backup automation, events streaming, observability fixes Bundles in-flight sando work with two boot-related fixes that surfaced during the 2026-06-01 MNW launch: - gates::boot_smoke now injects DATABASE_URL (scratch DB) and SCAN_ENABLED=false. The previous spawn passed only SANDO_BOOT_SMOKE=1 and the server panicked instantly on Config::from_env's MissingDatabaseUrl, failing every rebuild. - routes::get_state falls back to the most-recently-attempted version for a tier when current_version is unset. Before, a never-green tier exposed an empty gates array via /state — debugging required SSH + direct SQLite. This is what hid the MM gate failures all morning. Broader WIP (operator's stream): backup automation (sandod-backup-fetch systemd unit + timer, sync.rs hooks), events module split, deploy.rs restart-warning + symlink swap, bootstrap-node.sh for fresh target nodes, plans docs, TUI dashboard rework, todo.md updates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Author: Max Johnson <me@maxj.phd> · 2026-06-01 19:31 UTC
Commit: f76f08cddfd8ad1434664cbf7474c07b04fa6a2e
Parent: 869c5e0
26 files changed, +3204 insertions, -254 deletions
M sando/LICENSE +1 -1
@@ -1,6 +1,6 @@
1 1 MIT License
2 2
3 - Copyright (c) 2026 Max Jacobson
3 + Copyright (c) 2026 Make Creative, LLC
4 4
5 5 Permission is hereby granted, free of charge, to any person obtaining a copy
6 6 of this software and associated documentation files (the "Software"), to deal
M sando/README.md +22 -10
@@ -103,12 +103,29 @@ curl -X POST http://127.0.0.1:7766/promote/a \
103 103 | Method | Path | Body | Purpose |
104 104 |--------|------|------|---------|
105 105 | GET | `/state` | — | Tier list + current/previous version + last gate outcomes |
106 - | POST | `/rebuild` | `{sha?: string}` | Force a build; if `sha` is absent, resolves the configured deploy branch |
107 - | POST | `/promote/{tier}` | `{version, hotfix?, reset_burn_in?}` | Verify predecessor gates, deploy to tier nodes, advance state |
106 + | POST | `/rebuild` | `{sha?: string}` | Force a build; if `sha` is absent, resolves the configured deploy branch. Aborts any in-flight build (latest wins). |
107 + | POST | `/promote/{tier}` | `{version?, hotfix?, reset_burn_in?}` | Verify predecessor gates, deploy to tier nodes, advance state. `version` defaults to the predecessor tier's `current_version`. |
108 108 | POST | `/rollback/{tier}` | — | Swap `current` symlink to `previous_version` on every node in the tier |
109 - | POST | `/backup/fetch` | — | Pull the prod backup to `backup.local_path` (file:// or rsync://) |
109 + | POST | `/confirm/{tier}` | — | Insert a passing `manual_confirm` gate row for the tier's `current_version`. Replaces hand-SQL. |
110 + | POST | `/backup/fetch` | — | Pull the prod backup. Supports `file://`, `rsync://`, `ssh://user@host[:port]/path`. |
110 111 | GET | `/metrics` | — | Prometheus exposition |
111 - | GET | `/events` | — | WebSocket stream of deploy + gate events (not yet implemented) |
112 + | GET | `/events` | — | WebSocket stream of typed events (RebuildRequested, BuildStart/Ok/Failed, GateStart/Done, DeployStart/Ok/Failed, PromoteComplete, Rollback, BackupFetched, ManualConfirm, BuildAborted). |
113 +
114 + ## TUI
115 +
116 + `sando` (the TUI binary) connects to `$SANDO_DAEMON` (default `http://127.0.0.1:7766`), polls `/state` every 2s, and subscribes to `/events` over WS. Keybindings:
117 +
118 + | key | action |
119 + |-----|--------|
120 + | ↑/↓ or j/k | select tier |
121 + | p | `POST /promote/<selected>` (no body — version defaults to predecessor's current) |
122 + | R | `POST /rollback/<selected>` |
123 + | b | `POST /backup/fetch` |
124 + | c | `POST /confirm/<selected>` |
125 + | r | refresh hint (poller is already every 2s) |
126 + | q / Esc / Ctrl-C | quit |
127 +
128 + Action results show up in the events log a moment later (the actions themselves emit events from the daemon side).
112 129
113 130 ## Hotfix flow
114 131
@@ -123,14 +140,9 @@ curl -X POST http://127.0.0.1:7766/promote/a \
123 140
124 141 ## v0 limitations
125 142
126 - - Remote deploys (real SSH/rsync) are stubbed. Use `ssh_target = "local"` and
127 - a local `release_root` for dev. Production wiring is a follow-up.
128 143 - `migration_dry_run` requires a scratch Postgres at `scratch_db_url`. The
129 - gate drops and recreates `public` on every run; do not point this at
144 + gate drops every non-system schema on every run; do not point this at
130 145 anything that matters.
131 - - `/events` WebSocket is not implemented; the TUI polls `/state` every 2s.
132 - - `manual_confirm` has no operator-facing trigger yet (you have to insert a
133 - `gate_runs` row with `passed=1` by hand to satisfy it).
134 146
135 147 ## License
136 148
@@ -185,6 +185,12 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
185 185 checksum = "9330f8b2ff13f34540b44e946ef35111825727b38d33286ef986142615121801"
186 186
187 187 [[package]]
188 + name = "cfg_aliases"
189 + version = "0.2.1"
190 + source = "registry+https://github.com/rust-lang/crates.io-index"
191 + checksum = "613afe47fcd5fac7ccf1db93babcb082c5994d996f20b8b159f2ad1658eb5724"
192 +
193 + [[package]]
188 194 name = "chrono"
189 195 version = "0.4.44"
190 196 source = "registry+https://github.com/rust-lang/crates.io-index"
@@ -382,6 +388,12 @@ dependencies = [
382 388 ]
383 389
384 390 [[package]]
391 + name = "fastrand"
392 + version = "2.4.1"
393 + source = "registry+https://github.com/rust-lang/crates.io-index"
394 + checksum = "9f1f227452a390804cdb637b74a86990f2a7d7ba4b7d5693aac9b4dd6defd8d6"
395 +
396 + [[package]]
385 397 name = "find-msvc-tools"
386 398 version = "0.1.9"
387 399 source = "registry+https://github.com/rust-lang/crates.io-index"
@@ -522,8 +534,10 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
522 534 checksum = "ff2abc00be7fca6ebc474524697ae276ad847ad0a6b3faa4bcb027e9a4614ad0"
523 535 dependencies = [
524 536 "cfg-if",
537 + "js-sys",
525 538 "libc",
526 539 "wasi",
540 + "wasm-bindgen",
527 541 ]
528 542
529 543 [[package]]
@@ -533,9 +547,11 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
533 547 checksum = "899def5c37c4fd7b2664648c28120ecec138e4d395b459e5ca34f9cce2dd77fd"
534 548 dependencies = [
535 549 "cfg-if",
550 + "js-sys",
536 551 "libc",
537 552 "r-efi",
538 553 "wasip2",
554 + "wasm-bindgen",
539 555 ]
540 556
541 557 [[package]]
@@ -681,6 +697,23 @@ dependencies = [
681 697 "pin-project-lite",
682 698 "smallvec",
683 699 "tokio",
700 + "want",
701 + ]
702 +
703 + [[package]]
704 + name = "hyper-rustls"
705 + version = "0.27.9"
706 + source = "registry+https://github.com/rust-lang/crates.io-index"
707 + checksum = "33ca68d021ef39cf6463ab54c1d0f5daf03377b70561305bb89a8f83aab66e0f"
708 + dependencies = [
709 + "http",
710 + "hyper",
711 + "hyper-util",
712 + "rustls",
713 + "tokio",
714 + "tokio-rustls",
715 + "tower-service",
716 + "webpki-roots",
684 717 ]
685 718
686 719 [[package]]
@@ -689,13 +722,21 @@ version = "0.1.20"
689 722 source = "registry+https://github.com/rust-lang/crates.io-index"
690 723 checksum = "96547c2556ec9d12fb1578c4eaf448b04993e7fb79cbaad930a656880a6bdfa0"
691 724 dependencies = [
725 + "base64",
692 726 "bytes",
727 + "futures-channel",
728 + "futures-util",
693 729 "http",
694 730 "http-body",
695 731 "hyper",
732 + "ipnet",
733 + "libc",
734 + "percent-encoding",
696 735 "pin-project-lite",
736 + "socket2",
697 737 "tokio",
698 738 "tower-service",
739 + "tracing",
699 740 ]
700 741
701 742 [[package]]
@@ -836,6 +877,12 @@ dependencies = [
836 877 ]
837 878
838 879 [[package]]
880 + name = "ipnet"
881 + version = "2.12.0"
882 + source = "registry+https://github.com/rust-lang/crates.io-index"
883 + checksum = "d98f6fed1fde3f8c21bc40a1abb88dd75e67924f9cffc3ef95607bad8017f8e2"
884 +
885 + [[package]]
839 886 name = "itoa"
840 887 version = "1.0.18"
841 888 source = "registry+https://github.com/rust-lang/crates.io-index"
@@ -909,6 +956,12 @@ dependencies = [
909 956 ]
910 957
911 958 [[package]]
959 + name = "linux-raw-sys"
960 + version = "0.12.1"
961 + source = "registry+https://github.com/rust-lang/crates.io-index"
962 + checksum = "32a66949e030da00e8c7d4434b251670a91556f4144941d37452769c25d58a53"
963 +
964 + [[package]]
912 965 name = "litemap"
913 966 version = "0.8.2"
914 967 source = "registry+https://github.com/rust-lang/crates.io-index"
@@ -943,6 +996,12 @@ dependencies = [
943 996 ]
944 997
945 998 [[package]]
999 + name = "lru-slab"
1000 + version = "0.1.2"
1001 + source = "registry+https://github.com/rust-lang/crates.io-index"
1002 + checksum = "112b39cec0b298b6c1999fee3e31427f74f676e4cb9879ed1a121b43661a4154"
1003 +
1004 + [[package]]
946 1005 name = "matchers"
947 1006 version = "0.2.0"
948 1007 source = "registry+https://github.com/rust-lang/crates.io-index"
@@ -1225,6 +1284,61 @@ dependencies = [
1225 1284 ]
1226 1285
1227 1286 [[package]]
1287 + name = "quinn"
1288 + version = "0.11.9"
1289 + source = "registry+https://github.com/rust-lang/crates.io-index"
1290 + checksum = "b9e20a958963c291dc322d98411f541009df2ced7b5a4f2bd52337638cfccf20"
1291 + dependencies = [
1292 + "bytes",
1293 + "cfg_aliases",
1294 + "pin-project-lite",
1295 + "quinn-proto",
1296 + "quinn-udp",
1297 + "rustc-hash",
1298 + "rustls",
1299 + "socket2",
1300 + "thiserror",
1301 + "tokio",
1302 + "tracing",
1303 + "web-time",
1304 + ]
1305 +
1306 + [[package]]
1307 + name = "quinn-proto"
1308 + version = "0.11.14"
1309 + source = "registry+https://github.com/rust-lang/crates.io-index"
1310 + checksum = "434b42fec591c96ef50e21e886936e66d3cc3f737104fdb9b737c40ffb94c098"
1311 + dependencies = [
1312 + "bytes",
1313 + "getrandom 0.3.4",
1314 + "lru-slab",
1315 + "rand 0.9.4",
1316 + "ring",
1317 + "rustc-hash",
1318 + "rustls",
1319 + "rustls-pki-types",
1320 + "slab",
1321 + "thiserror",
1322 + "tinyvec",
1323 + "tracing",
1324 + "web-time",
1325 + ]
1326 +
1327 + [[package]]
1328 + name = "quinn-udp"
1329 + version = "0.5.14"
1330 + source = "registry+https://github.com/rust-lang/crates.io-index"
1331 + checksum = "addec6a0dcad8a8d96a771f815f0eaf55f9d1805756410b39f5fa81332574cbd"
1332 + dependencies = [
1333 + "cfg_aliases",
1334 + "libc",
1335 + "once_cell",
1336 + "socket2",
1337 + "tracing",
1338 + "windows-sys 0.60.2",
1339 + ]
1340 +
1341 + [[package]]
1228 1342 name = "quote"
1229 1343 version = "1.0.45"
1230 1344 source = "registry+https://github.com/rust-lang/crates.io-index"
@@ -1361,6 +1475,58 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
1361 1475 checksum = "dc897dd8d9e8bd1ed8cdad82b5966c3e0ecae09fb1907d58efaa013543185d0a"
1362 1476
1363 1477 [[package]]
1478 + name = "reqwest"
1479 + version = "0.12.28"
1480 + source = "registry+https://github.com/rust-lang/crates.io-index"
1481 + checksum = "eddd3ca559203180a307f12d114c268abf583f59b03cb906fd0b3ff8646c1147"
1482 + dependencies = [
1483 + "base64",
1484 + "bytes",
1485 + "futures-core",
1486 + "http",
1487 + "http-body",
1488 + "http-body-util",
1489 + "hyper",
1490 + "hyper-rustls",
1491 + "hyper-util",
1492 + "js-sys",
1493 + "log",
1494 + "percent-encoding",
1495 + "pin-project-lite",
1496 + "quinn",
1497 + "rustls",
1498 + "rustls-pki-types",
1499 + "serde",
1500 + "serde_json",
1501 + "serde_urlencoded",
1502 + "sync_wrapper",
1503 + "tokio",
1504 + "tokio-rustls",
1505 + "tower",
1506 + "tower-http",
1507 + "tower-service",
1508 + "url",
1509 + "wasm-bindgen",
1510 + "wasm-bindgen-futures",
1511 + "web-sys",
1512 + "webpki-roots",
1513 + ]
1514 +
1515 + [[package]]
1516 + name = "ring"
1517 + version = "0.17.14"
1518 + source = "registry+https://github.com/rust-lang/crates.io-index"
1519 + checksum = "a4689e6c2294d81e88dc6261c768b63bc4fcdb852be6d1352498b114f61383b7"
1520 + dependencies = [
1521 + "cc",
1522 + "cfg-if",
1523 + "getrandom 0.2.17",
1524 + "libc",
1525 + "untrusted",
1526 + "windows-sys 0.52.0",
1527 + ]
1528 +
1529 + [[package]]
1364 1530 name = "rsa"
1365 1531 version = "0.9.10"
1366 1532 source = "registry+https://github.com/rust-lang/crates.io-index"
@@ -1381,6 +1547,60 @@ dependencies = [
1381 1547 ]
1382 1548
1383 1549 [[package]]
1550 + name = "rustc-hash"
1551 + version = "2.1.2"
1552 + source = "registry+https://github.com/rust-lang/crates.io-index"
1553 + checksum = "94300abf3f1ae2e2b8ffb7b58043de3d399c73fa6f4b73826402a5c457614dbe"
1554 +
1555 + [[package]]
1556 + name = "rustix"
1557 + version = "1.1.4"
1558 + source = "registry+https://github.com/rust-lang/crates.io-index"
1559 + checksum = "b6fe4565b9518b83ef4f91bb47ce29620ca828bd32cb7e408f0062e9930ba190"
1560 + dependencies = [
1561 + "bitflags",
1562 + "errno",
1563 + "libc",
1564 + "linux-raw-sys",
1565 + "windows-sys 0.61.2",
1566 + ]
1567 +
1568 + [[package]]
1569 + name = "rustls"
1570 + version = "0.23.40"
1571 + source = "registry+https://github.com/rust-lang/crates.io-index"
1572 + checksum = "ef86cd5876211988985292b91c96a8f2d298df24e75989a43a3c73f2d4d8168b"
1573 + dependencies = [
1574 + "once_cell",
1575 + "ring",
1576 + "rustls-pki-types",
1577 + "rustls-webpki",
1578 + "subtle",
1579 + "zeroize",
1580 + ]
1581 +
1582 + [[package]]
1583 + name = "rustls-pki-types"
1584 + version = "1.14.1"
1585 + source = "registry+https://github.com/rust-lang/crates.io-index"
1586 + checksum = "30a7197ae7eb376e574fe940d068c30fe0462554a3ddbe4eca7838e049c937a9"
1587 + dependencies = [
1588 + "web-time",
1589 + "zeroize",
1590 + ]
1591 +
1592 + [[package]]
1593 + name = "rustls-webpki"
1594 + version = "0.103.13"
1595 + source = "registry+https://github.com/rust-lang/crates.io-index"
1596 + checksum = "61c429a8649f110dddef65e2a5ad240f747e85f7758a6bccc7e5777bd33f756e"
1597 + dependencies = [
1598 + "ring",
1599 + "rustls-pki-types",
1600 + "untrusted",
1601 + ]
1602 +
1603 + [[package]]
1384 1604 name = "rustversion"
1385 1605 version = "1.0.22"
1386 1606 source = "registry+https://github.com/rust-lang/crates.io-index"
@@ -1399,14 +1619,18 @@ dependencies = [
1399 1619 "anyhow",
1400 1620 "axum",
1401 1621 "chrono",
1622 + "http-body-util",
1402 1623 "metrics",
1403 1624 "metrics-exporter-prometheus",
1625 + "reqwest",
1404 1626 "serde",
1405 1627 "serde_json",
1406 1628 "sqlx",
1629 + "tempfile",
1407 1630 "thiserror",
1408 1631 "tokio",
1409 1632 "toml",
1633 + "tower",
1410 1634 "tracing",
1411 1635 "tracing-subscriber",
1412 1636 ]
@@ -1836,6 +2060,9 @@ name = "sync_wrapper"
1836 2060 version = "1.0.2"
1837 2061 source = "registry+https://github.com/rust-lang/crates.io-index"
1838 2062 checksum = "0bf256ce5efdfa370213c1dabab5935a12e49f2c58d15e9eac2870d3b4f27263"
2063 + dependencies = [
2064 + "futures-core",
2065 + ]
1839 2066
1840 2067 [[package]]
1841 2068 name = "synstructure"
@@ -1849,6 +2076,19 @@ dependencies = [
1849 2076 ]
1850 2077
1851 2078 [[package]]
2079 + name = "tempfile"
2080 + version = "3.27.0"
2081 + source = "registry+https://github.com/rust-lang/crates.io-index"
2082 + checksum = "32497e9a4c7b38532efcdebeef879707aa9f794296a4f0244f6f69e9bc8574bd"
2083 + dependencies = [
2084 + "fastrand",
2085 + "getrandom 0.3.4",
2086 + "once_cell",
2087 + "rustix",
2088 + "windows-sys 0.61.2",
2089 + ]
2090 +
2091 + [[package]]
1852 2092 name = "thiserror"
1853 2093 version = "2.0.18"
1854 2094 source = "registry+https://github.com/rust-lang/crates.io-index"
@@ -1930,6 +2170,16 @@ dependencies = [
1930 2170 ]
1931 2171
1932 2172 [[package]]
2173 + name = "tokio-rustls"
2174 + version = "0.26.4"
2175 + source = "registry+https://github.com/rust-lang/crates.io-index"
2176 + checksum = "1729aa945f29d91ba541258c8df89027d5792d85a8841fb65e8bf0f4ede4ef61"
2177 + dependencies = [
2178 + "rustls",
2179 + "tokio",
2180 + ]
2181 +
2182 + [[package]]
1933 2183 name = "tokio-stream"
1934 2184 version = "0.1.18"
1935 2185 source = "registry+https://github.com/rust-lang/crates.io-index"
@@ -2010,6 +2260,24 @@ dependencies = [
2010 2260 ]
2011 2261
2012 2262 [[package]]
2263 + name = "tower-http"
2264 + version = "0.6.11"
2265 + source = "registry+https://github.com/rust-lang/crates.io-index"
2266 + checksum = "4cfcf7e2740e6fc6d4d688b4ef00650406bb94adf4731e43c096c3a19fe40840"
2267 + dependencies = [
2268 + "bitflags",
2269 + "bytes",
2270 + "futures-util",
2271 + "http",
2272 + "http-body",
2273 + "pin-project-lite",
2274 + "tower",
2275 + "tower-layer",
2276 + "tower-service",
2277 + "url",
2278 + ]
2279 +
2280 + [[package]]
2013 2281 name = "tower-layer"
2014 2282 version = "0.3.3"
2015 2283 source = "registry+https://github.com/rust-lang/crates.io-index"
@@ -2097,6 +2365,12 @@ dependencies = [
2097 2365 ]
2098 2366
2099 2367 [[package]]
2368 + name = "try-lock"
2369 + version = "0.2.5"
2370 + source = "registry+https://github.com/rust-lang/crates.io-index"
2371 + checksum = "e421abadd41a4225275504ea4d6566923418b7f05506fbc9c0fe86ba7396114b"
2372 +
2373 + [[package]]
2100 2374 name = "tungstenite"
2101 2375 version = "0.29.0"
2102 2376 source = "registry+https://github.com/rust-lang/crates.io-index"
@@ -2146,6 +2420,12 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
2146 2420 checksum = "7df058c713841ad818f1dc5d3fd88063241cc61f49f5fbea4b951e8cf5a8d71d"
2147 2421
2148 2422 [[package]]
2423 + name = "untrusted"
2424 + version = "0.9.0"
2425 + source = "registry+https://github.com/rust-lang/crates.io-index"
2426 + checksum = "8ecb6da28b8a351d773b68d5825ac39017e680750f980f3a1a85cd8dd28a47c1"
2427 +
2428 + [[package]]
2149 2429 name = "url"
2150 2430 version = "2.5.8"
2151 2431 source = "registry+https://github.com/rust-lang/crates.io-index"
@@ -2182,6 +2462,15 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
2182 2462 checksum = "0b928f33d975fc6ad9f86c8f283853ad26bdd5b10b7f1542aa2fa15e2289105a"
2183 2463
2184 2464 [[package]]
2465 + name = "want"
2466 + version = "0.3.1"
2467 + source = "registry+https://github.com/rust-lang/crates.io-index"
2468 + checksum = "bfa7760aed19e106de2c7c0b581b509f2f25d3dacaf737cb82ac61bc6d760b0e"
2469 + dependencies = [
2470 + "try-lock",
2471 + ]
2472 +
2473 + [[package]]
2185 2474 name = "wasi"
2186 2475 version = "0.11.1+wasi-snapshot-preview1"
2187 2476 source = "registry+https://github.com/rust-lang/crates.io-index"
@@ -2216,6 +2505,16 @@ dependencies = [
2216 2505 ]
2217 2506
2218 2507 [[package]]
2508 + name = "wasm-bindgen-futures"
2509 + version = "0.4.72"
2510 + source = "registry+https://github.com/rust-lang/crates.io-index"
2511 + checksum = "9473dbd2991ae90b6291c3c32c30c6187ac49aa32f9905d1cce280ec1e110b0f"
2512 + dependencies = [
2513 + "js-sys",
2514 + "wasm-bindgen",
2515 + ]
2516 +
2517 + [[package]]
2219 2518 name = "wasm-bindgen-macro"
2220 2519 version = "0.2.122"
2221 2520 source = "registry+https://github.com/rust-lang/crates.io-index"
@@ -2258,6 +2557,25 @@ dependencies = [
2258 2557 ]
2259 2558
2260 2559 [[package]]
2560 + name = "web-time"
2561 + version = "1.1.0"
2562 + source = "registry+https://github.com/rust-lang/crates.io-index"
2563 + checksum = "5a6580f308b1fad9207618087a65c04e7a10bc77e02c8e84e9b00dd4b12fa0bb"
2564 + dependencies = [
2565 + "js-sys",
2566 + "wasm-bindgen",
2567 + ]
2568 +
2569 + [[package]]
2570 + name = "webpki-roots"
2571 + version = "1.0.7"
2572 + source = "registry+https://github.com/rust-lang/crates.io-index"
2573 + checksum = "52f5ee44c96cf55f1b349600768e3ece3a8f26010c05265ab73f945bb1a2eb9d"
2574 + dependencies = [
2575 + "rustls-pki-types",
2576 + ]
2577 +
2578 + [[package]]
2261 2579 name = "whoami"
2262 2580 version = "1.6.1"
2263 2581 source = "registry+https://github.com/rust-lang/crates.io-index"
@@ -2354,7 +2672,25 @@ version = "0.48.0"
2354 2672 source = "registry+https://github.com/rust-lang/crates.io-index"
2355 2673 checksum = "677d2418bec65e3338edb076e806bc1ec15693c5d0104683f2efe857f61056a9"
2356 2674 dependencies = [
2357 - "windows-targets",
2675 + "windows-targets 0.48.5",
2676 + ]
2677 +
2678 + [[package]]
2679 + name = "windows-sys"
2680 + version = "0.52.0"
2681 + source = "registry+https://github.com/rust-lang/crates.io-index"
2682 + checksum = "282be5f36a8ce781fad8c8ae18fa3f9beff57ec1b52cb3de0789201425d9a33d"
2683 + dependencies = [
2684 + "windows-targets 0.52.6",
2685 + ]
2686 +
2687 + [[package]]
2688 + name = "windows-sys"
2689 + version = "0.60.2"
2690 + source = "registry+https://github.com/rust-lang/crates.io-index"
2691 + checksum = "f2f500e4d28234f72040990ec9d39e3a6b950f9f22d3dba18416c35882612bcb"
2692 + dependencies = [
2693 + "windows-targets 0.53.5",
2358 2694 ]
2359 2695
2360 2696 [[package]]
@@ -2372,13 +2708,46 @@ version = "0.48.5"
2372 2708 source = "registry+https://github.com/rust-lang/crates.io-index"
2373 2709 checksum = "9a2fa6e2155d7247be68c096456083145c183cbbbc2764150dda45a87197940c"
2374 2710 dependencies = [
2375 - "windows_aarch64_gnullvm",
Lines truncated
@@ -22,3 +22,9 @@ metrics-exporter-prometheus = { version = "0.18.1", default-features = false }
22 22 anyhow = "1.0.102"
23 23 thiserror = "2.0.18"
24 24 chrono = { version = "0.4", features = ["serde"] }
25 +
26 + [dev-dependencies]
27 + tempfile = "3.20"
28 + tower = { version = "0.5", features = ["util"] }
29 + http-body-util = "0.1"
30 + reqwest = { version = "0.12", default-features = false, features = ["json", "rustls-tls"] }
@@ -1,9 +1,11 @@
1 1 //! Fetch the prod backup that `migration_dry_run` runs against.
2 2 //!
3 - //! Sources supported in v0:
4 - //! - `file:///abs/path/to/dump.sql.gz` — local copy. Used for localhost dev.
5 - //! - `rsync://host/module/path` — shells out to `rsync`. Used when MM
6 - //! pulls from an astra/Hetzner replica.
3 + //! Sources supported:
4 + //! - `file:///abs/path/to/dump.sql.gz` — local copy (dev).
5 + //! - `rsync://host/module/path` — rsync daemon protocol.
6 + //! - `ssh://user@host[:port]/path/file.sql.gz` — rsync-over-ssh. Used to pull
7 + //! prod backups from
8 + //! `backup-puller@alpha-west-1`.
7 9 //!
8 10 //! The fetch is command-driven: the operator triggers it via /backup/fetch, it
9 11 //! is not implicit in promote. That keeps the slowest, most failure-prone step
@@ -11,7 +13,7 @@
11 13
12 14 use crate::config::Config;
13 15 use crate::topology::Topology;
14 - use anyhow::{Context, Result};
16 + use anyhow::{Context, Result, bail};
15 17 use chrono::Utc;
16 18 use sqlx::SqlitePool;
17 19 use std::path::Path;
@@ -25,6 +27,61 @@ pub struct FetchedBackup {
25 27 pub byte_size: Option<i64>,
26 28 }
27 29
30 + /// Parsed `backup.source` URL. Owned strings so the parsed form outlives the
31 + /// (possibly transient) URL we read from config.
32 + #[derive(Debug, Clone, PartialEq, Eq)]
33 + pub(crate) enum BackupSource {
34 + /// Local file copy. Path follows the `file://` prefix.
35 + File { path: String },
36 + /// rsync daemon protocol. Full URL stays intact (rsync handles it).
37 + RsyncDaemon { url: String },
38 + /// rsync-over-ssh. Port is optional.
39 + Ssh {
40 + user_host: String,
41 + port: Option<u16>,
42 + path: String,
43 + },
44 + }
45 +
46 + /// Parse a `backup.source` URL into a `BackupSource`. Rejects unsupported
47 + /// schemes and malformed `ssh://` URLs (no path part).
48 + pub(crate) fn parse_source(s: &str) -> Result<BackupSource> {
49 + if let Some(rest) = s.strip_prefix("file://") {
50 + if rest.is_empty() {
51 + bail!("file:// URL is missing a path: {s}");
52 + }
53 + return Ok(BackupSource::File { path: rest.into() });
54 + }
55 + if s.starts_with("rsync://") {
56 + return Ok(BackupSource::RsyncDaemon { url: s.into() });
57 + }
58 + if let Some(rest) = s.strip_prefix("ssh://") {
59 + let (user_host_port, path_rest) = rest
60 + .split_once('/')
61 + .with_context(|| format!("ssh:// URL missing path: {s}"))?;
62 + if user_host_port.is_empty() {
63 + bail!("ssh:// URL missing user@host: {s}");
64 + }
65 + let path = format!("/{path_rest}");
66 + let (user_host, port) = match user_host_port.rsplit_once(':') {
67 + Some((uh, p)) => {
68 + // Heuristic: trailing `:digits` after the final `:` is the port.
69 + // Anything else (IPv6 literal, etc.) gets left alone.
70 + match p.parse::<u16>() {
71 + Ok(n) => (uh.to_string(), Some(n)),
72 + Err(_) => (user_host_port.to_string(), None),
73 + }
74 + }
75 + None => (user_host_port.to_string(), None),
76 + };
77 + if user_host.is_empty() {
78 + bail!("ssh:// URL has empty host (port {:?})", port);
79 + }
80 + return Ok(BackupSource::Ssh { user_host, port, path });
81 + }
82 + bail!("unsupported backup source scheme: {s}");
83 + }
84 +
28 85 pub async fn fetch(
29 86 pool: &SqlitePool,
30 87 _cfg: &Arc<Config>,
@@ -37,23 +94,45 @@ pub async fn fetch(
37 94 tokio::fs::create_dir_all(parent).await?;
38 95 }
39 96
40 - if let Some(rest) = source.strip_prefix("file://") {
41 - tokio::fs::copy(rest, &local_path)
42 - .await
43 - .with_context(|| format!("copy {rest} -> {local_path}"))?;
44 - } else if source.starts_with("rsync://") {
45 - let out = Command::new("rsync")
46 - .args(["-az", "--inplace", &source, &local_path])
47 - .output()
48 - .await
49 - .context("spawning rsync")?;
50 - anyhow::ensure!(
51 - out.status.success(),
52 - "rsync failed: {}",
53 - String::from_utf8_lossy(&out.stderr),
54 - );
55 - } else {
56 - anyhow::bail!("unsupported backup source scheme: {source}");
97 + let parsed = parse_source(&source)?;
98 + match parsed {
99 + BackupSource::File { path } => {
100 + tokio::fs::copy(&path, &local_path)
101 + .await
102 + .with_context(|| format!("copy {path} -> {local_path}"))?;
103 + }
104 + BackupSource::RsyncDaemon { url } => {
105 + let out = Command::new("rsync")
106 + .args(["-az", "--inplace", &url, &local_path])
107 + .output()
108 + .await
109 + .context("spawning rsync")?;
110 + anyhow::ensure!(
111 + out.status.success(),
112 + "rsync (daemon) failed: {}",
113 + String::from_utf8_lossy(&out.stderr),
114 + );
115 + }
116 + BackupSource::Ssh { user_host, port, path } => {
117 + let ssh_cmd = match port {
118 + Some(p) => format!("ssh -p {p} -o BatchMode=yes -o StrictHostKeyChecking=accept-new"),
119 + None => "ssh -o BatchMode=yes -o StrictHostKeyChecking=accept-new".into(),
120 + };
121 + let remote = format!("{user_host}:{path}");
122 + let out = Command::new("rsync")
123 + .args(["-a", "--partial"])
124 + .arg("-e").arg(&ssh_cmd)
125 + .arg(&remote)
126 + .arg(&local_path)
127 + .output()
128 + .await
129 + .context("spawning rsync")?;
130 + anyhow::ensure!(
131 + out.status.success(),
132 + "rsync (ssh) failed: {}",
133 + String::from_utf8_lossy(&out.stderr),
134 + );
135 + }
57 136 }
58 137
59 138 let meta = tokio::fs::metadata(&local_path).await?;
@@ -69,5 +148,106 @@ pub async fn fetch(
69 148 .execute(pool)
70 149 .await?;
71 150
151 + // Retention: prune rows fetched more than 30 days ago. The on-disk file
152 + // is overwritten each fetch (single `local_path`), so old rows reference
153 + // a path that no longer exists — keep the table from growing for no
154 + // good reason.
155 + sqlx::query(
156 + "DELETE FROM backups WHERE fetched_at < datetime('now', '-30 days')",
157 + )
158 + .execute(pool)
159 + .await?;
160 +
72 161 Ok(FetchedBackup { source, local_path, byte_size: Some(size) })
73 162 }
163 +
164 + #[cfg(test)]
165 + mod tests {
166 + use super::*;
167 +
168 + #[test]
169 + fn parses_file_url() {
170 + let s = parse_source("file:///opt/backups/latest.sql.gz").unwrap();
171 + assert_eq!(s, BackupSource::File { path: "/opt/backups/latest.sql.gz".into() });
172 + }
173 +
174 + #[test]
175 + fn file_url_without_path_errors() {
176 + assert!(parse_source("file://").is_err());
177 + }
178 +
179 + #[test]
180 + fn parses_rsync_daemon_url() {
181 + let s = parse_source("rsync://astra/mnw/latest.sql.gz").unwrap();
182 + assert_eq!(s, BackupSource::RsyncDaemon { url: "rsync://astra/mnw/latest.sql.gz".into() });
183 + }
184 +
185 + #[test]
186 + fn parses_ssh_url_with_port() {
187 + let s = parse_source("ssh://backup-puller@alpha-west-1:2200/latest.sql.gz").unwrap();
188 + assert_eq!(
189 + s,
190 + BackupSource::Ssh {
191 + user_host: "backup-puller@alpha-west-1".into(),
192 + port: Some(2200),
193 + path: "/latest.sql.gz".into(),
194 + }
195 + );
196 + }
197 +
198 + #[test]
199 + fn parses_ssh_url_without_port() {
200 + let s = parse_source("ssh://max@astra/opt/backups/mnw/latest.sql.gz").unwrap();
201 + assert_eq!(
202 + s,
203 + BackupSource::Ssh {
204 + user_host: "max@astra".into(),
205 + port: None,
206 + path: "/opt/backups/mnw/latest.sql.gz".into(),
207 + }
208 + );
209 + }
210 +
211 + #[test]
212 + fn ssh_url_without_path_errors() {
213 + // `split_once('/')` — `ssh://user@host` has no `/` after the scheme.
214 + assert!(parse_source("ssh://backup-puller@alpha-west-1").is_err());
215 + }
216 +
217 + #[test]
218 + fn ssh_url_without_user_host_errors() {
219 + // Empty user@host: `ssh:///foo`. Caught by the empty-prefix check.
220 + assert!(parse_source("ssh:///latest.sql.gz").is_err());
221 + }
222 +
223 + #[test]
224 + fn ssh_url_with_non_numeric_after_colon_treats_as_part_of_host() {
225 + // `host:notaport` should NOT parse `notaport` as a port. Leave the
226 + // colon part of user_host; libssh/rsync will reject if truly wrong.
227 + let s = parse_source("ssh://user@host:notaport/path").unwrap();
228 + assert_eq!(
229 + s,
230 + BackupSource::Ssh {
231 + user_host: "user@host:notaport".into(),
232 + port: None,
233 + path: "/path".into(),
234 + }
235 + );
236 + }
237 +
238 + #[test]
239 + fn rejects_unknown_scheme() {
240 + assert!(parse_source("ftp://example.com/file").is_err());
241 + assert!(parse_source("just-a-path.sql.gz").is_err());
242 + assert!(parse_source("").is_err());
243 + }
244 +
245 + #[test]
246 + fn ssh_url_preserves_multi_segment_path() {
247 + let s = parse_source("ssh://a@b:22/opt/foo/bar/baz.sql.gz").unwrap();
248 + match s {
249 + BackupSource::Ssh { path, .. } => assert_eq!(path, "/opt/foo/bar/baz.sql.gz"),
250 + _ => panic!("wrong variant"),
251 + }
252 + }
253 + }
@@ -21,7 +21,10 @@ pub struct BuildArtifact {
21 21 pub version: String,
22 22 pub git_sha: String,
23 23 pub worktree: PathBuf,
24 - pub binary_path: PathBuf,
24 + /// One entry per `cfg.bin_names` in declared order. First is the primary
25 + /// (referenced by the systemd unit's ExecStart). Paths are inside the
26 + /// worktree's `target/release/`.
27 + pub binary_paths: Vec<PathBuf>,
25 28 }
26 29
27 30 pub async fn run(
@@ -29,6 +32,7 @@ pub async fn run(
29 32 cfg: Arc<Config>,
30 33 topo: Arc<Topology>,
31 34 sha: String,
35 + events: crate::events::EventTx,
32 36 ) -> Result<BuildArtifact> {
33 37 let worktree = cfg.workdir.join(&sha);
34 38 let bare = PathBuf::from(&topo.repo.bare_path);
@@ -38,20 +42,47 @@ pub async fn run(
38 42 let version = read_pkg_version(&server_dir.join("Cargo.toml")).await
39 43 .with_context(|| format!("reading version from {}/Cargo.toml", server_dir.display()))?;
40 44
41 - tracing::info!(sha = %sha, version = %version, dir = %server_dir.display(), "cargo build --release start");
42 - let started = std::time::Instant::now();
43 - let out = Command::new("cargo")
45 + // sqlx compile-time query checking needs a live DB with the current schema.
46 + // We point cargo at the scratch DB and prep it (drop public, re-migrate)
47 + // before invoking cargo build. The same DB is reset again by
48 + // `migration_dry_run` later if it runs as a gate.
49 + let mut cargo_cmd = Command::new("cargo");
50 + cargo_cmd
44 51 .arg("build")
45 52 .arg("--release")
46 53 .current_dir(&server_dir)
54 + .kill_on_drop(true);
55 + if let Some(scratch_url) = cfg.scratch_db_url.as_deref() {
56 + tracing::info!(sha = %sha, "preparing scratch DB schema for sqlx compile-time checks");
57 + crate::gates::reset_scratch(scratch_url).await
58 + .context("scratch DB reset before build")?;
59 + crate::gates::run_migrator(scratch_url, &server_dir.join("migrations")).await
60 + .context("applying MNW migrations to scratch DB before build")?;
61 + cargo_cmd.env("DATABASE_URL", scratch_url);
62 + } else {
63 + tracing::warn!("scratch_db_url unset; sqlx will fall back to offline mode and may fail");
64 + }
65 +
66 + tracing::info!(sha = %sha, version = %version, dir = %server_dir.display(), "cargo build --release start");
67 + crate::events::emit(&events, crate::events::Event::BuildStart {
68 + sha: sha.clone(), version: version.clone(),
69 + });
70 + let started = std::time::Instant::now();
71 + let out = cargo_cmd
47 72 .output()
48 73 .await
49 74 .context("spawning cargo build")?;
50 75 let elapsed_s = started.elapsed().as_secs();
51 76 if !out.status.success() {
52 77 tracing::error!(sha = %sha, version = %version, elapsed_s, "cargo build --release failed");
78 + crate::events::emit(&events, crate::events::Event::BuildFailed {
79 + sha: sha.clone(), version: version.clone(), elapsed_s,
80 + });
53 81 } else {
54 82 tracing::info!(sha = %sha, version = %version, elapsed_s, "cargo build --release ok");
83 + crate::events::emit(&events, crate::events::Event::BuildOk {
84 + sha: sha.clone(), version: version.clone(), elapsed_s,
85 + });
55 86 }
56 87 anyhow::ensure!(
57 88 out.status.success(),
@@ -59,12 +90,16 @@ pub async fn run(
59 90 tail(&out.stderr, 4_000),
60 91 );
61 92
62 - let binary_path = server_dir.join("target/release/server");
63 - anyhow::ensure!(
64 - binary_path.exists(),
65 - "expected binary at {} after build",
66 - binary_path.display(),
67 - );
93 + let release_dir = server_dir.join("target/release");
94 + let mut binary_paths = Vec::with_capacity(cfg.bin_names.len());
95 + for name in &cfg.bin_names {
96 + let p = release_dir.join(name);
97 + anyhow::ensure!(p.exists(), "expected binary at {} after build", p.display());
98 + binary_paths.push(p);
99 + }
100 + // Primary binary path is the one we record in `versions.artifact_path`
101 + // (everything downstream — promote, rollback — looks it up by version).
102 + let primary = binary_paths[0].clone();
68 103
69 104 sqlx::query(
70 105 "INSERT OR IGNORE INTO versions (version, git_sha, built_at, artifact_path)
@@ -73,11 +108,11 @@ pub async fn run(
73 108 .bind(&version)
74 109 .bind(&sha)
75 110 .bind(Utc::now().to_rfc3339())
76 - .bind(binary_path.to_string_lossy().as_ref())
111 + .bind(primary.to_string_lossy().as_ref())
77 112 .execute(&pool)
78 113 .await?;
79 114
80 - Ok(BuildArtifact { version, git_sha: sha, worktree, binary_path })
115 + Ok(BuildArtifact { version, git_sha: sha, worktree, binary_paths })
81 116 }
82 117
83 118 /// Full MM-tier pipeline: build, deploy the binary into MM's release_root,
@@ -88,14 +123,36 @@ pub async fn build_and_run_mm(
88 123 cfg: Arc<Config>,
89 124 topo: Arc<Topology>,
90 125 sha: String,
126 + events: crate::events::EventTx,
91 127 ) -> Result<()> {
92 - let art = run(pool.clone(), cfg.clone(), topo.clone(), sha).await?;
128 + let art = run(pool.clone(), cfg.clone(), topo.clone(), sha, events.clone()).await?;
93 129
94 130 // Stage the binary in MM's release_root so future gates and the MM
95 131 // self-deploy point at a stable path, not the worktree's target/.
96 132 let mm_release_root = &cfg.release_root;
97 - let staged = deploy::deploy_local(mm_release_root, &art.version, &art.binary_path).await?;
98 - let staged_bin = staged.join("server");
133 + let staged = deploy::deploy_local(mm_release_root, &art.version, &art.binary_paths).await?;
134 +
135 + // Bring error-pages alongside the binaries so the deploy rsync ships the
136 + // static HTML to every node. Caddy on each node references
137 + // <release_root>/current/error-pages/. Skipped silently if the worktree
138 + // doesn't have them (older shas, or non-MNW projects using this daemon).
139 + let error_pages_src = art.worktree.join("server/deploy/error-pages");
140 + if error_pages_src.exists() {
141 + let out = Command::new("cp")
142 + .arg("-a")
143 + .arg(&error_pages_src)
144 + .arg(staged.join("error-pages"))
145 + .output()
146 + .await
147 + .context("spawning cp for error-pages")?;
148 + anyhow::ensure!(
149 + out.status.success(),
150 + "copying error-pages into staged dir: {}",
151 + String::from_utf8_lossy(&out.stderr),
152 + );
153 + }
154 +
155 + let staged_bin = staged.join(cfg.primary_bin());
99 156 sqlx::query("UPDATE versions SET artifact_path = ? WHERE version = ?")
100 157 .bind(staged_bin.to_string_lossy().as_ref())
101 158 .bind(&art.version)
@@ -112,6 +169,7 @@ pub async fn build_and_run_mm(
112 169 tier: "mm".to_string(),
113 170 version: art.version.clone(),
114 171 worktree: art.worktree.clone(),
172 + events: events.clone(),
115 173 };
116 174 let ok = gates::run_all(&ctx, &mm.gates).await?;
117 175
@@ -16,9 +16,22 @@ pub struct Config {
16 16 /// you care about.
17 17 #[serde(default)]
18 18 pub scratch_db_url: Option<String>,
19 + /// Names of cargo bin targets the server crate produces (files under
20 + /// `target/release/`). First entry is the primary unit (referenced from
21 + /// the systemd unit's ExecStart). Defaults to `["server"]`; MNW ships
22 + /// `["makenotwork", "mnw-admin"]`.
23 + #[serde(default = "default_bin_names")]
24 + pub bin_names: Vec<String>,
19 25 }
20 26
27 + fn default_bin_names() -> Vec<String> { vec!["server".into()] }
28 +
21 29 impl Config {
30 + /// Primary binary — the one the systemd unit's ExecStart points at.
31 + pub fn primary_bin(&self) -> &str {
32 + self.bin_names.first().map(|s| s.as_str()).unwrap_or("server")
33 + }
34 +
22 35 pub fn load() -> Result<Self> {
23 36 let path = std::env::var("SANDO_CONFIG").unwrap_or_else(|_| "sando-daemon.toml".into());
24 37 let raw = std::fs::read_to_string(&path)
@@ -1,40 +1,58 @@
1 1 //! Atomic symlink-swap deploys.
2 2 //!
3 - //! Layout on every target (MM, A nodes, B nodes, ...):
3 + //! Layout on every target (local host, A nodes, B nodes, ...):
4 4 //!
5 5 //! <release_root>/
6 6 //! releases/
7 7 //! 0.8.1/
8 - //! server <- the binary
8 + //! <bin_name>
9 9 //! 0.8.2/
10 - //! server
10 + //! <bin_name>
11 11 //! current -> releases/0.8.2
12 12 //!
13 - //! `ln -sfn` makes the swap atomic on Linux. systemd units should point at
14 - //! `<release_root>/current/server` so a swap + reload picks up the new binary
15 - //! without a window where the unit references a missing path.
13 + //! `ln -sfn` swaps the symlink. systemd units point at
14 + //! `<release_root>/current/<bin_name>` so reload-or-restart picks up the new
15 + //! binary without ever pointing at a missing path.
16 16 //!
17 - //! v0 only implements local deploys (used for MM and for localhost-dev
18 - //! "remote" nodes whose ssh_target is `local`). Real SSH/rsync deploys are
19 - //! follow-up work — see the `remote_deploy_stub` branch.
17 + //! For nodes with `ssh_target` set to anything other than `"local"`, deploy
18 + //! goes via rsync + ssh; the bootstrap (creating release_root, installing the
19 + //! service unit, granting sudo for systemctl) is out of scope here — it
20 + //! happens once per node, not per deploy.
20 21
21 22 use crate::topology::Node;
22 23 use anyhow::{Context, Result};
23 24 use std::path::{Path, PathBuf};
24 25 use tokio::process::Command;
25 26
26 - pub async fn deploy_local(release_root: &Path, version: &str, binary: &Path) -> Result<PathBuf> {
27 + /// SSH options used everywhere we shell out to ssh — fail fast, no prompts.
28 + const SSH_FLAGS: &[&str] = &[
29 + "-o", "BatchMode=yes",
30 + "-o", "ConnectTimeout=10",
31 + "-o", "StrictHostKeyChecking=accept-new",
32 + ];
33 +
34 + /// Keep this many release dirs per node; older ones get gc'd after a
35 + /// successful deploy. Fixed for now; promote to config if the constant ever
36 + /// needs to vary by tier.
37 + const RELEASES_TO_KEEP: usize = 5;
38 +
39 + pub async fn deploy_local(
40 + release_root: &Path,
41 + version: &str,
42 + binaries: &[PathBuf],
43 + ) -> Result<PathBuf> {
27 44 let release_dir = release_root.join("releases").join(version);
28 45 tokio::fs::create_dir_all(&release_dir).await?;
29 - let dest = release_dir.join("server");
30 - tokio::fs::copy(binary, &dest)
31 - .await
32 - .with_context(|| format!("copy {} -> {}", binary.display(), dest.display()))?;
46 + for binary in binaries {
47 + let name = binary.file_name()
48 + .context("binary path has no file name")?;
49 + let dest = release_dir.join(name);
50 + tokio::fs::copy(binary, &dest)
51 + .await
52 + .with_context(|| format!("copy {} -> {}", binary.display(), dest.display()))?;
53 + }
33 54
34 55 let current = release_root.join("current");
35 - // ln -sfn is atomic on Linux; on macOS the dev path is non-prod so the
36 - // race is irrelevant. We shell out rather than using std::os::unix::fs
37 - // symlink + rename because the rename-over-symlink pattern is platform-fussy.
38 56 let target = format!("releases/{version}");
39 57 let out = Command::new("ln")
40 58 .args(["-sfn", &target])
@@ -46,26 +64,372 @@ pub async fn deploy_local(release_root: &Path, version: &str, binary: &Path) ->
46 64 "symlink swap failed: {}",
47 65 String::from_utf8_lossy(&out.stderr),
48 66 );
67 +
68 + if let Err(e) = gc_local_releases(release_root).await {
69 + tracing::warn!(error = %e, "local release GC failed (non-fatal)");
70 + }
49 71 Ok(release_dir)
50 72 }
51 73
52 - pub async fn deploy_node(node: &Node, version: &str, binary: &Path) -> Result<PathBuf> {
74 + /// Deploy `staged_release_dir` (a directory built on the Sando host by
75 + /// `deploy_local`) to `node`. For `ssh_target=local`, this is just symlink
76 + /// swap + restart; for remote nodes, we rsync the whole dir.
77 + ///
78 + /// `primary_bin` is only used for logging — every file present in the staged
79 + /// dir gets shipped.
80 + pub async fn deploy_node(
81 + node: &Node,
82 + version: &str,
83 + staged_release_dir: &Path,
84 + primary_bin: &str,
85 + ) -> Result<PathBuf> {
53 86 if node.ssh_target == "local" || node.ssh_target.is_empty() {
54 - return deploy_local(Path::new(&node.release_root), version, binary).await;
87 + // Local deploy already happened when we staged on the Sando host.
88 + // Just re-point `current` at the staged dir.
89 + return reset_local_current(Path::new(&node.release_root), version).await;
90 + }
91 + deploy_remote(node, version, staged_release_dir, primary_bin).await
92 + }
93 +
94 + async fn reset_local_current(release_root: &Path, version: &str) -> Result<PathBuf> {
95 + let current = release_root.join("current");
96 + let target = format!("releases/{version}");
97 + let out = Command::new("ln")
98 + .args(["-sfn", &target])
99 + .arg(&current)
100 + .output()
101 + .await?;
102 + anyhow::ensure!(
103 + out.status.success(),
104 + "symlink swap failed: {}",
105 + String::from_utf8_lossy(&out.stderr),
106 + );
107 + Ok(release_root.join("releases").join(version))
108 + }
109 +
110 + async fn deploy_remote(
111 + node: &Node,
112 + version: &str,
113 + staged_release_dir: &Path,
114 + primary_bin: &str,
115 + ) -> Result<PathBuf> {
116 + let release_root = &node.release_root;
117 + let ssh_target = &node.ssh_target;
118 + let service = &node.service_name;
119 + let release_dir = format!("{release_root}/releases/{version}");
120 +
121 + tracing::info!(node = %node.name, version, "deploy: mkdir release dir");
122 + ssh(ssh_target, &format!("set -e; mkdir -p {q}", q = sh_quote(&release_dir)))
123 + .await
124 + .context("creating remote release dir")?;
125 +
126 + tracing::info!(node = %node.name, version, primary = %primary_bin, "deploy: rsync release dir");
127 + // Rsync the whole staged dir (all binaries + any sibling artifacts like
128 + // error-pages). Trailing slash on source = contents of dir, not the dir
129 + // itself. --chmod ensures binaries land executable; the regular-file
130 + // mask leaves data files at 0644.
131 + let rsync_src = format!("{}/", staged_release_dir.display());
132 + let rsync_dest = format!("{ssh_target}:{release_dir}/");
133 + let mut rsync = Command::new("rsync");
134 + rsync
135 + .arg("-az")
136 + .arg("--partial")
137 + .arg("--chmod=F0755,D0755")
138 + .arg("-e")
139 + .arg(format!(
140 + "ssh {}",
141 + SSH_FLAGS.iter().map(|s| s.to_string()).collect::<Vec<_>>().join(" ")
142 + ))
143 + .arg(&rsync_src)
144 + .arg(&rsync_dest);
145 + let out = rsync.output().await.context("spawning rsync")?;
146 + anyhow::ensure!(
147 + out.status.success(),
148 + "rsync failed (current symlink left intact): {}",
149 + String::from_utf8_lossy(&out.stderr),
150 + );
151 +
152 + tracing::info!(node = %node.name, version, "deploy: symlink swap + service reload");
153 + // Symlink swap is atomic via `mv -T` of a freshly-created symlink over
154 + // the old one (the rename(2) is the atomic step; `ln -sfn` does
155 + // unlink+symlink which has a window).
156 + let swap_and_restart = format!(
157 + "set -e; \
158 + cd {root}; \
159 + ln -sfn releases/{ver} current.new; \
160 + mv -Tf current.new current; \
161 + sudo /bin/systemctl reload-or-restart {svc}",
162 + root = sh_quote(release_root),
163 + ver = sh_quote(version),
164 + svc = sh_quote(service),
165 + );
166 + ssh(ssh_target, &swap_and_restart)
167 + .await
168 + .context("symlink swap + systemctl reload-or-restart")?;
169 +
170 + if let Err(e) = gc_remote_releases(ssh_target, release_root).await {
171 + tracing::warn!(error = %e, "remote release GC failed (non-fatal)");
172 + }
173 +
174 + Ok(PathBuf::from(release_root).join("releases").join(version))
175 + }
176 +
177 + async fn ssh(target: &str, script: &str) -> Result<()> {
178 + let mut cmd = Command::new("ssh");
179 + cmd.args(SSH_FLAGS).arg(target).arg(script);
180 + let out = cmd.output().await.context("spawning ssh")?;
181 + anyhow::ensure!(
182 + out.status.success(),
183 + "ssh {target} failed: {}",
184 + String::from_utf8_lossy(&out.stderr),
185 + );
186 + Ok(())
187 + }
188 +
189 + async fn gc_local_releases(release_root: &Path) -> Result<()> {
190 + let releases = release_root.join("releases");
191 + if !releases.exists() {
192 + return Ok(());
193 + }
194 + let mut entries = Vec::new();
195 + let mut rd = tokio::fs::read_dir(&releases).await?;
196 + while let Some(entry) = rd.next_entry().await? {
197 + if !entry.file_type().await?.is_dir() {
198 + continue;
199 + }
200 + let meta = entry.metadata().await?;
201 + entries.push((entry.path(), meta.modified()?));
55 202 }
56 - remote_deploy_stub(node, version, binary).await
203 + entries.sort_by(|a, b| b.1.cmp(&a.1));
204 + for (path, _) in entries.into_iter().skip(RELEASES_TO_KEEP) {
205 + if let Err(e) = tokio::fs::remove_dir_all(&path).await {
206 + tracing::warn!(path = %path.display(), error = %e, "gc: rm failed");
207 + } else {
208 + tracing::debug!(path = %path.display(), "gc: removed old release");
209 + }
210 + }
211 + Ok(())
57 212 }
58 213
59 - async fn remote_deploy_stub(node: &Node, version: &str, _binary: &Path) -> Result<PathBuf> {
60 - // Real implementation: rsync the binary to <ssh_target>:<release_root>/releases/<version>/server,
61 - // then ssh <ssh_target> "ln -sfn releases/<version> current && systemctl reload-or-restart <unit>".
62 - // Wiring this up needs a story for systemd unit naming and ssh key/auth conventions; deferring
63 - // until the localhost smoke loop is settled and we know which knobs matter.
64 - anyhow::bail!(
65 - "remote deploy not yet implemented (node {} -> {}); use ssh_target=local for dev",
66 - node.name,
67 - node.ssh_target,
214 + async fn gc_remote_releases(ssh_target: &str, release_root: &str) -> Result<()> {
215 + // `ls -t` orders by mtime desc. Skip the first N, rm the rest. `xargs -r`
216 + // is a no-op when stdin is empty (avoids `rm` complaining).
217 + let script = format!(
218 + "set -e; cd {root}/releases 2>/dev/null || exit 0; \
219 + ls -1t | tail -n +{keep_plus_one} | xargs -r -I{{}} rm -rf -- {{}}",
220 + root = sh_quote(release_root),
221 + keep_plus_one = RELEASES_TO_KEEP + 1,
68 222 );
69 - #[allow(unreachable_code)]
70 - Ok(PathBuf::from(&node.release_root).join("releases").join(version))
223 + ssh(ssh_target, &script).await
224 + }
225 +
226 + /// Single-quote a string for safe inclusion in a /bin/sh command, escaping
227 + /// any single quote inside. Not bulletproof for adversarial input, but every
228 + /// path here comes from our own config files.
229 + fn sh_quote(s: &str) -> String {
230 + let escaped = s.replace('\'', r"'\''");
231 + format!("'{escaped}'")
232 + }
233 +
234 + #[cfg(test)]
235 + mod tests {
236 + use super::*;
237 + use std::time::SystemTime;
238 +
239 + #[test]
240 + fn sh_quote_no_quote() {
241 + assert_eq!(sh_quote("hello"), "'hello'");
242 + assert_eq!(sh_quote("/opt/mnw/releases/0.8.12"), "'/opt/mnw/releases/0.8.12'");
243 + }
244 +
245 + #[test]
246 + fn sh_quote_with_quote() {
247 + // The string `it's` becomes `'it'\''s'` — close, escape, open.
248 + assert_eq!(sh_quote("it's"), r"'it'\''s'");
249 + }
250 +
251 + #[tokio::test]
252 + async fn deploy_local_copies_multiple_binaries_and_swaps_symlink() {
253 + let tmp = tempfile::tempdir().unwrap();
254 + let root = tmp.path();
255 +
256 + // Source binaries (worktree's target/release/)
257 + let src_dir = root.join("src");
258 + tokio::fs::create_dir_all(&src_dir).await.unwrap();
259 + let primary = src_dir.join("makenotwork");
260 + let admin = src_dir.join("mnw-admin");
261 + tokio::fs::write(&primary, b"PRIMARY").await.unwrap();
262 + tokio::fs::write(&admin, b"ADMIN").await.unwrap();
263 +
264 + // Release root (where staged versions live)
265 + let release_root = root.join("releases-root");
266 + tokio::fs::create_dir_all(&release_root).await.unwrap();
267 +
268 + let staged = deploy_local(
269 + &release_root,
270 + "0.8.12",
271 + &[primary.clone(), admin.clone()],
272 + )
273 + .await
274 + .expect("deploy_local should succeed");
275 +
276 + assert_eq!(staged, release_root.join("releases").join("0.8.12"));
277 + assert_eq!(tokio::fs::read(staged.join("makenotwork")).await.unwrap(), b"PRIMARY");
278 + assert_eq!(tokio::fs::read(staged.join("mnw-admin")).await.unwrap(), b"ADMIN");
279 +
280 + // current symlink should resolve to staged
281 + let current = release_root.join("current");
282 + let target = tokio::fs::read_link(&current).await.unwrap();
283 + assert_eq!(target.to_string_lossy(), "releases/0.8.12");
284 + // And reading through `current/` should give the new content.
285 + let via_current = tokio::fs::read(current.join("makenotwork")).await.unwrap();
286 + assert_eq!(via_current, b"PRIMARY");
287 + }
288 +
289 + #[tokio::test]
290 + async fn deploy_local_second_release_swaps_symlink_and_keeps_old_dir() {
291 + let tmp = tempfile::tempdir().unwrap();
292 + let root = tmp.path();
293 + let src_dir = root.join("src");
294 + tokio::fs::create_dir_all(&src_dir).await.unwrap();
295 + let bin = src_dir.join("server");
296 + tokio::fs::write(&bin, b"V1").await.unwrap();
297 +
298 + let release_root = root.join("rr");
299 + tokio::fs::create_dir_all(&release_root).await.unwrap();
300 +
301 + deploy_local(&release_root, "0.1.0", &[bin.clone()]).await.unwrap();
302 + // Rewrite source then deploy 0.2.0.
303 + tokio::fs::write(&bin, b"V2").await.unwrap();
304 + deploy_local(&release_root, "0.2.0", &[bin.clone()]).await.unwrap();
305 +
306 + // Both versions present on disk.
307 + assert!(release_root.join("releases/0.1.0/server").exists());
308 + assert!(release_root.join("releases/0.2.0/server").exists());
309 + // current points at the new one.
310 + let target = tokio::fs::read_link(release_root.join("current")).await.unwrap();
311 + assert_eq!(target.to_string_lossy(), "releases/0.2.0");
312 + let via_current = tokio::fs::read(release_root.join("current/server")).await.unwrap();
313 + assert_eq!(via_current, b"V2");
314 + }
315 +
316 + #[tokio::test]
317 + async fn gc_local_releases_keeps_last_n_by_mtime() {
318 + // Build > RELEASES_TO_KEEP fake release dirs with distinct mtimes,
319 + // then run gc and check which survived.
320 + let tmp = tempfile::tempdir().unwrap();
321 + let root = tmp.path();
322 + let releases = root.join("releases");
323 + tokio::fs::create_dir_all(&releases).await.unwrap();
324 +
325 + let total = RELEASES_TO_KEEP + 3;
326 + let mut names = Vec::new();
327 + for i in 0..total {
328 + let name = format!("v{i:02}");
329 + let dir = releases.join(&name);
330 + tokio::fs::create_dir(&dir).await.unwrap();
331 + // Stagger mtimes deterministically. tokio's File doesn't expose
332 + // set_times, so reach for std::fs::File + std::fs::FileTimes
333 + // (stable since 1.75). Synchronous is fine here — this is test
334 + // setup, not the hot path.
335 + let f = std::fs::File::open(&dir).unwrap();
336 + let when = SystemTime::UNIX_EPOCH + std::time::Duration::from_secs(1_700_000_000 + i as u64);
337 + let times = std::fs::FileTimes::new().set_modified(when);
338 + f.set_times(times).unwrap();
339 + names.push(name);
340 + }
341 +
342 + gc_local_releases(root).await.unwrap();
343 +
344 + // The last RELEASES_TO_KEEP by mtime (i.e. highest i) survive.
345 + let surviving_expected: Vec<_> = names
346 + .iter()
347 + .skip(total - RELEASES_TO_KEEP)
348 + .cloned()
349 + .collect();
350 + for name in &surviving_expected {
351 + assert!(
352 + releases.join(name).exists(),
353 + "expected to survive: {name}"
354 + );
355 + }
356 + for name in names.iter().take(total - RELEASES_TO_KEEP) {
357 + assert!(
358 + !releases.join(name).exists(),
359 + "expected to be pruned: {name}"
360 + );
361 + }
362 + }
363 +
364 + #[tokio::test]
365 + async fn gc_local_releases_noop_when_below_threshold() {
366 + let tmp = tempfile::tempdir().unwrap();
367 + let root = tmp.path();
368 + let releases = root.join("releases");
369 + tokio::fs::create_dir_all(&releases).await.unwrap();
370 + for i in 0..3 {
371 + tokio::fs::create_dir(releases.join(format!("v{i}"))).await.unwrap();
372 + }
373 + gc_local_releases(root).await.unwrap();
374 + for i in 0..3 {
375 + assert!(releases.join(format!("v{i}")).exists());
376 + }
377 + }
378 +
379 + #[tokio::test]
380 + async fn gc_local_releases_noop_when_releases_dir_missing() {
381 + let tmp = tempfile::tempdir().unwrap();
382 + gc_local_releases(tmp.path()).await.unwrap();
383 + }
384 +
385 + #[tokio::test]
386 + async fn deploy_remote_fails_cleanly_when_host_unreachable() {
387 + // 192.0.2.0/24 is reserved for documentation and routes nowhere.
388 + // ConnectTimeout=10 limits the test wallclock to ~10s worst case.
389 + let tmp = tempfile::tempdir().unwrap();
390 + let staged = tmp.path().join("releases").join("0.0.1");
391 + tokio::fs::create_dir_all(&staged).await.unwrap();
392 + tokio::fs::write(staged.join("server"), b"x").await.unwrap();
393 +
394 + let node = crate::topology::Node {
395 + name: "unreachable".into(),
396 + ssh_target: "deploy@192.0.2.1".into(),
397 + release_root: "/opt/never".into(),
398 + service_name: "makenotwork.service".into(),
399 + };
400 +
401 + let result = deploy_node(&node, "0.0.1", &staged, "server").await;
402 + let err = result.expect_err("deploy to unreachable host should fail");
403 + let msg = format!("{err:#}");
404 + // The ssh helper returns `ssh <target> failed: ...`. Don't pin the
405 + // exact wording, just that the failure is attributed and that no
406 + // panic / hang happened.
407 + assert!(
408 + msg.contains("ssh") || msg.contains("rsync") || msg.contains("connection"),
409 + "unexpected error: {msg}"
410 + );
411 + }
412 +
413 + #[tokio::test]
414 + async fn deploy_node_with_local_ssh_target_swaps_symlink() {
415 + // ssh_target="local" should route to the local fast-path: just a
416 + // symlink swap, no remote calls. Helpful for dev loops.
417 + let tmp = tempfile::tempdir().unwrap();
418 + let release_root = tmp.path().to_path_buf();
419 + let staged = release_root.join("releases").join("0.0.1");
420 + tokio::fs::create_dir_all(&staged).await.unwrap();
421 + tokio::fs::write(staged.join("server"), b"x").await.unwrap();
422 +
423 + let node = crate::topology::Node {
424 + name: "local-dev".into(),
425 + ssh_target: "local".into(),
426 + release_root: release_root.to_string_lossy().into_owned(),
427 + service_name: "makenotwork.service".into(),
428 + };
429 +
430 + let out = deploy_node(&node, "0.0.1", &staged, "server").await.unwrap();
431 + assert_eq!(out, staged);
432 + let target = tokio::fs::read_link(release_root.join("current")).await.unwrap();
433 + assert_eq!(target.to_string_lossy(), "releases/0.0.1");
434 + }
71 435 }
@@ -0,0 +1,126 @@
1 + //! Event bus for live operator visibility.
2 + //!
3 + //! Sites that previously logged via `tracing::info!` also emit a typed event
4 + //! onto a `broadcast::Sender<EventEnvelope>`. The WS handler at `/events`
5 + //! subscribes to the bus and forwards each envelope to the connected TUI as
6 + //! a JSON text frame.
7 +
8 + use chrono::{DateTime, Utc};
9 + use serde::Serialize;
10 + use tokio::sync::broadcast;
11 +
12 + /// Capacity of the broadcast channel. Slow subscribers that fall behind by
13 + /// more than this many events get `RecvError::Lagged`; the WS handler treats
14 + /// that as a recoverable hiccup, not a disconnect.
15 + pub const CAPACITY: usize = 256;
16 +
17 + pub type EventTx = broadcast::Sender<EventEnvelope>;
18 +
19 + #[derive(Clone, Debug, Serialize)]
20 + pub struct EventEnvelope {
21 + pub at: DateTime<Utc>,
22 + #[serde(flatten)]
23 + pub event: Event,
24 + }
25 +
26 + #[derive(Clone, Debug, Serialize)]
27 + #[serde(tag = "kind", rename_all = "snake_case")]
28 + pub enum Event {
29 + /// A /rebuild was accepted (post-receive hook or operator).
30 + RebuildRequested { sha: String },
31 + /// A previous in-flight build was aborted because a newer /rebuild arrived.
32 + BuildAborted { sha_aborted: String },
33 + BuildStart { sha: String, version: String },
34 + BuildOk { sha: String, version: String, elapsed_s: u64 },
35 + BuildFailed { sha: String, version: String, elapsed_s: u64 },
36 + GateStart { tier: String, version: String, gate: String },
37 + GateDone { tier: String, version: String, gate: String, passed: bool },
38 + DeployStart { tier: String, node: String, version: String },
39 + DeployOk { tier: String, node: String, version: String },
40 + DeployFailed { tier: String, node: String, version: String, error: String },
41 + PromoteComplete { tier: String, version: String },
42 + Rollback { tier: String, from: String, to: String },
43 + BackupFetched { source: String, byte_size: i64 },
44 + ManualConfirm { tier: String, version: String },
45 + }
46 +
47 + pub fn channel() -> EventTx {
48 + broadcast::channel(CAPACITY).0
49 + }
50 +
51 + /// Send an event without caring whether anyone is listening. The `send` call
52 + /// fails only when there are zero subscribers, which is the normal case for
53 + /// most operator-tool deployments.
54 + pub fn emit(tx: &EventTx, event: Event) {
55 + let envelope = EventEnvelope { at: Utc::now(), event };
56 + let _ = tx.send(envelope);
57 + }
58 +
59 + #[cfg(test)]
60 + mod tests {
61 + use super::*;
62 +
63 + #[test]
64 + fn emit_with_zero_subscribers_does_not_panic() {
65 + // The whole point of `let _ = tx.send(...)` is that emitting into an
66 + // unsubscribed bus is fine. Verify the contract — if this regresses
67 + // to `.unwrap()` someday, every build/deploy site will start
68 + // crashing.
69 + let tx = channel();
70 + emit(&tx, Event::RebuildRequested { sha: "abc".into() });
71 + emit(&tx, Event::BackupFetched { source: "x".into(), byte_size: 1 });
72 + }
73 +
74 + #[tokio::test]
75 + async fn emit_reaches_a_subscriber() {
76 + let tx = channel();
77 + let mut rx = tx.subscribe();
78 + emit(&tx, Event::PromoteComplete { tier: "a".into(), version: "0.8.12".into() });
79 + let env = rx.recv().await.expect("envelope");
80 + match env.event {
81 + Event::PromoteComplete { tier, version } => {
82 + assert_eq!(tier, "a");
83 + assert_eq!(version, "0.8.12");
84 + }
85 + _ => panic!("wrong event kind"),
86 + }
87 + }
88 +
89 + #[tokio::test]
90 + async fn envelope_serializes_with_flat_kind() {
91 + // Contract for the WS handler + TUI's `format_event`: the JSON has a
92 + // top-level `kind` field, not nested under `event`. Locking this in.
93 + let env = EventEnvelope {
94 + at: Utc::now(),
95 + event: Event::GateStart {
96 + tier: "mm".into(),
97 + version: "0.8.12".into(),
98 + gate: "cargo_test".into(),
99 + },
100 + };
101 + let s = serde_json::to_string(&env).unwrap();
102 + let v: serde_json::Value = serde_json::from_str(&s).unwrap();
103 + assert_eq!(v["kind"], "gate_start");
104 + assert_eq!(v["tier"], "mm");
105 + assert_eq!(v["gate"], "cargo_test");
106 + // No nested `event` object.
107 + assert!(v.get("event").is_none());
108 + }
109 +
110 + #[tokio::test]
111 + async fn lagged_subscriber_observes_recv_error_lagged() {
112 + // If a subscriber falls behind by more than CAPACITY, the next
113 + // recv() returns RecvError::Lagged(n) — not Closed, not a panic.
114 + // The WS handler turns this into a `lagged` envelope.
115 + let tx = channel();
116 + let mut rx = tx.subscribe();
117 + for i in 0..(CAPACITY + 10) {
118 + emit(&tx, Event::RebuildRequested { sha: format!("{i}") });
119 + }
120 + let err = rx.recv().await.expect_err("expected Lagged");
121 + match err {
122 + tokio::sync::broadcast::error::RecvError::Lagged(n) => assert!(n >= 10),
123 + other => panic!("unexpected error: {other:?}"),
124 + }
125 + }
126 + }
@@ -4,6 +4,7 @@
4 4 //! and the TUI can show them.
5 5
6 6 use crate::config::Config;
7 + use crate::events::{self, Event, EventTx};
7 8 use crate::topology::Gate;
8 9 use anyhow::Result;
9 10 use chrono::Utc;
@@ -18,6 +19,7 @@ pub struct GateCtx {
18 19 pub tier: String,
19 20 pub version: String,
20 21 pub worktree: PathBuf,
22 + pub events: EventTx,
21 23 }
22 24
23 25 #[derive(Debug, Clone)]
@@ -44,6 +46,11 @@ pub async fn run(ctx: &GateCtx, gate: &Gate) -> Result<GateOutcome> {
44 46 .await?;
45 47
46 48 tracing::info!(tier = %ctx.tier, version = %ctx.version, gate = kind, "gate start");
49 + events::emit(&ctx.events, Event::GateStart {
50 + tier: ctx.tier.clone(),
51 + version: ctx.version.clone(),
52 + gate: kind.into(),
53 + });
47 54
48 55 let outcome = match gate {
49 56 Gate::CargoTest => cargo_test(ctx).await,
@@ -72,20 +79,30 @@ pub async fn run(ctx: &GateCtx, gate: &Gate) -> Result<GateOutcome> {
72 79 tier = %ctx.tier, version = %ctx.version, gate = kind,
73 80 passed = outcome.passed, "gate done",
74 81 );
82 + events::emit(&ctx.events, Event::GateDone {
83 + tier: ctx.tier.clone(),
84 + version: ctx.version.clone(),
85 + gate: kind.into(),
86 + passed: outcome.passed,
87 + });
75 88
76 89 Ok(outcome)
77 90 }
78 91
79 92 /// Run a sequence of gates; stops on the first failure (no point running the
80 - /// rest if a prerequisite failed). Returns true iff every gate passed.
93 + /// Run every gate in order and return true iff all passed. We deliberately do
94 + /// NOT short-circuit on first failure — every gate's outcome is recorded in
95 + /// `gate_runs`, which is the operator's only visibility into pipeline health.
96 + /// Hiding later gates because an earlier one failed makes diagnosis worse.
81 97 pub async fn run_all(ctx: &GateCtx, gates: &[Gate]) -> Result<bool> {
98 + let mut all_ok = true;
82 99 for g in gates {
83 100 let o = run(ctx, g).await?;
84 101 if !o.passed {
85 - return Ok(false);
102 + all_ok = false;
86 103 }
87 104 }
88 - Ok(true)
105 + Ok(all_ok)
89 106 }
90 107
91 108 fn kind_str(g: &Gate) -> &'static str {
@@ -102,11 +119,15 @@ fn kind_str(g: &Gate) -> &'static str {
102 119
103 120 async fn cargo_test(ctx: &GateCtx) -> Result<GateOutcome> {
104 121 let server_dir = ctx.worktree.join("server");
105 - let out = Command::new("cargo")
106 - .args(["test", "--release"])
107 - .current_dir(&server_dir)
108 - .output()
109 - .await?;
122 + let mut cmd = Command::new("cargo");
123 + cmd.args(["test", "--release"]).current_dir(&server_dir).kill_on_drop(true);
124 + // Same online-mode rationale as the build step: sqlx query macros need a
125 + // live DB to type-check against. The scratch DB is left in migrated state
126 + // by the preceding build, so we can reuse it here.
127 + if let Some(scratch_url) = ctx.cfg.scratch_db_url.as_deref() {
128 + cmd.env("DATABASE_URL", scratch_url);
129 + }
130 + let out = cmd.output().await?;
110 131 Ok(GateOutcome {
111 132 passed: out.status.success(),
112 133 detail: Some(tail(&out.stderr, 4_000)),
@@ -148,12 +169,30 @@ async fn migration_dry_run(ctx: &GateCtx) -> Result<GateOutcome> {
148 169 }
149 170 }
150 171
151 - async fn reset_scratch(db_url: &str) -> Result<()> {
172 + pub(crate) async fn reset_scratch(db_url: &str) -> Result<()> {
152 173 use sqlx::postgres::PgPoolOptions;
153 174 use sqlx::Executor;
154 175 let pool = PgPoolOptions::new().max_connections(1).connect(db_url).await?;
155 - pool.execute("DROP SCHEMA IF EXISTS public CASCADE; CREATE SCHEMA public;")
156 - .await?;
176 + // Drop every non-system schema, not just public — migrations create custom
177 + // schemas (e.g. tower_sessions) that survive `DROP SCHEMA public CASCADE`
178 + // and then collide on the next migration run.
179 + pool.execute(
180 + r#"
181 + DO $$
182 + DECLARE s text;
183 + BEGIN
184 + FOR s IN
185 + SELECT nspname FROM pg_namespace
186 + WHERE nspname NOT LIKE 'pg_%'
187 + AND nspname NOT IN ('information_schema')
188 + LOOP
189 + EXECUTE format('DROP SCHEMA IF EXISTS %I CASCADE', s);
190 + END LOOP;
191 + EXECUTE 'CREATE SCHEMA public';
192 + END $$;
193 + "#,
194 + )
195 + .await?;
157 196 pool.close().await;
158 197 Ok(())
159 198 }
@@ -177,7 +216,7 @@ async fn restore_dump(db_url: &str, dump: &str) -> Result<()> {
177 216 Ok(())
178 217 }
179 218
180 - async fn run_migrator(db_url: &str, dir: &std::path::Path) -> Result<()> {
219 + pub(crate) async fn run_migrator(db_url: &str, dir: &std::path::Path) -> Result<()> {
181 220 use sqlx::postgres::PgPoolOptions;
182 221 let pool = PgPoolOptions::new().max_connections(1).connect(db_url).await?;
183 222 let migrator = sqlx::migrate::Migrator::new(dir).await?;
@@ -205,11 +244,21 @@ async fn boot_smoke(ctx: &GateCtx) -> Result<GateOutcome> {
205 244 // seconds without exiting. Panics in main, missing config, port-bind
206 245 // failures show up here. Anything more ambitious (probing /healthz on a
207 246 // real port) needs server config we don't generically know.
208 - let mut child = match tokio::process::Command::new(&bin)
209 - .env("SANDO_BOOT_SMOKE", "1")
210 - .kill_on_drop(true)
211 - .spawn()
212 - {
247 + //
248 + // The server requires DATABASE_URL or it panics on config load before
249 + // we can observe anything. We point it at the scratch DB (already
250 + // migrated by the build step and refreshed by migration_dry_run if
251 + // that gate ran first). SCAN_ENABLED=false skips loading YARA rules
252 + // from /opt/makenotwork/yara-rules which doesn't exist on the build
253 + // host. Other config has sane optional defaults.
254 + let mut cmd = tokio::process::Command::new(&bin);
255 + cmd.env("SANDO_BOOT_SMOKE", "1")
256 + .env("SCAN_ENABLED", "false")
257 + .kill_on_drop(true);
258 + if let Some(scratch_url) = ctx.cfg.scratch_db_url.as_deref() {
259 + cmd.env("DATABASE_URL", scratch_url);
260 + }
261 + let mut child = match cmd.spawn() {
213 262 Ok(c) => c,
214 263 Err(e) => return Ok(GateOutcome { passed: false, detail: Some(format!("spawn: {e}")) }),
215 264 };
@@ -279,3 +328,59 @@ fn tail(buf: &[u8], max: usize) -> String {
279 328 let s = String::from_utf8_lossy(buf);
280 329 if s.len() <= max { s.into_owned() } else { format!("...{}", &s[s.len() - max..]) }
281 330 }
331 +
332 + #[cfg(test)]
333 + mod tests {
334 + use super::*;
335 +
336 + /// reset_scratch must drop every non-system schema, not just `public` —
337 + /// otherwise migrations that create custom schemas (e.g. tower_sessions)
338 + /// collide on the next run. This regressed once (Phase 0) and the fix is
339 + /// load-bearing for migration_dry_run.
340 + ///
341 + /// Gated on `SANDO_TEST_PG_URL` so it only runs where postgres is
342 + /// available. Set `SANDO_TEST_PG_URL=postgres:///sando_scratch?host=/var/run/postgresql`
343 + /// (or similar) before `cargo test`.
344 + #[tokio::test]
345 + async fn reset_scratch_drops_all_non_system_schemas() {
346 + let Ok(url) = std::env::var("SANDO_TEST_PG_URL") else {
347 + eprintln!("skipping: SANDO_TEST_PG_URL not set");
348 + return;
349 + };
350 + use sqlx::Executor;
351 + use sqlx::postgres::PgPoolOptions;
352 +
353 + let pool = PgPoolOptions::new().max_connections(1).connect(&url).await.unwrap();
354 + // Plant two non-system schemas + a table in each.
355 + pool.execute("DROP SCHEMA IF EXISTS foo CASCADE; CREATE SCHEMA foo; CREATE TABLE foo.t (i int);")
356 + .await.unwrap();
357 + pool.execute("DROP SCHEMA IF EXISTS tower_sessions CASCADE; CREATE SCHEMA tower_sessions; CREATE TABLE tower_sessions.session (id text);")
358 + .await.unwrap();
359 + pool.close().await;
360 +
361 + reset_scratch(&url).await.expect("reset_scratch");
362 +
363 + let pool = PgPoolOptions::new().max_connections(1).connect(&url).await.unwrap();
364 + let rows: Vec<(String,)> = sqlx::query_as(
365 + "SELECT nspname FROM pg_namespace WHERE nspname NOT LIKE 'pg_%' AND nspname <> 'information_schema'",
366 + )
367 + .fetch_all(&pool)
368 + .await
369 + .unwrap();
370 + let names: Vec<String> = rows.into_iter().map(|(s,)| s).collect();
371 + // After reset, only `public` should remain among non-system schemas.
372 + assert_eq!(names, vec!["public".to_string()], "got: {names:?}");
373 + pool.close().await;
374 + }
375 +
376 + /// Sanity: applying MNW migrations from a *non-existent* dir errors,
377 + /// rather than silently no-op'ing. Cheap pure check, no postgres needed
378 + /// (the sqlx::Migrator::new constructor itself reads the dir).
379 + #[tokio::test]
380 + async fn run_migrator_errors_on_missing_dir() {
381 + // The first thing run_migrator does is `Migrator::new(dir)`, which
382 + // needs a real dir to read migration files from.
383 + let res = run_migrator("postgres:///does-not-matter", std::path::Path::new("/nonexistent/sando-test-migrations")).await;
384 + assert!(res.is_err());
385 + }
386 + }
@@ -9,6 +9,7 @@ mod config;
9 9 mod db;
10 10 mod deploy;
11 11 mod error;
12 + mod events;
12 13 mod gates;
13 14 mod git;
14 15 mod metrics;
@@ -44,7 +45,14 @@ async fn main() -> Result<()> {
44 45
45 46 let prom = metrics::init();
46 47 let addr: SocketAddr = cfg.listen.parse()?;
47 - let app_state = state::AppState { pool, topo, cfg, prom };
48 + let app_state = state::AppState {
49 + pool,
50 + topo,
51 + cfg,
52 + prom,
53 + active_build: Arc::new(tokio::sync::Mutex::new(None)),
54 + events: events::channel(),
55 + };
48 56 let app = routes::router(app_state);
49 57 tracing::info!(%addr, "sando daemon listening");
50 58 let listener = tokio::net::TcpListener::bind(addr).await?;
@@ -14,6 +14,7 @@ pub fn router(state: AppState) -> Router {
14 14 .route("/promote/{tier}", post(promote))
15 15 .route("/rollback/{tier}", post(rollback))
16 16 .route("/rebuild", post(rebuild))
17 + .route("/confirm/{tier}", post(confirm))
17 18 .route("/backup/fetch", post(backup_fetch))
18 19 .route("/events", get(events_ws))
19 20 .with_state(state)
@@ -67,8 +68,25 @@ async fn get_state(State(s): State<AppState>) -> Result<Json<StateView>> {
67 68 .fetch_all(&s.pool)
68 69 .await?;
69 70
70 - let gates: Vec<GateView> = if let Some(ver) = current_version.as_ref() {
71 - // Most recent gate_runs row per gate_kind for (tier, current_version).
71 + // Surface gates for current_version when set, otherwise for the most
72 + // recently attempted version on this tier. Without the fallback, a
73 + // tier that has never gone green (MM after a build failure, B before
74 + // first deploy) exposes no gate detail via /state — debugging required
75 + // SSH and direct SQLite access. See sando todo: gate observability.
76 + let gate_version: Option<String> = if current_version.is_some() {
77 + current_version.clone()
78 + } else {
79 + sqlx::query_scalar(
80 + "SELECT version FROM gate_runs WHERE tier = ?
81 + ORDER BY id DESC LIMIT 1",
82 + )
83 + .bind(&name)
84 + .fetch_optional(&s.pool)
85 + .await?
86 + };
87 +
88 + let gates: Vec<GateView> = if let Some(ver) = gate_version.as_ref() {
89 + // Most recent gate_runs row per gate_kind for (tier, ver).
72 90 sqlx::query(
73 91 "SELECT gate_kind, passed, finished_at, detail
74 92 FROM gate_runs g
@@ -109,9 +127,12 @@ async fn get_state(State(s): State<AppState>) -> Result<Json<StateView>> {
109 127 Ok(Json(StateView { tiers }))
110 128 }
111 129
112 - #[derive(Deserialize)]
130 + #[derive(Deserialize, Default)]
113 131 struct PromoteBody {
114 - version: String,
132 + /// Optional. If absent, defaults to the predecessor tier's `current_version`
133 + /// (i.e. promote whatever just finished baking on the previous tier).
134 + #[serde(default)]
135 + version: Option<String>,
115 136 #[serde(default)]
116 137 hotfix: bool,
117 138 #[serde(default)]
@@ -121,8 +142,9 @@ struct PromoteBody {
121 142 async fn promote(
122 143 State(s): State<AppState>,
123 144 Path(tier): Path<String>,
124 - Json(body): Json<PromoteBody>,
145 + body: Option<Json<PromoteBody>>,
125 146 ) -> Result<Json<serde_json::Value>> {
147 + let body = body.map(|Json(b)| b).unwrap_or_default();
126 148 let idx = s.topo.tiers.iter().position(|t| t.name == tier)
127 149 .ok_or(crate::error::Error::NotFound)?;
128 150 if idx == 0 {
@@ -133,9 +155,24 @@ async fn promote(
133 155 let target = &s.topo.tiers[idx];
134 156 let source = &s.topo.tiers[idx - 1];
135 157
158 + // Resolve version: explicit if given, else the source tier's current.
159 + let version = match body.version {
160 + Some(v) => v,
161 + None => sqlx::query_scalar::<_, Option<String>>(
162 + "SELECT current_version FROM tier_state WHERE tier = ?",
163 + )
164 + .bind(&source.name)
165 + .fetch_optional(&s.pool).await
166 + .map_err(crate::error::Error::Db)?
167 + .flatten()
168 + .ok_or_else(|| crate::error::Error::GateBlocked(
169 + format!("no version specified and tier {} has no current_version", source.name),
170 + ))?,
171 + };
172 +
136 173 // 1. Predecessor must have all of its gates green for this version (with
137 174 // optional hotfix override that skips burn_in).
138 - let pending = unsatisfied_gates(&s.pool, &source.name, &body.version, body.hotfix).await?;
175 + let pending = unsatisfied_gates(&s.pool, &source.name, &version, body.hotfix).await?;
139 176 if !pending.is_empty() {
140 177 return Err(crate::error::Error::GateBlocked(format!(
141 178 "{} gate(s) not satisfied on tier {}: {}",
@@ -149,7 +186,7 @@ async fn promote(
149 186 let bin: Option<(String,)> = sqlx::query_as(
150 187 "SELECT artifact_path FROM versions WHERE version = ?",
151 188 )
152 - .bind(&body.version)
189 + .bind(&version)
153 190 .fetch_optional(&s.pool)
154 191 .await
155 192 .map_err(crate::error::Error::Db)?;
@@ -157,23 +194,49 @@ async fn promote(
157 194 return Err(crate::error::Error::NotFound);
158 195 };
159 196 let bin_path = std::path::PathBuf::from(bin);
197 + // `artifact_path` is the primary binary; the staged release dir is its parent.
198 + let staged_dir = bin_path.parent()
199 + .ok_or_else(|| crate::error::Error::Other(anyhow::anyhow!("artifact_path has no parent")))?
200 + .to_path_buf();
160 201
161 202 // 3. Deploy to each node. Sequential canary is the only policy
162 203 // implemented in v0; parallel is a one-line change once we trust the
163 204 // sequential path.
164 205 for node in &target.nodes {
165 - crate::deploy::deploy_node(node, &body.version, &bin_path)
166 - .await
167 - .map_err(crate::error::Error::Other)?;
168 - let now = chrono::Utc::now().to_rfc3339();
206 + let started = chrono::Utc::now().to_rfc3339();
207 + crate::events::emit(&s.events, crate::events::Event::DeployStart {
208 + tier: target.name.clone(), node: node.name.clone(), version: version.clone(),
209 + });
210 + let result = crate::deploy::deploy_node(node, &version, &staged_dir, s.cfg.primary_bin()).await;
211 + let finished = chrono::Utc::now().to_rfc3339();
212 + let (outcome, err_msg) = match &result {
213 + Ok(_) => ("ok", None),
214 + Err(e) => ("failed", Some(format!("{e:#}"))),
215 + };
169 216 sqlx::query(
170 217 "INSERT INTO deploys (version, tier, node, started_at, finished_at, outcome, hotfix, reset_burn_in)
171 - VALUES (?, ?, ?, ?, ?, 'ok', ?, ?)",
218 + VALUES (?, ?, ?, ?, ?, ?, ?, ?)",
172 219 )
173 - .bind(&body.version).bind(&target.name).bind(&node.name)
174 - .bind(&now).bind(&now)
220 + .bind(&version).bind(&target.name).bind(&node.name)
221 + .bind(&started).bind(&finished).bind(outcome)
175 222 .bind(body.hotfix as i64).bind(body.reset_burn_in as i64)
176 223 .execute(&s.pool).await.map_err(crate::error::Error::Db)?;
224 + if let Err(e) = result {
225 + let msg = err_msg.unwrap_or_default();
226 + tracing::error!(
227 + tier = %target.name, node = %node.name, version = %version,
228 + error = %msg,
229 + "deploy failed; current symlink left intact, tier_state not advanced"
230 + );
231 + crate::events::emit(&s.events, crate::events::Event::DeployFailed {
232 + tier: target.name.clone(), node: node.name.clone(),
233 + version: version.clone(), error: msg,
234 + });
235 + return Err(crate::error::Error::Other(e));
236 + }
237 + crate::events::emit(&s.events, crate::events::Event::DeployOk {
238 + tier: target.name.clone(), node: node.name.clone(), version: version.clone(),
239 + });
177 240 }
178 241
179 242 // 4. Advance tier_state. burn_in_started_at is set to now so the target
@@ -189,7 +252,7 @@ async fn promote(
189 252 WHERE tier = ?",
190 253 )
191 254 .bind(prev)
192 - .bind(&body.version)
255 + .bind(&version)
193 256 .bind(chrono::Utc::now().to_rfc3339())
194 257 .bind(&target.name)
195 258 .execute(&s.pool).await.map_err(crate::error::Error::Db)?;
@@ -200,15 +263,18 @@ async fn promote(
200 263 .execute(&s.pool).await.map_err(crate::error::Error::Db)?;
201 264 }
202 265
266 + crate::events::emit(&s.events, crate::events::Event::PromoteComplete {
267 + tier: target.name.clone(), version: version.clone(),
268 + });
203 269 tracing::info!(
204 - version = %body.version, tier = %target.name,
270 + version = %version, tier = %target.name,
205 271 hotfix = body.hotfix, reset_burn_in = body.reset_burn_in,
206 272 "promote complete",
207 273 );
208 274
209 275 Ok(Json(serde_json::json!({
210 276 "tier": target.name,
211 - "version": body.version,
277 + "version": version,
212 278 "nodes_deployed": target.nodes.iter().map(|n| n.name.clone()).collect::<Vec<_>>(),
213 279 })))
214 280 }
@@ -275,9 +341,12 @@ async fn rollback(
275 341 ));
276 342 };
277 343 let bin_path = std::path::PathBuf::from(bin);
344 + let staged_dir = bin_path.parent()
345 + .ok_or_else(|| crate::error::Error::Other(anyhow::anyhow!("artifact_path has no parent")))?
346 + .to_path_buf();
278 347
279 348 for node in &target.nodes {
280 - crate::deploy::deploy_node(node, &previous, &bin_path)
349 + crate::deploy::deploy_node(node, &previous, &staged_dir, s.cfg.primary_bin())
281 350 .await
282 351 .map_err(crate::error::Error::Other)?;
283 352 }
@@ -292,6 +361,9 @@ async fn rollback(
292 361 .execute(&s.pool).await.map_err(crate::error::Error::Db)?;
293 362
294 363 tracing::warn!(tier = %tier, from = %current, to = %previous, "rollback complete");
364 + crate::events::emit(&s.events, crate::events::Event::Rollback {
365 + tier: tier.clone(), from: current.clone(), to: previous.clone(),
366 + });
295 367
296 368 Ok(Json(serde_json::json!({
297 369 "tier": tier,
@@ -323,24 +395,80 @@ async fn rebuild(
323 395 };
324 396
325 397 tracing::info!(sha = %sha, "rebuild requested");
398 + crate::events::emit(&s.events, crate::events::Event::RebuildRequested { sha: sha.clone() });
399 +
400 + // Latest /rebuild wins: abort any in-flight build before spawning a new
401 + // one. Aborting drops the spawned task's future, which drops any
402 + // tokio::process::Child it owns; with `kill_on_drop(true)` set on the
403 + // cargo Command, SIGKILL propagates to cargo + its rustc children.
404 + let mut slot = s.active_build.lock().await;
405 + if let Some(prev) = slot.take() {
406 + if !prev.is_finished() {
407 + tracing::warn!("aborting in-flight build for newer /rebuild request");
408 + crate::events::emit(&s.events, crate::events::Event::BuildAborted { sha_aborted: sha.clone() });
409 + prev.abort();
410 + }
411 + }
326 412
327 413 let pool = s.pool.clone();
328 414 let cfg = s.cfg.clone();
329 415 let topo = s.topo.clone();
416 + let events_for_task = s.events.clone();
330 417 let sha_for_task = sha.clone();
331 - tokio::spawn(async move {
332 - if let Err(e) = crate::build::build_and_run_mm(pool, cfg, topo, sha_for_task.clone()).await {
418 + let handle = tokio::spawn(async move {
419 + if let Err(e) = crate::build::build_and_run_mm(pool, cfg, topo, sha_for_task.clone(), events_for_task).await {
333 420 tracing::error!(sha = %sha_for_task, error = %e, "rebuild pipeline failed");
334 421 }
335 422 });
423 + *slot = Some(handle.abort_handle());
336 424
337 425 Ok(Json(serde_json::json!({ "accepted": true, "sha": sha })))
338 426 }
339 427
428 + async fn confirm(
429 + State(s): State<AppState>,
430 + Path(tier): Path<String>,
431 + ) -> Result<Json<serde_json::Value>> {
432 + // Operator-driven satisfaction of a `manual_confirm` gate. Looks up the
433 + // pending version (current MM version, or the tier's own if non-mm) and
434 + // inserts a passing gate_runs row so /promote can advance.
435 + let target = s.topo.tiers.iter().find(|t| t.name == tier)
436 + .ok_or(crate::error::Error::NotFound)?;
437 +
438 + let version: Option<String> = sqlx::query_scalar(
439 + "SELECT current_version FROM tier_state WHERE tier = ?",
440 + )
441 + .bind(&target.name)
442 + .fetch_optional(&s.pool).await.map_err(crate::error::Error::Db)?.flatten();
443 + let version = version.ok_or_else(|| crate::error::Error::GateBlocked(
444 + format!("tier {tier} has no current_version; nothing to confirm"),
445 + ))?;
446 +
447 + let now = chrono::Utc::now().to_rfc3339();
448 + sqlx::query(
449 + "INSERT INTO gate_runs (version, tier, gate_kind, started_at, finished_at, passed, detail)
450 + VALUES (?, ?, 'manual_confirm', ?, ?, 1, 'operator confirmed via POST /confirm')",
451 + )
452 + .bind(&version).bind(&target.name).bind(&now).bind(&now)
453 + .execute(&s.pool).await.map_err(crate::error::Error::Db)?;
454 +
455 + tracing::info!(tier = %tier, version = %version, "manual_confirm recorded");
456 + crate::events::emit(&s.events, crate::events::Event::ManualConfirm {
457 + tier: tier.clone(),
458 + version: version.clone(),
459 + });
460 +
461 + Ok(Json(serde_json::json!({ "tier": tier, "version": version })))
462 + }
463 +
340 464 async fn backup_fetch(State(s): State<AppState>) -> Result<Json<serde_json::Value>> {
341 465 let fb = crate::backup::fetch(&s.pool, &s.cfg, &s.topo)
342 466 .await
343 467 .map_err(crate::error::Error::Other)?;
468 + crate::events::emit(&s.events, crate::events::Event::BackupFetched {
469 + source: fb.source.clone(),
470 + byte_size: fb.byte_size.unwrap_or(0),
471 + });
344 472 Ok(Json(serde_json::json!({
345 473 "source": fb.source,
346 474 "local_path": fb.local_path,
@@ -348,8 +476,381 @@ async fn backup_fetch(State(s): State<AppState>) -> Result<Json<serde_json::Valu
348 476 })))
349 477 }
350 478
351 - async fn events_ws(ws: WebSocketUpgrade, State(_s): State<AppState>) -> impl IntoResponse {
352 - ws.on_upgrade(|_socket| async move {
353 - // tail of deploy/gate events for the TUI
479 + async fn events_ws(ws: WebSocketUpgrade, State(s): State<AppState>) -> impl IntoResponse {
480 + use axum::extract::ws::Message;
481 + use tokio::sync::broadcast::error::RecvError;
482 +
483 + ws.on_upgrade(move |mut socket| async move {
484 + let mut rx = s.events.subscribe();
485 + loop {
486 + match rx.recv().await {
487 + Ok(env) => {
488 + let json = match serde_json::to_string(&env) {
489 + Ok(s) => s,
490 + Err(e) => {
491 + tracing::warn!(error = %e, "events ws: serialize failed");
492 + continue;
493 + }
494 + };
495 + if socket.send(Message::Text(json.into())).await.is_err() {
496 + break;
497 + }
498 + }
499 + Err(RecvError::Lagged(n)) => {
500 + let _ = socket.send(Message::Text(
501 + format!(r#"{{"kind":"lagged","skipped":{n}}}"#).into(),
502 + )).await;
503 + }
504 + Err(RecvError::Closed) => break,
505 + }
506 + }
354 507 })
355 508 }
509 +
510 + #[cfg(test)]
511 + mod tests {
512 + use super::*;
513 + use crate::config::Config;
514 + use crate::topology::{BackupConfig, CanaryPolicy, Gate, Node, RepoConfig, Tier, Topology};
515 + use axum::body::Body;
516 + use axum::http::{Request, StatusCode};
517 + use http_body_util::BodyExt;
518 + use metrics_exporter_prometheus::PrometheusBuilder;
519 + use sqlx::sqlite::SqlitePoolOptions;
520 + use sqlx::SqlitePool;
521 + use std::path::PathBuf;
522 + use std::sync::Arc;
523 + use tower::ServiceExt;
524 +
525 + async fn fresh_pool() -> SqlitePool {
526 + let pool = SqlitePoolOptions::new()
527 + .max_connections(1)
528 + .connect("sqlite::memory:")
529 + .await
530 + .unwrap();
531 + sqlx::migrate!("./migrations").run(&pool).await.unwrap();
532 + pool
533 + }
534 +
535 + /// Two-tier topology used by the route tests: mm (provisioned, no nodes)
536 + /// → a (provisioned, one local node). Mirrors the production shape
537 + /// without involving real ssh / postgres.
538 + fn test_topo() -> Topology {
539 + Topology {
540 + repo: RepoConfig { bare_path: "/tmp/test.git".into(), branch: "main".into() },
541 + backup: BackupConfig {
542 + source: "file:///tmp/test-backup.sql".into(),
543 + local_path: "/tmp/local-backup.sql".into(),
544 + },
545 + tiers: vec![
546 + Tier {
547 + name: "mm".into(),
548 + provisioned: true,
549 + gates: vec![],
550 + canary: CanaryPolicy::Sequential,
551 + nodes: vec![],
552 + },
553 + Tier {
554 + name: "a".into(),
555 + provisioned: true,
556 + gates: vec![Gate::BootSmoke],
557 + canary: CanaryPolicy::Sequential,
558 + nodes: vec![Node {
559 + name: "a-local".into(),
560 + ssh_target: "local".into(),
561 + release_root: "/tmp/a-node".into(),
562 + service_name: "makenotwork.service".into(),
563 + }],
564 + },
565 + ],
566 + }
567 + }
568 +
569 + fn test_cfg() -> Config {
570 + Config {
571 + listen: "127.0.0.1:0".into(),
572 + db_path: PathBuf::from(":memory:"),
573 + topology_path: PathBuf::from("/tmp/test-sando.toml"),
574 + workdir: PathBuf::from("/tmp/sando-work"),
575 + release_root: PathBuf::from("/tmp/sando-releases"),
576 + scratch_db_url: None,
577 + bin_names: vec!["makenotwork".into()],
578 + }
579 + }
580 +
581 + async fn test_state() -> AppState {
582 + let pool = fresh_pool().await;
583 + // Seed tier rows so FKs on tier_state / gate_runs are satisfied.
584 + for (i, name) in ["mm", "a"].iter().enumerate() {
585 + sqlx::query(
586 + "INSERT INTO tiers (name, ord, provisioned, canary) VALUES (?, ?, 1, 'sequential')",
587 + )
588 + .bind(name).bind(i as i64).execute(&pool).await.unwrap();
589 + sqlx::query("INSERT INTO tier_state (tier) VALUES (?)")
590 + .bind(name).execute(&pool).await.unwrap();
591 + }
592 + // Don't call install_recorder in tests — it touches a process-global
593 + // and conflicts when tests run in parallel.
594 + let prom = PrometheusBuilder::new().build_recorder().handle();
595 + AppState {
596 + pool,
597 + topo: Arc::new(test_topo()),
598 + cfg: Arc::new(test_cfg()),
599 + prom,
600 + active_build: Arc::new(tokio::sync::Mutex::new(None)),
601 + events: crate::events::channel(),
602 + }
603 + }
604 +
605 + async fn body_string(resp: axum::response::Response) -> String {
606 + let bytes = resp.into_body().collect().await.unwrap().to_bytes();
607 + String::from_utf8(bytes.to_vec()).unwrap()
608 + }
609 +
610 + /// Insert the FK prerequisites for inserting gate_runs/tier_state rows.
611 + async fn seed(pool: &SqlitePool, tier: &str, version: &str) {
612 + sqlx::query("INSERT INTO tiers (name, ord, provisioned, canary) VALUES (?, 0, 1, 'sequential') ON CONFLICT DO NOTHING")
613 + .bind(tier).execute(pool).await.unwrap();
614 + sqlx::query("INSERT INTO versions (version, git_sha, built_at, artifact_path) VALUES (?, 'sha', datetime('now'), '/tmp/x') ON CONFLICT DO NOTHING")
615 + .bind(version).execute(pool).await.unwrap();
616 + sqlx::query("INSERT INTO tier_state (tier, current_version) VALUES (?, NULL) ON CONFLICT DO NOTHING")
617 + .bind(tier).execute(pool).await.unwrap();
618 + }
619 +
620 + async fn insert_gate(pool: &SqlitePool, tier: &str, version: &str, kind: &str, passed: i64) {
621 + sqlx::query(
622 + "INSERT INTO gate_runs (version, tier, gate_kind, started_at, finished_at, passed) \
623 + VALUES (?, ?, ?, datetime('now'), datetime('now'), ?)",
624 + )
625 + .bind(version).bind(tier).bind(kind).bind(passed)
626 + .execute(pool).await.unwrap();
627 + }
628 +
629 + // ---- unsatisfied_gates ----
630 +
631 + #[tokio::test]
632 + async fn unsatisfied_gates_empty_when_no_runs() {
633 + // No gate_runs rows means there's nothing to check — caller treats
634 + // empty as "all green" which is correct iff the predecessor tier
635 + // has no configured gates. The topology validation is upstream.
636 + let pool = fresh_pool().await;
637 + seed(&pool, "mm", "0.8.12").await;
638 + let pending = unsatisfied_gates(&pool, "mm", "0.8.12", false).await.unwrap();
639 + assert_eq!(pending, Vec::<String>::new());
640 + }
641 +
642 + #[tokio::test]
643 + async fn unsatisfied_gates_flags_failed_kind() {
644 + let pool = fresh_pool().await;
645 + seed(&pool, "mm", "0.8.12").await;
646 + insert_gate(&pool, "mm", "0.8.12", "cargo_test", 0).await;
647 + insert_gate(&pool, "mm", "0.8.12", "boot_smoke", 1).await;
648 + let pending = unsatisfied_gates(&pool, "mm", "0.8.12", false).await.unwrap();
649 + assert_eq!(pending, vec!["cargo_test".to_string()]);
650 + }
651 +
652 + #[tokio::test]
653 + async fn unsatisfied_gates_latest_row_wins() {
654 + // Two runs of the same gate; only the latest counts. A flap from
655 + // red to green should clear the pending entry.
656 + let pool = fresh_pool().await;
657 + seed(&pool, "mm", "0.8.12").await;
658 + insert_gate(&pool, "mm", "0.8.12", "cargo_test", 0).await;
659 + insert_gate(&pool, "mm", "0.8.12", "cargo_test", 1).await;
660 + let pending = unsatisfied_gates(&pool, "mm", "0.8.12", false).await.unwrap();
661 + assert!(pending.is_empty());
662 + }
663 +
664 + #[tokio::test]
665 + async fn unsatisfied_gates_hotfix_skips_only_burn_in() {
666 + // hotfix=true is supposed to bypass burn_in failures specifically —
667 + // not cargo_test, not boot_smoke. Lock the semantic so a future
668 + // rename doesn't accidentally widen it.
669 + let pool = fresh_pool().await;
670 + seed(&pool, "a", "0.8.12").await;
671 + insert_gate(&pool, "a", "0.8.12", "burn_in", 0).await;
672 + insert_gate(&pool, "a", "0.8.12", "cargo_test", 0).await;
673 +
674 + let normal = unsatisfied_gates(&pool, "a", "0.8.12", false).await.unwrap();
675 + let mut sorted = normal.clone();
676 + sorted.sort();
677 + assert_eq!(sorted, vec!["burn_in".to_string(), "cargo_test".to_string()]);
678 +
679 + let with_hotfix = unsatisfied_gates(&pool, "a", "0.8.12", true).await.unwrap();
680 + assert_eq!(with_hotfix, vec!["cargo_test".to_string()]);
681 + }
682 +
683 + #[tokio::test]
684 + async fn unsatisfied_gates_ignores_other_tiers_and_versions() {
685 + let pool = fresh_pool().await;
686 + seed(&pool, "mm", "0.8.12").await;
687 + seed(&pool, "mm", "0.8.11").await;
688 + seed(&pool, "a", "0.8.12").await;
689 + // Mark mm/0.8.12 cargo_test failing, but unrelated tiers/versions
690 + // shouldn't pollute the query.
691 + insert_gate(&pool, "mm", "0.8.12", "cargo_test", 0).await;
692 + insert_gate(&pool, "a", "0.8.12", "cargo_test", 0).await;
693 + insert_gate(&pool, "mm", "0.8.11", "cargo_test", 0).await;
Lines truncated
@@ -1,8 +1,11 @@
1 1 use crate::config::Config;
2 + use crate::events::EventTx;
2 3 use crate::topology::Topology;
3 4 use metrics_exporter_prometheus::PrometheusHandle;
4 5 use sqlx::SqlitePool;
5 6 use std::sync::Arc;
7 + use tokio::sync::Mutex;
8 + use tokio::task::AbortHandle;
6 9
7 10 #[derive(Clone)]
8 11 pub struct AppState {
@@ -10,4 +13,10 @@ pub struct AppState {
10 13 pub topo: Arc<Topology>,
11 14 pub cfg: Arc<Config>,
12 15 pub prom: PrometheusHandle,
16 + /// Single-slot guard for the build pipeline. A new /rebuild aborts any
17 + /// in-flight build (cargo + gates) so the latest push always wins.
18 + pub active_build: Arc<Mutex<Option<AbortHandle>>>,
19 + /// Broadcast bus for live operator events. WS /events subscribes; all
20 + /// build/gate/deploy code sites emit on this.
21 + pub events: EventTx,
13 22 }
@@ -151,6 +151,7 @@ mod tests {
151 151 name: name.into(),
152 152 ssh_target: format!("deploy@{name}"),
153 153 release_root: "/opt/mnw".into(),
154 + service_name: "makenotwork.service".into(),
154 155 }
155 156 }
156 157
@@ -39,8 +39,14 @@ pub struct Node {
39 39 pub name: String,
40 40 pub ssh_target: String,
41 41 pub release_root: String,
42 + /// systemd unit name to reload-or-restart after the symlink swap.
43 + /// Defaults to "makenotwork.service" because that's MNW's prod unit.
44 + #[serde(default = "default_service_name")]
45 + pub service_name: String,
42 46 }
43 47
48 + fn default_service_name() -> String { "makenotwork.service".into() }
49 +
44 50 #[derive(Debug, Clone, Copy, Serialize, Deserialize, Default)]
45 51 #[serde(rename_all = "snake_case")]
46 52 pub enum CanaryPolicy {
@@ -0,0 +1,164 @@
1 + #!/usr/bin/env bash
2 + # Idempotent bootstrap for a fresh MNW node (tier A/B/C deploy target).
3 + #
4 + # Run on the new node as root. After this finishes, sandod on the Sando host
5 + # can rsync + deploy to <ssh_target>:/opt/mnw/.
6 + #
7 + # Required env:
8 + # SANDO_PUBKEY — sando user's public key on the Sando host. Get it via:
9 + # `ssh pop-os 'sudo cat /srv/sando/.ssh/id_ed25519.pub'`
10 + #
11 + # Optional env:
12 + # DEPLOY_ROOT — defaults to /opt/mnw
13 + # BIN_NAME — primary binary name (matches sando-daemon.toml's
14 + # bin_names[0]). Defaults to "makenotwork".
15 + # SERVICE_NAME — systemd unit name. Defaults to "makenotwork.service".
16 + # SERVICE_USER — runtime user for the binary. Defaults to "deploy".
17 + # ENABLE_FIREWALL — "1" to set up UFW (22/80/443). Defaults to "1".
18 + # INSTALL_CADDY — "1" to apt-install caddy (config is operator's job).
19 + # Defaults to "1".
20 + # INSTALL_POSTGRES — "1" to apt-install postgresql. Defaults to "1".
21 + # INSTALL_TAILSCALE — "1" to apt-install tailscale (NOT authenticated;
22 + # operator runs `tailscale up`). Defaults to "1".
23 + #
24 + # What this does NOT do (operator's job):
25 + # - tailscale up (auth)
26 + # - DNS records
27 + # - Caddyfile content + Cloudflare origin certs + private keys
28 + # - postgres role + db + .env / DATABASE_URL
29 + # - any secrets
30 +
31 + set -euo pipefail
32 +
33 + if [[ $EUID -ne 0 ]]; then
34 + echo "must run as root" >&2
35 + exit 1
36 + fi
37 + if [[ -z "${SANDO_PUBKEY:-}" ]]; then
38 + echo "SANDO_PUBKEY env var is required" >&2
39 + exit 1
40 + fi
41 +
42 + DEPLOY_ROOT="${DEPLOY_ROOT:-/opt/mnw}"
43 + BIN_NAME="${BIN_NAME:-makenotwork}"
44 + SERVICE_NAME="${SERVICE_NAME:-makenotwork.service}"
45 + SERVICE_USER="${SERVICE_USER:-deploy}"
46 + ENABLE_FIREWALL="${ENABLE_FIREWALL:-1}"
47 + INSTALL_CADDY="${INSTALL_CADDY:-1}"
48 + INSTALL_POSTGRES="${INSTALL_POSTGRES:-1}"
49 + INSTALL_TAILSCALE="${INSTALL_TAILSCALE:-1}"
50 +
51 + export DEBIAN_FRONTEND=noninteractive
52 +
53 + log() { echo "[bootstrap] $*"; }
54 +
55 + log "1/8 base packages"
56 + apt-get update -qq
57 + apt-get install -y -qq curl gnupg ca-certificates rsync ufw fail2ban > /dev/null
58 +
59 + if [[ "$INSTALL_POSTGRES" == "1" ]]; then
60 + log "2/8 postgresql"
61 + apt-get install -y -qq postgresql > /dev/null
62 + else
63 + log "2/8 skipping postgresql"
64 + fi
65 +
66 + if [[ "$INSTALL_TAILSCALE" == "1" ]]; then
67 + log "3/8 tailscale (not authenticating)"
68 + if ! command -v tailscale >/dev/null; then
69 + # Ubuntu codename. tailscale's repo is published per-codename;
70 + # noble (24.04) keys work on 24.04+ derivatives.
71 + codename=$(. /etc/os-release && echo "$VERSION_CODENAME")
72 + curl -fsSL "https://pkgs.tailscale.com/stable/ubuntu/${codename}.noarmor.gpg" \
73 + > /usr/share/keyrings/tailscale-archive-keyring.gpg
74 + curl -fsSL "https://pkgs.tailscale.com/stable/ubuntu/${codename}.tailscale-keyring.list" \
75 + > /etc/apt/sources.list.d/tailscale.list
76 + apt-get update -qq
77 + apt-get install -y -qq tailscale > /dev/null
78 + systemctl enable --now tailscaled
79 + fi
80 + else
81 + log "3/8 skipping tailscale"
82 + fi
83 +
84 + if [[ "$INSTALL_CADDY" == "1" ]]; then
85 + log "4/8 caddy (no Caddyfile — operator's job)"
86 + if ! command -v caddy >/dev/null; then
87 + curl -fsSL https://dl.cloudsmith.io/public/caddy/stable/gpg.key \
88 + | gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
89 + curl -fsSL https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt \
90 + > /etc/apt/sources.list.d/caddy-stable.list
91 + apt-get update -qq
92 + apt-get install -y -qq caddy > /dev/null
93 + fi
94 + else
95 + log "4/8 skipping caddy"
96 + fi
97 +
98 + log "5/8 deploy user + dirs"
99 + if ! id "$SERVICE_USER" &>/dev/null; then
100 + useradd -m -d "/home/$SERVICE_USER" -s /bin/bash "$SERVICE_USER"
101 + fi
102 + install -d -o "$SERVICE_USER" -g "$SERVICE_USER" -m 0700 "/home/$SERVICE_USER/.ssh"
103 + if ! grep -qF "$SANDO_PUBKEY" "/home/$SERVICE_USER/.ssh/authorized_keys" 2>/dev/null; then
104 + echo "$SANDO_PUBKEY" >> "/home/$SERVICE_USER/.ssh/authorized_keys"
105 + fi
106 + chown "$SERVICE_USER:$SERVICE_USER" "/home/$SERVICE_USER/.ssh/authorized_keys"
107 + chmod 0600 "/home/$SERVICE_USER/.ssh/authorized_keys"
108 + install -d -o "$SERVICE_USER" -g "$SERVICE_USER" -m 0755 "$DEPLOY_ROOT" "$DEPLOY_ROOT/releases"
109 +
110 + log "6/8 sudoers (systemctl on $SERVICE_NAME for $SERVICE_USER)"
111 + cat > "/etc/sudoers.d/${SERVICE_USER}-mnw" <<EOF
112 + $SERVICE_USER ALL=(ALL) NOPASSWD: /bin/systemctl reload-or-restart $SERVICE_NAME, /bin/systemctl restart $SERVICE_NAME, /bin/systemctl status $SERVICE_NAME
113 + EOF
114 + chmod 0440 "/etc/sudoers.d/${SERVICE_USER}-mnw"
115 + visudo -c -f "/etc/sudoers.d/${SERVICE_USER}-mnw" >/dev/null
116 +
117 + log "7/8 systemd unit ($SERVICE_NAME) — points at $DEPLOY_ROOT/current/$BIN_NAME"
118 + cat > "/etc/systemd/system/$SERVICE_NAME" <<EOF
119 + [Unit]
120 + Description=Makenotwork
121 + After=network.target
122 +
123 + [Service]
124 + Type=simple
125 + User=$SERVICE_USER
126 + Group=$SERVICE_USER
127 + WorkingDirectory=$DEPLOY_ROOT
128 + ExecStart=$DEPLOY_ROOT/current/$BIN_NAME
129 + EnvironmentFile=-$DEPLOY_ROOT/.env
130 + Restart=on-failure
131 + RestartSec=30
132 + # Exit 2 = migration failure (MNW server convention). Don't restart;
133 + # operator must intervene before the next deploy.
134 + RestartPreventExitStatus=2
135 + StandardOutput=journal
136 + StandardError=journal
137 + SyslogIdentifier=$BIN_NAME
138 +
139 + [Install]
140 + WantedBy=multi-user.target
141 + EOF
142 + systemctl daemon-reload
143 + systemctl enable "$SERVICE_NAME" >/dev/null 2>&1 || true
144 +
145 + if [[ "$ENABLE_FIREWALL" == "1" ]]; then
146 + log "8/8 firewall (UFW: 22/80/443 in, all else deny)"
147 + ufw --force reset > /dev/null
148 + ufw default deny incoming > /dev/null
149 + ufw default allow outgoing > /dev/null
150 + ufw allow 22/tcp > /dev/null
151 + ufw allow 80/tcp > /dev/null
152 + ufw allow 443/tcp > /dev/null
153 + ufw --force enable > /dev/null
154 + else
155 + log "8/8 skipping firewall"
156 + fi
157 +
158 + echo
159 + log "Done. Next steps for the operator:"
160 + echo " - tailscale up (auth this node to the tailnet)"
161 + echo " - DNS A/AAAA records for the domain you'll serve"
162 + echo " - Install /etc/caddy/Caddyfile + Cloudflare Origin CA cert + key"
163 + echo " - postgres: create role+db, drop secrets into $DEPLOY_ROOT/.env"
164 + echo " - Run a sando deploy from the Sando host: POST /promote/<tier>"
@@ -5,5 +5,6 @@ listen = "100.103.89.95:7766" # pop-os tailnet IP; bind tailnet-only, not 0.0.
5 5 db_path = "/srv/sando/state/sando.db"
6 6 topology_path = "/etc/sando/sando.toml"
7 7 workdir = "/srv/sando/work"
8 - release_root = "/srv/sando/releases"
8 + release_root = "/srv/sando"
9 9 scratch_db_url = "postgres:///sando_scratch?host=/var/run/postgresql"
10 + bin_names = ["makenotwork", "mnw-admin"]
@@ -0,0 +1,18 @@
1 + # One-shot: tells sandod to pull the latest prod backup.
2 + # Paired with sandod-backup-fetch.timer for daily execution.
3 + #
4 + # Place at /etc/systemd/system/sandod-backup-fetch.service on the Sando host.
5 +
6 + [Unit]
7 + Description=Sando: fetch latest prod backup
8 + After=sandod.service network-online.target
9 + Wants=network-online.target
10 + Requires=sandod.service
11 +
12 + [Service]
13 + Type=oneshot
14 + # Reuse the same env file the daemon does — gives us $SANDO_DAEMON.
15 + EnvironmentFile=/etc/sando/sando.env
16 + ExecStart=/usr/bin/curl -fsS --max-time 600 -X POST ${SANDO_DAEMON}/backup/fetch
17 + # Service exits non-zero if the daemon refuses; fine — we want the timer to
18 + # log + retry on the next cycle. Don't restart aggressively.
@@ -0,0 +1,17 @@
1 + # Daily trigger for backup fetch. Prod's own backup-db.sh runs at 03:00 UTC;
2 + # we fetch at 04:00 UTC to leave headroom for offsite sync to complete first.
3 + #
4 + # Place at /etc/systemd/system/sandod-backup-fetch.timer on the Sando host.
5 + # Enable: systemctl enable --now sandod-backup-fetch.timer
6 +
7 + [Unit]
8 + Description=Sando: daily prod-backup fetch
9 +
10 + [Timer]
11 + OnCalendar=*-*-* 04:00:00 UTC
12 + # If the box was off when the timer fired, run on next boot.
13 + Persistent=true
14 + Unit=sandod-backup-fetch.service
15 +
16 + [Install]
17 + WantedBy=timers.target
@@ -0,0 +1,126 @@
1 + # Config artifacts vs binary artifacts
2 +
3 + Phase 3 design doc. Resolves: which of `deploy.sh`'s per-deploy actions sando absorbs, which move to one-time node-bootstrap, which sando explicitly skips.
4 +
5 + Status: draft. Decisions below are recommendations; checkboxes match `MNW/sando/todo.md` Phase 3.
6 +
7 + ## Inventory of `deploy.sh`'s actions
8 +
9 + | Action | Frequency | What it does |
10 + |------------------------------|----------------|----------------------------------------------------------------|
11 + | `build_binary` | per-deploy | cargo-zigbuild on macOS → x86_64 Linux musl/glibc |
12 + | `upload_config: Caddyfile` | per-deploy | scp `Caddyfile` → `/etc/caddy/Caddyfile`, `systemctl reload caddy` |
13 + | `upload_config: error-pages` | per-deploy | scp `error-pages/*.html` → `/opt/makenotwork/error-pages/` |
14 + | `upload_config: security` | per-deploy | scp `sshd-git.conf`, `fail2ban-sshd.conf`, `setup-firewall.sh` |
15 + | `upload_config: chmod` | per-deploy | chmod +x on setup-* scripts |
16 + | `upload_binary` | per-deploy | scp `makenotwork` + `mnw-admin` → `/opt/makenotwork/` |
17 + | `send_restart_warning` | per-deploy | POST `/api/internal/restart-warning` (30s notice), sleep 30s |
18 + | `restart_app` | per-deploy | `systemctl restart makenotwork`; curl 127.0.0.1:3000 to verify |
19 + | `sqlx migrate run` (implied) | startup | server runs migrations on startup in `main.rs:73` |
20 +
21 + ## Decision per item
22 +
23 + ### 1. Caddyfile — **bootstrap-only, not per-deploy**
24 +
25 + Caddy config is stable infrastructure. Most releases don't touch it. Per-deploy uploads couple binary version to config version unnecessarily and risk reload churn for unchanged config.
26 +
27 + - Node-bootstrap script installs `/etc/caddy/Caddyfile` once.
28 + - Updating Caddy config is an explicit operator action (`sando-cli push-caddy` or just `scp + systemctl reload caddy` manually), tracked but not per-release.
29 + - Revisit if Caddy config changes start landing >1x per sprint, then move to per-release artifact under `releases/<version>/Caddyfile` with a deploy hook.
30 +
31 + **Per-project alternative tracked:** if a Caddyfile change accompanies a binary change (rare), the operator must run the explicit Caddy-push step alongside `sando promote`.
32 +
33 + ### 2. error-pages — **bake into binary**
34 +
35 + Error pages version with code. They reference brand glyphs (diamond mark) and copy that drifts with the rest of the site.
36 +
37 + - Use `include_dir!` or `include_bytes!` to embed `server/deploy/error-pages/*.html` into the binary.
38 + - Update Caddy `handle_errors` blocks to point at an in-app fallback route (e.g. `/__errors/404.html`) instead of `/opt/makenotwork/error-pages/`. That route can serve the embedded HTML.
39 +
40 + Cost: small MNW server PR (separate from sando). Marks `deploy.sh upload_config: error-pages` step removable.
41 +
42 + Until that lands: ship error-pages as sibling under `releases/<version>/error-pages/`. Caddy still reads from `/opt/makenotwork/error-pages/` symlinked to `current/error-pages`. (Track A on testnot already has the `current` symlink working; just symlink error-pages parallel.)
43 +
44 + ### 3. mnw-admin binary — **ship alongside server**
45 +
46 + `mnw-admin` is part of the release; deploy.sh uploads it. Sando should too.
47 +
48 + - Extend `cfg.bin_name: String` → `cfg.bin_names: Vec<String>` (e.g. `["makenotwork", "mnw-admin"]`).
49 + - `deploy_local` + `deploy_node` iterate over the list, rsyncing each to `releases/<version>/<bin>`.
50 + - Build step looks up each in `server/target/release/<bin>`.
51 +
52 + Default stays `["server"]` for backwards-compat with the existing example config.
53 +
54 + ### 4. systemd unit (`makenotwork.service`) — **bootstrap-only**
55 +
56 + The unit references `<release_root>/current/makenotwork`. Once installed, it doesn't change per release.
57 +
58 + - Node-bootstrap script installs `/etc/systemd/system/makenotwork.service`.
59 + - `deploy.sh`'s upload of the unit was a re-upload-every-time pattern. Sando does not.
60 + - If the unit ever needs to change (e.g. resource limits, env file path), that's a one-shot operator action, not a per-deploy step.
61 +
62 + ### 5. Security configs (sshd-git, fail2ban, firewall) — **bootstrap-only**
63 +
64 + These are one-time host hardening. They have no release coupling.
65 +
66 + - Node-bootstrap script installs them on first provision.
67 + - Updates are out-of-band operator actions (or fold into a `sando push-config` later).
68 +
69 + ### 6. backup-db.sh — **bootstrap-only**
70 +
71 + Same as security configs. Backup script is host infrastructure, not release artifact.
72 +
73 + - Node-bootstrap installs `backup-db.sh` and its cron entry.
74 + - Updates out-of-band.
75 + - Bonus: backup-db.sh should be updated to (a) maintain `latest.sql.gz` hard link, (b) push to astra for true offsite — currently broken (see separate "offsite sync broken" ticket).
76 +
77 + ### 7. Restart warning — **defer to Phase 5; track for prod cutover**
78 +
79 + `deploy.sh` posts a 30s warning, sleeps 30s, then restarts. Sando does NOT yet do this.
80 +
81 + - For testnot (low traffic): skip. Service crash-loops invisibly enough.
82 + - For prod cutover: sando must implement this. Options:
83 + - **A**: Sando POSTs `/api/internal/restart-warning` itself, requires CLI_SERVICE_TOKEN exposed to sando. Token would live in `/etc/sando/sando.env` on pop-os.
84 + - **B**: Sando exposes a `pre_deploy_hook` per-tier in `sando.toml` (shell command); operator decides.
85 + - Recommendation: **A** for prod tiers only (`tier.restart_warning_seconds = 30` in `sando.toml`). Tier A (testnot) leaves it unset = no warning.
86 +
87 + Phase 5 implementation, not blocking cutover-readiness.
88 +
89 + ### 8. Cross-compile from macOS — **retire**
90 +
91 + Pop-os is x86_64 Ubuntu-derived, prod is x86_64 Ubuntu 24.04. Sando builds natively. Cargo-zigbuild path goes away once sando is canonical.
92 +
93 + - Verify: take a recent prod binary (from `deploy.sh`'s build) and sando's binary for the same sha, compare runtime behavior across one full sprint of testnot use.
94 + - Once verified, mark `deploy.sh` archived and delete cargo-zigbuild from dev-machine setup notes.
95 +
96 + ### 9. Prod migrations — **server-self-applies on startup; sando does NOT**
97 +
98 + MNW server runs `sqlx::migrate!("./migrations").run(&db).await` in `main.rs:73` at startup. This means:
99 +
100 + - A new binary starting up applies any pending migrations against the live prod DB.
101 + - Sando does not need an explicit `POST /migrate/{tier}` endpoint.
102 + - The `migration_dry_run` gate's purpose is to catch migration FAILURE before the live binary tries to run them — that's the prod safety net.
103 + - Risk: a partially-applied migration (e.g. multi-statement, the 2026-05-22 incident class) can leave the DB in a broken state mid-startup. Sandbox the migration via `migration_dry_run` catches this; the live server then either succeeds or fails-and-crash-loops on the same migration sequence.
104 + - Open question: should sando refuse to promote if `migration_dry_run` flags the upcoming version as a destructive migration (drop+recreate column)? Phase 5+ enhancement.
105 +
106 + **Action:** none — current architecture is correct. Document this in `plans/migration-dryrun-failures.md` (Phase 2 follow-up).
107 +
108 + ## Net effect on `deploy.sh`
109 +
110 + | Step | Replaced by Sando | Moved to node-bootstrap | Retired |
111 + |---------------------|------------------------------|-------------------------|---------|
112 + | build_binary | yes (native on pop-os) | | |
113 + | upload_config | | yes (Caddyfile, etc.) | |
114 + | upload_binary | yes (+ mnw-admin) | | |
115 + | send_restart_warning| yes (Phase 5, prod tier only)| | |
116 + | restart_app | yes (reload-or-restart) | | |
117 +
118 + Once items 2-9 above land, `deploy.sh` becomes redundant and moves to `server/deploy/archive/`.
119 +
120 + ## Implementation order
121 +
122 + 1. **`bin_names: Vec<String>`** — small, unblocks mnw-admin shipping (#3).
123 + 2. **error-pages as release sibling + symlink** — small, unblocks #2 until bake-into-binary lands.
124 + 3. **node-bootstrap script** — folds Caddyfile (#1), unit (#4), security (#5), backup (#6) into one idempotent script. Already a Phase 1 carryover.
125 + 4. **Phase 5: restart_warning hook** — when prod cutover gets scheduled.
126 + 5. **Prod cutover sprint** — verify binary parity (#8), retire `deploy.sh` (#9 needs no action).
M sando/todo.md +125 -26