max / makenotwork
26 files changed,
+3204 insertions,
-254 deletions
| @@ -1,6 +1,6 @@ | |||
| 1 | 1 | MIT License | |
| 2 | 2 | ||
| 3 | - | Copyright (c) 2026 Max Jacobson | |
| 3 | + | Copyright (c) 2026 Make Creative, LLC | |
| 4 | 4 | ||
| 5 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy | |
| 6 | 6 | of this software and associated documentation files (the "Software"), to deal |
| @@ -103,12 +103,29 @@ curl -X POST http://127.0.0.1:7766/promote/a \ | |||
| 103 | 103 | | Method | Path | Body | Purpose | | |
| 104 | 104 | |--------|------|------|---------| | |
| 105 | 105 | | GET | `/state` | — | Tier list + current/previous version + last gate outcomes | | |
| 106 | - | | POST | `/rebuild` | `{sha?: string}` | Force a build; if `sha` is absent, resolves the configured deploy branch | | |
| 107 | - | | POST | `/promote/{tier}` | `{version, hotfix?, reset_burn_in?}` | Verify predecessor gates, deploy to tier nodes, advance state | | |
| 106 | + | | POST | `/rebuild` | `{sha?: string}` | Force a build; if `sha` is absent, resolves the configured deploy branch. Aborts any in-flight build (latest wins). | | |
| 107 | + | | POST | `/promote/{tier}` | `{version?, hotfix?, reset_burn_in?}` | Verify predecessor gates, deploy to tier nodes, advance state. `version` defaults to the predecessor tier's `current_version`. | | |
| 108 | 108 | | POST | `/rollback/{tier}` | — | Swap `current` symlink to `previous_version` on every node in the tier | | |
| 109 | - | | POST | `/backup/fetch` | — | Pull the prod backup to `backup.local_path` (file:// or rsync://) | | |
| 109 | + | | POST | `/confirm/{tier}` | — | Insert a passing `manual_confirm` gate row for the tier's `current_version`. Replaces hand-SQL. | | |
| 110 | + | | POST | `/backup/fetch` | — | Pull the prod backup. Supports `file://`, `rsync://`, `ssh://user@host[:port]/path`. | | |
| 110 | 111 | | GET | `/metrics` | — | Prometheus exposition | | |
| 111 | - | | GET | `/events` | — | WebSocket stream of deploy + gate events (not yet implemented) | | |
| 112 | + | | GET | `/events` | — | WebSocket stream of typed events (RebuildRequested, BuildStart/Ok/Failed, GateStart/Done, DeployStart/Ok/Failed, PromoteComplete, Rollback, BackupFetched, ManualConfirm, BuildAborted). | | |
| 113 | + | ||
| 114 | + | ## TUI | |
| 115 | + | ||
| 116 | + | `sando` (the TUI binary) connects to `$SANDO_DAEMON` (default `http://127.0.0.1:7766`), polls `/state` every 2s, and subscribes to `/events` over WS. Keybindings: | |
| 117 | + | ||
| 118 | + | | key | action | | |
| 119 | + | |-----|--------| | |
| 120 | + | | ↑/↓ or j/k | select tier | | |
| 121 | + | | p | `POST /promote/<selected>` (no body — version defaults to predecessor's current) | | |
| 122 | + | | R | `POST /rollback/<selected>` | | |
| 123 | + | | b | `POST /backup/fetch` | | |
| 124 | + | | c | `POST /confirm/<selected>` | | |
| 125 | + | | r | refresh hint (poller is already every 2s) | | |
| 126 | + | | q / Esc / Ctrl-C | quit | | |
| 127 | + | ||
| 128 | + | Action results show up in the events log a moment later (the actions themselves emit events from the daemon side). | |
| 112 | 129 | ||
| 113 | 130 | ## Hotfix flow | |
| 114 | 131 | ||
| @@ -123,14 +140,9 @@ curl -X POST http://127.0.0.1:7766/promote/a \ | |||
| 123 | 140 | ||
| 124 | 141 | ## v0 limitations | |
| 125 | 142 | ||
| 126 | - | - Remote deploys (real SSH/rsync) are stubbed. Use `ssh_target = "local"` and | |
| 127 | - | a local `release_root` for dev. Production wiring is a follow-up. | |
| 128 | 143 | - `migration_dry_run` requires a scratch Postgres at `scratch_db_url`. The | |
| 129 | - | gate drops and recreates `public` on every run; do not point this at | |
| 144 | + | gate drops every non-system schema on every run; do not point this at | |
| 130 | 145 | anything that matters. | |
| 131 | - | - `/events` WebSocket is not implemented; the TUI polls `/state` every 2s. | |
| 132 | - | - `manual_confirm` has no operator-facing trigger yet (you have to insert a | |
| 133 | - | `gate_runs` row with `passed=1` by hand to satisfy it). | |
| 134 | 146 | ||
| 135 | 147 | ## License | |
| 136 | 148 |
| @@ -185,6 +185,12 @@ source = "registry+https://github.com/rust-lang/crates.io-index" | |||
| 185 | 185 | checksum = "9330f8b2ff13f34540b44e946ef35111825727b38d33286ef986142615121801" | |
| 186 | 186 | ||
| 187 | 187 | [[package]] | |
| 188 | + | name = "cfg_aliases" | |
| 189 | + | version = "0.2.1" | |
| 190 | + | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| 191 | + | checksum = "613afe47fcd5fac7ccf1db93babcb082c5994d996f20b8b159f2ad1658eb5724" | |
| 192 | + | ||
| 193 | + | [[package]] | |
| 188 | 194 | name = "chrono" | |
| 189 | 195 | version = "0.4.44" | |
| 190 | 196 | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| @@ -382,6 +388,12 @@ dependencies = [ | |||
| 382 | 388 | ] | |
| 383 | 389 | ||
| 384 | 390 | [[package]] | |
| 391 | + | name = "fastrand" | |
| 392 | + | version = "2.4.1" | |
| 393 | + | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| 394 | + | checksum = "9f1f227452a390804cdb637b74a86990f2a7d7ba4b7d5693aac9b4dd6defd8d6" | |
| 395 | + | ||
| 396 | + | [[package]] | |
| 385 | 397 | name = "find-msvc-tools" | |
| 386 | 398 | version = "0.1.9" | |
| 387 | 399 | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| @@ -522,8 +534,10 @@ source = "registry+https://github.com/rust-lang/crates.io-index" | |||
| 522 | 534 | checksum = "ff2abc00be7fca6ebc474524697ae276ad847ad0a6b3faa4bcb027e9a4614ad0" | |
| 523 | 535 | dependencies = [ | |
| 524 | 536 | "cfg-if", | |
| 537 | + | "js-sys", | |
| 525 | 538 | "libc", | |
| 526 | 539 | "wasi", | |
| 540 | + | "wasm-bindgen", | |
| 527 | 541 | ] | |
| 528 | 542 | ||
| 529 | 543 | [[package]] | |
| @@ -533,9 +547,11 @@ source = "registry+https://github.com/rust-lang/crates.io-index" | |||
| 533 | 547 | checksum = "899def5c37c4fd7b2664648c28120ecec138e4d395b459e5ca34f9cce2dd77fd" | |
| 534 | 548 | dependencies = [ | |
| 535 | 549 | "cfg-if", | |
| 550 | + | "js-sys", | |
| 536 | 551 | "libc", | |
| 537 | 552 | "r-efi", | |
| 538 | 553 | "wasip2", | |
| 554 | + | "wasm-bindgen", | |
| 539 | 555 | ] | |
| 540 | 556 | ||
| 541 | 557 | [[package]] | |
| @@ -681,6 +697,23 @@ dependencies = [ | |||
| 681 | 697 | "pin-project-lite", | |
| 682 | 698 | "smallvec", | |
| 683 | 699 | "tokio", | |
| 700 | + | "want", | |
| 701 | + | ] | |
| 702 | + | ||
| 703 | + | [[package]] | |
| 704 | + | name = "hyper-rustls" | |
| 705 | + | version = "0.27.9" | |
| 706 | + | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| 707 | + | checksum = "33ca68d021ef39cf6463ab54c1d0f5daf03377b70561305bb89a8f83aab66e0f" | |
| 708 | + | dependencies = [ | |
| 709 | + | "http", | |
| 710 | + | "hyper", | |
| 711 | + | "hyper-util", | |
| 712 | + | "rustls", | |
| 713 | + | "tokio", | |
| 714 | + | "tokio-rustls", | |
| 715 | + | "tower-service", | |
| 716 | + | "webpki-roots", | |
| 684 | 717 | ] | |
| 685 | 718 | ||
| 686 | 719 | [[package]] | |
| @@ -689,13 +722,21 @@ version = "0.1.20" | |||
| 689 | 722 | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| 690 | 723 | checksum = "96547c2556ec9d12fb1578c4eaf448b04993e7fb79cbaad930a656880a6bdfa0" | |
| 691 | 724 | dependencies = [ | |
| 725 | + | "base64", | |
| 692 | 726 | "bytes", | |
| 727 | + | "futures-channel", | |
| 728 | + | "futures-util", | |
| 693 | 729 | "http", | |
| 694 | 730 | "http-body", | |
| 695 | 731 | "hyper", | |
| 732 | + | "ipnet", | |
| 733 | + | "libc", | |
| 734 | + | "percent-encoding", | |
| 696 | 735 | "pin-project-lite", | |
| 736 | + | "socket2", | |
| 697 | 737 | "tokio", | |
| 698 | 738 | "tower-service", | |
| 739 | + | "tracing", | |
| 699 | 740 | ] | |
| 700 | 741 | ||
| 701 | 742 | [[package]] | |
| @@ -836,6 +877,12 @@ dependencies = [ | |||
| 836 | 877 | ] | |
| 837 | 878 | ||
| 838 | 879 | [[package]] | |
| 880 | + | name = "ipnet" | |
| 881 | + | version = "2.12.0" | |
| 882 | + | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| 883 | + | checksum = "d98f6fed1fde3f8c21bc40a1abb88dd75e67924f9cffc3ef95607bad8017f8e2" | |
| 884 | + | ||
| 885 | + | [[package]] | |
| 839 | 886 | name = "itoa" | |
| 840 | 887 | version = "1.0.18" | |
| 841 | 888 | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| @@ -909,6 +956,12 @@ dependencies = [ | |||
| 909 | 956 | ] | |
| 910 | 957 | ||
| 911 | 958 | [[package]] | |
| 959 | + | name = "linux-raw-sys" | |
| 960 | + | version = "0.12.1" | |
| 961 | + | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| 962 | + | checksum = "32a66949e030da00e8c7d4434b251670a91556f4144941d37452769c25d58a53" | |
| 963 | + | ||
| 964 | + | [[package]] | |
| 912 | 965 | name = "litemap" | |
| 913 | 966 | version = "0.8.2" | |
| 914 | 967 | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| @@ -943,6 +996,12 @@ dependencies = [ | |||
| 943 | 996 | ] | |
| 944 | 997 | ||
| 945 | 998 | [[package]] | |
| 999 | + | name = "lru-slab" | |
| 1000 | + | version = "0.1.2" | |
| 1001 | + | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| 1002 | + | checksum = "112b39cec0b298b6c1999fee3e31427f74f676e4cb9879ed1a121b43661a4154" | |
| 1003 | + | ||
| 1004 | + | [[package]] | |
| 946 | 1005 | name = "matchers" | |
| 947 | 1006 | version = "0.2.0" | |
| 948 | 1007 | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| @@ -1225,6 +1284,61 @@ dependencies = [ | |||
| 1225 | 1284 | ] | |
| 1226 | 1285 | ||
| 1227 | 1286 | [[package]] | |
| 1287 | + | name = "quinn" | |
| 1288 | + | version = "0.11.9" | |
| 1289 | + | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| 1290 | + | checksum = "b9e20a958963c291dc322d98411f541009df2ced7b5a4f2bd52337638cfccf20" | |
| 1291 | + | dependencies = [ | |
| 1292 | + | "bytes", | |
| 1293 | + | "cfg_aliases", | |
| 1294 | + | "pin-project-lite", | |
| 1295 | + | "quinn-proto", | |
| 1296 | + | "quinn-udp", | |
| 1297 | + | "rustc-hash", | |
| 1298 | + | "rustls", | |
| 1299 | + | "socket2", | |
| 1300 | + | "thiserror", | |
| 1301 | + | "tokio", | |
| 1302 | + | "tracing", | |
| 1303 | + | "web-time", | |
| 1304 | + | ] | |
| 1305 | + | ||
| 1306 | + | [[package]] | |
| 1307 | + | name = "quinn-proto" | |
| 1308 | + | version = "0.11.14" | |
| 1309 | + | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| 1310 | + | checksum = "434b42fec591c96ef50e21e886936e66d3cc3f737104fdb9b737c40ffb94c098" | |
| 1311 | + | dependencies = [ | |
| 1312 | + | "bytes", | |
| 1313 | + | "getrandom 0.3.4", | |
| 1314 | + | "lru-slab", | |
| 1315 | + | "rand 0.9.4", | |
| 1316 | + | "ring", | |
| 1317 | + | "rustc-hash", | |
| 1318 | + | "rustls", | |
| 1319 | + | "rustls-pki-types", | |
| 1320 | + | "slab", | |
| 1321 | + | "thiserror", | |
| 1322 | + | "tinyvec", | |
| 1323 | + | "tracing", | |
| 1324 | + | "web-time", | |
| 1325 | + | ] | |
| 1326 | + | ||
| 1327 | + | [[package]] | |
| 1328 | + | name = "quinn-udp" | |
| 1329 | + | version = "0.5.14" | |
| 1330 | + | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| 1331 | + | checksum = "addec6a0dcad8a8d96a771f815f0eaf55f9d1805756410b39f5fa81332574cbd" | |
| 1332 | + | dependencies = [ | |
| 1333 | + | "cfg_aliases", | |
| 1334 | + | "libc", | |
| 1335 | + | "once_cell", | |
| 1336 | + | "socket2", | |
| 1337 | + | "tracing", | |
| 1338 | + | "windows-sys 0.60.2", | |
| 1339 | + | ] | |
| 1340 | + | ||
| 1341 | + | [[package]] | |
| 1228 | 1342 | name = "quote" | |
| 1229 | 1343 | version = "1.0.45" | |
| 1230 | 1344 | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| @@ -1361,6 +1475,58 @@ source = "registry+https://github.com/rust-lang/crates.io-index" | |||
| 1361 | 1475 | checksum = "dc897dd8d9e8bd1ed8cdad82b5966c3e0ecae09fb1907d58efaa013543185d0a" | |
| 1362 | 1476 | ||
| 1363 | 1477 | [[package]] | |
| 1478 | + | name = "reqwest" | |
| 1479 | + | version = "0.12.28" | |
| 1480 | + | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| 1481 | + | checksum = "eddd3ca559203180a307f12d114c268abf583f59b03cb906fd0b3ff8646c1147" | |
| 1482 | + | dependencies = [ | |
| 1483 | + | "base64", | |
| 1484 | + | "bytes", | |
| 1485 | + | "futures-core", | |
| 1486 | + | "http", | |
| 1487 | + | "http-body", | |
| 1488 | + | "http-body-util", | |
| 1489 | + | "hyper", | |
| 1490 | + | "hyper-rustls", | |
| 1491 | + | "hyper-util", | |
| 1492 | + | "js-sys", | |
| 1493 | + | "log", | |
| 1494 | + | "percent-encoding", | |
| 1495 | + | "pin-project-lite", | |
| 1496 | + | "quinn", | |
| 1497 | + | "rustls", | |
| 1498 | + | "rustls-pki-types", | |
| 1499 | + | "serde", | |
| 1500 | + | "serde_json", | |
| 1501 | + | "serde_urlencoded", | |
| 1502 | + | "sync_wrapper", | |
| 1503 | + | "tokio", | |
| 1504 | + | "tokio-rustls", | |
| 1505 | + | "tower", | |
| 1506 | + | "tower-http", | |
| 1507 | + | "tower-service", | |
| 1508 | + | "url", | |
| 1509 | + | "wasm-bindgen", | |
| 1510 | + | "wasm-bindgen-futures", | |
| 1511 | + | "web-sys", | |
| 1512 | + | "webpki-roots", | |
| 1513 | + | ] | |
| 1514 | + | ||
| 1515 | + | [[package]] | |
| 1516 | + | name = "ring" | |
| 1517 | + | version = "0.17.14" | |
| 1518 | + | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| 1519 | + | checksum = "a4689e6c2294d81e88dc6261c768b63bc4fcdb852be6d1352498b114f61383b7" | |
| 1520 | + | dependencies = [ | |
| 1521 | + | "cc", | |
| 1522 | + | "cfg-if", | |
| 1523 | + | "getrandom 0.2.17", | |
| 1524 | + | "libc", | |
| 1525 | + | "untrusted", | |
| 1526 | + | "windows-sys 0.52.0", | |
| 1527 | + | ] | |
| 1528 | + | ||
| 1529 | + | [[package]] | |
| 1364 | 1530 | name = "rsa" | |
| 1365 | 1531 | version = "0.9.10" | |
| 1366 | 1532 | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| @@ -1381,6 +1547,60 @@ dependencies = [ | |||
| 1381 | 1547 | ] | |
| 1382 | 1548 | ||
| 1383 | 1549 | [[package]] | |
| 1550 | + | name = "rustc-hash" | |
| 1551 | + | version = "2.1.2" | |
| 1552 | + | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| 1553 | + | checksum = "94300abf3f1ae2e2b8ffb7b58043de3d399c73fa6f4b73826402a5c457614dbe" | |
| 1554 | + | ||
| 1555 | + | [[package]] | |
| 1556 | + | name = "rustix" | |
| 1557 | + | version = "1.1.4" | |
| 1558 | + | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| 1559 | + | checksum = "b6fe4565b9518b83ef4f91bb47ce29620ca828bd32cb7e408f0062e9930ba190" | |
| 1560 | + | dependencies = [ | |
| 1561 | + | "bitflags", | |
| 1562 | + | "errno", | |
| 1563 | + | "libc", | |
| 1564 | + | "linux-raw-sys", | |
| 1565 | + | "windows-sys 0.61.2", | |
| 1566 | + | ] | |
| 1567 | + | ||
| 1568 | + | [[package]] | |
| 1569 | + | name = "rustls" | |
| 1570 | + | version = "0.23.40" | |
| 1571 | + | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| 1572 | + | checksum = "ef86cd5876211988985292b91c96a8f2d298df24e75989a43a3c73f2d4d8168b" | |
| 1573 | + | dependencies = [ | |
| 1574 | + | "once_cell", | |
| 1575 | + | "ring", | |
| 1576 | + | "rustls-pki-types", | |
| 1577 | + | "rustls-webpki", | |
| 1578 | + | "subtle", | |
| 1579 | + | "zeroize", | |
| 1580 | + | ] | |
| 1581 | + | ||
| 1582 | + | [[package]] | |
| 1583 | + | name = "rustls-pki-types" | |
| 1584 | + | version = "1.14.1" | |
| 1585 | + | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| 1586 | + | checksum = "30a7197ae7eb376e574fe940d068c30fe0462554a3ddbe4eca7838e049c937a9" | |
| 1587 | + | dependencies = [ | |
| 1588 | + | "web-time", | |
| 1589 | + | "zeroize", | |
| 1590 | + | ] | |
| 1591 | + | ||
| 1592 | + | [[package]] | |
| 1593 | + | name = "rustls-webpki" | |
| 1594 | + | version = "0.103.13" | |
| 1595 | + | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| 1596 | + | checksum = "61c429a8649f110dddef65e2a5ad240f747e85f7758a6bccc7e5777bd33f756e" | |
| 1597 | + | dependencies = [ | |
| 1598 | + | "ring", | |
| 1599 | + | "rustls-pki-types", | |
| 1600 | + | "untrusted", | |
| 1601 | + | ] | |
| 1602 | + | ||
| 1603 | + | [[package]] | |
| 1384 | 1604 | name = "rustversion" | |
| 1385 | 1605 | version = "1.0.22" | |
| 1386 | 1606 | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| @@ -1399,14 +1619,18 @@ dependencies = [ | |||
| 1399 | 1619 | "anyhow", | |
| 1400 | 1620 | "axum", | |
| 1401 | 1621 | "chrono", | |
| 1622 | + | "http-body-util", | |
| 1402 | 1623 | "metrics", | |
| 1403 | 1624 | "metrics-exporter-prometheus", | |
| 1625 | + | "reqwest", | |
| 1404 | 1626 | "serde", | |
| 1405 | 1627 | "serde_json", | |
| 1406 | 1628 | "sqlx", | |
| 1629 | + | "tempfile", | |
| 1407 | 1630 | "thiserror", | |
| 1408 | 1631 | "tokio", | |
| 1409 | 1632 | "toml", | |
| 1633 | + | "tower", | |
| 1410 | 1634 | "tracing", | |
| 1411 | 1635 | "tracing-subscriber", | |
| 1412 | 1636 | ] | |
| @@ -1836,6 +2060,9 @@ name = "sync_wrapper" | |||
| 1836 | 2060 | version = "1.0.2" | |
| 1837 | 2061 | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| 1838 | 2062 | checksum = "0bf256ce5efdfa370213c1dabab5935a12e49f2c58d15e9eac2870d3b4f27263" | |
| 2063 | + | dependencies = [ | |
| 2064 | + | "futures-core", | |
| 2065 | + | ] | |
| 1839 | 2066 | ||
| 1840 | 2067 | [[package]] | |
| 1841 | 2068 | name = "synstructure" | |
| @@ -1849,6 +2076,19 @@ dependencies = [ | |||
| 1849 | 2076 | ] | |
| 1850 | 2077 | ||
| 1851 | 2078 | [[package]] | |
| 2079 | + | name = "tempfile" | |
| 2080 | + | version = "3.27.0" | |
| 2081 | + | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| 2082 | + | checksum = "32497e9a4c7b38532efcdebeef879707aa9f794296a4f0244f6f69e9bc8574bd" | |
| 2083 | + | dependencies = [ | |
| 2084 | + | "fastrand", | |
| 2085 | + | "getrandom 0.3.4", | |
| 2086 | + | "once_cell", | |
| 2087 | + | "rustix", | |
| 2088 | + | "windows-sys 0.61.2", | |
| 2089 | + | ] | |
| 2090 | + | ||
| 2091 | + | [[package]] | |
| 1852 | 2092 | name = "thiserror" | |
| 1853 | 2093 | version = "2.0.18" | |
| 1854 | 2094 | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| @@ -1930,6 +2170,16 @@ dependencies = [ | |||
| 1930 | 2170 | ] | |
| 1931 | 2171 | ||
| 1932 | 2172 | [[package]] | |
| 2173 | + | name = "tokio-rustls" | |
| 2174 | + | version = "0.26.4" | |
| 2175 | + | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| 2176 | + | checksum = "1729aa945f29d91ba541258c8df89027d5792d85a8841fb65e8bf0f4ede4ef61" | |
| 2177 | + | dependencies = [ | |
| 2178 | + | "rustls", | |
| 2179 | + | "tokio", | |
| 2180 | + | ] | |
| 2181 | + | ||
| 2182 | + | [[package]] | |
| 1933 | 2183 | name = "tokio-stream" | |
| 1934 | 2184 | version = "0.1.18" | |
| 1935 | 2185 | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| @@ -2010,6 +2260,24 @@ dependencies = [ | |||
| 2010 | 2260 | ] | |
| 2011 | 2261 | ||
| 2012 | 2262 | [[package]] | |
| 2263 | + | name = "tower-http" | |
| 2264 | + | version = "0.6.11" | |
| 2265 | + | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| 2266 | + | checksum = "4cfcf7e2740e6fc6d4d688b4ef00650406bb94adf4731e43c096c3a19fe40840" | |
| 2267 | + | dependencies = [ | |
| 2268 | + | "bitflags", | |
| 2269 | + | "bytes", | |
| 2270 | + | "futures-util", | |
| 2271 | + | "http", | |
| 2272 | + | "http-body", | |
| 2273 | + | "pin-project-lite", | |
| 2274 | + | "tower", | |
| 2275 | + | "tower-layer", | |
| 2276 | + | "tower-service", | |
| 2277 | + | "url", | |
| 2278 | + | ] | |
| 2279 | + | ||
| 2280 | + | [[package]] | |
| 2013 | 2281 | name = "tower-layer" | |
| 2014 | 2282 | version = "0.3.3" | |
| 2015 | 2283 | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| @@ -2097,6 +2365,12 @@ dependencies = [ | |||
| 2097 | 2365 | ] | |
| 2098 | 2366 | ||
| 2099 | 2367 | [[package]] | |
| 2368 | + | name = "try-lock" | |
| 2369 | + | version = "0.2.5" | |
| 2370 | + | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| 2371 | + | checksum = "e421abadd41a4225275504ea4d6566923418b7f05506fbc9c0fe86ba7396114b" | |
| 2372 | + | ||
| 2373 | + | [[package]] | |
| 2100 | 2374 | name = "tungstenite" | |
| 2101 | 2375 | version = "0.29.0" | |
| 2102 | 2376 | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| @@ -2146,6 +2420,12 @@ source = "registry+https://github.com/rust-lang/crates.io-index" | |||
| 2146 | 2420 | checksum = "7df058c713841ad818f1dc5d3fd88063241cc61f49f5fbea4b951e8cf5a8d71d" | |
| 2147 | 2421 | ||
| 2148 | 2422 | [[package]] | |
| 2423 | + | name = "untrusted" | |
| 2424 | + | version = "0.9.0" | |
| 2425 | + | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| 2426 | + | checksum = "8ecb6da28b8a351d773b68d5825ac39017e680750f980f3a1a85cd8dd28a47c1" | |
| 2427 | + | ||
| 2428 | + | [[package]] | |
| 2149 | 2429 | name = "url" | |
| 2150 | 2430 | version = "2.5.8" | |
| 2151 | 2431 | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| @@ -2182,6 +2462,15 @@ source = "registry+https://github.com/rust-lang/crates.io-index" | |||
| 2182 | 2462 | checksum = "0b928f33d975fc6ad9f86c8f283853ad26bdd5b10b7f1542aa2fa15e2289105a" | |
| 2183 | 2463 | ||
| 2184 | 2464 | [[package]] | |
| 2465 | + | name = "want" | |
| 2466 | + | version = "0.3.1" | |
| 2467 | + | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| 2468 | + | checksum = "bfa7760aed19e106de2c7c0b581b509f2f25d3dacaf737cb82ac61bc6d760b0e" | |
| 2469 | + | dependencies = [ | |
| 2470 | + | "try-lock", | |
| 2471 | + | ] | |
| 2472 | + | ||
| 2473 | + | [[package]] | |
| 2185 | 2474 | name = "wasi" | |
| 2186 | 2475 | version = "0.11.1+wasi-snapshot-preview1" | |
| 2187 | 2476 | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| @@ -2216,6 +2505,16 @@ dependencies = [ | |||
| 2216 | 2505 | ] | |
| 2217 | 2506 | ||
| 2218 | 2507 | [[package]] | |
| 2508 | + | name = "wasm-bindgen-futures" | |
| 2509 | + | version = "0.4.72" | |
| 2510 | + | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| 2511 | + | checksum = "9473dbd2991ae90b6291c3c32c30c6187ac49aa32f9905d1cce280ec1e110b0f" | |
| 2512 | + | dependencies = [ | |
| 2513 | + | "js-sys", | |
| 2514 | + | "wasm-bindgen", | |
| 2515 | + | ] | |
| 2516 | + | ||
| 2517 | + | [[package]] | |
| 2219 | 2518 | name = "wasm-bindgen-macro" | |
| 2220 | 2519 | version = "0.2.122" | |
| 2221 | 2520 | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| @@ -2258,6 +2557,25 @@ dependencies = [ | |||
| 2258 | 2557 | ] | |
| 2259 | 2558 | ||
| 2260 | 2559 | [[package]] | |
| 2560 | + | name = "web-time" | |
| 2561 | + | version = "1.1.0" | |
| 2562 | + | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| 2563 | + | checksum = "5a6580f308b1fad9207618087a65c04e7a10bc77e02c8e84e9b00dd4b12fa0bb" | |
| 2564 | + | dependencies = [ | |
| 2565 | + | "js-sys", | |
| 2566 | + | "wasm-bindgen", | |
| 2567 | + | ] | |
| 2568 | + | ||
| 2569 | + | [[package]] | |
| 2570 | + | name = "webpki-roots" | |
| 2571 | + | version = "1.0.7" | |
| 2572 | + | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| 2573 | + | checksum = "52f5ee44c96cf55f1b349600768e3ece3a8f26010c05265ab73f945bb1a2eb9d" | |
| 2574 | + | dependencies = [ | |
| 2575 | + | "rustls-pki-types", | |
| 2576 | + | ] | |
| 2577 | + | ||
| 2578 | + | [[package]] | |
| 2261 | 2579 | name = "whoami" | |
| 2262 | 2580 | version = "1.6.1" | |
| 2263 | 2581 | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| @@ -2354,7 +2672,25 @@ version = "0.48.0" | |||
| 2354 | 2672 | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| 2355 | 2673 | checksum = "677d2418bec65e3338edb076e806bc1ec15693c5d0104683f2efe857f61056a9" | |
| 2356 | 2674 | dependencies = [ | |
| 2357 | - | "windows-targets", | |
| 2675 | + | "windows-targets 0.48.5", | |
| 2676 | + | ] | |
| 2677 | + | ||
| 2678 | + | [[package]] | |
| 2679 | + | name = "windows-sys" | |
| 2680 | + | version = "0.52.0" | |
| 2681 | + | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| 2682 | + | checksum = "282be5f36a8ce781fad8c8ae18fa3f9beff57ec1b52cb3de0789201425d9a33d" | |
| 2683 | + | dependencies = [ | |
| 2684 | + | "windows-targets 0.52.6", | |
| 2685 | + | ] | |
| 2686 | + | ||
| 2687 | + | [[package]] | |
| 2688 | + | name = "windows-sys" | |
| 2689 | + | version = "0.60.2" | |
| 2690 | + | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| 2691 | + | checksum = "f2f500e4d28234f72040990ec9d39e3a6b950f9f22d3dba18416c35882612bcb" | |
| 2692 | + | dependencies = [ | |
| 2693 | + | "windows-targets 0.53.5", | |
| 2358 | 2694 | ] | |
| 2359 | 2695 | ||
| 2360 | 2696 | [[package]] | |
| @@ -2372,13 +2708,46 @@ version = "0.48.5" | |||
| 2372 | 2708 | source = "registry+https://github.com/rust-lang/crates.io-index" | |
| 2373 | 2709 | checksum = "9a2fa6e2155d7247be68c096456083145c183cbbbc2764150dda45a87197940c" | |
| 2374 | 2710 | dependencies = [ | |
| 2375 | - | "windows_aarch64_gnullvm", |
Lines truncated
| @@ -22,3 +22,9 @@ metrics-exporter-prometheus = { version = "0.18.1", default-features = false } | |||
| 22 | 22 | anyhow = "1.0.102" | |
| 23 | 23 | thiserror = "2.0.18" | |
| 24 | 24 | chrono = { version = "0.4", features = ["serde"] } | |
| 25 | + | ||
| 26 | + | [dev-dependencies] | |
| 27 | + | tempfile = "3.20" | |
| 28 | + | tower = { version = "0.5", features = ["util"] } | |
| 29 | + | http-body-util = "0.1" | |
| 30 | + | reqwest = { version = "0.12", default-features = false, features = ["json", "rustls-tls"] } |
| @@ -1,9 +1,11 @@ | |||
| 1 | 1 | //! Fetch the prod backup that `migration_dry_run` runs against. | |
| 2 | 2 | //! | |
| 3 | - | //! Sources supported in v0: | |
| 4 | - | //! - `file:///abs/path/to/dump.sql.gz` — local copy. Used for localhost dev. | |
| 5 | - | //! - `rsync://host/module/path` — shells out to `rsync`. Used when MM | |
| 6 | - | //! pulls from an astra/Hetzner replica. | |
| 3 | + | //! Sources supported: | |
| 4 | + | //! - `file:///abs/path/to/dump.sql.gz` — local copy (dev). | |
| 5 | + | //! - `rsync://host/module/path` — rsync daemon protocol. | |
| 6 | + | //! - `ssh://user@host[:port]/path/file.sql.gz` — rsync-over-ssh. Used to pull | |
| 7 | + | //! prod backups from | |
| 8 | + | //! `backup-puller@alpha-west-1`. | |
| 7 | 9 | //! | |
| 8 | 10 | //! The fetch is command-driven: the operator triggers it via /backup/fetch, it | |
| 9 | 11 | //! is not implicit in promote. That keeps the slowest, most failure-prone step | |
| @@ -11,7 +13,7 @@ | |||
| 11 | 13 | ||
| 12 | 14 | use crate::config::Config; | |
| 13 | 15 | use crate::topology::Topology; | |
| 14 | - | use anyhow::{Context, Result}; | |
| 16 | + | use anyhow::{Context, Result, bail}; | |
| 15 | 17 | use chrono::Utc; | |
| 16 | 18 | use sqlx::SqlitePool; | |
| 17 | 19 | use std::path::Path; | |
| @@ -25,6 +27,61 @@ pub struct FetchedBackup { | |||
| 25 | 27 | pub byte_size: Option<i64>, | |
| 26 | 28 | } | |
| 27 | 29 | ||
| 30 | + | /// Parsed `backup.source` URL. Owned strings so the parsed form outlives the | |
| 31 | + | /// (possibly transient) URL we read from config. | |
| 32 | + | #[derive(Debug, Clone, PartialEq, Eq)] | |
| 33 | + | pub(crate) enum BackupSource { | |
| 34 | + | /// Local file copy. Path follows the `file://` prefix. | |
| 35 | + | File { path: String }, | |
| 36 | + | /// rsync daemon protocol. Full URL stays intact (rsync handles it). | |
| 37 | + | RsyncDaemon { url: String }, | |
| 38 | + | /// rsync-over-ssh. Port is optional. | |
| 39 | + | Ssh { | |
| 40 | + | user_host: String, | |
| 41 | + | port: Option<u16>, | |
| 42 | + | path: String, | |
| 43 | + | }, | |
| 44 | + | } | |
| 45 | + | ||
| 46 | + | /// Parse a `backup.source` URL into a `BackupSource`. Rejects unsupported | |
| 47 | + | /// schemes and malformed `ssh://` URLs (no path part). | |
| 48 | + | pub(crate) fn parse_source(s: &str) -> Result<BackupSource> { | |
| 49 | + | if let Some(rest) = s.strip_prefix("file://") { | |
| 50 | + | if rest.is_empty() { | |
| 51 | + | bail!("file:// URL is missing a path: {s}"); | |
| 52 | + | } | |
| 53 | + | return Ok(BackupSource::File { path: rest.into() }); | |
| 54 | + | } | |
| 55 | + | if s.starts_with("rsync://") { | |
| 56 | + | return Ok(BackupSource::RsyncDaemon { url: s.into() }); | |
| 57 | + | } | |
| 58 | + | if let Some(rest) = s.strip_prefix("ssh://") { | |
| 59 | + | let (user_host_port, path_rest) = rest | |
| 60 | + | .split_once('/') | |
| 61 | + | .with_context(|| format!("ssh:// URL missing path: {s}"))?; | |
| 62 | + | if user_host_port.is_empty() { | |
| 63 | + | bail!("ssh:// URL missing user@host: {s}"); | |
| 64 | + | } | |
| 65 | + | let path = format!("/{path_rest}"); | |
| 66 | + | let (user_host, port) = match user_host_port.rsplit_once(':') { | |
| 67 | + | Some((uh, p)) => { | |
| 68 | + | // Heuristic: trailing `:digits` after the final `:` is the port. | |
| 69 | + | // Anything else (IPv6 literal, etc.) gets left alone. | |
| 70 | + | match p.parse::<u16>() { | |
| 71 | + | Ok(n) => (uh.to_string(), Some(n)), | |
| 72 | + | Err(_) => (user_host_port.to_string(), None), | |
| 73 | + | } | |
| 74 | + | } | |
| 75 | + | None => (user_host_port.to_string(), None), | |
| 76 | + | }; | |
| 77 | + | if user_host.is_empty() { | |
| 78 | + | bail!("ssh:// URL has empty host (port {:?})", port); | |
| 79 | + | } | |
| 80 | + | return Ok(BackupSource::Ssh { user_host, port, path }); | |
| 81 | + | } | |
| 82 | + | bail!("unsupported backup source scheme: {s}"); | |
| 83 | + | } | |
| 84 | + | ||
| 28 | 85 | pub async fn fetch( | |
| 29 | 86 | pool: &SqlitePool, | |
| 30 | 87 | _cfg: &Arc<Config>, | |
| @@ -37,23 +94,45 @@ pub async fn fetch( | |||
| 37 | 94 | tokio::fs::create_dir_all(parent).await?; | |
| 38 | 95 | } | |
| 39 | 96 | ||
| 40 | - | if let Some(rest) = source.strip_prefix("file://") { | |
| 41 | - | tokio::fs::copy(rest, &local_path) | |
| 42 | - | .await | |
| 43 | - | .with_context(|| format!("copy {rest} -> {local_path}"))?; | |
| 44 | - | } else if source.starts_with("rsync://") { | |
| 45 | - | let out = Command::new("rsync") | |
| 46 | - | .args(["-az", "--inplace", &source, &local_path]) | |
| 47 | - | .output() | |
| 48 | - | .await | |
| 49 | - | .context("spawning rsync")?; | |
| 50 | - | anyhow::ensure!( | |
| 51 | - | out.status.success(), | |
| 52 | - | "rsync failed: {}", | |
| 53 | - | String::from_utf8_lossy(&out.stderr), | |
| 54 | - | ); | |
| 55 | - | } else { | |
| 56 | - | anyhow::bail!("unsupported backup source scheme: {source}"); | |
| 97 | + | let parsed = parse_source(&source)?; | |
| 98 | + | match parsed { | |
| 99 | + | BackupSource::File { path } => { | |
| 100 | + | tokio::fs::copy(&path, &local_path) | |
| 101 | + | .await | |
| 102 | + | .with_context(|| format!("copy {path} -> {local_path}"))?; | |
| 103 | + | } | |
| 104 | + | BackupSource::RsyncDaemon { url } => { | |
| 105 | + | let out = Command::new("rsync") | |
| 106 | + | .args(["-az", "--inplace", &url, &local_path]) | |
| 107 | + | .output() | |
| 108 | + | .await | |
| 109 | + | .context("spawning rsync")?; | |
| 110 | + | anyhow::ensure!( | |
| 111 | + | out.status.success(), | |
| 112 | + | "rsync (daemon) failed: {}", | |
| 113 | + | String::from_utf8_lossy(&out.stderr), | |
| 114 | + | ); | |
| 115 | + | } | |
| 116 | + | BackupSource::Ssh { user_host, port, path } => { | |
| 117 | + | let ssh_cmd = match port { | |
| 118 | + | Some(p) => format!("ssh -p {p} -o BatchMode=yes -o StrictHostKeyChecking=accept-new"), | |
| 119 | + | None => "ssh -o BatchMode=yes -o StrictHostKeyChecking=accept-new".into(), | |
| 120 | + | }; | |
| 121 | + | let remote = format!("{user_host}:{path}"); | |
| 122 | + | let out = Command::new("rsync") | |
| 123 | + | .args(["-a", "--partial"]) | |
| 124 | + | .arg("-e").arg(&ssh_cmd) | |
| 125 | + | .arg(&remote) | |
| 126 | + | .arg(&local_path) | |
| 127 | + | .output() | |
| 128 | + | .await | |
| 129 | + | .context("spawning rsync")?; | |
| 130 | + | anyhow::ensure!( | |
| 131 | + | out.status.success(), | |
| 132 | + | "rsync (ssh) failed: {}", | |
| 133 | + | String::from_utf8_lossy(&out.stderr), | |
| 134 | + | ); | |
| 135 | + | } | |
| 57 | 136 | } | |
| 58 | 137 | ||
| 59 | 138 | let meta = tokio::fs::metadata(&local_path).await?; | |
| @@ -69,5 +148,106 @@ pub async fn fetch( | |||
| 69 | 148 | .execute(pool) | |
| 70 | 149 | .await?; | |
| 71 | 150 | ||
| 151 | + | // Retention: prune rows fetched more than 30 days ago. The on-disk file | |
| 152 | + | // is overwritten each fetch (single `local_path`), so old rows reference | |
| 153 | + | // a path that no longer exists — keep the table from growing for no | |
| 154 | + | // good reason. | |
| 155 | + | sqlx::query( | |
| 156 | + | "DELETE FROM backups WHERE fetched_at < datetime('now', '-30 days')", | |
| 157 | + | ) | |
| 158 | + | .execute(pool) | |
| 159 | + | .await?; | |
| 160 | + | ||
| 72 | 161 | Ok(FetchedBackup { source, local_path, byte_size: Some(size) }) | |
| 73 | 162 | } | |
| 163 | + | ||
| 164 | + | #[cfg(test)] | |
| 165 | + | mod tests { | |
| 166 | + | use super::*; | |
| 167 | + | ||
| 168 | + | #[test] | |
| 169 | + | fn parses_file_url() { | |
| 170 | + | let s = parse_source("file:///opt/backups/latest.sql.gz").unwrap(); | |
| 171 | + | assert_eq!(s, BackupSource::File { path: "/opt/backups/latest.sql.gz".into() }); | |
| 172 | + | } | |
| 173 | + | ||
| 174 | + | #[test] | |
| 175 | + | fn file_url_without_path_errors() { | |
| 176 | + | assert!(parse_source("file://").is_err()); | |
| 177 | + | } | |
| 178 | + | ||
| 179 | + | #[test] | |
| 180 | + | fn parses_rsync_daemon_url() { | |
| 181 | + | let s = parse_source("rsync://astra/mnw/latest.sql.gz").unwrap(); | |
| 182 | + | assert_eq!(s, BackupSource::RsyncDaemon { url: "rsync://astra/mnw/latest.sql.gz".into() }); | |
| 183 | + | } | |
| 184 | + | ||
| 185 | + | #[test] | |
| 186 | + | fn parses_ssh_url_with_port() { | |
| 187 | + | let s = parse_source("ssh://backup-puller@alpha-west-1:2200/latest.sql.gz").unwrap(); | |
| 188 | + | assert_eq!( | |
| 189 | + | s, | |
| 190 | + | BackupSource::Ssh { | |
| 191 | + | user_host: "backup-puller@alpha-west-1".into(), | |
| 192 | + | port: Some(2200), | |
| 193 | + | path: "/latest.sql.gz".into(), | |
| 194 | + | } | |
| 195 | + | ); | |
| 196 | + | } | |
| 197 | + | ||
| 198 | + | #[test] | |
| 199 | + | fn parses_ssh_url_without_port() { | |
| 200 | + | let s = parse_source("ssh://max@astra/opt/backups/mnw/latest.sql.gz").unwrap(); | |
| 201 | + | assert_eq!( | |
| 202 | + | s, | |
| 203 | + | BackupSource::Ssh { | |
| 204 | + | user_host: "max@astra".into(), | |
| 205 | + | port: None, | |
| 206 | + | path: "/opt/backups/mnw/latest.sql.gz".into(), | |
| 207 | + | } | |
| 208 | + | ); | |
| 209 | + | } | |
| 210 | + | ||
| 211 | + | #[test] | |
| 212 | + | fn ssh_url_without_path_errors() { | |
| 213 | + | // `split_once('/')` — `ssh://user@host` has no `/` after the scheme. | |
| 214 | + | assert!(parse_source("ssh://backup-puller@alpha-west-1").is_err()); | |
| 215 | + | } | |
| 216 | + | ||
| 217 | + | #[test] | |
| 218 | + | fn ssh_url_without_user_host_errors() { | |
| 219 | + | // Empty user@host: `ssh:///foo`. Caught by the empty-prefix check. | |
| 220 | + | assert!(parse_source("ssh:///latest.sql.gz").is_err()); | |
| 221 | + | } | |
| 222 | + | ||
| 223 | + | #[test] | |
| 224 | + | fn ssh_url_with_non_numeric_after_colon_treats_as_part_of_host() { | |
| 225 | + | // `host:notaport` should NOT parse `notaport` as a port. Leave the | |
| 226 | + | // colon part of user_host; libssh/rsync will reject if truly wrong. | |
| 227 | + | let s = parse_source("ssh://user@host:notaport/path").unwrap(); | |
| 228 | + | assert_eq!( | |
| 229 | + | s, | |
| 230 | + | BackupSource::Ssh { | |
| 231 | + | user_host: "user@host:notaport".into(), | |
| 232 | + | port: None, | |
| 233 | + | path: "/path".into(), | |
| 234 | + | } | |
| 235 | + | ); | |
| 236 | + | } | |
| 237 | + | ||
| 238 | + | #[test] | |
| 239 | + | fn rejects_unknown_scheme() { | |
| 240 | + | assert!(parse_source("ftp://example.com/file").is_err()); | |
| 241 | + | assert!(parse_source("just-a-path.sql.gz").is_err()); | |
| 242 | + | assert!(parse_source("").is_err()); | |
| 243 | + | } | |
| 244 | + | ||
| 245 | + | #[test] | |
| 246 | + | fn ssh_url_preserves_multi_segment_path() { | |
| 247 | + | let s = parse_source("ssh://a@b:22/opt/foo/bar/baz.sql.gz").unwrap(); | |
| 248 | + | match s { | |
| 249 | + | BackupSource::Ssh { path, .. } => assert_eq!(path, "/opt/foo/bar/baz.sql.gz"), | |
| 250 | + | _ => panic!("wrong variant"), | |
| 251 | + | } | |
| 252 | + | } | |
| 253 | + | } |
| @@ -21,7 +21,10 @@ pub struct BuildArtifact { | |||
| 21 | 21 | pub version: String, | |
| 22 | 22 | pub git_sha: String, | |
| 23 | 23 | pub worktree: PathBuf, | |
| 24 | - | pub binary_path: PathBuf, | |
| 24 | + | /// One entry per `cfg.bin_names` in declared order. First is the primary | |
| 25 | + | /// (referenced by the systemd unit's ExecStart). Paths are inside the | |
| 26 | + | /// worktree's `target/release/`. | |
| 27 | + | pub binary_paths: Vec<PathBuf>, | |
| 25 | 28 | } | |
| 26 | 29 | ||
| 27 | 30 | pub async fn run( | |
| @@ -29,6 +32,7 @@ pub async fn run( | |||
| 29 | 32 | cfg: Arc<Config>, | |
| 30 | 33 | topo: Arc<Topology>, | |
| 31 | 34 | sha: String, | |
| 35 | + | events: crate::events::EventTx, | |
| 32 | 36 | ) -> Result<BuildArtifact> { | |
| 33 | 37 | let worktree = cfg.workdir.join(&sha); | |
| 34 | 38 | let bare = PathBuf::from(&topo.repo.bare_path); | |
| @@ -38,20 +42,47 @@ pub async fn run( | |||
| 38 | 42 | let version = read_pkg_version(&server_dir.join("Cargo.toml")).await | |
| 39 | 43 | .with_context(|| format!("reading version from {}/Cargo.toml", server_dir.display()))?; | |
| 40 | 44 | ||
| 41 | - | tracing::info!(sha = %sha, version = %version, dir = %server_dir.display(), "cargo build --release start"); | |
| 42 | - | let started = std::time::Instant::now(); | |
| 43 | - | let out = Command::new("cargo") | |
| 45 | + | // sqlx compile-time query checking needs a live DB with the current schema. | |
| 46 | + | // We point cargo at the scratch DB and prep it (drop public, re-migrate) | |
| 47 | + | // before invoking cargo build. The same DB is reset again by | |
| 48 | + | // `migration_dry_run` later if it runs as a gate. | |
| 49 | + | let mut cargo_cmd = Command::new("cargo"); | |
| 50 | + | cargo_cmd | |
| 44 | 51 | .arg("build") | |
| 45 | 52 | .arg("--release") | |
| 46 | 53 | .current_dir(&server_dir) | |
| 54 | + | .kill_on_drop(true); | |
| 55 | + | if let Some(scratch_url) = cfg.scratch_db_url.as_deref() { | |
| 56 | + | tracing::info!(sha = %sha, "preparing scratch DB schema for sqlx compile-time checks"); | |
| 57 | + | crate::gates::reset_scratch(scratch_url).await | |
| 58 | + | .context("scratch DB reset before build")?; | |
| 59 | + | crate::gates::run_migrator(scratch_url, &server_dir.join("migrations")).await | |
| 60 | + | .context("applying MNW migrations to scratch DB before build")?; | |
| 61 | + | cargo_cmd.env("DATABASE_URL", scratch_url); | |
| 62 | + | } else { | |
| 63 | + | tracing::warn!("scratch_db_url unset; sqlx will fall back to offline mode and may fail"); | |
| 64 | + | } | |
| 65 | + | ||
| 66 | + | tracing::info!(sha = %sha, version = %version, dir = %server_dir.display(), "cargo build --release start"); | |
| 67 | + | crate::events::emit(&events, crate::events::Event::BuildStart { | |
| 68 | + | sha: sha.clone(), version: version.clone(), | |
| 69 | + | }); | |
| 70 | + | let started = std::time::Instant::now(); | |
| 71 | + | let out = cargo_cmd | |
| 47 | 72 | .output() | |
| 48 | 73 | .await | |
| 49 | 74 | .context("spawning cargo build")?; | |
| 50 | 75 | let elapsed_s = started.elapsed().as_secs(); | |
| 51 | 76 | if !out.status.success() { | |
| 52 | 77 | tracing::error!(sha = %sha, version = %version, elapsed_s, "cargo build --release failed"); | |
| 78 | + | crate::events::emit(&events, crate::events::Event::BuildFailed { | |
| 79 | + | sha: sha.clone(), version: version.clone(), elapsed_s, | |
| 80 | + | }); | |
| 53 | 81 | } else { | |
| 54 | 82 | tracing::info!(sha = %sha, version = %version, elapsed_s, "cargo build --release ok"); | |
| 83 | + | crate::events::emit(&events, crate::events::Event::BuildOk { | |
| 84 | + | sha: sha.clone(), version: version.clone(), elapsed_s, | |
| 85 | + | }); | |
| 55 | 86 | } | |
| 56 | 87 | anyhow::ensure!( | |
| 57 | 88 | out.status.success(), | |
| @@ -59,12 +90,16 @@ pub async fn run( | |||
| 59 | 90 | tail(&out.stderr, 4_000), | |
| 60 | 91 | ); | |
| 61 | 92 | ||
| 62 | - | let binary_path = server_dir.join("target/release/server"); | |
| 63 | - | anyhow::ensure!( | |
| 64 | - | binary_path.exists(), | |
| 65 | - | "expected binary at {} after build", | |
| 66 | - | binary_path.display(), | |
| 67 | - | ); | |
| 93 | + | let release_dir = server_dir.join("target/release"); | |
| 94 | + | let mut binary_paths = Vec::with_capacity(cfg.bin_names.len()); | |
| 95 | + | for name in &cfg.bin_names { | |
| 96 | + | let p = release_dir.join(name); | |
| 97 | + | anyhow::ensure!(p.exists(), "expected binary at {} after build", p.display()); | |
| 98 | + | binary_paths.push(p); | |
| 99 | + | } | |
| 100 | + | // Primary binary path is the one we record in `versions.artifact_path` | |
| 101 | + | // (everything downstream — promote, rollback — looks it up by version). | |
| 102 | + | let primary = binary_paths[0].clone(); | |
| 68 | 103 | ||
| 69 | 104 | sqlx::query( | |
| 70 | 105 | "INSERT OR IGNORE INTO versions (version, git_sha, built_at, artifact_path) | |
| @@ -73,11 +108,11 @@ pub async fn run( | |||
| 73 | 108 | .bind(&version) | |
| 74 | 109 | .bind(&sha) | |
| 75 | 110 | .bind(Utc::now().to_rfc3339()) | |
| 76 | - | .bind(binary_path.to_string_lossy().as_ref()) | |
| 111 | + | .bind(primary.to_string_lossy().as_ref()) | |
| 77 | 112 | .execute(&pool) | |
| 78 | 113 | .await?; | |
| 79 | 114 | ||
| 80 | - | Ok(BuildArtifact { version, git_sha: sha, worktree, binary_path }) | |
| 115 | + | Ok(BuildArtifact { version, git_sha: sha, worktree, binary_paths }) | |
| 81 | 116 | } | |
| 82 | 117 | ||
| 83 | 118 | /// Full MM-tier pipeline: build, deploy the binary into MM's release_root, | |
| @@ -88,14 +123,36 @@ pub async fn build_and_run_mm( | |||
| 88 | 123 | cfg: Arc<Config>, | |
| 89 | 124 | topo: Arc<Topology>, | |
| 90 | 125 | sha: String, | |
| 126 | + | events: crate::events::EventTx, | |
| 91 | 127 | ) -> Result<()> { | |
| 92 | - | let art = run(pool.clone(), cfg.clone(), topo.clone(), sha).await?; | |
| 128 | + | let art = run(pool.clone(), cfg.clone(), topo.clone(), sha, events.clone()).await?; | |
| 93 | 129 | ||
| 94 | 130 | // Stage the binary in MM's release_root so future gates and the MM | |
| 95 | 131 | // self-deploy point at a stable path, not the worktree's target/. | |
| 96 | 132 | let mm_release_root = &cfg.release_root; | |
| 97 | - | let staged = deploy::deploy_local(mm_release_root, &art.version, &art.binary_path).await?; | |
| 98 | - | let staged_bin = staged.join("server"); | |
| 133 | + | let staged = deploy::deploy_local(mm_release_root, &art.version, &art.binary_paths).await?; | |
| 134 | + | ||
| 135 | + | // Bring error-pages alongside the binaries so the deploy rsync ships the | |
| 136 | + | // static HTML to every node. Caddy on each node references | |
| 137 | + | // <release_root>/current/error-pages/. Skipped silently if the worktree | |
| 138 | + | // doesn't have them (older shas, or non-MNW projects using this daemon). | |
| 139 | + | let error_pages_src = art.worktree.join("server/deploy/error-pages"); | |
| 140 | + | if error_pages_src.exists() { | |
| 141 | + | let out = Command::new("cp") | |
| 142 | + | .arg("-a") | |
| 143 | + | .arg(&error_pages_src) | |
| 144 | + | .arg(staged.join("error-pages")) | |
| 145 | + | .output() | |
| 146 | + | .await | |
| 147 | + | .context("spawning cp for error-pages")?; | |
| 148 | + | anyhow::ensure!( | |
| 149 | + | out.status.success(), | |
| 150 | + | "copying error-pages into staged dir: {}", | |
| 151 | + | String::from_utf8_lossy(&out.stderr), | |
| 152 | + | ); | |
| 153 | + | } | |
| 154 | + | ||
| 155 | + | let staged_bin = staged.join(cfg.primary_bin()); | |
| 99 | 156 | sqlx::query("UPDATE versions SET artifact_path = ? WHERE version = ?") | |
| 100 | 157 | .bind(staged_bin.to_string_lossy().as_ref()) | |
| 101 | 158 | .bind(&art.version) | |
| @@ -112,6 +169,7 @@ pub async fn build_and_run_mm( | |||
| 112 | 169 | tier: "mm".to_string(), | |
| 113 | 170 | version: art.version.clone(), | |
| 114 | 171 | worktree: art.worktree.clone(), | |
| 172 | + | events: events.clone(), | |
| 115 | 173 | }; | |
| 116 | 174 | let ok = gates::run_all(&ctx, &mm.gates).await?; | |
| 117 | 175 |
| @@ -16,9 +16,22 @@ pub struct Config { | |||
| 16 | 16 | /// you care about. | |
| 17 | 17 | #[serde(default)] | |
| 18 | 18 | pub scratch_db_url: Option<String>, | |
| 19 | + | /// Names of cargo bin targets the server crate produces (files under | |
| 20 | + | /// `target/release/`). First entry is the primary unit (referenced from | |
| 21 | + | /// the systemd unit's ExecStart). Defaults to `["server"]`; MNW ships | |
| 22 | + | /// `["makenotwork", "mnw-admin"]`. | |
| 23 | + | #[serde(default = "default_bin_names")] | |
| 24 | + | pub bin_names: Vec<String>, | |
| 19 | 25 | } | |
| 20 | 26 | ||
| 27 | + | fn default_bin_names() -> Vec<String> { vec!["server".into()] } | |
| 28 | + | ||
| 21 | 29 | impl Config { | |
| 30 | + | /// Primary binary — the one the systemd unit's ExecStart points at. | |
| 31 | + | pub fn primary_bin(&self) -> &str { | |
| 32 | + | self.bin_names.first().map(|s| s.as_str()).unwrap_or("server") | |
| 33 | + | } | |
| 34 | + | ||
| 22 | 35 | pub fn load() -> Result<Self> { | |
| 23 | 36 | let path = std::env::var("SANDO_CONFIG").unwrap_or_else(|_| "sando-daemon.toml".into()); | |
| 24 | 37 | let raw = std::fs::read_to_string(&path) |
| @@ -1,40 +1,58 @@ | |||
| 1 | 1 | //! Atomic symlink-swap deploys. | |
| 2 | 2 | //! | |
| 3 | - | //! Layout on every target (MM, A nodes, B nodes, ...): | |
| 3 | + | //! Layout on every target (local host, A nodes, B nodes, ...): | |
| 4 | 4 | //! | |
| 5 | 5 | //! <release_root>/ | |
| 6 | 6 | //! releases/ | |
| 7 | 7 | //! 0.8.1/ | |
| 8 | - | //! server <- the binary | |
| 8 | + | //! <bin_name> | |
| 9 | 9 | //! 0.8.2/ | |
| 10 | - | //! server | |
| 10 | + | //! <bin_name> | |
| 11 | 11 | //! current -> releases/0.8.2 | |
| 12 | 12 | //! | |
| 13 | - | //! `ln -sfn` makes the swap atomic on Linux. systemd units should point at | |
| 14 | - | //! `<release_root>/current/server` so a swap + reload picks up the new binary | |
| 15 | - | //! without a window where the unit references a missing path. | |
| 13 | + | //! `ln -sfn` swaps the symlink. systemd units point at | |
| 14 | + | //! `<release_root>/current/<bin_name>` so reload-or-restart picks up the new | |
| 15 | + | //! binary without ever pointing at a missing path. | |
| 16 | 16 | //! | |
| 17 | - | //! v0 only implements local deploys (used for MM and for localhost-dev | |
| 18 | - | //! "remote" nodes whose ssh_target is `local`). Real SSH/rsync deploys are | |
| 19 | - | //! follow-up work — see the `remote_deploy_stub` branch. | |
| 17 | + | //! For nodes with `ssh_target` set to anything other than `"local"`, deploy | |
| 18 | + | //! goes via rsync + ssh; the bootstrap (creating release_root, installing the | |
| 19 | + | //! service unit, granting sudo for systemctl) is out of scope here — it | |
| 20 | + | //! happens once per node, not per deploy. | |
| 20 | 21 | ||
| 21 | 22 | use crate::topology::Node; | |
| 22 | 23 | use anyhow::{Context, Result}; | |
| 23 | 24 | use std::path::{Path, PathBuf}; | |
| 24 | 25 | use tokio::process::Command; | |
| 25 | 26 | ||
| 26 | - | pub async fn deploy_local(release_root: &Path, version: &str, binary: &Path) -> Result<PathBuf> { | |
| 27 | + | /// SSH options used everywhere we shell out to ssh — fail fast, no prompts. | |
| 28 | + | const SSH_FLAGS: &[&str] = &[ | |
| 29 | + | "-o", "BatchMode=yes", | |
| 30 | + | "-o", "ConnectTimeout=10", | |
| 31 | + | "-o", "StrictHostKeyChecking=accept-new", | |
| 32 | + | ]; | |
| 33 | + | ||
| 34 | + | /// Keep this many release dirs per node; older ones get gc'd after a | |
| 35 | + | /// successful deploy. Fixed for now; promote to config if the constant ever | |
| 36 | + | /// needs to vary by tier. | |
| 37 | + | const RELEASES_TO_KEEP: usize = 5; | |
| 38 | + | ||
| 39 | + | pub async fn deploy_local( | |
| 40 | + | release_root: &Path, | |
| 41 | + | version: &str, | |
| 42 | + | binaries: &[PathBuf], | |
| 43 | + | ) -> Result<PathBuf> { | |
| 27 | 44 | let release_dir = release_root.join("releases").join(version); | |
| 28 | 45 | tokio::fs::create_dir_all(&release_dir).await?; | |
| 29 | - | let dest = release_dir.join("server"); | |
| 30 | - | tokio::fs::copy(binary, &dest) | |
| 31 | - | .await | |
| 32 | - | .with_context(|| format!("copy {} -> {}", binary.display(), dest.display()))?; | |
| 46 | + | for binary in binaries { | |
| 47 | + | let name = binary.file_name() | |
| 48 | + | .context("binary path has no file name")?; | |
| 49 | + | let dest = release_dir.join(name); | |
| 50 | + | tokio::fs::copy(binary, &dest) | |
| 51 | + | .await | |
| 52 | + | .with_context(|| format!("copy {} -> {}", binary.display(), dest.display()))?; | |
| 53 | + | } | |
| 33 | 54 | ||
| 34 | 55 | let current = release_root.join("current"); | |
| 35 | - | // ln -sfn is atomic on Linux; on macOS the dev path is non-prod so the | |
| 36 | - | // race is irrelevant. We shell out rather than using std::os::unix::fs | |
| 37 | - | // symlink + rename because the rename-over-symlink pattern is platform-fussy. | |
| 38 | 56 | let target = format!("releases/{version}"); | |
| 39 | 57 | let out = Command::new("ln") | |
| 40 | 58 | .args(["-sfn", &target]) | |
| @@ -46,26 +64,372 @@ pub async fn deploy_local(release_root: &Path, version: &str, binary: &Path) -> | |||
| 46 | 64 | "symlink swap failed: {}", | |
| 47 | 65 | String::from_utf8_lossy(&out.stderr), | |
| 48 | 66 | ); | |
| 67 | + | ||
| 68 | + | if let Err(e) = gc_local_releases(release_root).await { | |
| 69 | + | tracing::warn!(error = %e, "local release GC failed (non-fatal)"); | |
| 70 | + | } | |
| 49 | 71 | Ok(release_dir) | |
| 50 | 72 | } | |
| 51 | 73 | ||
| 52 | - | pub async fn deploy_node(node: &Node, version: &str, binary: &Path) -> Result<PathBuf> { | |
| 74 | + | /// Deploy `staged_release_dir` (a directory built on the Sando host by | |
| 75 | + | /// `deploy_local`) to `node`. For `ssh_target=local`, this is just symlink | |
| 76 | + | /// swap + restart; for remote nodes, we rsync the whole dir. | |
| 77 | + | /// | |
| 78 | + | /// `primary_bin` is only used for logging — every file present in the staged | |
| 79 | + | /// dir gets shipped. | |
| 80 | + | pub async fn deploy_node( | |
| 81 | + | node: &Node, | |
| 82 | + | version: &str, | |
| 83 | + | staged_release_dir: &Path, | |
| 84 | + | primary_bin: &str, | |
| 85 | + | ) -> Result<PathBuf> { | |
| 53 | 86 | if node.ssh_target == "local" || node.ssh_target.is_empty() { | |
| 54 | - | return deploy_local(Path::new(&node.release_root), version, binary).await; | |
| 87 | + | // Local deploy already happened when we staged on the Sando host. | |
| 88 | + | // Just re-point `current` at the staged dir. | |
| 89 | + | return reset_local_current(Path::new(&node.release_root), version).await; | |
| 90 | + | } | |
| 91 | + | deploy_remote(node, version, staged_release_dir, primary_bin).await | |
| 92 | + | } | |
| 93 | + | ||
| 94 | + | async fn reset_local_current(release_root: &Path, version: &str) -> Result<PathBuf> { | |
| 95 | + | let current = release_root.join("current"); | |
| 96 | + | let target = format!("releases/{version}"); | |
| 97 | + | let out = Command::new("ln") | |
| 98 | + | .args(["-sfn", &target]) | |
| 99 | + | .arg(¤t) | |
| 100 | + | .output() | |
| 101 | + | .await?; | |
| 102 | + | anyhow::ensure!( | |
| 103 | + | out.status.success(), | |
| 104 | + | "symlink swap failed: {}", | |
| 105 | + | String::from_utf8_lossy(&out.stderr), | |
| 106 | + | ); | |
| 107 | + | Ok(release_root.join("releases").join(version)) | |
| 108 | + | } | |
| 109 | + | ||
| 110 | + | async fn deploy_remote( | |
| 111 | + | node: &Node, | |
| 112 | + | version: &str, | |
| 113 | + | staged_release_dir: &Path, | |
| 114 | + | primary_bin: &str, | |
| 115 | + | ) -> Result<PathBuf> { | |
| 116 | + | let release_root = &node.release_root; | |
| 117 | + | let ssh_target = &node.ssh_target; | |
| 118 | + | let service = &node.service_name; | |
| 119 | + | let release_dir = format!("{release_root}/releases/{version}"); | |
| 120 | + | ||
| 121 | + | tracing::info!(node = %node.name, version, "deploy: mkdir release dir"); | |
| 122 | + | ssh(ssh_target, &format!("set -e; mkdir -p {q}", q = sh_quote(&release_dir))) | |
| 123 | + | .await | |
| 124 | + | .context("creating remote release dir")?; | |
| 125 | + | ||
| 126 | + | tracing::info!(node = %node.name, version, primary = %primary_bin, "deploy: rsync release dir"); | |
| 127 | + | // Rsync the whole staged dir (all binaries + any sibling artifacts like | |
| 128 | + | // error-pages). Trailing slash on source = contents of dir, not the dir | |
| 129 | + | // itself. --chmod ensures binaries land executable; the regular-file | |
| 130 | + | // mask leaves data files at 0644. | |
| 131 | + | let rsync_src = format!("{}/", staged_release_dir.display()); | |
| 132 | + | let rsync_dest = format!("{ssh_target}:{release_dir}/"); | |
| 133 | + | let mut rsync = Command::new("rsync"); | |
| 134 | + | rsync | |
| 135 | + | .arg("-az") | |
| 136 | + | .arg("--partial") | |
| 137 | + | .arg("--chmod=F0755,D0755") | |
| 138 | + | .arg("-e") | |
| 139 | + | .arg(format!( | |
| 140 | + | "ssh {}", | |
| 141 | + | SSH_FLAGS.iter().map(|s| s.to_string()).collect::<Vec<_>>().join(" ") | |
| 142 | + | )) | |
| 143 | + | .arg(&rsync_src) | |
| 144 | + | .arg(&rsync_dest); | |
| 145 | + | let out = rsync.output().await.context("spawning rsync")?; | |
| 146 | + | anyhow::ensure!( | |
| 147 | + | out.status.success(), | |
| 148 | + | "rsync failed (current symlink left intact): {}", | |
| 149 | + | String::from_utf8_lossy(&out.stderr), | |
| 150 | + | ); | |
| 151 | + | ||
| 152 | + | tracing::info!(node = %node.name, version, "deploy: symlink swap + service reload"); | |
| 153 | + | // Symlink swap is atomic via `mv -T` of a freshly-created symlink over | |
| 154 | + | // the old one (the rename(2) is the atomic step; `ln -sfn` does | |
| 155 | + | // unlink+symlink which has a window). | |
| 156 | + | let swap_and_restart = format!( | |
| 157 | + | "set -e; \ | |
| 158 | + | cd {root}; \ | |
| 159 | + | ln -sfn releases/{ver} current.new; \ | |
| 160 | + | mv -Tf current.new current; \ | |
| 161 | + | sudo /bin/systemctl reload-or-restart {svc}", | |
| 162 | + | root = sh_quote(release_root), | |
| 163 | + | ver = sh_quote(version), | |
| 164 | + | svc = sh_quote(service), | |
| 165 | + | ); | |
| 166 | + | ssh(ssh_target, &swap_and_restart) | |
| 167 | + | .await | |
| 168 | + | .context("symlink swap + systemctl reload-or-restart")?; | |
| 169 | + | ||
| 170 | + | if let Err(e) = gc_remote_releases(ssh_target, release_root).await { | |
| 171 | + | tracing::warn!(error = %e, "remote release GC failed (non-fatal)"); | |
| 172 | + | } | |
| 173 | + | ||
| 174 | + | Ok(PathBuf::from(release_root).join("releases").join(version)) | |
| 175 | + | } | |
| 176 | + | ||
| 177 | + | async fn ssh(target: &str, script: &str) -> Result<()> { | |
| 178 | + | let mut cmd = Command::new("ssh"); | |
| 179 | + | cmd.args(SSH_FLAGS).arg(target).arg(script); | |
| 180 | + | let out = cmd.output().await.context("spawning ssh")?; | |
| 181 | + | anyhow::ensure!( | |
| 182 | + | out.status.success(), | |
| 183 | + | "ssh {target} failed: {}", | |
| 184 | + | String::from_utf8_lossy(&out.stderr), | |
| 185 | + | ); | |
| 186 | + | Ok(()) | |
| 187 | + | } | |
| 188 | + | ||
| 189 | + | async fn gc_local_releases(release_root: &Path) -> Result<()> { | |
| 190 | + | let releases = release_root.join("releases"); | |
| 191 | + | if !releases.exists() { | |
| 192 | + | return Ok(()); | |
| 193 | + | } | |
| 194 | + | let mut entries = Vec::new(); | |
| 195 | + | let mut rd = tokio::fs::read_dir(&releases).await?; | |
| 196 | + | while let Some(entry) = rd.next_entry().await? { | |
| 197 | + | if !entry.file_type().await?.is_dir() { | |
| 198 | + | continue; | |
| 199 | + | } | |
| 200 | + | let meta = entry.metadata().await?; | |
| 201 | + | entries.push((entry.path(), meta.modified()?)); | |
| 55 | 202 | } | |
| 56 | - | remote_deploy_stub(node, version, binary).await | |
| 203 | + | entries.sort_by(|a, b| b.1.cmp(&a.1)); | |
| 204 | + | for (path, _) in entries.into_iter().skip(RELEASES_TO_KEEP) { | |
| 205 | + | if let Err(e) = tokio::fs::remove_dir_all(&path).await { | |
| 206 | + | tracing::warn!(path = %path.display(), error = %e, "gc: rm failed"); | |
| 207 | + | } else { | |
| 208 | + | tracing::debug!(path = %path.display(), "gc: removed old release"); | |
| 209 | + | } | |
| 210 | + | } | |
| 211 | + | Ok(()) | |
| 57 | 212 | } | |
| 58 | 213 | ||
| 59 | - | async fn remote_deploy_stub(node: &Node, version: &str, _binary: &Path) -> Result<PathBuf> { | |
| 60 | - | // Real implementation: rsync the binary to <ssh_target>:<release_root>/releases/<version>/server, | |
| 61 | - | // then ssh <ssh_target> "ln -sfn releases/<version> current && systemctl reload-or-restart <unit>". | |
| 62 | - | // Wiring this up needs a story for systemd unit naming and ssh key/auth conventions; deferring | |
| 63 | - | // until the localhost smoke loop is settled and we know which knobs matter. | |
| 64 | - | anyhow::bail!( | |
| 65 | - | "remote deploy not yet implemented (node {} -> {}); use ssh_target=local for dev", | |
| 66 | - | node.name, | |
| 67 | - | node.ssh_target, | |
| 214 | + | async fn gc_remote_releases(ssh_target: &str, release_root: &str) -> Result<()> { | |
| 215 | + | // `ls -t` orders by mtime desc. Skip the first N, rm the rest. `xargs -r` | |
| 216 | + | // is a no-op when stdin is empty (avoids `rm` complaining). | |
| 217 | + | let script = format!( | |
| 218 | + | "set -e; cd {root}/releases 2>/dev/null || exit 0; \ | |
| 219 | + | ls -1t | tail -n +{keep_plus_one} | xargs -r -I{{}} rm -rf -- {{}}", | |
| 220 | + | root = sh_quote(release_root), | |
| 221 | + | keep_plus_one = RELEASES_TO_KEEP + 1, | |
| 68 | 222 | ); | |
| 69 | - | #[allow(unreachable_code)] | |
| 70 | - | Ok(PathBuf::from(&node.release_root).join("releases").join(version)) | |
| 223 | + | ssh(ssh_target, &script).await | |
| 224 | + | } | |
| 225 | + | ||
| 226 | + | /// Single-quote a string for safe inclusion in a /bin/sh command, escaping | |
| 227 | + | /// any single quote inside. Not bulletproof for adversarial input, but every | |
| 228 | + | /// path here comes from our own config files. | |
| 229 | + | fn sh_quote(s: &str) -> String { | |
| 230 | + | let escaped = s.replace('\'', r"'\''"); | |
| 231 | + | format!("'{escaped}'") | |
| 232 | + | } | |
| 233 | + | ||
| 234 | + | #[cfg(test)] | |
| 235 | + | mod tests { | |
| 236 | + | use super::*; | |
| 237 | + | use std::time::SystemTime; | |
| 238 | + | ||
| 239 | + | #[test] | |
| 240 | + | fn sh_quote_no_quote() { | |
| 241 | + | assert_eq!(sh_quote("hello"), "'hello'"); | |
| 242 | + | assert_eq!(sh_quote("/opt/mnw/releases/0.8.12"), "'/opt/mnw/releases/0.8.12'"); | |
| 243 | + | } | |
| 244 | + | ||
| 245 | + | #[test] | |
| 246 | + | fn sh_quote_with_quote() { | |
| 247 | + | // The string `it's` becomes `'it'\''s'` — close, escape, open. | |
| 248 | + | assert_eq!(sh_quote("it's"), r"'it'\''s'"); | |
| 249 | + | } | |
| 250 | + | ||
| 251 | + | #[tokio::test] | |
| 252 | + | async fn deploy_local_copies_multiple_binaries_and_swaps_symlink() { | |
| 253 | + | let tmp = tempfile::tempdir().unwrap(); | |
| 254 | + | let root = tmp.path(); | |
| 255 | + | ||
| 256 | + | // Source binaries (worktree's target/release/) | |
| 257 | + | let src_dir = root.join("src"); | |
| 258 | + | tokio::fs::create_dir_all(&src_dir).await.unwrap(); | |
| 259 | + | let primary = src_dir.join("makenotwork"); | |
| 260 | + | let admin = src_dir.join("mnw-admin"); | |
| 261 | + | tokio::fs::write(&primary, b"PRIMARY").await.unwrap(); | |
| 262 | + | tokio::fs::write(&admin, b"ADMIN").await.unwrap(); | |
| 263 | + | ||
| 264 | + | // Release root (where staged versions live) | |
| 265 | + | let release_root = root.join("releases-root"); | |
| 266 | + | tokio::fs::create_dir_all(&release_root).await.unwrap(); | |
| 267 | + | ||
| 268 | + | let staged = deploy_local( | |
| 269 | + | &release_root, | |
| 270 | + | "0.8.12", | |
| 271 | + | &[primary.clone(), admin.clone()], | |
| 272 | + | ) | |
| 273 | + | .await | |
| 274 | + | .expect("deploy_local should succeed"); | |
| 275 | + | ||
| 276 | + | assert_eq!(staged, release_root.join("releases").join("0.8.12")); | |
| 277 | + | assert_eq!(tokio::fs::read(staged.join("makenotwork")).await.unwrap(), b"PRIMARY"); | |
| 278 | + | assert_eq!(tokio::fs::read(staged.join("mnw-admin")).await.unwrap(), b"ADMIN"); | |
| 279 | + | ||
| 280 | + | // current symlink should resolve to staged | |
| 281 | + | let current = release_root.join("current"); | |
| 282 | + | let target = tokio::fs::read_link(¤t).await.unwrap(); | |
| 283 | + | assert_eq!(target.to_string_lossy(), "releases/0.8.12"); | |
| 284 | + | // And reading through `current/` should give the new content. | |
| 285 | + | let via_current = tokio::fs::read(current.join("makenotwork")).await.unwrap(); | |
| 286 | + | assert_eq!(via_current, b"PRIMARY"); | |
| 287 | + | } | |
| 288 | + | ||
| 289 | + | #[tokio::test] | |
| 290 | + | async fn deploy_local_second_release_swaps_symlink_and_keeps_old_dir() { | |
| 291 | + | let tmp = tempfile::tempdir().unwrap(); | |
| 292 | + | let root = tmp.path(); | |
| 293 | + | let src_dir = root.join("src"); | |
| 294 | + | tokio::fs::create_dir_all(&src_dir).await.unwrap(); | |
| 295 | + | let bin = src_dir.join("server"); | |
| 296 | + | tokio::fs::write(&bin, b"V1").await.unwrap(); | |
| 297 | + | ||
| 298 | + | let release_root = root.join("rr"); | |
| 299 | + | tokio::fs::create_dir_all(&release_root).await.unwrap(); | |
| 300 | + | ||
| 301 | + | deploy_local(&release_root, "0.1.0", &[bin.clone()]).await.unwrap(); | |
| 302 | + | // Rewrite source then deploy 0.2.0. | |
| 303 | + | tokio::fs::write(&bin, b"V2").await.unwrap(); | |
| 304 | + | deploy_local(&release_root, "0.2.0", &[bin.clone()]).await.unwrap(); | |
| 305 | + | ||
| 306 | + | // Both versions present on disk. | |
| 307 | + | assert!(release_root.join("releases/0.1.0/server").exists()); | |
| 308 | + | assert!(release_root.join("releases/0.2.0/server").exists()); | |
| 309 | + | // current points at the new one. | |
| 310 | + | let target = tokio::fs::read_link(release_root.join("current")).await.unwrap(); | |
| 311 | + | assert_eq!(target.to_string_lossy(), "releases/0.2.0"); | |
| 312 | + | let via_current = tokio::fs::read(release_root.join("current/server")).await.unwrap(); | |
| 313 | + | assert_eq!(via_current, b"V2"); | |
| 314 | + | } | |
| 315 | + | ||
| 316 | + | #[tokio::test] | |
| 317 | + | async fn gc_local_releases_keeps_last_n_by_mtime() { | |
| 318 | + | // Build > RELEASES_TO_KEEP fake release dirs with distinct mtimes, | |
| 319 | + | // then run gc and check which survived. | |
| 320 | + | let tmp = tempfile::tempdir().unwrap(); | |
| 321 | + | let root = tmp.path(); | |
| 322 | + | let releases = root.join("releases"); | |
| 323 | + | tokio::fs::create_dir_all(&releases).await.unwrap(); | |
| 324 | + | ||
| 325 | + | let total = RELEASES_TO_KEEP + 3; | |
| 326 | + | let mut names = Vec::new(); | |
| 327 | + | for i in 0..total { | |
| 328 | + | let name = format!("v{i:02}"); | |
| 329 | + | let dir = releases.join(&name); | |
| 330 | + | tokio::fs::create_dir(&dir).await.unwrap(); | |
| 331 | + | // Stagger mtimes deterministically. tokio's File doesn't expose | |
| 332 | + | // set_times, so reach for std::fs::File + std::fs::FileTimes | |
| 333 | + | // (stable since 1.75). Synchronous is fine here — this is test | |
| 334 | + | // setup, not the hot path. | |
| 335 | + | let f = std::fs::File::open(&dir).unwrap(); | |
| 336 | + | let when = SystemTime::UNIX_EPOCH + std::time::Duration::from_secs(1_700_000_000 + i as u64); | |
| 337 | + | let times = std::fs::FileTimes::new().set_modified(when); | |
| 338 | + | f.set_times(times).unwrap(); | |
| 339 | + | names.push(name); | |
| 340 | + | } | |
| 341 | + | ||
| 342 | + | gc_local_releases(root).await.unwrap(); | |
| 343 | + | ||
| 344 | + | // The last RELEASES_TO_KEEP by mtime (i.e. highest i) survive. | |
| 345 | + | let surviving_expected: Vec<_> = names | |
| 346 | + | .iter() | |
| 347 | + | .skip(total - RELEASES_TO_KEEP) | |
| 348 | + | .cloned() | |
| 349 | + | .collect(); | |
| 350 | + | for name in &surviving_expected { | |
| 351 | + | assert!( | |
| 352 | + | releases.join(name).exists(), | |
| 353 | + | "expected to survive: {name}" | |
| 354 | + | ); | |
| 355 | + | } | |
| 356 | + | for name in names.iter().take(total - RELEASES_TO_KEEP) { | |
| 357 | + | assert!( | |
| 358 | + | !releases.join(name).exists(), | |
| 359 | + | "expected to be pruned: {name}" | |
| 360 | + | ); | |
| 361 | + | } | |
| 362 | + | } | |
| 363 | + | ||
| 364 | + | #[tokio::test] | |
| 365 | + | async fn gc_local_releases_noop_when_below_threshold() { | |
| 366 | + | let tmp = tempfile::tempdir().unwrap(); | |
| 367 | + | let root = tmp.path(); | |
| 368 | + | let releases = root.join("releases"); | |
| 369 | + | tokio::fs::create_dir_all(&releases).await.unwrap(); | |
| 370 | + | for i in 0..3 { | |
| 371 | + | tokio::fs::create_dir(releases.join(format!("v{i}"))).await.unwrap(); | |
| 372 | + | } | |
| 373 | + | gc_local_releases(root).await.unwrap(); | |
| 374 | + | for i in 0..3 { | |
| 375 | + | assert!(releases.join(format!("v{i}")).exists()); | |
| 376 | + | } | |
| 377 | + | } | |
| 378 | + | ||
| 379 | + | #[tokio::test] | |
| 380 | + | async fn gc_local_releases_noop_when_releases_dir_missing() { | |
| 381 | + | let tmp = tempfile::tempdir().unwrap(); | |
| 382 | + | gc_local_releases(tmp.path()).await.unwrap(); | |
| 383 | + | } | |
| 384 | + | ||
| 385 | + | #[tokio::test] | |
| 386 | + | async fn deploy_remote_fails_cleanly_when_host_unreachable() { | |
| 387 | + | // 192.0.2.0/24 is reserved for documentation and routes nowhere. | |
| 388 | + | // ConnectTimeout=10 limits the test wallclock to ~10s worst case. | |
| 389 | + | let tmp = tempfile::tempdir().unwrap(); | |
| 390 | + | let staged = tmp.path().join("releases").join("0.0.1"); | |
| 391 | + | tokio::fs::create_dir_all(&staged).await.unwrap(); | |
| 392 | + | tokio::fs::write(staged.join("server"), b"x").await.unwrap(); | |
| 393 | + | ||
| 394 | + | let node = crate::topology::Node { | |
| 395 | + | name: "unreachable".into(), | |
| 396 | + | ssh_target: "deploy@192.0.2.1".into(), | |
| 397 | + | release_root: "/opt/never".into(), | |
| 398 | + | service_name: "makenotwork.service".into(), | |
| 399 | + | }; | |
| 400 | + | ||
| 401 | + | let result = deploy_node(&node, "0.0.1", &staged, "server").await; | |
| 402 | + | let err = result.expect_err("deploy to unreachable host should fail"); | |
| 403 | + | let msg = format!("{err:#}"); | |
| 404 | + | // The ssh helper returns `ssh <target> failed: ...`. Don't pin the | |
| 405 | + | // exact wording, just that the failure is attributed and that no | |
| 406 | + | // panic / hang happened. | |
| 407 | + | assert!( | |
| 408 | + | msg.contains("ssh") || msg.contains("rsync") || msg.contains("connection"), | |
| 409 | + | "unexpected error: {msg}" | |
| 410 | + | ); | |
| 411 | + | } | |
| 412 | + | ||
| 413 | + | #[tokio::test] | |
| 414 | + | async fn deploy_node_with_local_ssh_target_swaps_symlink() { | |
| 415 | + | // ssh_target="local" should route to the local fast-path: just a | |
| 416 | + | // symlink swap, no remote calls. Helpful for dev loops. | |
| 417 | + | let tmp = tempfile::tempdir().unwrap(); | |
| 418 | + | let release_root = tmp.path().to_path_buf(); | |
| 419 | + | let staged = release_root.join("releases").join("0.0.1"); | |
| 420 | + | tokio::fs::create_dir_all(&staged).await.unwrap(); | |
| 421 | + | tokio::fs::write(staged.join("server"), b"x").await.unwrap(); | |
| 422 | + | ||
| 423 | + | let node = crate::topology::Node { | |
| 424 | + | name: "local-dev".into(), | |
| 425 | + | ssh_target: "local".into(), | |
| 426 | + | release_root: release_root.to_string_lossy().into_owned(), | |
| 427 | + | service_name: "makenotwork.service".into(), | |
| 428 | + | }; | |
| 429 | + | ||
| 430 | + | let out = deploy_node(&node, "0.0.1", &staged, "server").await.unwrap(); | |
| 431 | + | assert_eq!(out, staged); | |
| 432 | + | let target = tokio::fs::read_link(release_root.join("current")).await.unwrap(); | |
| 433 | + | assert_eq!(target.to_string_lossy(), "releases/0.0.1"); | |
| 434 | + | } | |
| 71 | 435 | } |
| @@ -0,0 +1,126 @@ | |||
| 1 | + | //! Event bus for live operator visibility. | |
| 2 | + | //! | |
| 3 | + | //! Sites that previously logged via `tracing::info!` also emit a typed event | |
| 4 | + | //! onto a `broadcast::Sender<EventEnvelope>`. The WS handler at `/events` | |
| 5 | + | //! subscribes to the bus and forwards each envelope to the connected TUI as | |
| 6 | + | //! a JSON text frame. | |
| 7 | + | ||
| 8 | + | use chrono::{DateTime, Utc}; | |
| 9 | + | use serde::Serialize; | |
| 10 | + | use tokio::sync::broadcast; | |
| 11 | + | ||
| 12 | + | /// Capacity of the broadcast channel. Slow subscribers that fall behind by | |
| 13 | + | /// more than this many events get `RecvError::Lagged`; the WS handler treats | |
| 14 | + | /// that as a recoverable hiccup, not a disconnect. | |
| 15 | + | pub const CAPACITY: usize = 256; | |
| 16 | + | ||
| 17 | + | pub type EventTx = broadcast::Sender<EventEnvelope>; | |
| 18 | + | ||
| 19 | + | #[derive(Clone, Debug, Serialize)] | |
| 20 | + | pub struct EventEnvelope { | |
| 21 | + | pub at: DateTime<Utc>, | |
| 22 | + | #[serde(flatten)] | |
| 23 | + | pub event: Event, | |
| 24 | + | } | |
| 25 | + | ||
| 26 | + | #[derive(Clone, Debug, Serialize)] | |
| 27 | + | #[serde(tag = "kind", rename_all = "snake_case")] | |
| 28 | + | pub enum Event { | |
| 29 | + | /// A /rebuild was accepted (post-receive hook or operator). | |
| 30 | + | RebuildRequested { sha: String }, | |
| 31 | + | /// A previous in-flight build was aborted because a newer /rebuild arrived. | |
| 32 | + | BuildAborted { sha_aborted: String }, | |
| 33 | + | BuildStart { sha: String, version: String }, | |
| 34 | + | BuildOk { sha: String, version: String, elapsed_s: u64 }, | |
| 35 | + | BuildFailed { sha: String, version: String, elapsed_s: u64 }, | |
| 36 | + | GateStart { tier: String, version: String, gate: String }, | |
| 37 | + | GateDone { tier: String, version: String, gate: String, passed: bool }, | |
| 38 | + | DeployStart { tier: String, node: String, version: String }, | |
| 39 | + | DeployOk { tier: String, node: String, version: String }, | |
| 40 | + | DeployFailed { tier: String, node: String, version: String, error: String }, | |
| 41 | + | PromoteComplete { tier: String, version: String }, | |
| 42 | + | Rollback { tier: String, from: String, to: String }, | |
| 43 | + | BackupFetched { source: String, byte_size: i64 }, | |
| 44 | + | ManualConfirm { tier: String, version: String }, | |
| 45 | + | } | |
| 46 | + | ||
| 47 | + | pub fn channel() -> EventTx { | |
| 48 | + | broadcast::channel(CAPACITY).0 | |
| 49 | + | } | |
| 50 | + | ||
| 51 | + | /// Send an event without caring whether anyone is listening. The `send` call | |
| 52 | + | /// fails only when there are zero subscribers, which is the normal case for | |
| 53 | + | /// most operator-tool deployments. | |
| 54 | + | pub fn emit(tx: &EventTx, event: Event) { | |
| 55 | + | let envelope = EventEnvelope { at: Utc::now(), event }; | |
| 56 | + | let _ = tx.send(envelope); | |
| 57 | + | } | |
| 58 | + | ||
| 59 | + | #[cfg(test)] | |
| 60 | + | mod tests { | |
| 61 | + | use super::*; | |
| 62 | + | ||
| 63 | + | #[test] | |
| 64 | + | fn emit_with_zero_subscribers_does_not_panic() { | |
| 65 | + | // The whole point of `let _ = tx.send(...)` is that emitting into an | |
| 66 | + | // unsubscribed bus is fine. Verify the contract — if this regresses | |
| 67 | + | // to `.unwrap()` someday, every build/deploy site will start | |
| 68 | + | // crashing. | |
| 69 | + | let tx = channel(); | |
| 70 | + | emit(&tx, Event::RebuildRequested { sha: "abc".into() }); | |
| 71 | + | emit(&tx, Event::BackupFetched { source: "x".into(), byte_size: 1 }); | |
| 72 | + | } | |
| 73 | + | ||
| 74 | + | #[tokio::test] | |
| 75 | + | async fn emit_reaches_a_subscriber() { | |
| 76 | + | let tx = channel(); | |
| 77 | + | let mut rx = tx.subscribe(); | |
| 78 | + | emit(&tx, Event::PromoteComplete { tier: "a".into(), version: "0.8.12".into() }); | |
| 79 | + | let env = rx.recv().await.expect("envelope"); | |
| 80 | + | match env.event { | |
| 81 | + | Event::PromoteComplete { tier, version } => { | |
| 82 | + | assert_eq!(tier, "a"); | |
| 83 | + | assert_eq!(version, "0.8.12"); | |
| 84 | + | } | |
| 85 | + | _ => panic!("wrong event kind"), | |
| 86 | + | } | |
| 87 | + | } | |
| 88 | + | ||
| 89 | + | #[tokio::test] | |
| 90 | + | async fn envelope_serializes_with_flat_kind() { | |
| 91 | + | // Contract for the WS handler + TUI's `format_event`: the JSON has a | |
| 92 | + | // top-level `kind` field, not nested under `event`. Locking this in. | |
| 93 | + | let env = EventEnvelope { | |
| 94 | + | at: Utc::now(), | |
| 95 | + | event: Event::GateStart { | |
| 96 | + | tier: "mm".into(), | |
| 97 | + | version: "0.8.12".into(), | |
| 98 | + | gate: "cargo_test".into(), | |
| 99 | + | }, | |
| 100 | + | }; | |
| 101 | + | let s = serde_json::to_string(&env).unwrap(); | |
| 102 | + | let v: serde_json::Value = serde_json::from_str(&s).unwrap(); | |
| 103 | + | assert_eq!(v["kind"], "gate_start"); | |
| 104 | + | assert_eq!(v["tier"], "mm"); | |
| 105 | + | assert_eq!(v["gate"], "cargo_test"); | |
| 106 | + | // No nested `event` object. | |
| 107 | + | assert!(v.get("event").is_none()); | |
| 108 | + | } | |
| 109 | + | ||
| 110 | + | #[tokio::test] | |
| 111 | + | async fn lagged_subscriber_observes_recv_error_lagged() { | |
| 112 | + | // If a subscriber falls behind by more than CAPACITY, the next | |
| 113 | + | // recv() returns RecvError::Lagged(n) — not Closed, not a panic. | |
| 114 | + | // The WS handler turns this into a `lagged` envelope. | |
| 115 | + | let tx = channel(); | |
| 116 | + | let mut rx = tx.subscribe(); | |
| 117 | + | for i in 0..(CAPACITY + 10) { | |
| 118 | + | emit(&tx, Event::RebuildRequested { sha: format!("{i}") }); | |
| 119 | + | } | |
| 120 | + | let err = rx.recv().await.expect_err("expected Lagged"); | |
| 121 | + | match err { | |
| 122 | + | tokio::sync::broadcast::error::RecvError::Lagged(n) => assert!(n >= 10), | |
| 123 | + | other => panic!("unexpected error: {other:?}"), | |
| 124 | + | } | |
| 125 | + | } | |
| 126 | + | } |
| @@ -4,6 +4,7 @@ | |||
| 4 | 4 | //! and the TUI can show them. | |
| 5 | 5 | ||
| 6 | 6 | use crate::config::Config; | |
| 7 | + | use crate::events::{self, Event, EventTx}; | |
| 7 | 8 | use crate::topology::Gate; | |
| 8 | 9 | use anyhow::Result; | |
| 9 | 10 | use chrono::Utc; | |
| @@ -18,6 +19,7 @@ pub struct GateCtx { | |||
| 18 | 19 | pub tier: String, | |
| 19 | 20 | pub version: String, | |
| 20 | 21 | pub worktree: PathBuf, | |
| 22 | + | pub events: EventTx, | |
| 21 | 23 | } | |
| 22 | 24 | ||
| 23 | 25 | #[derive(Debug, Clone)] | |
| @@ -44,6 +46,11 @@ pub async fn run(ctx: &GateCtx, gate: &Gate) -> Result<GateOutcome> { | |||
| 44 | 46 | .await?; | |
| 45 | 47 | ||
| 46 | 48 | tracing::info!(tier = %ctx.tier, version = %ctx.version, gate = kind, "gate start"); | |
| 49 | + | events::emit(&ctx.events, Event::GateStart { | |
| 50 | + | tier: ctx.tier.clone(), | |
| 51 | + | version: ctx.version.clone(), | |
| 52 | + | gate: kind.into(), | |
| 53 | + | }); | |
| 47 | 54 | ||
| 48 | 55 | let outcome = match gate { | |
| 49 | 56 | Gate::CargoTest => cargo_test(ctx).await, | |
| @@ -72,20 +79,30 @@ pub async fn run(ctx: &GateCtx, gate: &Gate) -> Result<GateOutcome> { | |||
| 72 | 79 | tier = %ctx.tier, version = %ctx.version, gate = kind, | |
| 73 | 80 | passed = outcome.passed, "gate done", | |
| 74 | 81 | ); | |
| 82 | + | events::emit(&ctx.events, Event::GateDone { | |
| 83 | + | tier: ctx.tier.clone(), | |
| 84 | + | version: ctx.version.clone(), | |
| 85 | + | gate: kind.into(), | |
| 86 | + | passed: outcome.passed, | |
| 87 | + | }); | |
| 75 | 88 | ||
| 76 | 89 | Ok(outcome) | |
| 77 | 90 | } | |
| 78 | 91 | ||
| 79 | 92 | /// Run a sequence of gates; stops on the first failure (no point running the | |
| 80 | - | /// rest if a prerequisite failed). Returns true iff every gate passed. | |
| 93 | + | /// Run every gate in order and return true iff all passed. We deliberately do | |
| 94 | + | /// NOT short-circuit on first failure — every gate's outcome is recorded in | |
| 95 | + | /// `gate_runs`, which is the operator's only visibility into pipeline health. | |
| 96 | + | /// Hiding later gates because an earlier one failed makes diagnosis worse. | |
| 81 | 97 | pub async fn run_all(ctx: &GateCtx, gates: &[Gate]) -> Result<bool> { | |
| 98 | + | let mut all_ok = true; | |
| 82 | 99 | for g in gates { | |
| 83 | 100 | let o = run(ctx, g).await?; | |
| 84 | 101 | if !o.passed { | |
| 85 | - | return Ok(false); | |
| 102 | + | all_ok = false; | |
| 86 | 103 | } | |
| 87 | 104 | } | |
| 88 | - | Ok(true) | |
| 105 | + | Ok(all_ok) | |
| 89 | 106 | } | |
| 90 | 107 | ||
| 91 | 108 | fn kind_str(g: &Gate) -> &'static str { | |
| @@ -102,11 +119,15 @@ fn kind_str(g: &Gate) -> &'static str { | |||
| 102 | 119 | ||
| 103 | 120 | async fn cargo_test(ctx: &GateCtx) -> Result<GateOutcome> { | |
| 104 | 121 | let server_dir = ctx.worktree.join("server"); | |
| 105 | - | let out = Command::new("cargo") | |
| 106 | - | .args(["test", "--release"]) | |
| 107 | - | .current_dir(&server_dir) | |
| 108 | - | .output() | |
| 109 | - | .await?; | |
| 122 | + | let mut cmd = Command::new("cargo"); | |
| 123 | + | cmd.args(["test", "--release"]).current_dir(&server_dir).kill_on_drop(true); | |
| 124 | + | // Same online-mode rationale as the build step: sqlx query macros need a | |
| 125 | + | // live DB to type-check against. The scratch DB is left in migrated state | |
| 126 | + | // by the preceding build, so we can reuse it here. | |
| 127 | + | if let Some(scratch_url) = ctx.cfg.scratch_db_url.as_deref() { | |
| 128 | + | cmd.env("DATABASE_URL", scratch_url); | |
| 129 | + | } | |
| 130 | + | let out = cmd.output().await?; | |
| 110 | 131 | Ok(GateOutcome { | |
| 111 | 132 | passed: out.status.success(), | |
| 112 | 133 | detail: Some(tail(&out.stderr, 4_000)), | |
| @@ -148,12 +169,30 @@ async fn migration_dry_run(ctx: &GateCtx) -> Result<GateOutcome> { | |||
| 148 | 169 | } | |
| 149 | 170 | } | |
| 150 | 171 | ||
| 151 | - | async fn reset_scratch(db_url: &str) -> Result<()> { | |
| 172 | + | pub(crate) async fn reset_scratch(db_url: &str) -> Result<()> { | |
| 152 | 173 | use sqlx::postgres::PgPoolOptions; | |
| 153 | 174 | use sqlx::Executor; | |
| 154 | 175 | let pool = PgPoolOptions::new().max_connections(1).connect(db_url).await?; | |
| 155 | - | pool.execute("DROP SCHEMA IF EXISTS public CASCADE; CREATE SCHEMA public;") | |
| 156 | - | .await?; | |
| 176 | + | // Drop every non-system schema, not just public — migrations create custom | |
| 177 | + | // schemas (e.g. tower_sessions) that survive `DROP SCHEMA public CASCADE` | |
| 178 | + | // and then collide on the next migration run. | |
| 179 | + | pool.execute( | |
| 180 | + | r#" | |
| 181 | + | DO $$ | |
| 182 | + | DECLARE s text; | |
| 183 | + | BEGIN | |
| 184 | + | FOR s IN | |
| 185 | + | SELECT nspname FROM pg_namespace | |
| 186 | + | WHERE nspname NOT LIKE 'pg_%' | |
| 187 | + | AND nspname NOT IN ('information_schema') | |
| 188 | + | LOOP | |
| 189 | + | EXECUTE format('DROP SCHEMA IF EXISTS %I CASCADE', s); | |
| 190 | + | END LOOP; | |
| 191 | + | EXECUTE 'CREATE SCHEMA public'; | |
| 192 | + | END $$; | |
| 193 | + | "#, | |
| 194 | + | ) | |
| 195 | + | .await?; | |
| 157 | 196 | pool.close().await; | |
| 158 | 197 | Ok(()) | |
| 159 | 198 | } | |
| @@ -177,7 +216,7 @@ async fn restore_dump(db_url: &str, dump: &str) -> Result<()> { | |||
| 177 | 216 | Ok(()) | |
| 178 | 217 | } | |
| 179 | 218 | ||
| 180 | - | async fn run_migrator(db_url: &str, dir: &std::path::Path) -> Result<()> { | |
| 219 | + | pub(crate) async fn run_migrator(db_url: &str, dir: &std::path::Path) -> Result<()> { | |
| 181 | 220 | use sqlx::postgres::PgPoolOptions; | |
| 182 | 221 | let pool = PgPoolOptions::new().max_connections(1).connect(db_url).await?; | |
| 183 | 222 | let migrator = sqlx::migrate::Migrator::new(dir).await?; | |
| @@ -205,11 +244,21 @@ async fn boot_smoke(ctx: &GateCtx) -> Result<GateOutcome> { | |||
| 205 | 244 | // seconds without exiting. Panics in main, missing config, port-bind | |
| 206 | 245 | // failures show up here. Anything more ambitious (probing /healthz on a | |
| 207 | 246 | // real port) needs server config we don't generically know. | |
| 208 | - | let mut child = match tokio::process::Command::new(&bin) | |
| 209 | - | .env("SANDO_BOOT_SMOKE", "1") | |
| 210 | - | .kill_on_drop(true) | |
| 211 | - | .spawn() | |
| 212 | - | { | |
| 247 | + | // | |
| 248 | + | // The server requires DATABASE_URL or it panics on config load before | |
| 249 | + | // we can observe anything. We point it at the scratch DB (already | |
| 250 | + | // migrated by the build step and refreshed by migration_dry_run if | |
| 251 | + | // that gate ran first). SCAN_ENABLED=false skips loading YARA rules | |
| 252 | + | // from /opt/makenotwork/yara-rules which doesn't exist on the build | |
| 253 | + | // host. Other config has sane optional defaults. | |
| 254 | + | let mut cmd = tokio::process::Command::new(&bin); | |
| 255 | + | cmd.env("SANDO_BOOT_SMOKE", "1") | |
| 256 | + | .env("SCAN_ENABLED", "false") | |
| 257 | + | .kill_on_drop(true); | |
| 258 | + | if let Some(scratch_url) = ctx.cfg.scratch_db_url.as_deref() { | |
| 259 | + | cmd.env("DATABASE_URL", scratch_url); | |
| 260 | + | } | |
| 261 | + | let mut child = match cmd.spawn() { | |
| 213 | 262 | Ok(c) => c, | |
| 214 | 263 | Err(e) => return Ok(GateOutcome { passed: false, detail: Some(format!("spawn: {e}")) }), | |
| 215 | 264 | }; | |
| @@ -279,3 +328,59 @@ fn tail(buf: &[u8], max: usize) -> String { | |||
| 279 | 328 | let s = String::from_utf8_lossy(buf); | |
| 280 | 329 | if s.len() <= max { s.into_owned() } else { format!("...{}", &s[s.len() - max..]) } | |
| 281 | 330 | } | |
| 331 | + | ||
| 332 | + | #[cfg(test)] | |
| 333 | + | mod tests { | |
| 334 | + | use super::*; | |
| 335 | + | ||
| 336 | + | /// reset_scratch must drop every non-system schema, not just `public` — | |
| 337 | + | /// otherwise migrations that create custom schemas (e.g. tower_sessions) | |
| 338 | + | /// collide on the next run. This regressed once (Phase 0) and the fix is | |
| 339 | + | /// load-bearing for migration_dry_run. | |
| 340 | + | /// | |
| 341 | + | /// Gated on `SANDO_TEST_PG_URL` so it only runs where postgres is | |
| 342 | + | /// available. Set `SANDO_TEST_PG_URL=postgres:///sando_scratch?host=/var/run/postgresql` | |
| 343 | + | /// (or similar) before `cargo test`. | |
| 344 | + | #[tokio::test] | |
| 345 | + | async fn reset_scratch_drops_all_non_system_schemas() { | |
| 346 | + | let Ok(url) = std::env::var("SANDO_TEST_PG_URL") else { | |
| 347 | + | eprintln!("skipping: SANDO_TEST_PG_URL not set"); | |
| 348 | + | return; | |
| 349 | + | }; | |
| 350 | + | use sqlx::Executor; | |
| 351 | + | use sqlx::postgres::PgPoolOptions; | |
| 352 | + | ||
| 353 | + | let pool = PgPoolOptions::new().max_connections(1).connect(&url).await.unwrap(); | |
| 354 | + | // Plant two non-system schemas + a table in each. | |
| 355 | + | pool.execute("DROP SCHEMA IF EXISTS foo CASCADE; CREATE SCHEMA foo; CREATE TABLE foo.t (i int);") | |
| 356 | + | .await.unwrap(); | |
| 357 | + | pool.execute("DROP SCHEMA IF EXISTS tower_sessions CASCADE; CREATE SCHEMA tower_sessions; CREATE TABLE tower_sessions.session (id text);") | |
| 358 | + | .await.unwrap(); | |
| 359 | + | pool.close().await; | |
| 360 | + | ||
| 361 | + | reset_scratch(&url).await.expect("reset_scratch"); | |
| 362 | + | ||
| 363 | + | let pool = PgPoolOptions::new().max_connections(1).connect(&url).await.unwrap(); | |
| 364 | + | let rows: Vec<(String,)> = sqlx::query_as( | |
| 365 | + | "SELECT nspname FROM pg_namespace WHERE nspname NOT LIKE 'pg_%' AND nspname <> 'information_schema'", | |
| 366 | + | ) | |
| 367 | + | .fetch_all(&pool) | |
| 368 | + | .await | |
| 369 | + | .unwrap(); | |
| 370 | + | let names: Vec<String> = rows.into_iter().map(|(s,)| s).collect(); | |
| 371 | + | // After reset, only `public` should remain among non-system schemas. | |
| 372 | + | assert_eq!(names, vec!["public".to_string()], "got: {names:?}"); | |
| 373 | + | pool.close().await; | |
| 374 | + | } | |
| 375 | + | ||
| 376 | + | /// Sanity: applying MNW migrations from a *non-existent* dir errors, | |
| 377 | + | /// rather than silently no-op'ing. Cheap pure check, no postgres needed | |
| 378 | + | /// (the sqlx::Migrator::new constructor itself reads the dir). | |
| 379 | + | #[tokio::test] | |
| 380 | + | async fn run_migrator_errors_on_missing_dir() { | |
| 381 | + | // The first thing run_migrator does is `Migrator::new(dir)`, which | |
| 382 | + | // needs a real dir to read migration files from. | |
| 383 | + | let res = run_migrator("postgres:///does-not-matter", std::path::Path::new("/nonexistent/sando-test-migrations")).await; | |
| 384 | + | assert!(res.is_err()); | |
| 385 | + | } | |
| 386 | + | } |
| @@ -9,6 +9,7 @@ mod config; | |||
| 9 | 9 | mod db; | |
| 10 | 10 | mod deploy; | |
| 11 | 11 | mod error; | |
| 12 | + | mod events; | |
| 12 | 13 | mod gates; | |
| 13 | 14 | mod git; | |
| 14 | 15 | mod metrics; | |
| @@ -44,7 +45,14 @@ async fn main() -> Result<()> { | |||
| 44 | 45 | ||
| 45 | 46 | let prom = metrics::init(); | |
| 46 | 47 | let addr: SocketAddr = cfg.listen.parse()?; | |
| 47 | - | let app_state = state::AppState { pool, topo, cfg, prom }; | |
| 48 | + | let app_state = state::AppState { | |
| 49 | + | pool, | |
| 50 | + | topo, | |
| 51 | + | cfg, | |
| 52 | + | prom, | |
| 53 | + | active_build: Arc::new(tokio::sync::Mutex::new(None)), | |
| 54 | + | events: events::channel(), | |
| 55 | + | }; | |
| 48 | 56 | let app = routes::router(app_state); | |
| 49 | 57 | tracing::info!(%addr, "sando daemon listening"); | |
| 50 | 58 | let listener = tokio::net::TcpListener::bind(addr).await?; |
| @@ -14,6 +14,7 @@ pub fn router(state: AppState) -> Router { | |||
| 14 | 14 | .route("/promote/{tier}", post(promote)) | |
| 15 | 15 | .route("/rollback/{tier}", post(rollback)) | |
| 16 | 16 | .route("/rebuild", post(rebuild)) | |
| 17 | + | .route("/confirm/{tier}", post(confirm)) | |
| 17 | 18 | .route("/backup/fetch", post(backup_fetch)) | |
| 18 | 19 | .route("/events", get(events_ws)) | |
| 19 | 20 | .with_state(state) | |
| @@ -67,8 +68,25 @@ async fn get_state(State(s): State<AppState>) -> Result<Json<StateView>> { | |||
| 67 | 68 | .fetch_all(&s.pool) | |
| 68 | 69 | .await?; | |
| 69 | 70 | ||
| 70 | - | let gates: Vec<GateView> = if let Some(ver) = current_version.as_ref() { | |
| 71 | - | // Most recent gate_runs row per gate_kind for (tier, current_version). | |
| 71 | + | // Surface gates for current_version when set, otherwise for the most | |
| 72 | + | // recently attempted version on this tier. Without the fallback, a | |
| 73 | + | // tier that has never gone green (MM after a build failure, B before | |
| 74 | + | // first deploy) exposes no gate detail via /state — debugging required | |
| 75 | + | // SSH and direct SQLite access. See sando todo: gate observability. | |
| 76 | + | let gate_version: Option<String> = if current_version.is_some() { | |
| 77 | + | current_version.clone() | |
| 78 | + | } else { | |
| 79 | + | sqlx::query_scalar( | |
| 80 | + | "SELECT version FROM gate_runs WHERE tier = ? | |
| 81 | + | ORDER BY id DESC LIMIT 1", | |
| 82 | + | ) | |
| 83 | + | .bind(&name) | |
| 84 | + | .fetch_optional(&s.pool) | |
| 85 | + | .await? | |
| 86 | + | }; | |
| 87 | + | ||
| 88 | + | let gates: Vec<GateView> = if let Some(ver) = gate_version.as_ref() { | |
| 89 | + | // Most recent gate_runs row per gate_kind for (tier, ver). | |
| 72 | 90 | sqlx::query( | |
| 73 | 91 | "SELECT gate_kind, passed, finished_at, detail | |
| 74 | 92 | FROM gate_runs g | |
| @@ -109,9 +127,12 @@ async fn get_state(State(s): State<AppState>) -> Result<Json<StateView>> { | |||
| 109 | 127 | Ok(Json(StateView { tiers })) | |
| 110 | 128 | } | |
| 111 | 129 | ||
| 112 | - | #[derive(Deserialize)] | |
| 130 | + | #[derive(Deserialize, Default)] | |
| 113 | 131 | struct PromoteBody { | |
| 114 | - | version: String, | |
| 132 | + | /// Optional. If absent, defaults to the predecessor tier's `current_version` | |
| 133 | + | /// (i.e. promote whatever just finished baking on the previous tier). | |
| 134 | + | #[serde(default)] | |
| 135 | + | version: Option<String>, | |
| 115 | 136 | #[serde(default)] | |
| 116 | 137 | hotfix: bool, | |
| 117 | 138 | #[serde(default)] | |
| @@ -121,8 +142,9 @@ struct PromoteBody { | |||
| 121 | 142 | async fn promote( | |
| 122 | 143 | State(s): State<AppState>, | |
| 123 | 144 | Path(tier): Path<String>, | |
| 124 | - | Json(body): Json<PromoteBody>, | |
| 145 | + | body: Option<Json<PromoteBody>>, | |
| 125 | 146 | ) -> Result<Json<serde_json::Value>> { | |
| 147 | + | let body = body.map(|Json(b)| b).unwrap_or_default(); | |
| 126 | 148 | let idx = s.topo.tiers.iter().position(|t| t.name == tier) | |
| 127 | 149 | .ok_or(crate::error::Error::NotFound)?; | |
| 128 | 150 | if idx == 0 { | |
| @@ -133,9 +155,24 @@ async fn promote( | |||
| 133 | 155 | let target = &s.topo.tiers[idx]; | |
| 134 | 156 | let source = &s.topo.tiers[idx - 1]; | |
| 135 | 157 | ||
| 158 | + | // Resolve version: explicit if given, else the source tier's current. | |
| 159 | + | let version = match body.version { | |
| 160 | + | Some(v) => v, | |
| 161 | + | None => sqlx::query_scalar::<_, Option<String>>( | |
| 162 | + | "SELECT current_version FROM tier_state WHERE tier = ?", | |
| 163 | + | ) | |
| 164 | + | .bind(&source.name) | |
| 165 | + | .fetch_optional(&s.pool).await | |
| 166 | + | .map_err(crate::error::Error::Db)? | |
| 167 | + | .flatten() | |
| 168 | + | .ok_or_else(|| crate::error::Error::GateBlocked( | |
| 169 | + | format!("no version specified and tier {} has no current_version", source.name), | |
| 170 | + | ))?, | |
| 171 | + | }; | |
| 172 | + | ||
| 136 | 173 | // 1. Predecessor must have all of its gates green for this version (with | |
| 137 | 174 | // optional hotfix override that skips burn_in). | |
| 138 | - | let pending = unsatisfied_gates(&s.pool, &source.name, &body.version, body.hotfix).await?; | |
| 175 | + | let pending = unsatisfied_gates(&s.pool, &source.name, &version, body.hotfix).await?; | |
| 139 | 176 | if !pending.is_empty() { | |
| 140 | 177 | return Err(crate::error::Error::GateBlocked(format!( | |
| 141 | 178 | "{} gate(s) not satisfied on tier {}: {}", | |
| @@ -149,7 +186,7 @@ async fn promote( | |||
| 149 | 186 | let bin: Option<(String,)> = sqlx::query_as( | |
| 150 | 187 | "SELECT artifact_path FROM versions WHERE version = ?", | |
| 151 | 188 | ) | |
| 152 | - | .bind(&body.version) | |
| 189 | + | .bind(&version) | |
| 153 | 190 | .fetch_optional(&s.pool) | |
| 154 | 191 | .await | |
| 155 | 192 | .map_err(crate::error::Error::Db)?; | |
| @@ -157,23 +194,49 @@ async fn promote( | |||
| 157 | 194 | return Err(crate::error::Error::NotFound); | |
| 158 | 195 | }; | |
| 159 | 196 | let bin_path = std::path::PathBuf::from(bin); | |
| 197 | + | // `artifact_path` is the primary binary; the staged release dir is its parent. | |
| 198 | + | let staged_dir = bin_path.parent() | |
| 199 | + | .ok_or_else(|| crate::error::Error::Other(anyhow::anyhow!("artifact_path has no parent")))? | |
| 200 | + | .to_path_buf(); | |
| 160 | 201 | ||
| 161 | 202 | // 3. Deploy to each node. Sequential canary is the only policy | |
| 162 | 203 | // implemented in v0; parallel is a one-line change once we trust the | |
| 163 | 204 | // sequential path. | |
| 164 | 205 | for node in &target.nodes { | |
| 165 | - | crate::deploy::deploy_node(node, &body.version, &bin_path) | |
| 166 | - | .await | |
| 167 | - | .map_err(crate::error::Error::Other)?; | |
| 168 | - | let now = chrono::Utc::now().to_rfc3339(); | |
| 206 | + | let started = chrono::Utc::now().to_rfc3339(); | |
| 207 | + | crate::events::emit(&s.events, crate::events::Event::DeployStart { | |
| 208 | + | tier: target.name.clone(), node: node.name.clone(), version: version.clone(), | |
| 209 | + | }); | |
| 210 | + | let result = crate::deploy::deploy_node(node, &version, &staged_dir, s.cfg.primary_bin()).await; | |
| 211 | + | let finished = chrono::Utc::now().to_rfc3339(); | |
| 212 | + | let (outcome, err_msg) = match &result { | |
| 213 | + | Ok(_) => ("ok", None), | |
| 214 | + | Err(e) => ("failed", Some(format!("{e:#}"))), | |
| 215 | + | }; | |
| 169 | 216 | sqlx::query( | |
| 170 | 217 | "INSERT INTO deploys (version, tier, node, started_at, finished_at, outcome, hotfix, reset_burn_in) | |
| 171 | - | VALUES (?, ?, ?, ?, ?, 'ok', ?, ?)", | |
| 218 | + | VALUES (?, ?, ?, ?, ?, ?, ?, ?)", | |
| 172 | 219 | ) | |
| 173 | - | .bind(&body.version).bind(&target.name).bind(&node.name) | |
| 174 | - | .bind(&now).bind(&now) | |
| 220 | + | .bind(&version).bind(&target.name).bind(&node.name) | |
| 221 | + | .bind(&started).bind(&finished).bind(outcome) | |
| 175 | 222 | .bind(body.hotfix as i64).bind(body.reset_burn_in as i64) | |
| 176 | 223 | .execute(&s.pool).await.map_err(crate::error::Error::Db)?; | |
| 224 | + | if let Err(e) = result { | |
| 225 | + | let msg = err_msg.unwrap_or_default(); | |
| 226 | + | tracing::error!( | |
| 227 | + | tier = %target.name, node = %node.name, version = %version, | |
| 228 | + | error = %msg, | |
| 229 | + | "deploy failed; current symlink left intact, tier_state not advanced" | |
| 230 | + | ); | |
| 231 | + | crate::events::emit(&s.events, crate::events::Event::DeployFailed { | |
| 232 | + | tier: target.name.clone(), node: node.name.clone(), | |
| 233 | + | version: version.clone(), error: msg, | |
| 234 | + | }); | |
| 235 | + | return Err(crate::error::Error::Other(e)); | |
| 236 | + | } | |
| 237 | + | crate::events::emit(&s.events, crate::events::Event::DeployOk { | |
| 238 | + | tier: target.name.clone(), node: node.name.clone(), version: version.clone(), | |
| 239 | + | }); | |
| 177 | 240 | } | |
| 178 | 241 | ||
| 179 | 242 | // 4. Advance tier_state. burn_in_started_at is set to now so the target | |
| @@ -189,7 +252,7 @@ async fn promote( | |||
| 189 | 252 | WHERE tier = ?", | |
| 190 | 253 | ) | |
| 191 | 254 | .bind(prev) | |
| 192 | - | .bind(&body.version) | |
| 255 | + | .bind(&version) | |
| 193 | 256 | .bind(chrono::Utc::now().to_rfc3339()) | |
| 194 | 257 | .bind(&target.name) | |
| 195 | 258 | .execute(&s.pool).await.map_err(crate::error::Error::Db)?; | |
| @@ -200,15 +263,18 @@ async fn promote( | |||
| 200 | 263 | .execute(&s.pool).await.map_err(crate::error::Error::Db)?; | |
| 201 | 264 | } | |
| 202 | 265 | ||
| 266 | + | crate::events::emit(&s.events, crate::events::Event::PromoteComplete { | |
| 267 | + | tier: target.name.clone(), version: version.clone(), | |
| 268 | + | }); | |
| 203 | 269 | tracing::info!( | |
| 204 | - | version = %body.version, tier = %target.name, | |
| 270 | + | version = %version, tier = %target.name, | |
| 205 | 271 | hotfix = body.hotfix, reset_burn_in = body.reset_burn_in, | |
| 206 | 272 | "promote complete", | |
| 207 | 273 | ); | |
| 208 | 274 | ||
| 209 | 275 | Ok(Json(serde_json::json!({ | |
| 210 | 276 | "tier": target.name, | |
| 211 | - | "version": body.version, | |
| 277 | + | "version": version, | |
| 212 | 278 | "nodes_deployed": target.nodes.iter().map(|n| n.name.clone()).collect::<Vec<_>>(), | |
| 213 | 279 | }))) | |
| 214 | 280 | } | |
| @@ -275,9 +341,12 @@ async fn rollback( | |||
| 275 | 341 | )); | |
| 276 | 342 | }; | |
| 277 | 343 | let bin_path = std::path::PathBuf::from(bin); | |
| 344 | + | let staged_dir = bin_path.parent() | |
| 345 | + | .ok_or_else(|| crate::error::Error::Other(anyhow::anyhow!("artifact_path has no parent")))? | |
| 346 | + | .to_path_buf(); | |
| 278 | 347 | ||
| 279 | 348 | for node in &target.nodes { | |
| 280 | - | crate::deploy::deploy_node(node, &previous, &bin_path) | |
| 349 | + | crate::deploy::deploy_node(node, &previous, &staged_dir, s.cfg.primary_bin()) | |
| 281 | 350 | .await | |
| 282 | 351 | .map_err(crate::error::Error::Other)?; | |
| 283 | 352 | } | |
| @@ -292,6 +361,9 @@ async fn rollback( | |||
| 292 | 361 | .execute(&s.pool).await.map_err(crate::error::Error::Db)?; | |
| 293 | 362 | ||
| 294 | 363 | tracing::warn!(tier = %tier, from = %current, to = %previous, "rollback complete"); | |
| 364 | + | crate::events::emit(&s.events, crate::events::Event::Rollback { | |
| 365 | + | tier: tier.clone(), from: current.clone(), to: previous.clone(), | |
| 366 | + | }); | |
| 295 | 367 | ||
| 296 | 368 | Ok(Json(serde_json::json!({ | |
| 297 | 369 | "tier": tier, | |
| @@ -323,24 +395,80 @@ async fn rebuild( | |||
| 323 | 395 | }; | |
| 324 | 396 | ||
| 325 | 397 | tracing::info!(sha = %sha, "rebuild requested"); | |
| 398 | + | crate::events::emit(&s.events, crate::events::Event::RebuildRequested { sha: sha.clone() }); | |
| 399 | + | ||
| 400 | + | // Latest /rebuild wins: abort any in-flight build before spawning a new | |
| 401 | + | // one. Aborting drops the spawned task's future, which drops any | |
| 402 | + | // tokio::process::Child it owns; with `kill_on_drop(true)` set on the | |
| 403 | + | // cargo Command, SIGKILL propagates to cargo + its rustc children. | |
| 404 | + | let mut slot = s.active_build.lock().await; | |
| 405 | + | if let Some(prev) = slot.take() { | |
| 406 | + | if !prev.is_finished() { | |
| 407 | + | tracing::warn!("aborting in-flight build for newer /rebuild request"); | |
| 408 | + | crate::events::emit(&s.events, crate::events::Event::BuildAborted { sha_aborted: sha.clone() }); | |
| 409 | + | prev.abort(); | |
| 410 | + | } | |
| 411 | + | } | |
| 326 | 412 | ||
| 327 | 413 | let pool = s.pool.clone(); | |
| 328 | 414 | let cfg = s.cfg.clone(); | |
| 329 | 415 | let topo = s.topo.clone(); | |
| 416 | + | let events_for_task = s.events.clone(); | |
| 330 | 417 | let sha_for_task = sha.clone(); | |
| 331 | - | tokio::spawn(async move { | |
| 332 | - | if let Err(e) = crate::build::build_and_run_mm(pool, cfg, topo, sha_for_task.clone()).await { | |
| 418 | + | let handle = tokio::spawn(async move { | |
| 419 | + | if let Err(e) = crate::build::build_and_run_mm(pool, cfg, topo, sha_for_task.clone(), events_for_task).await { | |
| 333 | 420 | tracing::error!(sha = %sha_for_task, error = %e, "rebuild pipeline failed"); | |
| 334 | 421 | } | |
| 335 | 422 | }); | |
| 423 | + | *slot = Some(handle.abort_handle()); | |
| 336 | 424 | ||
| 337 | 425 | Ok(Json(serde_json::json!({ "accepted": true, "sha": sha }))) | |
| 338 | 426 | } | |
| 339 | 427 | ||
| 428 | + | async fn confirm( | |
| 429 | + | State(s): State<AppState>, | |
| 430 | + | Path(tier): Path<String>, | |
| 431 | + | ) -> Result<Json<serde_json::Value>> { | |
| 432 | + | // Operator-driven satisfaction of a `manual_confirm` gate. Looks up the | |
| 433 | + | // pending version (current MM version, or the tier's own if non-mm) and | |
| 434 | + | // inserts a passing gate_runs row so /promote can advance. | |
| 435 | + | let target = s.topo.tiers.iter().find(|t| t.name == tier) | |
| 436 | + | .ok_or(crate::error::Error::NotFound)?; | |
| 437 | + | ||
| 438 | + | let version: Option<String> = sqlx::query_scalar( | |
| 439 | + | "SELECT current_version FROM tier_state WHERE tier = ?", | |
| 440 | + | ) | |
| 441 | + | .bind(&target.name) | |
| 442 | + | .fetch_optional(&s.pool).await.map_err(crate::error::Error::Db)?.flatten(); | |
| 443 | + | let version = version.ok_or_else(|| crate::error::Error::GateBlocked( | |
| 444 | + | format!("tier {tier} has no current_version; nothing to confirm"), | |
| 445 | + | ))?; | |
| 446 | + | ||
| 447 | + | let now = chrono::Utc::now().to_rfc3339(); | |
| 448 | + | sqlx::query( | |
| 449 | + | "INSERT INTO gate_runs (version, tier, gate_kind, started_at, finished_at, passed, detail) | |
| 450 | + | VALUES (?, ?, 'manual_confirm', ?, ?, 1, 'operator confirmed via POST /confirm')", | |
| 451 | + | ) | |
| 452 | + | .bind(&version).bind(&target.name).bind(&now).bind(&now) | |
| 453 | + | .execute(&s.pool).await.map_err(crate::error::Error::Db)?; | |
| 454 | + | ||
| 455 | + | tracing::info!(tier = %tier, version = %version, "manual_confirm recorded"); | |
| 456 | + | crate::events::emit(&s.events, crate::events::Event::ManualConfirm { | |
| 457 | + | tier: tier.clone(), | |
| 458 | + | version: version.clone(), | |
| 459 | + | }); | |
| 460 | + | ||
| 461 | + | Ok(Json(serde_json::json!({ "tier": tier, "version": version }))) | |
| 462 | + | } | |
| 463 | + | ||
| 340 | 464 | async fn backup_fetch(State(s): State<AppState>) -> Result<Json<serde_json::Value>> { | |
| 341 | 465 | let fb = crate::backup::fetch(&s.pool, &s.cfg, &s.topo) | |
| 342 | 466 | .await | |
| 343 | 467 | .map_err(crate::error::Error::Other)?; | |
| 468 | + | crate::events::emit(&s.events, crate::events::Event::BackupFetched { | |
| 469 | + | source: fb.source.clone(), | |
| 470 | + | byte_size: fb.byte_size.unwrap_or(0), | |
| 471 | + | }); | |
| 344 | 472 | Ok(Json(serde_json::json!({ | |
| 345 | 473 | "source": fb.source, | |
| 346 | 474 | "local_path": fb.local_path, | |
| @@ -348,8 +476,381 @@ async fn backup_fetch(State(s): State<AppState>) -> Result<Json<serde_json::Valu | |||
| 348 | 476 | }))) | |
| 349 | 477 | } | |
| 350 | 478 | ||
| 351 | - | async fn events_ws(ws: WebSocketUpgrade, State(_s): State<AppState>) -> impl IntoResponse { | |
| 352 | - | ws.on_upgrade(|_socket| async move { | |
| 353 | - | // tail of deploy/gate events for the TUI | |
| 479 | + | async fn events_ws(ws: WebSocketUpgrade, State(s): State<AppState>) -> impl IntoResponse { | |
| 480 | + | use axum::extract::ws::Message; | |
| 481 | + | use tokio::sync::broadcast::error::RecvError; | |
| 482 | + | ||
| 483 | + | ws.on_upgrade(move |mut socket| async move { | |
| 484 | + | let mut rx = s.events.subscribe(); | |
| 485 | + | loop { | |
| 486 | + | match rx.recv().await { | |
| 487 | + | Ok(env) => { | |
| 488 | + | let json = match serde_json::to_string(&env) { | |
| 489 | + | Ok(s) => s, | |
| 490 | + | Err(e) => { | |
| 491 | + | tracing::warn!(error = %e, "events ws: serialize failed"); | |
| 492 | + | continue; | |
| 493 | + | } | |
| 494 | + | }; | |
| 495 | + | if socket.send(Message::Text(json.into())).await.is_err() { | |
| 496 | + | break; | |
| 497 | + | } | |
| 498 | + | } | |
| 499 | + | Err(RecvError::Lagged(n)) => { | |
| 500 | + | let _ = socket.send(Message::Text( | |
| 501 | + | format!(r#"{{"kind":"lagged","skipped":{n}}}"#).into(), | |
| 502 | + | )).await; | |
| 503 | + | } | |
| 504 | + | Err(RecvError::Closed) => break, | |
| 505 | + | } | |
| 506 | + | } | |
| 354 | 507 | }) | |
| 355 | 508 | } | |
| 509 | + | ||
| 510 | + | #[cfg(test)] | |
| 511 | + | mod tests { | |
| 512 | + | use super::*; | |
| 513 | + | use crate::config::Config; | |
| 514 | + | use crate::topology::{BackupConfig, CanaryPolicy, Gate, Node, RepoConfig, Tier, Topology}; | |
| 515 | + | use axum::body::Body; | |
| 516 | + | use axum::http::{Request, StatusCode}; | |
| 517 | + | use http_body_util::BodyExt; | |
| 518 | + | use metrics_exporter_prometheus::PrometheusBuilder; | |
| 519 | + | use sqlx::sqlite::SqlitePoolOptions; | |
| 520 | + | use sqlx::SqlitePool; | |
| 521 | + | use std::path::PathBuf; | |
| 522 | + | use std::sync::Arc; | |
| 523 | + | use tower::ServiceExt; | |
| 524 | + | ||
| 525 | + | async fn fresh_pool() -> SqlitePool { | |
| 526 | + | let pool = SqlitePoolOptions::new() | |
| 527 | + | .max_connections(1) | |
| 528 | + | .connect("sqlite::memory:") | |
| 529 | + | .await | |
| 530 | + | .unwrap(); | |
| 531 | + | sqlx::migrate!("./migrations").run(&pool).await.unwrap(); | |
| 532 | + | pool | |
| 533 | + | } | |
| 534 | + | ||
| 535 | + | /// Two-tier topology used by the route tests: mm (provisioned, no nodes) | |
| 536 | + | /// → a (provisioned, one local node). Mirrors the production shape | |
| 537 | + | /// without involving real ssh / postgres. | |
| 538 | + | fn test_topo() -> Topology { | |
| 539 | + | Topology { | |
| 540 | + | repo: RepoConfig { bare_path: "/tmp/test.git".into(), branch: "main".into() }, | |
| 541 | + | backup: BackupConfig { | |
| 542 | + | source: "file:///tmp/test-backup.sql".into(), | |
| 543 | + | local_path: "/tmp/local-backup.sql".into(), | |
| 544 | + | }, | |
| 545 | + | tiers: vec![ | |
| 546 | + | Tier { | |
| 547 | + | name: "mm".into(), | |
| 548 | + | provisioned: true, | |
| 549 | + | gates: vec![], | |
| 550 | + | canary: CanaryPolicy::Sequential, | |
| 551 | + | nodes: vec![], | |
| 552 | + | }, | |
| 553 | + | Tier { | |
| 554 | + | name: "a".into(), | |
| 555 | + | provisioned: true, | |
| 556 | + | gates: vec![Gate::BootSmoke], | |
| 557 | + | canary: CanaryPolicy::Sequential, | |
| 558 | + | nodes: vec![Node { | |
| 559 | + | name: "a-local".into(), | |
| 560 | + | ssh_target: "local".into(), | |
| 561 | + | release_root: "/tmp/a-node".into(), | |
| 562 | + | service_name: "makenotwork.service".into(), | |
| 563 | + | }], | |
| 564 | + | }, | |
| 565 | + | ], | |
| 566 | + | } | |
| 567 | + | } | |
| 568 | + | ||
| 569 | + | fn test_cfg() -> Config { | |
| 570 | + | Config { | |
| 571 | + | listen: "127.0.0.1:0".into(), | |
| 572 | + | db_path: PathBuf::from(":memory:"), | |
| 573 | + | topology_path: PathBuf::from("/tmp/test-sando.toml"), | |
| 574 | + | workdir: PathBuf::from("/tmp/sando-work"), | |
| 575 | + | release_root: PathBuf::from("/tmp/sando-releases"), | |
| 576 | + | scratch_db_url: None, | |
| 577 | + | bin_names: vec!["makenotwork".into()], | |
| 578 | + | } | |
| 579 | + | } | |
| 580 | + | ||
| 581 | + | async fn test_state() -> AppState { | |
| 582 | + | let pool = fresh_pool().await; | |
| 583 | + | // Seed tier rows so FKs on tier_state / gate_runs are satisfied. | |
| 584 | + | for (i, name) in ["mm", "a"].iter().enumerate() { | |
| 585 | + | sqlx::query( | |
| 586 | + | "INSERT INTO tiers (name, ord, provisioned, canary) VALUES (?, ?, 1, 'sequential')", | |
| 587 | + | ) | |
| 588 | + | .bind(name).bind(i as i64).execute(&pool).await.unwrap(); | |
| 589 | + | sqlx::query("INSERT INTO tier_state (tier) VALUES (?)") | |
| 590 | + | .bind(name).execute(&pool).await.unwrap(); | |
| 591 | + | } | |
| 592 | + | // Don't call install_recorder in tests — it touches a process-global | |
| 593 | + | // and conflicts when tests run in parallel. | |
| 594 | + | let prom = PrometheusBuilder::new().build_recorder().handle(); | |
| 595 | + | AppState { | |
| 596 | + | pool, | |
| 597 | + | topo: Arc::new(test_topo()), | |
| 598 | + | cfg: Arc::new(test_cfg()), | |
| 599 | + | prom, | |
| 600 | + | active_build: Arc::new(tokio::sync::Mutex::new(None)), | |
| 601 | + | events: crate::events::channel(), | |
| 602 | + | } | |
| 603 | + | } | |
| 604 | + | ||
| 605 | + | async fn body_string(resp: axum::response::Response) -> String { | |
| 606 | + | let bytes = resp.into_body().collect().await.unwrap().to_bytes(); | |
| 607 | + | String::from_utf8(bytes.to_vec()).unwrap() | |
| 608 | + | } | |
| 609 | + | ||
| 610 | + | /// Insert the FK prerequisites for inserting gate_runs/tier_state rows. | |
| 611 | + | async fn seed(pool: &SqlitePool, tier: &str, version: &str) { | |
| 612 | + | sqlx::query("INSERT INTO tiers (name, ord, provisioned, canary) VALUES (?, 0, 1, 'sequential') ON CONFLICT DO NOTHING") | |
| 613 | + | .bind(tier).execute(pool).await.unwrap(); | |
| 614 | + | sqlx::query("INSERT INTO versions (version, git_sha, built_at, artifact_path) VALUES (?, 'sha', datetime('now'), '/tmp/x') ON CONFLICT DO NOTHING") | |
| 615 | + | .bind(version).execute(pool).await.unwrap(); | |
| 616 | + | sqlx::query("INSERT INTO tier_state (tier, current_version) VALUES (?, NULL) ON CONFLICT DO NOTHING") | |
| 617 | + | .bind(tier).execute(pool).await.unwrap(); | |
| 618 | + | } | |
| 619 | + | ||
| 620 | + | async fn insert_gate(pool: &SqlitePool, tier: &str, version: &str, kind: &str, passed: i64) { | |
| 621 | + | sqlx::query( | |
| 622 | + | "INSERT INTO gate_runs (version, tier, gate_kind, started_at, finished_at, passed) \ | |
| 623 | + | VALUES (?, ?, ?, datetime('now'), datetime('now'), ?)", | |
| 624 | + | ) | |
| 625 | + | .bind(version).bind(tier).bind(kind).bind(passed) | |
| 626 | + | .execute(pool).await.unwrap(); | |
| 627 | + | } | |
| 628 | + | ||
| 629 | + | // ---- unsatisfied_gates ---- | |
| 630 | + | ||
| 631 | + | #[tokio::test] | |
| 632 | + | async fn unsatisfied_gates_empty_when_no_runs() { | |
| 633 | + | // No gate_runs rows means there's nothing to check — caller treats | |
| 634 | + | // empty as "all green" which is correct iff the predecessor tier | |
| 635 | + | // has no configured gates. The topology validation is upstream. | |
| 636 | + | let pool = fresh_pool().await; | |
| 637 | + | seed(&pool, "mm", "0.8.12").await; | |
| 638 | + | let pending = unsatisfied_gates(&pool, "mm", "0.8.12", false).await.unwrap(); | |
| 639 | + | assert_eq!(pending, Vec::<String>::new()); | |
| 640 | + | } | |
| 641 | + | ||
| 642 | + | #[tokio::test] | |
| 643 | + | async fn unsatisfied_gates_flags_failed_kind() { | |
| 644 | + | let pool = fresh_pool().await; | |
| 645 | + | seed(&pool, "mm", "0.8.12").await; | |
| 646 | + | insert_gate(&pool, "mm", "0.8.12", "cargo_test", 0).await; | |
| 647 | + | insert_gate(&pool, "mm", "0.8.12", "boot_smoke", 1).await; | |
| 648 | + | let pending = unsatisfied_gates(&pool, "mm", "0.8.12", false).await.unwrap(); | |
| 649 | + | assert_eq!(pending, vec!["cargo_test".to_string()]); | |
| 650 | + | } | |
| 651 | + | ||
| 652 | + | #[tokio::test] | |
| 653 | + | async fn unsatisfied_gates_latest_row_wins() { | |
| 654 | + | // Two runs of the same gate; only the latest counts. A flap from | |
| 655 | + | // red to green should clear the pending entry. | |
| 656 | + | let pool = fresh_pool().await; | |
| 657 | + | seed(&pool, "mm", "0.8.12").await; | |
| 658 | + | insert_gate(&pool, "mm", "0.8.12", "cargo_test", 0).await; | |
| 659 | + | insert_gate(&pool, "mm", "0.8.12", "cargo_test", 1).await; | |
| 660 | + | let pending = unsatisfied_gates(&pool, "mm", "0.8.12", false).await.unwrap(); | |
| 661 | + | assert!(pending.is_empty()); | |
| 662 | + | } | |
| 663 | + | ||
| 664 | + | #[tokio::test] | |
| 665 | + | async fn unsatisfied_gates_hotfix_skips_only_burn_in() { | |
| 666 | + | // hotfix=true is supposed to bypass burn_in failures specifically — | |
| 667 | + | // not cargo_test, not boot_smoke. Lock the semantic so a future | |
| 668 | + | // rename doesn't accidentally widen it. | |
| 669 | + | let pool = fresh_pool().await; | |
| 670 | + | seed(&pool, "a", "0.8.12").await; | |
| 671 | + | insert_gate(&pool, "a", "0.8.12", "burn_in", 0).await; | |
| 672 | + | insert_gate(&pool, "a", "0.8.12", "cargo_test", 0).await; | |
| 673 | + | ||
| 674 | + | let normal = unsatisfied_gates(&pool, "a", "0.8.12", false).await.unwrap(); | |
| 675 | + | let mut sorted = normal.clone(); | |
| 676 | + | sorted.sort(); | |
| 677 | + | assert_eq!(sorted, vec!["burn_in".to_string(), "cargo_test".to_string()]); | |
| 678 | + | ||
| 679 | + | let with_hotfix = unsatisfied_gates(&pool, "a", "0.8.12", true).await.unwrap(); | |
| 680 | + | assert_eq!(with_hotfix, vec!["cargo_test".to_string()]); | |
| 681 | + | } | |
| 682 | + | ||
| 683 | + | #[tokio::test] | |
| 684 | + | async fn unsatisfied_gates_ignores_other_tiers_and_versions() { | |
| 685 | + | let pool = fresh_pool().await; | |
| 686 | + | seed(&pool, "mm", "0.8.12").await; | |
| 687 | + | seed(&pool, "mm", "0.8.11").await; | |
| 688 | + | seed(&pool, "a", "0.8.12").await; | |
| 689 | + | // Mark mm/0.8.12 cargo_test failing, but unrelated tiers/versions | |
| 690 | + | // shouldn't pollute the query. | |
| 691 | + | insert_gate(&pool, "mm", "0.8.12", "cargo_test", 0).await; | |
| 692 | + | insert_gate(&pool, "a", "0.8.12", "cargo_test", 0).await; | |
| 693 | + | insert_gate(&pool, "mm", "0.8.11", "cargo_test", 0).await; |
Lines truncated
| @@ -1,8 +1,11 @@ | |||
| 1 | 1 | use crate::config::Config; | |
| 2 | + | use crate::events::EventTx; | |
| 2 | 3 | use crate::topology::Topology; | |
| 3 | 4 | use metrics_exporter_prometheus::PrometheusHandle; | |
| 4 | 5 | use sqlx::SqlitePool; | |
| 5 | 6 | use std::sync::Arc; | |
| 7 | + | use tokio::sync::Mutex; | |
| 8 | + | use tokio::task::AbortHandle; | |
| 6 | 9 | ||
| 7 | 10 | #[derive(Clone)] | |
| 8 | 11 | pub struct AppState { | |
| @@ -10,4 +13,10 @@ pub struct AppState { | |||
| 10 | 13 | pub topo: Arc<Topology>, | |
| 11 | 14 | pub cfg: Arc<Config>, | |
| 12 | 15 | pub prom: PrometheusHandle, | |
| 16 | + | /// Single-slot guard for the build pipeline. A new /rebuild aborts any | |
| 17 | + | /// in-flight build (cargo + gates) so the latest push always wins. | |
| 18 | + | pub active_build: Arc<Mutex<Option<AbortHandle>>>, | |
| 19 | + | /// Broadcast bus for live operator events. WS /events subscribes; all | |
| 20 | + | /// build/gate/deploy code sites emit on this. | |
| 21 | + | pub events: EventTx, | |
| 13 | 22 | } |
| @@ -151,6 +151,7 @@ mod tests { | |||
| 151 | 151 | name: name.into(), | |
| 152 | 152 | ssh_target: format!("deploy@{name}"), | |
| 153 | 153 | release_root: "/opt/mnw".into(), | |
| 154 | + | service_name: "makenotwork.service".into(), | |
| 154 | 155 | } | |
| 155 | 156 | } | |
| 156 | 157 |
| @@ -39,8 +39,14 @@ pub struct Node { | |||
| 39 | 39 | pub name: String, | |
| 40 | 40 | pub ssh_target: String, | |
| 41 | 41 | pub release_root: String, | |
| 42 | + | /// systemd unit name to reload-or-restart after the symlink swap. | |
| 43 | + | /// Defaults to "makenotwork.service" because that's MNW's prod unit. | |
| 44 | + | #[serde(default = "default_service_name")] | |
| 45 | + | pub service_name: String, | |
| 42 | 46 | } | |
| 43 | 47 | ||
| 48 | + | fn default_service_name() -> String { "makenotwork.service".into() } | |
| 49 | + | ||
| 44 | 50 | #[derive(Debug, Clone, Copy, Serialize, Deserialize, Default)] | |
| 45 | 51 | #[serde(rename_all = "snake_case")] | |
| 46 | 52 | pub enum CanaryPolicy { |
| @@ -0,0 +1,164 @@ | |||
| 1 | + | #!/usr/bin/env bash | |
| 2 | + | # Idempotent bootstrap for a fresh MNW node (tier A/B/C deploy target). | |
| 3 | + | # | |
| 4 | + | # Run on the new node as root. After this finishes, sandod on the Sando host | |
| 5 | + | # can rsync + deploy to <ssh_target>:/opt/mnw/. | |
| 6 | + | # | |
| 7 | + | # Required env: | |
| 8 | + | # SANDO_PUBKEY — sando user's public key on the Sando host. Get it via: | |
| 9 | + | # `ssh pop-os 'sudo cat /srv/sando/.ssh/id_ed25519.pub'` | |
| 10 | + | # | |
| 11 | + | # Optional env: | |
| 12 | + | # DEPLOY_ROOT — defaults to /opt/mnw | |
| 13 | + | # BIN_NAME — primary binary name (matches sando-daemon.toml's | |
| 14 | + | # bin_names[0]). Defaults to "makenotwork". | |
| 15 | + | # SERVICE_NAME — systemd unit name. Defaults to "makenotwork.service". | |
| 16 | + | # SERVICE_USER — runtime user for the binary. Defaults to "deploy". | |
| 17 | + | # ENABLE_FIREWALL — "1" to set up UFW (22/80/443). Defaults to "1". | |
| 18 | + | # INSTALL_CADDY — "1" to apt-install caddy (config is operator's job). | |
| 19 | + | # Defaults to "1". | |
| 20 | + | # INSTALL_POSTGRES — "1" to apt-install postgresql. Defaults to "1". | |
| 21 | + | # INSTALL_TAILSCALE — "1" to apt-install tailscale (NOT authenticated; | |
| 22 | + | # operator runs `tailscale up`). Defaults to "1". | |
| 23 | + | # | |
| 24 | + | # What this does NOT do (operator's job): | |
| 25 | + | # - tailscale up (auth) | |
| 26 | + | # - DNS records | |
| 27 | + | # - Caddyfile content + Cloudflare origin certs + private keys | |
| 28 | + | # - postgres role + db + .env / DATABASE_URL | |
| 29 | + | # - any secrets | |
| 30 | + | ||
| 31 | + | set -euo pipefail | |
| 32 | + | ||
| 33 | + | if [[ $EUID -ne 0 ]]; then | |
| 34 | + | echo "must run as root" >&2 | |
| 35 | + | exit 1 | |
| 36 | + | fi | |
| 37 | + | if [[ -z "${SANDO_PUBKEY:-}" ]]; then | |
| 38 | + | echo "SANDO_PUBKEY env var is required" >&2 | |
| 39 | + | exit 1 | |
| 40 | + | fi | |
| 41 | + | ||
| 42 | + | DEPLOY_ROOT="${DEPLOY_ROOT:-/opt/mnw}" | |
| 43 | + | BIN_NAME="${BIN_NAME:-makenotwork}" | |
| 44 | + | SERVICE_NAME="${SERVICE_NAME:-makenotwork.service}" | |
| 45 | + | SERVICE_USER="${SERVICE_USER:-deploy}" | |
| 46 | + | ENABLE_FIREWALL="${ENABLE_FIREWALL:-1}" | |
| 47 | + | INSTALL_CADDY="${INSTALL_CADDY:-1}" | |
| 48 | + | INSTALL_POSTGRES="${INSTALL_POSTGRES:-1}" | |
| 49 | + | INSTALL_TAILSCALE="${INSTALL_TAILSCALE:-1}" | |
| 50 | + | ||
| 51 | + | export DEBIAN_FRONTEND=noninteractive | |
| 52 | + | ||
| 53 | + | log() { echo "[bootstrap] $*"; } | |
| 54 | + | ||
| 55 | + | log "1/8 base packages" | |
| 56 | + | apt-get update -qq | |
| 57 | + | apt-get install -y -qq curl gnupg ca-certificates rsync ufw fail2ban > /dev/null | |
| 58 | + | ||
| 59 | + | if [[ "$INSTALL_POSTGRES" == "1" ]]; then | |
| 60 | + | log "2/8 postgresql" | |
| 61 | + | apt-get install -y -qq postgresql > /dev/null | |
| 62 | + | else | |
| 63 | + | log "2/8 skipping postgresql" | |
| 64 | + | fi | |
| 65 | + | ||
| 66 | + | if [[ "$INSTALL_TAILSCALE" == "1" ]]; then | |
| 67 | + | log "3/8 tailscale (not authenticating)" | |
| 68 | + | if ! command -v tailscale >/dev/null; then | |
| 69 | + | # Ubuntu codename. tailscale's repo is published per-codename; | |
| 70 | + | # noble (24.04) keys work on 24.04+ derivatives. | |
| 71 | + | codename=$(. /etc/os-release && echo "$VERSION_CODENAME") | |
| 72 | + | curl -fsSL "https://pkgs.tailscale.com/stable/ubuntu/${codename}.noarmor.gpg" \ | |
| 73 | + | > /usr/share/keyrings/tailscale-archive-keyring.gpg | |
| 74 | + | curl -fsSL "https://pkgs.tailscale.com/stable/ubuntu/${codename}.tailscale-keyring.list" \ | |
| 75 | + | > /etc/apt/sources.list.d/tailscale.list | |
| 76 | + | apt-get update -qq | |
| 77 | + | apt-get install -y -qq tailscale > /dev/null | |
| 78 | + | systemctl enable --now tailscaled | |
| 79 | + | fi | |
| 80 | + | else | |
| 81 | + | log "3/8 skipping tailscale" | |
| 82 | + | fi | |
| 83 | + | ||
| 84 | + | if [[ "$INSTALL_CADDY" == "1" ]]; then | |
| 85 | + | log "4/8 caddy (no Caddyfile — operator's job)" | |
| 86 | + | if ! command -v caddy >/dev/null; then | |
| 87 | + | curl -fsSL https://dl.cloudsmith.io/public/caddy/stable/gpg.key \ | |
| 88 | + | | gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg | |
| 89 | + | curl -fsSL https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt \ | |
| 90 | + | > /etc/apt/sources.list.d/caddy-stable.list | |
| 91 | + | apt-get update -qq | |
| 92 | + | apt-get install -y -qq caddy > /dev/null | |
| 93 | + | fi | |
| 94 | + | else | |
| 95 | + | log "4/8 skipping caddy" | |
| 96 | + | fi | |
| 97 | + | ||
| 98 | + | log "5/8 deploy user + dirs" | |
| 99 | + | if ! id "$SERVICE_USER" &>/dev/null; then | |
| 100 | + | useradd -m -d "/home/$SERVICE_USER" -s /bin/bash "$SERVICE_USER" | |
| 101 | + | fi | |
| 102 | + | install -d -o "$SERVICE_USER" -g "$SERVICE_USER" -m 0700 "/home/$SERVICE_USER/.ssh" | |
| 103 | + | if ! grep -qF "$SANDO_PUBKEY" "/home/$SERVICE_USER/.ssh/authorized_keys" 2>/dev/null; then | |
| 104 | + | echo "$SANDO_PUBKEY" >> "/home/$SERVICE_USER/.ssh/authorized_keys" | |
| 105 | + | fi | |
| 106 | + | chown "$SERVICE_USER:$SERVICE_USER" "/home/$SERVICE_USER/.ssh/authorized_keys" | |
| 107 | + | chmod 0600 "/home/$SERVICE_USER/.ssh/authorized_keys" | |
| 108 | + | install -d -o "$SERVICE_USER" -g "$SERVICE_USER" -m 0755 "$DEPLOY_ROOT" "$DEPLOY_ROOT/releases" | |
| 109 | + | ||
| 110 | + | log "6/8 sudoers (systemctl on $SERVICE_NAME for $SERVICE_USER)" | |
| 111 | + | cat > "/etc/sudoers.d/${SERVICE_USER}-mnw" <<EOF | |
| 112 | + | $SERVICE_USER ALL=(ALL) NOPASSWD: /bin/systemctl reload-or-restart $SERVICE_NAME, /bin/systemctl restart $SERVICE_NAME, /bin/systemctl status $SERVICE_NAME | |
| 113 | + | EOF | |
| 114 | + | chmod 0440 "/etc/sudoers.d/${SERVICE_USER}-mnw" | |
| 115 | + | visudo -c -f "/etc/sudoers.d/${SERVICE_USER}-mnw" >/dev/null | |
| 116 | + | ||
| 117 | + | log "7/8 systemd unit ($SERVICE_NAME) — points at $DEPLOY_ROOT/current/$BIN_NAME" | |
| 118 | + | cat > "/etc/systemd/system/$SERVICE_NAME" <<EOF | |
| 119 | + | [Unit] | |
| 120 | + | Description=Makenotwork | |
| 121 | + | After=network.target | |
| 122 | + | ||
| 123 | + | [Service] | |
| 124 | + | Type=simple | |
| 125 | + | User=$SERVICE_USER | |
| 126 | + | Group=$SERVICE_USER | |
| 127 | + | WorkingDirectory=$DEPLOY_ROOT | |
| 128 | + | ExecStart=$DEPLOY_ROOT/current/$BIN_NAME | |
| 129 | + | EnvironmentFile=-$DEPLOY_ROOT/.env | |
| 130 | + | Restart=on-failure | |
| 131 | + | RestartSec=30 | |
| 132 | + | # Exit 2 = migration failure (MNW server convention). Don't restart; | |
| 133 | + | # operator must intervene before the next deploy. | |
| 134 | + | RestartPreventExitStatus=2 | |
| 135 | + | StandardOutput=journal | |
| 136 | + | StandardError=journal | |
| 137 | + | SyslogIdentifier=$BIN_NAME | |
| 138 | + | ||
| 139 | + | [Install] | |
| 140 | + | WantedBy=multi-user.target | |
| 141 | + | EOF | |
| 142 | + | systemctl daemon-reload | |
| 143 | + | systemctl enable "$SERVICE_NAME" >/dev/null 2>&1 || true | |
| 144 | + | ||
| 145 | + | if [[ "$ENABLE_FIREWALL" == "1" ]]; then | |
| 146 | + | log "8/8 firewall (UFW: 22/80/443 in, all else deny)" | |
| 147 | + | ufw --force reset > /dev/null | |
| 148 | + | ufw default deny incoming > /dev/null | |
| 149 | + | ufw default allow outgoing > /dev/null | |
| 150 | + | ufw allow 22/tcp > /dev/null | |
| 151 | + | ufw allow 80/tcp > /dev/null | |
| 152 | + | ufw allow 443/tcp > /dev/null | |
| 153 | + | ufw --force enable > /dev/null | |
| 154 | + | else | |
| 155 | + | log "8/8 skipping firewall" | |
| 156 | + | fi | |
| 157 | + | ||
| 158 | + | echo | |
| 159 | + | log "Done. Next steps for the operator:" | |
| 160 | + | echo " - tailscale up (auth this node to the tailnet)" | |
| 161 | + | echo " - DNS A/AAAA records for the domain you'll serve" | |
| 162 | + | echo " - Install /etc/caddy/Caddyfile + Cloudflare Origin CA cert + key" | |
| 163 | + | echo " - postgres: create role+db, drop secrets into $DEPLOY_ROOT/.env" | |
| 164 | + | echo " - Run a sando deploy from the Sando host: POST /promote/<tier>" |
| @@ -5,5 +5,6 @@ listen = "100.103.89.95:7766" # pop-os tailnet IP; bind tailnet-only, not 0.0. | |||
| 5 | 5 | db_path = "/srv/sando/state/sando.db" | |
| 6 | 6 | topology_path = "/etc/sando/sando.toml" | |
| 7 | 7 | workdir = "/srv/sando/work" | |
| 8 | - | release_root = "/srv/sando/releases" | |
| 8 | + | release_root = "/srv/sando" | |
| 9 | 9 | scratch_db_url = "postgres:///sando_scratch?host=/var/run/postgresql" | |
| 10 | + | bin_names = ["makenotwork", "mnw-admin"] |
| @@ -0,0 +1,18 @@ | |||
| 1 | + | # One-shot: tells sandod to pull the latest prod backup. | |
| 2 | + | # Paired with sandod-backup-fetch.timer for daily execution. | |
| 3 | + | # | |
| 4 | + | # Place at /etc/systemd/system/sandod-backup-fetch.service on the Sando host. | |
| 5 | + | ||
| 6 | + | [Unit] | |
| 7 | + | Description=Sando: fetch latest prod backup | |
| 8 | + | After=sandod.service network-online.target | |
| 9 | + | Wants=network-online.target | |
| 10 | + | Requires=sandod.service | |
| 11 | + | ||
| 12 | + | [Service] | |
| 13 | + | Type=oneshot | |
| 14 | + | # Reuse the same env file the daemon does — gives us $SANDO_DAEMON. | |
| 15 | + | EnvironmentFile=/etc/sando/sando.env | |
| 16 | + | ExecStart=/usr/bin/curl -fsS --max-time 600 -X POST ${SANDO_DAEMON}/backup/fetch | |
| 17 | + | # Service exits non-zero if the daemon refuses; fine — we want the timer to | |
| 18 | + | # log + retry on the next cycle. Don't restart aggressively. |
| @@ -0,0 +1,17 @@ | |||
| 1 | + | # Daily trigger for backup fetch. Prod's own backup-db.sh runs at 03:00 UTC; | |
| 2 | + | # we fetch at 04:00 UTC to leave headroom for offsite sync to complete first. | |
| 3 | + | # | |
| 4 | + | # Place at /etc/systemd/system/sandod-backup-fetch.timer on the Sando host. | |
| 5 | + | # Enable: systemctl enable --now sandod-backup-fetch.timer | |
| 6 | + | ||
| 7 | + | [Unit] | |
| 8 | + | Description=Sando: daily prod-backup fetch | |
| 9 | + | ||
| 10 | + | [Timer] | |
| 11 | + | OnCalendar=*-*-* 04:00:00 UTC | |
| 12 | + | # If the box was off when the timer fired, run on next boot. | |
| 13 | + | Persistent=true | |
| 14 | + | Unit=sandod-backup-fetch.service | |
| 15 | + | ||
| 16 | + | [Install] | |
| 17 | + | WantedBy=timers.target |
| @@ -0,0 +1,126 @@ | |||
| 1 | + | # Config artifacts vs binary artifacts | |
| 2 | + | ||
| 3 | + | Phase 3 design doc. Resolves: which of `deploy.sh`'s per-deploy actions sando absorbs, which move to one-time node-bootstrap, which sando explicitly skips. | |
| 4 | + | ||
| 5 | + | Status: draft. Decisions below are recommendations; checkboxes match `MNW/sando/todo.md` Phase 3. | |
| 6 | + | ||
| 7 | + | ## Inventory of `deploy.sh`'s actions | |
| 8 | + | ||
| 9 | + | | Action | Frequency | What it does | | |
| 10 | + | |------------------------------|----------------|----------------------------------------------------------------| | |
| 11 | + | | `build_binary` | per-deploy | cargo-zigbuild on macOS → x86_64 Linux musl/glibc | | |
| 12 | + | | `upload_config: Caddyfile` | per-deploy | scp `Caddyfile` → `/etc/caddy/Caddyfile`, `systemctl reload caddy` | | |
| 13 | + | | `upload_config: error-pages` | per-deploy | scp `error-pages/*.html` → `/opt/makenotwork/error-pages/` | | |
| 14 | + | | `upload_config: security` | per-deploy | scp `sshd-git.conf`, `fail2ban-sshd.conf`, `setup-firewall.sh` | | |
| 15 | + | | `upload_config: chmod` | per-deploy | chmod +x on setup-* scripts | | |
| 16 | + | | `upload_binary` | per-deploy | scp `makenotwork` + `mnw-admin` → `/opt/makenotwork/` | | |
| 17 | + | | `send_restart_warning` | per-deploy | POST `/api/internal/restart-warning` (30s notice), sleep 30s | | |
| 18 | + | | `restart_app` | per-deploy | `systemctl restart makenotwork`; curl 127.0.0.1:3000 to verify | | |
| 19 | + | | `sqlx migrate run` (implied) | startup | server runs migrations on startup in `main.rs:73` | | |
| 20 | + | ||
| 21 | + | ## Decision per item | |
| 22 | + | ||
| 23 | + | ### 1. Caddyfile — **bootstrap-only, not per-deploy** | |
| 24 | + | ||
| 25 | + | Caddy config is stable infrastructure. Most releases don't touch it. Per-deploy uploads couple binary version to config version unnecessarily and risk reload churn for unchanged config. | |
| 26 | + | ||
| 27 | + | - Node-bootstrap script installs `/etc/caddy/Caddyfile` once. | |
| 28 | + | - Updating Caddy config is an explicit operator action (`sando-cli push-caddy` or just `scp + systemctl reload caddy` manually), tracked but not per-release. | |
| 29 | + | - Revisit if Caddy config changes start landing >1x per sprint, then move to per-release artifact under `releases/<version>/Caddyfile` with a deploy hook. | |
| 30 | + | ||
| 31 | + | **Per-project alternative tracked:** if a Caddyfile change accompanies a binary change (rare), the operator must run the explicit Caddy-push step alongside `sando promote`. | |
| 32 | + | ||
| 33 | + | ### 2. error-pages — **bake into binary** | |
| 34 | + | ||
| 35 | + | Error pages version with code. They reference brand glyphs (diamond mark) and copy that drifts with the rest of the site. | |
| 36 | + | ||
| 37 | + | - Use `include_dir!` or `include_bytes!` to embed `server/deploy/error-pages/*.html` into the binary. | |
| 38 | + | - Update Caddy `handle_errors` blocks to point at an in-app fallback route (e.g. `/__errors/404.html`) instead of `/opt/makenotwork/error-pages/`. That route can serve the embedded HTML. | |
| 39 | + | ||
| 40 | + | Cost: small MNW server PR (separate from sando). Marks `deploy.sh upload_config: error-pages` step removable. | |
| 41 | + | ||
| 42 | + | Until that lands: ship error-pages as sibling under `releases/<version>/error-pages/`. Caddy still reads from `/opt/makenotwork/error-pages/` symlinked to `current/error-pages`. (Track A on testnot already has the `current` symlink working; just symlink error-pages parallel.) | |
| 43 | + | ||
| 44 | + | ### 3. mnw-admin binary — **ship alongside server** | |
| 45 | + | ||
| 46 | + | `mnw-admin` is part of the release; deploy.sh uploads it. Sando should too. | |
| 47 | + | ||
| 48 | + | - Extend `cfg.bin_name: String` → `cfg.bin_names: Vec<String>` (e.g. `["makenotwork", "mnw-admin"]`). | |
| 49 | + | - `deploy_local` + `deploy_node` iterate over the list, rsyncing each to `releases/<version>/<bin>`. | |
| 50 | + | - Build step looks up each in `server/target/release/<bin>`. | |
| 51 | + | ||
| 52 | + | Default stays `["server"]` for backwards-compat with the existing example config. | |
| 53 | + | ||
| 54 | + | ### 4. systemd unit (`makenotwork.service`) — **bootstrap-only** | |
| 55 | + | ||
| 56 | + | The unit references `<release_root>/current/makenotwork`. Once installed, it doesn't change per release. | |
| 57 | + | ||
| 58 | + | - Node-bootstrap script installs `/etc/systemd/system/makenotwork.service`. | |
| 59 | + | - `deploy.sh`'s upload of the unit was a re-upload-every-time pattern. Sando does not. | |
| 60 | + | - If the unit ever needs to change (e.g. resource limits, env file path), that's a one-shot operator action, not a per-deploy step. | |
| 61 | + | ||
| 62 | + | ### 5. Security configs (sshd-git, fail2ban, firewall) — **bootstrap-only** | |
| 63 | + | ||
| 64 | + | These are one-time host hardening. They have no release coupling. | |
| 65 | + | ||
| 66 | + | - Node-bootstrap script installs them on first provision. | |
| 67 | + | - Updates are out-of-band operator actions (or fold into a `sando push-config` later). | |
| 68 | + | ||
| 69 | + | ### 6. backup-db.sh — **bootstrap-only** | |
| 70 | + | ||
| 71 | + | Same as security configs. Backup script is host infrastructure, not release artifact. | |
| 72 | + | ||
| 73 | + | - Node-bootstrap installs `backup-db.sh` and its cron entry. | |
| 74 | + | - Updates out-of-band. | |
| 75 | + | - Bonus: backup-db.sh should be updated to (a) maintain `latest.sql.gz` hard link, (b) push to astra for true offsite — currently broken (see separate "offsite sync broken" ticket). | |
| 76 | + | ||
| 77 | + | ### 7. Restart warning — **defer to Phase 5; track for prod cutover** | |
| 78 | + | ||
| 79 | + | `deploy.sh` posts a 30s warning, sleeps 30s, then restarts. Sando does NOT yet do this. | |
| 80 | + | ||
| 81 | + | - For testnot (low traffic): skip. Service crash-loops invisibly enough. | |
| 82 | + | - For prod cutover: sando must implement this. Options: | |
| 83 | + | - **A**: Sando POSTs `/api/internal/restart-warning` itself, requires CLI_SERVICE_TOKEN exposed to sando. Token would live in `/etc/sando/sando.env` on pop-os. | |
| 84 | + | - **B**: Sando exposes a `pre_deploy_hook` per-tier in `sando.toml` (shell command); operator decides. | |
| 85 | + | - Recommendation: **A** for prod tiers only (`tier.restart_warning_seconds = 30` in `sando.toml`). Tier A (testnot) leaves it unset = no warning. | |
| 86 | + | ||
| 87 | + | Phase 5 implementation, not blocking cutover-readiness. | |
| 88 | + | ||
| 89 | + | ### 8. Cross-compile from macOS — **retire** | |
| 90 | + | ||
| 91 | + | Pop-os is x86_64 Ubuntu-derived, prod is x86_64 Ubuntu 24.04. Sando builds natively. Cargo-zigbuild path goes away once sando is canonical. | |
| 92 | + | ||
| 93 | + | - Verify: take a recent prod binary (from `deploy.sh`'s build) and sando's binary for the same sha, compare runtime behavior across one full sprint of testnot use. | |
| 94 | + | - Once verified, mark `deploy.sh` archived and delete cargo-zigbuild from dev-machine setup notes. | |
| 95 | + | ||
| 96 | + | ### 9. Prod migrations — **server-self-applies on startup; sando does NOT** | |
| 97 | + | ||
| 98 | + | MNW server runs `sqlx::migrate!("./migrations").run(&db).await` in `main.rs:73` at startup. This means: | |
| 99 | + | ||
| 100 | + | - A new binary starting up applies any pending migrations against the live prod DB. | |
| 101 | + | - Sando does not need an explicit `POST /migrate/{tier}` endpoint. | |
| 102 | + | - The `migration_dry_run` gate's purpose is to catch migration FAILURE before the live binary tries to run them — that's the prod safety net. | |
| 103 | + | - Risk: a partially-applied migration (e.g. multi-statement, the 2026-05-22 incident class) can leave the DB in a broken state mid-startup. Sandbox the migration via `migration_dry_run` catches this; the live server then either succeeds or fails-and-crash-loops on the same migration sequence. | |
| 104 | + | - Open question: should sando refuse to promote if `migration_dry_run` flags the upcoming version as a destructive migration (drop+recreate column)? Phase 5+ enhancement. | |
| 105 | + | ||
| 106 | + | **Action:** none — current architecture is correct. Document this in `plans/migration-dryrun-failures.md` (Phase 2 follow-up). | |
| 107 | + | ||
| 108 | + | ## Net effect on `deploy.sh` | |
| 109 | + | ||
| 110 | + | | Step | Replaced by Sando | Moved to node-bootstrap | Retired | | |
| 111 | + | |---------------------|------------------------------|-------------------------|---------| | |
| 112 | + | | build_binary | yes (native on pop-os) | | | | |
| 113 | + | | upload_config | | yes (Caddyfile, etc.) | | | |
| 114 | + | | upload_binary | yes (+ mnw-admin) | | | | |
| 115 | + | | send_restart_warning| yes (Phase 5, prod tier only)| | | | |
| 116 | + | | restart_app | yes (reload-or-restart) | | | | |
| 117 | + | ||
| 118 | + | Once items 2-9 above land, `deploy.sh` becomes redundant and moves to `server/deploy/archive/`. | |
| 119 | + | ||
| 120 | + | ## Implementation order | |
| 121 | + | ||
| 122 | + | 1. **`bin_names: Vec<String>`** — small, unblocks mnw-admin shipping (#3). | |
| 123 | + | 2. **error-pages as release sibling + symlink** — small, unblocks #2 until bake-into-binary lands. | |
| 124 | + | 3. **node-bootstrap script** — folds Caddyfile (#1), unit (#4), security (#5), backup (#6) into one idempotent script. Already a Phase 1 carryover. | |
| 125 | + | 4. **Phase 5: restart_warning hook** — when prod cutover gets scheduled. | |
| 126 | + | 5. **Prod cutover sprint** — verify binary parity (#8), retire `deploy.sh` (#9 needs no action). |