max / pom
6 files changed,
+721 insertions,
-112 deletions
| @@ -0,0 +1,308 @@ | |||
| 1 | + | # PoM — Completed Work | |
| 2 | + | ||
| 3 | + | Archived completed phases from todo.md. All items here are done. | |
| 4 | + | ||
| 5 | + | --- | |
| 6 | + | ||
| 7 | + | ## Phase 1 — Core Infrastructure | |
| 8 | + | Health checks, test orchestration, CLI, MCP server, SQLite storage. | |
| 9 | + | ||
| 10 | + | ### Done | |
| 11 | + | - [x] HTTP health checks with configurable targets and timeouts | |
| 12 | + | - [x] SSH test orchestration with CI output parsing | |
| 13 | + | - [x] CLI commands: health, test, status, history, prune | |
| 14 | + | - [x] MCP server mode (stdio transport) | |
| 15 | + | - [x] SQLite storage with WAL mode | |
| 16 | + | - [x] Per-target interval overrides | |
| 17 | + | ||
| 18 | + | ## Phase 2 — Serve Mode | |
| 19 | + | Background daemon with periodic health checks. | |
| 20 | + | ||
| 21 | + | ### Done | |
| 22 | + | - [x] Serve mode with per-target health check intervals | |
| 23 | + | - [x] Daily prune task | |
| 24 | + | - [x] Graceful shutdown (SIGINT/SIGTERM) | |
| 25 | + | - [x] Systemd service on hetzner | |
| 26 | + | ||
| 27 | + | ## Phase 3 — HTTP API + MNW Integration | |
| 28 | + | Expose data to consumers, wire into MNW health page. | |
| 29 | + | ||
| 30 | + | ### Done | |
| 31 | + | - [x] Axum HTTP API (`/api/status`, `/api/status/{target}`) | |
| 32 | + | - [x] Uptime percentage queries (24h, 7d) | |
| 33 | + | - [x] MNW `/health` page shows External Monitor card | |
| 34 | + | - [x] MNW `/api/health` JSON includes `external_monitoring` field | |
| 35 | + | - [x] Graceful fallback when PoM unavailable | |
| 36 | + | ||
| 37 | + | ## Phase 4 — Peer Mesh | |
| 38 | + | Syncthing-style peer network. Each PoM instance has a UUID, discovers peers by address, and shares monitoring data across the mesh. Any instance can see the full network state. | |
| 39 | + | ||
| 40 | + | ### Done (4A — Instance Identity) | |
| 41 | + | - [x] Auto-generate UUID on first run, store in data dir (`~/.local/share/pom/instance_id`) | |
| 42 | + | - [x] Instance name in config (`[instance]` section, defaults to hostname) | |
| 43 | + | - [x] `GET /api/peer/info` endpoint (returns instance ID, name, version, target list, started_at) | |
| 44 | + | ||
| 45 | + | ### Done (4B — Peer Configuration) | |
| 46 | + | - [x] `[instance]` config section (name, optional ID override) | |
| 47 | + | - [x] `[peers.<name>]` config section (address, on_missing, grace_count) | |
| 48 | + | - [x] Peer connection on serve startup (exchange instance info via `/api/peer/info`) | |
| 49 | + | - [x] Validate peer identity: store UUID on first connect, warn if UUID changes unexpectedly | |
| 50 | + | ||
| 51 | + | ### Done (4C — Peer Health Monitoring) | |
| 52 | + | - [x] Periodic peer heartbeat (poll each peer's `/api/peer/info`, configurable interval, default 60s) | |
| 53 | + | - [x] Peer status tracking in SQLite (`peer_identities`, `peer_heartbeats` tables) | |
| 54 | + | - [x] `on_missing` behavior: fire action when peer heartbeat fails (after configurable grace period) | |
| 55 | + | - [x] State machine: Unknown -> Online/GracePeriod -> Missing, with recovery detection | |
| 56 | + | - [x] Prune task also cleans `peer_heartbeats` | |
| 57 | + | ||
| 58 | + | ### Done (4D — Status Sharing) | |
| 59 | + | - [x] `GET /api/peer/status` endpoint (returns this instance's full target + peer status) | |
| 60 | + | - [x] Each instance periodically fetches peer status to build combined view | |
| 61 | + | - [x] `GET /api/mesh` endpoint (aggregated view: all instances, all targets, all peer statuses) | |
| 62 | + | - [x] CLI: `pom mesh [--json]` command to show network state | |
| 63 | + | - [x] MCP tool: `get_mesh_status` surfaces mesh state | |
| 64 | + | ||
| 65 | + | ### Done (4E — Code + Config) | |
| 66 | + | - [x] Per-host config files (`deploy/pom-hetzner.toml`, `deploy/pom-astra.toml`) | |
| 67 | + | - [x] Updated `deploy/deploy.sh` to use per-host configs | |
| 68 | + | - [x] Listen on `0.0.0.0:9100` in deploy configs (Tailscale peer access) | |
| 69 | + | ||
| 70 | + | ### Done (4E — Deploy) | |
| 71 | + | - [x] Install Tailscale on hetzner (`100.120.174.96`) | |
| 72 | + | - [x] Update astra peer config to use hetzner's Tailscale IP | |
| 73 | + | - [x] Fix `blocking_read()` panic in `spawn_heartbeat_tasks` (must be async) | |
| 74 | + | - [x] Deploy v0.2.0 to hetzner + astra | |
| 75 | + | - [x] Verify: `/api/peer/info` returns correct identity on each | |
| 76 | + | - [x] Verify: `/api/mesh` shows both instances online (~65ms latency) | |
| 77 | + | - [x] Update deploy scripts to use Tailscale IPs | |
| 78 | + | ||
| 79 | + | ## Phase 5 — Alerting (pre-beta) | |
| 80 | + | Email alerts triggered by target status changes or peer disappearance. Peers with `on_missing = "alert"` use this system. | |
| 81 | + | ||
| 82 | + | ### Done | |
| 83 | + | - [x] Postmark API integration (`src/alerts.rs` — Alerter struct, `X-Postmark-Server-Token` header) | |
| 84 | + | - [x] Alert configuration in pom.toml (`[alerts]` section: postmark_token, to, from, cooldown_secs) | |
| 85 | + | - [x] Status change detection (query previous health check before insert, compare statuses, fire on transition) | |
| 86 | + | - [x] Cooldown logic (alerts table tracks sent_at, skip if within cooldown window) | |
| 87 | + | - [x] Recovery alerts (notify when target returns to operational) | |
| 88 | + | - [x] Peer-triggered alerts (peer goes missing/recovering with `on_missing = "alert"`) | |
| 89 | + | - [x] Dev mode (no postmark_token → alerts logged to stdout) | |
| 90 | + | - [x] DB migration v2 (alerts table + index) | |
| 91 | + | - [x] Deploy configs updated (`deploy/pom-hetzner.toml`, `deploy/pom-astra.toml`) | |
| 92 | + | - [x] 11 new tests (3 unit, 5 integration, 3 config) | |
| 93 | + | - [x] Set postmark_token in production deploy configs | |
| 94 | + | - [x] Create `pom-alerts@makenot.work` sender signature in Postmark dashboard | |
| 95 | + | ||
| 96 | + | ## Phase 6 — TLS Certificate Monitoring | |
| 97 | + | Probe TLS certs, track expiry, alert before outage. | |
| 98 | + | ||
| 99 | + | ### Done | |
| 100 | + | - [x] TLS certificate check: connect to target, TLS handshake, read leaf cert expiry (`src/checks/tls.rs`) | |
| 101 | + | - [x] Per-target TLS config: `[targets.mnw.tls]` with host, port (default 443), warn_days (default 14) | |
| 102 | + | - [x] Configurable check interval: `tls_check_interval_secs` on `[serve]` (default 3600) | |
| 103 | + | - [x] DB migration v3: `tls_checks` table with index | |
| 104 | + | - [x] Store cert check results per target (insert/query, `TlsCheckRow`) | |
| 105 | + | - [x] Prune old TLS checks in daily prune task (5-tuple return) | |
| 106 | + | - [x] TLS data in API response: `tls` field on `/api/status/{target}` (skip_serializing_if None) | |
| 107 | + | - [x] CLI display: TLS line in `pom status` (OK/WARN/ERR with days remaining + expiry date) | |
| 108 | + | - [x] Serve loop: TLS check task per target on its own interval | |
| 109 | + | - [x] Alerts: expiry warning, error, and recovery (with cooldown) | |
| 110 | + | - [x] Deploy configs updated (hetzner + astra: `[targets.mnw.tls] host = "makenot.work"`) | |
| 111 | + | - [x] 17 new tests (2 unit, 5 config, 10 integration) | |
| 112 | + | - [x] Dependencies: x509-parser 0.16, tokio-rustls 0.26, rustls-pki-types 1, webpki-roots 1 | |
| 113 | + | ||
| 114 | + | ## Phase 7 — Response Validation | |
| 115 | + | Verify response bodies match expected patterns, not just HTTP status codes. | |
| 116 | + | ||
| 117 | + | ### Done | |
| 118 | + | - [x] `HealthExpectation` config struct: `status_code`, `json_fields` (dot-path), `body_contains` | |
| 119 | + | - [x] `[targets.mnw.health.expect]` TOML config section (all fields optional) | |
| 120 | + | - [x] `resolve_json_path()` — walk dot-separated paths through nested JSON | |
| 121 | + | - [x] `validate_expectations()` — check status code, body substring, JSON field values | |
| 122 | + | - [x] Refactored `check_health` to `response.text()` + `serde_json::from_str` (preserves raw body) | |
| 123 | + | - [x] Expectation failures override to Degraded with joined error descriptions | |
| 124 | + | - [x] Deploy configs updated (hetzner + astra: `status_code = 200`, `json_fields.status = "operational"`) | |
| 125 | + | - [x] 17 new unit tests (resolve_json_path, validate_expectations, config parsing) | |
| 126 | + | ||
| 127 | + | ## Phase 8 — Latency Trending + Anomaly Detection | |
| 128 | + | Track performance over time, detect drift before it becomes an outage. | |
| 129 | + | ||
| 130 | + | ### Done | |
| 131 | + | - [x] `LatencyStats` + `LatencyBucket` types with `from_times()` and `bucket_by_time()` (types.rs, 9 unit tests) | |
| 132 | + | - [x] DB queries: `get_response_times`, `get_recent_response_times` (db.rs — operational-only filtering) | |
| 133 | + | - [x] `TrendingConfig` (baseline_window_hours, spike_threshold) wired into `HealthConfig` (config.rs, 3 tests) | |
| 134 | + | - [x] `detect_latency_drift()` — 3 consecutive checks over baseline threshold (checks/http.rs, 6 unit tests) | |
| 135 | + | - [x] Drift + recovery alerts with cooldown (alerts.rs) | |
| 136 | + | - [x] Drift detection in serve loop with `in_drift` state tracking (cli.rs) | |
| 137 | + | - [x] `latency_24h` on `/api/status/{target}`, `GET /api/trends/{target}?hours=&bucket_minutes=` (api.rs) | |
| 138 | + | - [x] Latency line in CLI `pom status` output (display.rs, 2 tests) | |
| 139 | + | - [x] Latency stats in MCP `get_status` tool (tools/health.rs) | |
| 140 | + | - [x] Deploy configs: `[targets.mnw.health.trending]` (pom-hetzner.toml, pom-astra.toml) | |
| 141 | + | - [x] MNW health page: avg/p95 latency in PoM card (health.rs, public.rs, health.html) | |
| 142 | + | - [x] 8 new integration tests (response times, trends API, latency in status, config parsing) | |
| 143 | + | ||
| 144 | + | ## Phase 9 — Smart Test Prompting | |
| 145 | + | Detect when tests should be re-run based on staleness and version changes. | |
| 146 | + | ||
| 147 | + | ### Done | |
| 148 | + | - [x] `TestStaleness` struct: stale flag, reason, current/tested versions, last_test_at, days_since_test (types.rs) | |
| 149 | + | - [x] `get_version_at_time()` DB query: extract version from health check closest to a given timestamp (db.rs) | |
| 150 | + | - [x] `staleness_days` config field on `TestsConfig` (default 7) (config.rs, 2 config tests) | |
| 151 | + | - [x] `compute_test_staleness()` pure function: no-tests, age-based, version-change triggers (checks/http.rs, 5 unit tests) | |
| 152 | + | - [x] `test_staleness` field on API `TargetStatus` (skip_serializing_if None) (api.rs) | |
| 153 | + | - [x] `build_target_status` computes staleness for targets with test config (api.rs) | |
| 154 | + | - [x] CLI `pom status` shows "Tests: STALE" line with reason (display.rs, 4 display tests) | |
| 155 | + | - [x] CLI JSON output includes `test_staleness` object (cli.rs) | |
| 156 | + | - [x] MCP `get_status` shows staleness info when stale (tools/health.rs) | |
| 157 | + | - [x] Deploy configs: `staleness_days = 7` (pom-hetzner.toml, pom-astra.toml) | |
| 158 | + | - [x] 8 integration tests (version_at_time, staleness by version/age/fresh, config parsing, MCP tool, no-config omits field) | |
| 159 | + | ||
| 160 | + | ## Phase 10 — Downtime Log + Incident History | |
| 161 | + | Structured timeline of status transitions for post-incident review. | |
| 162 | + | ||
| 163 | + | ### Done | |
| 164 | + | - [x] DB migration v4: `incidents` table (id, target, started_at, ended_at, duration_secs, from_status, to_status) | |
| 165 | + | - [x] `IncidentRow` struct (sqlx::FromRow + Serialize) | |
| 166 | + | - [x] Incident queries: `insert_incident`, `close_open_incidents`, `get_open_incident`, `get_recent_incidents` | |
| 167 | + | - [x] Automatic incident open on transition away from operational (serve loop) | |
| 168 | + | - [x] Automatic incident close (with duration) on recovery to operational | |
| 169 | + | - [x] Status change between non-operational states: close old incident, open new | |
| 170 | + | - [x] `current_incident` + `incidents` (last 10) on API `/api/status/{target}` (skip_serializing_if) | |
| 171 | + | - [x] CLI `pom status` shows active incident line | |
| 172 | + | - [x] MCP `get_status` shows active + recent incidents | |
| 173 | + | - [x] Prune cleans closed incidents (6-tuple return from `prune_old_records`) | |
| 174 | + | - [x] 10 new tests (migration, lifecycle, target isolation, prune, API) | |
| 175 | + | - [x] Surface in MNW health page (incident timeline, recent incidents list, expandable check lists, formatted timestamps) | |
| 176 | + | ||
| 177 | + | ## Audit Remediation (Second Audit, 2026-03-11) | |
| 178 | + | 5 findings, 3 cold spots. All resolved. | |
| 179 | + | ||
| 180 | + | ### Done | |
| 181 | + | - [x] Extract CLI command handlers from main.rs into cli.rs (main.rs: 587 -> 130 LOC, cli.rs: 466 LOC) | |
| 182 | + | - [x] Add typed PomError enum with thiserror (8 variants, replaces Box<dyn Error> across 9 files) | |
| 183 | + | - [x] Add .DS_Store and IDE dirs (.idea/, .vscode/) to .gitignore | |
| 184 | + | - [x] Add module-level //! docs to main.rs (config.rs already had one) | |
| 185 | + | - [x] Add migration versioning (schema_version table, numbered migrations, pre-migration DB detection, 3 tests) | |
| 186 | + | - [x] Add CLI display tests (extract formatting into display.rs, 27 tests: health snapshots, test results, status, history, prune, mesh) | |
| 187 | + | ||
| 188 | + | ## Audit Remediation (First Audit, 2026-03-10) | |
| 189 | + | First audit. 11 findings, 8 cold spots. All resolved. | |
| 190 | + | ||
| 191 | + | ### Done | |
| 192 | + | - [x] Add DB indexes: `health_checks(target, id DESC)`, `health_checks(target, checked_at)`, `test_runs(target, id DESC)`, `peer_heartbeats(peer_name, id DESC)` (db.rs, init_schema) | |
| 193 | + | - [x] Fix 4 clippy `collapsible_if` warnings (api.rs, peer.rs, main.rs — used Rust 2024 let chains) | |
| 194 | + | - [x] Decouple mesh write lock from DB writes in heartbeat handlers (peer.rs — block-scoped lock, DB writes after drop) | |
| 195 | + | - [x] Decouple mesh read lock from DB queries in peer_status and mesh_view handlers (api.rs — same pattern) | |
| 196 | + | - [x] Log `/api/peer/status` fetch failures instead of silently ignoring (peer.rs, tracing::debug) | |
| 197 | + | - [x] Include peer heartbeat prune count in `prune_old_records` return value (db.rs — now returns 3-tuple) | |
| 198 | + | - [x] Add `//!` module docs to db.rs, config.rs, peer.rs, types.rs, lib.rs (api.rs already had one) | |
| 199 | + | - [x] Change `PeerConfig.on_missing` from `String` to `OnMissing` enum with `#[derive(Deserialize)]` + `#[default]` | |
| 200 | + | - [x] Add API endpoint integration tests (5 tests: /api/status, /api/status/{target} 404, /api/peer/info, peer disabled, /api/mesh) | |
| 201 | + | - [x] Add heartbeat state machine unit tests (5 tests: grace transitions, recovery, first-contact UUID, DB recording) | |
| 202 | + | - [x] Add config parsing tests (4 tests: full parse, defaults, on_missing default, hostname fallback) | |
| 203 | + | - [x] Add HTTP health check response classification tests (8 tests: operational, degraded, unknown status, error codes, missing fields, non-JSON) | |
| 204 | + | - [x] Extract `HealthStatus::icon()` method, eliminating 3 repeated match blocks in main.rs | |
| 205 | + | - [x] Add types.rs tests (4 tests: Display/FromStr roundtrip, icon mapping, serde roundtrip, invalid parse) | |
| 206 | + | ||
| 207 | + | ## Phase 11 — Route Specs (pre-beta) | |
| 208 | + | ||
| 209 | + | Define expected routes per target in config. PoM periodically checks each route and alerts if any return non-200. Catches missing pages, broken deploys, misconfigured paths. | |
| 210 | + | ||
| 211 | + | ### Done | |
| 212 | + | - [x] `expected_routes` config field on `[targets.<name>]` — list of paths to check (e.g. `["/", "/docs", "/docs/faq", "/pricing"]`) | |
| 213 | + | - [x] `route_check_interval_secs` on `[serve]` (default 300 = 5 min) | |
| 214 | + | - [x] Route check module (`src/checks/routes.rs`) — sequential GET per path, 2xx = OK | |
| 215 | + | - [x] Route check task in serve loop (separate interval from health checks) | |
| 216 | + | - [x] Route check results stored in DB (migration v5: `route_checks` table with indexes) | |
| 217 | + | - [x] `RouteCheckRow`, `insert_route_check`, `get_latest_route_checks` queries | |
| 218 | + | - [x] Prune includes route_checks (7-tuple return from `prune_old_records`) | |
| 219 | + | - [x] Alert on route failure (non-200 on any expected route, with cooldown key `route:{target}`) | |
| 220 | + | - [x] Recovery alert when previously-failing route returns 200 (no cooldown) | |
| 221 | + | - [x] `route_status` field on `/api/status/{target}` (list of paths with last status, skip_serializing_if empty) | |
| 222 | + | - [x] CLI `pom status` shows route check summary (e.g. "Routes: 9/9 OK" or "Routes: 7/9 (FAIL: /docs/faq, /pricing)") | |
| 223 | + | - [x] MNW health page: route status in PoM card | |
| 224 | + | - [x] Deploy configs updated (hetzner + astra: MNW 9 routes, MT 1 route) | |
| 225 | + | - [x] 18 new tests (4 config, 5 route check unit, 3 display, 1 alert, 5 integration) | |
| 226 | + | ||
| 227 | + | ## Phase 12 — External Target: htpy.app | |
| 228 | + | ||
| 229 | + | Monitor https://htpy.app (homotopy-rs, repo at `/Users/max/Math/sseq-work/homotopy-rs`). PoM already supports multiple targets — this adds htpy.app as a third monitored site alongside MNW and MT. | |
| 230 | + | ||
| 231 | + | ### Done | |
| 232 | + | - [x] Add `[targets.htpy]` to deploy configs (pom-hetzner.toml, pom-astra.toml) with health URL, route checks, TLS | |
| 233 | + | - [x] Health check via Tailscale (`http://100.99.153.68:8080/archive/S_2`) with `body_contains = "htpy"` expectation | |
| 234 | + | - [x] Route check: `/archive/S_2` (the default redirect target from `/`) | |
| 235 | + | - [x] TLS monitoring for htpy.app (`[targets.htpy.tls] host = "htpy.app"`) | |
| 236 | + | - [x] Fix `classify_non_json` — non-JSON 2xx responses now promoted to Operational when all expectations pass | |
| 237 | + | - [x] Verified on both hetzner (9ms) and astra (185ms): operational, TLS valid (87d), routes 1/1 OK | |
| 238 | + | ||
| 239 | + | ### Not applicable | |
| 240 | + | - MNW health page: htpy.app is a separate service, doesn't belong on MNW's health dashboard | |
| 241 | + | ||
| 242 | + | ## Audit Action Items (2026-03-13, third audit — pre-launch skeptical lens) | |
| 243 | + | ||
| 244 | + | ### Done | |
| 245 | + | - [x] **CRITICAL:** Remove Postmark API token from deployment configs (`deploy/pom-hetzner.toml`, `deploy/pom-astra.toml`) — moved to `POM_POSTMARK_TOKEN` env var, loaded in config.rs, systemd `EnvironmentFile=/etc/pom/env` | |
| 246 | + | - [x] Add API authentication (bearer token middleware on all /api/* routes, `POM_API_TOKEN` env var or `[serve] api_token` config, 5 tests) | |
| 247 | + | - [x] Add peer mesh authentication (`[peers.X] token` field, heartbeat client sends `Authorization: Bearer` header, MNW health.rs updated to send token) | |
| 248 | + | - [x] Add integration tests for core functions (check_health, check_tls — 9 new integration tests with mock servers) | |
| 249 | + | - [x] Add self-monitoring capability (`/api/health` endpoint returns `{"status":"operational","version":"..."}`, no auth required) | |
| 250 | + | - [x] Shell-escape SSH test filter parameter (`checks/ssh.rs` — alphanumeric + `_:-` allowlist, returns error TestRun on invalid chars) | |
| 251 | + | - [x] Reject peer responses on UUID mismatch instead of just logging a warning (`peer.rs` — upgraded to tracing::error, skips status update, increments consecutive_failures) | |
| 252 | + | - [x] Add rate limiting to API endpoints (fixed-window 60 req/min middleware on authenticated routes, 1 unit test) | |
| 253 | + | ||
| 254 | + | ## Audit (Run 4, 2026-03-14) | |
| 255 | + | ||
| 256 | + | Full code audit of Phases 11-12 additions. 1 HIGH, 4 MEDIUM, 5 LOW findings. | |
| 257 | + | ||
| 258 | + | ### Done | |
| 259 | + | - [x] Audit route checks module (`src/checks/routes.rs`) — base_url parsing, error handling, edge cases | |
| 260 | + | - [x] Audit `classify_non_json` Operational promotion — verified correct, no false positives | |
| 261 | + | - [x] Audit deploy configs for consistency (htpy Tailscale IP, route lists, expectation accuracy) | |
| 262 | + | - [x] Review test coverage gaps in Phase 11-12 code | |
| 263 | + | ||
| 264 | + | ### Done (from audit findings) | |
| 265 | + | - [x] **HIGH:** Disable redirect following in route check client (`redirect(Policy::none())`) — was silently following redirects | |
| 266 | + | - [x] **MEDIUM:** Fix startup thundering herd — consume first tick of health/TLS/route/prune intervals before entering loop | |
| 267 | + | - [x] **MEDIUM:** Fix recovery cooldown interaction — `get_latest_alert_for_target` now excludes `%recovery%` alert types | |
| 268 | + | - [x] **MEDIUM:** Set `MissedTickBehavior::Delay` on route check interval to prevent back-to-back storms | |
| 269 | + | - [x] **LOW:** Validate `expected_routes` paths start with `/` at config load time | |
| 270 | + | - [x] 3 new tests (recovery cooldown, empty routes, path validation) | |
| 271 | + | ||
| 272 | + | ### Done (remaining findings — resolved) | |
| 273 | + | - [x] **MEDIUM:** Graceful shutdown — CancellationToken + `tokio::select!` in all task loops, `with_graceful_shutdown` on API server, 5s grace period (`cli.rs`) | |
| 274 | + | - [x] **LOW:** Remove redundant htpy route check — removed `expected_routes` from htpy target (deploy configs) | |
| 275 | + | - [x] **LOW:** Monitor for silent task panics — 60s watchdog checks `JoinHandle::is_finished()` in shutdown loop (`cli.rs`) | |
| 276 | + | ||
| 277 | + | ## Phase 13 — Per-Test Tracking & Duration Trending (Mar 2026) | |
| 278 | + | `TestDetail` struct + `details` field on `TestSummary`. Parse individual test lines from cargo output. Migration v7: `test_details` table. `insert_test_details`, `get_test_regressions`, `get_test_durations` DB queries. Duration drift detection (baseline 10 runs, 1.5x threshold). Wired into CLI, MCP, API. 10 new tests. MT test target added to astra + hetzner configs. | |
| 279 | + | ||
| 280 | + | ## Phase 6 — TLS (additional, Mar 2026) | |
| 281 | + | Domain WHOIS/registration check (registrar, expiry, nameservers). DNS record verification (A/AAAA/CNAME resolve to expected IPs). | |
| 282 | + | ||
| 283 | + | ## Run 6 Audit Items (Mar 2026) | |
| 284 | + | Fixed 6 collapsible_if clippy warnings (cli.rs, config.rs, checks/http.rs). | |
| 285 | + | ||
| 286 | + | ## Run 8 Audit Items (Mar 2026) | |
| 287 | + | Hardened `escape_js` in dashboard.rs: added newline, carriage return, `<` (`\x3c`), null escaping + 4 tests. | |
| 288 | + | ||
| 289 | + | --- | |
| 290 | + | ||
| 291 | + | ## DNS/Route Stale Data Fix (2026-03-25) | |
| 292 | + | ||
| 293 | + | - [x] Switch Cloudflare-proxied DNS records to resolution-only checks | |
| 294 | + | - [x] Filter `route_status` and `dns_status` in API to only configured entries | |
| 295 | + | - [x] Add `prune_stale_routes()` and `prune_stale_dns()` DB functions | |
| 296 | + | - [x] Call prune functions at task startup | |
| 297 | + | - [x] Update integration tests for new filtering behavior | |
| 298 | + | - [x] Deploy to hetzner (pruned 890 stale route check rows on startup) | |
| 299 | + | ||
| 300 | + | --- | |
| 301 | + | ||
| 302 | + | ## Rust Patterns Audit (2026-03-21) | |
| 303 | + | ||
| 304 | + | - [x] Create `AlertCategory` enum (18 variants) replacing string literals | |
| 305 | + | - [x] Create `DnsRecordType` enum (A/Aaaa/Cname/Mx/Txt) replacing raw strings | |
| 306 | + | - [x] Add 30s timeout wrapper around email sends | |
| 307 | + | - [x] Eliminate HealthSnapshot clone under lock in API handlers | |
| 308 | + | - [x] Use `Cow<'_, str>` for JSON path response instead of String clone |
| @@ -0,0 +1,115 @@ | |||
| 1 | + | # PoM (Peace of Mind) -- Audit History | |
| 2 | + | ||
| 3 | + | Full chronological audit log. See [audit_review.md](./audit_review.md) for current state. | |
| 4 | + | ||
| 5 | + | ## Changes Since Last Audit | |
| 6 | + | ||
| 7 | + | ### Tenth audit (2026-03-28, Run 12 cross-project) | |
| 8 | + | - **Test count:** 359 (222 unit + 8 cli + 129 integration). 0 clippy warnings. 0 failures. | |
| 9 | + | - **Grade:** A (maintained). v0.3.2. | |
| 10 | + | - **CORS monitoring:** New check type added for monitoring CORS headers on targets. | |
| 11 | + | - **New dependency advisories (action items):** | |
| 12 | + | - aws-lc-sys 0.38.0 (RUSTSEC-2026-0044 + -0048, severity 7.4 HIGH) — upgrade to 0.39.0 via `cargo update -p aws-lc-sys` | |
| 13 | + | - rustls-webpki 0.103.9 (RUSTSEC-2026-0049) — upgrade to 0.103.10 via `cargo update -p rustls-webpki` | |
| 14 | + | - paste unmaintained (RUSTSEC-2024-0436) — upstream via rmcp, warning only | |
| 15 | + | - **Mandatory surprise:** None. Previous surprises (rate limiter relaxed ordering, write!().unwrap() infallibility) still valid. | |
| 16 | + | - **No new code findings.** All previous items remain resolved. | |
| 17 | + | ||
| 18 | + | ### DNS/Route stale data fix (2026-03-25) | |
| 19 | + | - **Test count:** 352 (unchanged). 0 clippy warnings. | |
| 20 | + | - **Config:** Switched all 4 Cloudflare-proxied DNS records from `expected = ["IP"]` to `expected = []` (resolution-only). DNS checks were always failing because Cloudflare returns rotating proxy IPs, not the origin IP. | |
| 21 | + | - **API filtering:** `route_status` and `dns_status` in `/api/status/{target}` now filtered to only entries matching current config. Stale routes (e.g. `/docs/about`, `/signup`) and stale DNS records no longer appear in API responses. | |
| 22 | + | - **DB pruning:** Added `prune_stale_routes()` and `prune_stale_dns()` to `db.rs`. Called once at task startup in `routes.rs` and `dns.rs` to clean up historical data when config changes. Pruned 890 stale route check rows on first deploy. | |
| 23 | + | - **Integration tests:** Updated `api_status_includes_route_status` and `api_status_includes_dns_status` to use configs with matching route/DNS entries. | |
| 24 | + | - **Deployed to hetzner** — v0.3.2 binary + updated config. | |
| 25 | + | ||
| 26 | + | ### Eighth audit (2026-03-18, Run 9 cross-project) | |
| 27 | + | - **Test count:** 344 (unchanged). 0 clippy warnings. | |
| 28 | + | - **Grade:** A (maintained). v0.3.1 (deployed 2026-03-18). | |
| 29 | + | - **Dashboard UI shipped.** Per-test tracking, regression detection, duration drift. | |
| 30 | + | - **cli/ directory module split** completed (1,035-line cli.rs -> 8 files). | |
| 31 | + | - **Observations (pre-existing, not regressions):** | |
| 32 | + | - Mutex `.unwrap()` in rate limiter (api.rs:41) — if thread panics while holding lock, subsequent calls panic. Impact: LOW (rate limiter only, not core logic). Design choice: acceptable for monitoring tool. | |
| 33 | + | - `serde_json::to_value(d).unwrap_or_default()` in API details field — silently becomes null on serialization failure. Impact: LOW, safe fallback. | |
| 34 | + | - **No new findings requiring action.** Grade maintained at A. | |
| 35 | + | - **Mandatory surprise:** Rate limiter uses `fetch_add` with Relaxed ordering — can allow up to max_per_window+1 requests due to check-then-increment race. Known trade-off of lock-free rate limiting, documented. | |
| 36 | + | ||
| 37 | + | ### Fifth audit (2026-03-16, Run 6 cross-project) | |
| 38 | + | - **Test count:** 238 -> 344 (220 unit + 124 integration, +106 tests) | |
| 39 | + | - **Grade:** A (maintained). No new findings above LOW. | |
| 40 | + | - **Source LOC:** 10,113 (up from ~3.5K) | |
| 41 | + | - **Clippy:** 2 warnings (collapsible_if in cli.rs — LOW) | |
| 42 | + | - **Production unwraps:** 76 total — 64 infallible write! on String, 12 safe-by-construction. Effectively zero risky unwraps. | |
| 43 | + | - **Mandatory surprise:** write!().unwrap() pattern provably infallible — Actually fine. | |
| 44 | + | - **Previous items verified:** All previous remediated items confirmed intact. | |
| 45 | + | - **Note:** cli.rs at 1,036 lines — approaching the 500-line branching guideline but mostly flat match arms. | |
| 46 | + | - **Infrastructure check:** Blocked by Tailscale SSH re-authentication. Deferred. | |
| 47 | + | ||
| 48 | + | ### Fourth audit remediation (2026-03-14) | |
| 49 | + | - **Grade:** A- -> A. All remaining findings resolved. | |
| 50 | + | - **Test count:** 229 -> 238 (+9 integration tests) | |
| 51 | + | - **Graceful shutdown:** Replaced `handle.abort()` with CancellationToken + `tokio::select!` in all task loops. API server uses `with_graceful_shutdown`. 5s grace period on SIGINT/SIGTERM. | |
| 52 | + | - **Task panic detection:** 60s watchdog checks `JoinHandle::is_finished()` on all background tasks. | |
| 53 | + | - **Rate limiting:** Fixed-window 60 req/min middleware on authenticated API routes. Custom `RateLimiter` struct. | |
| 54 | + | - **Self-monitoring:** `GET /api/health` endpoint (public, no auth) returns `{"status":"operational","version":"..."}`. | |
| 55 | + | - **Integration tests:** 5 check_health tests (mock axum servers: operational, degraded, unreachable, expectations pass/fail), 1 check_tls test (self-signed cert via rcgen), 2 /api/health tests, 1 rate limiter test. | |
| 56 | + | - **Deploy config cleanup:** Removed redundant htpy `expected_routes` (duplicated health check URL). | |
| 57 | + | - **Dependency:** Added `tokio-util` for CancellationToken. | |
| 58 | + | - **Cold spots:** 0 remaining (was 3). All previous architectural and testing gaps closed. | |
| 59 | + | ||
| 60 | + | ### Third audit (2026-03-13, pre-launch skeptical lens) | |
| 61 | + | - **Grade:** A -> A-. Postmark API token in plaintext deployment configs is a real issue. | |
| 62 | + | - **Test count:** 56 -> 187 (+131 tests) | |
| 63 | + | - **New findings:** Plaintext API token, no API auth, no peer mesh auth, no integration tests for core functions, no self-monitoring. | |
| 64 | + | - **38 unwraps in non-test code** — all verified safe (write to String or guarded by prior checks). | |
| 65 | + | ||
| 66 | + | **Post-audit remediation (2026-03-13):** | |
| 67 | + | - All 3 critical/medium findings resolved: Postmark token to env var, API bearer auth (5 tests), peer mesh auth | |
| 68 | + | - 2 low findings resolved: SSH filter validation, peer UUID mismatch rejection | |
| 69 | + | - Test count: 187 -> 195 (+8 tests) | |
| 70 | + | - Documentation upgraded to A: All struct fields documented (HealthSnapshot, HealthStatus, HealthDetails, TestRun, TestStaleness, PeerStatus, OnMissing, all config types, all API response types). All 8 error variants documented. 11 config defaults with rationale comments. prune_old_records return tuple documented. description.md rewritten, architecture.md created (191 lines), README created (62 lines). | |
| 71 | + | ||
| 72 | + | ### Observability Upgrade (2026-03-13) | |
| 73 | + | - **Observability:** A- -> A | |
| 74 | + | - Added 57 `#[instrument(skip_all)]` annotations across 9 files: db.rs (28), alerts.rs (9), tools/mod.rs (8), tools/health.rs (5), tools/tests.rs (3), checks/http.rs (1), checks/tls.rs (1), checks/ssh.rs (1), peer.rs (1) | |
| 75 | + | - Added Multithreaded forum as monitoring target: `pom-astra.toml` (localhost:3400), `pom-hetzner.toml` (Tailscale IP) | |
| 76 | + | - Added test runner targets for GO, BB, AF, SK to `pom-astra.toml` | |
| 77 | + | - All 208 tests pass. `cargo check` passes clean. | |
| 78 | + | ||
| 79 | + | ### Adversarial Test Audit (2026-03-13) | |
| 80 | + | ||
| 81 | + | **Goal:** Write tests that try to break the system. Find edge cases, race conditions, boundary conditions, and logic errors. | |
| 82 | + | ||
| 83 | + | **Results:** | |
| 84 | + | - **Test count:** 195 -> 208 (+13 tests) | |
| 85 | + | - **CRITICAL fix:** Alert cooldown key mismatch — `record_alert` used `target` but lookup used `alert_key` (`"health:{target}"`), so cooldowns never matched and alerts fired every check. Fixed by using `alert_key` consistently. | |
| 86 | + | - **HIGH fix:** TLS expiry check inconsistent at day boundary — time-of-day comparison could cause flapping. Changed to `date_naive()` comparison for stable day-level logic. | |
| 87 | + | - **HIGH fix:** UUID mismatch left stale peer state — now resets state, clears failures, persists via `update_peer_identity()` to prevent showing stale data after peer identity change. | |
| 88 | + | - **HIGH fix:** `prune_old_records` no guard for days <= 0 — could delete all records. Added early return for `days <= 0` (no-op). | |
| 89 | + | - **HIGH fix:** SSH timeout ignored config value — hardcoded `ConnectTimeout=10` in SSH args. Changed to use `config.timeout_secs`. | |
| 90 | + | - **Added `rcgen` dev dependency** for TLS cert generation in tests. | |
| 91 | + | ||
| 92 | + | ### Second audit (2026-03-11) | |
| 93 | + | | Change | Detail | | |
| 94 | + | |--------|--------| | |
| 95 | + | | Tests | +39 tests (17 -> 56). 28 unit + 28 integration. Tests/KLOC: 5.8 -> 18.4. | | |
| 96 | + | | Lock contention | Addressed in both peer.rs (heartbeat handlers) and api.rs (status/mesh handlers). Data collected under lock, DB writes after release. | | |
| 97 | + | | DB indexes | 4 indexes added: health_checks(target, id DESC), health_checks(target, checked_at), test_runs(target, id DESC), peer_heartbeats(peer_name, id DESC). | | |
| 98 | + | | Clippy | 4 warnings -> 0. Used Rust 2024 let chains instead of nested if-let. | | |
| 99 | + | | Type safety | PeerConfig.on_missing changed from String to OnMissing enum with serde deserialization. | | |
| 100 | + | | Module docs | Added //! docs to db.rs, config.rs, peer.rs, types.rs, lib.rs. | | |
| 101 | + | | Error handling | /api/peer/status fetch failures now logged at debug level instead of silenced. | | |
| 102 | + | | Prune | prune_old_records now returns 3-tuple including peer heartbeat count. | | |
| 103 | + | | Code extraction | HealthStatus::icon() method eliminates 3 repeated match blocks. | | |
| 104 | + | | HTTP checks | Response classification extracted into pure functions for testability. | | |
| 105 | + | ||
| 106 | + | ## Metrics Over Time | |
| 107 | + | ||
| 108 | + | | Audit Date | LOC | Rust Files | Tests | Tests/KLOC | Clippy Warnings | Cold Spots | Overall | | |
| 109 | + | |------------|-----|-----------|-------|-----------|----------------|------------|---------| | |
| 110 | + | | 2026-03-10 | 2,934 | 15 | 17 | 5.8 | 4 | 8 | B+ | | |
| 111 | + | | 2026-03-11 | 3,039 | 14 | 56 | 18.4 | 0 | 3 | A | | |
| 112 | + | | 2026-03-13 | ~3K | ~14 | 208 | ~69 | 0 | 3 | A- | | |
| 113 | + | | 2026-03-14 | ~3.5K | ~16 | 238 | ~68 | 0 | 0 | A | | |
| 114 | + | | 2026-03-16 | 10.1K | 23 | 344 | ~34 | 2 | 0 | A | | |
| 115 | + | | 2026-03-18 | 10.1K | 23 | 344 | ~34 | 0 | 0 | A | |
| @@ -144,114 +144,6 @@ Filed in `docs/mnw/pom/todo.md`. | |||
| 144 | 144 | 10. ~~Add heartbeat state machine tests~~ -- Done (9 tests) | |
| 145 | 145 | 11. ~~Add config parsing tests~~ -- Done (4 tests) | |
| 146 | 146 | ||
| 147 | - | ## Changes Since Last Audit | |
| 148 | - | ||
| 149 | - | ### Tenth audit (2026-03-28, Run 12 cross-project) | |
| 150 | - | - **Test count:** 359 (222 unit + 8 cli + 129 integration). 0 clippy warnings. 0 failures. | |
| 151 | - | - **Grade:** A (maintained). v0.3.2. | |
| 152 | - | - **CORS monitoring:** New check type added for monitoring CORS headers on targets. | |
| 153 | - | - **New dependency advisories (action items):** | |
| 154 | - | - aws-lc-sys 0.38.0 (RUSTSEC-2026-0044 + -0048, severity 7.4 HIGH) — upgrade to 0.39.0 via `cargo update -p aws-lc-sys` | |
| 155 | - | - rustls-webpki 0.103.9 (RUSTSEC-2026-0049) — upgrade to 0.103.10 via `cargo update -p rustls-webpki` | |
| 156 | - | - paste unmaintained (RUSTSEC-2024-0436) — upstream via rmcp, warning only | |
| 157 | - | - **Mandatory surprise:** None. Previous surprises (rate limiter relaxed ordering, write!().unwrap() infallibility) still valid. | |
| 158 | - | - **No new code findings.** All previous items remain resolved. | |
| 159 | - | ||
| 160 | - | ### DNS/Route stale data fix (2026-03-25) | |
| 161 | - | - **Test count:** 352 (unchanged). 0 clippy warnings. | |
| 162 | - | - **Config:** Switched all 4 Cloudflare-proxied DNS records from `expected = ["IP"]` to `expected = []` (resolution-only). DNS checks were always failing because Cloudflare returns rotating proxy IPs, not the origin IP. | |
| 163 | - | - **API filtering:** `route_status` and `dns_status` in `/api/status/{target}` now filtered to only entries matching current config. Stale routes (e.g. `/docs/about`, `/signup`) and stale DNS records no longer appear in API responses. | |
| 164 | - | - **DB pruning:** Added `prune_stale_routes()` and `prune_stale_dns()` to `db.rs`. Called once at task startup in `routes.rs` and `dns.rs` to clean up historical data when config changes. Pruned 890 stale route check rows on first deploy. | |
| 165 | - | - **Integration tests:** Updated `api_status_includes_route_status` and `api_status_includes_dns_status` to use configs with matching route/DNS entries. | |
| 166 | - | - **Deployed to hetzner** — v0.3.2 binary + updated config. | |
| 167 | - | ||
| 168 | - | ### Eighth audit (2026-03-18, Run 9 cross-project) | |
| 169 | - | - **Test count:** 344 (unchanged). 0 clippy warnings. | |
| 170 | - | - **Grade:** A (maintained). v0.3.1 (deployed 2026-03-18). | |
| 171 | - | - **Dashboard UI shipped.** Per-test tracking, regression detection, duration drift. | |
| 172 | - | - **cli/ directory module split** completed (1,035-line cli.rs -> 8 files). | |
| 173 | - | - **Observations (pre-existing, not regressions):** | |
| 174 | - | - Mutex `.unwrap()` in rate limiter (api.rs:41) — if thread panics while holding lock, subsequent calls panic. Impact: LOW (rate limiter only, not core logic). Design choice: acceptable for monitoring tool. | |
| 175 | - | - `serde_json::to_value(d).unwrap_or_default()` in API details field — silently becomes null on serialization failure. Impact: LOW, safe fallback. | |
| 176 | - | - **No new findings requiring action.** Grade maintained at A. | |
| 177 | - | - **Mandatory surprise:** Rate limiter uses `fetch_add` with Relaxed ordering — can allow up to max_per_window+1 requests due to check-then-increment race. Known trade-off of lock-free rate limiting, documented. | |
| 178 | - | ||
| 179 | - | ### Fifth audit (2026-03-16, Run 6 cross-project) | |
| 180 | - | - **Test count:** 238 -> 344 (220 unit + 124 integration, +106 tests) | |
| 181 | - | - **Grade:** A (maintained). No new findings above LOW. | |
| 182 | - | - **Source LOC:** 10,113 (up from ~3.5K) | |
| 183 | - | - **Clippy:** 2 warnings (collapsible_if in cli.rs — LOW) | |
| 184 | - | - **Production unwraps:** 76 total — 64 infallible write! on String, 12 safe-by-construction. Effectively zero risky unwraps. | |
| 185 | - | - **Mandatory surprise:** write!().unwrap() pattern provably infallible — Actually fine. | |
| 186 | - | - **Previous items verified:** All previous remediated items confirmed intact. | |
| 187 | - | - **Note:** cli.rs at 1,036 lines — approaching the 500-line branching guideline but mostly flat match arms. | |
| 188 | - | - **Infrastructure check:** Blocked by Tailscale SSH re-authentication. Deferred. | |
| 189 | - | ||
| 190 | - | ### Fourth audit remediation (2026-03-14) | |
| 191 | - | - **Grade:** A- -> A. All remaining findings resolved. | |
| 192 | - | - **Test count:** 229 -> 238 (+9 integration tests) | |
| 193 | - | - **Graceful shutdown:** Replaced `handle.abort()` with CancellationToken + `tokio::select!` in all task loops. API server uses `with_graceful_shutdown`. 5s grace period on SIGINT/SIGTERM. | |
| 194 | - | - **Task panic detection:** 60s watchdog checks `JoinHandle::is_finished()` on all background tasks. | |
| 195 | - | - **Rate limiting:** Fixed-window 60 req/min middleware on authenticated API routes. Custom `RateLimiter` struct. | |
| 196 | - | - **Self-monitoring:** `GET /api/health` endpoint (public, no auth) returns `{"status":"operational","version":"..."}`. | |
| 197 | - | - **Integration tests:** 5 check_health tests (mock axum servers: operational, degraded, unreachable, expectations pass/fail), 1 check_tls test (self-signed cert via rcgen), 2 /api/health tests, 1 rate limiter test. | |
| 198 | - | - **Deploy config cleanup:** Removed redundant htpy `expected_routes` (duplicated health check URL). | |
| 199 | - | - **Dependency:** Added `tokio-util` for CancellationToken. | |
| 200 | - | - **Cold spots:** 0 remaining (was 3). All previous architectural and testing gaps closed. | |
| 201 | - | ||
| 202 | - | ### Third audit (2026-03-13, pre-launch skeptical lens) | |
| 203 | - | - **Grade:** A -> A-. Postmark API token in plaintext deployment configs is a real issue. | |
| 204 | - | - **Test count:** 56 -> 187 (+131 tests) | |
| 205 | - | - **New findings:** Plaintext API token, no API auth, no peer mesh auth, no integration tests for core functions, no self-monitoring. | |
| 206 | - | - **38 unwraps in non-test code** — all verified safe (write to String or guarded by prior checks). | |
| 207 | - | ||
| 208 | - | **Post-audit remediation (2026-03-13):** | |
| 209 | - | - All 3 critical/medium findings resolved: Postmark token to env var, API bearer auth (5 tests), peer mesh auth | |
| 210 | - | - 2 low findings resolved: SSH filter validation, peer UUID mismatch rejection | |
| 211 | - | - Test count: 187 -> 195 (+8 tests) | |
| 212 | - | - Documentation upgraded to A: All struct fields documented (HealthSnapshot, HealthStatus, HealthDetails, TestRun, TestStaleness, PeerStatus, OnMissing, all config types, all API response types). All 8 error variants documented. 11 config defaults with rationale comments. prune_old_records return tuple documented. description.md rewritten, architecture.md created (191 lines), README created (62 lines). | |
| 213 | - | ||
| 214 | - | ### Observability Upgrade (2026-03-13) | |
| 215 | - | - **Observability:** A- -> A | |
| 216 | - | - Added 57 `#[instrument(skip_all)]` annotations across 9 files: db.rs (28), alerts.rs (9), tools/mod.rs (8), tools/health.rs (5), tools/tests.rs (3), checks/http.rs (1), checks/tls.rs (1), checks/ssh.rs (1), peer.rs (1) | |
| 217 | - | - Added Multithreaded forum as monitoring target: `pom-astra.toml` (localhost:3400), `pom-hetzner.toml` (Tailscale IP) | |
| 218 | - | - Added test runner targets for GO, BB, AF, SK to `pom-astra.toml` | |
| 219 | - | - All 208 tests pass. `cargo check` passes clean. | |
| 220 | - | ||
| 221 | - | ### Adversarial Test Audit (2026-03-13) | |
| 222 | - | ||
| 223 | - | **Goal:** Write tests that try to break the system. Find edge cases, race conditions, boundary conditions, and logic errors. | |
| 224 | - | ||
| 225 | - | **Results:** | |
| 226 | - | - **Test count:** 195 -> 208 (+13 tests) | |
| 227 | - | - **CRITICAL fix:** Alert cooldown key mismatch — `record_alert` used `target` but lookup used `alert_key` (`"health:{target}"`), so cooldowns never matched and alerts fired every check. Fixed by using `alert_key` consistently. | |
| 228 | - | - **HIGH fix:** TLS expiry check inconsistent at day boundary — time-of-day comparison could cause flapping. Changed to `date_naive()` comparison for stable day-level logic. | |
| 229 | - | - **HIGH fix:** UUID mismatch left stale peer state — now resets state, clears failures, persists via `update_peer_identity()` to prevent showing stale data after peer identity change. | |
| 230 | - | - **HIGH fix:** `prune_old_records` no guard for days <= 0 — could delete all records. Added early return for `days <= 0` (no-op). | |
| 231 | - | - **HIGH fix:** SSH timeout ignored config value — hardcoded `ConnectTimeout=10` in SSH args. Changed to use `config.timeout_secs`. | |
| 232 | - | - **Added `rcgen` dev dependency** for TLS cert generation in tests. | |
| 233 | - | ||
| 234 | - | ### Second audit (2026-03-11) | |
| 235 | - | | Change | Detail | | |
| 236 | - | |--------|--------| | |
| 237 | - | | Tests | +39 tests (17 -> 56). 28 unit + 28 integration. Tests/KLOC: 5.8 -> 18.4. | | |
| 238 | - | | Lock contention | Addressed in both peer.rs (heartbeat handlers) and api.rs (status/mesh handlers). Data collected under lock, DB writes after release. | | |
| 239 | - | | DB indexes | 4 indexes added: health_checks(target, id DESC), health_checks(target, checked_at), test_runs(target, id DESC), peer_heartbeats(peer_name, id DESC). | | |
| 240 | - | | Clippy | 4 warnings -> 0. Used Rust 2024 let chains instead of nested if-let. | | |
| 241 | - | | Type safety | PeerConfig.on_missing changed from String to OnMissing enum with serde deserialization. | | |
| 242 | - | | Module docs | Added //! docs to db.rs, config.rs, peer.rs, types.rs, lib.rs. | | |
| 243 | - | | Error handling | /api/peer/status fetch failures now logged at debug level instead of silenced. | | |
| 244 | - | | Prune | prune_old_records now returns 3-tuple including peer heartbeat count. | | |
| 245 | - | | Code extraction | HealthStatus::icon() method eliminates 3 repeated match blocks. | | |
| 246 | - | | HTTP checks | Response classification extracted into pure functions for testability. | | |
| 247 | - | ||
| 248 | - | ## Metrics Over Time | |
| 249 | - | ||
| 250 | - | | Audit Date | LOC | Rust Files | Tests | Tests/KLOC | Clippy Warnings | Cold Spots | Overall | | |
| 251 | - | |------------|-----|-----------|-------|-----------|----------------|------------|---------| | |
| 252 | - | | 2026-03-10 | 2,934 | 15 | 17 | 5.8 | 4 | 8 | B+ | | |
| 253 | - | | 2026-03-11 | 3,039 | 14 | 56 | 18.4 | 0 | 3 | A | | |
| 254 | - | | 2026-03-13 | ~3K | ~14 | 208 | ~69 | 0 | 3 | A- | | |
| 255 | - | | 2026-03-14 | ~3.5K | ~16 | 238 | ~68 | 0 | 0 | A | | |
| 256 | - | | 2026-03-16 | 10.1K | 23 | 344 | ~34 | 2 | 0 | A | | |
| 257 | - | | 2026-03-18 | 10.1K | 23 | 344 | ~34 | 0 | 0 | A | | |
| 147 | + | --- | |
| 148 | + | ||
| 149 | + | See [audit_history.md](./audit_history.md) for full chronological audit log. |
| @@ -0,0 +1,90 @@ | |||
| 1 | + | # PoM -- Competitive Analysis | |
| 2 | + | ||
| 3 | + | Last updated: 2026-04-02 | |
| 4 | + | ||
| 5 | + | ## Positioning | |
| 6 | + | ||
| 7 | + | PoM (Peace of Mind) is a single-binary production monitor built for indie developers and small teams. It runs as a peer mesh -- two instances cross-check each other with no central dashboard required. CLI-first, with an optional HTTP API and Claude integration (MCP server mode). | |
| 8 | + | ||
| 9 | + | The key differentiators are the peer mesh architecture (no single point of failure for monitoring), the CLI-first interface (inspect via SSH, no browser needed), and the Claude MCP integration (AI-assisted diagnostics). PoM monitors what matters for small deployments: uptime, TLS certificates, DNS records, domain registration, route availability, and test freshness. | |
| 10 | + | ||
| 11 | + | ## Pricing Comparison | |
| 12 | + | ||
| 13 | + | | Tool | Price | Model | | |
| 14 | + | |------|-------|-------| | |
| 15 | + | | **PoM** | Free | Source-available (PolyForm NC) | | |
| 16 | + | | Uptime Robot | $0-$58/mo | Freemium (50 monitors free) | | |
| 17 | + | | Pingdom | $15-$100/mo | SaaS | | |
| 18 | + | | Datadog | $15-$23/host/mo | SaaS | | |
| 19 | + | | New Relic | $0-$0.35/GB | Freemium | | |
| 20 | + | | Grafana + Prometheus | Free (self-host) | Open source | | |
| 21 | + | | StatusCake | $0-$67/mo | Freemium | | |
| 22 | + | | Hetrix Tools | $0-$20/mo | Freemium | | |
| 23 | + | ||
| 24 | + | ## Feature Matrix | |
| 25 | + | ||
| 26 | + | | Feature | PoM | Uptime Robot | Pingdom | Datadog | Grafana+Prom | | |
| 27 | + | |---------|:---:|:-----------:|:-------:|:-------:|:------------:| | |
| 28 | + | | HTTP health checks | Y | Y | Y | Y | Y | | |
| 29 | + | | TLS certificate monitoring | Y | Y | Y | Y | N* | | |
| 30 | + | | DNS record verification | Y | N | N | Y | N* | | |
| 31 | + | | WHOIS domain expiry | Y | N | N | N | N* | | |
| 32 | + | | Route availability checks | Y | N | Y | Y | N* | | |
| 33 | + | | CORS preflight checks | Y | N | N | N | N | | |
| 34 | + | | Peer mesh (cross-monitoring) | Y | N | N | N | N | | |
| 35 | + | | CLI-first interface | Y | N | N | N | N | | |
| 36 | + | | Claude MCP integration | Y | N | N | N | N | | |
| 37 | + | | SSH test execution | Y | N | N | N | N | | |
| 38 | + | | Latency drift detection | Y | N | Y | Y | Y | | |
| 39 | + | | Test duration drift | Y | N | N | N | N | | |
| 40 | + | | Email alerts | Y | Y | Y | Y | Y | | |
| 41 | + | | Status page | N | Y | Y | Y | Y** | | |
| 42 | + | | Mobile app | N | Y | Y | Y | Y** | | |
| 43 | + | | APM / traces | N | N | N | Y | Y | | |
| 44 | + | | Log aggregation | N | N | N | Y | Y | | |
| 45 | + | | Self-hosted | Y | N | N | N | Y | | |
| 46 | + | | Single binary | Y | N/A | N/A | N/A | N | | |
| 47 | + | ||
| 48 | + | \* Requires additional exporters. \*\* Via Grafana dashboards. | |
| 49 | + | ||
| 50 | + | ## Competitor Deep Dives | |
| 51 | + | ||
| 52 | + | ### 1. Uptime Robot | |
| 53 | + | ||
| 54 | + | Simple uptime monitoring SaaS. Free tier with 50 monitors at 5-minute intervals. Pro adds 1-minute intervals, SSL monitoring, status pages. The default choice for indie developers. | |
| 55 | + | ||
| 56 | + | **What PoM lacks:** status pages, mobile app, SMS/Slack/webhook alerts, maintenance windows. **What Uptime Robot lacks:** peer mesh, CLI interface, DNS/WHOIS monitoring, SSH test execution, AI integration. | |
| 57 | + | ||
| 58 | + | ### 2. Datadog | |
| 59 | + | ||
| 60 | + | Enterprise observability platform (APM, logs, metrics, dashboards). Powerful but expensive and invasive (requires agents on every host). Overkill for small deployments. | |
| 61 | + | ||
| 62 | + | **What PoM lacks:** APM, distributed tracing, dashboards, log aggregation, 800+ integrations. **What Datadog lacks:** peer mesh, CLI-first operation, single binary simplicity, affordability for indie teams. | |
| 63 | + | ||
| 64 | + | ### 3. Grafana + Prometheus | |
| 65 | + | ||
| 66 | + | Open-source metrics and visualization stack. Extremely flexible, industry standard. Requires significant setup (Prometheus server, exporters, Grafana instance, alertmanager). No built-in TLS/DNS/WHOIS monitoring without custom exporters. | |
| 67 | + | ||
| 68 | + | **What PoM lacks:** rich dashboards, metric visualization, alertmanager flexibility, ecosystem of exporters. **What Grafana+Prom lacks:** out-of-box TLS/DNS/WHOIS, peer mesh, single binary, zero-config setup. | |
| 69 | + | ||
| 70 | + | ### 4. StatusCake | |
| 71 | + | ||
| 72 | + | Web-based uptime and page speed monitoring. Free tier with 10 monitors. Pro adds SSL, domain, and server monitoring. Similar scope to Uptime Robot but with more check types. | |
| 73 | + | ||
| 74 | + | **What PoM lacks:** page speed testing, server monitoring agents, status pages, Slack/Teams integration. | |
| 75 | + | ||
| 76 | + | ## What We Offer That Competitors Don't | |
| 77 | + | ||
| 78 | + | - **Peer mesh** -- two PoM instances monitor each other. If one goes down, the other detects it. No central dashboard is a single point of failure. | |
| 79 | + | - **CLI-first** -- inspect status, run checks, query history from the terminal via SSH. No browser required. | |
| 80 | + | - **Claude MCP integration** -- expose health checks, test execution, and mesh status as MCP tools for AI-assisted diagnostics. | |
| 81 | + | - **SSH test execution** -- trigger and parse CI test runs on remote servers, track test freshness and duration drift. | |
| 82 | + | - **Single binary, zero dependencies** -- no Docker, no external services, no agents. SQLite for history, Postmark for email alerts. | |
| 83 | + | - **Monitoring-offline meta-alert** -- detects when all targets are unreachable simultaneously (likely a PoM network issue, not actual outages). Prevents false alarm cascades. | |
| 84 | + | ||
| 85 | + | ## Target Users | |
| 86 | + | ||
| 87 | + | - Indie developers running 1-5 services who want monitoring without SaaS costs | |
| 88 | + | - Small teams that operate via SSH and prefer CLI tools over web dashboards | |
| 89 | + | - Anyone who wants peer-verified monitoring (not trusting a single monitoring vendor) | |
| 90 | + | - Claude Code users who want AI-assisted production diagnostics |
| @@ -0,0 +1,202 @@ | |||
| 1 | + | # PoM Operational Runbook | |
| 2 | + | ||
| 3 | + | Procedures for responding to alerts, managing the service, and troubleshooting common issues. | |
| 4 | + | ||
| 5 | + | ## Alert Response Guide | |
| 6 | + | ||
| 7 | + | ### Health Status Change (Operational -> Error/Unreachable) | |
| 8 | + | ||
| 9 | + | **Symptoms:** Email alert with target status change. | |
| 10 | + | ||
| 11 | + | **Steps:** | |
| 12 | + | 1. Verify manually: `curl -v https://makenot.work/api/health` | |
| 13 | + | 2. If **Unreachable**: check network (Tailscale, firewall, DNS resolution) | |
| 14 | + | 3. If **Error** (5xx): SSH into the target server, check application logs | |
| 15 | + | ```sh | |
| 16 | + | ssh root@100.120.174.96 journalctl -u makenotwork --since "10 minutes ago" | |
| 17 | + | ``` | |
| 18 | + | 4. If **Degraded** (2xx but unexpected body): check application state, database connectivity | |
| 19 | + | 5. Restart the service if needed: `ssh root@100.120.174.96 systemctl restart makenotwork` | |
| 20 | + | ||
| 21 | + | ### TLS Certificate Expiry | |
| 22 | + | ||
| 23 | + | **Symptoms:** Alert when certificate expires within 14 days. | |
| 24 | + | ||
| 25 | + | **Steps:** | |
| 26 | + | 1. Verify: `openssl s_client -connect makenot.work:443 2>/dev/null | openssl x509 -noout -dates` | |
| 27 | + | 2. Cloudflare Origin CA certs (15-year): no renewal needed. If alert fires, check Caddy config. | |
| 28 | + | 3. If Caddy is serving wrong cert: verify cert paths in `/etc/caddy/Caddyfile` | |
| 29 | + | 4. For custom domains (on-demand TLS): Caddy auto-renews via ACME. Check Caddy logs. | |
| 30 | + | ||
| 31 | + | ### TLS Check Failed | |
| 32 | + | ||
| 33 | + | **Symptoms:** Handshake timeout, certificate parse failure, or connection refused. | |
| 34 | + | ||
| 35 | + | **Steps:** | |
| 36 | + | 1. Verify: `openssl s_client -connect makenot.work:443 -servername makenot.work` | |
| 37 | + | 2. Check Caddy status: `ssh root@100.120.174.96 systemctl status caddy` | |
| 38 | + | 3. Check if port 443 is open: `ssh root@100.120.174.96 ss -tlnp | grep 443` | |
| 39 | + | 4. If Caddy is down, restart: `ssh root@100.120.174.96 systemctl restart caddy` | |
| 40 | + | ||
| 41 | + | ### Peer Missing | |
| 42 | + | ||
| 43 | + | **Symptoms:** Peer (astra or hetzner) unreachable for 3+ consecutive heartbeats (3+ minutes). | |
| 44 | + | ||
| 45 | + | **Steps:** | |
| 46 | + | 1. SSH into the peer: `ssh max@100.106.221.39` (astra) or `ssh root@100.120.174.96` (hetzner) | |
| 47 | + | 2. Check PoM service: `systemctl status pom` | |
| 48 | + | 3. Check Tailscale connectivity: `tailscale ping <peer-ip>` | |
| 49 | + | 4. If PoM is down: `systemctl restart pom` | |
| 50 | + | 5. If Tailscale is down: `systemctl restart tailscored` | |
| 51 | + | ||
| 52 | + | ### Latency Drift | |
| 53 | + | ||
| 54 | + | **Symptoms:** Sustained response time increase (>2x the 7-day baseline). | |
| 55 | + | ||
| 56 | + | **Steps:** | |
| 57 | + | 1. Check server load: `ssh root@100.120.174.96 top -bn1 | head -5` | |
| 58 | + | 2. Check PostgreSQL: `ssh root@100.120.174.96 "psql -c 'SELECT count(*) FROM pg_stat_activity;' makenotwork"` | |
| 59 | + | 3. Check for slow queries: `ssh root@100.120.174.96 "psql -c \"SELECT query, calls, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 5;\" makenotwork"` | |
| 60 | + | 4. Check disk I/O: `ssh root@100.120.174.96 iostat -x 1 3` | |
| 61 | + | 5. If database-related: consider `VACUUM ANALYZE` on affected tables | |
| 62 | + | ||
| 63 | + | ### Route Failure | |
| 64 | + | ||
| 65 | + | **Symptoms:** Specific paths (e.g., `/login`, `/docs`) returning non-2xx. | |
| 66 | + | ||
| 67 | + | **Steps:** | |
| 68 | + | 1. Verify: `curl -sI https://makenot.work/login` | |
| 69 | + | 2. If 502/503: application is down or Caddy can't reach it | |
| 70 | + | 3. If 404: route may have been removed in a deploy -- check recent deploys | |
| 71 | + | 4. If 500: application error -- check logs with `journalctl -u makenotwork` | |
| 72 | + | ||
| 73 | + | ### DNS Mismatch | |
| 74 | + | ||
| 75 | + | **Symptoms:** DNS records don't match expected values. | |
| 76 | + | ||
| 77 | + | **Steps:** | |
| 78 | + | 1. Verify: `dig makenot.work +short` and compare with expected | |
| 79 | + | 2. Check Cloudflare DNS dashboard for unexpected changes | |
| 80 | + | 3. If propagation issue: wait 5-10 minutes and recheck | |
| 81 | + | 4. If intentional change: update PoM config to match new expected values | |
| 82 | + | ||
| 83 | + | ### WHOIS Domain Expiry | |
| 84 | + | ||
| 85 | + | **Symptoms:** Domain registration expires within 30 days. | |
| 86 | + | ||
| 87 | + | **Steps:** | |
| 88 | + | 1. Verify: `whois makenot.work | grep -i expir` | |
| 89 | + | 2. Renew domain with registrar (Cloudflare Registrar for makenot.work) | |
| 90 | + | 3. Confirm renewal: re-run WHOIS check | |
| 91 | + | ||
| 92 | + | ### Monitoring Offline (All Targets Unreachable) | |
| 93 | + | ||
| 94 | + | **Symptoms:** All monitored targets are down simultaneously. | |
| 95 | + | ||
| 96 | + | **Steps:** | |
| 97 | + | 1. This almost certainly means PoM's own network is down, not all targets | |
| 98 | + | 2. Check the PoM instance's network: `ping 1.1.1.1`, `tailscale status` | |
| 99 | + | 3. Check DNS resolution: `dig makenot.work` | |
| 100 | + | 4. If network is fine, check if all targets actually are down (unlikely but possible) | |
| 101 | + | ||
| 102 | + | ### Test Run Stale | |
| 103 | + | ||
| 104 | + | **Symptoms:** No test run recorded in 7+ days. | |
| 105 | + | ||
| 106 | + | **Steps:** | |
| 107 | + | 1. SSH into astra and run tests manually: `/home/max/staging/run-tests.sh` | |
| 108 | + | 2. If tests fail: investigate failures, fix, re-run | |
| 109 | + | 3. If SSH test execution fails: check SSH key, connectivity, permissions | |
| 110 | + | ||
| 111 | + | ## Service Management | |
| 112 | + | ||
| 113 | + | ### Starting/Stopping | |
| 114 | + | ||
| 115 | + | ```sh | |
| 116 | + | # Hetzner | |
| 117 | + | ssh root@100.120.174.96 systemctl start pom | |
| 118 | + | ssh root@100.120.174.96 systemctl stop pom | |
| 119 | + | ssh root@100.120.174.96 systemctl restart pom | |
| 120 | + | ||
| 121 | + | # Astra | |
| 122 | + | ssh max@100.106.221.39 sudo systemctl start pom | |
| 123 | + | ssh max@100.106.221.39 sudo systemctl stop pom | |
| 124 | + | ssh max@100.106.221.39 sudo systemctl restart pom | |
| 125 | + | ``` | |
| 126 | + | ||
| 127 | + | ### Checking Status | |
| 128 | + | ||
| 129 | + | ```sh | |
| 130 | + | # Service status | |
| 131 | + | ssh root@100.120.174.96 systemctl status pom | |
| 132 | + | ||
| 133 | + | # Application logs | |
| 134 | + | ssh root@100.120.174.96 journalctl -u pom --since "1 hour ago" | |
| 135 | + | ||
| 136 | + | # API health | |
| 137 | + | curl http://100.120.174.96:9100/api/health | |
| 138 | + | ||
| 139 | + | # Full status (requires API token) | |
| 140 | + | curl -H "Authorization: Bearer <token>" http://100.120.174.96:9100/api/status | |
| 141 | + | ||
| 142 | + | # Mesh view (self + peers) | |
| 143 | + | curl -H "Authorization: Bearer <token>" http://100.120.174.96:9100/api/mesh | |
| 144 | + | ``` | |
| 145 | + | ||
| 146 | + | ### Deploying Updates | |
| 147 | + | ||
| 148 | + | ```sh | |
| 149 | + | cd ~/Code/MNW/pom | |
| 150 | + | ./deploy/deploy.sh # Deploy to both astra and hetzner | |
| 151 | + | ``` | |
| 152 | + | ||
| 153 | + | The deploy script cross-compiles for both architectures, uploads binaries, and restarts services. | |
| 154 | + | ||
| 155 | + | ### Configuration Changes | |
| 156 | + | ||
| 157 | + | Config lives at `/etc/pom/pom.toml` on each instance. After editing: | |
| 158 | + | ||
| 159 | + | ```sh | |
| 160 | + | ssh root@100.120.174.96 systemctl restart pom | |
| 161 | + | ``` | |
| 162 | + | ||
| 163 | + | Alert credentials are in `/etc/pom/env` (Postmark token, API token). | |
| 164 | + | ||
| 165 | + | ## Check Intervals | |
| 166 | + | ||
| 167 | + | | Check Type | Default Interval | Notes | | |
| 168 | + | |------------|-----------------|-------| | |
| 169 | + | | Health (HTTP) | 5 minutes | 10-second timeout per request | | |
| 170 | + | | TLS certificate | 1 hour | Warns at 14 days before expiry | | |
| 171 | + | | Route availability | 5 minutes | Checks all configured paths | | |
| 172 | + | | DNS records | 1 hour | Compares against expected values | | |
| 173 | + | | WHOIS expiry | 1 hour | Warns at 30 days before expiry | | |
| 174 | + | | CORS preflight | 1 hour | OPTIONS request validation | | |
| 175 | + | | Peer heartbeat | 60 seconds | 3 failures before alert (grace period) | | |
| 176 | + | | Data pruning | Daily | Retains 30 days of history | | |
| 177 | + | ||
| 178 | + | ## Alert Cooldowns | |
| 179 | + | ||
| 180 | + | - **Default cooldown:** 5 minutes between repeated alerts for the same target | |
| 181 | + | - **Recovery alerts:** Always sent immediately (bypass cooldown) | |
| 182 | + | - **Monitoring-offline:** Special meta-alert when all targets are unreachable | |
| 183 | + | ||
| 184 | + | ## Production Instances | |
| 185 | + | ||
| 186 | + | | Instance | IP | Architecture | Config | | |
| 187 | + | |----------|-----|-------------|--------| | |
| 188 | + | | Hetzner | `100.120.174.96:9100` | x86_64 | `/etc/pom/pom.toml` | | |
| 189 | + | | Astra | `100.106.221.39:9100` | aarch64 | `/etc/pom/pom.toml` | | |
| 190 | + | ||
| 191 | + | Both instances monitor the same targets and cross-check each other via the peer mesh. | |
| 192 | + | ||
| 193 | + | ## Key Files | |
| 194 | + | ||
| 195 | + | | What | Where | | |
| 196 | + | |------|-------| | |
| 197 | + | | Config | `/etc/pom/pom.toml` | | |
| 198 | + | | Credentials | `/etc/pom/env` | | |
| 199 | + | | Database | `/var/lib/pom/pom.db` (SQLite) | | |
| 200 | + | | Instance ID | `/var/lib/pom/instance_id` | | |
| 201 | + | | systemd unit | `/etc/systemd/system/pom.service` | | |
| 202 | + | | Deploy script | `deploy/deploy.sh` | |
| @@ -1,6 +1,8 @@ | |||
| 1 | 1 | # PoM Todo | |
| 2 | 2 | ||
| 3 | - | Done: Phases 1-13 complete. Per-test tracking + regression detection + duration drift added. 352 tests (124 lib + 228 integration). Grade: A (Run 10). v0.3.2 (redeployed 2026-03-25). cli/ split into directory module. Dashboard UI shipped. DNS checks fixed for Cloudflare-proxied domains. Stale route/DNS data pruning added. | |
| 3 | + | Done: All phases (1-13). Active: None. Next: Post-beta items below. | |
| 4 | + | ||
| 5 | + | v0.3.2. Audit grade A. Dashboard UI, regression detection, duration drift. Monitors MNW + MT + htpy.app. | |
| 4 | 6 | ||
| 5 | 7 | Completed work archived in `docs/archive/pom_todo_done.md`. | |
| 6 | 8 |