Skip to main content

max / pom

Add project docs: audit history, competition analysis, runbook Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Author: Max J. <87768334+MaxJMath@users.noreply.github.com> · 2026-04-12 23:39 UTC
Commit: e8ed35e73aea237197b1fda84f5423b4c7af8f2b
Parent: b10b7fd
6 files changed, +721 insertions, -112 deletions
@@ -0,0 +1,308 @@
1 + # PoM — Completed Work
2 +
3 + Archived completed phases from todo.md. All items here are done.
4 +
5 + ---
6 +
7 + ## Phase 1 — Core Infrastructure
8 + Health checks, test orchestration, CLI, MCP server, SQLite storage.
9 +
10 + ### Done
11 + - [x] HTTP health checks with configurable targets and timeouts
12 + - [x] SSH test orchestration with CI output parsing
13 + - [x] CLI commands: health, test, status, history, prune
14 + - [x] MCP server mode (stdio transport)
15 + - [x] SQLite storage with WAL mode
16 + - [x] Per-target interval overrides
17 +
18 + ## Phase 2 — Serve Mode
19 + Background daemon with periodic health checks.
20 +
21 + ### Done
22 + - [x] Serve mode with per-target health check intervals
23 + - [x] Daily prune task
24 + - [x] Graceful shutdown (SIGINT/SIGTERM)
25 + - [x] Systemd service on hetzner
26 +
27 + ## Phase 3 — HTTP API + MNW Integration
28 + Expose data to consumers, wire into MNW health page.
29 +
30 + ### Done
31 + - [x] Axum HTTP API (`/api/status`, `/api/status/{target}`)
32 + - [x] Uptime percentage queries (24h, 7d)
33 + - [x] MNW `/health` page shows External Monitor card
34 + - [x] MNW `/api/health` JSON includes `external_monitoring` field
35 + - [x] Graceful fallback when PoM unavailable
36 +
37 + ## Phase 4 — Peer Mesh
38 + Syncthing-style peer network. Each PoM instance has a UUID, discovers peers by address, and shares monitoring data across the mesh. Any instance can see the full network state.
39 +
40 + ### Done (4A — Instance Identity)
41 + - [x] Auto-generate UUID on first run, store in data dir (`~/.local/share/pom/instance_id`)
42 + - [x] Instance name in config (`[instance]` section, defaults to hostname)
43 + - [x] `GET /api/peer/info` endpoint (returns instance ID, name, version, target list, started_at)
44 +
45 + ### Done (4B — Peer Configuration)
46 + - [x] `[instance]` config section (name, optional ID override)
47 + - [x] `[peers.<name>]` config section (address, on_missing, grace_count)
48 + - [x] Peer connection on serve startup (exchange instance info via `/api/peer/info`)
49 + - [x] Validate peer identity: store UUID on first connect, warn if UUID changes unexpectedly
50 +
51 + ### Done (4C — Peer Health Monitoring)
52 + - [x] Periodic peer heartbeat (poll each peer's `/api/peer/info`, configurable interval, default 60s)
53 + - [x] Peer status tracking in SQLite (`peer_identities`, `peer_heartbeats` tables)
54 + - [x] `on_missing` behavior: fire action when peer heartbeat fails (after configurable grace period)
55 + - [x] State machine: Unknown -> Online/GracePeriod -> Missing, with recovery detection
56 + - [x] Prune task also cleans `peer_heartbeats`
57 +
58 + ### Done (4D — Status Sharing)
59 + - [x] `GET /api/peer/status` endpoint (returns this instance's full target + peer status)
60 + - [x] Each instance periodically fetches peer status to build combined view
61 + - [x] `GET /api/mesh` endpoint (aggregated view: all instances, all targets, all peer statuses)
62 + - [x] CLI: `pom mesh [--json]` command to show network state
63 + - [x] MCP tool: `get_mesh_status` surfaces mesh state
64 +
65 + ### Done (4E — Code + Config)
66 + - [x] Per-host config files (`deploy/pom-hetzner.toml`, `deploy/pom-astra.toml`)
67 + - [x] Updated `deploy/deploy.sh` to use per-host configs
68 + - [x] Listen on `0.0.0.0:9100` in deploy configs (Tailscale peer access)
69 +
70 + ### Done (4E — Deploy)
71 + - [x] Install Tailscale on hetzner (`100.120.174.96`)
72 + - [x] Update astra peer config to use hetzner's Tailscale IP
73 + - [x] Fix `blocking_read()` panic in `spawn_heartbeat_tasks` (must be async)
74 + - [x] Deploy v0.2.0 to hetzner + astra
75 + - [x] Verify: `/api/peer/info` returns correct identity on each
76 + - [x] Verify: `/api/mesh` shows both instances online (~65ms latency)
77 + - [x] Update deploy scripts to use Tailscale IPs
78 +
79 + ## Phase 5 — Alerting (pre-beta)
80 + Email alerts triggered by target status changes or peer disappearance. Peers with `on_missing = "alert"` use this system.
81 +
82 + ### Done
83 + - [x] Postmark API integration (`src/alerts.rs` — Alerter struct, `X-Postmark-Server-Token` header)
84 + - [x] Alert configuration in pom.toml (`[alerts]` section: postmark_token, to, from, cooldown_secs)
85 + - [x] Status change detection (query previous health check before insert, compare statuses, fire on transition)
86 + - [x] Cooldown logic (alerts table tracks sent_at, skip if within cooldown window)
87 + - [x] Recovery alerts (notify when target returns to operational)
88 + - [x] Peer-triggered alerts (peer goes missing/recovering with `on_missing = "alert"`)
89 + - [x] Dev mode (no postmark_token → alerts logged to stdout)
90 + - [x] DB migration v2 (alerts table + index)
91 + - [x] Deploy configs updated (`deploy/pom-hetzner.toml`, `deploy/pom-astra.toml`)
92 + - [x] 11 new tests (3 unit, 5 integration, 3 config)
93 + - [x] Set postmark_token in production deploy configs
94 + - [x] Create `pom-alerts@makenot.work` sender signature in Postmark dashboard
95 +
96 + ## Phase 6 — TLS Certificate Monitoring
97 + Probe TLS certs, track expiry, alert before outage.
98 +
99 + ### Done
100 + - [x] TLS certificate check: connect to target, TLS handshake, read leaf cert expiry (`src/checks/tls.rs`)
101 + - [x] Per-target TLS config: `[targets.mnw.tls]` with host, port (default 443), warn_days (default 14)
102 + - [x] Configurable check interval: `tls_check_interval_secs` on `[serve]` (default 3600)
103 + - [x] DB migration v3: `tls_checks` table with index
104 + - [x] Store cert check results per target (insert/query, `TlsCheckRow`)
105 + - [x] Prune old TLS checks in daily prune task (5-tuple return)
106 + - [x] TLS data in API response: `tls` field on `/api/status/{target}` (skip_serializing_if None)
107 + - [x] CLI display: TLS line in `pom status` (OK/WARN/ERR with days remaining + expiry date)
108 + - [x] Serve loop: TLS check task per target on its own interval
109 + - [x] Alerts: expiry warning, error, and recovery (with cooldown)
110 + - [x] Deploy configs updated (hetzner + astra: `[targets.mnw.tls] host = "makenot.work"`)
111 + - [x] 17 new tests (2 unit, 5 config, 10 integration)
112 + - [x] Dependencies: x509-parser 0.16, tokio-rustls 0.26, rustls-pki-types 1, webpki-roots 1
113 +
114 + ## Phase 7 — Response Validation
115 + Verify response bodies match expected patterns, not just HTTP status codes.
116 +
117 + ### Done
118 + - [x] `HealthExpectation` config struct: `status_code`, `json_fields` (dot-path), `body_contains`
119 + - [x] `[targets.mnw.health.expect]` TOML config section (all fields optional)
120 + - [x] `resolve_json_path()` — walk dot-separated paths through nested JSON
121 + - [x] `validate_expectations()` — check status code, body substring, JSON field values
122 + - [x] Refactored `check_health` to `response.text()` + `serde_json::from_str` (preserves raw body)
123 + - [x] Expectation failures override to Degraded with joined error descriptions
124 + - [x] Deploy configs updated (hetzner + astra: `status_code = 200`, `json_fields.status = "operational"`)
125 + - [x] 17 new unit tests (resolve_json_path, validate_expectations, config parsing)
126 +
127 + ## Phase 8 — Latency Trending + Anomaly Detection
128 + Track performance over time, detect drift before it becomes an outage.
129 +
130 + ### Done
131 + - [x] `LatencyStats` + `LatencyBucket` types with `from_times()` and `bucket_by_time()` (types.rs, 9 unit tests)
132 + - [x] DB queries: `get_response_times`, `get_recent_response_times` (db.rs — operational-only filtering)
133 + - [x] `TrendingConfig` (baseline_window_hours, spike_threshold) wired into `HealthConfig` (config.rs, 3 tests)
134 + - [x] `detect_latency_drift()` — 3 consecutive checks over baseline threshold (checks/http.rs, 6 unit tests)
135 + - [x] Drift + recovery alerts with cooldown (alerts.rs)
136 + - [x] Drift detection in serve loop with `in_drift` state tracking (cli.rs)
137 + - [x] `latency_24h` on `/api/status/{target}`, `GET /api/trends/{target}?hours=&bucket_minutes=` (api.rs)
138 + - [x] Latency line in CLI `pom status` output (display.rs, 2 tests)
139 + - [x] Latency stats in MCP `get_status` tool (tools/health.rs)
140 + - [x] Deploy configs: `[targets.mnw.health.trending]` (pom-hetzner.toml, pom-astra.toml)
141 + - [x] MNW health page: avg/p95 latency in PoM card (health.rs, public.rs, health.html)
142 + - [x] 8 new integration tests (response times, trends API, latency in status, config parsing)
143 +
144 + ## Phase 9 — Smart Test Prompting
145 + Detect when tests should be re-run based on staleness and version changes.
146 +
147 + ### Done
148 + - [x] `TestStaleness` struct: stale flag, reason, current/tested versions, last_test_at, days_since_test (types.rs)
149 + - [x] `get_version_at_time()` DB query: extract version from health check closest to a given timestamp (db.rs)
150 + - [x] `staleness_days` config field on `TestsConfig` (default 7) (config.rs, 2 config tests)
151 + - [x] `compute_test_staleness()` pure function: no-tests, age-based, version-change triggers (checks/http.rs, 5 unit tests)
152 + - [x] `test_staleness` field on API `TargetStatus` (skip_serializing_if None) (api.rs)
153 + - [x] `build_target_status` computes staleness for targets with test config (api.rs)
154 + - [x] CLI `pom status` shows "Tests: STALE" line with reason (display.rs, 4 display tests)
155 + - [x] CLI JSON output includes `test_staleness` object (cli.rs)
156 + - [x] MCP `get_status` shows staleness info when stale (tools/health.rs)
157 + - [x] Deploy configs: `staleness_days = 7` (pom-hetzner.toml, pom-astra.toml)
158 + - [x] 8 integration tests (version_at_time, staleness by version/age/fresh, config parsing, MCP tool, no-config omits field)
159 +
160 + ## Phase 10 — Downtime Log + Incident History
161 + Structured timeline of status transitions for post-incident review.
162 +
163 + ### Done
164 + - [x] DB migration v4: `incidents` table (id, target, started_at, ended_at, duration_secs, from_status, to_status)
165 + - [x] `IncidentRow` struct (sqlx::FromRow + Serialize)
166 + - [x] Incident queries: `insert_incident`, `close_open_incidents`, `get_open_incident`, `get_recent_incidents`
167 + - [x] Automatic incident open on transition away from operational (serve loop)
168 + - [x] Automatic incident close (with duration) on recovery to operational
169 + - [x] Status change between non-operational states: close old incident, open new
170 + - [x] `current_incident` + `incidents` (last 10) on API `/api/status/{target}` (skip_serializing_if)
171 + - [x] CLI `pom status` shows active incident line
172 + - [x] MCP `get_status` shows active + recent incidents
173 + - [x] Prune cleans closed incidents (6-tuple return from `prune_old_records`)
174 + - [x] 10 new tests (migration, lifecycle, target isolation, prune, API)
175 + - [x] Surface in MNW health page (incident timeline, recent incidents list, expandable check lists, formatted timestamps)
176 +
177 + ## Audit Remediation (Second Audit, 2026-03-11)
178 + 5 findings, 3 cold spots. All resolved.
179 +
180 + ### Done
181 + - [x] Extract CLI command handlers from main.rs into cli.rs (main.rs: 587 -> 130 LOC, cli.rs: 466 LOC)
182 + - [x] Add typed PomError enum with thiserror (8 variants, replaces Box<dyn Error> across 9 files)
183 + - [x] Add .DS_Store and IDE dirs (.idea/, .vscode/) to .gitignore
184 + - [x] Add module-level //! docs to main.rs (config.rs already had one)
185 + - [x] Add migration versioning (schema_version table, numbered migrations, pre-migration DB detection, 3 tests)
186 + - [x] Add CLI display tests (extract formatting into display.rs, 27 tests: health snapshots, test results, status, history, prune, mesh)
187 +
188 + ## Audit Remediation (First Audit, 2026-03-10)
189 + First audit. 11 findings, 8 cold spots. All resolved.
190 +
191 + ### Done
192 + - [x] Add DB indexes: `health_checks(target, id DESC)`, `health_checks(target, checked_at)`, `test_runs(target, id DESC)`, `peer_heartbeats(peer_name, id DESC)` (db.rs, init_schema)
193 + - [x] Fix 4 clippy `collapsible_if` warnings (api.rs, peer.rs, main.rs — used Rust 2024 let chains)
194 + - [x] Decouple mesh write lock from DB writes in heartbeat handlers (peer.rs — block-scoped lock, DB writes after drop)
195 + - [x] Decouple mesh read lock from DB queries in peer_status and mesh_view handlers (api.rs — same pattern)
196 + - [x] Log `/api/peer/status` fetch failures instead of silently ignoring (peer.rs, tracing::debug)
197 + - [x] Include peer heartbeat prune count in `prune_old_records` return value (db.rs — now returns 3-tuple)
198 + - [x] Add `//!` module docs to db.rs, config.rs, peer.rs, types.rs, lib.rs (api.rs already had one)
199 + - [x] Change `PeerConfig.on_missing` from `String` to `OnMissing` enum with `#[derive(Deserialize)]` + `#[default]`
200 + - [x] Add API endpoint integration tests (5 tests: /api/status, /api/status/{target} 404, /api/peer/info, peer disabled, /api/mesh)
201 + - [x] Add heartbeat state machine unit tests (5 tests: grace transitions, recovery, first-contact UUID, DB recording)
202 + - [x] Add config parsing tests (4 tests: full parse, defaults, on_missing default, hostname fallback)
203 + - [x] Add HTTP health check response classification tests (8 tests: operational, degraded, unknown status, error codes, missing fields, non-JSON)
204 + - [x] Extract `HealthStatus::icon()` method, eliminating 3 repeated match blocks in main.rs
205 + - [x] Add types.rs tests (4 tests: Display/FromStr roundtrip, icon mapping, serde roundtrip, invalid parse)
206 +
207 + ## Phase 11 — Route Specs (pre-beta)
208 +
209 + Define expected routes per target in config. PoM periodically checks each route and alerts if any return non-200. Catches missing pages, broken deploys, misconfigured paths.
210 +
211 + ### Done
212 + - [x] `expected_routes` config field on `[targets.<name>]` — list of paths to check (e.g. `["/", "/docs", "/docs/faq", "/pricing"]`)
213 + - [x] `route_check_interval_secs` on `[serve]` (default 300 = 5 min)
214 + - [x] Route check module (`src/checks/routes.rs`) — sequential GET per path, 2xx = OK
215 + - [x] Route check task in serve loop (separate interval from health checks)
216 + - [x] Route check results stored in DB (migration v5: `route_checks` table with indexes)
217 + - [x] `RouteCheckRow`, `insert_route_check`, `get_latest_route_checks` queries
218 + - [x] Prune includes route_checks (7-tuple return from `prune_old_records`)
219 + - [x] Alert on route failure (non-200 on any expected route, with cooldown key `route:{target}`)
220 + - [x] Recovery alert when previously-failing route returns 200 (no cooldown)
221 + - [x] `route_status` field on `/api/status/{target}` (list of paths with last status, skip_serializing_if empty)
222 + - [x] CLI `pom status` shows route check summary (e.g. "Routes: 9/9 OK" or "Routes: 7/9 (FAIL: /docs/faq, /pricing)")
223 + - [x] MNW health page: route status in PoM card
224 + - [x] Deploy configs updated (hetzner + astra: MNW 9 routes, MT 1 route)
225 + - [x] 18 new tests (4 config, 5 route check unit, 3 display, 1 alert, 5 integration)
226 +
227 + ## Phase 12 — External Target: htpy.app
228 +
229 + Monitor https://htpy.app (homotopy-rs, repo at `/Users/max/Math/sseq-work/homotopy-rs`). PoM already supports multiple targets — this adds htpy.app as a third monitored site alongside MNW and MT.
230 +
231 + ### Done
232 + - [x] Add `[targets.htpy]` to deploy configs (pom-hetzner.toml, pom-astra.toml) with health URL, route checks, TLS
233 + - [x] Health check via Tailscale (`http://100.99.153.68:8080/archive/S_2`) with `body_contains = "htpy"` expectation
234 + - [x] Route check: `/archive/S_2` (the default redirect target from `/`)
235 + - [x] TLS monitoring for htpy.app (`[targets.htpy.tls] host = "htpy.app"`)
236 + - [x] Fix `classify_non_json` — non-JSON 2xx responses now promoted to Operational when all expectations pass
237 + - [x] Verified on both hetzner (9ms) and astra (185ms): operational, TLS valid (87d), routes 1/1 OK
238 +
239 + ### Not applicable
240 + - MNW health page: htpy.app is a separate service, doesn't belong on MNW's health dashboard
241 +
242 + ## Audit Action Items (2026-03-13, third audit — pre-launch skeptical lens)
243 +
244 + ### Done
245 + - [x] **CRITICAL:** Remove Postmark API token from deployment configs (`deploy/pom-hetzner.toml`, `deploy/pom-astra.toml`) — moved to `POM_POSTMARK_TOKEN` env var, loaded in config.rs, systemd `EnvironmentFile=/etc/pom/env`
246 + - [x] Add API authentication (bearer token middleware on all /api/* routes, `POM_API_TOKEN` env var or `[serve] api_token` config, 5 tests)
247 + - [x] Add peer mesh authentication (`[peers.X] token` field, heartbeat client sends `Authorization: Bearer` header, MNW health.rs updated to send token)
248 + - [x] Add integration tests for core functions (check_health, check_tls — 9 new integration tests with mock servers)
249 + - [x] Add self-monitoring capability (`/api/health` endpoint returns `{"status":"operational","version":"..."}`, no auth required)
250 + - [x] Shell-escape SSH test filter parameter (`checks/ssh.rs` — alphanumeric + `_:-` allowlist, returns error TestRun on invalid chars)
251 + - [x] Reject peer responses on UUID mismatch instead of just logging a warning (`peer.rs` — upgraded to tracing::error, skips status update, increments consecutive_failures)
252 + - [x] Add rate limiting to API endpoints (fixed-window 60 req/min middleware on authenticated routes, 1 unit test)
253 +
254 + ## Audit (Run 4, 2026-03-14)
255 +
256 + Full code audit of Phases 11-12 additions. 1 HIGH, 4 MEDIUM, 5 LOW findings.
257 +
258 + ### Done
259 + - [x] Audit route checks module (`src/checks/routes.rs`) — base_url parsing, error handling, edge cases
260 + - [x] Audit `classify_non_json` Operational promotion — verified correct, no false positives
261 + - [x] Audit deploy configs for consistency (htpy Tailscale IP, route lists, expectation accuracy)
262 + - [x] Review test coverage gaps in Phase 11-12 code
263 +
264 + ### Done (from audit findings)
265 + - [x] **HIGH:** Disable redirect following in route check client (`redirect(Policy::none())`) — was silently following redirects
266 + - [x] **MEDIUM:** Fix startup thundering herd — consume first tick of health/TLS/route/prune intervals before entering loop
267 + - [x] **MEDIUM:** Fix recovery cooldown interaction — `get_latest_alert_for_target` now excludes `%recovery%` alert types
268 + - [x] **MEDIUM:** Set `MissedTickBehavior::Delay` on route check interval to prevent back-to-back storms
269 + - [x] **LOW:** Validate `expected_routes` paths start with `/` at config load time
270 + - [x] 3 new tests (recovery cooldown, empty routes, path validation)
271 +
272 + ### Done (remaining findings — resolved)
273 + - [x] **MEDIUM:** Graceful shutdown — CancellationToken + `tokio::select!` in all task loops, `with_graceful_shutdown` on API server, 5s grace period (`cli.rs`)
274 + - [x] **LOW:** Remove redundant htpy route check — removed `expected_routes` from htpy target (deploy configs)
275 + - [x] **LOW:** Monitor for silent task panics — 60s watchdog checks `JoinHandle::is_finished()` in shutdown loop (`cli.rs`)
276 +
277 + ## Phase 13 — Per-Test Tracking & Duration Trending (Mar 2026)
278 + `TestDetail` struct + `details` field on `TestSummary`. Parse individual test lines from cargo output. Migration v7: `test_details` table. `insert_test_details`, `get_test_regressions`, `get_test_durations` DB queries. Duration drift detection (baseline 10 runs, 1.5x threshold). Wired into CLI, MCP, API. 10 new tests. MT test target added to astra + hetzner configs.
279 +
280 + ## Phase 6 — TLS (additional, Mar 2026)
281 + Domain WHOIS/registration check (registrar, expiry, nameservers). DNS record verification (A/AAAA/CNAME resolve to expected IPs).
282 +
283 + ## Run 6 Audit Items (Mar 2026)
284 + Fixed 6 collapsible_if clippy warnings (cli.rs, config.rs, checks/http.rs).
285 +
286 + ## Run 8 Audit Items (Mar 2026)
287 + Hardened `escape_js` in dashboard.rs: added newline, carriage return, `<` (`\x3c`), null escaping + 4 tests.
288 +
289 + ---
290 +
291 + ## DNS/Route Stale Data Fix (2026-03-25)
292 +
293 + - [x] Switch Cloudflare-proxied DNS records to resolution-only checks
294 + - [x] Filter `route_status` and `dns_status` in API to only configured entries
295 + - [x] Add `prune_stale_routes()` and `prune_stale_dns()` DB functions
296 + - [x] Call prune functions at task startup
297 + - [x] Update integration tests for new filtering behavior
298 + - [x] Deploy to hetzner (pruned 890 stale route check rows on startup)
299 +
300 + ---
301 +
302 + ## Rust Patterns Audit (2026-03-21)
303 +
304 + - [x] Create `AlertCategory` enum (18 variants) replacing string literals
305 + - [x] Create `DnsRecordType` enum (A/Aaaa/Cname/Mx/Txt) replacing raw strings
306 + - [x] Add 30s timeout wrapper around email sends
307 + - [x] Eliminate HealthSnapshot clone under lock in API handlers
308 + - [x] Use `Cow<'_, str>` for JSON path response instead of String clone
@@ -0,0 +1,115 @@
1 + # PoM (Peace of Mind) -- Audit History
2 +
3 + Full chronological audit log. See [audit_review.md](./audit_review.md) for current state.
4 +
5 + ## Changes Since Last Audit
6 +
7 + ### Tenth audit (2026-03-28, Run 12 cross-project)
8 + - **Test count:** 359 (222 unit + 8 cli + 129 integration). 0 clippy warnings. 0 failures.
9 + - **Grade:** A (maintained). v0.3.2.
10 + - **CORS monitoring:** New check type added for monitoring CORS headers on targets.
11 + - **New dependency advisories (action items):**
12 + - aws-lc-sys 0.38.0 (RUSTSEC-2026-0044 + -0048, severity 7.4 HIGH) — upgrade to 0.39.0 via `cargo update -p aws-lc-sys`
13 + - rustls-webpki 0.103.9 (RUSTSEC-2026-0049) — upgrade to 0.103.10 via `cargo update -p rustls-webpki`
14 + - paste unmaintained (RUSTSEC-2024-0436) — upstream via rmcp, warning only
15 + - **Mandatory surprise:** None. Previous surprises (rate limiter relaxed ordering, write!().unwrap() infallibility) still valid.
16 + - **No new code findings.** All previous items remain resolved.
17 +
18 + ### DNS/Route stale data fix (2026-03-25)
19 + - **Test count:** 352 (unchanged). 0 clippy warnings.
20 + - **Config:** Switched all 4 Cloudflare-proxied DNS records from `expected = ["IP"]` to `expected = []` (resolution-only). DNS checks were always failing because Cloudflare returns rotating proxy IPs, not the origin IP.
21 + - **API filtering:** `route_status` and `dns_status` in `/api/status/{target}` now filtered to only entries matching current config. Stale routes (e.g. `/docs/about`, `/signup`) and stale DNS records no longer appear in API responses.
22 + - **DB pruning:** Added `prune_stale_routes()` and `prune_stale_dns()` to `db.rs`. Called once at task startup in `routes.rs` and `dns.rs` to clean up historical data when config changes. Pruned 890 stale route check rows on first deploy.
23 + - **Integration tests:** Updated `api_status_includes_route_status` and `api_status_includes_dns_status` to use configs with matching route/DNS entries.
24 + - **Deployed to hetzner** — v0.3.2 binary + updated config.
25 +
26 + ### Eighth audit (2026-03-18, Run 9 cross-project)
27 + - **Test count:** 344 (unchanged). 0 clippy warnings.
28 + - **Grade:** A (maintained). v0.3.1 (deployed 2026-03-18).
29 + - **Dashboard UI shipped.** Per-test tracking, regression detection, duration drift.
30 + - **cli/ directory module split** completed (1,035-line cli.rs -> 8 files).
31 + - **Observations (pre-existing, not regressions):**
32 + - Mutex `.unwrap()` in rate limiter (api.rs:41) — if thread panics while holding lock, subsequent calls panic. Impact: LOW (rate limiter only, not core logic). Design choice: acceptable for monitoring tool.
33 + - `serde_json::to_value(d).unwrap_or_default()` in API details field — silently becomes null on serialization failure. Impact: LOW, safe fallback.
34 + - **No new findings requiring action.** Grade maintained at A.
35 + - **Mandatory surprise:** Rate limiter uses `fetch_add` with Relaxed ordering — can allow up to max_per_window+1 requests due to check-then-increment race. Known trade-off of lock-free rate limiting, documented.
36 +
37 + ### Fifth audit (2026-03-16, Run 6 cross-project)
38 + - **Test count:** 238 -> 344 (220 unit + 124 integration, +106 tests)
39 + - **Grade:** A (maintained). No new findings above LOW.
40 + - **Source LOC:** 10,113 (up from ~3.5K)
41 + - **Clippy:** 2 warnings (collapsible_if in cli.rs — LOW)
42 + - **Production unwraps:** 76 total — 64 infallible write! on String, 12 safe-by-construction. Effectively zero risky unwraps.
43 + - **Mandatory surprise:** write!().unwrap() pattern provably infallible — Actually fine.
44 + - **Previous items verified:** All previous remediated items confirmed intact.
45 + - **Note:** cli.rs at 1,036 lines — approaching the 500-line branching guideline but mostly flat match arms.
46 + - **Infrastructure check:** Blocked by Tailscale SSH re-authentication. Deferred.
47 +
48 + ### Fourth audit remediation (2026-03-14)
49 + - **Grade:** A- -> A. All remaining findings resolved.
50 + - **Test count:** 229 -> 238 (+9 integration tests)
51 + - **Graceful shutdown:** Replaced `handle.abort()` with CancellationToken + `tokio::select!` in all task loops. API server uses `with_graceful_shutdown`. 5s grace period on SIGINT/SIGTERM.
52 + - **Task panic detection:** 60s watchdog checks `JoinHandle::is_finished()` on all background tasks.
53 + - **Rate limiting:** Fixed-window 60 req/min middleware on authenticated API routes. Custom `RateLimiter` struct.
54 + - **Self-monitoring:** `GET /api/health` endpoint (public, no auth) returns `{"status":"operational","version":"..."}`.
55 + - **Integration tests:** 5 check_health tests (mock axum servers: operational, degraded, unreachable, expectations pass/fail), 1 check_tls test (self-signed cert via rcgen), 2 /api/health tests, 1 rate limiter test.
56 + - **Deploy config cleanup:** Removed redundant htpy `expected_routes` (duplicated health check URL).
57 + - **Dependency:** Added `tokio-util` for CancellationToken.
58 + - **Cold spots:** 0 remaining (was 3). All previous architectural and testing gaps closed.
59 +
60 + ### Third audit (2026-03-13, pre-launch skeptical lens)
61 + - **Grade:** A -> A-. Postmark API token in plaintext deployment configs is a real issue.
62 + - **Test count:** 56 -> 187 (+131 tests)
63 + - **New findings:** Plaintext API token, no API auth, no peer mesh auth, no integration tests for core functions, no self-monitoring.
64 + - **38 unwraps in non-test code** — all verified safe (write to String or guarded by prior checks).
65 +
66 + **Post-audit remediation (2026-03-13):**
67 + - All 3 critical/medium findings resolved: Postmark token to env var, API bearer auth (5 tests), peer mesh auth
68 + - 2 low findings resolved: SSH filter validation, peer UUID mismatch rejection
69 + - Test count: 187 -> 195 (+8 tests)
70 + - Documentation upgraded to A: All struct fields documented (HealthSnapshot, HealthStatus, HealthDetails, TestRun, TestStaleness, PeerStatus, OnMissing, all config types, all API response types). All 8 error variants documented. 11 config defaults with rationale comments. prune_old_records return tuple documented. description.md rewritten, architecture.md created (191 lines), README created (62 lines).
71 +
72 + ### Observability Upgrade (2026-03-13)
73 + - **Observability:** A- -> A
74 + - Added 57 `#[instrument(skip_all)]` annotations across 9 files: db.rs (28), alerts.rs (9), tools/mod.rs (8), tools/health.rs (5), tools/tests.rs (3), checks/http.rs (1), checks/tls.rs (1), checks/ssh.rs (1), peer.rs (1)
75 + - Added Multithreaded forum as monitoring target: `pom-astra.toml` (localhost:3400), `pom-hetzner.toml` (Tailscale IP)
76 + - Added test runner targets for GO, BB, AF, SK to `pom-astra.toml`
77 + - All 208 tests pass. `cargo check` passes clean.
78 +
79 + ### Adversarial Test Audit (2026-03-13)
80 +
81 + **Goal:** Write tests that try to break the system. Find edge cases, race conditions, boundary conditions, and logic errors.
82 +
83 + **Results:**
84 + - **Test count:** 195 -> 208 (+13 tests)
85 + - **CRITICAL fix:** Alert cooldown key mismatch — `record_alert` used `target` but lookup used `alert_key` (`"health:{target}"`), so cooldowns never matched and alerts fired every check. Fixed by using `alert_key` consistently.
86 + - **HIGH fix:** TLS expiry check inconsistent at day boundary — time-of-day comparison could cause flapping. Changed to `date_naive()` comparison for stable day-level logic.
87 + - **HIGH fix:** UUID mismatch left stale peer state — now resets state, clears failures, persists via `update_peer_identity()` to prevent showing stale data after peer identity change.
88 + - **HIGH fix:** `prune_old_records` no guard for days <= 0 — could delete all records. Added early return for `days <= 0` (no-op).
89 + - **HIGH fix:** SSH timeout ignored config value — hardcoded `ConnectTimeout=10` in SSH args. Changed to use `config.timeout_secs`.
90 + - **Added `rcgen` dev dependency** for TLS cert generation in tests.
91 +
92 + ### Second audit (2026-03-11)
93 + | Change | Detail |
94 + |--------|--------|
95 + | Tests | +39 tests (17 -> 56). 28 unit + 28 integration. Tests/KLOC: 5.8 -> 18.4. |
96 + | Lock contention | Addressed in both peer.rs (heartbeat handlers) and api.rs (status/mesh handlers). Data collected under lock, DB writes after release. |
97 + | DB indexes | 4 indexes added: health_checks(target, id DESC), health_checks(target, checked_at), test_runs(target, id DESC), peer_heartbeats(peer_name, id DESC). |
98 + | Clippy | 4 warnings -> 0. Used Rust 2024 let chains instead of nested if-let. |
99 + | Type safety | PeerConfig.on_missing changed from String to OnMissing enum with serde deserialization. |
100 + | Module docs | Added //! docs to db.rs, config.rs, peer.rs, types.rs, lib.rs. |
101 + | Error handling | /api/peer/status fetch failures now logged at debug level instead of silenced. |
102 + | Prune | prune_old_records now returns 3-tuple including peer heartbeat count. |
103 + | Code extraction | HealthStatus::icon() method eliminates 3 repeated match blocks. |
104 + | HTTP checks | Response classification extracted into pure functions for testability. |
105 +
106 + ## Metrics Over Time
107 +
108 + | Audit Date | LOC | Rust Files | Tests | Tests/KLOC | Clippy Warnings | Cold Spots | Overall |
109 + |------------|-----|-----------|-------|-----------|----------------|------------|---------|
110 + | 2026-03-10 | 2,934 | 15 | 17 | 5.8 | 4 | 8 | B+ |
111 + | 2026-03-11 | 3,039 | 14 | 56 | 18.4 | 0 | 3 | A |
112 + | 2026-03-13 | ~3K | ~14 | 208 | ~69 | 0 | 3 | A- |
113 + | 2026-03-14 | ~3.5K | ~16 | 238 | ~68 | 0 | 0 | A |
114 + | 2026-03-16 | 10.1K | 23 | 344 | ~34 | 2 | 0 | A |
115 + | 2026-03-18 | 10.1K | 23 | 344 | ~34 | 0 | 0 | A |
@@ -144,114 +144,6 @@ Filed in `docs/mnw/pom/todo.md`.
144 144 10. ~~Add heartbeat state machine tests~~ -- Done (9 tests)
145 145 11. ~~Add config parsing tests~~ -- Done (4 tests)
146 146
147 - ## Changes Since Last Audit
148 -
149 - ### Tenth audit (2026-03-28, Run 12 cross-project)
150 - - **Test count:** 359 (222 unit + 8 cli + 129 integration). 0 clippy warnings. 0 failures.
151 - - **Grade:** A (maintained). v0.3.2.
152 - - **CORS monitoring:** New check type added for monitoring CORS headers on targets.
153 - - **New dependency advisories (action items):**
154 - - aws-lc-sys 0.38.0 (RUSTSEC-2026-0044 + -0048, severity 7.4 HIGH) — upgrade to 0.39.0 via `cargo update -p aws-lc-sys`
155 - - rustls-webpki 0.103.9 (RUSTSEC-2026-0049) — upgrade to 0.103.10 via `cargo update -p rustls-webpki`
156 - - paste unmaintained (RUSTSEC-2024-0436) — upstream via rmcp, warning only
157 - - **Mandatory surprise:** None. Previous surprises (rate limiter relaxed ordering, write!().unwrap() infallibility) still valid.
158 - - **No new code findings.** All previous items remain resolved.
159 -
160 - ### DNS/Route stale data fix (2026-03-25)
161 - - **Test count:** 352 (unchanged). 0 clippy warnings.
162 - - **Config:** Switched all 4 Cloudflare-proxied DNS records from `expected = ["IP"]` to `expected = []` (resolution-only). DNS checks were always failing because Cloudflare returns rotating proxy IPs, not the origin IP.
163 - - **API filtering:** `route_status` and `dns_status` in `/api/status/{target}` now filtered to only entries matching current config. Stale routes (e.g. `/docs/about`, `/signup`) and stale DNS records no longer appear in API responses.
164 - - **DB pruning:** Added `prune_stale_routes()` and `prune_stale_dns()` to `db.rs`. Called once at task startup in `routes.rs` and `dns.rs` to clean up historical data when config changes. Pruned 890 stale route check rows on first deploy.
165 - - **Integration tests:** Updated `api_status_includes_route_status` and `api_status_includes_dns_status` to use configs with matching route/DNS entries.
166 - - **Deployed to hetzner** — v0.3.2 binary + updated config.
167 -
168 - ### Eighth audit (2026-03-18, Run 9 cross-project)
169 - - **Test count:** 344 (unchanged). 0 clippy warnings.
170 - - **Grade:** A (maintained). v0.3.1 (deployed 2026-03-18).
171 - - **Dashboard UI shipped.** Per-test tracking, regression detection, duration drift.
172 - - **cli/ directory module split** completed (1,035-line cli.rs -> 8 files).
173 - - **Observations (pre-existing, not regressions):**
174 - - Mutex `.unwrap()` in rate limiter (api.rs:41) — if thread panics while holding lock, subsequent calls panic. Impact: LOW (rate limiter only, not core logic). Design choice: acceptable for monitoring tool.
175 - - `serde_json::to_value(d).unwrap_or_default()` in API details field — silently becomes null on serialization failure. Impact: LOW, safe fallback.
176 - - **No new findings requiring action.** Grade maintained at A.
177 - - **Mandatory surprise:** Rate limiter uses `fetch_add` with Relaxed ordering — can allow up to max_per_window+1 requests due to check-then-increment race. Known trade-off of lock-free rate limiting, documented.
178 -
179 - ### Fifth audit (2026-03-16, Run 6 cross-project)
180 - - **Test count:** 238 -> 344 (220 unit + 124 integration, +106 tests)
181 - - **Grade:** A (maintained). No new findings above LOW.
182 - - **Source LOC:** 10,113 (up from ~3.5K)
183 - - **Clippy:** 2 warnings (collapsible_if in cli.rs — LOW)
184 - - **Production unwraps:** 76 total — 64 infallible write! on String, 12 safe-by-construction. Effectively zero risky unwraps.
185 - - **Mandatory surprise:** write!().unwrap() pattern provably infallible — Actually fine.
186 - - **Previous items verified:** All previous remediated items confirmed intact.
187 - - **Note:** cli.rs at 1,036 lines — approaching the 500-line branching guideline but mostly flat match arms.
188 - - **Infrastructure check:** Blocked by Tailscale SSH re-authentication. Deferred.
189 -
190 - ### Fourth audit remediation (2026-03-14)
191 - - **Grade:** A- -> A. All remaining findings resolved.
192 - - **Test count:** 229 -> 238 (+9 integration tests)
193 - - **Graceful shutdown:** Replaced `handle.abort()` with CancellationToken + `tokio::select!` in all task loops. API server uses `with_graceful_shutdown`. 5s grace period on SIGINT/SIGTERM.
194 - - **Task panic detection:** 60s watchdog checks `JoinHandle::is_finished()` on all background tasks.
195 - - **Rate limiting:** Fixed-window 60 req/min middleware on authenticated API routes. Custom `RateLimiter` struct.
196 - - **Self-monitoring:** `GET /api/health` endpoint (public, no auth) returns `{"status":"operational","version":"..."}`.
197 - - **Integration tests:** 5 check_health tests (mock axum servers: operational, degraded, unreachable, expectations pass/fail), 1 check_tls test (self-signed cert via rcgen), 2 /api/health tests, 1 rate limiter test.
198 - - **Deploy config cleanup:** Removed redundant htpy `expected_routes` (duplicated health check URL).
199 - - **Dependency:** Added `tokio-util` for CancellationToken.
200 - - **Cold spots:** 0 remaining (was 3). All previous architectural and testing gaps closed.
201 -
202 - ### Third audit (2026-03-13, pre-launch skeptical lens)
203 - - **Grade:** A -> A-. Postmark API token in plaintext deployment configs is a real issue.
204 - - **Test count:** 56 -> 187 (+131 tests)
205 - - **New findings:** Plaintext API token, no API auth, no peer mesh auth, no integration tests for core functions, no self-monitoring.
206 - - **38 unwraps in non-test code** — all verified safe (write to String or guarded by prior checks).
207 -
208 - **Post-audit remediation (2026-03-13):**
209 - - All 3 critical/medium findings resolved: Postmark token to env var, API bearer auth (5 tests), peer mesh auth
210 - - 2 low findings resolved: SSH filter validation, peer UUID mismatch rejection
211 - - Test count: 187 -> 195 (+8 tests)
212 - - Documentation upgraded to A: All struct fields documented (HealthSnapshot, HealthStatus, HealthDetails, TestRun, TestStaleness, PeerStatus, OnMissing, all config types, all API response types). All 8 error variants documented. 11 config defaults with rationale comments. prune_old_records return tuple documented. description.md rewritten, architecture.md created (191 lines), README created (62 lines).
213 -
214 - ### Observability Upgrade (2026-03-13)
215 - - **Observability:** A- -> A
216 - - Added 57 `#[instrument(skip_all)]` annotations across 9 files: db.rs (28), alerts.rs (9), tools/mod.rs (8), tools/health.rs (5), tools/tests.rs (3), checks/http.rs (1), checks/tls.rs (1), checks/ssh.rs (1), peer.rs (1)
217 - - Added Multithreaded forum as monitoring target: `pom-astra.toml` (localhost:3400), `pom-hetzner.toml` (Tailscale IP)
218 - - Added test runner targets for GO, BB, AF, SK to `pom-astra.toml`
219 - - All 208 tests pass. `cargo check` passes clean.
220 -
221 - ### Adversarial Test Audit (2026-03-13)
222 -
223 - **Goal:** Write tests that try to break the system. Find edge cases, race conditions, boundary conditions, and logic errors.
224 -
225 - **Results:**
226 - - **Test count:** 195 -> 208 (+13 tests)
227 - - **CRITICAL fix:** Alert cooldown key mismatch — `record_alert` used `target` but lookup used `alert_key` (`"health:{target}"`), so cooldowns never matched and alerts fired every check. Fixed by using `alert_key` consistently.
228 - - **HIGH fix:** TLS expiry check inconsistent at day boundary — time-of-day comparison could cause flapping. Changed to `date_naive()` comparison for stable day-level logic.
229 - - **HIGH fix:** UUID mismatch left stale peer state — now resets state, clears failures, persists via `update_peer_identity()` to prevent showing stale data after peer identity change.
230 - - **HIGH fix:** `prune_old_records` no guard for days <= 0 — could delete all records. Added early return for `days <= 0` (no-op).
231 - - **HIGH fix:** SSH timeout ignored config value — hardcoded `ConnectTimeout=10` in SSH args. Changed to use `config.timeout_secs`.
232 - - **Added `rcgen` dev dependency** for TLS cert generation in tests.
233 -
234 - ### Second audit (2026-03-11)
235 - | Change | Detail |
236 - |--------|--------|
237 - | Tests | +39 tests (17 -> 56). 28 unit + 28 integration. Tests/KLOC: 5.8 -> 18.4. |
238 - | Lock contention | Addressed in both peer.rs (heartbeat handlers) and api.rs (status/mesh handlers). Data collected under lock, DB writes after release. |
239 - | DB indexes | 4 indexes added: health_checks(target, id DESC), health_checks(target, checked_at), test_runs(target, id DESC), peer_heartbeats(peer_name, id DESC). |
240 - | Clippy | 4 warnings -> 0. Used Rust 2024 let chains instead of nested if-let. |
241 - | Type safety | PeerConfig.on_missing changed from String to OnMissing enum with serde deserialization. |
242 - | Module docs | Added //! docs to db.rs, config.rs, peer.rs, types.rs, lib.rs. |
243 - | Error handling | /api/peer/status fetch failures now logged at debug level instead of silenced. |
244 - | Prune | prune_old_records now returns 3-tuple including peer heartbeat count. |
245 - | Code extraction | HealthStatus::icon() method eliminates 3 repeated match blocks. |
246 - | HTTP checks | Response classification extracted into pure functions for testability. |
247 -
248 - ## Metrics Over Time
249 -
250 - | Audit Date | LOC | Rust Files | Tests | Tests/KLOC | Clippy Warnings | Cold Spots | Overall |
251 - |------------|-----|-----------|-------|-----------|----------------|------------|---------|
252 - | 2026-03-10 | 2,934 | 15 | 17 | 5.8 | 4 | 8 | B+ |
253 - | 2026-03-11 | 3,039 | 14 | 56 | 18.4 | 0 | 3 | A |
254 - | 2026-03-13 | ~3K | ~14 | 208 | ~69 | 0 | 3 | A- |
255 - | 2026-03-14 | ~3.5K | ~16 | 238 | ~68 | 0 | 0 | A |
256 - | 2026-03-16 | 10.1K | 23 | 344 | ~34 | 2 | 0 | A |
257 - | 2026-03-18 | 10.1K | 23 | 344 | ~34 | 0 | 0 | A |
147 + ---
148 +
149 + See [audit_history.md](./audit_history.md) for full chronological audit log.
@@ -0,0 +1,90 @@
1 + # PoM -- Competitive Analysis
2 +
3 + Last updated: 2026-04-02
4 +
5 + ## Positioning
6 +
7 + PoM (Peace of Mind) is a single-binary production monitor built for indie developers and small teams. It runs as a peer mesh -- two instances cross-check each other with no central dashboard required. CLI-first, with an optional HTTP API and Claude integration (MCP server mode).
8 +
9 + The key differentiators are the peer mesh architecture (no single point of failure for monitoring), the CLI-first interface (inspect via SSH, no browser needed), and the Claude MCP integration (AI-assisted diagnostics). PoM monitors what matters for small deployments: uptime, TLS certificates, DNS records, domain registration, route availability, and test freshness.
10 +
11 + ## Pricing Comparison
12 +
13 + | Tool | Price | Model |
14 + |------|-------|-------|
15 + | **PoM** | Free | Source-available (PolyForm NC) |
16 + | Uptime Robot | $0-$58/mo | Freemium (50 monitors free) |
17 + | Pingdom | $15-$100/mo | SaaS |
18 + | Datadog | $15-$23/host/mo | SaaS |
19 + | New Relic | $0-$0.35/GB | Freemium |
20 + | Grafana + Prometheus | Free (self-host) | Open source |
21 + | StatusCake | $0-$67/mo | Freemium |
22 + | Hetrix Tools | $0-$20/mo | Freemium |
23 +
24 + ## Feature Matrix
25 +
26 + | Feature | PoM | Uptime Robot | Pingdom | Datadog | Grafana+Prom |
27 + |---------|:---:|:-----------:|:-------:|:-------:|:------------:|
28 + | HTTP health checks | Y | Y | Y | Y | Y |
29 + | TLS certificate monitoring | Y | Y | Y | Y | N* |
30 + | DNS record verification | Y | N | N | Y | N* |
31 + | WHOIS domain expiry | Y | N | N | N | N* |
32 + | Route availability checks | Y | N | Y | Y | N* |
33 + | CORS preflight checks | Y | N | N | N | N |
34 + | Peer mesh (cross-monitoring) | Y | N | N | N | N |
35 + | CLI-first interface | Y | N | N | N | N |
36 + | Claude MCP integration | Y | N | N | N | N |
37 + | SSH test execution | Y | N | N | N | N |
38 + | Latency drift detection | Y | N | Y | Y | Y |
39 + | Test duration drift | Y | N | N | N | N |
40 + | Email alerts | Y | Y | Y | Y | Y |
41 + | Status page | N | Y | Y | Y | Y** |
42 + | Mobile app | N | Y | Y | Y | Y** |
43 + | APM / traces | N | N | N | Y | Y |
44 + | Log aggregation | N | N | N | Y | Y |
45 + | Self-hosted | Y | N | N | N | Y |
46 + | Single binary | Y | N/A | N/A | N/A | N |
47 +
48 + \* Requires additional exporters. \*\* Via Grafana dashboards.
49 +
50 + ## Competitor Deep Dives
51 +
52 + ### 1. Uptime Robot
53 +
54 + Simple uptime monitoring SaaS. Free tier with 50 monitors at 5-minute intervals. Pro adds 1-minute intervals, SSL monitoring, status pages. The default choice for indie developers.
55 +
56 + **What PoM lacks:** status pages, mobile app, SMS/Slack/webhook alerts, maintenance windows. **What Uptime Robot lacks:** peer mesh, CLI interface, DNS/WHOIS monitoring, SSH test execution, AI integration.
57 +
58 + ### 2. Datadog
59 +
60 + Enterprise observability platform (APM, logs, metrics, dashboards). Powerful but expensive and invasive (requires agents on every host). Overkill for small deployments.
61 +
62 + **What PoM lacks:** APM, distributed tracing, dashboards, log aggregation, 800+ integrations. **What Datadog lacks:** peer mesh, CLI-first operation, single binary simplicity, affordability for indie teams.
63 +
64 + ### 3. Grafana + Prometheus
65 +
66 + Open-source metrics and visualization stack. Extremely flexible, industry standard. Requires significant setup (Prometheus server, exporters, Grafana instance, alertmanager). No built-in TLS/DNS/WHOIS monitoring without custom exporters.
67 +
68 + **What PoM lacks:** rich dashboards, metric visualization, alertmanager flexibility, ecosystem of exporters. **What Grafana+Prom lacks:** out-of-box TLS/DNS/WHOIS, peer mesh, single binary, zero-config setup.
69 +
70 + ### 4. StatusCake
71 +
72 + Web-based uptime and page speed monitoring. Free tier with 10 monitors. Pro adds SSL, domain, and server monitoring. Similar scope to Uptime Robot but with more check types.
73 +
74 + **What PoM lacks:** page speed testing, server monitoring agents, status pages, Slack/Teams integration.
75 +
76 + ## What We Offer That Competitors Don't
77 +
78 + - **Peer mesh** -- two PoM instances monitor each other. If one goes down, the other detects it. No central dashboard is a single point of failure.
79 + - **CLI-first** -- inspect status, run checks, query history from the terminal via SSH. No browser required.
80 + - **Claude MCP integration** -- expose health checks, test execution, and mesh status as MCP tools for AI-assisted diagnostics.
81 + - **SSH test execution** -- trigger and parse CI test runs on remote servers, track test freshness and duration drift.
82 + - **Single binary, zero dependencies** -- no Docker, no external services, no agents. SQLite for history, Postmark for email alerts.
83 + - **Monitoring-offline meta-alert** -- detects when all targets are unreachable simultaneously (likely a PoM network issue, not actual outages). Prevents false alarm cascades.
84 +
85 + ## Target Users
86 +
87 + - Indie developers running 1-5 services who want monitoring without SaaS costs
88 + - Small teams that operate via SSH and prefer CLI tools over web dashboards
89 + - Anyone who wants peer-verified monitoring (not trusting a single monitoring vendor)
90 + - Claude Code users who want AI-assisted production diagnostics
@@ -0,0 +1,202 @@
1 + # PoM Operational Runbook
2 +
3 + Procedures for responding to alerts, managing the service, and troubleshooting common issues.
4 +
5 + ## Alert Response Guide
6 +
7 + ### Health Status Change (Operational -> Error/Unreachable)
8 +
9 + **Symptoms:** Email alert with target status change.
10 +
11 + **Steps:**
12 + 1. Verify manually: `curl -v https://makenot.work/api/health`
13 + 2. If **Unreachable**: check network (Tailscale, firewall, DNS resolution)
14 + 3. If **Error** (5xx): SSH into the target server, check application logs
15 + ```sh
16 + ssh root@100.120.174.96 journalctl -u makenotwork --since "10 minutes ago"
17 + ```
18 + 4. If **Degraded** (2xx but unexpected body): check application state, database connectivity
19 + 5. Restart the service if needed: `ssh root@100.120.174.96 systemctl restart makenotwork`
20 +
21 + ### TLS Certificate Expiry
22 +
23 + **Symptoms:** Alert when certificate expires within 14 days.
24 +
25 + **Steps:**
26 + 1. Verify: `openssl s_client -connect makenot.work:443 2>/dev/null | openssl x509 -noout -dates`
27 + 2. Cloudflare Origin CA certs (15-year): no renewal needed. If alert fires, check Caddy config.
28 + 3. If Caddy is serving wrong cert: verify cert paths in `/etc/caddy/Caddyfile`
29 + 4. For custom domains (on-demand TLS): Caddy auto-renews via ACME. Check Caddy logs.
30 +
31 + ### TLS Check Failed
32 +
33 + **Symptoms:** Handshake timeout, certificate parse failure, or connection refused.
34 +
35 + **Steps:**
36 + 1. Verify: `openssl s_client -connect makenot.work:443 -servername makenot.work`
37 + 2. Check Caddy status: `ssh root@100.120.174.96 systemctl status caddy`
38 + 3. Check if port 443 is open: `ssh root@100.120.174.96 ss -tlnp | grep 443`
39 + 4. If Caddy is down, restart: `ssh root@100.120.174.96 systemctl restart caddy`
40 +
41 + ### Peer Missing
42 +
43 + **Symptoms:** Peer (astra or hetzner) unreachable for 3+ consecutive heartbeats (3+ minutes).
44 +
45 + **Steps:**
46 + 1. SSH into the peer: `ssh max@100.106.221.39` (astra) or `ssh root@100.120.174.96` (hetzner)
47 + 2. Check PoM service: `systemctl status pom`
48 + 3. Check Tailscale connectivity: `tailscale ping <peer-ip>`
49 + 4. If PoM is down: `systemctl restart pom`
50 + 5. If Tailscale is down: `systemctl restart tailscored`
51 +
52 + ### Latency Drift
53 +
54 + **Symptoms:** Sustained response time increase (>2x the 7-day baseline).
55 +
56 + **Steps:**
57 + 1. Check server load: `ssh root@100.120.174.96 top -bn1 | head -5`
58 + 2. Check PostgreSQL: `ssh root@100.120.174.96 "psql -c 'SELECT count(*) FROM pg_stat_activity;' makenotwork"`
59 + 3. Check for slow queries: `ssh root@100.120.174.96 "psql -c \"SELECT query, calls, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 5;\" makenotwork"`
60 + 4. Check disk I/O: `ssh root@100.120.174.96 iostat -x 1 3`
61 + 5. If database-related: consider `VACUUM ANALYZE` on affected tables
62 +
63 + ### Route Failure
64 +
65 + **Symptoms:** Specific paths (e.g., `/login`, `/docs`) returning non-2xx.
66 +
67 + **Steps:**
68 + 1. Verify: `curl -sI https://makenot.work/login`
69 + 2. If 502/503: application is down or Caddy can't reach it
70 + 3. If 404: route may have been removed in a deploy -- check recent deploys
71 + 4. If 500: application error -- check logs with `journalctl -u makenotwork`
72 +
73 + ### DNS Mismatch
74 +
75 + **Symptoms:** DNS records don't match expected values.
76 +
77 + **Steps:**
78 + 1. Verify: `dig makenot.work +short` and compare with expected
79 + 2. Check Cloudflare DNS dashboard for unexpected changes
80 + 3. If propagation issue: wait 5-10 minutes and recheck
81 + 4. If intentional change: update PoM config to match new expected values
82 +
83 + ### WHOIS Domain Expiry
84 +
85 + **Symptoms:** Domain registration expires within 30 days.
86 +
87 + **Steps:**
88 + 1. Verify: `whois makenot.work | grep -i expir`
89 + 2. Renew domain with registrar (Cloudflare Registrar for makenot.work)
90 + 3. Confirm renewal: re-run WHOIS check
91 +
92 + ### Monitoring Offline (All Targets Unreachable)
93 +
94 + **Symptoms:** All monitored targets are down simultaneously.
95 +
96 + **Steps:**
97 + 1. This almost certainly means PoM's own network is down, not all targets
98 + 2. Check the PoM instance's network: `ping 1.1.1.1`, `tailscale status`
99 + 3. Check DNS resolution: `dig makenot.work`
100 + 4. If network is fine, check if all targets actually are down (unlikely but possible)
101 +
102 + ### Test Run Stale
103 +
104 + **Symptoms:** No test run recorded in 7+ days.
105 +
106 + **Steps:**
107 + 1. SSH into astra and run tests manually: `/home/max/staging/run-tests.sh`
108 + 2. If tests fail: investigate failures, fix, re-run
109 + 3. If SSH test execution fails: check SSH key, connectivity, permissions
110 +
111 + ## Service Management
112 +
113 + ### Starting/Stopping
114 +
115 + ```sh
116 + # Hetzner
117 + ssh root@100.120.174.96 systemctl start pom
118 + ssh root@100.120.174.96 systemctl stop pom
119 + ssh root@100.120.174.96 systemctl restart pom
120 +
121 + # Astra
122 + ssh max@100.106.221.39 sudo systemctl start pom
123 + ssh max@100.106.221.39 sudo systemctl stop pom
124 + ssh max@100.106.221.39 sudo systemctl restart pom
125 + ```
126 +
127 + ### Checking Status
128 +
129 + ```sh
130 + # Service status
131 + ssh root@100.120.174.96 systemctl status pom
132 +
133 + # Application logs
134 + ssh root@100.120.174.96 journalctl -u pom --since "1 hour ago"
135 +
136 + # API health
137 + curl http://100.120.174.96:9100/api/health
138 +
139 + # Full status (requires API token)
140 + curl -H "Authorization: Bearer <token>" http://100.120.174.96:9100/api/status
141 +
142 + # Mesh view (self + peers)
143 + curl -H "Authorization: Bearer <token>" http://100.120.174.96:9100/api/mesh
144 + ```
145 +
146 + ### Deploying Updates
147 +
148 + ```sh
149 + cd ~/Code/MNW/pom
150 + ./deploy/deploy.sh # Deploy to both astra and hetzner
151 + ```
152 +
153 + The deploy script cross-compiles for both architectures, uploads binaries, and restarts services.
154 +
155 + ### Configuration Changes
156 +
157 + Config lives at `/etc/pom/pom.toml` on each instance. After editing:
158 +
159 + ```sh
160 + ssh root@100.120.174.96 systemctl restart pom
161 + ```
162 +
163 + Alert credentials are in `/etc/pom/env` (Postmark token, API token).
164 +
165 + ## Check Intervals
166 +
167 + | Check Type | Default Interval | Notes |
168 + |------------|-----------------|-------|
169 + | Health (HTTP) | 5 minutes | 10-second timeout per request |
170 + | TLS certificate | 1 hour | Warns at 14 days before expiry |
171 + | Route availability | 5 minutes | Checks all configured paths |
172 + | DNS records | 1 hour | Compares against expected values |
173 + | WHOIS expiry | 1 hour | Warns at 30 days before expiry |
174 + | CORS preflight | 1 hour | OPTIONS request validation |
175 + | Peer heartbeat | 60 seconds | 3 failures before alert (grace period) |
176 + | Data pruning | Daily | Retains 30 days of history |
177 +
178 + ## Alert Cooldowns
179 +
180 + - **Default cooldown:** 5 minutes between repeated alerts for the same target
181 + - **Recovery alerts:** Always sent immediately (bypass cooldown)
182 + - **Monitoring-offline:** Special meta-alert when all targets are unreachable
183 +
184 + ## Production Instances
185 +
186 + | Instance | IP | Architecture | Config |
187 + |----------|-----|-------------|--------|
188 + | Hetzner | `100.120.174.96:9100` | x86_64 | `/etc/pom/pom.toml` |
189 + | Astra | `100.106.221.39:9100` | aarch64 | `/etc/pom/pom.toml` |
190 +
191 + Both instances monitor the same targets and cross-check each other via the peer mesh.
192 +
193 + ## Key Files
194 +
195 + | What | Where |
196 + |------|-------|
197 + | Config | `/etc/pom/pom.toml` |
198 + | Credentials | `/etc/pom/env` |
199 + | Database | `/var/lib/pom/pom.db` (SQLite) |
200 + | Instance ID | `/var/lib/pom/instance_id` |
201 + | systemd unit | `/etc/systemd/system/pom.service` |
202 + | Deploy script | `deploy/deploy.sh` |
M docs/todo.md +3 -1
@@ -1,6 +1,8 @@
1 1 # PoM Todo
2 2
3 - Done: Phases 1-13 complete. Per-test tracking + regression detection + duration drift added. 352 tests (124 lib + 228 integration). Grade: A (Run 10). v0.3.2 (redeployed 2026-03-25). cli/ split into directory module. Dashboard UI shipped. DNS checks fixed for Cloudflare-proxied domains. Stale route/DNS data pruning added.
3 + Done: All phases (1-13). Active: None. Next: Post-beta items below.
4 +
5 + v0.3.2. Audit grade A. Dashboard UI, regression detection, duration drift. Monitors MNW + MT + htpy.app.
4 6
5 7 Completed work archived in `docs/archive/pom_todo_done.md`.
6 8