max / pom

Add project docs: audit history, competition analysis, runbook Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Author: Max J. <87768334+MaxJMath@users.noreply.github.com> · 2026-04-12 23:39 UTC

Commit: e8ed35e73aea237197b1fda84f5423b4c7af8f2b

Parent: b10b7fd

6 files changed, +721 insertions, -112 deletions

A docs/archive/pom_todo_done.md +308

		@@ -0,0 +1,308 @@
1	+	# PoM — Completed Work
2	+
3	+	Archived completed phases from todo.md. All items here are done.
4	+
5	+	---
6	+
7	+	## Phase 1 — Core Infrastructure
8	+	Health checks, test orchestration, CLI, MCP server, SQLite storage.
9	+
10	+	### Done
11	+	- [x] HTTP health checks with configurable targets and timeouts
12	+	- [x] SSH test orchestration with CI output parsing
13	+	- [x] CLI commands: health, test, status, history, prune
14	+	- [x] MCP server mode (stdio transport)
15	+	- [x] SQLite storage with WAL mode
16	+	- [x] Per-target interval overrides
17	+
18	+	## Phase 2 — Serve Mode
19	+	Background daemon with periodic health checks.
20	+
21	+	### Done
22	+	- [x] Serve mode with per-target health check intervals
23	+	- [x] Daily prune task
24	+	- [x] Graceful shutdown (SIGINT/SIGTERM)
25	+	- [x] Systemd service on hetzner
26	+
27	+	## Phase 3 — HTTP API + MNW Integration
28	+	Expose data to consumers, wire into MNW health page.
29	+
30	+	### Done
31	+	- [x] Axum HTTP API (`/api/status`, `/api/status/{target}`)
32	+	- [x] Uptime percentage queries (24h, 7d)
33	+	- [x] MNW `/health` page shows External Monitor card
34	+	- [x] MNW `/api/health` JSON includes `external_monitoring` field
35	+	- [x] Graceful fallback when PoM unavailable
36	+
37	+	## Phase 4 — Peer Mesh
38	+	Syncthing-style peer network. Each PoM instance has a UUID, discovers peers by address, and shares monitoring data across the mesh. Any instance can see the full network state.
39	+
40	+	### Done (4A — Instance Identity)
41	+	- [x] Auto-generate UUID on first run, store in data dir (`~/.local/share/pom/instance_id`)
42	+	- [x] Instance name in config (`[instance]` section, defaults to hostname)
43	+	- [x] `GET /api/peer/info` endpoint (returns instance ID, name, version, target list, started_at)
44	+
45	+	### Done (4B — Peer Configuration)
46	+	- [x] `[instance]` config section (name, optional ID override)
47	+	- [x] `[peers.<name>]` config section (address, on_missing, grace_count)
48	+	- [x] Peer connection on serve startup (exchange instance info via `/api/peer/info`)
49	+	- [x] Validate peer identity: store UUID on first connect, warn if UUID changes unexpectedly
50	+
51	+	### Done (4C — Peer Health Monitoring)
52	+	- [x] Periodic peer heartbeat (poll each peer's `/api/peer/info`, configurable interval, default 60s)
53	+	- [x] Peer status tracking in SQLite (`peer_identities`, `peer_heartbeats` tables)
54	+	- [x] `on_missing` behavior: fire action when peer heartbeat fails (after configurable grace period)
55	+	- [x] State machine: Unknown -> Online/GracePeriod -> Missing, with recovery detection
56	+	- [x] Prune task also cleans `peer_heartbeats`
57	+
58	+	### Done (4D — Status Sharing)
59	+	- [x] `GET /api/peer/status` endpoint (returns this instance's full target + peer status)
60	+	- [x] Each instance periodically fetches peer status to build combined view
61	+	- [x] `GET /api/mesh` endpoint (aggregated view: all instances, all targets, all peer statuses)
62	+	- [x] CLI: `pom mesh [--json]` command to show network state
63	+	- [x] MCP tool: `get_mesh_status` surfaces mesh state
64	+
65	+	### Done (4E — Code + Config)
66	+	- [x] Per-host config files (`deploy/pom-hetzner.toml`, `deploy/pom-astra.toml`)
67	+	- [x] Updated `deploy/deploy.sh` to use per-host configs
68	+	- [x] Listen on `0.0.0.0:9100` in deploy configs (Tailscale peer access)
69	+
70	+	### Done (4E — Deploy)
71	+	- [x] Install Tailscale on hetzner (`100.120.174.96`)
72	+	- [x] Update astra peer config to use hetzner's Tailscale IP
73	+	- [x] Fix `blocking_read()` panic in `spawn_heartbeat_tasks` (must be async)
74	+	- [x] Deploy v0.2.0 to hetzner + astra
75	+	- [x] Verify: `/api/peer/info` returns correct identity on each
76	+	- [x] Verify: `/api/mesh` shows both instances online (~65ms latency)
77	+	- [x] Update deploy scripts to use Tailscale IPs
78	+
79	+	## Phase 5 — Alerting (pre-beta)
80	+	Email alerts triggered by target status changes or peer disappearance. Peers with `on_missing = "alert"` use this system.
81	+
82	+	### Done
83	+	- [x] Postmark API integration (`src/alerts.rs` — Alerter struct, `X-Postmark-Server-Token` header)
84	+	- [x] Alert configuration in pom.toml (`[alerts]` section: postmark_token, to, from, cooldown_secs)
85	+	- [x] Status change detection (query previous health check before insert, compare statuses, fire on transition)
86	+	- [x] Cooldown logic (alerts table tracks sent_at, skip if within cooldown window)
87	+	- [x] Recovery alerts (notify when target returns to operational)
88	+	- [x] Peer-triggered alerts (peer goes missing/recovering with `on_missing = "alert"`)
89	+	- [x] Dev mode (no postmark_token → alerts logged to stdout)
90	+	- [x] DB migration v2 (alerts table + index)
91	+	- [x] Deploy configs updated (`deploy/pom-hetzner.toml`, `deploy/pom-astra.toml`)
92	+	- [x] 11 new tests (3 unit, 5 integration, 3 config)
93	+	- [x] Set postmark_token in production deploy configs
94	+	- [x] Create `pom-alerts@makenot.work` sender signature in Postmark dashboard
95	+
96	+	## Phase 6 — TLS Certificate Monitoring
97	+	Probe TLS certs, track expiry, alert before outage.
98	+
99	+	### Done
100	+	- [x] TLS certificate check: connect to target, TLS handshake, read leaf cert expiry (`src/checks/tls.rs`)
101	+	- [x] Per-target TLS config: `[targets.mnw.tls]` with host, port (default 443), warn_days (default 14)
102	+	- [x] Configurable check interval: `tls_check_interval_secs` on `[serve]` (default 3600)
103	+	- [x] DB migration v3: `tls_checks` table with index
104	+	- [x] Store cert check results per target (insert/query, `TlsCheckRow`)
105	+	- [x] Prune old TLS checks in daily prune task (5-tuple return)
106	+	- [x] TLS data in API response: `tls` field on `/api/status/{target}` (skip_serializing_if None)
107	+	- [x] CLI display: TLS line in `pom status` (OK/WARN/ERR with days remaining + expiry date)
108	+	- [x] Serve loop: TLS check task per target on its own interval
109	+	- [x] Alerts: expiry warning, error, and recovery (with cooldown)
110	+	- [x] Deploy configs updated (hetzner + astra: `[targets.mnw.tls] host = "makenot.work"`)
111	+	- [x] 17 new tests (2 unit, 5 config, 10 integration)
112	+	- [x] Dependencies: x509-parser 0.16, tokio-rustls 0.26, rustls-pki-types 1, webpki-roots 1
113	+
114	+	## Phase 7 — Response Validation
115	+	Verify response bodies match expected patterns, not just HTTP status codes.
116	+
117	+	### Done
118	+	- [x] `HealthExpectation` config struct: `status_code`, `json_fields` (dot-path), `body_contains`
119	+	- [x] `[targets.mnw.health.expect]` TOML config section (all fields optional)
120	+	- [x] `resolve_json_path()` — walk dot-separated paths through nested JSON
121	+	- [x] `validate_expectations()` — check status code, body substring, JSON field values
122	+	- [x] Refactored `check_health` to `response.text()` + `serde_json::from_str` (preserves raw body)
123	+	- [x] Expectation failures override to Degraded with joined error descriptions
124	+	- [x] Deploy configs updated (hetzner + astra: `status_code = 200`, `json_fields.status = "operational"`)
125	+	- [x] 17 new unit tests (resolve_json_path, validate_expectations, config parsing)
126	+
127	+	## Phase 8 — Latency Trending + Anomaly Detection
128	+	Track performance over time, detect drift before it becomes an outage.
129	+
130	+	### Done
131	+	- [x] `LatencyStats` + `LatencyBucket` types with `from_times()` and `bucket_by_time()` (types.rs, 9 unit tests)
132	+	- [x] DB queries: `get_response_times`, `get_recent_response_times` (db.rs — operational-only filtering)
133	+	- [x] `TrendingConfig` (baseline_window_hours, spike_threshold) wired into `HealthConfig` (config.rs, 3 tests)
134	+	- [x] `detect_latency_drift()` — 3 consecutive checks over baseline threshold (checks/http.rs, 6 unit tests)
135	+	- [x] Drift + recovery alerts with cooldown (alerts.rs)
136	+	- [x] Drift detection in serve loop with `in_drift` state tracking (cli.rs)
137	+	- [x] `latency_24h` on `/api/status/{target}`, `GET /api/trends/{target}?hours=&bucket_minutes=` (api.rs)
138	+	- [x] Latency line in CLI `pom status` output (display.rs, 2 tests)
139	+	- [x] Latency stats in MCP `get_status` tool (tools/health.rs)
140	+	- [x] Deploy configs: `[targets.mnw.health.trending]` (pom-hetzner.toml, pom-astra.toml)
141	+	- [x] MNW health page: avg/p95 latency in PoM card (health.rs, public.rs, health.html)
142	+	- [x] 8 new integration tests (response times, trends API, latency in status, config parsing)
143	+
144	+	## Phase 9 — Smart Test Prompting
145	+	Detect when tests should be re-run based on staleness and version changes.
146	+
147	+	### Done
148	+	- [x] `TestStaleness` struct: stale flag, reason, current/tested versions, last_test_at, days_since_test (types.rs)
149	+	- [x] `get_version_at_time()` DB query: extract version from health check closest to a given timestamp (db.rs)
150	+	- [x] `staleness_days` config field on `TestsConfig` (default 7) (config.rs, 2 config tests)
151	+	- [x] `compute_test_staleness()` pure function: no-tests, age-based, version-change triggers (checks/http.rs, 5 unit tests)
152	+	- [x] `test_staleness` field on API `TargetStatus` (skip_serializing_if None) (api.rs)
153	+	- [x] `build_target_status` computes staleness for targets with test config (api.rs)
154	+	- [x] CLI `pom status` shows "Tests: STALE" line with reason (display.rs, 4 display tests)
155	+	- [x] CLI JSON output includes `test_staleness` object (cli.rs)
156	+	- [x] MCP `get_status` shows staleness info when stale (tools/health.rs)
157	+	- [x] Deploy configs: `staleness_days = 7` (pom-hetzner.toml, pom-astra.toml)
158	+	- [x] 8 integration tests (version_at_time, staleness by version/age/fresh, config parsing, MCP tool, no-config omits field)
159	+
160	+	## Phase 10 — Downtime Log + Incident History
161	+	Structured timeline of status transitions for post-incident review.
162	+
163	+	### Done
164	+	- [x] DB migration v4: `incidents` table (id, target, started_at, ended_at, duration_secs, from_status, to_status)
165	+	- [x] `IncidentRow` struct (sqlx::FromRow + Serialize)
166	+	- [x] Incident queries: `insert_incident`, `close_open_incidents`, `get_open_incident`, `get_recent_incidents`
167	+	- [x] Automatic incident open on transition away from operational (serve loop)
168	+	- [x] Automatic incident close (with duration) on recovery to operational
169	+	- [x] Status change between non-operational states: close old incident, open new
170	+	- [x] `current_incident` + `incidents` (last 10) on API `/api/status/{target}` (skip_serializing_if)
171	+	- [x] CLI `pom status` shows active incident line
172	+	- [x] MCP `get_status` shows active + recent incidents
173	+	- [x] Prune cleans closed incidents (6-tuple return from `prune_old_records`)
174	+	- [x] 10 new tests (migration, lifecycle, target isolation, prune, API)
175	+	- [x] Surface in MNW health page (incident timeline, recent incidents list, expandable check lists, formatted timestamps)
176	+
177	+	## Audit Remediation (Second Audit, 2026-03-11)
178	+	5 findings, 3 cold spots. All resolved.
179	+
180	+	### Done
181	+	- [x] Extract CLI command handlers from main.rs into cli.rs (main.rs: 587 -> 130 LOC, cli.rs: 466 LOC)
182	+	- [x] Add typed PomError enum with thiserror (8 variants, replaces Box<dyn Error> across 9 files)
183	+	- [x] Add .DS_Store and IDE dirs (.idea/, .vscode/) to .gitignore
184	+	- [x] Add module-level //! docs to main.rs (config.rs already had one)
185	+	- [x] Add migration versioning (schema_version table, numbered migrations, pre-migration DB detection, 3 tests)
186	+	- [x] Add CLI display tests (extract formatting into display.rs, 27 tests: health snapshots, test results, status, history, prune, mesh)
187	+
188	+	## Audit Remediation (First Audit, 2026-03-10)
189	+	First audit. 11 findings, 8 cold spots. All resolved.
190	+
191	+	### Done
192	+	- [x] Add DB indexes: `health_checks(target, id DESC)`, `health_checks(target, checked_at)`, `test_runs(target, id DESC)`, `peer_heartbeats(peer_name, id DESC)` (db.rs, init_schema)
193	+	- [x] Fix 4 clippy `collapsible_if` warnings (api.rs, peer.rs, main.rs — used Rust 2024 let chains)
194	+	- [x] Decouple mesh write lock from DB writes in heartbeat handlers (peer.rs — block-scoped lock, DB writes after drop)
195	+	- [x] Decouple mesh read lock from DB queries in peer_status and mesh_view handlers (api.rs — same pattern)
196	+	- [x] Log `/api/peer/status` fetch failures instead of silently ignoring (peer.rs, tracing::debug)
197	+	- [x] Include peer heartbeat prune count in `prune_old_records` return value (db.rs — now returns 3-tuple)
198	+	- [x] Add `//!` module docs to db.rs, config.rs, peer.rs, types.rs, lib.rs (api.rs already had one)
199	+	- [x] Change `PeerConfig.on_missing` from `String` to `OnMissing` enum with `#[derive(Deserialize)]` + `#[default]`
200	+	- [x] Add API endpoint integration tests (5 tests: /api/status, /api/status/{target} 404, /api/peer/info, peer disabled, /api/mesh)
201	+	- [x] Add heartbeat state machine unit tests (5 tests: grace transitions, recovery, first-contact UUID, DB recording)
202	+	- [x] Add config parsing tests (4 tests: full parse, defaults, on_missing default, hostname fallback)
203	+	- [x] Add HTTP health check response classification tests (8 tests: operational, degraded, unknown status, error codes, missing fields, non-JSON)
204	+	- [x] Extract `HealthStatus::icon()` method, eliminating 3 repeated match blocks in main.rs
205	+	- [x] Add types.rs tests (4 tests: Display/FromStr roundtrip, icon mapping, serde roundtrip, invalid parse)
206	+
207	+	## Phase 11 — Route Specs (pre-beta)
208	+
209	+	Define expected routes per target in config. PoM periodically checks each route and alerts if any return non-200. Catches missing pages, broken deploys, misconfigured paths.
210	+
211	+	### Done
212	+	- [x] `expected_routes` config field on `[targets.<name>]` — list of paths to check (e.g. `["/", "/docs", "/docs/faq", "/pricing"]`)
213	+	- [x] `route_check_interval_secs` on `[serve]` (default 300 = 5 min)
214	+	- [x] Route check module (`src/checks/routes.rs`) — sequential GET per path, 2xx = OK
215	+	- [x] Route check task in serve loop (separate interval from health checks)
216	+	- [x] Route check results stored in DB (migration v5: `route_checks` table with indexes)
217	+	- [x] `RouteCheckRow`, `insert_route_check`, `get_latest_route_checks` queries
218	+	- [x] Prune includes route_checks (7-tuple return from `prune_old_records`)
219	+	- [x] Alert on route failure (non-200 on any expected route, with cooldown key `route:{target}`)
220	+	- [x] Recovery alert when previously-failing route returns 200 (no cooldown)
221	+	- [x] `route_status` field on `/api/status/{target}` (list of paths with last status, skip_serializing_if empty)
222	+	- [x] CLI `pom status` shows route check summary (e.g. "Routes: 9/9 OK" or "Routes: 7/9 (FAIL: /docs/faq, /pricing)")
223	+	- [x] MNW health page: route status in PoM card
224	+	- [x] Deploy configs updated (hetzner + astra: MNW 9 routes, MT 1 route)
225	+	- [x] 18 new tests (4 config, 5 route check unit, 3 display, 1 alert, 5 integration)
226	+
227	+	## Phase 12 — External Target: htpy.app
228	+
229	+	Monitor https://htpy.app (homotopy-rs, repo at `/Users/max/Math/sseq-work/homotopy-rs`). PoM already supports multiple targets — this adds htpy.app as a third monitored site alongside MNW and MT.
230	+
231	+	### Done
232	+	- [x] Add `[targets.htpy]` to deploy configs (pom-hetzner.toml, pom-astra.toml) with health URL, route checks, TLS
233	+	- [x] Health check via Tailscale (`http://100.99.153.68:8080/archive/S_2`) with `body_contains = "htpy"` expectation
234	+	- [x] Route check: `/archive/S_2` (the default redirect target from `/`)
235	+	- [x] TLS monitoring for htpy.app (`[targets.htpy.tls] host = "htpy.app"`)
236	+	- [x] Fix `classify_non_json` — non-JSON 2xx responses now promoted to Operational when all expectations pass
237	+	- [x] Verified on both hetzner (9ms) and astra (185ms): operational, TLS valid (87d), routes 1/1 OK
238	+
239	+	### Not applicable
240	+	- MNW health page: htpy.app is a separate service, doesn't belong on MNW's health dashboard
241	+
242	+	## Audit Action Items (2026-03-13, third audit — pre-launch skeptical lens)
243	+
244	+	### Done
245	+	- [x] CRITICAL: Remove Postmark API token from deployment configs (`deploy/pom-hetzner.toml`, `deploy/pom-astra.toml`) — moved to `POM_POSTMARK_TOKEN` env var, loaded in config.rs, systemd `EnvironmentFile=/etc/pom/env`
246	+	- [x] Add API authentication (bearer token middleware on all /api/* routes, `POM_API_TOKEN` env var or `[serve] api_token` config, 5 tests)
247	+	- [x] Add peer mesh authentication (`[peers.X] token` field, heartbeat client sends `Authorization: Bearer` header, MNW health.rs updated to send token)
248	+	- [x] Add integration tests for core functions (check_health, check_tls — 9 new integration tests with mock servers)
249	+	- [x] Add self-monitoring capability (`/api/health` endpoint returns `{"status":"operational","version":"..."}`, no auth required)
250	+	- [x] Shell-escape SSH test filter parameter (`checks/ssh.rs` — alphanumeric + `_:-` allowlist, returns error TestRun on invalid chars)
251	+	- [x] Reject peer responses on UUID mismatch instead of just logging a warning (`peer.rs` — upgraded to tracing::error, skips status update, increments consecutive_failures)
252	+	- [x] Add rate limiting to API endpoints (fixed-window 60 req/min middleware on authenticated routes, 1 unit test)
253	+
254	+	## Audit (Run 4, 2026-03-14)
255	+
256	+	Full code audit of Phases 11-12 additions. 1 HIGH, 4 MEDIUM, 5 LOW findings.
257	+
258	+	### Done
259	+	- [x] Audit route checks module (`src/checks/routes.rs`) — base_url parsing, error handling, edge cases
260	+	- [x] Audit `classify_non_json` Operational promotion — verified correct, no false positives
261	+	- [x] Audit deploy configs for consistency (htpy Tailscale IP, route lists, expectation accuracy)
262	+	- [x] Review test coverage gaps in Phase 11-12 code
263	+
264	+	### Done (from audit findings)
265	+	- [x] HIGH: Disable redirect following in route check client (`redirect(Policy::none())`) — was silently following redirects
266	+	- [x] MEDIUM: Fix startup thundering herd — consume first tick of health/TLS/route/prune intervals before entering loop
267	+	- [x] MEDIUM: Fix recovery cooldown interaction — `get_latest_alert_for_target` now excludes `%recovery%` alert types
268	+	- [x] MEDIUM: Set `MissedTickBehavior::Delay` on route check interval to prevent back-to-back storms
269	+	- [x] LOW: Validate `expected_routes` paths start with `/` at config load time
270	+	- [x] 3 new tests (recovery cooldown, empty routes, path validation)
271	+
272	+	### Done (remaining findings — resolved)
273	+	- [x] MEDIUM: Graceful shutdown — CancellationToken + `tokio::select!` in all task loops, `with_graceful_shutdown` on API server, 5s grace period (`cli.rs`)
274	+	- [x] LOW: Remove redundant htpy route check — removed `expected_routes` from htpy target (deploy configs)
275	+	- [x] LOW: Monitor for silent task panics — 60s watchdog checks `JoinHandle::is_finished()` in shutdown loop (`cli.rs`)
276	+
277	+	## Phase 13 — Per-Test Tracking & Duration Trending (Mar 2026)
278	+	`TestDetail` struct + `details` field on `TestSummary`. Parse individual test lines from cargo output. Migration v7: `test_details` table. `insert_test_details`, `get_test_regressions`, `get_test_durations` DB queries. Duration drift detection (baseline 10 runs, 1.5x threshold). Wired into CLI, MCP, API. 10 new tests. MT test target added to astra + hetzner configs.
279	+
280	+	## Phase 6 — TLS (additional, Mar 2026)
281	+	Domain WHOIS/registration check (registrar, expiry, nameservers). DNS record verification (A/AAAA/CNAME resolve to expected IPs).
282	+
283	+	## Run 6 Audit Items (Mar 2026)
284	+	Fixed 6 collapsible_if clippy warnings (cli.rs, config.rs, checks/http.rs).
285	+
286	+	## Run 8 Audit Items (Mar 2026)
287	+	Hardened `escape_js` in dashboard.rs: added newline, carriage return, `<` (`\x3c`), null escaping + 4 tests.
288	+
289	+	---
290	+
291	+	## DNS/Route Stale Data Fix (2026-03-25)
292	+
293	+	- [x] Switch Cloudflare-proxied DNS records to resolution-only checks
294	+	- [x] Filter `route_status` and `dns_status` in API to only configured entries
295	+	- [x] Add `prune_stale_routes()` and `prune_stale_dns()` DB functions
296	+	- [x] Call prune functions at task startup
297	+	- [x] Update integration tests for new filtering behavior
298	+	- [x] Deploy to hetzner (pruned 890 stale route check rows on startup)
299	+
300	+	---
301	+
302	+	## Rust Patterns Audit (2026-03-21)
303	+
304	+	- [x] Create `AlertCategory` enum (18 variants) replacing string literals
305	+	- [x] Create `DnsRecordType` enum (A/Aaaa/Cname/Mx/Txt) replacing raw strings
306	+	- [x] Add 30s timeout wrapper around email sends
307	+	- [x] Eliminate HealthSnapshot clone under lock in API handlers
308	+	- [x] Use `Cow<'_, str>` for JSON path response instead of String clone

A docs/audit_history.md +115

		@@ -0,0 +1,115 @@
1	+	# PoM (Peace of Mind) -- Audit History
2	+
3	+	Full chronological audit log. See [audit_review.md](./audit_review.md) for current state.
4	+
5	+	## Changes Since Last Audit
6	+
7	+	### Tenth audit (2026-03-28, Run 12 cross-project)
8	+	- Test count: 359 (222 unit + 8 cli + 129 integration). 0 clippy warnings. 0 failures.
9	+	- Grade: A (maintained). v0.3.2.
10	+	- CORS monitoring: New check type added for monitoring CORS headers on targets.
11	+	- New dependency advisories (action items):
12	+	- aws-lc-sys 0.38.0 (RUSTSEC-2026-0044 + -0048, severity 7.4 HIGH) — upgrade to 0.39.0 via `cargo update -p aws-lc-sys`
13	+	- rustls-webpki 0.103.9 (RUSTSEC-2026-0049) — upgrade to 0.103.10 via `cargo update -p rustls-webpki`
14	+	- paste unmaintained (RUSTSEC-2024-0436) — upstream via rmcp, warning only
15	+	- Mandatory surprise: None. Previous surprises (rate limiter relaxed ordering, write!().unwrap() infallibility) still valid.
16	+	- No new code findings. All previous items remain resolved.
17	+
18	+	### DNS/Route stale data fix (2026-03-25)
19	+	- Test count: 352 (unchanged). 0 clippy warnings.
20	+	- Config: Switched all 4 Cloudflare-proxied DNS records from `expected = ["IP"]` to `expected = []` (resolution-only). DNS checks were always failing because Cloudflare returns rotating proxy IPs, not the origin IP.
21	+	- API filtering: `route_status` and `dns_status` in `/api/status/{target}` now filtered to only entries matching current config. Stale routes (e.g. `/docs/about`, `/signup`) and stale DNS records no longer appear in API responses.
22	+	- DB pruning: Added `prune_stale_routes()` and `prune_stale_dns()` to `db.rs`. Called once at task startup in `routes.rs` and `dns.rs` to clean up historical data when config changes. Pruned 890 stale route check rows on first deploy.
23	+	- Integration tests: Updated `api_status_includes_route_status` and `api_status_includes_dns_status` to use configs with matching route/DNS entries.
24	+	- Deployed to hetzner — v0.3.2 binary + updated config.
25	+
26	+	### Eighth audit (2026-03-18, Run 9 cross-project)
27	+	- Test count: 344 (unchanged). 0 clippy warnings.
28	+	- Grade: A (maintained). v0.3.1 (deployed 2026-03-18).
29	+	- Dashboard UI shipped. Per-test tracking, regression detection, duration drift.
30	+	- cli/ directory module split completed (1,035-line cli.rs -> 8 files).
31	+	- Observations (pre-existing, not regressions):
32	+	- Mutex `.unwrap()` in rate limiter (api.rs:41) — if thread panics while holding lock, subsequent calls panic. Impact: LOW (rate limiter only, not core logic). Design choice: acceptable for monitoring tool.
33	+	- `serde_json::to_value(d).unwrap_or_default()` in API details field — silently becomes null on serialization failure. Impact: LOW, safe fallback.
34	+	- No new findings requiring action. Grade maintained at A.
35	+	- Mandatory surprise: Rate limiter uses `fetch_add` with Relaxed ordering — can allow up to max_per_window+1 requests due to check-then-increment race. Known trade-off of lock-free rate limiting, documented.
36	+
37	+	### Fifth audit (2026-03-16, Run 6 cross-project)
38	+	- Test count: 238 -> 344 (220 unit + 124 integration, +106 tests)
39	+	- Grade: A (maintained). No new findings above LOW.
40	+	- Source LOC: 10,113 (up from ~3.5K)
41	+	- Clippy: 2 warnings (collapsible_if in cli.rs — LOW)
42	+	- Production unwraps: 76 total — 64 infallible write! on String, 12 safe-by-construction. Effectively zero risky unwraps.
43	+	- Mandatory surprise: write!().unwrap() pattern provably infallible — Actually fine.
44	+	- Previous items verified: All previous remediated items confirmed intact.
45	+	- Note: cli.rs at 1,036 lines — approaching the 500-line branching guideline but mostly flat match arms.
46	+	- Infrastructure check: Blocked by Tailscale SSH re-authentication. Deferred.
47	+
48	+	### Fourth audit remediation (2026-03-14)
49	+	- Grade: A- -> A. All remaining findings resolved.
50	+	- Test count: 229 -> 238 (+9 integration tests)
51	+	- Graceful shutdown: Replaced `handle.abort()` with CancellationToken + `tokio::select!` in all task loops. API server uses `with_graceful_shutdown`. 5s grace period on SIGINT/SIGTERM.
52	+	- Task panic detection: 60s watchdog checks `JoinHandle::is_finished()` on all background tasks.
53	+	- Rate limiting: Fixed-window 60 req/min middleware on authenticated API routes. Custom `RateLimiter` struct.
54	+	- Self-monitoring: `GET /api/health` endpoint (public, no auth) returns `{"status":"operational","version":"..."}`.
55	+	- Integration tests: 5 check_health tests (mock axum servers: operational, degraded, unreachable, expectations pass/fail), 1 check_tls test (self-signed cert via rcgen), 2 /api/health tests, 1 rate limiter test.
56	+	- Deploy config cleanup: Removed redundant htpy `expected_routes` (duplicated health check URL).
57	+	- Dependency: Added `tokio-util` for CancellationToken.
58	+	- Cold spots: 0 remaining (was 3). All previous architectural and testing gaps closed.
59	+
60	+	### Third audit (2026-03-13, pre-launch skeptical lens)
61	+	- Grade: A -> A-. Postmark API token in plaintext deployment configs is a real issue.
62	+	- Test count: 56 -> 187 (+131 tests)
63	+	- New findings: Plaintext API token, no API auth, no peer mesh auth, no integration tests for core functions, no self-monitoring.
64	+	- 38 unwraps in non-test code — all verified safe (write to String or guarded by prior checks).
65	+
66	+	Post-audit remediation (2026-03-13):
67	+	- All 3 critical/medium findings resolved: Postmark token to env var, API bearer auth (5 tests), peer mesh auth
68	+	- 2 low findings resolved: SSH filter validation, peer UUID mismatch rejection
69	+	- Test count: 187 -> 195 (+8 tests)
70	+	- Documentation upgraded to A: All struct fields documented (HealthSnapshot, HealthStatus, HealthDetails, TestRun, TestStaleness, PeerStatus, OnMissing, all config types, all API response types). All 8 error variants documented. 11 config defaults with rationale comments. prune_old_records return tuple documented. description.md rewritten, architecture.md created (191 lines), README created (62 lines).
71	+
72	+	### Observability Upgrade (2026-03-13)
73	+	- Observability: A- -> A
74	+	- Added 57 `#[instrument(skip_all)]` annotations across 9 files: db.rs (28), alerts.rs (9), tools/mod.rs (8), tools/health.rs (5), tools/tests.rs (3), checks/http.rs (1), checks/tls.rs (1), checks/ssh.rs (1), peer.rs (1)
75	+	- Added Multithreaded forum as monitoring target: `pom-astra.toml` (localhost:3400), `pom-hetzner.toml` (Tailscale IP)
76	+	- Added test runner targets for GO, BB, AF, SK to `pom-astra.toml`
77	+	- All 208 tests pass. `cargo check` passes clean.
78	+
79	+	### Adversarial Test Audit (2026-03-13)
80	+
81	+	Goal: Write tests that try to break the system. Find edge cases, race conditions, boundary conditions, and logic errors.
82	+
83	+	Results:
84	+	- Test count: 195 -> 208 (+13 tests)
85	+	- CRITICAL fix: Alert cooldown key mismatch — `record_alert` used `target` but lookup used `alert_key` (`"health:{target}"`), so cooldowns never matched and alerts fired every check. Fixed by using `alert_key` consistently.
86	+	- HIGH fix: TLS expiry check inconsistent at day boundary — time-of-day comparison could cause flapping. Changed to `date_naive()` comparison for stable day-level logic.
87	+	- HIGH fix: UUID mismatch left stale peer state — now resets state, clears failures, persists via `update_peer_identity()` to prevent showing stale data after peer identity change.
88	+	- HIGH fix: `prune_old_records` no guard for days <= 0 — could delete all records. Added early return for `days <= 0` (no-op).
89	+	- HIGH fix: SSH timeout ignored config value — hardcoded `ConnectTimeout=10` in SSH args. Changed to use `config.timeout_secs`.
90	+	- Added `rcgen` dev dependency for TLS cert generation in tests.
91	+
92	+	### Second audit (2026-03-11)
93	+	\| Change \| Detail \|
94	+	\|--------\|--------\|
95	+	\| Tests \| +39 tests (17 -> 56). 28 unit + 28 integration. Tests/KLOC: 5.8 -> 18.4. \|
96	+	\| Lock contention \| Addressed in both peer.rs (heartbeat handlers) and api.rs (status/mesh handlers). Data collected under lock, DB writes after release. \|
97	+	\| DB indexes \| 4 indexes added: health_checks(target, id DESC), health_checks(target, checked_at), test_runs(target, id DESC), peer_heartbeats(peer_name, id DESC). \|
98	+	\| Clippy \| 4 warnings -> 0. Used Rust 2024 let chains instead of nested if-let. \|
99	+	\| Type safety \| PeerConfig.on_missing changed from String to OnMissing enum with serde deserialization. \|
100	+	\| Module docs \| Added //! docs to db.rs, config.rs, peer.rs, types.rs, lib.rs. \|
101	+	\| Error handling \| /api/peer/status fetch failures now logged at debug level instead of silenced. \|
102	+	\| Prune \| prune_old_records now returns 3-tuple including peer heartbeat count. \|
103	+	\| Code extraction \| HealthStatus::icon() method eliminates 3 repeated match blocks. \|
104	+	\| HTTP checks \| Response classification extracted into pure functions for testability. \|
105	+
106	+	## Metrics Over Time
107	+
108	+	\| Audit Date \| LOC \| Rust Files \| Tests \| Tests/KLOC \| Clippy Warnings \| Cold Spots \| Overall \|
109	+	\|------------\|-----\|-----------\|-------\|-----------\|----------------\|------------\|---------\|
110	+	\| 2026-03-10 \| 2,934 \| 15 \| 17 \| 5.8 \| 4 \| 8 \| B+ \|
111	+	\| 2026-03-11 \| 3,039 \| 14 \| 56 \| 18.4 \| 0 \| 3 \| A \|
112	+	\| 2026-03-13 \| ~3K \| ~14 \| 208 \| ~69 \| 0 \| 3 \| A- \|
113	+	\| 2026-03-14 \| ~3.5K \| ~16 \| 238 \| ~68 \| 0 \| 0 \| A \|
114	+	\| 2026-03-16 \| 10.1K \| 23 \| 344 \| ~34 \| 2 \| 0 \| A \|
115	+	\| 2026-03-18 \| 10.1K \| 23 \| 344 \| ~34 \| 0 \| 0 \| A \|

M docs/audit_review.md +3 -111

			@@ -144,114 +144,6 @@ Filed in `docs/mnw/pom/todo.md`.
144	144		10. ~~Add heartbeat state machine tests~~ -- Done (9 tests)
145	145		11. ~~Add config parsing tests~~ -- Done (4 tests)
146	146
147		-	## Changes Since Last Audit
148		-
149		-	### Tenth audit (2026-03-28, Run 12 cross-project)
150		-	- Test count: 359 (222 unit + 8 cli + 129 integration). 0 clippy warnings. 0 failures.
151		-	- Grade: A (maintained). v0.3.2.
152		-	- CORS monitoring: New check type added for monitoring CORS headers on targets.
153		-	- New dependency advisories (action items):
154		-	- aws-lc-sys 0.38.0 (RUSTSEC-2026-0044 + -0048, severity 7.4 HIGH) — upgrade to 0.39.0 via `cargo update -p aws-lc-sys`
155		-	- rustls-webpki 0.103.9 (RUSTSEC-2026-0049) — upgrade to 0.103.10 via `cargo update -p rustls-webpki`
156		-	- paste unmaintained (RUSTSEC-2024-0436) — upstream via rmcp, warning only
157		-	- Mandatory surprise: None. Previous surprises (rate limiter relaxed ordering, write!().unwrap() infallibility) still valid.
158		-	- No new code findings. All previous items remain resolved.
159		-
160		-	### DNS/Route stale data fix (2026-03-25)
161		-	- Test count: 352 (unchanged). 0 clippy warnings.
162		-	- Config: Switched all 4 Cloudflare-proxied DNS records from `expected = ["IP"]` to `expected = []` (resolution-only). DNS checks were always failing because Cloudflare returns rotating proxy IPs, not the origin IP.
163		-	- API filtering: `route_status` and `dns_status` in `/api/status/{target}` now filtered to only entries matching current config. Stale routes (e.g. `/docs/about`, `/signup`) and stale DNS records no longer appear in API responses.
164		-	- DB pruning: Added `prune_stale_routes()` and `prune_stale_dns()` to `db.rs`. Called once at task startup in `routes.rs` and `dns.rs` to clean up historical data when config changes. Pruned 890 stale route check rows on first deploy.
165		-	- Integration tests: Updated `api_status_includes_route_status` and `api_status_includes_dns_status` to use configs with matching route/DNS entries.
166		-	- Deployed to hetzner — v0.3.2 binary + updated config.
167		-
168		-	### Eighth audit (2026-03-18, Run 9 cross-project)
169		-	- Test count: 344 (unchanged). 0 clippy warnings.
170		-	- Grade: A (maintained). v0.3.1 (deployed 2026-03-18).
171		-	- Dashboard UI shipped. Per-test tracking, regression detection, duration drift.
172		-	- cli/ directory module split completed (1,035-line cli.rs -> 8 files).
173		-	- Observations (pre-existing, not regressions):
174		-	- Mutex `.unwrap()` in rate limiter (api.rs:41) — if thread panics while holding lock, subsequent calls panic. Impact: LOW (rate limiter only, not core logic). Design choice: acceptable for monitoring tool.
175		-	- `serde_json::to_value(d).unwrap_or_default()` in API details field — silently becomes null on serialization failure. Impact: LOW, safe fallback.
176		-	- No new findings requiring action. Grade maintained at A.
177		-	- Mandatory surprise: Rate limiter uses `fetch_add` with Relaxed ordering — can allow up to max_per_window+1 requests due to check-then-increment race. Known trade-off of lock-free rate limiting, documented.
178		-
179		-	### Fifth audit (2026-03-16, Run 6 cross-project)
180		-	- Test count: 238 -> 344 (220 unit + 124 integration, +106 tests)
181		-	- Grade: A (maintained). No new findings above LOW.
182		-	- Source LOC: 10,113 (up from ~3.5K)
183		-	- Clippy: 2 warnings (collapsible_if in cli.rs — LOW)
184		-	- Production unwraps: 76 total — 64 infallible write! on String, 12 safe-by-construction. Effectively zero risky unwraps.
185		-	- Mandatory surprise: write!().unwrap() pattern provably infallible — Actually fine.
186		-	- Previous items verified: All previous remediated items confirmed intact.
187		-	- Note: cli.rs at 1,036 lines — approaching the 500-line branching guideline but mostly flat match arms.
188		-	- Infrastructure check: Blocked by Tailscale SSH re-authentication. Deferred.
189		-
190		-	### Fourth audit remediation (2026-03-14)
191		-	- Grade: A- -> A. All remaining findings resolved.
192		-	- Test count: 229 -> 238 (+9 integration tests)
193		-	- Graceful shutdown: Replaced `handle.abort()` with CancellationToken + `tokio::select!` in all task loops. API server uses `with_graceful_shutdown`. 5s grace period on SIGINT/SIGTERM.
194		-	- Task panic detection: 60s watchdog checks `JoinHandle::is_finished()` on all background tasks.
195		-	- Rate limiting: Fixed-window 60 req/min middleware on authenticated API routes. Custom `RateLimiter` struct.
196		-	- Self-monitoring: `GET /api/health` endpoint (public, no auth) returns `{"status":"operational","version":"..."}`.
197		-	- Integration tests: 5 check_health tests (mock axum servers: operational, degraded, unreachable, expectations pass/fail), 1 check_tls test (self-signed cert via rcgen), 2 /api/health tests, 1 rate limiter test.
198		-	- Deploy config cleanup: Removed redundant htpy `expected_routes` (duplicated health check URL).
199		-	- Dependency: Added `tokio-util` for CancellationToken.
200		-	- Cold spots: 0 remaining (was 3). All previous architectural and testing gaps closed.
201		-
202		-	### Third audit (2026-03-13, pre-launch skeptical lens)
203		-	- Grade: A -> A-. Postmark API token in plaintext deployment configs is a real issue.
204		-	- Test count: 56 -> 187 (+131 tests)
205		-	- New findings: Plaintext API token, no API auth, no peer mesh auth, no integration tests for core functions, no self-monitoring.
206		-	- 38 unwraps in non-test code — all verified safe (write to String or guarded by prior checks).
207		-
208		-	Post-audit remediation (2026-03-13):
209		-	- All 3 critical/medium findings resolved: Postmark token to env var, API bearer auth (5 tests), peer mesh auth
210		-	- 2 low findings resolved: SSH filter validation, peer UUID mismatch rejection
211		-	- Test count: 187 -> 195 (+8 tests)
212		-	- Documentation upgraded to A: All struct fields documented (HealthSnapshot, HealthStatus, HealthDetails, TestRun, TestStaleness, PeerStatus, OnMissing, all config types, all API response types). All 8 error variants documented. 11 config defaults with rationale comments. prune_old_records return tuple documented. description.md rewritten, architecture.md created (191 lines), README created (62 lines).
213		-
214		-	### Observability Upgrade (2026-03-13)
215		-	- Observability: A- -> A
216		-	- Added 57 `#[instrument(skip_all)]` annotations across 9 files: db.rs (28), alerts.rs (9), tools/mod.rs (8), tools/health.rs (5), tools/tests.rs (3), checks/http.rs (1), checks/tls.rs (1), checks/ssh.rs (1), peer.rs (1)
217		-	- Added Multithreaded forum as monitoring target: `pom-astra.toml` (localhost:3400), `pom-hetzner.toml` (Tailscale IP)
218		-	- Added test runner targets for GO, BB, AF, SK to `pom-astra.toml`
219		-	- All 208 tests pass. `cargo check` passes clean.
220		-
221		-	### Adversarial Test Audit (2026-03-13)
222		-
223		-	Goal: Write tests that try to break the system. Find edge cases, race conditions, boundary conditions, and logic errors.
224		-
225		-	Results:
226		-	- Test count: 195 -> 208 (+13 tests)
227		-	- CRITICAL fix: Alert cooldown key mismatch — `record_alert` used `target` but lookup used `alert_key` (`"health:{target}"`), so cooldowns never matched and alerts fired every check. Fixed by using `alert_key` consistently.
228		-	- HIGH fix: TLS expiry check inconsistent at day boundary — time-of-day comparison could cause flapping. Changed to `date_naive()` comparison for stable day-level logic.
229		-	- HIGH fix: UUID mismatch left stale peer state — now resets state, clears failures, persists via `update_peer_identity()` to prevent showing stale data after peer identity change.
230		-	- HIGH fix: `prune_old_records` no guard for days <= 0 — could delete all records. Added early return for `days <= 0` (no-op).
231		-	- HIGH fix: SSH timeout ignored config value — hardcoded `ConnectTimeout=10` in SSH args. Changed to use `config.timeout_secs`.
232		-	- Added `rcgen` dev dependency for TLS cert generation in tests.
233		-
234		-	### Second audit (2026-03-11)
235		-	\| Change \| Detail \|
236		-	\|--------\|--------\|
237		-	\| Tests \| +39 tests (17 -> 56). 28 unit + 28 integration. Tests/KLOC: 5.8 -> 18.4. \|
238		-	\| Lock contention \| Addressed in both peer.rs (heartbeat handlers) and api.rs (status/mesh handlers). Data collected under lock, DB writes after release. \|
239		-	\| DB indexes \| 4 indexes added: health_checks(target, id DESC), health_checks(target, checked_at), test_runs(target, id DESC), peer_heartbeats(peer_name, id DESC). \|
240		-	\| Clippy \| 4 warnings -> 0. Used Rust 2024 let chains instead of nested if-let. \|
241		-	\| Type safety \| PeerConfig.on_missing changed from String to OnMissing enum with serde deserialization. \|
242		-	\| Module docs \| Added //! docs to db.rs, config.rs, peer.rs, types.rs, lib.rs. \|
243		-	\| Error handling \| /api/peer/status fetch failures now logged at debug level instead of silenced. \|
244		-	\| Prune \| prune_old_records now returns 3-tuple including peer heartbeat count. \|
245		-	\| Code extraction \| HealthStatus::icon() method eliminates 3 repeated match blocks. \|
246		-	\| HTTP checks \| Response classification extracted into pure functions for testability. \|
247		-
248		-	## Metrics Over Time
249		-
250		-	\| Audit Date \| LOC \| Rust Files \| Tests \| Tests/KLOC \| Clippy Warnings \| Cold Spots \| Overall \|
251		-	\|------------\|-----\|-----------\|-------\|-----------\|----------------\|------------\|---------\|
252		-	\| 2026-03-10 \| 2,934 \| 15 \| 17 \| 5.8 \| 4 \| 8 \| B+ \|
253		-	\| 2026-03-11 \| 3,039 \| 14 \| 56 \| 18.4 \| 0 \| 3 \| A \|
254		-	\| 2026-03-13 \| ~3K \| ~14 \| 208 \| ~69 \| 0 \| 3 \| A- \|
255		-	\| 2026-03-14 \| ~3.5K \| ~16 \| 238 \| ~68 \| 0 \| 0 \| A \|
256		-	\| 2026-03-16 \| 10.1K \| 23 \| 344 \| ~34 \| 2 \| 0 \| A \|
257		-	\| 2026-03-18 \| 10.1K \| 23 \| 344 \| ~34 \| 0 \| 0 \| A \|
	147	+	---
	148	+
	149	+	See [audit_history.md](./audit_history.md) for full chronological audit log.

A docs/competition.md +90

		@@ -0,0 +1,90 @@
1	+	# PoM -- Competitive Analysis
2	+
3	+	Last updated: 2026-04-02
4	+
5	+	## Positioning
6	+
7	+	PoM (Peace of Mind) is a single-binary production monitor built for indie developers and small teams. It runs as a peer mesh -- two instances cross-check each other with no central dashboard required. CLI-first, with an optional HTTP API and Claude integration (MCP server mode).
8	+
9	+	The key differentiators are the peer mesh architecture (no single point of failure for monitoring), the CLI-first interface (inspect via SSH, no browser needed), and the Claude MCP integration (AI-assisted diagnostics). PoM monitors what matters for small deployments: uptime, TLS certificates, DNS records, domain registration, route availability, and test freshness.
10	+
11	+	## Pricing Comparison
12	+
13	+	\| Tool \| Price \| Model \|
14	+	\|------\|-------\|-------\|
15	+	\| PoM \| Free \| Source-available (PolyForm NC) \|
16	+	\| Uptime Robot \| $0-$58/mo \| Freemium (50 monitors free) \|
17	+	\| Pingdom \| $15-$100/mo \| SaaS \|
18	+	\| Datadog \| $15-$23/host/mo \| SaaS \|
19	+	\| New Relic \| $0-$0.35/GB \| Freemium \|
20	+	\| Grafana + Prometheus \| Free (self-host) \| Open source \|
21	+	\| StatusCake \| $0-$67/mo \| Freemium \|
22	+	\| Hetrix Tools \| $0-$20/mo \| Freemium \|
23	+
24	+	## Feature Matrix
25	+
26	+	\| Feature \| PoM \| Uptime Robot \| Pingdom \| Datadog \| Grafana+Prom \|
27	+	\|---------\|:---:\|:-----------:\|:-------:\|:-------:\|:------------:\|
28	+	\| HTTP health checks \| Y \| Y \| Y \| Y \| Y \|
29	+	\| TLS certificate monitoring \| Y \| Y \| Y \| Y \| N* \|
30	+	\| DNS record verification \| Y \| N \| N \| Y \| N* \|
31	+	\| WHOIS domain expiry \| Y \| N \| N \| N \| N* \|
32	+	\| Route availability checks \| Y \| N \| Y \| Y \| N* \|
33	+	\| CORS preflight checks \| Y \| N \| N \| N \| N \|
34	+	\| Peer mesh (cross-monitoring) \| Y \| N \| N \| N \| N \|
35	+	\| CLI-first interface \| Y \| N \| N \| N \| N \|
36	+	\| Claude MCP integration \| Y \| N \| N \| N \| N \|
37	+	\| SSH test execution \| Y \| N \| N \| N \| N \|
38	+	\| Latency drift detection \| Y \| N \| Y \| Y \| Y \|
39	+	\| Test duration drift \| Y \| N \| N \| N \| N \|
40	+	\| Email alerts \| Y \| Y \| Y \| Y \| Y \|
41	+	\| Status page \| N \| Y \| Y \| Y \| Y** \|
42	+	\| Mobile app \| N \| Y \| Y \| Y \| Y** \|
43	+	\| APM / traces \| N \| N \| N \| Y \| Y \|
44	+	\| Log aggregation \| N \| N \| N \| Y \| Y \|
45	+	\| Self-hosted \| Y \| N \| N \| N \| Y \|
46	+	\| Single binary \| Y \| N/A \| N/A \| N/A \| N \|
47	+
48	+	\* Requires additional exporters. \\ Via Grafana dashboards.
49	+
50	+	## Competitor Deep Dives
51	+
52	+	### 1. Uptime Robot
53	+
54	+	Simple uptime monitoring SaaS. Free tier with 50 monitors at 5-minute intervals. Pro adds 1-minute intervals, SSL monitoring, status pages. The default choice for indie developers.
55	+
56	+	What PoM lacks: status pages, mobile app, SMS/Slack/webhook alerts, maintenance windows. What Uptime Robot lacks: peer mesh, CLI interface, DNS/WHOIS monitoring, SSH test execution, AI integration.
57	+
58	+	### 2. Datadog
59	+
60	+	Enterprise observability platform (APM, logs, metrics, dashboards). Powerful but expensive and invasive (requires agents on every host). Overkill for small deployments.
61	+
62	+	What PoM lacks: APM, distributed tracing, dashboards, log aggregation, 800+ integrations. What Datadog lacks: peer mesh, CLI-first operation, single binary simplicity, affordability for indie teams.
63	+
64	+	### 3. Grafana + Prometheus
65	+
66	+	Open-source metrics and visualization stack. Extremely flexible, industry standard. Requires significant setup (Prometheus server, exporters, Grafana instance, alertmanager). No built-in TLS/DNS/WHOIS monitoring without custom exporters.
67	+
68	+	What PoM lacks: rich dashboards, metric visualization, alertmanager flexibility, ecosystem of exporters. What Grafana+Prom lacks: out-of-box TLS/DNS/WHOIS, peer mesh, single binary, zero-config setup.
69	+
70	+	### 4. StatusCake
71	+
72	+	Web-based uptime and page speed monitoring. Free tier with 10 monitors. Pro adds SSL, domain, and server monitoring. Similar scope to Uptime Robot but with more check types.
73	+
74	+	What PoM lacks: page speed testing, server monitoring agents, status pages, Slack/Teams integration.
75	+
76	+	## What We Offer That Competitors Don't
77	+
78	+	- Peer mesh -- two PoM instances monitor each other. If one goes down, the other detects it. No central dashboard is a single point of failure.
79	+	- CLI-first -- inspect status, run checks, query history from the terminal via SSH. No browser required.
80	+	- Claude MCP integration -- expose health checks, test execution, and mesh status as MCP tools for AI-assisted diagnostics.
81	+	- SSH test execution -- trigger and parse CI test runs on remote servers, track test freshness and duration drift.
82	+	- Single binary, zero dependencies -- no Docker, no external services, no agents. SQLite for history, Postmark for email alerts.
83	+	- Monitoring-offline meta-alert -- detects when all targets are unreachable simultaneously (likely a PoM network issue, not actual outages). Prevents false alarm cascades.
84	+
85	+	## Target Users
86	+
87	+	- Indie developers running 1-5 services who want monitoring without SaaS costs
88	+	- Small teams that operate via SSH and prefer CLI tools over web dashboards
89	+	- Anyone who wants peer-verified monitoring (not trusting a single monitoring vendor)
90	+	- Claude Code users who want AI-assisted production diagnostics

A docs/runbook.md +202

		@@ -0,0 +1,202 @@
1	+	# PoM Operational Runbook
2	+
3	+	Procedures for responding to alerts, managing the service, and troubleshooting common issues.
4	+
5	+	## Alert Response Guide
6	+
7	+	### Health Status Change (Operational -> Error/Unreachable)
8	+
9	+	Symptoms: Email alert with target status change.
10	+
11	+	Steps:
12	+	1. Verify manually: `curl -v https://makenot.work/api/health`
13	+	2. If Unreachable: check network (Tailscale, firewall, DNS resolution)
14	+	3. If Error (5xx): SSH into the target server, check application logs
15	+	```sh
16	+	ssh root@100.120.174.96 journalctl -u makenotwork --since "10 minutes ago"
17	+	```
18	+	4. If Degraded (2xx but unexpected body): check application state, database connectivity
19	+	5. Restart the service if needed: `ssh root@100.120.174.96 systemctl restart makenotwork`
20	+
21	+	### TLS Certificate Expiry
22	+
23	+	Symptoms: Alert when certificate expires within 14 days.
24	+
25	+	Steps:
26	+	1. Verify: `openssl s_client -connect makenot.work:443 2>/dev/null \| openssl x509 -noout -dates`
27	+	2. Cloudflare Origin CA certs (15-year): no renewal needed. If alert fires, check Caddy config.
28	+	3. If Caddy is serving wrong cert: verify cert paths in `/etc/caddy/Caddyfile`
29	+	4. For custom domains (on-demand TLS): Caddy auto-renews via ACME. Check Caddy logs.
30	+
31	+	### TLS Check Failed
32	+
33	+	Symptoms: Handshake timeout, certificate parse failure, or connection refused.
34	+
35	+	Steps:
36	+	1. Verify: `openssl s_client -connect makenot.work:443 -servername makenot.work`
37	+	2. Check Caddy status: `ssh root@100.120.174.96 systemctl status caddy`
38	+	3. Check if port 443 is open: `ssh root@100.120.174.96 ss -tlnp \| grep 443`
39	+	4. If Caddy is down, restart: `ssh root@100.120.174.96 systemctl restart caddy`
40	+
41	+	### Peer Missing
42	+
43	+	Symptoms: Peer (astra or hetzner) unreachable for 3+ consecutive heartbeats (3+ minutes).
44	+
45	+	Steps:
46	+	1. SSH into the peer: `ssh max@100.106.221.39` (astra) or `ssh root@100.120.174.96` (hetzner)
47	+	2. Check PoM service: `systemctl status pom`
48	+	3. Check Tailscale connectivity: `tailscale ping <peer-ip>`
49	+	4. If PoM is down: `systemctl restart pom`
50	+	5. If Tailscale is down: `systemctl restart tailscored`
51	+
52	+	### Latency Drift
53	+
54	+	Symptoms: Sustained response time increase (>2x the 7-day baseline).
55	+
56	+	Steps:
57	+	1. Check server load: `ssh root@100.120.174.96 top -bn1 \| head -5`
58	+	2. Check PostgreSQL: `ssh root@100.120.174.96 "psql -c 'SELECT count(*) FROM pg_stat_activity;' makenotwork"`
59	+	3. Check for slow queries: `ssh root@100.120.174.96 "psql -c \"SELECT query, calls, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 5;\" makenotwork"`
60	+	4. Check disk I/O: `ssh root@100.120.174.96 iostat -x 1 3`
61	+	5. If database-related: consider `VACUUM ANALYZE` on affected tables
62	+
63	+	### Route Failure
64	+
65	+	Symptoms: Specific paths (e.g., `/login`, `/docs`) returning non-2xx.
66	+
67	+	Steps:
68	+	1. Verify: `curl -sI https://makenot.work/login`
69	+	2. If 502/503: application is down or Caddy can't reach it
70	+	3. If 404: route may have been removed in a deploy -- check recent deploys
71	+	4. If 500: application error -- check logs with `journalctl -u makenotwork`
72	+
73	+	### DNS Mismatch
74	+
75	+	Symptoms: DNS records don't match expected values.
76	+
77	+	Steps:
78	+	1. Verify: `dig makenot.work +short` and compare with expected
79	+	2. Check Cloudflare DNS dashboard for unexpected changes
80	+	3. If propagation issue: wait 5-10 minutes and recheck
81	+	4. If intentional change: update PoM config to match new expected values
82	+
83	+	### WHOIS Domain Expiry
84	+
85	+	Symptoms: Domain registration expires within 30 days.
86	+
87	+	Steps:
88	+	1. Verify: `whois makenot.work \| grep -i expir`
89	+	2. Renew domain with registrar (Cloudflare Registrar for makenot.work)
90	+	3. Confirm renewal: re-run WHOIS check
91	+
92	+	### Monitoring Offline (All Targets Unreachable)
93	+
94	+	Symptoms: All monitored targets are down simultaneously.
95	+
96	+	Steps:
97	+	1. This almost certainly means PoM's own network is down, not all targets
98	+	2. Check the PoM instance's network: `ping 1.1.1.1`, `tailscale status`
99	+	3. Check DNS resolution: `dig makenot.work`
100	+	4. If network is fine, check if all targets actually are down (unlikely but possible)
101	+
102	+	### Test Run Stale
103	+
104	+	Symptoms: No test run recorded in 7+ days.
105	+
106	+	Steps:
107	+	1. SSH into astra and run tests manually: `/home/max/staging/run-tests.sh`
108	+	2. If tests fail: investigate failures, fix, re-run
109	+	3. If SSH test execution fails: check SSH key, connectivity, permissions
110	+
111	+	## Service Management
112	+
113	+	### Starting/Stopping
114	+
115	+	```sh
116	+	# Hetzner
117	+	ssh root@100.120.174.96 systemctl start pom
118	+	ssh root@100.120.174.96 systemctl stop pom
119	+	ssh root@100.120.174.96 systemctl restart pom
120	+
121	+	# Astra
122	+	ssh max@100.106.221.39 sudo systemctl start pom
123	+	ssh max@100.106.221.39 sudo systemctl stop pom
124	+	ssh max@100.106.221.39 sudo systemctl restart pom
125	+	```
126	+
127	+	### Checking Status
128	+
129	+	```sh
130	+	# Service status
131	+	ssh root@100.120.174.96 systemctl status pom
132	+
133	+	# Application logs
134	+	ssh root@100.120.174.96 journalctl -u pom --since "1 hour ago"
135	+
136	+	# API health
137	+	curl http://100.120.174.96:9100/api/health
138	+
139	+	# Full status (requires API token)
140	+	curl -H "Authorization: Bearer <token>" http://100.120.174.96:9100/api/status
141	+
142	+	# Mesh view (self + peers)
143	+	curl -H "Authorization: Bearer <token>" http://100.120.174.96:9100/api/mesh
144	+	```
145	+
146	+	### Deploying Updates
147	+
148	+	```sh
149	+	cd ~/Code/MNW/pom
150	+	./deploy/deploy.sh # Deploy to both astra and hetzner
151	+	```
152	+
153	+	The deploy script cross-compiles for both architectures, uploads binaries, and restarts services.
154	+
155	+	### Configuration Changes
156	+
157	+	Config lives at `/etc/pom/pom.toml` on each instance. After editing:
158	+
159	+	```sh
160	+	ssh root@100.120.174.96 systemctl restart pom
161	+	```
162	+
163	+	Alert credentials are in `/etc/pom/env` (Postmark token, API token).
164	+
165	+	## Check Intervals
166	+
167	+	\| Check Type \| Default Interval \| Notes \|
168	+	\|------------\|-----------------\|-------\|
169	+	\| Health (HTTP) \| 5 minutes \| 10-second timeout per request \|
170	+	\| TLS certificate \| 1 hour \| Warns at 14 days before expiry \|
171	+	\| Route availability \| 5 minutes \| Checks all configured paths \|
172	+	\| DNS records \| 1 hour \| Compares against expected values \|
173	+	\| WHOIS expiry \| 1 hour \| Warns at 30 days before expiry \|
174	+	\| CORS preflight \| 1 hour \| OPTIONS request validation \|
175	+	\| Peer heartbeat \| 60 seconds \| 3 failures before alert (grace period) \|
176	+	\| Data pruning \| Daily \| Retains 30 days of history \|
177	+
178	+	## Alert Cooldowns
179	+
180	+	- Default cooldown: 5 minutes between repeated alerts for the same target
181	+	- Recovery alerts: Always sent immediately (bypass cooldown)
182	+	- Monitoring-offline: Special meta-alert when all targets are unreachable
183	+
184	+	## Production Instances
185	+
186	+	\| Instance \| IP \| Architecture \| Config \|
187	+	\|----------\|-----\|-------------\|--------\|
188	+	\| Hetzner \| `100.120.174.96:9100` \| x86_64 \| `/etc/pom/pom.toml` \|
189	+	\| Astra \| `100.106.221.39:9100` \| aarch64 \| `/etc/pom/pom.toml` \|
190	+
191	+	Both instances monitor the same targets and cross-check each other via the peer mesh.
192	+
193	+	## Key Files
194	+
195	+	\| What \| Where \|
196	+	\|------\|-------\|
197	+	\| Config \| `/etc/pom/pom.toml` \|
198	+	\| Credentials \| `/etc/pom/env` \|
199	+	\| Database \| `/var/lib/pom/pom.db` (SQLite) \|
200	+	\| Instance ID \| `/var/lib/pom/instance_id` \|
201	+	\| systemd unit \| `/etc/systemd/system/pom.service` \|
202	+	\| Deploy script \| `deploy/deploy.sh` \|

M docs/todo.md +3 -1

			@@ -1,6 +1,8 @@
1	1		# PoM Todo
2	2
3		-	Done: Phases 1-13 complete. Per-test tracking + regression detection + duration drift added. 352 tests (124 lib + 228 integration). Grade: A (Run 10). v0.3.2 (redeployed 2026-03-25). cli/ split into directory module. Dashboard UI shipped. DNS checks fixed for Cloudflare-proxied domains. Stale route/DNS data pruning added.
	3	+	Done: All phases (1-13). Active: None. Next: Post-beta items below.
	4	+
	5	+	v0.3.2. Audit grade A. Dashboard UI, regression detection, duration drift. Monitors MNW + MT + htpy.app.
4	6
5	7		Completed work archived in `docs/archive/pom_todo_done.md`.
6	8