max / pom

Move project docs into repo for ~/Code directory layout Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Author: Max J. <87768334+MaxJMath@users.noreply.github.com> · 2026-03-30 02:26 UTC

Commit: b10b7fd181f923ba01a5a90205b151e8381074fb

Parent: 2572bc2

3 files changed, +486 insertions, -0 deletions

A docs/architecture.md +191

		@@ -0,0 +1,191 @@
1	+	# PoM Architecture
2	+
3	+	## System Overview
4	+
5	+	PoM runs in three modes, selected by how it is invoked:
6	+
7	+	1. CLI mode (`pom health`, `pom test`, `pom status`, etc.) -- runs a single command and exits. Useful for ad-hoc checks and cron jobs.
8	+	2. Serve mode (`pom serve`) -- long-running daemon that spawns per-target health check loops, TLS check loops, peer heartbeat tasks, a daily prune task, and an HTTP API server. This is the production deployment mode.
9	+	3. MCP server mode (bare `pom` with no subcommand) -- launches an MCP server over stdio for Claude integration. Exposes health checks, test execution, history queries, and mesh status as MCP tools.
10	+
11	+	All three modes load the same TOML config and connect to the same SQLite database.
12	+
13	+	## Module Map
14	+
15	+	\| Module \| File \| Role \|
16	+	\|--------\|------\|------\|
17	+	\| `main` \| `src/main.rs` \| Entry point -- parses CLI args, dispatches to CLI handler or MCP server \|
18	+	\| `cli` \| `src/cli.rs` \| CLI command handlers (health, test, status, history, prune, serve, mesh) \|
19	+	\| `config` \| `src/config.rs` \| TOML config loading, types for targets/peers/alerts/serve settings \|
20	+	\| `types` \| `src/types.rs` \| Shared domain types: HealthSnapshot, TestRun, TlsStatus, LatencyStats, TestStaleness \|
21	+	\| `db` \| `src/db.rs` \| SQLite schema (versioned migrations), all queries for health/tests/alerts/TLS/incidents/peers \|
22	+	\| `api` \| `src/api.rs` \| Axum HTTP API: status, trends, peer info, mesh view, bearer token auth middleware \|
23	+	\| `alerts` \| `src/alerts.rs` \| Alerter struct -- sends emails via Postmark on status transitions, with cooldown tracking \|
24	+	\| `peer` \| `src/peer.rs` \| Peer mesh: identity management, heartbeat loops, grace period state machine, mesh state \|
25	+	\| `display` \| `src/display.rs` \| Pure formatting functions for CLI output (no I/O) \|
26	+	\| `error` \| `src/error.rs` \| Typed error enum (PomError) wrapping IO, DB, HTTP, JSON, config errors \|
27	+	\| `checks::http` \| `src/checks/http.rs` \| HTTP health checker, response classification, expectation validation, latency drift detection, test staleness computation \|
28	+	\| `checks::tls` \| `src/checks/tls.rs` \| TLS certificate prober -- TCP connect, TLS handshake, x509 leaf cert parsing \|
29	+	\| `checks::ssh` \| `src/checks/ssh.rs` \| Remote test runner -- executes commands over SSH, captures output \|
30	+	\| `checks::parse` \| `src/checks/parse.rs` \| CI output parser -- extracts PASS/FAIL steps and cargo test counts \|
31	+	\| `tools` \| `src/tools/mod.rs` \| MCP server definition (PomServer), tool registration via rmcp \|
32	+	\| `tools::health` \| `src/tools/health.rs` \| MCP tool implementations for health checks, history, targets, mesh status \|
33	+	\| `tools::tests` \| `src/tools/tests.rs` \| MCP tool implementations for test execution, history, raw output \|
34	+
35	+	## Data Flow
36	+
37	+	```
38	+	pom.toml (config)
39	+	\|
40	+	v
41	+	Config::load() --> targets, peers, alerts, serve settings
42	+	\|
43	+	v
44	+	db::connect() --> SQLite pool (WAL mode, versioned migrations)
45	+	\|
46	+	+---> [CLI mode] single command --> check/query --> display --> exit
47	+	\|
48	+	+---> [Serve mode]
49	+	\| \|
50	+	\| +--> per-target health check loop (configurable interval)
51	+	\| \| check_health() --> insert_health_check()
52	+	\| \| compare with previous --> alert on transition
53	+	\| \| detect latency drift --> alert if sustained
54	+	\| \| open/close incidents on status changes
55	+	\| \|
56	+	\| +--> per-target TLS check loop (hourly default)
57	+	\| \| check_tls() --> insert_tls_check()
58	+	\| \| alert on expiry warning or error
59	+	\| \|
60	+	\| +--> per-peer heartbeat loop (60s default)
61	+	\| \| GET /api/peer/info --> update mesh state
62	+	\| \| GET /api/peer/status --> cache for mesh view
63	+	\| \| grace period state machine on failure
64	+	\| \|
65	+	\| +--> daily prune task (configurable retention)
66	+	\| \|
67	+	\| +--> HTTP API server (Axum, configurable bind address)
68	+	\|
69	+	+---> [MCP mode] stdio transport --> tool calls --> same check/query logic
70	+	```
71	+
72	+	## Peer Mesh Design
73	+
74	+	Each PoM instance has a persistent UUID (stored at `~/.local/share/pom/instance_id`). Peers are configured by name with an address, a `on_missing` policy, and an optional grace count.
75	+
76	+	### Heartbeat State Machine
77	+
78	+	```
79	+	Unknown --> (success) --> Online
80	+	Unknown --> (failure) --> GracePeriod --> (failures >= grace_count) --> Missing
81	+	Online --> (failure) --> GracePeriod --> (failures >= grace_count) --> Missing
82	+	Missing --> (success) --> Online (triggers recovery alert)
83	+	```
84	+
85	+	Each heartbeat cycle:
86	+	1. GET `/api/peer/info` -- verifies identity, measures latency
87	+	2. On first contact, store the peer's UUID in `peer_identities` table
88	+	3. On subsequent contacts, reject UUID mismatches (prevents impersonation)
89	+	4. GET `/api/peer/status` -- caches the peer's full status for mesh aggregation
90	+	5. Record heartbeat result in `peer_heartbeats` table
91	+
92	+	### On Missing Policy
93	+
94	+	- `alert` -- send email alert when peer transitions to Missing, send recovery when it returns
95	+	- `log` -- log the event, no email
96	+	- `ignore` -- suppress entirely
97	+
98	+	## Database Schema
99	+
100	+	SQLite with WAL journal mode. Schema is managed through numbered migrations (currently v1-v4).
101	+
102	+	### Tables
103	+
104	+	\| Table \| Purpose \| Key Columns \|
105	+	\|-------\|---------\|-------------\|
106	+	\| `schema_version` \| Migration tracking \| version, description, applied_at \|
107	+	\| `health_checks` \| HTTP health check results \| target, status, checked_at, response_time_ms, details_json, error \|
108	+	\| `test_runs` \| SSH test execution results \| target, started_at, duration_secs, exit_code, passed, summary_json, raw_output \|
109	+	\| `peer_identities` \| First-seen peer UUIDs \| peer_name (PK), instance_id, first_seen \|
110	+	\| `peer_heartbeats` \| Heartbeat history \| peer_name, status, latency_ms, checked_at \|
111	+	\| `alerts` \| Alert history + cooldown tracking \| target, alert_type, from_status, to_status, sent_at \|
112	+	\| `tls_checks` \| TLS certificate probe results \| target, host, valid, days_remaining, not_before, not_after, subject, issuer \|
113	+	\| `incidents` \| Health incidents (open/closed) \| target, started_at, ended_at, duration_secs, from_status, to_status \|
114	+
115	+	Pre-migration databases are detected by the presence of the `health_checks` table and stamped as v1 without re-running the initial migration.
116	+
117	+	## Alert Pipeline
118	+
119	+	```
120	+	Health status change detected (previous != current)
121	+	\|
122	+	+--> operational -> non-operational: send_health_alert(), open incident
123	+	+--> non-operational -> operational: send_health_recovery(), close incidents
124	+	+--> non-operational -> different non-operational: close old incident, open new, alert
125	+	\|
126	+	TLS check detects issue
127	+	\|
128	+	+--> was OK, now invalid/error: send_tls_error_alert()
129	+	+--> was OK, now within warn_days: send_tls_expiry_alert()
130	+	+--> was bad, now OK: send_tls_recovery()
131	+	\|
132	+	Latency drift detected (all recent checks exceed baseline * threshold)
133	+	\|
134	+	+--> entered drift: send_latency_drift_alert()
135	+	+--> exited drift: send_latency_recovery()
136	+	\|
137	+	Peer transitions to Missing
138	+	\|
139	+	+--> send_peer_missing() (if on_missing = alert)
140	+	+--> peer recovers: send_peer_recovery()
141	+	```
142	+
143	+	All alerts except recoveries are subject to a per-target cooldown (default 300s). Recoveries always send immediately. Without a Postmark token, alerts are logged to stdout (dev mode).
144	+
145	+	## API Endpoints
146	+
147	+	All endpoints require `Authorization: Bearer <token>` when `api_token` is configured (in config or via `POM_API_TOKEN` env var). Without a token configured, all requests pass through.
148	+
149	+	\| Endpoint \| Method \| Description \|
150	+	\|----------\|--------\|-------------\|
151	+	\| `/api/status` \| GET \| JSON summary of all targets (latest health, uptime, latency, TLS, staleness, incidents) \|
152	+	\| `/api/status/{target}` \| GET \| Same as above for a single target \|
153	+	\| `/api/trends/{target}` \| GET \| Latency trend data with configurable window and bucket size (`?hours=24&bucket_minutes=60`) \|
154	+	\| `/api/peer/info` \| GET \| This instance's identity (id, name, version, targets, started_at) \|
155	+	\| `/api/peer/status` \| GET \| This instance's full view: identity + target statuses + peer summaries \|
156	+	\| `/api/mesh` \| GET \| Aggregated mesh view: self + each peer's cached status \|
157	+
158	+	## Check Types
159	+
160	+	### HTTP Health Check
161	+
162	+	Sends GET to the target's health URL. Classifies the response:
163	+	- JSON with `"status": "operational"` --> Operational
164	+	- JSON with `"status": "degraded"` --> Degraded
165	+	- Non-JSON 2xx --> Degraded
166	+	- Non-2xx or unknown status --> Error
167	+	- Connection failure --> Unreachable
168	+
169	+	Extracts version, uptime, checks, and monitoring from the JSON response body. Supports expectation validation: expected status code, required body substrings, and JSON field value assertions (with dot-path traversal for nested fields).
170	+
171	+	### TLS Certificate Check
172	+
173	+	Connects to host:port, completes a TLS handshake using the system trust store (webpki-roots), extracts the leaf certificate, and parses it with x509-parser. Records validity, days remaining, not_before/not_after, subject, and issuer. Alerts when days_remaining falls below the configured `warn_days` threshold (default 14).
174	+
175	+	### SSH Test Runner
176	+
177	+	Executes a configured command on a remote host via `ssh -o BatchMode=yes`. The command string comes from config (typically a CI script like `./run-ci.sh`). Supports an optional filter argument (validated to `[a-zA-Z0-9_:-]` to prevent injection). Output is parsed for PASS/FAIL step lines and `test result:` cargo test summary lines.
178	+
179	+	## Key Design Decisions
180	+
181	+	SQLite over PostgreSQL. PoM is a single-binary tool that runs on each monitoring host. SQLite keeps it self-contained with zero external dependencies. WAL mode provides concurrent reads during serve mode. Data volume is modest (a few checks per minute, pruned after 30 days).
182	+
183	+	Peer mesh over centralized monitoring. Two independent instances cross-check each other. If the Hetzner instance goes down, Astra detects it (and vice versa). No single point of failure for the monitoring layer itself.
184	+
185	+	Bearer token auth. Simple, stateless, sufficient for machine-to-machine API access between peers. Configured per-peer and per-instance. No user management needed.
186	+
187	+	Versioned migrations. The migration system detects pre-migration databases and stamps them without re-running. Each migration is a numbered SQL block. This avoids external migration tools while keeping schema evolution safe.
188	+
189	+	Separate check intervals. Health checks can have per-target interval overrides. TLS checks run on a longer interval (hourly default) since certificate state changes slowly. Peer heartbeats run on a short interval (60s default) for timely failure detection.
190	+
191	+	Cooldown on alerts. Prevents alert storms during flapping. Recovery alerts bypass cooldown so operators always know when a service comes back.

A docs/audit_review.md +257

		@@ -0,0 +1,257 @@
1	+	# PoM (Peace of Mind) -- Audit Review
2	+
3	+	Last audited: 2026-03-28 (tenth audit, Run 12 cross-project)
4	+	Previous audit: 2026-03-18 (eighth audit, Run 9 cross-project)
5	+
6	+	## Overall Grade: A
7	+
8	+	Run 12 cross-project audit. 359 tests (222 unit + 8 cli + 129 integration). 0 clippy warnings. v0.3.2. Grade stable at A. CORS monitoring added. DNS/route stale data fix deployed. New dep advisories: aws-lc-sys (HIGH 7.4), rustls-webpki.
9	+
10	+	## Scorecard
11	+
12	+	\| Dimension \| Grade \| Notes \|
13	+	\|-----------\|:-----:\|-------\|
14	+	\| Code Quality \| A \| Zero clippy warnings. Clean error handling with typed PomError enum (thiserror). No unnecessary complexity. \|
15	+	\| Architecture \| A \| Well-structured single crate. main.rs is thin (130 LOC), CLI handlers extracted to cli.rs. Module boundaries clean and purposeful. \|
16	+	\| Testing \| A \| 359 tests (222 unit + 8 cli + 129 integration) for ~10K LOC = ~36 tests/KLOC. Coverage comprehensive — DB, API, MCP tools, parsing, peer state machine, check_health (mock servers), check_tls (self-signed cert), rate limiter, CORS monitoring all tested. \|
17	+	\| Security \| A \| All SQL parameterized. SSH hardened with BatchMode + `--` option separator. Bearer token auth on API + peer mesh. Rate limiting (60 req/min). HTTP response size limits (10MB). Shell-escape validation on SSH filter. \|
18	+	\| Performance \| A \| WAL mode, proper indexes, async throughout. Appropriate for monitoring workload. \|
19	+	\| Documentation \| A \| Module-level `//!` docs on all modules. All struct fields documented. description.md rewritten, architecture.md created (191 lines), README created (62 lines). Tool descriptions are excellent. \|
20	+	\| Dependencies \| A \| 17 direct dependencies, all justified. No unused deps. rustls-tls avoids OpenSSL. Semver ranges. tokio-util for CancellationToken. \|
21	+	\| Type Safety \| A \| Domain enums (HealthStatus, PeerStatus, OnMissing) with proper serde. No stringly-typed fields in application logic. \|
22	+	\| Observability \| A \| 57 `#[instrument(skip_all)]` annotations across all modules. Debug-level logging for non-critical peer status fetch failures. Watchdog logging for silent task panics. \|
23	+	\| Concurrency \| A \| Clean RwLock discipline. Lock-then-release-then-DB pattern in peer module. CancellationToken for cooperative shutdown. \|
24	+	\| Resilience \| A \| CancellationToken + `tokio::select!` in all task loops. `with_graceful_shutdown` on API server. 5s grace period. Watchdog for task panic detection. Grace periods for peer detection. Error logging without panics. \|
25	+	\| API Consistency \| A \| Consistent JSON shapes. Proper HTTP status codes. Public `/api/health` endpoint. MCP tool descriptions well-written. \|
26	+	\| Migration Safety \| A- \| Numbered migration system (schema_version table). `IF NOT EXISTS` guards on all DDL. 5 migrations tracked. \|
27	+	\| Codebase Size \| A \| ~3,500 LOC for a full monitoring tool with CLI, MCP server, HTTP API, SQLite persistence, SSH test runner, peer mesh, and self-monitoring. Extremely efficient. \|
28	+
29	+	## Module Heatmap
30	+
31	+	\| Module \| Code \| Arch \| Test \| Security \| Perf \| Docs \| Type Safety \| Concurrency \| Resilience \|
32	+	\|--------\|:----:\|:----:\|:----:\|:--------:\|:----:\|:----:\|:-----------:\|:-----------:\|:----------:\|
33	+	\| main \| A \| A \| A \| A \| A \| A \| A \| A \| A \|
34	+	\| cli \| A \| A \| A \| A \| A \| A \| A \| A \| A \|
35	+	\| peer \| A \| A \| A \| A \| A \| A \| A \| A \| A \|
36	+	\| db \| A \| A \| A \| A+ \| A \| A \| A- \| A \| A \|
37	+	\| api \| A \| A \| A \| A \| A \| A \| A \| A \| A \|
38	+	\| config \| A \| A \| A \| A \| A \| A \| A \| - \| - \|
39	+	\| tools \| A \| A \| A \| A \| A \| A \| A \| - \| A \|
40	+	\| checks \| A \| A \| A \| A \| A \| A \| A \| - \| A \|
41	+	\| display \| A \| A \| A \| - \| - \| A \| A \| - \| - \|
42	+	\| types \| A \| A \| A \| - \| - \| A \| A+ \| - \| - \|
43	+	\| alerts \| A \| A \| A \| A \| A \| A \| A \| - \| A \|
44	+
45	+	Heatmap changes (2026-03-14 remediation):
46	+	- main: Arch B+ -> A (split into main.rs 130 LOC + cli.rs), Test B -> A (display.rs has 27+ tests), Resilience A- -> A (CancellationToken + graceful shutdown)
47	+	- api: Security A -> A (rate limiting added), Resilience A -> A (`with_graceful_shutdown`)
48	+	- cli (new): All A — extracted command handlers with CancellationToken shutdown orchestration
49	+	- display (new): All A — formatting logic with 27+ unit tests
50	+	- alerts (new): All A — extracted for visibility in heatmap
51	+
52	+	### Cold Spots
53	+
54	+	None remaining. All previous cold spots resolved:
55	+	- main.rs split into main.rs (130 LOC) + cli.rs (command handlers) + display.rs (formatting with 27+ tests)
56	+	- CLI display formatting extracted and tested via display.rs
57	+	- All modules documented with `//!` docs
58	+
59	+	### Cold Spot Remediation History (First Audit)
60	+
61	+	All 8 cold spots from the first audit were remediated before the second audit:
62	+
63	+	- ~~config.rs (Testing): C~~ -> A -- 4 config parsing tests (full parse, defaults, on_missing default, hostname fallback). `on_missing` now uses `OnMissing` enum with serde.
64	+	- ~~api.rs (Testing): C~~ -> A -- 5 API endpoint integration tests (status, target 404, peer/info, peer disabled, mesh).
65	+	- ~~api.rs (Concurrency): B~~ -> A -- Lock decoupled from DB queries in `peer_status` and `mesh_view`.
66	+	- ~~peer.rs (Testing): B-~~ -> A -- 9 heartbeat state machine tests (grace transitions, recovery, UUID first-contact, DB recording, identity, on_missing deserialization).
67	+	- ~~peer.rs (Concurrency): B~~ -> A -- Lock decoupled from DB writes in both heartbeat handlers.
68	+	- ~~db.rs (Performance): B-~~ -> A -- 4 indexes added to init_schema.
69	+	- ~~checks/http.rs (Testing): C~~ -> A -- Response classification extracted into pure functions. 8 unit tests covering all status paths.
70	+	- ~~main.rs (Testing): C~~ -> B -- Status icon logic extracted to `HealthStatus::icon()`. 4 types.rs tests. main.rs is thin orchestration over well-tested modules.
71	+
72	+	## Strengths
73	+
74	+	- Lean architecture. ~3,500 LOC delivers health checks, test orchestration, MCP server, HTTP API, peer mesh, CLI, alerting, TLS monitoring, route checks, and self-monitoring. No over-engineering.
75	+	- Clean error handling. `?` propagation with typed PomError enum (thiserror, 8 variants). No panics in production paths.
76	+	- Correct security posture. All SQL parameterized. Bearer token auth on API + peer mesh. Rate limiting (60 req/min). SSH hardened with `BatchMode=yes` + `--` separator. Shell-escape validation.
77	+	- Backward-compatible design. All config sections are `#[serde(default)]`. Existing configs work unchanged.
78	+	- Good MCP integration. 8 MCP tools with clear descriptions. Tools strip `raw_output` from test history to avoid flooding context.
79	+	- Comprehensive test suite. 238 tests (152 unit + 86 integration) at ~68/KLOC covering DB, API, MCP tools, parsing, peer state machine, config, health classification, mock server health checks, TLS cert validation, and rate limiting.
80	+	- Excellent lock discipline. Peer module collects data under lock, releases lock, then performs DB writes. No lock contention under load.
81	+	- Robust shutdown. CancellationToken with `tokio::select!` in all task loops. Graceful API server shutdown. Watchdog for silent panics. 5s grace period.
82	+
83	+	## Weaknesses
84	+
85	+	All previously identified weaknesses have been resolved:
86	+	- ~~main.rs size~~ — Split into main.rs (130 LOC) + cli.rs + display.rs
87	+	- ~~No typed error enum~~ — PomError enum with thiserror (8 variants)
88	+	- ~~Missing module docs~~ — All modules have `//!` docs
89	+	- ~~No migration versioning~~ — schema_version table with numbered migrations (v1-v5)
90	+
91	+	## Mandatory Surprise
92	+
93	+	Surprise: The peer mesh fetches `/api/peer/status` as a SECOND HTTP call after `/api/peer/info` in every heartbeat loop iteration, completely independently.
94	+
95	+	In `peer.rs:196-213`, after the heartbeat success/failure handling completes, there is a second HTTP GET to `/api/peer/status` that caches the full status data for mesh aggregation. This means every heartbeat cycle makes TWO HTTP requests to each peer.
96	+
97	+	Verdict: Actually fine.
98	+
99	+	This is intentional and well-designed. The `/api/peer/info` call is the heartbeat probe (lightweight, just identity data), while `/api/peer/status` fetches the full target/peer view for mesh aggregation. Separating them means:
100	+	1. The heartbeat latency measurement is accurate (not inflated by the larger status response).
101	+	2. A failure to fetch status data does not affect heartbeat state transitions (it logs at debug level and moves on).
102	+	3. The mesh view (`/api/mesh`) can show each peer's own view of their targets, enabling cross-instance monitoring.
103	+
104	+	The second request's failure is handled gracefully with `tracing::debug!` -- no state corruption, no retries, no panics. Good design.
105	+
106	+	## Action Items
107	+
108	+	Filed in `docs/mnw/pom/todo.md`.
109	+
110	+	### Third Audit (2026-03-13, pre-launch skeptical lens) -- 7 items, all resolved
111	+
112	+	1. ~~[CRITICAL] Remove Postmark API token from deployment configs~~ — Done — moved to POM_POSTMARK_TOKEN env var
113	+	2. ~~[MEDIUM] Add API authentication~~ — Done — bearer token middleware, 5 tests
114	+	3. ~~[MEDIUM] Add peer mesh authentication~~ — Done — per-peer token field, heartbeat sends Authorization header
115	+	4. ~~[LOW] Add integration tests for core functions~~ — Done — 9 new integration tests (check_health mock servers, check_tls self-signed cert)
116	+	5. ~~[LOW] Add self-monitoring~~ — Done — `/api/health` endpoint (public, no auth, returns status + version)
117	+	6. ~~[LOW] Shell-escape SSH test filter parameter~~ — Done (alphanumeric + `_:-` allowlist)
118	+	7. ~~[LOW] Reject peer UUID mismatch~~ — Done (tracing::error, skip status update, increment failures)
119	+
120	+	### Security Deep Dive (2026-03-13) — Complete (2/2)
121	+
122	+	8. ~~[MEDIUM] SSH `--` separator~~ — Done (`checks/ssh.rs`: `.arg("--")` between hostname and command prevents option injection via filter strings)
123	+	9. ~~[LOW] HTTP response size limit~~ — Done (`checks/http.rs`: `MAX_RESPONSE_BYTES` 10MB constant, content-length header check + post-read size verification)
124	+
125	+	### Second Audit (2026-03-11) -- 5 items
126	+
127	+	1. Extract CLI command handlers from main.rs into cli/ submodule (587 LOC, 7 commands)
128	+	2. Add typed PomError enum with thiserror (currently Box<dyn Error>, diverges from cross-project standard)
129	+	3. Add .DS_Store and IDE dirs (.idea/, .vscode/) to .gitignore
130	+	4. Add module-level //! docs to main.rs and config.rs
131	+	5. Add migration versioning (PRAGMA user_version) before next schema change
132	+
133	+	### First Audit (2026-03-10) -- 11 items, all resolved
134	+
135	+	1. ~~Add DB indexes~~ -- Done (4 indexes added to init_schema)
136	+	2. ~~Fix clippy warnings~~ -- Done (4 collapsible_if warnings fixed with Rust 2024 let chains)
137	+	3. ~~Decouple mesh lock from DB writes in heartbeat handlers~~ -- Done
138	+	4. ~~Decouple mesh read lock from DB queries in API handlers~~ -- Done
139	+	5. ~~Log /api/peer/status fetch failures~~ -- Done (tracing::debug)
140	+	6. ~~Include peer heartbeat prune count~~ -- Done (3-tuple return)
141	+	7. ~~Add module docs~~ -- Done (//! docs on db.rs, config.rs, peer.rs, types.rs, lib.rs)
142	+	8. ~~Change PeerConfig.on_missing to OnMissing enum~~ -- Done
143	+	9. ~~Add API endpoint tests~~ -- Done (5 tests)
144	+	10. ~~Add heartbeat state machine tests~~ -- Done (9 tests)
145	+	11. ~~Add config parsing tests~~ -- Done (4 tests)
146	+
147	+	## Changes Since Last Audit
148	+
149	+	### Tenth audit (2026-03-28, Run 12 cross-project)
150	+	- Test count: 359 (222 unit + 8 cli + 129 integration). 0 clippy warnings. 0 failures.
151	+	- Grade: A (maintained). v0.3.2.
152	+	- CORS monitoring: New check type added for monitoring CORS headers on targets.
153	+	- New dependency advisories (action items):
154	+	- aws-lc-sys 0.38.0 (RUSTSEC-2026-0044 + -0048, severity 7.4 HIGH) — upgrade to 0.39.0 via `cargo update -p aws-lc-sys`
155	+	- rustls-webpki 0.103.9 (RUSTSEC-2026-0049) — upgrade to 0.103.10 via `cargo update -p rustls-webpki`
156	+	- paste unmaintained (RUSTSEC-2024-0436) — upstream via rmcp, warning only
157	+	- Mandatory surprise: None. Previous surprises (rate limiter relaxed ordering, write!().unwrap() infallibility) still valid.
158	+	- No new code findings. All previous items remain resolved.
159	+
160	+	### DNS/Route stale data fix (2026-03-25)
161	+	- Test count: 352 (unchanged). 0 clippy warnings.
162	+	- Config: Switched all 4 Cloudflare-proxied DNS records from `expected = ["IP"]` to `expected = []` (resolution-only). DNS checks were always failing because Cloudflare returns rotating proxy IPs, not the origin IP.
163	+	- API filtering: `route_status` and `dns_status` in `/api/status/{target}` now filtered to only entries matching current config. Stale routes (e.g. `/docs/about`, `/signup`) and stale DNS records no longer appear in API responses.
164	+	- DB pruning: Added `prune_stale_routes()` and `prune_stale_dns()` to `db.rs`. Called once at task startup in `routes.rs` and `dns.rs` to clean up historical data when config changes. Pruned 890 stale route check rows on first deploy.
165	+	- Integration tests: Updated `api_status_includes_route_status` and `api_status_includes_dns_status` to use configs with matching route/DNS entries.
166	+	- Deployed to hetzner — v0.3.2 binary + updated config.
167	+
168	+	### Eighth audit (2026-03-18, Run 9 cross-project)
169	+	- Test count: 344 (unchanged). 0 clippy warnings.
170	+	- Grade: A (maintained). v0.3.1 (deployed 2026-03-18).
171	+	- Dashboard UI shipped. Per-test tracking, regression detection, duration drift.
172	+	- cli/ directory module split completed (1,035-line cli.rs -> 8 files).
173	+	- Observations (pre-existing, not regressions):
174	+	- Mutex `.unwrap()` in rate limiter (api.rs:41) — if thread panics while holding lock, subsequent calls panic. Impact: LOW (rate limiter only, not core logic). Design choice: acceptable for monitoring tool.
175	+	- `serde_json::to_value(d).unwrap_or_default()` in API details field — silently becomes null on serialization failure. Impact: LOW, safe fallback.
176	+	- No new findings requiring action. Grade maintained at A.
177	+	- Mandatory surprise: Rate limiter uses `fetch_add` with Relaxed ordering — can allow up to max_per_window+1 requests due to check-then-increment race. Known trade-off of lock-free rate limiting, documented.
178	+
179	+	### Fifth audit (2026-03-16, Run 6 cross-project)
180	+	- Test count: 238 -> 344 (220 unit + 124 integration, +106 tests)
181	+	- Grade: A (maintained). No new findings above LOW.
182	+	- Source LOC: 10,113 (up from ~3.5K)
183	+	- Clippy: 2 warnings (collapsible_if in cli.rs — LOW)
184	+	- Production unwraps: 76 total — 64 infallible write! on String, 12 safe-by-construction. Effectively zero risky unwraps.
185	+	- Mandatory surprise: write!().unwrap() pattern provably infallible — Actually fine.
186	+	- Previous items verified: All previous remediated items confirmed intact.
187	+	- Note: cli.rs at 1,036 lines — approaching the 500-line branching guideline but mostly flat match arms.
188	+	- Infrastructure check: Blocked by Tailscale SSH re-authentication. Deferred.
189	+
190	+	### Fourth audit remediation (2026-03-14)
191	+	- Grade: A- -> A. All remaining findings resolved.
192	+	- Test count: 229 -> 238 (+9 integration tests)
193	+	- Graceful shutdown: Replaced `handle.abort()` with CancellationToken + `tokio::select!` in all task loops. API server uses `with_graceful_shutdown`. 5s grace period on SIGINT/SIGTERM.
194	+	- Task panic detection: 60s watchdog checks `JoinHandle::is_finished()` on all background tasks.
195	+	- Rate limiting: Fixed-window 60 req/min middleware on authenticated API routes. Custom `RateLimiter` struct.
196	+	- Self-monitoring: `GET /api/health` endpoint (public, no auth) returns `{"status":"operational","version":"..."}`.
197	+	- Integration tests: 5 check_health tests (mock axum servers: operational, degraded, unreachable, expectations pass/fail), 1 check_tls test (self-signed cert via rcgen), 2 /api/health tests, 1 rate limiter test.
198	+	- Deploy config cleanup: Removed redundant htpy `expected_routes` (duplicated health check URL).
199	+	- Dependency: Added `tokio-util` for CancellationToken.
200	+	- Cold spots: 0 remaining (was 3). All previous architectural and testing gaps closed.
201	+
202	+	### Third audit (2026-03-13, pre-launch skeptical lens)
203	+	- Grade: A -> A-. Postmark API token in plaintext deployment configs is a real issue.
204	+	- Test count: 56 -> 187 (+131 tests)
205	+	- New findings: Plaintext API token, no API auth, no peer mesh auth, no integration tests for core functions, no self-monitoring.
206	+	- 38 unwraps in non-test code — all verified safe (write to String or guarded by prior checks).
207	+
208	+	Post-audit remediation (2026-03-13):
209	+	- All 3 critical/medium findings resolved: Postmark token to env var, API bearer auth (5 tests), peer mesh auth
210	+	- 2 low findings resolved: SSH filter validation, peer UUID mismatch rejection
211	+	- Test count: 187 -> 195 (+8 tests)
212	+	- Documentation upgraded to A: All struct fields documented (HealthSnapshot, HealthStatus, HealthDetails, TestRun, TestStaleness, PeerStatus, OnMissing, all config types, all API response types). All 8 error variants documented. 11 config defaults with rationale comments. prune_old_records return tuple documented. description.md rewritten, architecture.md created (191 lines), README created (62 lines).
213	+
214	+	### Observability Upgrade (2026-03-13)
215	+	- Observability: A- -> A
216	+	- Added 57 `#[instrument(skip_all)]` annotations across 9 files: db.rs (28), alerts.rs (9), tools/mod.rs (8), tools/health.rs (5), tools/tests.rs (3), checks/http.rs (1), checks/tls.rs (1), checks/ssh.rs (1), peer.rs (1)
217	+	- Added Multithreaded forum as monitoring target: `pom-astra.toml` (localhost:3400), `pom-hetzner.toml` (Tailscale IP)
218	+	- Added test runner targets for GO, BB, AF, SK to `pom-astra.toml`
219	+	- All 208 tests pass. `cargo check` passes clean.
220	+
221	+	### Adversarial Test Audit (2026-03-13)
222	+
223	+	Goal: Write tests that try to break the system. Find edge cases, race conditions, boundary conditions, and logic errors.
224	+
225	+	Results:
226	+	- Test count: 195 -> 208 (+13 tests)
227	+	- CRITICAL fix: Alert cooldown key mismatch — `record_alert` used `target` but lookup used `alert_key` (`"health:{target}"`), so cooldowns never matched and alerts fired every check. Fixed by using `alert_key` consistently.
228	+	- HIGH fix: TLS expiry check inconsistent at day boundary — time-of-day comparison could cause flapping. Changed to `date_naive()` comparison for stable day-level logic.
229	+	- HIGH fix: UUID mismatch left stale peer state — now resets state, clears failures, persists via `update_peer_identity()` to prevent showing stale data after peer identity change.
230	+	- HIGH fix: `prune_old_records` no guard for days <= 0 — could delete all records. Added early return for `days <= 0` (no-op).
231	+	- HIGH fix: SSH timeout ignored config value — hardcoded `ConnectTimeout=10` in SSH args. Changed to use `config.timeout_secs`.
232	+	- Added `rcgen` dev dependency for TLS cert generation in tests.
233	+
234	+	### Second audit (2026-03-11)
235	+	\| Change \| Detail \|
236	+	\|--------\|--------\|
237	+	\| Tests \| +39 tests (17 -> 56). 28 unit + 28 integration. Tests/KLOC: 5.8 -> 18.4. \|
238	+	\| Lock contention \| Addressed in both peer.rs (heartbeat handlers) and api.rs (status/mesh handlers). Data collected under lock, DB writes after release. \|
239	+	\| DB indexes \| 4 indexes added: health_checks(target, id DESC), health_checks(target, checked_at), test_runs(target, id DESC), peer_heartbeats(peer_name, id DESC). \|
240	+	\| Clippy \| 4 warnings -> 0. Used Rust 2024 let chains instead of nested if-let. \|
241	+	\| Type safety \| PeerConfig.on_missing changed from String to OnMissing enum with serde deserialization. \|
242	+	\| Module docs \| Added //! docs to db.rs, config.rs, peer.rs, types.rs, lib.rs. \|
243	+	\| Error handling \| /api/peer/status fetch failures now logged at debug level instead of silenced. \|
244	+	\| Prune \| prune_old_records now returns 3-tuple including peer heartbeat count. \|
245	+	\| Code extraction \| HealthStatus::icon() method eliminates 3 repeated match blocks. \|
246	+	\| HTTP checks \| Response classification extracted into pure functions for testability. \|
247	+
248	+	## Metrics Over Time
249	+
250	+	\| Audit Date \| LOC \| Rust Files \| Tests \| Tests/KLOC \| Clippy Warnings \| Cold Spots \| Overall \|
251	+	\|------------\|-----\|-----------\|-------\|-----------\|----------------\|------------\|---------\|
252	+	\| 2026-03-10 \| 2,934 \| 15 \| 17 \| 5.8 \| 4 \| 8 \| B+ \|
253	+	\| 2026-03-11 \| 3,039 \| 14 \| 56 \| 18.4 \| 0 \| 3 \| A \|
254	+	\| 2026-03-13 \| ~3K \| ~14 \| 208 \| ~69 \| 0 \| 3 \| A- \|
255	+	\| 2026-03-14 \| ~3.5K \| ~16 \| 238 \| ~68 \| 0 \| 0 \| A \|
256	+	\| 2026-03-16 \| 10.1K \| 23 \| 344 \| ~34 \| 2 \| 0 \| A \|
257	+	\| 2026-03-18 \| 10.1K \| 23 \| 344 \| ~34 \| 0 \| 0 \| A \|

A docs/todo.md +38

		@@ -0,0 +1,38 @@
1	+	# PoM Todo
2	+
3	+	Done: Phases 1-13 complete. Per-test tracking + regression detection + duration drift added. 352 tests (124 lib + 228 integration). Grade: A (Run 10). v0.3.2 (redeployed 2026-03-25). cli/ split into directory module. Dashboard UI shipped. DNS checks fixed for Cloudflare-proxied domains. Stale route/DNS data pruning added.
4	+
5	+	Completed work archived in `docs/archive/pom_todo_done.md`.
6	+
7	+	## Notification Integration
8	+	When MNW unified notification service is built, PoM can push alerts there instead of / in addition to email.
9	+	- [ ] Push PoM alerts to MNW notifications API (health failures, TLS expiry, DNS changes)
10	+	- [ ] Deduplicate alert delivery (email via MNW notification preferences instead of direct Postmark)
11	+
12	+	## Deferred
13	+	- [ ] Multi-location probing beyond hetzner+astra+macbook (third-party VPS for independent perspective)
14	+	- [ ] Webhook alert channel (ntfy.sh, Pushover, generic webhook)
15	+	- [ ] Prometheus/OpenTelemetry metrics export
16	+	- [ ] Peer auto-discovery (mDNS/Tailscale API — currently manual config only)
17	+
18	+	---
19	+
20	+	## Key Paths
21	+	- Config: `src/config.rs`, `pom.toml`
22	+	- Database: `src/db.rs`
23	+	- HTTP API: `src/api.rs`
24	+	- Peer mesh: `src/peer.rs`
25	+	- Health checks: `src/checks/http.rs`
26	+	- Route checks: `src/checks/routes.rs`
27	+	- TLS checks: `src/checks/tls.rs`
28	+	- DNS checks: `src/checks/dns.rs`
29	+	- WHOIS checks: `src/checks/whois.rs`
30	+	- CI output parsing: `src/checks/parse.rs`
31	+	- Test orchestration: `src/checks/ssh.rs`
32	+	- MCP server: `src/tools/mod.rs`, `src/tools/health.rs`, `src/tools/tests.rs`
33	+	- CLI: `src/main.rs`, `src/cli/` (mod.rs, serve.rs, status.rs, incident.rs, tasks/)
34	+	- Types: `src/types.rs`
35	+	- Integration tests: `tests/integration.rs`
36	+	- Deploy: `deploy/` (deploy.sh, pom-hetzner.toml, pom-astra.toml, pom.service)
37	+
38	+	Run 6 + Run 8 audit items all resolved.