Skip to main content

max / pom

10.9 KB · 150 lines History Blame Raw
1 # PoM (Peace of Mind) -- Audit Review
2
3 **Last audited:** 2026-03-28 (tenth audit, Run 12 cross-project)
4 **Previous audit:** 2026-03-18 (eighth audit, Run 9 cross-project)
5
6 ## Overall Grade: A
7
8 Run 12 cross-project audit. 359 tests (222 unit + 8 cli + 129 integration). 0 clippy warnings. v0.3.2. Grade stable at A. CORS monitoring added. DNS/route stale data fix deployed. New dep advisories: aws-lc-sys (HIGH 7.4), rustls-webpki.
9
10 ## Scorecard
11
12 | Dimension | Grade | Notes |
13 |-----------|:-----:|-------|
14 | Code Quality | A | Zero clippy warnings. Clean error handling with typed PomError enum (thiserror). No unnecessary complexity. |
15 | Architecture | A | Well-structured single crate. main.rs is thin (130 LOC), CLI handlers extracted to cli.rs. Module boundaries clean and purposeful. |
16 | Testing | A | 359 tests (222 unit + 8 cli + 129 integration) for ~10K LOC = ~36 tests/KLOC. Coverage comprehensive — DB, API, MCP tools, parsing, peer state machine, check_health (mock servers), check_tls (self-signed cert), rate limiter, CORS monitoring all tested. |
17 | Security | A | All SQL parameterized. SSH hardened with BatchMode + `--` option separator. Bearer token auth on API + peer mesh. Rate limiting (60 req/min). HTTP response size limits (10MB). Shell-escape validation on SSH filter. |
18 | Performance | A | WAL mode, proper indexes, async throughout. Appropriate for monitoring workload. |
19 | Documentation | A | Module-level `//!` docs on all modules. All struct fields documented. description.md rewritten, architecture.md created (191 lines), README created (62 lines). Tool descriptions are excellent. |
20 | Dependencies | A | 17 direct dependencies, all justified. No unused deps. rustls-tls avoids OpenSSL. Semver ranges. tokio-util for CancellationToken. |
21 | Type Safety | A | Domain enums (HealthStatus, PeerStatus, OnMissing) with proper serde. No stringly-typed fields in application logic. |
22 | Observability | A | 57 `#[instrument(skip_all)]` annotations across all modules. Debug-level logging for non-critical peer status fetch failures. Watchdog logging for silent task panics. |
23 | Concurrency | A | Clean RwLock discipline. Lock-then-release-then-DB pattern in peer module. CancellationToken for cooperative shutdown. |
24 | Resilience | A | CancellationToken + `tokio::select!` in all task loops. `with_graceful_shutdown` on API server. 5s grace period. Watchdog for task panic detection. Grace periods for peer detection. Error logging without panics. |
25 | API Consistency | A | Consistent JSON shapes. Proper HTTP status codes. Public `/api/health` endpoint. MCP tool descriptions well-written. |
26 | Migration Safety | A- | Numbered migration system (schema_version table). `IF NOT EXISTS` guards on all DDL. 5 migrations tracked. |
27 | Codebase Size | A | ~3,500 LOC for a full monitoring tool with CLI, MCP server, HTTP API, SQLite persistence, SSH test runner, peer mesh, and self-monitoring. Extremely efficient. |
28
29 ## Module Heatmap
30
31 | Module | Code | Arch | Test | Security | Perf | Docs | Type Safety | Concurrency | Resilience |
32 |--------|:----:|:----:|:----:|:--------:|:----:|:----:|:-----------:|:-----------:|:----------:|
33 | main | A | A | A | A | A | A | A | A | A |
34 | cli | A | A | A | A | A | A | A | A | A |
35 | peer | A | A | A | A | A | A | A | A | A |
36 | db | A | A | A | A+ | A | A | A- | A | A |
37 | api | A | A | A | A | A | A | A | A | A |
38 | config | A | A | A | A | A | A | A | - | - |
39 | tools | A | A | A | A | A | A | A | - | A |
40 | checks | A | A | A | A | A | A | A | - | A |
41 | display | A | A | A | - | - | A | A | - | - |
42 | types | A | A | A | - | - | A | A+ | - | - |
43 | alerts | A | A | A | A | A | A | A | - | A |
44
45 **Heatmap changes (2026-03-14 remediation):**
46 - main: Arch B+ -> A (split into main.rs 130 LOC + cli.rs), Test B -> A (display.rs has 27+ tests), Resilience A- -> A (CancellationToken + graceful shutdown)
47 - api: Security A -> A (rate limiting added), Resilience A -> A (`with_graceful_shutdown`)
48 - cli (new): All A — extracted command handlers with CancellationToken shutdown orchestration
49 - display (new): All A — formatting logic with 27+ unit tests
50 - alerts (new): All A — extracted for visibility in heatmap
51
52 ### Cold Spots
53
54 None remaining. All previous cold spots resolved:
55 - main.rs split into main.rs (130 LOC) + cli.rs (command handlers) + display.rs (formatting with 27+ tests)
56 - CLI display formatting extracted and tested via display.rs
57 - All modules documented with `//!` docs
58
59 ### Cold Spot Remediation History (First Audit)
60
61 All 8 cold spots from the first audit were remediated before the second audit:
62
63 - ~~**config.rs (Testing): C**~~ -> **A** -- 4 config parsing tests (full parse, defaults, on_missing default, hostname fallback). `on_missing` now uses `OnMissing` enum with serde.
64 - ~~**api.rs (Testing): C**~~ -> **A** -- 5 API endpoint integration tests (status, target 404, peer/info, peer disabled, mesh).
65 - ~~**api.rs (Concurrency): B**~~ -> **A** -- Lock decoupled from DB queries in `peer_status` and `mesh_view`.
66 - ~~**peer.rs (Testing): B-**~~ -> **A** -- 9 heartbeat state machine tests (grace transitions, recovery, UUID first-contact, DB recording, identity, on_missing deserialization).
67 - ~~**peer.rs (Concurrency): B**~~ -> **A** -- Lock decoupled from DB writes in both heartbeat handlers.
68 - ~~**db.rs (Performance): B-**~~ -> **A** -- 4 indexes added to init_schema.
69 - ~~**checks/http.rs (Testing): C**~~ -> **A** -- Response classification extracted into pure functions. 8 unit tests covering all status paths.
70 - ~~**main.rs (Testing): C**~~ -> **B** -- Status icon logic extracted to `HealthStatus::icon()`. 4 types.rs tests. main.rs is thin orchestration over well-tested modules.
71
72 ## Strengths
73
74 - **Lean architecture.** ~3,500 LOC delivers health checks, test orchestration, MCP server, HTTP API, peer mesh, CLI, alerting, TLS monitoring, route checks, and self-monitoring. No over-engineering.
75 - **Clean error handling.** `?` propagation with typed PomError enum (thiserror, 8 variants). No panics in production paths.
76 - **Correct security posture.** All SQL parameterized. Bearer token auth on API + peer mesh. Rate limiting (60 req/min). SSH hardened with `BatchMode=yes` + `--` separator. Shell-escape validation.
77 - **Backward-compatible design.** All config sections are `#[serde(default)]`. Existing configs work unchanged.
78 - **Good MCP integration.** 8 MCP tools with clear descriptions. Tools strip `raw_output` from test history to avoid flooding context.
79 - **Comprehensive test suite.** 238 tests (152 unit + 86 integration) at ~68/KLOC covering DB, API, MCP tools, parsing, peer state machine, config, health classification, mock server health checks, TLS cert validation, and rate limiting.
80 - **Excellent lock discipline.** Peer module collects data under lock, releases lock, then performs DB writes. No lock contention under load.
81 - **Robust shutdown.** CancellationToken with `tokio::select!` in all task loops. Graceful API server shutdown. Watchdog for silent panics. 5s grace period.
82
83 ## Weaknesses
84
85 All previously identified weaknesses have been resolved:
86 - ~~**main.rs size**~~ — Split into main.rs (130 LOC) + cli.rs + display.rs
87 - ~~**No typed error enum**~~ — PomError enum with thiserror (8 variants)
88 - ~~**Missing module docs**~~ — All modules have `//!` docs
89 - ~~**No migration versioning**~~ — schema_version table with numbered migrations (v1-v5)
90
91 ## Mandatory Surprise
92
93 **Surprise: The peer mesh fetches `/api/peer/status` as a SECOND HTTP call after `/api/peer/info` in every heartbeat loop iteration, completely independently.**
94
95 In `peer.rs:196-213`, after the heartbeat success/failure handling completes, there is a second HTTP GET to `/api/peer/status` that caches the full status data for mesh aggregation. This means every heartbeat cycle makes TWO HTTP requests to each peer.
96
97 **Verdict: Actually fine.**
98
99 This is intentional and well-designed. The `/api/peer/info` call is the heartbeat probe (lightweight, just identity data), while `/api/peer/status` fetches the full target/peer view for mesh aggregation. Separating them means:
100 1. The heartbeat latency measurement is accurate (not inflated by the larger status response).
101 2. A failure to fetch status data does not affect heartbeat state transitions (it logs at debug level and moves on).
102 3. The mesh view (`/api/mesh`) can show each peer's own view of their targets, enabling cross-instance monitoring.
103
104 The second request's failure is handled gracefully with `tracing::debug!` -- no state corruption, no retries, no panics. Good design.
105
106 ## Action Items
107
108 Filed in `docs/mnw/pom/todo.md`.
109
110 ### Third Audit (2026-03-13, pre-launch skeptical lens) -- 7 items, all resolved
111
112 1. ~~**[CRITICAL]** Remove Postmark API token from deployment configs~~ — Done — moved to POM_POSTMARK_TOKEN env var
113 2. ~~**[MEDIUM]** Add API authentication~~ — Done — bearer token middleware, 5 tests
114 3. ~~**[MEDIUM]** Add peer mesh authentication~~ — Done — per-peer token field, heartbeat sends Authorization header
115 4. ~~**[LOW]** Add integration tests for core functions~~ — Done — 9 new integration tests (check_health mock servers, check_tls self-signed cert)
116 5. ~~**[LOW]** Add self-monitoring~~ — Done — `/api/health` endpoint (public, no auth, returns status + version)
117 6. ~~**[LOW]** Shell-escape SSH test filter parameter~~ — Done (alphanumeric + `_:-` allowlist)
118 7. ~~**[LOW]** Reject peer UUID mismatch~~ — Done (tracing::error, skip status update, increment failures)
119
120 ### Security Deep Dive (2026-03-13) — Complete (2/2)
121
122 8. ~~**[MEDIUM]** SSH `--` separator~~ — Done (`checks/ssh.rs`: `.arg("--")` between hostname and command prevents option injection via filter strings)
123 9. ~~**[LOW]** HTTP response size limit~~ — Done (`checks/http.rs`: `MAX_RESPONSE_BYTES` 10MB constant, content-length header check + post-read size verification)
124
125 ### Second Audit (2026-03-11) -- 5 items
126
127 1. Extract CLI command handlers from main.rs into cli/ submodule (587 LOC, 7 commands)
128 2. Add typed PomError enum with thiserror (currently Box<dyn Error>, diverges from cross-project standard)
129 3. Add .DS_Store and IDE dirs (.idea/, .vscode/) to .gitignore
130 4. Add module-level //! docs to main.rs and config.rs
131 5. Add migration versioning (PRAGMA user_version) before next schema change
132
133 ### First Audit (2026-03-10) -- 11 items, all resolved
134
135 1. ~~Add DB indexes~~ -- Done (4 indexes added to init_schema)
136 2. ~~Fix clippy warnings~~ -- Done (4 collapsible_if warnings fixed with Rust 2024 let chains)
137 3. ~~Decouple mesh lock from DB writes in heartbeat handlers~~ -- Done
138 4. ~~Decouple mesh read lock from DB queries in API handlers~~ -- Done
139 5. ~~Log /api/peer/status fetch failures~~ -- Done (tracing::debug)
140 6. ~~Include peer heartbeat prune count~~ -- Done (3-tuple return)
141 7. ~~Add module docs~~ -- Done (//! docs on db.rs, config.rs, peer.rs, types.rs, lib.rs)
142 8. ~~Change PeerConfig.on_missing to OnMissing enum~~ -- Done
143 9. ~~Add API endpoint tests~~ -- Done (5 tests)
144 10. ~~Add heartbeat state machine tests~~ -- Done (9 tests)
145 11. ~~Add config parsing tests~~ -- Done (4 tests)
146
147 ---
148
149 See [audit_history.md]./audit_history.md for full chronological audit log.
150