| 1 |
# PoM (Peace of Mind) -- Audit Review |
| 2 |
|
| 3 |
**Last audited:** 2026-03-28 (tenth audit, Run 12 cross-project) |
| 4 |
**Previous audit:** 2026-03-18 (eighth audit, Run 9 cross-project) |
| 5 |
|
| 6 |
## Overall Grade: A |
| 7 |
|
| 8 |
Run 12 cross-project audit. 359 tests (222 unit + 8 cli + 129 integration). 0 clippy warnings. v0.3.2. Grade stable at A. CORS monitoring added. DNS/route stale data fix deployed. New dep advisories: aws-lc-sys (HIGH 7.4), rustls-webpki. |
| 9 |
|
| 10 |
## Scorecard |
| 11 |
|
| 12 |
|
| 13 |
|
| 14 |
| Code Quality | A | Zero clippy warnings. Clean error handling with typed PomError enum (thiserror). No unnecessary complexity. | |
| 15 |
| Architecture | A | Well-structured single crate. main.rs is thin (130 LOC), CLI handlers extracted to cli.rs. Module boundaries clean and purposeful. | |
| 16 |
| Testing | A | 359 tests (222 unit + 8 cli + 129 integration) for ~10K LOC = ~36 tests/KLOC. Coverage comprehensive — DB, API, MCP tools, parsing, peer state machine, check_health (mock servers), check_tls (self-signed cert), rate limiter, CORS monitoring all tested. | |
| 17 |
| Security | A | All SQL parameterized. SSH hardened with BatchMode + `--` option separator. Bearer token auth on API + peer mesh. Rate limiting (60 req/min). HTTP response size limits (10MB). Shell-escape validation on SSH filter. | |
| 18 |
| Performance | A | WAL mode, proper indexes, async throughout. Appropriate for monitoring workload. | |
| 19 |
| Documentation | A | Module-level `//!` docs on all modules. All struct fields documented. description.md rewritten, architecture.md created (191 lines), README created (62 lines). Tool descriptions are excellent. | |
| 20 |
| Dependencies | A | 17 direct dependencies, all justified. No unused deps. rustls-tls avoids OpenSSL. Semver ranges. tokio-util for CancellationToken. | |
| 21 |
| Type Safety | A | Domain enums (HealthStatus, PeerStatus, OnMissing) with proper serde. No stringly-typed fields in application logic. | |
| 22 |
| Observability | A | 57 `#[instrument(skip_all)]` annotations across all modules. Debug-level logging for non-critical peer status fetch failures. Watchdog logging for silent task panics. | |
| 23 |
| Concurrency | A | Clean RwLock discipline. Lock-then-release-then-DB pattern in peer module. CancellationToken for cooperative shutdown. | |
| 24 |
| Resilience | A | CancellationToken + `tokio::select!` in all task loops. `with_graceful_shutdown` on API server. 5s grace period. Watchdog for task panic detection. Grace periods for peer detection. Error logging without panics. | |
| 25 |
| API Consistency | A | Consistent JSON shapes. Proper HTTP status codes. Public `/api/health` endpoint. MCP tool descriptions well-written. | |
| 26 |
| Migration Safety | A- | Numbered migration system (schema_version table). `IF NOT EXISTS` guards on all DDL. 5 migrations tracked. | |
| 27 |
| Codebase Size | A | ~3,500 LOC for a full monitoring tool with CLI, MCP server, HTTP API, SQLite persistence, SSH test runner, peer mesh, and self-monitoring. Extremely efficient. | |
| 28 |
|
| 29 |
## Module Heatmap |
| 30 |
|
| 31 |
|
| 32 |
|
| 33 |
| main | A | A | A | A | A | A | A | A | A | |
| 34 |
| cli | A | A | A | A | A | A | A | A | A | |
| 35 |
| peer | A | A | A | A | A | A | A | A | A | |
| 36 |
| db | A | A | A | A+ | A | A | A- | A | A | |
| 37 |
| api | A | A | A | A | A | A | A | A | A | |
| 38 |
| config | A | A | A | A | A | A | A | - | - | |
| 39 |
| tools | A | A | A | A | A | A | A | - | A | |
| 40 |
| checks | A | A | A | A | A | A | A | - | A | |
| 41 |
| display | A | A | A | - | - | A | A | - | - | |
| 42 |
| types | A | A | A | - | - | A | A+ | - | - | |
| 43 |
| alerts | A | A | A | A | A | A | A | - | A | |
| 44 |
|
| 45 |
**Heatmap changes (2026-03-14 remediation):** |
| 46 |
- main: Arch B+ -> A (split into main.rs 130 LOC + cli.rs), Test B -> A (display.rs has 27+ tests), Resilience A- -> A (CancellationToken + graceful shutdown) |
| 47 |
- api: Security A -> A (rate limiting added), Resilience A -> A (`with_graceful_shutdown`) |
| 48 |
- cli (new): All A — extracted command handlers with CancellationToken shutdown orchestration |
| 49 |
- display (new): All A — formatting logic with 27+ unit tests |
| 50 |
- alerts (new): All A — extracted for visibility in heatmap |
| 51 |
|
| 52 |
### Cold Spots |
| 53 |
|
| 54 |
None remaining. All previous cold spots resolved: |
| 55 |
- main.rs split into main.rs (130 LOC) + cli.rs (command handlers) + display.rs (formatting with 27+ tests) |
| 56 |
- CLI display formatting extracted and tested via display.rs |
| 57 |
- All modules documented with `//!` docs |
| 58 |
|
| 59 |
### Cold Spot Remediation History (First Audit) |
| 60 |
|
| 61 |
All 8 cold spots from the first audit were remediated before the second audit: |
| 62 |
|
| 63 |
- ~~**config.rs (Testing): C**~~ -> **A** -- 4 config parsing tests (full parse, defaults, on_missing default, hostname fallback). `on_missing` now uses `OnMissing` enum with serde. |
| 64 |
- ~~**api.rs (Testing): C**~~ -> **A** -- 5 API endpoint integration tests (status, target 404, peer/info, peer disabled, mesh). |
| 65 |
- ~~**api.rs (Concurrency): B**~~ -> **A** -- Lock decoupled from DB queries in `peer_status` and `mesh_view`. |
| 66 |
- ~~**peer.rs (Testing): B-**~~ -> **A** -- 9 heartbeat state machine tests (grace transitions, recovery, UUID first-contact, DB recording, identity, on_missing deserialization). |
| 67 |
- ~~**peer.rs (Concurrency): B**~~ -> **A** -- Lock decoupled from DB writes in both heartbeat handlers. |
| 68 |
- ~~**db.rs (Performance): B-**~~ -> **A** -- 4 indexes added to init_schema. |
| 69 |
- ~~**checks/http.rs (Testing): C**~~ -> **A** -- Response classification extracted into pure functions. 8 unit tests covering all status paths. |
| 70 |
- ~~**main.rs (Testing): C**~~ -> **B** -- Status icon logic extracted to `HealthStatus::icon()`. 4 types.rs tests. main.rs is thin orchestration over well-tested modules. |
| 71 |
|
| 72 |
## Strengths |
| 73 |
|
| 74 |
- **Lean architecture.** ~3,500 LOC delivers health checks, test orchestration, MCP server, HTTP API, peer mesh, CLI, alerting, TLS monitoring, route checks, and self-monitoring. No over-engineering. |
| 75 |
- **Clean error handling.** `?` propagation with typed PomError enum (thiserror, 8 variants). No panics in production paths. |
| 76 |
- **Correct security posture.** All SQL parameterized. Bearer token auth on API + peer mesh. Rate limiting (60 req/min). SSH hardened with `BatchMode=yes` + `--` separator. Shell-escape validation. |
| 77 |
- **Backward-compatible design.** All config sections are `#[serde(default)]`. Existing configs work unchanged. |
| 78 |
- **Good MCP integration.** 8 MCP tools with clear descriptions. Tools strip `raw_output` from test history to avoid flooding context. |
| 79 |
- **Comprehensive test suite.** 238 tests (152 unit + 86 integration) at ~68/KLOC covering DB, API, MCP tools, parsing, peer state machine, config, health classification, mock server health checks, TLS cert validation, and rate limiting. |
| 80 |
- **Excellent lock discipline.** Peer module collects data under lock, releases lock, then performs DB writes. No lock contention under load. |
| 81 |
- **Robust shutdown.** CancellationToken with `tokio::select!` in all task loops. Graceful API server shutdown. Watchdog for silent panics. 5s grace period. |
| 82 |
|
| 83 |
## Weaknesses |
| 84 |
|
| 85 |
All previously identified weaknesses have been resolved: |
| 86 |
- ~~**main.rs size**~~ — Split into main.rs (130 LOC) + cli.rs + display.rs |
| 87 |
- ~~**No typed error enum**~~ — PomError enum with thiserror (8 variants) |
| 88 |
- ~~**Missing module docs**~~ — All modules have `//!` docs |
| 89 |
- ~~**No migration versioning**~~ — schema_version table with numbered migrations (v1-v5) |
| 90 |
|
| 91 |
## Mandatory Surprise |
| 92 |
|
| 93 |
**Surprise: The peer mesh fetches `/api/peer/status` as a SECOND HTTP call after `/api/peer/info` in every heartbeat loop iteration, completely independently.** |
| 94 |
|
| 95 |
In `peer.rs:196-213`, after the heartbeat success/failure handling completes, there is a second HTTP GET to `/api/peer/status` that caches the full status data for mesh aggregation. This means every heartbeat cycle makes TWO HTTP requests to each peer. |
| 96 |
|
| 97 |
**Verdict: Actually fine.** |
| 98 |
|
| 99 |
This is intentional and well-designed. The `/api/peer/info` call is the heartbeat probe (lightweight, just identity data), while `/api/peer/status` fetches the full target/peer view for mesh aggregation. Separating them means: |
| 100 |
1. The heartbeat latency measurement is accurate (not inflated by the larger status response). |
| 101 |
2. A failure to fetch status data does not affect heartbeat state transitions (it logs at debug level and moves on). |
| 102 |
3. The mesh view (`/api/mesh`) can show each peer's own view of their targets, enabling cross-instance monitoring. |
| 103 |
|
| 104 |
The second request's failure is handled gracefully with `tracing::debug!` -- no state corruption, no retries, no panics. Good design. |
| 105 |
|
| 106 |
## Action Items |
| 107 |
|
| 108 |
Filed in `docs/mnw/pom/todo.md`. |
| 109 |
|
| 110 |
### Third Audit (2026-03-13, pre-launch skeptical lens) -- 7 items, all resolved |
| 111 |
|
| 112 |
1. ~~**[CRITICAL]** Remove Postmark API token from deployment configs~~ — Done — moved to POM_POSTMARK_TOKEN env var |
| 113 |
2. ~~**[MEDIUM]** Add API authentication~~ — Done — bearer token middleware, 5 tests |
| 114 |
3. ~~**[MEDIUM]** Add peer mesh authentication~~ — Done — per-peer token field, heartbeat sends Authorization header |
| 115 |
4. ~~**[LOW]** Add integration tests for core functions~~ — Done — 9 new integration tests (check_health mock servers, check_tls self-signed cert) |
| 116 |
5. ~~**[LOW]** Add self-monitoring~~ — Done — `/api/health` endpoint (public, no auth, returns status + version) |
| 117 |
6. ~~**[LOW]** Shell-escape SSH test filter parameter~~ — Done (alphanumeric + `_:-` allowlist) |
| 118 |
7. ~~**[LOW]** Reject peer UUID mismatch~~ — Done (tracing::error, skip status update, increment failures) |
| 119 |
|
| 120 |
### Security Deep Dive (2026-03-13) — Complete (2/2) |
| 121 |
|
| 122 |
8. ~~**[MEDIUM]** SSH `--` separator~~ — Done (`checks/ssh.rs`: `.arg("--")` between hostname and command prevents option injection via filter strings) |
| 123 |
9. ~~**[LOW]** HTTP response size limit~~ — Done (`checks/http.rs`: `MAX_RESPONSE_BYTES` 10MB constant, content-length header check + post-read size verification) |
| 124 |
|
| 125 |
### Second Audit (2026-03-11) -- 5 items |
| 126 |
|
| 127 |
1. Extract CLI command handlers from main.rs into cli/ submodule (587 LOC, 7 commands) |
| 128 |
2. Add typed PomError enum with thiserror (currently Box<dyn Error>, diverges from cross-project standard) |
| 129 |
3. Add .DS_Store and IDE dirs (.idea/, .vscode/) to .gitignore |
| 130 |
4. Add module-level //! docs to main.rs and config.rs |
| 131 |
5. Add migration versioning (PRAGMA user_version) before next schema change |
| 132 |
|
| 133 |
### First Audit (2026-03-10) -- 11 items, all resolved |
| 134 |
|
| 135 |
1. ~~Add DB indexes~~ -- Done (4 indexes added to init_schema) |
| 136 |
2. ~~Fix clippy warnings~~ -- Done (4 collapsible_if warnings fixed with Rust 2024 let chains) |
| 137 |
3. ~~Decouple mesh lock from DB writes in heartbeat handlers~~ -- Done |
| 138 |
4. ~~Decouple mesh read lock from DB queries in API handlers~~ -- Done |
| 139 |
5. ~~Log /api/peer/status fetch failures~~ -- Done (tracing::debug) |
| 140 |
6. ~~Include peer heartbeat prune count~~ -- Done (3-tuple return) |
| 141 |
7. ~~Add module docs~~ -- Done (//! docs on db.rs, config.rs, peer.rs, types.rs, lib.rs) |
| 142 |
8. ~~Change PeerConfig.on_missing to OnMissing enum~~ -- Done |
| 143 |
9. ~~Add API endpoint tests~~ -- Done (5 tests) |
| 144 |
10. ~~Add heartbeat state machine tests~~ -- Done (9 tests) |
| 145 |
11. ~~Add config parsing tests~~ -- Done (4 tests) |
| 146 |
|
| 147 |
--- |
| 148 |
|
| 149 |
See [audit_history.md](./audit_history.md) for full chronological audit log. |
| 150 |
|