# PoM (Peace of Mind) -- Audit Review **Last audited:** 2026-03-28 (tenth audit, Run 12 cross-project) **Previous audit:** 2026-03-18 (eighth audit, Run 9 cross-project) ## Overall Grade: A Run 12 cross-project audit. 359 tests (222 unit + 8 cli + 129 integration). 0 clippy warnings. v0.3.2. Grade stable at A. CORS monitoring added. DNS/route stale data fix deployed. New dep advisories: aws-lc-sys (HIGH 7.4), rustls-webpki. ## Scorecard | Dimension | Grade | Notes | |-----------|:-----:|-------| | Code Quality | A | Zero clippy warnings. Clean error handling with typed PomError enum (thiserror). No unnecessary complexity. | | Architecture | A | Well-structured single crate. main.rs is thin (130 LOC), CLI handlers extracted to cli.rs. Module boundaries clean and purposeful. | | Testing | A | 359 tests (222 unit + 8 cli + 129 integration) for ~10K LOC = ~36 tests/KLOC. Coverage comprehensive — DB, API, MCP tools, parsing, peer state machine, check_health (mock servers), check_tls (self-signed cert), rate limiter, CORS monitoring all tested. | | Security | A | All SQL parameterized. SSH hardened with BatchMode + `--` option separator. Bearer token auth on API + peer mesh. Rate limiting (60 req/min). HTTP response size limits (10MB). Shell-escape validation on SSH filter. | | Performance | A | WAL mode, proper indexes, async throughout. Appropriate for monitoring workload. | | Documentation | A | Module-level `//!` docs on all modules. All struct fields documented. description.md rewritten, architecture.md created (191 lines), README created (62 lines). Tool descriptions are excellent. | | Dependencies | A | 17 direct dependencies, all justified. No unused deps. rustls-tls avoids OpenSSL. Semver ranges. tokio-util for CancellationToken. | | Type Safety | A | Domain enums (HealthStatus, PeerStatus, OnMissing) with proper serde. No stringly-typed fields in application logic. | | Observability | A | 57 `#[instrument(skip_all)]` annotations across all modules. Debug-level logging for non-critical peer status fetch failures. Watchdog logging for silent task panics. | | Concurrency | A | Clean RwLock discipline. Lock-then-release-then-DB pattern in peer module. CancellationToken for cooperative shutdown. | | Resilience | A | CancellationToken + `tokio::select!` in all task loops. `with_graceful_shutdown` on API server. 5s grace period. Watchdog for task panic detection. Grace periods for peer detection. Error logging without panics. | | API Consistency | A | Consistent JSON shapes. Proper HTTP status codes. Public `/api/health` endpoint. MCP tool descriptions well-written. | | Migration Safety | A- | Numbered migration system (schema_version table). `IF NOT EXISTS` guards on all DDL. 5 migrations tracked. | | Codebase Size | A | ~3,500 LOC for a full monitoring tool with CLI, MCP server, HTTP API, SQLite persistence, SSH test runner, peer mesh, and self-monitoring. Extremely efficient. | ## Module Heatmap | Module | Code | Arch | Test | Security | Perf | Docs | Type Safety | Concurrency | Resilience | |--------|:----:|:----:|:----:|:--------:|:----:|:----:|:-----------:|:-----------:|:----------:| | main | A | A | A | A | A | A | A | A | A | | cli | A | A | A | A | A | A | A | A | A | | peer | A | A | A | A | A | A | A | A | A | | db | A | A | A | A+ | A | A | A- | A | A | | api | A | A | A | A | A | A | A | A | A | | config | A | A | A | A | A | A | A | - | - | | tools | A | A | A | A | A | A | A | - | A | | checks | A | A | A | A | A | A | A | - | A | | display | A | A | A | - | - | A | A | - | - | | types | A | A | A | - | - | A | A+ | - | - | | alerts | A | A | A | A | A | A | A | - | A | **Heatmap changes (2026-03-14 remediation):** - main: Arch B+ -> A (split into main.rs 130 LOC + cli.rs), Test B -> A (display.rs has 27+ tests), Resilience A- -> A (CancellationToken + graceful shutdown) - api: Security A -> A (rate limiting added), Resilience A -> A (`with_graceful_shutdown`) - cli (new): All A — extracted command handlers with CancellationToken shutdown orchestration - display (new): All A — formatting logic with 27+ unit tests - alerts (new): All A — extracted for visibility in heatmap ### Cold Spots None remaining. All previous cold spots resolved: - main.rs split into main.rs (130 LOC) + cli.rs (command handlers) + display.rs (formatting with 27+ tests) - CLI display formatting extracted and tested via display.rs - All modules documented with `//!` docs ### Cold Spot Remediation History (First Audit) All 8 cold spots from the first audit were remediated before the second audit: - ~~**config.rs (Testing): C**~~ -> **A** -- 4 config parsing tests (full parse, defaults, on_missing default, hostname fallback). `on_missing` now uses `OnMissing` enum with serde. - ~~**api.rs (Testing): C**~~ -> **A** -- 5 API endpoint integration tests (status, target 404, peer/info, peer disabled, mesh). - ~~**api.rs (Concurrency): B**~~ -> **A** -- Lock decoupled from DB queries in `peer_status` and `mesh_view`. - ~~**peer.rs (Testing): B-**~~ -> **A** -- 9 heartbeat state machine tests (grace transitions, recovery, UUID first-contact, DB recording, identity, on_missing deserialization). - ~~**peer.rs (Concurrency): B**~~ -> **A** -- Lock decoupled from DB writes in both heartbeat handlers. - ~~**db.rs (Performance): B-**~~ -> **A** -- 4 indexes added to init_schema. - ~~**checks/http.rs (Testing): C**~~ -> **A** -- Response classification extracted into pure functions. 8 unit tests covering all status paths. - ~~**main.rs (Testing): C**~~ -> **B** -- Status icon logic extracted to `HealthStatus::icon()`. 4 types.rs tests. main.rs is thin orchestration over well-tested modules. ## Strengths - **Lean architecture.** ~3,500 LOC delivers health checks, test orchestration, MCP server, HTTP API, peer mesh, CLI, alerting, TLS monitoring, route checks, and self-monitoring. No over-engineering. - **Clean error handling.** `?` propagation with typed PomError enum (thiserror, 8 variants). No panics in production paths. - **Correct security posture.** All SQL parameterized. Bearer token auth on API + peer mesh. Rate limiting (60 req/min). SSH hardened with `BatchMode=yes` + `--` separator. Shell-escape validation. - **Backward-compatible design.** All config sections are `#[serde(default)]`. Existing configs work unchanged. - **Good MCP integration.** 8 MCP tools with clear descriptions. Tools strip `raw_output` from test history to avoid flooding context. - **Comprehensive test suite.** 238 tests (152 unit + 86 integration) at ~68/KLOC covering DB, API, MCP tools, parsing, peer state machine, config, health classification, mock server health checks, TLS cert validation, and rate limiting. - **Excellent lock discipline.** Peer module collects data under lock, releases lock, then performs DB writes. No lock contention under load. - **Robust shutdown.** CancellationToken with `tokio::select!` in all task loops. Graceful API server shutdown. Watchdog for silent panics. 5s grace period. ## Weaknesses All previously identified weaknesses have been resolved: - ~~**main.rs size**~~ — Split into main.rs (130 LOC) + cli.rs + display.rs - ~~**No typed error enum**~~ — PomError enum with thiserror (8 variants) - ~~**Missing module docs**~~ — All modules have `//!` docs - ~~**No migration versioning**~~ — schema_version table with numbered migrations (v1-v5) ## Mandatory Surprise **Surprise: The peer mesh fetches `/api/peer/status` as a SECOND HTTP call after `/api/peer/info` in every heartbeat loop iteration, completely independently.** In `peer.rs:196-213`, after the heartbeat success/failure handling completes, there is a second HTTP GET to `/api/peer/status` that caches the full status data for mesh aggregation. This means every heartbeat cycle makes TWO HTTP requests to each peer. **Verdict: Actually fine.** This is intentional and well-designed. The `/api/peer/info` call is the heartbeat probe (lightweight, just identity data), while `/api/peer/status` fetches the full target/peer view for mesh aggregation. Separating them means: 1. The heartbeat latency measurement is accurate (not inflated by the larger status response). 2. A failure to fetch status data does not affect heartbeat state transitions (it logs at debug level and moves on). 3. The mesh view (`/api/mesh`) can show each peer's own view of their targets, enabling cross-instance monitoring. The second request's failure is handled gracefully with `tracing::debug!` -- no state corruption, no retries, no panics. Good design. ## Action Items Filed in `docs/mnw/pom/todo.md`. ### Third Audit (2026-03-13, pre-launch skeptical lens) -- 7 items, all resolved 1. ~~**[CRITICAL]** Remove Postmark API token from deployment configs~~ — Done — moved to POM_POSTMARK_TOKEN env var 2. ~~**[MEDIUM]** Add API authentication~~ — Done — bearer token middleware, 5 tests 3. ~~**[MEDIUM]** Add peer mesh authentication~~ — Done — per-peer token field, heartbeat sends Authorization header 4. ~~**[LOW]** Add integration tests for core functions~~ — Done — 9 new integration tests (check_health mock servers, check_tls self-signed cert) 5. ~~**[LOW]** Add self-monitoring~~ — Done — `/api/health` endpoint (public, no auth, returns status + version) 6. ~~**[LOW]** Shell-escape SSH test filter parameter~~ — Done (alphanumeric + `_:-` allowlist) 7. ~~**[LOW]** Reject peer UUID mismatch~~ — Done (tracing::error, skip status update, increment failures) ### Security Deep Dive (2026-03-13) — Complete (2/2) 8. ~~**[MEDIUM]** SSH `--` separator~~ — Done (`checks/ssh.rs`: `.arg("--")` between hostname and command prevents option injection via filter strings) 9. ~~**[LOW]** HTTP response size limit~~ — Done (`checks/http.rs`: `MAX_RESPONSE_BYTES` 10MB constant, content-length header check + post-read size verification) ### Second Audit (2026-03-11) -- 5 items 1. Extract CLI command handlers from main.rs into cli/ submodule (587 LOC, 7 commands) 2. Add typed PomError enum with thiserror (currently Box, diverges from cross-project standard) 3. Add .DS_Store and IDE dirs (.idea/, .vscode/) to .gitignore 4. Add module-level //! docs to main.rs and config.rs 5. Add migration versioning (PRAGMA user_version) before next schema change ### First Audit (2026-03-10) -- 11 items, all resolved 1. ~~Add DB indexes~~ -- Done (4 indexes added to init_schema) 2. ~~Fix clippy warnings~~ -- Done (4 collapsible_if warnings fixed with Rust 2024 let chains) 3. ~~Decouple mesh lock from DB writes in heartbeat handlers~~ -- Done 4. ~~Decouple mesh read lock from DB queries in API handlers~~ -- Done 5. ~~Log /api/peer/status fetch failures~~ -- Done (tracing::debug) 6. ~~Include peer heartbeat prune count~~ -- Done (3-tuple return) 7. ~~Add module docs~~ -- Done (//! docs on db.rs, config.rs, peer.rs, types.rs, lib.rs) 8. ~~Change PeerConfig.on_missing to OnMissing enum~~ -- Done 9. ~~Add API endpoint tests~~ -- Done (5 tests) 10. ~~Add heartbeat state machine tests~~ -- Done (9 tests) 11. ~~Add config parsing tests~~ -- Done (4 tests) --- See [audit_history.md](./audit_history.md) for full chronological audit log.