Skip to main content

max / pom

Move project docs into repo for ~/Code directory layout Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Author: Max J. <87768334+MaxJMath@users.noreply.github.com> · 2026-03-30 02:26 UTC
Commit: b10b7fd181f923ba01a5a90205b151e8381074fb
Parent: 2572bc2
3 files changed, +486 insertions, -0 deletions
@@ -0,0 +1,191 @@
1 + # PoM Architecture
2 +
3 + ## System Overview
4 +
5 + PoM runs in three modes, selected by how it is invoked:
6 +
7 + 1. **CLI mode** (`pom health`, `pom test`, `pom status`, etc.) -- runs a single command and exits. Useful for ad-hoc checks and cron jobs.
8 + 2. **Serve mode** (`pom serve`) -- long-running daemon that spawns per-target health check loops, TLS check loops, peer heartbeat tasks, a daily prune task, and an HTTP API server. This is the production deployment mode.
9 + 3. **MCP server mode** (bare `pom` with no subcommand) -- launches an MCP server over stdio for Claude integration. Exposes health checks, test execution, history queries, and mesh status as MCP tools.
10 +
11 + All three modes load the same TOML config and connect to the same SQLite database.
12 +
13 + ## Module Map
14 +
15 + | Module | File | Role |
16 + |--------|------|------|
17 + | `main` | `src/main.rs` | Entry point -- parses CLI args, dispatches to CLI handler or MCP server |
18 + | `cli` | `src/cli.rs` | CLI command handlers (health, test, status, history, prune, serve, mesh) |
19 + | `config` | `src/config.rs` | TOML config loading, types for targets/peers/alerts/serve settings |
20 + | `types` | `src/types.rs` | Shared domain types: HealthSnapshot, TestRun, TlsStatus, LatencyStats, TestStaleness |
21 + | `db` | `src/db.rs` | SQLite schema (versioned migrations), all queries for health/tests/alerts/TLS/incidents/peers |
22 + | `api` | `src/api.rs` | Axum HTTP API: status, trends, peer info, mesh view, bearer token auth middleware |
23 + | `alerts` | `src/alerts.rs` | Alerter struct -- sends emails via Postmark on status transitions, with cooldown tracking |
24 + | `peer` | `src/peer.rs` | Peer mesh: identity management, heartbeat loops, grace period state machine, mesh state |
25 + | `display` | `src/display.rs` | Pure formatting functions for CLI output (no I/O) |
26 + | `error` | `src/error.rs` | Typed error enum (PomError) wrapping IO, DB, HTTP, JSON, config errors |
27 + | `checks::http` | `src/checks/http.rs` | HTTP health checker, response classification, expectation validation, latency drift detection, test staleness computation |
28 + | `checks::tls` | `src/checks/tls.rs` | TLS certificate prober -- TCP connect, TLS handshake, x509 leaf cert parsing |
29 + | `checks::ssh` | `src/checks/ssh.rs` | Remote test runner -- executes commands over SSH, captures output |
30 + | `checks::parse` | `src/checks/parse.rs` | CI output parser -- extracts PASS/FAIL steps and cargo test counts |
31 + | `tools` | `src/tools/mod.rs` | MCP server definition (PomServer), tool registration via rmcp |
32 + | `tools::health` | `src/tools/health.rs` | MCP tool implementations for health checks, history, targets, mesh status |
33 + | `tools::tests` | `src/tools/tests.rs` | MCP tool implementations for test execution, history, raw output |
34 +
35 + ## Data Flow
36 +
37 + ```
38 + pom.toml (config)
39 + |
40 + v
41 + Config::load() --> targets, peers, alerts, serve settings
42 + |
43 + v
44 + db::connect() --> SQLite pool (WAL mode, versioned migrations)
45 + |
46 + +---> [CLI mode] single command --> check/query --> display --> exit
47 + |
48 + +---> [Serve mode]
49 + | |
50 + | +--> per-target health check loop (configurable interval)
51 + | | check_health() --> insert_health_check()
52 + | | compare with previous --> alert on transition
53 + | | detect latency drift --> alert if sustained
54 + | | open/close incidents on status changes
55 + | |
56 + | +--> per-target TLS check loop (hourly default)
57 + | | check_tls() --> insert_tls_check()
58 + | | alert on expiry warning or error
59 + | |
60 + | +--> per-peer heartbeat loop (60s default)
61 + | | GET /api/peer/info --> update mesh state
62 + | | GET /api/peer/status --> cache for mesh view
63 + | | grace period state machine on failure
64 + | |
65 + | +--> daily prune task (configurable retention)
66 + | |
67 + | +--> HTTP API server (Axum, configurable bind address)
68 + |
69 + +---> [MCP mode] stdio transport --> tool calls --> same check/query logic
70 + ```
71 +
72 + ## Peer Mesh Design
73 +
74 + Each PoM instance has a persistent UUID (stored at `~/.local/share/pom/instance_id`). Peers are configured by name with an address, a `on_missing` policy, and an optional grace count.
75 +
76 + ### Heartbeat State Machine
77 +
78 + ```
79 + Unknown --> (success) --> Online
80 + Unknown --> (failure) --> GracePeriod --> (failures >= grace_count) --> Missing
81 + Online --> (failure) --> GracePeriod --> (failures >= grace_count) --> Missing
82 + Missing --> (success) --> Online (triggers recovery alert)
83 + ```
84 +
85 + Each heartbeat cycle:
86 + 1. GET `/api/peer/info` -- verifies identity, measures latency
87 + 2. On first contact, store the peer's UUID in `peer_identities` table
88 + 3. On subsequent contacts, reject UUID mismatches (prevents impersonation)
89 + 4. GET `/api/peer/status` -- caches the peer's full status for mesh aggregation
90 + 5. Record heartbeat result in `peer_heartbeats` table
91 +
92 + ### On Missing Policy
93 +
94 + - `alert` -- send email alert when peer transitions to Missing, send recovery when it returns
95 + - `log` -- log the event, no email
96 + - `ignore` -- suppress entirely
97 +
98 + ## Database Schema
99 +
100 + SQLite with WAL journal mode. Schema is managed through numbered migrations (currently v1-v4).
101 +
102 + ### Tables
103 +
104 + | Table | Purpose | Key Columns |
105 + |-------|---------|-------------|
106 + | `schema_version` | Migration tracking | version, description, applied_at |
107 + | `health_checks` | HTTP health check results | target, status, checked_at, response_time_ms, details_json, error |
108 + | `test_runs` | SSH test execution results | target, started_at, duration_secs, exit_code, passed, summary_json, raw_output |
109 + | `peer_identities` | First-seen peer UUIDs | peer_name (PK), instance_id, first_seen |
110 + | `peer_heartbeats` | Heartbeat history | peer_name, status, latency_ms, checked_at |
111 + | `alerts` | Alert history + cooldown tracking | target, alert_type, from_status, to_status, sent_at |
112 + | `tls_checks` | TLS certificate probe results | target, host, valid, days_remaining, not_before, not_after, subject, issuer |
113 + | `incidents` | Health incidents (open/closed) | target, started_at, ended_at, duration_secs, from_status, to_status |
114 +
115 + Pre-migration databases are detected by the presence of the `health_checks` table and stamped as v1 without re-running the initial migration.
116 +
117 + ## Alert Pipeline
118 +
119 + ```
120 + Health status change detected (previous != current)
121 + |
122 + +--> operational -> non-operational: send_health_alert(), open incident
123 + +--> non-operational -> operational: send_health_recovery(), close incidents
124 + +--> non-operational -> different non-operational: close old incident, open new, alert
125 + |
126 + TLS check detects issue
127 + |
128 + +--> was OK, now invalid/error: send_tls_error_alert()
129 + +--> was OK, now within warn_days: send_tls_expiry_alert()
130 + +--> was bad, now OK: send_tls_recovery()
131 + |
132 + Latency drift detected (all recent checks exceed baseline * threshold)
133 + |
134 + +--> entered drift: send_latency_drift_alert()
135 + +--> exited drift: send_latency_recovery()
136 + |
137 + Peer transitions to Missing
138 + |
139 + +--> send_peer_missing() (if on_missing = alert)
140 + +--> peer recovers: send_peer_recovery()
141 + ```
142 +
143 + All alerts except recoveries are subject to a per-target cooldown (default 300s). Recoveries always send immediately. Without a Postmark token, alerts are logged to stdout (dev mode).
144 +
145 + ## API Endpoints
146 +
147 + All endpoints require `Authorization: Bearer <token>` when `api_token` is configured (in config or via `POM_API_TOKEN` env var). Without a token configured, all requests pass through.
148 +
149 + | Endpoint | Method | Description |
150 + |----------|--------|-------------|
151 + | `/api/status` | GET | JSON summary of all targets (latest health, uptime, latency, TLS, staleness, incidents) |
152 + | `/api/status/{target}` | GET | Same as above for a single target |
153 + | `/api/trends/{target}` | GET | Latency trend data with configurable window and bucket size (`?hours=24&bucket_minutes=60`) |
154 + | `/api/peer/info` | GET | This instance's identity (id, name, version, targets, started_at) |
155 + | `/api/peer/status` | GET | This instance's full view: identity + target statuses + peer summaries |
156 + | `/api/mesh` | GET | Aggregated mesh view: self + each peer's cached status |
157 +
158 + ## Check Types
159 +
160 + ### HTTP Health Check
161 +
162 + Sends GET to the target's health URL. Classifies the response:
163 + - JSON with `"status": "operational"` --> Operational
164 + - JSON with `"status": "degraded"` --> Degraded
165 + - Non-JSON 2xx --> Degraded
166 + - Non-2xx or unknown status --> Error
167 + - Connection failure --> Unreachable
168 +
169 + Extracts version, uptime, checks, and monitoring from the JSON response body. Supports expectation validation: expected status code, required body substrings, and JSON field value assertions (with dot-path traversal for nested fields).
170 +
171 + ### TLS Certificate Check
172 +
173 + Connects to host:port, completes a TLS handshake using the system trust store (webpki-roots), extracts the leaf certificate, and parses it with x509-parser. Records validity, days remaining, not_before/not_after, subject, and issuer. Alerts when days_remaining falls below the configured `warn_days` threshold (default 14).
174 +
175 + ### SSH Test Runner
176 +
177 + Executes a configured command on a remote host via `ssh -o BatchMode=yes`. The command string comes from config (typically a CI script like `./run-ci.sh`). Supports an optional filter argument (validated to `[a-zA-Z0-9_:-]` to prevent injection). Output is parsed for PASS/FAIL step lines and `test result:` cargo test summary lines.
178 +
179 + ## Key Design Decisions
180 +
181 + **SQLite over PostgreSQL.** PoM is a single-binary tool that runs on each monitoring host. SQLite keeps it self-contained with zero external dependencies. WAL mode provides concurrent reads during serve mode. Data volume is modest (a few checks per minute, pruned after 30 days).
182 +
183 + **Peer mesh over centralized monitoring.** Two independent instances cross-check each other. If the Hetzner instance goes down, Astra detects it (and vice versa). No single point of failure for the monitoring layer itself.
184 +
185 + **Bearer token auth.** Simple, stateless, sufficient for machine-to-machine API access between peers. Configured per-peer and per-instance. No user management needed.
186 +
187 + **Versioned migrations.** The migration system detects pre-migration databases and stamps them without re-running. Each migration is a numbered SQL block. This avoids external migration tools while keeping schema evolution safe.
188 +
189 + **Separate check intervals.** Health checks can have per-target interval overrides. TLS checks run on a longer interval (hourly default) since certificate state changes slowly. Peer heartbeats run on a short interval (60s default) for timely failure detection.
190 +
191 + **Cooldown on alerts.** Prevents alert storms during flapping. Recovery alerts bypass cooldown so operators always know when a service comes back.
@@ -0,0 +1,257 @@
1 + # PoM (Peace of Mind) -- Audit Review
2 +
3 + **Last audited:** 2026-03-28 (tenth audit, Run 12 cross-project)
4 + **Previous audit:** 2026-03-18 (eighth audit, Run 9 cross-project)
5 +
6 + ## Overall Grade: A
7 +
8 + Run 12 cross-project audit. 359 tests (222 unit + 8 cli + 129 integration). 0 clippy warnings. v0.3.2. Grade stable at A. CORS monitoring added. DNS/route stale data fix deployed. New dep advisories: aws-lc-sys (HIGH 7.4), rustls-webpki.
9 +
10 + ## Scorecard
11 +
12 + | Dimension | Grade | Notes |
13 + |-----------|:-----:|-------|
14 + | Code Quality | A | Zero clippy warnings. Clean error handling with typed PomError enum (thiserror). No unnecessary complexity. |
15 + | Architecture | A | Well-structured single crate. main.rs is thin (130 LOC), CLI handlers extracted to cli.rs. Module boundaries clean and purposeful. |
16 + | Testing | A | 359 tests (222 unit + 8 cli + 129 integration) for ~10K LOC = ~36 tests/KLOC. Coverage comprehensive — DB, API, MCP tools, parsing, peer state machine, check_health (mock servers), check_tls (self-signed cert), rate limiter, CORS monitoring all tested. |
17 + | Security | A | All SQL parameterized. SSH hardened with BatchMode + `--` option separator. Bearer token auth on API + peer mesh. Rate limiting (60 req/min). HTTP response size limits (10MB). Shell-escape validation on SSH filter. |
18 + | Performance | A | WAL mode, proper indexes, async throughout. Appropriate for monitoring workload. |
19 + | Documentation | A | Module-level `//!` docs on all modules. All struct fields documented. description.md rewritten, architecture.md created (191 lines), README created (62 lines). Tool descriptions are excellent. |
20 + | Dependencies | A | 17 direct dependencies, all justified. No unused deps. rustls-tls avoids OpenSSL. Semver ranges. tokio-util for CancellationToken. |
21 + | Type Safety | A | Domain enums (HealthStatus, PeerStatus, OnMissing) with proper serde. No stringly-typed fields in application logic. |
22 + | Observability | A | 57 `#[instrument(skip_all)]` annotations across all modules. Debug-level logging for non-critical peer status fetch failures. Watchdog logging for silent task panics. |
23 + | Concurrency | A | Clean RwLock discipline. Lock-then-release-then-DB pattern in peer module. CancellationToken for cooperative shutdown. |
24 + | Resilience | A | CancellationToken + `tokio::select!` in all task loops. `with_graceful_shutdown` on API server. 5s grace period. Watchdog for task panic detection. Grace periods for peer detection. Error logging without panics. |
25 + | API Consistency | A | Consistent JSON shapes. Proper HTTP status codes. Public `/api/health` endpoint. MCP tool descriptions well-written. |
26 + | Migration Safety | A- | Numbered migration system (schema_version table). `IF NOT EXISTS` guards on all DDL. 5 migrations tracked. |
27 + | Codebase Size | A | ~3,500 LOC for a full monitoring tool with CLI, MCP server, HTTP API, SQLite persistence, SSH test runner, peer mesh, and self-monitoring. Extremely efficient. |
28 +
29 + ## Module Heatmap
30 +
31 + | Module | Code | Arch | Test | Security | Perf | Docs | Type Safety | Concurrency | Resilience |
32 + |--------|:----:|:----:|:----:|:--------:|:----:|:----:|:-----------:|:-----------:|:----------:|
33 + | main | A | A | A | A | A | A | A | A | A |
34 + | cli | A | A | A | A | A | A | A | A | A |
35 + | peer | A | A | A | A | A | A | A | A | A |
36 + | db | A | A | A | A+ | A | A | A- | A | A |
37 + | api | A | A | A | A | A | A | A | A | A |
38 + | config | A | A | A | A | A | A | A | - | - |
39 + | tools | A | A | A | A | A | A | A | - | A |
40 + | checks | A | A | A | A | A | A | A | - | A |
41 + | display | A | A | A | - | - | A | A | - | - |
42 + | types | A | A | A | - | - | A | A+ | - | - |
43 + | alerts | A | A | A | A | A | A | A | - | A |
44 +
45 + **Heatmap changes (2026-03-14 remediation):**
46 + - main: Arch B+ -> A (split into main.rs 130 LOC + cli.rs), Test B -> A (display.rs has 27+ tests), Resilience A- -> A (CancellationToken + graceful shutdown)
47 + - api: Security A -> A (rate limiting added), Resilience A -> A (`with_graceful_shutdown`)
48 + - cli (new): All A — extracted command handlers with CancellationToken shutdown orchestration
49 + - display (new): All A — formatting logic with 27+ unit tests
50 + - alerts (new): All A — extracted for visibility in heatmap
51 +
52 + ### Cold Spots
53 +
54 + None remaining. All previous cold spots resolved:
55 + - main.rs split into main.rs (130 LOC) + cli.rs (command handlers) + display.rs (formatting with 27+ tests)
56 + - CLI display formatting extracted and tested via display.rs
57 + - All modules documented with `//!` docs
58 +
59 + ### Cold Spot Remediation History (First Audit)
60 +
61 + All 8 cold spots from the first audit were remediated before the second audit:
62 +
63 + - ~~**config.rs (Testing): C**~~ -> **A** -- 4 config parsing tests (full parse, defaults, on_missing default, hostname fallback). `on_missing` now uses `OnMissing` enum with serde.
64 + - ~~**api.rs (Testing): C**~~ -> **A** -- 5 API endpoint integration tests (status, target 404, peer/info, peer disabled, mesh).
65 + - ~~**api.rs (Concurrency): B**~~ -> **A** -- Lock decoupled from DB queries in `peer_status` and `mesh_view`.
66 + - ~~**peer.rs (Testing): B-**~~ -> **A** -- 9 heartbeat state machine tests (grace transitions, recovery, UUID first-contact, DB recording, identity, on_missing deserialization).
67 + - ~~**peer.rs (Concurrency): B**~~ -> **A** -- Lock decoupled from DB writes in both heartbeat handlers.
68 + - ~~**db.rs (Performance): B-**~~ -> **A** -- 4 indexes added to init_schema.
69 + - ~~**checks/http.rs (Testing): C**~~ -> **A** -- Response classification extracted into pure functions. 8 unit tests covering all status paths.
70 + - ~~**main.rs (Testing): C**~~ -> **B** -- Status icon logic extracted to `HealthStatus::icon()`. 4 types.rs tests. main.rs is thin orchestration over well-tested modules.
71 +
72 + ## Strengths
73 +
74 + - **Lean architecture.** ~3,500 LOC delivers health checks, test orchestration, MCP server, HTTP API, peer mesh, CLI, alerting, TLS monitoring, route checks, and self-monitoring. No over-engineering.
75 + - **Clean error handling.** `?` propagation with typed PomError enum (thiserror, 8 variants). No panics in production paths.
76 + - **Correct security posture.** All SQL parameterized. Bearer token auth on API + peer mesh. Rate limiting (60 req/min). SSH hardened with `BatchMode=yes` + `--` separator. Shell-escape validation.
77 + - **Backward-compatible design.** All config sections are `#[serde(default)]`. Existing configs work unchanged.
78 + - **Good MCP integration.** 8 MCP tools with clear descriptions. Tools strip `raw_output` from test history to avoid flooding context.
79 + - **Comprehensive test suite.** 238 tests (152 unit + 86 integration) at ~68/KLOC covering DB, API, MCP tools, parsing, peer state machine, config, health classification, mock server health checks, TLS cert validation, and rate limiting.
80 + - **Excellent lock discipline.** Peer module collects data under lock, releases lock, then performs DB writes. No lock contention under load.
81 + - **Robust shutdown.** CancellationToken with `tokio::select!` in all task loops. Graceful API server shutdown. Watchdog for silent panics. 5s grace period.
82 +
83 + ## Weaknesses
84 +
85 + All previously identified weaknesses have been resolved:
86 + - ~~**main.rs size**~~ — Split into main.rs (130 LOC) + cli.rs + display.rs
87 + - ~~**No typed error enum**~~ — PomError enum with thiserror (8 variants)
88 + - ~~**Missing module docs**~~ — All modules have `//!` docs
89 + - ~~**No migration versioning**~~ — schema_version table with numbered migrations (v1-v5)
90 +
91 + ## Mandatory Surprise
92 +
93 + **Surprise: The peer mesh fetches `/api/peer/status` as a SECOND HTTP call after `/api/peer/info` in every heartbeat loop iteration, completely independently.**
94 +
95 + In `peer.rs:196-213`, after the heartbeat success/failure handling completes, there is a second HTTP GET to `/api/peer/status` that caches the full status data for mesh aggregation. This means every heartbeat cycle makes TWO HTTP requests to each peer.
96 +
97 + **Verdict: Actually fine.**
98 +
99 + This is intentional and well-designed. The `/api/peer/info` call is the heartbeat probe (lightweight, just identity data), while `/api/peer/status` fetches the full target/peer view for mesh aggregation. Separating them means:
100 + 1. The heartbeat latency measurement is accurate (not inflated by the larger status response).
101 + 2. A failure to fetch status data does not affect heartbeat state transitions (it logs at debug level and moves on).
102 + 3. The mesh view (`/api/mesh`) can show each peer's own view of their targets, enabling cross-instance monitoring.
103 +
104 + The second request's failure is handled gracefully with `tracing::debug!` -- no state corruption, no retries, no panics. Good design.
105 +
106 + ## Action Items
107 +
108 + Filed in `docs/mnw/pom/todo.md`.
109 +
110 + ### Third Audit (2026-03-13, pre-launch skeptical lens) -- 7 items, all resolved
111 +
112 + 1. ~~**[CRITICAL]** Remove Postmark API token from deployment configs~~ — Done — moved to POM_POSTMARK_TOKEN env var
113 + 2. ~~**[MEDIUM]** Add API authentication~~ — Done — bearer token middleware, 5 tests
114 + 3. ~~**[MEDIUM]** Add peer mesh authentication~~ — Done — per-peer token field, heartbeat sends Authorization header
115 + 4. ~~**[LOW]** Add integration tests for core functions~~ — Done — 9 new integration tests (check_health mock servers, check_tls self-signed cert)
116 + 5. ~~**[LOW]** Add self-monitoring~~ — Done — `/api/health` endpoint (public, no auth, returns status + version)
117 + 6. ~~**[LOW]** Shell-escape SSH test filter parameter~~ — Done (alphanumeric + `_:-` allowlist)
118 + 7. ~~**[LOW]** Reject peer UUID mismatch~~ — Done (tracing::error, skip status update, increment failures)
119 +
120 + ### Security Deep Dive (2026-03-13) — Complete (2/2)
121 +
122 + 8. ~~**[MEDIUM]** SSH `--` separator~~ — Done (`checks/ssh.rs`: `.arg("--")` between hostname and command prevents option injection via filter strings)
123 + 9. ~~**[LOW]** HTTP response size limit~~ — Done (`checks/http.rs`: `MAX_RESPONSE_BYTES` 10MB constant, content-length header check + post-read size verification)
124 +
125 + ### Second Audit (2026-03-11) -- 5 items
126 +
127 + 1. Extract CLI command handlers from main.rs into cli/ submodule (587 LOC, 7 commands)
128 + 2. Add typed PomError enum with thiserror (currently Box<dyn Error>, diverges from cross-project standard)
129 + 3. Add .DS_Store and IDE dirs (.idea/, .vscode/) to .gitignore
130 + 4. Add module-level //! docs to main.rs and config.rs
131 + 5. Add migration versioning (PRAGMA user_version) before next schema change
132 +
133 + ### First Audit (2026-03-10) -- 11 items, all resolved
134 +
135 + 1. ~~Add DB indexes~~ -- Done (4 indexes added to init_schema)
136 + 2. ~~Fix clippy warnings~~ -- Done (4 collapsible_if warnings fixed with Rust 2024 let chains)
137 + 3. ~~Decouple mesh lock from DB writes in heartbeat handlers~~ -- Done
138 + 4. ~~Decouple mesh read lock from DB queries in API handlers~~ -- Done
139 + 5. ~~Log /api/peer/status fetch failures~~ -- Done (tracing::debug)
140 + 6. ~~Include peer heartbeat prune count~~ -- Done (3-tuple return)
141 + 7. ~~Add module docs~~ -- Done (//! docs on db.rs, config.rs, peer.rs, types.rs, lib.rs)
142 + 8. ~~Change PeerConfig.on_missing to OnMissing enum~~ -- Done
143 + 9. ~~Add API endpoint tests~~ -- Done (5 tests)
144 + 10. ~~Add heartbeat state machine tests~~ -- Done (9 tests)
145 + 11. ~~Add config parsing tests~~ -- Done (4 tests)
146 +
147 + ## Changes Since Last Audit
148 +
149 + ### Tenth audit (2026-03-28, Run 12 cross-project)
150 + - **Test count:** 359 (222 unit + 8 cli + 129 integration). 0 clippy warnings. 0 failures.
151 + - **Grade:** A (maintained). v0.3.2.
152 + - **CORS monitoring:** New check type added for monitoring CORS headers on targets.
153 + - **New dependency advisories (action items):**
154 + - aws-lc-sys 0.38.0 (RUSTSEC-2026-0044 + -0048, severity 7.4 HIGH) — upgrade to 0.39.0 via `cargo update -p aws-lc-sys`
155 + - rustls-webpki 0.103.9 (RUSTSEC-2026-0049) — upgrade to 0.103.10 via `cargo update -p rustls-webpki`
156 + - paste unmaintained (RUSTSEC-2024-0436) — upstream via rmcp, warning only
157 + - **Mandatory surprise:** None. Previous surprises (rate limiter relaxed ordering, write!().unwrap() infallibility) still valid.
158 + - **No new code findings.** All previous items remain resolved.
159 +
160 + ### DNS/Route stale data fix (2026-03-25)
161 + - **Test count:** 352 (unchanged). 0 clippy warnings.
162 + - **Config:** Switched all 4 Cloudflare-proxied DNS records from `expected = ["IP"]` to `expected = []` (resolution-only). DNS checks were always failing because Cloudflare returns rotating proxy IPs, not the origin IP.
163 + - **API filtering:** `route_status` and `dns_status` in `/api/status/{target}` now filtered to only entries matching current config. Stale routes (e.g. `/docs/about`, `/signup`) and stale DNS records no longer appear in API responses.
164 + - **DB pruning:** Added `prune_stale_routes()` and `prune_stale_dns()` to `db.rs`. Called once at task startup in `routes.rs` and `dns.rs` to clean up historical data when config changes. Pruned 890 stale route check rows on first deploy.
165 + - **Integration tests:** Updated `api_status_includes_route_status` and `api_status_includes_dns_status` to use configs with matching route/DNS entries.
166 + - **Deployed to hetzner** — v0.3.2 binary + updated config.
167 +
168 + ### Eighth audit (2026-03-18, Run 9 cross-project)
169 + - **Test count:** 344 (unchanged). 0 clippy warnings.
170 + - **Grade:** A (maintained). v0.3.1 (deployed 2026-03-18).
171 + - **Dashboard UI shipped.** Per-test tracking, regression detection, duration drift.
172 + - **cli/ directory module split** completed (1,035-line cli.rs -> 8 files).
173 + - **Observations (pre-existing, not regressions):**
174 + - Mutex `.unwrap()` in rate limiter (api.rs:41) — if thread panics while holding lock, subsequent calls panic. Impact: LOW (rate limiter only, not core logic). Design choice: acceptable for monitoring tool.
175 + - `serde_json::to_value(d).unwrap_or_default()` in API details field — silently becomes null on serialization failure. Impact: LOW, safe fallback.
176 + - **No new findings requiring action.** Grade maintained at A.
177 + - **Mandatory surprise:** Rate limiter uses `fetch_add` with Relaxed ordering — can allow up to max_per_window+1 requests due to check-then-increment race. Known trade-off of lock-free rate limiting, documented.
178 +
179 + ### Fifth audit (2026-03-16, Run 6 cross-project)
180 + - **Test count:** 238 -> 344 (220 unit + 124 integration, +106 tests)
181 + - **Grade:** A (maintained). No new findings above LOW.
182 + - **Source LOC:** 10,113 (up from ~3.5K)
183 + - **Clippy:** 2 warnings (collapsible_if in cli.rs — LOW)
184 + - **Production unwraps:** 76 total — 64 infallible write! on String, 12 safe-by-construction. Effectively zero risky unwraps.
185 + - **Mandatory surprise:** write!().unwrap() pattern provably infallible — Actually fine.
186 + - **Previous items verified:** All previous remediated items confirmed intact.
187 + - **Note:** cli.rs at 1,036 lines — approaching the 500-line branching guideline but mostly flat match arms.
188 + - **Infrastructure check:** Blocked by Tailscale SSH re-authentication. Deferred.
189 +
190 + ### Fourth audit remediation (2026-03-14)
191 + - **Grade:** A- -> A. All remaining findings resolved.
192 + - **Test count:** 229 -> 238 (+9 integration tests)
193 + - **Graceful shutdown:** Replaced `handle.abort()` with CancellationToken + `tokio::select!` in all task loops. API server uses `with_graceful_shutdown`. 5s grace period on SIGINT/SIGTERM.
194 + - **Task panic detection:** 60s watchdog checks `JoinHandle::is_finished()` on all background tasks.
195 + - **Rate limiting:** Fixed-window 60 req/min middleware on authenticated API routes. Custom `RateLimiter` struct.
196 + - **Self-monitoring:** `GET /api/health` endpoint (public, no auth) returns `{"status":"operational","version":"..."}`.
197 + - **Integration tests:** 5 check_health tests (mock axum servers: operational, degraded, unreachable, expectations pass/fail), 1 check_tls test (self-signed cert via rcgen), 2 /api/health tests, 1 rate limiter test.
198 + - **Deploy config cleanup:** Removed redundant htpy `expected_routes` (duplicated health check URL).
199 + - **Dependency:** Added `tokio-util` for CancellationToken.
200 + - **Cold spots:** 0 remaining (was 3). All previous architectural and testing gaps closed.
201 +
202 + ### Third audit (2026-03-13, pre-launch skeptical lens)
203 + - **Grade:** A -> A-. Postmark API token in plaintext deployment configs is a real issue.
204 + - **Test count:** 56 -> 187 (+131 tests)
205 + - **New findings:** Plaintext API token, no API auth, no peer mesh auth, no integration tests for core functions, no self-monitoring.
206 + - **38 unwraps in non-test code** — all verified safe (write to String or guarded by prior checks).
207 +
208 + **Post-audit remediation (2026-03-13):**
209 + - All 3 critical/medium findings resolved: Postmark token to env var, API bearer auth (5 tests), peer mesh auth
210 + - 2 low findings resolved: SSH filter validation, peer UUID mismatch rejection
211 + - Test count: 187 -> 195 (+8 tests)
212 + - Documentation upgraded to A: All struct fields documented (HealthSnapshot, HealthStatus, HealthDetails, TestRun, TestStaleness, PeerStatus, OnMissing, all config types, all API response types). All 8 error variants documented. 11 config defaults with rationale comments. prune_old_records return tuple documented. description.md rewritten, architecture.md created (191 lines), README created (62 lines).
213 +
214 + ### Observability Upgrade (2026-03-13)
215 + - **Observability:** A- -> A
216 + - Added 57 `#[instrument(skip_all)]` annotations across 9 files: db.rs (28), alerts.rs (9), tools/mod.rs (8), tools/health.rs (5), tools/tests.rs (3), checks/http.rs (1), checks/tls.rs (1), checks/ssh.rs (1), peer.rs (1)
217 + - Added Multithreaded forum as monitoring target: `pom-astra.toml` (localhost:3400), `pom-hetzner.toml` (Tailscale IP)
218 + - Added test runner targets for GO, BB, AF, SK to `pom-astra.toml`
219 + - All 208 tests pass. `cargo check` passes clean.
220 +
221 + ### Adversarial Test Audit (2026-03-13)
222 +
223 + **Goal:** Write tests that try to break the system. Find edge cases, race conditions, boundary conditions, and logic errors.
224 +
225 + **Results:**
226 + - **Test count:** 195 -> 208 (+13 tests)
227 + - **CRITICAL fix:** Alert cooldown key mismatch — `record_alert` used `target` but lookup used `alert_key` (`"health:{target}"`), so cooldowns never matched and alerts fired every check. Fixed by using `alert_key` consistently.
228 + - **HIGH fix:** TLS expiry check inconsistent at day boundary — time-of-day comparison could cause flapping. Changed to `date_naive()` comparison for stable day-level logic.
229 + - **HIGH fix:** UUID mismatch left stale peer state — now resets state, clears failures, persists via `update_peer_identity()` to prevent showing stale data after peer identity change.
230 + - **HIGH fix:** `prune_old_records` no guard for days <= 0 — could delete all records. Added early return for `days <= 0` (no-op).
231 + - **HIGH fix:** SSH timeout ignored config value — hardcoded `ConnectTimeout=10` in SSH args. Changed to use `config.timeout_secs`.
232 + - **Added `rcgen` dev dependency** for TLS cert generation in tests.
233 +
234 + ### Second audit (2026-03-11)
235 + | Change | Detail |
236 + |--------|--------|
237 + | Tests | +39 tests (17 -> 56). 28 unit + 28 integration. Tests/KLOC: 5.8 -> 18.4. |
238 + | Lock contention | Addressed in both peer.rs (heartbeat handlers) and api.rs (status/mesh handlers). Data collected under lock, DB writes after release. |
239 + | DB indexes | 4 indexes added: health_checks(target, id DESC), health_checks(target, checked_at), test_runs(target, id DESC), peer_heartbeats(peer_name, id DESC). |
240 + | Clippy | 4 warnings -> 0. Used Rust 2024 let chains instead of nested if-let. |
241 + | Type safety | PeerConfig.on_missing changed from String to OnMissing enum with serde deserialization. |
242 + | Module docs | Added //! docs to db.rs, config.rs, peer.rs, types.rs, lib.rs. |
243 + | Error handling | /api/peer/status fetch failures now logged at debug level instead of silenced. |
244 + | Prune | prune_old_records now returns 3-tuple including peer heartbeat count. |
245 + | Code extraction | HealthStatus::icon() method eliminates 3 repeated match blocks. |
246 + | HTTP checks | Response classification extracted into pure functions for testability. |
247 +
248 + ## Metrics Over Time
249 +
250 + | Audit Date | LOC | Rust Files | Tests | Tests/KLOC | Clippy Warnings | Cold Spots | Overall |
251 + |------------|-----|-----------|-------|-----------|----------------|------------|---------|
252 + | 2026-03-10 | 2,934 | 15 | 17 | 5.8 | 4 | 8 | B+ |
253 + | 2026-03-11 | 3,039 | 14 | 56 | 18.4 | 0 | 3 | A |
254 + | 2026-03-13 | ~3K | ~14 | 208 | ~69 | 0 | 3 | A- |
255 + | 2026-03-14 | ~3.5K | ~16 | 238 | ~68 | 0 | 0 | A |
256 + | 2026-03-16 | 10.1K | 23 | 344 | ~34 | 2 | 0 | A |
257 + | 2026-03-18 | 10.1K | 23 | 344 | ~34 | 0 | 0 | A |
A docs/todo.md +38
@@ -0,0 +1,38 @@
1 + # PoM Todo
2 +
3 + Done: Phases 1-13 complete. Per-test tracking + regression detection + duration drift added. 352 tests (124 lib + 228 integration). Grade: A (Run 10). v0.3.2 (redeployed 2026-03-25). cli/ split into directory module. Dashboard UI shipped. DNS checks fixed for Cloudflare-proxied domains. Stale route/DNS data pruning added.
4 +
5 + Completed work archived in `docs/archive/pom_todo_done.md`.
6 +
7 + ## Notification Integration
8 + When MNW unified notification service is built, PoM can push alerts there instead of / in addition to email.
9 + - [ ] Push PoM alerts to MNW notifications API (health failures, TLS expiry, DNS changes)
10 + - [ ] Deduplicate alert delivery (email via MNW notification preferences instead of direct Postmark)
11 +
12 + ## Deferred
13 + - [ ] Multi-location probing beyond hetzner+astra+macbook (third-party VPS for independent perspective)
14 + - [ ] Webhook alert channel (ntfy.sh, Pushover, generic webhook)
15 + - [ ] Prometheus/OpenTelemetry metrics export
16 + - [ ] Peer auto-discovery (mDNS/Tailscale API — currently manual config only)
17 +
18 + ---
19 +
20 + ## Key Paths
21 + - Config: `src/config.rs`, `pom.toml`
22 + - Database: `src/db.rs`
23 + - HTTP API: `src/api.rs`
24 + - Peer mesh: `src/peer.rs`
25 + - Health checks: `src/checks/http.rs`
26 + - Route checks: `src/checks/routes.rs`
27 + - TLS checks: `src/checks/tls.rs`
28 + - DNS checks: `src/checks/dns.rs`
29 + - WHOIS checks: `src/checks/whois.rs`
30 + - CI output parsing: `src/checks/parse.rs`
31 + - Test orchestration: `src/checks/ssh.rs`
32 + - MCP server: `src/tools/mod.rs`, `src/tools/health.rs`, `src/tools/tests.rs`
33 + - CLI: `src/main.rs`, `src/cli/` (mod.rs, serve.rs, status.rs, incident.rs, tasks/)
34 + - Types: `src/types.rs`
35 + - Integration tests: `tests/integration.rs`
36 + - Deploy: `deploy/` (deploy.sh, pom-hetzner.toml, pom-astra.toml, pom.service)
37 +
38 + Run 6 + Run 8 audit items all resolved.