| 1 |
# PoM (Peace of Mind) -- Audit History |
| 2 |
|
| 3 |
Full chronological audit log. See [audit_review.md](./audit_review.md) for current state. |
| 4 |
|
| 5 |
## Changes Since Last Audit |
| 6 |
|
| 7 |
### Tenth audit (2026-03-28, Run 12 cross-project) |
| 8 |
- **Test count:** 359 (222 unit + 8 cli + 129 integration). 0 clippy warnings. 0 failures. |
| 9 |
- **Grade:** A (maintained). v0.3.2. |
| 10 |
- **CORS monitoring:** New check type added for monitoring CORS headers on targets. |
| 11 |
- **New dependency advisories (action items):** |
| 12 |
- aws-lc-sys 0.38.0 (RUSTSEC-2026-0044 + -0048, severity 7.4 HIGH) — upgrade to 0.39.0 via `cargo update -p aws-lc-sys` |
| 13 |
- rustls-webpki 0.103.9 (RUSTSEC-2026-0049) — upgrade to 0.103.10 via `cargo update -p rustls-webpki` |
| 14 |
- paste unmaintained (RUSTSEC-2024-0436) — upstream via rmcp, warning only |
| 15 |
- **Mandatory surprise:** None. Previous surprises (rate limiter relaxed ordering, write!().unwrap() infallibility) still valid. |
| 16 |
- **No new code findings.** All previous items remain resolved. |
| 17 |
|
| 18 |
### DNS/Route stale data fix (2026-03-25) |
| 19 |
- **Test count:** 352 (unchanged). 0 clippy warnings. |
| 20 |
- **Config:** Switched all 4 Cloudflare-proxied DNS records from `expected = ["IP"]` to `expected = []` (resolution-only). DNS checks were always failing because Cloudflare returns rotating proxy IPs, not the origin IP. |
| 21 |
- **API filtering:** `route_status` and `dns_status` in `/api/status/{target}` now filtered to only entries matching current config. Stale routes (e.g. `/docs/about`, `/signup`) and stale DNS records no longer appear in API responses. |
| 22 |
- **DB pruning:** Added `prune_stale_routes()` and `prune_stale_dns()` to `db.rs`. Called once at task startup in `routes.rs` and `dns.rs` to clean up historical data when config changes. Pruned 890 stale route check rows on first deploy. |
| 23 |
- **Integration tests:** Updated `api_status_includes_route_status` and `api_status_includes_dns_status` to use configs with matching route/DNS entries. |
| 24 |
- **Deployed to hetzner** — v0.3.2 binary + updated config. |
| 25 |
|
| 26 |
### Eighth audit (2026-03-18, Run 9 cross-project) |
| 27 |
- **Test count:** 344 (unchanged). 0 clippy warnings. |
| 28 |
- **Grade:** A (maintained). v0.3.1 (deployed 2026-03-18). |
| 29 |
- **Dashboard UI shipped.** Per-test tracking, regression detection, duration drift. |
| 30 |
- **cli/ directory module split** completed (1,035-line cli.rs -> 8 files). |
| 31 |
- **Observations (pre-existing, not regressions):** |
| 32 |
- Mutex `.unwrap()` in rate limiter (api.rs:41) — if thread panics while holding lock, subsequent calls panic. Impact: LOW (rate limiter only, not core logic). Design choice: acceptable for monitoring tool. |
| 33 |
- `serde_json::to_value(d).unwrap_or_default()` in API details field — silently becomes null on serialization failure. Impact: LOW, safe fallback. |
| 34 |
- **No new findings requiring action.** Grade maintained at A. |
| 35 |
- **Mandatory surprise:** Rate limiter uses `fetch_add` with Relaxed ordering — can allow up to max_per_window+1 requests due to check-then-increment race. Known trade-off of lock-free rate limiting, documented. |
| 36 |
|
| 37 |
### Fifth audit (2026-03-16, Run 6 cross-project) |
| 38 |
- **Test count:** 238 -> 344 (220 unit + 124 integration, +106 tests) |
| 39 |
- **Grade:** A (maintained). No new findings above LOW. |
| 40 |
- **Source LOC:** 10,113 (up from ~3.5K) |
| 41 |
- **Clippy:** 2 warnings (collapsible_if in cli.rs — LOW) |
| 42 |
- **Production unwraps:** 76 total — 64 infallible write! on String, 12 safe-by-construction. Effectively zero risky unwraps. |
| 43 |
- **Mandatory surprise:** write!().unwrap() pattern provably infallible — Actually fine. |
| 44 |
- **Previous items verified:** All previous remediated items confirmed intact. |
| 45 |
- **Note:** cli.rs at 1,036 lines — approaching the 500-line branching guideline but mostly flat match arms. |
| 46 |
- **Infrastructure check:** Blocked by Tailscale SSH re-authentication. Deferred. |
| 47 |
|
| 48 |
### Fourth audit remediation (2026-03-14) |
| 49 |
- **Grade:** A- -> A. All remaining findings resolved. |
| 50 |
- **Test count:** 229 -> 238 (+9 integration tests) |
| 51 |
- **Graceful shutdown:** Replaced `handle.abort()` with CancellationToken + `tokio::select!` in all task loops. API server uses `with_graceful_shutdown`. 5s grace period on SIGINT/SIGTERM. |
| 52 |
- **Task panic detection:** 60s watchdog checks `JoinHandle::is_finished()` on all background tasks. |
| 53 |
- **Rate limiting:** Fixed-window 60 req/min middleware on authenticated API routes. Custom `RateLimiter` struct. |
| 54 |
- **Self-monitoring:** `GET /api/health` endpoint (public, no auth) returns `{"status":"operational","version":"..."}`. |
| 55 |
- **Integration tests:** 5 check_health tests (mock axum servers: operational, degraded, unreachable, expectations pass/fail), 1 check_tls test (self-signed cert via rcgen), 2 /api/health tests, 1 rate limiter test. |
| 56 |
- **Deploy config cleanup:** Removed redundant htpy `expected_routes` (duplicated health check URL). |
| 57 |
- **Dependency:** Added `tokio-util` for CancellationToken. |
| 58 |
- **Cold spots:** 0 remaining (was 3). All previous architectural and testing gaps closed. |
| 59 |
|
| 60 |
### Third audit (2026-03-13, pre-launch skeptical lens) |
| 61 |
- **Grade:** A -> A-. Postmark API token in plaintext deployment configs is a real issue. |
| 62 |
- **Test count:** 56 -> 187 (+131 tests) |
| 63 |
- **New findings:** Plaintext API token, no API auth, no peer mesh auth, no integration tests for core functions, no self-monitoring. |
| 64 |
- **38 unwraps in non-test code** — all verified safe (write to String or guarded by prior checks). |
| 65 |
|
| 66 |
**Post-audit remediation (2026-03-13):** |
| 67 |
- All 3 critical/medium findings resolved: Postmark token to env var, API bearer auth (5 tests), peer mesh auth |
| 68 |
- 2 low findings resolved: SSH filter validation, peer UUID mismatch rejection |
| 69 |
- Test count: 187 -> 195 (+8 tests) |
| 70 |
- Documentation upgraded to A: All struct fields documented (HealthSnapshot, HealthStatus, HealthDetails, TestRun, TestStaleness, PeerStatus, OnMissing, all config types, all API response types). All 8 error variants documented. 11 config defaults with rationale comments. prune_old_records return tuple documented. description.md rewritten, architecture.md created (191 lines), README created (62 lines). |
| 71 |
|
| 72 |
### Observability Upgrade (2026-03-13) |
| 73 |
- **Observability:** A- -> A |
| 74 |
- Added 57 `#[instrument(skip_all)]` annotations across 9 files: db.rs (28), alerts.rs (9), tools/mod.rs (8), tools/health.rs (5), tools/tests.rs (3), checks/http.rs (1), checks/tls.rs (1), checks/ssh.rs (1), peer.rs (1) |
| 75 |
- Added Multithreaded forum as monitoring target: `pom-astra.toml` (localhost:3400), `pom-hetzner.toml` (Tailscale IP) |
| 76 |
- Added test runner targets for GO, BB, AF, SK to `pom-astra.toml` |
| 77 |
- All 208 tests pass. `cargo check` passes clean. |
| 78 |
|
| 79 |
### Adversarial Test Audit (2026-03-13) |
| 80 |
|
| 81 |
**Goal:** Write tests that try to break the system. Find edge cases, race conditions, boundary conditions, and logic errors. |
| 82 |
|
| 83 |
**Results:** |
| 84 |
- **Test count:** 195 -> 208 (+13 tests) |
| 85 |
- **CRITICAL fix:** Alert cooldown key mismatch — `record_alert` used `target` but lookup used `alert_key` (`"health:{target}"`), so cooldowns never matched and alerts fired every check. Fixed by using `alert_key` consistently. |
| 86 |
- **HIGH fix:** TLS expiry check inconsistent at day boundary — time-of-day comparison could cause flapping. Changed to `date_naive()` comparison for stable day-level logic. |
| 87 |
- **HIGH fix:** UUID mismatch left stale peer state — now resets state, clears failures, persists via `update_peer_identity()` to prevent showing stale data after peer identity change. |
| 88 |
- **HIGH fix:** `prune_old_records` no guard for days <= 0 — could delete all records. Added early return for `days <= 0` (no-op). |
| 89 |
- **HIGH fix:** SSH timeout ignored config value — hardcoded `ConnectTimeout=10` in SSH args. Changed to use `config.timeout_secs`. |
| 90 |
- **Added `rcgen` dev dependency** for TLS cert generation in tests. |
| 91 |
|
| 92 |
### Second audit (2026-03-11) |
| 93 |
|
| 94 |
|
| 95 |
| Tests | +39 tests (17 -> 56). 28 unit + 28 integration. Tests/KLOC: 5.8 -> 18.4. | |
| 96 |
| Lock contention | Addressed in both peer.rs (heartbeat handlers) and api.rs (status/mesh handlers). Data collected under lock, DB writes after release. | |
| 97 |
| DB indexes | 4 indexes added: health_checks(target, id DESC), health_checks(target, checked_at), test_runs(target, id DESC), peer_heartbeats(peer_name, id DESC). | |
| 98 |
| Clippy | 4 warnings -> 0. Used Rust 2024 let chains instead of nested if-let. | |
| 99 |
| Type safety | PeerConfig.on_missing changed from String to OnMissing enum with serde deserialization. | |
| 100 |
| Module docs | Added //! docs to db.rs, config.rs, peer.rs, types.rs, lib.rs. | |
| 101 |
| Error handling | /api/peer/status fetch failures now logged at debug level instead of silenced. | |
| 102 |
| Prune | prune_old_records now returns 3-tuple including peer heartbeat count. | |
| 103 |
| Code extraction | HealthStatus::icon() method eliminates 3 repeated match blocks. | |
| 104 |
| HTTP checks | Response classification extracted into pure functions for testability. | |
| 105 |
|
| 106 |
## Metrics Over Time |
| 107 |
|
| 108 |
|
| 109 |
|
| 110 |
| 2026-03-10 | 2,934 | 15 | 17 | 5.8 | 4 | 8 | B+ | |
| 111 |
| 2026-03-11 | 3,039 | 14 | 56 | 18.4 | 0 | 3 | A | |
| 112 |
| 2026-03-13 | ~3K | ~14 | 208 | ~69 | 0 | 3 | A- | |
| 113 |
| 2026-03-14 | ~3.5K | ~16 | 238 | ~68 | 0 | 0 | A | |
| 114 |
| 2026-03-16 | 10.1K | 23 | 344 | ~34 | 2 | 0 | A | |
| 115 |
| 2026-03-18 | 10.1K | 23 | 344 | ~34 | 0 | 0 | A | |
| 116 |
|