Skip to main content

max / pom

9.1 KB · 116 lines History Blame Raw
1 # PoM (Peace of Mind) -- Audit History
2
3 Full chronological audit log. See [audit_review.md]./audit_review.md for current state.
4
5 ## Changes Since Last Audit
6
7 ### Tenth audit (2026-03-28, Run 12 cross-project)
8 - **Test count:** 359 (222 unit + 8 cli + 129 integration). 0 clippy warnings. 0 failures.
9 - **Grade:** A (maintained). v0.3.2.
10 - **CORS monitoring:** New check type added for monitoring CORS headers on targets.
11 - **New dependency advisories (action items):**
12 - aws-lc-sys 0.38.0 (RUSTSEC-2026-0044 + -0048, severity 7.4 HIGH) — upgrade to 0.39.0 via `cargo update -p aws-lc-sys`
13 - rustls-webpki 0.103.9 (RUSTSEC-2026-0049) — upgrade to 0.103.10 via `cargo update -p rustls-webpki`
14 - paste unmaintained (RUSTSEC-2024-0436) — upstream via rmcp, warning only
15 - **Mandatory surprise:** None. Previous surprises (rate limiter relaxed ordering, write!().unwrap() infallibility) still valid.
16 - **No new code findings.** All previous items remain resolved.
17
18 ### DNS/Route stale data fix (2026-03-25)
19 - **Test count:** 352 (unchanged). 0 clippy warnings.
20 - **Config:** Switched all 4 Cloudflare-proxied DNS records from `expected = ["IP"]` to `expected = []` (resolution-only). DNS checks were always failing because Cloudflare returns rotating proxy IPs, not the origin IP.
21 - **API filtering:** `route_status` and `dns_status` in `/api/status/{target}` now filtered to only entries matching current config. Stale routes (e.g. `/docs/about`, `/signup`) and stale DNS records no longer appear in API responses.
22 - **DB pruning:** Added `prune_stale_routes()` and `prune_stale_dns()` to `db.rs`. Called once at task startup in `routes.rs` and `dns.rs` to clean up historical data when config changes. Pruned 890 stale route check rows on first deploy.
23 - **Integration tests:** Updated `api_status_includes_route_status` and `api_status_includes_dns_status` to use configs with matching route/DNS entries.
24 - **Deployed to hetzner** — v0.3.2 binary + updated config.
25
26 ### Eighth audit (2026-03-18, Run 9 cross-project)
27 - **Test count:** 344 (unchanged). 0 clippy warnings.
28 - **Grade:** A (maintained). v0.3.1 (deployed 2026-03-18).
29 - **Dashboard UI shipped.** Per-test tracking, regression detection, duration drift.
30 - **cli/ directory module split** completed (1,035-line cli.rs -> 8 files).
31 - **Observations (pre-existing, not regressions):**
32 - Mutex `.unwrap()` in rate limiter (api.rs:41) — if thread panics while holding lock, subsequent calls panic. Impact: LOW (rate limiter only, not core logic). Design choice: acceptable for monitoring tool.
33 - `serde_json::to_value(d).unwrap_or_default()` in API details field — silently becomes null on serialization failure. Impact: LOW, safe fallback.
34 - **No new findings requiring action.** Grade maintained at A.
35 - **Mandatory surprise:** Rate limiter uses `fetch_add` with Relaxed ordering — can allow up to max_per_window+1 requests due to check-then-increment race. Known trade-off of lock-free rate limiting, documented.
36
37 ### Fifth audit (2026-03-16, Run 6 cross-project)
38 - **Test count:** 238 -> 344 (220 unit + 124 integration, +106 tests)
39 - **Grade:** A (maintained). No new findings above LOW.
40 - **Source LOC:** 10,113 (up from ~3.5K)
41 - **Clippy:** 2 warnings (collapsible_if in cli.rs — LOW)
42 - **Production unwraps:** 76 total — 64 infallible write! on String, 12 safe-by-construction. Effectively zero risky unwraps.
43 - **Mandatory surprise:** write!().unwrap() pattern provably infallible — Actually fine.
44 - **Previous items verified:** All previous remediated items confirmed intact.
45 - **Note:** cli.rs at 1,036 lines — approaching the 500-line branching guideline but mostly flat match arms.
46 - **Infrastructure check:** Blocked by Tailscale SSH re-authentication. Deferred.
47
48 ### Fourth audit remediation (2026-03-14)
49 - **Grade:** A- -> A. All remaining findings resolved.
50 - **Test count:** 229 -> 238 (+9 integration tests)
51 - **Graceful shutdown:** Replaced `handle.abort()` with CancellationToken + `tokio::select!` in all task loops. API server uses `with_graceful_shutdown`. 5s grace period on SIGINT/SIGTERM.
52 - **Task panic detection:** 60s watchdog checks `JoinHandle::is_finished()` on all background tasks.
53 - **Rate limiting:** Fixed-window 60 req/min middleware on authenticated API routes. Custom `RateLimiter` struct.
54 - **Self-monitoring:** `GET /api/health` endpoint (public, no auth) returns `{"status":"operational","version":"..."}`.
55 - **Integration tests:** 5 check_health tests (mock axum servers: operational, degraded, unreachable, expectations pass/fail), 1 check_tls test (self-signed cert via rcgen), 2 /api/health tests, 1 rate limiter test.
56 - **Deploy config cleanup:** Removed redundant htpy `expected_routes` (duplicated health check URL).
57 - **Dependency:** Added `tokio-util` for CancellationToken.
58 - **Cold spots:** 0 remaining (was 3). All previous architectural and testing gaps closed.
59
60 ### Third audit (2026-03-13, pre-launch skeptical lens)
61 - **Grade:** A -> A-. Postmark API token in plaintext deployment configs is a real issue.
62 - **Test count:** 56 -> 187 (+131 tests)
63 - **New findings:** Plaintext API token, no API auth, no peer mesh auth, no integration tests for core functions, no self-monitoring.
64 - **38 unwraps in non-test code** — all verified safe (write to String or guarded by prior checks).
65
66 **Post-audit remediation (2026-03-13):**
67 - All 3 critical/medium findings resolved: Postmark token to env var, API bearer auth (5 tests), peer mesh auth
68 - 2 low findings resolved: SSH filter validation, peer UUID mismatch rejection
69 - Test count: 187 -> 195 (+8 tests)
70 - Documentation upgraded to A: All struct fields documented (HealthSnapshot, HealthStatus, HealthDetails, TestRun, TestStaleness, PeerStatus, OnMissing, all config types, all API response types). All 8 error variants documented. 11 config defaults with rationale comments. prune_old_records return tuple documented. description.md rewritten, architecture.md created (191 lines), README created (62 lines).
71
72 ### Observability Upgrade (2026-03-13)
73 - **Observability:** A- -> A
74 - Added 57 `#[instrument(skip_all)]` annotations across 9 files: db.rs (28), alerts.rs (9), tools/mod.rs (8), tools/health.rs (5), tools/tests.rs (3), checks/http.rs (1), checks/tls.rs (1), checks/ssh.rs (1), peer.rs (1)
75 - Added Multithreaded forum as monitoring target: `pom-astra.toml` (localhost:3400), `pom-hetzner.toml` (Tailscale IP)
76 - Added test runner targets for GO, BB, AF, SK to `pom-astra.toml`
77 - All 208 tests pass. `cargo check` passes clean.
78
79 ### Adversarial Test Audit (2026-03-13)
80
81 **Goal:** Write tests that try to break the system. Find edge cases, race conditions, boundary conditions, and logic errors.
82
83 **Results:**
84 - **Test count:** 195 -> 208 (+13 tests)
85 - **CRITICAL fix:** Alert cooldown key mismatch — `record_alert` used `target` but lookup used `alert_key` (`"health:{target}"`), so cooldowns never matched and alerts fired every check. Fixed by using `alert_key` consistently.
86 - **HIGH fix:** TLS expiry check inconsistent at day boundary — time-of-day comparison could cause flapping. Changed to `date_naive()` comparison for stable day-level logic.
87 - **HIGH fix:** UUID mismatch left stale peer state — now resets state, clears failures, persists via `update_peer_identity()` to prevent showing stale data after peer identity change.
88 - **HIGH fix:** `prune_old_records` no guard for days <= 0 — could delete all records. Added early return for `days <= 0` (no-op).
89 - **HIGH fix:** SSH timeout ignored config value — hardcoded `ConnectTimeout=10` in SSH args. Changed to use `config.timeout_secs`.
90 - **Added `rcgen` dev dependency** for TLS cert generation in tests.
91
92 ### Second audit (2026-03-11)
93 | Change | Detail |
94 |--------|--------|
95 | Tests | +39 tests (17 -> 56). 28 unit + 28 integration. Tests/KLOC: 5.8 -> 18.4. |
96 | Lock contention | Addressed in both peer.rs (heartbeat handlers) and api.rs (status/mesh handlers). Data collected under lock, DB writes after release. |
97 | DB indexes | 4 indexes added: health_checks(target, id DESC), health_checks(target, checked_at), test_runs(target, id DESC), peer_heartbeats(peer_name, id DESC). |
98 | Clippy | 4 warnings -> 0. Used Rust 2024 let chains instead of nested if-let. |
99 | Type safety | PeerConfig.on_missing changed from String to OnMissing enum with serde deserialization. |
100 | Module docs | Added //! docs to db.rs, config.rs, peer.rs, types.rs, lib.rs. |
101 | Error handling | /api/peer/status fetch failures now logged at debug level instead of silenced. |
102 | Prune | prune_old_records now returns 3-tuple including peer heartbeat count. |
103 | Code extraction | HealthStatus::icon() method eliminates 3 repeated match blocks. |
104 | HTTP checks | Response classification extracted into pure functions for testability. |
105
106 ## Metrics Over Time
107
108 | Audit Date | LOC | Rust Files | Tests | Tests/KLOC | Clippy Warnings | Cold Spots | Overall |
109 |------------|-----|-----------|-------|-----------|----------------|------------|---------|
110 | 2026-03-10 | 2,934 | 15 | 17 | 5.8 | 4 | 8 | B+ |
111 | 2026-03-11 | 3,039 | 14 | 56 | 18.4 | 0 | 3 | A |
112 | 2026-03-13 | ~3K | ~14 | 208 | ~69 | 0 | 3 | A- |
113 | 2026-03-14 | ~3.5K | ~16 | 238 | ~68 | 0 | 0 | A |
114 | 2026-03-16 | 10.1K | 23 | 344 | ~34 | 2 | 0 | A |
115 | 2026-03-18 | 10.1K | 23 | 344 | ~34 | 0 | 0 | A |
116