max / pom

9.1 KB · 116 lines History Blame Raw

1	# PoM (Peace of Mind) -- Audit History
2
3	Full chronological audit log. See [audit_review.md](./audit_review.md) for current state.
4
5	## Changes Since Last Audit
6
7	### Tenth audit (2026-03-28, Run 12 cross-project)
8	- Test count: 359 (222 unit + 8 cli + 129 integration). 0 clippy warnings. 0 failures.
9	- Grade: A (maintained). v0.3.2.
10	- CORS monitoring: New check type added for monitoring CORS headers on targets.
11	- New dependency advisories (action items):
12	- aws-lc-sys 0.38.0 (RUSTSEC-2026-0044 + -0048, severity 7.4 HIGH) — upgrade to 0.39.0 via `cargo update -p aws-lc-sys`
13	- rustls-webpki 0.103.9 (RUSTSEC-2026-0049) — upgrade to 0.103.10 via `cargo update -p rustls-webpki`
14	- paste unmaintained (RUSTSEC-2024-0436) — upstream via rmcp, warning only
15	- Mandatory surprise: None. Previous surprises (rate limiter relaxed ordering, write!().unwrap() infallibility) still valid.
16	- No new code findings. All previous items remain resolved.
17
18	### DNS/Route stale data fix (2026-03-25)
19	- Test count: 352 (unchanged). 0 clippy warnings.
20	- Config: Switched all 4 Cloudflare-proxied DNS records from `expected = ["IP"]` to `expected = []` (resolution-only). DNS checks were always failing because Cloudflare returns rotating proxy IPs, not the origin IP.
21	- API filtering: `route_status` and `dns_status` in `/api/status/{target}` now filtered to only entries matching current config. Stale routes (e.g. `/docs/about`, `/signup`) and stale DNS records no longer appear in API responses.
22	- DB pruning: Added `prune_stale_routes()` and `prune_stale_dns()` to `db.rs`. Called once at task startup in `routes.rs` and `dns.rs` to clean up historical data when config changes. Pruned 890 stale route check rows on first deploy.
23	- Integration tests: Updated `api_status_includes_route_status` and `api_status_includes_dns_status` to use configs with matching route/DNS entries.
24	- Deployed to hetzner — v0.3.2 binary + updated config.
25
26	### Eighth audit (2026-03-18, Run 9 cross-project)
27	- Test count: 344 (unchanged). 0 clippy warnings.
28	- Grade: A (maintained). v0.3.1 (deployed 2026-03-18).
29	- Dashboard UI shipped. Per-test tracking, regression detection, duration drift.
30	- cli/ directory module split completed (1,035-line cli.rs -> 8 files).
31	- Observations (pre-existing, not regressions):
32	- Mutex `.unwrap()` in rate limiter (api.rs:41) — if thread panics while holding lock, subsequent calls panic. Impact: LOW (rate limiter only, not core logic). Design choice: acceptable for monitoring tool.
33	- `serde_json::to_value(d).unwrap_or_default()` in API details field — silently becomes null on serialization failure. Impact: LOW, safe fallback.
34	- No new findings requiring action. Grade maintained at A.
35	- Mandatory surprise: Rate limiter uses `fetch_add` with Relaxed ordering — can allow up to max_per_window+1 requests due to check-then-increment race. Known trade-off of lock-free rate limiting, documented.
36
37	### Fifth audit (2026-03-16, Run 6 cross-project)
38	- Test count: 238 -> 344 (220 unit + 124 integration, +106 tests)
39	- Grade: A (maintained). No new findings above LOW.
40	- Source LOC: 10,113 (up from ~3.5K)
41	- Clippy: 2 warnings (collapsible_if in cli.rs — LOW)
42	- Production unwraps: 76 total — 64 infallible write! on String, 12 safe-by-construction. Effectively zero risky unwraps.
43	- Mandatory surprise: write!().unwrap() pattern provably infallible — Actually fine.
44	- Previous items verified: All previous remediated items confirmed intact.
45	- Note: cli.rs at 1,036 lines — approaching the 500-line branching guideline but mostly flat match arms.
46	- Infrastructure check: Blocked by Tailscale SSH re-authentication. Deferred.
47
48	### Fourth audit remediation (2026-03-14)
49	- Grade: A- -> A. All remaining findings resolved.
50	- Test count: 229 -> 238 (+9 integration tests)
51	- Graceful shutdown: Replaced `handle.abort()` with CancellationToken + `tokio::select!` in all task loops. API server uses `with_graceful_shutdown`. 5s grace period on SIGINT/SIGTERM.
52	- Task panic detection: 60s watchdog checks `JoinHandle::is_finished()` on all background tasks.
53	- Rate limiting: Fixed-window 60 req/min middleware on authenticated API routes. Custom `RateLimiter` struct.
54	- Self-monitoring: `GET /api/health` endpoint (public, no auth) returns `{"status":"operational","version":"..."}`.
55	- Integration tests: 5 check_health tests (mock axum servers: operational, degraded, unreachable, expectations pass/fail), 1 check_tls test (self-signed cert via rcgen), 2 /api/health tests, 1 rate limiter test.
56	- Deploy config cleanup: Removed redundant htpy `expected_routes` (duplicated health check URL).
57	- Dependency: Added `tokio-util` for CancellationToken.
58	- Cold spots: 0 remaining (was 3). All previous architectural and testing gaps closed.
59
60	### Third audit (2026-03-13, pre-launch skeptical lens)
61	- Grade: A -> A-. Postmark API token in plaintext deployment configs is a real issue.
62	- Test count: 56 -> 187 (+131 tests)
63	- New findings: Plaintext API token, no API auth, no peer mesh auth, no integration tests for core functions, no self-monitoring.
64	- 38 unwraps in non-test code — all verified safe (write to String or guarded by prior checks).
65
66	Post-audit remediation (2026-03-13):
67	- All 3 critical/medium findings resolved: Postmark token to env var, API bearer auth (5 tests), peer mesh auth
68	- 2 low findings resolved: SSH filter validation, peer UUID mismatch rejection
69	- Test count: 187 -> 195 (+8 tests)
70	- Documentation upgraded to A: All struct fields documented (HealthSnapshot, HealthStatus, HealthDetails, TestRun, TestStaleness, PeerStatus, OnMissing, all config types, all API response types). All 8 error variants documented. 11 config defaults with rationale comments. prune_old_records return tuple documented. description.md rewritten, architecture.md created (191 lines), README created (62 lines).
71
72	### Observability Upgrade (2026-03-13)
73	- Observability: A- -> A
74	- Added 57 `#[instrument(skip_all)]` annotations across 9 files: db.rs (28), alerts.rs (9), tools/mod.rs (8), tools/health.rs (5), tools/tests.rs (3), checks/http.rs (1), checks/tls.rs (1), checks/ssh.rs (1), peer.rs (1)
75	- Added Multithreaded forum as monitoring target: `pom-astra.toml` (localhost:3400), `pom-hetzner.toml` (Tailscale IP)
76	- Added test runner targets for GO, BB, AF, SK to `pom-astra.toml`
77	- All 208 tests pass. `cargo check` passes clean.
78
79	### Adversarial Test Audit (2026-03-13)
80
81	Goal: Write tests that try to break the system. Find edge cases, race conditions, boundary conditions, and logic errors.
82
83	Results:
84	- Test count: 195 -> 208 (+13 tests)
85	- CRITICAL fix: Alert cooldown key mismatch — `record_alert` used `target` but lookup used `alert_key` (`"health:{target}"`), so cooldowns never matched and alerts fired every check. Fixed by using `alert_key` consistently.
86	- HIGH fix: TLS expiry check inconsistent at day boundary — time-of-day comparison could cause flapping. Changed to `date_naive()` comparison for stable day-level logic.
87	- HIGH fix: UUID mismatch left stale peer state — now resets state, clears failures, persists via `update_peer_identity()` to prevent showing stale data after peer identity change.
88	- HIGH fix: `prune_old_records` no guard for days <= 0 — could delete all records. Added early return for `days <= 0` (no-op).
89	- HIGH fix: SSH timeout ignored config value — hardcoded `ConnectTimeout=10` in SSH args. Changed to use `config.timeout_secs`.
90	- Added `rcgen` dev dependency for TLS cert generation in tests.
91
92	### Second audit (2026-03-11)
93	\| Change \| Detail \|
94	\|--------\|--------\|
95	\| Tests \| +39 tests (17 -> 56). 28 unit + 28 integration. Tests/KLOC: 5.8 -> 18.4. \|
96	\| Lock contention \| Addressed in both peer.rs (heartbeat handlers) and api.rs (status/mesh handlers). Data collected under lock, DB writes after release. \|
97	\| DB indexes \| 4 indexes added: health_checks(target, id DESC), health_checks(target, checked_at), test_runs(target, id DESC), peer_heartbeats(peer_name, id DESC). \|
98	\| Clippy \| 4 warnings -> 0. Used Rust 2024 let chains instead of nested if-let. \|
99	\| Type safety \| PeerConfig.on_missing changed from String to OnMissing enum with serde deserialization. \|
100	\| Module docs \| Added //! docs to db.rs, config.rs, peer.rs, types.rs, lib.rs. \|
101	\| Error handling \| /api/peer/status fetch failures now logged at debug level instead of silenced. \|
102	\| Prune \| prune_old_records now returns 3-tuple including peer heartbeat count. \|
103	\| Code extraction \| HealthStatus::icon() method eliminates 3 repeated match blocks. \|
104	\| HTTP checks \| Response classification extracted into pure functions for testability. \|
105
106	## Metrics Over Time
107
108	\| Audit Date \| LOC \| Rust Files \| Tests \| Tests/KLOC \| Clippy Warnings \| Cold Spots \| Overall \|
109	\|------------\|-----\|-----------\|-------\|-----------\|----------------\|------------\|---------\|
110	\| 2026-03-10 \| 2,934 \| 15 \| 17 \| 5.8 \| 4 \| 8 \| B+ \|
111	\| 2026-03-11 \| 3,039 \| 14 \| 56 \| 18.4 \| 0 \| 3 \| A \|
112	\| 2026-03-13 \| ~3K \| ~14 \| 208 \| ~69 \| 0 \| 3 \| A- \|
113	\| 2026-03-14 \| ~3.5K \| ~16 \| 238 \| ~68 \| 0 \| 0 \| A \|
114	\| 2026-03-16 \| 10.1K \| 23 \| 344 \| ~34 \| 2 \| 0 \| A \|
115	\| 2026-03-18 \| 10.1K \| 23 \| 344 \| ~34 \| 0 \| 0 \| A \|
116