|
1 |
+ |
# PoM (Peace of Mind) -- Audit Review
|
|
2 |
+ |
|
|
3 |
+ |
**Last audited:** 2026-03-28 (tenth audit, Run 12 cross-project)
|
|
4 |
+ |
**Previous audit:** 2026-03-18 (eighth audit, Run 9 cross-project)
|
|
5 |
+ |
|
|
6 |
+ |
## Overall Grade: A
|
|
7 |
+ |
|
|
8 |
+ |
Run 12 cross-project audit. 359 tests (222 unit + 8 cli + 129 integration). 0 clippy warnings. v0.3.2. Grade stable at A. CORS monitoring added. DNS/route stale data fix deployed. New dep advisories: aws-lc-sys (HIGH 7.4), rustls-webpki.
|
|
9 |
+ |
|
|
10 |
+ |
## Scorecard
|
|
11 |
+ |
|
|
12 |
+ |
| Dimension | Grade | Notes |
|
|
13 |
+ |
|-----------|:-----:|-------|
|
|
14 |
+ |
| Code Quality | A | Zero clippy warnings. Clean error handling with typed PomError enum (thiserror). No unnecessary complexity. |
|
|
15 |
+ |
| Architecture | A | Well-structured single crate. main.rs is thin (130 LOC), CLI handlers extracted to cli.rs. Module boundaries clean and purposeful. |
|
|
16 |
+ |
| Testing | A | 359 tests (222 unit + 8 cli + 129 integration) for ~10K LOC = ~36 tests/KLOC. Coverage comprehensive — DB, API, MCP tools, parsing, peer state machine, check_health (mock servers), check_tls (self-signed cert), rate limiter, CORS monitoring all tested. |
|
|
17 |
+ |
| Security | A | All SQL parameterized. SSH hardened with BatchMode + `--` option separator. Bearer token auth on API + peer mesh. Rate limiting (60 req/min). HTTP response size limits (10MB). Shell-escape validation on SSH filter. |
|
|
18 |
+ |
| Performance | A | WAL mode, proper indexes, async throughout. Appropriate for monitoring workload. |
|
|
19 |
+ |
| Documentation | A | Module-level `//!` docs on all modules. All struct fields documented. description.md rewritten, architecture.md created (191 lines), README created (62 lines). Tool descriptions are excellent. |
|
|
20 |
+ |
| Dependencies | A | 17 direct dependencies, all justified. No unused deps. rustls-tls avoids OpenSSL. Semver ranges. tokio-util for CancellationToken. |
|
|
21 |
+ |
| Type Safety | A | Domain enums (HealthStatus, PeerStatus, OnMissing) with proper serde. No stringly-typed fields in application logic. |
|
|
22 |
+ |
| Observability | A | 57 `#[instrument(skip_all)]` annotations across all modules. Debug-level logging for non-critical peer status fetch failures. Watchdog logging for silent task panics. |
|
|
23 |
+ |
| Concurrency | A | Clean RwLock discipline. Lock-then-release-then-DB pattern in peer module. CancellationToken for cooperative shutdown. |
|
|
24 |
+ |
| Resilience | A | CancellationToken + `tokio::select!` in all task loops. `with_graceful_shutdown` on API server. 5s grace period. Watchdog for task panic detection. Grace periods for peer detection. Error logging without panics. |
|
|
25 |
+ |
| API Consistency | A | Consistent JSON shapes. Proper HTTP status codes. Public `/api/health` endpoint. MCP tool descriptions well-written. |
|
|
26 |
+ |
| Migration Safety | A- | Numbered migration system (schema_version table). `IF NOT EXISTS` guards on all DDL. 5 migrations tracked. |
|
|
27 |
+ |
| Codebase Size | A | ~3,500 LOC for a full monitoring tool with CLI, MCP server, HTTP API, SQLite persistence, SSH test runner, peer mesh, and self-monitoring. Extremely efficient. |
|
|
28 |
+ |
|
|
29 |
+ |
## Module Heatmap
|
|
30 |
+ |
|
|
31 |
+ |
| Module | Code | Arch | Test | Security | Perf | Docs | Type Safety | Concurrency | Resilience |
|
|
32 |
+ |
|--------|:----:|:----:|:----:|:--------:|:----:|:----:|:-----------:|:-----------:|:----------:|
|
|
33 |
+ |
| main | A | A | A | A | A | A | A | A | A |
|
|
34 |
+ |
| cli | A | A | A | A | A | A | A | A | A |
|
|
35 |
+ |
| peer | A | A | A | A | A | A | A | A | A |
|
|
36 |
+ |
| db | A | A | A | A+ | A | A | A- | A | A |
|
|
37 |
+ |
| api | A | A | A | A | A | A | A | A | A |
|
|
38 |
+ |
| config | A | A | A | A | A | A | A | - | - |
|
|
39 |
+ |
| tools | A | A | A | A | A | A | A | - | A |
|
|
40 |
+ |
| checks | A | A | A | A | A | A | A | - | A |
|
|
41 |
+ |
| display | A | A | A | - | - | A | A | - | - |
|
|
42 |
+ |
| types | A | A | A | - | - | A | A+ | - | - |
|
|
43 |
+ |
| alerts | A | A | A | A | A | A | A | - | A |
|
|
44 |
+ |
|
|
45 |
+ |
**Heatmap changes (2026-03-14 remediation):**
|
|
46 |
+ |
- main: Arch B+ -> A (split into main.rs 130 LOC + cli.rs), Test B -> A (display.rs has 27+ tests), Resilience A- -> A (CancellationToken + graceful shutdown)
|
|
47 |
+ |
- api: Security A -> A (rate limiting added), Resilience A -> A (`with_graceful_shutdown`)
|
|
48 |
+ |
- cli (new): All A — extracted command handlers with CancellationToken shutdown orchestration
|
|
49 |
+ |
- display (new): All A — formatting logic with 27+ unit tests
|
|
50 |
+ |
- alerts (new): All A — extracted for visibility in heatmap
|
|
51 |
+ |
|
|
52 |
+ |
### Cold Spots
|
|
53 |
+ |
|
|
54 |
+ |
None remaining. All previous cold spots resolved:
|
|
55 |
+ |
- main.rs split into main.rs (130 LOC) + cli.rs (command handlers) + display.rs (formatting with 27+ tests)
|
|
56 |
+ |
- CLI display formatting extracted and tested via display.rs
|
|
57 |
+ |
- All modules documented with `//!` docs
|
|
58 |
+ |
|
|
59 |
+ |
### Cold Spot Remediation History (First Audit)
|
|
60 |
+ |
|
|
61 |
+ |
All 8 cold spots from the first audit were remediated before the second audit:
|
|
62 |
+ |
|
|
63 |
+ |
- ~~**config.rs (Testing): C**~~ -> **A** -- 4 config parsing tests (full parse, defaults, on_missing default, hostname fallback). `on_missing` now uses `OnMissing` enum with serde.
|
|
64 |
+ |
- ~~**api.rs (Testing): C**~~ -> **A** -- 5 API endpoint integration tests (status, target 404, peer/info, peer disabled, mesh).
|
|
65 |
+ |
- ~~**api.rs (Concurrency): B**~~ -> **A** -- Lock decoupled from DB queries in `peer_status` and `mesh_view`.
|
|
66 |
+ |
- ~~**peer.rs (Testing): B-**~~ -> **A** -- 9 heartbeat state machine tests (grace transitions, recovery, UUID first-contact, DB recording, identity, on_missing deserialization).
|
|
67 |
+ |
- ~~**peer.rs (Concurrency): B**~~ -> **A** -- Lock decoupled from DB writes in both heartbeat handlers.
|
|
68 |
+ |
- ~~**db.rs (Performance): B-**~~ -> **A** -- 4 indexes added to init_schema.
|
|
69 |
+ |
- ~~**checks/http.rs (Testing): C**~~ -> **A** -- Response classification extracted into pure functions. 8 unit tests covering all status paths.
|
|
70 |
+ |
- ~~**main.rs (Testing): C**~~ -> **B** -- Status icon logic extracted to `HealthStatus::icon()`. 4 types.rs tests. main.rs is thin orchestration over well-tested modules.
|
|
71 |
+ |
|
|
72 |
+ |
## Strengths
|
|
73 |
+ |
|
|
74 |
+ |
- **Lean architecture.** ~3,500 LOC delivers health checks, test orchestration, MCP server, HTTP API, peer mesh, CLI, alerting, TLS monitoring, route checks, and self-monitoring. No over-engineering.
|
|
75 |
+ |
- **Clean error handling.** `?` propagation with typed PomError enum (thiserror, 8 variants). No panics in production paths.
|
|
76 |
+ |
- **Correct security posture.** All SQL parameterized. Bearer token auth on API + peer mesh. Rate limiting (60 req/min). SSH hardened with `BatchMode=yes` + `--` separator. Shell-escape validation.
|
|
77 |
+ |
- **Backward-compatible design.** All config sections are `#[serde(default)]`. Existing configs work unchanged.
|
|
78 |
+ |
- **Good MCP integration.** 8 MCP tools with clear descriptions. Tools strip `raw_output` from test history to avoid flooding context.
|
|
79 |
+ |
- **Comprehensive test suite.** 238 tests (152 unit + 86 integration) at ~68/KLOC covering DB, API, MCP tools, parsing, peer state machine, config, health classification, mock server health checks, TLS cert validation, and rate limiting.
|
|
80 |
+ |
- **Excellent lock discipline.** Peer module collects data under lock, releases lock, then performs DB writes. No lock contention under load.
|
|
81 |
+ |
- **Robust shutdown.** CancellationToken with `tokio::select!` in all task loops. Graceful API server shutdown. Watchdog for silent panics. 5s grace period.
|
|
82 |
+ |
|
|
83 |
+ |
## Weaknesses
|
|
84 |
+ |
|
|
85 |
+ |
All previously identified weaknesses have been resolved:
|
|
86 |
+ |
- ~~**main.rs size**~~ — Split into main.rs (130 LOC) + cli.rs + display.rs
|
|
87 |
+ |
- ~~**No typed error enum**~~ — PomError enum with thiserror (8 variants)
|
|
88 |
+ |
- ~~**Missing module docs**~~ — All modules have `//!` docs
|
|
89 |
+ |
- ~~**No migration versioning**~~ — schema_version table with numbered migrations (v1-v5)
|
|
90 |
+ |
|
|
91 |
+ |
## Mandatory Surprise
|
|
92 |
+ |
|
|
93 |
+ |
**Surprise: The peer mesh fetches `/api/peer/status` as a SECOND HTTP call after `/api/peer/info` in every heartbeat loop iteration, completely independently.**
|
|
94 |
+ |
|
|
95 |
+ |
In `peer.rs:196-213`, after the heartbeat success/failure handling completes, there is a second HTTP GET to `/api/peer/status` that caches the full status data for mesh aggregation. This means every heartbeat cycle makes TWO HTTP requests to each peer.
|
|
96 |
+ |
|
|
97 |
+ |
**Verdict: Actually fine.**
|
|
98 |
+ |
|
|
99 |
+ |
This is intentional and well-designed. The `/api/peer/info` call is the heartbeat probe (lightweight, just identity data), while `/api/peer/status` fetches the full target/peer view for mesh aggregation. Separating them means:
|
|
100 |
+ |
1. The heartbeat latency measurement is accurate (not inflated by the larger status response).
|
|
101 |
+ |
2. A failure to fetch status data does not affect heartbeat state transitions (it logs at debug level and moves on).
|
|
102 |
+ |
3. The mesh view (`/api/mesh`) can show each peer's own view of their targets, enabling cross-instance monitoring.
|
|
103 |
+ |
|
|
104 |
+ |
The second request's failure is handled gracefully with `tracing::debug!` -- no state corruption, no retries, no panics. Good design.
|
|
105 |
+ |
|
|
106 |
+ |
## Action Items
|
|
107 |
+ |
|
|
108 |
+ |
Filed in `docs/mnw/pom/todo.md`.
|
|
109 |
+ |
|
|
110 |
+ |
### Third Audit (2026-03-13, pre-launch skeptical lens) -- 7 items, all resolved
|
|
111 |
+ |
|
|
112 |
+ |
1. ~~**[CRITICAL]** Remove Postmark API token from deployment configs~~ — Done — moved to POM_POSTMARK_TOKEN env var
|
|
113 |
+ |
2. ~~**[MEDIUM]** Add API authentication~~ — Done — bearer token middleware, 5 tests
|
|
114 |
+ |
3. ~~**[MEDIUM]** Add peer mesh authentication~~ — Done — per-peer token field, heartbeat sends Authorization header
|
|
115 |
+ |
4. ~~**[LOW]** Add integration tests for core functions~~ — Done — 9 new integration tests (check_health mock servers, check_tls self-signed cert)
|
|
116 |
+ |
5. ~~**[LOW]** Add self-monitoring~~ — Done — `/api/health` endpoint (public, no auth, returns status + version)
|
|
117 |
+ |
6. ~~**[LOW]** Shell-escape SSH test filter parameter~~ — Done (alphanumeric + `_:-` allowlist)
|
|
118 |
+ |
7. ~~**[LOW]** Reject peer UUID mismatch~~ — Done (tracing::error, skip status update, increment failures)
|
|
119 |
+ |
|
|
120 |
+ |
### Security Deep Dive (2026-03-13) — Complete (2/2)
|
|
121 |
+ |
|
|
122 |
+ |
8. ~~**[MEDIUM]** SSH `--` separator~~ — Done (`checks/ssh.rs`: `.arg("--")` between hostname and command prevents option injection via filter strings)
|
|
123 |
+ |
9. ~~**[LOW]** HTTP response size limit~~ — Done (`checks/http.rs`: `MAX_RESPONSE_BYTES` 10MB constant, content-length header check + post-read size verification)
|
|
124 |
+ |
|
|
125 |
+ |
### Second Audit (2026-03-11) -- 5 items
|
|
126 |
+ |
|
|
127 |
+ |
1. Extract CLI command handlers from main.rs into cli/ submodule (587 LOC, 7 commands)
|
|
128 |
+ |
2. Add typed PomError enum with thiserror (currently Box<dyn Error>, diverges from cross-project standard)
|
|
129 |
+ |
3. Add .DS_Store and IDE dirs (.idea/, .vscode/) to .gitignore
|
|
130 |
+ |
4. Add module-level //! docs to main.rs and config.rs
|
|
131 |
+ |
5. Add migration versioning (PRAGMA user_version) before next schema change
|
|
132 |
+ |
|
|
133 |
+ |
### First Audit (2026-03-10) -- 11 items, all resolved
|
|
134 |
+ |
|
|
135 |
+ |
1. ~~Add DB indexes~~ -- Done (4 indexes added to init_schema)
|
|
136 |
+ |
2. ~~Fix clippy warnings~~ -- Done (4 collapsible_if warnings fixed with Rust 2024 let chains)
|
|
137 |
+ |
3. ~~Decouple mesh lock from DB writes in heartbeat handlers~~ -- Done
|
|
138 |
+ |
4. ~~Decouple mesh read lock from DB queries in API handlers~~ -- Done
|
|
139 |
+ |
5. ~~Log /api/peer/status fetch failures~~ -- Done (tracing::debug)
|
|
140 |
+ |
6. ~~Include peer heartbeat prune count~~ -- Done (3-tuple return)
|
|
141 |
+ |
7. ~~Add module docs~~ -- Done (//! docs on db.rs, config.rs, peer.rs, types.rs, lib.rs)
|
|
142 |
+ |
8. ~~Change PeerConfig.on_missing to OnMissing enum~~ -- Done
|
|
143 |
+ |
9. ~~Add API endpoint tests~~ -- Done (5 tests)
|
|
144 |
+ |
10. ~~Add heartbeat state machine tests~~ -- Done (9 tests)
|
|
145 |
+ |
11. ~~Add config parsing tests~~ -- Done (4 tests)
|
|
146 |
+ |
|
|
147 |
+ |
## Changes Since Last Audit
|
|
148 |
+ |
|
|
149 |
+ |
### Tenth audit (2026-03-28, Run 12 cross-project)
|
|
150 |
+ |
- **Test count:** 359 (222 unit + 8 cli + 129 integration). 0 clippy warnings. 0 failures.
|
|
151 |
+ |
- **Grade:** A (maintained). v0.3.2.
|
|
152 |
+ |
- **CORS monitoring:** New check type added for monitoring CORS headers on targets.
|
|
153 |
+ |
- **New dependency advisories (action items):**
|
|
154 |
+ |
- aws-lc-sys 0.38.0 (RUSTSEC-2026-0044 + -0048, severity 7.4 HIGH) — upgrade to 0.39.0 via `cargo update -p aws-lc-sys`
|
|
155 |
+ |
- rustls-webpki 0.103.9 (RUSTSEC-2026-0049) — upgrade to 0.103.10 via `cargo update -p rustls-webpki`
|
|
156 |
+ |
- paste unmaintained (RUSTSEC-2024-0436) — upstream via rmcp, warning only
|
|
157 |
+ |
- **Mandatory surprise:** None. Previous surprises (rate limiter relaxed ordering, write!().unwrap() infallibility) still valid.
|
|
158 |
+ |
- **No new code findings.** All previous items remain resolved.
|
|
159 |
+ |
|
|
160 |
+ |
### DNS/Route stale data fix (2026-03-25)
|
|
161 |
+ |
- **Test count:** 352 (unchanged). 0 clippy warnings.
|
|
162 |
+ |
- **Config:** Switched all 4 Cloudflare-proxied DNS records from `expected = ["IP"]` to `expected = []` (resolution-only). DNS checks were always failing because Cloudflare returns rotating proxy IPs, not the origin IP.
|
|
163 |
+ |
- **API filtering:** `route_status` and `dns_status` in `/api/status/{target}` now filtered to only entries matching current config. Stale routes (e.g. `/docs/about`, `/signup`) and stale DNS records no longer appear in API responses.
|
|
164 |
+ |
- **DB pruning:** Added `prune_stale_routes()` and `prune_stale_dns()` to `db.rs`. Called once at task startup in `routes.rs` and `dns.rs` to clean up historical data when config changes. Pruned 890 stale route check rows on first deploy.
|
|
165 |
+ |
- **Integration tests:** Updated `api_status_includes_route_status` and `api_status_includes_dns_status` to use configs with matching route/DNS entries.
|
|
166 |
+ |
- **Deployed to hetzner** — v0.3.2 binary + updated config.
|
|
167 |
+ |
|
|
168 |
+ |
### Eighth audit (2026-03-18, Run 9 cross-project)
|
|
169 |
+ |
- **Test count:** 344 (unchanged). 0 clippy warnings.
|
|
170 |
+ |
- **Grade:** A (maintained). v0.3.1 (deployed 2026-03-18).
|
|
171 |
+ |
- **Dashboard UI shipped.** Per-test tracking, regression detection, duration drift.
|
|
172 |
+ |
- **cli/ directory module split** completed (1,035-line cli.rs -> 8 files).
|
|
173 |
+ |
- **Observations (pre-existing, not regressions):**
|
|
174 |
+ |
- Mutex `.unwrap()` in rate limiter (api.rs:41) — if thread panics while holding lock, subsequent calls panic. Impact: LOW (rate limiter only, not core logic). Design choice: acceptable for monitoring tool.
|
|
175 |
+ |
- `serde_json::to_value(d).unwrap_or_default()` in API details field — silently becomes null on serialization failure. Impact: LOW, safe fallback.
|
|
176 |
+ |
- **No new findings requiring action.** Grade maintained at A.
|
|
177 |
+ |
- **Mandatory surprise:** Rate limiter uses `fetch_add` with Relaxed ordering — can allow up to max_per_window+1 requests due to check-then-increment race. Known trade-off of lock-free rate limiting, documented.
|
|
178 |
+ |
|
|
179 |
+ |
### Fifth audit (2026-03-16, Run 6 cross-project)
|
|
180 |
+ |
- **Test count:** 238 -> 344 (220 unit + 124 integration, +106 tests)
|
|
181 |
+ |
- **Grade:** A (maintained). No new findings above LOW.
|
|
182 |
+ |
- **Source LOC:** 10,113 (up from ~3.5K)
|
|
183 |
+ |
- **Clippy:** 2 warnings (collapsible_if in cli.rs — LOW)
|
|
184 |
+ |
- **Production unwraps:** 76 total — 64 infallible write! on String, 12 safe-by-construction. Effectively zero risky unwraps.
|
|
185 |
+ |
- **Mandatory surprise:** write!().unwrap() pattern provably infallible — Actually fine.
|
|
186 |
+ |
- **Previous items verified:** All previous remediated items confirmed intact.
|
|
187 |
+ |
- **Note:** cli.rs at 1,036 lines — approaching the 500-line branching guideline but mostly flat match arms.
|
|
188 |
+ |
- **Infrastructure check:** Blocked by Tailscale SSH re-authentication. Deferred.
|
|
189 |
+ |
|
|
190 |
+ |
### Fourth audit remediation (2026-03-14)
|
|
191 |
+ |
- **Grade:** A- -> A. All remaining findings resolved.
|
|
192 |
+ |
- **Test count:** 229 -> 238 (+9 integration tests)
|
|
193 |
+ |
- **Graceful shutdown:** Replaced `handle.abort()` with CancellationToken + `tokio::select!` in all task loops. API server uses `with_graceful_shutdown`. 5s grace period on SIGINT/SIGTERM.
|
|
194 |
+ |
- **Task panic detection:** 60s watchdog checks `JoinHandle::is_finished()` on all background tasks.
|
|
195 |
+ |
- **Rate limiting:** Fixed-window 60 req/min middleware on authenticated API routes. Custom `RateLimiter` struct.
|
|
196 |
+ |
- **Self-monitoring:** `GET /api/health` endpoint (public, no auth) returns `{"status":"operational","version":"..."}`.
|
|
197 |
+ |
- **Integration tests:** 5 check_health tests (mock axum servers: operational, degraded, unreachable, expectations pass/fail), 1 check_tls test (self-signed cert via rcgen), 2 /api/health tests, 1 rate limiter test.
|
|
198 |
+ |
- **Deploy config cleanup:** Removed redundant htpy `expected_routes` (duplicated health check URL).
|
|
199 |
+ |
- **Dependency:** Added `tokio-util` for CancellationToken.
|
|
200 |
+ |
- **Cold spots:** 0 remaining (was 3). All previous architectural and testing gaps closed.
|
|
201 |
+ |
|
|
202 |
+ |
### Third audit (2026-03-13, pre-launch skeptical lens)
|
|
203 |
+ |
- **Grade:** A -> A-. Postmark API token in plaintext deployment configs is a real issue.
|
|
204 |
+ |
- **Test count:** 56 -> 187 (+131 tests)
|
|
205 |
+ |
- **New findings:** Plaintext API token, no API auth, no peer mesh auth, no integration tests for core functions, no self-monitoring.
|
|
206 |
+ |
- **38 unwraps in non-test code** — all verified safe (write to String or guarded by prior checks).
|
|
207 |
+ |
|
|
208 |
+ |
**Post-audit remediation (2026-03-13):**
|
|
209 |
+ |
- All 3 critical/medium findings resolved: Postmark token to env var, API bearer auth (5 tests), peer mesh auth
|
|
210 |
+ |
- 2 low findings resolved: SSH filter validation, peer UUID mismatch rejection
|
|
211 |
+ |
- Test count: 187 -> 195 (+8 tests)
|
|
212 |
+ |
- Documentation upgraded to A: All struct fields documented (HealthSnapshot, HealthStatus, HealthDetails, TestRun, TestStaleness, PeerStatus, OnMissing, all config types, all API response types). All 8 error variants documented. 11 config defaults with rationale comments. prune_old_records return tuple documented. description.md rewritten, architecture.md created (191 lines), README created (62 lines).
|
|
213 |
+ |
|
|
214 |
+ |
### Observability Upgrade (2026-03-13)
|
|
215 |
+ |
- **Observability:** A- -> A
|
|
216 |
+ |
- Added 57 `#[instrument(skip_all)]` annotations across 9 files: db.rs (28), alerts.rs (9), tools/mod.rs (8), tools/health.rs (5), tools/tests.rs (3), checks/http.rs (1), checks/tls.rs (1), checks/ssh.rs (1), peer.rs (1)
|
|
217 |
+ |
- Added Multithreaded forum as monitoring target: `pom-astra.toml` (localhost:3400), `pom-hetzner.toml` (Tailscale IP)
|
|
218 |
+ |
- Added test runner targets for GO, BB, AF, SK to `pom-astra.toml`
|
|
219 |
+ |
- All 208 tests pass. `cargo check` passes clean.
|
|
220 |
+ |
|
|
221 |
+ |
### Adversarial Test Audit (2026-03-13)
|
|
222 |
+ |
|
|
223 |
+ |
**Goal:** Write tests that try to break the system. Find edge cases, race conditions, boundary conditions, and logic errors.
|
|
224 |
+ |
|
|
225 |
+ |
**Results:**
|
|
226 |
+ |
- **Test count:** 195 -> 208 (+13 tests)
|
|
227 |
+ |
- **CRITICAL fix:** Alert cooldown key mismatch — `record_alert` used `target` but lookup used `alert_key` (`"health:{target}"`), so cooldowns never matched and alerts fired every check. Fixed by using `alert_key` consistently.
|
|
228 |
+ |
- **HIGH fix:** TLS expiry check inconsistent at day boundary — time-of-day comparison could cause flapping. Changed to `date_naive()` comparison for stable day-level logic.
|
|
229 |
+ |
- **HIGH fix:** UUID mismatch left stale peer state — now resets state, clears failures, persists via `update_peer_identity()` to prevent showing stale data after peer identity change.
|
|
230 |
+ |
- **HIGH fix:** `prune_old_records` no guard for days <= 0 — could delete all records. Added early return for `days <= 0` (no-op).
|
|
231 |
+ |
- **HIGH fix:** SSH timeout ignored config value — hardcoded `ConnectTimeout=10` in SSH args. Changed to use `config.timeout_secs`.
|
|
232 |
+ |
- **Added `rcgen` dev dependency** for TLS cert generation in tests.
|
|
233 |
+ |
|
|
234 |
+ |
### Second audit (2026-03-11)
|
|
235 |
+ |
| Change | Detail |
|
|
236 |
+ |
|--------|--------|
|
|
237 |
+ |
| Tests | +39 tests (17 -> 56). 28 unit + 28 integration. Tests/KLOC: 5.8 -> 18.4. |
|
|
238 |
+ |
| Lock contention | Addressed in both peer.rs (heartbeat handlers) and api.rs (status/mesh handlers). Data collected under lock, DB writes after release. |
|
|
239 |
+ |
| DB indexes | 4 indexes added: health_checks(target, id DESC), health_checks(target, checked_at), test_runs(target, id DESC), peer_heartbeats(peer_name, id DESC). |
|
|
240 |
+ |
| Clippy | 4 warnings -> 0. Used Rust 2024 let chains instead of nested if-let. |
|
|
241 |
+ |
| Type safety | PeerConfig.on_missing changed from String to OnMissing enum with serde deserialization. |
|
|
242 |
+ |
| Module docs | Added //! docs to db.rs, config.rs, peer.rs, types.rs, lib.rs. |
|
|
243 |
+ |
| Error handling | /api/peer/status fetch failures now logged at debug level instead of silenced. |
|
|
244 |
+ |
| Prune | prune_old_records now returns 3-tuple including peer heartbeat count. |
|
|
245 |
+ |
| Code extraction | HealthStatus::icon() method eliminates 3 repeated match blocks. |
|
|
246 |
+ |
| HTTP checks | Response classification extracted into pure functions for testability. |
|
|
247 |
+ |
|
|
248 |
+ |
## Metrics Over Time
|
|
249 |
+ |
|
|
250 |
+ |
| Audit Date | LOC | Rust Files | Tests | Tests/KLOC | Clippy Warnings | Cold Spots | Overall |
|
|
251 |
+ |
|------------|-----|-----------|-------|-----------|----------------|------------|---------|
|
|
252 |
+ |
| 2026-03-10 | 2,934 | 15 | 17 | 5.8 | 4 | 8 | B+ |
|
|
253 |
+ |
| 2026-03-11 | 3,039 | 14 | 56 | 18.4 | 0 | 3 | A |
|
|
254 |
+ |
| 2026-03-13 | ~3K | ~14 | 208 | ~69 | 0 | 3 | A- |
|
|
255 |
+ |
| 2026-03-14 | ~3.5K | ~16 | 238 | ~68 | 0 | 0 | A |
|
|
256 |
+ |
| 2026-03-16 | 10.1K | 23 | 344 | ~34 | 2 | 0 | A |
|
|
257 |
+ |
| 2026-03-18 | 10.1K | 23 | 344 | ~34 | 0 | 0 | A |
|