max / makenotwork

10.5 KB · 192 lines History Blame Raw

1	# PoM Architecture
2
3	## System Overview
4
5	PoM runs in three modes, selected by how it is invoked:
6
7	1. CLI mode (`pom health`, `pom test`, `pom status`, etc.) -- runs a single command and exits. Useful for ad-hoc checks and cron jobs.
8	2. Serve mode (`pom serve`) -- long-running daemon that spawns per-target health check loops, TLS check loops, peer heartbeat tasks, a daily prune task, and an HTTP API server. This is the production deployment mode.
9	3. MCP server mode (bare `pom` with no subcommand) -- launches an MCP server over stdio for Claude integration. Exposes health checks, test execution, history queries, and mesh status as MCP tools.
10
11	All three modes load the same TOML config and connect to the same SQLite database.
12
13	## Module Map
14
15	\| Module \| File \| Role \|
16	\|--------\|------\|------\|
17	\| `main` \| `src/main.rs` \| Entry point -- parses CLI args, dispatches to CLI handler or MCP server \|
18	\| `cli` \| `src/cli.rs` \| CLI command handlers (health, test, status, history, prune, serve, mesh) \|
19	\| `config` \| `src/config.rs` \| TOML config loading, types for targets/peers/alerts/serve settings \|
20	\| `types` \| `src/types.rs` \| Shared domain types: HealthSnapshot, TestRun, TlsStatus, LatencyStats, TestStaleness \|
21	\| `db` \| `src/db.rs` \| SQLite schema (versioned migrations), all queries for health/tests/alerts/TLS/incidents/peers \|
22	\| `api` \| `src/api.rs` \| Axum HTTP API: status, trends, peer info, mesh view, bearer token auth middleware \|
23	\| `alerts` \| `src/alerts.rs` \| Alerter struct -- sends emails via Postmark on status transitions, with cooldown tracking \|
24	\| `peer` \| `src/peer.rs` \| Peer mesh: identity management, heartbeat loops, grace period state machine, mesh state \|
25	\| `display` \| `src/display.rs` \| Pure formatting functions for CLI output (no I/O) \|
26	\| `error` \| `src/error.rs` \| Typed error enum (PomError) wrapping IO, DB, HTTP, JSON, config errors \|
27	\| `checks::http` \| `src/checks/http.rs` \| HTTP health checker, response classification, expectation validation, latency drift detection, test staleness computation \|
28	\| `checks::tls` \| `src/checks/tls.rs` \| TLS certificate prober -- TCP connect, TLS handshake, x509 leaf cert parsing \|
29	\| `checks::ssh` \| `src/checks/ssh.rs` \| Remote test runner -- executes commands over SSH, captures output \|
30	\| `checks::parse` \| `src/checks/parse.rs` \| CI output parser -- extracts PASS/FAIL steps and cargo test counts \|
31	\| `tools` \| `src/tools/mod.rs` \| MCP server definition (PomServer), tool registration via rmcp \|
32	\| `tools::health` \| `src/tools/health.rs` \| MCP tool implementations for health checks, history, targets, mesh status \|
33	\| `tools::tests` \| `src/tools/tests.rs` \| MCP tool implementations for test execution, history, raw output \|
34
35	## Data Flow
36
37	```
38	pom.toml (config)
39	\|
40	v
41	Config::load() --> targets, peers, alerts, serve settings
42	\|
43	v
44	db::connect() --> SQLite pool (WAL mode, versioned migrations)
45	\|
46	+---> [CLI mode] single command --> check/query --> display --> exit
47	\|
48	+---> [Serve mode]
49	\| \|
50	\| +--> per-target health check loop (configurable interval)
51	\| \| check_health() --> insert_health_check()
52	\| \| compare with previous --> alert on transition
53	\| \| detect latency drift --> alert if sustained
54	\| \| open/close incidents on status changes
55	\| \|
56	\| +--> per-target TLS check loop (hourly default)
57	\| \| check_tls() --> insert_tls_check()
58	\| \| alert on expiry warning or error
59	\| \|
60	\| +--> per-peer heartbeat loop (60s default)
61	\| \| GET /api/peer/info --> update mesh state
62	\| \| GET /api/peer/status --> cache for mesh view
63	\| \| grace period state machine on failure
64	\| \|
65	\| +--> daily prune task (configurable retention)
66	\| \|
67	\| +--> HTTP API server (Axum, configurable bind address)
68	\|
69	+---> [MCP mode] stdio transport --> tool calls --> same check/query logic
70	```
71
72	## Peer Mesh Design
73
74	Each PoM instance has a persistent UUID (stored at `~/.local/share/pom/instance_id`). Peers are configured by name with an address, a `on_missing` policy, and an optional grace count.
75
76	### Heartbeat State Machine
77
78	```
79	Unknown --> (success) --> Online
80	Unknown --> (failure) --> GracePeriod --> (failures >= grace_count) --> Missing
81	Online --> (failure) --> GracePeriod --> (failures >= grace_count) --> Missing
82	Missing --> (success) --> Online (triggers recovery alert)
83	```
84
85	Each heartbeat cycle:
86	1. GET `/api/peer/info` -- verifies identity, measures latency
87	2. On first contact, store the peer's UUID in `peer_identities` table
88	3. On subsequent contacts, reject UUID mismatches (prevents impersonation)
89	4. GET `/api/peer/status` -- caches the peer's full status for mesh aggregation
90	5. Record heartbeat result in `peer_heartbeats` table
91
92	### On Missing Policy
93
94	- `alert` -- send email alert when peer transitions to Missing, send recovery when it returns
95	- `log` -- log the event, no email
96	- `ignore` -- suppress entirely
97
98	## Database Schema
99
100	SQLite with WAL journal mode. Schema is managed through numbered migrations (currently v1-v4).
101
102	### Tables
103
104	\| Table \| Purpose \| Key Columns \|
105	\|-------\|---------\|-------------\|
106	\| `schema_version` \| Migration tracking \| version, description, applied_at \|
107	\| `health_checks` \| HTTP health check results \| target, status, checked_at, response_time_ms, details_json, error \|
108	\| `test_runs` \| SSH test execution results \| target, started_at, duration_secs, exit_code, passed, summary_json, raw_output \|
109	\| `peer_identities` \| First-seen peer UUIDs \| peer_name (PK), instance_id, first_seen \|
110	\| `peer_heartbeats` \| Heartbeat history \| peer_name, status, latency_ms, checked_at \|
111	\| `alerts` \| Alert history + cooldown tracking \| target, alert_type, from_status, to_status, sent_at \|
112	\| `tls_checks` \| TLS certificate probe results \| target, host, valid, days_remaining, not_before, not_after, subject, issuer \|
113	\| `incidents` \| Health incidents (open/closed) \| target, started_at, ended_at, duration_secs, from_status, to_status \|
114
115	Pre-migration databases are detected by the presence of the `health_checks` table and stamped as v1 without re-running the initial migration.
116
117	## Alert Pipeline
118
119	```
120	Health status change detected (previous != current)
121	\|
122	+--> operational -> non-operational: send_health_alert(), open incident
123	+--> non-operational -> operational: send_health_recovery(), close incidents
124	+--> non-operational -> different non-operational: close old incident, open new, alert
125	\|
126	TLS check detects issue
127	\|
128	+--> was OK, now invalid/error: send_tls_error_alert()
129	+--> was OK, now within warn_days: send_tls_expiry_alert()
130	+--> was bad, now OK: send_tls_recovery()
131	\|
132	Latency drift detected (all recent checks exceed baseline * threshold)
133	\|
134	+--> entered drift: send_latency_drift_alert()
135	+--> exited drift: send_latency_recovery()
136	\|
137	Peer transitions to Missing
138	\|
139	+--> send_peer_missing() (if on_missing = alert)
140	+--> peer recovers: send_peer_recovery()
141	```
142
143	All alerts except recoveries are subject to a per-target cooldown (default 300s). Recoveries always send immediately. Without a Postmark token, alerts are logged to stdout (dev mode).
144
145	## API Endpoints
146
147	All endpoints require `Authorization: Bearer <token>` when `api_token` is configured (in config or via `POM_API_TOKEN` env var). Without a token configured, all requests pass through.
148
149	\| Endpoint \| Method \| Description \|
150	\|----------\|--------\|-------------\|
151	\| `/api/status` \| GET \| JSON summary of all targets (latest health, uptime, latency, TLS, staleness, incidents) \|
152	\| `/api/status/{target}` \| GET \| Same as above for a single target \|
153	\| `/api/trends/{target}` \| GET \| Latency trend data with configurable window and bucket size (`?hours=24&bucket_minutes=60`) \|
154	\| `/api/peer/info` \| GET \| This instance's identity (id, name, version, targets, started_at) \|
155	\| `/api/peer/status` \| GET \| This instance's full view: identity + target statuses + peer summaries \|
156	\| `/api/mesh` \| GET \| Aggregated mesh view: self + each peer's cached status \|
157
158	## Check Types
159
160	### HTTP Health Check
161
162	Sends GET to the target's health URL. Classifies the response:
163	- JSON with `"status": "operational"` --> Operational
164	- JSON with `"status": "degraded"` --> Degraded
165	- Non-JSON 2xx --> Degraded
166	- Non-2xx or unknown status --> Error
167	- Connection failure --> Unreachable
168
169	Extracts version, uptime, checks, and monitoring from the JSON response body. Supports expectation validation: expected status code, required body substrings, and JSON field value assertions (with dot-path traversal for nested fields).
170
171	### TLS Certificate Check
172
173	Connects to host:port, completes a TLS handshake using the system trust store (webpki-roots), extracts the leaf certificate, and parses it with x509-parser. Records validity, days remaining, not_before/not_after, subject, and issuer. Alerts when days_remaining falls below the configured `warn_days` threshold (default 14).
174
175	### SSH Test Runner
176
177	Executes a configured command on a remote host via `ssh -o BatchMode=yes`. The command string comes from config (typically a CI script like `./run-ci.sh`). Supports an optional filter argument (validated to `[a-zA-Z0-9_:-]` to prevent injection). Output is parsed for PASS/FAIL step lines and `test result:` cargo test summary lines.
178
179	## Key Design Decisions
180
181	SQLite over PostgreSQL. PoM is a single-binary tool that runs on each monitoring host. SQLite keeps it self-contained with zero external dependencies. WAL mode provides concurrent reads during serve mode. Data volume is modest (a few checks per minute, pruned after 30 days).
182
183	Peer mesh over centralized monitoring. Two independent instances cross-check each other. If the Hetzner instance goes down, Astra detects it (and vice versa). No single point of failure for the monitoring layer itself.
184
185	Bearer token auth. Simple, stateless, sufficient for machine-to-machine API access between peers. Configured per-peer and per-instance. No user management needed.
186
187	Versioned migrations. The migration system detects pre-migration databases and stamps them without re-running. Each migration is a numbered SQL block. This avoids external migration tools while keeping schema evolution safe.
188
189	Separate check intervals. Health checks can have per-target interval overrides. TLS checks run on a longer interval (hourly default) since certificate state changes slowly. Peer heartbeats run on a short interval (60s default) for timely failure detection.
190
191	Cooldown on alerts. Prevents alert storms during flapping. Recovery alerts bypass cooldown so operators always know when a service comes back.
192