| 1 |
# PoM Architecture |
| 2 |
|
| 3 |
## System Overview |
| 4 |
|
| 5 |
PoM runs in three modes, selected by how it is invoked: |
| 6 |
|
| 7 |
1. **CLI mode** (`pom health`, `pom test`, `pom status`, etc.) -- runs a single command and exits. Useful for ad-hoc checks and cron jobs. |
| 8 |
2. **Serve mode** (`pom serve`) -- long-running daemon that spawns per-target health check loops, TLS check loops, peer heartbeat tasks, a daily prune task, and an HTTP API server. This is the production deployment mode. |
| 9 |
3. **MCP server mode** (bare `pom` with no subcommand) -- launches an MCP server over stdio for Claude integration. Exposes health checks, test execution, history queries, and mesh status as MCP tools. |
| 10 |
|
| 11 |
All three modes load the same TOML config and connect to the same SQLite database. |
| 12 |
|
| 13 |
## Module Map |
| 14 |
|
| 15 |
|
| 16 |
|
| 17 |
| `main` | `src/main.rs` | Entry point -- parses CLI args, dispatches to CLI handler or MCP server | |
| 18 |
| `cli` | `src/cli.rs` | CLI command handlers (health, test, status, history, prune, serve, mesh) | |
| 19 |
| `config` | `src/config.rs` | TOML config loading, types for targets/peers/alerts/serve settings | |
| 20 |
| `types` | `src/types.rs` | Shared domain types: HealthSnapshot, TestRun, TlsStatus, LatencyStats, TestStaleness | |
| 21 |
| `db` | `src/db.rs` | SQLite schema (versioned migrations), all queries for health/tests/alerts/TLS/incidents/peers | |
| 22 |
| `api` | `src/api.rs` | Axum HTTP API: status, trends, peer info, mesh view, bearer token auth middleware | |
| 23 |
| `alerts` | `src/alerts.rs` | Alerter struct -- sends emails via Postmark on status transitions, with cooldown tracking | |
| 24 |
| `peer` | `src/peer.rs` | Peer mesh: identity management, heartbeat loops, grace period state machine, mesh state | |
| 25 |
| `display` | `src/display.rs` | Pure formatting functions for CLI output (no I/O) | |
| 26 |
| `error` | `src/error.rs` | Typed error enum (PomError) wrapping IO, DB, HTTP, JSON, config errors | |
| 27 |
| `checks::http` | `src/checks/http.rs` | HTTP health checker, response classification, expectation validation, latency drift detection, test staleness computation | |
| 28 |
| `checks::tls` | `src/checks/tls.rs` | TLS certificate prober -- TCP connect, TLS handshake, x509 leaf cert parsing | |
| 29 |
| `checks::ssh` | `src/checks/ssh.rs` | Remote test runner -- executes commands over SSH, captures output | |
| 30 |
| `checks::parse` | `src/checks/parse.rs` | CI output parser -- extracts PASS/FAIL steps and cargo test counts | |
| 31 |
| `tools` | `src/tools/mod.rs` | MCP server definition (PomServer), tool registration via rmcp | |
| 32 |
| `tools::health` | `src/tools/health.rs` | MCP tool implementations for health checks, history, targets, mesh status | |
| 33 |
| `tools::tests` | `src/tools/tests.rs` | MCP tool implementations for test execution, history, raw output | |
| 34 |
|
| 35 |
## Data Flow |
| 36 |
|
| 37 |
``` |
| 38 |
pom.toml (config) |
| 39 |
| |
| 40 |
v |
| 41 |
Config::load() --> targets, peers, alerts, serve settings |
| 42 |
| |
| 43 |
v |
| 44 |
db::connect() --> SQLite pool (WAL mode, versioned migrations) |
| 45 |
| |
| 46 |
+---> [CLI mode] single command --> check/query --> display --> exit |
| 47 |
| |
| 48 |
+---> [Serve mode] |
| 49 |
| | |
| 50 |
| +--> per-target health check loop (configurable interval) |
| 51 |
| | check_health() --> insert_health_check() |
| 52 |
| | compare with previous --> alert on transition |
| 53 |
| | detect latency drift --> alert if sustained |
| 54 |
| | open/close incidents on status changes |
| 55 |
| | |
| 56 |
| +--> per-target TLS check loop (hourly default) |
| 57 |
| | check_tls() --> insert_tls_check() |
| 58 |
| | alert on expiry warning or error |
| 59 |
| | |
| 60 |
| +--> per-peer heartbeat loop (60s default) |
| 61 |
| | GET /api/peer/info --> update mesh state |
| 62 |
| | GET /api/peer/status --> cache for mesh view |
| 63 |
| | grace period state machine on failure |
| 64 |
| | |
| 65 |
| +--> daily prune task (configurable retention) |
| 66 |
| | |
| 67 |
| +--> HTTP API server (Axum, configurable bind address) |
| 68 |
| |
| 69 |
+---> [MCP mode] stdio transport --> tool calls --> same check/query logic |
| 70 |
``` |
| 71 |
|
| 72 |
## Peer Mesh Design |
| 73 |
|
| 74 |
Each PoM instance has a persistent UUID (stored at `~/.local/share/pom/instance_id`). Peers are configured by name with an address, a `on_missing` policy, and an optional grace count. |
| 75 |
|
| 76 |
### Heartbeat State Machine |
| 77 |
|
| 78 |
``` |
| 79 |
Unknown --> (success) --> Online |
| 80 |
Unknown --> (failure) --> GracePeriod --> (failures >= grace_count) --> Missing |
| 81 |
Online --> (failure) --> GracePeriod --> (failures >= grace_count) --> Missing |
| 82 |
Missing --> (success) --> Online (triggers recovery alert) |
| 83 |
``` |
| 84 |
|
| 85 |
Each heartbeat cycle: |
| 86 |
1. GET `/api/peer/info` -- verifies identity, measures latency |
| 87 |
2. On first contact, store the peer's UUID in `peer_identities` table |
| 88 |
3. On subsequent contacts, reject UUID mismatches (prevents impersonation) |
| 89 |
4. GET `/api/peer/status` -- caches the peer's full status for mesh aggregation |
| 90 |
5. Record heartbeat result in `peer_heartbeats` table |
| 91 |
|
| 92 |
### On Missing Policy |
| 93 |
|
| 94 |
- `alert` -- send email alert when peer transitions to Missing, send recovery when it returns |
| 95 |
- `log` -- log the event, no email |
| 96 |
- `ignore` -- suppress entirely |
| 97 |
|
| 98 |
## Database Schema |
| 99 |
|
| 100 |
SQLite with WAL journal mode. Schema is managed through numbered migrations (currently v1-v4). |
| 101 |
|
| 102 |
### Tables |
| 103 |
|
| 104 |
|
| 105 |
|
| 106 |
| `schema_version` | Migration tracking | version, description, applied_at | |
| 107 |
| `health_checks` | HTTP health check results | target, status, checked_at, response_time_ms, details_json, error | |
| 108 |
| `test_runs` | SSH test execution results | target, started_at, duration_secs, exit_code, passed, summary_json, raw_output | |
| 109 |
| `peer_identities` | First-seen peer UUIDs | peer_name (PK), instance_id, first_seen | |
| 110 |
| `peer_heartbeats` | Heartbeat history | peer_name, status, latency_ms, checked_at | |
| 111 |
| `alerts` | Alert history + cooldown tracking | target, alert_type, from_status, to_status, sent_at | |
| 112 |
| `tls_checks` | TLS certificate probe results | target, host, valid, days_remaining, not_before, not_after, subject, issuer | |
| 113 |
| `incidents` | Health incidents (open/closed) | target, started_at, ended_at, duration_secs, from_status, to_status | |
| 114 |
|
| 115 |
Pre-migration databases are detected by the presence of the `health_checks` table and stamped as v1 without re-running the initial migration. |
| 116 |
|
| 117 |
## Alert Pipeline |
| 118 |
|
| 119 |
``` |
| 120 |
Health status change detected (previous != current) |
| 121 |
| |
| 122 |
+--> operational -> non-operational: send_health_alert(), open incident |
| 123 |
+--> non-operational -> operational: send_health_recovery(), close incidents |
| 124 |
+--> non-operational -> different non-operational: close old incident, open new, alert |
| 125 |
| |
| 126 |
TLS check detects issue |
| 127 |
| |
| 128 |
+--> was OK, now invalid/error: send_tls_error_alert() |
| 129 |
+--> was OK, now within warn_days: send_tls_expiry_alert() |
| 130 |
+--> was bad, now OK: send_tls_recovery() |
| 131 |
| |
| 132 |
Latency drift detected (all recent checks exceed baseline * threshold) |
| 133 |
| |
| 134 |
+--> entered drift: send_latency_drift_alert() |
| 135 |
+--> exited drift: send_latency_recovery() |
| 136 |
| |
| 137 |
Peer transitions to Missing |
| 138 |
| |
| 139 |
+--> send_peer_missing() (if on_missing = alert) |
| 140 |
+--> peer recovers: send_peer_recovery() |
| 141 |
``` |
| 142 |
|
| 143 |
All alerts except recoveries are subject to a per-target cooldown (default 300s). Recoveries always send immediately. Without a Postmark token, alerts are logged to stdout (dev mode). |
| 144 |
|
| 145 |
## API Endpoints |
| 146 |
|
| 147 |
All endpoints require `Authorization: Bearer <token>` when `api_token` is configured (in config or via `POM_API_TOKEN` env var). Without a token configured, all requests pass through. |
| 148 |
|
| 149 |
|
| 150 |
|
| 151 |
| `/api/status` | GET | JSON summary of all targets (latest health, uptime, latency, TLS, staleness, incidents) | |
| 152 |
| `/api/status/{target}` | GET | Same as above for a single target | |
| 153 |
| `/api/trends/{target}` | GET | Latency trend data with configurable window and bucket size (`?hours=24&bucket_minutes=60`) | |
| 154 |
| `/api/peer/info` | GET | This instance's identity (id, name, version, targets, started_at) | |
| 155 |
| `/api/peer/status` | GET | This instance's full view: identity + target statuses + peer summaries | |
| 156 |
| `/api/mesh` | GET | Aggregated mesh view: self + each peer's cached status | |
| 157 |
|
| 158 |
## Check Types |
| 159 |
|
| 160 |
### HTTP Health Check |
| 161 |
|
| 162 |
Sends GET to the target's health URL. Classifies the response: |
| 163 |
- JSON with `"status": "operational"` --> Operational |
| 164 |
- JSON with `"status": "degraded"` --> Degraded |
| 165 |
- Non-JSON 2xx --> Degraded |
| 166 |
- Non-2xx or unknown status --> Error |
| 167 |
- Connection failure --> Unreachable |
| 168 |
|
| 169 |
Extracts version, uptime, checks, and monitoring from the JSON response body. Supports expectation validation: expected status code, required body substrings, and JSON field value assertions (with dot-path traversal for nested fields). |
| 170 |
|
| 171 |
### TLS Certificate Check |
| 172 |
|
| 173 |
Connects to host:port, completes a TLS handshake using the system trust store (webpki-roots), extracts the leaf certificate, and parses it with x509-parser. Records validity, days remaining, not_before/not_after, subject, and issuer. Alerts when days_remaining falls below the configured `warn_days` threshold (default 14). |
| 174 |
|
| 175 |
### SSH Test Runner |
| 176 |
|
| 177 |
Executes a configured command on a remote host via `ssh -o BatchMode=yes`. The command string comes from config (typically a CI script like `./run-ci.sh`). Supports an optional filter argument (validated to `[a-zA-Z0-9_:-]` to prevent injection). Output is parsed for PASS/FAIL step lines and `test result:` cargo test summary lines. |
| 178 |
|
| 179 |
## Key Design Decisions |
| 180 |
|
| 181 |
**SQLite over PostgreSQL.** PoM is a single-binary tool that runs on each monitoring host. SQLite keeps it self-contained with zero external dependencies. WAL mode provides concurrent reads during serve mode. Data volume is modest (a few checks per minute, pruned after 30 days). |
| 182 |
|
| 183 |
**Peer mesh over centralized monitoring.** Two independent instances cross-check each other. If the Hetzner instance goes down, Astra detects it (and vice versa). No single point of failure for the monitoring layer itself. |
| 184 |
|
| 185 |
**Bearer token auth.** Simple, stateless, sufficient for machine-to-machine API access between peers. Configured per-peer and per-instance. No user management needed. |
| 186 |
|
| 187 |
**Versioned migrations.** The migration system detects pre-migration databases and stamps them without re-running. Each migration is a numbered SQL block. This avoids external migration tools while keeping schema evolution safe. |
| 188 |
|
| 189 |
**Separate check intervals.** Health checks can have per-target interval overrides. TLS checks run on a longer interval (hourly default) since certificate state changes slowly. Peer heartbeats run on a short interval (60s default) for timely failure detection. |
| 190 |
|
| 191 |
**Cooldown on alerts.** Prevents alert storms during flapping. Recovery alerts bypass cooldown so operators always know when a service comes back. |
| 192 |
|