# PoM Architecture ## System Overview PoM runs in three modes, selected by how it is invoked: 1. **CLI mode** (`pom health`, `pom test`, `pom status`, etc.) -- runs a single command and exits. Useful for ad-hoc checks and cron jobs. 2. **Serve mode** (`pom serve`) -- long-running daemon that spawns per-target health check loops, TLS check loops, peer heartbeat tasks, a daily prune task, and an HTTP API server. This is the production deployment mode. 3. **MCP server mode** (bare `pom` with no subcommand) -- launches an MCP server over stdio for Claude integration. Exposes health checks, test execution, history queries, and mesh status as MCP tools. All three modes load the same TOML config and connect to the same SQLite database. ## Module Map | Module | File | Role | |--------|------|------| | `main` | `src/main.rs` | Entry point -- parses CLI args, dispatches to CLI handler or MCP server | | `cli` | `src/cli.rs` | CLI command handlers (health, test, status, history, prune, serve, mesh) | | `config` | `src/config.rs` | TOML config loading, types for targets/peers/alerts/serve settings | | `types` | `src/types.rs` | Shared domain types: HealthSnapshot, TestRun, TlsStatus, LatencyStats, TestStaleness | | `db` | `src/db.rs` | SQLite schema (versioned migrations), all queries for health/tests/alerts/TLS/incidents/peers | | `api` | `src/api.rs` | Axum HTTP API: status, trends, peer info, mesh view, bearer token auth middleware | | `alerts` | `src/alerts.rs` | Alerter struct -- sends emails via Postmark on status transitions, with cooldown tracking | | `peer` | `src/peer.rs` | Peer mesh: identity management, heartbeat loops, grace period state machine, mesh state | | `display` | `src/display.rs` | Pure formatting functions for CLI output (no I/O) | | `error` | `src/error.rs` | Typed error enum (PomError) wrapping IO, DB, HTTP, JSON, config errors | | `checks::http` | `src/checks/http.rs` | HTTP health checker, response classification, expectation validation, latency drift detection, test staleness computation | | `checks::tls` | `src/checks/tls.rs` | TLS certificate prober -- TCP connect, TLS handshake, x509 leaf cert parsing | | `checks::ssh` | `src/checks/ssh.rs` | Remote test runner -- executes commands over SSH, captures output | | `checks::parse` | `src/checks/parse.rs` | CI output parser -- extracts PASS/FAIL steps and cargo test counts | | `tools` | `src/tools/mod.rs` | MCP server definition (PomServer), tool registration via rmcp | | `tools::health` | `src/tools/health.rs` | MCP tool implementations for health checks, history, targets, mesh status | | `tools::tests` | `src/tools/tests.rs` | MCP tool implementations for test execution, history, raw output | ## Data Flow ``` pom.toml (config) | v Config::load() --> targets, peers, alerts, serve settings | v db::connect() --> SQLite pool (WAL mode, versioned migrations) | +---> [CLI mode] single command --> check/query --> display --> exit | +---> [Serve mode] | | | +--> per-target health check loop (configurable interval) | | check_health() --> insert_health_check() | | compare with previous --> alert on transition | | detect latency drift --> alert if sustained | | open/close incidents on status changes | | | +--> per-target TLS check loop (hourly default) | | check_tls() --> insert_tls_check() | | alert on expiry warning or error | | | +--> per-peer heartbeat loop (60s default) | | GET /api/peer/info --> update mesh state | | GET /api/peer/status --> cache for mesh view | | grace period state machine on failure | | | +--> daily prune task (configurable retention) | | | +--> HTTP API server (Axum, configurable bind address) | +---> [MCP mode] stdio transport --> tool calls --> same check/query logic ``` ## Peer Mesh Design Each PoM instance has a persistent UUID (stored at `~/.local/share/pom/instance_id`). Peers are configured by name with an address, a `on_missing` policy, and an optional grace count. ### Heartbeat State Machine ``` Unknown --> (success) --> Online Unknown --> (failure) --> GracePeriod --> (failures >= grace_count) --> Missing Online --> (failure) --> GracePeriod --> (failures >= grace_count) --> Missing Missing --> (success) --> Online (triggers recovery alert) ``` Each heartbeat cycle: 1. GET `/api/peer/info` -- verifies identity, measures latency 2. On first contact, store the peer's UUID in `peer_identities` table 3. On subsequent contacts, reject UUID mismatches (prevents impersonation) 4. GET `/api/peer/status` -- caches the peer's full status for mesh aggregation 5. Record heartbeat result in `peer_heartbeats` table ### On Missing Policy - `alert` -- send email alert when peer transitions to Missing, send recovery when it returns - `log` -- log the event, no email - `ignore` -- suppress entirely ## Database Schema SQLite with WAL journal mode. Schema is managed through numbered migrations (currently v1-v4). ### Tables | Table | Purpose | Key Columns | |-------|---------|-------------| | `schema_version` | Migration tracking | version, description, applied_at | | `health_checks` | HTTP health check results | target, status, checked_at, response_time_ms, details_json, error | | `test_runs` | SSH test execution results | target, started_at, duration_secs, exit_code, passed, summary_json, raw_output | | `peer_identities` | First-seen peer UUIDs | peer_name (PK), instance_id, first_seen | | `peer_heartbeats` | Heartbeat history | peer_name, status, latency_ms, checked_at | | `alerts` | Alert history + cooldown tracking | target, alert_type, from_status, to_status, sent_at | | `tls_checks` | TLS certificate probe results | target, host, valid, days_remaining, not_before, not_after, subject, issuer | | `incidents` | Health incidents (open/closed) | target, started_at, ended_at, duration_secs, from_status, to_status | Pre-migration databases are detected by the presence of the `health_checks` table and stamped as v1 without re-running the initial migration. ## Alert Pipeline ``` Health status change detected (previous != current) | +--> operational -> non-operational: send_health_alert(), open incident +--> non-operational -> operational: send_health_recovery(), close incidents +--> non-operational -> different non-operational: close old incident, open new, alert | TLS check detects issue | +--> was OK, now invalid/error: send_tls_error_alert() +--> was OK, now within warn_days: send_tls_expiry_alert() +--> was bad, now OK: send_tls_recovery() | Latency drift detected (all recent checks exceed baseline * threshold) | +--> entered drift: send_latency_drift_alert() +--> exited drift: send_latency_recovery() | Peer transitions to Missing | +--> send_peer_missing() (if on_missing = alert) +--> peer recovers: send_peer_recovery() ``` All alerts except recoveries are subject to a per-target cooldown (default 300s). Recoveries always send immediately. Without a Postmark token, alerts are logged to stdout (dev mode). ## API Endpoints All endpoints require `Authorization: Bearer ` when `api_token` is configured (in config or via `POM_API_TOKEN` env var). Without a token configured, all requests pass through. | Endpoint | Method | Description | |----------|--------|-------------| | `/api/status` | GET | JSON summary of all targets (latest health, uptime, latency, TLS, staleness, incidents) | | `/api/status/{target}` | GET | Same as above for a single target | | `/api/trends/{target}` | GET | Latency trend data with configurable window and bucket size (`?hours=24&bucket_minutes=60`) | | `/api/peer/info` | GET | This instance's identity (id, name, version, targets, started_at) | | `/api/peer/status` | GET | This instance's full view: identity + target statuses + peer summaries | | `/api/mesh` | GET | Aggregated mesh view: self + each peer's cached status | ## Check Types ### HTTP Health Check Sends GET to the target's health URL. Classifies the response: - JSON with `"status": "operational"` --> Operational - JSON with `"status": "degraded"` --> Degraded - Non-JSON 2xx --> Degraded - Non-2xx or unknown status --> Error - Connection failure --> Unreachable Extracts version, uptime, checks, and monitoring from the JSON response body. Supports expectation validation: expected status code, required body substrings, and JSON field value assertions (with dot-path traversal for nested fields). ### TLS Certificate Check Connects to host:port, completes a TLS handshake using the system trust store (webpki-roots), extracts the leaf certificate, and parses it with x509-parser. Records validity, days remaining, not_before/not_after, subject, and issuer. Alerts when days_remaining falls below the configured `warn_days` threshold (default 14). ### SSH Test Runner Executes a configured command on a remote host via `ssh -o BatchMode=yes`. The command string comes from config (typically a CI script like `./run-ci.sh`). Supports an optional filter argument (validated to `[a-zA-Z0-9_:-]` to prevent injection). Output is parsed for PASS/FAIL step lines and `test result:` cargo test summary lines. ## Key Design Decisions **SQLite over PostgreSQL.** PoM is a single-binary tool that runs on each monitoring host. SQLite keeps it self-contained with zero external dependencies. WAL mode provides concurrent reads during serve mode. Data volume is modest (a few checks per minute, pruned after 30 days). **Peer mesh over centralized monitoring.** Two independent instances cross-check each other. If the Hetzner instance goes down, Astra detects it (and vice versa). No single point of failure for the monitoring layer itself. **Bearer token auth.** Simple, stateless, sufficient for machine-to-machine API access between peers. Configured per-peer and per-instance. No user management needed. **Versioned migrations.** The migration system detects pre-migration databases and stamps them without re-running. Each migration is a numbered SQL block. This avoids external migration tools while keeping schema evolution safe. **Separate check intervals.** Health checks can have per-target interval overrides. TLS checks run on a longer interval (hourly default) since certificate state changes slowly. Peer heartbeats run on a short interval (60s default) for timely failure detection. **Cooldown on alerts.** Prevents alert storms during flapping. Recovery alerts bypass cooldown so operators always know when a service comes back.