# PoM Operational Runbook Procedures for responding to alerts, managing the service, and troubleshooting common issues. ## Alert Response Guide ### Health Status Change (Operational -> Error/Unreachable) **Symptoms:** Email alert with target status change. **Steps:** 1. Verify manually: `curl -v https://makenot.work/api/health` 2. If **Unreachable**: check network (Tailscale, firewall, DNS resolution) 3. If **Error** (5xx): SSH into the target server, check application logs ```sh ssh root@100.120.174.96 journalctl -u makenotwork --since "10 minutes ago" ``` 4. If **Degraded** (2xx but unexpected body): check application state, database connectivity 5. Restart the service if needed: `ssh root@100.120.174.96 systemctl restart makenotwork` ### TLS Certificate Expiry **Symptoms:** Alert when certificate expires within 14 days. **Steps:** 1. Verify: `openssl s_client -connect makenot.work:443 2>/dev/null | openssl x509 -noout -dates` 2. Cloudflare Origin CA certs (15-year): no renewal needed. If alert fires, check Caddy config. 3. If Caddy is serving wrong cert: verify cert paths in `/etc/caddy/Caddyfile` 4. For custom domains (on-demand TLS): Caddy auto-renews via ACME. Check Caddy logs. ### TLS Check Failed **Symptoms:** Handshake timeout, certificate parse failure, or connection refused. **Steps:** 1. Verify: `openssl s_client -connect makenot.work:443 -servername makenot.work` 2. Check Caddy status: `ssh root@100.120.174.96 systemctl status caddy` 3. Check if port 443 is open: `ssh root@100.120.174.96 ss -tlnp | grep 443` 4. If Caddy is down, restart: `ssh root@100.120.174.96 systemctl restart caddy` ### Peer Missing **Symptoms:** Peer (astra or hetzner) unreachable for 3+ consecutive heartbeats (3+ minutes). **Steps:** 1. SSH into the peer: `ssh max@100.106.221.39` (astra) or `ssh root@100.120.174.96` (hetzner) 2. Check PoM service: `systemctl status pom` 3. Check Tailscale connectivity: `tailscale ping ` 4. If PoM is down: `systemctl restart pom` 5. If Tailscale is down: `systemctl restart tailscored` ### Latency Drift **Symptoms:** Sustained response time increase (>2x the 7-day baseline). **Steps:** 1. Check server load: `ssh root@100.120.174.96 top -bn1 | head -5` 2. Check PostgreSQL: `ssh root@100.120.174.96 "psql -c 'SELECT count(*) FROM pg_stat_activity;' makenotwork"` 3. Check for slow queries: `ssh root@100.120.174.96 "psql -c \"SELECT query, calls, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 5;\" makenotwork"` 4. Check disk I/O: `ssh root@100.120.174.96 iostat -x 1 3` 5. If database-related: consider `VACUUM ANALYZE` on affected tables ### Route Failure **Symptoms:** Specific paths (e.g., `/login`, `/docs`) returning non-2xx. **Steps:** 1. Verify: `curl -sI https://makenot.work/login` 2. If 502/503: application is down or Caddy can't reach it 3. If 404: route may have been removed in a deploy -- check recent deploys 4. If 500: application error -- check logs with `journalctl -u makenotwork` ### DNS Mismatch **Symptoms:** DNS records don't match expected values. **Steps:** 1. Verify: `dig makenot.work +short` and compare with expected 2. Check Cloudflare DNS dashboard for unexpected changes 3. If propagation issue: wait 5-10 minutes and recheck 4. If intentional change: update PoM config to match new expected values ### WHOIS Domain Expiry **Symptoms:** Domain registration expires within 30 days. **Steps:** 1. Verify: `whois makenot.work | grep -i expir` 2. Renew domain with registrar (Cloudflare Registrar for makenot.work) 3. Confirm renewal: re-run WHOIS check ### Monitoring Offline (All Targets Unreachable) **Symptoms:** All monitored targets are down simultaneously. **Steps:** 1. This almost certainly means PoM's own network is down, not all targets 2. Check the PoM instance's network: `ping 1.1.1.1`, `tailscale status` 3. Check DNS resolution: `dig makenot.work` 4. If network is fine, check if all targets actually are down (unlikely but possible) ### Test Run Stale **Symptoms:** No test run recorded in 7+ days. **Steps:** 1. SSH into astra and run tests manually: `/home/max/staging/run-tests.sh` 2. If tests fail: investigate failures, fix, re-run 3. If SSH test execution fails: check SSH key, connectivity, permissions ## Service Management ### Starting/Stopping ```sh # Hetzner ssh root@100.120.174.96 systemctl start pom ssh root@100.120.174.96 systemctl stop pom ssh root@100.120.174.96 systemctl restart pom # Astra ssh max@100.106.221.39 sudo systemctl start pom ssh max@100.106.221.39 sudo systemctl stop pom ssh max@100.106.221.39 sudo systemctl restart pom ``` ### Checking Status ```sh # Service status ssh root@100.120.174.96 systemctl status pom # Application logs ssh root@100.120.174.96 journalctl -u pom --since "1 hour ago" # API health curl http://100.120.174.96:9100/api/health # Full status (requires API token) curl -H "Authorization: Bearer " http://100.120.174.96:9100/api/status # Mesh view (self + peers) curl -H "Authorization: Bearer " http://100.120.174.96:9100/api/mesh ``` ### Deploying Updates ```sh cd ~/Code/MNW/pom ./deploy/deploy.sh # Deploy to both astra and hetzner ``` The deploy script cross-compiles for both architectures, uploads binaries, and restarts services. ### Configuration Changes Config lives at `/etc/pom/pom.toml` on each instance. After editing: ```sh ssh root@100.120.174.96 systemctl restart pom ``` Alert credentials are in `/etc/pom/env` (Postmark token, API token). ## Check Intervals | Check Type | Default Interval | Notes | |------------|-----------------|-------| | Health (HTTP) | 5 minutes | 10-second timeout per request | | TLS certificate | 1 hour | Warns at 14 days before expiry | | Route availability | 5 minutes | Checks all configured paths | | DNS records | 1 hour | Compares against expected values | | WHOIS expiry | 1 hour | Warns at 30 days before expiry | | CORS preflight | 1 hour | OPTIONS request validation | | Peer heartbeat | 60 seconds | 3 failures before alert (grace period) | | Data pruning | Daily | Retains 30 days of history | ## Alert Cooldowns - **Default cooldown:** 5 minutes between repeated alerts for the same target - **Recovery alerts:** Always sent immediately (bypass cooldown) - **Monitoring-offline:** Special meta-alert when all targets are unreachable ## Production Instances | Instance | IP | Architecture | Config | |----------|-----|-------------|--------| | Hetzner | `100.120.174.96:9100` | x86_64 | `/etc/pom/pom.toml` | | Astra | `100.106.221.39:9100` | aarch64 | `/etc/pom/pom.toml` | Both instances monitor the same targets and cross-check each other via the peer mesh. ## Key Files | What | Where | |------|-------| | Config | `/etc/pom/pom.toml` | | Credentials | `/etc/pom/env` | | Database | `/var/lib/pom/pom.db` (SQLite) | | Instance ID | `/var/lib/pom/instance_id` | | systemd unit | `/etc/systemd/system/pom.service` | | Deploy script | `deploy/deploy.sh` |