# PoM Operational Runbook

Procedures for responding to alerts, managing the service, and troubleshooting common issues.

## Alert Response Guide

### Health Status Change (Operational -> Error/Unreachable)

**Symptoms:** Email alert with target status change.

**Steps:**
1. Verify manually: `curl -v https://makenot.work/api/health`
2. If **Unreachable**: check network (Tailscale, firewall, DNS resolution)
3. If **Error** (5xx): SSH into the target server, check application logs
   ```sh
   ssh root@100.120.174.96 journalctl -u makenotwork --since "10 minutes ago"
   ```
4. If **Degraded** (2xx but unexpected body): check application state, database connectivity
5. Restart the service if needed: `ssh root@100.120.174.96 systemctl restart makenotwork`

### TLS Certificate Expiry

**Symptoms:** Alert when certificate expires within 14 days.

**Steps:**
1. Verify: `openssl s_client -connect makenot.work:443 2>/dev/null | openssl x509 -noout -dates`
2. Cloudflare Origin CA certs (15-year): no renewal needed. If alert fires, check Caddy config.
3. If Caddy is serving wrong cert: verify cert paths in `/etc/caddy/Caddyfile`
4. For custom domains (on-demand TLS): Caddy auto-renews via ACME. Check Caddy logs.

### TLS Check Failed

**Symptoms:** Handshake timeout, certificate parse failure, or connection refused.

**Steps:**
1. Verify: `openssl s_client -connect makenot.work:443 -servername makenot.work`
2. Check Caddy status: `ssh root@100.120.174.96 systemctl status caddy`
3. Check if port 443 is open: `ssh root@100.120.174.96 ss -tlnp | grep 443`
4. If Caddy is down, restart: `ssh root@100.120.174.96 systemctl restart caddy`

### Peer Missing

**Symptoms:** Peer (astra or hetzner) unreachable for 3+ consecutive heartbeats (3+ minutes).

**Steps:**
1. SSH into the peer: `ssh max@100.106.221.39` (astra) or `ssh root@100.120.174.96` (hetzner)
2. Check PoM service: `systemctl status pom`
3. Check Tailscale connectivity: `tailscale ping <peer-ip>`
4. If PoM is down: `systemctl restart pom`
5. If Tailscale is down: `systemctl restart tailscored`

### Latency Drift

**Symptoms:** Sustained response time increase (>2x the 7-day baseline).

**Steps:**
1. Check server load: `ssh root@100.120.174.96 top -bn1 | head -5`
2. Check PostgreSQL: `ssh root@100.120.174.96 "psql -c 'SELECT count(*) FROM pg_stat_activity;' makenotwork"`
3. Check for slow queries: `ssh root@100.120.174.96 "psql -c \"SELECT query, calls, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 5;\" makenotwork"`
4. Check disk I/O: `ssh root@100.120.174.96 iostat -x 1 3`
5. If database-related: consider `VACUUM ANALYZE` on affected tables

### Route Failure

**Symptoms:** Specific paths (e.g., `/login`, `/docs`) returning non-2xx.

**Steps:**
1. Verify: `curl -sI https://makenot.work/login`
2. If 502/503: application is down or Caddy can't reach it
3. If 404: route may have been removed in a deploy -- check recent deploys
4. If 500: application error -- check logs with `journalctl -u makenotwork`

### DNS Mismatch

**Symptoms:** DNS records don't match expected values.

**Steps:**
1. Verify: `dig makenot.work +short` and compare with expected
2. Check Cloudflare DNS dashboard for unexpected changes
3. If propagation issue: wait 5-10 minutes and recheck
4. If intentional change: update PoM config to match new expected values

### WHOIS Domain Expiry

**Symptoms:** Domain registration expires within 30 days.

**Steps:**
1. Verify: `whois makenot.work | grep -i expir`
2. Renew domain with registrar (Cloudflare Registrar for makenot.work)
3. Confirm renewal: re-run WHOIS check

### Monitoring Offline (All Targets Unreachable)

**Symptoms:** All monitored targets are down simultaneously.

**Steps:**
1. This almost certainly means PoM's own network is down, not all targets
2. Check the PoM instance's network: `ping 1.1.1.1`, `tailscale status`
3. Check DNS resolution: `dig makenot.work`
4. If network is fine, check if all targets actually are down (unlikely but possible)

### Test Run Stale

**Symptoms:** No test run recorded in 7+ days.

**Steps:**
1. SSH into astra and run tests manually: `/home/max/staging/run-tests.sh`
2. If tests fail: investigate failures, fix, re-run
3. If SSH test execution fails: check SSH key, connectivity, permissions

## Service Management

### Starting/Stopping

```sh
# Hetzner
ssh root@100.120.174.96 systemctl start pom
ssh root@100.120.174.96 systemctl stop pom
ssh root@100.120.174.96 systemctl restart pom

# Astra
ssh max@100.106.221.39 sudo systemctl start pom
ssh max@100.106.221.39 sudo systemctl stop pom
ssh max@100.106.221.39 sudo systemctl restart pom
```

### Checking Status

```sh
# Service status
ssh root@100.120.174.96 systemctl status pom

# Application logs
ssh root@100.120.174.96 journalctl -u pom --since "1 hour ago"

# API health
curl http://100.120.174.96:9100/api/health

# Full status (requires API token)
curl -H "Authorization: Bearer <token>" http://100.120.174.96:9100/api/status

# Mesh view (self + peers)
curl -H "Authorization: Bearer <token>" http://100.120.174.96:9100/api/mesh
```

### Deploying Updates

```sh
cd ~/Code/MNW/pom
./deploy/deploy.sh              # Deploy to both astra and hetzner
```

The deploy script cross-compiles for both architectures, uploads binaries, and restarts services.

### Configuration Changes

Config lives at `/etc/pom/pom.toml` on each instance. After editing:

```sh
ssh root@100.120.174.96 systemctl restart pom
```

Alert credentials are in `/etc/pom/env` (Postmark token, API token).

## Check Intervals

| Check Type | Default Interval | Notes |
|------------|-----------------|-------|
| Health (HTTP) | 5 minutes | 10-second timeout per request |
| TLS certificate | 1 hour | Warns at 14 days before expiry |
| Route availability | 5 minutes | Checks all configured paths |
| DNS records | 1 hour | Compares against expected values |
| WHOIS expiry | 1 hour | Warns at 30 days before expiry |
| CORS preflight | 1 hour | OPTIONS request validation |
| Peer heartbeat | 60 seconds | 3 failures before alert (grace period) |
| Data pruning | Daily | Retains 30 days of history |

## Alert Cooldowns

- **Default cooldown:** 5 minutes between repeated alerts for the same target
- **Recovery alerts:** Always sent immediately (bypass cooldown)
- **Monitoring-offline:** Special meta-alert when all targets are unreachable

## Production Instances

| Instance | IP | Architecture | Config |
|----------|-----|-------------|--------|
| Hetzner | `100.120.174.96:9100` | x86_64 | `/etc/pom/pom.toml` |
| Astra | `100.106.221.39:9100` | aarch64 | `/etc/pom/pom.toml` |

Both instances monitor the same targets and cross-check each other via the peer mesh.

## Key Files

| What | Where |
|------|-------|
| Config | `/etc/pom/pom.toml` |
| Credentials | `/etc/pom/env` |
| Database | `/var/lib/pom/pom.db` (SQLite) |
| Instance ID | `/var/lib/pom/instance_id` |
| systemd unit | `/etc/systemd/system/pom.service` |
| Deploy script | `deploy/deploy.sh` |