| 1 |
# PoM Operational Runbook |
| 2 |
|
| 3 |
Procedures for responding to alerts, managing the service, and troubleshooting common issues. |
| 4 |
|
| 5 |
## Alert Response Guide |
| 6 |
|
| 7 |
### Health Status Change (Operational -> Error/Unreachable) |
| 8 |
|
| 9 |
**Symptoms:** Email alert with target status change. |
| 10 |
|
| 11 |
**Steps:** |
| 12 |
1. Verify manually: `curl -v https://makenot.work/api/health` |
| 13 |
2. If **Unreachable**: check network (Tailscale, firewall, DNS resolution) |
| 14 |
3. If **Error** (5xx): SSH into the target server, check application logs |
| 15 |
```sh |
| 16 |
ssh root@100.120.174.96 journalctl -u makenotwork --since "10 minutes ago" |
| 17 |
``` |
| 18 |
4. If **Degraded** (2xx but unexpected body): check application state, database connectivity |
| 19 |
5. Restart the service if needed: `ssh root@100.120.174.96 systemctl restart makenotwork` |
| 20 |
|
| 21 |
### TLS Certificate Expiry |
| 22 |
|
| 23 |
**Symptoms:** Alert when certificate expires within 14 days. |
| 24 |
|
| 25 |
**Steps:** |
| 26 |
1. Verify: `openssl s_client -connect makenot.work:443 2>/dev/null | openssl x509 -noout -dates` |
| 27 |
2. Cloudflare Origin CA certs (15-year): no renewal needed. If alert fires, check Caddy config. |
| 28 |
3. If Caddy is serving wrong cert: verify cert paths in `/etc/caddy/Caddyfile` |
| 29 |
4. For custom domains (on-demand TLS): Caddy auto-renews via ACME. Check Caddy logs. |
| 30 |
|
| 31 |
### TLS Check Failed |
| 32 |
|
| 33 |
**Symptoms:** Handshake timeout, certificate parse failure, or connection refused. |
| 34 |
|
| 35 |
**Steps:** |
| 36 |
1. Verify: `openssl s_client -connect makenot.work:443 -servername makenot.work` |
| 37 |
2. Check Caddy status: `ssh root@100.120.174.96 systemctl status caddy` |
| 38 |
3. Check if port 443 is open: `ssh root@100.120.174.96 ss -tlnp | grep 443` |
| 39 |
4. If Caddy is down, restart: `ssh root@100.120.174.96 systemctl restart caddy` |
| 40 |
|
| 41 |
### Peer Missing |
| 42 |
|
| 43 |
**Symptoms:** Peer (astra or hetzner) unreachable for 3+ consecutive heartbeats (3+ minutes). |
| 44 |
|
| 45 |
**Steps:** |
| 46 |
1. SSH into the peer: `ssh max@100.106.221.39` (astra) or `ssh root@100.120.174.96` (hetzner) |
| 47 |
2. Check PoM service: `systemctl status pom` |
| 48 |
3. Check Tailscale connectivity: `tailscale ping <peer-ip>` |
| 49 |
4. If PoM is down: `systemctl restart pom` |
| 50 |
5. If Tailscale is down: `systemctl restart tailscored` |
| 51 |
|
| 52 |
### Latency Drift |
| 53 |
|
| 54 |
**Symptoms:** Sustained response time increase (>2x the 7-day baseline). |
| 55 |
|
| 56 |
**Steps:** |
| 57 |
1. Check server load: `ssh root@100.120.174.96 top -bn1 | head -5` |
| 58 |
2. Check PostgreSQL: `ssh root@100.120.174.96 "psql -c 'SELECT count(*) FROM pg_stat_activity;' makenotwork"` |
| 59 |
3. Check for slow queries: `ssh root@100.120.174.96 "psql -c \"SELECT query, calls, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 5;\" makenotwork"` |
| 60 |
4. Check disk I/O: `ssh root@100.120.174.96 iostat -x 1 3` |
| 61 |
5. If database-related: consider `VACUUM ANALYZE` on affected tables |
| 62 |
|
| 63 |
### Route Failure |
| 64 |
|
| 65 |
**Symptoms:** Specific paths (e.g., `/login`, `/docs`) returning non-2xx. |
| 66 |
|
| 67 |
**Steps:** |
| 68 |
1. Verify: `curl -sI https://makenot.work/login` |
| 69 |
2. If 502/503: application is down or Caddy can't reach it |
| 70 |
3. If 404: route may have been removed in a deploy -- check recent deploys |
| 71 |
4. If 500: application error -- check logs with `journalctl -u makenotwork` |
| 72 |
|
| 73 |
### DNS Mismatch |
| 74 |
|
| 75 |
**Symptoms:** DNS records don't match expected values. |
| 76 |
|
| 77 |
**Steps:** |
| 78 |
1. Verify: `dig makenot.work +short` and compare with expected |
| 79 |
2. Check Cloudflare DNS dashboard for unexpected changes |
| 80 |
3. If propagation issue: wait 5-10 minutes and recheck |
| 81 |
4. If intentional change: update PoM config to match new expected values |
| 82 |
|
| 83 |
### WHOIS Domain Expiry |
| 84 |
|
| 85 |
**Symptoms:** Domain registration expires within 30 days. |
| 86 |
|
| 87 |
**Steps:** |
| 88 |
1. Verify: `whois makenot.work | grep -i expir` |
| 89 |
2. Renew domain with registrar (Cloudflare Registrar for makenot.work) |
| 90 |
3. Confirm renewal: re-run WHOIS check |
| 91 |
|
| 92 |
### Monitoring Offline (All Targets Unreachable) |
| 93 |
|
| 94 |
**Symptoms:** All monitored targets are down simultaneously. |
| 95 |
|
| 96 |
**Steps:** |
| 97 |
1. This almost certainly means PoM's own network is down, not all targets |
| 98 |
2. Check the PoM instance's network: `ping 1.1.1.1`, `tailscale status` |
| 99 |
3. Check DNS resolution: `dig makenot.work` |
| 100 |
4. If network is fine, check if all targets actually are down (unlikely but possible) |
| 101 |
|
| 102 |
### Test Run Stale |
| 103 |
|
| 104 |
**Symptoms:** No test run recorded in 7+ days. |
| 105 |
|
| 106 |
**Steps:** |
| 107 |
1. SSH into astra and run tests manually: `/home/max/staging/run-tests.sh` |
| 108 |
2. If tests fail: investigate failures, fix, re-run |
| 109 |
3. If SSH test execution fails: check SSH key, connectivity, permissions |
| 110 |
|
| 111 |
## Service Management |
| 112 |
|
| 113 |
### Starting/Stopping |
| 114 |
|
| 115 |
```sh |
| 116 |
# Hetzner |
| 117 |
ssh root@100.120.174.96 systemctl start pom |
| 118 |
ssh root@100.120.174.96 systemctl stop pom |
| 119 |
ssh root@100.120.174.96 systemctl restart pom |
| 120 |
|
| 121 |
# Astra |
| 122 |
ssh max@100.106.221.39 sudo systemctl start pom |
| 123 |
ssh max@100.106.221.39 sudo systemctl stop pom |
| 124 |
ssh max@100.106.221.39 sudo systemctl restart pom |
| 125 |
``` |
| 126 |
|
| 127 |
### Checking Status |
| 128 |
|
| 129 |
```sh |
| 130 |
# Service status |
| 131 |
ssh root@100.120.174.96 systemctl status pom |
| 132 |
|
| 133 |
# Application logs |
| 134 |
ssh root@100.120.174.96 journalctl -u pom --since "1 hour ago" |
| 135 |
|
| 136 |
# API health |
| 137 |
curl http://100.120.174.96:9100/api/health |
| 138 |
|
| 139 |
# Full status (requires API token) |
| 140 |
curl -H "Authorization: Bearer <token>" http://100.120.174.96:9100/api/status |
| 141 |
|
| 142 |
# Mesh view (self + peers) |
| 143 |
curl -H "Authorization: Bearer <token>" http://100.120.174.96:9100/api/mesh |
| 144 |
``` |
| 145 |
|
| 146 |
### Deploying Updates |
| 147 |
|
| 148 |
```sh |
| 149 |
cd ~/Code/MNW/pom |
| 150 |
./deploy/deploy.sh # Deploy to both astra and hetzner |
| 151 |
``` |
| 152 |
|
| 153 |
The deploy script cross-compiles for both architectures, uploads binaries, and restarts services. |
| 154 |
|
| 155 |
### Configuration Changes |
| 156 |
|
| 157 |
Config lives at `/etc/pom/pom.toml` on each instance. After editing: |
| 158 |
|
| 159 |
```sh |
| 160 |
ssh root@100.120.174.96 systemctl restart pom |
| 161 |
``` |
| 162 |
|
| 163 |
Alert credentials are in `/etc/pom/env` (Postmark token, API token). |
| 164 |
|
| 165 |
## Check Intervals |
| 166 |
|
| 167 |
|
| 168 |
|
| 169 |
| Health (HTTP) | 5 minutes | 10-second timeout per request | |
| 170 |
| TLS certificate | 1 hour | Warns at 14 days before expiry | |
| 171 |
| Route availability | 5 minutes | Checks all configured paths | |
| 172 |
| DNS records | 1 hour | Compares against expected values | |
| 173 |
| WHOIS expiry | 1 hour | Warns at 30 days before expiry | |
| 174 |
| CORS preflight | 1 hour | OPTIONS request validation | |
| 175 |
| Peer heartbeat | 60 seconds | 3 failures before alert (grace period) | |
| 176 |
| Data pruning | Daily | Retains 30 days of history | |
| 177 |
|
| 178 |
## Alert Cooldowns |
| 179 |
|
| 180 |
- **Default cooldown:** 5 minutes between repeated alerts for the same target |
| 181 |
- **Recovery alerts:** Always sent immediately (bypass cooldown) |
| 182 |
- **Monitoring-offline:** Special meta-alert when all targets are unreachable |
| 183 |
|
| 184 |
## Production Instances |
| 185 |
|
| 186 |
|
| 187 |
|
| 188 |
| Hetzner | `100.120.174.96:9100` | x86_64 | `/etc/pom/pom.toml` | |
| 189 |
| Astra | `100.106.221.39:9100` | aarch64 | `/etc/pom/pom.toml` | |
| 190 |
|
| 191 |
Both instances monitor the same targets and cross-check each other via the peer mesh. |
| 192 |
|
| 193 |
## Key Files |
| 194 |
|
| 195 |
|
| 196 |
|
| 197 |
| Config | `/etc/pom/pom.toml` | |
| 198 |
| Credentials | `/etc/pom/env` | |
| 199 |
| Database | `/var/lib/pom/pom.db` (SQLite) | |
| 200 |
| Instance ID | `/var/lib/pom/instance_id` | |
| 201 |
| systemd unit | `/etc/systemd/system/pom.service` | |
| 202 |
| Deploy script | `deploy/deploy.sh` | |
| 203 |
|