Skip to main content

max / pom

6.8 KB · 203 lines History Blame Raw
1 # PoM Operational Runbook
2
3 Procedures for responding to alerts, managing the service, and troubleshooting common issues.
4
5 ## Alert Response Guide
6
7 ### Health Status Change (Operational -> Error/Unreachable)
8
9 **Symptoms:** Email alert with target status change.
10
11 **Steps:**
12 1. Verify manually: `curl -v https://makenot.work/api/health`
13 2. If **Unreachable**: check network (Tailscale, firewall, DNS resolution)
14 3. If **Error** (5xx): SSH into the target server, check application logs
15 ```sh
16 ssh root@100.120.174.96 journalctl -u makenotwork --since "10 minutes ago"
17 ```
18 4. If **Degraded** (2xx but unexpected body): check application state, database connectivity
19 5. Restart the service if needed: `ssh root@100.120.174.96 systemctl restart makenotwork`
20
21 ### TLS Certificate Expiry
22
23 **Symptoms:** Alert when certificate expires within 14 days.
24
25 **Steps:**
26 1. Verify: `openssl s_client -connect makenot.work:443 2>/dev/null | openssl x509 -noout -dates`
27 2. Cloudflare Origin CA certs (15-year): no renewal needed. If alert fires, check Caddy config.
28 3. If Caddy is serving wrong cert: verify cert paths in `/etc/caddy/Caddyfile`
29 4. For custom domains (on-demand TLS): Caddy auto-renews via ACME. Check Caddy logs.
30
31 ### TLS Check Failed
32
33 **Symptoms:** Handshake timeout, certificate parse failure, or connection refused.
34
35 **Steps:**
36 1. Verify: `openssl s_client -connect makenot.work:443 -servername makenot.work`
37 2. Check Caddy status: `ssh root@100.120.174.96 systemctl status caddy`
38 3. Check if port 443 is open: `ssh root@100.120.174.96 ss -tlnp | grep 443`
39 4. If Caddy is down, restart: `ssh root@100.120.174.96 systemctl restart caddy`
40
41 ### Peer Missing
42
43 **Symptoms:** Peer (astra or hetzner) unreachable for 3+ consecutive heartbeats (3+ minutes).
44
45 **Steps:**
46 1. SSH into the peer: `ssh max@100.106.221.39` (astra) or `ssh root@100.120.174.96` (hetzner)
47 2. Check PoM service: `systemctl status pom`
48 3. Check Tailscale connectivity: `tailscale ping <peer-ip>`
49 4. If PoM is down: `systemctl restart pom`
50 5. If Tailscale is down: `systemctl restart tailscored`
51
52 ### Latency Drift
53
54 **Symptoms:** Sustained response time increase (>2x the 7-day baseline).
55
56 **Steps:**
57 1. Check server load: `ssh root@100.120.174.96 top -bn1 | head -5`
58 2. Check PostgreSQL: `ssh root@100.120.174.96 "psql -c 'SELECT count(*) FROM pg_stat_activity;' makenotwork"`
59 3. Check for slow queries: `ssh root@100.120.174.96 "psql -c \"SELECT query, calls, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 5;\" makenotwork"`
60 4. Check disk I/O: `ssh root@100.120.174.96 iostat -x 1 3`
61 5. If database-related: consider `VACUUM ANALYZE` on affected tables
62
63 ### Route Failure
64
65 **Symptoms:** Specific paths (e.g., `/login`, `/docs`) returning non-2xx.
66
67 **Steps:**
68 1. Verify: `curl -sI https://makenot.work/login`
69 2. If 502/503: application is down or Caddy can't reach it
70 3. If 404: route may have been removed in a deploy -- check recent deploys
71 4. If 500: application error -- check logs with `journalctl -u makenotwork`
72
73 ### DNS Mismatch
74
75 **Symptoms:** DNS records don't match expected values.
76
77 **Steps:**
78 1. Verify: `dig makenot.work +short` and compare with expected
79 2. Check Cloudflare DNS dashboard for unexpected changes
80 3. If propagation issue: wait 5-10 minutes and recheck
81 4. If intentional change: update PoM config to match new expected values
82
83 ### WHOIS Domain Expiry
84
85 **Symptoms:** Domain registration expires within 30 days.
86
87 **Steps:**
88 1. Verify: `whois makenot.work | grep -i expir`
89 2. Renew domain with registrar (Cloudflare Registrar for makenot.work)
90 3. Confirm renewal: re-run WHOIS check
91
92 ### Monitoring Offline (All Targets Unreachable)
93
94 **Symptoms:** All monitored targets are down simultaneously.
95
96 **Steps:**
97 1. This almost certainly means PoM's own network is down, not all targets
98 2. Check the PoM instance's network: `ping 1.1.1.1`, `tailscale status`
99 3. Check DNS resolution: `dig makenot.work`
100 4. If network is fine, check if all targets actually are down (unlikely but possible)
101
102 ### Test Run Stale
103
104 **Symptoms:** No test run recorded in 7+ days.
105
106 **Steps:**
107 1. SSH into astra and run tests manually: `/home/max/staging/run-tests.sh`
108 2. If tests fail: investigate failures, fix, re-run
109 3. If SSH test execution fails: check SSH key, connectivity, permissions
110
111 ## Service Management
112
113 ### Starting/Stopping
114
115 ```sh
116 # Hetzner
117 ssh root@100.120.174.96 systemctl start pom
118 ssh root@100.120.174.96 systemctl stop pom
119 ssh root@100.120.174.96 systemctl restart pom
120
121 # Astra
122 ssh max@100.106.221.39 sudo systemctl start pom
123 ssh max@100.106.221.39 sudo systemctl stop pom
124 ssh max@100.106.221.39 sudo systemctl restart pom
125 ```
126
127 ### Checking Status
128
129 ```sh
130 # Service status
131 ssh root@100.120.174.96 systemctl status pom
132
133 # Application logs
134 ssh root@100.120.174.96 journalctl -u pom --since "1 hour ago"
135
136 # API health
137 curl http://100.120.174.96:9100/api/health
138
139 # Full status (requires API token)
140 curl -H "Authorization: Bearer <token>" http://100.120.174.96:9100/api/status
141
142 # Mesh view (self + peers)
143 curl -H "Authorization: Bearer <token>" http://100.120.174.96:9100/api/mesh
144 ```
145
146 ### Deploying Updates
147
148 ```sh
149 cd ~/Code/MNW/pom
150 ./deploy/deploy.sh # Deploy to both astra and hetzner
151 ```
152
153 The deploy script cross-compiles for both architectures, uploads binaries, and restarts services.
154
155 ### Configuration Changes
156
157 Config lives at `/etc/pom/pom.toml` on each instance. After editing:
158
159 ```sh
160 ssh root@100.120.174.96 systemctl restart pom
161 ```
162
163 Alert credentials are in `/etc/pom/env` (Postmark token, API token).
164
165 ## Check Intervals
166
167 | Check Type | Default Interval | Notes |
168 |------------|-----------------|-------|
169 | Health (HTTP) | 5 minutes | 10-second timeout per request |
170 | TLS certificate | 1 hour | Warns at 14 days before expiry |
171 | Route availability | 5 minutes | Checks all configured paths |
172 | DNS records | 1 hour | Compares against expected values |
173 | WHOIS expiry | 1 hour | Warns at 30 days before expiry |
174 | CORS preflight | 1 hour | OPTIONS request validation |
175 | Peer heartbeat | 60 seconds | 3 failures before alert (grace period) |
176 | Data pruning | Daily | Retains 30 days of history |
177
178 ## Alert Cooldowns
179
180 - **Default cooldown:** 5 minutes between repeated alerts for the same target
181 - **Recovery alerts:** Always sent immediately (bypass cooldown)
182 - **Monitoring-offline:** Special meta-alert when all targets are unreachable
183
184 ## Production Instances
185
186 | Instance | IP | Architecture | Config |
187 |----------|-----|-------------|--------|
188 | Hetzner | `100.120.174.96:9100` | x86_64 | `/etc/pom/pom.toml` |
189 | Astra | `100.106.221.39:9100` | aarch64 | `/etc/pom/pom.toml` |
190
191 Both instances monitor the same targets and cross-check each other via the peer mesh.
192
193 ## Key Files
194
195 | What | Where |
196 |------|-------|
197 | Config | `/etc/pom/pom.toml` |
198 | Credentials | `/etc/pom/env` |
199 | Database | `/var/lib/pom/pom.db` (SQLite) |
200 | Instance ID | `/var/lib/pom/instance_id` |
201 | systemd unit | `/etc/systemd/system/pom.service` |
202 | Deploy script | `deploy/deploy.sh` |
203