max / pom

6.8 KB · 203 lines History Blame Raw

1	# PoM Operational Runbook
2
3	Procedures for responding to alerts, managing the service, and troubleshooting common issues.
4
5	## Alert Response Guide
6
7	### Health Status Change (Operational -> Error/Unreachable)
8
9	Symptoms: Email alert with target status change.
10
11	Steps:
12	1. Verify manually: `curl -v https://makenot.work/api/health`
13	2. If Unreachable: check network (Tailscale, firewall, DNS resolution)
14	3. If Error (5xx): SSH into the target server, check application logs
15	```sh
16	ssh root@100.120.174.96 journalctl -u makenotwork --since "10 minutes ago"
17	```
18	4. If Degraded (2xx but unexpected body): check application state, database connectivity
19	5. Restart the service if needed: `ssh root@100.120.174.96 systemctl restart makenotwork`
20
21	### TLS Certificate Expiry
22
23	Symptoms: Alert when certificate expires within 14 days.
24
25	Steps:
26	1. Verify: `openssl s_client -connect makenot.work:443 2>/dev/null \| openssl x509 -noout -dates`
27	2. Cloudflare Origin CA certs (15-year): no renewal needed. If alert fires, check Caddy config.
28	3. If Caddy is serving wrong cert: verify cert paths in `/etc/caddy/Caddyfile`
29	4. For custom domains (on-demand TLS): Caddy auto-renews via ACME. Check Caddy logs.
30
31	### TLS Check Failed
32
33	Symptoms: Handshake timeout, certificate parse failure, or connection refused.
34
35	Steps:
36	1. Verify: `openssl s_client -connect makenot.work:443 -servername makenot.work`
37	2. Check Caddy status: `ssh root@100.120.174.96 systemctl status caddy`
38	3. Check if port 443 is open: `ssh root@100.120.174.96 ss -tlnp \| grep 443`
39	4. If Caddy is down, restart: `ssh root@100.120.174.96 systemctl restart caddy`
40
41	### Peer Missing
42
43	Symptoms: Peer (astra or hetzner) unreachable for 3+ consecutive heartbeats (3+ minutes).
44
45	Steps:
46	1. SSH into the peer: `ssh max@100.106.221.39` (astra) or `ssh root@100.120.174.96` (hetzner)
47	2. Check PoM service: `systemctl status pom`
48	3. Check Tailscale connectivity: `tailscale ping <peer-ip>`
49	4. If PoM is down: `systemctl restart pom`
50	5. If Tailscale is down: `systemctl restart tailscored`
51
52	### Latency Drift
53
54	Symptoms: Sustained response time increase (>2x the 7-day baseline).
55
56	Steps:
57	1. Check server load: `ssh root@100.120.174.96 top -bn1 \| head -5`
58	2. Check PostgreSQL: `ssh root@100.120.174.96 "psql -c 'SELECT count(*) FROM pg_stat_activity;' makenotwork"`
59	3. Check for slow queries: `ssh root@100.120.174.96 "psql -c \"SELECT query, calls, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 5;\" makenotwork"`
60	4. Check disk I/O: `ssh root@100.120.174.96 iostat -x 1 3`
61	5. If database-related: consider `VACUUM ANALYZE` on affected tables
62
63	### Route Failure
64
65	Symptoms: Specific paths (e.g., `/login`, `/docs`) returning non-2xx.
66
67	Steps:
68	1. Verify: `curl -sI https://makenot.work/login`
69	2. If 502/503: application is down or Caddy can't reach it
70	3. If 404: route may have been removed in a deploy -- check recent deploys
71	4. If 500: application error -- check logs with `journalctl -u makenotwork`
72
73	### DNS Mismatch
74
75	Symptoms: DNS records don't match expected values.
76
77	Steps:
78	1. Verify: `dig makenot.work +short` and compare with expected
79	2. Check Cloudflare DNS dashboard for unexpected changes
80	3. If propagation issue: wait 5-10 minutes and recheck
81	4. If intentional change: update PoM config to match new expected values
82
83	### WHOIS Domain Expiry
84
85	Symptoms: Domain registration expires within 30 days.
86
87	Steps:
88	1. Verify: `whois makenot.work \| grep -i expir`
89	2. Renew domain with registrar (Cloudflare Registrar for makenot.work)
90	3. Confirm renewal: re-run WHOIS check
91
92	### Monitoring Offline (All Targets Unreachable)
93
94	Symptoms: All monitored targets are down simultaneously.
95
96	Steps:
97	1. This almost certainly means PoM's own network is down, not all targets
98	2. Check the PoM instance's network: `ping 1.1.1.1`, `tailscale status`
99	3. Check DNS resolution: `dig makenot.work`
100	4. If network is fine, check if all targets actually are down (unlikely but possible)
101
102	### Test Run Stale
103
104	Symptoms: No test run recorded in 7+ days.
105
106	Steps:
107	1. SSH into astra and run tests manually: `/home/max/staging/run-tests.sh`
108	2. If tests fail: investigate failures, fix, re-run
109	3. If SSH test execution fails: check SSH key, connectivity, permissions
110
111	## Service Management
112
113	### Starting/Stopping
114
115	```sh
116	# Hetzner
117	ssh root@100.120.174.96 systemctl start pom
118	ssh root@100.120.174.96 systemctl stop pom
119	ssh root@100.120.174.96 systemctl restart pom
120
121	# Astra
122	ssh max@100.106.221.39 sudo systemctl start pom
123	ssh max@100.106.221.39 sudo systemctl stop pom
124	ssh max@100.106.221.39 sudo systemctl restart pom
125	```
126
127	### Checking Status
128
129	```sh
130	# Service status
131	ssh root@100.120.174.96 systemctl status pom
132
133	# Application logs
134	ssh root@100.120.174.96 journalctl -u pom --since "1 hour ago"
135
136	# API health
137	curl http://100.120.174.96:9100/api/health
138
139	# Full status (requires API token)
140	curl -H "Authorization: Bearer <token>" http://100.120.174.96:9100/api/status
141
142	# Mesh view (self + peers)
143	curl -H "Authorization: Bearer <token>" http://100.120.174.96:9100/api/mesh
144	```
145
146	### Deploying Updates
147
148	```sh
149	cd ~/Code/MNW/pom
150	./deploy/deploy.sh # Deploy to both astra and hetzner
151	```
152
153	The deploy script cross-compiles for both architectures, uploads binaries, and restarts services.
154
155	### Configuration Changes
156
157	Config lives at `/etc/pom/pom.toml` on each instance. After editing:
158
159	```sh
160	ssh root@100.120.174.96 systemctl restart pom
161	```
162
163	Alert credentials are in `/etc/pom/env` (Postmark token, API token).
164
165	## Check Intervals
166
167	\| Check Type \| Default Interval \| Notes \|
168	\|------------\|-----------------\|-------\|
169	\| Health (HTTP) \| 5 minutes \| 10-second timeout per request \|
170	\| TLS certificate \| 1 hour \| Warns at 14 days before expiry \|
171	\| Route availability \| 5 minutes \| Checks all configured paths \|
172	\| DNS records \| 1 hour \| Compares against expected values \|
173	\| WHOIS expiry \| 1 hour \| Warns at 30 days before expiry \|
174	\| CORS preflight \| 1 hour \| OPTIONS request validation \|
175	\| Peer heartbeat \| 60 seconds \| 3 failures before alert (grace period) \|
176	\| Data pruning \| Daily \| Retains 30 days of history \|
177
178	## Alert Cooldowns
179
180	- Default cooldown: 5 minutes between repeated alerts for the same target
181	- Recovery alerts: Always sent immediately (bypass cooldown)
182	- Monitoring-offline: Special meta-alert when all targets are unreachable
183
184	## Production Instances
185
186	\| Instance \| IP \| Architecture \| Config \|
187	\|----------\|-----\|-------------\|--------\|
188	\| Hetzner \| `100.120.174.96:9100` \| x86_64 \| `/etc/pom/pom.toml` \|
189	\| Astra \| `100.106.221.39:9100` \| aarch64 \| `/etc/pom/pom.toml` \|
190
191	Both instances monitor the same targets and cross-check each other via the peer mesh.
192
193	## Key Files
194
195	\| What \| Where \|
196	\|------\|-------\|
197	\| Config \| `/etc/pom/pom.toml` \|
198	\| Credentials \| `/etc/pom/env` \|
199	\| Database \| `/var/lib/pom/pom.db` (SQLite) \|
200	\| Instance ID \| `/var/lib/pom/instance_id` \|
201	\| systemd unit \| `/etc/systemd/system/pom.service` \|
202	\| Deploy script \| `deploy/deploy.sh` \|
203