Skip to main content

max / makenotwork

10.0 KB · 119 lines History Blame Raw
1 # Infrastructure Scaling Audit
2
3 Capacity assessment of the production stack and the upgrade path from current state through 100k creators. Companion to `architecture.md` (what exists).
4
5 ## Current Topology
6
7 **Single production VM** — Hetzner `CCX13 x86` in US-West (`alpha-west-1`, 5.78.144.244 / 100.120.174.96). Runs MNW (:3000), Multithreaded (:3400), PostgreSQL (:5432, both DBs), PoM (:9100), Caddy (:80/:443), Git SSH (:22). Tailscale-only admin SSH on :2200; public :22 is mnw-cli only.
8
9 CCX13 = 2 dedicated vCPU / 8 GB RAM / 80 GB NVMe + 10 GB volume, ~20 TB included monthly egress.
10
11 **Edge** — Cloudflare proxy ON for `makenot.work`, `*.makenot.work`, `*.maxj.phd` (e.g. `dl.maxj.phd`), `htpy.app`. Full (Strict) SSL via Origin CA wildcards. Authenticated Origin Pulls (mTLS) — origin only accepts the Cloudflare client cert. `cdn.makenot.work` reverse-proxies to Hetzner Object Storage (`fsn1`), with Cloudflare caching at the edge.
12
13 **Custom domains** — creator domains bypass Cloudflare and point at the origin directly, where Caddy issues on-demand TLS (LE HTTP-01, gated by `/api/domains/caddy-ask`). Stable routing target: `connect.makenot.work` (A → origin IP, **proxy OFF**) — customers CNAME to it (apex via flattening) so the origin IP can change in one place. **The apex `maxj.phd` dogfoods this exact path**: it is a verified custom domain, DNS-only (CNAME-flattened → `connect.makenot.work`), served by the on-demand-LE catch-all — *not* a CF-proxied zone. (`dl.maxj.phd` stays CF-proxied with its own Origin CA cert + mTLS.)
14
15 **Object storage** — Hetzner S3 (`fsn1` Frankfurt), presigned PUT/GET. Separate buckets for content and SyncKit blobs.
16
17 **Tailscale mesh** carries deploy, CI (astra → alpha), PoM peer health, build offload. Tailscale is **not** in the fan request path — fans go Browser → Cloudflare → public origin.
18
19 ## Capacity Stages
20
21 | Stage | First bottleneck | Cheapest fix |
22 |---|---|---|
23 | ~100 creators | Nothing. CCX13 idles. Postgres fits in RAM. CF absorbs read spikes. | Stay put. Verify backups restore. |
24 | ~1,000 creators | (1) Postgres connection pool / query latency on HTMX dashboards. (2) S3 egress on downloads that bypass CF cache. (3) Caddy on-demand TLS issuance bursts on custom domains. | Resize to CCX23 (4 vCPU / 16 GB). Bump `DB_POOL_MAX_CONNECTIONS`. Ensure long `Cache-Control: immutable` on S3 objects. Consider CF Cache Reserve for cold content. |
25 | ~10,000 creators | (1) Single-VM SPOF. (2) Postgres write throughput (sessions, scheduler, MT, audit). (3) Hetzner 20 TB egress cap if downloads bypass CF. (4) pg_dump duration on a busy DB. | Split MNW, MT, and Postgres onto separate boxes (all on tailnet). PG to CX42/CCX33 with WAL streaming to astra (already in place). Force all downloads through `cdn.makenot.work`. Read replica for discover/feed. |
26 | ~100,000 creators | App horizontal scaling: sessions PG-backed (OK), rate limiter in-process (not OK). Search/discover. S3 storage cost itself (PB scale). | Multi-app behind LB (Hetzner LB or CF Load Balancing). Distributed rate limit. Lifecycle policies to migrate cold content to cheaper tier or Backblaze B2 (Bandwidth Alliance). PG → HA primary + replica + PgBouncer. |
27
28 ## Risk Items to Address Before They Bite
29
30 1. **CDN coverage of paid downloads.** Presigned S3 URLs from `routes/storage/...` — verify clients fetch via `cdn.makenot.work` rather than directly from `fsn1.your-objectstorage.com`. Direct fetches skip CF caching and put egress on Hetzner. Largest hidden cost lever at scale.
31 2. **Cache-Control on S3 objects.** CF only caches what the origin marks cacheable. Confirm uploads set `Cache-Control: public, max-age=31536000, immutable` (content-addressed keys make this safe).
32 3. **Postgres connection budget.** MNW pool = 25; MT has its own pool; both share the same Postgres instance. PoM should alert on `pg_stat_activity` saturation.
33 4. **Caddy on-demand TLS ask endpoint.** `/api/domains/caddy-ask` becomes an issuance-abuse target at scale. Confirm rate limits and cap concurrent ACME issuance.
34 5. **Direct origin exposure.** AOP mTLS protects HTTPS, but `ssh.makenot.work` (CF proxy OFF) is direct. fail2ban is in place; harden further (CF Spectrum or stricter rate limit) when public git traffic grows.
35 6. **Offsite backups.** Daily pg_dump local + astra WAL replication. Both regions could share a fate. Document or add a third-location offsite (B2 / S3-compat) before crossing ~1k creators.
36 7. **Single region.** Hetzner US-West app + Frankfurt S3 = transatlantic per cache miss. EU creator uploads cross the Atlantic twice. Not urgent; budget for it at 10k+.
37 8. **Tailscale dependency for ops.** Admin SSH (:2200) is tailnet-only. If Tailscale control plane is down, break-glass path is public :22 → mnw-cli only. Confirm break-glass procedure is documented; cross-reference `feedback_tailscale_ssh.md` rule.
38
39 ## Recommended Upgrade Path
40
41 - **Now → 1k creators:** CCX13 stays. Confirm CDN/cache hygiene (items 1 & 2). Add CF cache hit ratio to weekly review.
42 - **1k:** Resize in place to CCX23 (one reboot, ~5 min). Bump pool sizing. Tune Postgres `shared_buffers` / `effective_cache_size` for 16 GB.
43 - **3–5k:** Split DB onto its own box. Add PgBouncer. Move MT to its own VM if forum traffic justifies.
44 - **10k+:** App tier behind a load balancer. Distributed rate limit. PG read replica for discover/feeds/RSS.
45
46 ## Economic Analysis
47
48 All prices are list rates as of 2026-05; verify before budgeting. EUR/USD assumed ~1.08. Tier mix assumed roughly: 50% Basic ($16), 25% Small Files ($24), 20% Big Files ($36), 5% Everything ($60) → blended ARPU ~$24/mo. Tier storage caps (50/250/500/500 GB) are headroom, not actual usage; assume actual fill ~20% of cap at any given time.
49
50 ### Compute (Hetzner)
51
52 | VM | Price/mo | Specs |
53 |---|---|---|
54 | CCX13 (current) | ~$15 | 2 dedicated vCPU / 8 GB / 80 GB NVMe + 10 GB |
55 | CCX23 | ~$30 | 4 vCPU / 16 GB / 160 GB |
56 | CCX33 | ~$60 | 8 vCPU / 32 GB / 240 GB |
57 | CCX43 | ~$120 | 16 vCPU / 64 GB / 360 GB |
58
59 Egress: 20 TB/mo included on each VM, $1.20/TB beyond (Hetzner). Object storage egress to Cloudflare counts against this; CF cache hits do not.
60
61 ### Object storage (Hetzner)
62
63 - Storage: €5.99/mo per TB after 1 TB included with first bucket
64 - Egress: included up to 1 TB/mo per bucket, then €1/TB
65 - Per-request cost: included
66 - Rough USD: ~$6.50/TB-month, ~$1.10/TB egress beyond included
67
68 ### Cloudflare
69
70 - Free plan covers proxy, basic DDoS, basic caching. Sufficient through ~10k creators if cache hit ratio stays high.
71 - Pro ($25/mo): WAF, image optimization. Optional.
72 - Cache Reserve: $0.015/GB-month stored, $0.36/M reads. Useful once cold-content miss rate matters.
73 - Bandwidth from CF edge to end users: **free** (this is the headline economic lever).
74 - Workers/transform/load balancing: not needed before 10k+.
75
76 ### Postmark
77
78 - $15/mo for 10k emails; $1.25/k after on shared plan; volume pricing kicks in beyond 50k/mo.
79 - At 1k creators with weekly digest + transactional: estimate 30–60k/mo → $50–80/mo.
80 - At 10k creators: 300–600k/mo → $400–800/mo. Largest non-Stripe variable cost.
81
82 ### Stripe
83
84 - Pass-through to creators (~3% + $0.30 processing). Platform takes 0% of GMV.
85 - Stripe Connect: no fixed platform fee; per-payout fees minimal for Standard accounts.
86 - Tax (Stripe Tax): optional, 0.5% of transaction. Not currently enabled.
87
88 ### Cost Projections by Stage
89
90 Numbers are monthly recurring infrastructure cost (not including domain, banking, accounting, contractor labor). Revenue assumes blended ARPU $19/mo per paying creator.
91
92 | Stage | Creators | Compute | Storage | CDN | Email | Backups offsite | **Total infra/mo** | **Revenue/mo** | **Infra as % of rev** |
93 |---|---|---|---|---|---|---|---|---|---|
94 | Today | ~10 | $15 | $7 (1 TB) | $0 | $15 | $0 (astra) | **~$37** | ~$190 | ~19% |
95 | 100 | 100 | $15 | $20 (~3 TB) | $0 | $15 | $5 | **~$55** | ~$1,900 | ~3% |
96 | 1k | 1,000 | $30 (CCX23) | $200 (~30 TB) | $0 | $80 | $30 (B2/offsite) | **~$340** | ~$19,000 | ~1.8% |
97 | 10k | 10,000 | $300 (3× CCX33: app, MT, PG) | $2,000 (~300 TB) | $50 (Cache Reserve) | $800 | $300 | **~$3,450** | ~$190,000 | ~1.8% |
98 | 100k | 100,000 | $2,400 (~10–15 VMs + LB + replicas) | $20,000 (~3 PB) | $500 | $5,000 | $3,000 | **~$30,900** | ~$1,900,000 | ~1.6% |
99
100 ### Things That Change the Math
101
102 - **CDN cache hit ratio.** At 30% hit ratio, object storage egress dominates beyond 10k creators (could double the storage line). At 90%+ hit ratio (achievable with immutable content-addressed keys), the storage egress line stays near-zero. The single biggest cost lever in the table.
103 - **Storage fill rate.** "20% of cap" is a guess. If creators actually fill tiers, storage at 10k creators is closer to $10k/mo (~1.5 PB), not $2k. Watch this once real creators are on.
104 - **Egress to non-CF paths.** Custom domains bypass Cloudflare; downloads via custom domains hit Hetzner's 20 TB cap quickly at scale. Force downloads through `cdn.makenot.work` regardless of profile-page hostname.
105 - **Backup storage offsite.** If using B2 + Bandwidth Alliance, restore egress is free to CF — important if astra is the offsite site of record.
106 - **Stripe fees are pass-through.** They do not show up in the table because creators pay them, not the platform. But at 100k creators × $19 ARPU, Stripe processes ~$23M/yr through Connect; any per-transaction platform-side fee (Tax, Radar, Identity) materially changes the math.
107 - **Labor is the actual cost.** Even at 10k creators, infra is ~$3.5k/mo. One contractor for ops/support is 3–5× that. The constraint is not the bill from Hetzner.
108
109 ### Margin Read
110
111 At every stage past ~100 creators, infrastructure is well under 5% of revenue. The platform is structurally cheap to operate because Cloudflare absorbs the read fan-out for free and Postgres + Caddy on a Hetzner box scales further than people expect. The economic risks are:
112
113 1. Stripe fees being raised by Stripe (out of our control).
114 2. Cache hit ratio collapsing (controllable; track it).
115 3. Hidden labor cost of support per creator (controllable via DIY tier guardrails).
116
117 The 0% platform fee model holds at every projected stage. The bottleneck is creator acquisition and support load, not infrastructure unit economics.
118
119