| 1 |
# Infrastructure Scaling Audit |
| 2 |
|
| 3 |
Capacity assessment of the production stack and the upgrade path from current state through 100k creators. Companion to `architecture.md` (what exists). |
| 4 |
|
| 5 |
## Current Topology |
| 6 |
|
| 7 |
**Single production VM** — Hetzner `CCX13 x86` in US-West (`alpha-west-1`, 5.78.144.244 / 100.120.174.96). Runs MNW (:3000), Multithreaded (:3400), PostgreSQL (:5432, both DBs), PoM (:9100), Caddy (:80/:443), Git SSH (:22). Tailscale-only admin SSH on :2200; public :22 is mnw-cli only. |
| 8 |
|
| 9 |
CCX13 = 2 dedicated vCPU / 8 GB RAM / 80 GB NVMe + 10 GB volume, ~20 TB included monthly egress. |
| 10 |
|
| 11 |
**Edge** — Cloudflare proxy ON for `makenot.work`, `*.makenot.work`, `*.maxj.phd` (e.g. `dl.maxj.phd`), `htpy.app`. Full (Strict) SSL via Origin CA wildcards. Authenticated Origin Pulls (mTLS) — origin only accepts the Cloudflare client cert. `cdn.makenot.work` reverse-proxies to Hetzner Object Storage (`fsn1`), with Cloudflare caching at the edge. |
| 12 |
|
| 13 |
**Custom domains** — creator domains bypass Cloudflare and point at the origin directly, where Caddy issues on-demand TLS (LE HTTP-01, gated by `/api/domains/caddy-ask`). Stable routing target: `connect.makenot.work` (A → origin IP, **proxy OFF**) — customers CNAME to it (apex via flattening) so the origin IP can change in one place. **The apex `maxj.phd` dogfoods this exact path**: it is a verified custom domain, DNS-only (CNAME-flattened → `connect.makenot.work`), served by the on-demand-LE catch-all — *not* a CF-proxied zone. (`dl.maxj.phd` stays CF-proxied with its own Origin CA cert + mTLS.) |
| 14 |
|
| 15 |
**Object storage** — Hetzner S3 (`fsn1` Frankfurt), presigned PUT/GET. Separate buckets for content and SyncKit blobs. |
| 16 |
|
| 17 |
**Tailscale mesh** carries deploy, CI (astra → alpha), PoM peer health, build offload. Tailscale is **not** in the fan request path — fans go Browser → Cloudflare → public origin. |
| 18 |
|
| 19 |
## Capacity Stages |
| 20 |
|
| 21 |
|
| 22 |
|
| 23 |
| ~100 creators | Nothing. CCX13 idles. Postgres fits in RAM. CF absorbs read spikes. | Stay put. Verify backups restore. | |
| 24 |
| ~1,000 creators | (1) Postgres connection pool / query latency on HTMX dashboards. (2) S3 egress on downloads that bypass CF cache. (3) Caddy on-demand TLS issuance bursts on custom domains. | Resize to CCX23 (4 vCPU / 16 GB). Bump `DB_POOL_MAX_CONNECTIONS`. Ensure long `Cache-Control: immutable` on S3 objects. Consider CF Cache Reserve for cold content. | |
| 25 |
| ~10,000 creators | (1) Single-VM SPOF. (2) Postgres write throughput (sessions, scheduler, MT, audit). (3) Hetzner 20 TB egress cap if downloads bypass CF. (4) pg_dump duration on a busy DB. | Split MNW, MT, and Postgres onto separate boxes (all on tailnet). PG to CX42/CCX33 with WAL streaming to astra (already in place). Force all downloads through `cdn.makenot.work`. Read replica for discover/feed. | |
| 26 |
| ~100,000 creators | App horizontal scaling: sessions PG-backed (OK), rate limiter in-process (not OK). Search/discover. S3 storage cost itself (PB scale). | Multi-app behind LB (Hetzner LB or CF Load Balancing). Distributed rate limit. Lifecycle policies to migrate cold content to cheaper tier or Backblaze B2 (Bandwidth Alliance). PG → HA primary + replica + PgBouncer. | |
| 27 |
|
| 28 |
## Risk Items to Address Before They Bite |
| 29 |
|
| 30 |
1. **CDN coverage of paid downloads.** Presigned S3 URLs from `routes/storage/...` — verify clients fetch via `cdn.makenot.work` rather than directly from `fsn1.your-objectstorage.com`. Direct fetches skip CF caching and put egress on Hetzner. Largest hidden cost lever at scale. |
| 31 |
2. **Cache-Control on S3 objects.** CF only caches what the origin marks cacheable. Confirm uploads set `Cache-Control: public, max-age=31536000, immutable` (content-addressed keys make this safe). |
| 32 |
3. **Postgres connection budget.** MNW pool = 25; MT has its own pool; both share the same Postgres instance. PoM should alert on `pg_stat_activity` saturation. |
| 33 |
4. **Caddy on-demand TLS ask endpoint.** `/api/domains/caddy-ask` becomes an issuance-abuse target at scale. Confirm rate limits and cap concurrent ACME issuance. |
| 34 |
5. **Direct origin exposure.** AOP mTLS protects HTTPS, but `ssh.makenot.work` (CF proxy OFF) is direct. fail2ban is in place; harden further (CF Spectrum or stricter rate limit) when public git traffic grows. |
| 35 |
6. **Offsite backups.** Daily pg_dump local + astra WAL replication. Both regions could share a fate. Document or add a third-location offsite (B2 / S3-compat) before crossing ~1k creators. |
| 36 |
7. **Single region.** Hetzner US-West app + Frankfurt S3 = transatlantic per cache miss. EU creator uploads cross the Atlantic twice. Not urgent; budget for it at 10k+. |
| 37 |
8. **Tailscale dependency for ops.** Admin SSH (:2200) is tailnet-only. If Tailscale control plane is down, break-glass path is public :22 → mnw-cli only. Confirm break-glass procedure is documented; cross-reference `feedback_tailscale_ssh.md` rule. |
| 38 |
|
| 39 |
## Recommended Upgrade Path |
| 40 |
|
| 41 |
- **Now → 1k creators:** CCX13 stays. Confirm CDN/cache hygiene (items 1 & 2). Add CF cache hit ratio to weekly review. |
| 42 |
- **1k:** Resize in place to CCX23 (one reboot, ~5 min). Bump pool sizing. Tune Postgres `shared_buffers` / `effective_cache_size` for 16 GB. |
| 43 |
- **3–5k:** Split DB onto its own box. Add PgBouncer. Move MT to its own VM if forum traffic justifies. |
| 44 |
- **10k+:** App tier behind a load balancer. Distributed rate limit. PG read replica for discover/feeds/RSS. |
| 45 |
|
| 46 |
## Economic Analysis |
| 47 |
|
| 48 |
All prices are list rates as of 2026-05; verify before budgeting. EUR/USD assumed ~1.08. Tier mix assumed roughly: 50% Basic ($16), 25% Small Files ($24), 20% Big Files ($36), 5% Everything ($60) → blended ARPU ~$24/mo. Tier storage caps (50/250/500/500 GB) are headroom, not actual usage; assume actual fill ~20% of cap at any given time. |
| 49 |
|
| 50 |
### Compute (Hetzner) |
| 51 |
|
| 52 |
|
| 53 |
|
| 54 |
| CCX13 (current) | ~$15 | 2 dedicated vCPU / 8 GB / 80 GB NVMe + 10 GB | |
| 55 |
| CCX23 | ~$30 | 4 vCPU / 16 GB / 160 GB | |
| 56 |
| CCX33 | ~$60 | 8 vCPU / 32 GB / 240 GB | |
| 57 |
| CCX43 | ~$120 | 16 vCPU / 64 GB / 360 GB | |
| 58 |
|
| 59 |
Egress: 20 TB/mo included on each VM, $1.20/TB beyond (Hetzner). Object storage egress to Cloudflare counts against this; CF cache hits do not. |
| 60 |
|
| 61 |
### Object storage (Hetzner) |
| 62 |
|
| 63 |
- Storage: €5.99/mo per TB after 1 TB included with first bucket |
| 64 |
- Egress: included up to 1 TB/mo per bucket, then €1/TB |
| 65 |
- Per-request cost: included |
| 66 |
- Rough USD: ~$6.50/TB-month, ~$1.10/TB egress beyond included |
| 67 |
|
| 68 |
### Cloudflare |
| 69 |
|
| 70 |
- Free plan covers proxy, basic DDoS, basic caching. Sufficient through ~10k creators if cache hit ratio stays high. |
| 71 |
- Pro ($25/mo): WAF, image optimization. Optional. |
| 72 |
- Cache Reserve: $0.015/GB-month stored, $0.36/M reads. Useful once cold-content miss rate matters. |
| 73 |
- Bandwidth from CF edge to end users: **free** (this is the headline economic lever). |
| 74 |
- Workers/transform/load balancing: not needed before 10k+. |
| 75 |
|
| 76 |
### Postmark |
| 77 |
|
| 78 |
- $15/mo for 10k emails; $1.25/k after on shared plan; volume pricing kicks in beyond 50k/mo. |
| 79 |
- At 1k creators with weekly digest + transactional: estimate 30–60k/mo → $50–80/mo. |
| 80 |
- At 10k creators: 300–600k/mo → $400–800/mo. Largest non-Stripe variable cost. |
| 81 |
|
| 82 |
### Stripe |
| 83 |
|
| 84 |
- Pass-through to creators (~3% + $0.30 processing). Platform takes 0% of GMV. |
| 85 |
- Stripe Connect: no fixed platform fee; per-payout fees minimal for Standard accounts. |
| 86 |
- Tax (Stripe Tax): optional, 0.5% of transaction. Not currently enabled. |
| 87 |
|
| 88 |
### Cost Projections by Stage |
| 89 |
|
| 90 |
Numbers are monthly recurring infrastructure cost (not including domain, banking, accounting, contractor labor). Revenue assumes blended ARPU $19/mo per paying creator. |
| 91 |
|
| 92 |
|
| 93 |
|
| 94 |
| Today | ~10 | $15 | $7 (1 TB) | $0 | $15 | $0 (astra) | **~$37** | ~$190 | ~19% | |
| 95 |
| 100 | 100 | $15 | $20 (~3 TB) | $0 | $15 | $5 | **~$55** | ~$1,900 | ~3% | |
| 96 |
| 1k | 1,000 | $30 (CCX23) | $200 (~30 TB) | $0 | $80 | $30 (B2/offsite) | **~$340** | ~$19,000 | ~1.8% | |
| 97 |
| 10k | 10,000 | $300 (3× CCX33: app, MT, PG) | $2,000 (~300 TB) | $50 (Cache Reserve) | $800 | $300 | **~$3,450** | ~$190,000 | ~1.8% | |
| 98 |
| 100k | 100,000 | $2,400 (~10–15 VMs + LB + replicas) | $20,000 (~3 PB) | $500 | $5,000 | $3,000 | **~$30,900** | ~$1,900,000 | ~1.6% | |
| 99 |
|
| 100 |
### Things That Change the Math |
| 101 |
|
| 102 |
- **CDN cache hit ratio.** At 30% hit ratio, object storage egress dominates beyond 10k creators (could double the storage line). At 90%+ hit ratio (achievable with immutable content-addressed keys), the storage egress line stays near-zero. The single biggest cost lever in the table. |
| 103 |
- **Storage fill rate.** "20% of cap" is a guess. If creators actually fill tiers, storage at 10k creators is closer to $10k/mo (~1.5 PB), not $2k. Watch this once real creators are on. |
| 104 |
- **Egress to non-CF paths.** Custom domains bypass Cloudflare; downloads via custom domains hit Hetzner's 20 TB cap quickly at scale. Force downloads through `cdn.makenot.work` regardless of profile-page hostname. |
| 105 |
- **Backup storage offsite.** If using B2 + Bandwidth Alliance, restore egress is free to CF — important if astra is the offsite site of record. |
| 106 |
- **Stripe fees are pass-through.** They do not show up in the table because creators pay them, not the platform. But at 100k creators × $19 ARPU, Stripe processes ~$23M/yr through Connect; any per-transaction platform-side fee (Tax, Radar, Identity) materially changes the math. |
| 107 |
- **Labor is the actual cost.** Even at 10k creators, infra is ~$3.5k/mo. One contractor for ops/support is 3–5× that. The constraint is not the bill from Hetzner. |
| 108 |
|
| 109 |
### Margin Read |
| 110 |
|
| 111 |
At every stage past ~100 creators, infrastructure is well under 5% of revenue. The platform is structurally cheap to operate because Cloudflare absorbs the read fan-out for free and Postgres + Caddy on a Hetzner box scales further than people expect. The economic risks are: |
| 112 |
|
| 113 |
1. Stripe fees being raised by Stripe (out of our control). |
| 114 |
2. Cache hit ratio collapsing (controllable; track it). |
| 115 |
3. Hidden labor cost of support per creator (controllable via DIY tier guardrails). |
| 116 |
|
| 117 |
The 0% platform fee model holds at every projected stage. The bottleneck is creator acquisition and support load, not infrastructure unit economics. |
| 118 |
|
| 119 |
|