Skip to main content

max / makenotwork

revert "server: tier_prices canonical-assumptions test skips when toml absent" The test must stay strict. assumptions.toml needs to be present in sando's build workspace (separate followup); skipping is the wrong mitigation. Also restores sando/plans/mm-hardware-bom.md which was inadvertently deleted in the reverted commit, and rolls server back from 0.9.3 to 0.9.2. This reverts commit f7d1990e1580ba7c3c764d1ed4c232ef20482a89. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Author: Max Johnson <me@maxj.phd> · 2026-06-01 21:35 UTC
Commit: ad63f4a16f618a21abeae3b306a4ed855da94278
Parent: f7d1990
3 files changed, +92 insertions, -9 deletions
@@ -0,0 +1,91 @@
1 + # MakeMachine Hardware BOM
2 +
3 + Settled 2026-05-23. Top-of-line host platform; GPUs are fungible and live on the EveryCycle GPU thread (see `~/Code/everycycle/docs/roadmap.md`).
4 +
5 + The substrate is built once and kept stable; GPU experimentation happens above it without revisiting motherboard, CPU, or RAM.
6 +
7 + ## Component list
8 +
9 + | Component | Choice | Approx cost | Notes |
10 + |---|---|---|---|
11 + | CPU | AMD Threadripper Pro 7975WX | $3,900 | 32 cores, 5.3 GHz boost, 8-channel DDR5. Same memory controller and PCIe lanes as bigger SKUs; cores past ~32 starve on memory bandwidth for inference. |
12 + | Motherboard | ASRock Rack WRX90D8-2L/2T | $1,200 | 7× PCIe 5.0 ×16, 8 DIMM slots, **ASPEED AST2600 BMC (OpenBMC-friendly)**, dual 10 GbE, 4× M.2 + 4× SlimSAS. |
13 + | RAM | 8× 64 GB DDR5-5600 ECC RDIMM (512 GB total) | $2,500 | One DIMM per channel — 2 DIMMs per channel forces DDR5 down to ~4400 MT/s on WRX90, costing ~20% memory bandwidth (real impact on CPU-offload layers). Brand: Micron or Hynix off the board QVL. |
14 + | Storage 1 | 4 TB Gen5 NVMe (Samsung 9100 Pro or Crucial T705) | $600 | ZFS root pool: OS, models, Sando state, logs. |
15 + | Storage 2 | 4 TB Gen5 NVMe (same) | $600 | **Raw XFS kv-scratch.** Not ZFS — ZFS caps Gen5 throughput; scratch is by definition disposable. EveryCycle uses this for kv-cache overflow. |
16 + | PSU | 2000 W Titanium-class (Super Flower Leadex Titanium or equivalent) | $600 | Headroom for any GPU combination the loose-parts experimentation hits. |
17 + | Chassis | 4U rackmount with ≥4 dual-slot GPU bays (Sliger CX4712 or similar) | $400 | Matches the eventual EveryCycle reference inference box; rackable from day one. |
18 + | CPU cooler | Silverstone XE360-TR5 (air) or Noctua NH-U14S TR5-SP6 | $200 | Air, not AIO — pump failure on a 24/7 box is worse than fan failure. |
19 + | Case fans | 6× 140 mm Noctua industrial | $200 | Front-to-back airflow; GPUs in a 4U breathe through these. |
20 + | Boot GPU | Nvidia GT 1030 (low-profile) | $80 | WRX90 has no iGPU; need a tiny card for console/boot. Also used as console GPU when datacenter cards (no display output) are installed. GT 710 originally specced but effectively EOL retail in 2026; GT 1030 is the current floor. Alternative: skip entirely if WRX90 BMC serial-over-LAN proves reliable. |
21 + | **Host subtotal** | | **~$10,280** | Before any compute GPU. |
22 +
23 + ## First compute GPU (Thread A1)
24 +
25 + | Component | Choice | Approx cost | Notes |
26 + |---|---|---|---|
27 + | GPU | 1× Nvidia Tesla P40 | $200 | 24 GB GDDR5, Pascal (cc 6.1). On-thesis: Pascal is the next architecture facing CUDA-legacy transition. |
28 + | Cooling adapter | 3D-printed or commercial fan shroud for P40 | $25 | Server card; passive cooling needs chassis airflow OR a strapped-on fan. |
29 + | Power adapter | EPS 8-pin to PCIe (or proper EPS routing) | $10 | P40 uses CPU-style 8-pin, not standard PCIe 8-pin. |
30 + | **GPU subtotal at A1** | | **~$235** | |
31 +
32 + ## Total
33 +
34 + | | |
35 + |---|---|
36 + | Host platform | ~$10,280 |
37 + | First GPU (A1) | ~$235 |
38 + | **MakeMachine v0 (assembled, runnable)** | **~$10,515** |
39 +
40 + Budget originally specced at $14–16K for the previous spec. Net savings: ~$4–5K. The savings are the GPU experimentation budget for advancing the GPU thread through A4–A5 over time.
41 +
42 + ## BIOS settings worth confirming at first boot
43 +
44 + - **Above 4G Decoding: enabled.** Required for datacenter cards with large BARs (Tesla P40, MI50, etc.).
45 + - **Resizable BAR: enabled.** Same reason.
46 + - **IOMMU: enabled.** Required for podman + CDI GPU passthrough.
47 + - **SR-IOV: as needed.** Not critical at A1; revisit at A5.
48 + - **Memory speed: DDR5-5600 (JEDEC).** Not pushed past spec.
49 + - **PCIe lane bifurcation: leave default initially; revisit for multi-GPU configurations.**
50 +
51 + ## Why this spec, summarized
52 +
53 + - **Threadripper Pro WRX90 platform** chosen over consumer TRX50 for the 7× PCIe 5.0 ×16 slots and the AST2600 BMC. EveryCycle's `bmc-agent` crate needs a real BMC to talk to.
54 + - **512 GB RAM** lets a 405B Q4 model fit fully in RAM if ever wanted; the immediate need is comfortable headroom for kv-cache experiments.
55 + - **Two Gen5 NVMes with different filesystems** because ZFS root is the right answer for the OS and model store, but ZFS would cap Gen5 bandwidth on the kv-scratch where bandwidth is the whole point.
56 + - **2000 W PSU** is sized for the worst-case 4-GPU configuration on the GPU thread, not for A1 alone.
57 + - **4U rackmount chassis** because it doubles as a prototype of the eventual EveryCycle reference appliance form factor.
58 +
59 + ## Sourcing notes
60 +
61 + Most of the spine is specialist channels; Microcenter covers the commodity parts.
62 +
63 + | Part | Channel |
64 + |---|---|
65 + | TR Pro 7975WX | Newegg, ShopBLT, Provantage |
66 + | WRX90D8-2L/2T | ASRock Rack direct, ShopBLT (longest lead — order first) |
67 + | DDR5-5600 ECC RDIMM (Micron/Hynix off board QVL) | Nemix, ServerSupply, Newegg |
68 + | Sliger CX4712 | Sliger direct (~6 wk historical lead). Fallback: Rosewill RSV-L4500U from Newegg if waiting is unacceptable. |
69 + | Tesla P40 + EPS-to-PCIe adapter + fan shroud | eBay |
70 + | 2× 4 TB Gen5 NVMe (9100 Pro / T705) | Microcenter |
71 + | PSU | Microcenter if Super Flower Leadex Titanium 2000 W in stock; otherwise Seasonic PRIME TX-1600 is the MC-reliable alternative (sufficient for A1–A3, marginal at A4 4-GPU). |
72 + | Noctua NF-A14 industrialPPC-3000 fans | Microcenter |
73 + | NH-U14S TR5-SP6 cooler | Noctua direct or Newegg; Microcenter rarely stocks |
74 + | GT 1030 boot GPU, paste, cables, M.2 heatsinks | Microcenter |
75 +
76 + Suggested order of operations: board first (longest lead) → CPU + RAM kit together (QVL match) → chassis → Microcenter run for commodity parts → P40 last (cheapest, most fungible).
77 +
78 + ## Open considerations (not blocking purchase)
79 +
80 + - **PSU redundancy.** Single 2000 W Titanium is fine for v0. Dual-redundant CRPS (e.g. FSP Twins) is the server-class move for v1.
81 + - **Cooler headroom.** NH-U14S TR5-SP6 is adequate for the 7975WX's 350 W TDP, not generous. XE360-TR5 (listed alt) gives more headroom for sustained all-core + GPU host load. Both are air.
82 + - **Boot GPU skip.** If WRX90 BMC serial-over-LAN proves reliable, GT 1030 line item is removable and a slot is reclaimed.
83 + - **kv-scratch PLP.** Consumer 9100 Pro / T705 have no power-loss protection. Fine for disposable scratch; revisit if EveryCycle ever checkpoints kv-cache through it.
84 +
85 + ## What this BOM deliberately does not include
86 +
87 + - **Multiple compute GPUs at purchase time.** Each GPU thread milestone is a discrete add; advancing the thread is its own decision moment.
88 + - **Liquid cooling.** More maintenance overhead than a single-operator shop should carry.
89 + - **Optane / SLC drives.** Optane is discontinued; Gen5 TLC is the right answer now.
90 + - **128-core CPU.** Memory-bandwidth-starved for inference; ~$6K extra for no inference gain.
91 + - **A second NIC card.** The 2× 10 GbE onboard is enough for v0.
@@ -1,6 +1,6 @@
1 1 [package]
2 2 name = "makenotwork"
3 - version = "0.9.3"
3 + version = "0.9.2"
4 4 edition = "2024"
5 5 license-file = "LICENSE"
6 6
@@ -164,16 +164,8 @@ mod tests {
164 164 /// Guards every key TierPrices reads. If a future toml edit removes one of
165 165 /// these or flips its type, the panic in `from_assumptions` will fire at
166 166 /// startup; this test catches it at PR time instead.
167 - ///
168 - /// Skipped when the canonical toml isn't present — sando builds MNW from a
169 - /// bare-repo clone without _private/ alongside, and CI images may not mount
170 - /// the private store. Local + dev runs still execute it.
171 167 #[test]
172 168 fn from_canonical_assumptions_populates_every_field() {
173 - if !std::path::Path::new(ASSUMPTIONS_PATH).exists() {
174 - eprintln!("skipping: {ASSUMPTIONS_PATH} not present (sando/CI workspace)");
175 - return;
176 - }
177 169 let a = Assumptions::load(ASSUMPTIONS_PATH).expect("load canonical toml");
178 170 let p = TierPrices::from_assumptions(&a);
179 171