max / makenotwork

revert "server: tier_prices canonical-assumptions test skips when toml absent" The test must stay strict. assumptions.toml needs to be present in sando's build workspace (separate followup); skipping is the wrong mitigation. Also restores sando/plans/mm-hardware-bom.md which was inadvertently deleted in the reverted commit, and rolls server back from 0.9.3 to 0.9.2. This reverts commit f7d1990e1580ba7c3c764d1ed4c232ef20482a89. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Author: Max Johnson <me@maxj.phd> · 2026-06-01 21:35 UTC

Commit: ad63f4a16f618a21abeae3b306a4ed855da94278

Parent: f7d1990

3 files changed, +92 insertions, -9 deletions

A sando/plans/mm-hardware-bom.md +91

		@@ -0,0 +1,91 @@
1	+	# MakeMachine Hardware BOM
2	+
3	+	Settled 2026-05-23. Top-of-line host platform; GPUs are fungible and live on the EveryCycle GPU thread (see `~/Code/everycycle/docs/roadmap.md`).
4	+
5	+	The substrate is built once and kept stable; GPU experimentation happens above it without revisiting motherboard, CPU, or RAM.
6	+
7	+	## Component list
8	+
9	+	\| Component \| Choice \| Approx cost \| Notes \|
10	+	\|---\|---\|---\|---\|
11	+	\| CPU \| AMD Threadripper Pro 7975WX \| $3,900 \| 32 cores, 5.3 GHz boost, 8-channel DDR5. Same memory controller and PCIe lanes as bigger SKUs; cores past ~32 starve on memory bandwidth for inference. \|
12	+	\| Motherboard \| ASRock Rack WRX90D8-2L/2T \| $1,200 \| 7× PCIe 5.0 ×16, 8 DIMM slots, ASPEED AST2600 BMC (OpenBMC-friendly), dual 10 GbE, 4× M.2 + 4× SlimSAS. \|
13	+	\| RAM \| 8× 64 GB DDR5-5600 ECC RDIMM (512 GB total) \| $2,500 \| One DIMM per channel — 2 DIMMs per channel forces DDR5 down to ~4400 MT/s on WRX90, costing ~20% memory bandwidth (real impact on CPU-offload layers). Brand: Micron or Hynix off the board QVL. \|
14	+	\| Storage 1 \| 4 TB Gen5 NVMe (Samsung 9100 Pro or Crucial T705) \| $600 \| ZFS root pool: OS, models, Sando state, logs. \|
15	+	\| Storage 2 \| 4 TB Gen5 NVMe (same) \| $600 \| Raw XFS kv-scratch. Not ZFS — ZFS caps Gen5 throughput; scratch is by definition disposable. EveryCycle uses this for kv-cache overflow. \|
16	+	\| PSU \| 2000 W Titanium-class (Super Flower Leadex Titanium or equivalent) \| $600 \| Headroom for any GPU combination the loose-parts experimentation hits. \|
17	+	\| Chassis \| 4U rackmount with ≥4 dual-slot GPU bays (Sliger CX4712 or similar) \| $400 \| Matches the eventual EveryCycle reference inference box; rackable from day one. \|
18	+	\| CPU cooler \| Silverstone XE360-TR5 (air) or Noctua NH-U14S TR5-SP6 \| $200 \| Air, not AIO — pump failure on a 24/7 box is worse than fan failure. \|
19	+	\| Case fans \| 6× 140 mm Noctua industrial \| $200 \| Front-to-back airflow; GPUs in a 4U breathe through these. \|
20	+	\| Boot GPU \| Nvidia GT 1030 (low-profile) \| $80 \| WRX90 has no iGPU; need a tiny card for console/boot. Also used as console GPU when datacenter cards (no display output) are installed. GT 710 originally specced but effectively EOL retail in 2026; GT 1030 is the current floor. Alternative: skip entirely if WRX90 BMC serial-over-LAN proves reliable. \|
21	+	\| Host subtotal \| \| ~$10,280 \| Before any compute GPU. \|
22	+
23	+	## First compute GPU (Thread A1)
24	+
25	+	\| Component \| Choice \| Approx cost \| Notes \|
26	+	\|---\|---\|---\|---\|
27	+	\| GPU \| 1× Nvidia Tesla P40 \| $200 \| 24 GB GDDR5, Pascal (cc 6.1). On-thesis: Pascal is the next architecture facing CUDA-legacy transition. \|
28	+	\| Cooling adapter \| 3D-printed or commercial fan shroud for P40 \| $25 \| Server card; passive cooling needs chassis airflow OR a strapped-on fan. \|
29	+	\| Power adapter \| EPS 8-pin to PCIe (or proper EPS routing) \| $10 \| P40 uses CPU-style 8-pin, not standard PCIe 8-pin. \|
30	+	\| GPU subtotal at A1 \| \| ~$235 \| \|
31	+
32	+	## Total
33	+
34	+	\| \| \|
35	+	\|---\|---\|
36	+	\| Host platform \| ~$10,280 \|
37	+	\| First GPU (A1) \| ~$235 \|
38	+	\| MakeMachine v0 (assembled, runnable) \| ~$10,515 \|
39	+
40	+	Budget originally specced at $14–16K for the previous spec. Net savings: ~$4–5K. The savings are the GPU experimentation budget for advancing the GPU thread through A4–A5 over time.
41	+
42	+	## BIOS settings worth confirming at first boot
43	+
44	+	- Above 4G Decoding: enabled. Required for datacenter cards with large BARs (Tesla P40, MI50, etc.).
45	+	- Resizable BAR: enabled. Same reason.
46	+	- IOMMU: enabled. Required for podman + CDI GPU passthrough.
47	+	- SR-IOV: as needed. Not critical at A1; revisit at A5.
48	+	- Memory speed: DDR5-5600 (JEDEC). Not pushed past spec.
49	+	- PCIe lane bifurcation: leave default initially; revisit for multi-GPU configurations.
50	+
51	+	## Why this spec, summarized
52	+
53	+	- Threadripper Pro WRX90 platform chosen over consumer TRX50 for the 7× PCIe 5.0 ×16 slots and the AST2600 BMC. EveryCycle's `bmc-agent` crate needs a real BMC to talk to.
54	+	- 512 GB RAM lets a 405B Q4 model fit fully in RAM if ever wanted; the immediate need is comfortable headroom for kv-cache experiments.
55	+	- Two Gen5 NVMes with different filesystems because ZFS root is the right answer for the OS and model store, but ZFS would cap Gen5 bandwidth on the kv-scratch where bandwidth is the whole point.
56	+	- 2000 W PSU is sized for the worst-case 4-GPU configuration on the GPU thread, not for A1 alone.
57	+	- 4U rackmount chassis because it doubles as a prototype of the eventual EveryCycle reference appliance form factor.
58	+
59	+	## Sourcing notes
60	+
61	+	Most of the spine is specialist channels; Microcenter covers the commodity parts.
62	+
63	+	\| Part \| Channel \|
64	+	\|---\|---\|
65	+	\| TR Pro 7975WX \| Newegg, ShopBLT, Provantage \|
66	+	\| WRX90D8-2L/2T \| ASRock Rack direct, ShopBLT (longest lead — order first) \|
67	+	\| DDR5-5600 ECC RDIMM (Micron/Hynix off board QVL) \| Nemix, ServerSupply, Newegg \|
68	+	\| Sliger CX4712 \| Sliger direct (~6 wk historical lead). Fallback: Rosewill RSV-L4500U from Newegg if waiting is unacceptable. \|
69	+	\| Tesla P40 + EPS-to-PCIe adapter + fan shroud \| eBay \|
70	+	\| 2× 4 TB Gen5 NVMe (9100 Pro / T705) \| Microcenter \|
71	+	\| PSU \| Microcenter if Super Flower Leadex Titanium 2000 W in stock; otherwise Seasonic PRIME TX-1600 is the MC-reliable alternative (sufficient for A1–A3, marginal at A4 4-GPU). \|
72	+	\| Noctua NF-A14 industrialPPC-3000 fans \| Microcenter \|
73	+	\| NH-U14S TR5-SP6 cooler \| Noctua direct or Newegg; Microcenter rarely stocks \|
74	+	\| GT 1030 boot GPU, paste, cables, M.2 heatsinks \| Microcenter \|
75	+
76	+	Suggested order of operations: board first (longest lead) → CPU + RAM kit together (QVL match) → chassis → Microcenter run for commodity parts → P40 last (cheapest, most fungible).
77	+
78	+	## Open considerations (not blocking purchase)
79	+
80	+	- PSU redundancy. Single 2000 W Titanium is fine for v0. Dual-redundant CRPS (e.g. FSP Twins) is the server-class move for v1.
81	+	- Cooler headroom. NH-U14S TR5-SP6 is adequate for the 7975WX's 350 W TDP, not generous. XE360-TR5 (listed alt) gives more headroom for sustained all-core + GPU host load. Both are air.
82	+	- Boot GPU skip. If WRX90 BMC serial-over-LAN proves reliable, GT 1030 line item is removable and a slot is reclaimed.
83	+	- kv-scratch PLP. Consumer 9100 Pro / T705 have no power-loss protection. Fine for disposable scratch; revisit if EveryCycle ever checkpoints kv-cache through it.
84	+
85	+	## What this BOM deliberately does not include
86	+
87	+	- Multiple compute GPUs at purchase time. Each GPU thread milestone is a discrete add; advancing the thread is its own decision moment.
88	+	- Liquid cooling. More maintenance overhead than a single-operator shop should carry.
89	+	- Optane / SLC drives. Optane is discontinued; Gen5 TLC is the right answer now.
90	+	- 128-core CPU. Memory-bandwidth-starved for inference; ~$6K extra for no inference gain.
91	+	- A second NIC card. The 2× 10 GbE onboard is enough for v0.

M server/Cargo.toml +1 -1

			@@ -1,6 +1,6 @@
1	1		[package]
2	2		name = "makenotwork"
3		-	version = "0.9.3"
	3	+	version = "0.9.2"
4	4		edition = "2024"
5	5		license-file = "LICENSE"
6	6

M server/src/tier_prices.rs -8

			@@ -164,16 +164,8 @@ mod tests {
164	164		/// Guards every key TierPrices reads. If a future toml edit removes one of
165	165		/// these or flips its type, the panic in `from_assumptions` will fire at
166	166		/// startup; this test catches it at PR time instead.
167		-	///
168		-	/// Skipped when the canonical toml isn't present — sando builds MNW from a
169		-	/// bare-repo clone without _private/ alongside, and CI images may not mount
170		-	/// the private store. Local + dev runs still execute it.
171	167		#[test]
172	168		fn from_canonical_assumptions_populates_every_field() {
173		-	if !std::path::Path::new(ASSUMPTIONS_PATH).exists() {
174		-	eprintln!("skipping: {ASSUMPTIONS_PATH} not present (sando/CI workspace)");
175		-	return;
176		-	}
177	169		let a = Assumptions::load(ASSUMPTIONS_PATH).expect("load canonical toml");
178	170		let p = TierPrices::from_assumptions(&a);
179	171