Build guide
Twin 4090s for high-throughput 34B–70B inference with NVLink-ready parts.
Budget
$6,500
Profile
HOME INFERENCE
Target model
Llama 3.3 70B
68.8tok/s
58.5–79.1 tok/s decode on Llama 3.3 70B
Value: 10.58 tok/s per $1k
The dual RTX 4090 configuration is the practical ceiling for consumer-grade 70B inference. Two 4090s provide 48 GB of combined VRAM, which fits Llama 3.3 70B at Q4_K_M with enough headroom for a 32K context window — a combination no single consumer card can match. With tensor parallelism across both cards, decode throughput scales close to linearly: expect 25–40 tok/s on 70B at Q4, compared to 4–8 tok/s on a single 3090. The NVLink-ready motherboard and PCIe slot spacing matter: GPUs running without NVLink will share bandwidth over PCIe, which caps throughput. This build is suited for a single power user or a very small team (2–3 concurrent users on 34B). It isn't a rack server — cooling and noise make it a desktop workstation, not a headless inference node. Teams needing higher concurrency or larger models should look at purpose-built server hardware.