Build guide

Dual RTX 4090 Workstation

Twin 4090s for high-throughput 34B–70B inference with NVLink-ready parts.

Budget

$6,500

Profile

HOME INFERENCE

Target model

Llama 3.3 70B

View saved build Customize All guides

Performance

ESTIMATED

68.8tok/s

58.5–79.1 tok/s decode on Llama 3.3 70B

Value: 10.58 tok/s per $1k

Why this build

The dual RTX 4090 configuration is the practical ceiling for consumer-grade 70B inference. Two 4090s provide 48 GB of combined VRAM, which fits Llama 3.3 70B at Q4_K_M with enough headroom for a 32K context window — a combination no single consumer card can match. With tensor parallelism across both cards, decode throughput scales close to linearly: expect 25–40 tok/s on 70B at Q4, compared to 4–8 tok/s on a single 3090. The NVLink-ready motherboard and PCIe slot spacing matter: GPUs running without NVLink will share bandwidth over PCIe, which caps throughput. This build is suited for a single power user or a very small team (2–3 concurrent users on 34B). It isn't a rack server — cooling and noise make it a desktop workstation, not a headless inference node. Teams needing higher concurrency or larger models should look at purpose-built server hardware.

Parts list

GPU
NVIDIA GeForce RTX 4090
Amazon · $1,580
$3,160 (2×)
CPU
AMD Ryzen 9 7950X
Amazon · $535
$535
Motherboard
ASUS ROG STRIX X670E-E GAMING WIFI
Amazon · $503
$503