Do I need NVLink for multi-GPU LLM inference?

Not technically required, but strongly recommended for 70B models. Without NVLink (i.e., two 4090s via PCIe), the inter-GPU communication bandwidth limits throughput on large models. Community estimates suggest a dual-4090 PCIe setup delivers notably fewer tok/s on 70B than a dual-3090 NVLink setup. For 70B inference specifically, NVLink is a meaningful advantage. For smaller models that fit on one card, multi-GPU does not help regardless.

Can I mix different GPU models in a multi-GPU build?

For inference via tensor parallelism (the model split across cards), mixing different GPU models is technically supported by some runtimes but generally not recommended. Mismatched VRAM sizes mean the larger card's extra memory is not used, and mismatched bandwidth can create synchronization bottlenecks. For NVLink specifically, both cards must use the same NVLink connector and be from the same generation — you cannot NVLink a 3090 with a 4090. Keep both cards identical for multi-GPU builds.

What motherboard should I use for 2× RTX 4090?

You need a motherboard with two full mechanical x16 PCIe slots. For consumer Intel platforms, look for Z790 boards explicitly listing dual-GPU support — many mid-range Z790 boards have one x16 and one x4 slot electrically, which limits the second card's bandwidth. ASUS ProArt Z790-Creator WiFi, MSI MEG Z790 GODLIKE, and similar enthusiast boards reliably support dual x16. For maximum inter-GPU bandwidth without NVLink, AMD TRX50 HEDT (Threadripper) or Intel W790 workstation platforms are better choices.

Buying guide

Multi-GPU Builds for Local AI: When 2× RTX 4090 (or 3090) Makes Sense

June 16, 20266 min readBy the ClankerBuilder editorial team · how we rate

When to go multi-GPU for local LLM inference — NVLink vs PCIe bandwidth, 70B model performance, build requirements, and cost vs single A100.

A single 24GB GPU cannot comfortably run a 70B model. At Q4_K_M quantization, Llama 3.3 70B needs approximately 43GB of VRAM — more than any single consumer card offers. The options are to buy a professional 48GB+ card (expensive), rent from the cloud (ongoing cost), or pool two consumer cards. Multi-GPU consumer builds have become the dominant home approach for 70B inference, but not all multi-GPU setups are equal.

The critical distinction is how the two cards communicate. RTX 3090s can be connected via NVLink, pooling their VRAM at high bandwidth. RTX 4090s dropped consumer NVLink and communicate over PCIe, which is meaningfully slower for inter-GPU traffic. This difference changes throughput on 70B models enough to affect which card makes sense for a multi-GPU build — in many cases, dual 3090s with NVLink out-perform dual 4090s on 70B, despite the 4090 being the faster single card.

Why You'd Go Multi-GPU: 70B+ Models

The math is straightforward. Llama 3.3 70B at Q4_K_M takes approximately 43GB. A single RTX 4090 or 3090 has 24GB — not enough. A professional card like the RTX A6000 (48GB) fits it on one card but costs $4,000+. Two 24GB consumer cards at $450–1,600 each reach 48GB at a fraction of the cost.

For 7B through 34B models, a single 24GB card is the right setup — adding a second card for models that already fit adds PCIe overhead without benefit. Multi-GPU is specifically the solution for the 70B class and above, where 24GB is the hard floor you cannot meet with one card. If you only run 8B and 24B models now but anticipate wanting 70B, planning for dual-card compatibility at build time (proper motherboard, PSU headroom, case) is a worthwhile hedge.

RTX 3090: NVLink Changes Everything

Two RTX 3090s with an NVLink bridge pool their 24GB each into a single 48GB logical VRAM surface. The inference runtime sees one card with 48GB and can load a full Q4_K_M 70B model without splitting it across PCIe. The NVLink bridge enables high-bandwidth communication between the two cards — substantially faster than PCIe for inter-GPU data transfers.

Based on third-party community estimates, a dual-3090 NVLink setup delivers approximately 50–60 tok/s on Llama 3.3 70B at Q4_K_M. These are community-reported estimates, not measurements run by ClankerBuilder. NVLink bridges compatible with the RTX 3090 NVLink connector typically cost $120–200 and must match the physical card spacing (check your motherboard's PCIe slot layout before ordering). The cards must be identical or at minimum use the same physical NVLink connector — mixing AIB cooler designs can cause fitment issues.

Two 3090s + NVLink bridge = 48GB logical VRAM
Community-estimated ~50–60 tok/s on 70B Q4_K_M
NVLink bridge cost: ~$120–200
Cards must be matching (same connector style) — check cooler clearance
Street price for two used 3090s: ~$900–1,300

RTX 4090: PCIe-Only Multi-GPU

The RTX 4090 Ada removed consumer NVLink. Two RTX 4090s can still run a 70B model together using tensor parallelism — the model layers are split across cards and computed in parallel — but the inter-GPU communication crosses the PCIe bus. PCIe x16 bandwidth (approximately 32 GB/s bidirectional) is dramatically lower than NVLink bandwidth, and that bottleneck shows up on large models where frequent inter-GPU transfers are required.

Based on third-party community estimates, a dual-4090 PCIe setup delivers approximately 35–50 tok/s on 70B models — lower than the dual-3090 NVLink setup despite each 4090 being the faster single card. The 4090's higher per-card bandwidth helps on models that fit on one card, but for 70B inference split across two cards via PCIe, the communication overhead erases much of the per-card advantage. For single-card work (7B–34B), the 4090 remains the faster card.

Two 4090s: 48GB total, but PCIe communication between cards
Community-estimated ~35–50 tok/s on 70B at Q4_K_M
Slower than dual-3090 NVLink on 70B despite higher per-card speed
Better than dual 3090 for 34B and under (single-card models)
Street price for two new 4090s: ~$3,200+

What You Need: System Requirements

Multi-GPU builds have real system requirements beyond the cards themselves. For two RTX 4090s, you need a Z790 motherboard with two full x16 PCIe slots (mechanical x16, electrical x16 or x8 — check the spec sheet), or an HEDT platform like TRX50. For two RTX 3090s with NVLink, the slot spacing is critical: NVLink bridges have a fixed pitch, so the two slots must be exactly the right distance apart. Many ATX motherboards have adequate spacing; microATX boards often do not.

Power is a significant constraint. Two RTX 4090s draw up to 900W under load just for the GPUs. Add CPU, storage, and cooling and you need a 1,400–1,600W PSU with proper multi-rail or single-rail headroom. Two RTX 3090s at full load draw approximately 700W from the GPUs, requiring a 1,200W+ PSU. A full-tower case with at least two accessible PCIe x16 slots and adequate airflow is strongly recommended — mid-tower cases that physically fit two large cards often lack the cooling volume for sustained dual-GPU loads.

Dual 4090: Z790 with 2× x16 slots or HEDT platform
Dual 3090 NVLink: verify exact slot spacing for NVLink bridge pitch
PSU: 1,400W+ for dual 4090; 1,200W+ for dual 3090
64GB+ system RAM recommended for 70B model serving
Full-tower case strongly preferred for airflow and slot clearance

Multi-GPU vs Single A100 80GB

A used A100 80GB sells for approximately $7,000–10,000 in 2026. A dual RTX 3090 setup with NVLink costs approximately $1,100–1,400 total (two used 3090s plus bridge). The A100 wins on single-card simplicity, higher memory bandwidth (~2TB/s HBM2e vs ~1.9TB/s combined GDDR6X), ECC memory for reliability, better multi-GPU interconnect (NVLink 3.0 at higher bandwidth than consumer NVLink), and superior fine-tuning support at scale.

For personal home use running inference only, the dual 3090 wins on cost — the roughly $6,000–8,000 savings is decisive for most budgets, and throughput for a single user is adequate. For a shared team inference server, for fine-tuning large models, or for environments where ECC and sustained-load reliability matter, the A100 can justify its price. The break-even point is essentially about use case: personal inference versus professional or team-scale workloads.

Used A100 80GB: $7,000–10,000; single card, ECC, HBM2e bandwidth
Dual RTX 3090 NVLink: ~$1,100–1,400; requires bridge, more setup
Personal inference: dual 3090 wins on cost by $6,000+
Team server or fine-tuning at scale: A100 may justify the premium

Frequently asked questions

Do I need NVLink for multi-GPU LLM inference?: Not technically required, but strongly recommended for 70B models. Without NVLink (i.e., two 4090s via PCIe), the inter-GPU communication bandwidth limits throughput on large models. Community estimates suggest a dual-4090 PCIe setup delivers notably fewer tok/s on 70B than a dual-3090 NVLink setup. For 70B inference specifically, NVLink is a meaningful advantage. For smaller models that fit on one card, multi-GPU does not help regardless.
Can I mix different GPU models in a multi-GPU build?: For inference via tensor parallelism (the model split across cards), mixing different GPU models is technically supported by some runtimes but generally not recommended. Mismatched VRAM sizes mean the larger card's extra memory is not used, and mismatched bandwidth can create synchronization bottlenecks. For NVLink specifically, both cards must use the same NVLink connector and be from the same generation — you cannot NVLink a 3090 with a 4090. Keep both cards identical for multi-GPU builds.
What motherboard should I use for 2× RTX 4090?: You need a motherboard with two full mechanical x16 PCIe slots. For consumer Intel platforms, look for Z790 boards explicitly listing dual-GPU support — many mid-range Z790 boards have one x16 and one x4 slot electrically, which limits the second card's bandwidth. ASUS ProArt Z790-Creator WiFi, MSI MEG Z790 GODLIKE, and similar enthusiast boards reliably support dual x16. For maximum inter-GPU bandwidth without NVLink, AMD TRX50 HEDT (Threadripper) or Intel W790 workstation platforms are better choices.