Best hardware for

Qwen 3 30B-A3B (MoE)

Requires ~19 GB VRAM (Q4 + KV cache). Ranked by estimated decode tok/s across 90 GPUs.

NVIDIA GeForce RTX 4090

24 GB VRAM · $6,950

Fits74.4 tok/s

NVIDIA GeForce RTX 3090 (Used)

24 GB VRAM

Fits73.7 tok/s

NVIDIA GeForce RTX 5090

32 GB VRAM · $3,900

Fits— tok/s

NVIDIA GeForce RTX 5080

How GPU Rankings Work

GPUs are ranked by estimated decode tok/s for Qwen 3 30B-A3B (MoE). Higher tok/s means faster text generation during inference. Rankings combine community benchmarks, lab measurements, and spec-based estimates when real data isn't available.

"Fits" vs "VRAM tight": GPUs marked "Fits" have enough VRAM for the full model with headroom for KV cache growth. "VRAM tight" GPUs may work with shorter context lengths but could run out of memory with long conversations.

If Your Model Doesn't Fit

Multi-GPU setup: Use multiple GPUs to split the model across VRAM pools. Two RTX 4090s (48 GB total) can run larger models that don't fit on a single card.

Quantization: Reduce model precision with Q4 or Q8 quantization to fit in less VRAM. Quality trade-off is usually minimal for most use cases.

Cloud alternatives: Consider GPU cloud rental for occasional use instead of purchasing hardware for models that require extensive VRAM.

Qwen 3 30B-A3B (MoE)

How GPU Rankings Work

If Your Model Doesn't Fit

Related reading