Can I run DeepSeek-R1 on a single consumer GPU?

Not the full R1 — it is a very large mixture-of-experts model that needs hundreds of gigabytes of memory because every expert must stay resident. What you can run on a single consumer GPU is a distilled DeepSeek: a 7B or 8B fits on an 8GB card, a 32B fits on a 24GB card. The distills capture much of R1's reasoning at a fraction of the size.

How much VRAM does a distilled DeepSeek 70B need?

About 40GB at a 4-bit quant with usable context, the same as any 70B model. That means a single 48GB card such as an RTX A6000, or two 24GB cards pooled together such as dual RTX 3090s. A single 24GB card can only run it with aggressive quantization and offload, which is slow.

Why can't I just count DeepSeek's active parameters for VRAM?

Because a mixture-of-experts model can route any token to any expert, so all experts must be loaded in memory simultaneously. You pay the memory cost of the full parameter count even though only a subset does compute per token. That is why the full R1 needs server-class memory while its dense distills are the practical local option.

Is it cheaper to run full DeepSeek-R1 in the cloud?

For almost everyone, yes. The full R1 needs multi-GPU server hardware that costs far more than most people will spend on a whole setup, and unless you run it many hours a day, renting by the hour is dramatically cheaper. Reserve owning hardware for the distilled models that fit consumer GPUs.

Buying guide

What Hardware Do You Need to Run DeepSeek Locally?

June 15, 20267 min readBy the ClankerBuilder editorial team · how we rate

Hardware to run DeepSeek locally: the full 671B R1 needs a datacenter, but distilled 1.5B–70B versions run on consumer GPUs. VRAM by variant.

Running DeepSeek locally depends entirely on which DeepSeek you mean. The headline model, DeepSeek-R1, is a very large mixture-of-experts model with hundreds of billions of total parameters; running it in full is firmly datacenter territory and needs hundreds of gigabytes of memory. But DeepSeek also ships distilled versions — dense models from roughly 1.5B up to 70B parameters — and those run on the same consumer hardware as any other model that size.

The short answer: a distilled 7B or 8B DeepSeek runs on almost any modern GPU with 8GB or more; a distilled 32B wants a 24GB card; a distilled 70B needs about 40GB, meaning a 48GB card or two 24GB cards. The full R1 is a rent-don't-buy proposition for nearly everyone. Here is how to size each one.

Full R1 versus the distilled models

DeepSeek-R1 in its full form is a mixture-of-experts model: it has a very large total parameter count but activates only a fraction of those parameters per token. That makes it efficient to serve at scale, but the catch for local use is that all the experts still have to live in memory, so the full model needs hundreds of gigabytes even quantized. That is multi-GPU server hardware, not a desktop.

The distilled DeepSeek models are different animals. They are dense models — typically distilled onto Llama or Qwen bases at sizes like 1.5B, 7B, 8B, 14B, 32B, and 70B — that capture much of R1's reasoning style at a fraction of the size. For local use, these are what you actually run, and they size exactly like any dense model of the same parameter count.

VRAM by DeepSeek variant

Use the same rule of thumb as any model: at 4-bit, plan on roughly half a gigabyte of VRAM per billion parameters for the weights, plus a few gigabytes for KV cache and overhead. The distilled tiers map cleanly onto consumer hardware. The numbers below are 4-bit estimates and shift with context length and the specific quant.

Approximate VRAM for DeepSeek distilled models at 4-bit.
DeepSeek variant	~VRAM (Q4)	Runs on
Distill 1.5B	~2 GB	Any modern GPU / integrated
Distill 7B / 8B	~5–6 GB	8GB+ GPU, most laptops
Distill 14B	~10 GB	12–16GB GPU
Distill 32B	~20 GB	24GB GPU (3090/4090/5090)
Distill 70B	~40 GB	48GB card or 2× 24GB
Full R1 (MoE)	hundreds of GB	Multi-GPU server / cloud

Why mixture-of-experts changes the math

It is tempting to assume that because an MoE model only activates a few billion parameters per token, you only need memory for those active parameters. That is not how it works for local inference. The router can pick any expert on any token, so every expert must be resident in memory at all times. You pay the full memory cost of the total parameter count while only getting the compute cost of the active subset.

That is exactly why the full R1 is impractical to own and the distilled dense models are the sensible local choice. If you specifically need R1-level reasoning rather than a distill, renting a multi-GPU cloud instance by the hour is almost always cheaper than assembling the hardware.

Recommended setups

For a distilled 7B–14B DeepSeek, a single 16–24GB card or an Apple Silicon Mac handles it comfortably and is the right starting point for most people. For the distilled 32B, a 24GB card such as a used RTX 3090 or an RTX 4090 is the sweet spot. For the distilled 70B, you are in the same place as any 70B: about 48GB of VRAM, which means a pair of used 3090s or a large-memory Mac — see The Cheapest Way to Run a 70B Model Locally.

If you want the full R1, rent it. Work the economics with the cloud-vs-buy calculator before spending anything on hardware you would only use occasionally.

Related builds

Used GPU Budget Build

Cost-optimized build using a used RTX 3090 for 70B experimentation at Q3 quant.

View build

Home Inference Workstation

RTX 4090 powerhouse for 8B–34B models with headroom for agent workflows.