Skip to content
ClankerBuilder
Sign in

Explainer

How Much VRAM Do You Need to Run LLMs Locally? (2026)

7 min readBy the ClankerBuilder editorial team · how we rate

VRAM requirements for running Llama 3, DeepSeek, Qwen 3, and Gemma 3 locally with Q4_K_M quantization — and what happens when you don't have enough.

VRAM is the gate for local LLM inference. A model that does not fit in GPU memory cannot run at GPU speed — it either refuses to load, spills into system RAM at a 10–50× throughput penalty, or forces you to quantize so aggressively that quality suffers. Getting the VRAM question right before buying hardware is the single highest-leverage decision in a local AI build.

The good news is that quantization makes most popular models surprisingly affordable to run. Q4_K_M, the community default for 4-bit quantization, cuts a model's memory footprint roughly in half compared to 8-bit and by roughly four times compared to full FP16 precision, while preserving most of the output quality. A 7B model that would need 14GB at FP16 needs only about 5.4GB at Q4_K_M. This guide walks through the sizing math, the per-model requirements for the models people actually run in 2026, and what to do when your VRAM falls short.

The VRAM Formula

Estimating VRAM for a model at a given quantization level is straightforward: take the parameter count in billions, multiply by the bytes per weight for your chosen quant, then add the KV cache and a small runtime overhead.

For Q4_K_M, each weight takes approximately 0.55 bytes (4 bits per weight plus some quantization metadata). A Llama 3.1 8B model works out to: 8 billion × 0.55 bytes ≈ 4.4GB for the weights. Add roughly 0.5–1GB of KV cache at a 4,096-token context and approximately 1.5GB of runtime overhead (CUDA context, activations, inference engine buffers), and you land around 6.5–7GB total. A 8GB GPU handles it; a 6GB GPU does not.

The formula: VRAM ≈ (params in billions × 0.55) + KV cache (varies with context length) + 1.5GB runtime. Context length is the wild card: longer contexts grow the KV cache, which can push a model that fits at 2K context over the edge at 16K context. Plan for the context you actually use, not the minimum.

Q4_K_M as the Standard

Q4_K_M has become the community default quantization format for a reason: it delivers strong quality at roughly half the memory of Q8 and one quarter of FP16, making large models accessible on consumer hardware.

The 'K' in Q4_K_M refers to k-quants, a higher-quality quantization scheme in llama.cpp that assigns different precision to different weight matrices depending on their sensitivity — more important layers get higher precision within the 4-bit budget. The 'M' is the middle-size variant of this scheme (there is also Q4_K_S for smaller and Q4_K_L for larger). In practice, Q4_K_M is indistinguishable from Q8 or FP16 on most casual text tasks, and the quality gap only becomes visible on demanding reasoning benchmarks or long-chain mathematical reasoning.

Q8 approximately doubles the VRAM requirement compared to Q4_K_M for roughly 1% quality improvement in most evals. The tradeoff is almost never worth it unless you are running a specialized task where that margin matters and you have the VRAM budget. If you do have headroom, Q6_K is a better intermediate step than Q8.

VRAM Requirements by Popular Model

The table below shows approximate Q4_K_M VRAM requirements for the models most commonly run locally in 2026. These are estimates based on the formula above; actual usage varies with context length, runtime, and system configuration.

Approximate Q4_K_M VRAM requirements for popular local LLMs.
ModelParamsMin VRAM (Q4_K_M)Minimum GPU
Llama 3.2 3B3B~2.3 GBAny 4GB GPU
Llama 3.1 8B8B~5.4 GB8GB GPU (RTX 3070 etc.)
Mistral Small 24B24B~15.5 GB24GB GPU (RTX 3090)
Llama 3.3 70B70B~43 GB2× RTX 3090 (NVLink)
DeepSeek V3 (MoE)~24B active~48 GBMulti-GPU (MoE routing overhead)
Llama 3.1 405B405B~240 GB8× A100 80GB

What Happens When VRAM Is Too Small

When a model exceeds your VRAM capacity, inference runtimes like llama.cpp and Ollama offer a partial solution: offload as many model layers as fit into VRAM, and run the remaining layers on CPU using system RAM. This partial offloading is a middle ground between full GPU acceleration and full CPU inference.

With partial offloading, throughput scales with how much of the model fits in VRAM. If 80% of the layers are on the GPU and 20% on the CPU, performance is somewhere between full GPU speed and full CPU speed — typically much closer to CPU speed because the CPU layers become the bottleneck. For practical purposes, any significant CPU offload usually degrades throughput from 30–90 tok/s to somewhere in the 5–15 tok/s range, which is usable for experimentation but slow for daily use.

The alternatives: quantize further to a lower bit depth (Q3 or Q2 — quality degrades more sharply below Q4), use a smaller model in the same family, or upgrade your GPU.

  • CPU offload: 10–50× slower than full VRAM fit; usable for experimentation
  • Quantize lower (Q3/Q2): saves VRAM, but noticeable quality drop below Q4
  • Smaller model: Llama 3.2 3B or 3.1 8B instead of 70B stays well within 8–24GB
  • Partial GPU offload: some layers on GPU, rest on CPU — better than full CPU, worse than full GPU

Choosing the Right GPU

VRAM tiers map cleanly to model classes. Eight gigabytes handles Llama 3.1 8B and smaller models comfortably — good for coding assistants and fast chat, but you will hit the ceiling on any 13B-and-up model. Sixteen gigabytes opens up the 13B–20B class, including many strong code and reasoning models, but 34B starts to feel cramped and 70B is not viable.

Twenty-four gigabytes is the sweet spot for most users: it fits the vast majority of 7B–34B models at Q4_K_M with a healthy context window, and the used RTX 3090 delivers this VRAM tier at the best dollar-per-gigabyte ratio available. Forty-eight gigabytes — via a dual 24GB build or a professional card like the RTX A6000 — handles 70B models comfortably and is the minimum for serious 70B use. Beyond that, you are in multi-GPU workstation or datacenter territory.

  • 8GB GPU: Llama 3.1 8B and smaller — good entry point
  • 16GB GPU: up to ~20B models; 34B is tight
  • 24GB GPU: most 7B–34B models comfortably — recommended sweet spot
  • 48GB+: 70B models with real context; dual RTX 3090 (NVLink) is the budget path

Frequently asked questions

Can I run a 70B model on a single RTX 4090?
Technically yes, but with serious caveats. A Q3_K_M or lower quant of a 70B model can fit within 24GB, but at the cost of noticeable quality degradation and a very limited context window. For most people, a single 24GB card is not a practical 70B setup — you want two 24GB cards or a 48GB professional card. If 70B inference is a priority, plan for 48GB total VRAM.
How does context length affect VRAM usage?
Significantly. The KV cache grows linearly with context length: more tokens in the context window means more keys and values to store. A Llama 3.1 8B model at 4,096 tokens might use 6GB total; at 32,768 tokens it can push to 9–10GB. For a 70B model at very long context, the KV cache alone can exceed several gigabytes. If you need long context, add extra VRAM headroom beyond the base model weight estimate.
Is Q4_K_M noticeably worse than full precision?
For the vast majority of tasks — chat, coding, summarization, Q&A — Q4_K_M is indistinguishable from Q8 or FP16 in practice. Quality gaps become detectable on demanding multi-step reasoning, precise mathematical calculation, or tasks where the model's calibration matters. For everyday use, Q4_K_M is the right default and the quality-to-memory tradeoff is strongly in its favor.

Performance figures are estimates aggregated from third-party benchmarks — we don't benchmark hardware ourselves. See our methodology and sources.

Some links in this article are affiliate links. If you buy through them we may earn a commission at no extra cost to you. See our affiliate disclosure.

How Much VRAM Do You Need to Run LLMs Locally? (2026) · ClankerBuilder