Explainer
How Much VRAM Do You Need to Run LLMs Locally? (2026)
VRAM requirements for running Llama 3, DeepSeek, Qwen 3, and Gemma 3 locally with Q4_K_M quantization — and what happens when you don't have enough.
VRAM is the gate for local LLM inference. A model that does not fit in GPU memory cannot run at GPU speed — it either refuses to load, spills into system RAM at a 10–50× throughput penalty, or forces you to quantize so aggressively that quality suffers. Getting the VRAM question right before buying hardware is the single highest-leverage decision in a local AI build.
The good news is that quantization makes most popular models surprisingly affordable to run. Q4_K_M, the community default for 4-bit quantization, cuts a model's memory footprint roughly in half compared to 8-bit and by roughly four times compared to full FP16 precision, while preserving most of the output quality. A 7B model that would need 14GB at FP16 needs only about 5.4GB at Q4_K_M. This guide walks through the sizing math, the per-model requirements for the models people actually run in 2026, and what to do when your VRAM falls short.
The VRAM Formula
Estimating VRAM for a model at a given quantization level is straightforward: take the parameter count in billions, multiply by the bytes per weight for your chosen quant, then add the KV cache and a small runtime overhead.
For Q4_K_M, each weight takes approximately 0.55 bytes (4 bits per weight plus some quantization metadata). A Llama 3.1 8B model works out to: 8 billion × 0.55 bytes ≈ 4.4GB for the weights. Add roughly 0.5–1GB of KV cache at a 4,096-token context and approximately 1.5GB of runtime overhead (CUDA context, activations, inference engine buffers), and you land around 6.5–7GB total. A 8GB GPU handles it; a 6GB GPU does not.
The formula: VRAM ≈ (params in billions × 0.55) + KV cache (varies with context length) + 1.5GB runtime. Context length is the wild card: longer contexts grow the KV cache, which can push a model that fits at 2K context over the edge at 16K context. Plan for the context you actually use, not the minimum.
Q4_K_M as the Standard
Q4_K_M has become the community default quantization format for a reason: it delivers strong quality at roughly half the memory of Q8 and one quarter of FP16, making large models accessible on consumer hardware.
The 'K' in Q4_K_M refers to k-quants, a higher-quality quantization scheme in llama.cpp that assigns different precision to different weight matrices depending on their sensitivity — more important layers get higher precision within the 4-bit budget. The 'M' is the middle-size variant of this scheme (there is also Q4_K_S for smaller and Q4_K_L for larger). In practice, Q4_K_M is indistinguishable from Q8 or FP16 on most casual text tasks, and the quality gap only becomes visible on demanding reasoning benchmarks or long-chain mathematical reasoning.
Q8 approximately doubles the VRAM requirement compared to Q4_K_M for roughly 1% quality improvement in most evals. The tradeoff is almost never worth it unless you are running a specialized task where that margin matters and you have the VRAM budget. If you do have headroom, Q6_K is a better intermediate step than Q8.
VRAM Requirements by Popular Model
The table below shows approximate Q4_K_M VRAM requirements for the models most commonly run locally in 2026. These are estimates based on the formula above; actual usage varies with context length, runtime, and system configuration.
| Model | Params | Min VRAM (Q4_K_M) | Minimum GPU |
|---|---|---|---|
| Llama 3.2 3B | 3B | ~2.3 GB | Any 4GB GPU |
| Llama 3.1 8B | 8B | ~5.4 GB | 8GB GPU (RTX 3070 etc.) |
| Mistral Small 24B | 24B | ~15.5 GB | 24GB GPU (RTX 3090) |
| Llama 3.3 70B | 70B | ~43 GB | 2× RTX 3090 (NVLink) |
| DeepSeek V3 (MoE) | ~24B active | ~48 GB | Multi-GPU (MoE routing overhead) |
| Llama 3.1 405B | 405B | ~240 GB | 8× A100 80GB |
What Happens When VRAM Is Too Small
When a model exceeds your VRAM capacity, inference runtimes like llama.cpp and Ollama offer a partial solution: offload as many model layers as fit into VRAM, and run the remaining layers on CPU using system RAM. This partial offloading is a middle ground between full GPU acceleration and full CPU inference.
With partial offloading, throughput scales with how much of the model fits in VRAM. If 80% of the layers are on the GPU and 20% on the CPU, performance is somewhere between full GPU speed and full CPU speed — typically much closer to CPU speed because the CPU layers become the bottleneck. For practical purposes, any significant CPU offload usually degrades throughput from 30–90 tok/s to somewhere in the 5–15 tok/s range, which is usable for experimentation but slow for daily use.
The alternatives: quantize further to a lower bit depth (Q3 or Q2 — quality degrades more sharply below Q4), use a smaller model in the same family, or upgrade your GPU.
- CPU offload: 10–50× slower than full VRAM fit; usable for experimentation
- Quantize lower (Q3/Q2): saves VRAM, but noticeable quality drop below Q4
- Smaller model: Llama 3.2 3B or 3.1 8B instead of 70B stays well within 8–24GB
- Partial GPU offload: some layers on GPU, rest on CPU — better than full CPU, worse than full GPU
Choosing the Right GPU
VRAM tiers map cleanly to model classes. Eight gigabytes handles Llama 3.1 8B and smaller models comfortably — good for coding assistants and fast chat, but you will hit the ceiling on any 13B-and-up model. Sixteen gigabytes opens up the 13B–20B class, including many strong code and reasoning models, but 34B starts to feel cramped and 70B is not viable.
Twenty-four gigabytes is the sweet spot for most users: it fits the vast majority of 7B–34B models at Q4_K_M with a healthy context window, and the used RTX 3090 delivers this VRAM tier at the best dollar-per-gigabyte ratio available. Forty-eight gigabytes — via a dual 24GB build or a professional card like the RTX A6000 — handles 70B models comfortably and is the minimum for serious 70B use. Beyond that, you are in multi-GPU workstation or datacenter territory.
- 8GB GPU: Llama 3.1 8B and smaller — good entry point
- 16GB GPU: up to ~20B models; 34B is tight
- 24GB GPU: most 7B–34B models comfortably — recommended sweet spot
- 48GB+: 70B models with real context; dual RTX 3090 (NVLink) is the budget path
Frequently asked questions
- Can I run a 70B model on a single RTX 4090?
- Technically yes, but with serious caveats. A Q3_K_M or lower quant of a 70B model can fit within 24GB, but at the cost of noticeable quality degradation and a very limited context window. For most people, a single 24GB card is not a practical 70B setup — you want two 24GB cards or a 48GB professional card. If 70B inference is a priority, plan for 48GB total VRAM.
- How does context length affect VRAM usage?
- Significantly. The KV cache grows linearly with context length: more tokens in the context window means more keys and values to store. A Llama 3.1 8B model at 4,096 tokens might use 6GB total; at 32,768 tokens it can push to 9–10GB. For a 70B model at very long context, the KV cache alone can exceed several gigabytes. If you need long context, add extra VRAM headroom beyond the base model weight estimate.
- Is Q4_K_M noticeably worse than full precision?
- For the vast majority of tasks — chat, coding, summarization, Q&A — Q4_K_M is indistinguishable from Q8 or FP16 in practice. Quality gaps become detectable on demanding multi-step reasoning, precise mathematical calculation, or tasks where the model's calibration matters. For everyday use, Q4_K_M is the right default and the quality-to-memory tradeoff is strongly in its favor.
Related reading
Performance figures are estimates aggregated from third-party benchmarks — we don't benchmark hardware ourselves. See our methodology and sources.
Some links in this article are affiliate links. If you buy through them we may earn a commission at no extra cost to you. See our affiliate disclosure.