Skip to content
ClankerBuilder
Sign in

Explainer

How Much VRAM Do You Need to Run Llama 70B?

6 min read

The VRAM to run Llama 70B: about 40GB for a 4-bit quant. See the math, a quant-by-quant table, and what hardware hits each tier.

The short answer: to run Llama 70B locally with usable quality, plan on about 40GB of VRAM. That is the comfortable floor for a 4-bit quantized version with a normal context window, and it is why the practical setups are a single 48GB card or a pair of 24GB cards. You can technically squeeze a 70B model onto a single 24GB GPU, but only by quantizing so aggressively and shrinking context so far that quality and usefulness suffer. Going the other direction, full-precision FP16 needs roughly 140GB, which is firmly in multi-GPU datacenter territory.

Everything else in this guide is the math behind that number: how to estimate VRAM from the parameter count, how quantization changes it, how context length grows the KV cache, and what happens to speed when you spill over onto system RAM. The numbers here are estimates and ranges, not promises. Your exact figures shift with the specific quant, the inference engine, the context window, and how much headroom the OS and other processes leave you.

The VRAM Rule of Thumb

The rule of thumb is simple. VRAM needed is approximately the number of parameters times the bytes per weight, plus the KV cache, plus some overhead. For a 70B model the weights dominate. At FP16 each weight is two bytes, so 70 billion parameters times two bytes is about 140GB just for the weights. Drop to 8-bit and each weight is one byte, halving that to roughly 70GB. Drop to a common 4-bit quant and you are near half a byte per weight, landing around 35 to 40GB.

On top of the weights you add the KV cache, which holds the attention keys and values for every token in your context, and a few gigabytes of overhead for activations, the CUDA context, and the inference runtime. For a 70B model at a typical context length the KV cache and overhead together add a handful of gigabytes. That is why the headline 4-bit number rounds up from the raw 35GB of weights to a planning figure closer to 40GB once you account for context and runtime slack.

70B VRAM by Quantization, and What Hardware Hits Each Tier

Quantization is the single biggest lever you have. It trades a small, usually acceptable, loss in output quality for a large reduction in memory. The community has largely settled on 4-bit medium quants as the sweet spot for 70B models: they cut memory by roughly four times versus FP16 while keeping the model coherent and useful. Below that you keep saving memory, but quality degradation becomes more noticeable, especially on reasoning and code.

Here is a rough breakdown of where a 70B model lands by quantization, counting weights plus a little context headroom.

Different setups map onto those tiers cleanly. A single 24GB consumer card cannot hold a 4-bit 70B model with real context, so it only works at very aggressive 3-bit or lower quants with a tiny context window, and even then it is tight. A 48GB card such as an RTX A6000, or two 24GB cards like dual RTX 4090s pooled together, comfortably runs the 4-bit tier with room for context. An 80GB H100 handles the 8-bit tier on a single card. FP16 needs two 80GB cards or more. Apple Silicon is a special case: unified memory means the GPU can address most of system RAM, so a Mac with 64GB or 128GB of unified memory can load larger quants than its raw price suggests, though memory bandwidth caps how fast it runs.

  • FP16 (full precision), ~140GB: two 80GB cards (such as dual H100) or more; datacenter only.
  • Q8 (8-bit), ~70GB: a single 80GB H100, or two 48GB cards.
  • Q4_K_M (4-bit medium), ~40GB: a single 48GB A6000, or dual RTX 4090s pooling 48GB. The practical sweet spot.
  • Q3 (3-bit), ~32GB: fits two 24GB cards, or a single 24GB card only with a very small context window.
  • Single 24GB GPU: viable only at aggressive 3-bit-or-lower quants and tiny context, with quality and length both compromised.

How Context Length Grows the KV Cache

Context length is the variable people forget, and it is the reason your VRAM use creeps up during a long session. The KV cache stores a key and value vector for every token currently in the context, for every layer. As your prompt and generated output grow, that cache grows linearly with the number of tokens. A short prompt costs almost nothing; a long document or a multi-turn conversation can add many gigabytes.

Llama 70B helps here by using grouped-query attention, which shares key and value heads across groups of query heads and cuts the KV cache by roughly eight times compared to older full multi-head designs. Even so, the growth is real. At a moderate context of around 32K tokens the 4-bit KV cache is on the order of eight to ten gigabytes, and pushing toward 128K can add tens of gigabytes on top of the weights. If you plan to use long context, budget for it explicitly rather than assuming the weight figure is the whole story. You can also quantize the KV cache itself to claw back some of that memory.

Offloading to System RAM and Why It Is Slow

When a model does not fit entirely in VRAM, most inference engines will offload the overflow layers to system RAM and run them on the CPU. This lets you load a model that would otherwise be impossible, but the speed cost is severe. GPU memory bandwidth is an order of magnitude higher than system RAM bandwidth, and the CPU is far slower at the matrix math, so every offloaded layer becomes a bottleneck the whole generation has to wait on.

In practice, a 70B model running mostly on the GPU with only a few offloaded layers might still feel interactive, but a 70B model with most layers offloaded to RAM typically crawls at one to three tokens per second. That is slow enough that it is fine for a batch job you walk away from and frustrating for anything interactive. The lesson is that offloading is a fallback for fitting the model at all, not a strategy for good performance. If speed matters, size your VRAM so the model fits.

Realistic Tokens Per Second by Setup

Speed expectations vary widely by setup. A single 48GB card running a 4-bit 70B model with the whole model resident in VRAM commonly lands in the rough range of fifteen to thirty tokens per second, depending on context length and the engine. Dual 4090s pooling 48GB run the same 4-bit tier and scale to roughly 85 to 90 percent of ideal because of the communication overhead between cards, so they perform similarly to a single big card while costing less.

An 80GB H100 running an 8-bit version will be faster still and is the comfortable choice when output quality at higher precision matters. A single 24GB card forced into heavy offloading sits at the bottom, in the low single digits. Treat all of these as ballpark figures; the only way to know your real number is to measure your own stack with your own quant, context length, and inference engine.

Related builds

Used GPU Budget Build

Cost-optimized build using a used RTX 3090 for 70B experimentation at Q3 quant.

View build

Dual RTX 4090 Workstation

Twin 4090s for high-throughput 34B–70B inference with NVLink-ready parts.

View build

Pro Dual-GPU 70B

Team-grade dual 4090 rig targeting Llama 3.3 70B at Q4.

View build

Frequently asked questions

Can I run Llama 70B on a single RTX 4090?
Not at usable quality. A single 4090 has 24GB of VRAM, and a 4-bit 70B model needs around 40GB with normal context. You can only fit it by dropping to a very aggressive 3-bit-or-lower quant and a tiny context window, which hurts output quality, or by offloading layers to system RAM, which drops speed to roughly one to three tokens per second. For a genuinely usable single-GPU 70B setup, you want a 48GB card or two 24GB cards pooled together.
Does context length change how much VRAM I need?
Yes. The KV cache stores attention data for every token in your context and grows linearly as the conversation or document gets longer. For a 4-bit 70B model, a moderate 32K context adds roughly eight to ten gigabytes on top of the weights, and pushing toward 128K can add tens of gigabytes. If you plan to use long context, budget for it explicitly instead of assuming the weight figure is the full cost.
What is the minimum VRAM to run Llama 70B with good quality?
About 40GB. That covers a 4-bit medium quant, which the community treats as the quality sweet spot, plus headroom for a normal context window and runtime overhead. In practice that means a single 48GB GPU like an RTX A6000, or a dual-GPU build that pools two 24GB cards to 48GB total.
Why does FP16 need so much more VRAM than 4-bit?
Because each weight takes more bytes. FP16 stores every one of the 70 billion parameters in two bytes, which is about 140GB just for the weights. A 4-bit quant stores each weight in roughly half a byte, cutting that to around 35 to 40GB. The quality loss from 4-bit is usually small and acceptable, which is why most local 70B users never run FP16.
Does dual GPU give me double the speed?
No. Two GPUs mainly give you more pooled VRAM, which is what lets a 70B model fit. Speed does not double, because splitting a model across cards adds communication overhead. Dual 4090s typically reach about 85 to 90 percent of ideal scaling on a 4-bit 70B model, so they perform similarly to a single large card while usually costing less to assemble.

Some links in this article are affiliate links. If you buy through them we may earn a commission at no extra cost to you. See our affiliate disclosure.