Will a quantized model give noticeably worse answers?

At Q4_K_M and above, most people cannot tell the difference from the full-precision model in everyday use; the estimated quality retention is in the low-to-mid 90s percent. The loss becomes noticeable at Q3 and obvious at Q2, showing up as weaker reasoning, repetition, and more factual mistakes. For precision-sensitive work like coding, stepping up to Q5 or Q6 is a reasonable hedge.

What does 'bits per weight' mean and why does it matter?

It is the average number of bits used to store each of the model's parameters. FP16 uses 16, Q8 about 8, Q4 about 4. Since memory roughly tracks bits per weight, halving the bits roughly halves the file and VRAM footprint. It is the single number that most directly predicts whether a model will fit on your card.

Can I run a model that is bigger than my VRAM?

Yes, in two ways. Quantizing the model down to Q4 or lower may shrink it enough to fit. Failing that, GGUF tools can offload some layers to system RAM and run the rest on the GPU, which lets you run larger models at the cost of speed. The more layers spill to RAM, the slower it gets, so quantization is usually the better lever to pull first.

Is perplexity the right way to judge quantization quality?

Perplexity is the common metric: it measures how surprised the model is by real text, and a small perplexity increase versus FP16 signals a small quality loss. It is a useful proxy but not the whole picture, since it does not directly capture reasoning or instruction-following. Treat the small perplexity bump at Q4_K_M as confirmation the model is largely intact, and judge the rest by trying it on your actual tasks.

Explainer

LLM Quantization Explained: Q4 vs Q8 and What to Pick

June 15, 20265 min readBy the ClankerBuilder editorial team · how we rate

LLM quantization explained: how Q4 vs Q8 GGUF levels trade VRAM for quality, and which one to pick for your hardware.

If you have ever tried to run a capable language model on your own machine and watched it refuse to load, the reason is almost always memory. A model stored in full precision is large: a rough rule of thumb is about 2 GB of memory for every billion parameters at 16-bit precision, so a 70-billion-parameter model wants roughly 140 GB just to hold its weights. Quantization is the technique that makes local models practical. It compresses each weight from 16 bits down to 8, 5, or 4 bits, shrinking the file and the memory footprint dramatically while keeping most of the model's quality intact.

This matters because memory, not raw compute, is the wall most people hit first. A quantized model that fits in your GPU's VRAM runs; one that overflows either crawls or fails to start. Understanding quantization is the difference between buying a card that can run the model you want and one that cannot. This guide explains what the common quantization levels mean, where the practical sweet spot sits, and how your choice changes the hardware you actually need.

What quantization actually does

Every weight in a neural network is a number. When a model is trained and released, those numbers are usually stored in a 16-bit floating-point format (FP16 or BF16), which gives plenty of precision but takes two bytes per weight. Quantization reduces the number of bits used to represent each weight. Instead of two bytes, you might use one byte (8-bit) or roughly half a byte (4-bit). The model gets smaller in direct proportion to how aggressively you compress it.

The trick is doing this without breaking the model. Naive rounding would lose too much information, so modern quantization methods group weights into small blocks and store a shared scaling factor for each block, preserving the relative differences between weights even at low bit counts. Some methods go further and spend extra bits on the weights that matter most while squeezing the rest harder. The result is that a model can drop to a quarter of its original size and still produce output that is, for most tasks, hard to distinguish from the full-precision original.

The cost is a small, measurable loss of fidelity. As you use fewer bits, the model's predictions drift slightly from what the original would have produced. At gentle compression this drift is negligible. Push too far and it becomes obvious: repetition, weaker reasoning, and factual mistakes. The whole game is finding the point where you save the most memory before quality starts to suffer in ways you notice.

The common GGUF levels, from Q8 down to Q2

GGUF is the file format used by llama.cpp and the tools built on it (Ollama, LM Studio, and most desktop local-LLM apps). Its quantization levels are labeled with a number for the approximate bits per weight and a suffix describing the method. The K in names like Q4_K_M means a K-quant, which allocates precision intelligently across the model; the trailing letter (S, M, L) is the size variant, with M (medium) being the usual default.

Here is roughly how the common levels stack up. Treat the percentages as estimates, since exact quality loss varies by model and task:

Q8_0 (~8 bits/weight): about half the size of FP16, with quality loss so small it is effectively lossless. Use it when you have memory to spare and want maximum fidelity.
Q6_K (~6 bits/weight): very close to Q8 quality at a smaller size; a fine choice when Q8 is just out of reach.
Q5_K_M (~5 bits/weight): a strong balance, noticeably smaller than Q6 with only a slight quality cost. A common step up from Q4 when you have a bit more VRAM.
Q4_K_M (~4.5 bits/weight): the popular default. Roughly 70-75% smaller than FP16 while retaining an estimated 92-95% of the original quality.
Q3_K_M (~3.5 bits/weight): visibly degraded but still usable on very tight hardware; quality loss becomes hard to ignore here.
Q2_K (~2.5-3 bits/weight): extreme compression with a sharp quality drop. Generally not recommended unless you have no other way to fit the model.

Why Q4_K_M is the sweet spot

If you take one practical recommendation away, it is this: start with Q4_K_M. The reason is the shape of the quality-versus-size curve. From Q8 down to Q4_K_M, quality declines very gradually, so each step down buys you a meaningful memory saving for almost no perceptible loss. Below Q4, the curve bends sharply. The drop from Q4 to Q3 is much steeper than the drop from Q8 to Q4, and the fall to Q2 is steeper still. Q4_K_M sits right at the elbow of that curve, capturing most of the memory savings before the cliff.

There are good reasons to deviate. Go higher (Q5, Q6, or Q8) when you have spare VRAM and want the best possible output, or for tasks where small errors compound, such as code generation or precise instruction-following. Go lower (Q3) only when a model you really want will not otherwise fit, and accept that the output will be rougher. A useful instinct: if the only way to fit a model is Q3 or Q2, you are often better off running the next smaller model at Q4_K_M instead. A smaller model at a healthy quantization usually beats a larger one squeezed to the breaking point.

How quantization changes the hardware you need

This is where quantization becomes a buying decision. Because the memory footprint scales almost linearly with bits per weight, dropping from FP16 to 4-bit cuts the weight memory to roughly a quarter. A model that needs about 40 GB of memory at FP16 will fit in roughly 10 GB at Q4, which is the difference between needing a data-center card and running comfortably on a mainstream consumer GPU. At the high end, a 70B model that wants around 140 GB at full precision drops to roughly 40 GB at Q4 (estimates, before context).

Two caveats keep these numbers honest. First, weights are not the only thing in VRAM. The KV cache, which holds the running context, grows with how long a prompt and conversation get and with how many requests run at once, so leave headroom on top of the weight size, especially for long contexts. Second, quantization saves memory but does not speed up a model the way more cores would; on the contrary, very low-bit formats can sometimes run slower because of the extra work to unpack them. When you read a model's hardware page, the listed VRAM figure usually refers to a specific quantization, so check which one before assuming it matches your card.

GPTQ and AWQ versus GGUF

GGUF is the most common format for desktop and CPU-friendly setups, but it is not the only one. GPTQ and AWQ are two other 4-bit-oriented quantization methods, and you will see them on model pages alongside GGUF. The short version: GGUF is the most flexible and the best choice when you want CPU inference or to offload part of a model to system RAM, which is how people run models larger than their VRAM. GPTQ and AWQ are designed to run fast on GPUs and are common in server-style inference stacks like vLLM.

On quality at 4-bit, the three are broadly comparable. AWQ tends to edge out GPTQ by preserving the most important weights more carefully, and a 4-bit AWQ model occupies roughly the same disk space as a Q4 GGUF. The practical guidance is simple: if you are running locally on a desktop, a laptop, or anything that benefits from CPU offload, reach for GGUF. If you are deploying a GPU inference server and want maximum throughput for multiple users, AWQ or GPTQ are usually the better fit. For most people choosing hardware to run a model at home, GGUF at Q4_K_M is the path of least resistance.

Related builds

Budget Llama 8B Box

Minimal spend path to a solid Llama 3.1 8B daily driver.

View build

Home Inference Workstation

RTX 4090 powerhouse for 8B–34B models with headroom for agent workflows.