What is a good tokens per second for local LLM use?

For a single person reading replies, about ten tokens per second already feels real-time, since average reading is only five to eight tokens per second. Twenty to forty tokens per second gives comfortable headroom to skim and re-read. Higher numbers mainly benefit agents, tool-calling loops, and batch jobs rather than a solo reader.

Why is my prompt processing fast but generation slow?

Those are two different phases. Prompt processing is prefill, which runs in parallel and is compute-bound, so it is fast. Generation is decode, which produces one token at a time and is limited by how fast weights and the key-value cache stream out of memory. Slow generation almost always means you are bandwidth-limited, not compute-limited.

Does more VRAM make tokens per second faster?

Not directly. VRAM determines whether a model fits at all; memory bandwidth determines how fast it decodes once it fits. A card with lots of slow memory can hold a big model but still generate slowly. When comparing cards for generation speed, look at bandwidth in gigabytes per second, not just capacity.

How does quantization affect tokens per second?

Quantization reduces the bytes per weight, for example from 16 bits to 4 bits. Because decode is bandwidth-bound, fewer bytes to stream per token means faster generation, often several times faster, plus a smaller memory footprint. The cost is usually a small, often unnoticeable drop in output quality at common 4-bit settings.

Why does generation slow down in long conversations?

The key-value cache grows with every token in the context. The model reads that cache to produce each new token, so as a chat gets longer the bytes streamed per token increase and decode speed gradually falls. Trimming old context or starting a fresh conversation restores speed.

Explainer

Tokens Per Second Explained: How Fast Is Fast Enough?

June 10, 20265 min readBy the ClankerBuilder editorial team · how we rate

Tokens per second LLM speed explained: prefill vs decode, the human reading benchmark, and which hardware spec actually predicts generation speed.

When people shop for hardware to run a model locally, the number they fixate on is tokens per second. It sounds precise, but it hides a lot. A token is a chunk of text the model reads and writes, usually a word fragment; in English one token averages about three-quarters of a word, so roughly four tokens cover three words. Tokens per second, or tok/s, measures how fast the model turns those chunks into output. The question that actually matters is not "how high can the number go" but "how fast is fast enough."

The honest answer is that fast enough is lower than most people assume for a single human reader, and much higher than they assume once agents and batch jobs enter the picture. This article unpacks what tok/s really measures, why the same model reports two very different speeds depending on which phase you measure, and which hardware spec predicts the number you care about. Spoiler: for the speed you feel while reading a reply, it is memory bandwidth, not raw compute.

Two numbers hide behind one metric: prefill and decode

Every generation request runs in two phases, and they have opposite performance profiles. The first is prefill, sometimes called prompt processing: the model reads your entire prompt at once and builds its internal state. Because all the prompt tokens are available up front, the GPU processes them in parallel, doing a large amount of matrix math in one pass. Prefill is compute-bound, meaning it is limited by how many math operations the chip can perform. On a modern GPU it can run at hundreds or even thousands of tokens per second, and it largely determines how long you wait before the first word appears.

The second phase is decode, the actual generation. Here the model produces one token at a time, and each new token depends on every token before it, so the work cannot be parallelized the same way. For each token the GPU must stream the model's weights and the growing key-value cache out of memory. Decode is memory-bandwidth-bound: it is limited not by math but by how fast data moves from VRAM to the compute cores, and GPU utilization during decode is often only twenty to forty percent. This is the number you watch scroll across the screen.

People conflate the two because both get reported as "tokens per second." A vendor quoting a huge throughput figure is often quoting prefill or aggregate batch throughput, not the steady single-stream decode rate a solo user experiences. When you compare hardware for interactive use, insist on the decode number.

The human benchmark: how fast is fast enough

There is a natural ceiling on useful single-stream speed, and it is set by how fast you read. Average adult reading runs around 200 to 300 words per minute. At roughly 250 words per minute and the English token-to-word ratio, that works out to about five to eight tokens per second. In other words, a model generating in the single digits is already keeping pace with a careful reader.

Once generation comfortably exceeds reading speed, the bottleneck you feel stops being decode rate and becomes time to first token, the delay before anything appears at all. Past roughly ten tokens per second, output streams faster than you can absorb it, so further speed does not make a single conversation feel meaningfully snappier; a slow first token does. This is why a build that decodes at fifteen tokens per second with an instant first token can feel more responsive than one that decodes at forty but stalls for two seconds before starting.

Under ~5 tok/s: noticeably slow; you wait on the text
~5-8 tok/s: matches a typical reader's pace
~10+ tok/s: feels real-time for a single user reading along
~20-40 tok/s: comfortable headroom; skim, re-read, no waiting
Beyond ~40 tok/s: mostly helps agents, tool loops, and batch jobs, not solo reading

Why bigger models are slower

Decode speed falls as model size rises, and the reason follows directly from the memory-bandwidth bottleneck. To produce each token the GPU has to read every active weight out of memory. A larger model has more weights, so each token requires moving more bytes, and more bytes per token at a fixed bandwidth means fewer tokens per second. This is why an 8-billion-parameter model decodes far faster than a 70-billion-parameter one on the same card, often by a factor that roughly tracks the size difference.

A useful back-of-the-envelope estimate: divide your hardware's usable memory bandwidth by the model's size in bytes to approximate a ceiling on decode tokens per second. A model occupying 16 GB on a card with around 1,000 GB/s of bandwidth has a theoretical ceiling near 60 tok/s; real numbers land lower because of the key-value cache, overhead, and imperfect bandwidth utilization. The estimate is rough, but it explains the pattern: doubling the model roughly halves decode speed, and doubling bandwidth roughly doubles it.

What quantization, context length, and batching change

Three levers move tok/s in predictable directions once you understand the bottleneck. Quantization shrinks each weight from, say, 16 bits to 4 bits. Because decode is bandwidth-bound, fewer bytes per weight means fewer bytes to stream per token, so a 4-bit model typically decodes several times faster than the same model at full precision while also fitting in less VRAM. The tradeoff is a usually small drop in output quality.

Context length works the other way. The key-value cache grows with every token in the conversation, and the model must read that cache for each new token. A long prompt or a long-running chat inflates the cache into gigabytes, adding to the bytes streamed per token and dragging decode speed down as the conversation goes on. Batching is the lever that helps servers but not solo users: processing many requests together raises total throughput because decode's idle bandwidth gets shared across requests, yet it does not speed up, and can slightly slow, any single stream. A benchmark of aggregate batch throughput tells you little about how fast your one reply will feel.

Quantization (16-bit to 4-bit): big decode speedup, smaller footprint, minor quality cost
Longer context / longer chats: KV cache grows, decode gradually slows
Batching: raises aggregate server throughput, not single-stream speed

The spec that predicts decode speed: memory bandwidth

If you remember one thing when comparing GPUs for local inference, make it this: for the speed you feel, memory bandwidth predicts decode tokens per second better than any compute figure. Two cards with similar teraflops but different bandwidth will deliver noticeably different generation speeds, with the higher-bandwidth card winning. Compute still matters for prefill and for how quickly the first token arrives, but the steady stream of generated text is gated by how fast weights and cache move out of memory.

This reframes hardware shopping. Rather than chasing the largest model your VRAM can hold, target a model whose quantized size lets your card's bandwidth deliver at least ten to twenty tokens per second for interactive reading, with more headroom only if you plan to run agents or batch workloads. On the model pages, note each model's size and recommended quantization, then divide by a card's bandwidth from the GPU comparison to sanity-check the decode rate before you buy. A card that comfortably clears your reading speed on the model you actually want is a better purchase than a bigger card running a model too large to stream quickly.

Related builds

Home Inference Workstation

RTX 4090 powerhouse for 8B–34B models with headroom for agent workflows.

View build

Budget Llama 8B Box

Minimal spend path to a solid Llama 3.1 8B daily driver.