Skip to content
ClankerBuilder
Sign in

Explainer

What Hardware Do You Need to Fine-Tune an LLM?

5 min readBy the ClankerBuilder editorial team · how we rate

Hardware for fine-tuning LLMs: VRAM rules of thumb for full, LoRA, and QLoRA, why training beats inference, and when to rent an H100.

If you have run a model locally and assume fine-tuning needs a similar machine, the hardware math is about to surprise you. Inference only has to hold the model weights in memory and push tokens through them. Training has to hold the weights and the gradients and the optimizer's bookkeeping and the intermediate activations from every layer, all at once, for every step. That stack is why the same 7B model that infers comfortably on an 8GB card can demand 60 to 70GB to fully fine-tune.

The good news is that you almost never need full fine-tuning. Parameter-efficient methods have collapsed the requirement so far that a single 24GB consumer card can adapt a 7B or 13B model overnight. This guide walks through why training is so much hungrier than inference, the three approaches and their real VRAM costs, why NVIDIA is effectively mandatory, and when renting an H100 beats buying anything at all.

Why training needs far more memory than inference

Inference is memory-cheap because it is a one-way trip. You load the weights, run a forward pass, and read off the output. A rough rule is about 2GB of VRAM per billion parameters at 16-bit precision, so a 7B model fits in roughly 14GB and a 13B in roughly 26GB, before you account for the key-value cache.

Training keeps almost nothing for free. On top of the weights you have to store gradients, which mirror the weights in size, and optimizer state, which for the standard AdamW optimizer keeps two running averages per parameter. That alone roughly quadruples the per-parameter footprint of weights plus gradients. Then there are activations: the intermediate values from the forward pass have to be held in memory until the backward pass consumes them, and they scale with batch size and sequence length.

Add it up and a full fine-tune in 16-bit precision lands near 16GB per billion parameters, roughly eight times what inference needs for the same model. That multiplier is the single most important fact in fine-tuning hardware planning. Everything below is about shrinking it.

  • Inference (16-bit): about 2GB per billion parameters
  • Weights: same in training and inference
  • Gradients: one full copy of the weights, in training only
  • Optimizer state (AdamW): two more copies, the usual top memory consumer
  • Activations: scale with batch size and sequence length

The three approaches and what they actually cost

Full fine-tuning updates every weight in the model. It produces the most flexible result and is the heaviest by far, which is why it lives on multi-GPU servers rather than desktops. LoRA (Low-Rank Adaptation) freezes the original weights and trains small adapter matrices alongside them, so gradients and optimizer state only cover the tiny adapters instead of the whole model. QLoRA goes one step further: it loads the frozen base model in 4-bit quantization and trains LoRA adapters on top, cutting the largest cost, the static weights, to a quarter of its 16-bit size.

The practical consequence is that QLoRA puts a 7B or 13B fine-tune within reach of a single 24GB card such as an RTX 4090, with headroom for reasonable batch sizes. LoRA in 16-bit is heavier but still single-card territory for those sizes. Full fine-tuning of even a 7B model needs the kind of VRAM you only get from datacenter cards. The rules of thumb below are approximate and move with sequence length, batch size, and quantization details, but they are the right mental model for sizing a build.

  • Full fine-tune (16-bit): about 16GB per billion params; 7B near 65-70GB, 13B over 120GB; multi-GPU
  • LoRA (16-bit): roughly 2GB per billion params; 7B around 15-20GB, 13B around 28-40GB
  • QLoRA (4-bit base): roughly 0.4-0.6GB per billion params; 7B near 5-9GB, 13B near 9-17GB
  • QLoRA on a single 24GB card comfortably covers 7B and 13B, and 14B with careful settings

Why it has to be NVIDIA and CUDA

Almost every fine-tuning tool you will reach for assumes CUDA, NVIDIA's compute platform. The libraries that make low-VRAM training possible, including bitsandbytes for 4-bit and 8-bit quantization, the fused attention kernels, and the optimizer implementations, are built and tested against CUDA first. AMD's ROCm has improved and Apple Silicon can run small experiments, but the moment you hit an obscure kernel or a quantization path that was never ported, you lose hours that an NVIDIA card would have saved.

This is not a statement about which silicon is better in theory. It is a statement about where the ecosystem actually works today. For training specifically, where you are stacking quantization, custom kernels, and fast-moving libraries, a recent NVIDIA card with plenty of VRAM is the path of least resistance. A 24GB card such as a used 3090 or a 4090 is the entry point most people should target, and dual-GPU 4090 builds extend that to larger LoRA jobs and bigger batches.

System RAM, gradient checkpointing, and other levers

VRAM is the binding constraint, but it is not the only one. System RAM matters because datasets are loaded and preprocessed on the CPU, the base model is often staged in RAM before moving to the GPU, and some workflows offload optimizer state to system memory to fit larger jobs. Treat 32GB as a sensible floor and 64GB as comfortable for serious work; offloading in particular trades plentiful RAM for scarce VRAM, but at a real speed cost.

Gradient checkpointing is the other major lever. Instead of keeping every layer's activations in memory for the backward pass, it stores only a few checkpoints and recomputes the rest on the fly. That can cut activation memory substantially in exchange for roughly 20-30 percent more compute time, which is often the difference between a job fitting on your card and not running at all. Smaller batch sizes, shorter sequence lengths, and 8-bit optimizers are the other dials to reach for when you are a few gigabytes short.

When to rent an H100 instead

Buying a 24GB card makes sense if you fine-tune regularly, value local iteration, and want your data to stay on your own machine. But there is a clear line past which renting wins. Full fine-tuning of anything from 7B upward, multi-billion-parameter LoRA runs that exceed your VRAM, or a one-off job you will not repeat are all better served by an hourly cloud GPU than by hardware you have to house, power, and cool.

An H100 typically rents for roughly 2.50 to 4 dollars per hour on dedicated GPU clouds, with hyperscalers charging more and marketplaces sometimes less. At those rates an occasional large fine-tune costs tens of dollars rather than the thousands a comparable card would cost to buy. The break-even rule is simple: rent until your sustained utilization would keep an owned card busy a large fraction of the day over its useful life. Below that threshold, renting is cheaper and spares you the depreciation, power, and maintenance. A practical pattern is to develop and debug your pipeline on a local 24GB card with a small model and short runs, then rent an H100 only for the final full-scale job. The site's fine-tune workstation build is designed around exactly this local-iteration role, and the dual-4090 workstation build extends it to heavier LoRA work before you ever need to rent.

Related builds

Fine-tune Workstation

64GB RAM and RTX 4090 for LoRA fine-tuning on 8B–14B models.

View build

Dual RTX 4090 Workstation

Twin 4090s for high-throughput 34B–70B inference with NVLink-ready parts.

View build

Frequently asked questions

Can I fine-tune a 7B model on a 24GB GPU?
Yes. With QLoRA, which loads the base model in 4-bit and trains small adapters on top, a 7B fine-tune fits comfortably on a 24GB card such as an RTX 4090, and a 13B does too. Full fine-tuning of a 7B model is a different story and needs roughly 60 to 70GB, which means datacenter cards or a multi-GPU server.
Why does fine-tuning need so much more VRAM than just running the model?
Inference only holds the weights. Training also holds gradients, optimizer state, and the activations from the forward pass. With the standard AdamW optimizer the weights, gradients, and two optimizer copies together push a full fine-tune to roughly eight times the per-parameter memory of inference.
What is the difference between LoRA and QLoRA?
LoRA freezes the original weights and trains small low-rank adapter matrices, so only the adapters need gradients and optimizer state. QLoRA does the same but also quantizes the frozen base model to 4-bit, shrinking the largest cost, the static weights, to a quarter of its 16-bit size and making single-card fine-tuning of 7B to 13B models routine.
Do I have to use an NVIDIA GPU to fine-tune?
In practice, yes. The quantization libraries, fused kernels, and optimizer implementations that make low-VRAM training feasible are built and tested against NVIDIA's CUDA first. AMD ROCm and Apple Silicon can handle small experiments, but for serious or low-VRAM fine-tuning a recent NVIDIA card with ample VRAM is the path of least resistance.
When should I rent an H100 instead of buying a GPU?
Rent when the job is too big for a card you would realistically own, such as a full fine-tune or a large LoRA run, or when it is a one-off. An H100 rents for roughly 2.50 to 4 dollars per hour, so an occasional large job costs tens of dollars. Buy only when sustained use would keep an owned card busy enough to beat rental over its lifespan.

Some links in this article are affiliate links. If you buy through them we may earn a commission at no extra cost to you. See our affiliate disclosure.