How much RAM do I need to run a 70B model on a Mac?

Plan for at least 64GB, and 128GB is the comfortable target. A 70B model at 4-bit quantization occupies roughly 40 to 48GB on its own, and you need additional headroom for macOS, your apps, and the model's context window. That headroom is why an M4 Max with 128GB is the realistic recommendation for 70B-class models rather than trying to squeeze one into a smaller machine.

Is the base M4 Mac Mini enough for local LLMs?

Yes, for smaller models. A base M4 with 24GB of unified memory comfortably runs 7B to 12B models, which covers most chat assistants and coding helpers. The 16GB version works but is tight once macOS takes its share, so 24GB is a safer floor. If you want to run 30B-class or larger models, you will need to move up to an M4 Pro or higher.

Should I buy more GPU cores or more memory?

Memory, in almost every case. Unified memory caps the size of the model you can load at all, while GPU cores only affect how fast a model that already fits will run. If a model does not fit in memory, no amount of GPU cores will let you run it. Spend on memory first, then on cores and bandwidth once you have enough memory for your target model size.

Can I fine-tune LLMs on a Mac, or only run them?

Macs are built for running models, not training them. Some lightweight fine-tuning is technically possible, but the tooling for serious fine-tuning and training overwhelmingly targets NVIDIA's CUDA stack, so a Mac is a poor fit for that work. If fine-tuning is your main goal, plan on NVIDIA hardware. If inference is your goal, a Mac with enough unified memory is an excellent choice.

Why is my Mac slower at generating tokens than benchmarks suggest?

Token generation speed is set mainly by memory bandwidth and the size of the model's weights, and it drops as models get larger because more data has to be read from memory for every token. A long context window, a less aggressive quantization level, and other apps competing for memory all reduce throughput further. Treat published token-per-second figures as best-case ranges, not guarantees, and expect larger models to feel slower than small ones even on the same machine.

Buying guide

Best Mac for Running LLMs Locally (M-Series Guide)

June 10, 20265 min read

The best Mac for local LLMs comes down to unified memory, not GPU cores. M4, M4 Pro, M4 Max, and Ultra tiers explained by model size.

If you want the short version: the best Mac for local LLMs is whichever one has enough unified memory to hold the model you actually plan to run, and not a dollar more spent on GPU cores you can't feed. For most people experimenting with 7B to 12B models, a base M4 with 24GB is enough. If you want to run 30B-class models comfortably, jump to an M4 Pro with 48GB. For 70B-class models, you need an M4 Max with 128GB. And for the very largest open models, only an Ultra chip with 192GB or more will fit them. Everything below explains why memory is the spec that decides this, and where the money is well spent versus wasted.

This guide is accurate as of mid-2026 and uses approximate figures. Treat token-per-second numbers as ranges, because they shift with quantization, context length, and the specific runtime you use.

Why unified memory is the spec that matters

On a Mac, the CPU and GPU share one pool of memory. That pool is the hard ceiling on what you can run, because a model has to fit in memory to load at all. This is the single most important thing to understand before you spend anything: the size of your unified memory caps the size of the model you can load, full stop. A faster GPU does not help if the weights don't fit.

Model size in memory depends on parameter count and quantization. A rough rule of thumb: a 4-bit quantized model needs a little over half a gigabyte of memory per billion parameters, plus headroom for context and the OS. So a 7B model at 4-bit lands around 4 to 5GB, a 30B model around 18 to 22GB, and a 70B model around 40 to 48GB. You always want spare room on top, because macOS itself, your apps, and the model's context window all consume memory while it runs.

This is also why you should not over-index on GPU core counts when comparing Mac configurations. Two machines with the same memory but different GPU core counts will both run the same model; the one with more cores will run it somewhat faster, but neither can run a model that doesn't fit. Memory decides what is possible. GPU cores and bandwidth decide how fast the possible runs.

The practical tiers, by model size

Apple's M-series chips fall into clean tiers, and each tier maps to a class of model you can run with comfortable headroom. The base M4 supports up to 32GB of unified memory. The M4 Pro goes up to 64GB. The M4 Max reaches 128GB. The M2 Ultra tops out at 192GB, and the newer M3 Ultra extends all the way to 512GB for the people pushing the largest frontier-scale open models.

Here is the mapping that actually matters when you are choosing. These assume 4-bit quantization and leave room for the OS and a reasonable context window:

Base M4, 16 to 24GB: 7B to 12B models. Great for chat assistants, coding helpers, and learning the tooling. 16GB is tight once macOS takes its share, so 24GB is the more comfortable floor.
M4 Pro, 48GB: up to roughly 32B-class models with headroom. This is the sweet spot for most serious hobbyists and developers who want quality without Ultra pricing.
M4 Max, 128GB: 70B-class models. This is where local LLMs start to feel genuinely capable for real work, and where most prosumers should land if budget allows.
M2 Ultra, 192GB / M3 Ultra, up to 512GB: the very largest open models, including mixture-of-experts models in the hundreds of billions of parameters. Only worth it if you specifically need those models.

Memory bandwidth is what turns memory into speed

Once a model fits, the thing that determines how fast it generates tokens is memory bandwidth. Token generation on these models is overwhelmingly memory-bound: every token requires reading the model's weights from memory, so the rate at which memory can be read sets a practical ceiling on tokens per second.

The tiers differ a lot here. The base M4 offers roughly 120GB/s. The M4 Pro roughly doubles that to around 273GB/s. The M4 Max roughly doubles again to about 546GB/s. The Ultra chips reach roughly 800 to 820GB/s. In broad terms, more bandwidth means more tokens per second on the same model, and the jump from base to Pro to Max is felt clearly in interactive use.

A useful intuition: a small 7B model on a base M4 will feel snappy because the weights are small and the bandwidth is enough to read them quickly. A 70B model on an M4 Max fits and runs, but because the weights are far larger, it generates noticeably fewer tokens per second than that small model did. Bigger models are both harder to fit and slower to read, so you pay twice as you scale up. Do not expect a 70B model to feel as instant as a 7B one, even on the strongest Mac.

Mac Mini vs MacBook Pro vs Mac Studio

The form factor mostly follows the chip tier, so pick the chip first and the body second. The Mac Mini is the cheapest way into Apple Silicon AI and is an excellent always-on inference box for base M4 and M4 Pro configurations. It is the best value if you do not need a screen and the model will run as a local server you talk to from other machines.

The MacBook Pro is the right answer only if you need the model on a laptop, because portability costs money relative to the same chip in a desktop. It is available up to the M4 Max with 128GB, so a MacBook Pro can run 70B-class models on the go, but you are paying a premium for the battery, display, and chassis that a desktop does not charge you for.

The Mac Studio is the machine built for this work. It is the only way to get an Ultra chip, which means it is the only Mac that reaches 192GB and beyond. If you want 70B-class models at the best price-to-performance, or you need the very largest models, the Studio is the destination. The on-site Mac configs cover these directly, including a Mac Studio M4 Max 128GB build for the 70B tier and a Mac Studio M2 Ultra 192GB build for the largest models.

What to skip, and the honest limits

Spend your money on memory before anything else. The most common mistake is paying for more GPU cores or storage while staying on a memory tier that can't hold the models you wanted in the first place. If you have to choose between more GPU cores and more unified memory at the same price, choose memory almost every time, because memory changes what you can run and cores only change how fast. Storage is the easiest place to save: a 1TB or 2TB internal drive plus an external SSD for model files is far cheaper than maxing out internal storage.

It is also worth being honest about where Macs are not the answer. For raw inference speed on a single model that fits in a GPU's memory, a discrete NVIDIA GPU is faster, often substantially so, thanks to higher bandwidth and a mature CUDA software stack. And if your goal is fine-tuning or training rather than inference, NVIDIA is effectively required: most training tooling targets CUDA, and the Apple path for serious fine-tuning is immature by comparison. The Mac's advantage is large unified memory at a relatively accessible price, near-silent operation, and low power draw, which together make it an unusually good local inference machine. It is not a training rig.

So the buying logic is simple. Decide the largest model you genuinely want to run, find the memory tier that fits it with headroom, buy the cheapest Mac that reaches that tier, and don't overpay for cores you can't feed.

Frequently asked questions

How much RAM do I need to run a 70B model on a Mac?: Plan for at least 64GB, and 128GB is the comfortable target. A 70B model at 4-bit quantization occupies roughly 40 to 48GB on its own, and you need additional headroom for macOS, your apps, and the model's context window. That headroom is why an M4 Max with 128GB is the realistic recommendation for 70B-class models rather than trying to squeeze one into a smaller machine.
Is the base M4 Mac Mini enough for local LLMs?: Yes, for smaller models. A base M4 with 24GB of unified memory comfortably runs 7B to 12B models, which covers most chat assistants and coding helpers. The 16GB version works but is tight once macOS takes its share, so 24GB is a safer floor. If you want to run 30B-class or larger models, you will need to move up to an M4 Pro or higher.
Should I buy more GPU cores or more memory?: Memory, in almost every case. Unified memory caps the size of the model you can load at all, while GPU cores only affect how fast a model that already fits will run. If a model does not fit in memory, no amount of GPU cores will let you run it. Spend on memory first, then on cores and bandwidth once you have enough memory for your target model size.
Can I fine-tune LLMs on a Mac, or only run them?: Macs are built for running models, not training them. Some lightweight fine-tuning is technically possible, but the tooling for serious fine-tuning and training overwhelmingly targets NVIDIA's CUDA stack, so a Mac is a poor fit for that work. If fine-tuning is your main goal, plan on NVIDIA hardware. If inference is your goal, a Mac with enough unified memory is an excellent choice.
Why is my Mac slower at generating tokens than benchmarks suggest?: Token generation speed is set mainly by memory bandwidth and the size of the model's weights, and it drops as models get larger because more data has to be read from memory for every token. A long context window, a less aggressive quantization level, and other apps competing for memory all reduce throughput further. Treat published token-per-second figures as best-case ranges, not guarantees, and expect larger models to feel slower than small ones even on the same machine.

Mac configs for AI Mac Studio M4 Max Mac Studio M2 Ultra

Some links in this article are affiliate links. If you buy through them we may earn a commission at no extra cost to you. See our affiliate disclosure.