Jun 21, 2026

Ollama v0.30 on Apple Silicon: What the Stable MLX Release Actually Changed From the Preview

By RunAIHome Team · 10 min read

ollamaapple-siliconmlxlocal-aimacgemma-4inference

TL;DR: Ollama v0.30 (May 13, 2026) promoted the MLX engine from a spring preview to the default Apple Silicon inference path, and the point releases through June added the parts that actually matter day-to-day: Gemma 4 QAT weights, Gemma 4 MTP speculative decoding (>2× on Macs), and better KV-cache reuse so repeated prompts skip re-prefill. If you’re on the May preview build, upgrading to v0.30.10 is a free speed bump on an idle afternoon.

What you’ll be able to do after this guide:

Upgrade to Ollama v0.30.10 and confirm the MLX engine is actually running, not the old llama.cpp Metal fallback
Pull the Gemma 4 QAT tags that fit your Mac’s memory (1GB to 18GB) and run them at near-original quality
Verify KV-cache reuse and MTP speculative decoding are active using ollama ps and real timing

Honest take: This isn’t a new engine — it’s the MLX preview we covered on June 2 finally stabilized and fed. The headline tok/s numbers haven’t moved much since spring, but Gemma 4 QAT plus speculative decoding plus cache reuse together make a 32GB Mac feel meaningfully snappier on real multi-turn work. Upgrade, pull a -it-qat tag, and move on.

What the v0.30 line actually shipped

The MLX engine arrived in preview this spring as a swap of Ollama’s Mac backend from llama.cpp’s Metal path to Apple’s MLX framework, which treats unified memory as the architectural primitive instead of an edge case. That preview nearly doubled decode speed — from ~58 to ~112 tok/s on an M4 Max running Qwen3.5-35B-A3B at int4 — but it was narrow: a handful of models, a hard 32GB-memory floor, and a “preview” label.

Ollama v0.30.0, released May 13, 2026, changed the framing. The release notes describe it as “improved compatibility and performance using llama.cpp” that augments the MLX engine on Apple Silicon, bringing support to a wider range of hardware. In plain terms: MLX is now the default fast path on capable Macs, and the llama.cpp side got broader GGUF support (Hugging Face models and your own fine-tunes) plus faster NVIDIA performance for everyone else.

The interesting work happened in the point releases:

Version	Date	What it added
v0.30.0	May 13, 2026	MLX default on Apple Silicon; broader GGUF + Hugging Face model support; faster NVIDIA
v0.30.5	early June 2026	Fixed `gemma4:12b` floating-point exception crash; Gemma 4 MTP speculative decoding on Macs (>2× speedup)
v0.30.8	June 12, 2026	Improved prompt caching for better KV-cache reuse
v0.30.9	mid June 2026	Cohere2Moe architecture support
v0.30.10	June 17, 2026	Command A and North family models on Apple Silicon MLX; llama.cpp updated to build 9672

If you installed the spring preview and never touched it, you’re missing all four of those. None is a marketing bullet — they’re the difference between “MLX is fast in a benchmark” and “MLX is fast on the thing I actually do.”

Upgrade and verify it’s really MLX

Upgrading is the easy part. On macOS, re-run the installer or use Homebrew:

$ brew upgrade ollama
$ ollama --version
ollama version is 0.30.10

The part people skip — and then wonder why nothing got faster — is confirming the MLX engine is the one doing the work. The MLX path activates on Macs with 32GB or more of unified memory. Below that, Ollama silently falls back to llama.cpp Metal with no error and no speed change. That silent fallback is the single most common “I upgraded and saw nothing” complaint, and it’s not a bug — it’s the documented memory floor.

To check which engine is live, load a model and read ollama ps:

$ ollama run gemma4:26b-it-qat ""
$ ollama ps
NAME                 ID              SIZE     PROCESSOR    UNTIL
gemma4:26b-it-qat    a1b2c3d4e5f6    16 GB    100% GPU     4 minutes from now

100% GPU means the model is fully on the GPU via the unified-memory path. If you see any CPU percentage on a model that should fit, you’re either below the memory floor or the model spilled — close other apps and reload. The SIZE column also sanity-checks your quant: a 26B QAT model should report ~16GB, not ~30GB.

Gemma 4 QAT: the upgrade that changes which Mac is enough

The most useful thing v0.30 unlocked isn’t raw speed — it’s Google’s Gemma 4 quantization-aware training (QAT) checkpoints, released June 5, 2026, now available as first-party Ollama tags. QAT simulates quantization during training instead of bolting it on afterward, which cuts memory roughly 72% versus BF16 while keeping near-original quality. We covered the full QAT memory map in the Gemma 4 QAT hardware update; here’s the short version of what to pull:

$ ollama pull gemma4:e4b-it-qat    # ~5 GB  — fits a 16GB MacBook Air
$ ollama pull gemma4:12b-it-qat    # ~7 GB  — fits 16GB comfortably
$ ollama pull gemma4:26b-it-qat    # ~15 GB — fits a 16GB Mac/GPU, barely
$ ollama pull gemma4:31b-it-qat    # ~18 GB — needs 24GB+

Gemma 4 QAT tag	Memory	What it fits
`gemma4:e2b-it-qat`	~1 GB	A phone, or any Mac
`gemma4:e4b-it-qat`	~5 GB	8–16GB MacBook Air
`gemma4:12b-it-qat`	~7 GB	16GB Mac / 8GB+ GPU
`gemma4:26b-it-qat`	~15 GB	16GB Mac/GPU (tight)
`gemma4:31b-it-qat`	~18 GB	24GB Mac/GPU

The reason this matters: the 26B-A4B model now fits in ~15GB, which means a 16GB Mac that previously couldn’t touch a 26B-class model runs one at near-full quality. Critical caveat carried over from the QAT release: don’t hand-convert the Hugging Face QAT BF16 weights to Q4_0 yourself — the F16-vs-BF16 scale mismatch reintroduces the exact accuracy loss QAT was meant to avoid. Use the official Ollama -it-qat tags above, which are already converted correctly.

Speculative decoding and cache reuse: where v0.30 feels faster

Two changes in the point releases don’t show up as a bigger headline tok/s number but change the lived experience.

Gemma 4 MTP speculative decoding (v0.30.5) uses multi-token-prediction draft heads to propose several tokens at once and verify them in a single pass — lossless output, but Ollama reports over a 2× speedup on Macs for Gemma 4. This is the same family of technique we broke down in why local LLMs got good in 2026: it doesn’t raise the memory-bandwidth ceiling, it just wastes fewer trips to it.

KV-cache reuse (v0.30.8) is the quieter win. Before, sending a follow-up message in a long chat re-processed the entire prompt history (the prefill step) every turn. With improved prompt caching, an unchanged prefix is reused, so on a multi-turn conversation the second and later turns skip straight to generation. The bigger your system prompt and the longer your chat, the more time-to-first-token you save — on a long coding session with a 4K-token system prompt, that’s the difference between a visible pause and an instant reply on every turn.

You won’t see a flag for this. The way to confirm it’s helping is crude but honest: time two identical follow-up prompts in the same session. The second should start streaming noticeably sooner because the shared prefix is already cached.

Real numbers, and the ceiling that didn’t move

Here’s what to actually expect, because “2× faster” is only true in specific places:

Mac / model	Backend	Decode	Notes
M4 Max, Qwen3.5-35B-A3B int4	MLX	~112 tok/s	vs ~58 tok/s on the old Metal path (~93% gain)
M4 Max, optimized 7B	MLX	~230 tok/s	small models show MLX’s biggest lead
M3 Ultra, Gemma 4 27B Q4_K_M	MLX	~30–42 tok/s	prefill ~700–900 tok/s
M3 Ultra, Qwen3.6 30B-A3B	MLX	>80 tok/s	MoE sparsity (3B active) is why it’s 2× the dense 27B

The pattern worth internalizing: MLX leads llama.cpp by roughly 10–25% on most models, and up to 21–87% on small ones, but that advantage collapses at 27B+ dense, where both engines saturate the same memory bandwidth and converge. So the MLX upgrade helps most if you live in the 7B–14B range or run MoE models with low active-parameter counts. If you’re running a dense 32B at Q4, the engine swap barely registers — you’re bandwidth-bound, and no software change fixes that. This is the same wall that keeps a used RTX 3090’s 936 GB/s ahead of a Mac on raw tok/s for models that fit in 24GB.

A problem you’ll probably hit, and the fix

The most common post-upgrade snag: you pull gemma4:26b-it-qat on a 16GB Mac, it loads, but ollama ps shows a CPU split and generation crawls. The model nominally fits at ~15GB, but macOS, your browser, and the KV cache all want memory too, so it spills. Fixes, in order: quit memory-hungry apps; drop to gemma4:12b-it-qat (~7GB), which leaves real headroom on 16GB; or cap the context so the KV cache stays small:

$ OLLAMA_CONTEXT_LENGTH=4096 ollama serve

If you genuinely need a 26B-class model at full context and your Mac can’t hold it, that’s the signal to either move up to a 32GB+ machine or rent a GPU by the hour rather than buy — RunPod runs the same Ollama stack on a cloud 24GB card for a few cents per session, which is cheaper than a memory upgrade if it’s occasional.

FAQ

Do I have to do anything to “turn on” MLX in v0.30? No. On a 32GB+ Apple Silicon Mac it’s the default. Below 32GB you get llama.cpp Metal automatically. Check ollama ps for 100% GPU to confirm.

Is v0.30 faster on my NVIDIA or AMD PC too? The MLX engine is Apple-only, but v0.30.0 also shipped faster NVIDIA performance and broader GGUF support via llama.cpp, so non-Mac users benefit from the same line — just not from MLX specifically.

Should I upgrade from the spring MLX preview? Yes. The preview lacked Gemma 4 QAT tags, MTP speculative decoding, and KV-cache reuse — the three things that make v0.30 feel different in daily use. The upgrade is non-breaking.

Will hand-quantizing Gemma 4 QAT myself save space? No — it reintroduces the accuracy loss QAT prevents because of a BF16/F16 scale mismatch. Use the official gemma4:*-it-qat tags.

Which model should I actually run on a 16GB Mac? gemma4:12b-it-qat (~7GB) is the comfortable pick. gemma4:26b-it-qat (~15GB) works but leaves almost no headroom — only run it if you close everything else.

Want the coding-tool side of this? If you’re wiring a local model into an editor, see our sister site’s breakdown of local-LLM coding setups on aicoderscope.com for which models hold up as agents.

Sources

Last updated June 21, 2026. Versions and benchmarks change; verify current ollama --version and your own ollama ps output before relying on these numbers.

Was this article helpful?