GMKtec EVO-X2 Review 2026: A Sub-$2,000 Mini PC That Runs 235B Models on Ryzen AI Max+ 395

amdryzen-ai-maxstrix-halomini-pclocal-llmhardwarebenchmark

TL;DR: The GMKtec EVO-X2 puts 128GB of unified LPDDR5X behind an AMD Ryzen AI Max+ 395, so it loads models that physically don’t fit on any single consumer GPU — Qwen3-235B (22B active) runs at ~11 tok/s. The catch is bandwidth: at 256 GB/s it’s ~3.6× slower than a used RTX 3090, so anything under 24GB runs faster on a real GPU. Buy it for capacity, not speed.

GMKtec EVO-X2 128GBUsed RTX 3090 buildMac Studio M4 Max 64GB
Best for100B–235B MoE models in memory≤24GB models at top speed30–70B, silent, low power
Price (Jun 2026)$1,999–$2,199 (complete)~$1,400 (complete build)~$2,200 (complete)
Memory bandwidth256 GB/s936 GB/s546 GB/s
The catchBandwidth-bound; ~5 tok/s on dense 70BHard 24GB VRAM ceiling64GB max at this price

Honest take: If your goal is running a 100B+ MoE model at home for under $2,000, nothing in this class touches the EVO-X2. If you mostly run 7B–32B models, a used RTX 3090 build is cheaper and roughly 3× faster on tokens/sec.

What you actually get for the money

The GMKtec EVO-X2 is a complete mini PC built around the AMD Ryzen AI Max+ 395 — the “Strix Halo” APU with 16 Zen 5 cores and a Radeon 8060S integrated GPU. The chip itself is the same one we covered in our Strix Halo deep dive; this review is about the specific machine you can put on your desk and what its price-to-capacity ratio means for a home lab.

GMKtec sells three configurations as of June 2026:

ConfigMemorySSDPrice
Base64GB LPDDR5X1TB$1,499
Mid96GB LPDDR5X2TB$1,799
Max128GB LPDDR5X2TB$1,999–$2,199

The memory is soldered LPDDR5X-8000 across an 8-channel (256-bit) bus — there are no DIMM slots, so the RAM you buy is the RAM you keep. That’s the single most important purchase decision: for local AI, buy the 128GB model or don’t bother. The 64GB config is a fine gaming and general-work box, but the whole reason to choose Strix Halo over a discrete GPU is the ability to allocate up to 96GB as graphics memory and load models that don’t fit anywhere else. At $9 per GB of GPU-accessible memory, that allocation is the cheapest large unified-memory pool on the x86 market right now.

Connectivity is generous for the form factor: dual USB4, Wi-Fi 7, an SD 4.0 card reader, and quad 8K display output. None of that matters for headless inference, but it does mean the machine doubles as a workstation when you’re not running models.

Tokens per second, by model tier

Bandwidth is the whole story with Strix Halo, and the EVO-X2 is no exception. LLM decode speed is bound by how fast the chip can stream model weights out of memory, not by raw compute. At 256 GB/s theoretical bandwidth — and real-world figures lower than that — the EVO-X2 lands well behind any discrete GPU on a per-token basis but makes up for it on sheer capacity.

Here’s what community and review benchmarks report on the 128GB EVO-X2:

ModelTypeTokens/sec
7B–13B (Q4/Q6)Dense30–45 tok/s
30B-class MoESparse (~3B active)70–100 tok/s
Llama 3.3 70B (Q6_K)Dense~5 tok/s
Qwen3-235B-A22BSparse (22B active)~11 tok/s

The most instructive line in that table is the one that looks backwards: the 235-billion-parameter Qwen3 model runs more than twice as fast as the 70-billion-parameter Llama. That’s not a typo. Qwen3-235B is a Mixture-of-Experts model — 235B total parameters, but only about 22B are activated per token. Dense Llama 3.3 70B activates all 70B every single token. On a bandwidth-starved machine, the active parameter count is what sets your speed, so a sparse 235B beats a dense 70B handily.

This reframes how you should pick models for the EVO-X2. Dense models above ~30B feel sluggish — readable, but you’ll be waiting. MoE models are where the machine shines, because the 128GB pool holds the full expert set while only the active slice streams per token. If you came here to run 70B dense models fast, this is the wrong machine. If you came to run modern MoE models that don’t fit a GPU at all, it’s close to ideal.

For context, human reading speed is roughly 7–10 tokens/sec, so the 30B-MoE range (70–100 tok/s) feels instant, the 235B MoE (~11 tok/s) feels like a fast typist, and dense 70B (~5 tok/s) is genuinely slow for interactive chat but fine for batch jobs you walk away from.

Power draw and running cost

The EVO-X2 is frugal in a way no discrete GPU tower can match. Reviewers measured idle draw in the 8–14W range, with one putting it nearer 22W. Under a sustained Llama 3.3 70B Q6_K inference load it pulls 147–160W; gaming pushes it to 170–180W.

Put that against electricity. At the US average of about $0.12/kWh, running inference at 160W for an hour costs roughly $0.019 — call it two cents. Leave it idling 24/7 at 12W and you’re looking at about $0.035 per day, or near a dollar a month. A discrete RTX 3090 build idles higher and peaks at 285W+ under load, so the EVO-X2 is the cheaper machine to leave running as an always-on home AI server. We worked through the full 24/7 server math in our power bill breakdown — the short version is that idle power, not peak, dominates a year of always-on use, and the EVO-X2 wins on idle decisively.

EVO-X2 vs a discrete GPU build

This is the comparison most buyers actually face. A complete build around a used RTX 3090 — card at roughly $1,070, plus a basic host — lands near $1,400 and gives you 936 GB/s of bandwidth and ~95 tok/s on a 7B model. That’s roughly 3× the EVO-X2’s small-model speed. For anyone whose workload lives under 24GB — 7B–14B chat, code completion, image generation — the discrete card is faster and cheaper.

The 3090 hits a wall the EVO-X2 doesn’t: 24GB. The moment your model needs 30GB, 60GB, or 120GB, a single 3090 is out, and you’re into multi-GPU territory with its own cost and complexity. The EVO-X2’s 96GB allocatable pool clears that wall in a box that draws less than a single 3090 at idle. A RTX 5060 Ti 16GB build is cheaper still but caps out even sooner.

So the decision is clean:

  • Mostly run models that fit in 24GB? Discrete GPU. Faster, cheaper, done.
  • Need 100B+ MoE models in memory, in one quiet low-power box? EVO-X2.

For the Apple-shaped alternative, a Mac Studio M4 Max brings 546 GB/s — more than double the EVO-X2’s bandwidth — but tops out at 64GB at a similar price, and Apple has pulled its highest-RAM Studio configs. The EVO-X2 trades bandwidth for capacity; the Mac trades capacity for bandwidth. We compared the Apple side directly in our Mac Studio vs Mac Mini piece.

EVO-X2 vs DGX Spark

NVIDIA’s DGX Spark is the other “AI in a box” people cross-shop, and on paper it should win — it’s an NVIDIA platform with CUDA. In practice, for the kind of large-MoE inference the EVO-X2 targets, the Spark is bandwidth-limited too: a single unit decodes Llama 70B at around 2.7 tok/s. The EVO-X2 runs the much larger Qwen3-235B at ~11 tok/s for less money. The Spark’s advantages are real — CUDA toolchain maturity, NVLink-style clustering, and fine-tuning throughput (it tops 5,000 tok/s on a 70B QLoRA run) — but for a home lab that mostly does inference, the EVO-X2 delivers more usable tokens per dollar. We’re publishing a full head-to-head on the $3K–$4K developer-kit segment next; the short version is that the Spark earns its premium only if you fine-tune or live inside the CUDA ecosystem.

Cloud break-even: when does the box pay for itself?

If you’re currently renting GPUs for inference, the EVO-X2’s pitch is “stop paying rent.” The honest framing is a break-even calculation, not a blanket “local is cheaper.”

Cloud GPU rental on a service like RunPod runs roughly $0.30–$1.90 per hour depending on the card. If you rent for two hours a day at $1.00/hour, that’s about $60/month. A $1,999 EVO-X2 pays for itself against that habit in roughly 33 months — slow. But push usage to a real workload — say four hours a day, or a heavier card at $1.50/hour — and you’re at $180/month, with break-even inside a year. The math flips hard once you’re running near-continuously or paying per-token API rates for a model you could host yourself.

The other side of the ledger: cloud gives you H100-class bandwidth and elasticity the EVO-X2 can’t. If you need 100+ tok/s on a 70B dense model, or burst capacity for an hour a week, rent it — owning a bandwidth-limited box won’t make those jobs fast. The EVO-X2 wins for steady, all-day inference of large MoE models where its capacity, low idle power, and zero per-token cost compound over months. For coding-tool subscriptions specifically, weigh it against the managed options we track over at aicoderscope.com.

The honest limitations

Three things to know before you buy:

Bandwidth is the ceiling, and it’s fixed. No BIOS setting or driver update turns 256 GB/s into 936 GB/s. Dense models above 30B will always be slow on this machine. Plan your model choices around MoE.

The software stack is improving but still AMD. Ollama and llama.cpp with the Vulkan backend work well; ROCm support on Strix Halo has come a long way but still trails NVIDIA’s CUDA for breadth of tooling. Most inference “just works,” but exotic training setups and some custom nodes will need fiddling. Our NPU vs GPU breakdown covers why the integrated NPU doesn’t yet accelerate llama.cpp-style inference — the iGPU does the work.

Memory is soldered. You cannot upgrade later. Buy for the models you want to run in two years, not just today, because there’s no adding a stick of RAM down the line.

Who should buy it

Buy the GMKtec EVO-X2 128GB if you want a single, quiet, low-power box that runs 100B–235B MoE models locally for under $2,000, and you’ve accepted that small-model speed isn’t its job. It’s the cheapest way onto large open-weight MoE models without a multi-GPU server, and the idle power makes it a sensible always-on home AI server.

Skip it if your workload lives under 24GB — a used RTX 3090 build is cheaper and roughly 3× faster. And skip it if you fine-tune heavily or depend on CUDA-only tooling, where a discrete NVIDIA card or a DGX Spark earns its keep. For a broader look at where these mini PCs fit, see our mini PC for local LLMs guide and the Computex 2026 hardware roundup.

FAQ

Can the EVO-X2 really run a 235B-parameter model? Yes, on the 128GB configuration. Qwen3-235B is a Mixture-of-Experts model with 22B active parameters, and it runs at about 11 tokens/sec. The full model loads into the unified memory pool; only the active expert slice streams per token, which is why it’s faster than a dense 70B.

Why is a 70B model slower than a 235B model on this machine? Decode speed is bound by memory bandwidth and the number of active parameters per token. Dense Llama 3.3 70B activates all 70B every token (~5 tok/s); Qwen3-235B activates only ~22B (~11 tok/s). On bandwidth-limited hardware, sparse MoE beats dense.

Is the EVO-X2 faster than a used RTX 3090? No, not on tokens/sec for models that fit in 24GB. The 3090’s 936 GB/s gives it roughly 3× the small-model throughput. The EVO-X2 wins only when the model is too large for the 3090’s 24GB VRAM.

Which configuration should I buy for local AI? The 128GB model. The memory is soldered and non-upgradeable, and the entire point of Strix Halo is the large unified-memory pool. The 64GB version is fine for gaming but defeats the purpose for large-model inference.

How much does it cost to run 24/7? Idle draw is about 8–14W, so leaving it on continuously costs roughly a dollar a month in electricity at $0.12/kWh. Under inference load it pulls 147–160W.

Sources

Last updated June 23, 2026. Prices and specs change; verify current rates before purchasing.

Was this article helpful?