Jun 24, 2026

AMD Ryzen AI Halo vs NVIDIA DGX Spark 2026: Which 128GB AI Dev Kit Actually Pays Off

By RunAIHome Team · 11 min read

amdnvidiaryzen-ai-maxdgx-sparkstrix-halolocal-llmhardwarebenchmark

TL;DR: Both boxes carry 128GB of unified memory and both are bandwidth-bound, so token generation is nearly tied — ~34 tok/s (Halo) vs ~39 tok/s (Spark) on gpt-oss 120B. DGX Spark wins prompt processing by roughly 5× and brings CUDA; AMD’s Halo gives 2TB of storage and Linux-first ROCm for the same $3,999. For pure inference, a cheaper Strix Halo box wins.

	AMD Ryzen AI Halo	NVIDIA DGX Spark (Founders)	ASUS Ascent GX10
Price (Jun 2026)	$3,999 (2TB)	$3,999 → $4,699* (4TB)	$2,999 (1TB)
Chip	Ryzen AI Max+ 395	GB10 Grace Blackwell	GB10 Grace Blackwell
Memory / bandwidth	128GB LPDDR5X / 256 GB/s	128GB LPDDR5x / 273 GB/s	128GB LPDDR5x / 273 GB/s
Token gen (gpt-oss 120B)	~34 tok/s	~39 tok/s	~39 tok/s
Prompt processing	~340 tok/s	~1,720 tok/s	~1,720 tok/s
Software stack	ROCm / Vulkan / Ollama	CUDA (full ecosystem)	CUDA (full ecosystem)
The catch	Weaker prefill, ROCm maturity	Memory-shortage price hike	Only 1TB SSD

*NVIDIA raised the Founders Edition from $3,999 to $4,699 in 2026 as memory prices spiked.

Honest take: If you only run inference on MoE models, both are bandwidth-bound and roughly tied on the tokens you actually feel — so the cheaper Strix Halo box (like the GMKtec EVO-X2 at ~$1,999) is the smarter buy. Pay the DGX Spark premium only if you fine-tune, write CUDA, or run prefill-heavy agentic workloads.

The matchup isn’t $3,999 vs $2,999

The headline that’s been circulating — AMD’s $3,999 dev kit “tackling” a $2,999 DGX Spark — quietly compares two different things. NVIDIA’s own DGX Spark Founders Edition launched at $3,999 with a 4TB SSD, and as of 2026 NVIDIA raised it to $4,699 because of the same memory-price spike that’s hitting DDR5 and SSD buyers. The $2,999 figure belongs to the ASUS Ascent GX10, an OEM variant of the DGX Spark with the same GB10 Grace Blackwell Superchip and 128GB of memory — but only 1TB of storage instead of 4TB.

So the real picture, dollar for dollar, is:

AMD Ryzen AI Halo — $3,999, Ryzen AI Max+ 395, 128GB, 2TB SSD
DGX Spark (ASUS GX10) — $2,999, GB10, 128GB, 1TB SSD
DGX Spark (Founders) — $3,999 (now $4,699), GB10, 128GB, 4TB SSD

AMD slots its dev kit between the two NVIDIA configs on price, gives you double the storage of the cheap Spark, and matches the Founders Edition exactly at $3,999 before NVIDIA’s hike. That’s the framing AMD wants: same money, more SSD, and — they claim — leadership tokens-per-dollar.

What’s actually in each box

The AMD Ryzen AI Halo Developer Platform is built on the Ryzen AI Max+ 395 — the same “Strix Halo” APU in the GMKtec EVO-X2 and the chip we tore down in our Strix Halo deep dive. It pairs 16 Zen 5 cores (3.0GHz base, 5.1GHz boost) with a Radeon 8060S iGPU (40 RDNA 3.5 compute units) and an XDNA 2 NPU rated at 50 TOPS, for a platform total AMD quotes at 126 TOPS. Memory is 128GB of LPDDR5X-8000 on a 256-bit bus — 256 GB/s — and the kit ships with a 2TB PCIe 4 SSD, 10GbE LAN, Wi-Fi 7, four USB-C ports, and an aluminum chassis the size of a paperback (149 × 149 × 43mm). Pre-orders run through Micro Center, with availability around July 10, 2026, and you pick Linux or Windows at no price difference — a tell that AMD is aiming this squarely at developers.

The NVIDIA DGX Spark is a different animal under the hood. Its GB10 Grace Blackwell Superchip glues a 20-core Arm CPU (10 Cortex-X925 + 10 Cortex-A725) to a Blackwell GPU that NVIDIA rates at up to 1 petaFLOP of sparse FP4 tensor performance. Memory is also 128GB of unified LPDDR5x, but at a slightly higher 273 GB/s. It runs DGX OS (Ubuntu-based) and, crucially, the full CUDA stack.

The spec sheets converge on the thing that matters most for local LLMs: 128GB of unified memory on both. That’s enough to hold models no single 24GB consumer GPU can — gpt-oss 120B, Qwen3-235B at lower quant, dense 70B with room to spare. The question is how fast each one actually moves tokens through that memory.

Token generation: nearly a tie

Here’s the result that surprises people. On token generation — the decode phase, where the model streams one token at a time and you watch words appear — the two boxes are within a few tokens per second of each other.

On gpt-oss 120B (the MoE model both vendors lean on for demos), independent testing puts the Ryzen AI Max+ 395 at 34.13 tok/s against the DGX Spark’s 38.55 tok/s. That’s a 13% NVIDIA lead, not a generational gap. The reason is simple and it’s the same reason every box in this class behaves the way it does: decode is memory-bandwidth-bound, not compute-bound. The GB10’s 273 GB/s and Strix Halo’s 256 GB/s are within 7% of each other, so the tokens-per-second they can sustain on the same model are within 7% too. NVIDIA’s enormous FP4 compute advantage simply doesn’t get used during decode.

Per-model llama.cpp numbers fill in the rest of the picture on the DGX Spark side:

Model (DGX Spark)	Prompt processing	Token generation
gpt-oss 20B (MXFP4)	~2,000 tok/s	~60 tok/s
gpt-oss 120B (MXFP4)	~1,200 tok/s	~35 tok/s
Qwen3-Coder 30B (Q8_0)	~1,650 tok/s	~44 tok/s
Llama 3.3 70B (Q8_0, dense)	low	~2.6 tok/s

On the AMD side, gpt-oss 20B on the Ryzen AI Max+ 395 lands around 30–33 tok/s generation with roughly 400 tok/s prompt processing. Notice the pattern across both platforms: MoE models (gpt-oss, Qwen3-30B-A3B) fly because only a few billion parameters activate per token, while dense 70B craters to ~2.6 tok/s on the Spark — the same single-digit decode we measured in our GMKtec EVO-X2 review. Neither of these machines is a good dense-70B box. Both are MoE boxes that happen to have enough memory for big models.

For reference, human reading speed is about 7–10 tok/s, so anything in the 30–60 range feels comfortably interactive and the dense-70B ~2.6 tok/s is a “start it and walk away” experience on either platform.

Prompt processing: where NVIDIA earns its badge

The gap that actually separates these two boxes isn’t decode — it’s prefill, the prompt-processing phase that runs before the first token appears. On gpt-oss 120B, the DGX Spark processes the prompt at roughly 1,723 tok/s versus Strix Halo’s 339.87 tok/s — about 5× faster. Prefill is compute-bound, and this is exactly where Blackwell’s tensor cores and the mature CUDA kernels do their work while Strix Halo’s RDNA 3.5 iGPU falls behind.

This matters more than the spec-sheet symmetry suggests, and it maps directly to your workload:

Short prompts, chat, casual coding autocomplete → prefill is a rounding error. The two boxes feel identical.
Long context: RAG over big documents, full-repo code analysis, 32K+ token agentic loops → prefill dominates time-to-first-token. A 5× prefill advantage is the difference between a 3-second wait and a 15-second one, every turn. Here the DGX Spark pulls clearly ahead.
Fine-tuning / training → not a contest. CUDA, cuDNN, and the entire PyTorch training ecosystem run first-class on the Spark; a QLoRA pass on Llama 3.3 70B has been measured at over 5,000 tok/s throughput on the GB10. ROCm fine-tuning on Strix Halo works but is rougher and slower.

ROCm vs CUDA: the real tax

The benchmark you can’t put in a table is software friction. NVIDIA is selling a decade of CUDA momentum — PyTorch wheels that just work, every inference runtime tested on it first, Docker images, NIM microservices, forum answers for every error. If your workflow touches custom CUDA kernels, vLLM tensor-parallel, NeMo, or any “pip install and it runs” expectation, the Spark removes friction you’d otherwise spend evenings on.

AMD’s stack has genuinely improved. For pure inference, ROCm, Vulkan, llama.cpp, and Ollama all run gpt-oss and Qwen3 on Strix Halo today without drama — and in some single-batch llama.cpp tests the Vulkan backend even edges ahead of the Spark on raw decode. But step off the inference path into training or exotic libraries and you’ll feel the maturity gap. The honest framing: for inference, ROCm is fine in 2026; for everything else, CUDA still wins. If you’re choosing a box to run coding agents, our coverage of local coding stacks on aicoderscope.com is worth a read before you commit either way.

The “pays for itself in 6 months” claim

AMD markets the Halo as recovering its cost in roughly six months of cloud API savings. That math only holds if you’re genuinely running a heavy, sustained workload. Renting a comparable GPU on RunPod runs about $1/hr for a 24GB-class card; at 4 hours a day that’s ~$120/month, so a $3,999 box “pays back” in about 33 months — not six. To hit six months you’d need to be running it ~8+ hours a day at higher cloud rates, which describes a small team or a power user, not a hobbyist. The break-even is real, but it’s a workload claim, not a universal one. We worked through the full rent-vs-buy spreadsheet in RunPod vs local GPU.

Who should buy which

Buy the DGX Spark (or ASUS GX10) if: you fine-tune models, write CUDA, run vLLM/NeMo, or your inference is long-context and prefill-heavy. The 5× prompt-processing lead and the CUDA ecosystem are worth real money to you. Take the $2,999 GX10 if 1TB is enough storage; take the Founders Edition only if you need the 4TB and can stomach the hiked price.

Buy the Ryzen AI Halo if: you want an official, Linux-first AMD developer platform with 2TB of storage, 10GbE, and ROCm support out of the box, and your work is inference plus light experimentation. It matches the Founders Edition on price and beats it on storage.

Buy neither — get a Strix Halo mini PC instead — if: you’re a home-lab user who mostly runs MoE inference. The GMKtec EVO-X2 has the same Ryzen AI Max+ 395 chip and 128GB for ~$1,999, roughly half the Halo dev kit’s price, and posts the same decode numbers. You’re paying $2,000 extra for the Halo’s official-dev-kit status, 10GbE, and AMD’s support commitment — worth it for a business, hard to justify for a hobbyist.

Buy neither of these, period — get a used GPU — if: every model you run fits in 24GB. A used RTX 3090 pushes 936 GB/s and ~95 tok/s on a 7B model — roughly 3× the decode speed of either 128GB box — for around $1,070. These mini-supercomputers only justify themselves once your models physically don’t fit on a real GPU.

FAQ

Is the DGX Spark really only $2,999? The $2,999 price is the ASUS Ascent GX10, an OEM DGX Spark with 1TB of storage. NVIDIA’s own Founders Edition launched at $3,999 (4TB) and was raised to $4,699 in 2026.

Which is faster for everyday local chat? Effectively a tie. On MoE models like gpt-oss 120B, token generation is ~34 tok/s (Halo) vs ~39 tok/s (Spark) — both bandwidth-bound and within ~13% of each other.

Can either run dense Llama 3.3 70B well? No. Dense 70B decodes at ~2.6 tok/s on the DGX Spark and similarly slowly on Strix Halo. Both shine on MoE models, not dense ones.

Does the Ryzen AI Halo run Linux? Yes — AMD ships it with Linux or Windows at no price difference, and it’s positioned as a Linux-first developer platform with ROCm support.

Should I just buy a GMKtec EVO-X2 instead? For pure inference, yes. It uses the same Ryzen AI Max+ 395 and 128GB for ~$1,999. The Halo dev kit’s premium buys official AMD developer support, 2TB storage, and 10GbE — valuable for businesses, less so for hobbyists.

Sources

Last updated June 24, 2026. Prices and specs change; verify current rates before purchasing.

Was this article helpful?