Jun 28, 2026

Microsoft Aion 1.0 on Windows 2026: The 14B On-Device Model and What Your Copilot+ PC Actually Needs

By RunAIHome Team · 14 min read

npuwindowslocal-llmcopilot-plus-pcaion

TL;DR: At Build 2026 Microsoft announced Aion 1.0, an in-box Windows model family whose headline part is Aion 1.0 Plan — a 14B reasoning and tool-calling model that runs agentic workflows fully on-device. The catch is the 40-TOPS NPU floor: a lot of “AI PCs” already in the wild don’t clear it, and even the ones that do generate tokens slower than a five-year-old discrete GPU. Aion is a background orchestrator, not a faster chatbot.

	Aion 1.0 on NPU	Ollama on used RTX 3090	Ollama on a Copilot+ iGPU
Best for	Always-on, low-power local agents in Windows apps	Fast interactive 7B–14B chat/coding	Mid-speed local LLM on the laptop you own
Speed (≈)	~3–5 tok/s on a 14B-class model	~95 tok/s on a 7B	~15–40 tok/s on 7B–14B
Power	~10–25 W (NPU)	~285–350 W (whole GPU)	~30–60 W (iGPU)
The catch	Needs 40-TOPS NPU; preview-only; AMD deferred	$1,000+ card + a desktop to put it in	Bandwidth-bound, throttles on battery

Honest take: Aion 1.0 matters because it ships free in Windows and runs offline with zero setup — not because it’s fast. If you want speed, a used RTX 3090 still buries any NPU on tokens/second. Buy a 40-TOPS Copilot+ laptop for battery-friendly background AI; keep a discrete GPU for anything interactive.

What Microsoft actually announced

At Build 2026 on June 2, Microsoft used its “Windows as the trusted platform for development” keynote to introduce Aion 1.0, a family of on-device small language models baked into Windows 11. Satya Nadella framed the goal as “deliver unmetered intelligence to every home and every desk” — which, stripped of the keynote gloss, means Microsoft wants local inference to be a default Windows capability, not a thing you bolt on with Ollama.

There are two models:

Aion 1.0 Instruct — a small, fast SLM tuned for everyday text work: summarization, rewriting, extraction. It’s in preview now and you can test it today through the Windows Copilot Runtime API if you’re on an Edge Insider build. Open weights land on Hugging Face in July 2026.
Aion 1.0 Plan — the interesting one. A 14-billion-parameter reasoning and tool-calling model with a 32K context window that ships in-box on capable devices. It’s built to reason over user intent, call tools, manage files, and orchestrate sub-agents — a fully local agent loop. Microsoft says it’s coming “in the coming months,” so it’s announced, not shipped.

Both run on top of Windows ML, the on-device inference runtime that went generally available in 2025 and now underpins Windows AI Foundry and Foundry Local. Under the hood that’s DirectML plus the ONNX Runtime spreading work across CPU, GPU, and NPU. The practical upshot: developers don’t pick a backend by hand — Windows ML routes the model to whatever silicon the machine has, and the same Aion model that runs on an NPU can also run on a discrete GPU.

That last point is easy to miss. Microsoft explicitly said on-device SLM support is expanding to capable discrete GPUs. So Aion isn’t strictly an “NPU model” — but the marketing, the hardware floor, and the whole Copilot+ PC strategy are built around the NPU. That’s where the friction is.

The 40-TOPS floor is the whole story

Aion’s NPU path inherits the Copilot+ PC requirement: a minimum 40 TOPS of NPU performance. This single number decides whether the laptop you already own can run Aion’s marquee experience locally — and a surprising number of “AI PCs” fail it.

Here’s the reality of NPU ratings in mid-2026, all measured against that 40-TOPS line:

Processor	NPU TOPS	Meets 40-TOPS floor?	Aion / WSL 3 NPU support at launch
Qualcomm Snapdragon X Elite	45	✅ Yes	✅ Yes
Intel Lunar Lake (Core Ultra 200V)	45–48	✅ Yes	✅ Yes
Intel Meteor Lake (Core Ultra Series 1)	~10–11	❌ No	NPU passthrough listed, but NPU misses the floor
AMD Ryzen AI 300 “Strix Point” (XDNA 2)	50	✅ Yes	⚠️ Deferred (“coming later”)
AMD Ryzen AI Max+ 395 “Strix Halo” (XDNA 2)	50	✅ Yes	⚠️ Deferred

Two things jump out, and both contradict the breathless coverage.

Meteor Lake is not a Copilot+ chip. Intel’s first “AI PC” silicon — the Core Ultra Series 1 you’d have bought in a 2024 laptop — has an NPU rated around 10–11 TOPS. The “34 TOPS” figure you see on spec sheets is the whole package (NPU + iGPU + CPU), not the NPU. In UL Procyon’s AI benchmark, the Snapdragon X Elite’s NPU scored ~1,720 versus ~476 for that Meteor Lake NPU. So when Build’s slides listed “Intel Meteor Lake” under WSL 3 NPU passthrough, that’s about driver plumbing for the NPU that exists — it does not mean a Meteor Lake laptop clears Aion’s 40-TOPS bar. If you bought an “AI PC” in 2024, check the NPU number, not the package number, before you get excited.

AMD has the silicon but not the software — yet. This is the genuinely odd part. AMD’s XDNA 2 NPU in both Strix Point (Ryzen AI 300) and Strix Halo (Ryzen AI Max+ 395) is rated 50 TOPS, comfortably over the floor. These are some of the strongest consumer NPUs you can buy. Yet Microsoft listed AMD as “coming later” for both Aion’s NPU path and WSL 3 NPU passthrough at launch. Qualifying hardware, deferred support. If you own a Strix Halo machine (we covered it in our Ryzen AI Max+ 395 guide), you have the TOPS — you’re waiting on Microsoft’s driver story, not your chip.

So the honest hardware map at launch: Snapdragon X Elite and Intel Lunar Lake are the only consumer platforms that both clear the floor and have day-one Aion NPU support. Everyone else is either under the line (Meteor Lake) or waiting in line (AMD).

TOPS is not tokens per second

Here’s the part the spec sheets won’t tell you, and it’s the reason we keep pointing home-labbers back at discrete GPUs: a high TOPS number does not mean fast token generation.

LLM decode — the token-by-token generation you actually feel — is memory-bandwidth-bound, not compute-bound. TOPS measures raw matrix throughput, which matters for the prefill (reading your prompt) phase and for vision models. It barely predicts decode speed, which is gated by how fast the chip can stream the model’s weights out of memory for every single token.

The numbers make this brutally clear. On a Snapdragon X Elite’s 45-TOPS NPU, Llama 3.1 8B runs at roughly 5 tokens/second, and a smaller Llama 3.2 3B at about 10 tokens/second. Qualcomm markets “13B models at 30 tokens/second,” but that figure is best treated as a vendor ceiling under ideal conditions — independent testing lands far lower, and there are documented cases (the Surface Pro 11) where the NPU is actually slower than the same chip’s CPU for small-model inference.

Now put that next to a discrete GPU. A used RTX 3090 — about $1,070 in June 2026 — has 936 GB/s of memory bandwidth and runs a 7B model at roughly 95 tokens/second. That’s not a small lead. It’s the NPU doing ~5 tok/s versus the GPU doing ~95 tok/s on comparable model sizes — close to a 20× gap in the direction of the GPU.

For Aion 1.0 Plan specifically, the math gets worse. It’s a 14B model, bigger than the 8B that already crawls at ~5 tok/s on an NPU. A reasoning model emits a lot of chain-of-thought tokens before it answers. At a few tokens per second, a 14B reasoner that “thinks” for 800 tokens before responding will keep you waiting well over a minute. Reading speed is roughly 7–10 tok/s, so an NPU-bound 14B reasoner sits below the speed at which text is comfortable to read in real time.

This is the same conclusion we reached in our NPU vs discrete GPU breakdown: the NPU’s win was never speed. It’s tokens per watt. An NPU sips ~10–25 W doing this work; the RTX 3090 pulls ~285–350 W under load. For an always-on background agent that summarizes your files or drafts replies while you do other things, 5 tok/s at 15 W is genuinely useful and a discrete GPU would be absurd overkill. For an interactive coding assistant you’re actively waiting on, the NPU is the wrong tool.

Aion 1.0 vs running Qwen3.5 14B yourself

The obvious comparison for a Windows home-labber: why use Aion 1.0 Plan at all when you can already run Qwen3.5 14B via Ollama?

It comes down to three trade-offs.

Distribution and zero-setup. Aion ships in the box. No install, no model download, no ollama pull, no figuring out which quant fits. For the 95% of Windows users who will never touch a terminal, that’s the entire ballgame — local AI they didn’t have to assemble. Ollama wins on flexibility and model choice; Aion wins on “it’s already there and free.”

The backend you actually land on. Ollama on Windows runs through Vulkan (or CUDA on NVIDIA), so on a Copilot+ laptop it’ll typically use the integrated GPU, not the NPU. Aion through Windows ML can target the NPU, the iGPU, or a discrete GPU. The counterintuitive result: on the same Snapdragon or Lunar Lake laptop, the iGPU path can be faster than the NPU path for decode, because the iGPU often has more usable memory bandwidth. If raw speed on a laptop is your goal, the NPU isn’t automatically the answer even when you have one.

Integration vs control. Aion 1.0 Plan is wired into Windows’ agent story — file access, tool-calling, sub-agent orchestration, and the new OS-level containment (Microsoft Execution Containers) that sandbox what an agent can touch. That’s a real platform advantage for building local agents that act on your machine. Ollama gives you none of that OS integration but total control over the model, the quant, and the data. If you’re doing coding work specifically, our sister site aicoderscope.com tracks how these local stacks compare against cloud coding agents.

For most home-lab setups today, the pragmatic answer is: try Aion 1.0 Instruct when it’s broadly available for light text tasks (it’s free and built in), but keep Ollama + a real GPU for anything where you’re waiting on output. The 14B Plan model is worth watching once the open weights drop in July — at that point the community can run it through llama.cpp on a discrete GPU and sidestep the NPU bottleneck entirely.

Who should care, and what to buy

You already own a Copilot+ laptop (Snapdragon X Elite / Lunar Lake). Good news — you clear the floor and get day-one support. Aion is free local AI you can use offline, and the battery-friendly background-agent use case is exactly what this hardware is for. Just don’t expect GPU speeds.

You own a 2024 “AI PC” with Meteor Lake. Check the NPU TOPS, not the package number. You very likely don’t meet the 40-TOPS floor, which means Aion’s NPU experience isn’t for you. You can still run local LLMs on the iGPU via Ollama — that path doesn’t care about the Copilot+ badge.

You own an AMD Ryzen AI / Strix Halo machine. You have the TOPS (50) but Aion NPU support is deferred. Wait for Microsoft’s AMD enablement, or just run models on the iGPU/Ollama today — Strix Halo’s 128GB unified memory makes it a capable local-LLM box regardless of Aion.

You’re a home-labber chasing tokens/second. None of this changes the calculus. A used RTX 3090 at ~$1,070 / 936 GB/s / ~95 tok/s on 7B is still the value king for interactive local inference — see our used RTX 3090 deep dive. An NPU is a complement for low-power background work, not a replacement.

You want to develop local agents on Windows. This is the audience Aion is genuinely built for. The combination of an in-box 14B tool-calling model, Windows ML’s hardware abstraction, and OS-level agent containment is a real platform — and one cloud-only agent frameworks can’t match for offline, privacy-bound workloads. Build for it; just architect around the NPU being slow for interactive turns.

A note on the bigger picture: Aion is one piece of a Build 2026 push that also included WSL 3 NPU/GPU passthrough (covered in our WSL 3 passthrough guide) and a wave of deskside AI hardware. We put the whole hardware wave in context in our Computex 2026 roundup. The throughline across all of it is the same: the silicon is improving fast, but memory bandwidth still decides who’s fast, and NPUs still trade speed for efficiency.

FAQ

What is Microsoft Aion 1.0? A family of on-device small language models for Windows 11, announced at Build 2026 (June 2, 2026). It has two members: Aion 1.0 Instruct (a small SLM for summarization and rewriting, in preview now) and Aion 1.0 Plan (a 14B reasoning and tool-calling model with a 32K context window, coming in the following months). Open weights for the models are scheduled for Hugging Face in July 2026.

What hardware do I need to run Aion 1.0 locally? The NPU path requires a Copilot+ PC with at least a 40-TOPS NPU. At launch that means Qualcomm Snapdragon X Elite (45 TOPS) or Intel Lunar Lake (45–48 TOPS). AMD’s Ryzen AI chips (50 TOPS) qualify on hardware but Aion NPU support is deferred. Intel Meteor Lake’s NPU (~10–11 TOPS) does not meet the floor. The models can also run on capable discrete GPUs via Windows ML.

Is Aion 1.0 faster than running Ollama on my GPU? No. On an NPU, an 8B model runs around 5 tokens/second; a 14B model like Aion Plan would be slower. A used RTX 3090 runs a 7B at roughly 95 tokens/second. Aion’s advantage is power efficiency (~10–25 W vs ~285–350 W) and zero-setup integration, not speed.

Why does a 45-TOPS NPU run LLMs so slowly? Token generation is limited by memory bandwidth, not raw compute (TOPS). An NPU has high TOPS but modest memory bandwidth, so it generates tokens slowly even though it crunches matrices efficiently. That’s why discrete GPUs with 900+ GB/s of bandwidth are far faster for decode.

Can I run Aion 1.0 on Linux or via llama.cpp? Not at the OS-integration level — Aion’s runtime is Windows ML / Windows AI Foundry. But once the open weights ship on Hugging Face in July 2026, the community can convert them to GGUF and run them in llama.cpp/Ollama on any platform, including on a discrete GPU that sidesteps the NPU bottleneck.

Recommended Gear

RTX 3090 — used, ~$1,070, 936 GB/s, ~95 tok/s on a 7B; still the value pick for fast interactive local inference next to any NPU.

Sources

Last updated June 28, 2026. Prices and specs change; verify current rates before purchasing. Aion 1.0 Plan and AMD NPU support were not yet shipping as of this date — figures for Aion on NPU are based on comparable 8B/14B inference on the same silicon.

Was this article helpful?