AMD Lemonade Local LLM Server: GPU + NPU Inference on Consumer Hardware (2026 Guide)
TL;DR: Lemonade is AMD’s open-source local LLM server that uses your Ryzen AI NPU to cut time-to-first-token in half while offloading sustained generation to your GPU — all through a single OpenAI-compatible endpoint at localhost:13305. It handles text, image gen, speech-to-text, and TTS in one install. The catch: NPU acceleration requires Ryzen AI 300/400-series hardware (XDNA 2); older Ryzen AI chips and Nvidia hardware get no NPU benefit.
| Lemonade | Ollama | LM Studio | |
|---|---|---|---|
| Best for | AMD Ryzen AI 300+ with NPU | Any hardware, broadest model support | GUI-first beginners |
| NPU acceleration | ✅ XDNA 2 (Ryzen AI 300/400) | ❌ | ❌ |
| Multi-modal | LLM + Image + STT + TTS built-in | LLM only | LLM + basic vision |
| AMD GPU setup | Auto-detects ROCm/Vulkan | Manual ROCm config required | Manual ROCm config required |
| The catch | Best results on AMD-only hardware | No NPU offloading anywhere | Not a server, no API by default |
Honest take: If you own a Ryzen AI 300-series laptop or a Strix Halo system, Lemonade is the obvious choice — nothing else gives you NPU-accelerated TTFT plus unified multi-modal inference in one install. On Nvidia hardware or older AMD CPUs, use Ollama instead.
The problem Lemonade solves
Every guide about running local LLMs on AMD hardware eventually arrives at the same frustrating detour: ROCm. The ROCm driver stack is powerful but installation is brittle, GPU target strings change between releases, and getting llama.cpp to actually use your Radeon RX 7900 XTX rather than falling back to CPU can eat an afternoon.
AMD’s answer is Lemonade — an open-source local AI server (GitHub: lemonade-sdk/lemonade, 3.7k stars as of May 2026) sponsored and co-developed by AMD engineers. It auto-detects your hardware, selects the right backend, and exposes everything through a single OpenAI-compatible REST endpoint. No manual ROCm flags. No GPU target strings. One install.
The more interesting innovation, though, is the NPU. Modern Ryzen AI 300/400-series laptops ship with an XDNA 2 neural processing unit delivering 50 TOPS of AI compute — and until Lemonade, that hardware sat mostly idle for LLM inference.
How the GPU + NPU split works
LLM inference has two distinct phases with very different compute profiles:
Prefill (prompt processing): The model ingests your entire input prompt and builds the KV cache. This is compute-bound — it needs raw matrix multiply throughput, not memory bandwidth. A prompt of 1,000 tokens needs thousands of matrix operations processed in parallel. The NPU excels here.
Decode (token generation): The model generates one token at a time. Each step needs to load the entire model’s weights from memory to perform a single forward pass. This is memory-bandwidth-bound — sustained throughput depends on how fast weights can be read. The GPU wins here because it has wider memory buses.
Lemonade’s hybrid execution exploits this split. On Ryzen AI 300/400-series hardware, it routes prompt processing through the XDNA 2 NPU and token generation through the integrated GPU (or a discrete Radeon if you have one). AMD’s own benchmarks show the NPU delivers 2.3× faster time-to-first-token versus GPU-only inference, while GPU decode achieves 2.4× higher sustained throughput versus NPU-only decode. The hybrid mode combines both: fast startup from the NPU, sustained throughput from the GPU.
The backend doing the NPU work is FastFlowLM, a purpose-built runtime for AMD NPUs. Under the hood, Lemonade also orchestrates llama.cpp (for GGUF models on CPU/GPU via Vulkan or ROCm), whisper.cpp (speech-to-text), stable-diffusion.cpp (image generation), and Kokoro (text-to-speech). You don’t configure any of this — it picks the backend based on what your hardware supports and what model format you’re loading.
Hardware requirements
Full NPU + GPU hybrid: Ryzen AI 300/400-series (Strix Point)
The minimum hardware for NPU acceleration is a Ryzen AI 9 HX 370 or any other Ryzen AI 300-series chip (Strix Point). These APUs pack:
- XDNA 2 NPU: 50 TOPS AI compute
- Zen 5 CPU cores (up to 12 cores, 24 threads)
- RDNA 3.5 iGPU (up to 16 CUs)
- LPDDR5X system RAM (up to 32GB on typical laptop configs)
Windows 11 is required for NPU acceleration on these chips. Windows 10 is supported for CPU/GPU inference only.
On Linux, NPU support requires XDNA 2 specifically — the older XDNA 1 found in Ryzen AI 7000/8000/200-series chips is not supported for NPU inference via FastFlowLM on Linux. If you’re on a Ryzen AI 7040 series (Hawk Point) or similar XDNA 1 hardware, you can still run Lemonade with GPU or CPU backends.
Maximum configuration: Ryzen AI MAX+ 395 (Strix Halo)
The Ryzen AI Max+ 395 (Strix Halo) is the standout platform for Lemonade in 2026:
- XDNA 2 NPU: 50 TOPS
- RDNA 3.5 iGPU: 40 compute units, 60 FP16 TFLOPS
- Up to 128GB LPDDR5X unified memory (256-bit interface at 8,000 MT/s)
- Up to 96GB of that pool usable as VRAM
The 128GB unified memory ceiling means Strix Halo can run dense 70B+ models entirely in-memory. Community benchmarks show impressive numbers on this hardware: Qwen3-Coder-Next at 43 t/s (Q4), Qwen3.5 35B-A3B at 55 t/s (Q4), and even GPT-OSS 120B reaching ~50 t/s. Dense 27B models are slower — Qwen3.5 27B lands at 11–12 t/s at Q4, the bandwidth cost of a fully-dense architecture at that size.
For context, an RTX 4090 achieves roughly 50–80 t/s on 7B models at Q4 with Ollama — competitive with Strix Halo at smaller scales, but the RTX 4090 tops out at 24GB VRAM with no path to 70B inference without CPU offloading.
Discrete Radeon GPUs
If you have a desktop with an AMD Radeon RX 7900 XTX or similar RDNA2/RDNA3/RDNA4 card, Lemonade supports GPU inference via ROCm or Vulkan. You won’t get NPU acceleration — there’s no NPU in a discrete Radeon — but you do get Lemonade’s automatic backend selection, multi-modal stack, and unified API without manually configuring ROCm yourself.
Supported discrete GPU families: Radeon RX 6000 series (RDNA2), RX 7000 series (RDNA3), and RX 9000 series (RDNA4). The RX 9070 XT is the current value target for RDNA4 on desktop.
CPU fallback
No AMD GPU at all? Lemonade runs via llama.cpp CPU inference on any x86_64 machine. Performance is unsurprising — a Ryzen 9 7950X at Q4_K_M gets roughly 5–8 t/s on 7B models — but the setup path and API remain identical. Useful for testing or for workflows where latency doesn’t matter.
Installation
Windows
Download the one-click installer from the Lemonade releases page. The installer detects your hardware, pulls the right backends, and registers Lemonade as a Windows service. After install, the server is live at http://localhost:13305/v1.
Requirements: Windows 10 (build 1809+) for CPU/GPU, Windows 11 for NPU on Ryzen AI 300/400.
Model downloads happen through the Lemonade UI or via API — no manual GGUF hunting required. Models are stored locally; they don’t leave your machine.
Linux
With Lemonade 10.0.1, Debian packages are available via a PPA for Ubuntu-based distributions. Install ROCm drivers first if you have a Radeon GPU; the ROCm setup is still a prerequisite on Linux, but Lemonade handles everything above that layer.
For NPU on Linux: FastFlowLM requires XDNA 2 (Ryzen AI 300/400/Max series). The packages provide an improved setup process that Phoronix covered as a significant usability improvement over earlier versions.
macOS and Docker
Lemonade runs on macOS via CPU inference (no Metal/NPU backend as of v10.3). Docker images are available for containerized deployments.
Performance breakdown
Here’s what the verified numbers look like across hardware tiers:
| Hardware | Model | Backend | Result |
|---|---|---|---|
| Ryzen AI MAX+ 395 (Strix Halo) | Qwen3.5 35B-A3B | FastFlowLM | 55 t/s (Q4) |
| Ryzen AI MAX+ 395 (Strix Halo) | Qwen3-Coder-Next | FastFlowLM | 43 t/s (Q4) |
| Ryzen AI MAX+ 395 (Strix Halo) | GPT-OSS 120B | FastFlowLM | ~50 t/s |
| Ryzen AI MAX+ 395 (Strix Halo) | Qwen3.5 27B (dense) | FastFlowLM | 11–12 t/s (Q4) |
| Ryzen AI 300-series (Strix Point) | Llama 3.2-3B | NPU (FastFlowLM) | 28 t/s |
| Ryzen AI 300-series (Strix Point) | GPT-OSS-20B | NPU (FastFlowLM) | 19 t/s |
| Ryzen AI 300-series (NPU+CPU) | Any small model | FastFlowLM | 20–80 t/s at <2W |
The “<2W for NPU+CPU” figure deserves attention for laptop users. Running a 3B model on the NPU at ~28 t/s while drawing under 2 watts is a meaningful battery-life story — the iGPU at similar speeds would draw 10–15W. For always-on assistant use cases on a laptop, the NPU mode extends runtime in a way no Nvidia laptop NPU can match, because Nvidia laptops simply don’t have a user-accessible NPU for LLM inference.
The 10× faster initialization figure (Qwen3-4B cold load dropping from ~10 seconds to ~1 second on AMD Ryzen AI) is also notable. For interactive use where you’re loading and switching models, that matters.
Multi-modal capabilities
Lemonade v10.3 ships a unified multi-modal stack that most competing tools don’t attempt:
Text generation: LLMs via llama.cpp (GGUF), FastFlowLM (FLM), and OnnxRuntime GenAI (ONNX). The NPU path uses ONNX models; GGUF models run on GPU or CPU.
Image generation: Stable Diffusion via stable-diffusion.cpp. Accessible through the same OpenAI-compatible API — the OmniRouter layer handles routing image generation requests to the right backend. You point ComfyUI or any SD client at localhost:13305 and it handles the rest.
Speech-to-text: Whisper transcription via whisper.cpp. Drop-in replacement for the OpenAI Whisper API endpoint — same request format, local execution.
Text-to-speech: Kokoro TTS. Lower latency than Whisper TTS for real-time voice output in agent pipelines.
This unified stack matters for agentic workflows. If you’re building an agent that needs to transcribe audio, reason about it, and respond with speech, you previously needed three separate services (Whisper server, Ollama, a TTS server) each with their own setup and endpoints. Lemonade handles all four modalities at localhost:13305/v1 under the OpenAI API schema.
For projects that combine these capabilities — home automation, voice assistants, document pipelines — Lemonade is the only open-source option that runs the full stack under a single process without Docker Compose gymnastics.
Lemonade vs Ollama for AMD hardware
If you’re on AMD, the comparison is more nuanced than “just use Ollama.”
Where Ollama still wins:
- Vastly broader model library (any GGUF from Hugging Face in one pull command)
- Better Nvidia GPU support (ROCm and CUDA both work; Ollama’s CUDA path is more polished)
- macOS Metal acceleration
- Larger community, more third-party integrations, more Stack Overflow answers
Where Lemonade wins on AMD:
- NPU acceleration — Ollama has no NPU path at all, on any platform
- Auto-hardware detection for AMD — no manual ROCm flags or GPU target env vars
- Built-in multi-modal (Stable Diffusion, Whisper, TTS in one install)
- Context lengths up to 256K tokens on Ryzen AI NPUs via FastFlowLM 0.9.35
- 2.3× faster TTFT on Strix Point vs GPU-only inference — real for interactive use
The pragmatic setup if you’re on AMD hardware: run Lemonade as your primary inference server for NPU-eligible models and interactive use. Keep an Ollama instance available for model exploration or GGUF formats Lemonade doesn’t yet support. Both expose OpenAI-compatible APIs, so switching your app’s base_url between them takes one line.
If you have a discrete Radeon RX 7900 XTX in a desktop tower and no NPU, Lemonade’s value proposition shrinks — you’re getting automatic ROCm/Vulkan backend selection but losing the NPU differentiation. In that scenario, Ollama with a manual ROCm setup is fine, and you retain the broader model library. Our guide to running Ollama vs LM Studio vs llama.cpp covers that decision in more detail.
Connecting your existing tools
Because Lemonade speaks OpenAI’s API at http://localhost:13305/v1, the integration story is straightforward:
- Continue.dev: Set server URL to
http://localhost:13305/v1. See our Continue.dev + Ollama setup guide — the same config applies, just swap the port. - Open WebUI: Add a custom OpenAI connection pointing at
localhost:13305. - VS Code with GitHub Copilot alternative: Any extension supporting custom OpenAI endpoints works.
- n8n, LangChain, AutoGen: Update
openai.base_urltohttp://localhost:13305/v1.
The AMD developer portal has an official Lemonade getting-started playbook at developer.amd.com/playbooks/lemonade-getting-started/ with hardware-specific setup paths.
Frequently Asked Questions
Does Lemonade work on Nvidia GPUs? Yes, but without NPU acceleration. On Nvidia hardware, Lemonade falls back to llama.cpp with CPU inference. CUDA is not currently a Lemonade backend — for Nvidia GPUs, Ollama or vLLM are better choices. Lemonade’s value is specifically its AMD NPU + GPU hybrid path.
Which AMD laptops support NPU inference? Ryzen AI 300-series (Strix Point) and Ryzen AI 400-series laptops with XDNA 2 NPUs, plus Ryzen AI MAX/MAX+ 395 systems (Strix Halo). NPU acceleration on Linux requires XDNA 2 specifically — XDNA 1 chips (Ryzen AI 7000/8000/200-series) do not have Linux NPU support in Lemonade as of v10.3.
What model formats does Lemonade support? GGUF (via llama.cpp, runs on CPU/GPU), FLM (FastFlowLM format for NPU inference), and ONNX (via OnnxRuntime GenAI, for NPU+iGPU hybrid mode). Not every model is available in FLM or ONNX format — the NPU path has a narrower model library than GGUF. The Lemonade model catalog lists what’s available for each backend.
Can I run 70B models on a Strix Halo laptop? Yes. The Ryzen AI MAX+ 395 with 128GB unified memory (up to 96GB usable as VRAM) fits Qwen3.5 72B and similar models entirely in memory. Community benchmarks show GPT-OSS 120B running at ~50 t/s on this platform. A standard Strix Point laptop with 32GB is limited to ~18B models at Q4 before hitting memory limits.
Is Lemonade free? What’s the license?
The Lemonade server and SDK are open source (GitHub: lemonade-sdk/lemonade). The FastFlowLM NPU kernels are proprietary but free for “reasonable commercial use” per AMD’s terms — the source describes them as purpose-built AMD NPU optimizations not open-sourced. For most home lab and indie dev use cases, there is no cost.
Sources
- Lemonade GitHub Repository — lemonade-sdk
- Lemonade by AMD: A Unified API for Local AI Developers — AMD Developer
- Using Lemonade Across CPU, GPU, and NPU — AMD AI Playbooks
- AMD Ryzen AI 300 Series Strix Point APU Launch: 50 TOPS NPU — Tweaktown
- AMD Ryzen AI MAX+ 395: Breakthrough AI Performance — AMD Blog
- Lemonade v10.3: Run Local LLMs, Image Gen, and Speech on Your Own GPU — DEV Community
- AMD Ryzen AI NPUs Finally Useful on Linux via Lemonade 10.0 and FastFlowLM — Agent Wars
- Lemonade 10.0.1 Improves Setup for AMD Ryzen AI NPUs on Linux — Phoronix
- Running LLMs on the AMD NPU with Lemonade Server — Sleeping Robots
- FastFlowLM GitHub Repository — FastFlowLM
- Lemonade by AMD: Fast and Open Source Local LLM Server — Hacker News
- AMD Lemonade: A Unified API — Ryzen AI and Radeon ready to run LLMs Locally
Last updated May 28, 2026. Prices and specs change; verify current rates before purchasing.
Recommended Gear
- AMD Ryzen AI Max+ 395 (Strix Halo) laptops
- RTX 4090 (comparison reference)
- AMD Radeon RX 7900 XTX
- AMD Radeon RX 9070 XT
Was this article helpful?
Thanks for the feedback — it helps improve future articles.
Need hands-on help?
I offer 1-on-1 technical consulting for local AI setup, GPU selection, and AI coding tool configuration — same topics covered on this site.
Book a session — $49 / hour →