Jun 26, 2026

vLLM Won't Start? Every Fix for the Engine Init, CUDA, and OOM Errors (2026)

By RunAIHome Team · 13 min read

vllmlocal-llmcudagputroubleshooting

TL;DR: Most vLLM startup failures are one of three things: the engine reserves more KV-cache memory than your card has (No available memory for the cache blocks), the CUDA driver is older than the wheel was built for (The NVIDIA driver on your system is too old), or a multi-GPU run hangs at NCCL init. The fixes are nearly always flags, not code: pin --max-model-len, tune --gpu-memory-utilization, add --enforce-eager, or set a couple of NCCL env vars. Read the last line of the traceback first — it tells you which of the three you have.

What you’ll be able to do after this:

Read a vLLM startup traceback and know in one glance whether it’s a KV-cache/OOM problem, a driver/CUDA problem, or a multi-GPU networking hang.
Apply the exact flag or environment variable that fixes each class, with the values that actually work on 12–24 GB consumer cards.
Stop guessing from nvidia-smi — which lies about how much memory vLLM can actually use — and trust the startup log instead.

Honest take: vLLM is a server engine, not a desktop app. If you just want a model running on one consumer GPU with the least friction, Ollama or LM Studio will get you there faster. Reach for vLLM when you need throughput under concurrency — many requests at once — and you’re willing to learn three flags. Once you know those flags, 90% of the “it won’t start” pain disappears.

This guide assumes vLLM v0.23.0 (released June 13, 2026), which ships on PyTorch 2.11 with the default PyPI wheel now built for CUDA 13.0 and Python 3.14 added to the supported list. Older forum threads reference very different defaults, so the version tag matters when you’re copying fixes from 2024–2025 posts.

Step 0: Read the actual error, not the wall of logs

vLLM prints a lot of output on startup — model download progress, worker spawn messages, CUDA graph capture. None of that is the error. The error is the last Python traceback, and specifically its final line. Three lines account for the overwhelming majority of “vLLM won’t start” reports:

The line you see	What it actually means	Jump to
`ValueError: No available memory for the cache blocks`	KV cache doesn’t fit after weights load	OOM section
`RuntimeError: The NVIDIA driver on your system is too old`	Wheel built for newer CUDA than your driver	Driver section
Hangs forever after `Started a worker` / NCCL lines	Multi-GPU collective setup stuck	NCCL section

If you can’t tell which bucket you’re in, restart with debug logging on and capture the tail:

VLLM_LOGGING_LEVEL=DEBUG vllm serve Qwen/Qwen2.5-7B-Instruct 2>&1 | tee vllm.log

The DEBUG level is documented in vLLM’s own troubleshooting guide and is the single most useful thing you can do before asking anyone for help.

The #1 startup error: “No available memory for the cache blocks”

This is the error people hit first, and it’s been the top vLLM startup complaint since at least issue #2248. The full message reads:

ValueError: No available memory for the cache blocks. Try increasing
`gpu_memory_utilization` when initializing the engine.

Why it happens

vLLM loads the model weights first, then tries to carve the remaining VRAM into KV-cache blocks. The KV cache is sized from --max-model-len (the maximum sequence length) and the number of concurrent sequences. If the weights plus the requested context budget exceed your card, there’s nothing left for blocks, and the engine refuses to start rather than crash mid-request.

The trap: vLLM’s default --max-model-len is the model’s full trained context — often 32K or higher. A 7B model at Q4 might be ~4.5 GB of weights, but a 32K context KV cache for several parallel sequences can dwarf that. On a 12 GB card the math simply doesn’t close. This is exactly the failure mode reported for 7B–13B models on the RTX 3060 12GB in issue #27934 — and it affects every Ampere 12 GB card, not just the 3060.

The fix, in the order to try it

1. Pin --max-model-len to what you actually need. This is the highest-leverage fix and most people skip it. If your prompts are 4K, don’t pay for 32K of KV cache:

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.85

2. Raise --gpu-memory-utilization. Default is 0.9. The error message tells you to increase it, and that’s legitimate — it lets vLLM claim a larger slice of the card for blocks. On a dedicated inference box, 0.92–0.95 is reasonable. But on a card that’s also driving a display, going too high starves the desktop and can crash X. On 12 GB cards, counterintuitively, lowering it to 0.75–0.80 sometimes fixes init OOMs because it leaves more headroom for the CUDA context and fragmentation overhead that the allocator needs up front.

3. Cap concurrency with --max-num-seqs. Fewer simultaneous sequences means a smaller KV-cache budget. Dropping from the default to --max-num-seqs 16 (or 8) frees real memory on tight cards.

4. Quantize the KV cache. --kv-cache-dtype fp8 roughly halves KV-cache memory at a small quality cost — often the difference between fitting and not on 16 GB.

5. Add --enforce-eager. CUDA graph capture pre-allocates extra memory. Disabling it with --enforce-eager reclaims a few hundred MiB — useful as a last 300–500 MiB when you need context length more than peak throughput.

A combined low-VRAM launch that works on most 12 GB cards:

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.80 \
  --max-num-seqs 8 \
  --kv-cache-dtype fp8 \
  --enforce-eager

Don’t trust nvidia-smi here

A subtle point that wastes hours: nvidia-smi reports driver-level reserved memory, not the segments the CUDA allocator can actually hand to vLLM. vLLM’s block allocator queries CUDA directly and can OOM even when nvidia-smi shows a couple of GB “free.” When the two disagree, trust vLLM’s startup log, not the system monitor.

If — and only if — the log explicitly mentions fragmentation, add:

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

Don’t set this reflexively. It’s a fix for fragmentation specifically, not a magic OOM cure, and the same PYTORCH_CUDA_ALLOC_CONF knob shows up across the broader CUDA out-of-memory fix guide for Ollama, llama.cpp, and ComfyUI too.

”The NVIDIA driver on your system is too old”

The second-most-common wall, especially right after a fresh pip install vllm:

RuntimeError: The NVIDIA driver on your system is too old (found version XXXX).

Why it happens

vLLM wheels are compiled against a specific CUDA toolkit. As of v0.23.0 the default PyPI wheel targets CUDA 13.0. If your installed driver predates that toolkit, the compiled kernels can’t run. This bites people on stable LTS distros and on cloud images that pin older drivers.

The fixes

Option A — enable CUDA forward compatibility (no driver upgrade). If you’re on the official vLLM Docker image, add:

docker run --gpus all -e VLLM_ENABLE_CUDA_COMPATIBILITY=1 ... vllm/vllm-openai:latest

Outside Docker, install the matching cuda-compat package and point vLLM at it:

sudo apt-get install cuda-compat-13-0
export VLLM_ENABLE_CUDA_COMPATIBILITY=1
export VLLM_CUDA_COMPATIBILITY_PATH=/usr/local/cuda/compat

Option B — upgrade the driver. The cleaner long-term fix. Match your driver to the CUDA 13.0 toolkit minimum. On WSL2, install the latest Game Ready or Studio driver on the Windows host — never install a Linux GPU driver inside the WSL distro, which is the same trap that breaks Ollama GPU detection.

Option C — install a wheel built for your CUDA. vLLM publishes wheels for older CUDA lines. If you’re stuck on an older driver and can’t touch it, pin a wheel whose CUDA target your driver already supports.

A note on Windows

vLLM does not run natively on Windows. The supported paths are WSL2 with NVIDIA’s CUDA passthrough, the official Linux Docker image (now available through Docker Model Runner on Docker Desktop for Windows with WSL2 + NVIDIA GPUs), or a community Windows build. If you’re on bare Windows wondering why pip install vllm then vllm serve does nothing useful, that’s why — move to WSL2. For a pure-Windows experience, this is one of the clearest cases where Ollama or LM Studio wins over vLLM.

The multi-GPU hang at NCCL init

You split a model across two cards with --tensor-parallel-size 2, the log reaches the NCCL setup lines, and then… nothing. No error, no progress, no prompt. This is a collective-communication hang, and it’s reported across issues #8058, #16761, and — newer, on Blackwell — #33041 (TP=2 hanging after NCCL init on CUDA 13.0 with NCCL 2.27.7).

Diagnose first

Turn on NCCL logging to see exactly where it stalls:

NCCL_DEBUG=INFO vllm serve <model> --tensor-parallel-size 2

If it gets stuck during the collective setup phase, work through these in order:

1. Fix the network interface. On boxes with multiple interfaces (Docker bridges, VPNs, multiple NICs) NCCL can pick the wrong one and wait forever for a peer that can’t answer. Pin them explicitly:

export NCCL_SOCKET_IFNAME=eth0
export GLOO_SOCKET_IFNAME=eth0
export VLLM_HOST_IP=<this_host_ip>

Replace eth0 with your real interface (ip addr to find it). VLLM_HOST_IP overrides the address vLLM auto-detects when the network config confuses it.

2. Disable NCCL’s cuMem allocator. A known NCCL bug is worked around by setting this on every vLLM process — and on any external process that opens a NCCL connection to vLLM:

export NCCL_CUMEM_ENABLE=0

3. Force the spawn start method and rule out CUDA graphs.

export VLLM_WORKER_MULTIPROC_METHOD=spawn
vllm serve <model> --tensor-parallel-size 2 --enforce-eager

--enforce-eager removes CUDA graph capture from the equation; if the hang disappears, you’ve isolated graph capture as the cause and can report it cleanly.

4. Just upgrade. Several historical hangs (the zmq-related ones in the 0.5.2–0.5.3.post1 line, for example) were fixed in later releases. If you’re not on v0.23.0, upgrade before debugging further — you may be chasing a bug that’s already patched.

The silent hang: “Waiting for output from MQLLMEngine”

A cousin of the above: the server appears to start but every request hangs, and the log repeats “Waiting for output from MQLLMEngine.” This means the engine subprocess died or stalled and the front end is waiting on a worker that will never answer. Re-launch with VLLM_LOGGING_LEVEL=DEBUG and --enforce-eager; the engine’s real crash reason (often an OOM or a model-arch mismatch) usually surfaces in the worker logs that the wait message was hiding.

CUDAGraph crashes mid-startup

If the traceback points into self.graph.replay() inside model_runner.py, that’s a CUDA error inside a captured graph — opaque by design, because the error surfaces at replay, not at the offending op. Add --enforce-eager to disable graph capture; the same operation will then fail (or succeed) eagerly with a far more readable traceback you can act on.

A clean, known-good single-GPU launch

When you just want something that starts on a 24 GB card like a used RTX 3090 or an RTX 4090, start conservative and loosen from there:

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.85 \
  --dtype auto

If it starts, raise --max-model-len and --gpu-memory-utilization until you hit your throughput or context target. If it doesn’t, you’re back in the OOM section above — pin context lower first.

Don’t have a GPU big enough to test against, or need to validate a multi-GPU tensor-parallel config before buying the second card? Renting a matching instance by the hour on a service like RunPod is cheaper than buying hardware to find out your flags are wrong — and the same flags transfer straight to your local box.

FAQ

Why does vLLM use so much more memory than Ollama for the same model? vLLM pre-allocates the entire KV cache up front based on --max-model-len and concurrency, optimizing for many simultaneous requests. Ollama allocates more lazily for single-user use. That’s the core trade-off: vLLM trades higher idle memory for much higher throughput under load. If you’re a single user, that pre-allocation can feel wasteful — see the vLLM vs Ollama breakdown for when each one wins.

Is --gpu-memory-utilization a percentage of total VRAM or free VRAM? Total. 0.85 means vLLM targets 85% of the card’s total memory for weights plus KV cache. That’s why it can fail even when other processes are using the card — it doesn’t subtract their usage for you. On a card also running a desktop, leave more headroom.

Can I run vLLM on Windows without WSL2? Not on the official wheel. Use WSL2 with NVIDIA CUDA passthrough, the Linux Docker image via Docker Model Runner on Docker Desktop, or a community Windows build. WSL2 is the path most people land on and it’s well-supported with current drivers.

What’s the single first thing to try for any startup OOM? Lower --max-model-len. The default is the model’s full trained context, which is almost never what you need and inflates the KV-cache budget more than any other factor.

Does --enforce-eager hurt performance? Yes, somewhat — it disables CUDA graph optimization, which costs throughput. Use it to diagnose, or to claw back a few hundred MiB on a tight card, but turn it off once you’ve solved the underlying issue if you care about tokens/sec.

Sources

vLLM Troubleshooting — official documentation — VLLM_LOGGING_LEVEL=DEBUG, --enforce-eager, NCCL env vars, VLLM_ENABLE_CUDA_COMPATIBILITY.
vLLM Releases — GitHub — v0.23.0 (Jun 13, 2026), PyTorch 2.11, CUDA 13.0 default wheel, Python 3.14 support.
Issue #2248 — No available memory for the cache blocks — the canonical KV-cache OOM and the gpu_memory_utilization guidance.
Issue #27934 — V1 Engine memory allocation failures, 7B–13B on RTX 3060 12GB — Ampere 12 GB OOM pattern.
Issue #33041 — vLLM hangs after NCCL init with TP=2 on Blackwell (CUDA 13.0, NCCL 2.27.7) — multi-GPU hang on current hardware.
Issue #8058 — vLLM hang at NCCL step on multiple GPUs — interface and NCCL_CUMEM_ENABLE=0 workarounds.
Issue #16761 — NCCL invalid usage error on multi-GPU serve — collective setup failures.
Mistral-Small discussion #72 — “Waiting for output from MQLLMEngine” — the silent engine-subprocess hang.
Docker Model Runner adds vLLM support on Windows — vLLM on Docker Desktop for Windows via WSL2 + NVIDIA.
vllm-windows community build — GitHub — Windows kernels/build for users who can’t use WSL2.

Recommended Gear

RTX 3060 12GB — the budget Ampere card most often hitting the cache-block OOM; pin --max-model-len and it runs 7B fine.
RTX 3090 24GB — the value pick for vLLM; 24 GB and high bandwidth give real KV-cache headroom.
RTX 4090 — fastest single-card option for high-throughput vLLM serving on consumer hardware.

Prices and availability move weekly; verify current retailer listings before buying. Flag values tested against vLLM v0.23.0 as of June 2026.

Was this article helpful?