Multi-GPU for Local AI in 2026: NVLink vs PCIe, and When Two Cards Actually Help
The VRAM ceiling is what forces most people toward multi-GPU territory. Llama 3.3 70B at Q4_K_M quantization needs roughly 43 GB of GPU memory. No single consumer card clears that bar—the RTX 4090 maxes at 24 GB, the RTX 5090 at 32 GB, and everything in between falls short. Two cards change the math.
But “two cards” means very different things depending on how they communicate. NVLink gives GPUs a direct high-bandwidth wire between them. PCIe routes the same traffic through your CPU’s memory controller. In 2026, the distinction matters more than it used to—because NVLink has quietly disappeared from almost every consumer GPU on the market.
Here’s what the bandwidth numbers actually mean for inference, what modern frameworks do with multiple GPUs, and when the extra complexity is worth it.
Who still has NVLink
NVLink is NVIDIA’s peer-to-peer GPU interconnect. Instead of routing inter-GPU data through the CPU memory bus—as PCIe does—NVLink provides a direct path with its own dedicated bandwidth pool.
On the RTX 3090, NVLink 3.0 delivers 112.5 GB/s of aggregate bidirectional bandwidth between two cards. You need a physical NVLink bridge—a short PCB bar that clips across both GPUs. These sell for $40–$80 used on eBay.
The problem: NVLink was removed starting with the Ada Lovelace generation (RTX 40-series) and doesn’t return on Blackwell consumer cards.
| GPU | NVLink | Notes |
|---|---|---|
| RTX 3090 | Yes — NVLink 3.0, 112.5 GB/s | Only the base 3090, not 3090 Ti |
| RTX 3090 Ti | No | Connector physically removed |
| RTX 4070 / 4080 / 4090 | No | Entire Ada consumer lineup |
| RTX 5080 / 5090 | No | Consumer Blackwell lineup |
| RTX PRO 6000 Blackwell | Yes — NVLink 5, 1,800 GB/s | Workstation card, ~$6,000+ |
NVIDIA CEO Jensen Huang confirmed the RTX 4090 removal was intentional—freed die area went to Ada’s DLSS 3 hardware and transformer engine. The RTX PRO 6000 Blackwell does have NVLink 5 at a staggering 1,800 GB/s bandwidth, but at $6,000+ it’s outside the home lab conversation.
For anyone building a multi-GPU local AI setup in 2026, the RTX 3090 (specifically the non-Ti variant) is the only consumer card where NVLink is an option.
The RTX 3090 Ti is the most common trap here: it looks like the obvious upgrade, but NVIDIA removed the NVLink connector. Only the base RTX 3090 supports it.
PCIe inter-GPU bandwidth: what you actually get
Without NVLink, two GPUs communicate over PCIe—which means routing through the CPU’s memory controller. The bandwidth depends on the PCIe generation and slot width:
| Interconnect | Bandwidth (per direction) | Common hardware |
|---|---|---|
| PCIe 3.0 x16 | 16 GB/s | Pre-2021 boards |
| PCIe 4.0 x16 | 32 GB/s | Mainstream Z490/B550 and newer |
| PCIe 5.0 x16 | 64 GB/s | Z890/X870 (2024+ platforms) |
| NVLink 3.0 (RTX 3090) | 56.25 GB/s per direction | Bridged 3090 pair |
PCIe 4.0—the most common current standard—provides 32 GB/s per direction versus NVLink 3.0’s 56.25 GB/s. That’s roughly 1.75× slower. On a PCIe 5.0 platform (Z890 or X870), the gap essentially disappears.
Whether that bandwidth difference matters for inference depends entirely on which parallelism strategy the framework uses.
Two strategies, two bandwidth profiles
There are two fundamentally different ways to distribute a model across GPUs, and they have very different interconnect demands.
Pipeline parallelism (layer split)
GPU 0 handles the first half of the model’s transformer layers; GPU 1 handles the second half. At the layer boundary, the activation tensor—a few megabytes for a typical 70B model—transfers from one GPU to the other.
This transfer happens once per token, once per layer boundary. The bandwidth demand is low: even PCIe 3.0 handles it without becoming a bottleneck. llama.cpp’s default --tensor-split mode uses this approach.
The cost is efficiency: GPU 0 sits idle while GPU 1 processes its half, and vice versa. You get the combined VRAM of both cards, but autoregressive token generation is essentially sequential between the two GPUs.
Tensor parallelism (every-layer split)
Both GPUs process every transformer layer simultaneously. Each holds half the weight matrices, computes in parallel, then synchronizes partial results via an all-reduce after each layer.
Llama 3.3 70B has 80 transformer layers. Every token generation involves 80 all-reduce round trips, each carrying 8–32 MB of activation data at float16. This is where the interconnect bandwidth matters.
At NVLink 3.0 speeds (56.25 GB/s per direction), those synchronizations clear quickly. Over PCIe 4.0 (32 GB/s per direction), the link starts saturating at longer context lengths. At 4k context—typical for most interactive use—dual RTX 4090s over PCIe 4.0 with tensor parallelism achieve roughly 85–90% of equivalent NVLink-connected throughput. At 32k+ context, the penalty grows.
llama.cpp multi-GPU setup in 2026
llama.cpp has two distinct paths for multi-GPU.
Layer split (default, widely supported):
llama-server \
--model Meta-Llama-3.3-70B-Instruct-Q4_K_M.gguf \
--tensor-split 0.5,0.5 \
--n-gpu-layers 999 \
--ctx-size 4096 \
--port 8080
The --tensor-split 0.5,0.5 divides layers evenly across both GPUs. Adjust the ratio if cards have different VRAM—e.g., 0.6,0.4 for a 24GB + 16GB pair. Ollama uses this same mechanism automatically when multiple CUDA GPUs are detected.
True tensor parallelism (merged April 2026, build b8738+):
Mainline llama.cpp gained real tensor parallelism in April 2026 via build b8738, using NCCL (NVIDIA) or RCCL (AMD) for topology-aware communication. It auto-detects NVLink vs PCIe and adjusts synchronization accordingly.
# Build with NCCL tensor parallel support
cmake .. \
-DGGML_CUDA=ON \
-DGGML_CUDA_FORCE_DMMV=OFF \
-DGGML_CUDA_PEER_MAX_BATCH_SIZE=128
make -j$(nproc)
# Launch with tensor parallelism across 2 GPUs
./llama-server --model model.gguf --tp 2 -ngl 999 --ctx-size 4096
Benchmarks from the PR show 3–4× gains over layer split—but that headline applies primarily to MoE architectures (Qwen3 MoE, Llama 4 MoE), where expert routing creates uneven layer utilization that hurts pipeline parallelism. For dense models like Llama 3.3 70B, the improvement over layer split is more modest: 1.3–1.6× in typical benchmarks.
vLLM: the high-concurrency path
vLLM handles multi-GPU via tensor parallelism as the default:
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 2 \
--quantization awq \
--dtype float16
vLLM probes the NCCL topology on startup and automatically uses NVLink-optimized all-reduce on bridged cards, falling back to PCIe-routed NCCL otherwise. For single-user local inference, both paths produce comparable results. For multi-user serving (8+ concurrent requests), pipeline parallelism via --pipeline-parallel-size 2 often outperforms tensor parallel on PCIe—it avoids the all-reduce overhead entirely, with each GPU acting as an independent pipeline stage.
The full concurrency breakdown is in vLLM vs Ollama in 2026: When Each One Wins.
Real performance: what to expect
Llama 3.3 70B Q4_K_M needs ~43 GB of VRAM—here’s what the main consumer configurations look like:
| Configuration | Total VRAM | Llama 3.3 70B Q4 tok/s | Power draw |
|---|---|---|---|
| Single RTX 4090 | 24 GB | ~8–10 tok/s (Q2 only, quality loss) | 450W |
| Single RTX 5090 | 32 GB | ~15–18 tok/s (Q3 max) | 575W |
| Dual RTX 3090 NVLink | 48 GB | 15–20 tok/s | ~700W combined |
| Dual RTX 3090 PCIe 4.0 | 48 GB | ~10–14 tok/s (est.) | ~700W combined |
| Dual RTX 4090 PCIe 4.0 | 48 GB | ~28–40 tok/s (est.) | ~900W combined |
The dual RTX 3090 NVLink numbers are the best-benchmarked: 15–20 tok/s for Llama 3.3 70B Q4_K_M in llama.cpp layer-split mode, confirmed across multiple community setups. NVLink makes a meaningful difference here over PCIe—the unified memory pool eliminates software-level layer assignment overhead, and the higher inter-GPU bandwidth helps at longer contexts.
The dual RTX 4090 PCIe estimate is derived from the per-card memory bandwidth advantage (1,008 GB/s on 4090 vs 936 GB/s on 3090), but specific published benchmarks for this exact configuration vary by software stack. If you’re evaluating this setup, budget for the lower end of that range with PCIe 4.0.
For 30B-range models, multi-GPU provides diminishing returns:
| Config | Qwen3 32B Q4 tok/s | Notes |
|---|---|---|
| Single RTX 4090 | 35–40 tok/s | Fits in 24 GB, no inter-GPU overhead |
| Single RTX 5090 | 50–60 tok/s | 32 GB gives substantial headroom |
| Dual RTX 3090 NVLink | ~38–46 tok/s | Overkill on VRAM, minor throughput gain |
A single RTX 4090 handles Qwen3 32B Q4 cleanly within its 24 GB. Adding a second GPU for a 30B model increases power consumption by ~350W for a marginal speed bump. The full VRAM picture across quantization levels is in How Much VRAM Do You Need for Llama Models.
What multi-GPU actually costs you
The benchmark numbers are the easy part.
Thermal density. Two RTX 3090s each run at 350W TDP. In a standard ATX mid-tower, even well-ventilated cards end up sharing hot exhaust air—particularly when open-air designs stack their cooling zones. Most people with 24/7 dual-GPU inference setups end up either undervolting to reduce heat (which drops power draw to ~280W per card with minimal performance loss) or moving to a server chassis with proper linear airflow.
PCIe slot requirements. Two full-size cards in adjacent slots leave one slot of clearance between them. First-card exhaust blows directly into second-card intake. Slots with two-slot physical separation between cards fix this—check the spacing before buying a board for a dual-GPU build.
Power supply. Two RTX 3090s or 4090s need a 1,000W+ PSU with the right connector count. The RTX 4090 uses a 16-pin 600W connector; two of them plus CPU and storage easily pushes 1,000–1,200W system draw. Our PSU sizing guide for AI workstations has the exact calculation.
Motherboard compatibility for NVLink. The NVLink bridge requires two full-size PCIe slots at exactly two-slot or three-slot spacing. Not all ATX boards accommodate this. X570 and B550 boards vary—verify the physical slot layout against the bridge dimensions before purchasing. The bridge itself comes in 2-slot and 3-slot variants.
Software setup time. Layer split in Ollama or llama.cpp is plug-and-play—Ollama detects both GPUs and distributes automatically. Tensor parallelism in the April 2026 llama.cpp build still has known issues with some ROCm combinations and certain GGUF quantization formats. Expect to spend an afternoon debugging if you deviate from the standard CUDA + GGUF path.
The single-device alternative worth considering
For 70B inference specifically, a Mac Studio M3 Ultra with 96 GB unified memory is worth putting in the comparison. It runs Llama 3.3 70B Q4_K_M at roughly 40–50 tok/s via MLX—comparable to or faster than dual RTX 3090 NVLink—with one box, one power cable, and zero inter-GPU configuration.
The trade-off is real: no CUDA ecosystem, image generation (Flux, SDXL) runs 3–5× slower than on an NVIDIA card, and you’re locked into Apple’s hardware cadence. For text-only inference workflows, it’s a legitimate single-device competitor to dual-GPU. We covered the full hardware comparison in Mac Studio M3 Ultra vs Dual RTX 4090.
For occasional 70B jobs without the permanent hardware commitment, RunPod community pods offer dual RTX 4090 instances at ~$0.54/hr. That’s useful for testing a 70B model’s behavior before deciding whether the build is worth it.
Honest take: who should actually go multi-GPU
Build a dual-GPU system if:
- Your primary model is 70B or larger at Q4 quality (43+ GB VRAM required)
- You’re running multi-user inference—the extra VRAM dramatically expands the KV cache you can maintain per user, which matters at 8+ concurrent sessions
- You already own one RTX 3090 or 4090 and can add a second one for under $700 (used 3090) or under $2,400 (used 4090)
Stick with single GPU if:
- Your primary model is 34B or smaller—a single RTX 4090 runs Qwen3 32B or Llama 3.1 34B cleanly with no offload
- You’re primarily doing image generation (Flux, SDXL)—these don’t parallelize efficiently across consumer GPUs
- You’re on an older motherboard with PCIe 3.0 slots—the bandwidth penalty on tensor parallelism is significant enough that the second card delivers less than expected
If you go dual, prefer:
- Dual RTX 4090 (PCIe) over dual RTX 3090 (NVLink or not): higher per-card memory bandwidth wins on total throughput for inference, and the ~$200 NVLink bridge premium on 3090 doesn’t recover its cost in tok/s gains
- That said, dual RTX 3090 NVLink at ~$1,400 total (two used 3090s + bridge) is still the cheapest path to a verified 15–20 tok/s on 70B Q4—the used RTX 3090 value case still holds if your budget is tight
NVLink as a consumer feature is essentially finished. The RTX 3090 pair is the last consumer configuration where it exists, and within three to four years those cards will age out of relevance as mainstream models push past what 48 GB can hold. Plan your multi-GPU build around PCIe parallelism—it performs well enough that the interconnect is not your bottleneck for inference workloads.
1V1 PLAYBOOK · LOCAL LLM
Cut your local AI bill from $400/month cloud GPU to $47/month at home.
4-path hardware decision table, Ollama cold-start fix, Cursor/Claude Code routing configs, full 24-month TCO calculator.
Get it for $19 (early bird) →Sources
- Nvidia kills off NVLink on RTX 4090 — Windows Central
- NVLINK port support for RTX 3090 Ti, RTX 4080/4090 — NVIDIA Developer Forums
- Dual NVIDIA GeForce RTX 3090 NVLink Performance Review — ServeTheHome
- PCI Express 4.0 Specs: Bi-Directional Bandwidth 64 GB/s — HotHardware
- Multi-GPU LLM Setup 2026 — Run 70B-405B Locally — Compute Market
- llama.cpp Releases in April 2026: Tensor Parallelism — Fazm Blog
- RTX 5090 Blackwell: No NVLink on Consumer Cards — RunPod
- Llama 3.3 70B VRAM requirements — LocalLLM.in
- vLLM Multi-GPU Setup: NVLink vs PCIe — ServerMO
- Multi-GPU LLM Inference Guide — NVLink vs PCIe, Tensor Parallelism — Will It Run AI Blog
Last updated May 21, 2026. GPU prices and availability shift weekly; verify current listings before purchasing.
Recommended Gear
The hardware mentioned in this guide, with current prices on Amazon (affiliate links — at no extra cost to you, purchases help support this site):
Was this article helpful?
Thanks for the feedback — it helps improve future articles.
Need hands-on help?
I offer 1-on-1 technical consulting for local AI setup, GPU selection, and AI coding tool configuration — same topics covered on this site.
Book a session — $49 / hour →