SSD for Local AI in 2026: Why Your NVMe Drive Matters More Than You Think

nvmessdstoragelocal-aillmhardwarebuying-guide

When building a local AI workstation, storage gets treated as an afterthought. You spend weeks choosing the right GPU, obsess over VRAM tiers, and debate PCIe 4.0 vs 5.0 for your GPU slot — then throw in whatever 2TB drive happened to be on sale. That’s the wrong order of operations.

The SSD is where your model lives until the moment you fire up a prompt. Every cold start, every model switch, every reboot means the inference engine pulls tens of gigabytes off disk into system RAM or VRAM before the first token generates. On the wrong storage, a 40GB model makes you wait over a minute. On a good NVMe drive, it’s under 15 seconds.

Here’s the full breakdown: how storage actually affects local AI workloads, where the bottleneck is, and exactly which drives to buy in 2026.


Why Storage Is the Bottleneck Nobody Benchmarks

When you run ollama run llama3.3:70b or load a model in LM Studio, the inference engine reads the entire model file from disk before inference begins. For a 70B model quantized to Q4_K_M format, that’s approximately 40 GB of sequential read every cold start.

At SATA SSD speeds, that read takes over a minute. On a spinning hard drive, closer to five minutes — before the first token ever appears.

The core math: model file size ÷ storage sequential read speed = theoretical minimum load time. In practice, inference engines like llama.cpp don’t do pure sequential reads — they use memory mapping (mmap), where the OS handles I/O in chunks alongside metadata parsing and VRAM allocation. That overhead adds latency beyond raw throughput. But the ratio between storage tiers holds firmly. The CraftRigs benchmark confirms it directly: a 40GB model loads in 70+ seconds on SATA versus under 15 seconds on Gen 4 NVMe — a real-world 5× gap.

This matters more than most users realize. Local AI users swap models constantly: a fast 7B for quick answers, a slower 70B for multi-step reasoning, an SDXL checkpoint for image generation. Each switch is a full disk read. If every switch costs 70 seconds, you stop doing it. If it costs 10 seconds, you do it freely.


The Model Size Reality

Before comparing storage tiers, it’s worth knowing what file sizes you’re actually dealing with. These are approximate sizes for Q4_K_M GGUF quantization, which is the most common format for local inference:

ModelQuantizationApproximate File Size
Llama 3.2 3BQ4_K_M~2 GB
Llama 3.3 8BQ4_K_M~5 GB
Phi-4 14BQ4_K_M~9 GB
Qwen 3 30BQ4_K_M~18 GB
Llama 3.3 70BQ4_K_M~40 GB
Mistral Large 123BQ4_K_M~70 GB

The 70B models are where storage becomes a genuine workflow tax. At 40 GB, you’re asking your drive to work hard every session startup. Add ComfyUI alongside your LLM — SDXL checkpoints are 6–7 GB, Flux.1 Dev is 23 GB — and a single session setup can push 60 GB or more off disk.

If you also keep multiple quantization variants of the same model (Q4_K_M for speed, Q8_0 for quality), that one 70B model becomes 100+ GB across formats. Storage fills faster than people expect.


Storage Type Comparison: Load Time for a 40GB Model

The load times below are based on advertised sequential read speeds and real-world benchmarks. Actual load times run 20–40% longer than pure-throughput math predicts, due to inference engine overhead. The ratios between tiers are consistent.

Storage TypeSequential ReadTheoretical Load (40GB)Estimated Real-World Load
Spinning HDD~150 MB/s~270 sec5–8 minutes
SATA SSD~550 MB/s~75 sec70–90 seconds
NVMe Gen 3 (PCIe 3.0)~3,500 MB/s~12 sec18–25 seconds
NVMe Gen 4 (PCIe 4.0)~7,000–7,450 MB/s~6 sec10–15 seconds
NVMe Gen 5 (PCIe 5.0)~14,000–14,900 MB/s~3 sec8–12 seconds

The SATA-to-Gen-4 jump is transformative. The Gen-4-to-Gen-5 jump is marginal.

On HDDs: If you’re still storing models on a spinning drive, this is your most urgent hardware upgrade — more impactful than most GPU bumps. Five to eight minutes per load destroys any workflow that involves model switching. A $120 Gen 4 NVMe fixes it permanently.

On SATA SSD: You feel this every session. The upgrade to Gen 4 NVMe recovers 60+ seconds per model load. If you switch models 5–10 times a day, that’s 5–10 minutes of dead time you get back, compounding daily.

On Gen 3 NVMe: Acceptable, not optimal. You’re at 18–25 seconds for a 70B load — workable if you’re not switching models frequently. Upgrading to Gen 4 saves another 5–10 seconds, worth doing if you’re replacing a drive for capacity anyway.


Why the Gen 4-to-Gen 5 Gap Is Smaller Than You’d Expect

Here’s where the math gets interesting. Gen 5 drives are roughly 2× faster on paper — 14,000–14,900 MB/s versus 7,000–7,450 MB/s for Gen 4. But the practical cold-start improvement for LLM loading is only 2–4 seconds on a 40GB model.

Why the mismatch? Real-world loading speed through the Python/llama.cpp API tops out at roughly 1,300–2,000 MB/s, regardless of whether your drive can do 7,000 or 14,000 MB/s. The bottleneck shifts to:

  • Memory-mapping overhead in the OS
  • Layer-by-layer allocation as the model loads into VRAM
  • Metadata parsing and weight verification in the inference engine

Both Gen 4 and Gen 5 drives saturate the software’s ability to consume data. The hardware is no longer the limit — the inference engine is. That’s why the Samsung 9100 Pro (Gen 5, 14,800 MB/s) loads a 7B model in 2.6 seconds while a Gen 4 drive doing the same task might take 3.5–4 seconds. For a 70B model, the gap grows to maybe 3–5 seconds in total.

At a $60–$90 premium over Gen 4 for a 2TB drive, that math doesn’t favor Gen 5 for LLM-only workloads.


Drive Recommendations for Local AI Workstations

All prices are as of May 2026 and will fluctuate — verify before purchasing.

Best All-Round: Samsung 990 Pro 2TB (~$150)

7,450 MB/s sequential read, 6,900 MB/s write. The 990 Pro is the most mature, well-tested Gen 4 drive in the enthusiast market, with a thermal design that holds sustained throughput without throttling. Available at Amazon and Newegg. If you want the proven option and don’t want to think about it, buy this.

Best Value: WD Black SN850X 2TB (~$156)

Within $10 of the Samsung at 7,300 MB/s read. Real-world load times are indistinguishable from the 990 Pro. WD has a strong track record in sustained workloads, and the SN850X is available at B&H Photo and Amazon. Buy whichever is cheaper on the day you’re ordering.

Budget Gen 4: Kingston KC3000 2TB (~$120)

7,000 MB/s sequential read, $30 less than the premium options. For pure sequential model loading — which is exactly what this use case demands — it matches the top-tier drives. The controller is less consistent under heavy sustained writes, but model loading is read-dominated. Solid choice if the savings go toward more drive capacity elsewhere.

High Capacity: Sabrent Rocket 4 Plus 4TB (~$280)

If you store multiple 70B variants, ComfyUI checkpoints, and a Stable Diffusion model library, 2TB fills up fast. The Rocket 4 Plus at 7,100 MB/s gives you Gen 4 speed with real capacity headroom, at a price that beats most Gen 5 2TB options. The right choice if you’re constantly juggling model files.

Skip Unless You Have Other Use Cases: Crucial T705 2TB ($220) or Samsung 9100 Pro 4TB ($549)

Both are excellent drives. Neither meaningfully speeds up LLM cold-start times compared to Gen 4. Recommended only if your build also handles video editing, large dataset processing, or other workloads where 14,000+ MB/s sequential throughput pays off across the full workflow.


Capacity Planning: How Much Storage Do You Actually Need?

Drive SizeWhat It Holds
1TB2–3 large 70B models, or 10–15 small models. Tight — you’ll manage files constantly
2TB4–5 large models + ComfyUI + OS + tools. Comfortable for most users
4TBFull working set of 70B models across quantization variants, Flux/SDXL checkpoints, datasets, room to grow

The math is unforgiving. One Llama 3.3 70B in Q4_K_M, Q5_K_M, and Q8_0 occupies roughly 120 GB combined. Three 70B models in a single variant take 120 GB. Add ComfyUI with a couple of checkpoints: another 30–40 GB. The OS and tools: 60–80 GB. You’re at 300 GB with a minimal setup.

2TB is the practical minimum for serious local AI work with 70B-class models. 4TB is the comfortable option.

One important point: models should live on the same NVMe drive as your OS, not on a secondary slower disk. The common setup of “SSD for OS, spinning drive for models” defeats the purpose entirely — the slow drive’s read speed still gates every model load. Dedicate your fastest drive to the model working set.


What About NAS or Network Storage?

Don’t load models from it in real time.

Gigabit Ethernet tops out at ~125 MB/s actual throughput — worse than SATA for sequential reads. Even 2.5 GbE maxes at ~300 MB/s. Loading a 40GB model over a 2.5 GbE NAS takes 2–3 minutes regardless of how fast the NAS drives are, because the network is the limit.

The one valid NAS configuration for local AI: use the NAS as a cold archive for models you don’t actively use, and keep the working set on local NVMe. When you want to work with a model you’ve archived, transfer it to the NVMe drive first, then load it. That workflow is covered in detail in the When NOT to Use a NAS for Local LLMs article.


Where Storage Fits in the Full Build

Storage affects cold-start time and model availability, not inference speed. Once a model is loaded into VRAM, the SSD doesn’t matter at all for token generation.

For a complete hardware picture:

  • GPU VRAM is the primary bottleneck for which models you can run and at what speed. See the GPU buying guide for the 2026 tier breakdown.
  • System RAM determines how much model you can offload to CPU when VRAM isn’t enough. See the system RAM guide.
  • CPU matters less than most builders expect for LLM inference specifically. The CPU guide for AI workstations explains why.
  • PSU sizing for multi-GPU setups and high-TDP cards: covered in the PSU sizing guide.

Storage sits downstream of all of these: it doesn’t accelerate inference once the model is loaded, but it controls the friction cost of every cold start.


Honest Take

If you’re on SATA SSD: The Gen 4 NVMe upgrade pays back immediately and repeatedly. Every model load is 60+ seconds you get back. At $120–$150 for a 2TB Gen 4 drive, this is the highest-value hardware upgrade available to you right now if you haven’t made it yet.

If you’re on Gen 3 NVMe: You’re in acceptable territory — 18–25 seconds for a 70B load. Upgrading to Gen 4 shaves another 5–10 seconds. Worth doing when you’re buying a new drive for capacity, but not worth pulling out a working drive for speed alone.

If you’re on Gen 4 NVMe: You’re at the sweet spot. Gen 5 won’t change your LLM workflow in any meaningful way. Spend the premium elsewhere — more drive capacity, more system RAM, or the next GPU tier.

The 2026 baseline for serious local AI work: 2TB Gen 4 NVMe. Anything slower than that is paying a daily productivity tax.


Sources

Last updated May 9, 2026. Prices and specs change; verify current rates before purchasing.