How much VRAM does Llama 3.1 70B need?

About 32 GB — the minimum this recipe targets.

How hard is this setup?

Advanced — follow the steps above.

Llama 3.1 70B on RTX 5090: Fitting a 70B Model on One 32 GB Card with IQ3 GGUF

What You'll Build

A single-GPU local deployment of Llama 3.1 70B Instruct on one RTX 5090 (32 GB), served by llama.cpp with full GPU offload. The honest framing up front: a 70B model does not fit a 32 GB card at the usual Q4 quality tier — the only single-card path is an aggressive sub-4-bit (IQ3) quant that fits ~27-30 GB of the 32 GB envelope, leaving a narrow margin for the KV cache.

Hardware data: RTX 5090 (32 GB VRAM) · weights ~27-30 GB at IQ3 · KV-cache-limited · See benchmark data

⚠️ Read this before you start: the fit is the whole story. The default ollama pull llama3.1:70b is Q4_K_M, 43 GB (ollama.com/library/llama3.1:70b) — it will not load on a 32 GB card. The standalone Q4_K_M GGUF is the same size: 42,520,398,400 bytes ≈ 39.6 GiB (bartowski card, lmstudio-community card). To get 70B onto one 5090 you must drop to IQ3 (covered below). For full Q4+ quality, use two GPUs or CPU offload — out of scope for this single-card recipe.

Requirements

Component	Minimum	Tested
GPU	32 GB VRAM (no 24 GB card fits any usable 70B quant)	RTX 5090 (32 GB)
RAM	32 GB	—
Storage	~30 GB for the IQ3 weight file	~30 GB
Software	llama.cpp with CUDA (cu128 / sm_120) or Ollama	—

Why IQ3, and which file

A 70B model needs roughly params × bytes-per-weight of VRAM for the weights alone, plus KV cache and activations on top. At the common quality tier (Q4_K_M ≈ 4.5 bits/weight) that is ~40 GB — past the 5090's 32 GB. Dropping below 4 bits is the only way onto a single card. bartowski's GGUF card states the rule directly: "Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM" and, for sub-Q4 on NVIDIA, "if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants" (bartowski README). On a CUDA card that points at the IQ3 tier.

The two single-card candidates, with on-disk sizes verified via the HuggingFace tree API across multiple independent quanters:

Quant	File size (bytes)	≈ GiB	Fits 32 GB?	bartowski quality note
IQ3_M (lead)	31,937,038,912	29.74	Yes, tight (~2 GB for KV)	"Medium-low quality, new method with decent performance comparable to Q3_K_M"
IQ3_XS (KV-friendlier)	29,307,734,592	27.30	Yes (~4-5 GB for KV)	"Lower quality, new method with decent performance, slightly better than Q3_K_S"
Q4_K_M (does NOT fit)	42,520,398,400	39.60	No — overflows 32 GB	reference tier (needs multi-GPU / CPU offload)

IQ3_M and Q4_K_M sizes are corroborated by bartowski and lmstudio-community; IQ3_XS and Q3_K_S by bartowski and MaziyarPanahi. All numbers are on-disk byte counts, not measured runtime peaks — see /check/llama-3-1-70b/rtx-5090 for first-party runtime measurements once the community submits them.

Recommendation: lead with IQ3_M (29.74 GiB) for the best quality that still respects bartowski's "1-2 GB under VRAM" rule; switch to IQ3_XS (27.30 GiB) if you need more KV-cache headroom for longer contexts (see Troubleshooting). Q3_K_S (28.79 GiB) is similar in size to IQ3_M but bartowski rates it "Low quality, not recommended" — on CUDA the I-quant is the better trade at the same footprint.

Installation

1. Build or install llama.cpp with CUDA (Blackwell sm_120)

# Build from source with CUDA enabled (recommended for sm_120 support)
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

The RTX 5090 is Blackwell (sm_120) — build against a CUDA 12.8+ toolkit so the kernels target sm_120. (No FlashAttention-2 source build is required; llama.cpp's built-in --flash-attn uses its own kernels.)

2. Download the IQ3 GGUF

pip install -U "huggingface_hub[cli]"

# Lead: IQ3_M (29.74 GiB) — best quality that fits with a tight KV margin
hf download bartowski/Meta-Llama-3.1-70B-Instruct-GGUF \
  Meta-Llama-3.1-70B-Instruct-IQ3_M.gguf --local-dir ./models

# OR the KV-friendlier IQ3_XS (27.30 GiB)
# hf download bartowski/Meta-Llama-3.1-70B-Instruct-GGUF \
#   Meta-Llama-3.1-70B-Instruct-IQ3_XS.gguf --local-dir ./models

Both files are single-file GGUFs (no split/merge step needed).

Running

Launch with full GPU offload (-ngl 99), a modest context, and a quantized KV cache — the three settings that keep ~30 GB of weights plus KV inside the 32 GB envelope:

./build/bin/llama-server \
  -m ./models/Meta-Llama-3.1-70B-Instruct-IQ3_M.gguf \
  -ngl 99 \
  --ctx-size 8192 \
  --flash-attn \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --host 0.0.0.0 --port 8080

-ngl 99 offloads all 80 transformer layers to the GPU (anything that doesn't fit would spill to CPU and crater throughput — on a 32 GB card with IQ3 it should fit fully).
--ctx-size 8192 keeps the KV cache small. Llama 3.1 70B uses grouped-query attention (80 layers, 8 KV heads, head_dim 128 per its config), so KV is ~0.31 MB/token at fp16 — roughly 2.5 GB at 8K. The q8_0 cache flags roughly halve that.
The model's native window is 131072 tokens, but you cannot afford full-context KV on top of ~30 GB of weights — raise --ctx-size cautiously and watch VRAM.

On first launch llama.cpp memory-maps the file, allocates the KV cache, and prints the layer-offload summary; then the OpenAI-compatible server is live at http://localhost:8080.

Ollama alternative (must override the default quant)

# DEFAULT llama3.1:70b is Q4_K_M / 43 GB and will NOT fit a 32 GB card.
# Import the IQ3_M GGUF as a custom model instead:
cat > Modelfile <<'EOF'
FROM ./models/Meta-Llama-3.1-70B-Instruct-IQ3_M.gguf
PARAMETER num_ctx 8192
EOF
ollama create llama3.1-70b-iq3 -f Modelfile
ollama run llama3.1-70b-iq3

Results

Speed: Omitted. No first-party tokens/s measurement for Llama 3.1 70B at IQ3/Q3 on a single RTX 5090 exists in the sources surveyed for this recipe — Hardware Corner's RTX 5090 page and its cross-GPU ranking both stop at ~32-35B (no 70B-on-5090 row). Rather than quote a cross-architecture or cross-quant number, we leave Speed empty. If you run this, please submit your tok/s via /contribute — it becomes the first benchmark on /check/llama-3-1-70b/rtx-5090.
VRAM usage: Weights are ~29.74 GiB (IQ3_M) or ~27.30 GiB (IQ3_XS) on disk; runtime peak is weights + KV cache + activations. On a 32 GB card the IQ3 weights leave only ~2-5 GB for KV, so context and cache dtype are the binding constraints, not the weights alone. See /check/llama-3-1-70b/rtx-5090.
Quality notes: IQ3 is an aggressive quant. bartowski rates IQ3_M "Medium-low quality, new method with decent performance comparable to Q3_K_M" and IQ3_XS "Lower quality, new method with decent performance, slightly better than Q3_K_S" (bartowski README). Expect noticeably more degradation than a Q4+ 70B; for many tasks a Q4/Q5 32B model on this same card (which fits comfortably) is the better quality-per-VRAM trade.

For the full benchmark data, see /check/llama-3-1-70b/rtx-5090.

Troubleshooting

Out of memory at load, or right after the first long prompt

The weights leave very little KV headroom on a 32 GB card. In order of preference: (1) drop from IQ3_M to IQ3_XS (frees ~2.4 GB of weight footprint); (2) lower --ctx-size (4096 is safe); (3) keep --cache-type-k q8_0 --cache-type-v q8_0 (already in the launch line). KV cache grows with context length, so an OOM that only appears after a long input is a context/KV problem, not a weights problem.

`ollama pull llama3.1:70b` downloads 43 GB and then won't run

That tag is Q4_K_M (43 GB) by design (ollama.com) and exceeds 32 GB. Use the custom-Modelfile path above to import the IQ3_M GGUF instead — Ollama has no built-in IQ3 70B tag.

Why not just use Q4_K_M with partial offload?

You can (-ngl set below 80), but spilling 70B layers to CPU collapses throughput — the whole point of the 32 GB card is to keep everything on-GPU. If you want Q4+ 70B quality, that's a multi-GPU or heavily CPU-offloaded setup, which is outside this single-card recipe's scope.

Want better quality at this VRAM budget?

A dense 32B model at Q4_K_M/Q5 fits the 5090 with full context and far less quality loss than a 70B squeezed to IQ3. Treat 70B-on-one-5090 as "the largest model that technically fits," not "the best model for the card." Report your experience via /contribute.