What You'll Build
A single-GPU local deployment of Llama 3.1 70B Instruct on one RTX 5090 (32 GB), served by llama.cpp with full GPU offload. The honest framing up front: a 70B model does not fit a 32 GB card at the usual Q4 quality tier — the only single-card path is an aggressive sub-4-bit (IQ3) quant that fits ~27-30 GB of the 32 GB envelope, leaving a narrow margin for the KV cache.
Hardware data: RTX 5090 (32 GB VRAM) · weights ~27-30 GB at IQ3 · KV-cache-limited · See benchmark data
⚠️ Read this before you start: the fit is the whole story. The default
ollama pull llama3.1:70bis Q4_K_M, 43 GB (ollama.com/library/llama3.1:70b) — it will not load on a 32 GB card. The standalone Q4_K_M GGUF is the same size: 42,520,398,400 bytes ≈ 39.6 GiB (bartowski card, lmstudio-community card). To get 70B onto one 5090 you must drop to IQ3 (covered below). For full Q4+ quality, use two GPUs or CPU offload — out of scope for this single-card recipe.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 32 GB VRAM (no 24 GB card fits any usable 70B quant) | RTX 5090 (32 GB) |
| RAM | 32 GB | — |
| Storage | ~30 GB for the IQ3 weight file | ~30 GB |
| Software | llama.cpp with CUDA (cu128 / sm_120) or Ollama | — |
Why IQ3, and which file
A 70B model needs roughly params × bytes-per-weight of VRAM for the weights alone, plus KV cache and activations on top. At the common quality tier (Q4_K_M ≈ 4.5 bits/weight) that is ~40 GB — past the 5090's 32 GB. Dropping below 4 bits is the only way onto a single card. bartowski's GGUF card states the rule directly: "Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM" and, for sub-Q4 on NVIDIA, "if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants" (bartowski README). On a CUDA card that points at the IQ3 tier.
The two single-card candidates, with on-disk sizes verified via the HuggingFace tree API across multiple independent quanters:
| Quant | File size (bytes) | ≈ GiB | Fits 32 GB? | bartowski quality note |
|---|---|---|---|---|
| IQ3_M (lead) | 31,937,038,912 | 29.74 | Yes, tight (~2 GB for KV) | "Medium-low quality, new method with decent performance comparable to Q3_K_M" |
| IQ3_XS (KV-friendlier) | 29,307,734,592 | 27.30 | Yes (~4-5 GB for KV) | "Lower quality, new method with decent performance, slightly better than Q3_K_S" |
| Q4_K_M (does NOT fit) | 42,520,398,400 | 39.60 | No — overflows 32 GB | reference tier (needs multi-GPU / CPU offload) |
IQ3_M and Q4_K_M sizes are corroborated by bartowski and lmstudio-community; IQ3_XS and Q3_K_S by bartowski and MaziyarPanahi. All numbers are on-disk byte counts, not measured runtime peaks — see /check/llama-3-1-70b/rtx-5090 for first-party runtime measurements once the community submits them.
Recommendation: lead with IQ3_M (29.74 GiB) for the best quality that still respects bartowski's "1-2 GB under VRAM" rule; switch to IQ3_XS (27.30 GiB) if you need more KV-cache headroom for longer contexts (see Troubleshooting). Q3_K_S (28.79 GiB) is similar in size to IQ3_M but bartowski rates it "Low quality, not recommended" — on CUDA the I-quant is the better trade at the same footprint.
Installation
1. Build or install llama.cpp with CUDA (Blackwell sm_120)
# Build from source with CUDA enabled (recommended for sm_120 support)
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j
The RTX 5090 is Blackwell (sm_120) — build against a CUDA 12.8+ toolkit so the kernels target sm_120. (No FlashAttention-2 source build is required; llama.cpp's built-in --flash-attn uses its own kernels.)
2. Download the IQ3 GGUF
pip install -U "huggingface_hub[cli]"
# Lead: IQ3_M (29.74 GiB) — best quality that fits with a tight KV margin
hf download bartowski/Meta-Llama-3.1-70B-Instruct-GGUF \
Meta-Llama-3.1-70B-Instruct-IQ3_M.gguf --local-dir ./models
# OR the KV-friendlier IQ3_XS (27.30 GiB)
# hf download bartowski/Meta-Llama-3.1-70B-Instruct-GGUF \
# Meta-Llama-3.1-70B-Instruct-IQ3_XS.gguf --local-dir ./models
Both files are single-file GGUFs (no split/merge step needed).
Running
Launch with full GPU offload (-ngl 99), a modest context, and a quantized KV cache — the three settings that keep ~30 GB of weights plus KV inside the 32 GB envelope:
./build/bin/llama-server \
-m ./models/Meta-Llama-3.1-70B-Instruct-IQ3_M.gguf \
-ngl 99 \
--ctx-size 8192 \
--flash-attn \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--host 0.0.0.0 --port 8080
-ngl 99offloads all 80 transformer layers to the GPU (anything that doesn't fit would spill to CPU and crater throughput — on a 32 GB card with IQ3 it should fit fully).--ctx-size 8192keeps the KV cache small. Llama 3.1 70B uses grouped-query attention (80 layers, 8 KV heads, head_dim 128 per its config), so KV is ~0.31 MB/token at fp16 — roughly 2.5 GB at 8K. Theq8_0cache flags roughly halve that.- The model's native window is 131072 tokens, but you cannot afford full-context KV on top of ~30 GB of weights — raise
--ctx-sizecautiously and watch VRAM.
On first launch llama.cpp memory-maps the file, allocates the KV cache, and prints the layer-offload summary; then the OpenAI-compatible server is live at http://localhost:8080.
Ollama alternative (must override the default quant)
# DEFAULT llama3.1:70b is Q4_K_M / 43 GB and will NOT fit a 32 GB card.
# Import the IQ3_M GGUF as a custom model instead:
cat > Modelfile <<'EOF'
FROM ./models/Meta-Llama-3.1-70B-Instruct-IQ3_M.gguf
PARAMETER num_ctx 8192
EOF
ollama create llama3.1-70b-iq3 -f Modelfile
ollama run llama3.1-70b-iq3
Results
- Speed: Omitted. No first-party tokens/s measurement for Llama 3.1 70B at IQ3/Q3 on a single RTX 5090 exists in the sources surveyed for this recipe — Hardware Corner's RTX 5090 page and its cross-GPU ranking both stop at ~32-35B (no 70B-on-5090 row). Rather than quote a cross-architecture or cross-quant number, we leave Speed empty. If you run this, please submit your tok/s via /contribute — it becomes the first benchmark on /check/llama-3-1-70b/rtx-5090.
- VRAM usage: Weights are ~29.74 GiB (IQ3_M) or ~27.30 GiB (IQ3_XS) on disk; runtime peak is weights + KV cache + activations. On a 32 GB card the IQ3 weights leave only ~2-5 GB for KV, so context and cache dtype are the binding constraints, not the weights alone. See /check/llama-3-1-70b/rtx-5090.
- Quality notes: IQ3 is an aggressive quant. bartowski rates IQ3_M "Medium-low quality, new method with decent performance comparable to Q3_K_M" and IQ3_XS "Lower quality, new method with decent performance, slightly better than Q3_K_S" (bartowski README). Expect noticeably more degradation than a Q4+ 70B; for many tasks a Q4/Q5 32B model on this same card (which fits comfortably) is the better quality-per-VRAM trade.
For the full benchmark data, see /check/llama-3-1-70b/rtx-5090.
Troubleshooting
Out of memory at load, or right after the first long prompt
The weights leave very little KV headroom on a 32 GB card. In order of preference: (1) drop from IQ3_M to IQ3_XS (frees ~2.4 GB of weight footprint); (2) lower --ctx-size (4096 is safe); (3) keep --cache-type-k q8_0 --cache-type-v q8_0 (already in the launch line). KV cache grows with context length, so an OOM that only appears after a long input is a context/KV problem, not a weights problem.
ollama pull llama3.1:70b downloads 43 GB and then won't run
That tag is Q4_K_M (43 GB) by design (ollama.com) and exceeds 32 GB. Use the custom-Modelfile path above to import the IQ3_M GGUF instead — Ollama has no built-in IQ3 70B tag.
Why not just use Q4_K_M with partial offload?
You can (-ngl set below 80), but spilling 70B layers to CPU collapses throughput — the whole point of the 32 GB card is to keep everything on-GPU. If you want Q4+ 70B quality, that's a multi-GPU or heavily CPU-offloaded setup, which is outside this single-card recipe's scope.
Want better quality at this VRAM budget?
A dense 32B model at Q4_K_M/Q5 fits the 5090 with full context and far less quality loss than a 70B squeezed to IQ3. Treat 70B-on-one-5090 as "the largest model that technically fits," not "the best model for the card." Report your experience via /contribute.