What You'll Build
A single-RTX-4090 local deployment of Llama 3.3 70B Instruct, served by llama.cpp. A 70B model at the usual 4-bit quality tier needs ~40 GB — it does not fit a 24 GB card on-GPU — so this recipe gives you the two honest paths that both run on one 4090:
- The benchmarked path — Q4_K_XL with CPU offload. Our /check datapoint (17.79 tok/s @ 16K context) is a hybrid run: the 39.73 GiB Q4_K_XL weights live partly in the 4090's 24 GB and partly in system RAM, with llama.cpp splitting layers across GPU and CPU. Best quality, needs 48 GB+ system RAM, slower than fully-on-GPU.
- The fully-on-GPU alternative — sub-4-bit IQ2 GGUF. An aggressive IQ2 quant (~20–22 GB on disk) fits entirely inside the 24 GB envelope with room for the KV cache, so the whole model stays on the GPU. Fastest, but a visible quality drop versus the Q4 path.
Hardware data: RTX 4090 (24 GB VRAM) · Q4_K_XL ~17.79 tok/s @ 16K (CPU-offload hybrid) · fully-on-GPU at IQ2 (~20–22 GB) · See benchmark data
⚠️ Read this before you start: the fit is the whole story. A 70B at 4-bit (Q4_K_XL) is 39.73 GiB on disk (unsloth tree API) — it cannot sit entirely in 24 GB of VRAM. The 17.79 tok/s figure on our benchmark page is a partial-offload number: per Hardware Corner's methodology the run used the latest llama.cpp
llama-benchon Ubuntu 24.04 with CUDA 12.8 and "4-bit (Q4_K_XL) quantization for all of the benchmarks", and a 40 GB model on a 24 GB card means a chunk of the layers runs on the CPU out of system RAM. If you want the model fully on the GPU, drop to an IQ2 GGUF (covered below) — you trade quality for keeping everything on the 4090.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 24 GB VRAM | RTX 4090 (24 GB) |
| RAM | 48 GB+ for the Q4 offload path; 16 GB is enough for the fully-on-GPU IQ2 path | — |
| Storage | ~20 GB (IQ2) to ~40 GB (Q4_K_XL) for the GGUF file | ~40 GB |
| Software | llama.cpp with CUDA (CUDA 12.x, sm_89) or Ollama | — |
Two paths, and which file fits
bartowski's GGUF card states the trade-off directly. For the fastest setup it advises "Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM." — on a 24 GB card that means a file under ~22 GB, i.e. the IQ2 tier. For maximum quality it instead says "add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total." — that is exactly the Q4_K_XL CPU-offload path our benchmark used. And for the quant type on NVIDIA below 4 bits: "if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants" (bartowski README). So on a CUDA card the fully-on-GPU choice is an IQ2 I-quant.
On-disk sizes verified via the HuggingFace tree API across two independent quanters (bartowski + unsloth):
| Quant | File size (bytes) | ≈ GiB | Fully on 24 GB GPU? | bartowski quality note |
|---|---|---|---|---|
| IQ2_S (lead, fully-on-GPU) | 22,242,348,000 | 20.71 | Yes (~3 GB for KV) | "Low quality, uses SOTA techniques to be usable." |
| IQ2_XS (KV-friendlier) | 21,142,113,248 | 19.69 | Yes (~4 GB for KV) | "Low quality, uses SOTA techniques to be usable." |
| IQ2_M (borderline-tight) | 24,119,299,040 | 22.46 | Tight (~1.5 GB for KV) | "Relatively low quality, uses SOTA techniques to be surprisingly usable." |
| IQ2_XXS (smallest) | 19,097,390,048 | 17.79 | Yes (most KV room) | "Very low quality, uses SOTA techniques to be usable." |
| Q4_K_XL (offload-hybrid; benchmarked) | 42,664,577,632 | 39.73 | No — needs CPU offload | reference 4-bit tier (best quality of the two paths) |
IQ2 sizes are corroborated by bartowski and unsloth (unsloth's UD-IQ2_M = 22.63 GiB, UD-IQ2_XXS = 18.10 GiB — same tier, same fit). The Q4_K_XL file is unsloth's Llama-3.3-70B-Instruct-UD-Q4_K_XL.gguf (39.73 GiB) — the exact quant our /check benchmark used. All numbers are on-disk byte counts, not measured runtime peaks; see /check/llama-3-3-70b/rtx-4090.
Recommendation: if you want the model entirely on the GPU and the fastest tokens/s, lead with IQ2_S (20.71 GiB) — it leaves ~3 GB for the KV cache at a modest context. If you need more context headroom, drop to IQ2_XS (19.69 GiB). If you instead want the best quality and have 48 GB+ system RAM, use the Q4_K_XL offload path — it's the configuration behind the 17.79 tok/s on the benchmark page, just slower than the on-GPU IQ2 because layers traverse the PCIe bus to system RAM.
Installation
1. Build or install llama.cpp with CUDA (Ada, sm_89)
# Build from source with CUDA enabled
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j
The RTX 4090 is Ada Lovelace (sm_89), not Blackwell — the default CUDA 12.x toolkit already ships sm_89 kernels, so no special cu128/sm_120 wheel selection is needed. llama.cpp's built-in --flash-attn uses its own kernels; there is no separate pip install flash-attn step.
2. Download a GGUF
pip install -U "huggingface_hub[cli]"
# FULLY-ON-GPU lead: IQ2_S (20.71 GiB) — whole model stays on the 4090
hf download bartowski/Llama-3.3-70B-Instruct-GGUF \
Llama-3.3-70B-Instruct-IQ2_S.gguf --local-dir ./models
# OR the best-quality CPU-offload path: Q4_K_XL (39.73 GiB) — needs 48 GB+ system RAM
# hf download unsloth/Llama-3.3-70B-Instruct-GGUF \
# Llama-3.3-70B-Instruct-UD-Q4_K_XL.gguf --local-dir ./models
Both are single-file GGUFs (no split/merge step).
Running
Path A — fully on the GPU (IQ2_S)
Offload everything (-ngl 99) and keep a modest context so weights + KV stay inside 24 GB:
./build/bin/llama-server \
-m ./models/Llama-3.3-70B-Instruct-IQ2_S.gguf \
-ngl 99 \
--ctx-size 8192 \
--flash-attn \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--host 0.0.0.0 --port 8080
-ngl 99puts all 80 transformer layers on the GPU. With IQ2_S (~20.7 GiB) they fit, leaving ~3 GB for the KV cache.- Llama 3.3 70B uses grouped-query attention (80 layers, 8 KV heads, head_dim 128 per its config.json), so KV is ~0.31 MB/token at fp16 — roughly 2.5 GB at 8K and ~5 GB at 16K. The
q8_0cache flags roughly halve that, which is what lets you push context up.
Path B — best quality via CPU offload (Q4_K_XL, the benchmarked path)
The 39.73 GiB Q4_K_XL can't fit 24 GB on its own. Put as many layers on the GPU as fit and let the rest run on the CPU out of system RAM — this is the configuration behind the 17.79 tok/s on /check:
./build/bin/llama-server \
-m ./models/Llama-3.3-70B-Instruct-UD-Q4_K_XL.gguf \
-ngl 40 \
--ctx-size 16384 \
--flash-attn \
--host 0.0.0.0 --port 8080
-ngl 40offloads ~half the 80 layers to the 4090; the remainder stay on the CPU. Tune the number up until VRAM is nearly full (nvidia-smi) without OOM — more GPU layers = faster.- This path needs 48 GB+ system RAM to hold the CPU-resident layers. Throughput is gated by the CPU-side layers and the PCIe transfer, which is why it's slower than the on-GPU IQ2 path despite higher quality.
On first launch llama.cpp memory-maps the file, allocates the KV cache, prints the layer-offload summary, and serves an OpenAI-compatible API at http://localhost:8080.
Ollama alternative (must override the default quant)
# The default `ollama pull llama3.3:70b` is Q4_K_M (~43 GB) and will NOT fit on-GPU.
# Import the IQ2_S GGUF as a custom model for the fully-on-GPU path:
cat > Modelfile <<'EOF'
FROM ./models/Llama-3.3-70B-Instruct-IQ2_S.gguf
PARAMETER num_ctx 8192
EOF
ollama create llama3.3-70b-iq2 -f Modelfile
ollama run llama3.3-70b-iq2
Results
- Speed: 17.79 tokens/s at 16K context on the RTX 4090, measured with llama.cpp's
llama-benchat Q4_K_XL with CPU offload, per Hardware Corner's GPU ranking (the source behind /check/llama-3-3-70b/rtx-4090). This is the offload-hybrid path (Path B). No first-party tokens/s exists yet for the fully-on-GPU IQ2 path (Path A) — if you run it, please submit your tok/s via /contribute so it becomes the first on-GPU benchmark on this page. - VRAM usage: Path A (IQ2_S) keeps ~20.7 GiB of weights plus the KV cache inside the 24 GB envelope — context length and cache dtype are the binding constraint, not the weights. Path B (Q4_K_XL) overflows VRAM by design and spills layers to system RAM. See /check/llama-3-3-70b/rtx-4090.
- Quality notes: IQ2 is an aggressive quant. bartowski rates IQ2_S "Low quality, uses SOTA techniques to be usable." and IQ2_M "Relatively low quality, uses SOTA techniques to be surprisingly usable." (bartowski README). The Q4_K_XL offload path keeps full 4-bit quality at the cost of system RAM and speed. For many tasks a Q4/Q5 32B model — which fits the 4090 fully on-GPU with room to spare — is a better quality-per-VRAM trade than a 70B squeezed to IQ2.
For the full benchmark data, see /check/llama-3-3-70b/rtx-4090.
Troubleshooting
Out of memory at load, or right after the first long prompt (Path A / IQ2)
The IQ2 weights leave only a few GB of KV headroom on a 24 GB card. In order of preference: (1) drop from IQ2_S to IQ2_XS (frees ~1 GB of weight footprint) or IQ2_XXS (frees ~3 GB); (2) lower --ctx-size (4096 is safe); (3) keep --cache-type-k q8_0 --cache-type-v q8_0 (already in Path A's launch line). KV grows with context, so an OOM that only appears after a long input is a context/KV problem, not a weights problem.
Path B is slow / llama-server crawls
That's expected for the Q4_K_XL CPU-offload path — layers on the CPU traverse system RAM and the PCIe bus on every token. To go faster, raise -ngl until VRAM is nearly full (watch nvidia-smi); each extra layer on the GPU helps. If it OOMs, lower -ngl by a few. If you'd rather have speed than full 4-bit quality, switch to the fully-on-GPU IQ2 path (Path A).
ollama pull llama3.3:70b downloads ~43 GB and won't run on-GPU
That tag is Q4_K_M (~43 GB) and exceeds 24 GB. Use the custom-Modelfile path above to import the IQ2_S GGUF for a fully-on-GPU run, or run the Q4_K_XL offload path under llama.cpp directly.
Why not just use Q4 fully on the GPU?
You can't — a 24 GB card has no room for a ~40 GB 4-bit 70B. Your two real options on one 4090 are exactly the two in this recipe: keep full 4-bit quality by offloading part of the model to system RAM (Path B), or stay fully on-GPU by dropping to IQ2 (Path A). For full Q4+ 70B entirely on the GPU you need a larger card (e.g. a 32 GB RTX 5090 at IQ3) or multiple GPUs — out of scope here. Report your experience via /contribute.