How much VRAM does Llama 3.3 70B need?

About 24 GB — the minimum this recipe targets.

How hard is this setup?

Advanced — follow the steps above.

Llama 3.3 70B on RTX 4090: 70B-Class Chat on One 24 GB Card (Q4 Offload or Fully-On-GPU IQ2)

What You'll Build

A single-RTX-4090 local deployment of Llama 3.3 70B Instruct, served by llama.cpp. A 70B model at the usual 4-bit quality tier needs ~40 GB — it does not fit a 24 GB card on-GPU — so this recipe gives you the two honest paths that both run on one 4090:

The benchmarked path — Q4_K_XL with CPU offload. Our /check datapoint (17.79 tok/s @ 16K context) is a hybrid run: the 39.73 GiB Q4_K_XL weights live partly in the 4090's 24 GB and partly in system RAM, with llama.cpp splitting layers across GPU and CPU. Best quality, needs 48 GB+ system RAM, slower than fully-on-GPU.
The fully-on-GPU alternative — sub-4-bit IQ2 GGUF. An aggressive IQ2 quant (~20–22 GB on disk) fits entirely inside the 24 GB envelope with room for the KV cache, so the whole model stays on the GPU. Fastest, but a visible quality drop versus the Q4 path.

Hardware data: RTX 4090 (24 GB VRAM) · Q4_K_XL ~17.79 tok/s @ 16K (CPU-offload hybrid) · fully-on-GPU at IQ2 (~20–22 GB) · See benchmark data

⚠️ Read this before you start: the fit is the whole story. A 70B at 4-bit (Q4_K_XL) is 39.73 GiB on disk (unsloth tree API) — it cannot sit entirely in 24 GB of VRAM. The 17.79 tok/s figure on our benchmark page is a partial-offload number: per Hardware Corner's methodology the run used the latest llama.cpp llama-bench on Ubuntu 24.04 with CUDA 12.8 and "4-bit (Q4_K_XL) quantization for all of the benchmarks", and a 40 GB model on a 24 GB card means a chunk of the layers runs on the CPU out of system RAM. If you want the model fully on the GPU, drop to an IQ2 GGUF (covered below) — you trade quality for keeping everything on the 4090.

Requirements

Component	Minimum	Tested
GPU	24 GB VRAM	RTX 4090 (24 GB)
RAM	48 GB+ for the Q4 offload path; 16 GB is enough for the fully-on-GPU IQ2 path	—
Storage	~20 GB (IQ2) to ~40 GB (Q4_K_XL) for the GGUF file	~40 GB
Software	llama.cpp with CUDA (CUDA 12.x, sm_89) or Ollama	—

Two paths, and which file fits

bartowski's GGUF card states the trade-off directly. For the fastest setup it advises "Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM." — on a 24 GB card that means a file under ~22 GB, i.e. the IQ2 tier. For maximum quality it instead says "add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total." — that is exactly the Q4_K_XL CPU-offload path our benchmark used. And for the quant type on NVIDIA below 4 bits: "if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants" (bartowski README). So on a CUDA card the fully-on-GPU choice is an IQ2 I-quant.

On-disk sizes verified via the HuggingFace tree API across two independent quanters (bartowski + unsloth):

Quant	File size (bytes)	≈ GiB	Fully on 24 GB GPU?	bartowski quality note
IQ2_S (lead, fully-on-GPU)	22,242,348,000	20.71	Yes (~3 GB for KV)	"Low quality, uses SOTA techniques to be usable."
IQ2_XS (KV-friendlier)	21,142,113,248	19.69	Yes (~4 GB for KV)	"Low quality, uses SOTA techniques to be usable."
IQ2_M (borderline-tight)	24,119,299,040	22.46	Tight (~1.5 GB for KV)	"Relatively low quality, uses SOTA techniques to be surprisingly usable."
IQ2_XXS (smallest)	19,097,390,048	17.79	Yes (most KV room)	"Very low quality, uses SOTA techniques to be usable."
Q4_K_XL (offload-hybrid; benchmarked)	42,664,577,632	39.73	No — needs CPU offload	reference 4-bit tier (best quality of the two paths)

IQ2 sizes are corroborated by bartowski and unsloth (unsloth's UD-IQ2_M = 22.63 GiB, UD-IQ2_XXS = 18.10 GiB — same tier, same fit). The Q4_K_XL file is unsloth's Llama-3.3-70B-Instruct-UD-Q4_K_XL.gguf (39.73 GiB) — the exact quant our /check benchmark used. All numbers are on-disk byte counts, not measured runtime peaks; see /check/llama-3-3-70b/rtx-4090.

Recommendation: if you want the model entirely on the GPU and the fastest tokens/s, lead with IQ2_S (20.71 GiB) — it leaves ~3 GB for the KV cache at a modest context. If you need more context headroom, drop to IQ2_XS (19.69 GiB). If you instead want the best quality and have 48 GB+ system RAM, use the Q4_K_XL offload path — it's the configuration behind the 17.79 tok/s on the benchmark page, just slower than the on-GPU IQ2 because layers traverse the PCIe bus to system RAM.

Installation

1. Build or install llama.cpp with CUDA (Ada, sm_89)

# Build from source with CUDA enabled
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

The RTX 4090 is Ada Lovelace (sm_89), not Blackwell — the default CUDA 12.x toolkit already ships sm_89 kernels, so no special cu128/sm_120 wheel selection is needed. llama.cpp's built-in --flash-attn uses its own kernels; there is no separate pip install flash-attn step.

2. Download a GGUF

pip install -U "huggingface_hub[cli]"

# FULLY-ON-GPU lead: IQ2_S (20.71 GiB) — whole model stays on the 4090
hf download bartowski/Llama-3.3-70B-Instruct-GGUF \
  Llama-3.3-70B-Instruct-IQ2_S.gguf --local-dir ./models

# OR the best-quality CPU-offload path: Q4_K_XL (39.73 GiB) — needs 48 GB+ system RAM
# hf download unsloth/Llama-3.3-70B-Instruct-GGUF \
#   Llama-3.3-70B-Instruct-UD-Q4_K_XL.gguf --local-dir ./models

Both are single-file GGUFs (no split/merge step).

Running

Path A — fully on the GPU (IQ2_S)

Offload everything (-ngl 99) and keep a modest context so weights + KV stay inside 24 GB:

./build/bin/llama-server \
  -m ./models/Llama-3.3-70B-Instruct-IQ2_S.gguf \
  -ngl 99 \
  --ctx-size 8192 \
  --flash-attn \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --host 0.0.0.0 --port 8080

-ngl 99 puts all 80 transformer layers on the GPU. With IQ2_S (~20.7 GiB) they fit, leaving ~3 GB for the KV cache.
Llama 3.3 70B uses grouped-query attention (80 layers, 8 KV heads, head_dim 128 per its config.json), so KV is ~0.31 MB/token at fp16 — roughly 2.5 GB at 8K and ~5 GB at 16K. The q8_0 cache flags roughly halve that, which is what lets you push context up.

Path B — best quality via CPU offload (Q4_K_XL, the benchmarked path)

The 39.73 GiB Q4_K_XL can't fit 24 GB on its own. Put as many layers on the GPU as fit and let the rest run on the CPU out of system RAM — this is the configuration behind the 17.79 tok/s on /check:

./build/bin/llama-server \
  -m ./models/Llama-3.3-70B-Instruct-UD-Q4_K_XL.gguf \
  -ngl 40 \
  --ctx-size 16384 \
  --flash-attn \
  --host 0.0.0.0 --port 8080

-ngl 40 offloads ~half the 80 layers to the 4090; the remainder stay on the CPU. Tune the number up until VRAM is nearly full (nvidia-smi) without OOM — more GPU layers = faster.
This path needs 48 GB+ system RAM to hold the CPU-resident layers. Throughput is gated by the CPU-side layers and the PCIe transfer, which is why it's slower than the on-GPU IQ2 path despite higher quality.

On first launch llama.cpp memory-maps the file, allocates the KV cache, prints the layer-offload summary, and serves an OpenAI-compatible API at http://localhost:8080.

Ollama alternative (must override the default quant)

# The default `ollama pull llama3.3:70b` is Q4_K_M (~43 GB) and will NOT fit on-GPU.
# Import the IQ2_S GGUF as a custom model for the fully-on-GPU path:
cat > Modelfile <<'EOF'
FROM ./models/Llama-3.3-70B-Instruct-IQ2_S.gguf
PARAMETER num_ctx 8192
EOF
ollama create llama3.3-70b-iq2 -f Modelfile
ollama run llama3.3-70b-iq2

Results

Speed: 17.79 tokens/s at 16K context on the RTX 4090, measured with llama.cpp's llama-bench at Q4_K_XL with CPU offload, per Hardware Corner's GPU ranking (the source behind /check/llama-3-3-70b/rtx-4090). This is the offload-hybrid path (Path B). No first-party tokens/s exists yet for the fully-on-GPU IQ2 path (Path A) — if you run it, please submit your tok/s via /contribute so it becomes the first on-GPU benchmark on this page.
VRAM usage: Path A (IQ2_S) keeps ~20.7 GiB of weights plus the KV cache inside the 24 GB envelope — context length and cache dtype are the binding constraint, not the weights. Path B (Q4_K_XL) overflows VRAM by design and spills layers to system RAM. See /check/llama-3-3-70b/rtx-4090.
Quality notes: IQ2 is an aggressive quant. bartowski rates IQ2_S "Low quality, uses SOTA techniques to be usable." and IQ2_M "Relatively low quality, uses SOTA techniques to be surprisingly usable." (bartowski README). The Q4_K_XL offload path keeps full 4-bit quality at the cost of system RAM and speed. For many tasks a Q4/Q5 32B model — which fits the 4090 fully on-GPU with room to spare — is a better quality-per-VRAM trade than a 70B squeezed to IQ2.

For the full benchmark data, see /check/llama-3-3-70b/rtx-4090.

Troubleshooting

Out of memory at load, or right after the first long prompt (Path A / IQ2)

The IQ2 weights leave only a few GB of KV headroom on a 24 GB card. In order of preference: (1) drop from IQ2_S to IQ2_XS (frees ~1 GB of weight footprint) or IQ2_XXS (frees ~3 GB); (2) lower --ctx-size (4096 is safe); (3) keep --cache-type-k q8_0 --cache-type-v q8_0 (already in Path A's launch line). KV grows with context, so an OOM that only appears after a long input is a context/KV problem, not a weights problem.

Path B is slow / `llama-server` crawls

That's expected for the Q4_K_XL CPU-offload path — layers on the CPU traverse system RAM and the PCIe bus on every token. To go faster, raise -ngl until VRAM is nearly full (watch nvidia-smi); each extra layer on the GPU helps. If it OOMs, lower -ngl by a few. If you'd rather have speed than full 4-bit quality, switch to the fully-on-GPU IQ2 path (Path A).

`ollama pull llama3.3:70b` downloads ~43 GB and won't run on-GPU

That tag is Q4_K_M (~43 GB) and exceeds 24 GB. Use the custom-Modelfile path above to import the IQ2_S GGUF for a fully-on-GPU run, or run the Q4_K_XL offload path under llama.cpp directly.

Why not just use Q4 fully on the GPU?

You can't — a 24 GB card has no room for a ~40 GB 4-bit 70B. Your two real options on one 4090 are exactly the two in this recipe: keep full 4-bit quality by offloading part of the model to system RAM (Path B), or stay fully on-GPU by dropping to IQ2 (Path A). For full Q4+ 70B entirely on the GPU you need a larger card (e.g. a 32 GB RTX 5090 at IQ3) or multiple GPUs — out of scope here. Report your experience via /contribute.