Qwen3-32B on RTX 3090: UD-Q4_K_XL GGUF via llama.cpp

What You'll Build

A local Qwen3-32B reasoning assistant running on an RTX 3090 (24 GB VRAM) through llama.cpp with the unsloth/Qwen3-32B-GGUF UD-Q4_K_XL weights (20 GB on disk, Unsloth's Dynamic 2.0 mixed-precision GGUF tier). At 32B dense parameters, this pair sits on the upper edge of single-RTX-3090 LLM territory — Q4_K_M-class quants are the only ones that fit with usable KV-cache headroom, and context-window discipline matters in a way it never does on smaller 8B / 14B recipes.

Hardware data: RTX 3090 (24 GB VRAM) · UD-Q4_K_XL GGUF · 35.1 tok/s @ Q4_K, 4K ctx · See benchmark data

⚠️ Quant pinned — UD-Q4_K_XL. This recipe targets the unsloth/Qwen3-32B-GGUF UD-Q4_K_XL tier (20 GB on disk) specifically. Per Unsloth's Dynamic 2.0 documentation, the UD-*-XL family applies per-layer sensitivity-aware bit-allocation — "selectively quantizes layers much more intelligently." Plain Q4_K_M (19.8 GB on disk from the same repo) also fits; Q5_K_S (22.6 GB) and Q5_K_M (23.2 GB) do NOT fit alongside any KV cache on a 24 GB card. See "Picking a quant on this card" below.

ℹ️ Variant pinned — dense 32B from the Qwen3 family. Per the Qwen/Qwen3-32B model card, Qwen3 spans 8 sizes — 0.6b, 1.7b, 4b, 8b, 14b, 30b (MoE), 32b (this recipe — dense), 235b (MoE). The siblings have wildly different VRAM profiles: Qwen3-14B in Q4_K_M is ~9 GB; Qwen3-30B-A3B is MoE so all 30B params must be resident; Qwen3-235B does not fit consumer hardware. This recipe is for the dense 32B model only. If you want 14B on the same card with much more KV headroom, swap Qwen3-32B-GGUF for unsloth/Qwen3-14B-GGUF.

Requirements

Component	Minimum	Tested
GPU	24 GB VRAM (Q4_K_M / UD-Q4_K_XL)	RTX 3090 (24 GB)
RAM	32 GB system	—
Storage	20 GB (UD-Q4_K_XL GGUF) per unsloth/Qwen3-32B-GGUF	—
Driver	CUDA 12.x runtime (Ampere sm_86)	—
Runtime	llama.cpp / Ollama / LM Studio	llama.cpp b9247+

The model is released under Apache 2.0 — commercial use permitted. Per the Qwen3-32B HF model card, architecture is 64 layers / 64 Q heads / 8 KV heads (Grouped-Query Attention — keeps the per-token KV cache modest), 32,768-token native context, extendable to 131,072 via YaRN scaling.

Installation

Option A — llama.cpp + Unsloth UD-Q4_K_XL (recommended path)

This is the canonical CUDA-accelerated llama.cpp loader for a 32B GGUF on a 24 GB Ampere card.

1. Install llama.cpp

# macOS (Homebrew)
brew install llama.cpp

# Linux — pre-built CUDA binary
# Download the latest "llama-bXXXX-bin-ubuntu-cuda-12.x-x64.zip" asset from:
#   https://github.com/ggml-org/llama.cpp/releases
# Extract and add the bin/ directory to PATH.

To build from source with CUDA support instead, follow Unsloth's run guide:

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split

2. Download the UD-Q4_K_XL GGUF

Per Unsloth's run guide, use snapshot_download to pull only the UD-Q4_K_XL file (~20 GB) instead of the full repo:

pip install huggingface_hub hf_transfer

# download_q4kxl.py
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="unsloth/Qwen3-32B-GGUF",
    local_dir="unsloth/Qwen3-32B-GGUF",
    allow_patterns=["*UD-Q4_K_XL*"],
)

python download_q4kxl.py

The resulting file is unsloth/Qwen3-32B-GGUF/Qwen3-32B-UD-Q4_K_XL.gguf (20 GB per the unsloth model card).

3. Start the server

llama-server \
  --model unsloth/Qwen3-32B-GGUF/Qwen3-32B-UD-Q4_K_XL.gguf \
  --ctx-size 8192 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --flash-attn \
  --n-gpu-layers 99 \
  --host 0.0.0.0 --port 8080

--n-gpu-layers 99 offloads every layer to the 3090 (the 3090 has just enough VRAM to keep the whole model resident at Q4_K_M; layer-streaming is unnecessary). --ctx-size 8192 is a safe starting point that leaves room for KV cache. --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn halves KV memory up-front — strongly recommended on a 24 GB card because the 32B Q4_K_M envelope is razor-thin for context (see Troubleshooting).

Option B — Ollama (one-command alternative)

If you prefer a single command and don't need the specific UD-Q4_K_XL tier:

# Pulls the same UD-Q4_K_XL file directly from the unsloth HF repo
ollama run hf.co/unsloth/Qwen3-32B-GGUF:UD-Q4_K_XL

The command above is the canonical Ollama incantation documented by Unsloth for the UD-Q4_K_XL build. Ollama's own catalog also ships ollama run qwen3:32b (the default Q4_K_M tag from the Qwen team) which is one of the standard k-quants rather than Unsloth's Dynamic 2.0 — same disk class, slightly different per-layer recipe.

Option C — LM Studio (GUI)

LM Studio's catalog search ("Qwen3 32B GGUF") will surface the unsloth UD-Q4_K_XL build alongside the bartowski standard-quant ladder. Pick Qwen3-32B-UD-Q4_K_XL.gguf from the unsloth repo and download — same file as Option A. LM Studio's loader will set --n-gpu-layers to "max" automatically for a 3090.

Running

One-shot prompt via the llama.cpp HTTP server

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-32b",
    "messages": [{"role": "user", "content": "Explain Grouped Query Attention in three sentences."}],
    "temperature": 0.6,
    "top_p": 0.95,
    "top_k": 20
  }'

The llama.cpp llama-server binary exposes an OpenAI-compatible /v1/chat/completions endpoint on the port chosen above. Note the sampling parameters: per the Qwen/Qwen3-32B model card, thinking mode (the default) wants temperature=0.6, top_p=0.95, top_k=20, min_p=0 and explicitly warns "DO NOT use greedy decoding — it can lead to performance degradation and endless repetitions." Non-thinking mode uses temperature=0.7, top_p=0.8.

Interactive terminal

llama-cli \
  --model unsloth/Qwen3-32B-GGUF/Qwen3-32B-UD-Q4_K_XL.gguf \
  --ctx-size 8192 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --flash-attn \
  --n-gpu-layers 99 \
  --interactive

Press Ctrl-C to interrupt generation; the CLI keeps the model warm in VRAM until exit.

Thinking vs non-thinking mode

Qwen3-32B has a built-in chain-of-thought ("thinking") mode that the Qwen/Qwen3-32B model card enables via enable_thinking=True. Output starts with a <think>...</think> block followed by the user-facing answer. To disable thinking for latency-sensitive turns, append /no_think to your prompt or set enable_thinking=False in the chat-template parameters — and remember to switch sampling parameters to the non-thinking profile (temperature=0.7, top_p=0.8). Because thinking traces routinely run 2K–4K tokens (longer on hard problems), they amplify KV-cache pressure on this 24 GB card; keep --ctx-size 8192 and KV-quantization flags enabled unless you've tuned headroom.

Results

Speed: Per Hardware Corner's RTX 3090 LLM benchmark page, Qwen3 32B at Q4_K records 35.1 tokens/s at 4K context and 30.3 tokens/s at 16K context for token generation, with prompt-processing prefill at 1,087.9 t/s @ 4K and 767.8 t/s @ 16K (CUDA, -fa 1 flash-attention). Confirmed by /check/qwen3-32b/rtx-3090 (Hardware Corner-sourced, last verified 2026-05-15). Reasoning-model workloads where most of the output is <think> content will see effective throughput 30–50% lower than these chat-class figures — that token budget is partly consumed by content the user discards.
VRAM usage: ~22 GB peak at Q4_K_M on the 24 GB card. The unsloth/Qwen3-32B-GGUF tier table lists Q4_K_M at 19.8 GB on disk and UD-Q4_K_XL at 20 GB on disk — both load close to file size with GGUF's mmap-style allocation, leaving ~2–4 GB for KV cache and activations on a 24 GB card. The backend benchmark at /check/qwen3-32b/rtx-3090 records peak VRAM at the 24 GB ceiling at Q4_K — i.e. the full envelope is in use with default context. UD-Q4_K_XL (20 GB) and Q4_K_M (19.8 GB) are both in the "fits with discipline" band; Q5_K_S (22.6 GB) and Q5_K_M (23.2 GB) are too tight; Q6_K (26.9 GB) and BF16 (65.5 GB) do not fit at all.
Quality notes: UD-Q4_K_XL is the Unsloth mixed-precision GGUF tier; the Unsloth Dynamic 2.0 docs characterize it as "selectively quantizes layers much more intelligently" with per-layer sensitivity-aware bit-allocation, calibrated on 1.5M+ real-world tokens. On a 24 GB card you cannot step up to Q5/Q6/Q8 because they don't fit alongside KV cache — UD-Q4_K_XL is the quality ceiling for this pair without a larger GPU.

For the full benchmark data and other-GPU comparisons, see /check/qwen3-32b/rtx-3090.

Troubleshooting

Generation slows or OOMs past 8K–16K context

The ~22 GB Q4_K_M footprint leaves only ~2 GB for the KV cache on a 24 GB card before pressure spills out of memory. Qwen3-32B's 64 layers with GQA (8 KV heads) make the per-token KV cache reasonable but not free — at FP16 KV the cache eats ~250 MB per 1K tokens, so 8K context costs ~2 GB and 16K costs ~4 GB, the latter pushing total VRAM past 24 GB. The recommended KV-cache discipline ladder, in order of how much it helps:

Cap --ctx-size at 8192 (the recipe default) — keeps headroom for activations.
Quantize the KV cache. Recent llama.cpp builds support --cache-type-k q8_0 --cache-type-v q8_0 (cuts KV-cache size in half with minor quality cost) or --cache-type-k q4_0 (quarter, more cost). With Q8 KV cache, 32K context is reachable. The recipe's canonical install command already enables these.
Drop to UD-Q3_K_XL (16.4 GB on disk per the unsloth tier table) — frees ~4 GB for KV cache at a measurable but tolerable quality cost.
Use a smaller variant. If you routinely need 32K+ context, Qwen3-14B at Q4_K_M is ~9 GB and leaves enough headroom for full 32K with FP16 KV — different recipe, same model family.

Greedy decoding produces endless repetitions

This is a Qwen3-32B-specific failure mode the model card explicitly warns about: per Qwen/Qwen3-32B, "DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions." Always set temperature > 0 — 0.6 for thinking mode, 0.7 for non-thinking — along with top_p=0.95/0.8 and top_k=20. If you're using a frontend that defaults to temperature=0, override it.

YaRN context extension beyond 32K

Native context is 32,768; the model card describes extending to 131,072 via YaRN rope-scaling: vLLM/SGLang launch flag --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max-model-len 131072 per the Qwen3-32B card. The model card also warns "Avoid YaRN for short texts — static scaling can degrade performance on <32K token inputs." On a 24 GB 3090 the KV cache for 131K context far exceeds available VRAM anyway, so YaRN here is aspirational unless you're chunking; this is one of the workflows where a 48–80 GB GPU or chunked retrieval becomes necessary.

Want a different runtime — vLLM or SGLang?

The Qwen/Qwen3-32B model card documents vllm serve Qwen/Qwen3-32B --enable-reasoning --reasoning-parser deepseek_r1 and python -m sglang.launch_server --model-path Qwen/Qwen3-32B --reasoning-parser qwen3 — both load BF16 weights (65.5 GB total from the HF file tree) which does not fit on a 24 GB card. To use vLLM or SGLang on this hardware you need either an AWQ-INT4 mirror (≤24 GB resident) or a multi-GPU setup. For single-3090 deployment, llama.cpp + GGUF remains the only realistic loader path.

Ampere vs Ada — anything special for the 3090?

The RTX 3090 is Ampere (sm_86) — fully supported by mainline CUDA, llama.cpp's CUDA backend, and FlashAttention v2 (used by --flash-attn). The default pre-built llama-bXXXX-bin-ubuntu-cuda-12.x-x64.zip releases work out of the box; no special wheel selection or attention-implementation override is required. Note that Ampere does not have FP8 tensor cores (FP8 acceleration first shipped on Hopper sm_90 / Ada sm_89), but this recipe ships only Q4_K GGUF weights — there is no FP8 path here, so the arch limitation does not bite. If you sibling-pin from an Ada/Hopper recipe that uses FP8 safetensors, FP8 weights still load on the 3090 (dequantized to BF16 on the fly, VRAM savings preserved) but you do not get the FP8-compute speed boost; choose Q4_K_M GGUF over FP8 safetensors when both are documented.