Qwen3-32B on RTX 3090 Ti: UD-Q4_K_XL GGUF via llama.cpp

What You'll Build

A local Qwen3-32B reasoning assistant running on an RTX 3090 Ti (24 GB VRAM) through llama.cpp with the unsloth/Qwen3-32B-GGUF UD-Q4_K_XL weights (20 GB on disk, Unsloth's Dynamic 2.0 mixed-precision GGUF tier). At 32B dense parameters, this pair sits at the upper edge of single-Ampere-24GB LLM territory — Q4_K_M-class quants are the only ones that fit with usable KV-cache headroom, and context-window discipline matters in a way it never does on smaller 8B / 14B recipes on the same card.

Hardware data: RTX 3090 Ti (24 GB VRAM) · UD-Q4_K_XL GGUF · 38.0 tok/s @ Q4_K, 4K ctx · See benchmark data

ℹ️ 3090 vs 3090 Ti — twin-pin. The RTX 3090 and 3090 Ti share the same Ampere sm_86 architecture, the same 24 GB VRAM envelope, and the same GA102 die. The Ti carries roughly 10% more memory bandwidth (1,008 GB/s vs 936 GB/s) and ~10% more compute. Per Hardware Corner's RTX 3090 Ti page, "the performance difference between this and the standard RTX 3090 is often negligible" for LLM inference. The install steps and tuning advice in this recipe are identical to the RTX 3090 sibling — only the measured throughput is re-anchored on the Ti.

⚠️ Quant pinned — UD-Q4_K_XL. This recipe targets the unsloth/Qwen3-32B-GGUF UD-Q4_K_XL tier (20 GB on disk) specifically. Per Unsloth's Dynamic 2.0 documentation, the UD-*-XL family applies per-layer sensitivity-aware bit-allocation calibrated on real-world tokens. Plain Q4_K_M (19.76 GB on disk from the same repo) also fits; Q5_K_S (22.6 GB) and Q5_K_M (23.2 GB) do NOT fit alongside any KV cache on a 24 GB card. See "Picking a quant on this card" below.

ℹ️ Variant pinned — dense 32B from the Qwen3 family. Per the Qwen/Qwen3-32B model card, Qwen3 spans 8 sizes — 0.6b, 1.7b, 4b, 8b, 14b, 30b (MoE), 32b (this recipe — dense), 235b (MoE). The siblings have wildly different VRAM profiles: Qwen3-14B in Q4_K_M is ~9 GB; Qwen3-30B-A3B is sparse MoE so all 30B params must be resident; Qwen3-235B does not fit consumer hardware. This recipe is for the dense 32B model only. If you want 14B on the same card with much more KV headroom, swap Qwen3-32B-GGUF for unsloth/Qwen3-14B-GGUF.

Requirements

Component	Minimum	Tested
GPU	24 GB VRAM (Q4_K_M / UD-Q4_K_XL)	RTX 3090 Ti (24 GB)
RAM	32 GB system	—
Storage	20 GB (UD-Q4_K_XL GGUF) per unsloth/Qwen3-32B-GGUF	—
Driver	CUDA 12.x runtime (Ampere sm_86)	—
Runtime	llama.cpp / Ollama / LM Studio	llama.cpp b9247+

The model is released under Apache 2.0 — commercial use permitted. Per the Qwen/Qwen3-32B model card, architecture is 64 layers / 64 Q heads / 8 KV heads (Grouped-Query Attention — keeps the per-token KV cache modest), 32,768-token native context, extendable to 131,072 via YaRN scaling.

Installation

Option A — llama.cpp + Unsloth UD-Q4_K_XL (recommended path)

This is the canonical CUDA-accelerated llama.cpp loader for a 32B GGUF on a 24 GB Ampere card.

1. Install llama.cpp

# macOS (Homebrew)
brew install llama.cpp

# Linux — pre-built CUDA binary
# Download the latest "llama-bXXXX-bin-ubuntu-cuda-12.x-x64.zip" asset from:
#   https://github.com/ggml-org/llama.cpp/releases
# Extract and add the bin/ directory to PATH.

To build from source with CUDA support instead, follow Unsloth's run guide:

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split

2. Download the UD-Q4_K_XL GGUF

Per Unsloth's run guide, use snapshot_download to pull only the UD-Q4_K_XL file (~20 GB) instead of the full repo:

pip install huggingface_hub hf_transfer

# download_q4kxl.py
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="unsloth/Qwen3-32B-GGUF",
    local_dir="unsloth/Qwen3-32B-GGUF",
    allow_patterns=["*UD-Q4_K_XL*"],
)

python download_q4kxl.py

The resulting file is unsloth/Qwen3-32B-GGUF/Qwen3-32B-UD-Q4_K_XL.gguf (20.02 GB per the unsloth model card file tree).

3. Start the server

llama-server \
  --model unsloth/Qwen3-32B-GGUF/Qwen3-32B-UD-Q4_K_XL.gguf \
  --ctx-size 8192 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --flash-attn \
  --n-gpu-layers 99 \
  --host 0.0.0.0 --port 8080

--n-gpu-layers 99 offloads every layer to the 3090 Ti (the card has just enough VRAM to keep the whole model resident at Q4_K_M; layer-streaming is unnecessary). --ctx-size 8192 is a safe starting point that leaves room for KV cache. --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn halves KV memory up-front — strongly recommended on a 24 GB card because the 32B Q4_K_M envelope is razor-thin for context (see Troubleshooting).

Option B — Ollama (one-command alternative)

If you prefer a single command and don't need the specific UD-Q4_K_XL tier:

# Pulls the same UD-Q4_K_XL file directly from the unsloth HF repo
ollama run hf.co/unsloth/Qwen3-32B-GGUF:UD-Q4_K_XL

The command above is the canonical Ollama incantation documented by Unsloth for the UD-Q4_K_XL build. Ollama's own catalog also ships ollama run qwen3:32b (the default Q4_K_M tag from the Qwen team) which is one of the standard k-quants rather than Unsloth's Dynamic 2.0 — same disk class, slightly different per-layer recipe.

Option C — LM Studio (GUI)

LM Studio's catalog search ("Qwen3 32B GGUF") will surface the unsloth UD-Q4_K_XL build alongside the bartowski standard-quant ladder. Pick Qwen3-32B-UD-Q4_K_XL.gguf from the unsloth repo and download — same file as Option A. LM Studio's loader will set --n-gpu-layers to "max" automatically for a 3090 Ti.

Running

One-shot prompt via the llama.cpp HTTP server

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-32b",
    "messages": [{"role": "user", "content": "Explain Grouped Query Attention in three sentences."}],
    "temperature": 0.6,
    "top_p": 0.95,
    "top_k": 20
  }'

The llama.cpp llama-server binary exposes an OpenAI-compatible /v1/chat/completions endpoint on the port chosen above. Note the sampling parameters: per the Qwen/Qwen3-32B model card, thinking mode (the default) wants temperature=0.6, top_p=0.95, top_k=20, min_p=0, and the card explicitly warns against greedy decoding (temperature=0) due to repetition risk. Non-thinking mode uses temperature=0.7, top_p=0.8.

Interactive terminal

llama-cli \
  --model unsloth/Qwen3-32B-GGUF/Qwen3-32B-UD-Q4_K_XL.gguf \
  --ctx-size 8192 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --flash-attn \
  --n-gpu-layers 99 \
  --interactive

Press Ctrl-C to interrupt generation; the CLI keeps the model warm in VRAM until exit.

Thinking vs non-thinking mode

Qwen3-32B has a built-in chain-of-thought ("thinking") mode that the Qwen/Qwen3-32B model card enables via enable_thinking=True. Output starts with a <think>...</think> block followed by the user-facing answer. To disable thinking for latency-sensitive turns, append /no_think to your prompt or set enable_thinking=False in the chat-template parameters — and remember to switch sampling parameters to the non-thinking profile (temperature=0.7, top_p=0.8). Because thinking traces routinely run 2K–4K tokens (longer on hard problems), they amplify KV-cache pressure on this 24 GB card; keep --ctx-size 8192 and KV-quantization flags enabled unless you've tuned headroom.

Results

Speed: Per Hardware Corner's RTX 3090 Ti LLM benchmark page, Qwen3 32B at Q4_K records 38.0 tokens/s at 4K context and 32.6 tokens/s at 16K context for token generation, with prompt-processing prefill at 1,238.8 t/s @ 4K and 862.7 t/s @ 16K (CUDA, -fa 1 flash-attention). Hardware Corner's page-level summary confirms the 16K-context figure: "in our tests it handled the Qwen3 32B (Q4_K) model with a 16k context window, achieving a generation speed of 32.6 t/s." Confirmed by /check/qwen3-32b/rtx-3090-ti (Hardware Corner-sourced, last verified 2026-05-15). Reasoning-model workloads where most of the output is <think> content will see effective throughput 30–50% lower than these chat-class figures — that token budget is partly consumed by content the user discards. Compared with the RTX 3090 sibling (35.1 tok/s @ 4K, 30.3 tok/s @ 16K), the Ti is roughly 8% faster at both context lengths, in line with its ~10% bandwidth lead.
VRAM usage: ~22 GB peak at Q4_K_M on the 24 GB card. The unsloth/Qwen3-32B-GGUF file tree lists Q4_K_M at 19.76 GB on disk and UD-Q4_K_XL at 20.02 GB on disk — both load close to file size with GGUF's mmap-style allocation, leaving roughly 3–4 GB for KV cache and activations at --ctx-size 8192 with --cache-type-k q8_0 --cache-type-v q8_0. The backend benchmark at /check/qwen3-32b/rtx-3090-ti records peak VRAM at the 24 GB ceiling at Q4_K — i.e. the full envelope is in use with default context. UD-Q4_K_XL (20.02 GB) and Q4_K_M (19.76 GB) are both in the "fits with discipline" band; Q5_K_S (22.64 GB) and Q5_K_M (23.21 GB) are too tight; Q6_K (26.88 GB) and Q8_0 (34.82 GB) do not fit at all. This is materially different from the smaller 8B / 14B Qwen3 recipes on the same card — there is no spare 13–15 GB to colocate a second model or stack additional context.
Quality notes: UD-Q4_K_XL is the Unsloth mixed-precision GGUF tier; per the Unsloth Dynamic 2.0 docs, it uses per-layer sensitivity-aware bit-allocation calibrated on a large real-world token sample. On a 24 GB card you cannot step up to Q5/Q6/Q8 because they don't fit alongside KV cache — UD-Q4_K_XL is the quality ceiling for this pair without a larger GPU.

For the full benchmark data and other-GPU comparisons, see /check/qwen3-32b/rtx-3090-ti.

Troubleshooting

Picking a quant on this card

Quant fit on a 24 GB Ampere card — file sizes from the unsloth/Qwen3-32B-GGUF file tree:

Tier	On-disk	Fits with KV at ctx 8K?
UD-Q3_K_XL	16.40 GB	Yes — comfortable, ~7 GB KV/activation headroom
Q4_K_M	19.76 GB	Yes — tight, ~3–4 GB headroom (recipe default class)
UD-Q4_K_XL	20.02 GB	Yes — tight (recipe canonical)
Q5_K_S	22.64 GB	Borderline — typically OOMs past 4K context
Q5_K_M	23.21 GB	Borderline — no KV headroom
Q6_K	26.88 GB	No
Q8_0	34.82 GB	No

If you want headroom for longer contexts or a colocated draft model, drop to UD-Q3_K_XL (16.4 GB on disk). If you want to step up in quality, you need a larger GPU.

Generation slows or OOMs past 8K–16K context

The ~22 GB Q4_K_M footprint leaves only ~2 GB for the KV cache on a 24 GB card before pressure spills out of memory. Qwen3-32B's 64 layers with GQA (8 KV heads) make the per-token KV cache reasonable but not free — at FP16 KV the cache eats ~250 MB per 1K tokens, so 8K context costs ~2 GB and 16K costs ~4 GB, the latter pushing total VRAM past 24 GB. The recommended KV-cache discipline ladder, in order of how much it helps:

Cap --ctx-size at 8192 (the recipe default) — keeps headroom for activations.
Quantize the KV cache. Recent llama.cpp builds support --cache-type-k q8_0 --cache-type-v q8_0 (cuts KV-cache size in half with minor quality cost) or --cache-type-k q4_0 (quarter, more cost). With Q8 KV cache, 16K context is comfortable; Hardware Corner's published 32.6 tok/s at 16K context confirms this works in the 24 GB envelope.
Drop to UD-Q3_K_XL (16.40 GB on disk per the unsloth file tree) — frees ~4 GB for KV cache at a measurable but tolerable quality cost.
Use a smaller variant. If you routinely need 32K+ context, Qwen3-14B at Q4_K_M is ~9 GB and leaves enough headroom for full 32K with FP16 KV — different recipe, same model family.

Greedy decoding produces endless repetitions

This is a Qwen3-32B-specific failure mode the model card explicitly warns about — see the Qwen/Qwen3-32B model card "Best Practices" section. Always set temperature > 0 — 0.6 for thinking mode, 0.7 for non-thinking — along with top_p=0.95/0.8 and top_k=20. If you're using a frontend that defaults to temperature=0, override it.

YaRN context extension beyond 32K

Native context is 32,768; the model card describes extending to 131,072 via YaRN rope-scaling: vLLM/SGLang launch flag --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max-model-len 131072 per the Qwen3-32B card. The model card also recommends against YaRN for short-text inputs because static scaling can degrade short-context performance. On a 24 GB 3090 Ti the KV cache for 131K context far exceeds available VRAM anyway, so YaRN here is aspirational unless you're chunking; this is one of the workflows where a 48–80 GB GPU or chunked retrieval becomes necessary.

Want a different runtime — vLLM or SGLang?

The Qwen/Qwen3-32B model card documents vllm serve Qwen/Qwen3-32B --enable-reasoning --reasoning-parser deepseek_r1 and python -m sglang.launch_server --model-path Qwen/Qwen3-32B --reasoning-parser qwen3 — both load BF16 weights (~65 GB total from the HF file tree) which does not fit on a 24 GB card. To use vLLM or SGLang on this hardware you need either an AWQ-INT4 mirror (≤24 GB resident) or a multi-GPU setup. For single-3090-Ti deployment, llama.cpp + GGUF remains the only realistic loader path.

Ampere vs Ada — anything special for the 3090 Ti?

The RTX 3090 Ti is Ampere (sm_86) — fully supported by mainline CUDA, llama.cpp's CUDA backend, and FlashAttention v2 (used by --flash-attn). The default pre-built llama-bXXXX-bin-ubuntu-cuda-12.x-x64.zip releases work out of the box; no special wheel selection or attention-implementation override is required. Note that Ampere does not have FP8 tensor cores (FP8 acceleration first shipped on Hopper sm_90 / Ada sm_89), but this recipe ships only Q4_K GGUF weights — there is no FP8 path here, so the arch limitation does not bite. If you sibling-pin from an Ada/Hopper recipe that uses FP8 safetensors, FP8 weights still load on the 3090 Ti (dequantized to BF16 on the fly, VRAM savings preserved) but you do not get the FP8-compute speed boost; choose Q4_K_M GGUF over FP8 safetensors when both are documented.