What You'll Build
A local Qwen3-32B reasoning assistant running on an RTX 3090 (24 GB VRAM) through llama.cpp with the unsloth/Qwen3-32B-GGUF UD-Q4_K_XL weights (20 GB on disk, Unsloth's Dynamic 2.0 mixed-precision GGUF tier). At 32B dense parameters, this pair sits on the upper edge of single-RTX-3090 LLM territory — Q4_K_M-class quants are the only ones that fit with usable KV-cache headroom, and context-window discipline matters in a way it never does on smaller 8B / 14B recipes.
Hardware data: RTX 3090 (24 GB VRAM) · UD-Q4_K_XL GGUF · 35.1 tok/s @ Q4_K, 4K ctx · See benchmark data
⚠️ Quant pinned — UD-Q4_K_XL. This recipe targets the unsloth/Qwen3-32B-GGUF
UD-Q4_K_XLtier (20 GB on disk) specifically. Per Unsloth's Dynamic 2.0 documentation, theUD-*-XLfamily applies per-layer sensitivity-aware bit-allocation — "selectively quantizes layers much more intelligently." PlainQ4_K_M(19.8 GB on disk from the same repo) also fits; Q5_K_S (22.6 GB) and Q5_K_M (23.2 GB) do NOT fit alongside any KV cache on a 24 GB card. See "Picking a quant on this card" below.
ℹ️ Variant pinned — dense 32B from the Qwen3 family. Per the Qwen/Qwen3-32B model card, Qwen3 spans 8 sizes —
0.6b,1.7b,4b,8b,14b,30b(MoE),32b(this recipe — dense),235b(MoE). The siblings have wildly different VRAM profiles: Qwen3-14B in Q4_K_M is ~9 GB; Qwen3-30B-A3B is MoE so all 30B params must be resident; Qwen3-235B does not fit consumer hardware. This recipe is for the dense 32B model only. If you want 14B on the same card with much more KV headroom, swapQwen3-32B-GGUFfor unsloth/Qwen3-14B-GGUF.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 24 GB VRAM (Q4_K_M / UD-Q4_K_XL) | RTX 3090 (24 GB) |
| RAM | 32 GB system | — |
| Storage | 20 GB (UD-Q4_K_XL GGUF) per unsloth/Qwen3-32B-GGUF | — |
| Driver | CUDA 12.x runtime (Ampere sm_86) | — |
| Runtime | llama.cpp / Ollama / LM Studio | llama.cpp b9247+ |
The model is released under Apache 2.0 — commercial use permitted. Per the Qwen3-32B HF model card, architecture is 64 layers / 64 Q heads / 8 KV heads (Grouped-Query Attention — keeps the per-token KV cache modest), 32,768-token native context, extendable to 131,072 via YaRN scaling.
Installation
Option A — llama.cpp + Unsloth UD-Q4_K_XL (recommended path)
This is the canonical CUDA-accelerated llama.cpp loader for a 32B GGUF on a 24 GB Ampere card.
1. Install llama.cpp
# macOS (Homebrew)
brew install llama.cpp
# Linux — pre-built CUDA binary
# Download the latest "llama-bXXXX-bin-ubuntu-cuda-12.x-x64.zip" asset from:
# https://github.com/ggml-org/llama.cpp/releases
# Extract and add the bin/ directory to PATH.
To build from source with CUDA support instead, follow Unsloth's run guide:
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
2. Download the UD-Q4_K_XL GGUF
Per Unsloth's run guide, use snapshot_download to pull only the UD-Q4_K_XL file (~20 GB) instead of the full repo:
pip install huggingface_hub hf_transfer
# download_q4kxl.py
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="unsloth/Qwen3-32B-GGUF",
local_dir="unsloth/Qwen3-32B-GGUF",
allow_patterns=["*UD-Q4_K_XL*"],
)
python download_q4kxl.py
The resulting file is unsloth/Qwen3-32B-GGUF/Qwen3-32B-UD-Q4_K_XL.gguf (20 GB per the unsloth model card).
3. Start the server
llama-server \
--model unsloth/Qwen3-32B-GGUF/Qwen3-32B-UD-Q4_K_XL.gguf \
--ctx-size 8192 \
--cache-type-k q8_0 --cache-type-v q8_0 \
--flash-attn \
--n-gpu-layers 99 \
--host 0.0.0.0 --port 8080
--n-gpu-layers 99 offloads every layer to the 3090 (the 3090 has just enough VRAM to keep the whole model resident at Q4_K_M; layer-streaming is unnecessary). --ctx-size 8192 is a safe starting point that leaves room for KV cache. --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn halves KV memory up-front — strongly recommended on a 24 GB card because the 32B Q4_K_M envelope is razor-thin for context (see Troubleshooting).
Option B — Ollama (one-command alternative)
If you prefer a single command and don't need the specific UD-Q4_K_XL tier:
# Pulls the same UD-Q4_K_XL file directly from the unsloth HF repo
ollama run hf.co/unsloth/Qwen3-32B-GGUF:UD-Q4_K_XL
The command above is the canonical Ollama incantation documented by Unsloth for the UD-Q4_K_XL build. Ollama's own catalog also ships ollama run qwen3:32b (the default Q4_K_M tag from the Qwen team) which is one of the standard k-quants rather than Unsloth's Dynamic 2.0 — same disk class, slightly different per-layer recipe.
Option C — LM Studio (GUI)
LM Studio's catalog search ("Qwen3 32B GGUF") will surface the unsloth UD-Q4_K_XL build alongside the bartowski standard-quant ladder. Pick Qwen3-32B-UD-Q4_K_XL.gguf from the unsloth repo and download — same file as Option A. LM Studio's loader will set --n-gpu-layers to "max" automatically for a 3090.
Running
One-shot prompt via the llama.cpp HTTP server
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-32b",
"messages": [{"role": "user", "content": "Explain Grouped Query Attention in three sentences."}],
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20
}'
The llama.cpp llama-server binary exposes an OpenAI-compatible /v1/chat/completions endpoint on the port chosen above. Note the sampling parameters: per the Qwen/Qwen3-32B model card, thinking mode (the default) wants temperature=0.6, top_p=0.95, top_k=20, min_p=0 and explicitly warns "DO NOT use greedy decoding — it can lead to performance degradation and endless repetitions." Non-thinking mode uses temperature=0.7, top_p=0.8.
Interactive terminal
llama-cli \
--model unsloth/Qwen3-32B-GGUF/Qwen3-32B-UD-Q4_K_XL.gguf \
--ctx-size 8192 \
--cache-type-k q8_0 --cache-type-v q8_0 \
--flash-attn \
--n-gpu-layers 99 \
--interactive
Press Ctrl-C to interrupt generation; the CLI keeps the model warm in VRAM until exit.
Thinking vs non-thinking mode
Qwen3-32B has a built-in chain-of-thought ("thinking") mode that the Qwen/Qwen3-32B model card enables via enable_thinking=True. Output starts with a <think>...</think> block followed by the user-facing answer. To disable thinking for latency-sensitive turns, append /no_think to your prompt or set enable_thinking=False in the chat-template parameters — and remember to switch sampling parameters to the non-thinking profile (temperature=0.7, top_p=0.8). Because thinking traces routinely run 2K–4K tokens (longer on hard problems), they amplify KV-cache pressure on this 24 GB card; keep --ctx-size 8192 and KV-quantization flags enabled unless you've tuned headroom.
Results
- Speed: Per Hardware Corner's RTX 3090 LLM benchmark page, Qwen3 32B at Q4_K records 35.1 tokens/s at 4K context and 30.3 tokens/s at 16K context for token generation, with prompt-processing prefill at 1,087.9 t/s @ 4K and 767.8 t/s @ 16K (CUDA,
-fa 1flash-attention). Confirmed by/check/qwen3-32b/rtx-3090(Hardware Corner-sourced, last verified 2026-05-15). Reasoning-model workloads where most of the output is<think>content will see effective throughput 30–50% lower than these chat-class figures — that token budget is partly consumed by content the user discards. - VRAM usage: ~22 GB peak at Q4_K_M on the 24 GB card. The unsloth/Qwen3-32B-GGUF tier table lists Q4_K_M at 19.8 GB on disk and UD-Q4_K_XL at 20 GB on disk — both load close to file size with GGUF's mmap-style allocation, leaving ~2–4 GB for KV cache and activations on a 24 GB card. The backend benchmark at
/check/qwen3-32b/rtx-3090records peak VRAM at the 24 GB ceiling at Q4_K — i.e. the full envelope is in use with default context. UD-Q4_K_XL (20 GB) and Q4_K_M (19.8 GB) are both in the "fits with discipline" band; Q5_K_S (22.6 GB) and Q5_K_M (23.2 GB) are too tight; Q6_K (26.9 GB) and BF16 (65.5 GB) do not fit at all. - Quality notes: UD-Q4_K_XL is the Unsloth mixed-precision GGUF tier; the Unsloth Dynamic 2.0 docs characterize it as "selectively quantizes layers much more intelligently" with per-layer sensitivity-aware bit-allocation, calibrated on 1.5M+ real-world tokens. On a 24 GB card you cannot step up to Q5/Q6/Q8 because they don't fit alongside KV cache — UD-Q4_K_XL is the quality ceiling for this pair without a larger GPU.
For the full benchmark data and other-GPU comparisons, see /check/qwen3-32b/rtx-3090.
Troubleshooting
Generation slows or OOMs past 8K–16K context
The ~22 GB Q4_K_M footprint leaves only ~2 GB for the KV cache on a 24 GB card before pressure spills out of memory. Qwen3-32B's 64 layers with GQA (8 KV heads) make the per-token KV cache reasonable but not free — at FP16 KV the cache eats ~250 MB per 1K tokens, so 8K context costs ~2 GB and 16K costs ~4 GB, the latter pushing total VRAM past 24 GB. The recommended KV-cache discipline ladder, in order of how much it helps:
- Cap
--ctx-sizeat 8192 (the recipe default) — keeps headroom for activations. - Quantize the KV cache. Recent llama.cpp builds support
--cache-type-k q8_0 --cache-type-v q8_0(cuts KV-cache size in half with minor quality cost) or--cache-type-k q4_0(quarter, more cost). With Q8 KV cache, 32K context is reachable. The recipe's canonical install command already enables these. - Drop to UD-Q3_K_XL (16.4 GB on disk per the unsloth tier table) — frees ~4 GB for KV cache at a measurable but tolerable quality cost.
- Use a smaller variant. If you routinely need 32K+ context, Qwen3-14B at Q4_K_M is ~9 GB and leaves enough headroom for full 32K with FP16 KV — different recipe, same model family.
Greedy decoding produces endless repetitions
This is a Qwen3-32B-specific failure mode the model card explicitly warns about: per Qwen/Qwen3-32B, "DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions." Always set temperature > 0 — 0.6 for thinking mode, 0.7 for non-thinking — along with top_p=0.95/0.8 and top_k=20. If you're using a frontend that defaults to temperature=0, override it.
YaRN context extension beyond 32K
Native context is 32,768; the model card describes extending to 131,072 via YaRN rope-scaling: vLLM/SGLang launch flag --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max-model-len 131072 per the Qwen3-32B card. The model card also warns "Avoid YaRN for short texts — static scaling can degrade performance on <32K token inputs." On a 24 GB 3090 the KV cache for 131K context far exceeds available VRAM anyway, so YaRN here is aspirational unless you're chunking; this is one of the workflows where a 48–80 GB GPU or chunked retrieval becomes necessary.
Want a different runtime — vLLM or SGLang?
The Qwen/Qwen3-32B model card documents vllm serve Qwen/Qwen3-32B --enable-reasoning --reasoning-parser deepseek_r1 and python -m sglang.launch_server --model-path Qwen/Qwen3-32B --reasoning-parser qwen3 — both load BF16 weights (65.5 GB total from the HF file tree) which does not fit on a 24 GB card. To use vLLM or SGLang on this hardware you need either an AWQ-INT4 mirror (≤24 GB resident) or a multi-GPU setup. For single-3090 deployment, llama.cpp + GGUF remains the only realistic loader path.
Ampere vs Ada — anything special for the 3090?
The RTX 3090 is Ampere (sm_86) — fully supported by mainline CUDA, llama.cpp's CUDA backend, and FlashAttention v2 (used by --flash-attn). The default pre-built llama-bXXXX-bin-ubuntu-cuda-12.x-x64.zip releases work out of the box; no special wheel selection or attention-implementation override is required. Note that Ampere does not have FP8 tensor cores (FP8 acceleration first shipped on Hopper sm_90 / Ada sm_89), but this recipe ships only Q4_K GGUF weights — there is no FP8 path here, so the arch limitation does not bite. If you sibling-pin from an Ada/Hopper recipe that uses FP8 safetensors, FP8 weights still load on the 3090 (dequantized to BF16 on the fly, VRAM savings preserved) but you do not get the FP8-compute speed boost; choose Q4_K_M GGUF over FP8 safetensors when both are documented.