What You'll Build
A local Qwen3-32B reasoning assistant running on an RTX 3090 Ti (24 GB VRAM) through llama.cpp with the unsloth/Qwen3-32B-GGUF UD-Q4_K_XL weights (20 GB on disk, Unsloth's Dynamic 2.0 mixed-precision GGUF tier). At 32B dense parameters, this pair sits at the upper edge of single-Ampere-24GB LLM territory — Q4_K_M-class quants are the only ones that fit with usable KV-cache headroom, and context-window discipline matters in a way it never does on smaller 8B / 14B recipes on the same card.
Hardware data: RTX 3090 Ti (24 GB VRAM) · UD-Q4_K_XL GGUF · 38.0 tok/s @ Q4_K, 4K ctx · See benchmark data
ℹ️ 3090 vs 3090 Ti — twin-pin. The RTX 3090 and 3090 Ti share the same Ampere
sm_86architecture, the same 24 GB VRAM envelope, and the sameGA102die. The Ti carries roughly 10% more memory bandwidth (1,008 GB/s vs 936 GB/s) and ~10% more compute. Per Hardware Corner's RTX 3090 Ti page, "the performance difference between this and the standard RTX 3090 is often negligible" for LLM inference. The install steps and tuning advice in this recipe are identical to the RTX 3090 sibling — only the measured throughput is re-anchored on the Ti.
⚠️ Quant pinned — UD-Q4_K_XL. This recipe targets the unsloth/Qwen3-32B-GGUF
UD-Q4_K_XLtier (20 GB on disk) specifically. Per Unsloth's Dynamic 2.0 documentation, theUD-*-XLfamily applies per-layer sensitivity-aware bit-allocation calibrated on real-world tokens. PlainQ4_K_M(19.76 GB on disk from the same repo) also fits; Q5_K_S (22.6 GB) and Q5_K_M (23.2 GB) do NOT fit alongside any KV cache on a 24 GB card. See "Picking a quant on this card" below.
ℹ️ Variant pinned — dense 32B from the Qwen3 family. Per the Qwen/Qwen3-32B model card, Qwen3 spans 8 sizes —
0.6b,1.7b,4b,8b,14b,30b(MoE),32b(this recipe — dense),235b(MoE). The siblings have wildly different VRAM profiles: Qwen3-14B in Q4_K_M is ~9 GB; Qwen3-30B-A3B is sparse MoE so all 30B params must be resident; Qwen3-235B does not fit consumer hardware. This recipe is for the dense 32B model only. If you want 14B on the same card with much more KV headroom, swapQwen3-32B-GGUFfor unsloth/Qwen3-14B-GGUF.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 24 GB VRAM (Q4_K_M / UD-Q4_K_XL) | RTX 3090 Ti (24 GB) |
| RAM | 32 GB system | — |
| Storage | 20 GB (UD-Q4_K_XL GGUF) per unsloth/Qwen3-32B-GGUF | — |
| Driver | CUDA 12.x runtime (Ampere sm_86) | — |
| Runtime | llama.cpp / Ollama / LM Studio | llama.cpp b9247+ |
The model is released under Apache 2.0 — commercial use permitted. Per the Qwen/Qwen3-32B model card, architecture is 64 layers / 64 Q heads / 8 KV heads (Grouped-Query Attention — keeps the per-token KV cache modest), 32,768-token native context, extendable to 131,072 via YaRN scaling.
Installation
Option A — llama.cpp + Unsloth UD-Q4_K_XL (recommended path)
This is the canonical CUDA-accelerated llama.cpp loader for a 32B GGUF on a 24 GB Ampere card.
1. Install llama.cpp
# macOS (Homebrew)
brew install llama.cpp
# Linux — pre-built CUDA binary
# Download the latest "llama-bXXXX-bin-ubuntu-cuda-12.x-x64.zip" asset from:
# https://github.com/ggml-org/llama.cpp/releases
# Extract and add the bin/ directory to PATH.
To build from source with CUDA support instead, follow Unsloth's run guide:
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
2. Download the UD-Q4_K_XL GGUF
Per Unsloth's run guide, use snapshot_download to pull only the UD-Q4_K_XL file (~20 GB) instead of the full repo:
pip install huggingface_hub hf_transfer
# download_q4kxl.py
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="unsloth/Qwen3-32B-GGUF",
local_dir="unsloth/Qwen3-32B-GGUF",
allow_patterns=["*UD-Q4_K_XL*"],
)
python download_q4kxl.py
The resulting file is unsloth/Qwen3-32B-GGUF/Qwen3-32B-UD-Q4_K_XL.gguf (20.02 GB per the unsloth model card file tree).
3. Start the server
llama-server \
--model unsloth/Qwen3-32B-GGUF/Qwen3-32B-UD-Q4_K_XL.gguf \
--ctx-size 8192 \
--cache-type-k q8_0 --cache-type-v q8_0 \
--flash-attn \
--n-gpu-layers 99 \
--host 0.0.0.0 --port 8080
--n-gpu-layers 99 offloads every layer to the 3090 Ti (the card has just enough VRAM to keep the whole model resident at Q4_K_M; layer-streaming is unnecessary). --ctx-size 8192 is a safe starting point that leaves room for KV cache. --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn halves KV memory up-front — strongly recommended on a 24 GB card because the 32B Q4_K_M envelope is razor-thin for context (see Troubleshooting).
Option B — Ollama (one-command alternative)
If you prefer a single command and don't need the specific UD-Q4_K_XL tier:
# Pulls the same UD-Q4_K_XL file directly from the unsloth HF repo
ollama run hf.co/unsloth/Qwen3-32B-GGUF:UD-Q4_K_XL
The command above is the canonical Ollama incantation documented by Unsloth for the UD-Q4_K_XL build. Ollama's own catalog also ships ollama run qwen3:32b (the default Q4_K_M tag from the Qwen team) which is one of the standard k-quants rather than Unsloth's Dynamic 2.0 — same disk class, slightly different per-layer recipe.
Option C — LM Studio (GUI)
LM Studio's catalog search ("Qwen3 32B GGUF") will surface the unsloth UD-Q4_K_XL build alongside the bartowski standard-quant ladder. Pick Qwen3-32B-UD-Q4_K_XL.gguf from the unsloth repo and download — same file as Option A. LM Studio's loader will set --n-gpu-layers to "max" automatically for a 3090 Ti.
Running
One-shot prompt via the llama.cpp HTTP server
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-32b",
"messages": [{"role": "user", "content": "Explain Grouped Query Attention in three sentences."}],
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20
}'
The llama.cpp llama-server binary exposes an OpenAI-compatible /v1/chat/completions endpoint on the port chosen above. Note the sampling parameters: per the Qwen/Qwen3-32B model card, thinking mode (the default) wants temperature=0.6, top_p=0.95, top_k=20, min_p=0, and the card explicitly warns against greedy decoding (temperature=0) due to repetition risk. Non-thinking mode uses temperature=0.7, top_p=0.8.
Interactive terminal
llama-cli \
--model unsloth/Qwen3-32B-GGUF/Qwen3-32B-UD-Q4_K_XL.gguf \
--ctx-size 8192 \
--cache-type-k q8_0 --cache-type-v q8_0 \
--flash-attn \
--n-gpu-layers 99 \
--interactive
Press Ctrl-C to interrupt generation; the CLI keeps the model warm in VRAM until exit.
Thinking vs non-thinking mode
Qwen3-32B has a built-in chain-of-thought ("thinking") mode that the Qwen/Qwen3-32B model card enables via enable_thinking=True. Output starts with a <think>...</think> block followed by the user-facing answer. To disable thinking for latency-sensitive turns, append /no_think to your prompt or set enable_thinking=False in the chat-template parameters — and remember to switch sampling parameters to the non-thinking profile (temperature=0.7, top_p=0.8). Because thinking traces routinely run 2K–4K tokens (longer on hard problems), they amplify KV-cache pressure on this 24 GB card; keep --ctx-size 8192 and KV-quantization flags enabled unless you've tuned headroom.
Results
- Speed: Per Hardware Corner's RTX 3090 Ti LLM benchmark page, Qwen3 32B at Q4_K records 38.0 tokens/s at 4K context and 32.6 tokens/s at 16K context for token generation, with prompt-processing prefill at 1,238.8 t/s @ 4K and 862.7 t/s @ 16K (CUDA,
-fa 1flash-attention). Hardware Corner's page-level summary confirms the 16K-context figure: "in our tests it handled the Qwen3 32B (Q4_K) model with a 16k context window, achieving a generation speed of 32.6 t/s." Confirmed by/check/qwen3-32b/rtx-3090-ti(Hardware Corner-sourced, last verified 2026-05-15). Reasoning-model workloads where most of the output is<think>content will see effective throughput 30–50% lower than these chat-class figures — that token budget is partly consumed by content the user discards. Compared with the RTX 3090 sibling (35.1 tok/s @ 4K, 30.3 tok/s @ 16K), the Ti is roughly 8% faster at both context lengths, in line with its ~10% bandwidth lead. - VRAM usage: ~22 GB peak at Q4_K_M on the 24 GB card. The unsloth/Qwen3-32B-GGUF file tree lists Q4_K_M at 19.76 GB on disk and UD-Q4_K_XL at 20.02 GB on disk — both load close to file size with GGUF's mmap-style allocation, leaving roughly 3–4 GB for KV cache and activations at
--ctx-size 8192with--cache-type-k q8_0 --cache-type-v q8_0. The backend benchmark at/check/qwen3-32b/rtx-3090-tirecords peak VRAM at the 24 GB ceiling at Q4_K — i.e. the full envelope is in use with default context. UD-Q4_K_XL (20.02 GB) and Q4_K_M (19.76 GB) are both in the "fits with discipline" band; Q5_K_S (22.64 GB) and Q5_K_M (23.21 GB) are too tight; Q6_K (26.88 GB) and Q8_0 (34.82 GB) do not fit at all. This is materially different from the smaller 8B / 14B Qwen3 recipes on the same card — there is no spare 13–15 GB to colocate a second model or stack additional context. - Quality notes: UD-Q4_K_XL is the Unsloth mixed-precision GGUF tier; per the Unsloth Dynamic 2.0 docs, it uses per-layer sensitivity-aware bit-allocation calibrated on a large real-world token sample. On a 24 GB card you cannot step up to Q5/Q6/Q8 because they don't fit alongside KV cache — UD-Q4_K_XL is the quality ceiling for this pair without a larger GPU.
For the full benchmark data and other-GPU comparisons, see /check/qwen3-32b/rtx-3090-ti.
Troubleshooting
Picking a quant on this card
Quant fit on a 24 GB Ampere card — file sizes from the unsloth/Qwen3-32B-GGUF file tree:
| Tier | On-disk | Fits with KV at ctx 8K? |
|---|---|---|
| UD-Q3_K_XL | 16.40 GB | Yes — comfortable, ~7 GB KV/activation headroom |
| Q4_K_M | 19.76 GB | Yes — tight, ~3–4 GB headroom (recipe default class) |
| UD-Q4_K_XL | 20.02 GB | Yes — tight (recipe canonical) |
| Q5_K_S | 22.64 GB | Borderline — typically OOMs past 4K context |
| Q5_K_M | 23.21 GB | Borderline — no KV headroom |
| Q6_K | 26.88 GB | No |
| Q8_0 | 34.82 GB | No |
If you want headroom for longer contexts or a colocated draft model, drop to UD-Q3_K_XL (16.4 GB on disk). If you want to step up in quality, you need a larger GPU.
Generation slows or OOMs past 8K–16K context
The ~22 GB Q4_K_M footprint leaves only ~2 GB for the KV cache on a 24 GB card before pressure spills out of memory. Qwen3-32B's 64 layers with GQA (8 KV heads) make the per-token KV cache reasonable but not free — at FP16 KV the cache eats ~250 MB per 1K tokens, so 8K context costs ~2 GB and 16K costs ~4 GB, the latter pushing total VRAM past 24 GB. The recommended KV-cache discipline ladder, in order of how much it helps:
- Cap
--ctx-sizeat 8192 (the recipe default) — keeps headroom for activations. - Quantize the KV cache. Recent llama.cpp builds support
--cache-type-k q8_0 --cache-type-v q8_0(cuts KV-cache size in half with minor quality cost) or--cache-type-k q4_0(quarter, more cost). With Q8 KV cache, 16K context is comfortable; Hardware Corner's published 32.6 tok/s at 16K context confirms this works in the 24 GB envelope. - Drop to UD-Q3_K_XL (16.40 GB on disk per the unsloth file tree) — frees ~4 GB for KV cache at a measurable but tolerable quality cost.
- Use a smaller variant. If you routinely need 32K+ context, Qwen3-14B at Q4_K_M is ~9 GB and leaves enough headroom for full 32K with FP16 KV — different recipe, same model family.
Greedy decoding produces endless repetitions
This is a Qwen3-32B-specific failure mode the model card explicitly warns about — see the Qwen/Qwen3-32B model card "Best Practices" section. Always set temperature > 0 — 0.6 for thinking mode, 0.7 for non-thinking — along with top_p=0.95/0.8 and top_k=20. If you're using a frontend that defaults to temperature=0, override it.
YaRN context extension beyond 32K
Native context is 32,768; the model card describes extending to 131,072 via YaRN rope-scaling: vLLM/SGLang launch flag --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max-model-len 131072 per the Qwen3-32B card. The model card also recommends against YaRN for short-text inputs because static scaling can degrade short-context performance. On a 24 GB 3090 Ti the KV cache for 131K context far exceeds available VRAM anyway, so YaRN here is aspirational unless you're chunking; this is one of the workflows where a 48–80 GB GPU or chunked retrieval becomes necessary.
Want a different runtime — vLLM or SGLang?
The Qwen/Qwen3-32B model card documents vllm serve Qwen/Qwen3-32B --enable-reasoning --reasoning-parser deepseek_r1 and python -m sglang.launch_server --model-path Qwen/Qwen3-32B --reasoning-parser qwen3 — both load BF16 weights (~65 GB total from the HF file tree) which does not fit on a 24 GB card. To use vLLM or SGLang on this hardware you need either an AWQ-INT4 mirror (≤24 GB resident) or a multi-GPU setup. For single-3090-Ti deployment, llama.cpp + GGUF remains the only realistic loader path.
Ampere vs Ada — anything special for the 3090 Ti?
The RTX 3090 Ti is Ampere (sm_86) — fully supported by mainline CUDA, llama.cpp's CUDA backend, and FlashAttention v2 (used by --flash-attn). The default pre-built llama-bXXXX-bin-ubuntu-cuda-12.x-x64.zip releases work out of the box; no special wheel selection or attention-implementation override is required. Note that Ampere does not have FP8 tensor cores (FP8 acceleration first shipped on Hopper sm_90 / Ada sm_89), but this recipe ships only Q4_K GGUF weights — there is no FP8 path here, so the arch limitation does not bite. If you sibling-pin from an Ada/Hopper recipe that uses FP8 safetensors, FP8 weights still load on the 3090 Ti (dequantized to BF16 on the fly, VRAM savings preserved) but you do not get the FP8-compute speed boost; choose Q4_K_M GGUF over FP8 safetensors when both are documented.