How much VRAM does Qwen3 32B need?

About 29 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Qwen3-32B on RTX 5090: Q6_K_XL GGUF via llama.cpp (with AWQ-INT4 + 128K context alternative)

What You'll Build

A local Qwen3-32B reasoning assistant on an RTX 5090 (32 GB VRAM, Blackwell sm_120), with two viable paths the 24 GB-tier cards cannot run:

Option A — llama.cpp + Unsloth UD-Q6_K_XL GGUF (~29 GB on disk): the quality-ceiling GGUF tier that physically does not fit a 24 GB 3090/4090 (the 3090 sibling recipe and 4090 sibling recipe both top out at UD-Q4_K_XL ~20 GB). The 32 GB envelope swallows Q6_K_XL plus KV cache plus activations.
Option B — vLLM + Qwen's official AWQ-INT4 (~19 GB weights): 4-bit AWQ unlocks the model's native 32K (or YaRN-extended 131K) context on a single 5090 without KV-cache discipline — impossible on a 24 GB card where Q4_K_M leaves only ~2 GB of headroom.

Hardware data: RTX 5090 (32 GB VRAM) · llama.cpp Q4_K · 61.4 tok/s @ 4K · 50.9 tok/s @ 16K · 43.8 tok/s @ 32K · See benchmark data

⚠️ Q8_0 does NOT fit. The unsloth/Qwen3-32B-GGUF tier table lists Q8_0 at 34.8 GB on disk — over the 5090's 32 GB envelope before any KV cache. Q6_K (26.9 GB) and UD-Q6_K_XL (29 GB) are the quality ceiling on this card; UD-Q8_K_XL (39.5 GB) and BF16 (65.5 GB) are out of reach without multi-GPU. This recipe pins UD-Q6_K_XL.

ℹ️ Variant pinned — dense 32B from the Qwen3 family. Per the Qwen/Qwen3-32B model card, this is the dense 32B model (64 layers / 64 Q heads / 8 KV heads — Grouped-Query Attention; 32,768-token native context, extendable to 131,072 via YaRN). The Qwen3 family also ships smaller dense variants (0.6B / 1.7B / 4B / 8B / 14B), a sparse-MoE Qwen3-30B-A3B, and Qwen3-235B-MoE — all are separate recipes. This recipe is for Qwen/Qwen3-32B only.

Requirements

Component	Minimum	Tested
GPU	32 GB VRAM (UD-Q6_K_XL); 24 GB acceptable for AWQ-INT4 alternative	RTX 5090 (32 GB)
RAM	32 GB system	—
Storage	~30 GB (UD-Q6_K_XL GGUF) per unsloth/Qwen3-32B-GGUF	—
Driver	NVIDIA 570+ with CUDA 12.8 runtime (Blackwell sm_120)	—
Runtime	llama.cpp (Option A) or vLLM 0.10.2+ (Option B)	—

The model is released under Apache 2.0 — commercial use permitted. The official AWQ-INT4 variant (Qwen/Qwen3-32B-AWQ) is published by the Qwen org under the same Apache 2.0 license.

Installation

Option A — llama.cpp + Unsloth UD-Q6_K_XL (recommended primary path)

This is the canonical CUDA-accelerated llama.cpp loader for a 32B GGUF on a 32 GB Blackwell card, leading with the highest-quality quant that fits.

1. Install llama.cpp

Pre-built CUDA binaries from the llama.cpp releases page currently ship CUDA 12.4 and CUDA 13.1 builds — the CUDA 13.1 builds include Blackwell sm_120 kernels. macOS users can brew install llama.cpp.

To build from source with full Blackwell sm_120 support, follow Unsloth's Qwen3 run guide:

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

2. Download the UD-Q6_K_XL GGUF

Per Unsloth's run guide, use snapshot_download to pull only the UD-Q6_K_XL file (~29 GB) from the unsloth/Qwen3-32B-GGUF repo:

pip install huggingface_hub hf_transfer

# download_q6kxl.py
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="unsloth/Qwen3-32B-GGUF",
    local_dir="unsloth/Qwen3-32B-GGUF",
    allow_patterns=["*UD-Q6_K_XL*"],
)

python download_q6kxl.py

The resulting file is unsloth/Qwen3-32B-GGUF/Qwen3-32B-UD-Q6_K_XL.gguf (29 GB per the unsloth model card).

3. Start the server

llama-server \
  --model unsloth/Qwen3-32B-GGUF/Qwen3-32B-UD-Q6_K_XL.gguf \
  --ctx-size 16384 \
  --n-gpu-layers 99 \
  --host 0.0.0.0 --port 8080

--n-gpu-layers 99 offloads every layer to the 5090 (no streaming needed — 29 GB UD-Q6_K_XL + 16K-context KV cache fits comfortably in 32 GB). --ctx-size 16384 is a comfortable starting context with ~1.5 GB headroom; raise to 32768 if you need the model's native maximum (still fits without KV quantization on this card — the major win over the 3090/4090 siblings that required --cache-type-k q8_0).

If you prefer the lower-quality / smaller-footprint UD-Q4_K_XL (20 GB) that the 24 GB sibling recipes use, the canonical Ollama incantation documented by Unsloth is:

ollama run hf.co/unsloth/Qwen3-32B-GGUF:UD-Q4_K_XL

Option B — vLLM + AWQ-INT4 (the 128K-context alternative)

When you need the model's full 131,072-token YaRN-extended context, switch to the official Qwen/Qwen3-32B-AWQ build. AWQ INT4 weights total 19.3 GB on disk (per the HF file tree: 4 safetensors files), leaving ~12 GB on the 5090 for KV cache — enough for a 32K context with no quantization tricks.

1. Install vLLM with cu128 wheels

pip install --upgrade pip
# vLLM ships pre-built wheels including Blackwell sm_120 kernels in recent versions
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu128

2. Serve the model at native 32K context

vllm serve Qwen/Qwen3-32B-AWQ \
  --enable-reasoning --reasoning-parser deepseek_r1 \
  --max-model-len 32768 \
  --quantization awq_marlin

The exact vllm serve invocation is per the Qwen/Qwen3-32B-AWQ model card; the --quantization awq_marlin flag picks the Marlin kernel for AWQ inference on modern NVIDIA architectures.

3. (Optional) Extend to 131K context with YaRN

For long-document workloads, append the YaRN rope-scaling flag per the Qwen3-32B card:

vllm serve Qwen/Qwen3-32B-AWQ \
  --enable-reasoning --reasoning-parser deepseek_r1 \
  --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' \
  --max-model-len 131072 \
  --quantization awq_marlin

The Qwen3-32B card explicitly warns: "If the average context length does not exceed 32,768 tokens, we do not recommend enabling YaRN in this scenario, as it may potentially degrade model performance." Use the YaRN config only when your typical input genuinely exceeds 32K.

Running

One-shot prompt via the llama.cpp HTTP server (Option A)

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-32b",
    "messages": [{"role": "user", "content": "Explain Grouped Query Attention in three sentences."}],
    "temperature": 0.6,
    "top_p": 0.95,
    "top_k": 20
  }'

The llama.cpp llama-server binary exposes an OpenAI-compatible /v1/chat/completions endpoint. Per the Qwen/Qwen3-32B model card, thinking mode (the default) wants temperature=0.6, top_p=0.95, top_k=20, min_p=0 and the card explicitly warns: "DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions." Non-thinking mode uses temperature=0.7, top_p=0.8.

Interactive terminal (Option A)

llama-cli \
  --model unsloth/Qwen3-32B-GGUF/Qwen3-32B-UD-Q6_K_XL.gguf \
  --ctx-size 16384 \
  --n-gpu-layers 99 \
  --interactive

Thinking vs non-thinking mode

Qwen3-32B has a built-in chain-of-thought ("thinking") mode that the model card enables via enable_thinking=True. Output starts with a <think>...</think> block followed by the user-facing answer. To disable thinking for latency-sensitive turns, append /no_think to your prompt or set enable_thinking=False — and switch sampling parameters to the non-thinking profile (temperature=0.7, top_p=0.8). Thinking traces routinely run 2K–4K tokens (longer on hard problems); on this 32 GB card the KV-cache pressure that drives the 3090/4090 siblings to --ctx-size 8192 simply does not apply — you can keep --ctx-size 16384 (or 32768) on by default.

Results

Speed (Option A, llama.cpp + Q4_K GGUF, this recipe's runtime family): Per Hardware Corner's RTX 5090 LLM benchmark page, Qwen3 32B at Q4_K records 61.4 tokens/s @ 4K, 50.9 tokens/s @ 16K, 43.8 tokens/s @ 32K context for token generation; prompt-processing prefill is 2,931.3 t/s @ 4K, 2,077.2 t/s @ 16K, 1,451.1 t/s @ 32K (llama.cpp build 8189 column). The 4K and 16K figures are surfaced in /check/qwen3-32b/rtx-5090 (benchmark IDs 97/98). This is a direct first-party citation for the 5090 — no extrapolation needed. Expect UD-Q6_K_XL to run somewhat slower than the Q4_K figures above (larger weights → more memory bandwidth per token); the Awesome Agents Home GPU LLM Leaderboard (column "RTX 5090 (32GB)", row "Qwen 3 32B") corroborates the Q4_K_M Tier-S anchor at ~58 tok/s independently. The /check/qwen3-32b/rtx-5090 entry also lists Hardware Corner's Q4_K_XL 16K measurement at 50.92 tok/s (benchmark id=104, source hardware-corner.net/gpu-ranking-local-llm).
VRAM usage (Option A, this recipe's installed path): ~29 GB for the UD-Q6_K_XL weights + ~1–2 GB for a 16K KV cache + activations = ~30–31 GB peak on the 32 GB card. The unsloth/Qwen3-32B-GGUF tier table lists UD-Q6_K_XL at 29 GB on disk; GGUF mmap-style allocation makes runtime peak track on-disk size closely. For comparison, the 24 GB-tier 3090 and 4090 sibling recipes top out at UD-Q4_K_XL (20 GB) because Q5_K_S (22.6 GB) and Q6_K (26.9 GB) do not fit alongside any KV cache there — the 5090's 32 GB envelope is what unlocks the Q6 quality tier on a single consumer GPU.
VRAM usage (Option B, vLLM + AWQ-INT4, alternative runtime): ~22 GB peak per the Awesome Agents Home GPU LLM Leaderboard ("RTX 5090 (32GB)" column, "Qwen 3 32B" row, 22.2 GB at Q4); on-disk weights are 19.3 GB across 4 safetensors files per the Qwen3-32B-AWQ HF tree. vLLM's default --max-model-len reserves additional KV-cache memory beyond the weight footprint — the 32K-context config above reserves ~6–8 GB more; the 128K YaRN config reserves substantially more. If you serve at 131K context, expect peak VRAM to approach the full 32 GB envelope. This is a separate runtime from Option A — its lower envelope is not the recipe's installed-path claim.
Quality notes: UD-Q6_K_XL is the Unsloth Dynamic 2.0 mixed-precision tier above Q5_K_M; per the Unsloth Dynamic 2.0 docs, the UD-*-XL family applies per-layer sensitivity-aware bit-allocation calibrated on 1.5M+ curated tokens. Quality is materially closer to Q8_0 / BF16 than the 24 GB-tier Q4_K_M is — this is the real win of the 5090 envelope, not a faster inference speed (compute uplift over Q4 is sub-linear; quality uplift over Q4 is the primary motivation).

For the full benchmark data and other-GPU comparisons, see /check/qwen3-32b/rtx-5090.

Troubleshooting

FlashAttention-2 errors on the 5090 — "no kernel image is available for execution on the device"

The 5090 is Blackwell sm_120. As of authoring, Dao-AILab flash-attention#2168 ("[Blackwell/RTX 5090] CUDA error with flash-attention on RTX 5090 in WSL2") is open — FlashAttention-2 does not yet ship pre-built sm_120 kernels on its standard wheel distribution. Option A (llama.cpp) uses its own CUDA kernels and is unaffected — no special configuration required. Option B (vLLM) relies on scaled_dot_product_attention (SDPA) on Blackwell by default; do NOT pass --attention-backend FLASH_ATTN or set VLLM_ATTENTION_BACKEND=FLASH_ATTN, those flags will trigger the sm_120 kernel error. Track #2168 for resolution.

NVFP4 quants fail on vLLM — "RuntimeError: Error Internal" in `cutlass_scaled_fp4_mm`

NVFP4-quantized Qwen3-32B mirrors (RedHatAI/Qwen3-32B-NVFP4 and similar) currently fail on the 5090 via vLLM with a CUDA error in the NVFP4 matmul kernel — see vLLM issue #22783 (closed as stale, no maintainer fix; affects vLLM 0.10.1+). This recipe does not use NVFP4 — Option A is GGUF Q6, Option B is AWQ INT4 (different kernel path via awq_marlin, no NVFP4 dependency). If you specifically want NVFP4 on the 5090, wait for the upstream fix or watch vllm-project/vllm#24921 (related Qwen3-30B-A3B-NVFP4 sm_120 missing-kernel issue, also closed without merged fix).

Greedy decoding produces endless repetitions

This is a Qwen3-32B-specific failure mode the model card explicitly warns about: per Qwen/Qwen3-32B, "DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions." Always set temperature > 0 — 0.6 for thinking mode, 0.7 for non-thinking — along with top_p=0.95/0.8 and top_k=20. If you're using a frontend that defaults to temperature=0, override it.

Want even more headroom — context beyond 32K?

Option A (llama.cpp + Q6_K_XL) leaves ~1–2 GB for KV cache at 16K context; raising --ctx-size past 32K starts pushing peak VRAM toward the 32 GB ceiling. Two paths to extend further:

Quantize the KV cache. Recent llama.cpp builds support --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn (cuts KV-cache size in half with minor quality cost). With Q8 KV cache, Q6_K_XL at 64K context is reachable.
Switch to Option B (AWQ-INT4 via vLLM) and apply YaRN. AWQ's lower weight footprint plus vLLM's PagedAttention KV cache plus YaRN rope-scaling extends comfortably to 131K — the model's documented max — without KV-quantization tricks. This is the path Option B was included for.

Want to use vLLM with AWQ — Marlin vs other AWQ kernels?

The Qwen/Qwen3-32B-AWQ card documents vllm serve Qwen/Qwen3-32B-AWQ --enable-reasoning --reasoning-parser deepseek_r1 without an explicit --quantization flag — vLLM will auto-detect AWQ from the config. Adding --quantization awq_marlin explicitly pins the Marlin kernel which is the fastest AWQ path on modern NVIDIA architectures (Ampere through Blackwell). On the 5090 specifically, no AWQ-on-sm_120 failure modes have been reported in the vLLM Issues tracker as of authoring — distinct from the NVFP4 path above.

Need a smaller variant on this card?

If your workload doesn't need 32B parameters, the 5090's 32 GB envelope is wildly over-provisioned for Qwen3-14B (BF16 ~28 GB, fits comfortably) or Qwen3-8B (BF16 ~16 GB, leaves 16 GB headroom for a second model). The dedicated Qwen3-14B × RTX 5090 and Qwen3-8B × RTX 5090 recipes lead with the BF16 path that's impossible on smaller cards.