What You'll Build
A local Qwen3-14B chat / reasoning assistant on the 32 GB RTX 5090, run three ways: the Qwen-official FP8 quant via vLLM (recommended — Blackwell sm_120 has native FP8 tensor cores, so FP8 is both smaller and faster than BF16 here, unlike on Ampere); the BF16 full-precision weights via vLLM with context discipline (the 5090 is the first consumer NVIDIA card that fits the 27.5 GB BF16 weights at all); and the familiar Q4_K_M GGUF via Ollama / llama.cpp for one-command convenience.
Hardware data: RTX 5090 (32 GB VRAM, Blackwell sm_120) · Q4_K GGUF · 123.8 tokens/s generation at 4k context · See benchmark data
⚠️ Variant pinned — Qwen3 ships 8 sizes from the same Qwen org. Per the Ollama qwen3:14b tag list, Qwen3 spans
0.6b,1.7b,4b,8b,14b(this recipe),30b(MoE),32b, and235b(MoE). The siblings have wildly different VRAM profiles. The dense 14.8B parameter count of this recipe's variant is confirmed on the Qwen3-14B HF card ("Number of Parameters: 14.8B"); per the official Qwen speed benchmark, BF16 occupies 28,402 MB at input length 1, growing to 33,336 MB at 30k context — i.e. it overflows a 24 GB card and fits the 5090 only with a context cap below ~20k or with KV-cache quantization (see the BF16 path below).
ℹ️ 5090-specific: this is the first consumer NVIDIA card that fits BF16 at all. The 24 GB RTX 3090 and 4090 are forced to Q4_K_M / AWQ-INT4 / FP8 mirrors for this model. The 32 GB Blackwell envelope unlocks BF16 (with context discipline) and lets FP8 run at full 32K context with headroom. NVFP4 hardware acceleration is also a 5090 feature, but the official
nvidia/Qwen3-14B-NVFP4mirror is "Supported Runtime Engine(s): TensorRT-LLM" with "Test Hardware: B200" — not a consumer-runnable path today. The FP8 and BF16 paths below are the actionable 5090 unlocks.
ℹ️ Thinking mode is on by default. Per the Qwen3-14B HF card quickstart,
enable_thinking=Trueis the default and output starts with a<think>...</think>chain-of-thought block. To disable for latency-sensitive use, send/no_thinkin your prompt or passenable_thinking=Falsein the chat template.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 17 GB VRAM (FP8 weights + 32K KV) | RTX 5090 (32 GB, Blackwell sm_120) |
| RAM | 32 GB system | — |
| Storage | 15 GB (FP8) — or 9 GB (Q4_K_M) — or 28 GB (BF16) | per Qwen/Qwen3-14B HF tree + unsloth GGUF tree |
| Driver | CUDA 12.8+ runtime, cu128 PyTorch wheel for Blackwell sm_120 | — |
| Runtime | vLLM 0.7+ (FP8 / BF16) — or Ollama 0.5+ / llama.cpp (GGUF) | — |
The model is released under Apache 2.0 — commercial use is permitted.
Installation
Option A — vLLM with official Qwen FP8 (recommended for the 5090)
The 5090's Blackwell sm_120 architecture has native FP8 tensor cores (E4M3 / E5M2), so FP8 weights deliver a real throughput uplift on top of the memory saving — distinct from Ampere cards where FP8 is dequantized to BF16 on the fly. The Qwen/Qwen3-14B-FP8 repo is the official Qwen team's FP8 quantization with "fine-grained fp8 quantization with block size of 128" per the model card.
1. Install PyTorch with sm_120 (Blackwell) kernels
# cu128 wheel — required for Blackwell sm_120 kernel coverage
pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128
2. Install vLLM
pip install vllm
3. Serve the FP8 weights
vllm serve Qwen/Qwen3-14B-FP8 --enable-reasoning --reasoning-parser deepseek_r1
This is the official serve command per the Qwen/Qwen3-14B-FP8 model card. vLLM exposes an OpenAI-compatible HTTP API on port 8000. The FP8 weights occupy ~16 GB resident — per the official Qwen speed benchmark, FP8 is "16,012 MB" at input length 1, growing to "20,813 MB" at 30k context — leaving ~11 GB of headroom on the 32 GB 5090 for the full 32K native context window plus thinking-mode KV.
Option B — vLLM with BF16 (full precision, 5090-unique unlock)
BF16 fits the 32 GB envelope only on the 5090 (5090 = 32 GB; 4090 / 3090 = 24 GB; the BF16 weights are 28.4 GB at length 1). Cap context explicitly to stay under 30k (where BF16 hits 33.3 GB per the Qwen benchmark) — or use KV-cache quantization to push higher.
# vLLM with explicit context cap + FP8 KV-cache to fit full 32K context
vllm serve Qwen/Qwen3-14B \
--enable-reasoning --reasoning-parser deepseek_r1 \
--max-model-len 32768 \
--kv-cache-dtype fp8
The --max-model-len 32768 matches Qwen3-14B's native context window per the HF card ("Context Length: 32,768 natively"); --kv-cache-dtype fp8 halves KV memory so BF16 weights + full-32K KV fit comfortably. Without --kv-cache-dtype fp8, cap --max-model-len 8192 to leave room for the BF16 KV at default fp16.
Option C — Ollama (familiar one-command path, Q4_K_M)
If you want the simplest possible install or you'd rather spend the 5090's headroom on colocated models (see "Spending the headroom" below), Ollama remains the lowest-friction option.
1. Install Ollama
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
(Windows: download from ollama.com/download.) Per the Qwen3 HF card, "applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3."
2. Pull the 14B model
ollama pull qwen3:14b
This fetches a 9.3 GB Q4_K_M checkpoint per the Ollama qwen3:14b tag (14.8B parameters, Q4_K_M quantization).
Option D — llama.cpp with the full quant ladder
For higher-quality quants (Q6_K, Q8_0, BF16) the unsloth/Qwen3-14B-GGUF repo lists Qwen/Qwen3-14B explicitly as its base_model. Per-tier file sizes from the Files tab:
| Quant | File size | Notes for 32 GB 5090 |
|---|---|---|
| Q4_K_M | 9.00 GB | budget tier — ample headroom for colocations |
| Q5_K_M | 10.5 GB | better quality, still comfortable |
| Q6_K | 12.1 GB | high-fidelity |
| Q8_0 | 15.7 GB | near-lossless — recommended quality tier |
| UD-Q4_K_XL | 9.16 GB | Unsloth Dynamic 2.0 imatrix-tuned |
| UD-Q8_K_XL | 18.8 GB | Unsloth Dynamic 2.0, near-lossless |
| BF16 | 29.5 GB | full precision — fits 5090 with --ctx-size 8192 cap |
Install llama.cpp (brew install llama.cpp on macOS, or pre-built CUDA binaries from GitHub releases), then via the Hugging Face shortcut documented on the Unsloth card:
# Q8_0 — recommended near-lossless tier for the 32 GB 5090
llama-server -hf unsloth/Qwen3-14B-GGUF:Q8_0 -fa 1
The -fa 1 flag enables Flash Attention (the same flag used in the hardware-corner.net RTX 5090 LLM benchmark row for this model).
Running
One-shot prompt via vLLM (FP8 path)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-14B-FP8",
"messages": [{"role": "user", "content": "Explain the difference between MoE and dense transformer architectures in three sentences."}]
}'
Disable thinking mode for short answers (Ollama path)
ollama run qwen3:14b "/no_think What's the capital of France?"
Per the Qwen3-14B HF card, this disables enable_thinking for the request, skipping the <think>...</think> chain-of-thought prefix.
Results
- Speed (Q4_K llama.cpp path): 123.8 tokens/s generation at 4k context, Q4_K quantization with
-fa 1, measured on RTX 5090 — per the hardware-corner.net RTX 5090 LLM benchmark table row labelled "Qwen3 14B (Q4_K)", surfaced via/check/qwen3-14b/rtx-5090. Generation rate decays to 102.7 tok/s at 16k, 82.4 tok/s at 32k, 57.6 tok/s at 64k, and 37.2 tok/s at 128k as the KV cache grows. Prompt processing on the same row is much faster — 6,497.6 tok/s at 4k context, dropping to 908.4 tok/s at 128k. An independent corroboration at Q4_K_XL appears at the hardware-corner.net "GPU LLM Ranking" 16K table for the RTX 5090 ("102.68 tokens/second" at 16K context, Q4_K_XL), consistent within rounding of the Q4_K row above. - VRAM usage (FP8 path, this recipe's primary): ~16 GB resident with FP8 weights at length 1, growing to ~21 GB at 30k context — per the official Qwen speed benchmark ("FP8: 16,012 MB" / "20,813 MB"). Leaves ~11 GB of the 32 GB envelope free for the full 32K native window, thinking-mode KV, and a small colocation. The same benchmark documents BF16 at "28,402 MB" / "33,336 MB" and AWQ-INT4 at "9,962 MB" / "15,323 MB" — together these three precision tiers form the precision/VRAM ladder for Transformers-style inference on Qwen3-14B. The Q4_K llama.cpp path is a separate runtime and uses the unsloth GGUF file size (~9 GB) plus llama.cpp's smaller activation footprint. See /check/qwen3-14b/rtx-5090 for the live benchmark feed.
- Quality notes: On the 32 GB 5090, there is no quality reason to pick anything below Q8_0 (15.7 GB) for the GGUF path or FP8 (~16 GB) for the vLLM path — both leave ample KV cache room at full 32K context. The BF16 path is the highest-fidelity tier this card can run; FP8 is within rounding for most chat / reasoning workloads and is faster thanks to native sm_120 tensor cores.
For the full benchmark data, see /check/qwen3-14b/rtx-5090.
Spending the headroom — colocating other models on the 32 GB envelope
The 5090's 32 GB envelope leaves substantial spare VRAM after Qwen3-14B loads, even at its largest practical quant. Some cited per-model floors that fit alongside this recipe's FP8 / Q8_0 / Q4_K_M paths:
- Qwen3-14B FP8 (~16 GB) + Gemma 4 E4B (~5 GB) = ~21 GB — leaves a 10 GB margin. The E4B sibling is hardware-agnostic and pairs cleanly for multimodal pipelines.
- Qwen3-14B Q4_K_M (~9 GB) + Llama 3.1 8B Q4 (~6 GB) + Kokoro-82M TTS (~1 GB) = ~16 GB — a multi-model "chat + alternative + voice" stack with 16 GB to spare for KV cache.
- Qwen3-14B Q4_K_M (~9 GB) + Whisper-large-v3 (~3 GB) + a 7B Q4 LLM (~5 GB) = ~17 GB — ASR + reasoning + alternative-LLM production server.
These pairings are weight floors only. Each model adds its own KV cache and activation overhead; allow ~2 GB headroom per colocated model under load.
Troubleshooting
vLLM crashes at import or first inference with sm_120 / Blackwell errors
The 5090 uses Blackwell architecture (sm_120), which requires the cu128 PyTorch wheel for native kernel coverage — the default pip install torch may pull a cu126-wheel build that lacks sm_120 kernels. Verify with python -c "import torch; print(torch.version.cuda, torch.cuda.get_device_capability())" — you should see 12.8 (or higher) and (12, 0). If not, reinstall via the index URL in step 1 of Option A above. FlashAttention-2 is a separate axis: vLLM's default attention is currently SDPA (PyTorch scaled_dot_product_attention), which works on Blackwell without FA2. If you opt into FA2 explicitly, note that Dao-AILab/flash-attention#2168 ("[Blackwell/RTX 5090] CUDA error with flash-attention on RTX 5090 in WSL2") remains open at the time of writing — stick with SDPA or vLLM's default attention until FA2 sm_120 coverage lands.
<think>...</think> output is bloating responses
Qwen3 enables thinking mode by default per the HF card quickstart. Send /no_think at the start of any user message to disable it for that turn, or pass enable_thinking=False if you're calling the chat-template API directly. Per the model card's best-practices note: for thinking mode use Temperature=0.6, TopP=0.95, TopK=20, MinP=0 and do not use greedy decoding — it triggers endless repetitions.
BF16 path OOMs at long context
BF16 weights are 28.4 GB at length 1 and grow to 33.3 GB at 30k context per the Qwen speed benchmark — the 32 GB 5090 envelope does NOT fit the full 32K native context window at BF16 with default fp16 KV cache. Three escape hatches: (a) cap --max-model-len to 8192 (well within the BF16 + fp16-KV envelope), (b) add --kv-cache-dtype fp8 to halve KV memory and reclaim 32K context, or (c) drop to the FP8 path (Option A) — FP8 is faster on Blackwell anyway and fits 32K cleanly.
NaN output / CUDA assertion errors on first inference (Blackwell SDPA path)
A historical Blackwell PyTorch SDPA issue was reported on a smaller Qwen3 sibling at QwenLM/Qwen3#1499 ([Bug]: NaN in PyTorch SDPA on RTX5080); the reporter's variant was Qwen3-0.6B and the failure was reproduced by a Qwen team COLLABORATOR (jklj077) under float16 in the same thread — but the underlying SDPA-on-Blackwell failure is framework-level and model-class-independent (PyTorch SDPA → cuDNN dispatch on sm_120 / sm_121). The reporter (O5-7, community user) notes the fix in their final comment: "This bug was fixed by upgrading cuDNN. Please use the preview version of PyTorch." If you see NaN output or device-side assert triggered traces, confirm a recent cuDNN (the cu128 nightly wheel in step 1 includes it) and use torch_dtype="auto" (BF16) rather than torch_dtype=torch.float16 in any custom Transformers loader. Per the 2026-05-22 sibling-variant Issue disambiguation rule, only the model-class-independent SDPA-cuDNN workaround transfers here; variant-specific advice in that thread does not.
Generation slows dramatically past 32k context
32k is Qwen3-14B's native context window per the HF card ("Context Length: 32,768 natively and 131,072 tokens with YaRN"). Beyond that the model needs YaRN extension — supported in llama.cpp via --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 per the Qwen3 model card — but quality degrades and the KV cache balloons. For long-doc workflows, prefer chunking + retrieval over pushing context past 32k. The hardware-corner.net benchmark shows the generation rate falling from 123.8 tok/s at 4k to 37.2 tok/s at 128k context on this card.
I want the NVFP4 path (Blackwell native FP4)
NVIDIA publishes an nvidia/Qwen3-14B-NVFP4 mirror that targets Blackwell's NVFP4 hardware acceleration, but per its model card the "Supported Runtime Engine(s): TensorRT-LLM" and "Test Hardware: B200" — it is not a vLLM / SGLang / Ollama path today and consumer Blackwell (RTX 5090) is not in the tested-hardware list. For now, FP8 via vLLM (Option A) is the recommended Blackwell-accelerated path on the 5090; revisit NVFP4 once consumer-runtime support lands.
I want the larger 32B or 30B-MoE sibling
Qwen3-32B at Q4_K_M is ~19 GB on disk and fits the 5090's 32 GB envelope with plenty of headroom for full 128K context — swap qwen3:14b for qwen3:32b in any Ollama command. Qwen3-30B-A3B (MoE) routes per token (classical sparse MoE), so all expert params must be resident in VRAM per the Qwen3 model card. See /check/qwen3-32b/rtx-5090 once that recipe lands.