self-hosted/ai
§01·recipe · llm

Qwen3-14B on RTX 5090: FP8 via vLLM with Native Blackwell Acceleration

llmintermediate17GB+ VRAMMay 24, 2026
models
tools
prerequisites
  • NVIDIA RTX 5090 (32 GB VRAM, Blackwell sm_120) or equivalent 32 GB CUDA card
  • Recent NVIDIA driver with CUDA 12.8+ runtime (`cu128` wheel selection required for sm_120 kernels)
  • ~15 GB free disk for the FP8 weights (or ~9 GB for Q4_K_M GGUF, ~28 GB for BF16)
  • Python 3.10+, vLLM 0.7+ (for FP8) — or Ollama / llama.cpp for the GGUF path

What You'll Build

A local Qwen3-14B chat / reasoning assistant on the 32 GB RTX 5090, run three ways: the Qwen-official FP8 quant via vLLM (recommended — Blackwell sm_120 has native FP8 tensor cores, so FP8 is both smaller and faster than BF16 here, unlike on Ampere); the BF16 full-precision weights via vLLM with context discipline (the 5090 is the first consumer NVIDIA card that fits the 27.5 GB BF16 weights at all); and the familiar Q4_K_M GGUF via Ollama / llama.cpp for one-command convenience.

Hardware data: RTX 5090 (32 GB VRAM, Blackwell sm_120) · Q4_K GGUF · 123.8 tokens/s generation at 4k context · See benchmark data

⚠️ Variant pinned — Qwen3 ships 8 sizes from the same Qwen org. Per the Ollama qwen3:14b tag list, Qwen3 spans 0.6b, 1.7b, 4b, 8b, 14b (this recipe), 30b (MoE), 32b, and 235b (MoE). The siblings have wildly different VRAM profiles. The dense 14.8B parameter count of this recipe's variant is confirmed on the Qwen3-14B HF card ("Number of Parameters: 14.8B"); per the official Qwen speed benchmark, BF16 occupies 28,402 MB at input length 1, growing to 33,336 MB at 30k context — i.e. it overflows a 24 GB card and fits the 5090 only with a context cap below ~20k or with KV-cache quantization (see the BF16 path below).

ℹ️ 5090-specific: this is the first consumer NVIDIA card that fits BF16 at all. The 24 GB RTX 3090 and 4090 are forced to Q4_K_M / AWQ-INT4 / FP8 mirrors for this model. The 32 GB Blackwell envelope unlocks BF16 (with context discipline) and lets FP8 run at full 32K context with headroom. NVFP4 hardware acceleration is also a 5090 feature, but the official nvidia/Qwen3-14B-NVFP4 mirror is "Supported Runtime Engine(s): TensorRT-LLM" with "Test Hardware: B200" — not a consumer-runnable path today. The FP8 and BF16 paths below are the actionable 5090 unlocks.

ℹ️ Thinking mode is on by default. Per the Qwen3-14B HF card quickstart, enable_thinking=True is the default and output starts with a <think>...</think> chain-of-thought block. To disable for latency-sensitive use, send /no_think in your prompt or pass enable_thinking=False in the chat template.

Requirements

ComponentMinimumTested
GPU17 GB VRAM (FP8 weights + 32K KV)RTX 5090 (32 GB, Blackwell sm_120)
RAM32 GB system
Storage15 GB (FP8) — or 9 GB (Q4_K_M) — or 28 GB (BF16)per Qwen/Qwen3-14B HF tree + unsloth GGUF tree
DriverCUDA 12.8+ runtime, cu128 PyTorch wheel for Blackwell sm_120
RuntimevLLM 0.7+ (FP8 / BF16) — or Ollama 0.5+ / llama.cpp (GGUF)

The model is released under Apache 2.0 — commercial use is permitted.

Installation

Option A — vLLM with official Qwen FP8 (recommended for the 5090)

The 5090's Blackwell sm_120 architecture has native FP8 tensor cores (E4M3 / E5M2), so FP8 weights deliver a real throughput uplift on top of the memory saving — distinct from Ampere cards where FP8 is dequantized to BF16 on the fly. The Qwen/Qwen3-14B-FP8 repo is the official Qwen team's FP8 quantization with "fine-grained fp8 quantization with block size of 128" per the model card.

1. Install PyTorch with sm_120 (Blackwell) kernels

# cu128 wheel — required for Blackwell sm_120 kernel coverage
pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128

2. Install vLLM

pip install vllm

3. Serve the FP8 weights

vllm serve Qwen/Qwen3-14B-FP8 --enable-reasoning --reasoning-parser deepseek_r1

This is the official serve command per the Qwen/Qwen3-14B-FP8 model card. vLLM exposes an OpenAI-compatible HTTP API on port 8000. The FP8 weights occupy ~16 GB resident — per the official Qwen speed benchmark, FP8 is "16,012 MB" at input length 1, growing to "20,813 MB" at 30k context — leaving ~11 GB of headroom on the 32 GB 5090 for the full 32K native context window plus thinking-mode KV.

Option B — vLLM with BF16 (full precision, 5090-unique unlock)

BF16 fits the 32 GB envelope only on the 5090 (5090 = 32 GB; 4090 / 3090 = 24 GB; the BF16 weights are 28.4 GB at length 1). Cap context explicitly to stay under 30k (where BF16 hits 33.3 GB per the Qwen benchmark) — or use KV-cache quantization to push higher.

# vLLM with explicit context cap + FP8 KV-cache to fit full 32K context
vllm serve Qwen/Qwen3-14B \
  --enable-reasoning --reasoning-parser deepseek_r1 \
  --max-model-len 32768 \
  --kv-cache-dtype fp8

The --max-model-len 32768 matches Qwen3-14B's native context window per the HF card ("Context Length: 32,768 natively"); --kv-cache-dtype fp8 halves KV memory so BF16 weights + full-32K KV fit comfortably. Without --kv-cache-dtype fp8, cap --max-model-len 8192 to leave room for the BF16 KV at default fp16.

Option C — Ollama (familiar one-command path, Q4_K_M)

If you want the simplest possible install or you'd rather spend the 5090's headroom on colocated models (see "Spending the headroom" below), Ollama remains the lowest-friction option.

1. Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

(Windows: download from ollama.com/download.) Per the Qwen3 HF card, "applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3."

2. Pull the 14B model

ollama pull qwen3:14b

This fetches a 9.3 GB Q4_K_M checkpoint per the Ollama qwen3:14b tag (14.8B parameters, Q4_K_M quantization).

Option D — llama.cpp with the full quant ladder

For higher-quality quants (Q6_K, Q8_0, BF16) the unsloth/Qwen3-14B-GGUF repo lists Qwen/Qwen3-14B explicitly as its base_model. Per-tier file sizes from the Files tab:

QuantFile sizeNotes for 32 GB 5090
Q4_K_M9.00 GBbudget tier — ample headroom for colocations
Q5_K_M10.5 GBbetter quality, still comfortable
Q6_K12.1 GBhigh-fidelity
Q8_015.7 GBnear-lossless — recommended quality tier
UD-Q4_K_XL9.16 GBUnsloth Dynamic 2.0 imatrix-tuned
UD-Q8_K_XL18.8 GBUnsloth Dynamic 2.0, near-lossless
BF1629.5 GBfull precision — fits 5090 with --ctx-size 8192 cap

Install llama.cpp (brew install llama.cpp on macOS, or pre-built CUDA binaries from GitHub releases), then via the Hugging Face shortcut documented on the Unsloth card:

# Q8_0 — recommended near-lossless tier for the 32 GB 5090
llama-server -hf unsloth/Qwen3-14B-GGUF:Q8_0 -fa 1

The -fa 1 flag enables Flash Attention (the same flag used in the hardware-corner.net RTX 5090 LLM benchmark row for this model).

Running

One-shot prompt via vLLM (FP8 path)

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-14B-FP8",
    "messages": [{"role": "user", "content": "Explain the difference between MoE and dense transformer architectures in three sentences."}]
  }'

Disable thinking mode for short answers (Ollama path)

ollama run qwen3:14b "/no_think What's the capital of France?"

Per the Qwen3-14B HF card, this disables enable_thinking for the request, skipping the <think>...</think> chain-of-thought prefix.

Results

  • Speed (Q4_K llama.cpp path): 123.8 tokens/s generation at 4k context, Q4_K quantization with -fa 1, measured on RTX 5090 — per the hardware-corner.net RTX 5090 LLM benchmark table row labelled "Qwen3 14B (Q4_K)", surfaced via /check/qwen3-14b/rtx-5090. Generation rate decays to 102.7 tok/s at 16k, 82.4 tok/s at 32k, 57.6 tok/s at 64k, and 37.2 tok/s at 128k as the KV cache grows. Prompt processing on the same row is much faster — 6,497.6 tok/s at 4k context, dropping to 908.4 tok/s at 128k. An independent corroboration at Q4_K_XL appears at the hardware-corner.net "GPU LLM Ranking" 16K table for the RTX 5090 ("102.68 tokens/second" at 16K context, Q4_K_XL), consistent within rounding of the Q4_K row above.
  • VRAM usage (FP8 path, this recipe's primary): ~16 GB resident with FP8 weights at length 1, growing to ~21 GB at 30k context — per the official Qwen speed benchmark ("FP8: 16,012 MB" / "20,813 MB"). Leaves ~11 GB of the 32 GB envelope free for the full 32K native window, thinking-mode KV, and a small colocation. The same benchmark documents BF16 at "28,402 MB" / "33,336 MB" and AWQ-INT4 at "9,962 MB" / "15,323 MB" — together these three precision tiers form the precision/VRAM ladder for Transformers-style inference on Qwen3-14B. The Q4_K llama.cpp path is a separate runtime and uses the unsloth GGUF file size (~9 GB) plus llama.cpp's smaller activation footprint. See /check/qwen3-14b/rtx-5090 for the live benchmark feed.
  • Quality notes: On the 32 GB 5090, there is no quality reason to pick anything below Q8_0 (15.7 GB) for the GGUF path or FP8 (~16 GB) for the vLLM path — both leave ample KV cache room at full 32K context. The BF16 path is the highest-fidelity tier this card can run; FP8 is within rounding for most chat / reasoning workloads and is faster thanks to native sm_120 tensor cores.

For the full benchmark data, see /check/qwen3-14b/rtx-5090.

Spending the headroom — colocating other models on the 32 GB envelope

The 5090's 32 GB envelope leaves substantial spare VRAM after Qwen3-14B loads, even at its largest practical quant. Some cited per-model floors that fit alongside this recipe's FP8 / Q8_0 / Q4_K_M paths:

  • Qwen3-14B FP8 (~16 GB) + Gemma 4 E4B (~5 GB) = ~21 GB — leaves a 10 GB margin. The E4B sibling is hardware-agnostic and pairs cleanly for multimodal pipelines.
  • Qwen3-14B Q4_K_M (~9 GB) + Llama 3.1 8B Q4 (~6 GB) + Kokoro-82M TTS (~1 GB) = ~16 GB — a multi-model "chat + alternative + voice" stack with 16 GB to spare for KV cache.
  • Qwen3-14B Q4_K_M (~9 GB) + Whisper-large-v3 (~3 GB) + a 7B Q4 LLM (~5 GB) = ~17 GB — ASR + reasoning + alternative-LLM production server.

These pairings are weight floors only. Each model adds its own KV cache and activation overhead; allow ~2 GB headroom per colocated model under load.

Troubleshooting

vLLM crashes at import or first inference with sm_120 / Blackwell errors

The 5090 uses Blackwell architecture (sm_120), which requires the cu128 PyTorch wheel for native kernel coverage — the default pip install torch may pull a cu126-wheel build that lacks sm_120 kernels. Verify with python -c "import torch; print(torch.version.cuda, torch.cuda.get_device_capability())" — you should see 12.8 (or higher) and (12, 0). If not, reinstall via the index URL in step 1 of Option A above. FlashAttention-2 is a separate axis: vLLM's default attention is currently SDPA (PyTorch scaled_dot_product_attention), which works on Blackwell without FA2. If you opt into FA2 explicitly, note that Dao-AILab/flash-attention#2168 ("[Blackwell/RTX 5090] CUDA error with flash-attention on RTX 5090 in WSL2") remains open at the time of writing — stick with SDPA or vLLM's default attention until FA2 sm_120 coverage lands.

<think>...</think> output is bloating responses

Qwen3 enables thinking mode by default per the HF card quickstart. Send /no_think at the start of any user message to disable it for that turn, or pass enable_thinking=False if you're calling the chat-template API directly. Per the model card's best-practices note: for thinking mode use Temperature=0.6, TopP=0.95, TopK=20, MinP=0 and do not use greedy decoding — it triggers endless repetitions.

BF16 path OOMs at long context

BF16 weights are 28.4 GB at length 1 and grow to 33.3 GB at 30k context per the Qwen speed benchmark — the 32 GB 5090 envelope does NOT fit the full 32K native context window at BF16 with default fp16 KV cache. Three escape hatches: (a) cap --max-model-len to 8192 (well within the BF16 + fp16-KV envelope), (b) add --kv-cache-dtype fp8 to halve KV memory and reclaim 32K context, or (c) drop to the FP8 path (Option A) — FP8 is faster on Blackwell anyway and fits 32K cleanly.

NaN output / CUDA assertion errors on first inference (Blackwell SDPA path)

A historical Blackwell PyTorch SDPA issue was reported on a smaller Qwen3 sibling at QwenLM/Qwen3#1499 ([Bug]: NaN in PyTorch SDPA on RTX5080); the reporter's variant was Qwen3-0.6B and the failure was reproduced by a Qwen team COLLABORATOR (jklj077) under float16 in the same thread — but the underlying SDPA-on-Blackwell failure is framework-level and model-class-independent (PyTorch SDPA → cuDNN dispatch on sm_120 / sm_121). The reporter (O5-7, community user) notes the fix in their final comment: "This bug was fixed by upgrading cuDNN. Please use the preview version of PyTorch." If you see NaN output or device-side assert triggered traces, confirm a recent cuDNN (the cu128 nightly wheel in step 1 includes it) and use torch_dtype="auto" (BF16) rather than torch_dtype=torch.float16 in any custom Transformers loader. Per the 2026-05-22 sibling-variant Issue disambiguation rule, only the model-class-independent SDPA-cuDNN workaround transfers here; variant-specific advice in that thread does not.

Generation slows dramatically past 32k context

32k is Qwen3-14B's native context window per the HF card ("Context Length: 32,768 natively and 131,072 tokens with YaRN"). Beyond that the model needs YaRN extension — supported in llama.cpp via --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 per the Qwen3 model card — but quality degrades and the KV cache balloons. For long-doc workflows, prefer chunking + retrieval over pushing context past 32k. The hardware-corner.net benchmark shows the generation rate falling from 123.8 tok/s at 4k to 37.2 tok/s at 128k context on this card.

I want the NVFP4 path (Blackwell native FP4)

NVIDIA publishes an nvidia/Qwen3-14B-NVFP4 mirror that targets Blackwell's NVFP4 hardware acceleration, but per its model card the "Supported Runtime Engine(s): TensorRT-LLM" and "Test Hardware: B200" — it is not a vLLM / SGLang / Ollama path today and consumer Blackwell (RTX 5090) is not in the tested-hardware list. For now, FP8 via vLLM (Option A) is the recommended Blackwell-accelerated path on the 5090; revisit NVFP4 once consumer-runtime support lands.

I want the larger 32B or 30B-MoE sibling

Qwen3-32B at Q4_K_M is ~19 GB on disk and fits the 5090's 32 GB envelope with plenty of headroom for full 128K context — swap qwen3:14b for qwen3:32b in any Ollama command. Qwen3-30B-A3B (MoE) routes per token (classical sparse MoE), so all expert params must be resident in VRAM per the Qwen3 model card. See /check/qwen3-32b/rtx-5090 once that recipe lands.