self-hosted/ai
§01·recipe · llm

Qwen3-8B on RTX 5090: Q4_K_M GGUF with 26 GB of Headroom for Colocation, BF16, or Full 131K Context

llmbeginner6GB+ VRAMMay 24, 2026
models
tools
prerequisites
  • NVIDIA RTX 5090 (32 GB VRAM) or any Blackwell/Ada CUDA card with ≥ 6 GB free
  • Recent NVIDIA driver with CUDA 12.8+ support (required for Blackwell sm_120)
  • ~5 GB free disk for the Q4_K_M GGUF (or up to ~15 GB for BF16)
  • Ollama, llama.cpp, or LM Studio installed

What You'll Build

A local Qwen3-8B chat and reasoning assistant on an RTX 5090, served through Ollama or llama.cpp at Q4_K_M quantization. The weights are only ~5 GB on disk, so the model uses roughly 6 GB of the 5090's 32 GB envelope — leaving 26 GB free to colocate a Qwen3-32B sibling, run the full 16 GB BF16 weights with KV-cache room for the full 131K YaRN context, or stand up a multi-model production pipeline (Whisper + Kokoro + Qwen3-8B all on one card).

Hardware data: RTX 5090 (32 GB VRAM) · 200.4 tok/s generation @ 4K context (Q4_K, llama.cpp build 8189) · See benchmark data

ℹ️ Wildly over-provisioned by design — 5.3× headroom. Qwen3-8B Q4_K_M needs ~6 GB resident; the 5090 has 32 GB. The install path below is the same on any ≥ 6 GB CUDA card — but the "Spending the 26 GB of Headroom" section is what makes the 5090 worth using over a 5060 Ti 16GB or a 12 GB card for this model. If you only want chat, a 12 GB card is fine; the 5090 starts paying for itself when you reach for the spare 26 GB.

⚠️ Variant pinned — Qwen3 ships 8 sizes from the same Qwen org. Per the Ollama qwen3 tag list, Qwen3 spans 0.6b, 1.7b, 4b, 8b (this recipe), 14b, 30b (MoE), 32b, and 235b (MoE). The siblings have wildly different VRAM profiles — Qwen3-14B Q4_K_M is ~9 GB, Qwen3-32B Q4_K_M is ~18 GB and now fits the 5090 with ~14 GB to spare (see Qwen3 model card on the dense/MoE split). The instructions below are for the dense 8.2B model only; the headroom section covers running 14B / 32B alongside or instead.

Requirements

ComponentMinimumTested
GPU6 GB VRAM (Blackwell / Ada / Ampere)RTX 5090 (32 GB)
RAM16 GB system
Storage~5 GB for Q4_K_M GGUF~15 GB if you also pull BF16
DriverNVIDIA driver with CUDA 12.8+ (Blackwell sm_120)
SoftwareOllama, llama.cpp, or LM Studio

The model is released under Apache 2.0 — commercial use is permitted. The min_vram_gb: 6 floor is set by the Q4_K_M weight footprint; the 5090's 32 GB envelope gives you 26 GB of unused VRAM to do something else with (see below).

Installation

Pick one of the three runtimes below. All three are first-party-supported by the Qwen team and pull from the canonical Qwen/Qwen3-8B weights (or an officially-tracked GGUF mirror). Per the model card: "applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3."

Option A — Ollama (simplest)

curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen3:8b

qwen3:8b on the Ollama library is the Q4_K_M build, 5.2 GB on disk. One file, no manual quant selection.

Option B — llama.cpp (most control + full quant ladder)

# macOS / Homebrew
brew install llama.cpp

# Linux: pre-built CUDA wheel (cu128 for Blackwell)
# Visit https://github.com/ggml-org/llama.cpp/releases

# Or build from source with CUDA
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && cmake -B build -DGGML_CUDA=ON && cmake --build build -j --config Release

The 5090's 32 GB envelope makes every tier viable — per the unsloth/Qwen3-8B-GGUF per-tier table (link-back to upstream Qwen/Qwen3-8B confirmed on the page header):

QuantFile sizeNotes on a 32 GB RTX 5090
Q4_K_M5.03 GBrecommended default — leaves ~26 GB for KV cache / colocation
UD-Q4_K_XL5.14 GBUnsloth Dynamic 2.0 (accuracy-tuned 4-bit)
Q5_K_M5.85 GBslight quality lift, same fit
Q6_K6.73 GB"near perfect" per bartowski
Q8_08.71 GBnear-lossless — still leaves ~23 GB free
BF1616.4 GBfull precision — fits with ~16 GB of KV-cache headroom

Pull and serve any tier via the llama.cpp Hugging Face shortcut:

# OpenAI-compatible local server with web UI
llama-server -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL --port 8080 -ngl 99 -fa

# Or for the full BF16 weights — the 5090 has room
llama-server -hf unsloth/Qwen3-8B-GGUF:BF16 --port 8080 -ngl 99 -fa

-ngl 99 offloads all layers to the GPU; -fa enables Flash Attention.

Option C — LM Studio (GUI)

Install LM Studio, search qwen3-8b, pick any tier (defaults to Q4_K_M). The 32 GB envelope means you can keep multiple tiers downloaded and hot-swap.

Running

Ollama

ollama run qwen3:8b "Explain GQA attention in three sentences."

# Disable Qwen3's reasoning trace per turn (faster, less chatty)
ollama run qwen3:8b "/no_think What's the capital of France?"

# OpenAI-compatible endpoint
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen3:8b", "messages": [{"role": "user", "content": "Write a haiku about Blackwell GPUs."}]}'

llama.cpp

# Server with OpenAI-compatible API on :8080
llama-server -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL --port 8080 -ngl 99 -fa

# Single-shot CLI
llama-cli -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL -p "Summarize what GQA does in two lines."

vLLM / SGLang (production-style BF16 serving)

The 32 GB envelope makes the upstream BF16 path from the Qwen3-8B card comfortable — BF16 weights are ~15 GB, leaving ~17 GB for batched KV cache:

# vLLM (v0.8.5+)
vllm serve Qwen/Qwen3-8B --enable-reasoning --reasoning-parser deepseek_r1

# SGLang (v0.4.6.post1+)
python3 -m sglang.launch_server --model-path Qwen/Qwen3-8B \
  --host 0.0.0.0 --port 30000 --reasoning-parser qwen3

Results

  • Generation speed: 200.4 tok/s @ 4K context (Q4_K, llama.cpp build 8189), measured on Hardware Corner's RTX 5090 LLM benchmark page and surfaced via /check/qwen3-8b/rtx-5090. The same source publishes the full context ladder for this card:

    ContextToken generationPrompt processing
    4K200.4 tok/s11,933.4 tok/s
    16K162.3 tok/s8,538.4 tok/s
    32K129.8 tok/s6,034.2 tok/s
    64K91.8 tok/s3,089.6 tok/s
    128K58.8 tok/s1,209.1 tok/s

    For reference, the same source's RTX 4090 page shows 141.3 tok/s at 4K and the RTX 3090 page shows 115.3 tok/s — the 5090 is ~1.42× faster than the 4090 and ~1.74× faster than the 3090 at the same workload, in line with its 1,792 GB/s GDDR7 memory bandwidth (vs 1,008 GB/s GDDR6X on the 4090).

  • VRAM usage: Q4_K_M weights load to ~5 GB; with a 32K KV cache (fp16, GQA-compressed with the model's 8 KV heads) you stay well under 10 GB resident. The 5090's 32 GB envelope is ~26 GB free for the use cases below. See live benchmark data.

  • Quality notes: Qwen3 ships a hybrid thinking/non-thinking mode. The model card recommends Temperature=0.6, TopP=0.95, TopK=20, MinP=0 for thinking mode and Temperature=0.7, TopP=0.8, TopK=20, MinP=0 for non-thinking — avoid greedy decoding either way per the Qwen team's guidance. Use /no_think in a prompt (or enable_thinking=False in the chat template) to suppress the <think> trace when you just want a fast answer.

For the full benchmark data and other GPUs in the catalogue, see /check/qwen3-8b/rtx-5090.

Spending the 26 GB of Headroom — the real reason to run this on a 5090

Qwen3-8B at Q4_K_M leaves the 5090 with concrete capabilities a 16 GB card (5060 Ti / 4060 Ti / 4070) or 24 GB card (3090 / 4090) cannot match:

  1. Colocate the 32B sibling for hard problems. Qwen3-32B at Q4_K_M is 18.4 GB, and Qwen3-8B Q4_K_M (4.7 GB) + Qwen3-32B Q4_K_M (18.4 GB) = ~23 GB total weights, leaving ~9 GB for KV cache across both. Keep the 8B loaded for fast turns (200 tok/s) and the 32B loaded for hard ones (61 tok/s per the Hardware Corner RTX 5090 table), router-style. With Ollama: OLLAMA_NUM_PARALLEL=2 OLLAMA_MAX_LOADED_MODELS=2.

  2. Run BF16 weights for bit-exact reproduction. BF16 is 15.26 GB on disk — half the 5090's envelope, with ~17 GB to spare for context and activations. Useful when reproducing a published paper's results or when you want the model to behave identically to its training-time precision. The 5060 Ti 16GB cannot fit BF16 without offload; the 5090 fits it with a 32K KV cache to spare.

  3. Run a full multimodal production pipeline. Whisper-large-v3 (~3 GB) + Kokoro-82M TTS (~0.5 GB) + Qwen3-8B Q4_K_M (~5 GB) = ~8.5 GB total — ASR → reason → speak on one card with 23 GB to spare for a Stable Diffusion XL pipeline or an embedding model alongside.

  4. Push context to the full 131K window comfortably. The Qwen3-8B model card lists a native context of 32,768 tokens with explicit YaRN scaling to 131,072 tokens. Hardware Corner's table above measures 58.8 tok/s at 128K on the 5090 — about 2× the 3090's 28.1 tok/s at the same length, because GDDR7 bandwidth shines under KV-cache pressure. Enable in llama.cpp with --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768. The 5090 fits the 128K KV cache with room left over for other models.

  5. Use the 14B sibling for everyday work, 8B for batched throughput. Qwen3-14B at Q4_K_M is ~9 GB and runs at 123.8 tok/s on the 5090 per the same Hardware Corner table — a sweet spot between 8B's speed and 32B's quality. Combined with the 8B for routing, total footprint is ~14 GB, well under half the card.

Troubleshooting

Generation feels slow or nvidia-smi shows 0% GPU utilisation

Confirm a recent NVIDIA driver with CUDA 12.8+ runtime is installed (nvidia-smi should report a 575+ driver on Linux). The RTX 5090 uses the Blackwell architecture (sm_120), which requires the cu128 CUDA toolchain — older wheels lack the kernels and Ollama / llama.cpp silently fall back to CPU inference. For llama.cpp, grab a cu128 release binary or rebuild with -DGGML_CUDA=ON against a CUDA 12.8+ install. If you're under-utilising the GPU, also check you passed -ngl 99 and -fa to llama-server.

FlashAttention 2 errors when running transformers directly

If you bypass Ollama / llama.cpp and run the HF card quickstart via transformers directly, the quickstart uses torch_dtype="auto" and device_map="auto" — it does not hardcode attn_implementation="flash_attention_2", so it works on Blackwell sm_120 out of the box with torch>=2.5 + CUDA 12.8 wheels. If you (or a tutorial) explicitly add attn_implementation="flash_attention_2", the model will crash on first inference: FA2 wheels don't ship sm_120 kernels yet (Dao-AILab/flash-attention#2168). Override with attn_implementation="sdpa" (or remove the argument) and PyTorch's built-in scaled-dot-product attention handles it.

Rare SDPA NaN on early Blackwell + transformers FP16 paths

Issue QwenLM/Qwen3#1499 reported a torch.nn.functional.scaled_dot_product_attention NaN on RTX 5080 (also Blackwell sm_120) when running Qwen3-0.6B in transformers with FP16 SDPA. The Qwen maintainer's diagnosis (issue comment by @jklj077, COLLABORATOR): "PyTorch SDPA may select different implementations based on the inputs, data types, and devices. […] It's likely that the implementation selected for the RTX 5080 (Blackwell) was causing the problem." The reporter then confirmed the fix: "This bug was fixed by upgrading cuDNN. Please use the preview version of PyTorch." This recipe's Ollama / llama.cpp paths don't hit the transformers SDPA path at all, so you only need to worry about it if you switch to a transformers-direct workflow — in which case, use a recent PyTorch wheel with bundled cuDNN.

<think>...</think> output is bloating responses

Qwen3 enables thinking mode by default per the HF card quickstart. Send /no_think at the start of any user message to disable it for that turn, or pass enable_thinking=False if you're calling the chat-template API directly.

Pushing past the 32K native context

Native context is 32,768 tokens. For anything longer, enable YaRN explicitly — llama.cpp:

llama-server -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL \
  --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 -fa

Expect ~58.8 tok/s at the full 128K extent per Hardware Corner's table. The Qwen team specifically recommends not enabling YaRN unless you need it: per the model card, "if the average context length does not exceed 32,768 tokens, we do not recommend enabling YaRN in this scenario, as it may potentially degrade model performance."

Running larger Qwen3 siblings standalone

ollama run qwen3:14b (~9 GB Q4_K_M, 123.8 tok/s at 4K on the 5090) and ollama run qwen3:32b (~18 GB Q4_K_M, 61.4 tok/s at 4K on the 5090) both fit the 5090 with massive headroom — generation speeds from the same Hardware Corner RTX 5090 page. The 30B MoE and 235B MoE variants need all params resident (see the Qwen3 model card on the dense/MoE split — Qwen3-30B-A3B needs ~30 GB total resident weights at Q4 and just fits the 5090). For a 235B recipe, request via /contribute.