What You'll Build
A local Qwen3-8B chat / reasoning assistant running on a 16 GB RTX 5080, served through Ollama (or llama.cpp / LM Studio — same GGUF, three loaders). The recipe pins the dense 8B variant at Q4_K_M quantization (5.03 GB on disk), which leaves the RTX 5080's 16 GB envelope wildly over-provisioned — the binding question on this card isn't "does it fit" but "what to do with the ~10 GB of free VRAM."
Hardware data: RTX 5080 (16 GB GDDR7) · Q4_K GGUF · 129.1 tokens/s generation at 4k context · See benchmark data
⚠️ Variant pinned — Qwen3 ships 8 sizes from the same Qwen org. Per the Ollama qwen3 tag list, Qwen3 spans
0.6b,1.7b,4b,8b(this recipe),14b,30b(MoE),32b, and235b(MoE). The siblings have wildly different VRAM profiles — Qwen3-14B in Q4_K_M is ~8.5 GB and still fits 16 GB; Qwen3-32B in Q4_K_M is ~20 GB and overflows; Qwen3-235B (MoE, ~22B active) needs >100 GB total resident weights since the router can't pre-prune (see Qwen3 model card for the dense/MoE split). The instructions below are for the dense 8.2B model only. If you want 14B on this card, swapqwen3:8bforqwen3:14b; for 32B+ go to /contribute.
ℹ️ Thinking mode is on by default. Qwen3-8B has a built-in chain-of-thought ("thinking") mode that the model card's quickstart enables via
enable_thinking=True. Output starts with a<think>...</think>block followed by the user-facing answer. To disable for latency-sensitive use, send/no_thinkin your prompt or passenable_thinking=Falsein the chat template.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 6 GB VRAM (for Q4_K_M weights + KV at 4k ctx) | RTX 5080 (16 GB) |
| RAM | 16 GB system | — |
| Storage | 5.03 GB (Q4_K_M GGUF) or 8.71 GB (Q8_0) | per unsloth/Qwen3-8B-GGUF |
| Driver | CUDA 12.8+ runtime (Blackwell sm_120) | — |
| Runtime | Ollama 0.5+ / llama.cpp / LM Studio | — |
The model is released under Apache 2.0 — commercial use is permitted.
Installation
The fastest path is Ollama — one command pulls the canonical Q4_K_M build maintained by the Qwen team:
Option A — Ollama (recommended)
1. Install Ollama
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
(Windows: download from ollama.com/download.) Per the Qwen3 model card, "applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3."
2. Pull the 8B model
ollama pull qwen3:8b
This fetches a 5.2 GB Q4_K_M checkpoint per the Ollama qwen3:8b tag. The download is one file — no manual quant-tier selection needed.
Option B — llama.cpp + community GGUF
If you want a different quant tier (Q6_K for higher fidelity, Q8_0 for near-lossless, BF16 because the 5080 finally has the headroom for it), use a community redistributor that publishes the full ladder:
1. Install llama.cpp
# macOS (Homebrew)
brew install llama.cpp
# Linux — pre-built CUDA wheel
# Visit https://github.com/ggml-org/llama.cpp/releases for cu12x binaries
2. Pull the quant you want
Per the unsloth/Qwen3-8B-GGUF per-tier file-size table (link-back to upstream Qwen/Qwen3-8B confirmed on the page header):
| Quant | File size | Notes |
|---|---|---|
| Q4_K_M | 5.03 GB | recommended for general use |
| Q5_K_M | 5.85 GB | better quality, still tiny |
| Q6_K | 6.73 GB | "near perfect" per bartowski |
| Q8_0 | 8.71 GB | near-lossless |
| BF16 | 16.4 GB | full precision — at the very edge of the 16 GB envelope; cap context tight and disable KV cache padding |
Then via the llama.cpp Hugging Face shortcut (per the unsloth model card):
# OpenAI-compatible local server with web UI
llama-server -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL
# Interactive terminal
llama-cli -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL
Option C — LM Studio (GUI)
LM Studio offers a one-click install path per the Qwen3-8B HF card. Search "Qwen3-8B GGUF" inside the app and pick the Q4_K_M tier, or use the direct-import link lmstudio://open_from_hf?model=unsloth/Qwen3-8B-GGUF.
Running
One-shot prompt via Ollama
ollama run qwen3:8b "Explain GQA attention in three sentences."
First run loads the model into VRAM (~5 GB resident at idle, growing as the KV cache fills with longer contexts). Subsequent prompts in the same session stay warm.
Disable thinking mode for short answers
ollama run qwen3:8b "/no_think What's the capital of France?"
Per the Qwen3-8B HF card, this flips enable_thinking=False for the request, skipping the <think>...</think> chain-of-thought prefix.
OpenAI-compatible HTTP API
# Ollama exposes localhost:11434 by default
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3:8b",
"messages": [{"role": "user", "content": "Write a haiku about Blackwell GPUs."}]
}'
For higher throughput / production-style serving, the upstream Qwen3-8B card documents vllm serve Qwen/Qwen3-8B --enable-reasoning --reasoning-parser deepseek_r1 and python -m sglang.launch_server --model-path Qwen/Qwen3-8B --reasoning-parser qwen3 — both load BF16 weights (16.4 GB), which is right at this card's capacity. The 5080's 16 GB envelope can accommodate BF16 serving but only with tight context caps; for comfortable headroom prefer Ollama / llama.cpp with the Q4_K_M GGUF.
Spending the headroom — what to do with the spare ~10 GB
The Q4_K_M weights occupy ~5 GB at idle on the RTX 5080's 16 GB envelope, leaving roughly 10 GB of spare VRAM. The genuine per-GPU question on this card isn't "does it fit" (it fits trivially) — it's how to use that headroom. Three concrete options, all citable to the model card or the runtime:
- Quant up, not down. Q8_0 weights are 8.71 GB and near-lossless per bartowski; BF16 is 16.4 GB and right at the envelope edge. On a 5080 you can drop in
unsloth/Qwen3-8B-GGUF:Q8_0or run BF16 with a capped context (--ctx-size 8192) and lose nothing to quantization noise. - Longer context — push toward 131K with YaRN. Qwen3's native window is 32K per the HF card ("Context Length: 32,768 natively and 131,072 tokens with YaRN"). At Q4_K_M, the headroom comfortably accommodates 32K KV at fp16 (~2 GB) with room for YaRN extension toward 64K-128K via
--rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768per the unsloth GGUF instructions. - Colocate a smaller helper model. A 1.5–3B sidecar (Qwen3-1.7B, Qwen3-4B, or a Whisper-small ASR for voice pipelines) at Q4 takes 1–3 GB. Running two Ollama models concurrently (Ollama keeps both loaded in VRAM until they age out) lets you build retrieval / RAG / multi-stage pipelines on a single card. See the Ollama qwen3 tags for the smaller-variant tags.
Results
- Speed: 129.1 tokens/s generation at 4k context, Q4_K quantization, measured on RTX 5080 — per the hardware-corner.net LLM benchmark table, surfaced via
/check/qwen3-8b/rtx-5080. The full context ladder on the same page: 94.1 tok/s at 16k, 72.5 tok/s at 32k, 49.0 tok/s at 64k as the KV cache grows; the row caps at 64K (Qwen3-8B's "Max 64k" badge on Hardware Corner). Prompt processing is much faster — 6,410.1 tok/s at 4k context, 4,024.4 at 16k, 1,943.3 at 32k, 1,124.6 at 64k per the same source. The 5080 is roughly 1.9× faster on generation and 2.2× faster on prefill than the 5060 Ti at every context length per the same Hardware Corner page — directly attributable to the mem-bandwidth + compute uplift, since both cards run the same sm_120 kernels. - VRAM usage: The cited backend benchmark records the run as fitting the card's 16 GB envelope at the 4k-context configuration. At idle the Q4_K_M weights occupy ~5 GB; KV cache grows linearly with context and stays well below the envelope through 64K. The official Qwen speed benchmark corroborates the precision/VRAM ladder on H20 hardware: BF16 = 15947 MB, FP8 = 9323 MB, AWQ-INT4 = 6177 MB. Link to /check/qwen3-8b/rtx-5080 for the latest measurement.
- Quality notes: Q4_K_M is the community-default "sweet spot" — the bartowski Q-tier guide flags Q6_K as "near perfect, recommended" if you have the VRAM. On a 16 GB 5080 you can also run Q6_K (6.73 GB), Q8_0 (8.71 GB), or BF16 (16.4 GB, at the envelope edge with context capped). There's no quality reason to pick anything below Q4_K_M on this card.
For the full benchmark data and other-GPU comparisons, see /check/qwen3-8b/rtx-5080.
Troubleshooting
Ollama returns Error: model requires more system memory or hangs on load
Confirm a recent NVIDIA driver and CUDA 12.8+ runtime are installed (nvidia-smi should show a recent Blackwell-capable driver). The RTX 5080 uses the Blackwell architecture (sm_120) which requires CUDA 12.8 or newer; older CUDA wheels do not ship sm_120 kernels and will fall back to CPU or fail with a no kernel image is available for execution on the device error. Ollama's bundled CUDA runtime handles this for recent builds, but if you're compiling llama.cpp from source, build with LLAMA_CUDA=1 against CUDA 12.8+ explicitly. Watch nvidia-smi -l 1 in another terminal to confirm the GPU is actually being used; if it stays at 0% utilization, your driver or runtime is too old.
<think>...</think> output is bloating responses
Qwen3 enables thinking mode by default per the HF card quickstart. Send /no_think at the start of any user message to disable it for that turn, or pass enable_thinking=False if you're calling the chat-template API directly.
Using transformers directly — FA2 sm_120 wheel gap
If you bypass Ollama / llama.cpp and run the HF card quickstart via transformers directly, mind the FlashAttention-2 sm_120 gap: as of mid-2026, prebuilt FA2 wheels still do not ship sm_120 kernels for Blackwell consumer cards (tracked at Dao-AILab/flash-attention#2168). The Qwen3-8B quickstart uses torch_dtype="auto" and device_map="auto" without hardcoding attn_implementation="flash_attention_2", so it works out of the box with attn_implementation="eager" or "sdpa". If any third-party snippet you copy hardcodes attn_implementation="flash_attention_2", override it to "sdpa" until FA2 lands sm_120 wheels. Also: install the cu128 PyTorch wheel (pip install torch --index-url https://download.pytorch.org/whl/cu128) — the default cu126 wheel does not include sm_120 kernels.
I want the larger 14B / 32B sibling
Qwen3-14B at Q4_K_M is ~8.5 GB on disk and fits a 16 GB card with plenty of room — swap qwen3:8b for qwen3:14b in any Ollama command. Qwen3-32B at Q4_K_M is ~20 GB and does not fit without aggressive offloading; same for the 30B MoE and 235B MoE variants (MoE total params must be resident — see the Qwen3 model card on the dense/MoE split). For a 32B+ recipe on this card, request via /contribute.
Generation slows dramatically past 32k context
32k is Qwen3's native context window per the HF card ("Context Length: 32,768 natively and 131,072 tokens with YaRN"). Beyond that the model needs YaRN extension — supported in llama.cpp via --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 per the unsloth GGUF instructions — but quality degrades and the KV cache balloons. For long-doc workflows, prefer chunking + retrieval over pushing context past 32k. The hardware-corner.net benchmark shows the rate falling to 49.0 tok/s at 64k context on this card (vs 129.1 at 4k) — Qwen3-8B's row caps at the 64K "Max 64k" badge on Hardware Corner.