self-hosted/ai
§01·recipe · llm

Qwen3-14B on RTX 5080: Q4_K_M GGUF via Ollama or llama.cpp

llmbeginner16GB+ VRAMMay 28, 2026
models
tools
prerequisites
  • NVIDIA RTX 5080 (16 GB GDDR7) or equivalent Blackwell sm_120 card
  • Recent NVIDIA driver with CUDA 12.8+ runtime (Blackwell sm_120 — `cu128` PyTorch wheel required if you go the transformers route)
  • ~9 GB free disk for the Q4_K_M GGUF checkpoint (or ~16 GB for Q8_0)
  • Ollama, llama.cpp, or LM Studio installed

What You'll Build

A local Qwen3-14B chat / reasoning assistant running on a 16 GB RTX 5080, served through Ollama (or llama.cpp / LM Studio — same GGUF, three loaders). The recipe pins the dense 14B variant at Q4_K_M quantization (9.00 GB on disk), which fits the RTX 5080's 16 GB envelope with roughly 7 GB to spare for KV cache, longer context, or a small colocated helper model.

Hardware data: RTX 5080 (16 GB GDDR7) · Q4_K GGUF · 80.6 tokens/s generation at 4k context · See benchmark data

⚠️ Variant pinned — Qwen3 ships 8 sizes from the same Qwen org. Per the Ollama qwen3 tag list, Qwen3 spans 0.6b, 1.7b, 4b, 8b, 14b (this recipe), 30b (MoE), 32b, and 235b (MoE). The siblings have very different VRAM profiles — Qwen3-8B in Q4_K_M is ~5 GB; Qwen3-32B in Q4_K_M is ~20 GB and overflows a 16 GB card; the 30B/235B MoE variants need every expert resident in VRAM (the router can't pre-prune), far past this card. The instructions below are for the dense 14.8B model only. For 32B+ on this card, see /contribute.

ℹ️ Thinking mode is on by default. Qwen3-14B supports a built-in chain-of-thought ("thinking") mode that the model card's quickstart enables via enable_thinking=True. Output starts with a <think>...</think> block followed by the user-facing answer. To disable for latency-sensitive use, send /no_think in your prompt or pass enable_thinking=False in the chat template.

Requirements

ComponentMinimumTested
GPU16 GB VRAM (Q4_K_M weights ~9 GB + KV at 4k ctx, room for the full envelope)RTX 5080 (16 GB)
RAM16 GB system
Storage9.00 GB (Q4_K_M GGUF) or 15.70 GB (Q8_0)per unsloth/Qwen3-14B-GGUF
DriverCUDA 12.8+ runtime (Blackwell sm_120)
RuntimeOllama 0.5+ / llama.cpp / LM Studio

The model is released under Apache 2.0 — commercial use is permitted.

Installation

The fastest path is Ollama — one command pulls the canonical Q4_K_M build:

Option A — Ollama (recommended)

1. Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

(Windows: download from ollama.com/download.) Per the Qwen3-14B model card, "For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3."

2. Pull the 14B model

ollama pull qwen3:14b

This fetches a 9.3 GB Q4_K_M checkpoint per the Ollama qwen3:14b tag. The download is one file — no manual quant-tier selection needed.

Option B — llama.cpp + community GGUF

If you want a different quant tier (Q5_K_M / Q6_K for higher fidelity, Q8_0 for near-lossless), use a community redistributor that publishes the full ladder:

1. Install llama.cpp

# macOS (Homebrew)
brew install llama.cpp

# Linux — pre-built CUDA wheel
# Visit https://github.com/ggml-org/llama.cpp/releases for cu12x binaries

2. Pull the quant you want

Per the unsloth/Qwen3-14B-GGUF per-tier file-size table (upstream Qwen/Qwen3-14B link-back confirmed in the model-card header):

QuantFile sizeNotes
Q4_K_M9.00 GBrecommended for general use
Q5_K_M10.51 GBbetter quality, ample room on 16 GB
Q6_K12.12 GB"near perfect" per bartowski
Q8_015.70 GBnear-lossless — fills the 16 GB envelope, cap context tight
BF1629.54 GBfull precision — does NOT fit a 16 GB card

Then via the llama.cpp Hugging Face shortcut (per the unsloth model card):

# OpenAI-compatible local server with web UI
llama-server -hf unsloth/Qwen3-14B-GGUF:UD-Q4_K_XL

# Interactive terminal
llama-cli -hf unsloth/Qwen3-14B-GGUF:UD-Q4_K_XL

Option C — LM Studio (GUI)

LM Studio offers a one-click install path — the Qwen3 family is in its supported-runtime list on the Qwen3-14B HF card. Search "Qwen3-14B GGUF" inside the app and pick the Q4_K_M tier, or use the direct-import link lmstudio://open_from_hf?model=unsloth/Qwen3-14B-GGUF.

Running

One-shot prompt via Ollama

ollama run qwen3:14b "Explain GQA attention in three sentences."

First run loads the model into VRAM (~9 GB resident at idle, growing as the KV cache fills with longer contexts). Subsequent prompts in the same session stay warm.

Disable thinking mode for short answers

ollama run qwen3:14b "/no_think What's the capital of France?"

Per the Qwen3-14B HF card, this flips enable_thinking=False for the request, skipping the <think>...</think> chain-of-thought prefix.

OpenAI-compatible HTTP API

# Ollama exposes localhost:11434 by default
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:14b",
    "messages": [{"role": "user", "content": "Write a haiku about Blackwell GPUs."}]
  }'

For higher-throughput / production-style serving, the upstream Qwen3-14B card documents vllm serve Qwen/Qwen3-14B --enable-reasoning --reasoning-parser deepseek_r1 and python -m sglang.launch_server --model-path Qwen/Qwen3-14B --reasoning-parser qwen3 — both load BF16 weights (29.54 GB), which does not fit a 16 GB card. For local serving on this GPU prefer Ollama / llama.cpp with the Q4_K_M GGUF; FP8/vLLM serving needs a larger-VRAM card.

Spending the headroom — what to do with the spare ~7 GB

The Q4_K_M weights occupy ~9 GB at idle on the RTX 5080's 16 GB envelope, leaving roughly 7 GB of spare VRAM. Unlike the 32B-class siblings (where the weights alone eat the card), 14B at Q4_K_M leaves genuine room. Three concrete options, all citable to the model card or the runtime:

  • Quant up, not down. Q5_K_M (10.51 GB) and Q6_K (12.12 GB) both fit comfortably, and Q8_0 (15.70 GB) fits at the edge with a capped context (--ctx-size 8192) — Q6_K is "near perfect" per bartowski. On a 5080 there's no quality reason to pick anything below Q4_K_M.
  • Longer context — extend toward 131K with YaRN. Qwen3's native window is 32K per the HF card, which the card lists as extensible to 131,072 tokens with YaRN. At Q4_K_M the ~7 GB headroom accommodates a sizeable KV cache; for longer documents, enable YaRN extension via --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 per the unsloth GGUF instructions.
  • Colocate a small helper model. A 0.6–1.7B sidecar (Qwen3-0.6B, Qwen3-1.7B) at Q4 takes ~0.5–1.5 GB. Running it alongside qwen3:14b (Ollama keeps both loaded until they age out) lets you build a draft-model / RAG / multi-stage pipeline on a single card. See the Ollama qwen3 tags for the smaller-variant tags.

Results

  • Speed: 80.6 tokens/s generation at 4k context, Q4_K quantization, measured on RTX 5080 — per the hardware-corner.net LLM benchmark table, surfaced via /check/qwen3-14b/rtx-5080 (backend benchmark id=80). Generation slows as the KV cache grows: 64.0 tok/s at 16k and 51.9 tok/s at 32k on the same page. Prompt processing (prefill) is much faster — a distinct metric: 3,820.5 tok/s at 4k context, 2,542.0 at 16k, 1,326.1 at 32k per the same Hardware Corner row (backend benchmark id=79). Prefill measures how fast the model ingests your prompt; token generation measures how fast it writes the reply — the two are reported separately because they stress different parts of the pipeline.
  • VRAM usage: The cited backend benchmark records a 16.0 GB peak at the 4k-context configuration on this card — see /check/qwen3-14b/rtx-5080 (id=79/id=80). At idle the Q4_K_M weights occupy ~9 GB; the remainder of the envelope is KV cache and activations, which grow with context. Q4_K_M leaves headroom; Q8_0 (15.70 GB) and above fill the card.
  • Quality notes: Q4_K_M is the community-default "sweet spot." The 14.8B-parameter dense model (13.2B non-embedding, 40 layers, GQA 40 query / 8 KV heads per the HF card) is a meaningful quality step up from Qwen3-8B at the cost of roughly 1.6× the generation latency on the same card. If you need full 131K context, prefer a higher-VRAM card or chunk + retrieve.

For the full benchmark data and other-GPU comparisons, see /check/qwen3-14b/rtx-5080.

Troubleshooting

Ollama returns Error: model requires more system memory or hangs on load

Confirm a recent NVIDIA driver and CUDA 12.8+ runtime are installed (nvidia-smi should show a recent Blackwell-capable driver). The RTX 5080 uses the Blackwell architecture (sm_120) which requires CUDA 12.8 or newer; older CUDA wheels do not ship sm_120 kernels and fail with a no kernel image is available for execution on the device error. Ollama's bundled CUDA runtime handles this for recent builds, but if you compile llama.cpp from source, build with LLAMA_CUDA=1 against CUDA 12.8+ explicitly. Watch nvidia-smi -l 1 in another terminal to confirm the GPU is actually being used; if it stays at 0% utilization, your driver or runtime is too old.

Using transformers directly — FA2 sm_120 wheel gap

If you bypass Ollama / llama.cpp and run the HF card quickstart via transformers directly, mind the FlashAttention-2 sm_120 gap: as of mid-2026, prebuilt FA2 wheels still do not ship sm_120 kernels for Blackwell consumer cards (tracked at Dao-AILab/flash-attention#2168). The Qwen3 quickstart uses torch_dtype="auto" and device_map="auto" without hardcoding attn_implementation="flash_attention_2", so it works out of the box with attn_implementation="eager" or "sdpa". If any third-party snippet you copy hardcodes attn_implementation="flash_attention_2", override it to "sdpa" until FA2 lands sm_120 wheels. Also install the cu128 PyTorch wheel (pip install torch --index-url https://download.pytorch.org/whl/cu128) — the default cu126 wheel does not include sm_120 kernels. Blackwell's tensor cores natively support FP8, so there is no FP8-on-Ampere dequantization penalty on this card.

<think>...</think> output is bloating responses

Qwen3 enables thinking mode by default per the HF card quickstart. Send /no_think at the start of any user message to disable it for that turn, or pass enable_thinking=False if you're calling the chat-template API directly. Reasoning traces also consume KV cache — a long <think> block at high context can push memory usage up, so disable it when you don't need step-by-step reasoning.

I want the larger 32B sibling

Qwen3-32B at Q4_K_M is ~20 GB on disk and does not fit a 16 GB card without aggressive CPU offloading (which tanks throughput); same for the 30B and 235B MoE variants, whose total expert weights must be resident. Stay on 14B for this card, or request a 32B recipe via /contribute.

Generation slows past 32k context

32k is Qwen3's native context window per the HF card, which the card lists as extensible to 131,072 tokens with YaRN. Beyond that the model needs YaRN extension — supported in llama.cpp via --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 per the unsloth GGUF instructions — but quality degrades and the KV cache balloons. The hardware-corner.net benchmark shows generation falling to 51.9 tok/s at 32k (vs 80.6 at 4k) on this card; for long-doc workflows prefer chunking + retrieval over pushing context past 32k.