self-hosted/ai
§01·recipe · llm

Qwen3-14B on RTX 4080: Q4_K_M GGUF via Ollama or llama.cpp

llmbeginner16GB+ VRAMMay 29, 2026
models
tools
prerequisites
  • NVIDIA RTX 4080 (16 GB VRAM) or equivalent 16 GB CUDA card
  • Recent NVIDIA driver with CUDA 12.x support (Ada sm_89 — no special wheel selection required)
  • ~9.3 GB free disk for the Q4_K_M GGUF (or ~12 GB for Q6_K)
  • Ollama, llama.cpp, or LM Studio installed

What You'll Build

A local Qwen3-14B chat / reasoning assistant running on a 16 GB RTX 4080, served through Ollama (or llama.cpp / LM Studio — same GGUF, three loaders). The recipe pins the dense 14.8B variant at Q4_K_M quantization (9.0 GB on disk), which leaves roughly 6–7 GB of headroom on the 16 GB 4080 for Qwen3's 32k-native context window, the thinking-mode chain of thought, and the KV cache.

Hardware data: RTX 4080 (16 GB VRAM) · Q4_K GGUF · 62 tokens/s generation at 4k context · See benchmark data

⚠️ Variant pinned — Qwen3 ships 8 sizes from the same Qwen org. Per the Ollama qwen3:14b tag list, Qwen3 spans 0.6b, 1.7b, 4b, 8b, 14b (this recipe), 30b (MoE), 32b, and 235b (MoE). The siblings have wildly different VRAM profiles, and on a 16 GB card the variant choice is binding: Qwen3-14B in BF16 is ~28 GB and does NOT fit 16 GB per the official Qwen speed benchmark ("28,402 MB" memory footprint at input length 1, growing to 33,336 MB at 30k context). The instructions below are for the dense 14.8B model only at Q4_K_M GGUF — the path that fits 16 GB. For the 30B/235B MoE siblings, all expert params must be resident in VRAM and will not fit this card — see the Qwen3 model card on the dense/MoE split.

ℹ️ Thinking mode is on by default — size your context for it. Qwen3-14B has a built-in chain-of-thought ("thinking") mode that the model card's quickstart enables via enable_thinking=True. Output starts with a <think>...</think> block (often 2k–4k tokens on hard problems) followed by the user-facing answer. That <think> trace grows the KV cache far faster than a plain chat turn, which matters on a 16 GB card — see Troubleshooting. To disable for latency-sensitive use, send /no_think in your prompt or pass enable_thinking=False in the chat template.

Requirements

ComponentMinimumTested
GPU16 GB VRAM (Q4_K_M weights + KV headroom)RTX 4080 (16 GB)
RAM16 GB system
Storage9.3 GB (Q4_K_M GGUF) or 12.1 GB (Q6_K)per unsloth/Qwen3-14B-GGUF
DriverCUDA 12.x (Ada sm_89)
RuntimeOllama 0.5+ / llama.cpp / LM Studio

The model is released under Apache 2.0 — commercial use is permitted.

Installation

The fastest path is Ollama — one command pulls the canonical Q4_K_M build maintained by the Qwen team:

Option A — Ollama (recommended)

1. Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

(Windows: download from ollama.com/download.) Per the Qwen3 model card, "applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3."

2. Pull the 14B model

ollama pull qwen3:14b

This fetches a 9.3 GB Q4_K_M checkpoint per the Ollama qwen3:14b tag (14.8B parameters, Q4_K_M quantization). The download is one file — no manual quant-tier selection needed.

Option B — llama.cpp + community GGUF

If you want a higher-quality quant (Q5_K_M, Q6_K) or the imatrix-tuned Unsloth Dynamic 2.0 ladder, use a community redistributor that publishes the full per-tier table. The unsloth/Qwen3-14B-GGUF repo lists Qwen/Qwen3-14B explicitly as its base_model with link-back to the upstream model card.

1. Install llama.cpp

# macOS (Homebrew)
brew install llama.cpp

# Linux — pre-built CUDA wheel
# Visit https://github.com/ggml-org/llama.cpp/releases for cu12x binaries

2. Pull the quant you want

Per-tier file sizes from the unsloth/Qwen3-14B-GGUF Files tab:

QuantFile sizeFits 16 GB with KV headroom?
Q4_K_M9.00 GByes — recommended for this card
UD-Q4_K_XL9.16 GByes — Unsloth Dynamic 2.0 imatrix-tuned
Q5_K_M10.51 GByes — better quality, comfortable
Q6_K12.12 GByes, but watch KV at long context
Q8_015.70 GBno — weights alone leave ~0.3 GB; OOM with any KV cache
BF1629.54 GBno — does NOT fit 16 GB

On a 16 GB 4080, Q4_K_M / Q5_K_M / Q6_K all leave room for the KV cache; Q8_0 (15.70 GB) does not — its weights nearly fill the card before any context loads. This is the key difference from a 24 GB card, where Q8_0 is comfortable.

Then via the llama.cpp Hugging Face shortcut (per the unsloth model card):

# OpenAI-compatible local server with web UI
llama-server -hf unsloth/Qwen3-14B-GGUF:UD-Q4_K_XL --ctx-size 8192 --flash-attn

# Interactive terminal
llama-cli -hf unsloth/Qwen3-14B-GGUF:UD-Q4_K_XL --ctx-size 8192 --flash-attn

Option C — LM Studio (GUI)

LM Studio offers a one-click install path per the Qwen3-14B HF card. Search "Qwen3-14B GGUF" inside the app and pick the Q4_K_M tier (Q5_K_M or Q6_K if you want higher fidelity and still have room — but skip Q8_0 on this 16 GB card).

Running

One-shot prompt via Ollama

ollama run qwen3:14b "Explain the difference between MoE and dense transformer architectures in three sentences."

First run loads the model into VRAM (~9 GB resident at idle for Q4_K_M, growing as the KV cache fills with longer contexts). Subsequent prompts in the same session stay warm.

Disable thinking mode for short answers

ollama run qwen3:14b "/no_think What's the capital of France?"

Per the Qwen3-14B HF card, this flips enable_thinking=False for the request, skipping the <think>...</think> chain-of-thought prefix.

OpenAI-compatible HTTP API

# Ollama exposes localhost:11434 by default
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:14b",
    "messages": [{"role": "user", "content": "Write a haiku about Ada Lovelace GPUs."}]
  }'

The upstream Qwen3-14B card also documents vllm serve Qwen/Qwen3-14B --enable-reasoning --reasoning-parser deepseek_r1 and python -m sglang.launch_server --model-path Qwen/Qwen3-14B --reasoning-parser qwen3 — but both default to BF16 weights (~28 GB per the official speed benchmark), which overflows the 4080's 16 GB. Even FP8 (16,012 MB at length 1 per the same benchmark) sits right at the 16 GB ceiling and overflows once context grows. On the 4080, the Ollama / llama.cpp GGUF path is the comfortable one; see Troubleshooting for the vLLM/SGLang OOM walkthrough.

Results

  • Speed: 62 tokens/s generation at 4k context, Q4_K quantization, measured on the RTX 4080 — per the hardware-corner.net RTX 4080 LLM benchmark table row labelled "Qwen3 14B (Q4_K)", surfaced via /check/qwen3-14b/rtx-4080. Generation rate decays to 51.2 tok/s at 16k and 40.6 tok/s at 32k as the KV cache grows. Prompt processing on the same row is much faster — 3,574.7 tok/s at 4k context, dropping to 2,295.1 tok/s at 16k and 1,395.7 tok/s at 32k. Note these are chat-class throughput figures; for thinking-mode workloads where most of the output is a discarded <think> block, effective throughput is lower because the model emits far more tokens per useful answer.
  • VRAM usage: Both /check/qwen3-14b/rtx-4080 benchmarks report a 16 GB peak on this card at Q4_K. Q4_K_M weights occupy ~9 GB of the 16 GB card at idle; the rest is KV cache headroom the runtime expands with context. The official Qwen speed benchmark on H20 corroborates the precision/VRAM ladder for Qwen3-14B in Transformers: AWQ-INT4 = 9,962 MB at length 1 / 15,323 MB at 30k context, FP8 = 16,012 MB / 20,813 MB, BF16 = 28,402 MB / 33,336 MB — on a 16 GB card only the int4 / Q4_K_M / Q5_K_M / Q6_K GGUF paths fit with KV headroom; FP8 and BF16 overflow.
  • Quality notes: Q4_K_M is the community-default "sweet spot." On the 16 GB 4080 you can upgrade to Q5_K_M (10.51 GB) or Q6_K (12.12 GB) for higher fidelity and still leave room for a moderate KV cache, but Q8_0 (15.70 GB) is too large here — it nearly fills the card before any context loads. There's no quality reason to pick anything below Q4_K_M on this card.

For the full benchmark data and other-GPU comparisons, see /check/qwen3-14b/rtx-4080.

Troubleshooting

Out of memory mid-generation on a hard problem

Qwen3-14B's thinking mode emits a <think>...</think> chain-of-thought that routinely runs 2k–4k tokens (longer on hard math / coding), and the KV cache grows linearly with every token. On a 16 GB card that <think> trace is the most common OOM trigger — the weights fit fine, but the cache balloons during a long reasoning turn. Mitigations, in order: (1) cap the context with --ctx-size 8192 on the first run (shown in the Installation commands above); (2) quantize the KV cache with --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn in llama.cpp to roughly halve its memory; (3) drop a quant tier on the weights (Q4_K_M → Q4_K_S leaves more room for KV); (4) send /no_think to skip the chain-of-thought entirely for turns that don't need it. Watch nvidia-smi -l 1 during a hard problem to calibrate the actual peak.

Ollama returns Error: model requires more system memory or hangs on load

Confirm a recent NVIDIA driver and CUDA 12.x runtime are installed (nvidia-smi should show a driver from the past 12 months). The RTX 4080 uses the Ada Lovelace architecture (sm_89) which has been fully supported by mainline CUDA wheels since 2023 — no special build flags or wheel pinning are required. If Ollama still appears to hang on first load, watch nvidia-smi -l 1 in another terminal to confirm the GPU is actually being used; if it stays at 0% utilization, reinstall Ollama and re-pull the model.

<think>...</think> output is bloating responses

Qwen3 enables thinking mode by default per the HF card quickstart. Send /no_think at the start of any user message to disable it for that turn, or pass enable_thinking=False if you're calling the chat-template API directly. Per the model card best-practices note: for thinking mode use Temperature=0.6, TopP=0.95, TopK=20, MinP=0 and do not use greedy decoding — it triggers endless repetitions.

vLLM / SGLang server crashes with CUDA OOM at startup

vLLM and SGLang default to BF16 weights for Qwen/Qwen3-14B, which require ~28 GB resident per the official speed benchmark and far exceed the 4080's 16 GB. Unlike a 24 GB card, FP8 does not rescue you here either — it starts at 16,012 MB (length 1) and climbs to 20,813 MB at 30k context per the same benchmark, overflowing 16 GB once any real context loads. The fitting options on this card are (a) AWQ-INT4 weights (~10 GB resident, growing to ~15 GB at 30k), or (b) the Ollama / llama.cpp Q4_K_M GGUF path this recipe is built around. Reserve vLLM/SGLang for a larger card.

Generation slows dramatically past 32k context

32k is Qwen3-14B's native context window per the HF card, which the card lists as extensible to 131,072 tokens with YaRN. Beyond that the model needs YaRN extension — supported in llama.cpp via --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 per the Qwen3 model card — but quality degrades and the KV cache balloons (a real concern on 16 GB). For long-doc workflows, prefer chunking + retrieval over pushing context past 32k. The hardware-corner.net RTX 4080 benchmark shows the generation rate falling from 62 tok/s at 4k to 40.6 tok/s at 32k on this card.

Using transformers directly instead of Ollama

If you bypass Ollama / llama.cpp and run the HF card quickstart via transformers directly with torch_dtype="auto", device_map="auto", you will load BF16 weights and hit OOM on a 16 GB 4080 (28,402 MB at length 1 per the Qwen benchmark). The quickstart does not hardcode attn_implementation="flash_attention_2", so if you do fit a precision (AWQ-INT4 mirror, or a larger card), it works out of the box on the 4080 with a stock pip install torch — Ada sm_89 has full FA2 kernel coverage if you opt into FA2 separately. Unlike Blackwell-class cards, no cu128-specific wheel selection is required for the 4080.