self-hosted/ai
§01·recipe · llm

Qwen3-14B on RTX 4080 SUPER: Q4_K_M GGUF via Ollama or llama.cpp

llmbeginner16GB+ VRAMJun 2, 2026
models
tools
prerequisites
  • NVIDIA RTX 4080 SUPER (16 GB VRAM) or equivalent 16 GB CUDA card
  • Recent NVIDIA driver with CUDA 12.x support (Ada sm_89 — no special wheel selection required)
  • ~9.3 GB free disk for the Q4_K_M GGUF (or ~12 GB for Q6_K)
  • Ollama, llama.cpp, or LM Studio installed

What You'll Build

A local Qwen3-14B chat / reasoning assistant running on a 16 GB RTX 4080 SUPER, served through Ollama (or llama.cpp / LM Studio — same GGUF, three loaders). The recipe pins the dense 14.8B variant at Q4_K_M quantization (9.0 GB on disk), which leaves roughly 6–7 GB of headroom on the 16 GB 4080 SUPER for Qwen3's 32k-native context window, the thinking-mode chain of thought, and the KV cache.

Hardware data: RTX 4080 SUPER (16 GB VRAM) · Q4_K GGUF · 64.2 tokens/s generation at 4k context · See benchmark data

⚠️ Variant pinned — Qwen3 ships 8 sizes from the same Qwen org. Per the Ollama qwen3:14b tag list, Qwen3 spans 0.6b, 1.7b, 4b, 8b, 14b (this recipe), 30b (MoE), 32b, and 235b (MoE). The siblings have wildly different VRAM profiles, and on a 16 GB card the variant choice is binding: Qwen3-14B in BF16 is ~28 GB and does NOT fit 16 GB per the official Qwen speed benchmark ("28,402 MB" memory footprint at input length 1, growing to 33,336 MB at ~30k context). The instructions below are for the dense 14.8B model only at Q4_K_M GGUF — the path that fits 16 GB. For the 30B/235B MoE siblings, all expert params must be resident in VRAM and will not fit this card — see the Qwen3 model card on the dense/MoE split.

ℹ️ Thinking mode is on by default — size your context for it. Qwen3-14B has a built-in chain-of-thought ("thinking") mode that the model card's quickstart enables via enable_thinking=True. Output starts with a <think>...</think> block (often 2k–4k tokens on hard problems) followed by the user-facing answer. That <think> trace grows the KV cache far faster than a plain chat turn, which matters on a 16 GB card — see Troubleshooting. To disable for latency-sensitive use, send /no_think in your prompt or pass enable_thinking=False in the chat template.

Requirements

ComponentMinimumTested
GPU16 GB VRAM (Q4_K_M weights + KV headroom)RTX 4080 SUPER (16 GB)
RAM16 GB system
Storage9.0 GB (Q4_K_M GGUF) or 12.1 GB (Q6_K)per unsloth/Qwen3-14B-GGUF
DriverCUDA 12.x (Ada sm_89)
RuntimeOllama 0.5+ / llama.cpp / LM Studio

The model is released under Apache 2.0 — commercial use is permitted, weights are ungated (free download, no access request).

Installation

The fastest path is Ollama — one command pulls the canonical Q4_K_M build maintained by the Qwen team:

Option A — Ollama (recommended)

1. Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

(Windows: download from ollama.com/download.) Per the Qwen3 model card, "applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3."

2. Pull the 14B model

ollama pull qwen3:14b

This fetches a 9.3 GB Q4_K_M checkpoint per the Ollama qwen3:14b tag (14.8B parameters, Q4_K_M quantization). The download is one file — no manual quant-tier selection needed.

Option B — llama.cpp + community GGUF

If you want a higher-quality quant (Q5_K_M, Q6_K) or the imatrix-tuned Unsloth Dynamic 2.0 ladder, use a community redistributor that publishes the full per-tier table. The unsloth/Qwen3-14B-GGUF repo lists Qwen/Qwen3-14B explicitly as its base_model with link-back to the upstream model card.

1. Install llama.cpp

# macOS (Homebrew)
brew install llama.cpp

# Linux — pre-built CUDA wheel
# Visit https://github.com/ggml-org/llama.cpp/releases for cu12x binaries

2. Pull the quant you want

Per-tier file sizes from the unsloth/Qwen3-14B-GGUF Files tab:

QuantFile sizeFits 16 GB with KV headroom?
Q4_K_M9.00 GByes — recommended for this card
UD-Q4_K_XL9.16 GByes — Unsloth Dynamic 2.0 imatrix-tuned
Q5_K_M10.51 GByes — better quality, comfortable
Q6_K12.12 GByes, but watch KV at long context
Q8_015.70 GBno — weights alone leave ~0.3 GB; OOM with any KV cache
BF1629.54 GBno — does NOT fit 16 GB

On a 16 GB 4080 SUPER, Q4_K_M / Q5_K_M / Q6_K all leave room for the KV cache; Q8_0 (15.70 GB) does not — its weights nearly fill the card before any context loads. This is the key difference from a 24 GB card, where Q8_0 is comfortable.

Then via the llama.cpp Hugging Face shortcut (per the unsloth model card):

# OpenAI-compatible local server with web UI
llama-server -hf unsloth/Qwen3-14B-GGUF:UD-Q4_K_XL --ctx-size 8192 --flash-attn

# Interactive terminal
llama-cli -hf unsloth/Qwen3-14B-GGUF:UD-Q4_K_XL --ctx-size 8192 --flash-attn

Option C — LM Studio (GUI)

LM Studio offers a one-click install path per the Qwen3-14B HF card. Search "Qwen3-14B GGUF" inside the app and pick the Q4_K_M tier (Q5_K_M or Q6_K if you want higher fidelity and still have room — but skip Q8_0 on this 16 GB card).

Running

One-shot prompt via Ollama

ollama run qwen3:14b "Explain the difference between MoE and dense transformer architectures in three sentences."

First run loads the model into VRAM (~9 GB resident at idle for Q4_K_M, growing as the KV cache fills with longer contexts). Subsequent prompts in the same session stay warm.

Disable thinking mode for short answers

ollama run qwen3:14b "/no_think What's the capital of France?"

Per the Qwen3-14B HF card, this flips enable_thinking=False for the request, skipping the <think>...</think> chain-of-thought prefix.

OpenAI-compatible HTTP API

# Ollama exposes localhost:11434 by default
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:14b",
    "messages": [{"role": "user", "content": "Write a haiku about Ada Lovelace GPUs."}]
  }'

The upstream Qwen3-14B card also documents vllm serve Qwen/Qwen3-14B --enable-reasoning --reasoning-parser deepseek_r1 and python -m sglang.launch_server --model-path Qwen/Qwen3-14B --reasoning-parser qwen3 — but both default to BF16 weights (~28 GB per the official speed benchmark), which overflows the 4080 SUPER's 16 GB. Even FP8 (16,012 MB at length 1 per the same benchmark) sits right at the 16 GB ceiling and overflows once context grows. On the 4080 SUPER, the Ollama / llama.cpp GGUF path is the comfortable one; see Troubleshooting for the vLLM/SGLang OOM walkthrough.

Results

  • Speed: 64.2 tokens/s generation at 4k context, Q4_K quantization, measured on the RTX 4080 SUPER — per the hardware-corner.net RTX 4080 SUPER LLM benchmark table row labelled "Qwen3 14B (Q4_K)", surfaced via /check/qwen3-14b/rtx-4080-super. Generation rate decays to 52.8 tok/s at 16k and 42.6 tok/s at 32k as the KV cache grows. Prompt processing on the same row is much faster — 3,745.0 tok/s at 4k context, dropping to 2,526.7 tok/s at 16k and 1,769.3 tok/s at 32k. Note these are chat-class throughput figures; for thinking-mode workloads where most of the output is a discarded <think> block, effective throughput is lower because the model emits far more tokens per useful answer.
  • VRAM usage: Both /check/qwen3-14b/rtx-4080-super benchmarks report a 16 GB peak on this card at Q4_K. Q4_K_M weights occupy ~9 GB of the 16 GB card at idle; the rest is KV cache headroom the runtime expands with context. The official Qwen speed benchmark on H20 corroborates the precision/VRAM ladder for Qwen3-14B in Transformers: AWQ-INT4 = 9,962 MB at length 1 / 15,323 MB at ~30k context, FP8 = 16,012 MB / 20,813 MB, BF16 = 28,402 MB / 33,336 MB — on a 16 GB card only the int4 / Q4_K_M / Q5_K_M / Q6_K GGUF paths fit with KV headroom; FP8 and BF16 overflow.
  • Quality notes: Q4_K_M is the community-default "sweet spot." On the 16 GB 4080 SUPER you can upgrade to Q5_K_M (10.51 GB) or Q6_K (12.12 GB) for higher fidelity and still leave room for a moderate KV cache, but Q8_0 (15.70 GB) is too large here — it nearly fills the card before any context loads. There's no quality reason to pick anything below Q4_K_M on this card.

For the full benchmark data and other-GPU comparisons, see /check/qwen3-14b/rtx-4080-super.

Troubleshooting

Out of memory mid-generation on a hard problem

Qwen3-14B's thinking mode emits a <think>...</think> chain-of-thought that routinely runs 2k–4k tokens (longer on hard math / coding), and the KV cache grows linearly with every token. On a 16 GB card that <think> trace is the most common OOM trigger — the weights fit fine, but the cache balloons during a long reasoning turn. Mitigations, in order: (1) cap the context with --ctx-size 8192 on the first run (shown in the Installation commands above); (2) quantize the KV cache with --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn in llama.cpp to roughly halve its memory; (3) drop a quant tier on the weights (Q4_K_M → Q4_K_S leaves more room for KV); (4) send /no_think to skip the chain-of-thought entirely for turns that don't need it. Watch nvidia-smi -l 1 during a hard problem to calibrate the actual peak.

Ollama returns Error: model requires more system memory or hangs on load

Confirm a recent NVIDIA driver and CUDA 12.x runtime are installed (nvidia-smi should show a driver from the past 12 months). The RTX 4080 SUPER uses the Ada Lovelace architecture (sm_89) which has been fully supported by mainline CUDA wheels since 2023 — no special build flags or wheel pinning are required. If Ollama still appears to hang on first load, watch nvidia-smi -l 1 in another terminal to confirm the GPU is actually being used; if it stays at 0% utilization, reinstall Ollama and re-pull the model.

<think>...</think> output is bloating responses

Qwen3 enables thinking mode by default per the HF card quickstart. Send /no_think at the start of any user message to disable it for that turn, or pass enable_thinking=False if you're calling the chat-template API directly. Per the model card best-practices note: for thinking mode use Temperature=0.6, TopP=0.95, TopK=20, MinP=0 and do not use greedy decoding — it triggers endless repetitions.

vLLM / SGLang server crashes with CUDA OOM at startup

vLLM and SGLang default to BF16 weights for Qwen/Qwen3-14B, which require ~28 GB resident per the official speed benchmark and far exceed the 4080 SUPER's 16 GB. Unlike a 24 GB card, FP8 does not rescue you here either — it starts at 16,012 MB (length 1) and climbs to 20,813 MB at ~30k context per the same benchmark, overflowing 16 GB once any real context loads. The fitting options on this card are (a) AWQ-INT4 weights (~10 GB resident, growing to ~15 GB at ~30k), or (b) the Ollama / llama.cpp Q4_K_M GGUF path this recipe is built around. Reserve vLLM/SGLang for a larger card.

Generation slows dramatically past 32k context

32k is Qwen3-14B's native context window per the HF card, which the card lists as extensible to 131,072 tokens with YaRN. Beyond that the model needs YaRN extension — supported in llama.cpp via --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 per the Qwen3 model card — but quality degrades and the KV cache balloons (a real concern on 16 GB). For long-doc workflows, prefer chunking + retrieval over pushing context past 32k. The hardware-corner.net RTX 4080 SUPER benchmark shows the generation rate falling from 64.2 tok/s at 4k to 42.6 tok/s at 32k on this card.

Using transformers directly instead of Ollama

If you bypass Ollama / llama.cpp and run the HF card quickstart via transformers directly with torch_dtype="auto", device_map="auto", you will load BF16 weights and hit OOM on a 16 GB 4080 SUPER (28,402 MB at length 1 per the Qwen benchmark). The quickstart does not hardcode attn_implementation="flash_attention_2", so if you do fit a precision (AWQ-INT4 mirror, or a larger card), it works out of the box on the 4080 SUPER with a stock pip install torch — Ada sm_89 has full FA2 kernel coverage if you opt into FA2 separately. Unlike Blackwell-class cards, no cu128-specific wheel selection is required for the 4080 SUPER.