What You'll Build
A local Qwen3-14B chat / reasoning assistant running on a 16 GB RTX 4080 SUPER, served through Ollama (or llama.cpp / LM Studio — same GGUF, three loaders). The recipe pins the dense 14.8B variant at Q4_K_M quantization (9.0 GB on disk), which leaves roughly 6–7 GB of headroom on the 16 GB 4080 SUPER for Qwen3's 32k-native context window, the thinking-mode chain of thought, and the KV cache.
Hardware data: RTX 4080 SUPER (16 GB VRAM) · Q4_K GGUF · 64.2 tokens/s generation at 4k context · See benchmark data
⚠️ Variant pinned — Qwen3 ships 8 sizes from the same Qwen org. Per the Ollama qwen3:14b tag list, Qwen3 spans
0.6b,1.7b,4b,8b,14b(this recipe),30b(MoE),32b, and235b(MoE). The siblings have wildly different VRAM profiles, and on a 16 GB card the variant choice is binding: Qwen3-14B in BF16 is ~28 GB and does NOT fit 16 GB per the official Qwen speed benchmark ("28,402 MB" memory footprint at input length 1, growing to 33,336 MB at ~30k context). The instructions below are for the dense 14.8B model only at Q4_K_M GGUF — the path that fits 16 GB. For the 30B/235B MoE siblings, all expert params must be resident in VRAM and will not fit this card — see the Qwen3 model card on the dense/MoE split.
ℹ️ Thinking mode is on by default — size your context for it. Qwen3-14B has a built-in chain-of-thought ("thinking") mode that the model card's quickstart enables via
enable_thinking=True. Output starts with a<think>...</think>block (often 2k–4k tokens on hard problems) followed by the user-facing answer. That<think>trace grows the KV cache far faster than a plain chat turn, which matters on a 16 GB card — see Troubleshooting. To disable for latency-sensitive use, send/no_thinkin your prompt or passenable_thinking=Falsein the chat template.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 16 GB VRAM (Q4_K_M weights + KV headroom) | RTX 4080 SUPER (16 GB) |
| RAM | 16 GB system | — |
| Storage | 9.0 GB (Q4_K_M GGUF) or 12.1 GB (Q6_K) | per unsloth/Qwen3-14B-GGUF |
| Driver | CUDA 12.x (Ada sm_89) | — |
| Runtime | Ollama 0.5+ / llama.cpp / LM Studio | — |
The model is released under Apache 2.0 — commercial use is permitted, weights are ungated (free download, no access request).
Installation
The fastest path is Ollama — one command pulls the canonical Q4_K_M build maintained by the Qwen team:
Option A — Ollama (recommended)
1. Install Ollama
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
(Windows: download from ollama.com/download.) Per the Qwen3 model card, "applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3."
2. Pull the 14B model
ollama pull qwen3:14b
This fetches a 9.3 GB Q4_K_M checkpoint per the Ollama qwen3:14b tag (14.8B parameters, Q4_K_M quantization). The download is one file — no manual quant-tier selection needed.
Option B — llama.cpp + community GGUF
If you want a higher-quality quant (Q5_K_M, Q6_K) or the imatrix-tuned Unsloth Dynamic 2.0 ladder, use a community redistributor that publishes the full per-tier table. The unsloth/Qwen3-14B-GGUF repo lists Qwen/Qwen3-14B explicitly as its base_model with link-back to the upstream model card.
1. Install llama.cpp
# macOS (Homebrew)
brew install llama.cpp
# Linux — pre-built CUDA wheel
# Visit https://github.com/ggml-org/llama.cpp/releases for cu12x binaries
2. Pull the quant you want
Per-tier file sizes from the unsloth/Qwen3-14B-GGUF Files tab:
| Quant | File size | Fits 16 GB with KV headroom? |
|---|---|---|
| Q4_K_M | 9.00 GB | yes — recommended for this card |
| UD-Q4_K_XL | 9.16 GB | yes — Unsloth Dynamic 2.0 imatrix-tuned |
| Q5_K_M | 10.51 GB | yes — better quality, comfortable |
| Q6_K | 12.12 GB | yes, but watch KV at long context |
| Q8_0 | 15.70 GB | no — weights alone leave ~0.3 GB; OOM with any KV cache |
| BF16 | 29.54 GB | no — does NOT fit 16 GB |
On a 16 GB 4080 SUPER, Q4_K_M / Q5_K_M / Q6_K all leave room for the KV cache; Q8_0 (15.70 GB) does not — its weights nearly fill the card before any context loads. This is the key difference from a 24 GB card, where Q8_0 is comfortable.
Then via the llama.cpp Hugging Face shortcut (per the unsloth model card):
# OpenAI-compatible local server with web UI
llama-server -hf unsloth/Qwen3-14B-GGUF:UD-Q4_K_XL --ctx-size 8192 --flash-attn
# Interactive terminal
llama-cli -hf unsloth/Qwen3-14B-GGUF:UD-Q4_K_XL --ctx-size 8192 --flash-attn
Option C — LM Studio (GUI)
LM Studio offers a one-click install path per the Qwen3-14B HF card. Search "Qwen3-14B GGUF" inside the app and pick the Q4_K_M tier (Q5_K_M or Q6_K if you want higher fidelity and still have room — but skip Q8_0 on this 16 GB card).
Running
One-shot prompt via Ollama
ollama run qwen3:14b "Explain the difference between MoE and dense transformer architectures in three sentences."
First run loads the model into VRAM (~9 GB resident at idle for Q4_K_M, growing as the KV cache fills with longer contexts). Subsequent prompts in the same session stay warm.
Disable thinking mode for short answers
ollama run qwen3:14b "/no_think What's the capital of France?"
Per the Qwen3-14B HF card, this flips enable_thinking=False for the request, skipping the <think>...</think> chain-of-thought prefix.
OpenAI-compatible HTTP API
# Ollama exposes localhost:11434 by default
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3:14b",
"messages": [{"role": "user", "content": "Write a haiku about Ada Lovelace GPUs."}]
}'
The upstream Qwen3-14B card also documents vllm serve Qwen/Qwen3-14B --enable-reasoning --reasoning-parser deepseek_r1 and python -m sglang.launch_server --model-path Qwen/Qwen3-14B --reasoning-parser qwen3 — but both default to BF16 weights (~28 GB per the official speed benchmark), which overflows the 4080 SUPER's 16 GB. Even FP8 (16,012 MB at length 1 per the same benchmark) sits right at the 16 GB ceiling and overflows once context grows. On the 4080 SUPER, the Ollama / llama.cpp GGUF path is the comfortable one; see Troubleshooting for the vLLM/SGLang OOM walkthrough.
Results
- Speed: 64.2 tokens/s generation at 4k context, Q4_K quantization, measured on the RTX 4080 SUPER — per the hardware-corner.net RTX 4080 SUPER LLM benchmark table row labelled "Qwen3 14B (Q4_K)", surfaced via
/check/qwen3-14b/rtx-4080-super. Generation rate decays to 52.8 tok/s at 16k and 42.6 tok/s at 32k as the KV cache grows. Prompt processing on the same row is much faster — 3,745.0 tok/s at 4k context, dropping to 2,526.7 tok/s at 16k and 1,769.3 tok/s at 32k. Note these are chat-class throughput figures; for thinking-mode workloads where most of the output is a discarded<think>block, effective throughput is lower because the model emits far more tokens per useful answer. - VRAM usage: Both
/check/qwen3-14b/rtx-4080-superbenchmarks report a 16 GB peak on this card at Q4_K. Q4_K_M weights occupy ~9 GB of the 16 GB card at idle; the rest is KV cache headroom the runtime expands with context. The official Qwen speed benchmark on H20 corroborates the precision/VRAM ladder for Qwen3-14B in Transformers: AWQ-INT4 =9,962 MBat length 1 /15,323 MBat ~30k context, FP8 =16,012 MB/20,813 MB, BF16 =28,402 MB/33,336 MB— on a 16 GB card only the int4 / Q4_K_M / Q5_K_M / Q6_K GGUF paths fit with KV headroom; FP8 and BF16 overflow. - Quality notes: Q4_K_M is the community-default "sweet spot." On the 16 GB 4080 SUPER you can upgrade to Q5_K_M (10.51 GB) or Q6_K (12.12 GB) for higher fidelity and still leave room for a moderate KV cache, but Q8_0 (15.70 GB) is too large here — it nearly fills the card before any context loads. There's no quality reason to pick anything below Q4_K_M on this card.
For the full benchmark data and other-GPU comparisons, see /check/qwen3-14b/rtx-4080-super.
Troubleshooting
Out of memory mid-generation on a hard problem
Qwen3-14B's thinking mode emits a <think>...</think> chain-of-thought that routinely runs 2k–4k tokens (longer on hard math / coding), and the KV cache grows linearly with every token. On a 16 GB card that <think> trace is the most common OOM trigger — the weights fit fine, but the cache balloons during a long reasoning turn. Mitigations, in order: (1) cap the context with --ctx-size 8192 on the first run (shown in the Installation commands above); (2) quantize the KV cache with --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn in llama.cpp to roughly halve its memory; (3) drop a quant tier on the weights (Q4_K_M → Q4_K_S leaves more room for KV); (4) send /no_think to skip the chain-of-thought entirely for turns that don't need it. Watch nvidia-smi -l 1 during a hard problem to calibrate the actual peak.
Ollama returns Error: model requires more system memory or hangs on load
Confirm a recent NVIDIA driver and CUDA 12.x runtime are installed (nvidia-smi should show a driver from the past 12 months). The RTX 4080 SUPER uses the Ada Lovelace architecture (sm_89) which has been fully supported by mainline CUDA wheels since 2023 — no special build flags or wheel pinning are required. If Ollama still appears to hang on first load, watch nvidia-smi -l 1 in another terminal to confirm the GPU is actually being used; if it stays at 0% utilization, reinstall Ollama and re-pull the model.
<think>...</think> output is bloating responses
Qwen3 enables thinking mode by default per the HF card quickstart. Send /no_think at the start of any user message to disable it for that turn, or pass enable_thinking=False if you're calling the chat-template API directly. Per the model card best-practices note: for thinking mode use Temperature=0.6, TopP=0.95, TopK=20, MinP=0 and do not use greedy decoding — it triggers endless repetitions.
vLLM / SGLang server crashes with CUDA OOM at startup
vLLM and SGLang default to BF16 weights for Qwen/Qwen3-14B, which require ~28 GB resident per the official speed benchmark and far exceed the 4080 SUPER's 16 GB. Unlike a 24 GB card, FP8 does not rescue you here either — it starts at 16,012 MB (length 1) and climbs to 20,813 MB at ~30k context per the same benchmark, overflowing 16 GB once any real context loads. The fitting options on this card are (a) AWQ-INT4 weights (~10 GB resident, growing to ~15 GB at ~30k), or (b) the Ollama / llama.cpp Q4_K_M GGUF path this recipe is built around. Reserve vLLM/SGLang for a larger card.
Generation slows dramatically past 32k context
32k is Qwen3-14B's native context window per the HF card, which the card lists as extensible to 131,072 tokens with YaRN. Beyond that the model needs YaRN extension — supported in llama.cpp via --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 per the Qwen3 model card — but quality degrades and the KV cache balloons (a real concern on 16 GB). For long-doc workflows, prefer chunking + retrieval over pushing context past 32k. The hardware-corner.net RTX 4080 SUPER benchmark shows the generation rate falling from 64.2 tok/s at 4k to 42.6 tok/s at 32k on this card.
Using transformers directly instead of Ollama
If you bypass Ollama / llama.cpp and run the HF card quickstart via transformers directly with torch_dtype="auto", device_map="auto", you will load BF16 weights and hit OOM on a 16 GB 4080 SUPER (28,402 MB at length 1 per the Qwen benchmark). The quickstart does not hardcode attn_implementation="flash_attention_2", so if you do fit a precision (AWQ-INT4 mirror, or a larger card), it works out of the box on the 4080 SUPER with a stock pip install torch — Ada sm_89 has full FA2 kernel coverage if you opt into FA2 separately. Unlike Blackwell-class cards, no cu128-specific wheel selection is required for the 4080 SUPER.