self-hosted/ai
§01·recipe · llm

Qwen3-14B on RTX 4090: Q4_K_M GGUF via Ollama or llama.cpp

llmbeginner10GB+ VRAMMay 20, 2026
models
tools
prerequisites
  • NVIDIA RTX 4090 (24 GB VRAM) or equivalent 24 GB CUDA card
  • Recent NVIDIA driver with CUDA 12.x support (Ada sm_89 — no special wheel selection required)
  • ~9 GB free disk for the Q4_K_M GGUF (or ~16 GB for Q8_0)
  • Ollama, llama.cpp, or LM Studio installed

What You'll Build

A local Qwen3-14B chat / reasoning assistant running on a 24 GB RTX 4090, served through Ollama (or llama.cpp / LM Studio — same GGUF, three loaders). The recipe pins the dense 14.8B variant at Q4_K_M quantization (9.0 GB on disk), which leaves roughly 15 GB of headroom on the 4090 for Qwen3's 32k-native context window, the thinking-mode chain of thought, and a comfortable KV cache.

Hardware data: RTX 4090 (24 GB VRAM) · Q4_K GGUF · 84.4 tokens/s generation at 4k context · See benchmark data

⚠️ Variant pinned — Qwen3 ships 8 sizes from the same Qwen org. Per the Ollama qwen3:14b tag list, Qwen3 spans 0.6b, 1.7b, 4b, 8b, 14b (this recipe), 30b (MoE), 32b, and 235b (MoE). The siblings have wildly different VRAM profiles — Qwen3-32B in Q4_K_M is ~19 GB and still fits the 4090, but Qwen3-14B in BF16 is ~28 GB and does NOT fit 24 GB per the official Qwen speed benchmark ("28,402 MB" memory footprint at input length 1, growing to 33,336 MB at 30k context). The instructions below are for the dense 14.8B model only at Q4_K_M; if you want the BF16 path you'll need offloading or a 32 GB+ card. For the 30B/235B MoE siblings, all expert params must be resident in VRAM — see the Qwen3 model card on the dense/MoE split.

ℹ️ Thinking mode is on by default. Qwen3-14B has a built-in chain-of-thought ("thinking") mode that the model card's quickstart enables via enable_thinking=True. Output starts with a <think>...</think> block followed by the user-facing answer. To disable for latency-sensitive use, send /no_think in your prompt or pass enable_thinking=False in the chat template.

Requirements

ComponentMinimumTested
GPU10 GB VRAM (Q4_K_M weights)RTX 4090 (24 GB)
RAM16 GB system
Storage9.0 GB (Q4_K_M GGUF) or 15.7 GB (Q8_0)per unsloth/Qwen3-14B-GGUF
DriverCUDA 12.x (Ada sm_89)
RuntimeOllama 0.5+ / llama.cpp / LM Studio

The model is released under Apache 2.0 — commercial use is permitted.

Installation

The fastest path is Ollama — one command pulls the canonical Q4_K_M build maintained by the Qwen team:

Option A — Ollama (recommended)

1. Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

(Windows: download from ollama.com/download.) Per the Qwen3 model card, "applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3."

2. Pull the 14B model

ollama pull qwen3:14b

This fetches a 9.3 GB Q4_K_M checkpoint per the Ollama qwen3:14b tag (14.8B parameters, Q4_K_M quantization). The download is one file — no manual quant-tier selection needed.

Option B — llama.cpp + community GGUF

If you want a higher-quality quant (Q6_K, Q8_0) or the imatrix-tuned Unsloth Dynamic 2.0 ladder, use a community redistributor that publishes the full per-tier table. The unsloth/Qwen3-14B-GGUF repo lists Qwen/Qwen3-14B explicitly as its base_model with link-back to the upstream model card.

1. Install llama.cpp

# macOS (Homebrew)
brew install llama.cpp

# Linux — pre-built CUDA wheel
# Visit https://github.com/ggml-org/llama.cpp/releases for cu12x binaries

2. Pull the quant you want

Per-tier file sizes from the unsloth/Qwen3-14B-GGUF Files tab:

QuantFile sizeNotes
Q4_K_M9.00 GBrecommended for this card
Q5_K_M10.51 GBbetter quality, still comfortable
Q6_K12.12 GBhigh-fidelity, plenty of headroom
Q8_015.70 GBnear-lossless
UD-Q4_K_XL9.16 GBUnsloth Dynamic 2.0 imatrix-tuned
UD-Q8_K_XL18.75 GBUnsloth Dynamic 2.0, near-lossless
BF1629.54 GBfull precision — does NOT fit 24 GB

Then via the llama.cpp Hugging Face shortcut (per the unsloth model card):

# OpenAI-compatible local server with web UI
llama-server -hf unsloth/Qwen3-14B-GGUF:UD-Q4_K_XL

# Interactive terminal
llama-cli -hf unsloth/Qwen3-14B-GGUF:UD-Q4_K_XL

Option C — LM Studio (GUI)

LM Studio offers a one-click install path per the Qwen3-14B HF card. Search "Qwen3-14B GGUF" inside the app and pick the Q4_K_M tier (or Q8_0 if you want near-lossless and still have ~17 GB free).

Running

One-shot prompt via Ollama

ollama run qwen3:14b "Explain the difference between MoE and dense transformer architectures in three sentences."

First run loads the model into VRAM (~9 GB resident at idle for Q4_K_M, growing as the KV cache fills with longer contexts). Subsequent prompts in the same session stay warm.

Disable thinking mode for short answers

ollama run qwen3:14b "/no_think What's the capital of France?"

Per the Qwen3-14B HF card, this flips enable_thinking=False for the request, skipping the <think>...</think> chain-of-thought prefix.

OpenAI-compatible HTTP API

# Ollama exposes localhost:11434 by default
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:14b",
    "messages": [{"role": "user", "content": "Write a haiku about Ada Lovelace GPUs."}]
  }'

For higher throughput / production-style serving, the upstream Qwen3-14B card documents vllm serve Qwen/Qwen3-14B --enable-reasoning --reasoning-parser deepseek_r1 and python -m sglang.launch_server --model-path Qwen/Qwen3-14B --reasoning-parser qwen3 — both load BF16 weights though (~28 GB per the official speed benchmark), which overflows the 4090's 24 GB. For the 4090, the Ollama / llama.cpp GGUF path is the comfortable one; FP8 (16,012 MB per the same benchmark) is the lightest BF16-equivalent native-precision path if you want to use vLLM/SGLang and are willing to swap to an FP8 mirror.

Results

  • Speed: 84.4 tokens/s generation at 4k context, Q4_K quantization, measured on RTX 4090 — per the hardware-corner.net RTX 4090 LLM benchmark table row labelled "Qwen3 14B (Q4_K)", surfaced via /check/qwen3-14b/rtx-4090. Generation rate decays to 69.8 tok/s at 16k, 55.4 tok/s at 32k, and 38.8 tok/s at 64k as the KV cache grows. Prompt processing on the same row is much faster — 5,265.4 tok/s at 4k context, dropping to 1,398.3 tok/s at 64k.
  • VRAM usage: Q4_K_M weights occupy ~9 GB of the 24 GB card at idle; the rest is KV cache headroom the runtime expands with context. The official Qwen speed benchmark on H20 corroborates the precision/VRAM ladder for Qwen3-14B in Transformers: AWQ-INT4 = 9,962 MB at length 1 / 15,323 MB at 30k context, FP8 = 16,012 MB / 20,813 MB, BF16 = 28,402 MB / 33,336 MB — only the int4 / FP8 / Q4_K_M GGUF paths fit a 24 GB card. See /check/qwen3-14b/rtx-4090 for community-contributed measurements as they land.
  • Quality notes: Q4_K_M is the community-default "sweet spot." On a 24 GB 4090 you have plenty of room to upgrade to Q6_K (12.1 GB) or Q8_0 (15.7 GB) for near-lossless output — both leave 8–11 GB free for KV cache and activations even at 32k context. There's no quality reason to pick anything below Q4_K_M on this card.

For the full benchmark data and other-GPU comparisons, see /check/qwen3-14b/rtx-4090.

Troubleshooting

Ollama returns Error: model requires more system memory or hangs on load

Confirm a recent NVIDIA driver and CUDA 12.x runtime are installed (nvidia-smi should show a driver from the past 12 months). The RTX 4090 uses the Ada Lovelace architecture (sm_89) which has been fully supported by mainline CUDA wheels since 2023 — no special build flags or wheel pinning are required. If Ollama still appears to hang on first load, watch nvidia-smi -l 1 in another terminal to confirm the GPU is actually being used; if it stays at 0% utilization, reinstall Ollama and re-pull the model.

<think>...</think> output is bloating responses

Qwen3 enables thinking mode by default per the HF card quickstart. Send /no_think at the start of any user message to disable it for that turn, or pass enable_thinking=False if you're calling the chat-template API directly. Per the model card best-practices note: for thinking mode use Temperature=0.6, TopP=0.95, TopK=20, MinP=0 and do not use greedy decoding — it triggers endless repetitions.

vLLM / SGLang server crashes with CUDA OOM at startup

vLLM and SGLang default to BF16 weights for Qwen/Qwen3-14B, which require ~28 GB resident per the official speed benchmark and exceed the 4090's 24 GB. Either (a) switch to an FP8 mirror (Qwen/Qwen3-14B-FP8 if available, or run with --quantization fp8 if your wheel supports it — drops the footprint to ~16 GB per the same benchmark), (b) use AWQ-INT4 weights (~10 GB resident), or (c) drop to Ollama/llama.cpp with the Q4_K_M GGUF this recipe is built around.

Using transformers directly instead of Ollama

If you bypass Ollama / llama.cpp and run the HF card quickstart via transformers directly with torch_dtype="auto", device_map="auto", you will load BF16 weights and hit OOM on a 24 GB 4090 (28,402 MB at length 1 per the Qwen benchmark). The quickstart does not hardcode attn_implementation="flash_attention_2", so once you do fit (FP8 mirror, AWQ-INT4 mirror, or 32 GB+ card), it works out of the box on the 4090 with a stock pip install torch — Ada sm_89 has full FA2 kernel coverage if you opt into FA2 separately. Unlike Blackwell-class cards, no cu128-specific wheel selection is required for the 4090.

Generation slows dramatically past 32k context

32k is Qwen3-14B's native context window per the HF card ("Context Length: 32,768 natively and 131,072 tokens with YaRN"). Beyond that the model needs YaRN extension — supported in llama.cpp via --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 per the Qwen3 model card — but quality degrades and the KV cache balloons. For long-doc workflows, prefer chunking + retrieval over pushing context past 32k. The hardware-corner.net benchmark shows the generation rate falling from 84.4 tok/s at 4k to 38.8 tok/s at 64k context on this card.

I want the larger 32B or 30B-MoE sibling

Qwen3-32B at Q4_K_M is ~19 GB on disk and does fit a 24 GB card — swap qwen3:14b for qwen3:32b in any Ollama command (and expect 34.4 tok/s at 16k context per the hardware-corner.net 4090 "Biggest LLMs You Can Run" panel). Qwen3-30B-A3B (MoE) is ~22B-equivalent resident weights — all expert params must stay in VRAM per the Qwen3 model card. The Qwen3-32B recipe lives at /recipes once seeded.