Qwen3-14B on RTX 3090 Ti: Q4_K_M GGUF via Ollama or llama.cpp

What You'll Build

A local Qwen3-14B chat / reasoning assistant running on a 24 GB RTX 3090 Ti, served through Ollama (or llama.cpp / LM Studio — same GGUF, three loaders). The recipe pins the dense 14.8B variant at Q4_K_M quantization (8.38 GB on disk per the unsloth/Qwen3-14B-GGUF Files tab), which leaves roughly 15 GB of headroom on the 3090 Ti for Qwen3's 32k-native context window, the thinking-mode chain of thought, and a comfortable KV cache.

Hardware data: RTX 3090 Ti (24 GB VRAM) · Q4_K GGUF · 76.2 tokens/s generation at 4k context · See benchmark data

⚠️ Variant pinned — Qwen3 ships 8 sizes from the same Qwen org. Per the Ollama qwen3:14b tag list, Qwen3 spans 0.6b, 1.7b, 4b, 8b, 14b (this recipe), 30b (MoE), 32b, and 235b (MoE). The siblings have wildly different VRAM profiles — Qwen3-32B in Q4_K_M is ~19 GB and still fits the 3090 Ti, but Qwen3-14B in BF16 is ~28 GB and does NOT fit 24 GB per the official Qwen speed benchmark (28,402 MB memory footprint at input length 1, growing to 33,336 MB at 30k context). The instructions below are for the dense 14.8B model only at Q4_K_M; if you want the BF16 path you'll need offloading or a 32 GB+ card. For the 30B/235B MoE siblings, all expert params must be resident in VRAM — see the Qwen3-14B model card for the dense/MoE split.

ℹ️ Thinking mode is on by default. Qwen3-14B has a built-in chain-of-thought ("thinking") mode that the model card's quickstart enables via enable_thinking=True. Output starts with a <think>...</think> block followed by the user-facing answer. To disable for latency-sensitive use, send /no_think in your prompt or pass enable_thinking=False in the chat template.

Requirements

Component	Minimum	Tested
GPU	10 GB VRAM (Q4_K_M weights)	RTX 3090 Ti (24 GB)
RAM	16 GB system	—
Storage	8.38 GB (Q4_K_M GGUF) or 14.62 GB (Q8_0)	per unsloth/Qwen3-14B-GGUF
Driver	CUDA 12.x (Ampere sm_86)	—
Runtime	Ollama 0.5+ / llama.cpp / LM Studio	—

The model is released under Apache 2.0 — commercial use is permitted. Note that BF16 (~28 GB resident per the Qwen speed benchmark) does NOT fit a 24 GB 3090 Ti — pick one of the GGUF tiers below.

Installation

The fastest path is Ollama — one command pulls the canonical Q4_K_M build maintained by the Qwen team:

Option A — Ollama (recommended)

1. Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

(Windows: download from ollama.com/download.) Per the Qwen3 model card, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also added Qwen3 support.

2. Pull the 14B model

ollama pull qwen3:14b

This fetches a Q4_K_M checkpoint per the Ollama qwen3:14b tag (14.8B parameters). The download is one file — no manual quant-tier selection needed.

Option B — llama.cpp + community GGUF

If you want a higher-quality quant (Q6_K, Q8_0) or the imatrix-tuned Unsloth Dynamic 2.0 ladder, use a community redistributor that publishes the full per-tier table. The unsloth/Qwen3-14B-GGUF repo lists Qwen/Qwen3-14B explicitly as its base_model with link-back to the upstream model card.

1. Install llama.cpp

# macOS (Homebrew)
brew install llama.cpp

# Linux — pre-built CUDA wheel
# Visit https://github.com/ggml-org/llama.cpp/releases for cu12x binaries

2. Pull the quant you want

Per-tier file sizes from the unsloth/Qwen3-14B-GGUF Files tab:

Quant	File size	Notes
Q4_K_M	8.38 GB	recommended for this card
Q5_K_M	9.79 GB	better quality, still comfortable
Q6_K	11.29 GB	high-fidelity, plenty of headroom
Q8_0	14.62 GB	near-lossless
UD-Q4_K_XL	8.53 GB	Unsloth Dynamic 2.0 imatrix-tuned
UD-Q8_K_XL	17.47 GB	Unsloth Dynamic 2.0, near-lossless
BF16	27.51 GB	full precision — does NOT fit 24 GB

Then via the llama.cpp Hugging Face shortcut (per the unsloth model card):

# OpenAI-compatible local server with web UI
llama-server -hf unsloth/Qwen3-14B-GGUF:UD-Q4_K_XL

# Interactive terminal
llama-cli -hf unsloth/Qwen3-14B-GGUF:UD-Q4_K_XL

Flash Attention 2 has full kernel coverage on Ampere (sm_86), unlike the Blackwell sm_120 gap. Add -fa 1 to your llama.cpp invocations to enable GGML Flash Attention — reduces KV-cache memory pressure at longer contexts and gives a small throughput uplift on this size class.

Option C — LM Studio (GUI)

LM Studio offers a one-click install path per the Qwen3-14B HF card. Search "Qwen3-14B GGUF" inside the app and pick the Q4_K_M tier (or Q8_0 if you want near-lossless and still have ~15 GB free).

Running

One-shot prompt via Ollama

ollama run qwen3:14b "Explain the difference between MoE and dense transformer architectures in three sentences."

First run loads the model into VRAM (~9 GB resident at idle for Q4_K_M, growing as the KV cache fills with longer contexts). Subsequent prompts in the same session stay warm.

Disable thinking mode for short answers

ollama run qwen3:14b "/no_think What's the capital of France?"

Per the Qwen3-14B HF card, this flips enable_thinking=False for the request, skipping the <think>...</think> chain-of-thought prefix.

OpenAI-compatible HTTP API

# Ollama exposes localhost:11434 by default
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:14b",
    "messages": [{"role": "user", "content": "Write a haiku about Ampere GPUs."}]
  }'

For higher throughput / production-style serving, the upstream Qwen3-14B card documents vllm serve Qwen/Qwen3-14B --enable-reasoning --reasoning-parser deepseek_r1 and python -m sglang.launch_server --model-path Qwen/Qwen3-14B --reasoning-parser qwen3 — both load BF16 weights though (~28 GB per the official speed benchmark), which overflows the 3090 Ti's 24 GB. For the 3090 Ti, the Ollama / llama.cpp GGUF path is the comfortable one; AWQ-INT4 (9,962 MB at length 1 per the same benchmark) is the lightest BF16-equivalent path if you want to use vLLM/SGLang.

Results

Speed: 76.2 tokens/s generation at 4k context, Q4_K quantization, measured on RTX 3090 Ti — per the hardware-corner.net RTX 3090 Ti LLM benchmark table row labelled "Qwen3 14B (Q4_K)", surfaced via /check/qwen3-14b/rtx-3090-ti. Generation rate decays to 56.9 tok/s at 16k, 42.1 tok/s at 32k, and 28.2 tok/s at 64k as the KV cache grows. Prompt processing on the same row is much faster — 2,817.1 tok/s at 4k context, dropping to 828.5 tok/s at 64k. The 3090 Ti is ~9% faster than the vanilla 3090 on this workload (sibling row reports 70.0 tok/s gen at 4k), consistent with the Ti's higher GDDR6X bandwidth — token generation on this size class is memory-bandwidth-bound. Expect ~90% of the RTX 4090's 84.4 tok/s at the same context length, since the 4090 trades higher SM count for slightly lower memory bandwidth than its Ada-vs-Ampere generation gap would suggest.
VRAM usage: Q4_K_M weights occupy ~9 GB of the 24 GB card at idle; the rest is KV cache headroom the runtime expands with context. The official Qwen speed benchmark on H20 corroborates the precision/VRAM ladder for Qwen3-14B in Transformers: AWQ-INT4 = 9,962 MB at length 1 / 15,323 MB at 30k context, FP8 = 16,012 MB / 20,813 MB, BF16 = 28,402 MB / 33,336 MB — only the int4 / FP8 / Q4_K_M GGUF paths fit a 24 GB card. See /check/qwen3-14b/rtx-3090-ti for community-contributed measurements as they land.
Quality notes: Q4_K_M is the community-default "sweet spot." On a 24 GB 3090 Ti you have plenty of room to upgrade to Q6_K (11.29 GB) or Q8_0 (14.62 GB) for near-lossless output — both leave 9–13 GB free for KV cache and activations even at 32k context. The Unsloth Dynamic 2.0 UD-Q8_K_XL tier (17.47 GB on disk) also fits comfortably and is the highest-fidelity option on this card short of dual-GPU BF16. There is no quality reason to pick anything below Q4_K_M on this card.

For the full benchmark data and other-GPU comparisons, see /check/qwen3-14b/rtx-3090-ti.

Troubleshooting

Ollama returns `Error: model requires more system memory` or hangs on load

Confirm a recent NVIDIA driver and CUDA 12.x runtime are installed (nvidia-smi should show a driver from the past 12 months). The RTX 3090 Ti uses the Ampere architecture (sm_86) which has been fully supported by mainline CUDA wheels since 2020 — no special build flags or wheel pinning are required. If Ollama still appears to hang on first load, watch nvidia-smi -l 1 in another terminal to confirm the GPU is actually being used; if it stays at 0% utilization, reinstall Ollama and re-pull the model.

`<think>...</think>` output is bloating responses

Qwen3 enables thinking mode by default per the HF card quickstart. Send /no_think at the start of any user message to disable it for that turn, or pass enable_thinking=False if you're calling the chat-template API directly. Per the model card best-practices note: for thinking mode use Temperature=0.6, TopP=0.95, TopK=20, MinP=0 and do not use greedy decoding — it triggers endless repetitions.

vLLM / SGLang server crashes with CUDA OOM at startup

vLLM and SGLang default to BF16 weights for Qwen/Qwen3-14B, which require ~28 GB resident per the official speed benchmark and exceed the 3090 Ti's 24 GB. Either (a) use AWQ-INT4 weights (~10 GB resident), or (b) drop to Ollama/llama.cpp with the Q4_K_M GGUF this recipe is built around. FP8 caveat for the 3090 Ti: an FP8 safetensors file will load on Ampere because PyTorch dequantizes the FP8 weights to BF16/FP16 on the fly — but the 3090 Ti has no FP8 tensor cores (FP8 first shipped on Hopper sm_90 and consumer Ada sm_89), so you will see the VRAM saving (~16 GB resident) without the throughput boost Ada-class cards get. Prefer AWQ-INT4 or GGUF Q4_K_M for the best speed/quality trade on this card.

Using transformers directly instead of Ollama

If you bypass Ollama / llama.cpp and run the HF card quickstart via transformers directly with torch_dtype="auto", device_map="auto", you will load BF16 weights and hit OOM on a 24 GB 3090 Ti (28,402 MB at length 1 per the Qwen benchmark). The quickstart does not hardcode attn_implementation="flash_attention_2", so once you do fit (AWQ-INT4 mirror, or 32 GB+ card), it works on the 3090 Ti with a stock pip install torch — Ampere sm_86 has full FA2 kernel coverage if you opt into FA2 separately.

Generation slows dramatically past 32k context

32k is Qwen3-14B's native context window per the HF card, which the card lists as extensible to 131,072 tokens with YaRN. Beyond that the model needs YaRN extension — supported in llama.cpp via --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 per the Qwen3 model card — but quality degrades and the KV cache balloons. For long-doc workflows, prefer chunking + retrieval over pushing context past 32k. The hardware-corner.net benchmark shows the generation rate falling from 76.2 tok/s at 4k to 28.2 tok/s at 64k context on this card.

I want the larger 32B or 30B-MoE sibling

Qwen3-32B at Q4_K_M is ~19 GB on disk and does fit a 24 GB 3090 Ti — swap qwen3:14b for qwen3:32b in any Ollama command (and expect lower tok/s than the 14B, see /check/qwen3-32b/rtx-3090-ti for the cited 38.0 tok/s at 4k from the same Hardware Corner Ti page). Qwen3-30B-A3B (MoE) has ~22B-equivalent resident weights — all expert params must stay in VRAM per the Qwen3 model card. Note that even Qwen3-32B at BF16 (~65 GB) and Qwen3-14B at BF16 (~28 GB) overflow the 3090 Ti; sticking to Q4_K_M / Q5_K_M / Q6_K / Q8_0 GGUFs is the right call for any model in this family on a 24 GB card.