How much VRAM does Qwen3 14B need?

About 12 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

Qwen3-14B on RTX 3060 12GB: Q4_K_M GGUF via Ollama or llama.cpp

What You'll Build

A local Qwen3-14B chat / reasoning assistant running on a 12 GB RTX 3060, served through Ollama (or llama.cpp / LM Studio — same GGUF, three loaders). The recipe pins the dense 14.8B variant at Q4_K_M quantization (9.00 GB on disk), which fits the RTX 3060's 12 GB envelope but leaves it tight — only ~2–3 GB for KV cache and activations once a display is attached. Hardware Corner measured this exact card running Qwen3-14B Q4_K at up to 16k context fully in VRAM; beyond that you cap context (or drop to Q3_K_M) to stay inside the envelope.

Hardware data: RTX 3060 (12 GB GDDR6, 360 GB/s) · Q4_K GGUF · 31.2 tokens/s generation at 4k context · See benchmark data

⚠️ This is a tight fit on 12 GB. Q4_K_M (9.00 GB) loads, but that leaves roughly 2–3 GB for the KV cache and activations — effectively the whole card once a monitor is attached. A 12 GB desktop GPU with a display exposes only ~10.5–11.3 GB usable, so the practical context ceiling is much lower than a 16 GB card's. Per the Hardware Corner RTX 3060 12GB benchmark, the 3060 runs Qwen3-14B Q4_K "up to 16k context fully in VRAM" and the 32k column is unmeasured (it does not fit). Cap --ctx-size and/or quantize the KV cache, or drop to Q3_K_M (7.32 GB) for more context headroom. See Picking a quant on 12 GB below.

⚠️ Variant pinned — Qwen3 ships 8 sizes from the same Qwen org. Per the Ollama qwen3:14b tag list, Qwen3 spans 0.6b, 1.7b, 4b, 8b, 14b (this recipe), 30b (MoE), 32b, and 235b (MoE). The siblings have very different VRAM profiles — Qwen3-8B in Q4_K_M is ~5 GB; Qwen3-32B in Q4_K_M is ~20 GB and overflows even a 16 GB card; the 30B/235B MoE variants need every expert resident in VRAM (the router can't pre-prune), far past this card. The instructions below are for the dense 14.8B model only. For 32B+ on this card, see /contribute.

ℹ️ Thinking mode is on by default — and it eats KV cache. Qwen3-14B has a built-in chain-of-thought ("thinking") mode that the model card's quickstart enables via enable_thinking=True. Output starts with a <think>...</think> block — often 2k–4k tokens on hard math / coding problems — followed by the user-facing answer. That <think> trace grows the KV cache far faster than a plain chat turn, which is the single most common way to OOM this 12 GB card. To disable for latency-sensitive use, send /no_think in your prompt or pass enable_thinking=False in the chat template.

Requirements

Component	Minimum	Tested
GPU	12 GB VRAM (Q4_K_M weights ~9 GB + KV — tight; cap context)	RTX 3060 (12 GB, 360 GB/s)
RAM	16 GB system	—
Storage	9.00 GB (Q4_K_M GGUF) or 7.32 GB (Q3_K_M)	per unsloth/Qwen3-14B-GGUF
Driver	CUDA 12.x runtime (Ampere sm_86)	—
Runtime	Ollama 0.5+ / llama.cpp / LM Studio	—

The model is released under Apache 2.0 (HF license: apache-2.0, ungated) — commercial use is permitted.

Installation

The fastest path is Ollama — one command pulls the canonical Q4_K_M build:

Option A — Ollama (recommended)

1. Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

(Windows: download from ollama.com/download.) Per the Qwen3-14B model card, "For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3."

2. Pull the 14B model

ollama pull qwen3:14b

This fetches a 9.3 GB Q4_K_M checkpoint (14.8B parameters, Q4_K_M) per the Ollama qwen3:14b tag. The download is one file — no manual quant-tier selection needed. On a 12 GB card, run it with a capped context (see Running below) so the KV cache doesn't push you into OOM.

Option B — llama.cpp + community GGUF

If you want a smaller quant tier for more KV headroom (Q3_K_M), use a community redistributor that publishes the full ladder. The unsloth/Qwen3-14B-GGUF repo lists Qwen/Qwen3-14B explicitly as its base_model with link-back to the upstream model card.

1. Install llama.cpp

# macOS (Homebrew)
brew install llama.cpp

# Linux — pre-built CUDA wheel
# Visit https://github.com/ggml-org/llama.cpp/releases for cu12x binaries

2. Pull the quant you want

Per-tier file sizes from the unsloth/Qwen3-14B-GGUF Files tab (decimal GB, as HuggingFace displays them):

Quant	File size	Notes on a 12 GB card
Q3_K_M	7.32 GB	best context headroom — ~2.5–3 GB free for KV cache
Q4_K_S	8.57 GB	slightly smaller than Q4_K_M, a bit more KV room
Q4_K_M	9.00 GB	recommended quality/size — but tight; cap context
Q5_K_M	10.51 GB	leaves almost nothing for KV on 12 GB — headless only
Q6_K	12.12 GB	weights alone exceed the card; does NOT fit 12 GB
Q8_0	15.70 GB	does NOT fit a 12 GB card
BF16	29.54 GB	full precision — does NOT fit a 12 GB card

The key difference from a 16 GB card: there, Q4_K_M through Q6_K all leave KV room; on 12 GB only Q3_K_M / Q4_K_S / Q4_K_M leave any usable KV budget, and even Q4_K_M needs a capped context.

Then via the llama.cpp Hugging Face shortcut (per the unsloth model card):

# OpenAI-compatible local server with web UI — cap context for 12 GB
llama-server -hf unsloth/Qwen3-14B-GGUF:Q4_K_M --ctx-size 4096 --flash-attn

# More context headroom: drop to Q3_K_M
llama-server -hf unsloth/Qwen3-14B-GGUF:Q3_K_M --ctx-size 8192 --flash-attn

--flash-attn is safe to enable here: the RTX 3060 is Ampere (sm_86), the oldest architecture with full prebuilt FlashAttention-2 kernel coverage — no special wheel selection or override is needed. Hardware Corner ran the 3060 Qwen3-14B benchmark below on llama.cpp with CUDA, so this is the matching path.

Option C — LM Studio (GUI)

LM Studio offers a one-click install path — the Qwen3 family is in its supported-runtime list per the Qwen3-14B HF card. Search "Qwen3-14B GGUF" inside the app and pick the Q3_K_M or Q4_K_M tier (the larger Q5/Q6/Q8 tiers do not leave KV room on 12 GB), or use the direct-import link from unsloth/Qwen3-14B-GGUF.

Running

One-shot prompt via Ollama

ollama run qwen3:14b "Explain GQA attention in three sentences."

First run loads the model into VRAM (~9 GB resident at idle for Q4_K_M, growing as the KV cache fills with longer contexts). On a 12 GB card that idle footprint is already most of the card — keep an eye on nvidia-smi.

Cap the context window (important on 12 GB)

# llama.cpp — keep KV cache small so you don't OOM
llama-server -hf unsloth/Qwen3-14B-GGUF:Q4_K_M --ctx-size 4096 --flash-attn

# halve KV memory with quantized cache (lets you push context further)
llama-server -hf unsloth/Qwen3-14B-GGUF:Q4_K_M --ctx-size 8192 \
  --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn

Disable thinking mode for short answers

ollama run qwen3:14b "/no_think What's the capital of France?"

Per the Qwen3-14B HF card, this flips enable_thinking=False for the request, skipping the <think>...</think> chain-of-thought prefix — which also keeps the KV cache from ballooning on this tight card.

OpenAI-compatible HTTP API

# Ollama exposes localhost:11434 by default
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:14b",
    "messages": [{"role": "user", "content": "Write a haiku about Ampere GPUs."}]
  }'

For higher-throughput / production-style serving, the upstream Qwen3-14B card documents vllm serve Qwen/Qwen3-14B and python -m sglang.launch_server --model-path Qwen/Qwen3-14B --reasoning-parser qwen3 — but both default to BF16 weights (29.54 GB), which does not fit a 12 GB card. Even FP8 starts at 16,012 MB per the official Qwen speed benchmark and overflows 12 GB outright. For local serving on this GPU prefer Ollama / llama.cpp with the Q4_K_M (or Q3_K_M) GGUF.

Picking a quant on 12 GB (the binding constraint)

On the RTX 3060, VRAM — not compute — is the binding constraint, and the KV cache is what tips you over. Qwen3-14B is 40 layers with GQA (8 KV heads, head_dim 128) per the HF card, so its KV cache costs ~160 KiB per token at fp16 — about 0.67 GB per 4k tokens, doubling at 8k, 16k, and so on. On ~11 GB of usable VRAM:

Q4_K_M (9.00 GB) is the quality sweet spot but leaves only ~1–2 GB for KV cache + activations once a display is attached. Hardware Corner's measured run reached 16k context fully in VRAM at this quant on the 3060 — but they did NOT achieve 32k (that column is unmeasured), so treat ~16k as the practical Q4_K_M ceiling on a headless card and step it down toward ~4k–8k once a monitor is attached. Run it with --ctx-size 4096 (or push higher with --cache-type-k q8_0 --cache-type-v q8_0 to halve cache memory) and you're fine; leave the context at the runtime's 32K default and you risk OOM.
Q3_K_M (7.32 GB) is the longer-context / safer alternative for this card per the unsloth/Qwen3-14B-GGUF tier table — the ~1.7 GB you save over Q4_K_M goes straight into KV-cache budget, pushing the practical window further (more with quantized KV). Quality is slightly below Q4_K_M but it's the pragmatic choice if you need real context on 12 GB.
Q5_K_M (10.51 GB) and above do not leave room for a usable KV cache on 12 GB — they're headless-only edge cases at best; Q6_K (12.12 GB) and larger don't even fit the weights.

This is the key difference from the 16 GB siblings: a 16 GB card runs Q4_K_M well past 16k context, but a 12 GB card tops out around 16k (per Hardware Corner's 3060 measurement) — you trade context for the privilege of running the bigger 14B model at all. Calibrate the real ceiling on your own setup with nvidia-smi and route a measured peak to /contribute.

Results

Speed: 31.2 tokens/s generation at 4k context, Q4_K quantization, measured on the RTX 3060 12GB — per the hardware-corner.net RTX 3060 12GB LLM benchmark table row labelled "Qwen3 14B (Q4_K)", surfaced via /check/qwen3-14b/rtx-3060. Generation slows to 22.7 tok/s at 16k as the KV cache grows; the table's 32k+ columns are unmeasured for this model on this card (it does not fit). Prompt processing (prefill) is a separate, much-faster metric — 972.6 tok/s at 4k and 678.2 tok/s at 16k on the same row. Prefill measures how fast the model ingests your prompt; token generation measures how fast it writes the reply. Token generation is memory-bandwidth-bound, and at 360 GB/s the 3060 has well under half the bandwidth of a 24 GB Ampere card — so these rates are correspondingly lower than higher-tier siblings. Note these are chat-class throughput figures; for thinking-mode workloads where most of the output is a discarded <think> block, effective throughput per useful answer is lower because the model emits far more tokens.
VRAM usage: Q4_K_M weights occupy ~9 GB at idle; the KV cache and activations grow on top, filling the 12 GB card — which is why context must be capped. The backend lists this pair as verdict: runs with a token-generation benchmark at /check/qwen3-14b/rtx-3060; the 12 GB envelope here is anchored on the 9.00 GB Q4_K_M on-disk size (unsloth tree) plus KV cache, and corroborated by the Hardware Corner row running Qwen3-14B Q4_K "up to 16k context fully in VRAM" on this exact card. The official Qwen speed benchmark gives the transformers-path precision ladder for Qwen3-14B: AWQ-INT4 = 9,962 MB at length 1 / 15,323 MB at 30k context, FP8 = 16,012 MB / 20,813 MB, BF16 = 28,402 MB / 33,336 MB — on a 12 GB card only the int4 / Q3_K_M / Q4_K_M GGUF paths fit with KV headroom; FP8 and BF16 overflow, and even AWQ-INT4 overflows once context grows toward 30k.
Quality notes: Q4_K_M is the community-default "sweet spot," but on this 12 GB card Q3_K_M is the more practical default if you need context past ~16k. The 14.8B-parameter dense model (13.2B non-embedding, 40 layers, GQA 40 query / 8 KV heads per the HF card) is a meaningful quality step up from Qwen3-8B at the cost of roughly 1.7× the generation latency on the same card (the Hardware Corner 3060 row shows Qwen3-8B Q4_K at 55.2 tok/s @ 4k vs the 14B's 31.2). For thinking mode, the card recommends Temperature=0.6, TopP=0.95, TopK=20, MinP=0 and DO NOT use greedy decoding — it can cause endless repetitions.

For the full benchmark data and other-GPU comparisons, see /check/qwen3-14b/rtx-3060.

Troubleshooting

Out of memory after the prompt grows / at long context

This is the most common failure on 12 GB. Q4_K_M weights are ~9 GB resident before any context — once the KV cache fills, you hit the ceiling. Hardware Corner's measured ceiling for this card at Q4_K is 16k context fully in VRAM, and that assumes a headless card; with a monitor attached, expect to cap lower. Fixes, in order: (1) cap context with --ctx-size 4096; (2) quantize the KV cache with --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn (roughly halves KV memory); (3) drop to Q3_K_M (7.32 GB) per the unsloth GGUF table to free ~1.7 GB for KV; (4) if you still need long context, move to a 16 GB+ card. Disabling thinking mode (/no_think) also helps — <think> traces inflate the KV cache, and on a hard problem a 2k–4k token reasoning trace alone can be the difference between fitting and OOM.

Ollama returns `Error: model requires more system memory` or hangs on load

Confirm a recent NVIDIA driver and CUDA 12.x runtime are installed (nvidia-smi should show a driver from the past 12 months). The RTX 3060 uses the Ampere architecture (sm_86), which has been fully supported by mainline CUDA wheels since 2020 — the default cu124/cu121 PyTorch wheel works and no special build flags, cu128 selection, or wheel pinning are required (that requirement is specific to Blackwell sm_120 cards, which the 3060 is not). If Ollama still appears to hang on first load, watch nvidia-smi -l 1 in another terminal to confirm the GPU is actually being used; if it stays at 0% utilization, reinstall Ollama and re-pull the model.

Using transformers directly instead of Ollama

If you bypass Ollama / llama.cpp and run the HF card quickstart via transformers directly with torch_dtype="auto", device_map="auto", you will load BF16 weights (28,402 MB at length 1 per the Qwen speed benchmark) and hit OOM on a 12 GB 3060 — and even an AWQ-INT4 mirror (9,962 MB at length 1) overflows once context grows toward 30k (15,323 MB). The quickstart does not hardcode attn_implementation="flash_attention_2", so on the off chance you fit a quantized precision it runs out of the box with a stock pip install torch; Ampere sm_86 has full prebuilt FA2 kernel coverage if you opt into FA2 separately. There is no Blackwell-style cu128/sm_120 wheel gap on this card. Note that the 3060 has no FP8 tensor cores (FP8 first shipped on Hopper sm_90 and consumer Ada sm_89), so an FP8 safetensors file would only load by dequantizing to BF16/FP16 on the fly — a VRAM behaviour, not a speed win — and at 16,012 MB it overflows 12 GB anyway, so GGUF on Ollama/llama.cpp remains the path that fits.

`<think>...</think>` output is bloating responses

Qwen3 enables thinking mode by default per the HF card quickstart. Send /no_think at the start of any user message to disable it for that turn, or pass enable_thinking=False if you're calling the chat-template API directly. Reasoning traces also consume KV cache — a long <think> block at high context can push memory usage up, which on this 12 GB card is the difference between fitting and OOM.

Generation slows past 16k context, or you need more than 32k

32k is Qwen3-14B's native context window per the HF card, which lists it as extensible to 131,072 tokens with YaRN (--rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 in llama.cpp). But on 12 GB you will run out of KV cache before 32k at Q4_K_M — Hardware Corner's 3060 measurement tops out at 16k for this model (the 32k column is unmeasured). For long-document workflows on this card, prefer Q3_K_M with quantized KV plus chunking + retrieval over pushing raw context. The hardware-corner.net RTX 3060 12GB benchmark shows generation already falling from 31.2 tok/s at 4k to 22.7 tok/s at 16k on this card.