self-hosted/ai
§01·recipe · llm

Qwen3-14B on RTX 4070: Q4_K_M GGUF via Ollama or llama.cpp

llmbeginner12GB+ VRAMJun 5, 2026
models
tools
prerequisites
  • NVIDIA RTX 4070 (12 GB GDDR6X) or equivalent 12 GB CUDA card
  • Recent NVIDIA driver with CUDA 12.x runtime (Ada sm_89 — the default `cu124` wheel works; no special wheel selection required)
  • ~9 GB free disk for the Q4_K_M GGUF checkpoint (or ~7.3 GB for Q3_K_M)
  • Ollama, llama.cpp, or LM Studio installed

What You'll Build

A local Qwen3-14B chat / reasoning assistant running on a 12 GB RTX 4070, served through Ollama (or llama.cpp / LM Studio — same GGUF, three loaders). The recipe pins the dense 14.8B variant at Q4_K_M quantization (9.00 GB on disk), which fits the RTX 4070's 12 GB envelope but leaves it tight — only ~2–3 GB for KV cache and activations once a display is attached. The realistic context ceiling here is well below a 16 GB card's; you cap context (or drop to Q3_K_M) to stay inside the envelope.

Hardware data: RTX 4070 (12 GB GDDR6X) · Q4_K GGUF · 42.5 tokens/s generation at 4k context · See benchmark data

⚠️ This is a tight fit on 12 GB. Q4_K_M (9.00 GB) loads, but that leaves roughly 2–3 GB for the KV cache and activations — effectively the whole card once a monitor is attached. A 12 GB desktop GPU with a display exposes only ~10.5–11.3 GB usable, so the practical context ceiling is much lower than a 16 GB card's. Cap --ctx-size (4k–8k) and/or quantize the KV cache, or drop to Q3_K_M (7.32 GB) for genuinely more context headroom. See Picking a quant on 12 GB below.

⚠️ Variant pinned — Qwen3 ships 8 sizes from the same Qwen org. Per the Ollama qwen3 tag list, Qwen3 spans 0.6b, 1.7b, 4b, 8b, 14b (this recipe), 30b (MoE), 32b, and 235b (MoE). The siblings have very different VRAM profiles — Qwen3-8B in Q4_K_M is ~5 GB; Qwen3-32B in Q4_K_M is ~20 GB and overflows even a 16 GB card; the 30B/235B MoE variants need every expert resident in VRAM (the router can't pre-prune), far past this card. The instructions below are for the dense 14.8B model only. For 32B+ on this card, see /contribute.

ℹ️ Thinking mode is on by default — and it eats KV cache. Qwen3-14B has a built-in chain-of-thought ("thinking") mode that the model card's quickstart enables via enable_thinking=True (the card notes "True is the default value for enable_thinking"). Output starts with a <think>...</think> block — often 2k–4k tokens on hard math / coding problems — followed by the user-facing answer. That <think> trace grows the KV cache far faster than a plain chat turn, which is the single most common way to OOM this 12 GB card. To disable for latency-sensitive use, send /no_think in your prompt or pass enable_thinking=False in the chat template.

Requirements

ComponentMinimumTested
GPU12 GB VRAM (Q4_K_M weights ~9 GB + KV — tight; cap context)RTX 4070 (12 GB)
RAM16 GB system
Storage9.00 GB (Q4_K_M GGUF) or 7.32 GB (Q3_K_M)per unsloth/Qwen3-14B-GGUF
DriverCUDA 12.x runtime (Ada sm_89)
RuntimeOllama 0.5+ / llama.cpp / LM Studio

The model is released under Apache 2.0 (HF license: apache-2.0, ungated) — commercial use is permitted.

Installation

The fastest path is Ollama — one command pulls the canonical Q4_K_M build:

Option A — Ollama (recommended)

1. Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

(Windows: download from ollama.com/download.) Per the Qwen3-14B model card, "For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3."

2. Pull the 14B model

ollama pull qwen3:14b

This fetches a 9.3 GB Q4_K_M checkpoint (14.8B parameters, Q4_K_M) per the Ollama qwen3:14b tag. The download is one file — no manual quant-tier selection needed. On a 12 GB card, run it with a capped context (see Running below) so the KV cache doesn't push you into OOM.

Option B — llama.cpp + community GGUF

If you want a smaller quant tier for more KV headroom (Q3_K_M), use a community redistributor that publishes the full ladder. The unsloth/Qwen3-14B-GGUF repo lists Qwen/Qwen3-14B explicitly as its base_model with link-back to the upstream model card.

1. Install llama.cpp

# macOS (Homebrew)
brew install llama.cpp

# Linux — pre-built CUDA wheel
# Visit https://github.com/ggml-org/llama.cpp/releases for cu12x binaries

2. Pull the quant you want

Per-tier file sizes from the unsloth/Qwen3-14B-GGUF Files tab (decimal GB, as HuggingFace displays them):

QuantFile sizeNotes on a 12 GB card
Q3_K_M7.32 GBbest context headroom — ~2.5–3 GB free for KV cache
Q4_K_S8.57 GBslightly smaller than Q4_K_M, a bit more KV room
Q4_K_M9.00 GBrecommended quality/size — but tight; cap context
Q5_K_M10.51 GBleaves almost nothing for KV on 12 GB — headless only
Q6_K12.12 GBweights alone exceed the card; does NOT fit 12 GB
Q8_015.70 GBdoes NOT fit a 12 GB card
BF1629.54 GBfull precision — does NOT fit a 12 GB card

The key difference from a 16 GB card: there, Q4_K_M through Q6_K all leave KV room; on 12 GB only Q3_K_M / Q4_K_S / Q4_K_M leave any usable KV budget, and even Q4_K_M needs a capped context.

Then via the llama.cpp Hugging Face shortcut (per the unsloth model card):

# OpenAI-compatible local server with web UI — cap context for 12 GB
llama-server -hf unsloth/Qwen3-14B-GGUF:Q4_K_M --ctx-size 4096 --flash-attn

# More context headroom: drop to Q3_K_M
llama-server -hf unsloth/Qwen3-14B-GGUF:Q3_K_M --ctx-size 8192 --flash-attn

--flash-attn is safe to enable here: the RTX 4070 is Ada Lovelace (sm_89), which has full prebuilt FlashAttention kernel coverage — unlike Blackwell consumer cards, no special wheel selection or override is needed.

Option C — LM Studio (GUI)

LM Studio offers a one-click install path — the Qwen3 family is in its supported-runtime list per the Qwen3-14B HF card. Search "Qwen3-14B GGUF" inside the app and pick the Q3_K_M or Q4_K_M tier (the larger Q5/Q6/Q8 tiers do not leave KV room on 12 GB), or use the direct-import link from unsloth/Qwen3-14B-GGUF.

Running

One-shot prompt via Ollama

ollama run qwen3:14b "Explain GQA attention in three sentences."

First run loads the model into VRAM (~9 GB resident at idle for Q4_K_M, growing as the KV cache fills with longer contexts). On a 12 GB card that idle footprint is already most of the card — keep an eye on nvidia-smi.

Cap the context window (important on 12 GB)

# llama.cpp — keep KV cache small so you don't OOM
llama-server -hf unsloth/Qwen3-14B-GGUF:Q4_K_M --ctx-size 4096 --flash-attn

# halve KV memory with quantized cache (lets you push context further)
llama-server -hf unsloth/Qwen3-14B-GGUF:Q4_K_M --ctx-size 8192 \
  --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn

Disable thinking mode for short answers

ollama run qwen3:14b "/no_think What's the capital of France?"

Per the Qwen3-14B HF card, this flips enable_thinking=False for the request, skipping the <think>...</think> chain-of-thought prefix — which also keeps the KV cache from ballooning on this tight card.

OpenAI-compatible HTTP API

# Ollama exposes localhost:11434 by default
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:14b",
    "messages": [{"role": "user", "content": "Write a haiku about Ada Lovelace GPUs."}]
  }'

For higher-throughput / production-style serving, the upstream Qwen3-14B card documents vllm serve Qwen/Qwen3-14B and python -m sglang.launch_server --model-path Qwen/Qwen3-14B --reasoning-parser qwen3 — but both default to BF16 weights (29.54 GB), which does not fit a 12 GB card. Even FP8 starts at 16,012 MB per the official Qwen speed benchmark and overflows 12 GB outright. For local serving on this GPU prefer Ollama / llama.cpp with the Q4_K_M (or Q3_K_M) GGUF.

Picking a quant on 12 GB (the binding constraint)

On the RTX 4070, VRAM — not compute — is the binding constraint, and the KV cache is what tips you over. Qwen3-14B is 40 layers with GQA (8 KV heads, head_dim 128) per the HF card, so its KV cache costs ~160 KiB per token at fp16 — about 0.67 GB per 4k tokens, doubling at 8k, 16k, and so on. On ~11 GB of usable VRAM:

  • Q4_K_M (9.00 GB) is the quality sweet spot but leaves only ~1–2 GB for KV cache + activations once a display is attached. That is roughly a ~4k–8k token practical context window at fp16 KV (derived from the per-token KV cost above; closer to ~8k once you add --cache-type-k q8_0 --cache-type-v q8_0 to halve cache memory). Run it with --ctx-size 4096 (or 8192 with quantized KV) and you're fine; leave the context at the runtime's 32K default and you will OOM.
  • Q3_K_M (7.32 GB) is the longer-context / safer alternative for this card per the unsloth/Qwen3-14B-GGUF tier table — the ~1.7 GB you save over Q4_K_M goes straight into KV-cache budget, pushing the practical window into the ~16k+ token range (more with quantized KV). Quality is slightly below Q4_K_M but it's the pragmatic choice if you need real context on 12 GB.
  • Q5_K_M (10.51 GB) and above do not leave room for a usable KV cache on 12 GB — they're headless-only edge cases at best; Q6_K (12.12 GB) and larger don't even fit the weights.

This is the key difference from the 16 GB siblings: a 16 GB card runs Q4_K_M at the full 32K native context with several GB to spare, but a 12 GB card cannot — you trade context for the privilege of running the bigger 14B model at all. These context figures are derived from the model's published KV-cache geometry, not a measured benchmark; calibrate the real ceiling on your own setup with nvidia-smi and route a measured peak to /contribute.

Results

  • Speed: 42.5 tokens/s generation at 4k context, Q4_K quantization, measured on the RTX 4070 — per the hardware-corner.net RTX 4070 LLM benchmark table row labelled "Qwen3 14B (Q4_K)", surfaced via /check/qwen3-14b/rtx-4070. Generation slows to 32.7 tok/s at 16k as the KV cache grows. Prompt processing (prefill) is a separate, much-faster metric2,099.9 tok/s at 4k and 1,355.8 tok/s at 16k on the same row. Prefill measures how fast the model ingests your prompt; token generation measures how fast it writes the reply. (The 4070's generation rate is lower than the 16 GB 4070 Ti SUPER's on the same model because token generation is memory-bandwidth-bound and the 4070 has ~25% less bandwidth.) Note these are chat-class throughput figures; for thinking-mode workloads where most of the output is a discarded <think> block, effective throughput per useful answer is lower because the model emits far more tokens.
  • VRAM usage: Q4_K_M weights occupy ~9 GB at idle; the KV cache and activations grow on top, filling the 12 GB card — which is why context must be capped. The backend has no measured 4070 peak yet (verdict unknown at /check/qwen3-14b/rtx-4070), so the 12 GB envelope here is derived from the 9.00 GB Q4_K_M on-disk size (unsloth tree) plus KV cache, and corroborated by the Hardware Corner row running Qwen3-14B Q4_K on this exact 12 GB card. The official Qwen speed benchmark gives the transformers-path precision ladder for Qwen3-14B: AWQ-INT4 = 9,962 MB at length 1 / 15,323 MB at 30k context, FP8 = 16,012 MB / 20,813 MB, BF16 = 28,402 MB / 33,336 MB — on a 12 GB card only the int4 / Q3_K_M / Q4_K_M GGUF paths fit with KV headroom; FP8 and BF16 overflow, and even AWQ-INT4 overflows once context grows toward 30k.
  • Quality notes: Q4_K_M is the community-default "sweet spot," but on this 12 GB card Q3_K_M is the more practical default if you need context. The 14.8B-parameter dense model (13.2B non-embedding, 40 layers, GQA 40 query / 8 KV heads per the HF card) is a meaningful quality step up from Qwen3-8B at the cost of roughly 1.6× the generation latency on the same card. For thinking mode, the card recommends Temperature=0.6, TopP=0.95, TopK=20, MinP=0 and DO NOT use greedy decoding — it can cause endless repetitions.

For the full benchmark data and other-GPU comparisons, see /check/qwen3-14b/rtx-4070.

Troubleshooting

Out of memory after the prompt grows / at long context

This is the most common failure on 12 GB. Q4_K_M weights are ~9 GB resident before any context — once the KV cache fills, you hit the ceiling. Fixes, in order: (1) cap context with --ctx-size 4096; (2) quantize the KV cache with --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn (roughly halves KV memory); (3) drop to Q3_K_M (7.32 GB) per the unsloth GGUF table to free ~1.7 GB for KV; (4) if you still need long context, move to a 16 GB+ card. Disabling thinking mode (/no_think) also helps — <think> traces inflate the KV cache, and on a hard problem a 2k–4k token reasoning trace alone can be the difference between fitting and OOM.

Ollama returns Error: model requires more system memory or hangs on load

Confirm a recent NVIDIA driver and CUDA 12.x runtime are installed (nvidia-smi should show a driver from the past 12 months). The RTX 4070 uses the Ada Lovelace architecture (sm_89), which has been fully supported by mainline CUDA wheels since 2023 — the default cu124 PyTorch wheel works and no special build flags, cu128 selection, or wheel pinning are required (that requirement is specific to Blackwell sm_120 cards, which the 4070 is not). If Ollama still appears to hang on first load, watch nvidia-smi -l 1 in another terminal to confirm the GPU is actually being used; if it stays at 0% utilization, reinstall Ollama and re-pull the model.

Using transformers directly instead of Ollama

If you bypass Ollama / llama.cpp and run the HF card quickstart via transformers directly with torch_dtype="auto", device_map="auto", you will load BF16 weights (28,402 MB at length 1 per the Qwen speed benchmark) and hit OOM on a 12 GB 4070 — and even an AWQ-INT4 mirror (9,962 MB at length 1) overflows once context grows toward 30k (15,323 MB). The quickstart does not hardcode attn_implementation="flash_attention_2", so on the off chance you fit a quantized precision it runs out of the box with a stock pip install torch; Ada sm_89 has full prebuilt FA2 kernel coverage if you opt into FA2 separately. There is no Blackwell-style cu128/sm_120 wheel gap on this card, and Ada's tensor cores natively support FP8 — but FP8 weights (16,012 MB) still overflow 12 GB, so GGUF on Ollama/llama.cpp remains the path that fits.

<think>...</think> output is bloating responses

Qwen3 enables thinking mode by default per the HF card quickstart. Send /no_think at the start of any user message to disable it for that turn, or pass enable_thinking=False if you're calling the chat-template API directly. Reasoning traces also consume KV cache — a long <think> block at high context can push memory usage up, which on this 12 GB card is the difference between fitting and OOM.

Generation slows past 16k context, or you need more than 32k

32k is Qwen3-14B's native context window per the HF card, which lists it as extensible to 131,072 tokens with YaRN (--rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 in llama.cpp). But on 12 GB you will run out of KV cache long before 32k at Q4_K_M — the realistic ceiling is a few thousand tokens (see Picking a quant on 12 GB). For long-document workflows on this card, prefer Q3_K_M with quantized KV plus chunking + retrieval over pushing raw context. The hardware-corner.net RTX 4070 benchmark shows generation already falling from 42.5 tok/s at 4k to 32.7 tok/s at 16k on this card.