self-hosted/ai
§01·recipe · llm

Qwen3-14B on RTX 5070: Q4_K_M GGUF via Ollama or llama.cpp

llmbeginner12GB+ VRAMJun 4, 2026
models
tools
prerequisites
  • NVIDIA RTX 5070 (12 GB GDDR7) or equivalent Blackwell sm_120 card
  • Recent NVIDIA driver with CUDA 12.8+ runtime (Blackwell sm_120 — `cu128` PyTorch wheel required if you go the transformers route)
  • ~9 GB free disk for the Q4_K_M GGUF checkpoint (or ~7.3 GB for Q3_K_M)
  • Ollama, llama.cpp, or LM Studio installed

What You'll Build

A local Qwen3-14B chat / reasoning assistant running on a 12 GB RTX 5070, served through Ollama (or llama.cpp / LM Studio — same GGUF, three loaders). The recipe pins the dense 14B variant at Q4_K_M quantization (9.00 GB on disk), which fits the RTX 5070's 12 GB envelope but leaves it tight — only ~2–3 GB for KV cache and activations once a display is attached.

Hardware data: RTX 5070 (12 GB GDDR7) · Q4_K GGUF · 54.2 tokens/s generation at 4k context · See benchmark data

⚠️ This is a tight fit on 12 GB. Q4_K_M (9.00 GB) loads, and the backend benchmark records a 12.0 GB peak at 4k context (see /check/qwen3-14b/rtx-5070) — but that's effectively the whole card. A 12 GB desktop GPU with a monitor attached exposes only ~10.5–11.3 GB usable, so the practical context ceiling here is much lower than a 16 GB card's. Cap --ctx-size (4k–8k) and/or quantize the KV cache, or drop to Q3_K_M (7.32 GB) for genuinely more context headroom. See Picking a quant on 12 GB below.

⚠️ Variant pinned — Qwen3 ships 8 sizes from the same Qwen org. Per the Ollama qwen3 tag list, Qwen3 spans 0.6b, 1.7b, 4b, 8b, 14b (this recipe), 30b (MoE), 32b, and 235b (MoE). The siblings have very different VRAM profiles — Qwen3-8B in Q4_K_M is ~5 GB; Qwen3-32B in Q4_K_M is ~20 GB and overflows even a 16 GB card; the 30B/235B MoE variants need every expert resident in VRAM (the router can't pre-prune), far past this card. The instructions below are for the dense 14.8B model only. For 32B+ on this card, see /contribute.

ℹ️ Thinking mode is on by default. Qwen3-14B supports a built-in chain-of-thought ("thinking") mode that the model card's quickstart enables via enable_thinking=True. Output starts with a <think>...</think> block followed by the user-facing answer. To disable for latency-sensitive use, send /no_think in your prompt or pass enable_thinking=False in the chat template. Reasoning traces also consume KV cache, which matters more on this 12 GB card than on a larger one.

Requirements

ComponentMinimumTested
GPU12 GB VRAM (Q4_K_M weights ~9 GB + KV — tight; cap context)RTX 5070 (12 GB)
RAM16 GB system
Storage9.00 GB (Q4_K_M GGUF) or 7.32 GB (Q3_K_M)per unsloth/Qwen3-14B-GGUF
DriverCUDA 12.8+ runtime (Blackwell sm_120)
RuntimeOllama 0.5+ / llama.cpp / LM Studio

The model is released under Apache 2.0 — commercial use is permitted.

Installation

The fastest path is Ollama — one command pulls the canonical Q4_K_M build:

Option A — Ollama (recommended)

1. Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

(Windows: download from ollama.com/download.) Per the Qwen3-14B model card, "For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3."

2. Pull the 14B model

ollama pull qwen3:14b

This fetches a 9.3 GB Q4_K_M checkpoint per the Ollama qwen3:14b tag. The download is one file — no manual quant-tier selection needed. On a 12 GB card, run it with a capped context (see Running below) so the KV cache doesn't push you into OOM.

Option B — llama.cpp + community GGUF

If you want a smaller quant tier for more KV headroom (Q3_K_M), or a bigger one if you go headless, use a community redistributor that publishes the full ladder:

1. Install llama.cpp

# macOS (Homebrew)
brew install llama.cpp

# Linux — pre-built CUDA wheel
# Visit https://github.com/ggml-org/llama.cpp/releases for cu12x binaries

2. Pull the quant you want

Per the unsloth/Qwen3-14B-GGUF per-tier file-size table (upstream Qwen/Qwen3-14B link-back confirmed in the model-card header):

QuantFile sizeNotes on a 12 GB card
Q3_K_M7.32 GBbest context headroom — ~3–4 GB free for KV cache
Q4_K_S8.57 GBslightly smaller than Q4_K_M, a bit more KV room
Q4_K_M9.00 GBrecommended quality/size — but tight; cap context
Q5_K_M10.51 GBleaves almost nothing for KV on 12 GB — headless only
Q8_015.70 GBdoes NOT fit a 12 GB card
BF1629.54 GBfull precision — does NOT fit a 12 GB card

Then via the llama.cpp Hugging Face shortcut (per the unsloth model card):

# OpenAI-compatible local server with web UI — cap context for 12 GB
llama-server -hf unsloth/Qwen3-14B-GGUF:Q4_K_M --ctx-size 4096

# More context headroom: drop to Q3_K_M
llama-server -hf unsloth/Qwen3-14B-GGUF:Q3_K_M --ctx-size 8192

Option C — LM Studio (GUI)

LM Studio offers a one-click install path — the Qwen3 family is in its supported-runtime list on the Qwen3-14B HF card. Search "Qwen3-14B GGUF" inside the app and pick the Q3_K_M or Q4_K_M tier (the larger Q5/Q6/Q8 tiers do not leave KV room on 12 GB), or use the direct-import link lmstudio://open_from_hf?model=unsloth/Qwen3-14B-GGUF.

Running

One-shot prompt via Ollama

ollama run qwen3:14b "Explain GQA attention in three sentences."

First run loads the model into VRAM (~9 GB resident at idle for Q4_K_M, growing as the KV cache fills with longer contexts). On a 12 GB card that idle footprint is already most of the card — keep an eye on nvidia-smi.

Cap the context window (important on 12 GB)

# llama.cpp — keep KV cache small so you don't OOM
llama-server -hf unsloth/Qwen3-14B-GGUF:Q4_K_M --ctx-size 4096

# halve KV memory with quantized cache (lets you push context a bit further)
llama-server -hf unsloth/Qwen3-14B-GGUF:Q4_K_M --ctx-size 8192 \
  --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn

Disable thinking mode for short answers

ollama run qwen3:14b "/no_think What's the capital of France?"

Per the Qwen3-14B HF card, this flips enable_thinking=False for the request, skipping the <think>...</think> chain-of-thought prefix — which also keeps the KV cache from ballooning on this tight card.

OpenAI-compatible HTTP API

# Ollama exposes localhost:11434 by default
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:14b",
    "messages": [{"role": "user", "content": "Write a haiku about Blackwell GPUs."}]
  }'

For higher-throughput / production-style serving, the upstream Qwen3-14B card documents vllm serve Qwen/Qwen3-14B --enable-reasoning --reasoning-parser deepseek_r1 and python -m sglang.launch_server --model-path Qwen/Qwen3-14B --reasoning-parser qwen3 — both load BF16 weights (29.54 GB), which does not fit a 12 GB card. For local serving on this GPU prefer Ollama / llama.cpp with the Q4_K_M (or Q3_K_M) GGUF.

Picking a quant on 12 GB (the binding constraint)

On the RTX 5070, VRAM — not compute — is the binding constraint, and the KV cache is what tips you over:

  • Q4_K_M (9.00 GB) is the quality sweet spot but leaves only ~2–3 GB for KV cache + activations once a display is attached. The backend benchmark measures the 4k-context config at a 12.0 GB peak (/check/qwen3-14b/rtx-5070) — i.e. it fills the card. Run it with --ctx-size 4096 (or 8192 with quantized KV) and you're fine; leave the context at the runtime's 32K default and you will OOM.
  • Q3_K_M (7.32 GB) is the longer-context / safer alternative for this card per the unsloth/Qwen3-14B-GGUF tier table — the 1.68 GB you save over Q4_K_M goes straight into KV-cache budget, so you can run a meaningfully larger context window without OOM. Quality is slightly below Q4_K_M but it's the pragmatic choice if you need more than a few thousand tokens of context on 12 GB.
  • Q5_K_M and above do not leave room for a usable KV cache on 12 GB — they're headless-only edge cases at best.

This is the key difference from the 16 GB siblings: a 16 GB card can run Q4_K_M at the full 32K native context with ~7 GB to spare, but a 12 GB card cannot — you trade context for the privilege of running the bigger 14B model at all.

Results

  • Speed: 54.2 tokens/s generation at 4k context, Q4_K quantization, measured on RTX 5070 — per the hardware-corner.net RTX 5070 LLM benchmark table, surfaced via /check/qwen3-14b/rtx-5070 (backend benchmark id=70). Generation slows as the KV cache grows: 40.6 tok/s at 16k on the same page. Prompt processing (prefill) is much faster — a distinct metric: 2,144.2 tok/s at 4k context and 1,315.2 at 16k per the same Hardware Corner row (backend benchmark id=69). Prefill measures how fast the model ingests your prompt; token generation measures how fast it writes the reply — the two are reported separately because they stress different parts of the pipeline. (Generation is ~25% slower than a 16 GB RTX 5070 Ti on the same model because the 5070 has ~25% less memory bandwidth — token generation is memory-bound.)
  • VRAM usage: The cited backend benchmark records a 12.0 GB peak at the 4k-context configuration on this card — see /check/qwen3-14b/rtx-5070 (id=69/id=70). That is effectively the full card, which is why context must be capped. At idle the Q4_K_M weights occupy ~9 GB; the remainder is KV cache and activations, which grow with context.
  • Quality notes: Q4_K_M is the community-default "sweet spot," but on this 12 GB card Q3_K_M is the more practical default if you need context. The 14.8B-parameter dense model (13.2B non-embedding, 40 layers, GQA 40 query / 8 KV heads per the HF card) is a meaningful quality step up from Qwen3-8B at the cost of roughly 1.6× the generation latency on the same card.

For the full benchmark data and other-GPU comparisons, see /check/qwen3-14b/rtx-5070.

Troubleshooting

Out of memory after the prompt grows / at long context

This is the most common failure on 12 GB. Q4_K_M weights are ~9 GB resident before any context — once the KV cache fills, you hit the ceiling. Fixes, in order: (1) cap context with --ctx-size 4096; (2) quantize the KV cache with --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn (roughly halves KV memory); (3) drop to Q3_K_M (7.32 GB) per the unsloth GGUF table to free ~1.7 GB for KV; (4) if you still need long context, move to a 16 GB+ card. Disabling thinking mode (/no_think) also helps — <think> traces inflate the KV cache.

Ollama returns Error: model requires more system memory or hangs on load

Confirm a recent NVIDIA driver and CUDA 12.8+ runtime are installed (nvidia-smi should show a recent Blackwell-capable driver). The RTX 5070 uses the Blackwell architecture (sm_120) which requires CUDA 12.8 or newer; older CUDA wheels do not ship sm_120 kernels and fail with a no kernel image is available for execution on the device error. Ollama's bundled CUDA runtime handles this for recent builds, but if you compile llama.cpp from source, build with LLAMA_CUDA=1 against CUDA 12.8+ explicitly. Watch nvidia-smi -l 1 in another terminal to confirm the GPU is actually being used.

Using transformers directly — FA2 sm_120 wheel gap

If you bypass Ollama / llama.cpp and run the HF card quickstart via transformers directly, mind the FlashAttention-2 sm_120 gap: as of mid-2026, prebuilt FA2 wheels still do not ship sm_120 kernels for Blackwell consumer cards (tracked at Dao-AILab/flash-attention#2168). The Qwen3 quickstart uses torch_dtype="auto" and device_map="auto" without hardcoding attn_implementation="flash_attention_2", so it works out of the box with attn_implementation="eager" or "sdpa". If any third-party snippet you copy hardcodes attn_implementation="flash_attention_2", override it to "sdpa" until FA2 lands sm_120 wheels. Also install the cu128 PyTorch wheel (pip install torch --index-url https://download.pytorch.org/whl/cu128) — the default cu126 wheel does not include sm_120 kernels. Blackwell's tensor cores natively support FP8, so there is no FP8-on-Ampere dequantization penalty on this card. (Note that the BF16 transformers path needs 29.54 GB and does not fit 12 GB regardless — this section is only relevant if you're debugging a quantized transformers setup, not running full-precision weights.)

<think>...</think> output is bloating responses

Qwen3 enables thinking mode by default per the HF card quickstart. Send /no_think at the start of any user message to disable it for that turn, or pass enable_thinking=False if you're calling the chat-template API directly. Reasoning traces also consume KV cache — a long <think> block at high context can push memory usage up, which on this 12 GB card is the difference between fitting and OOM.