What You'll Build
A local Qwen3-14B chat / reasoning assistant running on a 12 GB RTX 5070, served through Ollama (or llama.cpp / LM Studio — same GGUF, three loaders). The recipe pins the dense 14B variant at Q4_K_M quantization (9.00 GB on disk), which fits the RTX 5070's 12 GB envelope but leaves it tight — only ~2–3 GB for KV cache and activations once a display is attached.
Hardware data: RTX 5070 (12 GB GDDR7) · Q4_K GGUF · 54.2 tokens/s generation at 4k context · See benchmark data
⚠️ This is a tight fit on 12 GB. Q4_K_M (9.00 GB) loads, and the backend benchmark records a 12.0 GB peak at 4k context (see /check/qwen3-14b/rtx-5070) — but that's effectively the whole card. A 12 GB desktop GPU with a monitor attached exposes only ~10.5–11.3 GB usable, so the practical context ceiling here is much lower than a 16 GB card's. Cap
--ctx-size(4k–8k) and/or quantize the KV cache, or drop to Q3_K_M (7.32 GB) for genuinely more context headroom. See Picking a quant on 12 GB below.
⚠️ Variant pinned — Qwen3 ships 8 sizes from the same Qwen org. Per the Ollama qwen3 tag list, Qwen3 spans
0.6b,1.7b,4b,8b,14b(this recipe),30b(MoE),32b, and235b(MoE). The siblings have very different VRAM profiles — Qwen3-8B in Q4_K_M is ~5 GB; Qwen3-32B in Q4_K_M is ~20 GB and overflows even a 16 GB card; the 30B/235B MoE variants need every expert resident in VRAM (the router can't pre-prune), far past this card. The instructions below are for the dense 14.8B model only. For 32B+ on this card, see /contribute.
ℹ️ Thinking mode is on by default. Qwen3-14B supports a built-in chain-of-thought ("thinking") mode that the model card's quickstart enables via
enable_thinking=True. Output starts with a<think>...</think>block followed by the user-facing answer. To disable for latency-sensitive use, send/no_thinkin your prompt or passenable_thinking=Falsein the chat template. Reasoning traces also consume KV cache, which matters more on this 12 GB card than on a larger one.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 12 GB VRAM (Q4_K_M weights ~9 GB + KV — tight; cap context) | RTX 5070 (12 GB) |
| RAM | 16 GB system | — |
| Storage | 9.00 GB (Q4_K_M GGUF) or 7.32 GB (Q3_K_M) | per unsloth/Qwen3-14B-GGUF |
| Driver | CUDA 12.8+ runtime (Blackwell sm_120) | — |
| Runtime | Ollama 0.5+ / llama.cpp / LM Studio | — |
The model is released under Apache 2.0 — commercial use is permitted.
Installation
The fastest path is Ollama — one command pulls the canonical Q4_K_M build:
Option A — Ollama (recommended)
1. Install Ollama
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
(Windows: download from ollama.com/download.) Per the Qwen3-14B model card, "For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3."
2. Pull the 14B model
ollama pull qwen3:14b
This fetches a 9.3 GB Q4_K_M checkpoint per the Ollama qwen3:14b tag. The download is one file — no manual quant-tier selection needed. On a 12 GB card, run it with a capped context (see Running below) so the KV cache doesn't push you into OOM.
Option B — llama.cpp + community GGUF
If you want a smaller quant tier for more KV headroom (Q3_K_M), or a bigger one if you go headless, use a community redistributor that publishes the full ladder:
1. Install llama.cpp
# macOS (Homebrew)
brew install llama.cpp
# Linux — pre-built CUDA wheel
# Visit https://github.com/ggml-org/llama.cpp/releases for cu12x binaries
2. Pull the quant you want
Per the unsloth/Qwen3-14B-GGUF per-tier file-size table (upstream Qwen/Qwen3-14B link-back confirmed in the model-card header):
| Quant | File size | Notes on a 12 GB card |
|---|---|---|
| Q3_K_M | 7.32 GB | best context headroom — ~3–4 GB free for KV cache |
| Q4_K_S | 8.57 GB | slightly smaller than Q4_K_M, a bit more KV room |
| Q4_K_M | 9.00 GB | recommended quality/size — but tight; cap context |
| Q5_K_M | 10.51 GB | leaves almost nothing for KV on 12 GB — headless only |
| Q8_0 | 15.70 GB | does NOT fit a 12 GB card |
| BF16 | 29.54 GB | full precision — does NOT fit a 12 GB card |
Then via the llama.cpp Hugging Face shortcut (per the unsloth model card):
# OpenAI-compatible local server with web UI — cap context for 12 GB
llama-server -hf unsloth/Qwen3-14B-GGUF:Q4_K_M --ctx-size 4096
# More context headroom: drop to Q3_K_M
llama-server -hf unsloth/Qwen3-14B-GGUF:Q3_K_M --ctx-size 8192
Option C — LM Studio (GUI)
LM Studio offers a one-click install path — the Qwen3 family is in its supported-runtime list on the Qwen3-14B HF card. Search "Qwen3-14B GGUF" inside the app and pick the Q3_K_M or Q4_K_M tier (the larger Q5/Q6/Q8 tiers do not leave KV room on 12 GB), or use the direct-import link lmstudio://open_from_hf?model=unsloth/Qwen3-14B-GGUF.
Running
One-shot prompt via Ollama
ollama run qwen3:14b "Explain GQA attention in three sentences."
First run loads the model into VRAM (~9 GB resident at idle for Q4_K_M, growing as the KV cache fills with longer contexts). On a 12 GB card that idle footprint is already most of the card — keep an eye on nvidia-smi.
Cap the context window (important on 12 GB)
# llama.cpp — keep KV cache small so you don't OOM
llama-server -hf unsloth/Qwen3-14B-GGUF:Q4_K_M --ctx-size 4096
# halve KV memory with quantized cache (lets you push context a bit further)
llama-server -hf unsloth/Qwen3-14B-GGUF:Q4_K_M --ctx-size 8192 \
--cache-type-k q8_0 --cache-type-v q8_0 --flash-attn
Disable thinking mode for short answers
ollama run qwen3:14b "/no_think What's the capital of France?"
Per the Qwen3-14B HF card, this flips enable_thinking=False for the request, skipping the <think>...</think> chain-of-thought prefix — which also keeps the KV cache from ballooning on this tight card.
OpenAI-compatible HTTP API
# Ollama exposes localhost:11434 by default
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3:14b",
"messages": [{"role": "user", "content": "Write a haiku about Blackwell GPUs."}]
}'
For higher-throughput / production-style serving, the upstream Qwen3-14B card documents vllm serve Qwen/Qwen3-14B --enable-reasoning --reasoning-parser deepseek_r1 and python -m sglang.launch_server --model-path Qwen/Qwen3-14B --reasoning-parser qwen3 — both load BF16 weights (29.54 GB), which does not fit a 12 GB card. For local serving on this GPU prefer Ollama / llama.cpp with the Q4_K_M (or Q3_K_M) GGUF.
Picking a quant on 12 GB (the binding constraint)
On the RTX 5070, VRAM — not compute — is the binding constraint, and the KV cache is what tips you over:
- Q4_K_M (9.00 GB) is the quality sweet spot but leaves only ~2–3 GB for KV cache + activations once a display is attached. The backend benchmark measures the 4k-context config at a 12.0 GB peak (/check/qwen3-14b/rtx-5070) — i.e. it fills the card. Run it with
--ctx-size 4096(or8192with quantized KV) and you're fine; leave the context at the runtime's 32K default and you will OOM. - Q3_K_M (7.32 GB) is the longer-context / safer alternative for this card per the unsloth/Qwen3-14B-GGUF tier table — the 1.68 GB you save over Q4_K_M goes straight into KV-cache budget, so you can run a meaningfully larger context window without OOM. Quality is slightly below Q4_K_M but it's the pragmatic choice if you need more than a few thousand tokens of context on 12 GB.
- Q5_K_M and above do not leave room for a usable KV cache on 12 GB — they're headless-only edge cases at best.
This is the key difference from the 16 GB siblings: a 16 GB card can run Q4_K_M at the full 32K native context with ~7 GB to spare, but a 12 GB card cannot — you trade context for the privilege of running the bigger 14B model at all.
Results
- Speed: 54.2 tokens/s generation at 4k context, Q4_K quantization, measured on RTX 5070 — per the hardware-corner.net RTX 5070 LLM benchmark table, surfaced via
/check/qwen3-14b/rtx-5070(backend benchmark id=70). Generation slows as the KV cache grows: 40.6 tok/s at 16k on the same page. Prompt processing (prefill) is much faster — a distinct metric: 2,144.2 tok/s at 4k context and 1,315.2 at 16k per the same Hardware Corner row (backend benchmark id=69). Prefill measures how fast the model ingests your prompt; token generation measures how fast it writes the reply — the two are reported separately because they stress different parts of the pipeline. (Generation is ~25% slower than a 16 GB RTX 5070 Ti on the same model because the 5070 has ~25% less memory bandwidth — token generation is memory-bound.) - VRAM usage: The cited backend benchmark records a 12.0 GB peak at the 4k-context configuration on this card — see /check/qwen3-14b/rtx-5070 (id=69/id=70). That is effectively the full card, which is why context must be capped. At idle the Q4_K_M weights occupy ~9 GB; the remainder is KV cache and activations, which grow with context.
- Quality notes: Q4_K_M is the community-default "sweet spot," but on this 12 GB card Q3_K_M is the more practical default if you need context. The 14.8B-parameter dense model (13.2B non-embedding, 40 layers, GQA 40 query / 8 KV heads per the HF card) is a meaningful quality step up from Qwen3-8B at the cost of roughly 1.6× the generation latency on the same card.
For the full benchmark data and other-GPU comparisons, see /check/qwen3-14b/rtx-5070.
Troubleshooting
Out of memory after the prompt grows / at long context
This is the most common failure on 12 GB. Q4_K_M weights are ~9 GB resident before any context — once the KV cache fills, you hit the ceiling. Fixes, in order: (1) cap context with --ctx-size 4096; (2) quantize the KV cache with --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn (roughly halves KV memory); (3) drop to Q3_K_M (7.32 GB) per the unsloth GGUF table to free ~1.7 GB for KV; (4) if you still need long context, move to a 16 GB+ card. Disabling thinking mode (/no_think) also helps — <think> traces inflate the KV cache.
Ollama returns Error: model requires more system memory or hangs on load
Confirm a recent NVIDIA driver and CUDA 12.8+ runtime are installed (nvidia-smi should show a recent Blackwell-capable driver). The RTX 5070 uses the Blackwell architecture (sm_120) which requires CUDA 12.8 or newer; older CUDA wheels do not ship sm_120 kernels and fail with a no kernel image is available for execution on the device error. Ollama's bundled CUDA runtime handles this for recent builds, but if you compile llama.cpp from source, build with LLAMA_CUDA=1 against CUDA 12.8+ explicitly. Watch nvidia-smi -l 1 in another terminal to confirm the GPU is actually being used.
Using transformers directly — FA2 sm_120 wheel gap
If you bypass Ollama / llama.cpp and run the HF card quickstart via transformers directly, mind the FlashAttention-2 sm_120 gap: as of mid-2026, prebuilt FA2 wheels still do not ship sm_120 kernels for Blackwell consumer cards (tracked at Dao-AILab/flash-attention#2168). The Qwen3 quickstart uses torch_dtype="auto" and device_map="auto" without hardcoding attn_implementation="flash_attention_2", so it works out of the box with attn_implementation="eager" or "sdpa". If any third-party snippet you copy hardcodes attn_implementation="flash_attention_2", override it to "sdpa" until FA2 lands sm_120 wheels. Also install the cu128 PyTorch wheel (pip install torch --index-url https://download.pytorch.org/whl/cu128) — the default cu126 wheel does not include sm_120 kernels. Blackwell's tensor cores natively support FP8, so there is no FP8-on-Ampere dequantization penalty on this card. (Note that the BF16 transformers path needs 29.54 GB and does not fit 12 GB regardless — this section is only relevant if you're debugging a quantized transformers setup, not running full-precision weights.)
<think>...</think> output is bloating responses
Qwen3 enables thinking mode by default per the HF card quickstart. Send /no_think at the start of any user message to disable it for that turn, or pass enable_thinking=False if you're calling the chat-template API directly. Reasoning traces also consume KV cache — a long <think> block at high context can push memory usage up, which on this 12 GB card is the difference between fitting and OOM.