self-hosted/ai
§01·recipe · llm

Qwen3-14B on RTX 5060 Ti: Q4_K_M GGUF via Ollama or llama.cpp

llmbeginner12GB+ VRAMJun 17, 2026

This beginner recipe sets up Qwen3 14B on the RTX 5060 Ti, needing about 12 GB of VRAM.

models
tools
prerequisites
  • NVIDIA RTX 5060 Ti (16 GB GDDR7) or equivalent Blackwell sm_120 card
  • Recent NVIDIA driver with CUDA 12.8+ runtime (Blackwell sm_120 — `cu128` PyTorch wheel required if you go the transformers route)
  • ~9 GB free disk for the Q4_K_M GGUF checkpoint (or ~16 GB for Q8_0)
  • Ollama, llama.cpp, or LM Studio installed

What You'll Build

A local Qwen3-14B chat / reasoning assistant running on a 16 GB RTX 5060 Ti, served through Ollama (or llama.cpp / LM Studio — same GGUF, three loaders). The recipe pins the dense 14B variant at Q4_K_M quantization (9.00 GB on disk), which fits the RTX 5060 Ti's 16 GB envelope — but on this card it is a tight fit, not a roomy one: the measured benchmark peak reaches the full 16 GB at a 4k context window (see below), so plan your context length and any colocated model accordingly.

Hardware data: RTX 5060 Ti (16 GB GDDR7) · Q4_K GGUF · 41.1 tokens/s generation at 4k context · 16.0 GB peak VRAM · See benchmark data

⚠️ Tight 16 GB fit. The cited backend benchmark records a 16.0 GB peak on a 16 GB card at a 4k context window — i.e. Qwen3-14B at Q4_K essentially fills this card. It runs, but there is little spare VRAM. Keep the context window modest, prefer Q4_K_M over higher quants, and avoid colocating a second model unless you cap context tightly. If you want comfortable headroom, the dense 8B sibling (qwen3:8b, ~5 GB) leaves far more room on the same card — see /check/qwen3-8b/rtx-5060-ti.

⚠️ Variant pinned — Qwen3 ships 8 sizes from the same Qwen org. Per the Ollama qwen3 tag list, Qwen3 spans 0.6b, 1.7b, 4b, 8b, 14b (this recipe), 30b (MoE), 32b, and 235b (MoE). The siblings have very different VRAM profiles — Qwen3-8B in Q4_K_M is ~5 GB; Qwen3-32B in Q4_K_M is ~20 GB and overflows a 16 GB card; the 30B/235B MoE variants need every expert resident in VRAM (the router can't pre-prune), far past this card. The instructions below are for the dense 14.8B model only. For 32B+ on this card, see /contribute.

ℹ️ Thinking mode is on by default. Qwen3-14B supports a built-in chain-of-thought ("thinking") mode that the model card's quickstart enables via enable_thinking=True. Output starts with a <think>...</think> block followed by the user-facing answer. To disable for latency-sensitive use, send /no_think in your prompt or pass enable_thinking=False in the chat template. On this tight-16 GB card, the <think> block also inflates the KV cache — another reason to disable it when you don't need step-by-step reasoning.

Requirements

ComponentMinimumTested
GPU16 GB VRAM (Q4_K_M weights ~9 GB + KV at 4k ctx fills the card — see the 16.0 GB benchmark peak)RTX 5060 Ti (16 GB)
RAM16 GB system
Storage9.00 GB (Q4_K_M GGUF) or 15.70 GB (Q8_0)per unsloth/Qwen3-14B-GGUF
DriverCUDA 12.8+ runtime (Blackwell sm_120)
RuntimeOllama 0.5+ / llama.cpp / LM Studio

The model is released under Apache 2.0 — commercial use is permitted.

Installation

The fastest path is Ollama — one command pulls the canonical Q4_K_M build:

Option A — Ollama (recommended)

1. Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

(Windows: download from ollama.com/download.) Per the Qwen3-14B model card, Qwen3 is supported by local runtimes including Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers.

2. Pull the 14B model

ollama pull qwen3:14b

This fetches a 9.3 GB Q4_K_M checkpoint per the Ollama qwen3:14b tag. The download is one file — no manual quant-tier selection needed.

Option B — llama.cpp + community GGUF

If you want a different quant tier (Q5_K_M for slightly higher fidelity, Q8_0 for near-lossless), use a community redistributor that publishes the full ladder. On this 16 GB card the higher quants tighten an already-tight fit — see the per-tier table below.

1. Install llama.cpp

# macOS (Homebrew)
brew install llama.cpp

# Linux — pre-built CUDA wheel
# Visit https://github.com/ggml-org/llama.cpp/releases for cu12x binaries

2. Pull the quant you want

Per the unsloth/Qwen3-14B-GGUF per-tier file-size table (upstream Qwen/Qwen3-14B link-back confirmed in the model-card header):

QuantFile sizeFit on a 16 GB RTX 5060 Ti
Q4_K_M9.00 GBrecommended — Q4_K already peaks at the full 16 GB envelope at 4k ctx (benchmark below)
Q5_K_M10.51 GBfits, but tightens further — cap context
Q6_K12.12 GB"near perfect" per bartowski; leaves very little KV headroom
Q8_015.70 GBnear-lossless but fills the card on weights alone — only viable with a very small context cap
BF1629.54 GBfull precision — does NOT fit a 16 GB card

Then via the llama.cpp Hugging Face shortcut (per the unsloth model card):

# OpenAI-compatible local server with web UI
llama-server -hf unsloth/Qwen3-14B-GGUF:UD-Q4_K_XL

# Interactive terminal
llama-cli -hf unsloth/Qwen3-14B-GGUF:UD-Q4_K_XL

Option C — LM Studio (GUI)

LM Studio offers a one-click install path — the Qwen3 family is in its supported-runtime list on the Qwen3-14B HF card. Search "Qwen3-14B GGUF" inside the app and pick the Q4_K_M tier, or use the direct-import link lmstudio://open_from_hf?model=unsloth/Qwen3-14B-GGUF.

Running

One-shot prompt via Ollama

ollama run qwen3:14b "Explain GQA attention in three sentences."

First run loads the model into VRAM (~9 GB resident at idle, climbing toward the 16 GB benchmark peak as the KV cache fills with longer contexts). Subsequent prompts in the same session stay warm.

Disable thinking mode for short answers

ollama run qwen3:14b "/no_think What's the capital of France?"

Per the Qwen3-14B HF card, this flips enable_thinking=False for the request, skipping the <think>...</think> chain-of-thought prefix — and on this tight card, it keeps the KV cache smaller too.

OpenAI-compatible HTTP API

# Ollama exposes localhost:11434 by default
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:14b",
    "messages": [{"role": "user", "content": "Write a haiku about Blackwell GPUs."}]
  }'

For higher-throughput / production-style serving, the upstream Qwen3-14B card documents vllm serve Qwen/Qwen3-14B --enable-reasoning --reasoning-parser deepseek_r1 and python -m sglang.launch_server --model-path Qwen/Qwen3-14B --reasoning-parser qwen3 — both load BF16 weights (29.54 GB), which does not fit a 16 GB card. For local serving on this GPU prefer Ollama / llama.cpp with the Q4_K_M GGUF; FP8/vLLM serving needs a larger-VRAM card.

Results

  • Speed: 41.1 tokens/s generation at 4k context, Q4_K quantization, measured on RTX 5060 Ti 16 GB — per the hardware-corner.net LLM benchmark table, surfaced via /check/qwen3-14b/rtx-5060-ti (backend benchmark id=23). Generation slows as the KV cache grows: 32.9 tok/s at 16k and 25.9 tok/s at 32k on the same page. Prompt processing (prefill) is much faster — a distinct metric: 1,743.0 tok/s at 4k context (backend benchmark id=11), 942.6 at 16k, 621.0 at 32k per the same Hardware Corner row. Prefill measures how fast the model ingests your prompt; token generation measures how fast it writes the reply — the two are reported separately because they stress different parts of the pipeline. The Hardware Corner row for Qwen3-14B (Q4_K) caps at 32k context — there is no published 64k rung for this model on this card. If you measure your own numbers, please contribute them so this page can carry first-party data across more context lengths.
  • VRAM usage: The cited backend benchmark records a 16.0 GB peak at the 4k-context configuration on this 16 GB card — see /check/qwen3-14b/rtx-5060-ti (id=23 / id=11). That is the full card: Qwen3-14B at Q4_K is a tight fit on the RTX 5060 Ti, not a roomy one. At idle the Q4_K_M weights occupy ~9 GB; the remainder of the envelope is KV cache and activations, which grow with context and push peak usage to the 16 GB ceiling. Any higher quant (Q5_K_M and above) tightens the fit further and forces a smaller context cap.
  • Quality notes: Q4_K_M is the community-default "sweet spot." The 14.8B-parameter dense model (13.2B non-embedding, 40 layers, GQA 40 query / 8 KV heads per the HF card) is a meaningful quality step up from Qwen3-8B, but on this 16 GB card it trades away the headroom the 8B variant enjoys. If you need long context (toward the 131K YaRN-extended ceiling) or want to colocate a second model, the tight 16 GB envelope here makes the 8B variant or a higher-VRAM card the better choice.

For the full benchmark data and other-GPU comparisons, see /check/qwen3-14b/rtx-5060-ti.

Troubleshooting

Out of memory or thrashing at longer contexts

Because Qwen3-14B at Q4_K already peaks at the full 16.0 GB on this card at 4k context (per the cited benchmark), pushing context further — or enabling a long <think> reasoning block — can tip the card into OOM or CPU-offload thrashing. Mitigations, in order: (1) cap context explicitly, e.g. --ctx-size 4096 in llama.cpp or OLLAMA_CONTEXT_LENGTH=4096; (2) quantize the KV cache with --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn (llama.cpp); (3) send /no_think to suppress the chain-of-thought block that inflates KV; (4) drop to the dense 8B sibling (qwen3:8b, ~5 GB) for comfortable headroom on the same card.

Ollama returns Error: model requires more system memory or hangs on load

Confirm a recent NVIDIA driver and CUDA 12.8+ runtime are installed (nvidia-smi should show a recent Blackwell-capable driver). The RTX 5060 Ti uses the Blackwell architecture (sm_120) which requires CUDA 12.8 or newer; older CUDA wheels do not ship sm_120 kernels and fail with a no kernel image is available for execution on the device error, or Ollama silently falls back to CPU inference (which appears as a hang). Watch nvidia-smi -l 1 in another terminal to confirm the GPU is actually being used; if it stays at 0% utilization, your driver or runtime is too old. If you compile llama.cpp from source, build with LLAMA_CUDA=1 against CUDA 12.8+ explicitly.

Using transformers directly — FA2 sm_120 wheel gap

If you bypass Ollama / llama.cpp and run the HF card quickstart via transformers directly, mind the FlashAttention-2 sm_120 gap: as of mid-2026, prebuilt FA2 wheels still do not ship sm_120 kernels for Blackwell consumer cards (tracked at Dao-AILab/flash-attention#2168). The Qwen3 quickstart uses torch_dtype="auto" and device_map="auto" without hardcoding attn_implementation="flash_attention_2", so it works out of the box with attn_implementation="eager" or "sdpa". If any third-party snippet you copy hardcodes attn_implementation="flash_attention_2", override it to "sdpa" until FA2 lands sm_120 wheels. Also install the cu128 PyTorch wheel (pip install torch --index-url https://download.pytorch.org/whl/cu128) — the default cu126 wheel does not include sm_120 kernels. Blackwell's tensor cores natively support FP8, so there is no FP8-on-Ampere dequantization penalty on this card.

I want the larger 32B sibling

Qwen3-32B at Q4_K_M is ~20 GB on disk and does not fit a 16 GB card without aggressive CPU offloading (which tanks throughput); same for the 30B and 235B MoE variants, whose total expert weights must be resident. Stay on 14B for this card (mindful of the tight fit), or request a 32B recipe via /contribute.

common questions
How much VRAM does Qwen3 14B need?

About 12 GB — the minimum this recipe targets.

Which GPUs is Qwen3 14B tested on?

RTX 5060 Ti (16 GB).

How hard is this setup?

Beginner — follow the steps above.