self-hosted/ai
§01·recipe · llm

Qwen3-8B on RTX 5060 Ti: Q4_K_M GGUF via Ollama or llama.cpp

llmbeginner16GB+ VRAMMay 19, 2026
models
tools
prerequisites
  • NVIDIA RTX 5060 Ti (16 GB VRAM) or equivalent 16 GB CUDA card
  • Recent NVIDIA driver with CUDA 12.8+ support (required for Blackwell sm_120)
  • ~6 GB free disk for the Q4_K_M GGUF checkpoint (or ~10 GB for Q8_0)
  • Ollama, llama.cpp, or LM Studio installed

What You'll Build

A local Qwen3-8B chat / reasoning assistant running on a 16 GB RTX 5060 Ti, served through Ollama (or llama.cpp / LM Studio — same GGUF, three loaders). The recipe pins the dense 8B variant at Q4_K_M quantization (5.03 GB on disk), which clears the 5060 Ti's VRAM with comfortable headroom for the 32k-native context window and the optional thinking-mode chain of thought.

Hardware data: RTX 5060 Ti (16 GB VRAM) · Q4_K_M GGUF · 69.2 tokens/s generation, 2965 tokens/s prefill at 4k context · See benchmark data

⚠️ Variant pinned — Qwen3 ships 8 sizes from the same Qwen org. Per the Ollama qwen3 tag list, Qwen3 spans 0.6b, 1.7b, 4b, 8b (this recipe), 14b, 30b (MoE), 32b, and 235b (MoE). The siblings have wildly different VRAM profiles — Qwen3-14B in Q4_K_M is ~8.5 GB and still fits 16 GB; Qwen3-32B in Q4_K_M is ~20 GB and overflows; Qwen3-235B (MoE, ~22B active) needs >100 GB total resident weights since the router can't pre-prune (see Qwen3 model card for the dense/MoE split). The instructions below are for the dense 8.2B model only. If you want 14B on this card, swap qwen3:8b for qwen3:14b and expect ~10 GB peak VRAM; for 32B+ go to /contribute.

ℹ️ Thinking mode is on by default. Qwen3-8B has a built-in chain-of-thought ("thinking") mode that the model card's quickstart enables via enable_thinking=True. Output starts with a <think>...</think> block followed by the user-facing answer. To disable for latency-sensitive use, send /no_think in your prompt or pass enable_thinking=False in the chat template.

Requirements

ComponentMinimumTested
GPU16 GB VRAM (per backend benchmark peak at 4k context)RTX 5060 Ti (16 GB)
RAM16 GB system
Storage5.03 GB (Q4_K_M GGUF) or 8.71 GB (Q8_0)per unsloth/Qwen3-8B-GGUF
DriverCUDA 12.8+ (Blackwell sm_120)
RuntimeOllama 0.5+ / llama.cpp / LM Studio

The model is released under Apache 2.0 — commercial use is permitted.

Installation

The fastest path is Ollama — one command pulls the canonical Q4_K_M build maintained by the Qwen team:

Option A — Ollama (recommended)

1. Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

(Windows: download from ollama.com/download.) Per the Qwen3 model card, "applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3."

2. Pull the 8B model

ollama pull qwen3:8b

This fetches a 5.2 GB Q4_K_M checkpoint per the Ollama qwen3:8b tag. The download is one file — no manual quant-tier selection needed.

Option B — llama.cpp + community GGUF

If you want a different quant tier (Q6_K for higher fidelity, Q8_0 for near-lossless), use a community redistributor that publishes the full ladder:

1. Install llama.cpp

# macOS (Homebrew)
brew install llama.cpp

# Linux — pre-built CUDA wheel
# Visit https://github.com/ggml-org/llama.cpp/releases for cu128 binaries

2. Pull the quant you want

Per the unsloth/Qwen3-8B-GGUF per-tier file-size table (link-back to upstream Qwen/Qwen3-8B confirmed on the page header):

QuantFile sizeNotes
Q4_K_M5.03 GBrecommended for this card
Q5_K_M5.85 GBbetter quality, still tiny
Q6_K6.73 GB"near perfect" per bartowski
Q8_08.71 GBnear-lossless
BF1616.4 GBfull precision — overflows 16 GB card; needs offload

Then via the llama.cpp Hugging Face shortcut (per the unsloth model card):

# OpenAI-compatible local server with web UI
llama-server -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL

# Interactive terminal
llama-cli -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL

Option C — LM Studio (GUI)

LM Studio offers a one-click install path per the Qwen3-8B HF card. Use the direct-import link lmstudio://open_from_hf?model=unsloth/Qwen3-8B-GGUF, or search "Qwen3-8B GGUF" inside the app and pick the Q4_K_M tier.

Running

One-shot prompt via Ollama

ollama run qwen3:8b "Explain GQA attention in three sentences."

First run loads the model into VRAM (~5 GB resident, climbing toward the 16 GB benchmark peak as the KV cache grows with longer contexts). Subsequent prompts in the same session stay warm.

Disable thinking mode for short answers

ollama run qwen3:8b "/no_think What's the capital of France?"

Per the Qwen3-8B HF card, this flips enable_thinking=False for the request, skipping the <think>...</think> chain-of-thought prefix.

OpenAI-compatible HTTP API

# Ollama exposes localhost:11434 by default
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:8b",
    "messages": [{"role": "user", "content": "Write a haiku about Blackwell GPUs."}]
  }'

For higher throughput / production-style serving, the upstream Qwen3-8B card documents vllm serve Qwen/Qwen3-8B --enable-reasoning --reasoning-parser deepseek_r1 and python -m sglang.launch_server --model-path Qwen/Qwen3-8B --reasoning-parser qwen3 — both load BF16 weights though (16.4 GB), which is right at this card's capacity. For the 5060 Ti, Ollama / llama.cpp with the Q4_K_M GGUF is the comfortable path.

Results

  • Speed: 69.2 tokens/s generation and 2965.1 tokens/s prefill at 4k context window, Q4_K, measured on RTX 5060 Ti 16 GB — per the hardware-corner.net LLM benchmark table, surfaced via /check/qwen3-8b/rtx-5060-ti. Generation rate drops to 25.8 tokens/s at 64k context as the KV cache grows.
  • VRAM usage: 16 GB peak at the benchmark's 4k context configuration (cited backend benchmark). At idle the Q4_K_M weights occupy ~5 GB; the rest is KV cache headroom, which the runtime expands as your sessions accumulate context. The official Qwen speed benchmark corroborates the precision/VRAM ladder on H20 hardware: BF16 = 15947 MB, FP8 = 9323 MB, AWQ-INT4 = 6177 MB.
  • Quality notes: Q4_K_M is the community-default "sweet spot" — the bartowski Q-tier guide flags Q6_K as "near perfect, recommended" if you have the VRAM. On a 16 GB 5060 Ti you can also run Q6_K (6.73 GB) or Q8_0 (8.71 GB) with plenty of room — there's no quality reason to pick anything below Q4_K_M on this card.

For the full benchmark data and other-GPU comparisons, see /check/qwen3-8b/rtx-5060-ti.

Troubleshooting

Ollama returns Error: model requires more system memory or hangs on load

Confirm CUDA 12.8+ drivers are installed (nvidia-smi should report a 575+ driver on Linux). The RTX 5060 Ti uses Blackwell sm_120 — older CUDA wheels lack the kernels and Ollama silently falls back to CPU inference, which appears as a hang. Reinstall Ollama after upgrading the driver.

<think>...</think> output is bloating responses

Qwen3 enables thinking mode by default per the HF card quickstart. Send /no_think at the start of any user message to disable it for that turn, or pass enable_thinking=False if you're calling the chat-template API directly.

I want the larger 14B / 32B sibling

Qwen3-14B at Q4_K_M is ~8.5 GB on disk and fits a 16 GB card comfortably — swap qwen3:8b for qwen3:14b in any Ollama command. Qwen3-32B at Q4_K_M is ~20 GB and does not fit without aggressive offloading; same for the 30B MoE and 235B MoE variants (MoE total params must be resident — see the Qwen3 model card on the dense/MoE split). For a 32B+ recipe on this card, request via /contribute.

FlashAttention 2 errors with transformers

If you bypass Ollama / llama.cpp and run the HF card quickstart via transformers directly, the quickstart uses torch_dtype="auto" and device_map="auto" — it does not hardcode attn_implementation="flash_attention_2", so it works on Blackwell sm_120 out of the box with torch>=2.5 + CUDA 12.8 wheels. If you (or a tutorial) add attn_implementation="flash_attention_2", the model will crash on first inference: FA2 wheels don't ship sm_120 kernels yet (Dao-AILab/flash-attention#2168). Override with attn_implementation="sdpa" (or remove the argument) and PyTorch's built-in scaled-dot-product attention handles it.

Generation slows dramatically past 32k context

32k is Qwen3's native context window per the HF card ("Context Length: 32,768 natively and 131,072 tokens with YaRN"). Beyond that the model needs YaRN extension — supported in llama.cpp via the --rope-scaling yarn flag, but quality degrades and the KV cache balloons. For long-doc workflows, prefer chunking + retrieval over pushing context past 32k.