Qwen3-8B on RTX 5070: Q4_K_M GGUF via Ollama or llama.cpp

What You'll Build

A local Qwen3-8B chat / reasoning assistant running on a 12 GB RTX 5070, served through Ollama (or llama.cpp / LM Studio — same GGUF, three loaders). The recipe pins the dense 8B variant at Q4_K_M quantization (5.03 GB on disk), which fits the RTX 5070's 12 GB envelope comfortably with room left for the KV cache. On this card the binding question is no longer "do the weights fit" — they take less than half the VRAM — but "how long a context can I run before the KV cache eats the remaining headroom."

Hardware data: RTX 5070 (12 GB GDDR7) · Q4_K GGUF · 85.8 tokens/s generation at 4k context · See benchmark data

⚠️ Variant pinned — Qwen3 ships 8 sizes from the same Qwen org. Per the Ollama qwen3 tag list, Qwen3 spans 0.6b, 1.7b, 4b, 8b (this recipe), 14b, 30b (MoE), 32b, and 235b (MoE). The siblings have wildly different VRAM profiles — Qwen3-14B in Q4_K_M is ~9 GB and is tight on a 12 GB card (little left for KV); Qwen3-32B in Q4_K_M is ~20 GB and overflows; Qwen3-235B (MoE, ~22B active) needs >100 GB total resident weights since the router can't pre-prune (see Qwen3 model card for the dense/MoE split). The instructions below are for the dense 8.2B model only. For the 14B+ siblings on this card, go to /contribute.

ℹ️ Thinking mode is on by default. Qwen3-8B has a built-in chain-of-thought ("thinking") mode that the model card enables via enable_thinking=True (the card notes "Default is True"). Output starts with a <think>...</think> block followed by the user-facing answer. To disable for latency-sensitive use, send /no_think in your prompt or pass enable_thinking=False in the chat template.

Requirements

Component	Minimum	Tested
GPU	12 GB VRAM (Q4_K_M weights + KV up to native 32K context)	RTX 5070 (12 GB)
RAM	16 GB system	—
Storage	5.03 GB (Q4_K_M GGUF); more for higher tiers	per unsloth/Qwen3-8B-GGUF
Driver	CUDA 12.8+ runtime (Blackwell sm_120)	—
Runtime	Ollama 0.5+ / llama.cpp / LM Studio	—

The model is released under Apache 2.0 — commercial use is permitted.

Installation

The fastest path is Ollama — one command pulls the canonical Q4_K_M build maintained by the Qwen team. Per the Qwen3 model card: "For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3."

Option A — Ollama (recommended)

1. Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

(Windows: download from ollama.com/download.)

2. Pull the 8B model

ollama pull qwen3:8b

This fetches a 5.2 GB Q4_K_M checkpoint per the Ollama qwen3:8b tag. The download is one file — no manual quant-tier selection needed.

Option B — llama.cpp + community GGUF

If you want a different quant tier (Q5_K_M / Q6_K for higher fidelity, or Q3 for more KV headroom), use a community redistributor that publishes the full ladder:

1. Install llama.cpp

# macOS (Homebrew)
brew install llama.cpp

# Linux — pre-built CUDA binaries
# Visit https://github.com/ggml-org/llama.cpp/releases for cu12x builds

2. Pull the quant you want

Per the unsloth/Qwen3-8B-GGUF per-tier file-size table (link-back to upstream Qwen/Qwen3-8B confirmed on the page header):

Quant	File size	Notes
Q3_K_M	4.12 GB	smaller weights → more KV headroom for long context
Q4_K_M	5.03 GB	recommended sweet spot
Q5_K_M	5.85 GB	better quality
Q6_K	6.73 GB	"near perfect" per bartowski
Q8_0	8.71 GB	near-lossless — fits 12 GB but leaves little KV room

Then via the llama.cpp Hugging Face shortcut (per the unsloth model card):

# OpenAI-compatible local server with web UI
llama-server -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL

# Interactive terminal
llama-cli -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL

Option C — LM Studio (GUI)

LM Studio offers a one-click install path per the Qwen3-8B HF card. Search "Qwen3-8B GGUF" inside the app and pick the Q4_K_M tier, or use the direct-import link lmstudio://open_from_hf?model=unsloth/Qwen3-8B-GGUF.

Running

One-shot prompt via Ollama

ollama run qwen3:8b "Explain GQA attention in three sentences."

First run loads the model into VRAM (~5 GB resident at idle, growing as the KV cache fills with longer contexts). Subsequent prompts in the same session stay warm.

Disable thinking mode for short answers

ollama run qwen3:8b "/no_think What's the capital of France?"

Per the Qwen3-8B HF card, this flips enable_thinking=False for the request, skipping the <think>...</think> chain-of-thought prefix.

OpenAI-compatible HTTP API

# Ollama exposes localhost:11434 by default
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:8b",
    "messages": [{"role": "user", "content": "Write a haiku about Blackwell GPUs."}]
  }'

Context budget on 12 GB — the real constraint

Unlike the 16 GB siblings of this card, on the 12 GB RTX 5070 the KV cache is the binding constraint, not the weights. With Q4_K_M weights resident at ~5 GB and a usable VRAM envelope of roughly 10.5–11.3 GB on a desktop card with a display attached, the math works out as follows. Qwen3-8B uses grouped-query attention with 8 KV heads × 128 head-dim × 36 layers, so the fp16 KV cache costs ~0.15 GB per 1K tokens (derived from the model config.json: num_key_value_heads: 8, head_dim: 128, num_hidden_layers: 36):

Context	KV cache (fp16)	Weights + KV	Fits 12 GB?
4K	~0.6 GB	~5.6 GB	yes, with wide margin
16K	~2.4 GB	~7.5 GB	yes
32K (native max)	~4.8 GB	~9.9 GB	yes — at the edge of comfortable
64K (YaRN)	~9.7 GB	~14.7 GB	no — exceeds 12 GB

So the realistic context ceiling on a 12 GB RTX 5070 at Q4_K_M is the full native 32K window — Qwen3-8B's native context is 32,768 tokens per the HF card. The hardware-corner.net RTX 5070 benchmark table confirms this empirically: its Qwen3-8B Q4_K row caps at a "Max 32k" badge and reports no 64K data, because the 64K KV cache won't fit in 12 GB. (The 16 GB RTX 5070 Ti, by contrast, reaches 64K on the same source.) To push past 32K you need YaRN RoPE scaling and a smaller quant (Q3_K_M frees ~0.9 GB) or KV-cache quantization — see Troubleshooting.

Results

Speed: 85.8 tokens/s generation at 4k context, Q4_K quantization, measured on RTX 5070 — per the hardware-corner.net LLM benchmark table, surfaced via /check/qwen3-8b/rtx-5070. Generation slows as the KV cache grows: 59.1 tok/s at 16k and 43.6 tok/s at 32k on the same page. Prompt processing is much faster — 3,487.7 tok/s at 4k, 1,600.8 at 16k, 898.8 at 32k. The row stops at 32K (the "Max 32k" badge) because 64K won't fit in 12 GB.
VRAM usage: The cited backend benchmark records a 12.0 GB envelope for the Q4_K run on this card — see /check/qwen3-8b/rtx-5070. At idle the Q4_K_M weights occupy ~5 GB; the remainder is KV cache and runtime overhead, which is why the full 12 GB is the planning figure even though the weights are small. The derived envelope above shows weights + 32K KV landing near ~9.9 GB, leaving display headroom.
Quality notes: Q4_K_M is the community-default "sweet spot" — the bartowski Q-tier guide flags Q6_K as "near perfect, recommended" if you have the VRAM. On a 12 GB 5070 you can run up to Q8_0 (8.71 GB) if you keep context short, but Q4_K_M leaves the most room for the KV cache — the better trade on this card. There's no quality reason to go below Q4_K_M unless you specifically need long-context headroom, in which case Q3_K_M is the move.

For the full benchmark data and other-GPU comparisons, see /check/qwen3-8b/rtx-5070.

Troubleshooting

Ollama returns `Error: model requires more system memory` or hangs on load

Confirm a recent NVIDIA driver and CUDA 12.8+ runtime are installed (nvidia-smi should show a recent Blackwell-capable driver). The RTX 5070 uses the Blackwell architecture (sm_120) which requires CUDA 12.8 or newer; older CUDA wheels do not ship sm_120 kernels and will fall back to CPU or fail with a no kernel image is available for execution on the device error. Ollama's bundled CUDA runtime handles this for recent builds, but if you're compiling llama.cpp from source, build with LLAMA_CUDA=1 against CUDA 12.8+ explicitly. Watch nvidia-smi -l 1 in another terminal to confirm the GPU is actually being used; if it stays at 0% utilization, your driver or runtime is too old.

Out of memory at long context

On a 12 GB card the KV cache is the constraint, not the weights. At Q4_K_M you can run the full native 32K window, but 64K (via YaRN) needs ~9.7 GB of KV cache alone and will not fit alongside the 5 GB of weights. Three fixes, cheapest first: (1) drop to Q3_K_M (4.12 GB weights per unsloth/Qwen3-8B-GGUF) to free ~0.9 GB; (2) enable KV-cache quantization in llama.cpp (--cache-type-k q8_0 --cache-type-v q8_0) to roughly halve KV cost; (3) cap the context explicitly with --ctx-size 32768 so the runtime doesn't over-allocate. For workflows that genuinely need 64K+, prefer chunking + retrieval over forcing long context onto this card.

`<think>...</think>` output is bloating responses

Qwen3 enables thinking mode by default per the HF card (enable_thinking=True, "Default is True"). Send /no_think at the start of any user message to disable it for that turn, or pass enable_thinking=False if you're calling the chat-template API directly.

Using transformers directly — FA2 sm_120 wheel gap

If you bypass Ollama / llama.cpp and run the HF card quickstart via transformers directly, mind the FlashAttention-2 sm_120 gap: as of mid-2026, prebuilt FA2 wheels still do not ship sm_120 kernels for Blackwell consumer cards (tracked at Dao-AILab/flash-attention#2168). The Qwen3-8B quickstart uses torch_dtype="auto" and device_map="auto" without hardcoding attn_implementation="flash_attention_2", so it works out of the box with attn_implementation="eager" or "sdpa". If any third-party snippet you copy hardcodes attn_implementation="flash_attention_2", override it to "sdpa" until FA2 lands sm_120 wheels. Also: install the cu128 PyTorch wheel (pip install torch --index-url https://download.pytorch.org/whl/cu128) — the default cu126 wheel does not include sm_120 kernels.

I want the larger 14B / 32B sibling

Qwen3-14B at Q4_K_M is ~9 GB on disk — it loads on a 12 GB card but leaves only ~2–3 GB for the KV cache, so usable context is short; swap qwen3:8b for qwen3:14b in any Ollama command and keep context tight. Qwen3-32B at Q4_K_M is ~20 GB and does not fit without aggressive offloading; same for the 30B MoE and 235B MoE variants (MoE total params must be resident — see the Qwen3 model card on the dense/MoE split). For a 14B+ recipe on this card, request via /contribute.