Qwen3-8B on RTX 5070 Ti: Q4_K_M GGUF via Ollama or llama.cpp

What You'll Build

A local Qwen3-8B chat / reasoning assistant running on a 16 GB RTX 5070 Ti, served through Ollama (or llama.cpp / LM Studio — same GGUF, three loaders). The recipe pins the dense 8B variant at Q4_K_M quantization (5.03 GB on disk), which leaves the RTX 5070 Ti's 16 GB envelope wildly over-provisioned for the weights themselves — once running, the binding question on this card isn't "does it fit" but "what to do with the spare VRAM."

Hardware data: RTX 5070 Ti (16 GB GDDR7) · Q4_K GGUF · 120.5 tokens/s generation at 4k context · See benchmark data

⚠️ Variant pinned — Qwen3 ships 8 sizes from the same Qwen org. Per the Ollama qwen3 tag list, Qwen3 spans 0.6b, 1.7b, 4b, 8b (this recipe), 14b, 30b (MoE), 32b, and 235b (MoE). The siblings have wildly different VRAM profiles — Qwen3-14B in Q4_K_M is ~8.5 GB and still fits 16 GB; Qwen3-32B in Q4_K_M is ~20 GB and overflows; Qwen3-235B (MoE, ~22B active) needs >100 GB total resident weights since the router can't pre-prune (see Qwen3 model card for the dense/MoE split). The instructions below are for the dense 8.2B model only. If you want 14B on this card, swap qwen3:8b for qwen3:14b; for 32B+ go to /contribute.

ℹ️ Thinking mode is on by default. Qwen3-8B has a built-in chain-of-thought ("thinking") mode that the model card's quickstart enables via enable_thinking=True. Output starts with a <think>...</think> block followed by the user-facing answer. To disable for latency-sensitive use, send /no_think in your prompt or pass enable_thinking=False in the chat template.

Requirements

Component	Minimum	Tested
GPU	6 GB VRAM (for Q4_K_M weights + KV at 4k ctx)	RTX 5070 Ti (16 GB)
RAM	16 GB system	—
Storage	5.03 GB (Q4_K_M GGUF) or 8.71 GB (Q8_0)	per unsloth/Qwen3-8B-GGUF
Driver	CUDA 12.8+ runtime (Blackwell sm_120)	—
Runtime	Ollama 0.5+ / llama.cpp / LM Studio	—

The model is released under Apache 2.0 — commercial use is permitted.

Installation

The fastest path is Ollama — one command pulls the canonical Q4_K_M build maintained by the Qwen team:

Option A — Ollama (recommended)

1. Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

(Windows: download from ollama.com/download.) Per the Qwen3 model card, "applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3."

2. Pull the 8B model

ollama pull qwen3:8b

This fetches a 5.2 GB Q4_K_M checkpoint per the Ollama qwen3:8b tag. The download is one file — no manual quant-tier selection needed.

Option B — llama.cpp + community GGUF

If you want a different quant tier (Q6_K for higher fidelity, Q8_0 for near-lossless, BF16 because the 5070 Ti has the headroom for it), use a community redistributor that publishes the full ladder:

1. Install llama.cpp

# macOS (Homebrew)
brew install llama.cpp

# Linux — pre-built CUDA wheel
# Visit https://github.com/ggml-org/llama.cpp/releases for cu12x binaries

2. Pull the quant you want

Per the unsloth/Qwen3-8B-GGUF per-tier file-size table (link-back to upstream Qwen/Qwen3-8B confirmed on the page header):

Quant	File size	Notes
Q4_K_M	5.03 GB	recommended for general use
Q5_K_M	5.85 GB	better quality, still tiny
Q6_K	6.73 GB	"near perfect" per bartowski
Q8_0	8.71 GB	near-lossless
BF16	16.4 GB	full precision — at the very edge of the 16 GB envelope; cap context tight and disable KV cache padding

Then via the llama.cpp Hugging Face shortcut (per the unsloth model card):

# OpenAI-compatible local server with web UI
llama-server -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL

# Interactive terminal
llama-cli -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL

Option C — LM Studio (GUI)

LM Studio offers a one-click install path per the Qwen3-8B HF card. Search "Qwen3-8B GGUF" inside the app and pick the Q4_K_M tier, or use the direct-import link lmstudio://open_from_hf?model=unsloth/Qwen3-8B-GGUF.

Running

One-shot prompt via Ollama

ollama run qwen3:8b "Explain GQA attention in three sentences."

First run loads the model into VRAM (~5 GB resident at idle, growing as the KV cache fills with longer contexts). Subsequent prompts in the same session stay warm.

Disable thinking mode for short answers

ollama run qwen3:8b "/no_think What's the capital of France?"

Per the Qwen3-8B HF card, this flips enable_thinking=False for the request, skipping the <think>...</think> chain-of-thought prefix.

OpenAI-compatible HTTP API

# Ollama exposes localhost:11434 by default
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:8b",
    "messages": [{"role": "user", "content": "Write a haiku about Blackwell GPUs."}]
  }'

For higher throughput / production-style serving, the upstream Qwen3-8B card documents vllm serve Qwen/Qwen3-8B --enable-reasoning --reasoning-parser deepseek_r1 and python -m sglang.launch_server --model-path Qwen/Qwen3-8B --reasoning-parser qwen3 — both load BF16 weights (16.4 GB), which is right at this card's capacity. The 5070 Ti's 16 GB envelope can accommodate BF16 serving but only with tight context caps; for comfortable headroom prefer Ollama / llama.cpp with the Q4_K_M GGUF.

Spending the headroom — what to do with the spare VRAM

The Q4_K_M weights occupy ~5 GB at idle on the RTX 5070 Ti's 16 GB envelope, leaving roughly 10 GB of spare VRAM before the KV cache grows. The genuine per-GPU question on this card isn't "does it fit" (it fits trivially) — it's how to use that headroom. Three concrete options, all citable to the model card or the runtime:

Quant up, not down. Q8_0 weights are 8.71 GB and near-lossless per bartowski; BF16 is 16.4 GB and right at the envelope edge. On a 5070 Ti you can drop in unsloth/Qwen3-8B-GGUF:Q8_0 or run BF16 with a capped context (--ctx-size 8192) and lose nothing to quantization noise.
Longer context — push toward 131K with YaRN. Qwen3's native window is 32K per the HF card — 32,768 tokens natively, extendable to 131,072 with YaRN. At Q4_K_M, the headroom comfortably accommodates 32K KV at fp16 (~2 GB) with room for YaRN extension toward 64K-128K via --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 per the unsloth GGUF instructions.
Colocate a smaller helper model. A 1.5–3B sidecar (Qwen3-1.7B, Qwen3-4B, or a Whisper-small ASR for voice pipelines) at Q4 takes 1–3 GB. Running two Ollama models concurrently (Ollama keeps both loaded in VRAM until they age out) lets you build retrieval / RAG / multi-stage pipelines on a single card. See the Ollama qwen3 tags for the smaller-variant tags.

Results

Speed: 120.5 tokens/s generation at 4k context, Q4_K quantization, measured on RTX 5070 Ti — per the hardware-corner.net LLM benchmark table, surfaced via /check/qwen3-8b/rtx-5070-ti. The full context ladder on the same page: 87.5 tok/s at 16k, 63.3 tok/s at 32k, 40.4 tok/s at 64k as the KV cache grows; the row caps at 64K (Qwen3-8B's "Max 64k" badge on Hardware Corner). Prompt processing is much faster — 5,557.3 tok/s at 4k context, 3,653.8 at 16k, 2,269.0 at 32k, 1,078.7 at 64k per the same source.
VRAM usage: The cited backend benchmark records a 16.0 GB peak for the 4k-context Q4_K run on this card — see /check/qwen3-8b/rtx-5070-ti. At idle the Q4_K_M weights occupy ~5 GB; the measured peak reflects the KV cache and runtime overhead across the benchmark run, so plan on the full 16 GB envelope rather than just the weight footprint. The official Qwen speed benchmark corroborates the precision/VRAM ladder on H20 hardware: BF16 = 15947 MB, FP8 = 9323 MB, AWQ-INT4 = 6177 MB.
Quality notes: Q4_K_M is the community-default "sweet spot" — the bartowski Q-tier guide flags Q6_K as "near perfect, recommended" if you have the VRAM. On a 16 GB 5070 Ti you can also run Q6_K (6.73 GB), Q8_0 (8.71 GB), or BF16 (16.4 GB, at the envelope edge with context capped). There's no quality reason to pick anything below Q4_K_M on this card.

For the full benchmark data and other-GPU comparisons, see /check/qwen3-8b/rtx-5070-ti.

Troubleshooting

Ollama returns `Error: model requires more system memory` or hangs on load

Confirm a recent NVIDIA driver and CUDA 12.8+ runtime are installed (nvidia-smi should show a recent Blackwell-capable driver). The RTX 5070 Ti uses the Blackwell architecture (sm_120) which requires CUDA 12.8 or newer; older CUDA wheels do not ship sm_120 kernels and will fall back to CPU or fail with a no kernel image is available for execution on the device error. Ollama's bundled CUDA runtime handles this for recent builds, but if you're compiling llama.cpp from source, build with LLAMA_CUDA=1 against CUDA 12.8+ explicitly. Watch nvidia-smi -l 1 in another terminal to confirm the GPU is actually being used; if it stays at 0% utilization, your driver or runtime is too old.

`<think>...</think>` output is bloating responses

Qwen3 enables thinking mode by default per the HF card quickstart. Send /no_think at the start of any user message to disable it for that turn, or pass enable_thinking=False if you're calling the chat-template API directly.

Using transformers directly — FA2 sm_120 wheel gap

If you bypass Ollama / llama.cpp and run the HF card quickstart via transformers directly, mind the FlashAttention-2 sm_120 gap: as of mid-2026, prebuilt FA2 wheels still do not ship sm_120 kernels for Blackwell consumer cards (tracked at Dao-AILab/flash-attention#2168). The Qwen3-8B quickstart uses torch_dtype="auto" and device_map="auto" without hardcoding attn_implementation="flash_attention_2", so it works out of the box with attn_implementation="eager" or "sdpa". If any third-party snippet you copy hardcodes attn_implementation="flash_attention_2", override it to "sdpa" until FA2 lands sm_120 wheels. Also: install the cu128 PyTorch wheel (pip install torch --index-url https://download.pytorch.org/whl/cu128) — the default cu126 wheel does not include sm_120 kernels.

I want the larger 14B / 32B sibling

Qwen3-14B at Q4_K_M is ~8.5 GB on disk and fits a 16 GB card with plenty of room — swap qwen3:8b for qwen3:14b in any Ollama command. Qwen3-32B at Q4_K_M is ~20 GB and does not fit without aggressive offloading; same for the 30B MoE and 235B MoE variants (MoE total params must be resident — see the Qwen3 model card on the dense/MoE split). For a 32B+ recipe on this card, request via /contribute.

Generation slows dramatically past 32k context

32k is Qwen3's native context window per the HF card — 32,768 tokens natively, extendable to 131,072 with YaRN. Beyond that the model needs YaRN extension — supported in llama.cpp via --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 per the unsloth GGUF instructions — but quality degrades and the KV cache balloons. For long-doc workflows, prefer chunking + retrieval over pushing context past 32k. The hardware-corner.net benchmark shows the rate falling to 40.4 tok/s at 64k context on this card (vs 120.5 at 4k) — Qwen3-8B's row caps at the 64K "Max 64k" badge on Hardware Corner.