self-hosted/ai
§01·recipe · llm

Qwen3-8B on RTX 4070: Q4_K_M GGUF via Ollama or llama.cpp

llmbeginner12GB+ VRAMJun 5, 2026
models
tools
prerequisites
  • NVIDIA RTX 4070 (12 GB GDDR6X) or equivalent 12 GB CUDA card
  • Recent NVIDIA driver with CUDA 12.x runtime (Ada sm_89 — no special wheel selection required)
  • ~6 GB free disk for the Q4_K_M GGUF checkpoint (more for higher quant tiers)
  • Ollama, llama.cpp, or LM Studio installed

What You'll Build

A local Qwen3-8B chat / reasoning assistant running on a 12 GB RTX 4070, served through Ollama (or llama.cpp / LM Studio — same GGUF, three loaders). The recipe pins the dense 8B variant at Q4_K_M quantization (5.03 GB on disk), which fits the RTX 4070's 12 GB envelope comfortably with room left for the KV cache. On this card the binding question is no longer "do the weights fit" — they take less than half the VRAM — but "how long a context can I run before the KV cache eats the remaining headroom."

Hardware data: RTX 4070 (12 GB GDDR6X) · Q4_K GGUF · 71.2 tokens/s generation at 4k context · See benchmark data

⚠️ Variant pinned — Qwen3 ships 8 sizes from the same Qwen org. Per the Ollama qwen3 tag list, Qwen3 spans 0.6b, 1.7b, 4b, 8b (this recipe), 14b, 30b (MoE), 32b, and 235b (MoE). The siblings have wildly different VRAM profiles — Qwen3-14B in Q4_K_M is ~9 GB and is tight on a 12 GB card (little left for KV); Qwen3-32B in Q4_K_M is ~20 GB and overflows; Qwen3-235B (MoE, ~22B active) needs >100 GB total resident weights since the router can't pre-prune (see Qwen3 model card for the dense/MoE split). The instructions below are for the dense 8.2B model only. For the 14B+ siblings on this card, go to /contribute.

ℹ️ Thinking mode is on by default. Qwen3-8B has a built-in chain-of-thought ("thinking") mode that the model card enables via enable_thinking=True (the card notes "Default is True"). Output starts with a <think>...</think> block followed by the user-facing answer. To disable for latency-sensitive use, send /no_think in your prompt or pass enable_thinking=False in the chat template.

Requirements

ComponentMinimumTested
GPU12 GB VRAM (Q4_K_M weights + KV up to native 32K context)RTX 4070 (12 GB)
RAM16 GB system
Storage5.03 GB (Q4_K_M GGUF); more for higher tiersper unsloth/Qwen3-8B-GGUF
DriverCUDA 12.x runtime (Ada sm_89)
RuntimeOllama 0.5+ / llama.cpp / LM Studio

The model is released under Apache 2.0 — commercial use is permitted. The weights are not gated on Hugging Face, so no access request or login is required to download them.

Installation

The fastest path is Ollama — one command pulls the canonical Q4_K_M build maintained by the Qwen team. Per the Qwen3 model card: "For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3."

Option A — Ollama (recommended)

1. Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

(Windows: download from ollama.com/download.)

2. Pull the 8B model

ollama pull qwen3:8b

This fetches a 5.2 GB Q4_K_M checkpoint per the Ollama qwen3:8b tag (8.19B parameters). The download is one file — no manual quant-tier selection needed.

Option B — llama.cpp + community GGUF

If you want a different quant tier (Q5_K_M / Q6_K for higher fidelity, or Q3 for more KV headroom), use a community redistributor that publishes the full ladder:

1. Install llama.cpp

# macOS (Homebrew)
brew install llama.cpp

# Linux — pre-built CUDA binaries
# Visit https://github.com/ggml-org/llama.cpp/releases for cu12x builds

2. Pull the quant you want

Per the unsloth/Qwen3-8B-GGUF per-tier file-size table (link-back to upstream Qwen/Qwen3-8B confirmed on the page header):

QuantFile sizeNotes
Q3_K_M4.12 GBsmaller weights → more KV headroom for long context
Q4_K_M5.03 GBrecommended sweet spot
Q5_K_M5.85 GBbetter quality
Q6_K6.73 GBflagged "near perfect" and recommended per bartowski
Q8_08.71 GBnear-lossless — fits 12 GB but leaves little KV room

Then via the llama.cpp Hugging Face shortcut (per the unsloth model card):

# OpenAI-compatible local server with web UI
llama-server -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL

# Interactive terminal
llama-cli -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL

Option C — LM Studio (GUI)

LM Studio offers a one-click install path per the Qwen3-8B HF card. Search "Qwen3-8B GGUF" inside the app and pick the Q4_K_M tier, or use the direct-import link lmstudio://open_from_hf?model=unsloth/Qwen3-8B-GGUF.

Running

One-shot prompt via Ollama

ollama run qwen3:8b "Explain GQA attention in three sentences."

First run loads the model into VRAM (~5 GB resident at idle, growing as the KV cache fills with longer contexts). Subsequent prompts in the same session stay warm.

Disable thinking mode for short answers

ollama run qwen3:8b "/no_think What's the capital of France?"

Per the Qwen3-8B HF card, this flips enable_thinking=False for the request, skipping the <think>...</think> chain-of-thought prefix.

OpenAI-compatible HTTP API

# Ollama exposes localhost:11434 by default
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:8b",
    "messages": [{"role": "user", "content": "Write a haiku about Ada Lovelace GPUs."}]
  }'

Context budget on 12 GB — the real constraint

Unlike a 16 GB card — where the same Q4_K_M weights leave enough headroom to hold the full 64K context (~14.7 GB total, see the table below) — on the 12 GB RTX 4070 the KV cache is the binding constraint, not the weights. With Q4_K_M weights resident at ~5 GB and a usable VRAM envelope of roughly 10.5–11.3 GB on a desktop card with a display attached, the math works out as follows. Qwen3-8B uses grouped-query attention with 8 KV heads × 128 head-dim × 36 layers, so the fp16 KV cache costs ~0.15 GB per 1K tokens (derived from the model config.json: num_key_value_heads: 8, head_dim: 128, num_hidden_layers: 36):

ContextKV cache (fp16)Weights + KVFits 12 GB?
4K~0.6 GB~5.6 GByes, with wide margin
16K~2.4 GB~7.5 GByes
32K (native max)~4.8 GB~9.9 GByes — at the edge of comfortable
64K (YaRN)~9.7 GB~14.7 GBno — exceeds 12 GB

So the realistic context ceiling on a 12 GB RTX 4070 at Q4_K_M is the full native 32K window — Qwen3-8B's native context is 32,768 tokens per the HF card. The hardware-corner.net RTX 4070 benchmark table confirms this empirically: its Qwen3-8B Q4_K row reports data only up to the 32K column and stops there (no 64K figure), because the 64K KV cache won't fit in 12 GB. To push past 32K you need YaRN RoPE scaling and a smaller quant (Q3_K_M frees ~0.9 GB) or KV-cache quantization — see Troubleshooting.

Results

  • Speed: 71.2 tokens/s generation at 4k context, Q4_K quantization, measured on the RTX 4070 — per the hardware-corner.net LLM benchmark table, surfaced via /check/qwen3-8b/rtx-4070. Generation slows as the KV cache grows: 52.07 tok/s at 16k and 38.07 tok/s at 32k on the same page. Prompt processing is much faster — 3,564.07 tok/s at 4k, 2,064.25 at 16k, 1,116.68 at 32k. The RTX 4070 row reports no 64K data because that context won't fit in 12 GB.
  • VRAM usage: Plan on the full 12 GB as the working envelope for the Q4_K run. At idle the Q4_K_M weights occupy ~5 GB; the remainder is KV cache and runtime overhead, which is why the full 12 GB is the planning figure even though the weights are small. The derived envelope above shows weights + 32K KV landing near ~9.9 GB, leaving display headroom. See /check/qwen3-8b/rtx-4070 — a community-submitted measurement via /contribute will replace this derived figure with an on-card peak.
  • Quality notes: Q4_K_M is the community-default "sweet spot" — the bartowski Q-tier guide describes Q6_K as "near perfect" and marks it recommended if you have the VRAM. On a 12 GB 4070 you can run up to Q8_0 (8.71 GB) if you keep context short, but Q4_K_M leaves the most room for the KV cache — the better trade on this card. There's no quality reason to go below Q4_K_M unless you specifically need long-context headroom, in which case Q3_K_M is the move.

For the full benchmark data and other-GPU comparisons, see /check/qwen3-8b/rtx-4070.

Troubleshooting

Ollama returns Error: model requires more system memory or hangs on load

Confirm a recent NVIDIA driver and CUDA 12.x runtime are installed (nvidia-smi should show a driver from the past 12 months). The RTX 4070 uses the Ada Lovelace architecture (sm_89), which has been fully supported by mainline CUDA wheels since 2023 — no special build flags or wheel pinning are required, and the default pip install torch (cu124) already includes sm_89 kernels. If Ollama still appears to hang on first load, watch nvidia-smi -l 1 in another terminal to confirm the GPU is actually being used; if it stays at 0% utilization, reinstall Ollama and re-pull the model.

Out of memory at long context

On a 12 GB card the KV cache is the constraint, not the weights. At Q4_K_M you can run the full native 32K window, but 64K (via YaRN) needs ~9.7 GB of KV cache alone and will not fit alongside the 5 GB of weights. Three fixes, cheapest first: (1) drop to Q3_K_M (4.12 GB weights per unsloth/Qwen3-8B-GGUF) to free ~0.9 GB; (2) enable KV-cache quantization in llama.cpp (--cache-type-k q8_0 --cache-type-v q8_0) to roughly halve KV cost; (3) cap the context explicitly with --ctx-size 32768 so the runtime doesn't over-allocate. For workflows that genuinely need 64K+, prefer chunking + retrieval over forcing long context onto this card.

<think>...</think> output is bloating responses

Qwen3 enables thinking mode by default per the HF card (enable_thinking=True, "Default is True"). Send /no_think at the start of any user message to disable it for that turn, or pass enable_thinking=False if you're calling the chat-template API directly.

Using transformers directly instead of Ollama

If you bypass Ollama / llama.cpp and run the HF card quickstart via transformers directly, the quickstart uses torch_dtype="auto" and device_map="auto" — it does not hardcode attn_implementation="flash_attention_2", so it works out of the box on the RTX 4070 with a stock pip install torch. Unlike Blackwell-class cards (sm_120), the Ada sm_89 architecture has full FlashAttention-2 kernel coverage in the prebuilt wheels, so if you do opt into FA2 (attn_implementation="flash_attention_2") it works without any wheel-pinning or source build — no cu128-specific selection is required, the default cu124 wheel is correct.

Generation slows dramatically past 32k context

Qwen3 natively supports a 32,768-token context, extendable to 131,072 tokens with YaRN RoPE scaling per the HF card. Beyond the native window the model needs YaRN extension — supported in llama.cpp via --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 per the unsloth GGUF instructions — but on a 12 GB card the 64K KV cache won't fit alongside the weights regardless, so YaRN here is only useful in combination with a smaller quant and KV-cache quantization. For long-doc workflows on this card, prefer chunking + retrieval over pushing context past 32k.

I want the larger 14B / 32B sibling

Qwen3-14B at Q4_K_M is ~9 GB on disk — it loads on a 12 GB card but leaves only ~2–3 GB for the KV cache, so usable context is short; swap qwen3:8b for qwen3:14b in any Ollama command and keep context tight. Qwen3-32B at Q4_K_M is ~20 GB and does not fit without aggressive offloading; same for the 30B MoE and 235B MoE variants (MoE total params must be resident — see the Qwen3 model card on the dense/MoE split). For a 14B+ recipe on this card, request via /contribute.