self-hosted/ai
§01·recipe · llm

Qwen3-8B on RTX 4060 Ti 16GB: Q4_K_M GGUF via Ollama or llama.cpp

llmbeginner16GB+ VRAMMay 20, 2026
models
tools
prerequisites
  • NVIDIA RTX 4060 Ti 16GB or equivalent 16 GB CUDA card
  • Recent NVIDIA driver with CUDA 12.x support (Ada sm_89 — no special wheel selection required)
  • ~6 GB free disk for the Q4_K_M GGUF checkpoint (or ~10 GB for Q8_0)
  • Ollama, llama.cpp, or LM Studio installed

What You'll Build

A local Qwen3-8B chat / reasoning assistant running on a 16 GB RTX 4060 Ti, served through Ollama (or llama.cpp / LM Studio — same GGUF, three loaders). The recipe pins the dense 8B variant at Q4_K_M quantization (5.03 GB on disk), which fits comfortably on the 4060 Ti with headroom for the 32k-native context window and the optional thinking-mode chain of thought.

Hardware data: RTX 4060 Ti 16GB · Q4_K GGUF · 45.8 tokens/s generation at 4k context · See benchmark data

⚠️ Variant pinned — Qwen3 ships 8 sizes from the same Qwen org. Per the Ollama qwen3 tag list, Qwen3 spans 0.6b, 1.7b, 4b, 8b (this recipe), 14b, 30b (MoE), 32b, and 235b (MoE). The siblings have wildly different VRAM profiles — Qwen3-14B in Q4_K_M is ~8.5 GB and still fits 16 GB; Qwen3-32B in Q4_K_M is ~20 GB and overflows; Qwen3-235B (MoE, ~22B active) needs >100 GB total resident weights since the router can't pre-prune (see Qwen3 model card for the dense/MoE split). The instructions below are for the dense 8.2B model only. If you want 14B on this card, swap qwen3:8b for qwen3:14b; for 32B+ go to /contribute.

ℹ️ Thinking mode is on by default. Qwen3-8B has a built-in chain-of-thought ("thinking") mode that the model card's quickstart enables via enable_thinking=True. Output starts with a <think>...</think> block followed by the user-facing answer. To disable for latency-sensitive use, send /no_think in your prompt or pass enable_thinking=False in the chat template.

Requirements

ComponentMinimumTested
GPU16 GB VRAMRTX 4060 Ti 16GB
RAM16 GB system
Storage5.03 GB (Q4_K_M GGUF) or 8.71 GB (Q8_0)per unsloth/Qwen3-8B-GGUF
DriverCUDA 12.x (Ada sm_89)
RuntimeOllama 0.5+ / llama.cpp / LM Studio

The model is released under Apache 2.0 — commercial use is permitted.

Installation

The fastest path is Ollama — one command pulls the canonical Q4_K_M build maintained by the Qwen team:

Option A — Ollama (recommended)

1. Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

(Windows: download from ollama.com/download.) Per the Qwen3 model card, "applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3."

2. Pull the 8B model

ollama pull qwen3:8b

This fetches a 5.2 GB Q4_K_M checkpoint per the Ollama qwen3:8b tag. The download is one file — no manual quant-tier selection needed.

Option B — llama.cpp + community GGUF

If you want a different quant tier (Q6_K for higher fidelity, Q8_0 for near-lossless), use a community redistributor that publishes the full ladder:

1. Install llama.cpp

# macOS (Homebrew)
brew install llama.cpp

# Linux — pre-built CUDA wheel
# Visit https://github.com/ggml-org/llama.cpp/releases for cu12x binaries

2. Pull the quant you want

Per the unsloth/Qwen3-8B-GGUF per-tier file-size table (link-back to upstream Qwen/Qwen3-8B confirmed on the page header):

QuantFile sizeNotes
Q4_K_M5.03 GBrecommended for this card
Q5_K_M5.85 GBbetter quality, still tiny
Q6_K6.73 GB"near perfect" per bartowski
Q8_08.71 GBnear-lossless
BF1616.4 GBfull precision — overflows 16 GB card; needs offload

Then via the llama.cpp Hugging Face shortcut (per the unsloth model card):

# OpenAI-compatible local server with web UI
llama-server -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL

# Interactive terminal
llama-cli -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL

Option C — LM Studio (GUI)

LM Studio offers a one-click install path per the Qwen3-8B HF card. Search "Qwen3-8B GGUF" inside the app and pick the Q4_K_M tier, or use the direct-import link lmstudio://open_from_hf?model=unsloth/Qwen3-8B-GGUF.

Running

One-shot prompt via Ollama

ollama run qwen3:8b "Explain GQA attention in three sentences."

First run loads the model into VRAM (~5 GB resident at idle, growing as the KV cache fills with longer contexts). Subsequent prompts in the same session stay warm.

Disable thinking mode for short answers

ollama run qwen3:8b "/no_think What's the capital of France?"

Per the Qwen3-8B HF card, this flips enable_thinking=False for the request, skipping the <think>...</think> chain-of-thought prefix.

OpenAI-compatible HTTP API

# Ollama exposes localhost:11434 by default
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:8b",
    "messages": [{"role": "user", "content": "Write a haiku about Ada Lovelace GPUs."}]
  }'

For higher throughput / production-style serving, the upstream Qwen3-8B card documents vllm serve Qwen/Qwen3-8B --enable-reasoning --reasoning-parser deepseek_r1 and python -m sglang.launch_server --model-path Qwen/Qwen3-8B --reasoning-parser qwen3 — both load BF16 weights though (16.4 GB), which is right at this card's capacity. For the 4060 Ti, Ollama / llama.cpp with the Q4_K_M GGUF is the comfortable path.

Results

  • Speed: 45.8 tokens/s generation at 4k context, Q4_K quantization, measured on RTX 4060 Ti 16GB — per the hardware-corner.net LLM benchmark table, surfaced via /check/qwen3-8b/rtx-4060-ti-16gb. Generation rate drops to 34.3 tok/s at 16k, 25.5 tok/s at 32k, and 13.0 tok/s at 64k as the KV cache grows. Prompt processing is much faster — 2,675.2 tok/s at 4k context per the same source.
  • VRAM usage: The cited backend benchmark records peak VRAM at the 4k-context configuration as fully utilizing the card's 16 GB — link to /check/qwen3-8b/rtx-4060-ti-16gb for the latest measurement. At idle the Q4_K_M weights occupy ~5 GB; the rest is KV cache headroom the runtime expands with context. The official Qwen speed benchmark corroborates the precision/VRAM ladder on H20 hardware: BF16 = 15947 MB, FP8 = 9323 MB, AWQ-INT4 = 6177 MB.
  • Quality notes: Q4_K_M is the community-default "sweet spot" — the bartowski Q-tier guide flags Q6_K as "near perfect, recommended" if you have the VRAM. On a 16 GB 4060 Ti you can also run Q6_K (6.73 GB) or Q8_0 (8.71 GB) with plenty of room — there's no quality reason to pick anything below Q4_K_M on this card.

For the full benchmark data and other-GPU comparisons, see /check/qwen3-8b/rtx-4060-ti-16gb.

Troubleshooting

Ollama returns Error: model requires more system memory or hangs on load

Confirm a recent NVIDIA driver and CUDA 12.x runtime are installed (nvidia-smi should show a driver from the past 12 months). The RTX 4060 Ti uses the Ada Lovelace architecture (sm_89) which has been fully supported by mainline CUDA wheels since 2023 — no special build flags or wheel pinning are required. If Ollama still appears to hang on first load, watch nvidia-smi -l 1 in another terminal to confirm the GPU is actually being used; if it stays at 0% utilization, reinstall Ollama and re-pull the model.

<think>...</think> output is bloating responses

Qwen3 enables thinking mode by default per the HF card quickstart. Send /no_think at the start of any user message to disable it for that turn, or pass enable_thinking=False if you're calling the chat-template API directly.

I want the larger 14B / 32B sibling

Qwen3-14B at Q4_K_M is ~8.5 GB on disk and fits a 16 GB card comfortably — swap qwen3:8b for qwen3:14b in any Ollama command. Qwen3-32B at Q4_K_M is ~20 GB and does not fit without aggressive offloading; same for the 30B MoE and 235B MoE variants (MoE total params must be resident — see the Qwen3 model card on the dense/MoE split). For a 32B+ recipe on this card, request via /contribute.

Using transformers directly instead of Ollama

If you bypass Ollama / llama.cpp and run the HF card quickstart via transformers directly, the quickstart uses torch_dtype="auto" and device_map="auto" — it does not hardcode attn_implementation="flash_attention_2", so it works out of the box on the 4060 Ti with a stock pip install torch (Ada sm_89 has full FA2 kernel coverage if you do opt into FA2 separately). Unlike Blackwell-class cards, no cu128-specific wheel selection is required for the 4060 Ti.

Generation slows dramatically past 32k context

32k is Qwen3's native context window per the HF card ("Context Length: 32,768 natively and 131,072 tokens with YaRN"). Beyond that the model needs YaRN extension — supported in llama.cpp via --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 per the unsloth GGUF instructions — but quality degrades and the KV cache balloons. For long-doc workflows, prefer chunking + retrieval over pushing context past 32k. The hardware-corner.net benchmark shows the rate falling to 13.0 tok/s at 64k context on this card.