self-hosted/ai
§01·recipe · llm

Qwen3-4B on RTX 4060: Q4_K_M GGUF via Ollama or llama.cpp

llmbeginner4GB+ VRAMMay 19, 2026
models
tools
prerequisites
  • NVIDIA RTX 4060 (8 GB VRAM) or equivalent 8 GB CUDA card
  • Recent NVIDIA driver with CUDA 12.1+ support
  • ~3 GB free disk for the Q4_K_M GGUF checkpoint (or ~5 GB for Q8_0)
  • Ollama, llama.cpp, or LM Studio installed

What You'll Build

A local Qwen3-4B chat / reasoning assistant running on an 8 GB RTX 4060, served through Ollama (or llama.cpp / LM Studio — same GGUF, three loaders). The recipe pins the dense 4B variant at Q4_K_M quantization (2.50 GB on disk), which clears the RTX 4060's 8 GB VRAM with comfortable headroom for the 32k-native context window and the optional thinking-mode chain of thought.

Hardware data: RTX 4060 (8 GB VRAM) · Q4_K_M GGUF · ~4 GB VRAM at 8k context per the unsloth-sourced VRAM table on LocalLLM.in · See benchmark data

⚠️ Variant pinned — Qwen3 ships 8 sizes from the same Qwen org. Per the Ollama qwen3 tag list, Qwen3 spans 0.6b, 1.7b, 4b (this recipe), 8b, 14b, 30b (MoE), 32b, and 235b (MoE). The siblings have wildly different VRAM profiles — Qwen3-8B in Q4_K_M is ~5 GB on disk (still fits 8 GB but the LocalLLM.in 16k-context table measured 40.58 tok/s on this exact card, with the 8 GB card already near full); Qwen3-14B in Q4_K_M is ~8.5 GB and does not fit the 8 GB 4060. The instructions below are for the dense 4.02B model only. If you want the 8B sibling on this card, swap qwen3:4b for qwen3:8b and expect ~5 GB on disk plus a tighter KV-cache ceiling.

⚠️ Runtime pinned — vanilla BF16 transformers does NOT fit comfortably on 8 GB. Per the official Qwen speed benchmark (Transformers backend, measured on NVIDIA H20), Qwen3-4B BF16 already takes 7,973 MB at 1-token input and climbs to 10,012 MB at 14,336-token input — i.e. it overflows the RTX 4060's 8 GB before you reach even half of the native 32k context. Q4_K_M GGUF via Ollama or llama.cpp is the path that fits. If you must use HF Transformers directly, use AWQ-INT4 (2,915 MB at 1 token, 3,881 MB at 6k per the same Qwen benchmark) — not BF16.

ℹ️ Thinking mode is on by default. Qwen3-4B has a built-in chain-of-thought ("thinking") mode that the model card quickstart enables via enable_thinking=True. Output starts with a <think>...</think> block followed by the user-facing answer. To disable for latency-sensitive use, send /no_think in your prompt or pass enable_thinking=False in the chat template.

Requirements

ComponentMinimumTested
GPU8 GB VRAM (4 GB at Q4_K_M / 8k context per LocalLLM.in)RTX 4060 (8 GB)
RAM16 GB system
Storage2.50 GB (Q4_K_M GGUF) or 4.28 GB (Q8_0)per bartowski/Qwen_Qwen3-4B-GGUF
DriverCUDA 12.1+
RuntimeOllama 0.5+ / llama.cpp / LM Studio

The model is released under Apache 2.0 — commercial use is permitted.

Installation

The fastest path is Ollama — one command pulls the canonical Q4_K_M build:

Option A — Ollama (recommended)

1. Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

(Windows: download from ollama.com/download.) Per the Qwen3-4B model card, "applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3."

2. Pull the 4B model

ollama pull qwen3:4b

This fetches a 2.5 GB Q4_K_M checkpoint per the Ollama qwen3:4b tag (4.02B params, original Qwen3 series). The download is one file — no manual quant-tier selection needed.

Option B — llama.cpp + community GGUF

If you want a different quant tier (Q6_K for higher fidelity, Q8_0 for near-lossless), use a community redistributor that publishes the full ladder. Both unsloth and bartowski mirror the canonical Qwen/Qwen3-4B and declare it as their base_model:

1. Install llama.cpp

# macOS (Homebrew)
brew install llama.cpp

# Linux — pre-built CUDA wheel
# Visit https://github.com/ggml-org/llama.cpp/releases for cu121+ binaries

2. Pull the quant you want

Per the bartowski/Qwen_Qwen3-4B-GGUF per-tier file-size table (base_model: Qwen/Qwen3-4B confirmed in the card header):

QuantFile sizeNotes
Q4_K_M2.50 GBrecommended for this card — "Good quality, default size for most use cases" per bartowski
Q4_K_S2.38 GBslightly smaller
Q5_K_M2.89 GB"High quality, recommended" per bartowski
Q6_K3.31 GB"Very high quality, near perfect, recommended" per bartowski
Q8_04.28 GB"Extremely high quality, generally unneeded but max available quant"
BF168.05 GBfull precision — does not fit the 8 GB 4060 (see admonition above)

Then via the llama.cpp Hugging Face shortcut (per the unsloth model card):

# OpenAI-compatible local server with web UI
llama-server -hf unsloth/Qwen3-4B-GGUF:UD-Q4_K_XL

# Interactive terminal
llama-cli -hf unsloth/Qwen3-4B-GGUF:UD-Q4_K_XL

Option C — LM Studio (GUI)

LM Studio offers a one-click install path. Search "Qwen3-4B GGUF" inside the app and pick the Q4_K_M tier from either unsloth/Qwen3-4B-GGUF or bartowski/Qwen_Qwen3-4B-GGUF.

Running

One-shot prompt via Ollama

ollama run qwen3:4b "Explain GQA attention in three sentences."

First run loads the model into VRAM (~2.5 GB resident from weights, climbing to ~4 GB total at 8k context as the KV cache grows — per the LocalLLM.in VRAM table sourced from unsloth). Subsequent prompts in the same session stay warm.

Disable thinking mode for short answers

ollama run qwen3:4b "/no_think What's the capital of France?"

Per the Qwen3-4B HF card, this flips enable_thinking=False for the request, skipping the <think>...</think> chain-of-thought prefix.

OpenAI-compatible HTTP API

# Ollama exposes localhost:11434 by default
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:4b",
    "messages": [{"role": "user", "content": "Write a haiku about RTX 4060s running local LLMs."}]
  }'

For higher throughput / production-style serving, the upstream Qwen3-4B card documents vllm serve Qwen/Qwen3-4B --enable-reasoning --reasoning-parser deepseek_r1 and python -m sglang.launch_server --model-path Qwen/Qwen3-4B --reasoning-parser qwen3 — but both load BF16 weights by default (8.05 GB), which overflows an 8 GB 4060 at non-trivial context. For the 4060, Ollama / llama.cpp with the Q4_K_M GGUF is the comfortable path.

Recommended sampling parameters

Per the Qwen3-4B card "Best Practices" section:

  • Thinking mode (default): Temperature=0.6, TopP=0.95, TopK=20, MinP=0. Do not use greedy decoding — it leads to endless repetitions.
  • Non-thinking mode: Temperature=0.7, TopP=0.8, TopK=20, MinP=0.
  • For endless-repetition issues, set presence_penalty between 0 and 2 (default 1.5).

Results

  • VRAM usage: ~4 GB at 8k context (Q4_K_M) per the unsloth-sourced LocalLLM.in table. Weights are 2.50 GB; the remaining ~1.5 GB is KV cache headroom that grows linearly with context length. The 32k-native window will land near 6 GB peak; pushing to YaRN-extended 131k is not advisable on an 8 GB card.
  • Speed: Empirical Qwen3-4B benchmarks for the RTX 4060 are not yet in our catalogue. As an upper bound, the LocalLLM.in benchmark measured the larger Qwen3-8B sibling at 40.58 tokens/sec at 16k context, Q4_K_M, on the same RTX 4060 (8 GB) — a 4B model on the same card runs faster than its 8B sibling, so 40+ tok/s is the floor. The official Qwen speed benchmark reports Qwen3-4B Transformers throughput on a server-class H20 (7,973 MB BF16 → 2,915 MB AWQ-INT4 at 1-token input) — those tokens/s figures don't transfer to consumer hardware, but the memory footprint ladder corroborates the GGUF file-size table above.
  • Quality notes: Q4_K_M is the community-default "sweet spot" — bartowski flags Q6_K as "near perfect, recommended" if you have the VRAM. On an 8 GB 4060 you can also run Q6_K (3.31 GB) or Q8_0 (4.28 GB) with plenty of room — there's no quality reason to pick anything below Q4_K_M on this card.

For the full benchmark data and other-GPU comparisons, see /check/qwen3-4b/rtx-4060.

Troubleshooting

Ollama returns Error: model requires more system memory or hangs on load

Confirm CUDA 12.1+ drivers are installed (nvidia-smi should report a 535+ driver on Linux). If the driver is too old, Ollama silently falls back to CPU inference, which appears as a hang on the 4060's modest CPU. Reinstall Ollama after upgrading the driver.

<think>...</think> output is bloating responses

Qwen3 enables thinking mode by default per the HF card quickstart. Send /no_think at the start of any user message to disable it for that turn, or pass enable_thinking=False if you're calling the chat-template API directly.

I want full BF16 instead of Q4_K_M

The full-precision Qwen3-4B weights are 8.05 GB on disk per the bartowski file-size table — that's already at the RTX 4060's 8 GB VRAM ceiling before adding any KV cache. The official Qwen Transformers benchmark measured BF16 memory at 7,973 MB for 1-token input rising to 10,012 MB at 14,336-token input on an H20 — so BF16 overflows the 4060 before mid-context. Either step down to AWQ-INT4 (2,915 MB at 1 token, 3,881 MB at 6k per the same benchmark) or stay on the Q4_K_M / Q5_K_M / Q6_K / Q8_0 GGUF tiers via Ollama / llama.cpp.

Generation slows dramatically past 32k context

32k is Qwen3-4B's native context window per the HF card ("Context Length: 32,768 natively and 131,072 tokens with YaRN"). Beyond that the model needs YaRN extension — supported in llama.cpp via llama-server --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 (per the model card). Quality degrades, the KV cache balloons, and an 8 GB card will not have enough headroom for YaRN-extended workloads at meaningful batch sizes. For long-doc workflows on this GPU, prefer chunking + retrieval over pushing context past 32k.

Endless repetitions in generation

The model card flags this explicitly: "DO NOT use greedy decoding". Use the recommended sampling parameters from the Best Practices section above (Temperature=0.6/0.7, TopP=0.95/0.8 for thinking/non-thinking respectively) and consider raising presence_penalty to 1.5 if repetitions persist.

I want the larger 8B / 14B sibling

Qwen3-8B at Q4_K_M is ~5 GB on disk and fits the 8 GB 4060 — swap qwen3:4b for qwen3:8b and expect tighter KV-cache headroom (the LocalLLM.in benchmark measured 40.58 tok/s at 16k Q4_K_M on this exact card). Qwen3-14B at Q4_K_M is ~8.5 GB and does not fit; same for 32B+ and the MoE variants (all MoE total params must be resident — see the Qwen3-4B model card on the dense/MoE split). For 14B+ recipes on this card, request via /contribute.