What You'll Build
A local Qwen3-4B chat / reasoning assistant running on an 8 GB RTX 4060, served through Ollama (or llama.cpp / LM Studio — same GGUF, three loaders). The recipe pins the dense 4B variant at Q4_K_M quantization (2.50 GB on disk), which clears the RTX 4060's 8 GB VRAM with comfortable headroom for the 32k-native context window and the optional thinking-mode chain of thought.
Hardware data: RTX 4060 (8 GB VRAM) · Q4_K_M GGUF · ~4 GB VRAM at 8k context per the unsloth-sourced VRAM table on LocalLLM.in · See benchmark data
⚠️ Variant pinned — Qwen3 ships 8 sizes from the same Qwen org. Per the Ollama qwen3 tag list, Qwen3 spans
0.6b,1.7b,4b(this recipe),8b,14b,30b(MoE),32b, and235b(MoE). The siblings have wildly different VRAM profiles — Qwen3-8B in Q4_K_M is ~5 GB on disk (still fits 8 GB but the LocalLLM.in 16k-context table measured 40.58 tok/s on this exact card, with the 8 GB card already near full); Qwen3-14B in Q4_K_M is ~8.5 GB and does not fit the 8 GB 4060. The instructions below are for the dense 4.02B model only. If you want the 8B sibling on this card, swapqwen3:4bforqwen3:8band expect ~5 GB on disk plus a tighter KV-cache ceiling.
⚠️ Runtime pinned — vanilla BF16 transformers does NOT fit comfortably on 8 GB. Per the official Qwen speed benchmark (Transformers backend, measured on NVIDIA H20), Qwen3-4B BF16 already takes 7,973 MB at 1-token input and climbs to 10,012 MB at 14,336-token input — i.e. it overflows the RTX 4060's 8 GB before you reach even half of the native 32k context. Q4_K_M GGUF via Ollama or llama.cpp is the path that fits. If you must use HF Transformers directly, use AWQ-INT4 (2,915 MB at 1 token, 3,881 MB at 6k per the same Qwen benchmark) — not BF16.
ℹ️ Thinking mode is on by default. Qwen3-4B has a built-in chain-of-thought ("thinking") mode that the model card quickstart enables via
enable_thinking=True. Output starts with a<think>...</think>block followed by the user-facing answer. To disable for latency-sensitive use, send/no_thinkin your prompt or passenable_thinking=Falsein the chat template.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 8 GB VRAM (4 GB at Q4_K_M / 8k context per LocalLLM.in) | RTX 4060 (8 GB) |
| RAM | 16 GB system | — |
| Storage | 2.50 GB (Q4_K_M GGUF) or 4.28 GB (Q8_0) | per bartowski/Qwen_Qwen3-4B-GGUF |
| Driver | CUDA 12.1+ | — |
| Runtime | Ollama 0.5+ / llama.cpp / LM Studio | — |
The model is released under Apache 2.0 — commercial use is permitted.
Installation
The fastest path is Ollama — one command pulls the canonical Q4_K_M build:
Option A — Ollama (recommended)
1. Install Ollama
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
(Windows: download from ollama.com/download.) Per the Qwen3-4B model card, "applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3."
2. Pull the 4B model
ollama pull qwen3:4b
This fetches a 2.5 GB Q4_K_M checkpoint per the Ollama qwen3:4b tag (4.02B params, original Qwen3 series). The download is one file — no manual quant-tier selection needed.
Option B — llama.cpp + community GGUF
If you want a different quant tier (Q6_K for higher fidelity, Q8_0 for near-lossless), use a community redistributor that publishes the full ladder. Both unsloth and bartowski mirror the canonical Qwen/Qwen3-4B and declare it as their base_model:
1. Install llama.cpp
# macOS (Homebrew)
brew install llama.cpp
# Linux — pre-built CUDA wheel
# Visit https://github.com/ggml-org/llama.cpp/releases for cu121+ binaries
2. Pull the quant you want
Per the bartowski/Qwen_Qwen3-4B-GGUF per-tier file-size table (base_model: Qwen/Qwen3-4B confirmed in the card header):
| Quant | File size | Notes |
|---|---|---|
| Q4_K_M | 2.50 GB | recommended for this card — "Good quality, default size for most use cases" per bartowski |
| Q4_K_S | 2.38 GB | slightly smaller |
| Q5_K_M | 2.89 GB | "High quality, recommended" per bartowski |
| Q6_K | 3.31 GB | "Very high quality, near perfect, recommended" per bartowski |
| Q8_0 | 4.28 GB | "Extremely high quality, generally unneeded but max available quant" |
| BF16 | 8.05 GB | full precision — does not fit the 8 GB 4060 (see admonition above) |
Then via the llama.cpp Hugging Face shortcut (per the unsloth model card):
# OpenAI-compatible local server with web UI
llama-server -hf unsloth/Qwen3-4B-GGUF:UD-Q4_K_XL
# Interactive terminal
llama-cli -hf unsloth/Qwen3-4B-GGUF:UD-Q4_K_XL
Option C — LM Studio (GUI)
LM Studio offers a one-click install path. Search "Qwen3-4B GGUF" inside the app and pick the Q4_K_M tier from either unsloth/Qwen3-4B-GGUF or bartowski/Qwen_Qwen3-4B-GGUF.
Running
One-shot prompt via Ollama
ollama run qwen3:4b "Explain GQA attention in three sentences."
First run loads the model into VRAM (~2.5 GB resident from weights, climbing to ~4 GB total at 8k context as the KV cache grows — per the LocalLLM.in VRAM table sourced from unsloth). Subsequent prompts in the same session stay warm.
Disable thinking mode for short answers
ollama run qwen3:4b "/no_think What's the capital of France?"
Per the Qwen3-4B HF card, this flips enable_thinking=False for the request, skipping the <think>...</think> chain-of-thought prefix.
OpenAI-compatible HTTP API
# Ollama exposes localhost:11434 by default
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3:4b",
"messages": [{"role": "user", "content": "Write a haiku about RTX 4060s running local LLMs."}]
}'
For higher throughput / production-style serving, the upstream Qwen3-4B card documents vllm serve Qwen/Qwen3-4B --enable-reasoning --reasoning-parser deepseek_r1 and python -m sglang.launch_server --model-path Qwen/Qwen3-4B --reasoning-parser qwen3 — but both load BF16 weights by default (8.05 GB), which overflows an 8 GB 4060 at non-trivial context. For the 4060, Ollama / llama.cpp with the Q4_K_M GGUF is the comfortable path.
Recommended sampling parameters
Per the Qwen3-4B card "Best Practices" section:
- Thinking mode (default): Temperature=0.6, TopP=0.95, TopK=20, MinP=0. Do not use greedy decoding — it leads to endless repetitions.
- Non-thinking mode: Temperature=0.7, TopP=0.8, TopK=20, MinP=0.
- For endless-repetition issues, set
presence_penaltybetween 0 and 2 (default 1.5).
Results
- VRAM usage: ~4 GB at 8k context (Q4_K_M) per the unsloth-sourced LocalLLM.in table. Weights are 2.50 GB; the remaining ~1.5 GB is KV cache headroom that grows linearly with context length. The 32k-native window will land near 6 GB peak; pushing to YaRN-extended 131k is not advisable on an 8 GB card.
- Speed: Empirical Qwen3-4B benchmarks for the RTX 4060 are not yet in our catalogue. As an upper bound, the LocalLLM.in benchmark measured the larger Qwen3-8B sibling at 40.58 tokens/sec at 16k context, Q4_K_M, on the same RTX 4060 (8 GB) — a 4B model on the same card runs faster than its 8B sibling, so 40+ tok/s is the floor. The official Qwen speed benchmark reports Qwen3-4B Transformers throughput on a server-class H20 (7,973 MB BF16 → 2,915 MB AWQ-INT4 at 1-token input) — those tokens/s figures don't transfer to consumer hardware, but the memory footprint ladder corroborates the GGUF file-size table above.
- Quality notes: Q4_K_M is the community-default "sweet spot" — bartowski flags Q6_K as "near perfect, recommended" if you have the VRAM. On an 8 GB 4060 you can also run Q6_K (3.31 GB) or Q8_0 (4.28 GB) with plenty of room — there's no quality reason to pick anything below Q4_K_M on this card.
For the full benchmark data and other-GPU comparisons, see /check/qwen3-4b/rtx-4060.
Troubleshooting
Ollama returns Error: model requires more system memory or hangs on load
Confirm CUDA 12.1+ drivers are installed (nvidia-smi should report a 535+ driver on Linux). If the driver is too old, Ollama silently falls back to CPU inference, which appears as a hang on the 4060's modest CPU. Reinstall Ollama after upgrading the driver.
<think>...</think> output is bloating responses
Qwen3 enables thinking mode by default per the HF card quickstart. Send /no_think at the start of any user message to disable it for that turn, or pass enable_thinking=False if you're calling the chat-template API directly.
I want full BF16 instead of Q4_K_M
The full-precision Qwen3-4B weights are 8.05 GB on disk per the bartowski file-size table — that's already at the RTX 4060's 8 GB VRAM ceiling before adding any KV cache. The official Qwen Transformers benchmark measured BF16 memory at 7,973 MB for 1-token input rising to 10,012 MB at 14,336-token input on an H20 — so BF16 overflows the 4060 before mid-context. Either step down to AWQ-INT4 (2,915 MB at 1 token, 3,881 MB at 6k per the same benchmark) or stay on the Q4_K_M / Q5_K_M / Q6_K / Q8_0 GGUF tiers via Ollama / llama.cpp.
Generation slows dramatically past 32k context
32k is Qwen3-4B's native context window per the HF card ("Context Length: 32,768 natively and 131,072 tokens with YaRN"). Beyond that the model needs YaRN extension — supported in llama.cpp via llama-server --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 (per the model card). Quality degrades, the KV cache balloons, and an 8 GB card will not have enough headroom for YaRN-extended workloads at meaningful batch sizes. For long-doc workflows on this GPU, prefer chunking + retrieval over pushing context past 32k.
Endless repetitions in generation
The model card flags this explicitly: "DO NOT use greedy decoding". Use the recommended sampling parameters from the Best Practices section above (Temperature=0.6/0.7, TopP=0.95/0.8 for thinking/non-thinking respectively) and consider raising presence_penalty to 1.5 if repetitions persist.
I want the larger 8B / 14B sibling
Qwen3-8B at Q4_K_M is ~5 GB on disk and fits the 8 GB 4060 — swap qwen3:4b for qwen3:8b and expect tighter KV-cache headroom (the LocalLLM.in benchmark measured 40.58 tok/s at 16k Q4_K_M on this exact card). Qwen3-14B at Q4_K_M is ~8.5 GB and does not fit; same for 32B+ and the MoE variants (all MoE total params must be resident — see the Qwen3-4B model card on the dense/MoE split). For 14B+ recipes on this card, request via /contribute.