What You'll Build
A local Qwen3-8B chat / reasoning assistant running on a 16 GB RTX 4060 Ti, served through Ollama (or llama.cpp / LM Studio — same GGUF, three loaders). The recipe pins the dense 8B variant at Q4_K_M quantization (5.03 GB on disk), which fits comfortably on the 4060 Ti with headroom for the 32k-native context window and the optional thinking-mode chain of thought.
Hardware data: RTX 4060 Ti 16GB · Q4_K GGUF · 45.8 tokens/s generation at 4k context · See benchmark data
⚠️ Variant pinned — Qwen3 ships 8 sizes from the same Qwen org. Per the Ollama qwen3 tag list, Qwen3 spans
0.6b,1.7b,4b,8b(this recipe),14b,30b(MoE),32b, and235b(MoE). The siblings have wildly different VRAM profiles — Qwen3-14B in Q4_K_M is ~8.5 GB and still fits 16 GB; Qwen3-32B in Q4_K_M is ~20 GB and overflows; Qwen3-235B (MoE, ~22B active) needs >100 GB total resident weights since the router can't pre-prune (see Qwen3 model card for the dense/MoE split). The instructions below are for the dense 8.2B model only. If you want 14B on this card, swapqwen3:8bforqwen3:14b; for 32B+ go to /contribute.
ℹ️ Thinking mode is on by default. Qwen3-8B has a built-in chain-of-thought ("thinking") mode that the model card's quickstart enables via
enable_thinking=True. Output starts with a<think>...</think>block followed by the user-facing answer. To disable for latency-sensitive use, send/no_thinkin your prompt or passenable_thinking=Falsein the chat template.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 16 GB VRAM | RTX 4060 Ti 16GB |
| RAM | 16 GB system | — |
| Storage | 5.03 GB (Q4_K_M GGUF) or 8.71 GB (Q8_0) | per unsloth/Qwen3-8B-GGUF |
| Driver | CUDA 12.x (Ada sm_89) | — |
| Runtime | Ollama 0.5+ / llama.cpp / LM Studio | — |
The model is released under Apache 2.0 — commercial use is permitted.
Installation
The fastest path is Ollama — one command pulls the canonical Q4_K_M build maintained by the Qwen team:
Option A — Ollama (recommended)
1. Install Ollama
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
(Windows: download from ollama.com/download.) Per the Qwen3 model card, "applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3."
2. Pull the 8B model
ollama pull qwen3:8b
This fetches a 5.2 GB Q4_K_M checkpoint per the Ollama qwen3:8b tag. The download is one file — no manual quant-tier selection needed.
Option B — llama.cpp + community GGUF
If you want a different quant tier (Q6_K for higher fidelity, Q8_0 for near-lossless), use a community redistributor that publishes the full ladder:
1. Install llama.cpp
# macOS (Homebrew)
brew install llama.cpp
# Linux — pre-built CUDA wheel
# Visit https://github.com/ggml-org/llama.cpp/releases for cu12x binaries
2. Pull the quant you want
Per the unsloth/Qwen3-8B-GGUF per-tier file-size table (link-back to upstream Qwen/Qwen3-8B confirmed on the page header):
| Quant | File size | Notes |
|---|---|---|
| Q4_K_M | 5.03 GB | recommended for this card |
| Q5_K_M | 5.85 GB | better quality, still tiny |
| Q6_K | 6.73 GB | "near perfect" per bartowski |
| Q8_0 | 8.71 GB | near-lossless |
| BF16 | 16.4 GB | full precision — overflows 16 GB card; needs offload |
Then via the llama.cpp Hugging Face shortcut (per the unsloth model card):
# OpenAI-compatible local server with web UI
llama-server -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL
# Interactive terminal
llama-cli -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL
Option C — LM Studio (GUI)
LM Studio offers a one-click install path per the Qwen3-8B HF card. Search "Qwen3-8B GGUF" inside the app and pick the Q4_K_M tier, or use the direct-import link lmstudio://open_from_hf?model=unsloth/Qwen3-8B-GGUF.
Running
One-shot prompt via Ollama
ollama run qwen3:8b "Explain GQA attention in three sentences."
First run loads the model into VRAM (~5 GB resident at idle, growing as the KV cache fills with longer contexts). Subsequent prompts in the same session stay warm.
Disable thinking mode for short answers
ollama run qwen3:8b "/no_think What's the capital of France?"
Per the Qwen3-8B HF card, this flips enable_thinking=False for the request, skipping the <think>...</think> chain-of-thought prefix.
OpenAI-compatible HTTP API
# Ollama exposes localhost:11434 by default
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3:8b",
"messages": [{"role": "user", "content": "Write a haiku about Ada Lovelace GPUs."}]
}'
For higher throughput / production-style serving, the upstream Qwen3-8B card documents vllm serve Qwen/Qwen3-8B --enable-reasoning --reasoning-parser deepseek_r1 and python -m sglang.launch_server --model-path Qwen/Qwen3-8B --reasoning-parser qwen3 — both load BF16 weights though (16.4 GB), which is right at this card's capacity. For the 4060 Ti, Ollama / llama.cpp with the Q4_K_M GGUF is the comfortable path.
Results
- Speed: 45.8 tokens/s generation at 4k context, Q4_K quantization, measured on RTX 4060 Ti 16GB — per the hardware-corner.net LLM benchmark table, surfaced via
/check/qwen3-8b/rtx-4060-ti-16gb. Generation rate drops to 34.3 tok/s at 16k, 25.5 tok/s at 32k, and 13.0 tok/s at 64k as the KV cache grows. Prompt processing is much faster — 2,675.2 tok/s at 4k context per the same source. - VRAM usage: The cited backend benchmark records peak VRAM at the 4k-context configuration as fully utilizing the card's 16 GB — link to /check/qwen3-8b/rtx-4060-ti-16gb for the latest measurement. At idle the Q4_K_M weights occupy ~5 GB; the rest is KV cache headroom the runtime expands with context. The official Qwen speed benchmark corroborates the precision/VRAM ladder on H20 hardware: BF16 = 15947 MB, FP8 = 9323 MB, AWQ-INT4 = 6177 MB.
- Quality notes: Q4_K_M is the community-default "sweet spot" — the bartowski Q-tier guide flags Q6_K as "near perfect, recommended" if you have the VRAM. On a 16 GB 4060 Ti you can also run Q6_K (6.73 GB) or Q8_0 (8.71 GB) with plenty of room — there's no quality reason to pick anything below Q4_K_M on this card.
For the full benchmark data and other-GPU comparisons, see /check/qwen3-8b/rtx-4060-ti-16gb.
Troubleshooting
Ollama returns Error: model requires more system memory or hangs on load
Confirm a recent NVIDIA driver and CUDA 12.x runtime are installed (nvidia-smi should show a driver from the past 12 months). The RTX 4060 Ti uses the Ada Lovelace architecture (sm_89) which has been fully supported by mainline CUDA wheels since 2023 — no special build flags or wheel pinning are required. If Ollama still appears to hang on first load, watch nvidia-smi -l 1 in another terminal to confirm the GPU is actually being used; if it stays at 0% utilization, reinstall Ollama and re-pull the model.
<think>...</think> output is bloating responses
Qwen3 enables thinking mode by default per the HF card quickstart. Send /no_think at the start of any user message to disable it for that turn, or pass enable_thinking=False if you're calling the chat-template API directly.
I want the larger 14B / 32B sibling
Qwen3-14B at Q4_K_M is ~8.5 GB on disk and fits a 16 GB card comfortably — swap qwen3:8b for qwen3:14b in any Ollama command. Qwen3-32B at Q4_K_M is ~20 GB and does not fit without aggressive offloading; same for the 30B MoE and 235B MoE variants (MoE total params must be resident — see the Qwen3 model card on the dense/MoE split). For a 32B+ recipe on this card, request via /contribute.
Using transformers directly instead of Ollama
If you bypass Ollama / llama.cpp and run the HF card quickstart via transformers directly, the quickstart uses torch_dtype="auto" and device_map="auto" — it does not hardcode attn_implementation="flash_attention_2", so it works out of the box on the 4060 Ti with a stock pip install torch (Ada sm_89 has full FA2 kernel coverage if you do opt into FA2 separately). Unlike Blackwell-class cards, no cu128-specific wheel selection is required for the 4060 Ti.
Generation slows dramatically past 32k context
32k is Qwen3's native context window per the HF card ("Context Length: 32,768 natively and 131,072 tokens with YaRN"). Beyond that the model needs YaRN extension — supported in llama.cpp via --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 per the unsloth GGUF instructions — but quality degrades and the KV cache balloons. For long-doc workflows, prefer chunking + retrieval over pushing context past 32k. The hardware-corner.net benchmark shows the rate falling to 13.0 tok/s at 64k context on this card.