What You'll Build
A local Qwen3-8B chat / reasoning assistant running on an RTX 4090, served through Ollama (or llama.cpp / LM Studio — same GGUF, three loaders). The recipe pins the dense 8B variant at Q4_K_M quantization (5.03 GB on disk), which leaves the 24 GB card with enormous KV-cache headroom — you can drive Qwen3's full 131k YaRN-extended context without offload, or comfortably run multiple concurrent sessions.
Hardware data: RTX 4090 (24 GB VRAM) · Q4_K GGUF · 141.3 tokens/s generation at 4k context · See benchmark data
⚠️ Variant pinned — Qwen3 ships 8 sizes from the same Qwen org. Per the Ollama qwen3 tag list, Qwen3 spans
0.6b,1.7b,4b,8b(this recipe),14b,30b(MoE),32b, and235b(MoE). The siblings have wildly different VRAM profiles — Qwen3-14B in Q4_K_M is ~8.5 GB and fits cleanly on this card; Qwen3-32B in Q4_K_M is ~20 GB and still fits with a smaller context budget; Qwen3-235B (MoE, ~22B active) needs >100 GB total resident weights since the router can't pre-prune (see Qwen3 model card on the dense/MoE split). The instructions below are for the dense 8.2B model only. If you want 14B or 32B on this card, swapqwen3:8bforqwen3:14borqwen3:32b; for 235B go to /contribute.
ℹ️ Thinking mode is on by default. Qwen3-8B has a built-in chain-of-thought ("thinking") mode that the model card's quickstart enables via
enable_thinking=True. Output starts with a<think>...</think>block followed by the user-facing answer. To disable for latency-sensitive use, send/no_thinkin your prompt or passenable_thinking=Falsein the chat template.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 6 GB VRAM (Q4_K_M) | RTX 4090 (24 GB) |
| RAM | 16 GB system | — |
| Storage | 5.03 GB (Q4_K_M GGUF) up to 16.4 GB (BF16) | per unsloth/Qwen3-8B-GGUF |
| Driver | CUDA 12.x (Ada sm_89) | — |
| Runtime | Ollama 0.5+ / llama.cpp / LM Studio | — |
The model is released under Apache 2.0 — commercial use is permitted. The min_vram_gb: 6 floor is set by the Q4_K_M weight footprint; on a 24 GB RTX 4090 you have 18 GB of headroom for long contexts or higher-precision quants (see "Picking a quant on this card" below).
Installation
The fastest path is Ollama — one command pulls the canonical Q4_K_M build maintained by the Qwen team:
Option A — Ollama (recommended)
1. Install Ollama
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
(Windows: download from ollama.com/download.) Per the Qwen3 model card, "applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3."
2. Pull the 8B model
ollama pull qwen3:8b
This fetches a 5.2 GB Q4_K_M checkpoint per the Ollama qwen3:8b tag. The download is one file — no manual quant-tier selection needed.
Option B — llama.cpp + community GGUF (picking a quant on this card)
If you want a higher-fidelity quant (Q6_K, Q8_0, or even full BF16), use a community redistributor that publishes the full ladder. The 4090's 24 GB envelope makes all tiers viable:
1. Install llama.cpp
# macOS (Homebrew)
brew install llama.cpp
# Linux — pre-built CUDA wheel
# Visit https://github.com/ggml-org/llama.cpp/releases for cu12x binaries
2. Pull the quant you want
Per the unsloth/Qwen3-8B-GGUF per-tier file-size table (link-back to upstream Qwen/Qwen3-8B confirmed on the page header):
| Quant | File size | Notes on a 24 GB RTX 4090 |
|---|---|---|
| Q4_K_M | 5.03 GB | recommended default — leaves ~18 GB for KV cache / concurrent sessions |
| Q5_K_M | 5.85 GB | slight quality lift, same fit |
| Q6_K | 6.73 GB | "near perfect" per bartowski |
| Q8_0 | 8.71 GB | near-lossless |
| BF16 | 16.4 GB | full precision — fits the 4090 with ~7 GB to spare for context |
| UD-Q4_K_XL | 5.14 GB | Unsloth "Dynamic 2.0" tier |
Then via the llama.cpp Hugging Face shortcut (per the unsloth model card):
# OpenAI-compatible local server with web UI
llama-server -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL
# Interactive terminal
llama-cli -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL
Option C — LM Studio (GUI)
LM Studio offers a one-click install path per the Qwen3-8B HF card. Search "Qwen3-8B GGUF" inside the app and pick the Q4_K_M tier, or use the direct-import link lmstudio://open_from_hf?model=unsloth/Qwen3-8B-GGUF.
Running
One-shot prompt via Ollama
ollama run qwen3:8b "Explain GQA attention in three sentences."
First run loads the model into VRAM (~5 GB resident at idle with Q4_K_M, growing as the KV cache fills with longer contexts). Subsequent prompts in the same session stay warm.
Disable thinking mode for short answers
ollama run qwen3:8b "/no_think What's the capital of France?"
Per the Qwen3-8B HF card, this flips enable_thinking=False for the request, skipping the <think>...</think> chain-of-thought prefix.
OpenAI-compatible HTTP API
# Ollama exposes localhost:11434 by default
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3:8b",
"messages": [{"role": "user", "content": "Write a haiku about Ada Lovelace GPUs."}]
}'
vLLM / SGLang for production-style serving
The 24 GB envelope is generous enough that the upstream vllm and sglang BF16 paths from the Qwen3-8B card also fit cleanly — BF16 weights are 16.4 GB, leaving ~7 GB for KV cache (good for moderate-context batched serving). Direct from the model card:
# vLLM (v0.8.5+)
vllm serve Qwen/Qwen3-8B --enable-reasoning --reasoning-parser deepseek_r1
# SGLang (v0.4.6.post1+)
python3 -m sglang.launch_server --model-path Qwen/Qwen3-8B \
--host 0.0.0.0 --port 30000 --reasoning-parser qwen3
For single-user chat or long-context exploration, Ollama / llama.cpp with the Q4_K_M GGUF keeps far more VRAM free for the KV cache — at the same 4k context, Q4_K runs significantly faster than BF16 per the precision/memory ladder in the Qwen speed benchmark.
Results
- Speed: 141.3 tokens/s generation at 4k context, Q4_K quantization, measured on RTX 4090 — per the hardware-corner.net RTX 4090 LLM benchmark table (row label "Qwen3 8B (Q4_K)", "Token Generation" column). Generation rate drops to 108.0 tok/s at 16k, 82.3 tok/s at 32k, 56.1 tok/s at 64k, and 33.8 tok/s at 128k as the KV cache grows. Prompt processing is much faster on the same source — 9,250.5 tok/s at 4k context, 5,530.5 at 16k, 3,560.1 at 32k. For reference, the same source's RTX 4060 Ti 16GB sibling page shows 45.8 tok/s at 4k for the same row — the 4090 is ~3.1× faster, in line with its ~3.5× memory-bandwidth advantage (1,008 vs 288 GB/s).
- VRAM usage: Q4_K_M weights occupy ~5 GB at idle; the
/check/qwen3-8b/rtx-4090endpoint currently has no community-submitted benchmark for this pair (link expands as data lands). For BF16, the Qwen speed benchmark measured 15,947 MB on an H20-96GB reference card — fitting the 4090's 24 GB envelope cleanly. FP8 = 9,323 MB and AWQ-INT4 = 6,177 MB on the same source. - Quality notes: Q4_K_M is the community-default "sweet spot" — the bartowski Q-tier guide flags Q6_K as "near perfect, recommended" if you want the highest quality without going to BF16. On a 24 GB 4090 there's no quality reason to pick anything below Q4_K_M, and you can comfortably go up to Q8_0 (8.71 GB) or even BF16 (16.4 GB) if you want full-precision behaviour with room to spare for context.
For the full benchmark data and other-GPU comparisons, see /check/qwen3-8b/rtx-4090.
Troubleshooting
Ollama returns Error: model requires more system memory or hangs on load
Confirm a recent NVIDIA driver and CUDA 12.x runtime are installed (nvidia-smi should show a driver from the past 12 months). The RTX 4090 uses the Ada Lovelace architecture (sm_89) which has been fully supported by mainline CUDA wheels since 2023 — no special build flags or wheel pinning are required. If Ollama still appears to hang on first load, watch nvidia-smi -l 1 in another terminal to confirm the GPU is actually being used; if it stays at 0% utilization, reinstall Ollama and re-pull the model.
<think>...</think> output is bloating responses
Qwen3 enables thinking mode by default per the HF card quickstart. Send /no_think at the start of any user message to disable it for that turn, or pass enable_thinking=False if you're calling the chat-template API directly.
I want to push past 32k context
32k is Qwen3's native context window per the HF card ("Context Length: 32,768 natively and 131,072 tokens with YaRN"). Beyond that the model needs YaRN extension — supported in llama.cpp via --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 per the unsloth GGUF instructions. On the 4090, the hardware-corner.net benchmark shows the generation rate falling from 141.3 tok/s at 4k to 33.8 tok/s at 128k — the model still fits comfortably, but throughput drops with the larger KV cache. Quality past 32k also degrades; for long-doc workflows, prefer chunking + retrieval over pushing context to the full 131k.
I want the larger 14B / 32B sibling
Qwen3-14B at Q4_K_M is ~8.5 GB on disk and fits a 24 GB card with massive headroom — swap qwen3:8b for qwen3:14b in any Ollama command. Qwen3-32B at Q4_K_M is ~20 GB and also fits, though with much less KV-cache budget; the 30B MoE and 235B MoE variants need all params resident (see the Qwen3 model card on the dense/MoE split — Qwen3-30B-A3B needs ~30 GB total resident weights at Q4 and overflows). For a dedicated 32B or 235B recipe on this card, request via /contribute.
Using transformers directly instead of Ollama
If you bypass Ollama / llama.cpp and run the HF card quickstart via transformers directly, the quickstart uses torch_dtype="auto" and device_map="auto" — it does not hardcode attn_implementation="flash_attention_2", so it works out of the box on the 4090 with a stock pip install torch. Ada sm_89 has full FA2 kernel coverage if you do opt into FA2 separately. Unlike Blackwell-class cards (sm_120), no cu128-specific wheel selection is required for the 4090.
Why not just run BF16 by default on this card?
BF16 weights are 16.4 GB per the unsloth/Qwen3-8B-GGUF table — fits the 4090's 24 GB envelope but leaves only ~7 GB for the KV cache, which limits how far you can push context. Q4_K_M at 5 GB leaves ~18 GB for KV cache (or for running concurrent sessions / batched serving), and quality is close enough to BF16 for most chat / reasoning use that the bartowski tier guide recommends Q6_K rather than BF16 as the quality ceiling. If you specifically need bit-exact BF16 behaviour (e.g. for reproducing a published paper's results), the vLLM / SGLang commands in the Running section run it directly from the HF repo.