What You'll Build
A local Qwen3-8B chat and reasoning assistant on an RTX 3090, served through Ollama or llama.cpp at Q4_K_M quantization. The weights are only ~5 GB on disk, so the model uses roughly 6 GB of the 3090's 24 GB envelope — leaving 18 GB free to colocate a second model, extend context to the full 131K window via YaRN, or keep a larger sibling (Qwen3-14B Q4_K_M, ~9 GB) loaded alongside.
Hardware data: RTX 3090 (24 GB VRAM) · 115.3 tok/s generation @ 4K context (Q4_K, CUDA, FA on) · See benchmark data
ℹ️ Wildly over-provisioned by design. Qwen3-8B Q4_K_M needs ~6 GB; the 3090 has 24 GB. The recipe below is the same install on any ≥ 6 GB CUDA card — but the "Headroom" section lower down is what makes a 3090 worth using over a 4060 or 3060 12 GB for this model. If you only want chat, a 12 GB card is fine; the 3090 starts paying for itself when you reach for the spare 18 GB.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 6 GB VRAM (Ampere or newer recommended) | RTX 3090 (24 GB) |
| RAM | 16 GB system RAM | — |
| Storage | ~6 GB for Q4_K_M GGUF | ~17 GB if you also pull BF16 |
| Driver | NVIDIA driver with CUDA 12.x | — |
| Software | Ollama, llama.cpp, or LM Studio | — |
Installation
Pick one of the three runtimes below. All three are first-party-supported by the Qwen team and pull from the canonical Qwen/Qwen3-8B weights (or an officially-tracked GGUF mirror).
Option A — Ollama (simplest)
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen3:8b
qwen3:8b on the Ollama library is the Q4_K_M build, ~5.2 GB on disk.
Option B — llama.cpp (most control)
# macOS / Homebrew
brew install llama.cpp
# Linux: build from source with CUDA
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && cmake -B build -DGGML_CUDA=ON && cmake --build build -j --config Release
# Pull the GGUF directly from Hugging Face and serve
llama-server -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL --port 8080
unsloth/Qwen3-8B-GGUF ships a per-tier table: Q4_K_M = 5.03 GB, UD-Q4_K_XL = 5.14 GB (Unsloth's accuracy-tuned 4-bit), Q5_K_M = 5.85 GB, Q6_K = 6.73 GB, Q8_0 = 8.71 GB, BF16 = 16.4 GB. The 3090 fits any of them; BF16 leaves ~7 GB of working VRAM (enough for chat, tight for long context).
Option C — LM Studio (GUI)
Install LM Studio, search for qwen3-8b, and download a GGUF build (defaults to Q4_K_M). Useful if you want a chat UI without writing a server config.
Running
Ollama
ollama run qwen3:8b "Explain GQA attention in three sentences."
# Disable Qwen3's reasoning trace per turn (faster, less chatty)
ollama run qwen3:8b "/no_think What's the capital of France?"
# OpenAI-compatible endpoint
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "qwen3:8b", "messages": [{"role": "user", "content": "Write a haiku about Ampere GPUs."}]}'
llama.cpp
# Server with OpenAI-compatible API on :8080
llama-server -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL --port 8080 -ngl 99 -fa
# Single-shot CLI
llama-cli -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL -p "Summarize what GQA does in two lines."
-ngl 99 offloads all layers to the GPU; -fa enables Flash Attention (which the RTX 3090's Ampere sm_86 supports natively — no special wheels needed).
Results
-
Generation speed: 115.3 tok/s @ 4K context (Q4_K, CUDA,
-fa 1), measured on Hardware Corner's RTX 3090 LLM benchmark page. The same source publishes the full context ladder for this card:Context Token generation Prompt processing 4K 115.3 tok/s 4,049.6 tok/s 16K 87.5 tok/s 2,572.5 tok/s 32K 67.9 tok/s 1,714.6 tok/s 64K 46.6 tok/s 1,014.3 tok/s 128K 28.1 tok/s 570.0 tok/s -
VRAM usage: Q4_K_M weights load to roughly 6 GB; with a 32K KV cache (fp16, GQA-compressed with the model's 8 KV heads) you stay well under 10 GB. The 3090's 24 GB envelope is 18 GB free for the use cases below. See live benchmark data.
-
Quality notes: Qwen3 ships a hybrid thinking/non-thinking mode. The model card recommends Temperature=0.6, TopP=0.95, TopK=20, MinP=0 for thinking mode and Temperature=0.7, TopP=0.8, TopK=20, MinP=0 for non-thinking — avoid greedy decoding either way per the Qwen team's guidance. Use
/no_thinkin a prompt (orenable_thinking=Falsein the chat template) to suppress the<think>trace when you just want a fast answer.
For the full benchmark data and other GPUs in the catalogue, see /check/qwen3-8b/rtx-3090.
Using the 18 GB of headroom — the real reason to run this on a 3090
Qwen3-8B at Q4_K_M leaves the 3090 with three concrete things you can't do on a 12 GB card:
- Colocate a second model. Load a Whisper-large (~3 GB), a 7B embedding model, or a Stable Diffusion XL pipeline (~7 GB) alongside Qwen3-8B for full ASR → reason → speak or RAG → reason → image pipelines on one card. With Ollama,
OLLAMA_NUM_PARALLEL=2 OLLAMA_MAX_LOADED_MODELS=2keeps both warm in VRAM. - Run the 14B sibling instead — or both. Qwen3-14B at Q4_K_M is ~9 GB; you can keep the 8B loaded for fast turns and the 14B loaded for hard ones, swap by request. The combined footprint is ~17 GB, still inside the envelope.
- Push context to the full 131K window. The Qwen3-8B model card lists a native context of 32,768 tokens with explicit YaRN scaling to 131,072 tokens — Hardware Corner's table above measures 28.1 tok/s at 128K, which is usable for long-document Q&A. Enable in llama.cpp with
--rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768. A 12 GB card can't fit the 128K KV cache; the 3090 can.
Troubleshooting
Generation feels slow under load or at long context
Confirm Flash Attention is enabled in your runtime:
- Ollama: enabled by default on Ampere; check
nvidia-smishows the GPU pinned during inference. - llama.cpp: pass
-faor-fa 1on the command line.
The 3090's Ampere sm_86 has full FA2 support; this is not a Lesson-J Blackwell-style wheel-availability issue. If you're under-utilising the GPU, the bottleneck is almost always CPU-side tokenisation or a missing -ngl offload flag.
Bloated reasoning traces eat your context budget
Qwen3's thinking mode emits a <think>...</think> block before the answer. For chat that doesn't need reasoning, suppress it with /no_think per turn (or set enable_thinking=False in the chat template). The Qwen team's model card documents both the hard switch (enable_thinking flag) and the soft per-turn switch (/think, /no_think tags).
Pushing past the 32K native context
Native context is 32,768 tokens. For anything longer, enable YaRN explicitly — llama.cpp:
llama-server -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL \
--rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 -fa
Expect ~28 tok/s at the full 128K extent per Hardware Corner's table.
Driver / CUDA errors on load
Update to a recent NVIDIA driver with CUDA 12.x. The 3090 is Ampere sm_86 and has been mainline-supported by every recent driver branch; if you hit CUDA out of memory on a 6 GB-class model, the most common cause is another process holding VRAM — nvidia-smi lists culprits.
Running larger Qwen3 siblings
ollama run qwen3:14b (~9 GB Q4_K_M) and ollama run qwen3:32b (~19 GB Q4_K_M) both fit the 3090. The 32B is tight — leave headroom for KV cache by reducing context or pulling a smaller quant.