How much VRAM does Qwen3-8B need?

About 6 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

Qwen3-8B on RTX 3090: Q4_K_M GGUF with 18 GB of Headroom for Colocation or Long Context

What You'll Build

A local Qwen3-8B chat and reasoning assistant on an RTX 3090, served through Ollama or llama.cpp at Q4_K_M quantization. The weights are only ~5 GB on disk, so the model uses roughly 6 GB of the 3090's 24 GB envelope — leaving 18 GB free to colocate a second model, extend context to the full 131K window via YaRN, or keep a larger sibling (Qwen3-14B Q4_K_M, ~9 GB) loaded alongside.

Hardware data: RTX 3090 (24 GB VRAM) · 115.3 tok/s generation @ 4K context (Q4_K, CUDA, FA on) · See benchmark data

ℹ️ Wildly over-provisioned by design. Qwen3-8B Q4_K_M needs ~6 GB; the 3090 has 24 GB. The recipe below is the same install on any ≥ 6 GB CUDA card — but the "Headroom" section lower down is what makes a 3090 worth using over a 4060 or 3060 12 GB for this model. If you only want chat, a 12 GB card is fine; the 3090 starts paying for itself when you reach for the spare 18 GB.

Requirements

Component	Minimum	Tested
GPU	6 GB VRAM (Ampere or newer recommended)	RTX 3090 (24 GB)
RAM	16 GB system RAM	—
Storage	~6 GB for Q4_K_M GGUF	~17 GB if you also pull BF16
Driver	NVIDIA driver with CUDA 12.x	—
Software	Ollama, llama.cpp, or LM Studio	—

Installation

Pick one of the three runtimes below. All three are first-party-supported by the Qwen team and pull from the canonical Qwen/Qwen3-8B weights (or an officially-tracked GGUF mirror).

Option A — Ollama (simplest)

curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen3:8b

qwen3:8b on the Ollama library is the Q4_K_M build, ~5.2 GB on disk.

Option B — llama.cpp (most control)

# macOS / Homebrew
brew install llama.cpp

# Linux: build from source with CUDA
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && cmake -B build -DGGML_CUDA=ON && cmake --build build -j --config Release

# Pull the GGUF directly from Hugging Face and serve
llama-server -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL --port 8080

unsloth/Qwen3-8B-GGUF ships a per-tier table: Q4_K_M = 5.03 GB, UD-Q4_K_XL = 5.14 GB (Unsloth's accuracy-tuned 4-bit), Q5_K_M = 5.85 GB, Q6_K = 6.73 GB, Q8_0 = 8.71 GB, BF16 = 16.4 GB. The 3090 fits any of them; BF16 leaves ~7 GB of working VRAM (enough for chat, tight for long context).

Option C — LM Studio (GUI)

Install LM Studio, search for qwen3-8b, and download a GGUF build (defaults to Q4_K_M). Useful if you want a chat UI without writing a server config.

Running

Ollama

ollama run qwen3:8b "Explain GQA attention in three sentences."

# Disable Qwen3's reasoning trace per turn (faster, less chatty)
ollama run qwen3:8b "/no_think What's the capital of France?"

# OpenAI-compatible endpoint
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen3:8b", "messages": [{"role": "user", "content": "Write a haiku about Ampere GPUs."}]}'

llama.cpp

# Server with OpenAI-compatible API on :8080
llama-server -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL --port 8080 -ngl 99 -fa

# Single-shot CLI
llama-cli -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL -p "Summarize what GQA does in two lines."

-ngl 99 offloads all layers to the GPU; -fa enables Flash Attention (which the RTX 3090's Ampere sm_86 supports natively — no special wheels needed).

Results

Generation speed: 115.3 tok/s @ 4K context (Q4_K, CUDA, -fa 1), measured on Hardware Corner's RTX 3090 LLM benchmark page. The same source publishes the full context ladder for this card:

Context Token generation Prompt processing
4K 115.3 tok/s 4,049.6 tok/s
16K 87.5 tok/s 2,572.5 tok/s
32K 67.9 tok/s 1,714.6 tok/s
64K 46.6 tok/s 1,014.3 tok/s
128K 28.1 tok/s 570.0 tok/s
VRAM usage: Q4_K_M weights load to roughly 6 GB; with a 32K KV cache (fp16, GQA-compressed with the model's 8 KV heads) you stay well under 10 GB. The 3090's 24 GB envelope is 18 GB free for the use cases below. See live benchmark data.
Quality notes: Qwen3 ships a hybrid thinking/non-thinking mode. The model card recommends Temperature=0.6, TopP=0.95, TopK=20, MinP=0 for thinking mode and Temperature=0.7, TopP=0.8, TopK=20, MinP=0 for non-thinking — avoid greedy decoding either way per the Qwen team's guidance. Use /no_think in a prompt (or enable_thinking=False in the chat template) to suppress the <think> trace when you just want a fast answer.

Context	Token generation	Prompt processing
4K	115.3 tok/s	4,049.6 tok/s
16K	87.5 tok/s	2,572.5 tok/s
32K	67.9 tok/s	1,714.6 tok/s
64K	46.6 tok/s	1,014.3 tok/s
128K	28.1 tok/s	570.0 tok/s

For the full benchmark data and other GPUs in the catalogue, see /check/qwen3-8b/rtx-3090.

Using the 18 GB of headroom — the real reason to run this on a 3090

Qwen3-8B at Q4_K_M leaves the 3090 with three concrete things you can't do on a 12 GB card:

Colocate a second model. Load a Whisper-large (~3 GB), a 7B embedding model, or a Stable Diffusion XL pipeline (~7 GB) alongside Qwen3-8B for full ASR → reason → speak or RAG → reason → image pipelines on one card. With Ollama, OLLAMA_NUM_PARALLEL=2 OLLAMA_MAX_LOADED_MODELS=2 keeps both warm in VRAM.
Run the 14B sibling instead — or both. Qwen3-14B at Q4_K_M is ~9 GB; you can keep the 8B loaded for fast turns and the 14B loaded for hard ones, swap by request. The combined footprint is ~17 GB, still inside the envelope.
Push context to the full 131K window. The Qwen3-8B model card lists a native context of 32,768 tokens with explicit YaRN scaling to 131,072 tokens — Hardware Corner's table above measures 28.1 tok/s at 128K, which is usable for long-document Q&A. Enable in llama.cpp with --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768. A 12 GB card can't fit the 128K KV cache; the 3090 can.

Troubleshooting

Generation feels slow under load or at long context

Confirm Flash Attention is enabled in your runtime:

Ollama: enabled by default on Ampere; check nvidia-smi shows the GPU pinned during inference.
llama.cpp: pass -fa or -fa 1 on the command line.

The 3090's Ampere sm_86 has full FA2 support; this is not a Lesson-J Blackwell-style wheel-availability issue. If you're under-utilising the GPU, the bottleneck is almost always CPU-side tokenisation or a missing -ngl offload flag.

Bloated reasoning traces eat your context budget

Qwen3's thinking mode emits a <think>...</think> block before the answer. For chat that doesn't need reasoning, suppress it with /no_think per turn (or set enable_thinking=False in the chat template). The Qwen team's model card documents both the hard switch (enable_thinking flag) and the soft per-turn switch (/think, /no_think tags).

Pushing past the 32K native context

Native context is 32,768 tokens. For anything longer, enable YaRN explicitly — llama.cpp:

llama-server -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL \
  --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 -fa

Expect ~28 tok/s at the full 128K extent per Hardware Corner's table.

Driver / CUDA errors on load

Update to a recent NVIDIA driver with CUDA 12.x. The 3090 is Ampere sm_86 and has been mainline-supported by every recent driver branch; if you hit CUDA out of memory on a 6 GB-class model, the most common cause is another process holding VRAM — nvidia-smi lists culprits.

Running larger Qwen3 siblings

ollama run qwen3:14b (~9 GB Q4_K_M) and ollama run qwen3:32b (~19 GB Q4_K_M) both fit the 3090. The 32B is tight — leave headroom for KV cache by reducing context or pulling a smaller quant.