self-hosted/ai
§01·recipe · llm

Qwen3-8B on RTX 3090: Q4_K_M GGUF with 18 GB of Headroom for Colocation or Long Context

llmbeginner6GB+ VRAMMay 22, 2026
models
tools
prerequisites
  • NVIDIA RTX 3090 (24 GB VRAM) or any Ampere/Ada CUDA card with ≥ 6 GB free
  • NVIDIA driver with CUDA 12.x support
  • ~6 GB free disk for the Q4_K_M GGUF (or ~17 GB for BF16)
  • Ollama, llama.cpp, or LM Studio installed

What You'll Build

A local Qwen3-8B chat and reasoning assistant on an RTX 3090, served through Ollama or llama.cpp at Q4_K_M quantization. The weights are only ~5 GB on disk, so the model uses roughly 6 GB of the 3090's 24 GB envelope — leaving 18 GB free to colocate a second model, extend context to the full 131K window via YaRN, or keep a larger sibling (Qwen3-14B Q4_K_M, ~9 GB) loaded alongside.

Hardware data: RTX 3090 (24 GB VRAM) · 115.3 tok/s generation @ 4K context (Q4_K, CUDA, FA on) · See benchmark data

ℹ️ Wildly over-provisioned by design. Qwen3-8B Q4_K_M needs ~6 GB; the 3090 has 24 GB. The recipe below is the same install on any ≥ 6 GB CUDA card — but the "Headroom" section lower down is what makes a 3090 worth using over a 4060 or 3060 12 GB for this model. If you only want chat, a 12 GB card is fine; the 3090 starts paying for itself when you reach for the spare 18 GB.

Requirements

ComponentMinimumTested
GPU6 GB VRAM (Ampere or newer recommended)RTX 3090 (24 GB)
RAM16 GB system RAM
Storage~6 GB for Q4_K_M GGUF~17 GB if you also pull BF16
DriverNVIDIA driver with CUDA 12.x
SoftwareOllama, llama.cpp, or LM Studio

Installation

Pick one of the three runtimes below. All three are first-party-supported by the Qwen team and pull from the canonical Qwen/Qwen3-8B weights (or an officially-tracked GGUF mirror).

Option A — Ollama (simplest)

curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen3:8b

qwen3:8b on the Ollama library is the Q4_K_M build, ~5.2 GB on disk.

Option B — llama.cpp (most control)

# macOS / Homebrew
brew install llama.cpp

# Linux: build from source with CUDA
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && cmake -B build -DGGML_CUDA=ON && cmake --build build -j --config Release

# Pull the GGUF directly from Hugging Face and serve
llama-server -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL --port 8080

unsloth/Qwen3-8B-GGUF ships a per-tier table: Q4_K_M = 5.03 GB, UD-Q4_K_XL = 5.14 GB (Unsloth's accuracy-tuned 4-bit), Q5_K_M = 5.85 GB, Q6_K = 6.73 GB, Q8_0 = 8.71 GB, BF16 = 16.4 GB. The 3090 fits any of them; BF16 leaves ~7 GB of working VRAM (enough for chat, tight for long context).

Option C — LM Studio (GUI)

Install LM Studio, search for qwen3-8b, and download a GGUF build (defaults to Q4_K_M). Useful if you want a chat UI without writing a server config.

Running

Ollama

ollama run qwen3:8b "Explain GQA attention in three sentences."

# Disable Qwen3's reasoning trace per turn (faster, less chatty)
ollama run qwen3:8b "/no_think What's the capital of France?"

# OpenAI-compatible endpoint
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen3:8b", "messages": [{"role": "user", "content": "Write a haiku about Ampere GPUs."}]}'

llama.cpp

# Server with OpenAI-compatible API on :8080
llama-server -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL --port 8080 -ngl 99 -fa

# Single-shot CLI
llama-cli -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL -p "Summarize what GQA does in two lines."

-ngl 99 offloads all layers to the GPU; -fa enables Flash Attention (which the RTX 3090's Ampere sm_86 supports natively — no special wheels needed).

Results

  • Generation speed: 115.3 tok/s @ 4K context (Q4_K, CUDA, -fa 1), measured on Hardware Corner's RTX 3090 LLM benchmark page. The same source publishes the full context ladder for this card:

    ContextToken generationPrompt processing
    4K115.3 tok/s4,049.6 tok/s
    16K87.5 tok/s2,572.5 tok/s
    32K67.9 tok/s1,714.6 tok/s
    64K46.6 tok/s1,014.3 tok/s
    128K28.1 tok/s570.0 tok/s
  • VRAM usage: Q4_K_M weights load to roughly 6 GB; with a 32K KV cache (fp16, GQA-compressed with the model's 8 KV heads) you stay well under 10 GB. The 3090's 24 GB envelope is 18 GB free for the use cases below. See live benchmark data.

  • Quality notes: Qwen3 ships a hybrid thinking/non-thinking mode. The model card recommends Temperature=0.6, TopP=0.95, TopK=20, MinP=0 for thinking mode and Temperature=0.7, TopP=0.8, TopK=20, MinP=0 for non-thinking — avoid greedy decoding either way per the Qwen team's guidance. Use /no_think in a prompt (or enable_thinking=False in the chat template) to suppress the <think> trace when you just want a fast answer.

For the full benchmark data and other GPUs in the catalogue, see /check/qwen3-8b/rtx-3090.

Using the 18 GB of headroom — the real reason to run this on a 3090

Qwen3-8B at Q4_K_M leaves the 3090 with three concrete things you can't do on a 12 GB card:

  1. Colocate a second model. Load a Whisper-large (~3 GB), a 7B embedding model, or a Stable Diffusion XL pipeline (~7 GB) alongside Qwen3-8B for full ASR → reason → speak or RAG → reason → image pipelines on one card. With Ollama, OLLAMA_NUM_PARALLEL=2 OLLAMA_MAX_LOADED_MODELS=2 keeps both warm in VRAM.
  2. Run the 14B sibling instead — or both. Qwen3-14B at Q4_K_M is ~9 GB; you can keep the 8B loaded for fast turns and the 14B loaded for hard ones, swap by request. The combined footprint is ~17 GB, still inside the envelope.
  3. Push context to the full 131K window. The Qwen3-8B model card lists a native context of 32,768 tokens with explicit YaRN scaling to 131,072 tokens — Hardware Corner's table above measures 28.1 tok/s at 128K, which is usable for long-document Q&A. Enable in llama.cpp with --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768. A 12 GB card can't fit the 128K KV cache; the 3090 can.

Troubleshooting

Generation feels slow under load or at long context

Confirm Flash Attention is enabled in your runtime:

  • Ollama: enabled by default on Ampere; check nvidia-smi shows the GPU pinned during inference.
  • llama.cpp: pass -fa or -fa 1 on the command line.

The 3090's Ampere sm_86 has full FA2 support; this is not a Lesson-J Blackwell-style wheel-availability issue. If you're under-utilising the GPU, the bottleneck is almost always CPU-side tokenisation or a missing -ngl offload flag.

Bloated reasoning traces eat your context budget

Qwen3's thinking mode emits a <think>...</think> block before the answer. For chat that doesn't need reasoning, suppress it with /no_think per turn (or set enable_thinking=False in the chat template). The Qwen team's model card documents both the hard switch (enable_thinking flag) and the soft per-turn switch (/think, /no_think tags).

Pushing past the 32K native context

Native context is 32,768 tokens. For anything longer, enable YaRN explicitly — llama.cpp:

llama-server -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL \
  --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 -fa

Expect ~28 tok/s at the full 128K extent per Hardware Corner's table.

Driver / CUDA errors on load

Update to a recent NVIDIA driver with CUDA 12.x. The 3090 is Ampere sm_86 and has been mainline-supported by every recent driver branch; if you hit CUDA out of memory on a 6 GB-class model, the most common cause is another process holding VRAM — nvidia-smi lists culprits.

Running larger Qwen3 siblings

ollama run qwen3:14b (~9 GB Q4_K_M) and ollama run qwen3:32b (~19 GB Q4_K_M) both fit the 3090. The 32B is tight — leave headroom for KV cache by reducing context or pulling a smaller quant.