self-hosted/ai
§01·recipe · llm

Qwen3-8B on RTX 3090 Ti: Q4_K_M GGUF with 18 GB of Headroom for Colocation or Long Context

llmbeginner6GB+ VRAMMay 28, 2026
models
tools
prerequisites
  • NVIDIA RTX 3090 Ti (24 GB VRAM) or any Ampere/Ada CUDA card with at least 6 GB free
  • NVIDIA driver with CUDA 12.x support
  • ~6 GB free disk for the Q4_K_M GGUF (or ~17 GB for BF16)
  • Ollama, llama.cpp, or LM Studio installed

What You'll Build

A local Qwen3-8B chat and reasoning assistant on an RTX 3090 Ti, served through Ollama or llama.cpp at Q4_K_M quantization. The weights are only ~5 GB on disk, so the model uses roughly 6 GB of the 3090 Ti's 24 GB envelope — leaving 18 GB free to colocate a second model, extend context to the full 131K window via YaRN, or keep a larger sibling (Qwen3-14B Q4_K_M, ~9 GB) loaded alongside.

Hardware data: RTX 3090 Ti (24 GB VRAM) · 123.7 tok/s generation @ 4K context (Q4_K, CUDA, FA on) · See benchmark data

ℹ️ Wildly over-provisioned by design. Qwen3-8B Q4_K_M needs ~6 GB; the 3090 Ti has 24 GB. The recipe below is the same install on any ≥ 6 GB CUDA card — but the "Headroom" section lower down is what makes a 3090 Ti worth using over a 4060 or 3060 12 GB for this model. If you only want chat, a 12 GB card is fine; the 3090 Ti starts paying for itself when you reach for the spare 18 GB. Hardware Corner notes the 3090 Ti's performance delta over the base 3090 is often negligible for LLM workloads — about 7% faster at this model — so if you're shopping, the standard 3090 is the value pick. If you already own the Ti, you get all the headroom plus a small bandwidth bump.

Requirements

ComponentMinimumTested
GPU6 GB VRAM (Ampere or newer recommended)RTX 3090 Ti (24 GB)
RAM16 GB system RAM
Storage~6 GB for Q4_K_M GGUF~17 GB if you also pull BF16
DriverNVIDIA driver with CUDA 12.x
SoftwareOllama, llama.cpp, or LM Studio

Installation

Pick one of the three runtimes below. All three are first-party-supported by the Qwen team and pull from the canonical Qwen/Qwen3-8B weights (or an officially-tracked GGUF mirror).

Option A — Ollama (simplest)

curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen3:8b

qwen3:8b on the Ollama library is the Q4_K_M build, ~5.2 GB on disk.

Option B — llama.cpp (most control)

# macOS / Homebrew
brew install llama.cpp

# Linux: build from source with CUDA
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && cmake -B build -DGGML_CUDA=ON && cmake --build build -j --config Release

# Pull the GGUF directly from Hugging Face and serve
llama-server -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL --port 8080

unsloth/Qwen3-8B-GGUF ships a per-tier ladder (sizes from the HF tree API): Q4_K_M = 5.03 GB, UD-Q4_K_XL = 5.14 GB (Unsloth's accuracy-tuned 4-bit), Q5_K_M = 5.85 GB, Q6_K = 6.73 GB, Q8_0 = 8.71 GB, BF16 = 16.4 GB. The 3090 Ti fits any of them; BF16 still leaves ~7 GB of working VRAM (enough for chat, tight for long context).

Option C — LM Studio (GUI)

Install LM Studio, search for qwen3-8b, and download a GGUF build (defaults to Q4_K_M). Useful if you want a chat UI without writing a server config.

Running

Ollama

ollama run qwen3:8b "Explain GQA attention in three sentences."

# Disable Qwen3's reasoning trace per turn (faster, less chatty)
ollama run qwen3:8b "/no_think What's the capital of France?"

# OpenAI-compatible endpoint
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen3:8b", "messages": [{"role": "user", "content": "Write a haiku about Ampere GPUs."}]}'

llama.cpp

# Server with OpenAI-compatible API on :8080
llama-server -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL --port 8080 -ngl 99 -fa

# Single-shot CLI
llama-cli -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL -p "Summarize what GQA does in two lines."

-ngl 99 offloads all layers to the GPU; -fa enables Flash Attention (which the RTX 3090 Ti's Ampere sm_86 supports natively — no special wheels needed).

Results

  • Generation speed: 123.7 tok/s @ 4K context (Q4_K, CUDA, -fa 1), measured on Hardware Corner's RTX 3090 Ti LLM benchmark page. The same source publishes the full context ladder for this card, and the numbers are mirrored into the backend as benchmarks #129 (prefill) and #130 (generation):

    ContextToken generationPrompt processing
    4K123.7 tok/s4,467.9 tok/s
    16K93.6 tok/s2,834.2 tok/s
    32K71.8 tok/s1,868.0 tok/s
    64K49.0 tok/s1,111.1 tok/s
    128K29.0 tok/s612.4 tok/s

    For reference, the standard RTX 3090's 4K token-gen on the same source is 115.3 tok/s — the Ti is ~7% faster on this workload, which tracks its 1,008 GB/s vs 936 GB/s memory bandwidth (LLM token generation is memory-bandwidth-bound).

  • VRAM usage: Q4_K_M weights load to roughly 6 GB; with a 32K KV cache (fp16, GQA-compressed with the model's 8 KV heads) you stay well under 10 GB. The 3090 Ti's 24 GB envelope is 18 GB free for the use cases below. See live benchmark data.

  • Quality notes: Qwen3 ships a hybrid thinking/non-thinking mode. The model card recommends Temperature=0.6, TopP=0.95, TopK=20, MinP=0 for thinking mode and Temperature=0.7, TopP=0.8, TopK=20, MinP=0 for non-thinking — and "DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions" per the Qwen team's verbatim guidance. Use /no_think in a prompt (or enable_thinking=False in the chat template) to suppress the <think> trace when you just want a fast answer.

For the full benchmark data and other GPUs in the catalogue, see /check/qwen3-8b/rtx-3090-ti.

Using the 18 GB of headroom — the real reason to run this on a 3090 Ti

Qwen3-8B at Q4_K_M leaves the 3090 Ti with three concrete things you can't do on a 12 GB card:

  1. Colocate a second model. Load a Whisper-large (~3 GB), a 7B embedding model, or a Stable Diffusion XL pipeline (~7 GB) alongside Qwen3-8B for full ASR → reason → speak or RAG → reason → image pipelines on one card. With Ollama, OLLAMA_NUM_PARALLEL=2 OLLAMA_MAX_LOADED_MODELS=2 keeps both warm in VRAM.
  2. Run the 14B sibling instead — or both. Qwen3-14B at Q4_K_M is ~9 GB; you can keep the 8B loaded for fast turns and the 14B loaded for hard ones, swap by request. The combined footprint is ~17 GB, still inside the envelope. Hardware Corner measures 76.2 tok/s for Qwen3-14B Q4_K @ 4K on the 3090 Ti — comfortable for interactive use.
  3. Push context to the full 131K window. The Qwen3-8B model card lists a native context of 32,768 tokens with explicit YaRN scaling to 131,072 tokens — Hardware Corner's table above measures 29.0 tok/s at 128K, which is usable for long-document Q&A. Enable in llama.cpp with --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768. A 12 GB card can't fit the 128K KV cache; the 3090 Ti can.

Troubleshooting

Generation feels slow under load or at long context

Confirm Flash Attention is enabled in your runtime:

  • Ollama: enabled by default on Ampere; check nvidia-smi shows the GPU pinned during inference.
  • llama.cpp: pass -fa or -fa 1 on the command line.

The 3090 Ti's Ampere sm_86 has full FA2 support; this is not a Blackwell-class wheel-availability issue (sm_120 cards need cu128 and FA2 sm_120 kernels, neither of which apply here — the default pip install torch already ships sm_86 kernels). If you're under-utilising the GPU, the bottleneck is almost always CPU-side tokenisation or a missing -ngl offload flag.

Bloated reasoning traces eat your context budget

Qwen3's thinking mode emits a <think>...</think> block before the answer. For chat that doesn't need reasoning, suppress it with /no_think per turn (or set enable_thinking=False in the chat template). The Qwen team's model card documents both the hard switch (enable_thinking flag) and the soft per-turn switch (/think, /no_think tags).

Pushing past the 32K native context

Native context is 32,768 tokens. For anything longer, enable YaRN explicitly — llama.cpp:

llama-server -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL \
  --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 -fa

Expect ~29 tok/s at the full 128K extent per Hardware Corner's table.

Driver / CUDA errors on load

Update to a recent NVIDIA driver with CUDA 12.x. The 3090 Ti is Ampere sm_86 and has been mainline-supported by every recent driver branch; if you hit CUDA out of memory on a 6 GB-class model, the most common cause is another process holding VRAM — nvidia-smi lists culprits.

Running larger Qwen3 siblings

ollama run qwen3:14b (~9 GB Q4_K_M) and ollama run qwen3:32b (~19 GB Q4_K_M) both fit the 3090 Ti. The 32B is tight — leave headroom for KV cache by reducing context or pulling a smaller quant. Hardware Corner's same page measures 38.0 tok/s @ 4K for Qwen3-32B Q4_K on the 3090 Ti — same envelope discipline as the 3090 sibling: cap --ctx-size to 8K-16K and enable KV-cache quantisation (--cache-type-k q8_0 --cache-type-v q8_0 --flash-attn).