self-hosted/ai
§01·recipe · llm

Qwen3-8B on RTX 4090: Q4_K_M GGUF via Ollama or llama.cpp

llmbeginner6GB+ VRAMMay 20, 2026
models
tools
prerequisites
  • NVIDIA RTX 4090 (24 GB VRAM) or equivalent Ada-class CUDA card
  • Recent NVIDIA driver with CUDA 12.x support (Ada sm_89 — no special wheel selection required)
  • ~6 GB free disk for the Q4_K_M GGUF checkpoint (~17 GB for BF16)
  • Ollama, llama.cpp, or LM Studio installed

What You'll Build

A local Qwen3-8B chat / reasoning assistant running on an RTX 4090, served through Ollama (or llama.cpp / LM Studio — same GGUF, three loaders). The recipe pins the dense 8B variant at Q4_K_M quantization (5.03 GB on disk), which leaves the 24 GB card with enormous KV-cache headroom — you can drive Qwen3's full 131k YaRN-extended context without offload, or comfortably run multiple concurrent sessions.

Hardware data: RTX 4090 (24 GB VRAM) · Q4_K GGUF · 141.3 tokens/s generation at 4k context · See benchmark data

⚠️ Variant pinned — Qwen3 ships 8 sizes from the same Qwen org. Per the Ollama qwen3 tag list, Qwen3 spans 0.6b, 1.7b, 4b, 8b (this recipe), 14b, 30b (MoE), 32b, and 235b (MoE). The siblings have wildly different VRAM profiles — Qwen3-14B in Q4_K_M is ~8.5 GB and fits cleanly on this card; Qwen3-32B in Q4_K_M is ~20 GB and still fits with a smaller context budget; Qwen3-235B (MoE, ~22B active) needs >100 GB total resident weights since the router can't pre-prune (see Qwen3 model card on the dense/MoE split). The instructions below are for the dense 8.2B model only. If you want 14B or 32B on this card, swap qwen3:8b for qwen3:14b or qwen3:32b; for 235B go to /contribute.

ℹ️ Thinking mode is on by default. Qwen3-8B has a built-in chain-of-thought ("thinking") mode that the model card's quickstart enables via enable_thinking=True. Output starts with a <think>...</think> block followed by the user-facing answer. To disable for latency-sensitive use, send /no_think in your prompt or pass enable_thinking=False in the chat template.

Requirements

ComponentMinimumTested
GPU6 GB VRAM (Q4_K_M)RTX 4090 (24 GB)
RAM16 GB system
Storage5.03 GB (Q4_K_M GGUF) up to 16.4 GB (BF16)per unsloth/Qwen3-8B-GGUF
DriverCUDA 12.x (Ada sm_89)
RuntimeOllama 0.5+ / llama.cpp / LM Studio

The model is released under Apache 2.0 — commercial use is permitted. The min_vram_gb: 6 floor is set by the Q4_K_M weight footprint; on a 24 GB RTX 4090 you have 18 GB of headroom for long contexts or higher-precision quants (see "Picking a quant on this card" below).

Installation

The fastest path is Ollama — one command pulls the canonical Q4_K_M build maintained by the Qwen team:

Option A — Ollama (recommended)

1. Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

(Windows: download from ollama.com/download.) Per the Qwen3 model card, "applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3."

2. Pull the 8B model

ollama pull qwen3:8b

This fetches a 5.2 GB Q4_K_M checkpoint per the Ollama qwen3:8b tag. The download is one file — no manual quant-tier selection needed.

Option B — llama.cpp + community GGUF (picking a quant on this card)

If you want a higher-fidelity quant (Q6_K, Q8_0, or even full BF16), use a community redistributor that publishes the full ladder. The 4090's 24 GB envelope makes all tiers viable:

1. Install llama.cpp

# macOS (Homebrew)
brew install llama.cpp

# Linux — pre-built CUDA wheel
# Visit https://github.com/ggml-org/llama.cpp/releases for cu12x binaries

2. Pull the quant you want

Per the unsloth/Qwen3-8B-GGUF per-tier file-size table (link-back to upstream Qwen/Qwen3-8B confirmed on the page header):

QuantFile sizeNotes on a 24 GB RTX 4090
Q4_K_M5.03 GBrecommended default — leaves ~18 GB for KV cache / concurrent sessions
Q5_K_M5.85 GBslight quality lift, same fit
Q6_K6.73 GB"near perfect" per bartowski
Q8_08.71 GBnear-lossless
BF1616.4 GBfull precision — fits the 4090 with ~7 GB to spare for context
UD-Q4_K_XL5.14 GBUnsloth "Dynamic 2.0" tier

Then via the llama.cpp Hugging Face shortcut (per the unsloth model card):

# OpenAI-compatible local server with web UI
llama-server -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL

# Interactive terminal
llama-cli -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL

Option C — LM Studio (GUI)

LM Studio offers a one-click install path per the Qwen3-8B HF card. Search "Qwen3-8B GGUF" inside the app and pick the Q4_K_M tier, or use the direct-import link lmstudio://open_from_hf?model=unsloth/Qwen3-8B-GGUF.

Running

One-shot prompt via Ollama

ollama run qwen3:8b "Explain GQA attention in three sentences."

First run loads the model into VRAM (~5 GB resident at idle with Q4_K_M, growing as the KV cache fills with longer contexts). Subsequent prompts in the same session stay warm.

Disable thinking mode for short answers

ollama run qwen3:8b "/no_think What's the capital of France?"

Per the Qwen3-8B HF card, this flips enable_thinking=False for the request, skipping the <think>...</think> chain-of-thought prefix.

OpenAI-compatible HTTP API

# Ollama exposes localhost:11434 by default
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:8b",
    "messages": [{"role": "user", "content": "Write a haiku about Ada Lovelace GPUs."}]
  }'

vLLM / SGLang for production-style serving

The 24 GB envelope is generous enough that the upstream vllm and sglang BF16 paths from the Qwen3-8B card also fit cleanly — BF16 weights are 16.4 GB, leaving ~7 GB for KV cache (good for moderate-context batched serving). Direct from the model card:

# vLLM (v0.8.5+)
vllm serve Qwen/Qwen3-8B --enable-reasoning --reasoning-parser deepseek_r1

# SGLang (v0.4.6.post1+)
python3 -m sglang.launch_server --model-path Qwen/Qwen3-8B \
  --host 0.0.0.0 --port 30000 --reasoning-parser qwen3

For single-user chat or long-context exploration, Ollama / llama.cpp with the Q4_K_M GGUF keeps far more VRAM free for the KV cache — at the same 4k context, Q4_K runs significantly faster than BF16 per the precision/memory ladder in the Qwen speed benchmark.

Results

  • Speed: 141.3 tokens/s generation at 4k context, Q4_K quantization, measured on RTX 4090 — per the hardware-corner.net RTX 4090 LLM benchmark table (row label "Qwen3 8B (Q4_K)", "Token Generation" column). Generation rate drops to 108.0 tok/s at 16k, 82.3 tok/s at 32k, 56.1 tok/s at 64k, and 33.8 tok/s at 128k as the KV cache grows. Prompt processing is much faster on the same source — 9,250.5 tok/s at 4k context, 5,530.5 at 16k, 3,560.1 at 32k. For reference, the same source's RTX 4060 Ti 16GB sibling page shows 45.8 tok/s at 4k for the same row — the 4090 is ~3.1× faster, in line with its ~3.5× memory-bandwidth advantage (1,008 vs 288 GB/s).
  • VRAM usage: Q4_K_M weights occupy ~5 GB at idle; the /check/qwen3-8b/rtx-4090 endpoint currently has no community-submitted benchmark for this pair (link expands as data lands). For BF16, the Qwen speed benchmark measured 15,947 MB on an H20-96GB reference card — fitting the 4090's 24 GB envelope cleanly. FP8 = 9,323 MB and AWQ-INT4 = 6,177 MB on the same source.
  • Quality notes: Q4_K_M is the community-default "sweet spot" — the bartowski Q-tier guide flags Q6_K as "near perfect, recommended" if you want the highest quality without going to BF16. On a 24 GB 4090 there's no quality reason to pick anything below Q4_K_M, and you can comfortably go up to Q8_0 (8.71 GB) or even BF16 (16.4 GB) if you want full-precision behaviour with room to spare for context.

For the full benchmark data and other-GPU comparisons, see /check/qwen3-8b/rtx-4090.

Troubleshooting

Ollama returns Error: model requires more system memory or hangs on load

Confirm a recent NVIDIA driver and CUDA 12.x runtime are installed (nvidia-smi should show a driver from the past 12 months). The RTX 4090 uses the Ada Lovelace architecture (sm_89) which has been fully supported by mainline CUDA wheels since 2023 — no special build flags or wheel pinning are required. If Ollama still appears to hang on first load, watch nvidia-smi -l 1 in another terminal to confirm the GPU is actually being used; if it stays at 0% utilization, reinstall Ollama and re-pull the model.

<think>...</think> output is bloating responses

Qwen3 enables thinking mode by default per the HF card quickstart. Send /no_think at the start of any user message to disable it for that turn, or pass enable_thinking=False if you're calling the chat-template API directly.

I want to push past 32k context

32k is Qwen3's native context window per the HF card ("Context Length: 32,768 natively and 131,072 tokens with YaRN"). Beyond that the model needs YaRN extension — supported in llama.cpp via --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 per the unsloth GGUF instructions. On the 4090, the hardware-corner.net benchmark shows the generation rate falling from 141.3 tok/s at 4k to 33.8 tok/s at 128k — the model still fits comfortably, but throughput drops with the larger KV cache. Quality past 32k also degrades; for long-doc workflows, prefer chunking + retrieval over pushing context to the full 131k.

I want the larger 14B / 32B sibling

Qwen3-14B at Q4_K_M is ~8.5 GB on disk and fits a 24 GB card with massive headroom — swap qwen3:8b for qwen3:14b in any Ollama command. Qwen3-32B at Q4_K_M is ~20 GB and also fits, though with much less KV-cache budget; the 30B MoE and 235B MoE variants need all params resident (see the Qwen3 model card on the dense/MoE split — Qwen3-30B-A3B needs ~30 GB total resident weights at Q4 and overflows). For a dedicated 32B or 235B recipe on this card, request via /contribute.

Using transformers directly instead of Ollama

If you bypass Ollama / llama.cpp and run the HF card quickstart via transformers directly, the quickstart uses torch_dtype="auto" and device_map="auto" — it does not hardcode attn_implementation="flash_attention_2", so it works out of the box on the 4090 with a stock pip install torch. Ada sm_89 has full FA2 kernel coverage if you do opt into FA2 separately. Unlike Blackwell-class cards (sm_120), no cu128-specific wheel selection is required for the 4090.

Why not just run BF16 by default on this card?

BF16 weights are 16.4 GB per the unsloth/Qwen3-8B-GGUF table — fits the 4090's 24 GB envelope but leaves only ~7 GB for the KV cache, which limits how far you can push context. Q4_K_M at 5 GB leaves ~18 GB for KV cache (or for running concurrent sessions / batched serving), and quality is close enough to BF16 for most chat / reasoning use that the bartowski tier guide recommends Q6_K rather than BF16 as the quality ceiling. If you specifically need bit-exact BF16 behaviour (e.g. for reproducing a published paper's results), the vLLM / SGLang commands in the Running section run it directly from the HF repo.