self-hosted/ai
§01·recipe · llm

Qwen3-8B on RTX 4080 SUPER: Q4_K_M GGUF via Ollama or llama.cpp

llmbeginner6GB+ VRAMJun 2, 2026
models
tools
prerequisites
  • NVIDIA RTX 4080 SUPER (16GB VRAM) or equivalent 16 GB CUDA card
  • Recent NVIDIA driver with CUDA 12.x support (Ada sm_89 — no special wheel selection required)
  • ~6 GB free disk for the Q4_K_M GGUF checkpoint (or ~10 GB for Q8_0)
  • Ollama, llama.cpp, or LM Studio installed

What You'll Build

A local Qwen3-8B chat / reasoning assistant running on a 16 GB RTX 4080 SUPER, served through Ollama (or llama.cpp / LM Studio — same GGUF, three loaders). The recipe pins the dense 8B variant at Q4_K_M quantization (5.03 GB on disk), which fits comfortably on the 4080 SUPER with headroom for the 32k-native context window and the optional thinking-mode chain of thought.

Hardware data: RTX 4080 SUPER (16GB VRAM) · Q4_K GGUF · 104.2 tokens/s generation at 4k context · See benchmark data

⚠️ Variant pinned — Qwen3 ships 8 sizes from the same Qwen org. Per the Ollama qwen3 tag list, Qwen3 spans 0.6b, 1.7b, 4b, 8b (this recipe), 14b, 30b (MoE), 32b, and 235b (MoE). The siblings have wildly different VRAM profiles — Qwen3-14B in Q4_K_M is ~8.5 GB and still fits 16 GB; Qwen3-32B in Q4_K_M is ~20 GB and overflows; Qwen3-235B (MoE, ~22B active) needs >100 GB total resident weights since the router can't pre-prune (see Qwen3 model card for the dense/MoE split). The instructions below are for the dense 8.2B model only. If you want 14B on this card, swap qwen3:8b for qwen3:14b; for 32B+ go to /contribute.

ℹ️ Thinking mode is on by default. Qwen3-8B has a built-in chain-of-thought ("thinking") mode that the model card's quickstart enables via enable_thinking=True. Output starts with a <think>...</think> block followed by the user-facing answer. To disable for latency-sensitive use, send /no_think in your prompt or pass enable_thinking=False in the chat template.

Requirements

ComponentMinimumTested
GPU16 GB VRAMRTX 4080 SUPER (16GB)
RAM16 GB system
Storage5.03 GB (Q4_K_M GGUF) or 8.71 GB (Q8_0)per unsloth/Qwen3-8B-GGUF
DriverCUDA 12.x (Ada sm_89)
RuntimeOllama 0.5+ / llama.cpp / LM Studio

The model is released under Apache 2.0 — commercial use is permitted. The weights are not gated on Hugging Face, so no access request or login is required to download them.

Installation

The fastest path is Ollama — one command pulls the canonical Q4_K_M build maintained by the Qwen team:

Option A — Ollama (recommended)

1. Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

(Windows: download from ollama.com/download.) Per the Qwen3 model card, "For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3."

2. Pull the 8B model

ollama pull qwen3:8b

This fetches a 5.2 GB Q4_K_M checkpoint per the Ollama qwen3:8b tag (8.19B parameters). The download is one file — no manual quant-tier selection needed.

Option B — llama.cpp + community GGUF

If you want a different quant tier (Q6_K for higher fidelity, Q8_0 for near-lossless), use a community redistributor that publishes the full ladder:

1. Install llama.cpp

# macOS (Homebrew)
brew install llama.cpp

# Linux — pre-built CUDA wheel
# Visit https://github.com/ggml-org/llama.cpp/releases for cu12x binaries

2. Pull the quant you want

Per the unsloth/Qwen3-8B-GGUF per-tier file-size table (link-back to upstream Qwen/Qwen3-8B confirmed on the page header):

QuantFile sizeNotes
Q4_K_M5.03 GBrecommended for this card
Q5_K_M5.85 GBbetter quality, still tiny
Q6_K6.73 GB"near perfect" per bartowski
Q8_08.71 GBnear-lossless
BF1616.39 GBfull precision — borderline on a 16 GB card; needs offload

Then via the llama.cpp Hugging Face shortcut (per the unsloth model card):

# OpenAI-compatible local server with web UI
llama-server -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL

# Interactive terminal
llama-cli -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL

Option C — LM Studio (GUI)

LM Studio offers a one-click install path per the Qwen3-8B HF card. Search "Qwen3-8B GGUF" inside the app and pick the Q4_K_M tier, or use the direct-import link lmstudio://open_from_hf?model=unsloth/Qwen3-8B-GGUF.

Running

One-shot prompt via Ollama

ollama run qwen3:8b "Explain GQA attention in three sentences."

First run loads the model into VRAM (~5 GB resident at idle, growing as the KV cache fills with longer contexts). Subsequent prompts in the same session stay warm.

Disable thinking mode for short answers

ollama run qwen3:8b "/no_think What's the capital of France?"

Per the Qwen3-8B HF card, this flips enable_thinking=False for the request, skipping the <think>...</think> chain-of-thought prefix.

OpenAI-compatible HTTP API

# Ollama exposes localhost:11434 by default
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:8b",
    "messages": [{"role": "user", "content": "Write a haiku about Ada Lovelace GPUs."}]
  }'

For higher throughput / production-style serving, the upstream Qwen3-8B card documents vllm serve Qwen/Qwen3-8B --enable-reasoning --reasoning-parser deepseek_r1 and python -m sglang.launch_server --model-path Qwen/Qwen3-8B --reasoning-parser qwen3 — both load BF16 weights though (16.39 GB), which is right at this card's capacity. For the 4080 SUPER, Ollama / llama.cpp with the Q4_K_M GGUF is the comfortable path.

Results

  • Speed: 104.2 tokens/s generation at 4k context, Q4_K quantization, measured on RTX 4080 SUPER — per the hardware-corner.net LLM benchmark table, surfaced via /check/qwen3-8b/rtx-4080-super. Generation rate drops as the KV cache grows with context: 79.4 tok/s at 16k, 59.5 tok/s at 32k, and 39.1 tok/s at 64k. Prompt processing is much faster — 6,137.0 tok/s at 4k context per the same source (falling to 3,858.1 at 16k, 2,537.1 at 32k, 1,501.5 at 64k).
  • VRAM usage: The cited backend benchmark records peak VRAM at the 4k-context Q4_K configuration as fully utilizing the card's 16 GB — link to /check/qwen3-8b/rtx-4080-super for the latest measurement. At idle the Q4_K_M weights occupy ~5 GB; the rest is KV-cache headroom the runtime expands with context. The official Qwen speed benchmark corroborates the precision/VRAM ladder (measured on H20 hardware): BF16 = 15947 MB, FP8 = 9323 MB, AWQ-INT4 = 6177 MB.
  • Quality notes: Q4_K_M is the community-default "sweet spot" — the bartowski Q-tier guide flags Q6_K as "near perfect, recommended" if you have the VRAM. On a 16 GB 4080 SUPER you can also run Q6_K (6.73 GB) or Q8_0 (8.71 GB) with plenty of room — there's no quality reason to pick anything below Q4_K_M on this card.

For the full benchmark data and other-GPU comparisons, see /check/qwen3-8b/rtx-4080-super.

Troubleshooting

Ollama returns Error: model requires more system memory or hangs on load

Confirm a recent NVIDIA driver and CUDA 12.x runtime are installed (nvidia-smi should show a driver from the past 12 months). The RTX 4080 SUPER uses the Ada Lovelace architecture (sm_89) which has been fully supported by mainline CUDA wheels since 2023 — no special build flags or wheel pinning are required. If Ollama still appears to hang on first load, watch nvidia-smi -l 1 in another terminal to confirm the GPU is actually being used; if it stays at 0% utilization, reinstall Ollama and re-pull the model.

<think>...</think> output is bloating responses

Qwen3 enables thinking mode by default per the HF card quickstart. Send /no_think at the start of any user message to disable it for that turn, or pass enable_thinking=False if you're calling the chat-template API directly.

I want the larger 14B / 32B sibling

Qwen3-14B at Q4_K_M is ~8.5 GB on disk and fits a 16 GB card comfortably — swap qwen3:8b for qwen3:14b in any Ollama command. Qwen3-32B at Q4_K_M is ~20 GB and does not fit without aggressive offloading; same for the 30B MoE and 235B MoE variants (MoE total params must be resident — see the Qwen3 model card on the dense/MoE split). For a 32B+ recipe on this card, request via /contribute.

Using transformers directly instead of Ollama

If you bypass Ollama / llama.cpp and run the HF card quickstart via transformers directly, the quickstart uses torch_dtype="auto" and device_map="auto" — it does not hardcode attn_implementation="flash_attention_2", so it works out of the box on the 4080 SUPER with a stock pip install torch (Ada sm_89 has full FA2 kernel coverage if you do opt into FA2 separately). Unlike Blackwell-class cards, no cu128-specific wheel selection is required for the 4080 SUPER.

Generation slows dramatically past 32k context

Qwen3 natively supports a 32,768-token context, extendable to 131,072 tokens with YaRN RoPE scaling per the HF card. Beyond the native window the model needs YaRN extension — supported in llama.cpp via --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 per the unsloth GGUF instructions — but quality degrades on short prompts and the KV cache balloons. For long-doc workflows, prefer chunking + retrieval over pushing context past 32k. The hardware-corner.net benchmark shows the generation rate falling to 39.1 tok/s at 64k context on this card.