What You'll Build
A local chat endpoint backed by Qwen3-30B-A3B — Alibaba's 30.5B-total Mixture-of-Experts model with 3.3B active parameters per token — running on a single RTX 5090 via llama.cpp (or Ollama). On the 5090's 32 GB of VRAM the Q4_K_M GGUF is a roomy fit: the weights are 18.56 GB on disk, so the model occupies only about half the card and leaves ~13 GB of headroom for a large KV cache and long-context work.
Hardware data: RTX 5090 (32 GB VRAM) · 226.1 tok/s generation, 7,093 tok/s prefill at 4K context (Q4_K) · See benchmark data
ℹ️ MoE means a 30B model runs like a small one. Qwen3-30B-A3B has 128 experts but activates only 8 (~3.3B params) per token. All experts stay resident in VRAM — the router can pick any of them — but generation cost tracks the small active slice, which is why a 30B model reaches 226 tok/s on the 5090 instead of the ~30-50 tok/s a dense 30B would. The full expert set is what determines the ~19 GB footprint; the active slice is what determines the speed.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 19 GB VRAM (Q4_K_M) | RTX 5090 (32 GB) |
| RAM | 16 GB system RAM | — |
| Storage | ~19 GB for the Q4_K_M GGUF (per the GGUF tree) | — |
| Software | CUDA 12.8+; recent llama.cpp or Ollama | — |
The 5090 fits this Q4_K_M build with ~13 GB to spare. That spare capacity is the point of running it here — see "Spending the headroom" below.
Installation
Option A: Ollama (easiest)
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run Qwen3-30B-A3B (Q4_K_M, ~19 GB)
ollama run qwen3:30b-a3b
The qwen3:30b-a3b tag is the Q4_K_M build (~19 GB) per the Ollama library. Ollama auto-offloads all layers to the GPU on a 32 GB card.
Option B: llama.cpp (fastest, and what the benchmark used)
Blackwell (sm_120) needs CUDA 12.8+ and an explicit architecture flag at build time. Build llama.cpp from source:
# Clone and build llama.cpp with CUDA for Blackwell (sm_120)
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="120"
cmake --build build --config Release -j$(nproc)
The -DCMAKE_CUDA_ARCHITECTURES="120" flag targets Blackwell's compute capability 12.0, and CUDA 12.8+ is required for the 5090 — see the llama.cpp build docs.
⚠️ Do not
pip install flash-attnfor this card. The standalone FlashAttention library has no sm_120 (compute-capability 12.0) kernels yet — it tops out at 8.9, so it crashes with "no kernel image is available" on the 5090. You do not need it: llama.cpp ships its own built-in flash-attention CUDA kernels, enabled with the runtime-fa/--flash-attnflag (flag defined in llama.cpparg.cpp). The RTX 5090 benchmark below was produced with native llama.cpp flash-attention, no FA2 source build.
Running
Qwen's own GGUF card publishes the recommended invocation. For the Q4_K_M build on the 5090 (offload everything, native flash-attention on):
./build/bin/llama-cli -hf Qwen/Qwen3-30B-A3B-GGUF:Q4_K_M \
--jinja --color -ngl 99 -fa -sm row \
--temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 --presence-penalty 1.5 \
-c 40960 -n 32768 --no-context-shift
The -hf flag pulls the GGUF directly from the canonical Qwen mirror; -ngl 99 offloads all layers (the 5090 has the room); -fa enables llama.cpp's flash-attention. The sampling values are Qwen's thinking-mode recommendation — the model card warns: "DO NOT use greedy decoding" for this model. To run a persistent OpenAI-compatible endpoint instead, swap llama-cli for llama-server --port 8080 with the same model and flags.
These sampling settings, the -ngl 99 -fa command, and the YaRN long-context flags below are all from the official Qwen3-30B-A3B-GGUF model card.
Results
- Generation speed: 226.1 tok/s at 4K context (Q4_K), measured on the RTX 5090 with llama.cpp (build 8189) by Hardware Corner and recorded in our benchmark data. Generation stays fast as context grows: 170.4 tok/s at 16K, 143.1 at 32K, 110.8 at 64K, and 76.8 at 128K (same source).
- Prefill (prompt processing): 7,093 tok/s at 4K context, scaling down to 5,509.8 (16K), 4,210.0 (32K), 2,379.7 (64K), and 985.0 (128K) per the Hardware Corner RTX 5090 table.
- VRAM usage: the Q4_K_M weights are 18.56 GB on disk; with the KV cache and activations the resident footprint sits comfortably under the 32 GB card with large headroom. A precise measured peak is not yet in our data — if you capture one, please contribute it so the next reader gets a measured number.
For the full benchmark data, see /check/qwen3-30b-a3b/rtx-5090.
Spending the headroom — long context and larger quants
The 5090 runs Q4_K_M with ~13 GB free, so unlike a 12 GB card you are not forced to the smallest quant. Two ways to use the room:
Longer context. Qwen3-30B-A3B supports 32,768 tokens natively and 131,072 with YaRN. To enable the 128K window in llama.cpp (per the Qwen GGUF card):
./build/bin/llama-cli ... -c 131072 --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768
Qwen advises adding the YaRN scaling only when you actually need long context, since static YaRN can slightly degrade short-prompt quality.
A larger quant. Higher-precision GGUFs also fit the 32 GB card: Q5_K_M is 21.73 GB, Q6_K is 25.09 GB, and even Q8_0 at 32.48 GB is borderline (leaves almost no KV-cache room — Q6_K is the practical ceiling if you want context). Swap the :Q4_K_M suffix in the -hf flag for :Q6_K to trade a little speed for fidelity.
Troubleshooting
"no kernel image is available for execution on the device"
You built llama.cpp without Blackwell (sm_120) support, or tried to use the standalone FlashAttention library. Rebuild with -DCMAKE_CUDA_ARCHITECTURES="120" on CUDA 12.8+, and rely on llama.cpp's built-in -fa rather than pip install flash-attn (which has no sm_120 kernels).
Thinking vs non-thinking mode
Qwen3-30B-A3B switches modes per turn — add /think or /no_think to a prompt to toggle, per the model card. For non-thinking chat, the model card recommends Temperature=0.7, TopP=0.8 instead of the thinking-mode 0.6 / 0.95 shown in the run command above.
Repetitive output
The card recommends setting presence_penalty to 1.5 for quantized models to suppress repetition; the run command above already includes --presence-penalty 1.5.
No other widely-reported issues for this pair. Report problems via the submission form.