How much VRAM does Qwen3 30B-A3B need?

About 19 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Qwen3-30B-A3B on RTX 5090: 226 tok/s MoE Chat with Room to Spare

What You'll Build

A local chat endpoint backed by Qwen3-30B-A3B — Alibaba's 30.5B-total Mixture-of-Experts model with 3.3B active parameters per token — running on a single RTX 5090 via llama.cpp (or Ollama). On the 5090's 32 GB of VRAM the Q4_K_M GGUF is a roomy fit: the weights are 18.56 GB on disk, so the model occupies only about half the card and leaves ~13 GB of headroom for a large KV cache and long-context work.

Hardware data: RTX 5090 (32 GB VRAM) · 226.1 tok/s generation, 7,093 tok/s prefill at 4K context (Q4_K) · See benchmark data

ℹ️ MoE means a 30B model runs like a small one. Qwen3-30B-A3B has 128 experts but activates only 8 (~3.3B params) per token. All experts stay resident in VRAM — the router can pick any of them — but generation cost tracks the small active slice, which is why a 30B model reaches 226 tok/s on the 5090 instead of the ~30-50 tok/s a dense 30B would. The full expert set is what determines the ~19 GB footprint; the active slice is what determines the speed.

Requirements

Component	Minimum	Tested
GPU	19 GB VRAM (Q4_K_M)	RTX 5090 (32 GB)
RAM	16 GB system RAM	—
Storage	~19 GB for the Q4_K_M GGUF (per the GGUF tree)	—
Software	CUDA 12.8+; recent llama.cpp or Ollama	—

The 5090 fits this Q4_K_M build with ~13 GB to spare. That spare capacity is the point of running it here — see "Spending the headroom" below.

Installation

Option A: Ollama (easiest)

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run Qwen3-30B-A3B (Q4_K_M, ~19 GB)
ollama run qwen3:30b-a3b

The qwen3:30b-a3b tag is the Q4_K_M build (~19 GB) per the Ollama library. Ollama auto-offloads all layers to the GPU on a 32 GB card.

Option B: llama.cpp (fastest, and what the benchmark used)

Blackwell (sm_120) needs CUDA 12.8+ and an explicit architecture flag at build time. Build llama.cpp from source:

# Clone and build llama.cpp with CUDA for Blackwell (sm_120)
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="120"
cmake --build build --config Release -j$(nproc)

The -DCMAKE_CUDA_ARCHITECTURES="120" flag targets Blackwell's compute capability 12.0, and CUDA 12.8+ is required for the 5090 — see the llama.cpp build docs.

⚠️ Do not pip install flash-attn for this card. The standalone FlashAttention library has no sm_120 (compute-capability 12.0) kernels yet — it tops out at 8.9, so it crashes with "no kernel image is available" on the 5090. You do not need it: llama.cpp ships its own built-in flash-attention CUDA kernels, enabled with the runtime -fa / --flash-attn flag (flag defined in llama.cpp arg.cpp). The RTX 5090 benchmark below was produced with native llama.cpp flash-attention, no FA2 source build.

Running

Qwen's own GGUF card publishes the recommended invocation. For the Q4_K_M build on the 5090 (offload everything, native flash-attention on):

./build/bin/llama-cli -hf Qwen/Qwen3-30B-A3B-GGUF:Q4_K_M \
  --jinja --color -ngl 99 -fa -sm row \
  --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 --presence-penalty 1.5 \
  -c 40960 -n 32768 --no-context-shift

The -hf flag pulls the GGUF directly from the canonical Qwen mirror; -ngl 99 offloads all layers (the 5090 has the room); -fa enables llama.cpp's flash-attention. The sampling values are Qwen's thinking-mode recommendation — the model card warns: "DO NOT use greedy decoding" for this model. To run a persistent OpenAI-compatible endpoint instead, swap llama-cli for llama-server --port 8080 with the same model and flags.

These sampling settings, the -ngl 99 -fa command, and the YaRN long-context flags below are all from the official Qwen3-30B-A3B-GGUF model card.

Results

Generation speed: 226.1 tok/s at 4K context (Q4_K), measured on the RTX 5090 with llama.cpp (build 8189) by Hardware Corner and recorded in our benchmark data. Generation stays fast as context grows: 170.4 tok/s at 16K, 143.1 at 32K, 110.8 at 64K, and 76.8 at 128K (same source).
Prefill (prompt processing): 7,093 tok/s at 4K context, scaling down to 5,509.8 (16K), 4,210.0 (32K), 2,379.7 (64K), and 985.0 (128K) per the Hardware Corner RTX 5090 table.
VRAM usage: the Q4_K_M weights are 18.56 GB on disk; with the KV cache and activations the resident footprint sits comfortably under the 32 GB card with large headroom. A precise measured peak is not yet in our data — if you capture one, please contribute it so the next reader gets a measured number.

For the full benchmark data, see /check/qwen3-30b-a3b/rtx-5090.

Spending the headroom — long context and larger quants

The 5090 runs Q4_K_M with ~13 GB free, so unlike a 12 GB card you are not forced to the smallest quant. Two ways to use the room:

Longer context. Qwen3-30B-A3B supports 32,768 tokens natively and 131,072 with YaRN. To enable the 128K window in llama.cpp (per the Qwen GGUF card):

./build/bin/llama-cli ... -c 131072 --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768

Qwen advises adding the YaRN scaling only when you actually need long context, since static YaRN can slightly degrade short-prompt quality.

A larger quant. Higher-precision GGUFs also fit the 32 GB card: Q5_K_M is 21.73 GB, Q6_K is 25.09 GB, and even Q8_0 at 32.48 GB is borderline (leaves almost no KV-cache room — Q6_K is the practical ceiling if you want context). Swap the :Q4_K_M suffix in the -hf flag for :Q6_K to trade a little speed for fidelity.

Troubleshooting

"no kernel image is available for execution on the device"

You built llama.cpp without Blackwell (sm_120) support, or tried to use the standalone FlashAttention library. Rebuild with -DCMAKE_CUDA_ARCHITECTURES="120" on CUDA 12.8+, and rely on llama.cpp's built-in -fa rather than pip install flash-attn (which has no sm_120 kernels).

Thinking vs non-thinking mode

Qwen3-30B-A3B switches modes per turn — add /think or /no_think to a prompt to toggle, per the model card. For non-thinking chat, the model card recommends Temperature=0.7, TopP=0.8 instead of the thinking-mode 0.6 / 0.95 shown in the run command above.

Repetitive output

The card recommends setting presence_penalty to 1.5 for quantized models to suppress repetition; the run command above already includes --presence-penalty 1.5.

No other widely-reported issues for this pair. Report problems via the submission form.