How much VRAM does Qwen3 30B-A3B need?

About 24 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Qwen3-30B-A3B on RTX 3090: Full-GPU MoE Chat at 153 tok/s

What You'll Build

A local chat endpoint backed by Qwen3-30B-A3B — Alibaba's 30.5B-total Mixture-of-Experts model with 3.3B activated parameters per token — running on a single RTX 3090 via llama.cpp, Ollama, or LM Studio. At Q4 the weights are about 18.6 GB, so every layer lives on the 24 GB card with no CPU offload: this is the clean full-GPU path. The MoE design is what makes a 30B-class model this fast — only ~3.3B parameters fire per token, so generation stays quick even though all 128 experts stay resident.

Hardware data: RTX 3090 (24 GB VRAM) · 153.6 tok/s generation at 4K context (Q4_K) · See benchmark data

ℹ️ 24 GB lets this run fully on-GPU — no offload needed. On a 12 GB card the same Q4 weights (~18.6 GB) overflow VRAM and force a CPU/GPU split; on the 3090 the entire model is resident, which is why this recipe leads with a plain -ngl 99 "all layers on the GPU" launch and drops the offload tuning that smaller cards need.

ℹ️ All 30.5B parameters must be resident, not just the 3.3B "active". Qwen3-30B-A3B is marketed as 30.5B total / 3.3B activated per token. All 128 experts (8 routed per token) must stay in VRAM because the router picks experts at inference time — you cannot pre-prune them. The ~3.3B active figure governs speed; the 30.5B total governs fit.

Requirements

Component	Minimum	Tested
GPU	20 GB VRAM	RTX 3090 (24 GB)
RAM	16 GB system RAM	—
Storage	~19 GB for the Q4_K_M MoE weights (per the GGUF tree)	—
Software	CUDA 12+; recent llama.cpp / Ollama / LM Studio	—

Installation

Three paths are provided. Pick one. Ollama is the fastest route to a working chat session; the llama.cpp GGUF gives you the exact Q4_K quant tier the benchmark used; LM Studio is the GUI equivalent.

Path A — Ollama (recommended for first run)

ollama pull qwen3:30b-a3b-q4_K_M
ollama run qwen3:30b-a3b-q4_K_M

The qwen3:30b-a3b-q4_K_M tag is the ~18.6 GB 4-bit MoE build; first run downloads it and drops you into an interactive REPL. (The shorter qwen3:30b-a3b and qwen3:30b tags resolve to the same family if you would rather not pin the quant explicitly.)

Path B — llama.cpp with the canonical Q4_K_M GGUF

Download the Q4_K_M GGUF from Qwen's own Qwen/Qwen3-30B-A3B-GGUF repo — an 18.6 GB file published by the model authors:

# grab a recent llama.cpp build first: https://github.com/ggml-org/llama.cpp
huggingface-cli download Qwen/Qwen3-30B-A3B-GGUF \
    Qwen3-30B-A3B-Q4_K_M.gguf --local-dir ./qwen3-30b-a3b

Then serve it with every layer on the GPU and FlashAttention enabled:

llama-server -m ./qwen3-30b-a3b/Qwen3-30B-A3B-Q4_K_M.gguf \
    -ngl 99 -fa 1 -c 4096 --host 0.0.0.0 --port 8000

-ngl 99 offloads every layer to the 3090 — at ~18.6 GB the full model fits the 24 GB card, so no --n-cpu-moe / tensor-offload tuning is needed. -c 4096 matches the 4K context the benchmark used (push it higher as far as the leftover VRAM allows — see Troubleshooting).

Path C — LM Studio (GUI)

Search LM Studio's model browser for Qwen3-30B-A3B and pick a Q4_K_M GGUF (the canonical build above). Set GPU offload to "max" so all layers land on the 3090, then start a chat. LM Studio uses llama.cpp under the hood, so the runtime path is identical to Path B.

Running

Ollama (interactive):

ollama run qwen3:30b-a3b-q4_K_M "Explain mixture-of-experts routing in one paragraph."

llama.cpp (HTTP, OpenAI-compatible):

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-30b-a3b",
    "messages": [{"role": "user", "content": "Explain mixture-of-experts routing in one paragraph."}]
  }'

All 30.5B parameters stay resident in VRAM regardless of path — the 3.3B "active" figure is a compute-per-token number (which experts the router fires), not a memory figure.

Results

Generation speed: 153.6 tokens/s at 4K context (Q4_K), measured on RTX 3090 in the Hardware Corner gpu-llm-benchmarks "Qwen3 30B A3B (Q4_K)" row (CUDA, -fa 1) and recorded as the backend benchmark for this pair. See /check/qwen3-30b-a3b/rtx-3090.
Prefill speed: 2,988.6 tokens/s at 4K context on the same Hardware Corner RTX 3090 row — prompt ingestion is fast because prefill is compute-bound and the 3090 keeps the whole MoE on-GPU.
VRAM usage: 24.0 GB peak at 4K context per /check/qwen3-30b-a3b/rtx-3090. The Q4_K_M weights are ~18.6 GB on disk, so the remaining budget covers the KV-cache, activations, and CUDA context — comfortable at 4K, tighter as context grows.
Quality notes: the canonical card lists a 32,768-token native context, extensible to 131,072 with YaRN; on a single 24 GB card you are KV-cache-bound well before the YaRN ceiling, so keep context modest (4K–16K) to stay within the card.

If you have measured generation or prefill at a different context length on a 3090, please contribute it — first-party numbers replace the benchmark row above.

For the full benchmark data and side-by-side compare across cards, see /check/qwen3-30b-a3b/rtx-3090.

Troubleshooting

Out of memory at long context

The Q4 weights leave only a few GB of headroom on the 24 GB card, and that headroom is the KV-cache budget. Pushing -c (context length) far past 16K grows the KV-cache and can OOM. Stay at 4K–16K on the 3090; enabling FlashAttention (-fa 1 on llama.cpp; Ollama enables it by default) shrinks the KV-cache footprint and buys some room. If you have a working long-context configuration on a 3090, please contribute it.

"All 30.5B parameters must fit, not just 3.3B"

Qwen3-30B-A3B is 30.5B total / 3.3B activated per token with 128 experts, 8 fired per token. Every expert must be resident in VRAM because the router selects them at inference time — you cannot pre-prune. The 3.3B active figure governs speed (why generation is fast), the 30.5B total governs fit (why it needs the full Q4 footprint). Cards smaller than ~20 GB cannot hold the Q4 weights and must offload experts to CPU (a different, slower recipe).

Multi-GPU launch commands from the model card don't fit a single 3090

The official HF model card Quickstart shows transformers, vLLM, and SGLang on the full BF16 weights (~61 GB) — a server configuration, not a consumer single-card path. For one RTX 3090 use the 4-bit GGUF route (Path A/B/C above); the BF16 transformers/vLLM path does not fit 24 GB.

Generation slower than expected for the GPU class

Confirm every layer is on the GPU (-ngl 99 on llama.cpp; set GPU offload to "max" in LM Studio; Ollama does this automatically when the model fits) and that FlashAttention is active. LLM token generation is memory-bandwidth-bound, so the per-token rate drops as the KV-cache grows past 4K. If your numbers are still off, please report them.