How much VRAM does Qwen3 30B-A3B need?

About 24 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Qwen3-30B-A3B on RTX 3090 Ti: 167 tok/s MoE Chat That Fits the Full 24 GB Card

What You'll Build

A local chat endpoint backed by Qwen3-30B-A3B — Alibaba's 30.5B-total Mixture-of-Experts model with 3.3B activated parameters per token — running on a single RTX 3090 Ti in Q4_K quantization via Ollama, llama.cpp, or LM Studio. The MoE design is what makes a 30B model fast on a 24 GB card: only 3.3B parameters fire per token, so generation runs at LLM-interactive speed even though all 128 experts stay resident in VRAM.

Hardware data: RTX 3090 Ti (24 GB VRAM) · 166.9 tok/s generation at 4K context (Q4_K) · See benchmark data

ℹ️ The Q4_K weights fit fully — no offload needed. At Q4_K_M the GGUF is ~18.6 GB on disk (per the bartowski tree), so the entire model lives on the 3090 Ti's 24 GB with room for the KV-cache. This recipe uses the clean full-GPU path (-ngl 99); there is no CPU-offload story here, unlike on smaller cards where the routed experts must spill to system RAM.

Requirements

Component	Minimum	Tested
GPU	24 GB VRAM	RTX 3090 Ti (24 GB)
RAM	16 GB system RAM	—
Storage	~19 GB for the Q4_K_M MoE weights (per the GGUF tree)	—
Software	CUDA 12+; recent Ollama / llama.cpp / LM Studio	—

Installation

Three paths are provided. Pick one. Ollama is the fastest route to a working chat session; the llama.cpp Q4_K GGUF gives you the exact quant tier the benchmark used; LM Studio is the GUI equivalent.

Path A — Ollama (recommended for first run)

ollama pull qwen3:30b-a3b
ollama run qwen3:30b-a3b

The qwen3:30b-a3b tag is a 19 GB Q4_K_M MoE build; first run downloads it and drops you into an interactive REPL. (Ollama also publishes an explicit qwen3:30b-a3b-q4_K_M tag at the same quant if you want to pin it by name.)

Path B — llama.cpp with the Q4_K GGUF

Download the Q4_K_M MoE GGUF — the bartowski/Qwen_Qwen3-30B-A3B-GGUF build is an 18.63 GB file that links back to the canonical Qwen/Qwen3-30B-A3B (base_model_relation: quantized):

# grab a recent llama.cpp build first: https://github.com/ggml-org/llama.cpp
huggingface-cli download bartowski/Qwen_Qwen3-30B-A3B-GGUF \
    Qwen_Qwen3-30B-A3B-Q4_K_M.gguf --local-dir ./qwen3-30b-a3b

Then serve it with all layers on the GPU and FlashAttention enabled:

llama-server -m ./qwen3-30b-a3b/Qwen_Qwen3-30B-A3B-Q4_K_M.gguf \
    -ngl 99 -fa 1 -c 4096 --host 0.0.0.0 --port 8000

-ngl 99 offloads every layer to the 3090 Ti — the full model fits, so this is the whole story, no --n-cpu-moe spill. -c 4096 matches the 4K context the benchmark used (push it higher only as far as the leftover VRAM allows — see Troubleshooting).

Path C — LM Studio (GUI)

Search LM Studio's model browser for Qwen3-30B-A3B and pick a Q4_K GGUF (the bartowski build above, or the lmstudio-community/Qwen3-30B-A3B-GGUF Q4_K_M at the same 18.63 GB). Set GPU offload to "max" so all layers land on the 3090 Ti, then start a chat. LM Studio uses llama.cpp under the hood, so the runtime path is identical to Path B.

Running

Ollama (interactive):

ollama run qwen3:30b-a3b "Explain mixture-of-experts routing in one paragraph."

llama.cpp (HTTP, OpenAI-compatible):

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-30b-a3b",
    "messages": [{"role": "user", "content": "Explain mixture-of-experts routing in one paragraph."}]
  }'

By default Qwen3 runs in thinking mode, emitting a <think>...</think> block before the final answer (the model card's enable_thinking=True flag toggles it off). All 30.5B parameters stay resident in VRAM regardless of path — the 3.3B "activated" figure is a compute-per-token number (which 8 of the 128 experts the router fires), not a memory figure.

Results

Generation speed: 166.9 tokens/s at 4K context (Q4_K), measured on RTX 3090 Ti by Hardware Corner's gpu-llm-benchmarks. It scales down gracefully as context grows: 121.9 tok/s at 16K and 92.0 tok/s at 32K — the slow falloff is the MoE design paying off, since only 3.3B parameters are read per token.
Prefill speed: 3,441.0 tokens/s at 4K context on the same Hardware Corner RTX 3090 Ti row (2,205.6 at 16K, 1,483.9 at 32K).
VRAM usage: 24.0 GB peak at 4K context per /check/qwen3-30b-a3b/rtx-3090-ti — the ~18.6 GB Q4_K weights plus KV-cache and activations sit comfortably inside the 24 GB card with no offload required.
Quality notes: Q4_K keeps the model at 4-bit while preserving the higher-precision tensors that matter for output quality; the canonical model card lists a 32,768-token native context, extensible to 131,072 with YaRN, but on a single 24 GB card the KV-cache is the binding constraint at long context — keep it modest (4K–16K) to stay within the card.

For the full benchmark data and side-by-side compare across cards, see /check/qwen3-30b-a3b/rtx-3090-ti.

Troubleshooting

`KeyError: 'qwen3_moe'` on model load

The Qwen3-MoE architecture needs a recent transformers. The model card warns that transformers<4.51.0 raises KeyError: 'qwen3_moe'. This only affects the raw transformers/diffusers path — the GGUF routes (Ollama / llama.cpp / LM Studio) sidestep it entirely, which is another reason to prefer them on a single consumer card. If you do run the Python snippet, pip install -U transformers first.

Out of memory at long context

The 4K-context benchmark peaks at 24.0 GB on the RTX 3090 Ti, so the card is essentially full even though the weights are only ~18.6 GB — the remainder is KV-cache and activations. Pushing -c (context length) higher grows the KV-cache and will eventually OOM. Stay at 4K–16K on the 3090 Ti; if you need the full 131K YaRN context, that needs a bigger card. If you have measured a working long-context configuration on a 3090 Ti, please contribute it.

"All 30B parameters must fit, not just 3B"

Qwen3-30B-A3B is marketed as 30.5B total / 3.3B activated per token. All 128 experts must be resident in VRAM because the router picks 8 of them at inference time — you cannot pre-prune them. The 3.3B active figure governs speed (why generation is fast), the 30.5B total governs fit (why it needs ~18.6 GB at Q4_K). That fit still clears the 24 GB card with headroom; sub-24 GB cards need either a smaller quant or MoE CPU-offload, which is a different recipe.

Multi-GPU launch commands from the model card don't fit a single 3090 Ti

The official HF model card's Quickstart shows the BF16 transformers path (~61 GB of weights) and references vLLM/SGLang server deployments — those are multi-GPU or large-card configurations, not a consumer single-card path. For one RTX 3090 Ti use the Q4_K GGUF route (Path A/B/C above); the BF16 transformers path does not fit 24 GB.

Generation slower than expected for the GPU class

Confirm FlashAttention is active (-fa 1 on llama.cpp; Ollama enables it by default) and that you are at small context — LLM token generation is memory-bandwidth-bound, so the per-token rate drops mechanically as the KV-cache grows past 4K, exactly as the Hardware Corner table shows (166.9 → 92.0 tok/s walking from 4K to 32K). If your numbers are still off, please report them.