What You'll Build
A local chat endpoint backed by Qwen3-30B-A3B — Alibaba's 30.5B-total Mixture-of-Experts model with 3.3B activated parameters per token — running on a single RTX 3090 via llama.cpp, Ollama, or LM Studio. At Q4 the weights are about 18.6 GB, so every layer lives on the 24 GB card with no CPU offload: this is the clean full-GPU path. The MoE design is what makes a 30B-class model this fast — only ~3.3B parameters fire per token, so generation stays quick even though all 128 experts stay resident.
Hardware data: RTX 3090 (24 GB VRAM) · 153.6 tok/s generation at 4K context (Q4_K) · See benchmark data
ℹ️ 24 GB lets this run fully on-GPU — no offload needed. On a 12 GB card the same Q4 weights (~18.6 GB) overflow VRAM and force a CPU/GPU split; on the 3090 the entire model is resident, which is why this recipe leads with a plain
-ngl 99"all layers on the GPU" launch and drops the offload tuning that smaller cards need.
ℹ️ All 30.5B parameters must be resident, not just the 3.3B "active". Qwen3-30B-A3B is marketed as 30.5B total / 3.3B activated per token. All 128 experts (8 routed per token) must stay in VRAM because the router picks experts at inference time — you cannot pre-prune them. The ~3.3B active figure governs speed; the 30.5B total governs fit.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 20 GB VRAM | RTX 3090 (24 GB) |
| RAM | 16 GB system RAM | — |
| Storage | ~19 GB for the Q4_K_M MoE weights (per the GGUF tree) | — |
| Software | CUDA 12+; recent llama.cpp / Ollama / LM Studio | — |
Installation
Three paths are provided. Pick one. Ollama is the fastest route to a working chat session; the llama.cpp GGUF gives you the exact Q4_K quant tier the benchmark used; LM Studio is the GUI equivalent.
Path A — Ollama (recommended for first run)
ollama pull qwen3:30b-a3b-q4_K_M
ollama run qwen3:30b-a3b-q4_K_M
The qwen3:30b-a3b-q4_K_M tag is the ~18.6 GB 4-bit MoE build; first run downloads it and drops you into an interactive REPL. (The shorter qwen3:30b-a3b and qwen3:30b tags resolve to the same family if you would rather not pin the quant explicitly.)
Path B — llama.cpp with the canonical Q4_K_M GGUF
Download the Q4_K_M GGUF from Qwen's own Qwen/Qwen3-30B-A3B-GGUF repo — an 18.6 GB file published by the model authors:
# grab a recent llama.cpp build first: https://github.com/ggml-org/llama.cpp
huggingface-cli download Qwen/Qwen3-30B-A3B-GGUF \
Qwen3-30B-A3B-Q4_K_M.gguf --local-dir ./qwen3-30b-a3b
Then serve it with every layer on the GPU and FlashAttention enabled:
llama-server -m ./qwen3-30b-a3b/Qwen3-30B-A3B-Q4_K_M.gguf \
-ngl 99 -fa 1 -c 4096 --host 0.0.0.0 --port 8000
-ngl 99 offloads every layer to the 3090 — at ~18.6 GB the full model fits the 24 GB card, so no --n-cpu-moe / tensor-offload tuning is needed. -c 4096 matches the 4K context the benchmark used (push it higher as far as the leftover VRAM allows — see Troubleshooting).
Path C — LM Studio (GUI)
Search LM Studio's model browser for Qwen3-30B-A3B and pick a Q4_K_M GGUF (the canonical build above). Set GPU offload to "max" so all layers land on the 3090, then start a chat. LM Studio uses llama.cpp under the hood, so the runtime path is identical to Path B.
Running
Ollama (interactive):
ollama run qwen3:30b-a3b-q4_K_M "Explain mixture-of-experts routing in one paragraph."
llama.cpp (HTTP, OpenAI-compatible):
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-30b-a3b",
"messages": [{"role": "user", "content": "Explain mixture-of-experts routing in one paragraph."}]
}'
All 30.5B parameters stay resident in VRAM regardless of path — the 3.3B "active" figure is a compute-per-token number (which experts the router fires), not a memory figure.
Results
- Generation speed: 153.6 tokens/s at 4K context (Q4_K), measured on RTX 3090 in the Hardware Corner gpu-llm-benchmarks "Qwen3 30B A3B (Q4_K)" row (CUDA,
-fa 1) and recorded as the backend benchmark for this pair. See /check/qwen3-30b-a3b/rtx-3090. - Prefill speed: 2,988.6 tokens/s at 4K context on the same Hardware Corner RTX 3090 row — prompt ingestion is fast because prefill is compute-bound and the 3090 keeps the whole MoE on-GPU.
- VRAM usage: 24.0 GB peak at 4K context per /check/qwen3-30b-a3b/rtx-3090. The Q4_K_M weights are ~18.6 GB on disk, so the remaining budget covers the KV-cache, activations, and CUDA context — comfortable at 4K, tighter as context grows.
- Quality notes: the canonical card lists a 32,768-token native context, extensible to 131,072 with YaRN; on a single 24 GB card you are KV-cache-bound well before the YaRN ceiling, so keep context modest (4K–16K) to stay within the card.
If you have measured generation or prefill at a different context length on a 3090, please contribute it — first-party numbers replace the benchmark row above.
For the full benchmark data and side-by-side compare across cards, see /check/qwen3-30b-a3b/rtx-3090.
Troubleshooting
Out of memory at long context
The Q4 weights leave only a few GB of headroom on the 24 GB card, and that headroom is the KV-cache budget. Pushing -c (context length) far past 16K grows the KV-cache and can OOM. Stay at 4K–16K on the 3090; enabling FlashAttention (-fa 1 on llama.cpp; Ollama enables it by default) shrinks the KV-cache footprint and buys some room. If you have a working long-context configuration on a 3090, please contribute it.
"All 30.5B parameters must fit, not just 3.3B"
Qwen3-30B-A3B is 30.5B total / 3.3B activated per token with 128 experts, 8 fired per token. Every expert must be resident in VRAM because the router selects them at inference time — you cannot pre-prune. The 3.3B active figure governs speed (why generation is fast), the 30.5B total governs fit (why it needs the full Q4 footprint). Cards smaller than ~20 GB cannot hold the Q4 weights and must offload experts to CPU (a different, slower recipe).
Multi-GPU launch commands from the model card don't fit a single 3090
The official HF model card Quickstart shows transformers, vLLM, and SGLang on the full BF16 weights (~61 GB) — a server configuration, not a consumer single-card path. For one RTX 3090 use the 4-bit GGUF route (Path A/B/C above); the BF16 transformers/vLLM path does not fit 24 GB.
Generation slower than expected for the GPU class
Confirm every layer is on the GPU (-ngl 99 on llama.cpp; set GPU offload to "max" in LM Studio; Ollama does this automatically when the model fits) and that FlashAttention is active. LLM token generation is memory-bandwidth-bound, so the per-token rate drops as the KV-cache grows past 4K. If your numbers are still off, please report them.