What You'll Build
A local chat endpoint backed by Qwen3-30B-A3B — Alibaba's 30.5B-total Mixture-of-Experts model with 3.3B activated parameters per token — running on a single RTX 3090 Ti in Q4_K quantization via Ollama, llama.cpp, or LM Studio. The MoE design is what makes a 30B model fast on a 24 GB card: only 3.3B parameters fire per token, so generation runs at LLM-interactive speed even though all 128 experts stay resident in VRAM.
Hardware data: RTX 3090 Ti (24 GB VRAM) · 166.9 tok/s generation at 4K context (Q4_K) · See benchmark data
ℹ️ The Q4_K weights fit fully — no offload needed. At Q4_K_M the GGUF is ~18.6 GB on disk (per the bartowski tree), so the entire model lives on the 3090 Ti's 24 GB with room for the KV-cache. This recipe uses the clean full-GPU path (
-ngl 99); there is no CPU-offload story here, unlike on smaller cards where the routed experts must spill to system RAM.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 24 GB VRAM | RTX 3090 Ti (24 GB) |
| RAM | 16 GB system RAM | — |
| Storage | ~19 GB for the Q4_K_M MoE weights (per the GGUF tree) | — |
| Software | CUDA 12+; recent Ollama / llama.cpp / LM Studio | — |
Installation
Three paths are provided. Pick one. Ollama is the fastest route to a working chat session; the llama.cpp Q4_K GGUF gives you the exact quant tier the benchmark used; LM Studio is the GUI equivalent.
Path A — Ollama (recommended for first run)
ollama pull qwen3:30b-a3b
ollama run qwen3:30b-a3b
The qwen3:30b-a3b tag is a 19 GB Q4_K_M MoE build; first run downloads it and drops you into an interactive REPL. (Ollama also publishes an explicit qwen3:30b-a3b-q4_K_M tag at the same quant if you want to pin it by name.)
Path B — llama.cpp with the Q4_K GGUF
Download the Q4_K_M MoE GGUF — the bartowski/Qwen_Qwen3-30B-A3B-GGUF build is an 18.63 GB file that links back to the canonical Qwen/Qwen3-30B-A3B (base_model_relation: quantized):
# grab a recent llama.cpp build first: https://github.com/ggml-org/llama.cpp
huggingface-cli download bartowski/Qwen_Qwen3-30B-A3B-GGUF \
Qwen_Qwen3-30B-A3B-Q4_K_M.gguf --local-dir ./qwen3-30b-a3b
Then serve it with all layers on the GPU and FlashAttention enabled:
llama-server -m ./qwen3-30b-a3b/Qwen_Qwen3-30B-A3B-Q4_K_M.gguf \
-ngl 99 -fa 1 -c 4096 --host 0.0.0.0 --port 8000
-ngl 99 offloads every layer to the 3090 Ti — the full model fits, so this is the whole story, no --n-cpu-moe spill. -c 4096 matches the 4K context the benchmark used (push it higher only as far as the leftover VRAM allows — see Troubleshooting).
Path C — LM Studio (GUI)
Search LM Studio's model browser for Qwen3-30B-A3B and pick a Q4_K GGUF (the bartowski build above, or the lmstudio-community/Qwen3-30B-A3B-GGUF Q4_K_M at the same 18.63 GB). Set GPU offload to "max" so all layers land on the 3090 Ti, then start a chat. LM Studio uses llama.cpp under the hood, so the runtime path is identical to Path B.
Running
Ollama (interactive):
ollama run qwen3:30b-a3b "Explain mixture-of-experts routing in one paragraph."
llama.cpp (HTTP, OpenAI-compatible):
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-30b-a3b",
"messages": [{"role": "user", "content": "Explain mixture-of-experts routing in one paragraph."}]
}'
By default Qwen3 runs in thinking mode, emitting a <think>...</think> block before the final answer (the model card's enable_thinking=True flag toggles it off). All 30.5B parameters stay resident in VRAM regardless of path — the 3.3B "activated" figure is a compute-per-token number (which 8 of the 128 experts the router fires), not a memory figure.
Results
- Generation speed: 166.9 tokens/s at 4K context (Q4_K), measured on RTX 3090 Ti by Hardware Corner's gpu-llm-benchmarks. It scales down gracefully as context grows: 121.9 tok/s at 16K and 92.0 tok/s at 32K — the slow falloff is the MoE design paying off, since only 3.3B parameters are read per token.
- Prefill speed: 3,441.0 tokens/s at 4K context on the same Hardware Corner RTX 3090 Ti row (2,205.6 at 16K, 1,483.9 at 32K).
- VRAM usage: 24.0 GB peak at 4K context per /check/qwen3-30b-a3b/rtx-3090-ti — the ~18.6 GB Q4_K weights plus KV-cache and activations sit comfortably inside the 24 GB card with no offload required.
- Quality notes: Q4_K keeps the model at 4-bit while preserving the higher-precision tensors that matter for output quality; the canonical model card lists a 32,768-token native context, extensible to 131,072 with YaRN, but on a single 24 GB card the KV-cache is the binding constraint at long context — keep it modest (4K–16K) to stay within the card.
For the full benchmark data and side-by-side compare across cards, see /check/qwen3-30b-a3b/rtx-3090-ti.
Troubleshooting
KeyError: 'qwen3_moe' on model load
The Qwen3-MoE architecture needs a recent transformers. The model card warns that transformers<4.51.0 raises KeyError: 'qwen3_moe'. This only affects the raw transformers/diffusers path — the GGUF routes (Ollama / llama.cpp / LM Studio) sidestep it entirely, which is another reason to prefer them on a single consumer card. If you do run the Python snippet, pip install -U transformers first.
Out of memory at long context
The 4K-context benchmark peaks at 24.0 GB on the RTX 3090 Ti, so the card is essentially full even though the weights are only ~18.6 GB — the remainder is KV-cache and activations. Pushing -c (context length) higher grows the KV-cache and will eventually OOM. Stay at 4K–16K on the 3090 Ti; if you need the full 131K YaRN context, that needs a bigger card. If you have measured a working long-context configuration on a 3090 Ti, please contribute it.
"All 30B parameters must fit, not just 3B"
Qwen3-30B-A3B is marketed as 30.5B total / 3.3B activated per token. All 128 experts must be resident in VRAM because the router picks 8 of them at inference time — you cannot pre-prune them. The 3.3B active figure governs speed (why generation is fast), the 30.5B total governs fit (why it needs ~18.6 GB at Q4_K). That fit still clears the 24 GB card with headroom; sub-24 GB cards need either a smaller quant or MoE CPU-offload, which is a different recipe.
Multi-GPU launch commands from the model card don't fit a single 3090 Ti
The official HF model card's Quickstart shows the BF16 transformers path (~61 GB of weights) and references vLLM/SGLang server deployments — those are multi-GPU or large-card configurations, not a consumer single-card path. For one RTX 3090 Ti use the Q4_K GGUF route (Path A/B/C above); the BF16 transformers path does not fit 24 GB.
Generation slower than expected for the GPU class
Confirm FlashAttention is active (-fa 1 on llama.cpp; Ollama enables it by default) and that you are at small context — LLM token generation is memory-bandwidth-bound, so the per-token rate drops mechanically as the KV-cache grows past 4K, exactly as the Hardware Corner table shows (166.9 → 92.0 tok/s walking from 4K to 32K). If your numbers are still off, please report them.