self-hosted/ai
§01·recipe · llm

Qwen3-30B-A3B on RTX 3090 Ti: 167 tok/s MoE Chat That Fits the Full 24 GB Card

llmintermediate24GB+ VRAMJun 27, 2026

This intermediate recipe sets up Qwen3 30B-A3B on the RTX 3090 Ti, needing about 24 GB of VRAM.

models
tools
prerequisites
  • NVIDIA RTX 3090 Ti (24 GB VRAM) — this is a 24 GB-tier recipe; the Q4_K weights fit fully with no CPU offload
  • Recent NVIDIA driver with CUDA 12+
  • ~19 GB free disk for the Q4_K_M MoE weights

What You'll Build

A local chat endpoint backed by Qwen3-30B-A3B — Alibaba's 30.5B-total Mixture-of-Experts model with 3.3B activated parameters per token — running on a single RTX 3090 Ti in Q4_K quantization via Ollama, llama.cpp, or LM Studio. The MoE design is what makes a 30B model fast on a 24 GB card: only 3.3B parameters fire per token, so generation runs at LLM-interactive speed even though all 128 experts stay resident in VRAM.

Hardware data: RTX 3090 Ti (24 GB VRAM) · 166.9 tok/s generation at 4K context (Q4_K) · See benchmark data

ℹ️ The Q4_K weights fit fully — no offload needed. At Q4_K_M the GGUF is ~18.6 GB on disk (per the bartowski tree), so the entire model lives on the 3090 Ti's 24 GB with room for the KV-cache. This recipe uses the clean full-GPU path (-ngl 99); there is no CPU-offload story here, unlike on smaller cards where the routed experts must spill to system RAM.

Requirements

ComponentMinimumTested
GPU24 GB VRAMRTX 3090 Ti (24 GB)
RAM16 GB system RAM
Storage~19 GB for the Q4_K_M MoE weights (per the GGUF tree)
SoftwareCUDA 12+; recent Ollama / llama.cpp / LM Studio

Installation

Three paths are provided. Pick one. Ollama is the fastest route to a working chat session; the llama.cpp Q4_K GGUF gives you the exact quant tier the benchmark used; LM Studio is the GUI equivalent.

Path A — Ollama (recommended for first run)

ollama pull qwen3:30b-a3b
ollama run qwen3:30b-a3b

The qwen3:30b-a3b tag is a 19 GB Q4_K_M MoE build; first run downloads it and drops you into an interactive REPL. (Ollama also publishes an explicit qwen3:30b-a3b-q4_K_M tag at the same quant if you want to pin it by name.)

Path B — llama.cpp with the Q4_K GGUF

Download the Q4_K_M MoE GGUF — the bartowski/Qwen_Qwen3-30B-A3B-GGUF build is an 18.63 GB file that links back to the canonical Qwen/Qwen3-30B-A3B (base_model_relation: quantized):

# grab a recent llama.cpp build first: https://github.com/ggml-org/llama.cpp
huggingface-cli download bartowski/Qwen_Qwen3-30B-A3B-GGUF \
    Qwen_Qwen3-30B-A3B-Q4_K_M.gguf --local-dir ./qwen3-30b-a3b

Then serve it with all layers on the GPU and FlashAttention enabled:

llama-server -m ./qwen3-30b-a3b/Qwen_Qwen3-30B-A3B-Q4_K_M.gguf \
    -ngl 99 -fa 1 -c 4096 --host 0.0.0.0 --port 8000

-ngl 99 offloads every layer to the 3090 Ti — the full model fits, so this is the whole story, no --n-cpu-moe spill. -c 4096 matches the 4K context the benchmark used (push it higher only as far as the leftover VRAM allows — see Troubleshooting).

Path C — LM Studio (GUI)

Search LM Studio's model browser for Qwen3-30B-A3B and pick a Q4_K GGUF (the bartowski build above, or the lmstudio-community/Qwen3-30B-A3B-GGUF Q4_K_M at the same 18.63 GB). Set GPU offload to "max" so all layers land on the 3090 Ti, then start a chat. LM Studio uses llama.cpp under the hood, so the runtime path is identical to Path B.

Running

Ollama (interactive):

ollama run qwen3:30b-a3b "Explain mixture-of-experts routing in one paragraph."

llama.cpp (HTTP, OpenAI-compatible):

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-30b-a3b",
    "messages": [{"role": "user", "content": "Explain mixture-of-experts routing in one paragraph."}]
  }'

By default Qwen3 runs in thinking mode, emitting a <think>...</think> block before the final answer (the model card's enable_thinking=True flag toggles it off). All 30.5B parameters stay resident in VRAM regardless of path — the 3.3B "activated" figure is a compute-per-token number (which 8 of the 128 experts the router fires), not a memory figure.

Results

  • Generation speed: 166.9 tokens/s at 4K context (Q4_K), measured on RTX 3090 Ti by Hardware Corner's gpu-llm-benchmarks. It scales down gracefully as context grows: 121.9 tok/s at 16K and 92.0 tok/s at 32K — the slow falloff is the MoE design paying off, since only 3.3B parameters are read per token.
  • Prefill speed: 3,441.0 tokens/s at 4K context on the same Hardware Corner RTX 3090 Ti row (2,205.6 at 16K, 1,483.9 at 32K).
  • VRAM usage: 24.0 GB peak at 4K context per /check/qwen3-30b-a3b/rtx-3090-ti — the ~18.6 GB Q4_K weights plus KV-cache and activations sit comfortably inside the 24 GB card with no offload required.
  • Quality notes: Q4_K keeps the model at 4-bit while preserving the higher-precision tensors that matter for output quality; the canonical model card lists a 32,768-token native context, extensible to 131,072 with YaRN, but on a single 24 GB card the KV-cache is the binding constraint at long context — keep it modest (4K–16K) to stay within the card.

For the full benchmark data and side-by-side compare across cards, see /check/qwen3-30b-a3b/rtx-3090-ti.

Troubleshooting

KeyError: 'qwen3_moe' on model load

The Qwen3-MoE architecture needs a recent transformers. The model card warns that transformers<4.51.0 raises KeyError: 'qwen3_moe'. This only affects the raw transformers/diffusers path — the GGUF routes (Ollama / llama.cpp / LM Studio) sidestep it entirely, which is another reason to prefer them on a single consumer card. If you do run the Python snippet, pip install -U transformers first.

Out of memory at long context

The 4K-context benchmark peaks at 24.0 GB on the RTX 3090 Ti, so the card is essentially full even though the weights are only ~18.6 GB — the remainder is KV-cache and activations. Pushing -c (context length) higher grows the KV-cache and will eventually OOM. Stay at 4K–16K on the 3090 Ti; if you need the full 131K YaRN context, that needs a bigger card. If you have measured a working long-context configuration on a 3090 Ti, please contribute it.

"All 30B parameters must fit, not just 3B"

Qwen3-30B-A3B is marketed as 30.5B total / 3.3B activated per token. All 128 experts must be resident in VRAM because the router picks 8 of them at inference time — you cannot pre-prune them. The 3.3B active figure governs speed (why generation is fast), the 30.5B total governs fit (why it needs ~18.6 GB at Q4_K). That fit still clears the 24 GB card with headroom; sub-24 GB cards need either a smaller quant or MoE CPU-offload, which is a different recipe.

Multi-GPU launch commands from the model card don't fit a single 3090 Ti

The official HF model card's Quickstart shows the BF16 transformers path (~61 GB of weights) and references vLLM/SGLang server deployments — those are multi-GPU or large-card configurations, not a consumer single-card path. For one RTX 3090 Ti use the Q4_K GGUF route (Path A/B/C above); the BF16 transformers path does not fit 24 GB.

Generation slower than expected for the GPU class

Confirm FlashAttention is active (-fa 1 on llama.cpp; Ollama enables it by default) and that you are at small context — LLM token generation is memory-bandwidth-bound, so the per-token rate drops mechanically as the KV-cache grows past 4K, exactly as the Hardware Corner table shows (166.9 → 92.0 tok/s walking from 4K to 32K). If your numbers are still off, please report them.

common questions
How much VRAM does Qwen3 30B-A3B need?

About 24 GB — the minimum this recipe targets.

Which GPUs is Qwen3 30B-A3B tested on?

RTX 3090 Ti (24 GB).

How hard is this setup?

Intermediate — follow the steps above.