How much VRAM does Qwen3 30B-A3B need?

About 12 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Qwen3-30B-A3B on RTX 4070: Local MoE LLM via CPU-Offload

What You'll Build

A local install of Qwen3-30B-A3B running on a 12GB RTX 4070. This is a Mixture-of-Experts (MoE) model: 30.5B total parameters but only 3.3B activated per token (128 experts, 8 routed per token, 48 layers). It is Apache-2.0 licensed and supports both a thinking mode for step-by-step reasoning and a non-thinking mode for fast dialogue, with 32K native context extensible to 131K with YaRN.

The catch: the Q4_K_M GGUF weights are ~18.6GB on disk, so the full model does not fit in 12GB of VRAM. The MoE-on-small-VRAM approach is to keep attention and the KV cache on the GPU and offload the expert FFN layers to CPU/RAM. Because only 3.3B params are active per token, the CPU-resident experts stay fast enough to be usable. Ollama does this partial offload automatically; llama.cpp exposes explicit flags (--cpu-moe, --n-cpu-moe, -ot) for it.

No verified speed numbers yet. The backend has no benchmark for this exact pair — /check/qwen3-30b-a3b/rtx-4070 currently returns verdict: unknown with zero benchmark rows. Throughput on a 12GB card depends heavily on your RAM bandwidth and how many expert layers you can keep on the GPU, so we don't quote a tok/s figure here. If you measure it on your own 4070, please contribute your result →.

Requirements

Component	Value
GPU	RTX 4070 (12GB)
VRAM used	up to 12GB (attention + KV cache + as many layers as fit)
System RAM	32GB+ recommended (experts are offloaded to RAM)
Storage	~19GB for the Q4_K_M GGUF
License	Apache-2.0

The full Q4_K_M file is ~18.6GB (verified across the official Qwen/Qwen3-30B-A3B-GGUF, unsloth, and bartowski repos). Smaller quants (Q3, Q2) reduce both disk and the RAM/VRAM footprint at some quality cost.

Installation

Option A: Ollama (Easiest — automatic partial offload)

Ollama detects available VRAM and automatically splits the model between GPU and CPU/RAM, so no offload flags are needed.

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull Qwen3-30B-A3B (Q4_K_M, ~19GB download)
ollama pull qwen3:30b-a3b

# Run
ollama run qwen3:30b-a3b

The qwen3:30b-a3b and qwen3:30b-a3b-q4_K_M tags both resolve to the same ~19GB Q4_K_M build on ollama.com/library/qwen3.

Option B: llama.cpp (explicit expert offload)

# Clone and build llama.cpp with CUDA
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

# Download the Q4_K_M GGUF (~18.6GB)
huggingface-cli download Qwen/Qwen3-30B-A3B-GGUF \
  Qwen3-30B-A3B-Q4_K_M.gguf --local-dir ./models/

Running

Ollama

Just ollama run qwen3:30b-a3b — it offloads what fits to the GPU and keeps the rest in RAM automatically.

llama.cpp with expert offload

Because the full model won't fit in 12GB, route the GPU layers to the device and push the MoE expert tensors to the CPU. Two equivalent approaches:

# Approach 1: keep ALL expert (FFN) weights on CPU, everything else on GPU
./build/bin/llama-server \
  -m models/Qwen3-30B-A3B-Q4_K_M.gguf \
  -ngl 999 \
  --cpu-moe \
  --ctx-size 32768 \
  --flash-attn \
  --port 8080

# Approach 2: keep experts of the first N layers on CPU; tune N to fit 12GB
./build/bin/llama-server \
  -m models/Qwen3-30B-A3B-Q4_K_M.gguf \
  -ngl 999 \
  --n-cpu-moe 48 \
  --ctx-size 32768 \
  --flash-attn \
  --port 8080

--cpu-moe keeps all MoE expert weights in CPU/RAM; --n-cpu-moe N keeps only the experts of the first N layers in CPU, letting you push more onto the GPU as VRAM allows. For finer control, --override-tensor (-ot) with a pattern like "\.ffn_.*_exps\.weight=CPU" does the same expert-to-CPU routing. Lower N until you OOM, then back off — that's the sweet spot for your card. See the llama.cpp MoE-offload guide for the full method.

Long context (YaRN)

Qwen3-30B-A3B is 32K-native and extends to 131,072 tokens with YaRN. Enable it per the model card only when you actually need >32K — the larger KV cache competes with the GPU layers for your 12GB.

Performance / Results

There is no measured throughput for Qwen3-30B-A3B on the RTX 4070 in our data: /check/qwen3-30b-a3b/rtx-4070 returns verdict: unknown with an empty benchmark list. Speed for the offload path is dominated by how much of the model you can keep on the GPU and by your system's RAM bandwidth, so any single tok/s number would be misleading.

If you run this configuration, your numbers are genuinely useful — please contribute a benchmark → so we can replace this section with verified data.

Troubleshooting

Out of memory on load: the full Q4_K_M won't fit 12GB. Use Ollama (auto-offload) or, in llama.cpp, add --cpu-moe (all experts to RAM) or lower --n-cpu-moe N. A smaller quant (Q3_K_M) reduces the footprint further at some quality cost.

Very slow generation: confirm the non-expert layers are actually on the GPU (-ngl 999) and that flash attention is on (--flash-attn). If you offloaded too many layers to CPU, raise the GPU share by lowering --n-cpu-moe.

Garbled or off-format output: Qwen3 uses a specific chat template. With llama-server, use its OpenAI-compatible /v1/chat/completions endpoint so the template is applied correctly. Pick a thinking or non-thinking tag/mode deliberately — mixing them can confuse formatting.

Wrong model expectations: this is the 30.5B-total / 3.3B-active Qwen3 MoE — not a 35B model and not a dense 30B. Reasoning and coding quality reflect the 3.3B active compute path with MoE routing, per the Qwen model card.