What You'll Build
Run Qwen3-30B-A3B (35B total params, 3B active — a Mixture of Experts model) locally on your RTX 4070 at 80 tokens/second using llama.cpp with Multi-Token Prediction (MTP).
Benchmark: 69–82 tok/s across tasks · 12GB VRAM · See full data →
What's Qwen3 MoE? 35B total parameters but only 3B active per token — you get 70B-class reasoning at ~10B compute cost. A massive efficiency win for home hardware.
Requirements
| Component | Value |
|---|---|
| GPU | RTX 4070 (12GB) tested |
| VRAM | 12GB (fits the full Q4 quant) |
| RAM | 16GB+ |
| Storage | ~20GB for Q4_K_M weights |
Installation
Option A: Ollama (Easiest)
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull Qwen3 30B-A3B
ollama pull qwen3:30b-a3b-q4_k_m
# Run
ollama run qwen3:30b-a3b-q4_k_m
Option B: llama.cpp (Fastest — enables MTP)
# Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)
# Download quantized weights
huggingface-cli download Qwen/Qwen3-30B-A3B-GGUF \
qwen3-30b-a3b-q4_k_m.gguf --local-dir ./models/
Running with MTP (Recommended for Speed)
Multi-Token Prediction dramatically increases throughput on Qwen3:
./build/bin/llama-server \
-m models/qwen3-30b-a3b-q4_k_m.gguf \
-ngl 99 \
--port 8080 \
--ctx-size 32768
For maximum speed with TurboQuant (community fork):
./build/bin/llama-cli \
-m models/qwen3-30b-a3b-q4_k_m.gguf \
-ngl 99 \
-fitt 1536 \
--temp 0.7 \
-p "Your prompt here"
The -fitt 1536 flag balances GPU/CPU offloading for optimal RTX 4070 performance.
Performance by Task
All benchmarks on RTX 4070 (12GB), llama.cpp with MTP enabled:
| Task | Speed |
|---|---|
| Code generation (Python) | 80.8 tok/s |
| Code generation (C++) | 81.8 tok/s |
| Translation | 81.9 tok/s |
| Q&A (factual) | 77.8 tok/s |
| Math (step-by-step) | 76.5 tok/s |
| Summarization | 75.4 tok/s |
| Creative writing (short) | 69.2 tok/s |
| Code review (long) | 73.2 tok/s |
Source: community benchmarks with llama.cpp MTP. Full data →
vs Vanilla llama.cpp
Without MTP: ~8 tok/s (generation speed)
With TurboQuant MTP fork: ~22 tok/s
With full MTP optimization: 80+ tok/s
This ~10× speedup is specific to Qwen3's architecture and MTP support in llama.cpp.
Context Length
Qwen3-30B-A3B supports up to 32K context natively. With flash attention enabled:
./llama-server -m model.gguf -ngl 99 --ctx-size 32768 --flash-attn
At 128K context (reported by community): 21 tok/s — still very usable for long documents.
Quantization Options
| Quant | VRAM | Quality | Speed |
|---|---|---|---|
| Q8_0 | ~20GB | Best | Slightly slower |
| Q4_K_M | 12GB | Excellent | Recommended |
| Q3_K_M | ~9GB | Good | Fast |
| Q2_K | ~7GB | Fair | Fastest |
For RTX 4070 with 12GB: Q4_K_M is ideal — fits fully in VRAM.
Use Cases
Qwen3-35B excels at:
- Coding: Top-tier for code completion and debugging at this size
- Reasoning: Supports "thinking mode" for complex problems
- Multilingual: Native Chinese + English, strong in 30+ languages
- Long context: 32K+ token context for long document analysis
Troubleshooting
Model too slow (< 5 tok/s): Check if model is loading on GPU. Add -ngl 99 to offload all layers.
OOM at 12GB: Try Q3_K_M quantization (needs ~9GB), or add -ngl 80 to partially offload.
Incorrect outputs: Qwen3 uses a specific chat template. Use llama-server with OpenAI-compatible API for correct formatting.