self-hosted/ai
§01·recipe · llm

Qwen3 35B MoE on RTX 4070: 80 tok/s Local LLM Guide

llmintermediate12GB+ VRAMMay 13, 2026
tools
prerequisites
  • NVIDIA RTX 4070 (12GB VRAM) or similar
  • llama.cpp or Ollama installed
  • Python 3.10+ (optional, for Python API)

What You'll Build

Run Qwen3-30B-A3B (35B total params, 3B active — a Mixture of Experts model) locally on your RTX 4070 at 80 tokens/second using llama.cpp with Multi-Token Prediction (MTP).

Benchmark: 69–82 tok/s across tasks · 12GB VRAM · See full data →

What's Qwen3 MoE? 35B total parameters but only 3B active per token — you get 70B-class reasoning at ~10B compute cost. A massive efficiency win for home hardware.

Requirements

ComponentValue
GPURTX 4070 (12GB) tested
VRAM12GB (fits the full Q4 quant)
RAM16GB+
Storage~20GB for Q4_K_M weights

Installation

Option A: Ollama (Easiest)

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull Qwen3 30B-A3B
ollama pull qwen3:30b-a3b-q4_k_m

# Run
ollama run qwen3:30b-a3b-q4_k_m

Option B: llama.cpp (Fastest — enables MTP)

# Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

# Download quantized weights
huggingface-cli download Qwen/Qwen3-30B-A3B-GGUF \
  qwen3-30b-a3b-q4_k_m.gguf --local-dir ./models/

Running with MTP (Recommended for Speed)

Multi-Token Prediction dramatically increases throughput on Qwen3:

./build/bin/llama-server \
  -m models/qwen3-30b-a3b-q4_k_m.gguf \
  -ngl 99 \
  --port 8080 \
  --ctx-size 32768

For maximum speed with TurboQuant (community fork):

./build/bin/llama-cli \
  -m models/qwen3-30b-a3b-q4_k_m.gguf \
  -ngl 99 \
  -fitt 1536 \
  --temp 0.7 \
  -p "Your prompt here"

The -fitt 1536 flag balances GPU/CPU offloading for optimal RTX 4070 performance.

Performance by Task

All benchmarks on RTX 4070 (12GB), llama.cpp with MTP enabled:

TaskSpeed
Code generation (Python)80.8 tok/s
Code generation (C++)81.8 tok/s
Translation81.9 tok/s
Q&A (factual)77.8 tok/s
Math (step-by-step)76.5 tok/s
Summarization75.4 tok/s
Creative writing (short)69.2 tok/s
Code review (long)73.2 tok/s

Source: community benchmarks with llama.cpp MTP. Full data →

vs Vanilla llama.cpp

Without MTP: ~8 tok/s (generation speed)
With TurboQuant MTP fork: ~22 tok/s
With full MTP optimization: 80+ tok/s

This ~10× speedup is specific to Qwen3's architecture and MTP support in llama.cpp.

Context Length

Qwen3-30B-A3B supports up to 32K context natively. With flash attention enabled:

./llama-server -m model.gguf -ngl 99 --ctx-size 32768 --flash-attn

At 128K context (reported by community): 21 tok/s — still very usable for long documents.

Quantization Options

QuantVRAMQualitySpeed
Q8_0~20GBBestSlightly slower
Q4_K_M12GBExcellentRecommended
Q3_K_M~9GBGoodFast
Q2_K~7GBFairFastest

For RTX 4070 with 12GB: Q4_K_M is ideal — fits fully in VRAM.

Use Cases

Qwen3-35B excels at:

  • Coding: Top-tier for code completion and debugging at this size
  • Reasoning: Supports "thinking mode" for complex problems
  • Multilingual: Native Chinese + English, strong in 30+ languages
  • Long context: 32K+ token context for long document analysis

Troubleshooting

Model too slow (< 5 tok/s): Check if model is loading on GPU. Add -ngl 99 to offload all layers.

OOM at 12GB: Try Q3_K_M quantization (needs ~9GB), or add -ngl 80 to partially offload.

Incorrect outputs: Qwen3 uses a specific chat template. Use llama-server with OpenAI-compatible API for correct formatting.