How much VRAM does Qwen3.6 35B-A3B need?

About 12 GB — the minimum this recipe targets.

How hard is this setup?

Advanced — follow the steps above.

Qwen3.6-35B-A3B on RTX 4070: 80 tok/s + 128K context from a 12GB card via MoE CPU-offload

What You'll Build

A local, OpenAI-compatible coding assistant powered by Qwen3.6-35B-A3B — Alibaba's 35-billion-parameter Mixture-of-Experts model (only ~3B parameters active per token) — running on a single 12GB RTX 4070. You'll serve it through llama.cpp from unsloth's Dynamic UD-Q4_K_XL GGUF, reaching 80.8 tokens/s with a 128K context window by keeping attention on the GPU and streaming the rarely-used expert tensors from system RAM.

Hardware data: RTX 4070 (12GB VRAM) · 80.8 tok/s at Q4_K_XL, 128K context · See benchmark data

⚠️ This model does not fit 12GB on its own. The UD-Q4_K_XL GGUF is ~22.85GB on disk — nearly 2× the RTX 4070's VRAM. It runs anyway because it's a sparse MoE: llama.cpp keeps the attention path and the per-token "active" weights on the GPU and offloads the bulk expert FFN tensors to CPU RAM. The headline 80.8 tok/s is plausible precisely because only ~3B of the 35B parameters are touched per token (plus multi-token prediction). You need enough system RAM to hold the offloaded experts — plan on 32GB+.

ℹ️ Text/coding path only. Qwen3.6-35B-A3B is a "Causal Language Model with Vision Encoder" — it can also accept images and video — but this recipe covers the text/coding path that the benchmark measures (and that the MTP GGUF supports). The vision encoder ships as a separate mmproj-*.gguf; the MTP speculative-decoding path used here is text-only (unsloth's card notes --mmproj is not yet supported alongside MTP). It's in our llm vertical for that reason.

Requirements

Component	Minimum	Tested
GPU	12GB VRAM (NVIDIA)	RTX 4070 (12GB)
RAM	32GB system RAM (to hold the CPU-offloaded experts)	—
Storage	~23GB for the UD-Q4_K_XL GGUF	22.85GB
Software	llama.cpp build with CUDA + MTP draft support, Python 3.10+	—

Why 32GB RAM: unsloth's own guidance is that "your total available memory (VRAM + system RAM) exceeds the size of the quantized model file" (unsloth Qwen3.6 docs). With a ~22.85GB file and 12GB on the GPU, roughly 11GB of expert weights live in RAM, plus KV cache and OS headroom.

Installation

1. Build llama.cpp with CUDA

These are the exact build steps from the unsloth Qwen3.6-35B-A3B-MTP-GGUF model card:

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

2. Download the UD-Q4_K_XL GGUF

llama.cpp can pull the weights for you via the -hf flag (used in step 3), or you can fetch the single file directly:

# ~22.85GB — the unsloth Dynamic UD-Q4_K_XL quant
huggingface-cli download unsloth/Qwen3.6-35B-A3B-MTP-GGUF \
    Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --local-dir ./models

Running

The RTX 4070 cannot hold all 22.85GB, so you tell llama.cpp to keep the active path on the GPU and push the Mixture-of-Experts tensors to CPU RAM. There are two equivalent ways to do this.

Option A — auto-fit (recommended; matches the benchmark)

Recent llama.cpp has a --fit feature that, per its help text, adjusts unset arguments "to fit in device memory" — it auto-decides how many expert layers to offload so the model fits your VRAM, leaving a margin you set with --fit-target (in MiB). The benchmark for this pair was produced this way: llama.cpp with MTP support, 128K context, and -fitt 1536 to balance the GPU/CPU load.

./llama.cpp/llama-server \
    -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
    -ngl 99 -c 131072 -fa on -np 1 \
    -fit on -fitt 1536 \
    --spec-type draft-mtp --spec-draft-n-max 2

-ngl 99 places all layers on the GPU as the starting point; -fit on then walks expert tensors back to CPU until the model fits, holding a -fitt 1536 (1536 MiB) margin per device.
-c 131072 requests the 128K context. Thanks to the hybrid architecture (only 10 of 40 layers use standard KV-cached attention, with just 2 KV heads), the KV cache at 128K is only ~2.7GB at fp16 — which is why "80 tok/s and 128K context on 12GB" is achievable.
--spec-type draft-mtp --spec-draft-n-max 2 enables the model's built-in Multi-Token Prediction draft head — unsloth report MTP enables ~1.5-2x faster inference with no accuracy loss. The slug's MTP infix refers to this.

Option B — manual expert offload (`--n-cpu-moe`)

If your build predates --fit, place experts on CPU manually. llama.cpp's --n-cpu-moe N flag is documented to "keep the Mixture of Experts (MoE) weights of the first N layers in the CPU" (llama.cpp common/arg.cpp):

./llama.cpp/llama-server \
    -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
    -ngl 99 -c 131072 -fa on -np 1 \
    --n-cpu-moe 36 \
    --spec-type draft-mtp --spec-draft-n-max 2

Increase --n-cpu-moe until the model fits 12GB (start high and lower it as VRAM allows — nvidia-smi shows headroom). The principle, per the llama.cpp MoE-offload guide, is to keep "Attention + Dense FFN + Shared expert FFN" on the GPU (used every forward pass) and push "Routed expert FFN" to CPU. A coarser equivalent is -ot "exps=CPU" (keep all routed experts on CPU).

Either option starts an OpenAI-compatible server on http://localhost:8080. For coding tasks, unsloth/Qwen recommend the thinking-mode-for-precise-coding sampler: temperature=0.6, top_p=0.95, top_k=20, presence_penalty=0.0.

Results

Speed: 80.8 tokens/s on the RTX 4070 at Q4_K_XL with 128K context (code-generation workload), per the benchmark data. This is achievable because only ~3B of the 35B parameters are active per token plus MTP speculative decoding — the expert tensors that live in CPU RAM are touched far less often than the GPU-resident attention path.
VRAM usage: 12GB peak (the card's full capacity) — see /check. The remaining ~11GB of expert weights live in system RAM.
Quality notes: This is the strong coding/agentic variant of Qwen3.6 (SWE-bench Verified 73.4, Terminal-Bench 2.0 51.5 per the model card). Dynamic UD-Q4_K_XL preserves more accuracy than a flat Q4 by quantizing sensitive tensors at higher precision.

For the full benchmark data, see /check/qwen3-6-35b-a3b-mtp-ud-q4-k-xl-gguf/rtx-4070. Measured a different number on your 4070? Send it via the submission form so the next reader gets your data.

Troubleshooting

Out of memory at load on the 12GB card

The full 22.85GB GGUF cannot sit in 12GB VRAM — that's expected, not a bug. If you OOM, you haven't offloaded enough experts: with Option A, lower the auto-fit margin is automatic, so confirm -fit on is actually present; with Option B, raise --n-cpu-moe (more layers' experts go to CPU, less VRAM used). You can also drop the context with -c 32768 to shrink the KV cache, or step down to the UD-Q4_K_S / UD-Q3_K_XL quant from the same unsloth repo.

Not enough system RAM

If load fails (or the system thrashes/swaps) with VRAM to spare, your system RAM is the limit — the offloaded experts need ~11GB of RAM on top of everything else. unsloth's rule is VRAM + system RAM must exceed the file size (~22.85GB), so 32GB RAM is the practical floor; 16GB is too tight once the OS and KV cache are accounted for.

MTP / `--spec-type draft-mtp` not recognized

Multi-token prediction draft support is recent — rebuild llama.cpp from a current checkout (the build commands above target llama-server explicitly). Per unsloth's card, --spec-draft-n-max 2 works best in most setups but is worth tuning. Note that MTP and --mmproj (vision) cannot be combined yet — this text/coding path is the supported one.

Throughput lower than 80 tok/s

The bottleneck for MoE CPU-offload is PCIe latency moving expert tensors, not raw CPU speed. Keeping more of the active path on the GPU helps — minimize --n-cpu-moe (Option B) to the smallest value that still fits, and ensure -fa on (flash attention) and the MTP draft flags are present. Faster system RAM and a full-width PCIe slot also help.