What You'll Build
A local, OpenAI-compatible coding assistant powered by Qwen3.6-35B-A3B — Alibaba's 35-billion-parameter Mixture-of-Experts model (only ~3B parameters active per token) — running on a single 12GB RTX 4070. You'll serve it through llama.cpp from unsloth's Dynamic UD-Q4_K_XL GGUF, reaching 80.8 tokens/s with a 128K context window by keeping attention on the GPU and streaming the rarely-used expert tensors from system RAM.
Hardware data: RTX 4070 (12GB VRAM) · 80.8 tok/s at Q4_K_XL, 128K context · See benchmark data
⚠️ This model does not fit 12GB on its own. The UD-Q4_K_XL GGUF is ~22.85GB on disk — nearly 2× the RTX 4070's VRAM. It runs anyway because it's a sparse MoE:
llama.cppkeeps the attention path and the per-token "active" weights on the GPU and offloads the bulk expert FFN tensors to CPU RAM. The headline 80.8 tok/s is plausible precisely because only ~3B of the 35B parameters are touched per token (plus multi-token prediction). You need enough system RAM to hold the offloaded experts — plan on 32GB+.
ℹ️ Text/coding path only. Qwen3.6-35B-A3B is a "Causal Language Model with Vision Encoder" — it can also accept images and video — but this recipe covers the text/coding path that the benchmark measures (and that the MTP GGUF supports). The vision encoder ships as a separate
mmproj-*.gguf; the MTP speculative-decoding path used here is text-only (unsloth's card notes--mmprojis not yet supported alongside MTP). It's in ourllmvertical for that reason.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 12GB VRAM (NVIDIA) | RTX 4070 (12GB) |
| RAM | 32GB system RAM (to hold the CPU-offloaded experts) | — |
| Storage | ~23GB for the UD-Q4_K_XL GGUF | 22.85GB |
| Software | llama.cpp build with CUDA + MTP draft support, Python 3.10+ | — |
Why 32GB RAM: unsloth's own guidance is that "your total available memory (VRAM + system RAM) exceeds the size of the quantized model file" (unsloth Qwen3.6 docs). With a ~22.85GB file and 12GB on the GPU, roughly 11GB of expert weights live in RAM, plus KV cache and OS headroom.
Installation
1. Build llama.cpp with CUDA
These are the exact build steps from the unsloth Qwen3.6-35B-A3B-MTP-GGUF model card:
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
2. Download the UD-Q4_K_XL GGUF
llama.cpp can pull the weights for you via the -hf flag (used in step 3), or you can fetch the single file directly:
# ~22.85GB — the unsloth Dynamic UD-Q4_K_XL quant
huggingface-cli download unsloth/Qwen3.6-35B-A3B-MTP-GGUF \
Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --local-dir ./models
Running
The RTX 4070 cannot hold all 22.85GB, so you tell llama.cpp to keep the active path on the GPU and push the Mixture-of-Experts tensors to CPU RAM. There are two equivalent ways to do this.
Option A — auto-fit (recommended; matches the benchmark)
Recent llama.cpp has a --fit feature that, per its help text, adjusts unset arguments "to fit in device memory" — it auto-decides how many expert layers to offload so the model fits your VRAM, leaving a margin you set with --fit-target (in MiB). The benchmark for this pair was produced this way: llama.cpp with MTP support, 128K context, and -fitt 1536 to balance the GPU/CPU load.
./llama.cpp/llama-server \
-hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
-ngl 99 -c 131072 -fa on -np 1 \
-fit on -fitt 1536 \
--spec-type draft-mtp --spec-draft-n-max 2
-ngl 99places all layers on the GPU as the starting point;-fit onthen walks expert tensors back to CPU until the model fits, holding a-fitt 1536(1536 MiB) margin per device.-c 131072requests the 128K context. Thanks to the hybrid architecture (only 10 of 40 layers use standard KV-cached attention, with just 2 KV heads), the KV cache at 128K is only ~2.7GB at fp16 — which is why "80 tok/s and 128K context on 12GB" is achievable.--spec-type draft-mtp --spec-draft-n-max 2enables the model's built-in Multi-Token Prediction draft head — unsloth report MTP enables ~1.5-2x faster inference with no accuracy loss. The slug'sMTPinfix refers to this.
Option B — manual expert offload (--n-cpu-moe)
If your build predates --fit, place experts on CPU manually. llama.cpp's --n-cpu-moe N flag is documented to "keep the Mixture of Experts (MoE) weights of the first N layers in the CPU" (llama.cpp common/arg.cpp):
./llama.cpp/llama-server \
-hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
-ngl 99 -c 131072 -fa on -np 1 \
--n-cpu-moe 36 \
--spec-type draft-mtp --spec-draft-n-max 2
Increase --n-cpu-moe until the model fits 12GB (start high and lower it as VRAM allows — nvidia-smi shows headroom). The principle, per the llama.cpp MoE-offload guide, is to keep "Attention + Dense FFN + Shared expert FFN" on the GPU (used every forward pass) and push "Routed expert FFN" to CPU. A coarser equivalent is -ot "exps=CPU" (keep all routed experts on CPU).
Either option starts an OpenAI-compatible server on http://localhost:8080. For coding tasks, unsloth/Qwen recommend the thinking-mode-for-precise-coding sampler: temperature=0.6, top_p=0.95, top_k=20, presence_penalty=0.0.
Results
- Speed: 80.8 tokens/s on the RTX 4070 at Q4_K_XL with 128K context (code-generation workload), per the benchmark data. This is achievable because only ~3B of the 35B parameters are active per token plus MTP speculative decoding — the expert tensors that live in CPU RAM are touched far less often than the GPU-resident attention path.
- VRAM usage: 12GB peak (the card's full capacity) — see /check. The remaining ~11GB of expert weights live in system RAM.
- Quality notes: This is the strong coding/agentic variant of Qwen3.6 (SWE-bench Verified 73.4, Terminal-Bench 2.0 51.5 per the model card). Dynamic UD-Q4_K_XL preserves more accuracy than a flat Q4 by quantizing sensitive tensors at higher precision.
For the full benchmark data, see /check/qwen3-6-35b-a3b-mtp-ud-q4-k-xl-gguf/rtx-4070. Measured a different number on your 4070? Send it via the submission form so the next reader gets your data.
Troubleshooting
Out of memory at load on the 12GB card
The full 22.85GB GGUF cannot sit in 12GB VRAM — that's expected, not a bug. If you OOM, you haven't offloaded enough experts: with Option A, lower the auto-fit margin is automatic, so confirm -fit on is actually present; with Option B, raise --n-cpu-moe (more layers' experts go to CPU, less VRAM used). You can also drop the context with -c 32768 to shrink the KV cache, or step down to the UD-Q4_K_S / UD-Q3_K_XL quant from the same unsloth repo.
Not enough system RAM
If load fails (or the system thrashes/swaps) with VRAM to spare, your system RAM is the limit — the offloaded experts need ~11GB of RAM on top of everything else. unsloth's rule is VRAM + system RAM must exceed the file size (~22.85GB), so 32GB RAM is the practical floor; 16GB is too tight once the OS and KV cache are accounted for.
MTP / --spec-type draft-mtp not recognized
Multi-token prediction draft support is recent — rebuild llama.cpp from a current checkout (the build commands above target llama-server explicitly). Per unsloth's card, --spec-draft-n-max 2 works best in most setups but is worth tuning. Note that MTP and --mmproj (vision) cannot be combined yet — this text/coding path is the supported one.
Throughput lower than 80 tok/s
The bottleneck for MoE CPU-offload is PCIe latency moving expert tensors, not raw CPU speed. Keeping more of the active path on the GPU helps — minimize --n-cpu-moe (Option B) to the smallest value that still fits, and ensure -fa on (flash attention) and the MTP draft flags are present. Faster system RAM and a full-width PCIe slot also help.