What You'll Build
A local chat endpoint backed by Qwen3.5-35B-A3B — Alibaba's 35B-total Mixture-of-Experts model with ~3B active parameters per token — running on a single RTX 5090 in MXFP4 quantization via llama.cpp, Ollama, or LM Studio. The MoE design is what makes a 35B model fast on a consumer card: only ~3B parameters fire per token (8 routed + 1 shared expert out of 256), so generation stays quick even though all experts remain resident in VRAM. On the RTX 5090, MXFP4 runs on Blackwell's native FP4 tensor cores — the headline path the benchmark below was measured on.
Hardware data: RTX 5090 (32 GB VRAM) · 165.2 tok/s generation at 4K context (MXFP4) · See benchmark data
ℹ️ MXFP4 here is Blackwell-native FP4. MXFP4 is FP4-microscaling. On the RTX 5090 (sm_120), the MoE expert tensors run on Blackwell's hardware FP4 tensor cores — distinct from older cards (e.g. the RTX 3090, which runs the same MXFP4 GGUF over generic quantized-matmul kernels with no FP4 tensor-core path). FP4 acceleration needs Blackwell, so this is the fast path the 32 GB card is built for. The benchmark above was measured in MXFP4.
ℹ️ This is a vision-language model run in text-only mode here. Qwen3.5-35B-A3B is a "Causal Language Model with Vision Encoder" (image-text-to-text), which is why it sits in our
multimodalvertical. This recipe covers the text-LLM chat path that the RTX 5090 benchmark measures (tok/s). The MXFP4 GGUF below ships a separatemmprojprojector file if you later want to enable image input; the text-chat fit and speed numbers here apply to the language path.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 24 GB VRAM | RTX 5090 (32 GB) |
| RAM | 16 GB system RAM | — |
| Storage | ~22 GB for the MXFP4 MoE weights (per the GGUF tree) | — |
| Software | CUDA 12.8+ (Blackwell sm_120); recent llama.cpp / Ollama / LM Studio | — |
Installation
Three paths are provided. Pick one. Ollama is the fastest route to a working chat session; the llama.cpp MXFP4 GGUF gives you the exact quant tier the benchmark used; LM Studio is the GUI equivalent.
1. Ollama (recommended for first run)
ollama pull qwen3.5:35b
ollama run qwen3.5:35b
The qwen3.5:35b tag is a ~24 GB 4-bit MoE build; first run downloads it and drops you into an interactive REPL. The 32 GB RTX 5090 holds it with room to spare. (Ollama also publishes an explicit qwen3.5:35b-a3b-q4_K_M tag at the same ~24 GB if you want to pin the quant.)
2. llama.cpp with the MXFP4 MoE GGUF
Download the MXFP4 MoE GGUF — the noctrex/Qwen3.5-35B-A3B-MXFP4_MOE-GGUF build keeps the MoE tensors at MXFP4 and the rest at higher precision, a 22.06 GB file that links back to the canonical Qwen/Qwen3.5-35B-A3B:
# grab a recent llama.cpp build first (CUDA 12.8 / Blackwell): https://github.com/ggml-org/llama.cpp
huggingface-cli download noctrex/Qwen3.5-35B-A3B-MXFP4_MOE-GGUF \
Qwen3.5-35B-A3B-MXFP4_MOE_F16.gguf --local-dir ./qwen3.5-35b
Then serve it with all layers on the GPU and FlashAttention enabled:
llama-server -m ./qwen3.5-35b/Qwen3.5-35B-A3B-MXFP4_MOE_F16.gguf \
-ngl 99 -fa 1 -c 8192 --host 0.0.0.0 --port 8000
-ngl 99 offloads every layer to the 5090; -fa 1 enables llama.cpp's native FlashAttention (built into the CUDA backend — there is no pip install flash-attn step). -c 8192 sets context; the 32 GB card has headroom to push this well past the 3090's 4K ceiling — see Troubleshooting.
3. LM Studio (GUI)
Search LM Studio's model browser for Qwen3.5-35B-A3B and pick a 4-bit GGUF (the MXFP4 MoE build above, or a Q4_K_S/Q4_K_M quant). Set GPU offload to "max" so all layers land on the 5090, then start a chat. LM Studio uses llama.cpp under the hood, so the runtime path is identical to Path 2.
Running
Ollama (interactive):
ollama run qwen3.5:35b "Explain mixture-of-experts routing in one paragraph."
llama.cpp (HTTP, OpenAI-compatible):
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5-35b",
"messages": [{"role": "user", "content": "Explain mixture-of-experts routing in one paragraph."}]
}'
Note that Qwen3.5 models operate in thinking mode by default, emitting a <think>...</think> block before the final answer (disable with enable_thinking: False). All 35B parameters stay resident in VRAM regardless of path — the ~3B "active" figure is a compute-per-token number (which experts the router fires), not a memory figure.
Results
- Generation speed: 165.2 tokens/s at 4K context (MXFP4), measured on RTX 5090 by Hardware Corner's gpu-llm-benchmarks (row "Qwen3.5 35B (MXFP4)", Token Generation column). This is the authoritative
/check/figure for this pair — about 1.5× the 111 tok/s the same MXFP4 build hits on the 24 GB RTX 3090, the gap coming from Blackwell's native FP4 tensor cores plus the wider GDDR7 memory bus. - Prefill speed: 6,605.2 tokens/s at 4K context on the same Hardware Corner RTX 5090 row (Prompt Processing column) — fast first-token latency for long prompts.
- VRAM usage: The MXFP4 MoE weights are 22.06 GB on disk (per the GGUF tree); with the KV-cache on top, plan on the low-to-mid 20s of GB at 4K context, leaving real headroom on the 32 GB card. The
/check/benchmark does not publish a measured peak for this pair — if you measure one, please contribute it. See /check/qwen3-5-35b/rtx-5090. - Quality notes: MXFP4 keeps the MoE tensors at 4-bit microscaling and the rest higher-precision. The canonical model card lists a 262,144-token native context extensible to ~1M via RoPE scaling; the 32 GB card lets you run a much larger KV-cache than a 24 GB card, but you are still KV-cache-bound far below the 1M ceiling — see Troubleshooting for sizing.
For the full benchmark data and side-by-side compare across cards, see /check/qwen3-5-35b/rtx-5090.
Troubleshooting
Pushing context length on the 32 GB card
Unlike the 24 GB RTX 3090 — which fills up around 4K context with this model — the RTX 5090's extra 8 GB lets you raise -c (context length) substantially before the KV-cache exhausts VRAM. Start at -c 8192 or -c 16384 and watch nvidia-smi; back off if you approach the 32 GB ceiling. The model card's 262K native / ~1M extended context is achievable only with offload or a bigger card — a single 32 GB card is KV-cache-bound well below it. If you measure a working long-context configuration, please contribute it.
"All 35B parameters must fit, not just 3B"
Qwen3.5-35B-A3B is marketed as 35B total / 3B activated per token. All 256 experts (8 routed + 1 shared per token) must be resident in VRAM because the router picks experts at inference time — you cannot pre-prune them. The ~3B active figure governs speed (why generation is fast); the 35B total governs fit (why it needs the 4-bit/MXFP4 quant rather than BF16). The full BF16 safetensors are 71.9 GB (per the canonical tree) and do not fit any single consumer card.
Multi-GPU launch commands from the model card don't fit a single 5090
The official HF model card's Quickstart shows vLLM and SGLang launched at --tensor-parallel-size 8 (and --tp-size 8) on the full BF16 weights (~72 GB) — that is an 8-GPU server configuration, not a consumer single-card path. For one RTX 5090 use the 4-bit/MXFP4 GGUF route (Path 1/2/3 above); the BF16 transformers/vLLM path does not fit 32 GB.
Generation slower than expected for a Blackwell card
Confirm you are on a CUDA 12.8 (sm_120) llama.cpp build so MXFP4 hits the Blackwell FP4 tensor cores, that FlashAttention is active (-fa 1 on llama.cpp; Ollama enables it by default), and that you are at small context — LLM token generation is memory-bandwidth-bound, so the per-token rate drops as the KV-cache grows. If your numbers are still off, please report them.