self-hosted/ai
§01·recipe · llm

Qwen3.5-35B-A3B on RTX 3090: MXFP4 MoE Chat at 111 tok/s

llmintermediate24GB+ VRAMJun 27, 2026

This intermediate recipe sets up Qwen3.5 35B on the RTX 3090, needing about 24 GB of VRAM.

models
tools
prerequisites
  • NVIDIA RTX 3090 (24 GB VRAM) — this is a 24 GB-tier recipe, not a sub-24 GB one
  • Recent NVIDIA driver with CUDA 12+
  • ~22 GB free disk for the MXFP4 GGUF weights

What You'll Build

A local chat endpoint backed by Qwen3.5-35B-A3B — Alibaba's 35B-total Mixture-of-Experts model with ~3B active parameters per token — running on a single RTX 3090 in MXFP4 quantization via llama.cpp, Ollama, or LM Studio. The MoE design is what makes a 35B model usable on a 24 GB card: only ~3B parameters fire per token, so generation is fast even though all experts stay resident.

Hardware data: RTX 3090 (24 GB VRAM) · 111.2 tok/s generation at 4K context (MXFP4) · See benchmark data

ℹ️ MXFP4 is FP4-microscaling, not FP8. Unlike FP8 (which needs Ada sm_89 / Hopper sm_90 tensor cores the 3090 lacks), the MXFP4 GGUF runs over standard quantized-matmul kernels in llama.cpp / Ollama / LM Studio. The 3090 takes the same MXFP4 path as newer cards — no dequantize-to-BF16 fallback, no architectural workaround. The benchmark above was measured in MXFP4.

ℹ️ This is a vision-language model run in text-only mode here. Qwen3.5-35B-A3B is a "Causal Language Model with Vision Encoder" — it can take image input. This recipe covers the text-LLM chat path that the RTX 3090 benchmark measures (tok/s), which is why it sits in our llm vertical. The MXFP4 GGUF below ships a separate mmproj projector file if you later want to enable image input; the text-chat fit and speed numbers here apply to the language path.

Requirements

ComponentMinimumTested
GPU24 GB VRAMRTX 3090 (24 GB)
RAM16 GB system RAM
Storage~22 GB for the MXFP4 MoE weights (per the GGUF tree)
SoftwareCUDA 12+; recent llama.cpp / Ollama / LM Studio

Installation

Three paths are provided. Pick one. Ollama is the fastest route to a working chat session; the llama.cpp MXFP4 GGUF gives you the exact quant tier the benchmark used; LM Studio is the GUI equivalent.

Path A — Ollama (recommended for first run)

ollama pull qwen3.5:35b
ollama run qwen3.5:35b

The qwen3.5:35b tag is a ~24 GB 4-bit MoE build that targets the 24 GB-card envelope; first run downloads it and drops you into an interactive REPL. (Ollama also publishes an explicit qwen3.5:35b-a3b-q4_K_M tag at the same ~24 GB if you want to pin the quant.)

Path B — llama.cpp with the MXFP4 GGUF

Download the MXFP4 MoE GGUF — the noctrex/Qwen3.5-35B-A3B-MXFP4_MOE-GGUF build is MXFP4 for the MoE tensors with the rest kept high-precision, a 22.06 GB file that links back to the canonical Qwen/Qwen3.5-35B-A3B:

# grab a recent llama.cpp build first: https://github.com/ggml-org/llama.cpp
huggingface-cli download noctrex/Qwen3.5-35B-A3B-MXFP4_MOE-GGUF \
    Qwen3.5-35B-A3B-MXFP4_MOE_F16.gguf --local-dir ./qwen3.5-35b

Then serve it with all layers on the GPU and FlashAttention enabled:

llama-server -m ./qwen3.5-35b/Qwen3.5-35B-A3B-MXFP4_MOE_F16.gguf \
    -ngl 99 -fa 1 -c 4096 --host 0.0.0.0 --port 8000

-ngl 99 offloads every layer to the 3090; -c 4096 matches the 4K context the benchmark used (push it higher only as far as the leftover VRAM allows — see Troubleshooting).

Path C — LM Studio (GUI)

Search LM Studio's model browser for Qwen3.5-35B-A3B and pick a 4-bit GGUF (the MXFP4 MoE build above, or a Q4_K_S/Q4_K_M quant). Set GPU offload to "max" so all layers land on the 3090, then start a chat. LM Studio uses llama.cpp under the hood, so the runtime path is identical to Path B.

Running

Ollama (interactive):

ollama run qwen3.5:35b "Explain mixture-of-experts routing in one paragraph."

llama.cpp (HTTP, OpenAI-compatible):

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-35b",
    "messages": [{"role": "user", "content": "Explain mixture-of-experts routing in one paragraph."}]
  }'

Note that Qwen3.5 operates in thinking mode by default, emitting a <think>...</think> block before the final answer. All 35B parameters stay resident in VRAM regardless of path — the ~3B "active" figure is a compute-per-token number (which experts the router fires), not a memory figure.

Results

  • Generation speed: 111.2 tokens/s at 4K context (MXFP4), measured on RTX 3090 by Hardware Corner's gpu-llm-benchmarks. It holds up well as context grows: 107.1 tok/s at 16K, 101.2 tok/s at 32K, 93.1 tok/s at 64K, 79.4 tok/s at 128K — the slow falloff is the MoE design paying off, since only ~3B parameters are read per token.
  • Prefill speed: 2,622.1 tokens/s at 4K context on the same Hardware Corner RTX 3090 row (2,381.3 at 16K, 2,121.6 at 32K, 1,749.8 at 64K, 1,288.9 at 128K).
  • VRAM usage: 24.0 GB peak at 4K context — this fills the RTX 3090, so it is a true 24 GB-tier recipe with little spare headroom. See /check/qwen3-5-35b/rtx-3090.
  • Quality notes: MXFP4 keeps the MoE tensors at 4-bit microscaling and the rest higher-precision; the canonical model card lists a 262,144-token native context extensible to ~1M, but on a single 24 GB card you are KV-cache-bound far below that — keep context modest (4K–16K) to stay within the card.

For the full benchmark data and side-by-side compare across cards, see /check/qwen3-5-35b/rtx-3090.

Troubleshooting

Out of memory at long context

The 4K-context benchmark already peaks at 24.0 GB on the RTX 3090, so the card is essentially full. Pushing -c (context length) higher grows the KV-cache and will OOM. Stay at 4K–8K on the 3090; if you need long context, that path needs a bigger card (the same model is benchmarked at 165.2 tok/s on the 32 GB RTX 5090). If you have measured a working long-context configuration on a 3090, please contribute it.

"All 35B parameters must fit, not just 3B"

Qwen3.5-35B-A3B is marketed as 35B total / 3B activated per token. All 256 experts (8 routed + 1 shared per token) must be resident in VRAM because the router picks experts at inference time — you cannot pre-prune them. The ~3B active figure governs speed (why generation is fast), the 35B total governs fit (why it needs the full 24 GB at MXFP4). Sub-24 GB cards should look at smaller siblings (Qwen3.5 9B, etc.) on the /check/ listings.

Multi-GPU launch commands from the model card don't fit a single 3090

The official HF model card's Quickstart shows vLLM and SGLang launched at --tensor-parallel-size 8 on the full BF16 weights (~72 GB) — that is an 8-GPU server configuration, not a consumer single-card path. For one RTX 3090 use the 4-bit GGUF route (Path A/B/C above); the BF16 transformers/vLLM path does not fit 24 GB.

Generation slower than expected for the GPU class

Confirm FlashAttention is active (-fa 1 on llama.cpp; Ollama enables it by default) and that you are at small context — LLM token generation is memory-bandwidth-bound, so the per-token rate drops mechanically as the KV-cache grows past 4K, exactly as the Hardware Corner table shows (111.2 → 79.4 tok/s walking from 4K to 128K). If your numbers are still off, please report them.

common questions
How much VRAM does Qwen3.5 35B need?

About 24 GB — the minimum this recipe targets.

Which GPUs is Qwen3.5 35B tested on?

RTX 3090 (24 GB).

How hard is this setup?

Intermediate — follow the steps above.