Qwen3-Next 80B-A3B
Qwen3-Next-80B-A3B (Instruct) is Alibaba's flagship high-sparsity Mixture-of-Experts model (release 2025-09), 80B total parameters with only ~3B active per token (512 experts, 10 activated). Its defining feature is a hybrid architecture: 48 layers in a 3:1 ratio of Gated DeltaNet linear-attention blocks to full Gated-Attention blocks, each feeding an MoE, plus Multi-Token Prediction — a design that keeps long-context KV cheap. Text-only, with a 262,144-token native context window extendable toward ~1M tokens via YaRN. Licensed Apache-2.0 (commercial use permitted). Alibaba ships a first-party GGUF (Qwen/Qwen3-Next-80B-A3B-Instruct-GGUF: Q4_K_M ~45 GiB through Q8_0); llama.cpp support for the qwen3_next hybrid architecture merged in late November 2025 (PR #16095, correctness-focused with speed tuning still pending), and Ollama lists qwen3-next:80b. Day-one runtimes were vLLM (>=0.10.2) and SGLang (>=0.5.2). Realistic local fit is Apple unified memory: 64 GB (m2-max) runs Q4_K_M comfortably — an 80B MoE on a Mac — while 48 GB needs a sub-Q4 community quant; 24-32 GB GPUs run it only via CPU-MoE offload (with ~64 GB system RAM), which the 3B-active design makes usable.
Download· 4 variants
| GPU | VRAM | Series | Best speed | Min VRAM | Works | Benchmarks | Recipe | |
|---|---|---|---|---|---|---|---|---|
| Apple M2 Max | 64GB | apple | ~ | 0 | recipe | check ↗ | ||
| Apple M3 Max | 48GB | apple | ~ | 0 | recipe | check ↗ |
✓ benchmarked·~ runs via recipe (not benchmarked)·— untested·✕doesn't fit