§01·model · /models

Qwen3-Next 80B-A3B

llmactiveApache-2.0

Qwen3-Next-80B-A3B (Instruct) is Alibaba's flagship high-sparsity Mixture-of-Experts model (release 2025-09), 80B total parameters with only ~3B active per token (512 experts, 10 activated). Its defining feature is a hybrid architecture: 48 layers in a 3:1 ratio of Gated DeltaNet linear-attention blocks to full Gated-Attention blocks, each feeding an MoE, plus Multi-Token Prediction — a design that keeps long-context KV cheap. Text-only, with a 262,144-token native context window extendable toward ~1M tokens via YaRN. Licensed Apache-2.0 (commercial use permitted). Alibaba ships a first-party GGUF (Qwen/Qwen3-Next-80B-A3B-Instruct-GGUF: Q4_K_M ~45 GiB through Q8_0); llama.cpp support for the qwen3_next hybrid architecture merged in late November 2025 (PR #16095, correctness-focused with speed tuning still pending), and Ollama lists qwen3-next:80b. Day-one runtimes were vLLM (>=0.10.2) and SGLang (>=0.5.2). Realistic local fit is Apple unified memory: 64 GB (m2-max) runs Q4_K_M comfortably — an 80B MoE on a Mac — while 48 GB needs a sub-Q4 community quant; 24-32 GB GPUs run it only via CPU-MoE offload (with ~64 GB system RAM), which the 3B-active design makes usable.

Download· 4 variants

huggingface.co ↗

§02·GPUs that run this model

2 total

GPU	VRAM	Series	Best speed	Min VRAM	Works	Benchmarks	Recipe
Apple M2 Max	64GB	apple			~	0	recipe	check ↗
Apple M3 Max	48GB	apple			~	0	recipe	check ↗

✓ benchmarked·~ runs via recipe (not benchmarked)·— untested·✕doesn't fit