How much VRAM does Qwen3-Next 80B-A3B need?

About 64 GB — the minimum this recipe targets.

How hard is this setup?

Advanced — follow the steps above.

Qwen3-Next 80B-A3B on Apple M2 Max: an 80B MoE Assistant in 64GB Unified Memory

What You'll Build

A private, local long-context assistant: Qwen3-Next-80B-A3B-Instruct — Qwen's 80B-total / ~3B-active Mixture-of-Experts model — served as an OpenAI-compatible endpoint on an Apple M2 Max (64GB unified memory) via llama.cpp on Metal (or Ollama as the turnkey alternative), and driven by a simple client like Open WebUI or a plain curl API call. This is a text-only generalist instruct model built for chat, summarization, drafting, and long-document work — it takes a 262,144-token native context (extensible toward ~1M via YaRN), per the Qwen3-Next-80B-A3B-Instruct model card. On a 64 GB Apple machine the unified memory is exactly large enough to hold the first-party Q4_K_M GGUF (~45.1 GiB / 48.4 GB) — the recommended and top usable quant here — with room left for a bounded KV cache once you raise the Metal memory cap.

Hardware data: Apple M2 Max (64GB unified memory, Metal) · Qwen3-Next-80B-A3B Q4_K_M GGUF (48.41 GB, recommended top quant) · bounded-context, raised GPU cap · See benchmark data

⚠️ llama.cpp support for this architecture is RECENT and not yet speed-tuned. The hybrid qwen3_next architecture was merged in ggml-org/llama.cpp#16095 "Model: Qwen3 Next" on 2025-11-28, first shipping in release b7186. It works on a recent build — but the author states the implementation is "focused on CORRECTNESS ONLY … Speed tuning … will come in future PRs." So expect modest tokens/sec on a fresh build today; it will improve as follow-up PRs land. Use a build at or after b7186 (or an Ollama version that bundles it) — an older stock brew install llama.cpp will report an unknown architecture. This is not an "un-upstreamed" model; it is upstream, just early.

ℹ️ An MoE keeps all experts resident — the file size is the memory cost, not the "3B active" count. Qwen3-Next is 80B total with 512 experts, 10 activated per token (plus 1 shared), ~3B active per token, per the model card. Only some experts fire per token (a throughput property), but all experts stay loaded in unified memory, so the footprint is the full quant file — ~45.1 GiB at Q4_K_M, not a smaller "3B active" fraction. Do not expect the low active-parameter count to shrink the memory requirement.

ℹ️ Long context is cheaper here than on a dense 80B — but the weights still dominate. Qwen3-Next is a hybrid: 48 layers in a 3:1 ratio of Gated DeltaNet (linear-attention) blocks to full Gated-Attention blocks, per the model card. Linear attention on 3/4 of the layers makes the KV cache grow much more slowly with context length than a dense full-attention model would — a real advantage for long documents. But the ~45.1 GiB of weights are the budget's floor, so on 64 GB you still bound the context deliberately rather than jumping to the full 256K window.

ℹ️ Apple unified memory is shared with the OS — you must raise the GPU cap for this one. On Apple Silicon the CPU and GPU share one memory pool, and by default macOS only lets the GPU wire down roughly 70–75% of total — about 48 GB of the 64 GB. The ~45.1 GiB Q4_K_M weights sit right at that default line, leaving almost nothing for the KV cache. Raise the cap once per boot before serving:
sudo sysctl iogpu.wired_limit_mb=57344   # ~56 GB GPU-usable; leaves ~8 GB for the OS
That gives the weights room plus a bounded KV cache. Do not try to give the GPU all 64 GB — the OS needs headroom or the machine stalls.

ℹ️ Quant reality: Q4_K_M is the ceiling on 64GB — Q5/Q6/Q8 do NOT fit. The first-party GGUF repo ships single-file Q4_K_M (48.41 GB), Q5_K_M (56.71 GB), Q6_K (65.53 GB), Q8_0 (84.81 GB), and a 4-part BF16 split (~159.5 GB). On a 64 GB machine only Q4_K_M fits with usable KV headroom — Q5_K_M (56.71 GB) already crowds out the OS reservation and KV cache, and Q6_K/Q8_0/BF16 exceed total memory outright. This recipe recommends Q4_K_M and does not recommend stepping above it here.

Requirements

Component	Minimum	Tested
GPU	64GB unified (Q4_K_M is ~45.1 GiB and needs KV + OS headroom)	Apple M2 Max (64GB unified memory, Metal)
RAM	Unified with GPU (64GB total on this config)	64GB unified
Storage	~49GB for the Q4_K_M GGUF	48.41 GB (`Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf`)
Software	llama.cpp ≥ b7186 (Metal), or Ollama (Metal); a client (Open WebUI / curl)	llama.cpp b7186+, Ollama
License	Apache-2.0 (commercial-OK)	—

The first-party Q4_K_M GGUF is 48.41 GB (48,410,988,384 bytes), a single file, per the Qwen3-Next-80B-A3B-Instruct-GGUF file tree. The higher first-party quants (Q5_K_M 56,710,369,120 B; Q6_K 65,528,461,152 B; Q8_0 84,812,052,320 B; BF16 4-part ~159.5 GB) do not fit this machine with usable headroom. The model is licensed under Apache-2.0 (commercial-OK) and is text-only, per the model card.

Installation

Pick one serving path, then a client. llama.cpp gives you the most control over the Metal memory cap and context; Ollama is the turnkey alternative and its qwen3-next:80b tag (~50 GB) also fits 64 GB.

Path A — Ollama (turnkey, Metal)

Ollama ships a first-party pack. On a 64 GB M2 Max the default qwen3-next:80b tag is a Q4_K_M-class ~50 GB pack (verify on the official Ollama tags page); it fits, but is close enough to the cap that you should still raise iogpu.wired_limit_mb (see the unified-memory box) and keep context bounded:

# ~50 GB Q4_K_M-class pack — fits 64GB with the GPU cap raised
ollama pull qwen3-next:80b

Ollama uses Metal automatically on macOS and serves an OpenAI-compatible endpoint at http://localhost:11434/v1. No CUDA flags. (Do not pull :80b-a3b-instruct-q8_0 — at ~85 GB it exceeds total unified memory.) Note that Ollama offers no sub-Q4 tag; the smallest published tag is the ~50 GB Q4_K_M-class pack, per the tags page.

Path B — llama.cpp (recent build, Metal)

Build a recent llama.cpp (Metal is the default on macOS) and download the first-party Q4_K_M GGUF. You need a build at or after b7186, the first release that contains the Qwen3-Next merge (PR #16095, merged 2025-11-28):

# Build a recent llama.cpp (Metal is default on macOS); ensure it is >= b7186
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
cmake -B build -DGGML_METAL=ON
cmake --build build -j

# Download the first-party Q4_K_M GGUF (single file, ~45.1 GiB)
huggingface-cli download Qwen/Qwen3-Next-80B-A3B-Instruct-GGUF \
  Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf \
  --local-dir ~/models/Qwen3-Next-80B-A3B-Instruct-GGUF

-DGGML_METAL=ON is the default on macOS (no CUDA on Apple Silicon), so Metal handles GPU offload. If your build predates b7186 it will report an unknown architecture — pull a newer tag and rebuild.

Install a client

Open WebUI gives you a browser chat UI over the OpenAI-compatible endpoint:

pip install open-webui && open-webui serve

Or skip the UI entirely and talk to the endpoint with curl (see Running).

Running

0. Raise the Metal memory cap (once per boot)

sudo sysctl iogpu.wired_limit_mb=57344   # ~56 GB GPU-usable

1. Serve Qwen3-Next with a bounded context

Ollama (Path A) — start the server; it exposes the OpenAI-compatible API at http://localhost:11434/v1 and drives Metal automatically:

ollama run qwen3-next:80b

Ollama's default context is small; raise it deliberately (e.g. /set parameter num_ctx 32768, or an OLLAMA_CONTEXT_LENGTH env var) rather than jumping to the full 262,144-token window. With ~50 GB of weights resident on a 64 GB machine, the full 256K context will not fit alongside the weights and the OS reservation — start bounded and raise num_ctx while watching memory in Activity Monitor. The hybrid linear-attention design (see the long-context box) means a given num_ctx costs less KV here than on a dense 80B, so you can push context reasonably far — but the weights still set the floor.

llama.cpp (Path B) — serve the downloaded GGUF with the model's built-in chat template:

./build/bin/llama-server \
  -m ~/models/Qwen3-Next-80B-A3B-Instruct-GGUF/Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf \
  --jinja \
  -ngl 99 \
  -c 32768 \
  --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0 \
  --port 8000

--jinja applies the model's bundled chat template so the instruct format is correct.
-ngl 99 offloads all layers to the GPU via Metal (the quant file must sit in unified memory — see the MoE note above).
-c 32768 caps context at 32K — a comfortable value on 64 GB running Q4_K_M with the cap raised. The card documents up to 262,144, but a bounded value keeps the KV cache reasonable; raise -c while watching memory (sudo powermetrics --samplers gpu_power, or Activity Monitor's Memory tab). The 3:1 DeltaNet:attention layout keeps KV growth gentle, so context scales further here than the raw number suggests.
The sampling flags match the card's recommended settings — Temperature 0.7, TopP 0.8, TopK 20, MinP 0 — per the model card. (You may add a presence_penalty between 0 and 2 to reduce repetition on long generations.)

Both paths expose an OpenAI-compatible API (:11434/v1 for Ollama, :8000/v1 for llama.cpp).

2. Talk to it

Point Open WebUI at the endpoint, or hit it directly with curl:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-next-80b-a3b",
    "messages": [{"role": "user", "content": "Summarize this meeting transcript in five bullets: ..."}],
    "temperature": 0.7, "top_p": 0.8
  }'

That is your private assistant — chat, summarization, drafting, and long-document Q&A, all local, with nothing leaving the machine.

Results

Memory usage: The Q4_K_M weights are ~45.1 GiB (48.41 GB) and must stay resident in unified memory. With the GPU cap raised to ~56 GB, that leaves roughly 10–11 GB of GPU-usable memory for the KV cache and activations, plus ~8 GB reserved for the OS — enough for a bounded 32K–64K context, more once you account for the cheap linear-attention KV. Per-quant sizes are verified via the Qwen3-Next-80B-A3B-Instruct-GGUF file tree and the Ollama tags page. Q5_K_M (56.71 GB) and above do not fit this machine with usable headroom.
Speed expectation: llama.cpp support is recent and not yet speed-tuned — the merge was explicitly "correctness only", so tokens/sec on a fresh build today are modest and will improve as follow-up PRs land. There is no community throughput benchmark for this model on the M2 Max yet, so we do not quote a tok/s figure rather than invent one or borrow one from different hardware.
Production path (non-Mac): for GPU-server deployment the day-one runtimes are vLLM (≥ 0.10.2) and SGLang (≥ 0.5.2); on Apple/Metal, llama.cpp and Ollama are the path.

There is no benchmark for Qwen3-Next-80B-A3B on the M2 Max in the catalog yet — /check/qwen3-next-80b-a3b/m2-max has no data. If you run it, report your throughput via the submission form so we can seed real benchmark data.

For the full benchmark data, see /check/qwen3-next-80b-a3b/m2-max.

Troubleshooting

`llama-server` reports an unknown model architecture / won't load the GGUF

Your llama.cpp build predates Qwen3-Next support. The qwen3_next architecture first shipped in release b7186 (PR #16095, merged 2025-11-28). Rebuild from a recent checkout (or update Ollama to a version that bundles it). This model is upstream — you just need a build new enough to contain the merge.

The model won't load / OOMs on 64GB

First, raise the Metal cap: sudo sysctl iogpu.wired_limit_mb=57344. The ~45.1 GiB Q4_K_M weights sit right at the default GPU-usable line (~48 GB), so without raising the cap there is almost no room for the KV cache. If it still OOMs after raising the cap, your context is too high — lower num_ctx on Ollama (or -c on llama.cpp, e.g. -c 16384) and close other heavy apps. Do not step up to Q5_K_M or higher on 64 GB — those quants do not fit with usable headroom.

Tokens/sec feel slow

Expected for now: llama.cpp's Qwen3-Next implementation is recent and "correctness only" — speed tuning is coming in later PRs (see the known-issue box). Keep to a bounded context, keep the build current, and watch for follow-up optimization PRs. The hybrid linear-attention design should also make long-context prompts degrade more gracefully than a dense 80B once speed work lands.

No other widely-reported issues on the M2 Max yet. If you run Qwen3-Next-80B-A3B on this chip, report your throughput and any problems via the submission form so we can seed real benchmark data.