self-hosted/ai
§01·recipe · llm

Qwen3-Next 80B-A3B on Apple M3 Max: an 80B MoE Assistant in 48GB via a Sub-Q4 GGUF

llmadvanced48GB+ VRAMJul 4, 2026

This advanced recipe sets up Qwen3-Next 80B-A3B on the Apple M3 Max, needing about 48 GB of VRAM.

models
tools
prerequisites
  • Apple M3 Max with 48GB unified memory — the marginal tier: the first-party Q4_K_M (~45.1 GiB) does NOT fit, so this recipe uses a sub-Q4 community quant (Q3_K_S, ~32.2 GiB) and accepts the quality tradeoff
  • macOS with Metal (llama.cpp builds with Metal by default on macOS)
  • A RECENT llama.cpp build — at or after release b7186 (2025-11-28), the first build that contains the Qwen3-Next merge; an older stock binary will not load this architecture
  • Willingness to raise the Metal GPU memory cap (sudo sysctl iogpu.wired_limit_mb) and keep context tightly bounded
  • ~35GB free disk for the Q3_K_S GGUF weight file (~32.2 GiB)
  • A simple client to talk to the OpenAI-compatible endpoint: Open WebUI, or just curl

What You'll Build

A private, local long-context assistant: Qwen3-Next-80B-A3B-Instruct — Qwen's 80B-total / ~3B-active Mixture-of-Experts model — served as an OpenAI-compatible endpoint on an Apple M3 Max (48GB unified memory) via llama.cpp on Metal, and driven by a simple client like Open WebUI or a plain curl API call. This is a text-only generalist instruct model for chat, summarization, drafting, and long-document work, taking a 262,144-token native context (extensible toward ~1M via YaRN), per the Qwen3-Next-80B-A3B-Instruct model card. 48 GB is the marginal tier for this 80B MoE: the first-party Q4_K_M GGUF (~45.1 GiB) does not fit, so this recipe runs a sub-Q4 community quant (unsloth's Q3_K_S, ~32.2 GiB) — a real quality tradeoff versus Q4, presented honestly, that fits the tuned Metal budget with a bounded KV cache.

Hardware data: Apple M3 Max (48GB unified memory, Metal) · Qwen3-Next-80B-A3B Q3_K_S GGUF (34.58 GB, sub-Q4 community quant — the only fit) · tightly bounded context, raised GPU cap · See benchmark data

⚠️ On 48GB you MUST drop below Q4 — this is a quality tradeoff, stated plainly. The first-party Q4_K_M is ~45.1 GiB (48.41 GB), which exceeds even the tuned Metal budget on a 48 GB machine — it will not fit with room for the KV cache and the OS. The first-party repo publishes nothing below Q4_K_M, so this tier requires a community sub-Q4 GGUF from unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF. This recipe uses Q3_K_S (34.58 GB / ~32.2 GiB) — a standard 3-bit-class quant that fits with the cap raised. Expect measurably lower quality than Q4_K_M (a 64 GB M2 Max can run Q4_K_M; a 48 GB machine cannot). If quality matters more than fitting on this chip, step up to a 64 GB Mac.

⚠️ llama.cpp support for this architecture is RECENT and not yet speed-tuned. The hybrid qwen3_next architecture was merged in ggml-org/llama.cpp#16095 "Model: Qwen3 Next" on 2025-11-28, first shipping in release b7186. It works on a recent build — but the author states the implementation is "focused on CORRECTNESS ONLY … Speed tuning … will come in future PRs." Expect modest tokens/sec on a fresh build today; it improves as follow-up PRs land. Use a build at or after b7186 — an older stock binary will report an unknown architecture.

ℹ️ An MoE keeps all experts resident — the file size is the memory cost, not the "3B active" count. Qwen3-Next is 80B total with 512 experts, 10 activated per token (plus 1 shared), ~3B active per token, per the model card. Only some experts fire per token (a throughput property), but all experts stay loaded in unified memory, so the footprint is the full quant file — ~32.2 GiB at Q3_K_S — not a smaller "3B active" fraction. This is precisely why 48 GB is tight: you cannot page out the unused experts.

ℹ️ Long context is cheaper here than on a dense 80B — which is what makes 48GB viable at all. Qwen3-Next is a hybrid: 48 layers in a 3:1 ratio of Gated DeltaNet (linear-attention) blocks to full Gated-Attention blocks, per the model card. Linear attention on 3/4 of the layers makes the KV cache grow much more slowly with context length than a dense model — the reason a bounded context still fits after the ~32.2 GiB of weights. But the weights dominate, so on 48 GB you keep context tightly bounded.

ℹ️ Apple unified memory is shared with the OS — you must raise the GPU cap, and headroom is thin. On Apple Silicon the CPU and GPU share one memory pool; by default macOS lets the GPU wire down roughly 70–75% of total — about 36 GB of the 48 GB. The ~32.2 GiB Q3_K_S weights sit right at that default line, so raise the cap once per boot before serving:

sudo sysctl iogpu.wired_limit_mb=40960   # ~40 GB GPU-usable; leaves ~8 GB for the OS

That leaves roughly 8 GB above the weights for a bounded KV cache. Close other heavy apps. Do not give the GPU all 48 GB — the OS needs headroom.

ℹ️ Quant reality: no first-party quant fits 48GB — the fit is community-only. The first-party GGUF repo starts at Q4_K_M (48.41 GB) and goes up (Q5_K_M, Q6_K, Q8_0, BF16) — none fit 48 GB. Sub-Q4 fits come only from the community. The unsloth repo publishes several: Q3_K_S 34.58 GB (used here), Q3_K_M 38.30 GB, UD-Q3_K_XL 35.64 GB, UD-IQ3_XXS 33.09 GB, and smaller IQ2/IQ1 packs. Q3_K_S is the quality/size sweet spot that clears the ~40 GB tuned cap with KV headroom; Q3_K_M (38.30 GB) also fits but leaves less room for context.

Requirements

ComponentMinimumTested
GPU48GB unified (Q3_K_S is ~32.2 GiB; Q4_K_M does NOT fit here)Apple M3 Max (48GB unified memory, Metal)
RAMUnified with GPU (48GB total on this config)48GB unified
Storage~35GB for the Q3_K_S GGUF34.58 GB (Qwen3-Next-80B-A3B-Instruct-Q3_K_S.gguf)
Softwarellama.cpp ≥ b7186 (Metal); a client (Open WebUI / curl)llama.cpp b7186+
LicenseApache-2.0 (commercial-OK)

The community Q3_K_S GGUF is 34.58 GB (34,580,476,704 bytes), a single file, per the unsloth Qwen3-Next-80B-A3B-Instruct-GGUF file tree. The first-party Q4_K_M is 48.41 GB (48,410,988,384 bytes), per the Qwen file tree — it does not fit 48 GB. The model is licensed under Apache-2.0 (commercial-OK) and is text-only, per the model card.

⚠️ Ollama's qwen3-next:80b does NOT fit 48GB. The default Ollama tag is a ~50 GB Q4_K_M-class pack, and Ollama publishes no sub-Q4 tag — the smallest tag is that ~50 GB pack, per the Ollama tags page. So on a 48 GB M3 Max the turnkey Ollama route is not available for this model; the path is llama.cpp with the community Q3_K_S GGUF. (On a 64 GB Mac, Ollama's ~50 GB tag does fit — see the M2 Max recipe.)

Installation

There is only one serving path on this tier — llama.cpp with the sub-Q4 community GGUF (Ollama's smallest tag is too big; see the box above). Then install a client.

llama.cpp (recent build, Metal)

Build a recent llama.cpp (Metal is the default on macOS) and download the community Q3_K_S GGUF. You need a build at or after b7186, the first release that contains the Qwen3-Next merge (PR #16095, merged 2025-11-28):

# Build a recent llama.cpp (Metal is default on macOS); ensure it is >= b7186
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
cmake -B build -DGGML_METAL=ON
cmake --build build -j

# Download the community sub-Q4 GGUF (single file, ~32.2 GiB) — the only fit on 48GB
huggingface-cli download unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF \
  Qwen3-Next-80B-A3B-Instruct-Q3_K_S.gguf \
  --local-dir ~/models/Qwen3-Next-80B-A3B-Instruct-GGUF

-DGGML_METAL=ON is the default on macOS (no CUDA on Apple Silicon), so Metal handles GPU offload. If your build predates b7186 it will report an unknown architecture — pull a newer checkout and rebuild.

Install a client

Open WebUI gives you a browser chat UI over the OpenAI-compatible endpoint:

pip install open-webui && open-webui serve

Or skip the UI and talk to the endpoint with curl (see Running).

Running

0. Raise the Metal memory cap (once per boot)

sudo sysctl iogpu.wired_limit_mb=40960   # ~40 GB GPU-usable

1. Serve Qwen3-Next with a tightly bounded context

./build/bin/llama-server \
  -m ~/models/Qwen3-Next-80B-A3B-Instruct-GGUF/Qwen3-Next-80B-A3B-Instruct-Q3_K_S.gguf \
  --jinja \
  -ngl 99 \
  -c 16384 \
  --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0 \
  --port 8000
  • --jinja applies the model's bundled chat template so the instruct format is correct.
  • -ngl 99 offloads all layers to the GPU via Metal (the quant file must sit in unified memory — see the MoE note above).
  • -c 16384 caps context at 16K — a deliberately tight value on 48 GB, where only ~8 GB sits above the weights after raising the cap. The card documents up to 262,144, but on this tier you keep it small and raise it cautiously while watching memory (sudo powermetrics --samplers gpu_power, or Activity Monitor's Memory tab). The 3:1 DeltaNet:attention layout keeps KV growth gentle, so 16K–32K is realistic; the full 256K is not.
  • The sampling flags match the card's recommended settings — Temperature 0.7, TopP 0.8, TopK 20, MinP 0 — per the model card. (You may add a presence_penalty between 0 and 2 to reduce repetition.)

This exposes an OpenAI-compatible API at http://localhost:8000/v1.

2. Talk to it

Point Open WebUI at the endpoint, or hit it directly with curl:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-next-80b-a3b",
    "messages": [{"role": "user", "content": "Summarize this document in five bullets: ..."}],
    "temperature": 0.7, "top_p": 0.8
  }'

That is your private assistant — chat, summarization, drafting, and long-document Q&A, all local. Bear in mind the Q3_K_S quant trades some fidelity for fitting on 48 GB.

Results

  • Memory usage: The Q3_K_S weights are ~32.2 GiB (34.58 GB) and must stay resident in unified memory. With the GPU cap raised to ~40 GB, that leaves only ~8 GB of GPU-usable memory for the KV cache and activations, plus ~8 GB reserved for the OS — enough for a tightly bounded 16K–32K context, helped by the cheap linear-attention KV. Sizes are verified via the unsloth file tree (Q3_K_S) and the first-party file tree (Q4_K_M, which does not fit here).
  • Quality note: this is the honest cost of the tier — Q3_K_S is a sub-Q4 quant and degrades quality relative to the Q4_K_M a 64 GB Mac can run. It is the only way to fit this 80B MoE on 48 GB; if you need Q4 fidelity, use a 64 GB machine (see the M2 Max recipe).
  • Speed expectation: llama.cpp support is recent and not yet speed-tuned — the merge was explicitly "correctness only", so tokens/sec on a fresh build today are modest and will improve as follow-up PRs land. There is no community throughput benchmark for this model on the M3 Max yet, so we do not quote a tok/s figure rather than invent one or borrow one from different hardware.
  • Production path (non-Mac): for GPU-server deployment the day-one runtimes are vLLM (≥ 0.10.2) and SGLang (≥ 0.5.2); on Apple/Metal, llama.cpp is the path (Ollama's smallest tag is too big for 48 GB).

There is no benchmark for Qwen3-Next-80B-A3B on the M3 Max in the catalog yet — /check/qwen3-next-80b-a3b/m3-max has no data. If you run it, report your throughput via the submission form so we can seed real benchmark data.

For the full benchmark data, see /check/qwen3-next-80b-a3b/m3-max.

Troubleshooting

llama-server reports an unknown model architecture / won't load the GGUF

Your llama.cpp build predates Qwen3-Next support. The qwen3_next architecture first shipped in release b7186 (PR #16095, merged 2025-11-28). Rebuild from a recent checkout. This model is upstream — you just need a build new enough to contain the merge.

The model won't load / OOMs on 48GB

First, raise the Metal cap: sudo sysctl iogpu.wired_limit_mb=40960, and close other heavy apps. The ~32.2 GiB Q3_K_S weights sit right at the default GPU-usable line (~36 GB). If it still OOMs, your context is too high — lower -c (e.g. -c 8192). Do not try the first-party Q4_K_M (48.41 GB) or Ollama's ~50 GB tag on 48 GB — neither fits. If even Q3_K_S is too tight after bounding context, drop to a smaller unsloth pack (UD-IQ3_XXS 33.09 GB, or an IQ2 pack) at a further quality cost.

Tokens/sec feel slow

Expected for now: llama.cpp's Qwen3-Next implementation is recent and "correctness only" — speed tuning is coming in later PRs (see the known-issue box). Keep to a tightly bounded context and keep the build current. The hybrid linear-attention design should make long-context prompts degrade more gracefully once speed work lands.

Output quality feels weaker than expected

You are running a sub-Q4 quant (Q3_K_S) because Q4_K_M does not fit 48 GB — some quality loss is inherent to the tier. Try Q3_K_M (38.30 GB) if it still fits your budget after the OS reservation, or move to a 64 GB Mac to run the first-party Q4_K_M.

No other widely-reported issues on the M3 Max yet. If you run Qwen3-Next-80B-A3B on this chip, report your throughput and any problems via the submission form so we can seed real benchmark data.

common questions
How much VRAM does Qwen3-Next 80B-A3B need?

About 48 GB — the minimum this recipe targets.

Which GPUs is Qwen3-Next 80B-A3B tested on?

Apple M3 Max (48 GB).

How hard is this setup?

Advanced — follow the steps above.