How much VRAM does gpt-oss 20B need?

About 13 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

gpt-oss 20B on Apple M2 Max: native-MXFP4 chat in 64 GB unified memory with MLX

What You'll Build

A fully-local OpenAI gpt-oss 20B chat endpoint running on an Apple M2 Max with 64 GB unified memory, using Apple's native MLX runtime — no NVIDIA GPU, no CUDA, no FlashAttention. gpt-oss-20b is a 21B-parameter Mixture-of-Experts model with ~3.6B active parameters per token, shipped by OpenAI in the MXFP4 4-bit format, so the whole thing fits in ~12 GB and leaves the M2 Max's memory pool almost entirely free.

Hardware data: Apple M2 Max (64 GB unified memory) · MLX MXFP4-Q8 weights ~12.1 GB on disk · See benchmark data

ℹ️ MXFP4 is the model's native format, not a CUDA trick. gpt-oss-20b is post-trained in MXFP4 — per OpenAI's model card, the MoE weights are quantized to MXFP4 so the model can "run within 16GB of memory" (openai/gpt-oss-20b card). MXFP4 is an OCP-standard 4-bit float, not an NVIDIA-only escape hatch — Apple's MLX and llama.cpp-Metal both run it directly on the M2 Max GPU. There is nothing to "subtract" here, unlike the FP8/NVFP4 paths a CUDA build would use.

ℹ️ Unified memory is not VRAM. The M2 Max has 64 GB of unified memory shared by CPU and GPU — not 64 GB of dedicated VRAM. macOS only lets the GPU address roughly 75% of it (~48 GB via Metal's recommendedMaxWorkingSetSize). At ~12 GB the gpt-oss-20b weights fit that ~48 GB pool with enormous headroom — this is a small-MoE model on a large machine, so unlike a 70B model there is no fit wall to manage here.

Requirements

Component	Minimum	Tested
GPU / memory	16 GB unified memory (~10.5 GB GPU-addressable)	Apple M2 Max (64 GB unified memory, ~48 GB GPU-addressable)
RAM	Same pool — unified	64 GB unified
Storage	~13 GB (MLX MXFP4-Q8 or MXFP4 GGUF)	~12.1 GB
Software	Python 3.10+, macOS Sonoma 14 / Sequoia 15+	macOS Sequoia 15

The binding constraint on Apple Silicon is addressable unified memory, not raw capacity. The MLX weights are 12.08 GB on disk across 3 safetensors shards (HF tree API for mlx-community/gpt-oss-20b-MXFP4-Q8). Against the ~48 GB the M2 Max GPU can address by default, that leaves well over 30 GB for the KV-cache and macOS — so even a long context window is comfortable here, and no wired-limit raise is needed on a 64 GB Mac. (On a 16 GB Mac the same model is tight against the ~10.5 GB addressable share — see Troubleshooting.)

Installation

1. Install MLX-LM (the Apple-native path)

pip install mlx-lm

MLX is Apple's array framework; mlx-lm is its LLM front-end. There is nothing CUDA-shaped to install — no torch build flags, no cu12x wheel, no FlashAttention. (ml-explore/mlx-lm)

2. Run the model (weights download on first use)

The canonical Apple build is mlx-community/gpt-oss-20b-MXFP4-Q8 — converted directly from openai/gpt-oss-20b with mlx-lm, keeping the MoE experts in their native MXFP4 and the rest at Q8. The repo's own README ships this exact snippet:

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/gpt-oss-20b-MXFP4-Q8")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)

On first run, mlx-lm pulls the weights (~12.1 GB, 3 shards) from the mlx-community Hugging Face org and caches them under ~/.cache/huggingface. These weights are ungated, and the underlying model is Apache-2.0 licensed (openai/gpt-oss-20b) — no license-acceptance step.

You can also drive it one-shot from the shell:

mlx_lm.generate --model mlx-community/gpt-oss-20b-MXFP4-Q8 --prompt "Explain Mixture-of-Experts in one paragraph."

Running

For an interactive, OpenAI-compatible local server (so you can point Open WebUI, a chat client, or your own code at it):

mlx_lm.server --model mlx-community/gpt-oss-20b-MXFP4-Q8

This starts a local server on 127.0.0.1:8080 exposing an OpenAI-style /v1/chat/completions endpoint. It is a development server — bind it to localhost only. (mlx-lm server docs)

Alternative: the GGUF path (llama.cpp / Ollama / LM Studio)

If you prefer the portable GGUF ecosystem, OpenAI ships gpt-oss-20b as a native-MXFP4 GGUF too:

# Ollama (simplest) — pulls a ~14 GB MXFP4 build with a 128K context window
ollama run gpt-oss:20b

Ollama lists gpt-oss:20b as a 14 GB download with a 128K context window, and notes the MoE weights are quantized to MXFP4 (ollama.com/library/gpt-oss). For a hand-managed llama.cpp build, Metal is enabled by default on macOS — "On MacOS, Metal is enabled by default. Using Metal makes the computation run on the GPU." (llama.cpp build docs) — so a standard cmake -B build && cmake --build build --config Release already runs on the GPU. Point it at the single-file MXFP4 GGUF from ggml-org/gpt-oss-20b-GGUF (gpt-oss-20b-mxfp4.gguf, ~12.1 GB). LM Studio runs both MLX and GGUF from a GUI if you prefer not to touch the terminal.

All three GGUF runtimes load the same native-MXFP4 weights — there is no requantization step and no CUDA-specific quant kernel involved.

Results

Speed: No first-party Apple M2 Max benchmark for this pair has been recorded yet — /check/gpt-oss-20b/m2-max currently returns verdict: unknown with no measurements. We are deliberately not quoting a token/sec figure: token generation on Apple Silicon is bandwidth-bound (the M2 Max runs ~400 GB/s unified memory), and the per-chip throughput numbers floating around come from single community runs on different Macs, which is not enough to publish as a measured M2 Max result. If you run this, please contribute your tok/s so we can seed a real datapoint.
Memory usage: ~12.1 GB resident for the weights, plus a KV-cache that grows with context. Fits the M2 Max's ~48 GB default-addressable pool with more than 30 GB to spare — colocate a second model or run a long context freely.
Quality notes: gpt-oss-20b is a reasoning-tuned MoE; MXFP4 is its native training-time format, so you are not trading quality for a quantization here the way a 4-bit requant of a dense model would. The MLX MXFP4-Q8 build keeps the experts in MXFP4 and the remaining tensors at Q8 for fidelity.

For the full benchmark data (and to be the first to populate it), see /check/gpt-oss-20b/m2-max.

Troubleshooting

Running on a 16 GB Mac instead — out of memory / swapping

On a 64 GB M2 Max there is nothing to tune. But if you try the same weights on a 16 GB Apple Silicon Mac, the ~12 GB model is tight against the ~10.5 GB the GPU can address by default, and you may hit memory pressure. Two fixes:

Keep the context modest (e.g. ≤ 8k tokens) and close other GPU-heavy apps.
Raise the wired-memory limit (macOS Sonoma 14 / Sequoia 15+):
```
sudo sysctl iogpu.wired_limit_mb=13312   # 13 GB; leaves ~3 GB for macOS on a 16 GB Mac
```
Always leave headroom for macOS — on small Macs that is only a couple of GB, so watch Activity Monitor's Memory-Pressure gauge. The setting is temporary and resets on reboot (persist it via /etc/sysctl.conf); sudo sysctl iogpu.wired_limit_mb=0 restores the default. (On macOS Monterey 12 / Ventura 13 the knob is sudo sysctl debug.iogpu.wired_limit=<bytes> instead.)

Tried to install FlashAttention / bitsandbytes / a `cu12x` wheel and it failed

None of those apply on Apple Silicon. There is no CUDA, no FlashAttention, and no GPU bitsandbytes kernel on macOS — MLX uses its own Metal attention, and llama.cpp uses Metal + the native MXFP4 GGUF. MXFP4 itself is not an NVIDIA-only format, so you do not need an FP8 or NVFP4 path either. If a generic tutorial tells you to pip install flash-attn, pass --load-in-4bit, or fetch a cu12x wheel, skip those steps entirely; the commands above are the complete Apple path.

The MLX build won't load on an older mlx-lm

The mlx-community/gpt-oss-20b-MXFP4-Q8 repo was converted with mlx-lm 0.27.0. If load(...) errors on the MXFP4 tensors, upgrade: pip install -U mlx-lm. gpt-oss MXFP4 support is recent, so an older mlx-lm may not recognise the format.

No other widely-reported issues. Report problems via the submission form.