How much VRAM does gpt-oss 20B need?

About 13 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

gpt-oss 20B on Apple M2 Pro: running 20B in 16 GB unified memory with the wired-limit raise

What You'll Build

A fully-local OpenAI gpt-oss 20B chat endpoint running on an Apple M2 Pro with 16 GB unified memory, using Apple's native MLX runtime — no NVIDIA GPU, no CUDA, no FlashAttention. gpt-oss-20b is a 21B-parameter Mixture-of-Experts model with ~3.6B active parameters per token, shipped by OpenAI in the MXFP4 4-bit format, so the whole thing weighs ~12 GB on disk. That is small enough that OpenAI markets it as fitting "within 16GB of memory" — but on a 16 GB Mac it is a genuinely tight fit, and you have to raise the GPU memory limit before it will load. This recipe walks the honest squeeze.

Hardware data: Apple M2 Pro (16 GB unified memory) · MLX MXFP4-Q8 weights ~12.1 GB on disk · See benchmark data

⚠️ This is a tight fit — the wired-limit raise is mandatory, not optional. The MXFP4 weights are ~12.1 GB, but a 16 GB M2 Pro only lets the GPU address ~10.5 GB by default (Metal's recommendedMaxWorkingSetSize). The model will not load until you raise the limit with sudo sysctl iogpu.wired_limit_mb — see Installation step 2. Even raised, you are left with only ~3 GB for macOS and the KV-cache, so keep the context window modest (≤ 4k recommended), close other apps, and watch Activity Monitor's Memory-Pressure gauge. This is the honest "20B on a 16 GB Mac" story: it works, but it is on the edge.

ℹ️ MXFP4 is the model's native format, not a CUDA trick. gpt-oss-20b is post-trained in MXFP4 — per OpenAI's model card, the MoE weights are quantized to MXFP4 so the model can "run within 16GB of memory" (openai/gpt-oss-20b card). MXFP4 is an OCP-standard 4-bit float, not an NVIDIA-only escape hatch — Apple's MLX and llama.cpp-Metal both run it directly on the M2 Pro GPU. There is no smaller "official" quant to fall back to: MXFP4 is the model's native 4-bit format, so ~12.1 GB is as small as the real weights get.

ℹ️ Unified memory is not VRAM. The M2 Pro has 16 GB of unified memory shared by CPU and GPU — not 16 GB of dedicated VRAM, and not a pool the GPU gets all of. macOS only lets the GPU address roughly two-thirds of it (~10.5 GB by default). At ~12.1 GB the gpt-oss-20b weights exceed that default share, which is exactly why the wired-limit raise below is required rather than a nice-to-have.

Requirements

Component	Minimum	Tested
GPU / memory	16 GB unified memory (~10.5 GB GPU-addressable by default, raised to ~13 GB)	Apple M2 Pro (16 GB unified memory)
RAM	Same pool — unified	16 GB unified
Storage	~13 GB (MLX MXFP4-Q8 or MXFP4 GGUF)	~12.1 GB
Software	Python 3.10+, macOS Sonoma 14 / Sequoia 15+	macOS Sequoia 15

The binding constraint on Apple Silicon is addressable unified memory, not raw capacity — and here it binds hard. The MLX weights are 12.10 GB on disk across 3 safetensors shards (HF tree API for mlx-community/gpt-oss-20b-MXFP4-Q8). Against the ~10.5 GB the M2 Pro GPU addresses by default, the weights alone do not fit — you must push the limit up to ~13 GB (Installation step 2), which leaves only ~3 GB for macOS plus the KV-cache. That is why a short context window and a clean machine (other apps closed) are part of the requirements here, not just nice-to-haves.

Installation

1. Install MLX-LM (the Apple-native path)

pip install mlx-lm

MLX is Apple's array framework; mlx-lm is its LLM front-end. There is nothing CUDA-shaped to install — no torch build flags, no cu12x wheel, no FlashAttention. (ml-explore/mlx-lm)

2. Raise the GPU memory limit (required on 16 GB)

Before the weights will load, raise the share of unified memory the GPU is allowed to wire down. On a default 16 GB M2 Pro the GPU caps out around ~10.5 GB, which is below the ~12.1 GB the model needs — so without this step load(...) fails or the machine swaps hard. On macOS Sonoma 14 / Sequoia 15+:

sudo sysctl iogpu.wired_limit_mb=13312   # 13 GB; leaves only ~3 GB for macOS on a 16 GB Mac

This is the tightest part of the recipe. 13 GB wired leaves roughly 3 GB for macOS, the runtime, and the KV-cache combined, so close other GPU-heavy apps first and watch Activity Monitor's Memory-Pressure gauge — if it goes yellow/red you are swapping. The setting is temporary and resets on reboot (persist it via /etc/sysctl.conf); sudo sysctl iogpu.wired_limit_mb=0 restores the default. On macOS Monterey 12 / Ventura 13 the knob is sudo sysctl debug.iogpu.wired_limit=<bytes> instead.

3. Run the model (weights download on first use)

The canonical Apple build is mlx-community/gpt-oss-20b-MXFP4-Q8 — converted directly from openai/gpt-oss-20b with mlx-lm, keeping the MoE experts in their native MXFP4 and the rest at Q8. The repo's own README ships this exact snippet:

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/gpt-oss-20b-MXFP4-Q8")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)

On first run, mlx-lm pulls the weights (~12.1 GB, 3 shards) from the mlx-community Hugging Face org and caches them under ~/.cache/huggingface. These weights are ungated, and the underlying model is Apache-2.0 licensed (openai/gpt-oss-20b) — no license-acceptance step.

You can also drive it one-shot from the shell:

mlx_lm.generate --model mlx-community/gpt-oss-20b-MXFP4-Q8 --prompt "Explain Mixture-of-Experts in one paragraph."

Running

For an interactive, OpenAI-compatible local server (so you can point Open WebUI, a chat client, or your own code at it):

mlx_lm.server --model mlx-community/gpt-oss-20b-MXFP4-Q8

This starts a local server on 127.0.0.1:8080 exposing an OpenAI-style /v1/chat/completions endpoint. It is a development server — bind it to localhost only. (mlx-lm server docs)

Keep your requests short on this chip. Every token of context adds to the KV-cache, and with only ~3 GB of slack after the weights and the wired-limit raise, a long prompt is what will tip you into swapping. Treat ≤ 4k tokens as the comfortable ceiling on a 16 GB Mac, and prefer a single conversation over many parallel sessions.

Alternative: the GGUF path (llama.cpp / Ollama / LM Studio)

If you prefer the portable GGUF ecosystem, OpenAI ships gpt-oss-20b as a native-MXFP4 GGUF too:

# Ollama (simplest) — pulls a ~14 GB MXFP4 build with a 128K context window
ollama run gpt-oss:20b

Ollama lists gpt-oss:20b as a 14 GB download with a 128K context window, and notes the MoE weights are quantized to MXFP4 (ollama.com/library/gpt-oss). The advertised 128K context is the model's capability, not what a 16 GB M2 Pro can hold resident — keep your actual context modest as above. For a hand-managed llama.cpp build, Metal is enabled by default on macOS — "On MacOS, Metal is enabled by default. Using Metal makes the computation run on the GPU." (llama.cpp build docs) — so a standard cmake -B build && cmake --build build --config Release already runs on the GPU. Point it at the single-file MXFP4 GGUF from ggml-org/gpt-oss-20b-GGUF (gpt-oss-20b-mxfp4.gguf, ~12.1 GB). LM Studio runs both MLX and GGUF from a GUI if you prefer not to touch the terminal. The same wired-limit raise (Installation step 2) applies whichever runtime you pick.

All three GGUF runtimes load the same native-MXFP4 weights — there is no requantization step and no CUDA-specific quant kernel involved.

Results

Speed: No first-party Apple M2 Pro benchmark for this pair has been recorded yet — /check/gpt-oss-20b/m2-pro currently returns verdict: unknown with no measurements. We are deliberately not quoting a token/sec figure. Token generation on Apple Silicon is bandwidth-bound, and the M2 Pro's ~200 GB/s unified memory is the slowest of Apple's pro chips — so throughput numbers measured on an M2/M3/M4 Max (400–546 GB/s) would overstate what this chip does and must not be carried over. If you run this, please contribute your tok/s so we can seed a real M2 Pro datapoint.
Memory usage: ~12.1 GB resident for the weights, plus a KV-cache that grows with context. That exceeds the ~10.5 GB the M2 Pro GPU addresses by default, so the wired-limit raise to ~13 GB is required to load at all — and even then only ~3 GB remains for macOS and the KV-cache. Expect memory pressure; keep context short and other apps closed.
Quality notes: gpt-oss-20b is a reasoning-tuned MoE; MXFP4 is its native training-time format, so you are not trading quality for a quantization here the way a 4-bit requant of a dense model would. The MLX MXFP4-Q8 build keeps the experts in MXFP4 and the remaining tensors at Q8 for fidelity.

For the full benchmark data (and to be the first to populate it), see /check/gpt-oss-20b/m2-pro.

Troubleshooting

Out of memory / hard swapping when the model loads

This is the expected failure mode on a 16 GB Mac. The ~12.1 GB weights are larger than the ~10.5 GB the GPU addresses by default, so:

Make sure you ran the wired-limit raise (Installation step 2): sudo sysctl iogpu.wired_limit_mb=13312. Without it the load almost always fails or swaps.
Close other GPU-heavy apps (browsers with many tabs, Photos, video editors) before loading — they compete for the same unified pool.
Keep the context modest (≤ 4k tokens). The KV-cache lives in the same ~3 GB of slack as macOS; a long prompt is the most common thing that tips a working setup into swapping.
If Activity Monitor's Memory-Pressure gauge stays red, this model is simply at the edge of what a 16 GB Mac can do — a 14B-class model (e.g. Qwen3-14B 4-bit, ~8 GB) is a roomier fit on this chip.

Tried to install FlashAttention / bitsandbytes / a `cu12x` wheel and it failed

None of those apply on Apple Silicon. There is no CUDA, no FlashAttention, and no GPU bitsandbytes kernel on macOS — MLX uses its own Metal attention, and llama.cpp uses Metal + the native MXFP4 GGUF. MXFP4 itself is not an NVIDIA-only format, so you do not need an FP8 or NVFP4 path either. If a generic tutorial tells you to pip install flash-attn, pass --load-in-4bit, or fetch a cu12x wheel, skip those steps entirely; the commands above are the complete Apple path.

The MLX build won't load on an older mlx-lm

The mlx-community/gpt-oss-20b-MXFP4-Q8 repo was converted with mlx-lm 0.27.0. If load(...) errors on the MXFP4 tensors, upgrade: pip install -U mlx-lm. gpt-oss MXFP4 support is recent, so an older mlx-lm may not recognise the format.

No other widely-reported issues. Report problems via the submission form.