How much VRAM does gpt-oss 20B need?

About 16 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

gpt-oss 20B on RTX 5090: MXFP4 Chat at 298 tok/s via Ollama or vLLM

What You'll Build

A local chat endpoint backed by OpenAI's open-weights gpt-oss-20b — a 21B-parameter mixture-of-experts LLM (3.6B active per token) shipped in native MXFP4 quantization. Two installation paths are covered: Ollama (one command, drop-in chat) or vLLM (OpenAI-compatible HTTP API). The 5090's 32 GB envelope leaves ~18 GB free after the model loads — enough room to colocate a second model on the same card.

Hardware data: RTX 5090 (32 GB VRAM) · 298.2 tok/s generation, 9443.8 tok/s prefill at 4K context (MXFP4) · See benchmark data

ℹ️ MXFP4 is FP4-microscaling, not FP8. The model ships pre-quantized to 4-bit floating-point and runs over standard quantized-matmul kernels in llama.cpp / Ollama / vLLM. On the 5090 (Blackwell sm_120) the same code path runs natively — there is no separate "MXFP4 acceleration mode" you need to enable. The performance uplift over the RTX 3090 sibling (~2× generation throughput at 4K context: 298.2 vs 147.5 tok/s) tracks the bandwidth-and-compute gap between Ampere and Blackwell, not a quant-format change.

Requirements

Component	Minimum	Tested
GPU	16 GB VRAM (per HF card: "the `gpt-oss-20b` model run within 16GB of memory")	RTX 5090 (32 GB)
RAM	16 GB system RAM	—
Storage	~14 GB for MXFP4 weights (Ollama listing; HF safetensors total: 4.79 + 4.80 + 4.17 = 13.76 GB per the HF tree API)	—
Software	NVIDIA driver with CUDA 12.8+ (required for Blackwell sm_120); Python 3.10+ (for vLLM path)	—
License	Apache-2.0 (HF card)	—

Installation

Two paths. Pick one. Ollama is the fastest route to a working chat; vLLM gives you an OpenAI-compatible HTTP server suitable for production-style usage.

Path A — Ollama (recommended for first run)

ollama pull gpt-oss:20b
ollama run gpt-oss:20b

That's it. The gpt-oss:20b tag is the native MXFP4 build — per the official Ollama listing, MXFP4 "enables the smaller model to run on systems with as little as 16GB memory." First run downloads ~14 GB and drops you into an interactive REPL.

Path B — vLLM (OpenAI-compatible API server)

Verbatim from the official HF model card:

uv pip install --pre vllm==0.10.1+gptoss \
    --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
    --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
    --index-strategy unsafe-best-match

vllm serve openai/gpt-oss-20b

Both --extra-index-url flags are mandatory. The vllm==0.10.1+gptoss build is pinned against a PyTorch nightly served from the cu128 channel — drop either flag and dependency resolution fails. The cu128 PyTorch wheel is what enables Blackwell sm_120 kernels in the first place (CUDA 12.8 is the first toolkit with native Blackwell support); the 4090 sibling needs the same wheel for a different reason (gpt-oss-specific kernels), but for the 5090 the cu128 pin is also the Blackwell-support gate.

Once vllm serve reports it is listening on port 8000, the server speaks the OpenAI Chat Completions API.

Running

Ollama (interactive):

ollama run gpt-oss:20b "Explain mixture-of-experts routing in one paragraph."

vLLM (HTTP):

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "messages": [{"role": "user", "content": "Explain mixture-of-experts routing in one paragraph."}]
  }'

Either path keeps all 21B parameters resident in VRAM. The 3.6B "active" figure from the HF card is a compute-per-token number (which experts the router fires on each forward pass) — it does not mean only 3.6B parameters live in memory. All experts must be loadable on demand.

Results

Generation speed: 298.2 tokens/s at 4K context (MXFP4 quantization), measured on RTX 5090 by Hardware Corner's gpu-llm-benchmarks in the gpt-oss 20B (Q4_K) row of the Token Generation table. The same row reports 249.2 / 215.1 / 169.0 / 112.0 tok/s walking from 16K → 32K → 64K → 128K context. (Note on labeling: Hardware Corner labels the quant Q4_K to match its cross-page schema, but for gpt-oss-20b this is the native MXFP4 release — there is no separate Q4_K community quant on the HF card. The numbers are MXFP4.)
Prefill speed: 9443.8 tokens/s at 4K context on the same row of the Hardware Corner Prompt Processing table, trailing to 7168.1 / 5183.1 / 3019.7 / 1636.4 tok/s at 16K → 128K.
VRAM usage: Plan on ~14–16 GB resident at typical context lengths. The MXFP4 weights are 13.76 GB on disk per the HF tree API; the HF card frames the deployment envelope as "the gpt-oss-20b model run within 16GB of memory." On the 32 GB 5090 that leaves ~16–18 GB of headroom — see Spending the Headroom below for what to do with it.
Quality notes: MXFP4 is the native release format (not a community after-the-fact quant), so there is no quality penalty to compare against — the BF16 weights are not publicly released. The model is post-trained for reasoning and tool-use; use the canonical harmony chat template for best behavior.

For the full benchmark data and cross-card compare, see /check/gpt-oss-20b/rtx-5090.

Spending the Headroom — the real reason to run this on a 5090

At ~14–16 GB resident, gpt-oss-20b leaves ~16–18 GB free on a 32 GB 5090. The 4090 and 3090 siblings each leave ~8 GB. That extra ~10 GB on the 5090 is not just margin — it's enough to colocate a second model on the same card without paging. A few concrete shapes:

gpt-oss-20b + Qwen3-8B Q4_K_M (~5 GB) — pair a heavy reasoning model with a fast generalist; route by intent. Total ~21 GB resident, 11 GB free for KV-cache as you push past 4K context on either model.
gpt-oss-20b + Whisper-large-v3 (~3 GB FP16) — full local voice pipeline (transcribe → reason → respond) on one card; total ~19 GB resident, 13 GB free.
gpt-oss-20b + Kokoro-82M TTS (~1 GB) — add speech synthesis to the chat endpoint for full conversational use; total ~15 GB resident, 17 GB free.
Long-context single-model use — alternatively, spend the headroom on context instead of a second model. Hardware Corner's row shows generation throughput at 128K context drops only ~62% from the 4K baseline (298.2 → 112.0 tok/s) — the 5090's bandwidth makes long-context use practical where it would be painful on a 3090.

None of these patterns work on the 3090 or 4090 — they are 24 GB cards where the same model leaves only ~8 GB of headroom, not enough for a second meaningful model load. The 5090's 32 GB envelope is the actual differentiator for this size class, not raw tok/s.

Troubleshooting

"All 21B parameters must fit, not just 3.6B"

The HF card markets the model as "21B parameters with 3.6B active parameters." All 21B must be resident in VRAM because the MoE router decides which experts to use per-token at inference time. The 3.6B is a FLOPs-per-token number, not a VRAM number. On the RTX 5090 this is comfortably under the 32 GB ceiling — the 16 GB stated minimum accounts for all 21B in MXFP4 plus working set.

vLLM install fails with dependency resolution errors

Both --extra-index-url lines in the Path B install command are mandatory (per HF card). The vllm==0.10.1+gptoss build depends on a PyTorch nightly served from download.pytorch.org/whl/nightly/cu128, not the stable channel. Drop either flag and pip won't find a compatible torch. On Blackwell specifically, the cu128 pin is also what enables sm_120 kernels — stable cu126 wheels don't ship Blackwell support.

Flash-Attention errors on first inference call

If you swap the recipe's runtime for something that imports flash_attention_2 explicitly (a HF Transformers attn_implementation="flash_attention_2" snippet, a custom vLLM config, a script copied from an Ampere/Ada walkthrough), Blackwell may crash at first forward pass — FA2 sm_120 kernel coverage is still in-flight at Dao-AILab/flash-attention#2168 (open as of recipe write time). The recipe's documented paths (Ollama, vLLM) do not need FA2 — Ollama uses llama.cpp's CUDA backend, and vLLM defaults to its own attention kernels. If you do hit an FA2 error in a custom setup, switch attn_implementation to "sdpa" (PyTorch's native scaled-dot-product attention — always works on sm_120 with cu128 wheels).

Generation slower than expected for the card

Two checks: (a) confirm you installed the cu128 PyTorch wheel (vLLM path) or a recent Ollama build — Ollama versions older than the Blackwell-support cutoff fall back to a non-optimal kernel; (b) confirm you are at small context. LLM token generation is memory-bandwidth-bound, so as the KV-cache grows past 4K the per-token rate drops mechanically — the Hardware Corner RTX 5090 table shows 298.2 → 112.0 tok/s walking from 4K to 128K context on the same hardware.

Want different hardware numbers?

If you have benchmark data on a different RTX 5090 configuration (longer context, batched serving, different runtime), submit it via /contribute so we can grow the /check/gpt-oss-20b/rtx-5090 page beyond Hardware Corner's first-party row.