self-hosted/ai
§01·recipe · llm

gpt-oss 20B on RTX 4090: MXFP4 chat via Ollama or vLLM

llmbeginner16GB+ VRAMMay 20, 2026
models
tools
prerequisites
  • NVIDIA RTX 4090 (24GB VRAM) or any other GPU with at least 16GB of VRAM
  • Recent NVIDIA driver with CUDA 12+ support (CUDA 12.8 for the vLLM wheel path)
  • Python 3.10+ if you take the transformers or vLLM path; Ollama bundles its own runtime
  • About 15 GB of free disk space for the weights

What You'll Build

A local chat / completions endpoint running OpenAI's gpt-oss 20B — the open-weights 21B-parameter MoE model that ships natively quantized to MXFP4 — on a single RTX 4090. The recipe covers the simplest path (Ollama, one command), the production path (vLLM, OpenAI-compatible HTTP server), and the reference path (Transformers).

Hardware data: RTX 4090 (24 GB VRAM) · 190.6 tok/s generation, 8369.3 tok/s prefill at 4k context, MXFP4 · See benchmark data

ℹ️ This is the 20B sibling, not gpt-oss-120b. OpenAI's gpt-oss release ships two models: the 20B in this recipe (fits a single consumer GPU at MXFP4) and a 120B sibling that needs an 80 GB datacenter card (H100 / MI300X) per the model card. If you landed here looking for the 120B path, this recipe does not cover it.

ℹ️ Mixture-of-Experts caveat. gpt-oss 20B is a per-token sparse MoE: 21B total parameters, 3.6B active per token (model card). The 3.6B figure is a compute / FLOPs number — it is not the VRAM number. All 21B parameters must be resident in VRAM because the router picks experts per token at runtime. The model fits 16 GB only because OpenAI pre-quantized the MoE weights to MXFP4 (4-bit floating-point); at BF16 the same parameter count would need ~42 GB and would not fit a single 24 GB card.

Requirements

ComponentMinimumTested
GPU16 GB VRAM (per HF card: "the gpt-oss-20b model run within 16GB of memory")RTX 4090 (24 GB)
RAM16 GB system RAM
Storage~14 GB on disk for the MXFP4 weights (ollama listing; HF safetensors total: 4.79 + 4.80 + 4.17 = 13.76 GB per the HF tree API)
SoftwareNVIDIA driver with CUDA 12+; Python 3.10+ (for vLLM / Transformers paths)
LicenseApache-2.0 (HF card)

Installation

Pick one of the three paths below. Ollama is the fastest path to a working chat; vLLM gives you an OpenAI-compatible HTTP server; Transformers is the reference implementation.

Path A — Ollama (simplest)

ollama pull gpt-oss:20b
ollama run gpt-oss:20b

This downloads the ~14 GB MXFP4 weights into Ollama's model store and drops you into an interactive chat. Per the official Ollama listing, the 20B variant is the consumer-hardware option: "quantization to MXFP4 format enables the smaller model to run on systems with as little as 16GB memory."

Path B — vLLM (OpenAI-compatible server)

vLLM ships a gpt-oss-specific wheel pinned against PyTorch nightly cu128 — this is the install command from the OpenAI HF model card:

uv pip install --pre vllm==0.10.1+gptoss \
    --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
    --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
    --index-strategy unsafe-best-match

vllm serve openai/gpt-oss-20b

vllm serve starts an OpenAI-compatible HTTP server on http://localhost:8000/v1 (chat completions, completions, embeddings). On first launch it downloads the weights from the Hub and JITs the MXFP4 kernels.

Path C — Transformers (reference)

pip install gpt-oss[torch]

huggingface-cli download openai/gpt-oss-20b --include "original/*" --local-dir gpt-oss-20b/
python -m gpt_oss.chat gpt-oss-20b/

Per the model card, this loads the safetensors in their native MXFP4 layout via the openai/gpt-oss reference codebase and opens an interactive chat.

Running

Once any of the three paths above is up, send a request. With Ollama:

ollama run gpt-oss:20b "Explain mixture-of-experts routing in one paragraph."

With vLLM (OpenAI-compatible):

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "messages": [{"role": "user", "content": "Explain mixture-of-experts routing in one paragraph."}]
  }'

First-token latency on a 4090 is sub-second at short context; sustained throughput is in the ~190 tok/s range — see Results.

Results

  • Speed: 190.6 tokens/s generation and 8369.3 tokens/s prefill at 4k context length, MXFP4 quantization. Measured on a full RTX 4090 by Hardware-Corner in their cross-model gpt-oss 20B (MXFP4) row of the "Prompt processing (t/s) and token generation speed (t/s) across different open weight models and context lengths" tables. At longer context the throughput drops as expected: 163.3 / 140.7 / 104.0 / 70.3 t/s generation at 16k / 32k / 64k / 128k respectively (same source page).
  • VRAM usage: Plan on ~14–16 GB resident at typical context lengths. The MXFP4 weights are 13.76 GB on disk per the HF tree API; the HF card frames the deployment envelope as "the gpt-oss-20b model run within 16GB of memory"; Hardware-Corner's benchmark rows on 16 GB cards (e.g. RTX 4080, RTX 4070 Ti Super) all show MXFP4 fitting cleanly. A 24 GB 4090 leaves ~8–10 GB of headroom for longer contexts and KV-cache growth.
  • Quality notes: gpt-oss 20B is post-trained with reasoning support and tool-use — see the model card for the canonical harmony chat template usage. The MXFP4 quantization is the native release format (not a community after-the-fact quant), so there is no quality penalty to compare against — the BF16 weights are not publicly released.

For the full benchmark data, see /check/gpt-oss-20b/rtx-4090.

Troubleshooting

"I only have 12 GB / 8 GB VRAM — can I run this?"

Not at MXFP4 native. The 16 GB floor on the model card is already the smallest official footprint — the weights ship pre-quantized to 4-bit, and there is no smaller official tier. Community llama.cpp GGUF re-quants exist but are out of scope here; if your card is under 16 GB, run gpt-oss 20B remotely (or pick a smaller model — Qwen3-8B Q4_K_M fits 8 GB).

vLLM install fails with PyTorch resolution errors

The vllm==0.10.1+gptoss wheel is pinned against the PyTorch nightly cu128 channel — that is why the --extra-index-url https://download.pytorch.org/whl/nightly/cu128 and --index-strategy unsafe-best-match flags are mandatory in the install command. Dropping either flag will produce dependency-resolution errors at install time. The exact command above is taken verbatim from the OpenAI HF model card.

"21B / 3.6B active — why does VRAM need all 21B?"

Per-token MoE routing means the gating network picks which experts to fire on a per-token basis, so the model cannot pre-prune which experts to load. All 21B parameters must be resident in VRAM. The 3.6B active figure is a compute / FLOPs-per-token metric, not a memory metric. This is the same pattern as Mixtral 8×7B (47B resident) and DeepSeek-V3 (671B resident). The reason this 21B fits 16 GB anyway is the native MXFP4 format (~0.5 bytes per parameter) — see the HF card's "MXFP4 quantization of the MoE weights" section.

Want different hardware numbers?

If you have benchmark data on a different RTX 4090 configuration (longer context, different runtime, batch > 1), submit it via /contribute so we can grow the /check/gpt-oss-20b/rtx-4090 page beyond Hardware-Corner's single-row measurement.