What You'll Build
A local chat / completions endpoint running OpenAI's gpt-oss 20B — the open-weights 21B-parameter MoE model that ships natively quantized to MXFP4 — on a single RTX 4090. The recipe covers the simplest path (Ollama, one command), the production path (vLLM, OpenAI-compatible HTTP server), and the reference path (Transformers).
Hardware data: RTX 4090 (24 GB VRAM) · 190.6 tok/s generation, 8369.3 tok/s prefill at 4k context, MXFP4 · See benchmark data
ℹ️ This is the 20B sibling, not gpt-oss-120b. OpenAI's gpt-oss release ships two models: the 20B in this recipe (fits a single consumer GPU at MXFP4) and a 120B sibling that needs an 80 GB datacenter card (H100 / MI300X) per the model card. If you landed here looking for the 120B path, this recipe does not cover it.
ℹ️ Mixture-of-Experts caveat. gpt-oss 20B is a per-token sparse MoE: 21B total parameters, 3.6B active per token (model card). The 3.6B figure is a compute / FLOPs number — it is not the VRAM number. All 21B parameters must be resident in VRAM because the router picks experts per token at runtime. The model fits 16 GB only because OpenAI pre-quantized the MoE weights to MXFP4 (4-bit floating-point); at BF16 the same parameter count would need ~42 GB and would not fit a single 24 GB card.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 16 GB VRAM (per HF card: "the gpt-oss-20b model run within 16GB of memory") | RTX 4090 (24 GB) |
| RAM | 16 GB system RAM | — |
| Storage | ~14 GB on disk for the MXFP4 weights (ollama listing; HF safetensors total: 4.79 + 4.80 + 4.17 = 13.76 GB per the HF tree API) | — |
| Software | NVIDIA driver with CUDA 12+; Python 3.10+ (for vLLM / Transformers paths) | — |
| License | Apache-2.0 (HF card) | — |
Installation
Pick one of the three paths below. Ollama is the fastest path to a working chat; vLLM gives you an OpenAI-compatible HTTP server; Transformers is the reference implementation.
Path A — Ollama (simplest)
ollama pull gpt-oss:20b
ollama run gpt-oss:20b
This downloads the ~14 GB MXFP4 weights into Ollama's model store and drops you into an interactive chat. Per the official Ollama listing, the 20B variant is the consumer-hardware option: "quantization to MXFP4 format enables the smaller model to run on systems with as little as 16GB memory."
Path B — vLLM (OpenAI-compatible server)
vLLM ships a gpt-oss-specific wheel pinned against PyTorch nightly cu128 — this is the install command from the OpenAI HF model card:
uv pip install --pre vllm==0.10.1+gptoss \
--extra-index-url https://wheels.vllm.ai/gpt-oss/ \
--extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
--index-strategy unsafe-best-match
vllm serve openai/gpt-oss-20b
vllm serve starts an OpenAI-compatible HTTP server on http://localhost:8000/v1 (chat completions, completions, embeddings). On first launch it downloads the weights from the Hub and JITs the MXFP4 kernels.
Path C — Transformers (reference)
pip install gpt-oss[torch]
huggingface-cli download openai/gpt-oss-20b --include "original/*" --local-dir gpt-oss-20b/
python -m gpt_oss.chat gpt-oss-20b/
Per the model card, this loads the safetensors in their native MXFP4 layout via the openai/gpt-oss reference codebase and opens an interactive chat.
Running
Once any of the three paths above is up, send a request. With Ollama:
ollama run gpt-oss:20b "Explain mixture-of-experts routing in one paragraph."
With vLLM (OpenAI-compatible):
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-oss-20b",
"messages": [{"role": "user", "content": "Explain mixture-of-experts routing in one paragraph."}]
}'
First-token latency on a 4090 is sub-second at short context; sustained throughput is in the ~190 tok/s range — see Results.
Results
- Speed: 190.6 tokens/s generation and 8369.3 tokens/s prefill at 4k context length, MXFP4 quantization. Measured on a full RTX 4090 by Hardware-Corner in their cross-model
gpt-oss 20B (MXFP4)row of the "Prompt processing (t/s) and token generation speed (t/s) across different open weight models and context lengths" tables. At longer context the throughput drops as expected: 163.3 / 140.7 / 104.0 / 70.3 t/s generation at 16k / 32k / 64k / 128k respectively (same source page). - VRAM usage: Plan on ~14–16 GB resident at typical context lengths. The MXFP4 weights are 13.76 GB on disk per the HF tree API; the HF card frames the deployment envelope as "the gpt-oss-20b model run within 16GB of memory"; Hardware-Corner's benchmark rows on 16 GB cards (e.g. RTX 4080, RTX 4070 Ti Super) all show MXFP4 fitting cleanly. A 24 GB 4090 leaves ~8–10 GB of headroom for longer contexts and KV-cache growth.
- Quality notes: gpt-oss 20B is post-trained with reasoning support and tool-use — see the model card for the canonical
harmonychat template usage. The MXFP4 quantization is the native release format (not a community after-the-fact quant), so there is no quality penalty to compare against — the BF16 weights are not publicly released.
For the full benchmark data, see /check/gpt-oss-20b/rtx-4090.
Troubleshooting
"I only have 12 GB / 8 GB VRAM — can I run this?"
Not at MXFP4 native. The 16 GB floor on the model card is already the smallest official footprint — the weights ship pre-quantized to 4-bit, and there is no smaller official tier. Community llama.cpp GGUF re-quants exist but are out of scope here; if your card is under 16 GB, run gpt-oss 20B remotely (or pick a smaller model — Qwen3-8B Q4_K_M fits 8 GB).
vLLM install fails with PyTorch resolution errors
The vllm==0.10.1+gptoss wheel is pinned against the PyTorch nightly cu128 channel — that is why the --extra-index-url https://download.pytorch.org/whl/nightly/cu128 and --index-strategy unsafe-best-match flags are mandatory in the install command. Dropping either flag will produce dependency-resolution errors at install time. The exact command above is taken verbatim from the OpenAI HF model card.
"21B / 3.6B active — why does VRAM need all 21B?"
Per-token MoE routing means the gating network picks which experts to fire on a per-token basis, so the model cannot pre-prune which experts to load. All 21B parameters must be resident in VRAM. The 3.6B active figure is a compute / FLOPs-per-token metric, not a memory metric. This is the same pattern as Mixtral 8×7B (47B resident) and DeepSeek-V3 (671B resident). The reason this 21B fits 16 GB anyway is the native MXFP4 format (~0.5 bytes per parameter) — see the HF card's "MXFP4 quantization of the MoE weights" section.
Want different hardware numbers?
If you have benchmark data on a different RTX 4090 configuration (longer context, different runtime, batch > 1), submit it via /contribute so we can grow the /check/gpt-oss-20b/rtx-4090 page beyond Hardware-Corner's single-row measurement.