self-hosted/ai
§01·recipe · llm

gpt-oss 20B on RTX 4080 SUPER: MXFP4 chat at 139 tok/s via Ollama or vLLM

llmbeginner16GB+ VRAMJun 2, 2026
models
tools
prerequisites
  • NVIDIA RTX 4080 SUPER (16GB VRAM) or any other GPU with at least 16GB of VRAM
  • Recent NVIDIA driver with CUDA 12+ support (CUDA 12.8 for the vLLM wheel path)
  • Python 3.10+ if you take the transformers or vLLM path; Ollama bundles its own runtime
  • About 15 GB of free disk space for the weights

What You'll Build

A local chat / completions endpoint running OpenAI's gpt-oss 20B — the open-weights 21B-parameter MoE model that ships natively quantized to MXFP4 — on a single RTX 4080 SUPER. The recipe covers the simplest path (Ollama, one command), the production path (vLLM, OpenAI-compatible HTTP server), and the reference path (Transformers).

Hardware data: RTX 4080 SUPER (16 GB VRAM) · 139.1 tok/s generation, 6364.0 tok/s prefill at 4k context, MXFP4 · See benchmark data

ℹ️ This is the 20B sibling, not gpt-oss-120b. OpenAI's gpt-oss release ships two models: the 20B in this recipe (fits a single consumer GPU at MXFP4) and a 120B sibling that needs an 80 GB datacenter card. Per the model card, MXFP4 quantization of the MoE weights is what makes "gpt-oss-120b run on a single 80GB GPU (like NVIDIA H100 or AMD MI300X) and the gpt-oss-20b model run within 16GB of memory." If you landed here looking for the 120B path, this recipe does not cover it.

ℹ️ MXFP4 on the RTX 4080 SUPER (Ada) — storage win, not a tensor-core win. MXFP4 is the model's native release format, so the 4-bit weights occupy ~14 GB on any card — that is why a 16 GB 4080 SUPER fits the model at all. But native FP4 tensor-core acceleration is a Blackwell feature: per NVIDIA's gpt-oss QAT writeup, "With the arrival of NVIDIA Blackwell, NVFP4 introduces a new FP4 format purpose-built for both training and inference efficiency […]" (NVFP4 is NVIDIA's own 4-bit float; gpt-oss ships in the related MXFP4 format — both are 4-bit floating-point and both need Blackwell-class FP4 tensor cores to accelerate). The RTX 4080 SUPER is Ada Lovelace (sm_89), which has no FP4 tensor cores — runtimes fall back to higher-precision MoE kernels (vLLM, for instance, documents a Marlin MXFP4 MoE fallback for pre-Blackwell GPUs without native FP4 support, such as Ampere A100). You keep the MXFP4 memory footprint; you do not get the Blackwell-class FP4 throughput. The 139 tok/s figure below is measured on a real RTX 4080 SUPER and already reflects this.

ℹ️ Mixture-of-Experts caveat. gpt-oss 20B is a per-token sparse MoE — the model card states it has "21B parameters with 3.6B active parameters". The 3.6B figure is a compute / FLOPs number — it is not the VRAM number. All 21B parameters must be resident in VRAM because the router picks experts per token at runtime. The model fits 16 GB only because OpenAI pre-quantized the MoE weights to MXFP4 (~0.5 bytes per parameter); at BF16 the same parameter count would need ~42 GB and would not fit a 16 GB card.

Requirements

ComponentMinimumTested
GPU16 GB VRAM (per HF card: "the gpt-oss-20b model run within 16GB of memory")RTX 4080 SUPER (16 GB)
RAM16 GB system RAM
Storage~14 GB on disk for the MXFP4 weights (Ollama listing lists 14GB; HF safetensors total: 4.79 + 4.80 + 4.17 = 13.76 GB per the HF tree API)
SoftwareNVIDIA driver with CUDA 12+; Python 3.10+ (for vLLM / Transformers paths)
LicenseApache-2.0 (HF card)

Installation

Pick one of the three paths below. Ollama is the fastest path to a working chat; vLLM gives you an OpenAI-compatible HTTP server; Transformers is the reference implementation.

Path A — Ollama (simplest)

ollama pull gpt-oss:20b
ollama run gpt-oss:20b

This downloads the ~14 GB MXFP4 weights into Ollama's model store and drops you into an interactive chat. Per the official Ollama listing, the 20B variant is the consumer-hardware option: "quantizing these to MXFP4 enables the smaller model to run on systems with as little as 16GB memory."

Path B — vLLM (OpenAI-compatible server)

vLLM ships a gpt-oss-specific wheel pinned against PyTorch nightly cu128 — this is the install command from the OpenAI HF model card:

uv pip install --pre vllm==0.10.1+gptoss \
    --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
    --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
    --index-strategy unsafe-best-match

vllm serve openai/gpt-oss-20b

vllm serve starts an OpenAI-compatible HTTP server on http://localhost:8000/v1 (chat completions, completions, embeddings). On first launch it downloads the weights from the Hub and builds the MoE kernels for your GPU.

Path C — Transformers (reference)

pip install -U transformers kernels torch
from transformers import pipeline

pipe = pipeline(
    task="text-generation",
    model="openai/gpt-oss-20b",
)
pipe("Plants create energy through a process known as")

Per the Transformers model docs, this loads the safetensors in their native MXFP4 layout and runs a text-generation pipeline. Note from the same docs: "SDPA is not supported because attention sinks require direct access to the full attention logits before softmax. Use Flash Attention or Flex Attention instead." — on the Ada-Lovelace 4080 SUPER (sm_89), prebuilt Flash Attention wheels are available, so no special wheel selection is required.

Running

Once any of the three paths above is up, send a request. With Ollama:

ollama run gpt-oss:20b "Explain mixture-of-experts routing in one paragraph."

With vLLM (OpenAI-compatible):

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "messages": [{"role": "user", "content": "Explain mixture-of-experts routing in one paragraph."}]
  }'

First-token latency on a 4080 SUPER is sub-second at short context; sustained generation is in the ~139 tok/s range at 4k context — see Results.

Results

  • Speed: 139.1 tokens/s generation and 6364.0 tokens/s prefill at 4k context length, MXFP4 quantization. Measured on an RTX 4080 SUPER by Hardware-Corner in their gpt-oss 20B (MXFP4) row of the "Prompt Processing" and "Token Generation" benchmark tables. At longer context the throughput drops as expected: 123.0 / 107.2 / 81.5 / 60.0 t/s generation at 16k / 32k / 64k / 128k respectively (same source page). The 6364.0 t/s prefill at 4k is mirrored in our benchmark data.
  • VRAM usage: Fits the full 16 GB envelope — our /check/gpt-oss-20b/rtx-4080-super benchmark records a 16.0 GB peak (verdict: runs), and the MXFP4 weights are 13.76 GB on disk per the HF tree API. The HF card frames the deployment envelope as "the gpt-oss-20b model run within 16GB of memory." On a 16 GB card there is little headroom for very long contexts — cap context length if you push past 4k (see Troubleshooting).
  • Quality notes: gpt-oss 20B is post-trained with reasoning support and tool-use — see the model card for the canonical harmony chat template usage. The MXFP4 quantization is the native release format (not a community after-the-fact quant), so there is no separate BF16 weight set to compare against — the BF16 weights are not publicly released.

For the full benchmark data, see /check/gpt-oss-20b/rtx-4080-super.

Troubleshooting

Generation is slower than a Blackwell RTX 50-series card at the same VRAM

Expected. The RTX 4080 SUPER is Ada Lovelace (sm_89) and has no native FP4 tensor cores — those arrived with Blackwell. The MXFP4 weights still load and run (you get the memory footprint), but the matmuls run through pre-Blackwell fallback kernels rather than native FP4 tensor-core paths (vLLM documents this fallback behaviour). The 139 tok/s figure in Results is already the realistic Ada number — don't expect Blackwell-class FP4 throughput on this card.

Out of memory at long context on the 16 GB card

The weights alone are ~14 GB, leaving only ~2 GB for the KV cache and activations on a 16 GB 4080 SUPER — much tighter than the 4090's 24 GB. The 16.0 GB peak in our benchmark is measured at 4k context. If you push context well past 4k you can OOM. Cap the context window (Ollama: set a smaller num_ctx; vLLM: --max-model-len 8192) and the model stays within the envelope.

"I only have 12 GB / 8 GB VRAM — can I run this?"

Not at MXFP4 native. The 16 GB floor on the model card is already the smallest official footprint — the weights ship pre-quantized to 4-bit, and there is no smaller official tier. Community llama.cpp GGUF re-quants exist but are out of scope here; if your card is under 16 GB, run gpt-oss 20B remotely (or pick a smaller model — Qwen3-8B Q4_K_M fits 8 GB).

vLLM install fails with PyTorch resolution errors

The vllm==0.10.1+gptoss wheel is pinned against the PyTorch nightly cu128 channel — that is why the --extra-index-url https://download.pytorch.org/whl/nightly/cu128 and --index-strategy unsafe-best-match flags are mandatory in the install command. Dropping either flag will produce dependency-resolution errors at install time. The exact command above is taken verbatim from the OpenAI HF model card.

"21B / 3.6B active — why does VRAM need all 21B?"

Per-token MoE routing means the gating network picks which experts to fire on a per-token basis, so the model cannot pre-prune which experts to load. All 21B parameters must be resident in VRAM. The 3.6B active figure is a compute / FLOPs-per-token metric, not a memory metric. This is the same pattern as Mixtral 8×7B (47B resident) and DeepSeek-V3 (671B resident). The reason this 21B fits 16 GB anyway is the native MXFP4 format — the model card describes the "MXFP4 quantization of the MoE weights".

Want different hardware numbers?

If you have benchmark data on a different RTX 4080 SUPER configuration (longer context, different runtime, batch > 1), submit it via /contribute so we can grow the /check/gpt-oss-20b/rtx-4080-super page beyond Hardware-Corner's measurement.