self-hosted/ai
§01·recipe · llm

gpt-oss 20B on RTX 5080: MXFP4 Chat at 172 tok/s via Ollama or vLLM

llmbeginner16GB+ VRAMMay 28, 2026
models
tools
prerequisites
  • NVIDIA RTX 5080 (16 GB GDDR7) or any consumer card with at least 16 GB VRAM
  • Recent NVIDIA driver with CUDA 12.8 support (required for Blackwell sm_120)
  • Python 3.10+ (only for the vLLM or Transformers paths)
  • ~14 GB free disk for the MXFP4 weights

What You'll Build

A local chat endpoint backed by OpenAI's open-weights gpt-oss-20b — a 21B-parameter mixture-of-experts LLM (3.6B active per token) shipped in native MXFP4 quantization. Two installation paths are covered: Ollama (one command, drop-in chat) or vLLM (OpenAI-compatible HTTP API). On the RTX 5080's 16 GB GDDR7 envelope this model sits right at its stated floor — it fits, but unlike a 24/32 GB card there is no room to colocate a second model, so the framing here is "how to run it comfortably at the 16 GB limit," not "what to do with the spare VRAM."

Hardware data: RTX 5080 (16 GB GDDR7) · 172.4 tok/s generation, 9146.1 tok/s prefill at 4K context (MXFP4) · See benchmark data

ℹ️ MXFP4 is FP4-microscaling, native to this model — not a separate community quant. Per the HF card, "The models were post-trained with MXFP4 quantization of the MoE weights ... and the gpt-oss-20b model run within 16GB of memory. All evals were performed with the same MXFP4 quantization." The 4-bit weights ship in the canonical release and run over standard quantized-matmul kernels in llama.cpp / Ollama / vLLM. On the RTX 5080 (Blackwell sm_120) those kernels run natively — there is no FP8-on-Ampere style dequant penalty here, because Blackwell has native MXFP4/FP8 tensor-core paths. (Hardware Corner labels the row MXFP4 directly on the RTX 5080 page — there is no separate Q4_K community quant to confuse it with.)

Requirements

ComponentMinimumTested
GPU16 GB VRAM (per HF card: "the gpt-oss-20b model run within 16GB of memory")RTX 5080 (16 GB GDDR7)
RAM16 GB system RAM
Storage~14 GB for MXFP4 weights (Ollama listing lists 14 GB; HF safetensors total 4.79 + 4.80 + 4.17 = 13.76 GB per the HF tree API)
SoftwareNVIDIA driver with CUDA 12.8+ (required for Blackwell sm_120); Python 3.10+ (for vLLM path)
LicenseApache 2.0 (HF card) — commercial use permitted

Installation

Two paths. Pick one. Ollama is the fastest route to a working chat; vLLM gives you an OpenAI-compatible HTTP server suitable for production-style usage.

Path A — Ollama (recommended for first run)

Verbatim from the official HF model card:

ollama pull gpt-oss:20b
ollama run gpt-oss:20b

That's it. The gpt-oss:20b tag is the native MXFP4 build — per the official Ollama listing, "quantizing these to MXFP4 enables the smaller model to run on systems with as little as 16GB memory." First run downloads ~14 GB and drops you into an interactive REPL.

Path B — vLLM (OpenAI-compatible API server)

Verbatim from the official HF model card:

uv pip install --pre vllm==0.10.1+gptoss \
    --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
    --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
    --index-strategy unsafe-best-match

vllm serve openai/gpt-oss-20b

Both --extra-index-url flags are mandatory. The vllm==0.10.1+gptoss build is pinned against a PyTorch nightly served from the cu128 channel — drop either flag and dependency resolution fails. The cu128 PyTorch wheel is also what enables Blackwell sm_120 kernels in the first place: CUDA 12.8 is the first toolkit with native Blackwell support, and the stable cu126 wheels do not ship sm_120 kernels.

Once vllm serve reports it is listening on port 8000, the server speaks the OpenAI Chat Completions API.

Running

Ollama (interactive):

ollama run gpt-oss:20b "Explain mixture-of-experts routing in one paragraph."

vLLM (HTTP):

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "messages": [{"role": "user", "content": "Explain mixture-of-experts routing in one paragraph."}]
  }'

Either path keeps all 21B parameters resident in VRAM (see the MoE note in Results). The model uses OpenAI's harmony response format — per the HF card, the models "should only be used with the harmony format as it will not work correctly otherwise." Ollama and vLLM apply it automatically via the bundled chat template; you only need to apply it by hand if you call model.generate directly through Transformers. The card also exposes a configurable reasoning effort — set "Reasoning: high" (or low/medium) in the system prompt to trade latency for depth.

Results

  • Generation speed: 172.4 tokens/s at 4K context (MXFP4 quantization), measured on RTX 5080 by Hardware Corner's gpu-llm-benchmarks in the gpt-oss 20B row of the Token Generation table. The same row reports 161.3 / 149.3 / 130.9 / 105.6 tok/s walking from 16K → 32K → 64K → 128K context as the KV cache grows. This matches backend benchmark id=82 (172.4 tok/s, 4k context generation speed).
  • Prefill speed: 9,146.1 tokens/s at 4K context on the same row of the Hardware Corner Prompt Processing table, trailing to 7,560.5 / 5,906.9 / 3,041.1 / 1,726.6 tok/s at 16K → 32K → 64K → 128K. This matches backend benchmark id=81 (9146.1 prefill tokens/s, 4k context, peak 16.0 GB).
  • VRAM usage: Peak 16.0 GB per the cited backend benchmark — i.e. the model uses essentially the whole 16 GB envelope at the 4K-context configuration. The MXFP4 weights are 13.76 GB on disk per the HF tree API, and the HF card frames the deployment envelope as "the gpt-oss-20b model run within 16GB of memory." On the 16 GB 5080 that leaves very little spare — this is a fits-at-the-floor recipe, not a headroom recipe (contrast the 32 GB RTX 5090 sibling, which has ~16 GB free for colocation). Plan to cap context tightly if you run other GPU workloads alongside. See /check/gpt-oss-20b/rtx-5080.
  • Quality notes: MXFP4 is the native release format (not a community after-the-fact quant), so there is no higher-precision baseline to compare against — the BF16 weights are not publicly released, and per the card "All evals were performed with the same MXFP4 quantization." The model is post-trained for reasoning and tool-use; use the canonical harmony chat template for correct behavior, and pick a reasoning effort (low/medium/high) to suit your latency budget.

For the full benchmark data and cross-card compare, see /check/gpt-oss-20b/rtx-5080.

Troubleshooting

All 21B parameters must fit, not just 3.6B

The HF card markets the model as "21B parameters with 3.6B active parameters" — but the 3.6B is a FLOPs-per-token figure (how many parameters fire on each forward pass), not a VRAM figure. gpt-oss-20b is a router-per-token mixture-of-experts model (32 experts, top-4 routed per token per its config.json: num_local_experts: 32, num_experts_per_tok: 4), so all experts must be resident in VRAM because the router can pick any of them per token at inference time — you cannot pre-prune. That is why the deployment minimum is the full ~14 GB of MXFP4 weights plus working set, landing at the stated 16 GB envelope and the 16.0 GB peak in the cited benchmark — not a smaller number derived from the active-param subset.

Out of memory at longer context on the 16 GB card

Unlike the 24 GB / 32 GB siblings, the 5080 runs this model at essentially its floor — the 16.0 GB peak in the benchmark already accounts for the full weights plus a modest KV cache. Pushing toward the model's 128K context window (max_position_embeddings: 131072 in config.json) grows the KV cache and can tip a 16 GB card into OOM. Mitigations, in order: cap the context window in your runtime (Ollama: OLLAMA_CONTEXT_LENGTH / num_ctx; vLLM: --max-model-len), and avoid colocating other GPU workloads. The Hardware Corner row itself only measures through 128K, and generation falls from 172.4 tok/s at 4K to 105.6 tok/s at 128K as the cache fills.

vLLM install fails with dependency resolution errors

Both --extra-index-url lines in the Path B install command are mandatory (per the HF card). The vllm==0.10.1+gptoss build depends on a PyTorch nightly served from download.pytorch.org/whl/nightly/cu128, not the stable channel. Drop either flag and pip won't find a compatible torch. On Blackwell specifically, the cu128 pin is also what enables sm_120 kernels — stable cu126 wheels don't ship Blackwell support.

Flash-Attention errors on first inference call

If you swap the recipe's runtime for something that imports flash_attention_2 explicitly (a HF Transformers attn_implementation="flash_attention_2" snippet, a custom vLLM config, a script copied from an Ampere/Ada walkthrough), Blackwell may crash at first forward pass — FA2 sm_120 kernel coverage is still in-flight at Dao-AILab/flash-attention#2168 (open as of recipe write time). The recipe's documented paths (Ollama, vLLM) do not need FA2 — Ollama uses llama.cpp's CUDA backend, and vLLM defaults to its own attention kernels. If you do hit an FA2 error in a custom setup, switch attn_implementation to "sdpa" (PyTorch's native scaled-dot-product attention — always works on sm_120 with cu128 wheels).

Generation slower than expected for the card

Two checks: (a) confirm you installed the cu128 PyTorch wheel (vLLM path) or a recent Ollama build — Ollama versions older than the Blackwell-support cutoff fall back to a non-optimal kernel; (b) confirm you are at small context. LLM token generation is memory-bandwidth-bound, so as the KV cache grows past 4K the per-token rate drops mechanically — the Hardware Corner RTX 5080 table shows 172.4 → 105.6 tok/s walking from 4K to 128K context on the same hardware.

Want different hardware numbers?

If you have benchmark data on a different RTX 5080 configuration (longer context, batched serving, different runtime), submit it via /contribute so we can grow the /check/gpt-oss-20b/rtx-5080 page beyond Hardware Corner's first-party row.