self-hosted/ai
§01·recipe · llm

gpt-oss 20B on RTX 3090: MXFP4 Chat at 147 tok/s via Ollama or vLLM

llmbeginner16GB+ VRAMMay 22, 2026
models
tools
prerequisites
  • NVIDIA RTX 3090 (24 GB VRAM) or any consumer card with at least 16 GB VRAM
  • Recent NVIDIA driver with CUDA 12+
  • Python 3.10+ (only for the vLLM path)

What You'll Build

A local chat endpoint backed by OpenAI's open-weights gpt-oss-20b model — a 21B-parameter mixture-of-experts LLM (3.6B active per token) shipped in native MXFP4 quantization. Two installation paths are covered: Ollama (one command, drop-in chat), or vLLM (OpenAI-compatible HTTP API).

Hardware data: RTX 3090 (24 GB VRAM) · 147.5 tok/s generation at 4K context (MXFP4, CUDA + FlashAttention) · See benchmark data

ℹ️ MXFP4 is FP4-microscaling, not FP8. Unlike FP8 (which needs Ada sm_89 / Hopper sm_90 tensor cores the 3090 lacks), MXFP4 runs over standard quantized-matmul kernels in llama.cpp / Ollama / vLLM. The 3090 takes the same MXFP4 path as the 4090 — no dequantize-to-BF16 fallback, no architectural workaround. The performance gap to the 4090 sibling is the usual memory-bandwidth difference (3090's 936 GB/s vs 4090's 1008 GB/s), not a quant-format penalty.

Requirements

ComponentMinimumTested
GPU16 GB VRAM (per HF model card)RTX 3090 (24 GB)
RAM16 GB system RAM
Storage~14 GB for MXFP4 weights (per Ollama library)
SoftwareCUDA 12+; Python 3.10+ (vLLM path only)

Installation

Two paths are provided. Pick one. Ollama is the fastest route to a working chat session; vLLM gives you an OpenAI-compatible HTTP server suitable for production-style usage.

Path A — Ollama (recommended for first run)

ollama pull gpt-oss:20b
ollama run gpt-oss:20b

That's it. The gpt-oss:20b tag is the native MXFP4 build — Ollama supports MXFP4 natively, with no re-quantization or conversion step. First run downloads ~14 GB and drops you into an interactive REPL.

Path B — vLLM (OpenAI-compatible API server)

Verbatim from the official HF model card:

uv pip install --pre vllm==0.10.1+gptoss \
    --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
    --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
    --index-strategy unsafe-best-match

vllm serve openai/gpt-oss-20b

Both --extra-index-url flags are mandatory — without them, dependency resolution fails because the vllm==0.10.1+gptoss build pulls a PyTorch nightly. Once vllm serve reports it is listening on port 8000, the server speaks the OpenAI Chat Completions API.

Running

Ollama (interactive):

ollama run gpt-oss:20b "Explain mixture-of-experts routing in one paragraph."

vLLM (HTTP):

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "messages": [{"role": "user", "content": "Explain mixture-of-experts routing in one paragraph."}]
  }'

Either path keeps all 21B parameters resident in VRAM. The 3.6B "active" figure from the HF card is a compute-per-token number (which experts the router fires on each forward pass) — it does not mean only 3.6B parameters live in memory. All experts must be loadable on demand.

Results

  • Generation speed: 147.5 tokens/s at 4K context (MXFP4, CUDA + -fa 1), measured on RTX 3090 by Hardware Corner's gpu-llm-benchmarks. Drops to 128.5 tok/s at 16K, 112.6 tok/s at 32K, 89.6 tok/s at 64K, 62.2 tok/s at 128K.
  • Prefill speed: 4400.3 tokens/s at 4K context on the same RTX 3090 row of the Hardware Corner table.
  • VRAM usage: Fits within the HF card's stated 16 GB minimum; on the 24 GB RTX 3090 you keep ~8 GB of headroom for context KV-cache as you push beyond 4K.
  • Quality notes: This is OpenAI's official MXFP4-native release, not a community post-quantization — there are no public BF16 weights to compare against. Output quality is the reference quality.

For the full benchmark data and side-by-side compare across cards, see /check/gpt-oss-20b/rtx-3090.

Troubleshooting

"All 21B parameters must fit, not just 3.6B"

The HF card markets the model as "21B total parameters, 3.6B active per token". All 21B must be resident in VRAM because the MoE router decides which experts to use per-token at inference time. The 3.6B is a FLOPs-per-token number, not a VRAM number. On the RTX 3090 this is comfortably under the 24 GB ceiling — the 16 GB stated minimum accounts for all 21B in MXFP4 plus working set.

vLLM install fails with dependency resolution errors

Both --extra-index-url lines in the Path B install command are mandatory (per HF card). The vllm==0.10.1+gptoss build depends on a PyTorch nightly served from download.pytorch.org/whl/nightly/cu128, not the stable channel. Drop either flag and pip won't find a compatible torch.

Generation slower than expected for the GPU class

Two checks: (a) confirm FlashAttention is active — Ollama enables it by default, and the Hardware Corner RTX 3090 numbers carry the -fa 1 flag explicitly; (b) confirm you are at small context. LLM token generation is memory-bandwidth-bound, so as KV-cache grows past 4K the per-token rate drops mechanically — the Hardware Corner table shows 147.5 → 62.2 tok/s walking from 4K to 128K context on the same hardware.

Sub-16 GB cards

Not supported. MXFP4 is already the smallest official footprint OpenAI published; there is no smaller quant tier in the canonical release. Cards below 16 GB should look at smaller siblings (Qwen3-8B, Llama 3.1 8B, etc.) on the /check/ listings.