How much VRAM does gpt-oss 20B need?

About 13 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

gpt-oss 20B on RX 7900 XTX: MXFP4 chat at ~119 tok/s via Ollama or llama.cpp-HIP

What You'll Build

A local chat / completions endpoint running OpenAI's gpt-oss 20B — the open-weights 21B-parameter MoE model that ships natively quantized to MXFP4 — on a 24 GB Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100) through the ROCm stack. The recipe covers the one-command path (Ollama), the full-control path (llama.cpp built with HIP), and an LM Studio GUI option. With 24 GB of VRAM the model is never memory-bound: the MXFP4 weights are ~12–14 GB resident, leaving ample headroom for the model's 128K context window.

Hardware data: RX 7900 XTX (24 GB VRAM) · ~119 tok/s generation (community-reported, Ollama) · native MXFP4 · ROCm 7 · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7900 XTX runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel and no FlashAttention-2 prebuilt-wheel step here. For LLM inference the reliable path is GGUF via llama.cpp-HIP (or Ollama, which bundles llama.cpp). Do not follow a guide that tells you to pip install flash-attn, pick a cu12x wheel, or use ExLlamaV2/Marlin for this card — those are NVIDIA-only.

ℹ️ MXFP4 on RDNA3 — a Q4_0-class GGUF, not native FP4 acceleration. gpt-oss ships its MoE weights in MXFP4, a 4-bit floating-point format. That is the model's native release format, which is why the weights occupy only ~12–14 GB and fit any 24 GB card with room to spare. But RDNA3 has no FP4 (or FP8) tensor hardware — its WMMA units accept only FP16/BF16/INT8/INT4. So on the RX 7900 XTX the MXFP4 weights are simply run as a 4-bit GGUF by llama.cpp's HIP backend: per the official llama.cpp gpt-oss guide, "The gpt-oss models are natively "quantized". I.e. they are trained in the MXFP4 format which is roughly equivalent to ggml's Q4_0." You get the MXFP4 memory footprint and full model quality; you do not get any Blackwell-class FP4 tensor-core throughput, because no consumer AMD card has FP4 tensor cores. There is no separate BF16 weight set to fall back to — MXFP4 is the only official release format.

ℹ️ Mixture-of-Experts caveat — all 21B must be resident. Per the model card, gpt-oss 20B is a per-token sparse MoE with "21B parameters with 3.6B active parameters". The 3.6B figure is a compute / FLOPs number, not the VRAM number — the router picks experts per token at runtime, so all 21B parameters must stay resident in VRAM. The model fits comfortably here because OpenAI pre-quantized the MoE weights to MXFP4 (~0.5 bytes/parameter); at BF16 the same parameter count would need ~42 GB and would not fit a 24 GB card.

Requirements

Component	Minimum	Tested
GPU	16 GB VRAM (per HF card: "the `gpt-oss-20b` model run within 16GB of memory")	RX 7900 XTX (24 GB)
RAM	16 GB system	—
Storage	12.11 GB (official GGUF) or 13.76 GB (HF safetensors)	per the HF tree API / ggml-org GGUF
Driver	AMD ROCm v7 (installed via `amdgpu-install`) on Linux	—
Runtime	Ollama / llama.cpp (HIP build) / LM Studio	—
License	Apache-2.0 (HF card)	—

The model is released under Apache 2.0 — commercial use is permitted with no copyleft restrictions. gpt-oss was trained on the harmony response format and "should only be used with the harmony format as it will not work correctly otherwise" — Ollama, llama.cpp (with --jinja), and LM Studio all apply this chat template for you, so you only need to handle harmony manually on the raw Transformers path.

Installation

Prerequisite — install the AMD ROCm v7 driver

The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU, but ROCm is not bundled with Ollama or the llama.cpp release binaries — you install it once at the OS level. Per the Ollama AMD GPU docs, Ollama requires the AMD ROCm v7 driver on Linux, installed or upgraded with the amdgpu-install utility. On Ubuntu 24.04 (Noble), install ROCm 7.2.1 via the standard amdgpu-install flow (AMD's Radeon ROCm install docs cover the current packages; the .deb URL below is HEAD-verified live):

# 1. Add the amdgpu-install package and install ROCm
wget https://repo.radeon.com/amdgpu-install/7.2.1/ubuntu/noble/amdgpu-install_7.2.1.70201-1_all.deb
sudo apt install ./amdgpu-install_7.2.1.70201-1_all.deb
sudo apt update
sudo amdgpu-install -y --usecase=graphics,rocm

# 2. Add yourself to the render/video groups (log out/in afterward)
sudo usermod -a -G render,video $LOGNAME

The RX 7900 XTX is on Ollama's supported AMD Radeon RX list and gfx1100 is in its supported LLVM-target list — so no HSA_OVERRIDE_GFX_VERSION masquerade is needed for this card (that override is only for cards ROCm doesn't ship kernels for).

Option A — Ollama (recommended)

1. Install Ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

Per the Ollama AMD preview blog, all of Ollama's features can be accelerated by AMD graphics cards on Linux and Windows, with the RX 7900 XTX named in its supported-card list. Ollama detects the ROCm runtime installed in the prerequisite step.

2. Pull and run the 20B model

ollama pull gpt-oss:20b
ollama run gpt-oss:20b

This downloads the ~14 GB MXFP4 weights into Ollama's model store (the official Ollama listing shows gpt-oss:20b at 14 GB with a 128K context window) and drops you into an interactive chat. Ollama applies the harmony chat template automatically.

Option B — llama.cpp built with HIP/ROCm

For full control over context length and runtime flags, build llama.cpp against HIP and target the gfx1100 architecture directly, then run the official MXFP4 GGUF.

1. Build llama.cpp with the HIP backend

Per the llama.cpp build docs, the Linux HIP build for an RDNA3 card like the RX 7900 XTX is:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
    cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1100 -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build --config Release -- -j 16

-DGGML_HIP=ON selects the ROCm backend; -DGPU_TARGETS=gfx1100 pins the kernels to the 7900 XTX's architecture (the build docs use gfx1100 as the explicit example for the "Radeon RX 7900XTX").

2. Run the official MXFP4 GGUF

The llama.cpp team publishes the native MXFP4 weights as a single GGUF at ggml-org/gpt-oss-20b-GGUF (gpt-oss-20b-mxfp4.gguf, 12.11 GB). Per the official llama.cpp gpt-oss guide, the recommended server command for a card with ample VRAM is:

# OpenAI-compatible local server with web UI; --jinja applies the harmony template
./build/bin/llama-server -hf ggml-org/gpt-oss-20b-GGUF --ctx-size 0 --jinja -ub 2048 -b 2048

--ctx-size 0 uses the model's full trained context; --jinja is what applies the required harmony chat template. On the 24 GB 7900 XTX you do not need the guide's --n-cpu-moe expert-offload flag (that is for memory-constrained cards) — keep all MoE layers on the GPU.

Option C — LM Studio (GUI)

LM Studio ships a ROCm runtime backend and a one-click install path. Search "gpt-oss 20B" inside the app and pick the MXFP4 GGUF build; LM Studio applies the harmony template and offers an OpenAI-compatible local server. On the 24 GB 7900 XTX the full weights load with room for the long context window.

Running

One-shot prompt via Ollama

ollama run gpt-oss:20b "Explain mixture-of-experts routing in one paragraph."

First run loads the ~12–14 GB of MXFP4 weights into VRAM; subsequent runs are instant. Watch GPU activity in another terminal with rocm-smi to confirm the card is doing the work.

OpenAI-compatible HTTP API

# Ollama exposes localhost:11434 by default
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-oss:20b",
    "messages": [{"role": "user", "content": "Write a haiku about a Radeon GPU."}]
  }'

With the llama.cpp-HIP server from Option B, the same OpenAI-compatible API is served on http://localhost:8080/v1 by default. Both runtimes apply the harmony format internally, so you send plain chat messages.

Results

Speed: A community user running Ollama 0.21.0 on an RX 7900 XTX 24 GB (Ryzen 9 7950X3D, ROCm backend) reported 119.30 tokens/s generation (and 3322.43 tok/s prompt-eval) for gpt-oss:20b in ollama issue #15771 — the figure is a side-comparison datapoint in a report about a different (Qwen3.6 MoE) model being slow, posted by the issue author (not an Ollama maintainer), so treat it as a single community-reported measurement rather than a first-party benchmark. The official llama.cpp gpt-oss guide separately records ~102 tok/s on the smaller RX 7900 XT (20 GB) card — a different GPU, listed here only to show the figures are in the same ballpark, not as an XTX number. No backend benchmark exists for this pair yet (/check/gpt-oss-20b/rx-7900-xtx returns verdict: unknown). If you've measured gpt-oss 20B tok/s on a 7900 XTX, please contribute it so it lands on the check page.
VRAM usage: The MXFP4 weights are 12.11 GB on disk for the official ggml-org/gpt-oss-20b-GGUF and 13.76 GB for the HF safetensors; the model card frames the deployment envelope as "the gpt-oss-20b model run within 16GB of memory". On the 24 GB 7900 XTX that leaves roughly 10 GB free for a large KV cache — the inverse of the 16 GB-class squeeze, so you can run the full 128K context without expert-offload tricks. See /check/gpt-oss-20b/rx-7900-xtx for any community-submitted peak.
Quality notes: MXFP4 is the native release format (not an after-the-fact community quant), so there is no separate full-precision weight set to compare against — per the llama.cpp guide the MXFP4 weights "get to keep their full quality" versus a re-quantized Q4_0. gpt-oss 20B is post-trained for reasoning and tool use; it must be used with the harmony chat template, which all three runtimes above apply for you.

For the full benchmark data, see /check/gpt-oss-20b/rx-7900-xtx.

Troubleshooting

Ollama runs on the CPU instead of the GPU

Confirm the ROCm v7 driver is installed (rocm-smi should list the 7900 XTX) and that your user is in the render and video groups (groups should show both — log out and back in after the usermod step). Per the Ollama AMD GPU docs, ROCm is a separate install from Ollama; if it's missing, Ollama silently falls back to CPU. The RX 7900 XTX (gfx1100) is natively supported, so you should not need HSA_OVERRIDE_GFX_VERSION — only unsupported cards need that masquerade.

Token generation feels slower than expected — try the Vulkan backend

On RDNA3 the ROCm/HIP backend can be slower at token generation than llama.cpp's Vulkan backend. Per llama.cpp issue #20934, on the RX 7900 XTX (gfx1100) Vulkan (RADV) reached ~167–177 tok/s on Llama 7B Q4_0 while ROCm landed at ~129–144 tok/s across ROCm 6.4.4–7.x. If your generation rate disappoints under ROCm, build llama.cpp with -DGGML_VULKAN=ON instead of -DGGML_HIP=ON and re-benchmark with llama-bench — Vulkan often wins for pure generation on this card.

Model output is garbled or ignores the system prompt

gpt-oss only works correctly with the harmony response format. Per the model card, the models "should only be used with the harmony format as it will not work correctly otherwise." Ollama and LM Studio apply it automatically; with llama.cpp make sure you pass --jinja (as in the Option B command) so the embedded chat template is used. On the raw Transformers path you must apply the harmony template yourself via the openai-harmony package.

"Can I fit this on a 12 GB / 8 GB AMD card?"

The MXFP4 weights are ~12–14 GB, so a 24 GB 7900 XTX is comfortable but a 12 GB card is tight and an 8 GB card cannot hold the weights resident. The official llama.cpp guide documents an --n-cpu-moe N flag that offloads N MoE layers to system RAM for memory-constrained GPUs — useful on smaller cards at a speed cost, but unnecessary on the 24 GB 7900 XTX where all layers stay on the GPU.