How much VRAM does gpt-oss 20B need?

About 13 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

gpt-oss 20B on RX 7800 XT: MXFP4 chat in 16 GB via Ollama or llama.cpp-HIP

What You'll Build

A local chat / completions endpoint running OpenAI's gpt-oss 20B — the open-weights 21B-parameter MoE model that ships natively quantized to MXFP4 — on a 16 GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) through the ROCm stack. The recipe covers the one-command path (Ollama), the full-control path (llama.cpp built with HIP), and an LM Studio GUI option. The MXFP4 weights are ~12–14 GB resident, so on a 16 GB card the model fits — but it is a tight fit: budget your context length, and reach for llama.cpp's expert-offload flag if a long KV cache pushes you over the edge.

Hardware data: RX 7800 XT (16 GB VRAM) · native MXFP4 · ROCm 7 · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7800 XT runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel and no FlashAttention-2 prebuilt-wheel step here. For LLM inference the reliable path is GGUF via llama.cpp-HIP (or Ollama, which bundles llama.cpp). Do not follow a guide that tells you to pip install flash-attn, pick a cu12x wheel, or use ExLlamaV2/Marlin for this card — those are NVIDIA-only.

ℹ️ MXFP4 on RDNA3 — a Q4_0-class GGUF, not native FP4 acceleration. gpt-oss ships its MoE weights in MXFP4, a 4-bit floating-point format. That is the model's native release format, which is why the weights occupy only ~12–14 GB and fit a 16 GB card at all. But RDNA3 has no FP4 (or FP8) tensor hardware — its WMMA units accept only FP16/BF16/INT8/INT4. So on the RX 7800 XT the MXFP4 weights are simply run as a 4-bit GGUF by llama.cpp's HIP backend: per the official llama.cpp gpt-oss guide, "The gpt-oss models are natively "quantized". I.e. they are trained in the MXFP4 format which is roughly equivalent to ggml's Q4_0." You get the MXFP4 memory footprint and full model quality; you do not get any Blackwell-class FP4 tensor-core throughput, because no consumer AMD card has FP4 tensor cores. There is no separate BF16 weight set to fall back to — MXFP4 is the only official release format.

ℹ️ Mixture-of-Experts caveat — all 21B must be resident. Per the model card, gpt-oss 20B is a per-token sparse MoE with "21B parameters with 3.6B active parameters". The 3.6B figure is a compute / FLOPs number, not the VRAM number — the router picks experts per token at runtime, so all 21B parameters must stay resident in VRAM. The model fits on a 16 GB card because OpenAI pre-quantized the MoE weights to MXFP4 (~0.5 bytes/parameter); at BF16 the same parameter count would need ~42 GB and would not fit any consumer card.

Requirements

Component	Minimum	Tested
GPU	16 GB VRAM (per HF card: "the `gpt-oss-20b` model run within 16GB of memory")	RX 7800 XT (16 GB)
RAM	16 GB system	—
Storage	12.11 GB (official GGUF) or 13.76 GB (HF safetensors)	per the HF tree API / ggml-org GGUF
Driver	AMD ROCm v7 (installed via `amdgpu-install`) on Linux	—
Runtime	Ollama / llama.cpp (HIP build) / LM Studio	—
License	Apache-2.0 (HF card)	—

The model is released under Apache 2.0 — commercial use is permitted with no copyleft restrictions. gpt-oss was trained on the harmony response format and "should only be used with the harmony format as it will not work correctly otherwise" — Ollama, llama.cpp (with --jinja), and LM Studio all apply this chat template for you, so you only need to handle harmony manually on the raw Transformers path.

ℹ️ 16 GB is a tight fit — prefer the 12.11 GB GGUF. With only 16 GB of VRAM, lead with the official 12.11 GB MXFP4 GGUF (ggml-org/gpt-oss-20b-GGUF) rather than the 13.76 GB HF safetensors set: the smaller on-disk footprint leaves a little more room for the KV cache. Once the ~12–13 GB of weights are resident, the remaining headroom is consumed by the context window's KV cache, so on this card you trim the context (or offload a few MoE layers — see Option B) rather than running the full 128K context the way a 24 GB card could.

Installation

Prerequisite — install the AMD ROCm v7 driver

The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU — it is listed with LLVM target gfx1101 in AMD's ROCm Linux system-requirements matrix — but ROCm is not bundled with Ollama or the llama.cpp release binaries; you install it once at the OS level. Per the Ollama AMD GPU docs, Ollama requires the AMD ROCm v7 driver on Linux, installed or upgraded with the amdgpu-install utility. On Ubuntu 24.04 (Noble), install ROCm 7.2.1 via the standard amdgpu-install flow (AMD's Radeon ROCm install docs cover the current packages; the .deb URL below is HEAD-verified live):

# 1. Add the amdgpu-install package and install ROCm
wget https://repo.radeon.com/amdgpu-install/7.2.1/ubuntu/noble/amdgpu-install_7.2.1.70201-1_all.deb
sudo apt install ./amdgpu-install_7.2.1.70201-1_all.deb
sudo apt update
sudo amdgpu-install -y --usecase=graphics,rocm

# 2. Add yourself to the render/video groups (log out/in afterward)
sudo usermod -a -G render,video $LOGNAME

Because the RX 7800 XT is on the supported-GPU matrix as gfx1101, you should not normally need an HSA_OVERRIDE_GFX_VERSION masquerade. If a tool ships only gfx1100 kernels and refuses to start on your card, the documented Linux fallback is to export HSA_OVERRIDE_GFX_VERSION=11.0.0 so the gfx1101 card presents as gfx1100 — treat that as a fallback, not a default.

Option A — Ollama (recommended)

1. Install Ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

Per the Ollama AMD preview blog, all of Ollama's features can be accelerated by AMD graphics cards on Linux and Windows. Ollama detects the ROCm runtime installed in the prerequisite step and runs the gfx1101 card without any manual architecture flag.

2. Pull and run the 20B model

ollama pull gpt-oss:20b
ollama run gpt-oss:20b

This downloads the ~14 GB MXFP4 weights into Ollama's model store (the official Ollama listing shows gpt-oss:20b at 14 GB with a 128K context window) and drops you into an interactive chat. On a 16 GB card the weights load into VRAM with limited headroom — if Ollama spills layers to the CPU, shorten the context with /set parameter num_ctx 8192 (or lower) inside the chat to claw back KV-cache memory. Ollama applies the harmony chat template automatically.

Option B — llama.cpp built with HIP/ROCm

For full control over context length and expert-offload, build llama.cpp against HIP and target the gfx1101 architecture directly, then run the official MXFP4 GGUF.

1. Build llama.cpp with the HIP backend

Per the llama.cpp build docs, the Linux HIP build pattern is the same as for any RDNA3 card — only the GPU_TARGETS value changes. For the RX 7800 XT, pin it to gfx1101 (the card's LLVM target per the AMD ROCm system-requirements matrix, which lists gfx1101 among its supported llama.cpp build targets):

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
    cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1101 -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build --config Release -- -j 16

-DGGML_HIP=ON selects the ROCm backend; -DGPU_TARGETS=gfx1101 pins the kernels to the RX 7800 XT's architecture (Navi 32). Do not copy a gfx1100 value from a 7900-series guide — that is the wrong target for this card.

2. Run the official MXFP4 GGUF

The llama.cpp team publishes the native MXFP4 weights as a single GGUF at ggml-org/gpt-oss-20b-GGUF (gpt-oss-20b-mxfp4.gguf, 12.11 GB). Per the official llama.cpp gpt-oss guide, the baseline server command is:

# OpenAI-compatible local server with web UI; --jinja applies the harmony template
./build/bin/llama-server -hf ggml-org/gpt-oss-20b-GGUF --ctx-size 0 --jinja -ub 2048 -b 2048

--jinja is what applies the required harmony chat template. On a 16 GB card, --ctx-size 0 (full trained context) may not fit alongside the resident weights — start with a smaller window, e.g. --ctx-size 16384. If the full weight set plus your context still overruns VRAM, the same guide documents an --n-cpu-moe N flag that offloads N MoE layers to system RAM:

# Tight-VRAM variant: cap context and offload a few MoE layers to system RAM
./build/bin/llama-server -hf ggml-org/gpt-oss-20b-GGUF --ctx-size 16384 --jinja -ub 2048 -b 2048 --n-cpu-moe 8

Increase --n-cpu-moe until the model loads without spilling; each offloaded layer trades a little speed for VRAM headroom. This expert-offload knob is the 16 GB-tier safety valve that a 24 GB card never needs.

Option C — LM Studio (GUI)

LM Studio ships a ROCm runtime backend and a one-click install path. Search "gpt-oss 20B" inside the app and pick the MXFP4 GGUF build; LM Studio applies the harmony template and offers an OpenAI-compatible local server. On a 16 GB card, set a conservative context length in the model-load dialog and enable partial GPU offload if the app reports the weights don't fully fit — the same tight-fit guidance as the llama.cpp path applies.

Running

One-shot prompt via Ollama

ollama run gpt-oss:20b "Explain mixture-of-experts routing in one paragraph."

First run loads the ~12–14 GB of MXFP4 weights into VRAM; subsequent runs are instant. Watch GPU activity in another terminal with rocm-smi to confirm the card is doing the work and to keep an eye on how close VRAM usage sits to the 16 GB ceiling.

OpenAI-compatible HTTP API

# Ollama exposes localhost:11434 by default
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-oss:20b",
    "messages": [{"role": "user", "content": "Write a haiku about a Radeon GPU."}]
  }'

With the llama.cpp-HIP server from Option B, the same OpenAI-compatible API is served on http://localhost:8080/v1 by default. Both runtimes apply the harmony format internally, so you send plain chat messages.

Results

Speed: No first-party generation-rate measurement exists yet for gpt-oss 20B on the RX 7800 XT specifically, so we deliberately do not quote a tok/s figure here. Numbers measured on the larger RX 7900 XTX (24 GB, 960 GB/s memory bandwidth) do not transfer to the 7800 XT — this card has roughly 65% of the XTX's memory bandwidth (624 GB/s) and fewer WMMA units, and LLM token generation is memory-bandwidth-bound, so the 7800 XT will be materially slower. The backend has no benchmark for this pair yet (/check/gpt-oss-20b/rx-7800-xt returns verdict: unknown). If you've measured gpt-oss 20B tok/s on a 7800 XT, please contribute it so it lands on the check page.
VRAM usage: The MXFP4 weights are 12.11 GB on disk for the official ggml-org/gpt-oss-20b-GGUF and 13.76 GB for the HF safetensors; the model card frames the deployment envelope as "the gpt-oss-20b model run within 16GB of memory". On a 16 GB 7800 XT that envelope leaves only a few GB for the KV cache, so this is a workable-but-tight fit — cap the context length, prefer the 12.11 GB GGUF, and use --n-cpu-moe if a longer context overruns VRAM. See /check/gpt-oss-20b/rx-7800-xt for any community-submitted peak.
Quality notes: MXFP4 is the native release format (not an after-the-fact community quant), so there is no separate full-precision weight set to compare against — per the llama.cpp guide the MXFP4 weights "get to keep their full quality" versus a re-quantized Q4_0. gpt-oss 20B is post-trained for reasoning and tool use; it must be used with the harmony chat template, which all three runtimes above apply for you.

For the full benchmark data, see /check/gpt-oss-20b/rx-7800-xt.

Troubleshooting

Ollama runs on the CPU instead of the GPU

Confirm the ROCm v7 driver is installed (rocm-smi should list the 7800 XT) and that your user is in the render and video groups (groups should show both — log out and back in after the usermod step). Per the Ollama AMD GPU docs, ROCm is a separate install from Ollama; if it's missing, Ollama silently falls back to CPU. The RX 7800 XT (gfx1101) is on the supported-GPU matrix, so you should not normally need HSA_OVERRIDE_GFX_VERSION — reach for the HSA_OVERRIDE_GFX_VERSION=11.0.0 masquerade only if a specific tool ships gfx1100-only kernels and refuses to start.

Out of memory or layers spilling to CPU

16 GB is a tight envelope for a ~12–13 GB resident model plus a KV cache. If the model OOMs or you see generation crawl because layers fell back to system RAM: (1) prefer the 12.11 GB GGUF over the 13.76 GB safetensors; (2) cap the context — --ctx-size 16384 (llama.cpp) or /set parameter num_ctx 8192 (Ollama) — rather than the full 128K; (3) on llama.cpp, add --n-cpu-moe N to offload N MoE layers to system RAM, raising N until it loads. The official llama.cpp gpt-oss guide documents --n-cpu-moe as the expert-offload knob for memory-constrained GPUs.

Token generation feels slower than expected — try the Vulkan backend

On RDNA3 the ROCm/HIP backend can be slower at token generation than llama.cpp's Vulkan backend. Per llama.cpp issue #20934, measured on a 7900-series RDNA3 card, the Vulkan (RADV) backend outpaced ROCm for pure generation on a small Q4_0 model across ROCm 6.4.4–7.x. If your generation rate disappoints under ROCm on the 7800 XT, build llama.cpp with -DGGML_VULKAN=ON instead of -DGGML_HIP=ON and re-benchmark with llama-bench — Vulkan often wins for pure generation on RDNA3.

Model output is garbled or ignores the system prompt

gpt-oss only works correctly with the harmony response format. Per the model card, the models "should only be used with the harmony format as it will not work correctly otherwise." Ollama and LM Studio apply it automatically; with llama.cpp make sure you pass --jinja (as in the Option B command) so the embedded chat template is used. On the raw Transformers path you must apply the harmony template yourself via the openai-harmony package.