self-hosted/ai
§01·recipe · llm

gpt-oss 20B on RTX 5070: MXFP4 Chat in 12 GB via llama.cpp Expert Offload

llmintermediate12GB+ VRAMJun 4, 2026
models
tools
prerequisites
  • NVIDIA RTX 5070 (12 GB GDDR7) or any consumer card with 12 GB VRAM
  • Recent NVIDIA driver with CUDA 12.8 support (required for Blackwell sm_120)
  • At least 16 GB system RAM (the CPU-offloaded MoE experts live in host RAM)
  • ~13 GB free disk for the MXFP4 GGUF weights

What You'll Build

A local chat endpoint backed by OpenAI's open-weights gpt-oss-20b — a 21B-parameter mixture-of-experts LLM (3.6B active per token) shipped in native MXFP4 quantization — running on a 12 GB RTX 5070. Unlike the 16 GB cards in our catalogue, this model does not fit a 12 GB card with everything resident on the GPU: the MXFP4 GGUF is 11.27 GiB on disk and the official llama.cpp memory table puts the full deployment at 14.9 GB at 8K context. The trick is llama.cpp's --n-cpu-moe flag, which keeps a handful of MoE expert layers in system RAM and the rest (attention, KV cache, most experts) on the GPU. On the 5070 only ~2 expert layers need to move off-GPU, so the throughput hit is small.

Hardware data: RTX 5070 (12 GB GDDR7) · runs via llama.cpp expert offload (--n-cpu-moe) · See benchmark data

⚠️ This is a 12 GB recipe, not a 16 GB recipe. Per the HF card, the MXFP4 release is designed so gpt-oss-20b will "run within 16GB of memory" — that is the 16 GB floor for fully-resident operation, and it is why a naive ollama run gpt-oss:20b (which tries to load everything on the GPU) spills to CPU or stutters on a 12 GB card. The path documented here deliberately offloads MoE experts so the model fits the 5070's 12 GB envelope on purpose.

Requirements

ComponentMinimumTested
GPU12 GB VRAMRTX 5070 (12 GB GDDR7)
RAM16 GB system RAM (holds the CPU-offloaded experts)64 GB DDR5 in the cited 5070 run
Storage~13 GB for the MXFP4 GGUF (gpt-oss-20b-mxfp4.gguf is 12.11 GB per the HF tree API)
SoftwareNVIDIA driver with CUDA 12.8+ (required for Blackwell sm_120); a recent llama.cpp release
LicenseApache 2.0 (HF card) — commercial use permitted

Installation

The documented sub-12-GB path is llama.cpp with expert CPU-offload. Ollama (which wraps llama.cpp) does not expose the --n-cpu-moe knob as cleanly, so for a 12 GB card use llama.cpp directly.

1. Install a recent llama.cpp build

Grab the latest release from the official releases page (prebuilt CUDA binaries are published per release), or build from source with CUDA enabled. You need a build new enough to support the --n-cpu-moe flag and Blackwell sm_120 — any current 2026 release qualifies. The CUDA backend requires the cu128 toolkit because CUDA 12.8 is the first version with native Blackwell (sm_120) kernels; older cu126 builds do not ship sm_120 support.

2. Run the server with MoE experts offloaded to CPU

The model auto-downloads on first launch via the -hf flag. This command is the official llama.cpp guide's recommendation for cards below 16 GB, lightly tuned for the 5070 (see Running for the exact 5070 invocation):

# Full 128k context, with MoE layers kept on the CPU as needed.
llama-server -hf ggml-org/gpt-oss-20b-GGUF --ctx-size 0 --jinja -ub 2048 -b 2048 --n-cpu-moe 22

The --n-cpu-moe N flag sets how many MoE layers stay in system RAM; the rest are offloaded to the GPU. The value 22 is the official guide's example for an RTX 2060 8 GB, verbatim from the llama.cpp guide. A 12 GB card holds far more, so you lower this number substantially — see Running. First launch downloads gpt-oss-20b-mxfp4.gguf (12.11 GB) and starts a WebUI + an OpenAI-compatible API on localhost.

Running

On a 12 GB RTX 5070 almost the entire model fits, so only ~2 expert layers need to leave the GPU. The configuration below is a community-reported RTX 5070 12 GB run from the llama.cpp guide thread:

llama-server -hf ggml-org/gpt-oss-20b-GGUF \
  --ctx-size 32768 --jinja -ub 2048 -b 2048 -ngl 99 -fa \
  --n-cpu-moe 2 --temp 1.0 --top-p 1.0 --top-k 0 --min-p 0.0

-ngl 99 puts all non-expert layers on the GPU; --n-cpu-moe 2 parks two expert layers in host RAM; -fa enables Flash-Attention kernels. Once llama-server reports it is listening (default port 8080), it serves a WebUI at http://localhost:8080 and an OpenAI-compatible Chat Completions API. The model uses OpenAI's harmony response format — the --jinja flag tells llama.cpp to apply the chat template embedded in the GGUF, so you don't have to format it by hand. Query the API with:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Explain mixture-of-experts routing in one paragraph."}]}'

ℹ️ Tight on a display card? A 12 GB desktop card with a monitor attached exposes only ~10.5–11.3 GB usable. A community RTX 3060 12 GB run reported it could not fit -ub 2048 at 32K context even with --n-cpu-moe 2, and had to drop -ub back to the default of 512 to leave a ~300 MB headroom margin. If you OOM at launch on the 5070, lower -ub/-b, raise --n-cpu-moe, or reduce --ctx-size.

Results

  • Speed: A community RTX 5070 12 GB run with --n-cpu-moe 2 at 32K context reports 62.21 tokens/s generation (llama.cpp guide thread, Ryzen 5 9600X + 64 GB DDR5). This is a single community measurement, not a first-party benchmark — for corroborating 5070 numbers, Hardware Corner's RTX 5070 page does not yet carry a gpt-oss-20b row, so submit yours via /contribute.
  • VRAM usage: The MXFP4 GGUF is 12.11 GB on disk (HF tree API), and the official llama.cpp memory table puts the fully-resident deployment at 14.9 GB at 8K / 15.5 GB at 32K — over the 12 GB envelope, which is exactly why expert offload is required. With --n-cpu-moe 2 the resident footprint drops under 12 GB; a community RTX 3060 12 GB run noted it left only ~300 MB spare at the default -ub. See /check/gpt-oss-20b/rtx-5070.
  • Throughput penalty: Offloading is not free, but on the 5070 it is small because only 2 of 24 expert layers stream over PCIe Gen5. For comparison, a community RTX 3060 12 GB run measured 30.6 tok/s steady-state (tg128) with experts on the CPU via llama-bench, rising to ~75 tok/s only when a small 2K context lets everything fit on the GPU with no offload (guide thread).

For the full benchmark data and cross-card compare, see /check/gpt-oss-20b/rtx-5070.

Troubleshooting

Out of memory at launch on a 12 GB card

The MXFP4 weights (12.11 GB on disk) plus compute buffers and KV cache exceed 12 GB if everything is resident — the official memory table lists 14.9 GB at 8K context. You must offload. If the Running command still OOMs (a display card exposes only ~10.5–11.3 GB usable), apply these in order: (a) raise --n-cpu-moe from 2 toward 34; (b) drop -ub/-b from 2048 to the default 512 — a community RTX 3060 12 GB run needed exactly this to fit 32K context; (c) reduce --ctx-size. Each step trades a little speed or context for headroom.

ollama run gpt-oss:20b stutters or spills to CPU on the 5070

Ollama tries to load the whole model on the GPU and falls back to mixed CPU/GPU when it doesn't fit, which on a 12 GB card means uncontrolled spillover rather than a deliberate expert split. For predictable 12 GB behaviour, use llama.cpp's --n-cpu-moe directly as documented above. (The native MXFP4 build is also the one to use — there is no separate higher-precision baseline; per the HF card, all evals were done with MXFP4.)

Flash-Attention errors on first inference call

If you point llama.cpp at a custom backend or copy a Transformers snippet that imports flash_attention_2, Blackwell can crash at the first forward pass — FA2 sm_120 kernel coverage is still in flight at Dao-AILab/flash-attention#2168. The -fa flag in the Running command uses llama.cpp's own CUDA Flash-Attention path, which works on sm_120 with a cu128 build — you do not need the external FA2 package.

Generation slower than expected

Two checks: (a) confirm your build has Blackwell (sm_120) support — it needs the cu128 CUDA 12.8 toolkit, since cu126 builds lack sm_120 kernels; (b) confirm --n-cpu-moe is as low as your VRAM allows — every extra layer left on the CPU adds a PCIe round-trip per token. Token generation is memory-bandwidth-bound, so the per-token rate also falls as the KV cache grows with context.

Want different hardware numbers?

If you have benchmark data on the RTX 5070 (different context, --n-cpu-moe value, or RAM speed), submit it via /contribute so we can grow the /check/gpt-oss-20b/rtx-5070 page beyond the single community measurement.