What You'll Build
A local chat endpoint backed by OpenAI's open-weights gpt-oss-20b — a 21B-parameter mixture-of-experts LLM (3.6B active per token) shipped in native MXFP4 quantization — running on a 12 GB RTX 5070. Unlike the 16 GB cards in our catalogue, this model does not fit a 12 GB card with everything resident on the GPU: the MXFP4 GGUF is 11.27 GiB on disk and the official llama.cpp memory table puts the full deployment at 14.9 GB at 8K context. The trick is llama.cpp's --n-cpu-moe flag, which keeps a handful of MoE expert layers in system RAM and the rest (attention, KV cache, most experts) on the GPU. On the 5070 only ~2 expert layers need to move off-GPU, so the throughput hit is small.
Hardware data: RTX 5070 (12 GB GDDR7) · runs via llama.cpp expert offload (--n-cpu-moe) · See benchmark data
⚠️ This is a 12 GB recipe, not a 16 GB recipe. Per the HF card, the MXFP4 release is designed so
gpt-oss-20bwill "run within 16GB of memory" — that is the 16 GB floor for fully-resident operation, and it is why a naiveollama run gpt-oss:20b(which tries to load everything on the GPU) spills to CPU or stutters on a 12 GB card. The path documented here deliberately offloads MoE experts so the model fits the 5070's 12 GB envelope on purpose.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 12 GB VRAM | RTX 5070 (12 GB GDDR7) |
| RAM | 16 GB system RAM (holds the CPU-offloaded experts) | 64 GB DDR5 in the cited 5070 run |
| Storage | ~13 GB for the MXFP4 GGUF (gpt-oss-20b-mxfp4.gguf is 12.11 GB per the HF tree API) | — |
| Software | NVIDIA driver with CUDA 12.8+ (required for Blackwell sm_120); a recent llama.cpp release | — |
| License | Apache 2.0 (HF card) — commercial use permitted | — |
Installation
The documented sub-12-GB path is llama.cpp with expert CPU-offload. Ollama (which wraps llama.cpp) does not expose the --n-cpu-moe knob as cleanly, so for a 12 GB card use llama.cpp directly.
1. Install a recent llama.cpp build
Grab the latest release from the official releases page (prebuilt CUDA binaries are published per release), or build from source with CUDA enabled. You need a build new enough to support the --n-cpu-moe flag and Blackwell sm_120 — any current 2026 release qualifies. The CUDA backend requires the cu128 toolkit because CUDA 12.8 is the first version with native Blackwell (sm_120) kernels; older cu126 builds do not ship sm_120 support.
2. Run the server with MoE experts offloaded to CPU
The model auto-downloads on first launch via the -hf flag. This command is the official llama.cpp guide's recommendation for cards below 16 GB, lightly tuned for the 5070 (see Running for the exact 5070 invocation):
# Full 128k context, with MoE layers kept on the CPU as needed.
llama-server -hf ggml-org/gpt-oss-20b-GGUF --ctx-size 0 --jinja -ub 2048 -b 2048 --n-cpu-moe 22
The --n-cpu-moe N flag sets how many MoE layers stay in system RAM; the rest are offloaded to the GPU. The value 22 is the official guide's example for an RTX 2060 8 GB, verbatim from the llama.cpp guide. A 12 GB card holds far more, so you lower this number substantially — see Running. First launch downloads gpt-oss-20b-mxfp4.gguf (12.11 GB) and starts a WebUI + an OpenAI-compatible API on localhost.
Running
On a 12 GB RTX 5070 almost the entire model fits, so only ~2 expert layers need to leave the GPU. The configuration below is a community-reported RTX 5070 12 GB run from the llama.cpp guide thread:
llama-server -hf ggml-org/gpt-oss-20b-GGUF \
--ctx-size 32768 --jinja -ub 2048 -b 2048 -ngl 99 -fa \
--n-cpu-moe 2 --temp 1.0 --top-p 1.0 --top-k 0 --min-p 0.0
-ngl 99 puts all non-expert layers on the GPU; --n-cpu-moe 2 parks two expert layers in host RAM; -fa enables Flash-Attention kernels. Once llama-server reports it is listening (default port 8080), it serves a WebUI at http://localhost:8080 and an OpenAI-compatible Chat Completions API. The model uses OpenAI's harmony response format — the --jinja flag tells llama.cpp to apply the chat template embedded in the GGUF, so you don't have to format it by hand. Query the API with:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Explain mixture-of-experts routing in one paragraph."}]}'
ℹ️ Tight on a display card? A 12 GB desktop card with a monitor attached exposes only ~10.5–11.3 GB usable. A community RTX 3060 12 GB run reported it could not fit
-ub 2048at 32K context even with--n-cpu-moe 2, and had to drop-ubback to the default of 512 to leave a ~300 MB headroom margin. If you OOM at launch on the 5070, lower-ub/-b, raise--n-cpu-moe, or reduce--ctx-size.
Results
- Speed: A community RTX 5070 12 GB run with
--n-cpu-moe 2at 32K context reports 62.21 tokens/s generation (llama.cpp guide thread, Ryzen 5 9600X + 64 GB DDR5). This is a single community measurement, not a first-party benchmark — for corroborating 5070 numbers, Hardware Corner's RTX 5070 page does not yet carry a gpt-oss-20b row, so submit yours via /contribute. - VRAM usage: The MXFP4 GGUF is 12.11 GB on disk (HF tree API), and the official llama.cpp memory table puts the fully-resident deployment at 14.9 GB at 8K / 15.5 GB at 32K — over the 12 GB envelope, which is exactly why expert offload is required. With
--n-cpu-moe 2the resident footprint drops under 12 GB; a community RTX 3060 12 GB run noted it left only ~300 MB spare at the default-ub. See /check/gpt-oss-20b/rtx-5070. - Throughput penalty: Offloading is not free, but on the 5070 it is small because only 2 of 24 expert layers stream over PCIe Gen5. For comparison, a community RTX 3060 12 GB run measured 30.6 tok/s steady-state (
tg128) with experts on the CPU viallama-bench, rising to ~75 tok/s only when a small 2K context lets everything fit on the GPU with no offload (guide thread).
For the full benchmark data and cross-card compare, see /check/gpt-oss-20b/rtx-5070.
Troubleshooting
Out of memory at launch on a 12 GB card
The MXFP4 weights (12.11 GB on disk) plus compute buffers and KV cache exceed 12 GB if everything is resident — the official memory table lists 14.9 GB at 8K context. You must offload. If the Running command still OOMs (a display card exposes only ~10.5–11.3 GB usable), apply these in order: (a) raise --n-cpu-moe from 2 toward 3–4; (b) drop -ub/-b from 2048 to the default 512 — a community RTX 3060 12 GB run needed exactly this to fit 32K context; (c) reduce --ctx-size. Each step trades a little speed or context for headroom.
ollama run gpt-oss:20b stutters or spills to CPU on the 5070
Ollama tries to load the whole model on the GPU and falls back to mixed CPU/GPU when it doesn't fit, which on a 12 GB card means uncontrolled spillover rather than a deliberate expert split. For predictable 12 GB behaviour, use llama.cpp's --n-cpu-moe directly as documented above. (The native MXFP4 build is also the one to use — there is no separate higher-precision baseline; per the HF card, all evals were done with MXFP4.)
Flash-Attention errors on first inference call
If you point llama.cpp at a custom backend or copy a Transformers snippet that imports flash_attention_2, Blackwell can crash at the first forward pass — FA2 sm_120 kernel coverage is still in flight at Dao-AILab/flash-attention#2168. The -fa flag in the Running command uses llama.cpp's own CUDA Flash-Attention path, which works on sm_120 with a cu128 build — you do not need the external FA2 package.
Generation slower than expected
Two checks: (a) confirm your build has Blackwell (sm_120) support — it needs the cu128 CUDA 12.8 toolkit, since cu126 builds lack sm_120 kernels; (b) confirm --n-cpu-moe is as low as your VRAM allows — every extra layer left on the CPU adds a PCIe round-trip per token. Token generation is memory-bandwidth-bound, so the per-token rate also falls as the KV cache grows with context.
Want different hardware numbers?
If you have benchmark data on the RTX 5070 (different context, --n-cpu-moe value, or RAM speed), submit it via /contribute so we can grow the /check/gpt-oss-20b/rtx-5070 page beyond the single community measurement.