gpt-oss 20B on RTX 4070: MXFP4 Chat in 12 GB via llama.cpp Expert Offload

What You'll Build

A local chat endpoint backed by OpenAI's open-weights gpt-oss-20b — a 21B-parameter mixture-of-experts LLM the model card describes as having "21B parameters with 3.6B active parameters" (HF card), shipped in native MXFP4 quantization — running on a 12 GB RTX 4070 (Ada Lovelace, AD104, sm_89).

The catch: this model does not fit a 12 GB card with everything resident on the GPU. The MXFP4 GGUF is 11.27 GiB on disk, and the official llama.cpp guide puts the full deployment at 14.9 GB at 8K context (15.5 GB at 32K) — comfortably over 12 GB. The fix is llama.cpp's --n-cpu-moe flag, which keeps a handful of MoE expert layers in system RAM and the rest of the model (attention, KV cache, most experts) on the GPU. On a 12 GB card only ~2–3 expert layers need to move off the GPU.

Hardware data: RTX 4070 (12 GB GDDR6X, PCIe Gen4) · runs via llama.cpp expert offload (--n-cpu-moe) · See benchmark data

⚠️ This is a 12 GB recipe, not a 16 GB recipe. Per the HF card, the MXFP4 release is designed so gpt-oss-20b will "run within 16GB of memory" — that 16 GB figure is the floor for fully-resident operation. A naive ollama run gpt-oss:20b (which tries to load everything on the GPU) spills uncontrolled to the CPU on a 12 GB card: an Ollama bug report on an RTX 4070 12 GB desktop shows the weights at 11.7 GiB plus a 2.0 GiB compute buffer overflowing the 12 GB envelope, only ~9 GB resident, and throughput collapsing to ~10 tok/s. The path documented here deliberately offloads MoE experts so the model fits the 4070's 12 GB envelope on purpose.

ℹ️ MXFP4 on the RTX 4070 (Ada) — a storage win, not a tensor-core win. MXFP4 is the model's native release format, so the 4-bit weights occupy ~12 GB on any card — that is what brings a 21B model anywhere near a 12 GB envelope. But native FP4 tensor-core acceleration is a Blackwell feature; the RTX 4070 is Ada Lovelace (sm_89) and has no FP4 tensor cores, so llama.cpp runs the MXFP4 weights through its standard CUDA MoE kernels (a dequantize path) rather than native FP4 tensor-core matmuls. You keep the MXFP4 memory footprint; you do not get Blackwell-class FP4 throughput.

Requirements

Component	Minimum	Tested
GPU	12 GB VRAM	RTX 4070 (12 GB GDDR6X)
RAM	16 GB system RAM (holds the CPU-offloaded experts)	—
Storage	~13 GB for the MXFP4 GGUF (`gpt-oss-20b-mxfp4.gguf` is 12.11 GB per the HF tree API)	—
Software	NVIDIA driver with CUDA 12+; a recent `llama.cpp` release with `--n-cpu-moe` support	—
License	Apache-2.0 (HF card) — commercial use permitted	—

Installation

The documented sub-12-GB path is llama.cpp with expert CPU-offload. Ollama (which wraps llama.cpp) does not expose the --n-cpu-moe knob cleanly and tries to load the whole model on the GPU — on a 12 GB card that means uncontrolled spillover (see the Troubleshooting note below). For a 12 GB card, use llama.cpp directly.

1. Install a recent llama.cpp build

Grab the latest release from the official releases page (prebuilt CUDA binaries are published per release), or build from source with CUDA enabled. You need a build new enough to support the --n-cpu-moe flag — any current 2026 release qualifies. On the Ada-Lovelace RTX 4070 (sm_89) the standard cu12x CUDA binaries work; unlike Blackwell cards, no special CUDA 12.8 / sm_120 toolkit is required.

2. Run the server with MoE experts offloaded to CPU

The model auto-downloads on first launch via the -hf flag. The --n-cpu-moe N flag sets how many MoE layers stay in system RAM; the rest run on the GPU — the official llama.cpp guide describes it as the knob for "hardware configs that cannot fit the models fully on the GPU." The guide's own example value is --n-cpu-moe 22, which is tuned for an RTX 2060 8 GB (guide thread). A 12 GB card holds far more of the model, so you lower this number substantially — see Running for the 12 GB value.

# First-launch download; the value 22 below is the guide's 8 GB example — lower it on 12 GB (see Running).
llama-server -hf ggml-org/gpt-oss-20b-GGUF --ctx-size 0 --jinja -ub 2048 -b 2048 --n-cpu-moe 22

First launch downloads gpt-oss-20b-mxfp4.gguf (12.11 GB) and starts a WebUI plus an OpenAI-compatible API on localhost.

Running

On a 12 GB card almost the entire model fits, so only a couple of expert layers need to leave the GPU. The closest documented same-tier configuration is a community RTX 3060 12 GB run from the guide thread (QuantiusBenignus): --n-cpu-moe 2 at 16K context (left ~600 MB VRAM spare), stepping up to --n-cpu-moe 3 for 32K context. Use the same starting point on the 4070's identical 12 GB envelope:

llama-server -hf ggml-org/gpt-oss-20b-GGUF \
  --ctx-size 16384 --jinja -ub 2048 -b 2048 -ngl 99 -fa \
  --n-cpu-moe 2 --temp 1.0 --top-p 1.0 --top-k 0 --min-p 0.0

-ngl 99 puts all non-expert layers on the GPU; --n-cpu-moe 2 parks two expert layers in host RAM; -fa enables llama.cpp's Flash-Attention kernels (Ada sm_89 prebuilt FA wheels are available, so no special build is needed). Once llama-server reports it is listening (default port 8080), it serves a WebUI at http://localhost:8080 and an OpenAI-compatible Chat Completions API. The model uses OpenAI's harmony response format — the --jinja flag tells llama.cpp to apply the chat template embedded in the GGUF. Query the API with:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Explain mixture-of-experts routing in one paragraph."}]}'

ℹ️ Gen4 link, not Gen5. The RTX 4070 is PCIe Gen4 x16 — roughly half the host-to-device bandwidth of the Blackwell RTX 50-series (Gen5). Because --n-cpu-moe streams the offloaded expert layers across the PCIe link every token, the throughput penalty on the offloaded portion is larger on a Gen4 4070 than on a Gen5 card at the same VRAM. Keep --n-cpu-moe as low as your VRAM allows (every extra offloaded layer adds a PCIe round-trip per token).

Results

Speed: No first-party RTX 4070 benchmark exists yet. Hardware Corner's RTX 4070 page carries Qwen3 8B and Qwen3 14B rows but no gpt-oss-20b row, and our backend has no benchmark for this pair (/check/gpt-oss-20b/rtx-4070 returns verdict unknown). The 12 GB Blackwell RTX 5070 and the 16 GB Ada RTX 4070 Ti SUPER both have measured numbers on our site, but neither transfers: the 5070 is PCIe Gen5 (faster offloaded-expert streaming) and the 4070 Ti SUPER has ~30% more CUDA cores and does not offload at all. If you measure gpt-oss-20b on an RTX 4070, please submit it via /contribute.
VRAM usage: The MXFP4 GGUF is 12.11 GB on disk (HF tree API), and the official llama.cpp memory table puts the fully-resident deployment at 14.9 GB at 8K / 15.5 GB at 32K — over the 12 GB envelope, which is exactly why expert offload is required. With --n-cpu-moe 2 the resident footprint drops under 12 GB; on the equivalent RTX 3060 12 GB the guide thread reports ~600 MB VRAM spare at 16K context. See /check/gpt-oss-20b/rtx-4070.
Quality notes: gpt-oss 20B is post-trained with reasoning support and tool-use. The MXFP4 quantization is the native release format — per the HF card, the models ship with "MXFP4 quantization of the MoE weights" and "All evals were performed with the same MXFP4 quantization" — so there is no separate higher-precision baseline to compare against on consumer hardware.

For the full benchmark data and cross-card compare, see /check/gpt-oss-20b/rtx-4070.

Troubleshooting

`ollama run gpt-oss:20b` is very slow (~10 tok/s) or spills to CPU

Ollama tries to load the whole model on the GPU and falls back to mixed CPU/GPU when it doesn't fit — on a 12 GB card that means uncontrolled spillover rather than a deliberate expert split. An Ollama report on an RTX 4070 12 GB desktop shows exactly this: weights at 11.7 GiB plus a 2.0 GiB compute buffer exceed 12 GB, only ~9 GB ends up resident, and generation falls to ~10 tok/s. For predictable 12 GB behaviour, use llama.cpp's --n-cpu-moe directly as documented above so you control which layers leave the GPU.

Out of memory at launch on the 12 GB card

The MXFP4 weights (12.11 GB on disk) plus compute buffers and KV cache exceed 12 GB if everything is resident — the official memory table lists 14.9 GB at 8K context. A 12 GB desktop card with a monitor attached exposes only ~10.5–11.3 GB usable, so you must offload. If the Running command still OOMs, apply these in order: (a) raise --n-cpu-moe from 2 toward 3–4; (b) drop -ub/-b from 2048 to the default 512; (c) reduce --ctx-size. Each step trades a little speed or context for headroom.

Transformers / `device_map="auto"` OOMs on a 12 GB 4070

Running the raw safetensors through Transformers on a 12 GB card is fragile: an HF discussion on an RTX 4070 Ti 12 GB reports out-of-memory errors until the user manually pinned only ~5 of 24 transformer layers to the GPU and pushed the rest to the CPU, at which point inference became extremely slow (seconds per token). The llama.cpp --n-cpu-moe GGUF path above is the supported way to fit 12 GB — prefer it over hand-rolled Transformers device maps.

Generation slower than a Blackwell RTX 50-series card

Expected, for two reasons. First, the RTX 4070 is Ada Lovelace (sm_89) with no native FP4 tensor cores, so the MXFP4 matmuls run through standard CUDA kernels rather than Blackwell's native FP4 path. Second, the 4070's PCIe Gen4 link streams the CPU-offloaded experts at roughly half the bandwidth of a Gen5 card, so the per-token offload cost is higher. Token generation is memory-bandwidth- and PCIe-bound here; keep --n-cpu-moe as low as your VRAM allows.

Want hardware numbers for this card?

There is no published RTX 4070 gpt-oss-20b benchmark yet. If you have measurements (context length, --n-cpu-moe value, RAM speed), submit them via /contribute so we can populate the /check/gpt-oss-20b/rtx-4070 page.