self-hosted/ai
§01·recipe · llm

gpt-oss 20B on RTX 3060: MXFP4 Chat at 64 tok/s in 12 GB via llama.cpp Expert Offload

llmintermediate12GB+ VRAMJun 13, 2026

This intermediate recipe sets up gpt-oss 20B on the RTX 3060, needing about 12 GB of VRAM.

models
tools
prerequisites
  • NVIDIA RTX 3060 (12 GB GDDR6) or any consumer card with 12 GB VRAM
  • Recent NVIDIA driver with CUDA 12+ support; a current llama.cpp release
  • At least 16 GB system RAM (32 GB recommended — the CPU-offloaded MoE experts live in host RAM)
  • ~13 GB free disk for the MXFP4 GGUF weights

What You'll Build

A local chat endpoint backed by OpenAI's open-weights gpt-oss-20b — a mixture-of-experts LLM the model card describes as having "21B parameters with 3.6B active parameters" (HF card), shipped in native MXFP4 quantization — running on a 12 GB RTX 3060 (Ampere, GA106, sm_86, 360 GB/s GDDR6 per TechPowerUp).

The catch: this model does not fit a 12 GB card with everything resident on the GPU. The MXFP4 GGUF is 11.27 GiB on disk, and the official llama.cpp gpt-oss guide puts the full deployment at 14.9 GB at 8K context (15.5 GB at 32K, 17.9 GB at 128K) — over 12 GB. The fix is llama.cpp's --n-cpu-moe flag, which keeps a couple of MoE expert layers in system RAM and the rest of the model (attention, KV cache, most experts) on the GPU. On the RTX 3060's 12 GB envelope only ~2 expert layers need to move off the GPU.

Hardware data: RTX 3060 (12 GB GDDR6, PCIe Gen4) · ~64 tok/s via llama.cpp expert offload (--n-cpu-moe 2, 16K context) · See benchmark data

⚠️ This is a 12 GB recipe, not a 16 GB recipe. Per the HF card, the MXFP4 release is designed so gpt-oss-20b will "run within 16GB of memory" — that 16 GB figure is the floor for fully-resident operation. A naive ollama run gpt-oss:20b (which tries to load everything on the GPU) spills uncontrolled to the CPU on a 12 GB card. The path documented here deliberately offloads two MoE expert layers so the model fits the 3060's 12 GB envelope on purpose, with ~600 MB VRAM to spare.

ℹ️ MXFP4 is FP4-microscaling, not FP8 — and it runs on Ampere. MXFP4 is handled by llama.cpp's standard quantized matmul kernels, completely independently of FP8 tensor cores. The RTX 3060 is Ampere (sm_86) and has no FP8 or FP4 tensor cores, but that does not block this model: the official llama.cpp guide thread shows gpt-oss 20B MXFP4 running on an RTX 3060 (compute capability 8.6) directly. MXFP4 is the model's native release format, so the 4-bit weights occupy ~12 GB on any card — that is what brings a 21B model near a 12 GB envelope. What you do not get on Ampere is Blackwell-class native FP4 tensor-core acceleration; the MXFP4 matmuls run through standard CUDA MoE kernels. MXFP4 here is a storage win, not a tensor-core win — but it is a fully supported path on the 3060.

Requirements

ComponentMinimumTested
GPU12 GB VRAMRTX 3060 (12 GB GDDR6, GA106, sm_86)
RAM16 GB system RAM (32 GB recommended — holds the CPU-offloaded experts)32 GB DDR4
Storage~13 GB for the MXFP4 GGUF (gpt-oss-20b-mxfp4.gguf is 12.11 GB per the HF tree API)
SoftwareNVIDIA driver with CUDA 12+; a recent llama.cpp release with --n-cpu-moe supportCUDA 13.0, llama.cpp build 6139
LicenseApache-2.0 (HF card) — commercial use permitted

Installation

The documented sub-12-GB path is llama.cpp with expert CPU-offload. Ollama (which wraps llama.cpp) does not expose the --n-cpu-moe knob cleanly and tries to load the whole model on the GPU — on a 12 GB card that means uncontrolled spillover (see Troubleshooting). For a 12 GB RTX 3060, use llama.cpp directly.

1. Install a recent llama.cpp build

Grab the latest release from the official releases page (prebuilt CUDA binaries are published per release), or build from source with CUDA enabled. You need a build new enough to support the --n-cpu-moe flag — any current 2026 release qualifies. On the Ampere RTX 3060 (sm_86) the standard cu12x CUDA binaries work; unlike Blackwell cards, no special CUDA 12.8 / sm_120 toolkit is required, and the prebuilt Flash-Attention kernels already cover sm_86.

2. Run the server with MoE experts offloaded to CPU

The model auto-downloads on first launch via the -hf flag. The --n-cpu-moe N flag is described in the official guide as the "Number of MoE layers N to keep on the CPU. This is used in hardware configs that cannot fit the models fully on the GPU." The guide's headline example is --n-cpu-moe 16 at 32K context, tuned for an RTX 2060 8 GB. The 3060 holds far more of the model, so you lower this number substantially — see Running for the 12 GB value.

# First-launch download; the value 16 below is the guide's 8 GB example — lower it on 12 GB (see Running).
llama-server -hf ggml-org/gpt-oss-20b-GGUF --ctx-size 32768 --jinja -ub 2048 -b 2048 --n-cpu-moe 16

First launch downloads gpt-oss-20b-mxfp4.gguf (12.11 GB) and starts a WebUI plus an OpenAI-compatible API on localhost.

Running

On a 12 GB card almost the entire model fits, so only a couple of expert layers need to leave the GPU. The official guide thread carries a first-party RTX 3060 12 GB configuration (QuantiusBenignus, "Ryzen 7 5700X with 32GB RAM (PCIe 4), NVIDIA RTX3060, 12GB VRAM, with CUDA 13.0"): --n-cpu-moe 2 at 16K context, which the author reports "Leaves about 600 MB VRAM budget, with 64 tok/sec initial generation rate". Use that as your starting point:

llama-server -hf ggml-org/gpt-oss-20b-GGUF \
  --ctx-size 16384 --jinja -ngl 99 -fa \
  --n-cpu-moe 2 --temp 1.0 --top-p 1.0 --top-k 0 --min-p 0.0

-ngl 99 puts all non-expert layers on the GPU; --n-cpu-moe 2 parks two expert layers in host RAM; -fa enables llama.cpp's Flash-Attention kernels (Ampere sm_86 has shipped prebuilt FA kernels since FlashAttention 2.x, so no special build is needed). Once llama-server reports it is listening (default port 8080), it serves a WebUI at http://localhost:8080 and an OpenAI-compatible Chat Completions API. The model uses OpenAI's harmony response format — the --jinja flag tells llama.cpp to apply the chat template embedded in the GGUF. Query the API with:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Explain mixture-of-experts routing in one paragraph."}]}'

ℹ️ Gen4 link — keep --n-cpu-moe as low as VRAM allows. The RTX 3060 is PCIe Gen4 x16. Because --n-cpu-moe streams the offloaded expert layers across the PCIe link every token, each extra offloaded layer adds a per-token PCIe round-trip. The same guide thread shows the cost of over-offloading on identical 3060 hardware: stepping to --n-cpu-moe 3 for a full 32K context still leaves ~600 MB spare but the author flags it as "too aggresive, will likely OOM before reaching context limit." Lower --n-cpu-moe is faster — keep it at the minimum your context size allows.

Results

  • Speed: ~64 tok/s initial generation at 16K context with --n-cpu-moe 2, measured on an RTX 3060 12 GB (Ryzen 7 5700X, 32 GB RAM, CUDA 13.0) by QuantiusBenignus on the official llama.cpp gpt-oss guide — the report reads "Leaves about 600 MB VRAM budget, with 64 tok/sec initial generation rate." With a small 2048-token context that fits entirely on the GPU (no offload), the same author measured the inference speed reaching "75 tok/sec". Our backend has no benchmark for this pair yet (/check/gpt-oss-20b/rtx-3060 returns verdict unknown); if you measure gpt-oss-20b on an RTX 3060, please submit it via /contribute.
  • VRAM usage: The MXFP4 GGUF is 12.11 GB on disk (HF tree API), and the official llama.cpp memory table puts the fully-resident deployment at 14.9 GB at 8K / 15.5 GB at 32K — over the 12 GB envelope, which is exactly why expert offload is required. With --n-cpu-moe 2 the resident footprint drops under 12 GB; on the same RTX 3060 12 GB the guide reports ~600 MB VRAM spare at 16K context. See /check/gpt-oss-20b/rtx-3060.
  • Quality notes: gpt-oss 20B is post-trained with reasoning support and tool-use. The MXFP4 quantization is the native release format — per the HF card, the models were post-trained with "MXFP4 quantization of the MoE weights" and "All evals were performed with the same MXFP4 quantization" — so there is no separate higher-precision baseline to compare against on consumer hardware.

For the full benchmark data and cross-card compare, see /check/gpt-oss-20b/rtx-3060.

Troubleshooting

ollama run gpt-oss:20b is very slow or spills to CPU

Ollama tries to load the whole model on the GPU and falls back to mixed CPU/GPU when it doesn't fit — on a 12 GB card that means uncontrolled spillover rather than a deliberate expert split. The Ollama download is ~14 GB, larger than the 3060's 12 GB VRAM, so part of it always lands on the CPU with no control over which part. For predictable 12 GB behaviour, use llama.cpp's --n-cpu-moe directly as documented above so you control which layers leave the GPU.

Out of memory at launch on the 12 GB card

The MXFP4 weights (12.11 GB on disk) plus compute buffers and KV cache exceed 12 GB if everything is resident — the official memory table lists 14.9 GB at 8K context. A 12 GB desktop card with a monitor attached exposes only ~10.5–11.3 GB usable, so you must offload. If the Running command still OOMs, apply these in order: (a) drop -ub/-b to the default 512 — on the same RTX 3060, QuantiusBenignus reports that with default -ub they could fit 32K context and still get "60 tokens/sec"; (b) raise --n-cpu-moe from 2 toward 3; (c) reduce --ctx-size. Each step trades a little speed or context for headroom.

Transformers / device_map="auto" OOMs on a 12 GB card

Running the raw safetensors through Transformers on a 12 GB card is fragile. An HF discussion on an RTX 4070 Ti 12 GB — same 12 GB envelope as the 3060 — reports that device_map="auto" OOMs, and the user had to manually pin only 15 of the 24 transformer layers to the GPU and push the rest to the CPU; a follow-up RTX 4070 12 GB reporter could pin only 5. The llama.cpp --n-cpu-moe GGUF path above is the supported way to fit 12 GB — prefer it over hand-rolled Transformers device maps.

Generation slower than a higher-end card

Expected. The RTX 3060 has 360 GB/s of memory bandwidth (TechPowerUp) — token generation is memory-bandwidth-bound, so a card with more bandwidth will be faster at the same offload setting. On top of that, the 3060's Gen4 link streams the CPU-offloaded experts every token; keep --n-cpu-moe as low as your VRAM allows to minimise that cost. Note also that MXFP4 runs through standard CUDA MoE kernels on Ampere (no native FP4 tensor cores), so you do not get the FP4-accelerated path a Blackwell card would use — but the MXFP4 weights still load and run correctly.

Want hardware numbers for this card?

There is no published backend RTX 3060 gpt-oss-20b benchmark yet. If you have measurements (context length, --n-cpu-moe value, RAM speed), submit them via /contribute so we can populate the /check/gpt-oss-20b/rtx-3060 page.

common questions
How much VRAM does gpt-oss 20B need?

About 12 GB — the minimum this recipe targets.

Which GPUs is gpt-oss 20B tested on?

RTX 3060 (12 GB).

How hard is this setup?

Intermediate — follow the steps above.