How much VRAM does Gemma 4 26B MoE need?

About 29 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Gemma 4 26B A4B-it on RTX 5090: Q8_0 Quality Tier via ggml-org GGUF + llama.cpp

What You'll Build

A local Gemma 4 26B A4B-it chat assistant — text + image input, text output — running on an RTX 5090 (32 GB GDDR7) through llama.cpp with the ggml-org/gemma-4-26B-A4B-it-GGUF mirror. The 5090's 32 GB envelope unlocks the Q8_0 quality tier (26.86 GB on disk per the HF tree) — a quant tier that does NOT fit a 24 GB card and was forced to Q4_K_M on both the RTX 3090 sibling recipe and RTX 4090 sibling recipe. The optional mmproj-Q8_0 vision encoder (806 MB) layers image input — OCR, chart reading, document parsing — on top.

Hardware data: RTX 5090 (32 GB GDDR7) · Q4_K reference · 180.3 tok/s generation + 8,799.2 tok/s prompt processing @ 4K context · See benchmark data

ℹ️ Variant pinned — 26B A4B-it (the MoE). Gemma 4 ships in four sizes — E2B, E4B, 26B A4B, and 31B Dense. This recipe targets the 26B A4B-it (google/gemma-4-26B-A4B-it) — the Mixture-of-Experts variant with 25.2B total parameters / 3.8B active per token. The 31B Dense variant (google/gemma-4-31B-it) is a different model and runs ~4× slower at the same quant tier — see Troubleshooting.

⚠️ All 25.2B parameters must be resident in VRAM — even though only 3.8B are "active" per token. Gemma 4 26B A4B is a router-per-token sparse MoE: per the HF model card, each token's forward pass picks 8 of 128 experts (plus 1 shared) at runtime, so you cannot pre-prune. "Active parameters" tells you the per-token compute cost (which is why a 26B-total model runs almost as fast as a 4B model on the same hardware) but not the VRAM cost. At Q8_0, weights alone are 26.86 GB on disk and stay roughly the same in VRAM — plan on ~27 GB for weights + ~1.5 GB for the KV cache at 8K context + ~1 GB for the optional vision encoder = ~29 GB peak on the 5090's 32 GB envelope.

Requirements

Component	Minimum	Tested
GPU	32 GB VRAM for Q8_0 path (Q8_0 weights alone are 26.86 GB; KV cache + vision encoder need headroom). 24 GB cards must use Q4_K_M — see sibling recipes.	RTX 5090 (32 GB GDDR7)
RAM	16 GB system	—
Storage	27 GB (Q8_0) or 17 GB (Q4_K_M) or 28 GB with mmproj vision encoder	—
Software	llama.cpp (recent build, post Apr 2026 chat-template fix, built against CUDA 12.8+/13 for sm_120 kernels), Ollama, or LM Studio	—
License	Apache 2.0 (Gemma 4 License)	commercial use permitted

Installation

1. Build (or update) llama.cpp with CUDA 12.8+/13

The ggml-org GGUF mirror is the llama.cpp team's own packaging — llama-cli's -hf flag fetches and caches it for you. You need a recent llama.cpp build that includes Google's April 2026 chat-template fix (noted in the unsloth canonical guide) and is compiled against CUDA 12.8 or newer so the resulting binary includes sm_120 (Blackwell) kernels for the 5090:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="120"
cmake --build build --config Release -j --target llama-cli llama-server

Setting -DCMAKE_CUDA_ARCHITECTURES="120" produces a Blackwell-native binary without dragging in unused older arches. If you also use the same binary on Ada/Ampere cards, broaden to "86;89;120". The resulting llama-cli and llama-server binaries land in ./build/bin/. llama.cpp's native CUDA kernels do not depend on FlashAttention-2, so the open Dao-AILab/flash-attention Issue #2168 (no FA2 sm_120 kernels on Blackwell as of mid-2026) does NOT affect this recipe — see Troubleshooting if you wire up FA2 manually anyway.

2. Run with the official ggml-org GGUF — Q8_0 quality tier (5090 unlock)

The simplest path — -hf downloads the Q8_0 shard (26.86 GB) on first call:

./build/bin/llama-cli \
    -hf ggml-org/gemma-4-26B-A4B-it-GGUF:Q8_0 \
    --n-gpu-layers 99 \
    --ctx-size 8192 \
    --temp 1.0 --top-p 0.95 --top-k 64

The sampling parameters (temp=1.0, top_p=0.95, top_k=64) are Google's recommended Gemma 4 defaults from the HF card. --n-gpu-layers 99 offloads all layers to the 5090. Q8_0 is the recommended tier for 32 GB cards — it preserves weight quality much closer to BF16 than Q4_K_M does, and 24 GB cards (3090 / 4090) can't fit it: the 26.86 GB file alone exceeds 24 GB and leaves no room for KV cache or activations.

3. (Alternative) Q4_K_M for maximum context headroom

If you want to maximize context length or run other models alongside, switch to the Q4_K_M tier — same shard the RTX 3090 sibling recipe and RTX 4090 sibling recipe use:

./build/bin/llama-cli \
    -hf ggml-org/gemma-4-26B-A4B-it-GGUF:Q4_K_M \
    --n-gpu-layers 99 \
    --ctx-size 32768 \
    --temp 1.0 --top-p 0.95 --top-k 64

Q4_K_M weights are 16.80 GB on disk per the HF tree, leaving roughly 14–15 GB free for KV cache and a longer context window (32K shown above, up to the model's 256K maximum if you have the patience for the KV growth).

4. (Optional) Add the vision encoder for image input

If you want the multimodal path — OCR, chart parsing, screen understanding — also download the mmproj adapter (806 MB at Q8_0):

./build/bin/llama-server \
    --model ~/.cache/llama.cpp/ggml-org_gemma-4-26B-A4B-it-GGUF_gemma-4-26B-A4B-it-Q8_0.gguf \
    --mmproj-url https://huggingface.co/ggml-org/gemma-4-26B-A4B-it-GGUF/resolve/main/mmproj-gemma-4-26B-A4B-it-Q8_0.gguf \
    --n-gpu-layers 99 \
    --ctx-size 8192 \
    --temp 1.0 --top-p 0.95 --top-k 64 \
    --host 0.0.0.0 --port 8080

Then POST images as base64 to http://localhost:8080/v1/chat/completions using the OpenAI-compatible API. Per the HF card best practices, place the image content before the text in the prompt.

5. (Alternative) Pull via Ollama

If you prefer Ollama's container model:

ollama pull gemma4:26b
ollama run gemma4:26b

Ollama's gemma4:26b tag defaults to Q4_K_M and bundles its own vision encoder. To exercise the 5090's 32 GB envelope at Q8_0, request the higher tier explicitly with ollama pull gemma4:26b-q8_0.

Running

A typical interactive session at Q8_0:

./build/bin/llama-cli \
    -hf ggml-org/gemma-4-26B-A4B-it-GGUF:Q8_0 \
    --n-gpu-layers 99 \
    --ctx-size 8192 \
    --temp 1.0 --top-p 0.95 --top-k 64 \
    --color -i

First launch downloads ~27 GB; subsequent launches start in seconds from the local cache. To enable Gemma 4's built-in reasoning mode, prepend <|think|> to the system prompt (the HF card describes the full control-token grammar).

Results

Speed (Q4_K reference): 180.3 tok/s token generation @ 4K context at Q4_K, per the Hardware Corner RTX 5090 LLM benchmark table — row labelled Gemma4 26B (Q4_K), column 4k Ctx under the Token Generation sub-table. Prompt processing reaches 8,799.2 tok/s at the same row/column under the Prompt Processing sub-table. The full context ladder (Token Generation): 180.3 → 167.2 → 159.4 → 149.4 → 130.2 → 106.0 tok/s at 4K / 16K / 32K / 64K / 128K / 256K — gentle degradation as KV cache grows. That's ~1.5× faster than the RTX 4090 reference and ~1.5× faster than the RTX 3090's 119.4 tok/s at the same row in the Hardware Corner RTX 3090 table. Q8_0 generation throughput on the 5090 is not directly measured in the cited Hardware Corner page; expect a ~30-40% slowdown vs Q4_K because Q8_0 doubles the per-weight memory bandwidth requirement — the 5090's ~1,792 GB/s bandwidth is among the highest on any consumer card, but Q8_0 still moves more bytes per token than Q4_K. A community benchmark via /contribute would pin Q8_0 throughput.
VRAM usage: ~27 GB resident for Q8_0 weights alone (26.86 GB on disk per the ggml-org HF tree), plus ~1.5 GB KV cache at 8K context, plus an optional ~1 GB for the mmproj Q8_0 vision encoder — ~29 GB derived runtime envelope on a 32 GB card. The 5090's 32 GB envelope provides ~3 GB headroom for context expansion (32K context adds ~3 GB more KV at Q8_0 weights). For comparison: the unsloth canonical guide lists "16–18 GB" for 26B A4B at 4-bit — that's the Q4_K_M number the sibling 24 GB cards use; the 5090's headroom flips the recommendation from "fit it in" to "max the quality tier".
Quality notes: Per Google's HF model card benchmarks, the 26B A4B variant scores 82.6% MMLU Pro, 88.3% AIME 2026, 77.1% LiveCodeBench v6, and 82.4% MATH-Vision — all close to the larger 31B Dense and substantially ahead of Gemma 3 27B. The Q8_0 quality uplift over Q4_K_M is small but measurable on benchmarks (typically 0.5–1.5 percentage points across reasoning suites) and subjectively noticeable on long-form output and rare-token domains (code, multilingual, technical writing) — which is the practical case for spending the 5090's extra envelope on weight quality rather than context length.

For the full benchmark data, see /check/gemma4-26b/rtx-5090.

Troubleshooting

NVFP4 path exists but is enterprise-only today

NVIDIA published nvidia/Gemma-4-26B-A4B-NVFP4 — an FP4 microscaling quant that targets Blackwell's 5th-generation tensor cores directly and would, in principle, be the fastest path on this card. Two caveats keep this off the recipe's primary path:

The NVIDIA card's Supported Runtime Engine(s) lists only vLLM, and the test hardware is B200 (datacenter Blackwell), not consumer 5090. The vLLM path is runnable on a 5090 in principle but is enterprise-serving-stack-oriented rather than the chat-on-a-desktop case this recipe targets.
Community GGUF conversions exist (catlilface/Gemma-4-26B-A4B-NVFP4-GGUF, 17.7 GB NVFP4) but the README explicitly states "Official llama.cpp doesn't currently support inference; use custom Docker image" — i.e. the NVFP4 path requires a forked, non-mainline llama.cpp build today. Watch the catlilface mirror's README and the city96/ComfyUI-GGUF repo for mainline NVFP4 loader support before treating NVFP4 as a viable consumer chat path on this card.

Once mainline llama.cpp / ComfyUI ships NVFP4 loaders, expect roughly Q4_K_M file size (~17 GB) with quality closer to Q6_K and Blackwell-native speed — a meaningful upgrade over Q4_K_M for the 5090 specifically.

Garbled `<|channel>thought` tokens in output

Symptom: the model emits raw <|channel>thought\n control tokens in conversational output, or reasoning mode produces malformed thinking blocks. Cause: an older llama.cpp / Ollama / LM Studio build predates Google's April 2026 chat-template revision. Fix: per the unsloth Gemma 4 guide ("Apr 11 update: Gemma 4 is now updated with Google's updated chat template + llama.cpp fixes") — rebuild llama.cpp from the latest main, or ollama pull gemma4:26b again to refresh the chat template metadata.

FlashAttention-2 build errors on RTX 5090

Symptom: if you wire up a non-llama.cpp FA2 path (vLLM, transformers, or any framework calling flash_attn_func), you may hit no kernel image is available for execution on the device on the 5090 — the FA2 sm_120 kernel set is incomplete as of mid-2026 (Dao-AILab/flash-attention Issue #2168, open). Fix: this recipe's llama.cpp path is unaffected because llama.cpp uses its own CUDA kernels, not FA2. For non-llama.cpp paths on the 5090 today, set attn_implementation="sdpa" (PyTorch's scaled-dot-product attention — Blackwell-supported) or "eager" until FA2 sm_120 ships.

"Out of memory" when increasing `--ctx-size` past 32K at Q8_0

Each additional 1K of KV cache adds roughly 100–130 MB at Q8_0 (the heavier weight tier carries a slightly larger KV cache per token). The recipe's default --ctx-size 8192 leaves ~3 GB headroom on the 32 GB card; pushing to --ctx-size 32768 consumes ~3 GB additional and tightens the budget — --ctx-size 65536 will OOM at Q8_0. To go longer-context on the 5090, drop to Q5_K_M (21.15 GB per the unsloth GGUF tree) which fits ~9 GB more KV cache, or Q4_K_M (16.80 GB) which fits the full 256K context with room to spare.

Token repetition collapse on long generation

A Gemma 4 open issue (community-reported, not yet maintainer-acknowledged) describes token-repetition collapse during long generation affecting both 31B Dense and 26B A4B. The workaround pattern is the standard sampling fix: increase --repeat-penalty to 1.15-1.2 in llama.cpp, or set frequency_penalty=0.3 in the OpenAI-compatible API. If the issue persists, drop generation length below 4K and chunk longer outputs across multiple calls.

Considering Q5_K_M, Q6_K, or BF16 instead?

The unsloth GGUF mirror ships the full per-quant-tier ladder via the HF tree API: Q5_K_M (UD-Q5_K_M) is 21.15 GB, Q6_K is 23.17 GB (newly fits on 5090, was 24 GB-cliff on 3090/4090), and BF16 is 50.5 GB (ggml-org tree) — BF16 does NOT fit a 32 GB card (would need a 48-GB-class card like the RTX 6000 Ada or two 5090s with tensor-parallel split). Q8_0 (26.86 GB) is the recommended sweet spot for a single 5090: it leaves enough room for 8-16K context plus the mmproj vision encoder, and the gap to BF16 in benchmark scores is small enough that Q8_0 captures most of the quality.

Wanted the 31B Dense variant instead?

The google/gemma-4-31B-it variant is a different model — 30.7B dense parameters. At Q4_K_M (~20 GB) it fits the 5090 with comfortable headroom, and at Q8_0 (~33 GB) it would be tight-to-not-fit on this card. Per the Hardware Corner RTX 5090 table, Gemma4 31B (Q4_K) runs roughly 4× slower than the 26B A4B's 180 tok/s — that's the sparse-MoE compute advantage: 31B Dense activates all parameters per token; 26B A4B activates only 3.8B per token. For most local-chat use on a single 5090, the 26B A4B variant in this recipe is the recommended choice — it scores within ~3 percentage points of 31B Dense on MMLU Pro, AIME, and LiveCodeBench per the HF benchmark table.