self-hosted/ai
§01·recipe · llm

Gemma 4 31B on RTX 5090: dense 31B local chat at 61 tok/s, with Q5/Q6 quality headroom in 32 GB

llmintermediate20GB+ VRAMJun 27, 2026

This intermediate recipe sets up Gemma4 31B on the RTX 5090, needing about 20 GB of VRAM.

models
tools
prerequisites
  • NVIDIA RTX 5090 (32 GB VRAM) — Blackwell sm_120; the extra 8 GB over a 24 GB card is what unlocks the Q5/Q6 quality step-up
  • Recent NVIDIA driver with CUDA 12.8+ support (Blackwell sm_120 — use current pre-built llama.cpp / Ollama CUDA binaries)
  • ~19 GB free disk for the Q4_K_M GGUF (~22 GB for Q5_K_M, ~26 GB for Q6_K)
  • Ollama, llama.cpp, or LM Studio installed

What You'll Build

A local Gemma 4 31B chat / reasoning assistant running on an RTX 5090 (32 GB VRAM) through Ollama or llama.cpp, using the unsloth/gemma-4-31B-it-GGUF weights — a community quant whose base_model links straight back to the canonical google/gemma-4-31B-it. The 5090's 32 GB changes the calculus versus a 24 GB card: Q4_K_M (18.3 GB) is the fast baseline at a measured 61.1 tok/s, but you now have room to step up to Q5_K_M (21.7 GB) or Q6_K (25.2 GB) for higher quality with KV-cache headroom to spare — quants that are tight-to-impossible on a 24 GB 3090.

Hardware data: RTX 5090 (32 GB VRAM) · Q4_K GGUF · 61.1 tok/s generation @ 4K ctx · See benchmark data

ℹ️ This recipe covers text use of a multimodal model. Per the google/gemma-4-31B model card, "Gemma 4 models are multimodal, handling text and image input (with audio supported on small models) and generating text output." The 31B's supported modalities are Text + Image (no audio), with a ~550M vision encoder. This recipe targets the text LLM path — local chat, reasoning, coding — which is how the catalogue files it (llm vertical) and how the backend benchmark was measured. To feed images you would additionally load the multimodal projector (mmproj-*.gguf in the same GGUF repo); that workflow is out of scope here.

⚠️ Variant pinned — 31B Dense, not the 26B A4B MoE sibling. Per the google/gemma-4-31B card, Gemma 4 ships in four sizes — E2B, E4B, 26B A4B (Mixture-of-Experts), and 31B (Dense, this recipe). The 31B has 30.7B total parameters and a 256K-token context window. The 26B A4B is a different architecture (25.2B total / 3.8B active MoE) that runs much faster but is a separate model + slug; this recipe is for the dense 31B only.

Requirements

ComponentMinimumTested
GPU20 GB VRAM (Q4_K_M)RTX 5090 (32 GB)
RAM32 GB system
Storage~19 GB (Q4_K_M GGUF) per the unsloth GGUF tree~26 GB if you keep the Q6_K build
DriverCUDA 12.8+ runtime (Blackwell sm_120)
RuntimeOllama / llama.cpp / LM StudioOllama 0.x, llama.cpp recent build

The model is released under Apache 2.0 (confirmed on the google/gemma-4-31B model card) — the repo is public and ungated. Per the same card, Gemma 4 uses a hybrid attention mechanism that "interleaves local sliding window attention with full global attention, ensuring the final layer is always global", with unified Keys and Values on the global layers and Proportional RoPE (p-RoPE) to keep the long-context KV cache modest. That SWA design is relevant to one llama.cpp runtime caveat — see Troubleshooting.

Installation

Option A — Ollama (recommended one-command path)

Ollama manages the CUDA runtime, the GGUF download, and the attention flags for you, which sidesteps the manual llama.cpp SWA/FlashAttention tuning (see Troubleshooting). The default gemma4:31b tag is Q4_K_M — the fast baseline that delivers the measured 61 tok/s:

# Default tag = Q4_K_M, the 61 tok/s baseline (~20 GB download)
ollama run gemma4:31b

The Ollama gemma4 library page publishes these 31B tags: gemma4:31b (an alias for 31b-it-q4_K_M), gemma4:31b-it-q8_0, gemma4:31b-it-qat, and gemma4:31b-it-bf16. On a 32 GB 5090 only the Q4_K_M tag fits comfortably — q8_0 (~33 GB) and bf16 (~62 GB) do not. Ollama does not publish a Q5_K_M or Q6_K gemma4 tag; to step up to those higher-quality quants (which the 5090's 32 GB fits with KV headroom), use the llama.cpp + GGUF path in Option B below.

Option B — llama.cpp + Unsloth GGUF

This is the canonical CUDA-accelerated llama.cpp loader for a 31B GGUF on a 32 GB Blackwell card.

1. Install llama.cpp

# macOS (Homebrew)
brew install llama.cpp

# Linux — pre-built CUDA binary
# Download the latest "llama-bXXXX-bin-ubuntu-cuda-12.x-x64.zip" asset from:
#   https://github.com/ggml-org/llama.cpp/releases
# Extract and add the bin/ directory to PATH.

2. Download a GGUF

Pull a single quant file from the unsloth/gemma-4-31B-it-GGUF repo instead of the whole quant ladder. Q4_K_M is the speed baseline; on a 32 GB card, Q5_K_M or Q6_K are the better quality choices:

pip install huggingface_hub hf_transfer
# download_gguf.py
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="unsloth/gemma-4-31B-it-GGUF",
    local_dir="unsloth/gemma-4-31B-it-GGUF",
    # Q4_K_M = baseline; swap to "*Q5_K_M*" or "*Q6_K*" for the 32 GB quality step-up
    allow_patterns=["*Q4_K_M*"],
)
python download_gguf.py

The resulting file is unsloth/gemma-4-31B-it-GGUF/gemma-4-31B-it-Q4_K_M.gguf (18.32 GB per the unsloth GGUF tree; Q5_K_M is 21.66 GB and Q6_K is 25.20 GB in the same tree).

3. Start the server

llama-server \
  --model unsloth/gemma-4-31B-it-GGUF/gemma-4-31B-it-Q4_K_M.gguf \
  --ctx-size 8192 \
  --n-gpu-layers 99 \
  --host 0.0.0.0 --port 8080

--n-gpu-layers 99 offloads every layer to the 5090 (the 32 GB envelope leaves ample room for the KV cache even at Q6_K; layer-streaming is unnecessary). --ctx-size 8192 is a safe starting point — with this much spare VRAM you can push it considerably higher (see Results). Note: this command intentionally omits --flash-attn — Gemma 4's mixed SWA/full-attention layers currently interact badly with the CUDA FlashAttention path in some builds (see Troubleshooting). Add --flash-attn only after confirming your build is stable with this model.

Option C — LM Studio (GUI)

LM Studio's catalog search ("gemma 4 31B GGUF") surfaces the unsloth builds alongside the bartowski standard-quant ladder and the official google/gemma-4-31B-it-qat-q4_0-gguf QAT build. On a 32 GB card pick gemma-4-31B-it-Q5_K_M.gguf or gemma-4-31B-it-Q6_K.gguf for quality, or stay on Q4_K_M for speed; LM Studio sets --n-gpu-layers to "max" automatically for a 5090.

Running

One-shot prompt via the llama.cpp HTTP server

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-4-31b",
    "messages": [{"role": "user", "content": "Explain sliding-window attention in three sentences."}]
  }'

llama-server exposes an OpenAI-compatible /v1/chat/completions endpoint on the chosen port. With Ollama (Option A), the same prompt is just an interactive turn at the ollama run gemma4:31b REPL, or a POST to Ollama's own /api/chat on port 11434.

Interactive terminal (llama.cpp)

llama-cli \
  --model unsloth/gemma-4-31B-it-GGUF/gemma-4-31B-it-Q4_K_M.gguf \
  --ctx-size 8192 \
  --n-gpu-layers 99 \
  --interactive

Press Ctrl-C to interrupt generation; the CLI keeps the model warm in VRAM until exit. First run pays the one-time GGUF load + VRAM warm-up; subsequent prompts are fast.

Results

  • Speed: Per Hardware Corner's RTX 5090 LLM benchmark page, Gemma 4 31B at Q4_K records 61.1 tokens/s generation at 4K context (59.2 t/s @ 16K, 55.4 t/s @ 32K, 51.6 t/s @ 64K, 43.4 t/s @ 128K), with prompt-processing prefill at 3,395 tokens/s @ 4K. Confirmed by /check/gemma4-31b/rtx-5090 (Hardware Corner-sourced, last verified 2026-05-15). Note these are two distinct numbers: ~61 t/s is how fast the model writes the answer; ~3,395 t/s is how fast it reads your prompt — prefill is normally 30–100× faster than generation. These figures are for Q4_K; a Q5_K_M or Q6_K build trades some of this throughput for higher quality. If you measure the higher quants, please contribute the numbers via the submission form so the next reader gets a first-party datapoint.
  • VRAM usage: The Q4_K_M weights are 18.32 GB on disk per the unsloth GGUF tree; with the KV cache + activations at 4K–8K context that lands around a ~20 GB runtime envelope, leaving ~12 GB free on a 32 GB card. That spare headroom is the 5090's advantage: it absorbs a much longer context window (the 256K-native KV cache scales with --ctx-size) or the larger Q5/Q6 weight files, where a 24 GB card has to choose. See /check/gemma4-31b/rtx-5090 for the benchmark detail; if your peak differs, contribute it via /contribute.
  • Quality notes: On a 32 GB card you are no longer pinned to Q4. From the unsloth tree: Q5_K_M is 21.66 GB and Q6_K is 25.20 GB on disk — both fit comfortably with KV-cache room, unlike on a 24 GB card where Q5 is very tight and Q6 doesn't fit. Q8_0 is 32.64 GB, which exceeds the 32 GB envelope once KV cache + activations are added — so Q6_K is the practical quality ceiling for this pair, with Q4_K_M as the speed baseline.

For the full benchmark data and other-GPU comparisons, see /check/gemma4-31b/rtx-5090.

Troubleshooting

llama.cpp crashes with "illegal memory access" when FlashAttention is enabled

Gemma 4's hybrid architecture mixes sliding-window-attention (SWA) layers (head dim 256) with full-attention layers (head dim 512). A community user reports on llama.cpp Issue #22527 (open, community-reported, no maintainer fix at time of writing) that llama-server on Gemma 4 31B crashes with a CUDA illegal memory access consistently after the second SWA KV-cache context checkpoint when --flash-attn is on, and that disabling FlashAttention instead inflates V-cache padding (the SWA/full head-dim mismatch causes llama.cpp to pad the V cache to the larger size). The reporter's card was a 4060 Ti, but the root cause is the Gemma-4 SWA path in the CUDA backend, which applies equally to the 5090's Blackwell CUDA backend. Workaround: prefer Ollama (Option A) which manages these flags for you, or run llama-server without --flash-attn (the recipe's default command). Track the upstream issue for a build-level fix before re-enabling FlashAttention with this model. On a 32 GB card the V-cache padding cost is easily absorbed, so running without FlashAttention is rarely a memory problem here.

Want even more context, or a higher quant?

The 5090's spare ~12 GB at Q4_K_M is real working room. Two ways to spend it:

  1. Longer context. Raise --ctx-size (e.g. 16384 or 32768). The Gemma 4 card notes the global layers use unified Keys and Values + p-RoPE to keep the long-context KV cache modest, and the model's native window is 256K tokens — so on 32 GB you can push context far past the 8K default before pressure builds. If you do hit a ceiling at very long context, recent llama.cpp builds support --cache-type-k q8_0 --cache-type-v q8_0 to halve KV-cache size — but note that flag path currently requires FlashAttention on llama.cpp, which conflicts with the Gemma-4 SWA caveat above; test stability first or stay on Ollama.
  2. Higher quant. Swap the Q4_K_M file for Q5_K_M (21.66 GB) or Q6_K (25.20 GB) from the unsloth tree for better output quality. Both fit a 32 GB card; only Q8_0 (32.64 GB) is out of reach once KV cache + activations are added.

Want full BF16 quality or a different runtime (vLLM / Transformers)?

The canonical transformers path loads BF16 weights via pip install -U transformers torch accelerate and AutoModelForCausalLM — at 30.7B parameters that is ~62 GB in BF16, which does not fit even a 32 GB card. For single-5090 deployment, a Q4–Q6 GGUF in Ollama or llama.cpp is the realistic loader path. To run the full-precision model or a vLLM server you need a larger GPU (or multi-GPU). The GGUF here is the consumer-card path the model card itself targets: "consumer GPUs and workstations (26B A4B, 31B)".

Blackwell sm_120 — anything special for the 5090?

The RTX 5090 is Blackwell (sm_120). Use current pre-built CUDA binaries: the llama-bXXXX-bin-ubuntu-cuda-12.x-x64.zip releases built against CUDA 12.8+ include sm_120 kernels, and Ollama's bundled runtime already targets Blackwell — no manual wheel-building is needed for GGUF inference. (This is a GGUF/llama.cpp recipe, so the FlashAttention-2 sm_120 wheel gap that bites transformers/vllm quick-start snippets on Blackwell does not apply here — llama.cpp's own attention kernels are used, and this recipe runs them without --flash-attn per the SWA caveat above.)

common questions
How much VRAM does Gemma4 31B need?

About 20 GB — the minimum this recipe targets.

Which GPUs is Gemma4 31B tested on?

RTX 5090 (32 GB).

How hard is this setup?

Intermediate — follow the steps above.