How much VRAM does Gemma4 31B need?

About 24 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Gemma 4 31B on RTX 3090: dense 31B local chat via Q4_K_M GGUF in Ollama / llama.cpp

What You'll Build

A local Gemma 4 31B chat / reasoning assistant running on an RTX 3090 (24 GB VRAM) through Ollama or llama.cpp, using the unsloth/gemma-4-31B-it-GGUF Q4_K_M weights (18.3 GB on disk) — a Lesson-grade community quant whose base_model links straight back to the canonical google/gemma-4-31B-it. At 31B dense parameters this pair sits on the upper edge of single-RTX-3090 LLM territory: Q4-class quants are the only ones that fit with usable KV-cache headroom on a 24 GB card, and context-window discipline matters here in a way it never does on smaller 8B / 14B recipes.

Hardware data: RTX 3090 (24 GB VRAM) · Q4_K GGUF · 34.7 tok/s generation @ 4K ctx · See benchmark data

ℹ️ This recipe covers text use of a multimodal model. Per the google/gemma-4-31B model card, "Gemma 4 models are multimodal, handling text and image input (with audio supported on small models) and generating text output." The 31B's supported modalities are Text + Image (no audio), with a ~550M vision encoder. This recipe targets the text LLM path — local chat, reasoning, coding — which is how the catalogue files it (llm vertical) and how the backend benchmark was measured. To feed images you would additionally load the multimodal projector (mmproj-*.gguf in the same GGUF repo); that workflow is out of scope here.

⚠️ Variant pinned — 31B Dense, not the 26B A4B MoE sibling. Per the google/gemma-4-31B card, Gemma 4 ships in four sizes — E2B, E4B, 26B A4B (Mixture-of-Experts), and 31B (Dense, this recipe). The 31B has 30.7B total parameters and a 256K-token context window. The 26B A4B is a different architecture (25.2B total / 3.8B active MoE) that runs much faster but is a separate model + slug; this recipe is for the dense 31B only.

Requirements

Component	Minimum	Tested
GPU	24 GB VRAM (Q4_K_M)	RTX 3090 (24 GB)
RAM	32 GB system	—
Storage	~19 GB (Q4_K_M GGUF) per the unsloth GGUF tree	—
Driver	CUDA 12.x runtime (Ampere sm_86)	—
Runtime	Ollama / llama.cpp / LM Studio	Ollama 0.x, llama.cpp recent build

The model is released under Apache 2.0 (confirmed on the google/gemma-4-31B model card) — the repo is public and ungated. Per the same card, Gemma 4 uses a hybrid attention mechanism that "interleaves local sliding window attention with full global attention, ensuring the final layer is always global", with unified Keys and Values on the global layers and Proportional RoPE (p-RoPE) to keep the long-context KV cache modest. That SWA design is relevant to one llama.cpp runtime caveat — see Troubleshooting.

Installation

Option A — Ollama (recommended one-command path)

Ollama manages the CUDA runtime, the GGUF download, and the attention flags for you, which sidesteps the manual llama.cpp SWA/FlashAttention tuning (see Troubleshooting). The default gemma4:31b tag is Q4_K_M (20 GB download) — the right quant for a 24 GB card.

# Default tag = Q4_K_M, the fits-24GB quant
ollama run gemma4:31b

# Equivalent explicit Q4_K_M tag (identical quant, pinned name)
ollama run gemma4:31b-it-q4_K_M

The tag list (gemma4:31b, gemma4:31b-it-q4_K_M, gemma4:31b-it-q8_0, gemma4:31b-it-bf16, …) is published on the Ollama gemma4 library page. On a 24 GB 3090, stay on q4_K_M; q8_0 (~33 GB) and bf16 do not fit.

Option B — llama.cpp + Unsloth Q4_K_M GGUF

This is the canonical CUDA-accelerated llama.cpp loader for a 31B GGUF on a 24 GB Ampere card.

1. Install llama.cpp

# macOS (Homebrew)
brew install llama.cpp

# Linux — pre-built CUDA binary
# Download the latest "llama-bXXXX-bin-ubuntu-cuda-12.x-x64.zip" asset from:
#   https://github.com/ggml-org/llama.cpp/releases
# Extract and add the bin/ directory to PATH.

2. Download the Q4_K_M GGUF

Pull only the single Q4_K_M file (18.3 GB) from the unsloth/gemma-4-31B-it-GGUF repo instead of the whole quant ladder:

pip install huggingface_hub hf_transfer

# download_q4km.py
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="unsloth/gemma-4-31B-it-GGUF",
    local_dir="unsloth/gemma-4-31B-it-GGUF",
    allow_patterns=["*Q4_K_M*"],
)

python download_q4km.py

The resulting file is unsloth/gemma-4-31B-it-GGUF/gemma-4-31B-it-Q4_K_M.gguf (18.3 GB per the unsloth GGUF tree).

3. Start the server

llama-server \
  --model unsloth/gemma-4-31B-it-GGUF/gemma-4-31B-it-Q4_K_M.gguf \
  --ctx-size 8192 \
  --n-gpu-layers 99 \
  --host 0.0.0.0 --port 8080

--n-gpu-layers 99 offloads every layer to the 3090 (18.3 GB weights leave room for the KV cache; layer-streaming is unnecessary). --ctx-size 8192 is a safe starting point. Note: unlike a typical 32B recipe, this command intentionally omits --flash-attn — Gemma 4's mixed SWA/full-attention layers currently interact badly with the CUDA FlashAttention path in some builds (see Troubleshooting). Add --flash-attn only after confirming your build is stable with this model.

Option C — LM Studio (GUI)

LM Studio's catalog search ("gemma 4 31B GGUF") surfaces the unsloth Q4_K_M build alongside the bartowski standard-quant ladder and the official google/gemma-4-31B-it-qat-q4_0-gguf QAT build. Pick gemma-4-31B-it-Q4_K_M.gguf and download; LM Studio sets --n-gpu-layers to "max" automatically for a 3090.

Running

One-shot prompt via the llama.cpp HTTP server

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-4-31b",
    "messages": [{"role": "user", "content": "Explain sliding-window attention in three sentences."}]
  }'

llama-server exposes an OpenAI-compatible /v1/chat/completions endpoint on the chosen port. With Ollama (Option A), the same prompt is just an interactive turn at the ollama run gemma4:31b REPL, or a POST to Ollama's own /api/chat on port 11434.

Interactive terminal (llama.cpp)

llama-cli \
  --model unsloth/gemma-4-31B-it-GGUF/gemma-4-31B-it-Q4_K_M.gguf \
  --ctx-size 8192 \
  --n-gpu-layers 99 \
  --interactive

Press Ctrl-C to interrupt generation; the CLI keeps the model warm in VRAM until exit. First run pays the one-time GGUF load + VRAM warm-up; subsequent prompts are fast.

Results

Speed: Per Hardware Corner's RTX 3090 LLM benchmark page, Gemma 4 31B at Q4_K records 34.7 tokens/s generation at 4K context (33.5 t/s @ 16K, 31.4 t/s @ 32K), with prompt-processing prefill at 1,155.8 tokens/s @ 4K. Confirmed by /check/gemma4-31b/rtx-3090 (Hardware Corner-sourced, last verified 2026-05-15). Note these are two distinct numbers: ~34.7 t/s is how fast the model writes the answer; ~1,156 t/s is how fast it reads your prompt — prefill is normally 30–100× faster than generation.
VRAM usage: 24 GB peak at Q4_K on the 24 GB card per /check/gemma4-31b/rtx-3090 — i.e. the full envelope is in use at 4K context. The Q4_K_M weights are 18.32 GB on disk per the unsloth GGUF tree; the remainder of the 24 GB is KV cache + activations. Longer context will need KV quantization or partial CPU offload (see Troubleshooting).
Quality notes: On a 24 GB card you cannot step up to Q5/Q6/Q8 — from the same unsloth tree, Q5_K_M is 21.66 GB on disk (very tight with any KV cache), Q6_K is 25.20 GB and Q8_0 is 32.64 GB (neither fits). Q4_K_M (18.32 GB) is the quality ceiling for this pair without a larger GPU. If your measurement differs, please contribute it via the submission form so the next reader gets a first-party number.

For the full benchmark data and other-GPU comparisons, see /check/gemma4-31b/rtx-3090.

Troubleshooting

llama.cpp crashes with "illegal memory access" when FlashAttention is enabled

Gemma 4's hybrid architecture mixes sliding-window-attention (SWA) layers (head dim 256) with full-attention layers (head dim 512). A community user reports on llama.cpp Issue #22527 (open, community-reported, no maintainer fix at time of writing) that llama-server on Gemma 4 31B crashes with a CUDA illegal memory access consistently after the second SWA KV-cache context checkpoint when --flash-attn is on, and that disabling FlashAttention instead inflates V-cache padding (the SWA/full head-dim mismatch causes llama.cpp to pad the V cache to the larger size). The reporter's card was a 4060 Ti, but the root cause is the Gemma-4 SWA path in the CUDA backend, which applies equally to the 3090's Ampere CUDA backend. Workaround: prefer Ollama (Option A) which manages these flags for you, or run llama-server without --flash-attn (the recipe's default command) and keep --ctx-size modest. Track the upstream issue for a build-level fix before re-enabling FlashAttention with this model.

Generation slows or OOMs past 8K–16K context

The Q4_K_M weights leave only a few GB for the KV cache on a 24 GB card before pressure spills out of memory. The Gemma 4 card notes the global layers use unified Keys and Values + p-RoPE to keep the long-context cache modest, but 256K native context is still far beyond what 24 GB holds. The KV-cache discipline ladder, in order of how much it helps:

Cap --ctx-size at 8192 (the recipe default) — keeps headroom for activations.
Quantize the KV cache. Recent llama.cpp builds support --cache-type-k q8_0 --cache-type-v q8_0 to halve KV-cache size — but note this currently requires FlashAttention on llama.cpp, which conflicts with the Gemma-4 SWA caveat above; test stability first or stay on Ollama.
Drop to a smaller quant. From the unsloth tree, UD-Q3_K_XL is 15.38 GB and Q3_K_M is 14.74 GB on disk — each frees several GB for KV cache at a measurable quality cost.
Use a smaller model. If you routinely need very long context, the dense E4B or the 26B A4B MoE sibling are far lighter — different recipes, same Gemma 4 family.

Want full BF16 quality or a different runtime (vLLM / Transformers)?

The canonical transformers path loads BF16 weights via pip install -U transformers torch accelerate and AutoModelForCausalLM — at 30.7B parameters that is ~62 GB in BF16, which does not fit a 24 GB card. For single-3090 deployment, a Q4-class GGUF in Ollama or llama.cpp is the only realistic loader path. To run the full-precision model or a vLLM server you need a larger GPU (or multi-GPU). The Q4_K_M GGUF here is the consumer-card path the model card itself targets: "consumer GPUs and workstations (26B A4B, 31B)".

Ampere vs Ada — anything special for the 3090?

The RTX 3090 is Ampere (sm_86) — fully supported by mainline CUDA, llama.cpp's CUDA backend, and Ollama's bundled runtime. The default pre-built llama-bXXXX-bin-ubuntu-cuda-12.x-x64.zip releases work out of the box; no special wheel selection is required. Ampere has no FP8 tensor cores (FP8 first shipped on Ada/Hopper), but this recipe ships only Q4_K GGUF weights — there is no FP8 path here, so the arch limitation does not bite.