Gemma 4 26B A4B-it on RTX 3090: Local Multimodal Chat via Q4_K_M GGUF + llama.cpp

What You'll Build

A local Gemma 4 26B A4B-it chat assistant — text + image input, text output — running on an RTX 3090 (24 GB VRAM) through llama.cpp with the ggml-org/gemma-4-26B-A4B-it-GGUF Q4_K_M weights (16.80 GB on disk per the HF tree). The optional mmproj-Q8_0 vision encoder (806 MB) unlocks image input — OCR, chart reading, document parsing, screen understanding — all on a single Ampere consumer card.

Hardware data: RTX 3090 (24 GB VRAM) · Q4_K · ~119.4 tok/s generation @ 4K context · See benchmark data

ℹ️ Variant pinned — 26B A4B-it (the MoE). Gemma 4 ships in four sizes — E2B, E4B, 26B A4B, and 31B Dense. This recipe targets the 26B A4B-it (google/gemma-4-26B-A4B-it) — the Mixture-of-Experts variant with 25.2B total parameters / 3.8B active per token. The 31B Dense variant (google/gemma-4-31B-it) is a different model and runs ~4× slower on the same Q4_K_M quant tier — see Troubleshooting.

⚠️ All 25.2B parameters must be resident in VRAM — even though only 3.8B are "active" per token. Gemma 4 26B A4B is a router-per-token sparse MoE: per the HF model card, each token's forward pass picks 8 of 128 experts (plus 1 shared) at runtime, so you cannot pre-prune. "Active parameters" tells you the per-token compute cost (which is why a 26B-total model runs almost as fast as a 4B model on the same hardware) but not the VRAM cost. At Q4_K_M, weights alone are 16.80 GB on disk and stay roughly the same in VRAM — plan on ~17 GB for weights + ~1 GB for the KV cache at modest context + ~1 GB for the optional vision encoder. The 3090's 24 GB envelope leaves ~6 GB headroom — workable for 4K–8K contexts, tight beyond.

Requirements

Component	Minimum	Tested
GPU	24 GB VRAM (Q4_K_M weights alone are 16.80 GB; KV cache + vision encoder need headroom)	RTX 3090 (24 GB)
RAM	16 GB system	—
Storage	17 GB (Q4_K_M GGUF) or 18 GB with mmproj vision encoder	—
Software	llama.cpp (recent build, post Apr 2026 chat-template fix), Ollama, or LM Studio	—
License	Apache 2.0 (Gemma 4 License)	commercial use permitted

Installation

1. Build (or update) llama.cpp

The ggml-org GGUF mirror is the llama.cpp team's own packaging — llama-cli's -hf flag fetches and caches it for you. You need a recent llama.cpp build that includes Google's April 2026 chat-template fix (noted in the unsloth canonical guide) — older binaries will run the model but emit <|channel>thought control tokens in the wrong order.

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j --target llama-cli llama-server

The GGML_CUDA=ON flag enables CUDA acceleration; on Ampere (sm_86) no special wheel selection is required — the default kernels cover the 3090. The resulting llama-cli and llama-server binaries land in ./build/bin/.

2. Run with the official ggml-org GGUF

The simplest path — -hf downloads the Q4_K_M shard (16.80 GB) on first call:

./build/bin/llama-cli \
    -hf ggml-org/gemma-4-26B-A4B-it-GGUF:Q4_K_M \
    --n-gpu-layers 99 \
    --ctx-size 8192 \
    --temp 1.0 --top-p 0.95 --top-k 64

The sampling parameters (temp=1.0, top_p=0.95, top_k=64) are Google's recommended Gemma 4 defaults from the HF card. --n-gpu-layers 99 offloads all 30 layers to the 3090.

3. (Optional) Add the vision encoder for image input

If you want the multimodal path — OCR, chart parsing, screen understanding — also download the mmproj adapter (806 MB at Q8_0):

# In a separate terminal, run llama-server with both files:
./build/bin/llama-server \
    --model ~/.cache/llama.cpp/ggml-org_gemma-4-26B-A4B-it-GGUF_gemma-4-26B-A4B-it-Q4_K_M.gguf \
    --mmproj-url https://huggingface.co/ggml-org/gemma-4-26B-A4B-it-GGUF/resolve/main/mmproj-gemma-4-26B-A4B-it-Q8_0.gguf \
    --n-gpu-layers 99 \
    --ctx-size 8192 \
    --temp 1.0 --top-p 0.95 --top-k 64 \
    --host 0.0.0.0 --port 8080

Then POST images as base64 to http://localhost:8080/v1/chat/completions using the OpenAI-compatible API. Per the HF card best practices, place the image content before the text in the prompt.

4. (Alternative) Pull via Ollama

If you prefer Ollama's container model:

ollama pull gemma4:26b
ollama run gemma4:26b

Ollama's gemma4:26b tag defaults to Q4_K_M and bundles its own vision encoder.

Running

A typical interactive session:

./build/bin/llama-cli \
    -hf ggml-org/gemma-4-26B-A4B-it-GGUF:Q4_K_M \
    --n-gpu-layers 99 \
    --ctx-size 8192 \
    --temp 1.0 --top-p 0.95 --top-k 64 \
    --color -i

First launch downloads ~17 GB; subsequent launches start in seconds from the local cache. To enable Gemma 4's built-in reasoning mode, prepend <|think|> to the system prompt (the HF card describes the full control-token grammar).

Results

Speed: 119.4 tok/s token generation @ 4K context at Q4_K, per the Hardware Corner RTX 3090 LLM benchmark table — row labelled Gemma4 26B (Q4_K), column 4k Ctx under the Token Generation sub-table. Prompt processing reaches 3,625.6 t/s at the same row/column under the Prompt Processing sub-table. Throughput is closer to a 4B-class model than a 26B-class model because only 3.8B params activate per forward pass. At 16K context the same row reports 115.0 tok/s generation / 3,068.9 t/s prefill — gentle degradation as KV cache grows.
VRAM usage: ~17 GB resident for Q4_K_M weights alone (16.80 GB on disk per the ggml-org HF tree), plus ~1 GB KV cache at 8K context, plus an optional ~1 GB for the mmproj Q8_0 vision encoder — ~18 GB derived runtime envelope on a 24 GB card. The 3090's 24 GB envelope provides ~6 GB headroom for longer contexts (the model supports up to 256K tokens per the HF card) and the vision encoder. Cross-confirmed by the unsloth canonical "How to Run Gemma 4" guide which lists "16–18 GB" for 26B A4B at 4-bit.
Quality notes: Per Google's HF model card benchmarks table, the 26B A4B variant scores 82.6% MMLU Pro, 88.3% AIME 2026, 77.1% LiveCodeBench v6, and 82.4% MATH-Vision — all close to the larger 31B Dense and substantially ahead of Gemma 3 27B. Because only 3.8B parameters are active per token, throughput is closer to a 4B-class model than a 26B-class model, giving the "fast inference at frontier quality" profile that a 24 GB Ampere card like the 3090 is well-suited for.

For the full benchmark data, see /check/gemma4-26b/rtx-3090.

Troubleshooting

Garbled `<|channel>thought` tokens in output

Symptom: the model emits raw <|channel>thought\n control tokens in conversational output, or reasoning mode produces malformed thinking blocks. Cause: an older llama.cpp / Ollama / LM Studio build predates Google's April 2026 chat-template revision. Fix: per the unsloth Gemma 4 guide ("Apr 11 update: Gemma 4 is now updated with Google's updated chat template + llama.cpp fixes") — rebuild llama.cpp from the latest main, or ollama pull gemma4:26b again to refresh the chat template metadata.

"Out of memory" when increasing `--ctx-size` past 16K

Each additional 1K of KV cache adds roughly 80–100 MB at Q4_K_M (30 layers × 1024 sliding window × unified KV). The default --ctx-size 8192 leaves comfortable headroom on a 24 GB card; pushing to --ctx-size 32768 consumes ~3 GB additional VRAM and starts crowding the optional vision encoder. If you need long-context (up to 256K tokens per the HF card) on a single 3090, drop to Q4_K_M weights only (skip the mmproj vision encoder), or move to Q5_K_S (18.85 GB per the unsloth GGUF tree) only with reduced context.

Considering Q5_K_M or Q8_0 instead of Q4_K_M?

The unsloth GGUF mirror ships per-quant-tier files via the HF tree API: Q5_K_M (UD-Q5_K_M) is 21.15 GB, Q6_K is 23.17 GB, and Q8_0 is 26.86 GB — Q8_0 does NOT fit a 24 GB card (no headroom for KV cache or activations). Q5_K_M fits but leaves only ~3 GB for context + vision encoder; Q4_K_M is the recommended sweet spot for 24 GB cards per the unsloth canonical guide ("Dynamic 4-bit for the 26B-A4B"). The Q6_K file fits on disk but in practice exhausts VRAM during inference once activations and KV cache load on top.

Wanted the 31B Dense variant instead?

The google/gemma-4-31B-it variant is a different model — 30.7B dense parameters. On the same RTX 3090, the Hardware Corner table shows Gemma4 31B (Q4_K) running at roughly 31.4 tok/s generation @ 32K context — about 4× slower than the 26B A4B's 119 tok/s. That's the sparse-MoE compute advantage: 31B Dense activates all parameters per token; 26B A4B activates only 3.8B per token. For most local-chat use on a 24 GB consumer card, the 26B A4B variant in this recipe is the recommended choice — it scores within ~3 percentage points of 31B Dense on MMLU Pro, AIME, and LiveCodeBench per the HF benchmark table.