Gemma 4 26B A4B-it on RTX 3090 Ti: Local Multimodal Chat via Q4_K_M GGUF + llama.cpp

What You'll Build

A local Gemma 4 26B A4B-it chat assistant — text + image input, text output — running on an RTX 3090 Ti (24 GB VRAM) through llama.cpp with the ggml-org/gemma-4-26B-A4B-it-GGUF Q4_K_M weights (16.80 GB on disk per the HF tree). The optional mmproj-Q8_0 vision encoder (806 MB) unlocks image input — OCR, chart reading, document parsing, screen understanding — all on a single Ampere consumer card.

Hardware data: RTX 3090 Ti (24 GB VRAM) · Q4_K · ~129 tok/s estimated generation @ 4K context (close-sibling forward-statement from the RTX 3090) · See benchmark data

ℹ️ Variant pinned — 26B A4B-it (the MoE). Gemma 4 ships in four sizes — E2B, E4B, 26B A4B, and 31B Dense. This recipe targets the 26B A4B-it (google/gemma-4-26B-A4B-it) — the Mixture-of-Experts variant with 25.2B total parameters / 3.8B active per token. The 31B Dense variant (google/gemma-4-31B-it) is a different model and runs ~4× slower on the same Q4_K_M quant tier — see Troubleshooting.

⚠️ All 25.2B parameters must be resident in VRAM — even though only 3.8B are "active" per token. Gemma 4 26B A4B is a router-per-token sparse MoE: per the HF model card, each token's forward pass picks 8 of 128 experts (plus 1 shared) at runtime, so you cannot pre-prune. "Active parameters" tells you the per-token compute cost (which is why a 26B-total model runs almost as fast as a 4B model on the same hardware) but not the VRAM cost. At Q4_K_M, weights alone are 16.80 GB on disk and stay roughly the same in VRAM — plan on ~17 GB for weights + ~1 GB for the KV cache at modest context + ~1 GB for the optional vision encoder. The 3090 Ti's 24 GB envelope leaves ~6 GB headroom — workable for 4K–8K contexts, tight beyond.

Requirements

Component	Minimum	Tested
GPU	24 GB VRAM (Q4_K_M weights alone are 16.80 GB; KV cache + vision encoder need headroom)	RTX 3090 Ti (24 GB)
RAM	16 GB system	—
Storage	17 GB (Q4_K_M GGUF) or 18 GB with mmproj vision encoder	—
Software	llama.cpp (recent build, post Apr 2026 chat-template fix), Ollama, or LM Studio	—
License	Apache 2.0 (Gemma 4 License)	commercial use permitted

Installation

1. Build (or update) llama.cpp

The ggml-org GGUF mirror is the llama.cpp team's own packaging — llama-cli's -hf flag fetches and caches it for you. You need a recent llama.cpp build that includes Google's April 2026 chat-template fix (noted in the unsloth canonical guide) — older binaries will run the model but emit <|channel>thought control tokens in the wrong order.

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j --target llama-cli llama-server

The GGML_CUDA=ON flag enables CUDA acceleration; on Ampere (sm_86) no special wheel selection is required — the default kernels cover the 3090 Ti. The resulting llama-cli and llama-server binaries land in ./build/bin/.

2. Run with the official ggml-org GGUF

The simplest path — -hf downloads the Q4_K_M shard (16.80 GB) on first call:

./build/bin/llama-cli \
    -hf ggml-org/gemma-4-26B-A4B-it-GGUF:Q4_K_M \
    --n-gpu-layers 99 \
    --ctx-size 8192 \
    --temp 1.0 --top-p 0.95 --top-k 64

The sampling parameters (temp=1.0, top_p=0.95, top_k=64) are Google's recommended Gemma 4 defaults from the HF card. --n-gpu-layers 99 offloads all 30 layers to the 3090 Ti.

3. (Optional) Add the vision encoder for image input

If you want the multimodal path — OCR, chart parsing, screen understanding — also download the mmproj adapter (806 MB at Q8_0):

# In a separate terminal, run llama-server with both files:
./build/bin/llama-server \
    --model ~/.cache/llama.cpp/ggml-org_gemma-4-26B-A4B-it-GGUF_gemma-4-26B-A4B-it-Q4_K_M.gguf \
    --mmproj-url https://huggingface.co/ggml-org/gemma-4-26B-A4B-it-GGUF/resolve/main/mmproj-gemma-4-26B-A4B-it-Q8_0.gguf \
    --n-gpu-layers 99 \
    --ctx-size 8192 \
    --temp 1.0 --top-p 0.95 --top-k 64 \
    --host 0.0.0.0 --port 8080

Then POST images as base64 to http://localhost:8080/v1/chat/completions using the OpenAI-compatible API. Per the HF card best practices, place the image content before the text in the prompt.

4. (Alternative) Pull via Ollama

If you prefer Ollama's container model:

ollama pull gemma4:26b
ollama run gemma4:26b

Ollama's gemma4:26b tag defaults to Q4_K_M and bundles its own vision encoder.

Running

A typical interactive session:

./build/bin/llama-cli \
    -hf ggml-org/gemma-4-26B-A4B-it-GGUF:Q4_K_M \
    --n-gpu-layers 99 \
    --ctx-size 8192 \
    --temp 1.0 --top-p 0.95 --top-k 64 \
    --color -i

First launch downloads ~17 GB; subsequent launches start in seconds from the local cache. To enable Gemma 4's built-in reasoning mode, prepend <|think|> to the system prompt (the HF card describes the full control-token grammar).

Results

Speed: ~129 tok/s estimated generation @ 4K context at Q4_K — close-sibling forward-statement from the RTX 3090, NOT a measurement on the 3090 Ti. The Hardware Corner RTX 3090 LLM benchmark table directly measures Gemma4 26B (Q4_K) at 119.4 tok/s generation @ 4K (3,625.6 t/s prompt processing). The same source's RTX 3090 Ti page does not carry a Gemma4 26B row, so no direct measurement exists. Across the five model rows present on both Hardware Corner pages (Qwen3 8B / 14B / 30B A3B / 32B and gpt-oss 20B), token generation on the 3090 Ti runs ~8% faster than the 3090 (mean ratio 1.083; MoE-only mean 1.087 — gpt-oss 20B 160.31 vs 147.53 t/s and Qwen3 30B A3B 166.91 vs 153.56 t/s). Applying that ratio to Gemma 4 26B A4B yields the ~129 tok/s estimate. Treat this as a planning figure, not a verified throughput — first-party measurement is welcome at /contribute.
VRAM usage: ~17 GB resident for Q4_K_M weights alone (16.80 GB on disk per the ggml-org HF tree), plus ~1 GB KV cache at 8K context, plus an optional ~1 GB for the mmproj Q8_0 vision encoder — ~18 GB derived runtime envelope on a 24 GB card. The 3090 Ti's 24 GB envelope provides ~6 GB headroom for longer contexts (the model supports up to 256K tokens per the HF card) and the vision encoder. Cross-confirmed by the unsloth canonical "How to Run Gemma 4" guide which lists "16–18 GB" for 26B A4B at 4-bit. Same VRAM tier as the 3090; the 3090 Ti's faster bandwidth (1,008 GB/s vs the 3090's 936 GB/s) is what shifts the speed dial, not the memory envelope.
Quality notes: Per Google's HF model card benchmarks table, the 26B A4B variant scores 82.6% MMLU Pro, 88.3% AIME 2026, 77.1% LiveCodeBench v6, and 82.4% MATH-Vision — all close to the larger 31B Dense and substantially ahead of Gemma 3 27B. Because only 3.8B parameters are active per token, throughput is closer to a 4B-class model than a 26B-class model, giving the "fast inference at frontier quality" profile that a 24 GB Ampere card like the 3090 Ti is well-suited for.

For the full benchmark data, see /check/gemma4-26b/rtx-3090-ti.

Troubleshooting

Garbled `<|channel>thought` tokens in output

Symptom: the model emits raw <|channel>thought\n control tokens in conversational output, or reasoning mode produces malformed thinking blocks. Cause: an older llama.cpp / Ollama / LM Studio build predates Google's April 2026 chat-template revision. Fix: per the unsloth Gemma 4 guide — rebuild llama.cpp from the latest main, or ollama pull gemma4:26b again to refresh the chat template metadata.

"Out of memory" when increasing `--ctx-size` past 16K

Each additional 1K of KV cache adds roughly 80–100 MB at Q4_K_M (30 layers × 1024 sliding window × unified KV). The default --ctx-size 8192 leaves comfortable headroom on a 24 GB card; pushing to --ctx-size 32768 consumes ~3 GB additional VRAM and starts crowding the optional vision encoder. If you need long-context (up to 256K tokens per the HF card) on a single 3090 Ti, drop to Q4_K_M weights only (skip the mmproj vision encoder), or move to Q5_K_S (18.85 GB per the unsloth GGUF tree) only with reduced context.

Considering Q5_K_M or Q8_0 instead of Q4_K_M?

The unsloth GGUF mirror ships per-quant-tier files via the HF tree API: Q5_K_M (UD-Q5_K_M) is 21.15 GB, Q6_K is 23.17 GB, and Q8_0 is 26.86 GB — Q8_0 does NOT fit a 24 GB card (no headroom for KV cache or activations). Q5_K_M fits but leaves only ~3 GB for context + vision encoder; Q4_K_M is the recommended sweet spot for 24 GB cards per the unsloth canonical guide. The Q6_K file fits on disk but in practice exhausts VRAM during inference once activations and KV cache load on top.

Wanted the 31B Dense variant instead?

The google/gemma-4-31B-it variant is a different model — 30.7B dense parameters. On the same RTX 3090, the Hardware Corner table shows Gemma4 31B (Q4_K) running at roughly 34.7 tok/s generation @ 4K context — about 3.4× slower than the 26B A4B's 119.4 tok/s. That's the sparse-MoE compute advantage: 31B Dense activates all parameters per token; 26B A4B activates only 3.8B per token. For most local-chat use on a 24 GB consumer card, the 26B A4B variant in this recipe is the recommended choice — it scores within ~3 percentage points of 31B Dense on MMLU Pro, AIME, and LiveCodeBench per the HF benchmark table.

Why no direct RTX 3090 Ti measurement?

The Hardware Corner RTX 3090 Ti LLM benchmarks page lists Qwen3 8B/14B/30B-A3B/32B and gpt-oss 20B but does not (as of recipe-publish time) include a Gemma4 26B row, while the RTX 3090 page does. The ~129 tok/s figure in Results is a close-sibling forward-statement (RTX 3090 measurement scaled by the 1.08–1.09× mean Ti/non-Ti ratio observed across the five overlapping model rows from the same source), not a direct measurement. If you run llama-bench on a 3090 Ti yourself, please contribute it at /contribute so the next reader gets a verified number.