How much VRAM does Gemma 4 E4B-IT need?

About 6 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

Gemma 4 E4B on RTX 5060 Ti: Multimodal Inference with transformers or llama.cpp

What You'll Build

A local Gemma 4 E4B instance on an RTX 5060 Ti — Google's 4.5 B-effective-parameter (8 B with embeddings) instruction-tuned multimodal model that accepts text, images, audio, and video as input and produces text. The recipe walks through three runtimes: HuggingFace transformers for the full BF16 path, Unsloth's GGUF quants via llama.cpp for the lean Q4/Q8 path, and Ollama for the one-command path.

Hardware data: RTX 5060 Ti (16 GB VRAM) · multimodal capable at Q4_K_M (~5 GB weights) up to BF16 (~15 GB weights) · See benchmark data

ℹ️ Multimodal input, text-only output. Gemma 4 E4B reads images, audio (≤ 30 s clips), video (≤ 60 s at 1 fps), and text — and replies in text. It does not generate images, speech, or video. For TTS on this card, see Kokoro; for image generation, see Z-Image or Flux.2 Klein.

Requirements

Component	Minimum	Tested
GPU	6 GB (Q4_K_M GGUF) / 9 GB (Q8_0 GGUF) / 16 GB (BF16 transformers)	RTX 5060 Ti (16 GB)
RAM	16 GB system RAM	—
Storage	5 – 16 GB depending on quant (file-size table)	15.1 GB for BF16, 4.98 GB for Q4_K_M
Software	Python 3.10+ and CUDA 12.8 wheel (transformers path) OR llama.cpp / Ollama (GGUF paths)	—

Per the official Google AI for Developers docs, the inference-memory footprint is 17.9 GB at BF16, 8.9 GB at SFP8, 4.5 GB at Q4_0 — Google's figures already include ~20 % load overhead on top of the raw weights, so plan on at least 25 % headroom for non-trivial contexts.

Installation

Path A — `transformers` (BF16, full multimodal)

This path runs the canonical google/gemma-4-E4B-it checkpoint at BF16. Weights are ~15 GB; the runtime peak sits around 16 GB depending on context length, so this fits the 5060 Ti 16 GB but leaves little headroom.

# CUDA 12.8 PyTorch wheel — required for Blackwell sm_120 (RTX 50-series)
pip install -U torch --index-url https://download.pytorch.org/whl/cu128
pip install -U transformers accelerate torchvision librosa

Path B — Unsloth GGUF via `llama.cpp` (Q4 / Q8, leanest)

Unsloth ships per-quant GGUFs of the canonical Google checkpoint at unsloth/gemma-4-E4B-it-GGUF. The model tree on that page explicitly links the upstream google/gemma-4-E4B-it.

# macOS / Linux via brew
brew install llama.cpp

# Or: build from source — https://github.com/ggml-org/llama.cpp

The GGUF file sizes from the Unsloth repo (pick a tier that fits your VRAM budget):

Quant	File size	Notes
UD-Q4_K_XL	5.13 GB	Recommended floor — Unsloth dynamic 4-bit
Q4_K_M	4.98 GB	Standard 4-bit
Q5_K_M	5.48 GB	Light step up
Q8_0	8.19 GB	Near-lossless; comfortable on 5060 Ti 16 GB
BF16	15.1 GB	Same as transformers BF16

Path C — Ollama (one-liner)

ollama run hf.co/unsloth/gemma-4-E4B-it-GGUF:UD-Q4_K_XL

⚠️ Known issue: Gemma 4's hybrid attention architecture currently exposes bugs in Ollama's tool-call parser and streaming layer (reported, 2026-04-07). If you need function calling or tool use, use llama.cpp directly (Path B) or vLLM rather than Ollama until the parser is patched.

Running

Path A — text + image inference (transformers)

The HuggingFace card's canonical snippet, verbatim:

from transformers import AutoProcessor, AutoModelForCausalLM

MODEL_ID = "google/gemma-4-E4B-it"

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    dtype="auto",        # picks BF16 on the 5060 Ti
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Write a short joke about saving RAM."},
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=False,
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
input_len = inputs["input_ids"].shape[-1]

outputs = model.generate(**inputs, max_new_tokens=1024)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
print(processor.parse_response(response))

For image input, swap in AutoModelForMultimodalLM and pass {"type": "image", "url": "..."} blocks — full snippet on the model card. Recommended sampling: temperature=1.0, top_p=0.95, top_k=64.

Path B — llama.cpp server

# OpenAI-compatible local server with web UI
llama-server -hf unsloth/gemma-4-E4B-it-GGUF:UD-Q4_K_XL

# Or one-shot CLI
llama-cli -hf unsloth/gemma-4-E4B-it-GGUF:UD-Q4_K_XL

The -hf flag streams the GGUF directly from HuggingFace on first run and caches it locally.

Results

Speed: A community walkthrough by danilchenko.dev (2026-04-07) reports ~45 tok/s at Q8_0 on an RTX 3060 (12 GB) — the 5060 Ti's substantially higher bandwidth should clear that, but no published 5060 Ti benchmark exists yet. Submit yours via /contribute.
VRAM usage: Cited inference-memory footprint per Google AI for Developers — BF16 = 17.9 GB, SFP8 = 8.9 GB, Q4_0 = 4.5 GB (incl. ~20% load overhead). Unsloth GGUF Q4_K_M file is 4.98 GB, Q8_0 is 8.19 GB, BF16 is 15.1 GB (file table). Add ~25 % runtime headroom for the KV cache at long contexts.
Quality notes: E4B is the "daily-driver" tier of the Gemma 4 family — 4.5 B effective / 8 B with embeddings, 42 layers, 262 K vocab, ~150 M vision encoder + ~300 M audio encoder. Image inputs support variable aspect/resolution with configurable visual-token budgets (70 / 140 / 280 / 560 / 1120). Audio inputs cap at 30 seconds per clip.

For the full benchmark data, see /check/gemma-4-e4b/rtx-5060-ti.

Troubleshooting

`flash_attention_2` errors on the 5060 Ti

The canonical HF snippet uses dtype="auto" and does not request attn_implementation="flash_attention_2", so the default SDPA path runs cleanly on Blackwell. If you copy a third-party snippet that hardcodes flash_attention_2, the call will fail at first forward pass — FA2 wheels do not yet ship sm_120 kernels for RTX 50-series (see the tracking issue at Dao-AILab/flash-attention#2168). Fix by removing the override or setting attn_implementation="sdpa".

Ollama function-calling / streaming glitches

Per the danilchenko.dev walkthrough (2026-04-07), Gemma 4's hybrid sliding-window/global attention triggers parser bugs in Ollama's tool-call and streaming layers. Workaround: run llama.cpp directly (Path B) or vLLM for tool-use workflows.

`enable_thinking` not triggering on E4B

Discussion #26 on the HF card notes that enable_thinking=True in the chat template does not currently activate Thinking Mode on the E4B / E2B sizes. The recipe snippet sets enable_thinking=False to match the documented working behaviour; if you need extended reasoning, the 26 B MoE or 31 B dense variants are the supported path.

Tight on VRAM at long contexts

BF16 (Path A) sits at ~15 GB weights on a 16 GB card, leaving ~1 GB for the KV cache. For contexts beyond 8 K tokens, switch to Path B / Q8_0 GGUF (8.19 GB on disk) — same near-lossless quality, half the VRAM, room for 32 K+ context. The Google docs figures are inference-memory baselines (weights + ~20% load overhead); the KV cache grows on top of that as context length increases.