self-hosted/ai
§01·recipe · multimodal

Gemma 4 E4B on RTX 5060 Ti: Multimodal Inference with transformers or llama.cpp

multimodalbeginner6GB+ VRAMMay 19, 2026
models
tools
prerequisites
  • NVIDIA RTX 5060 Ti (16GB VRAM) or any GPU with ≥ 6 GB for Q4_K_M, ≥ 9 GB for Q8_0, or ≥ 16 GB for BF16
  • Python 3.10+ (for the transformers path)
  • CUDA 12.8+ wheel of PyTorch (cu128) for Blackwell sm_120 — required only for the transformers path

What You'll Build

A local Gemma 4 E4B instance on an RTX 5060 Ti — Google's 4.5 B-effective-parameter (8 B with embeddings) instruction-tuned multimodal model that accepts text, images, audio, and video as input and produces text. The recipe walks through three runtimes: HuggingFace transformers for the full BF16 path, Unsloth's GGUF quants via llama.cpp for the lean Q4/Q8 path, and Ollama for the one-command path.

Hardware data: RTX 5060 Ti (16 GB VRAM) · multimodal capable at Q4_K_M (~5 GB weights) up to BF16 (~15 GB weights) · See benchmark data

ℹ️ Multimodal input, text-only output. Gemma 4 E4B reads images, audio (≤ 30 s clips), video (≤ 60 s at 1 fps), and text — and replies in text. It does not generate images, speech, or video. For TTS on this card, see Kokoro; for image generation, see Z-Image or Flux.2 Klein.

Requirements

ComponentMinimumTested
GPU6 GB (Q4_K_M GGUF) / 9 GB (Q8_0 GGUF) / 16 GB (BF16 transformers)RTX 5060 Ti (16 GB)
RAM16 GB system RAM
Storage5 – 16 GB depending on quant (file-size table)15.1 GB for BF16, 4.98 GB for Q4_K_M
SoftwarePython 3.10+ and CUDA 12.8 wheel (transformers path) OR llama.cpp / Ollama (GGUF paths)

Per the official Google AI for Developers docs, the static-weight memory footprint is 15 GB at BF16, 7.5 GB at FP8, 5 GB at Q4_0 — these numbers cover the weights only, not the KV cache or runtime overhead, so plan on at least 25 % headroom for non-trivial contexts.

Installation

Path A — transformers (BF16, full multimodal)

This path runs the canonical google/gemma-4-E4B-it checkpoint at BF16. Weights are ~15 GB; the runtime peak sits around 16 GB depending on context length, so this fits the 5060 Ti 16 GB but leaves little headroom.

# CUDA 12.8 PyTorch wheel — required for Blackwell sm_120 (RTX 50-series)
pip install -U torch --index-url https://download.pytorch.org/whl/cu128
pip install -U transformers accelerate torchvision librosa

Path B — Unsloth GGUF via llama.cpp (Q4 / Q8, leanest)

Unsloth ships per-quant GGUFs of the canonical Google checkpoint at unsloth/gemma-4-E4B-it-GGUF. The model tree on that page explicitly links the upstream google/gemma-4-E4B-it.

# macOS / Linux via brew
brew install llama.cpp

# Or: build from source — https://github.com/ggml-org/llama.cpp

The GGUF file sizes from the Unsloth repo (pick a tier that fits your VRAM budget):

QuantFile sizeNotes
UD-Q4_K_XL5.13 GBRecommended floor — Unsloth dynamic 4-bit
Q4_K_M4.98 GBStandard 4-bit
Q5_K_M5.48 GBLight step up
Q8_08.19 GBNear-lossless; comfortable on 5060 Ti 16 GB
BF1615.1 GBSame as transformers BF16

Path C — Ollama (one-liner)

ollama run hf.co/unsloth/gemma-4-E4B-it-GGUF:UD-Q4_K_XL

⚠️ Known issue: Gemma 4's hybrid attention architecture currently exposes bugs in Ollama's tool-call parser and streaming layer (reported, 2026-04-07). If you need function calling or tool use, use llama.cpp directly (Path B) or vLLM rather than Ollama until the parser is patched.

Running

Path A — text + image inference (transformers)

The HuggingFace card's canonical snippet, verbatim:

from transformers import AutoProcessor, AutoModelForCausalLM

MODEL_ID = "google/gemma-4-E4B-it"

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    dtype="auto",        # picks BF16 on the 5060 Ti
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Write a short joke about saving RAM."},
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=False,
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
input_len = inputs["input_ids"].shape[-1]

outputs = model.generate(**inputs, max_new_tokens=1024)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
print(processor.parse_response(response))

For image input, swap in AutoModelForMultimodalLM and pass {"type": "image", "url": "..."} blocks — full snippet on the model card. Recommended sampling: temperature=1.0, top_p=0.95, top_k=64.

Path B — llama.cpp server

# OpenAI-compatible local server with web UI
llama-server -hf unsloth/gemma-4-E4B-it-GGUF:UD-Q4_K_XL

# Or one-shot CLI
llama-cli -hf unsloth/gemma-4-E4B-it-GGUF:UD-Q4_K_XL

The -hf flag streams the GGUF directly from HuggingFace on first run and caches it locally.

Results

  • Speed: A community walkthrough by danilchenko.dev (2026-04-07) reports ~45 tok/s at Q8_0 on an RTX 3060 (12 GB) — the 5060 Ti's substantially higher bandwidth should clear that, but no published 5060 Ti benchmark exists yet. Submit yours via /contribute.
  • VRAM usage: Cited weight footprint per Google AI for Developers — BF16 = 15 GB, FP8 = 7.5 GB, Q4_0 = 5 GB (weights only). Unsloth GGUF Q4_K_M file is 4.98 GB, Q8_0 is 8.19 GB, BF16 is 15.1 GB (file table). Add ~25 % runtime headroom for the KV cache at long contexts.
  • Quality notes: E4B is the "daily-driver" tier of the Gemma 4 family — 4.5 B effective / 8 B with embeddings, 42 layers, 262 K vocab, ~150 M vision encoder + ~300 M audio encoder. Image inputs support variable aspect/resolution with configurable visual-token budgets (70 / 140 / 280 / 560 / 1120). Audio inputs cap at 30 seconds per clip.

For the full benchmark data, see /check/gemma-4-e4b/rtx-5060-ti.

Troubleshooting

flash_attention_2 errors on the 5060 Ti

The canonical HF snippet uses dtype="auto" and does not request attn_implementation="flash_attention_2", so the default SDPA path runs cleanly on Blackwell. If you copy a third-party snippet that hardcodes flash_attention_2, the call will fail at first forward pass — FA2 wheels do not yet ship sm_120 kernels for RTX 50-series (see the tracking issue at Dao-AILab/flash-attention#2168). Fix by removing the override or setting attn_implementation="sdpa".

Ollama function-calling / streaming glitches

Per the danilchenko.dev walkthrough (2026-04-07), Gemma 4's hybrid sliding-window/global attention triggers parser bugs in Ollama's tool-call and streaming layers. Workaround: run llama.cpp directly (Path B) or vLLM for tool-use workflows.

enable_thinking not triggering on E4B

Discussion #26 on the HF card notes that enable_thinking=True in the chat template does not currently activate Thinking Mode on the E4B / E2B sizes. The recipe snippet sets enable_thinking=False to match the documented working behaviour; if you need extended reasoning, the 26 B MoE or 31 B dense variants are the supported path.

Tight on VRAM at long contexts

BF16 (Path A) sits at ~15 GB weights on a 16 GB card, leaving ~1 GB for the KV cache. For contexts beyond 8 K tokens, switch to Path B / Q8_0 GGUF (8.19 GB on disk) — same near-lossless quality, half the VRAM, room for 32 K+ context. The Google docs explicitly call out that the static-weight numbers exclude KV-cache overhead.