Gemma 4 E4B on RTX 4060 Ti 16GB: Multimodal Inference via Q4_K_M GGUF (with optional Q8_0 / BF16)

What You'll Build

A local Gemma 4 E4B instance on an RTX 4060 Ti 16GB — Google's 4.5 B-effective-parameter (8 B with embeddings) instruction-tuned multimodal model that accepts text, images, audio, and video as input and produces text. On a 16 GB Ada Lovelace card the recipe pins Q4_K_M GGUF as the recommended variant (clean fit, lots of headroom) and documents Q8_0 GGUF and BF16 as comfortable / tight upgrades respectively.

Hardware data: RTX 4060 Ti 16GB (16 GB VRAM) · multimodal capable from Q4_K_M (~5 GB weights, ~6 GB peak with KV cache) up to BF16 (~15 GB weights, fits but ~1 GB headroom) · See benchmark data

ℹ️ Multimodal input, text-only output. Gemma 4 E4B reads images, audio (≤ 30 s clips), video (≤ 60 s at 1 fps), and text — and replies in text. It does not generate images, speech, or video. For TTS on this card, see Kokoro; for image generation, see Z-Image or Flux.2 Klein.

Requirements

Component	Minimum	Tested
GPU	6 GB (Q4_K_M GGUF) / 9 GB (Q8_0 GGUF) / 16 GB (BF16)	RTX 4060 Ti 16GB
RAM	16 GB system RAM	—
Storage	5 – 16 GB depending on quant (file-size table)	4.98 GB for Q4_K_M, 8.19 GB for Q8_0, 15.1 GB for BF16
Software	`llama.cpp` or Ollama (GGUF paths) OR Python 3.10+ with `transformers` (BF16 path)	—

Per the official Google AI for Developers docs, the static-weight memory footprint is 15 GB at BF16, 7.5 GB at FP8, 5 GB at Q4_0 — these numbers cover the weights only, not the KV cache or runtime overhead, so plan on at least 25 % headroom for non-trivial contexts.

Installation

This recipe defaults to Q4_K_M GGUF — it is the smallest quant that retains near-full multimodal instruction-following quality, and on the RTX 4060 Ti 16GB it loads in ~5 GB and leaves ample headroom for long contexts. Q8_0 and BF16 are documented as quality-step-up paths.

Path A — Ollama (one-liner, auto layer placement)

Ollama auto-detects the right number of GPU layers for your card; no manual -ngl flag is needed.

# macOS / Linux — install
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run the Unsloth Q4_K_M GGUF (≈ 5 GB download on first run)
ollama run hf.co/unsloth/gemma-4-E4B-it-GGUF:Q4_K_M

The Unsloth GGUF repo at unsloth/gemma-4-E4B-it-GGUF hosts the file; its model tree explicitly links upstream google/gemma-4-E4B-it (the canonical Google release). The community-maintained bartowski/google_gemma-4-E4B-it-GGUF is an equivalent Q4_K_M mirror if you prefer Bartowski's quants.

Path B — `llama.cpp` (explicit `-ngl`)

# macOS / Linux
brew install llama.cpp

# Or build from source — https://github.com/ggml-org/llama.cpp

Then launch the OpenAI-compatible server. On a 16 GB card, offloading all layers to GPU is safe at every quant tier documented here; explicitly pin with -ngl 99 (llama.cpp clamps to the model's real layer count):

# OpenAI-compatible local server with web UI
llama-server -hf unsloth/gemma-4-E4B-it-GGUF:Q4_K_M -ngl 99

# Or one-shot CLI
llama-cli -hf unsloth/gemma-4-E4B-it-GGUF:Q4_K_M -ngl 99

The -hf flag streams the GGUF directly from HuggingFace on first run and caches it locally. To switch tiers, replace the tag — :Q8_0 for the near-lossless quant, :BF16 for the full-precision (tight on 16 GB — see Troubleshooting), :UD-Q4_K_XL for Unsloth's dynamic 4-bit variant.

Variant file-size table (from Unsloth's GGUF repo)

The figures below are read directly from unsloth/gemma-4-E4B-it-GGUF. They are the on-disk file sizes; runtime VRAM is on-disk plus the KV cache.

Quant	File size	Fits 16 GB?
Q4_K_M	4.98 GB	✅ Comfortable — pinned by this recipe
UD-Q4_K_XL	5.13 GB	✅ Comfortable — Unsloth dynamic 4-bit
Q5_K_M	5.48 GB	✅ Comfortable
Q8_0	8.19 GB	✅ Comfortable — near-lossless quality
BF16	15.1 GB	⚠️ Fits but tight — ~1 GB headroom for KV cache

Optional Path C — `transformers` BF16 (Python, tight)

If you specifically need the canonical transformers Python API (e.g. for fine-tuning hooks), the full BF16 checkpoint sits at ~15 GB per Google's docs — it fits the 4060 Ti 16GB, but the runtime peak with non-trivial context can press the 16 GB ceiling. Prefer Path B / Q8_0 for headroom on long contexts.

# Default CUDA 12.x PyTorch wheel — Ada sm_89 kernels ship in the standard release
pip install -U torch
pip install -U transformers accelerate torchvision librosa

Unlike Blackwell (RTX 50-series, sm_120), the RTX 4060 Ti 16GB uses Ada Lovelace (sm_89), and the default pip install torch already includes the right kernels — no cu128-only wheel is required.

Running

Path A / B — chat via Ollama or llama.cpp

After ollama run … or llama-server … is up, both expose an OpenAI-compatible HTTP API on localhost:11434 (Ollama) or localhost:8080 (llama-server). For an interactive chat, just type at the prompt; for programmatic use:

# Ollama
curl http://localhost:11434/api/chat -d '{
  "model": "hf.co/unsloth/gemma-4-E4B-it-GGUF:Q4_K_M",
  "messages": [{"role": "user", "content": "Write a short joke about saving VRAM."}]
}'

# llama-server (OpenAI-compatible)
curl http://localhost:8080/v1/chat/completions -d '{
  "model": "gemma-4-e4b",
  "messages": [{"role": "user", "content": "Write a short joke about saving VRAM."}]
}'

For multimodal input (images), pass images: [<base64>] (Ollama) or attach with the OpenAI image_url content block (llama.cpp mmproj build) — full multimodal usage is documented on the model card. Recommended sampling: temperature=1.0, top_p=0.95, top_k=64.

Path C — transformers BF16

The HuggingFace card's canonical snippet, verbatim:

from transformers import AutoProcessor, AutoModelForCausalLM

MODEL_ID = "google/gemma-4-E4B-it"

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    dtype="auto",        # picks BF16 on the 4060 Ti 16GB
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Write a short joke about saving RAM."},
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=False,
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
input_len = inputs["input_ids"].shape[-1]

outputs = model.generate(**inputs, max_new_tokens=1024)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
print(processor.parse_response(response))

For image input, swap in AutoModelForMultimodalLM and pass {"type": "image", "url": "..."} blocks — full snippet on the model card.

Results

Speed: A community walkthrough by danilchenko.dev (2026-04-07) reports ~45 tok/s at Q8_0 on an RTX 3060 (12 GB) — a comparable-tier consumer reference point only (note: the RTX 3060 is Ampere sm_86, not Ada, so this is a same-class-card comparison, not same-architecture). The RTX 4060 Ti 16GB has roughly comparable memory bandwidth (~288 GB/s vs the 3060's ~360 GB/s) and at the smaller Q4_K_M tier on this recipe the throughput should be in the same ballpark, but no published RTX 4060 Ti 16GB benchmark exists yet. Submit yours via /contribute.
VRAM usage: Cited weight footprint per Google AI for Developers — BF16 = 15 GB, FP8 = 7.5 GB, Q4_0 = 5 GB (weights only). Unsloth GGUF Q4_K_M file is 4.98 GB, Q8_0 is 8.19 GB, BF16 is 15.1 GB (file table). On a 16 GB card, Q4_K_M leaves ~10 GB free after load (room for huge contexts); Q8_0 leaves ~7 GB; BF16 leaves ~1 GB — feasible but tight.
Quality notes: E4B is the "daily-driver" tier of the Gemma 4 family — 4.5 B effective / 8 B with embeddings, 42 layers, 262 K vocab, ~150 M vision encoder + ~300 M audio encoder. Image inputs support variable aspect/resolution with configurable visual-token budgets (70 / 140 / 280 / 560 / 1120). Audio inputs cap at 30 seconds per clip. Q4_K_M is the smallest quant that retains near-full instruction-following quality; below it (Q3, Q2) the multimodal alignment degrades visibly. Step up to Q8_0 if you have a quality-sensitive workload — the file is 8.19 GB and runs comfortably with full GPU offload on this card.

For the full benchmark data, see /check/gemma-4-e4b/rtx-4060-ti-16gb.

Troubleshooting

BF16 fits but presses the 16 GB ceiling at long contexts

The BF16 weights are 15.1 GB per the Unsloth file table, which leaves only ~1 GB of VRAM headroom on a 16 GB card after load. That is enough for short contexts but the KV cache will hit the ceiling well before the model's full context window. If you want full precision and long contexts, step down to Q8_0 (8.19 GB on disk per the same file table) — same near-lossless quality, half the VRAM, room for 32 K+ context. The Google docs explicitly call out that the static-weight numbers exclude KV-cache overhead.

Ollama function-calling / streaming glitches

Per the danilchenko.dev walkthrough (2026-04-07), Gemma 4's hybrid sliding-window/global attention triggers parser bugs in Ollama's tool-call and streaming layers. Workaround: run llama.cpp directly (Path B) or vLLM for tool-use workflows until the parser is patched.

`flash_attention_2` errors (if you take the optional transformers path)

If you copy a third-party transformers snippet that hardcodes attn_implementation="flash_attention_2", it may fail at first forward pass on some driver / wheel combinations — see the tracking issue at Dao-AILab/flash-attention#2168. Fix by removing the override or setting attn_implementation="sdpa". (The Ada Lovelace sm_89 kernels in FA2 are shipped — this is a wheel-mismatch issue on the 4060 Ti 16GB rather than the missing-kernel issue 50-series users hit.)

`enable_thinking` not triggering on E4B

Discussions on the HF card note that enable_thinking=True in the chat template does not currently activate Thinking Mode on the E4B / E2B sizes. The recipe snippet sets enable_thinking=False to match the documented working behaviour; if you need extended reasoning, the 26 B MoE or 31 B dense variants are the supported path — but neither fits this card, so this recipe leaves Thinking Mode out of scope.