self-hosted/ai
§01·recipe · multimodal

Gemma 4 E4B on RTX 4060: Multimodal Inference via Q4_K_M GGUF (llama.cpp or Ollama)

multimodalbeginner6GB+ VRAMMay 19, 2026
models
tools
prerequisites
  • NVIDIA RTX 4060 (8 GB VRAM) or any GPU with ≥ 6 GB VRAM
  • llama.cpp or Ollama installed (GGUF paths) — see Installation
  • Python 3.10+ only if you take the optional FP8 transformers path

What You'll Build

A local Gemma 4 E4B instance on an RTX 4060 — Google's 4.5 B-effective-parameter (8 B with embeddings) instruction-tuned multimodal model that accepts text, images, audio, and video as input and produces text. On a 8 GB card the recipe pins Q4_K_M GGUF as the only comfortably-fitting variant and walks through two runtimes: llama.cpp (manual control) and Ollama (one-liner).

Hardware data: RTX 4060 (8 GB VRAM) · multimodal capable at Q4_K_M (~5 GB weights, ~6 GB peak with KV cache) · See benchmark data

⚠️ BF16 will not fit 8 GB — use Q4_K_M instead. Google's docs list the static-weight footprint as 15 GB at BF16, 7.5 GB at FP8, 5 GB at Q4_0 (Gemma core docs). BF16 needs roughly twice the card's capacity, and Q8_0 (8.19 GB on disk per Unsloth's file table) does not leave room for the KV cache. Q4_K_M (4.98 GB) is the variant pinned by this recipe.

ℹ️ Multimodal input, text-only output. Gemma 4 E4B reads images, audio (≤ 30 s clips), video (≤ 60 s at 1 fps), and text — and replies in text. It does not generate images, speech, or video. For TTS on this card, see Kokoro; for image generation, see siblings of the Z-Image / Flux line.

Requirements

ComponentMinimumTested
GPU6 GB (Q4_K_M GGUF) — 8 GB card recommended for headroomRTX 4060 (8 GB)
RAM16 GB system RAM
Storage~5 GB for the Q4_K_M GGUF (file table)4.98 GB
Softwarellama.cpp or Ollama (GGUF path)

Per the official Google AI for Developers docs, the static-weight memory footprint is 5 GB at Q4_0 — this covers weights only, not the KV cache or runtime overhead, so plan on at least 25 % headroom for non-trivial contexts (≈ 6 GB peak on the 4060).

Installation

This recipe pins Q4_K_M GGUF because it is the only Gemma 4 E4B variant with comfortable headroom on an 8 GB card. Two runtimes — Ollama (zero config) and llama.cpp (full control) — are both documented; pick whichever fits your workflow.

Path A — Ollama (one-liner, auto layer placement)

Ollama auto-detects the right number of GPU layers for your card; no manual -ngl flag is needed.

# macOS / Linux — install
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run the Unsloth Q4_K_M GGUF (≈ 5 GB download on first run)
ollama run hf.co/unsloth/gemma-4-E4B-it-GGUF:Q4_K_M

The Unsloth GGUF repo at unsloth/gemma-4-E4B-it-GGUF hosts the file; its model tree explicitly links upstream google/gemma-4-E4B-it (the canonical Google release). The community-maintained bartowski/google_gemma-4-E4B-it-GGUF is an equivalent Q4_K_M mirror if you prefer Bartowski's quants.

Path B — llama.cpp (explicit -ngl)

# macOS / Linux
brew install llama.cpp

# Or build from source — https://github.com/ggml-org/llama.cpp

Then launch the OpenAI-compatible server. On an 8 GB card, offloading all 26 layers to GPU is safe at Q4_K_M; explicitly pin with -ngl 99 (llama.cpp clamps to the model's real layer count):

# OpenAI-compatible local server with web UI
llama-server -hf unsloth/gemma-4-E4B-it-GGUF:Q4_K_M -ngl 99

# Or one-shot CLI
llama-cli -hf unsloth/gemma-4-E4B-it-GGUF:Q4_K_M -ngl 99

The -hf flag streams the GGUF directly from HuggingFace on first run and caches it locally.

Variant file-size table (from Unsloth's GGUF repo)

The figures below are read directly from unsloth/gemma-4-E4B-it-GGUF. They are the on-disk file sizes; runtime VRAM is on-disk plus the KV cache.

QuantFile sizeFits 8 GB?
Q4_K_M4.98 GB✅ Comfortable — pinned by this recipe
UD-Q4_K_XL5.13 GB✅ Tight but OK — Unsloth dynamic 4-bit
Q8_08.19 GB❌ Will OOM on 8 GB — leaves no room for KV cache
BF1615.1 GB❌ Roughly double the card's capacity

Optional Path C — FP8 transformers (advanced, tight)

If you specifically need the canonical transformers Python API (e.g. for fine-tuning hooks), an FP8 cast keeps the weights at ~7.5 GB per Google's docs — borderline on an 8 GB card and likely to OOM with any non-trivial context. The GGUF path above is strongly preferred. If you must try FP8:

# CUDA 12.x PyTorch wheel
pip install -U torch
pip install -U transformers accelerate torchvision librosa

Then load with an 8-bit / FP8 quantization config (see Hugging Face quantization docs).

Running

Path A / B — chat via Ollama or llama.cpp

After ollama run … or llama-server … is up, both expose an OpenAI-compatible HTTP API on localhost:11434 (Ollama) or localhost:8080 (llama-server). For an interactive chat, just type at the prompt; for programmatic use:

# Ollama
curl http://localhost:11434/api/chat -d '{
  "model": "hf.co/unsloth/gemma-4-E4B-it-GGUF:Q4_K_M",
  "messages": [{"role": "user", "content": "Write a short joke about saving VRAM."}]
}'

# llama-server (OpenAI-compatible)
curl http://localhost:8080/v1/chat/completions -d '{
  "model": "gemma-4-e4b",
  "messages": [{"role": "user", "content": "Write a short joke about saving VRAM."}]
}'

For multimodal input (images), pass images: [<base64>] (Ollama) or attach with the OpenAI image_url content block (llama.cpp mmproj build) — full multimodal usage is documented on the model card. Recommended sampling: temperature=1.0, top_p=0.95, top_k=64.

Results

  • Speed: A community walkthrough by danilchenko.dev (2026-04-07) reports ~45 tok/s at Q8_0 on an RTX 3060 (12 GB). The RTX 4060 has comparable memory bandwidth (~272 GB/s vs 360 GB/s), and at the smaller Q4_K_M tier on this recipe the throughput is expected to be at least as high — but no published RTX 4060 benchmark exists yet. Submit yours via /contribute.
  • VRAM usage: Cited weight footprint per Google AI for Developers — Q4_0 = 5 GB (weights only). The Unsloth Q4_K_M GGUF file is 4.98 GB (file table). Add ~25 % runtime headroom for the KV cache at long contexts → ~6 GB peak on the 4060, leaving ~2 GB free.
  • Quality notes: E4B is the "daily-driver" tier of the Gemma 4 family — 4.5 B effective / 8 B with embeddings, 42 layers, 262 K vocab, ~150 M vision encoder + ~300 M audio encoder. Image inputs support variable aspect/resolution with configurable visual-token budgets (70 / 140 / 280 / 560 / 1120). Audio inputs cap at 30 seconds per clip. Q4_K_M is the smallest quant that retains near-full instruction-following quality; below it (Q3, Q2) the multimodal alignment degrades visibly.

For the full benchmark data, see /check/gemma-4-e4b/rtx-4060.

Troubleshooting

Q8_0 or BF16 OOMs on the 4060

This is expected — Q8_0 (8.19 GB on disk per the Unsloth file table) leaves no room for the KV cache on an 8 GB card, and BF16 is ~15 GB. Switch to Q4_K_M (4.98 GB) or Q4_K_XL (5.13 GB). The 16 GB-card sibling recipe documents the Q8_0 and BF16 paths if you later upgrade hardware.

Ollama function-calling / streaming glitches

Per the danilchenko.dev walkthrough (2026-04-07), Gemma 4's hybrid sliding-window/global attention triggers parser bugs in Ollama's tool-call and streaming layers. Workaround: run llama.cpp directly (Path B) or vLLM for tool-use workflows until the parser is patched.

Long contexts hit VRAM ceiling

The Q4_K_M weights leave ~2 GB free on the 4060 after load — enough for ~8 K context with a small KV cache, but contexts beyond that will start swapping or OOM. Either reduce n_ctx (llama.cpp --ctx-size 4096) or accept partial CPU offload via Ollama (which auto-handles layer placement). The Google docs explicitly call out that the static-weight numbers exclude KV-cache overhead.

flash_attention_2 errors (if you take the optional FP8 transformers path)

If you copy a third-party transformers snippet that hardcodes attn_implementation="flash_attention_2", it will fail on first forward pass on older driver / wheel combinations — see the tracking issue at Dao-AILab/flash-attention#2168. Fix by removing the override or setting attn_implementation="sdpa". (The Ada Lovelace sm_89 kernels in FA2 are shipped, so this is mostly a wheel-mismatch issue on the 4060 rather than the missing-kernel issue 50-series users hit.)

enable_thinking not triggering on E4B

Discussions on the HF card note that enable_thinking=True in the chat template does not currently activate Thinking Mode on the E4B / E2B sizes. If you need extended reasoning, the 26 B MoE or 31 B dense variants are the supported path — but neither fits an 8 GB card, so this recipe leaves Thinking Mode out of scope.