What You'll Build
A local Gemma 4 E4B instance on an RTX 4060 — Google's 4.5 B-effective-parameter (8 B with embeddings) instruction-tuned multimodal model that accepts text, images, audio, and video as input and produces text. On a 8 GB card the recipe pins Q4_K_M GGUF as the only comfortably-fitting variant and walks through two runtimes: llama.cpp (manual control) and Ollama (one-liner).
Hardware data: RTX 4060 (8 GB VRAM) · multimodal capable at Q4_K_M (~5 GB weights, ~6 GB peak with KV cache) · See benchmark data
⚠️ BF16 will not fit 8 GB — use Q4_K_M instead. Google's docs list the static-weight footprint as 15 GB at BF16, 7.5 GB at FP8, 5 GB at Q4_0 (Gemma core docs). BF16 needs roughly twice the card's capacity, and Q8_0 (8.19 GB on disk per Unsloth's file table) does not leave room for the KV cache. Q4_K_M (4.98 GB) is the variant pinned by this recipe.
ℹ️ Multimodal input, text-only output. Gemma 4 E4B reads images, audio (≤ 30 s clips), video (≤ 60 s at 1 fps), and text — and replies in text. It does not generate images, speech, or video. For TTS on this card, see Kokoro; for image generation, see siblings of the Z-Image / Flux line.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 6 GB (Q4_K_M GGUF) — 8 GB card recommended for headroom | RTX 4060 (8 GB) |
| RAM | 16 GB system RAM | — |
| Storage | ~5 GB for the Q4_K_M GGUF (file table) | 4.98 GB |
| Software | llama.cpp or Ollama (GGUF path) | — |
Per the official Google AI for Developers docs, the static-weight memory footprint is 5 GB at Q4_0 — this covers weights only, not the KV cache or runtime overhead, so plan on at least 25 % headroom for non-trivial contexts (≈ 6 GB peak on the 4060).
Installation
This recipe pins Q4_K_M GGUF because it is the only Gemma 4 E4B variant with comfortable headroom on an 8 GB card. Two runtimes — Ollama (zero config) and llama.cpp (full control) — are both documented; pick whichever fits your workflow.
Path A — Ollama (one-liner, auto layer placement)
Ollama auto-detects the right number of GPU layers for your card; no manual -ngl flag is needed.
# macOS / Linux — install
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run the Unsloth Q4_K_M GGUF (≈ 5 GB download on first run)
ollama run hf.co/unsloth/gemma-4-E4B-it-GGUF:Q4_K_M
The Unsloth GGUF repo at unsloth/gemma-4-E4B-it-GGUF hosts the file; its model tree explicitly links upstream google/gemma-4-E4B-it (the canonical Google release). The community-maintained bartowski/google_gemma-4-E4B-it-GGUF is an equivalent Q4_K_M mirror if you prefer Bartowski's quants.
Path B — llama.cpp (explicit -ngl)
# macOS / Linux
brew install llama.cpp
# Or build from source — https://github.com/ggml-org/llama.cpp
Then launch the OpenAI-compatible server. On an 8 GB card, offloading all 26 layers to GPU is safe at Q4_K_M; explicitly pin with -ngl 99 (llama.cpp clamps to the model's real layer count):
# OpenAI-compatible local server with web UI
llama-server -hf unsloth/gemma-4-E4B-it-GGUF:Q4_K_M -ngl 99
# Or one-shot CLI
llama-cli -hf unsloth/gemma-4-E4B-it-GGUF:Q4_K_M -ngl 99
The -hf flag streams the GGUF directly from HuggingFace on first run and caches it locally.
Variant file-size table (from Unsloth's GGUF repo)
The figures below are read directly from unsloth/gemma-4-E4B-it-GGUF. They are the on-disk file sizes; runtime VRAM is on-disk plus the KV cache.
| Quant | File size | Fits 8 GB? |
|---|---|---|
| Q4_K_M | 4.98 GB | ✅ Comfortable — pinned by this recipe |
| UD-Q4_K_XL | 5.13 GB | ✅ Tight but OK — Unsloth dynamic 4-bit |
| Q8_0 | 8.19 GB | ❌ Will OOM on 8 GB — leaves no room for KV cache |
| BF16 | 15.1 GB | ❌ Roughly double the card's capacity |
Optional Path C — FP8 transformers (advanced, tight)
If you specifically need the canonical transformers Python API (e.g. for fine-tuning hooks), an FP8 cast keeps the weights at ~7.5 GB per Google's docs — borderline on an 8 GB card and likely to OOM with any non-trivial context. The GGUF path above is strongly preferred. If you must try FP8:
# CUDA 12.x PyTorch wheel
pip install -U torch
pip install -U transformers accelerate torchvision librosa
Then load with an 8-bit / FP8 quantization config (see Hugging Face quantization docs).
Running
Path A / B — chat via Ollama or llama.cpp
After ollama run … or llama-server … is up, both expose an OpenAI-compatible HTTP API on localhost:11434 (Ollama) or localhost:8080 (llama-server). For an interactive chat, just type at the prompt; for programmatic use:
# Ollama
curl http://localhost:11434/api/chat -d '{
"model": "hf.co/unsloth/gemma-4-E4B-it-GGUF:Q4_K_M",
"messages": [{"role": "user", "content": "Write a short joke about saving VRAM."}]
}'
# llama-server (OpenAI-compatible)
curl http://localhost:8080/v1/chat/completions -d '{
"model": "gemma-4-e4b",
"messages": [{"role": "user", "content": "Write a short joke about saving VRAM."}]
}'
For multimodal input (images), pass images: [<base64>] (Ollama) or attach with the OpenAI image_url content block (llama.cpp mmproj build) — full multimodal usage is documented on the model card. Recommended sampling: temperature=1.0, top_p=0.95, top_k=64.
Results
- Speed: A community walkthrough by danilchenko.dev (2026-04-07) reports ~45 tok/s at Q8_0 on an RTX 3060 (12 GB). The RTX 4060 has comparable memory bandwidth (~272 GB/s vs 360 GB/s), and at the smaller Q4_K_M tier on this recipe the throughput is expected to be at least as high — but no published RTX 4060 benchmark exists yet. Submit yours via /contribute.
- VRAM usage: Cited weight footprint per Google AI for Developers — Q4_0 = 5 GB (weights only). The Unsloth Q4_K_M GGUF file is 4.98 GB (file table). Add ~25 % runtime headroom for the KV cache at long contexts → ~6 GB peak on the 4060, leaving ~2 GB free.
- Quality notes: E4B is the "daily-driver" tier of the Gemma 4 family — 4.5 B effective / 8 B with embeddings, 42 layers, 262 K vocab, ~150 M vision encoder + ~300 M audio encoder. Image inputs support variable aspect/resolution with configurable visual-token budgets (70 / 140 / 280 / 560 / 1120). Audio inputs cap at 30 seconds per clip. Q4_K_M is the smallest quant that retains near-full instruction-following quality; below it (Q3, Q2) the multimodal alignment degrades visibly.
For the full benchmark data, see /check/gemma-4-e4b/rtx-4060.
Troubleshooting
Q8_0 or BF16 OOMs on the 4060
This is expected — Q8_0 (8.19 GB on disk per the Unsloth file table) leaves no room for the KV cache on an 8 GB card, and BF16 is ~15 GB. Switch to Q4_K_M (4.98 GB) or Q4_K_XL (5.13 GB). The 16 GB-card sibling recipe documents the Q8_0 and BF16 paths if you later upgrade hardware.
Ollama function-calling / streaming glitches
Per the danilchenko.dev walkthrough (2026-04-07), Gemma 4's hybrid sliding-window/global attention triggers parser bugs in Ollama's tool-call and streaming layers. Workaround: run llama.cpp directly (Path B) or vLLM for tool-use workflows until the parser is patched.
Long contexts hit VRAM ceiling
The Q4_K_M weights leave ~2 GB free on the 4060 after load — enough for ~8 K context with a small KV cache, but contexts beyond that will start swapping or OOM. Either reduce n_ctx (llama.cpp --ctx-size 4096) or accept partial CPU offload via Ollama (which auto-handles layer placement). The Google docs explicitly call out that the static-weight numbers exclude KV-cache overhead.
flash_attention_2 errors (if you take the optional FP8 transformers path)
If you copy a third-party transformers snippet that hardcodes attn_implementation="flash_attention_2", it will fail on first forward pass on older driver / wheel combinations — see the tracking issue at Dao-AILab/flash-attention#2168. Fix by removing the override or setting attn_implementation="sdpa". (The Ada Lovelace sm_89 kernels in FA2 are shipped, so this is mostly a wheel-mismatch issue on the 4060 rather than the missing-kernel issue 50-series users hit.)
enable_thinking not triggering on E4B
Discussions on the HF card note that enable_thinking=True in the chat template does not currently activate Thinking Mode on the E4B / E2B sizes. If you need extended reasoning, the 26 B MoE or 31 B dense variants are the supported path — but neither fits an 8 GB card, so this recipe leaves Thinking Mode out of scope.