Gemma 4 E4B on RTX 5080: Multimodal Inference via Q4_K_M GGUF (llama.cpp or Ollama — BF16 will not fit comfortably)

What You'll Build

A local Gemma 4 E4B instance on an RTX 5080 — Google's 4.5 B-effective-parameter (8 B with embeddings) instruction-tuned multimodal model that accepts text, images, and audio as input and produces text. On a 16 GB Blackwell card the recipe pins Q4_K_M GGUF as the recommended variant (clean fit with ample headroom) and walks through two runtimes: Ollama (one-liner) and llama.cpp (full control). An optional BF16 transformers path is documented as a tight, expert-only upgrade.

Hardware data: RTX 5080 (16 GB VRAM) · multimodal-capable at Q4_K_M (~5 GB weights, ~6 GB peak with KV cache) · See benchmark data

⚠️ BF16 will not fit comfortably on 16 GB — use Q4_K_M instead. The canonical google/gemma-4-E4B-it checkpoint is a single 15.99 GB model.safetensors (read via the HF tree API); Google's docs list the inference-memory footprint (raw weights + ~20% load overhead) as 17.9 GB at BF16, 8.9 GB at SFP8, 4.5 GB at Q4_0 (Gemma core docs). On a 16 GB card BF16 does not fit: the 15.99 GB of weights alone nearly fill the envelope, and Google's 17.9 GB BF16 inference figure overshoots it — so a BF16 run OOMs on all but the most trivial contexts. Q4_K_M GGUF (4.977 GB on disk per Unsloth's file table) is the variant pinned by this recipe; Q8_0 (8.193 GB) is a comfortable near-lossless step up.

ℹ️ Multimodal input, text-only output. Gemma 4 E4B reads images, audio (≤ 30 s clips), video (≤ 60 s at 1 fps), and text — and replies in text. It does not generate images, speech, or video. For TTS on this card, see Kokoro; for image generation, see Z-Image or Flux.2 Klein.

Requirements

Component	Minimum	Tested
GPU	6 GB (Q4_K_M GGUF) / 9 GB (Q8_0 GGUF) — BF16 needs the full 16 GB and is tight	RTX 5080 (16 GB)
RAM	16 GB system RAM	—
Storage	~5 GB for the Q4_K_M GGUF (+ ~1 GB mmproj for image/audio input) (file table)	4.977 GB Q4_K_M, 8.193 GB Q8_0
Software	`llama.cpp` or Ollama (GGUF paths) OR Python 3.10+ with `transformers` + cu128 wheel (optional BF16 path)	—

Per the official Google AI for Developers docs, the inference-memory footprint is 17.9 GB at BF16, 8.9 GB at SFP8, 4.5 GB at Q4_0 — Google's figures already fold in ~20 % load overhead on top of the raw weights, and the KV cache then grows further with context, so a Q4_K_M run peaks around ≈ 6 GB at default contexts on the 5080.

Installation

This recipe defaults to Q4_K_M GGUF — it is the smallest quant that retains near-full multimodal instruction-following quality, and on the RTX 5080 it loads in ~5 GB and leaves ample headroom for long contexts. Two runtimes — Ollama (zero config) and llama.cpp (full control) — are both documented; pick whichever fits your workflow. Q8_0 is documented as a near-lossless step up; BF16 is an optional, tight, expert-only path (see Troubleshooting).

Path A — Ollama (one-liner, auto layer placement)

Ollama auto-detects the right number of GPU layers for your card; no manual -ngl flag is needed.

# macOS / Linux — install
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run the Unsloth Q4_K_M GGUF (≈ 5 GB download on first run)
ollama run hf.co/unsloth/gemma-4-E4B-it-GGUF:Q4_K_M

The Unsloth GGUF repo at unsloth/gemma-4-E4B-it-GGUF hosts the file; its model tree explicitly links upstream google/gemma-4-E4B-it (the canonical Google release). The community-maintained bartowski/google_gemma-4-E4B-it-GGUF is an equivalent Q4_K_M mirror if you prefer Bartowski's quants.

Path B — `llama.cpp` (explicit `-ngl`)

# macOS / Linux
brew install llama.cpp

# Or build from source — https://github.com/ggml-org/llama.cpp

Then launch the OpenAI-compatible server. On a 16 GB card, offloading all layers to GPU is safe at Q4_K_M and Q8_0; explicitly pin with -ngl 99 (llama.cpp clamps to the model's real layer count):

# OpenAI-compatible local server with web UI
llama-server -hf unsloth/gemma-4-E4B-it-GGUF:Q4_K_M -ngl 99

# Or one-shot CLI
llama-cli -hf unsloth/gemma-4-E4B-it-GGUF:Q4_K_M -ngl 99

The -hf flag streams the GGUF directly from HuggingFace on first run and caches it locally.

For image / audio input, llama.cpp also needs the multimodal projector (mmproj) file. The Unsloth repo ships mmproj-F16.gguf (0.990 GB on disk per the file table); pass it with --mmproj:

# Multimodal-capable server (text + image + audio input)
llama-server -hf unsloth/gemma-4-E4B-it-GGUF:Q4_K_M --mmproj-url https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF/resolve/main/mmproj-F16.gguf -ngl 99

Variant file-size table (from Unsloth's GGUF repo)

The figures below are read directly from the unsloth/gemma-4-E4B-it-GGUF tree via the HF tree API. They are on-disk file sizes; runtime VRAM is on-disk plus the KV cache + activations (and the ~1 GB mmproj if you enable image/audio input).

Quant	File size	Fits 16 GB?
Q4_K_M	4.977 GB	✅ Recommended — clean fit, lots of headroom
UD-Q4_K_XL	5.126 GB	✅ Comfortable — Unsloth dynamic 4-bit
Q5_K_M	5.482 GB	✅ Comfortable
Q8_0	8.193 GB	✅ Comfortable — near-lossless quality
UD-Q8_K_XL	8.712 GB	✅ Comfortable
BF16	15.053 GB	⚠️ Tight — ~1 GB free for KV cache, OOMs past trivial contexts

Optional Path C — BF16 `transformers` (Python, tight on 16 GB)

If you specifically need the canonical transformers Python API (e.g. for fine-tuning hooks or the documented multimodal AutoModelForMultimodalLM loader), BF16 weights are ~15 GB. The RTX 5080 uses Blackwell (sm_120) — the default pip install torch index does not yet ship sm_120 kernels, so the cu128 wheel is required:

# CUDA 12.8 PyTorch wheel — required for Blackwell sm_120 (RTX 50-series)
pip install -U torch --index-url https://download.pytorch.org/whl/cu128
pip install -U transformers accelerate torchvision librosa

On a 16 GB card this path leaves almost no headroom for the KV cache — it is documented for completeness, but the GGUF paths above are strongly preferred. See the "BF16 is tight" Troubleshooting entry below.

Running

Path A / B — chat via Ollama or llama.cpp

After ollama run … or llama-server … is up, both expose an OpenAI-compatible HTTP API on localhost:11434 (Ollama) or localhost:8080 (llama-server). For an interactive chat, just type at the prompt; for programmatic use:

# Ollama
curl http://localhost:11434/api/chat -d '{
  "model": "hf.co/unsloth/gemma-4-E4B-it-GGUF:Q4_K_M",
  "messages": [{"role": "user", "content": "Write a short joke about saving VRAM."}]
}'

# llama-server (OpenAI-compatible)
curl http://localhost:8080/v1/chat/completions -d '{
  "model": "gemma-4-e4b",
  "messages": [{"role": "user", "content": "Write a short joke about saving VRAM."}]
}'

For multimodal input (images), pass images: [<base64>] (Ollama) or attach with the OpenAI image_url content block (llama.cpp built with the --mmproj projector from Installation Path B). Full multimodal usage is documented on the model card. Recommended sampling: temperature=1.0, top_p=0.95, top_k=64.

Optional Path C — text + image inference (transformers)

The HuggingFace card's canonical multimodal snippet, verbatim from google/gemma-4-E4B-it:

from transformers import AutoProcessor, AutoModelForMultimodalLM

MODEL_ID = "google/gemma-4-E4B-it"

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(
    MODEL_ID,
    dtype="auto",        # picks BF16 on the 5080
    device_map="auto",
)

# Multimodal example — image + text
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/apps/sample-data/GoldenGate.png"},
            {"type": "text", "text": "What is shown in this image?"},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]

outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
print(processor.parse_response(response))

For text-only chat, swap AutoModelForMultimodalLM for AutoModelForCausalLM and drop the image content block.

Results

Speed: No first-party RTX 5080 tok/s benchmark exists for Gemma 4 E4B yet — the backend benchmark endpoint currently returns no measurements for this pair. Decode for a 4.5 B-effective model is memory-bandwidth-bound, and the RTX 5080's ~960 GB/s GDDR7 bandwidth is roughly 2.1× the RTX 5060 Ti's ~448 GB/s, so throughput on the 5080 should be substantially higher than on the smaller 16 GB Blackwell card — but no published 5080-named figure exists to cite directly, and forward-extrapolating from a smaller-bandwidth card would be a guess rather than a measurement. Submit yours via /contribute.
VRAM usage: Cited inference-memory footprint per Google AI for Developers — BF16 = 17.9 GB, SFP8 = 8.9 GB, Q4_0 = 4.5 GB (incl. ~20% load overhead). The Unsloth tree confirms the on-disk weights (read via the HF tree API): Q4_K_M = 4.977 GB, Q8_0 = 8.193 GB, BF16 = 15.053 GB, plus a ~0.990 GB mmproj-F16.gguf for image/audio input. At Q4_K_M the runtime peak is ~6 GB after the KV cache at default contexts, leaving ~10 GB of the 5080's envelope free. Add ~25 % runtime headroom for the KV cache at long contexts.
Quality notes: E4B is the "daily-driver" tier of the Gemma 4 family — 4.5 B effective / 8 B with embeddings, 42 layers, a 128 K-token context window, 262 K vocab, ~150 M vision encoder + ~300 M audio encoder per the HF card. Image inputs support variable aspect/resolution with configurable visual-token budgets (70 / 140 / 280 / 560 / 1120). Audio inputs cap at 30 seconds per clip. Q4_K_M is the smallest quant that retains near-full instruction-following quality; below it (Q3, Q2) the multimodal alignment degrades visibly. License is Apache 2.0 per the canonical model card (Gemma 4 license).

For the full benchmark data, see /check/gemma-4-e4b/rtx-5080.

Troubleshooting

BF16 is tight on 16 GB — OOMs past trivial contexts

This is expected. The canonical BF16 checkpoint is a single 15.99 GB model.safetensors (per the HF tree) and the GGUF BF16 file is 15.053 GB (Unsloth table). Either way, BF16 weights consume nearly the whole 16 GB envelope, leaving roughly 1 GB for the KV cache and activations — fine for a few hundred tokens, but it OOMs as soon as the context grows. Switch to Q8_0 (8.193 GB, near-lossless) or Q4_K_M (4.977 GB) GGUF, which both leave comfortable room for long contexts on the 5080.

`flash_attention_2` errors on the 5080 (Blackwell sm_120)

The canonical HF snippet uses dtype="auto" and does not request attn_implementation="flash_attention_2", so the default SDPA path runs cleanly on Blackwell. If you copy a third-party transformers snippet that hardcodes flash_attention_2, the call will fail at first forward pass — FA2 wheels do not yet ship sm_120 kernels for RTX 50-series, tracked at Dao-AILab/flash-attention#2168. Fix by removing the override or setting attn_implementation="sdpa".

Don't try the 26 B / 31 B Gemma 4 variants on a 16 GB card

A community user reports on llama.cpp Issue #21323 that unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL (the 26 B MoE sibling, not E4B) OOMs at load with CPU offloading on an RTX 5080 (16 GB). A separate report (Issue #21371) describes the 31 B variant producing gibberish on Blackwell under a CUDA build. Both reports are on larger Gemma 4 variants, not the E4B this recipe pins — they are cited here only as confirmation that the E4B tier is the right choice for a 16 GB card. The bigger 26 B/31 B siblings are out of scope here; if you need them, a 24 GB+ card is the supported path.

Ollama function-calling / streaming glitches

Per the danilchenko.dev walkthrough (2026-04-07), Gemma 4's hybrid sliding-window/global attention triggers parser bugs in Ollama's tool-call and streaming layers. Workaround: run llama.cpp directly (Path B) or vLLM for tool-use workflows until the parser is patched.

Thinking Mode (`enable_thinking`) on E4B

Gemma 4 E4B does support Thinking Mode — extended reasoning is listed under Core Capabilities on the canonical model card, and enable_thinking=True in the chat template activates it. The only E2B/E4B-specific quirk is cosmetic: with thinking disabled, the smaller sizes simply don't emit empty thought-block tags. Set enable_thinking=True when you want step-by-step reasoning before the answer; leave it off for terse, direct replies. If the toggle appears inert in a GGUF runtime, update to a recent llama.cpp / Ollama build — chat-template handling of the thinking flag has been iterated on across versions.