Gemma 4 E4B on RTX 4080: Multimodal Inference via Q4_K_M GGUF (with optional Q8_0 / BF16)

What You'll Build

A local Gemma 4 E4B instance on an RTX 4080 — Google's 4.5 B-effective-parameter (8 B with embeddings) instruction-tuned multimodal model that accepts text, image, and audio as input and produces text. On a 16 GB Ada Lovelace card this recipe pins Q4_K_M GGUF as the recommended variant (clean fit, lots of headroom) and documents Q8_0 GGUF and BF16 as comfortable / tight upgrades respectively. Image input is handled by a small separate vision projector (mmproj) that loads alongside the text GGUF.

Hardware data: RTX 4080 (16 GB VRAM) · multimodal capable from Q4_K_M (~5 GB text weights + ~1 GB projector, ~6 GB peak with KV cache) up to BF16 (~15 GB weights, fits but ~1 GB headroom) · See benchmark data

ℹ️ Multimodal input, text-only output. Gemma 4 E4B reads text, images, and audio and replies in text — it does not generate images, speech, or video. It lives in our multimodal vertical because it spans more than one input modality. For text-to-speech on this card see Kokoro or VoxCPM; for image generation see Z-Image or Flux.2 Klein.

ℹ️ Same 16 GB tier as the RTX 4060 Ti 16GB. The RTX 4080 is Ada Lovelace (sm_89) with the same 16 GB VRAM envelope as the RTX 4060 Ti 16GB, so the same quant-fits / BF16-tight tradeoff applies. The 4080 has more compute and ~2.5× the memory bandwidth (~717 GB/s vs ~288 GB/s), so expect faster generation than the 4060 Ti — but the VRAM constraints are identical.

Requirements

Component	Minimum	Tested
GPU	6 GB (Q4_K_M GGUF) / 9 GB (Q8_0 GGUF) / 16 GB (BF16)	RTX 4080 (16 GB)
RAM	16 GB system RAM	—
Storage	5 – 16 GB depending on quant (file-size table) + ~1 GB for the vision projector	4.98 GB for Q4_K_M, 8.19 GB for Q8_0, 15.05 GB for BF16
Software	`llama.cpp` or Ollama (GGUF paths) OR Python 3.10+ with `transformers` (BF16 path)	—

Gemma 4 E4B is licensed Apache-2.0 (model card; terms at the Gemma 4 license link). The weights are not gated — the HF repo is publicly downloadable without an access request — so no hf auth login or license click-through is required to pull the GGUFs.

Per the official Google AI for Developers docs, the static-weight memory footprint for Gemma 4 E4B is 15 GB at BF16, 7.5 GB at SFP8, 5 GB at Q4_0 — these numbers cover the weights only, not the KV cache or runtime overhead, so plan on at least 25 % headroom for non-trivial contexts.

Installation

This recipe defaults to Q4_K_M GGUF — it is the smallest quant that retains near-full multimodal instruction-following quality, and on the RTX 4080 it loads in ~5 GB and leaves ample headroom for long contexts. Q8_0 and BF16 are documented as quality-step-up paths.

Path A — Ollama (one-liner, auto layer placement)

Ollama auto-detects the right number of GPU layers for your card; on a 16 GB 4080 the whole model fits, so no manual -ngl flag is needed.

# macOS / Linux — install
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run the Unsloth Q4_K_M GGUF (≈ 5 GB download on first run)
ollama run hf.co/unsloth/gemma-4-E4B-it-GGUF:Q4_K_M

The Unsloth GGUF repo at unsloth/gemma-4-E4B-it-GGUF hosts the file; its model tree explicitly links upstream google/gemma-4-E4B-it (the canonical Google release). The community-maintained bartowski/google_gemma-4-E4B-it-GGUF is an equivalent Q4_K_M mirror if you prefer Bartowski's quants.

Path B — `llama.cpp` (explicit `-ngl`, plus `--mmproj` for image input)

# macOS / Linux
brew install llama.cpp

# Or build from source — https://github.com/ggml-org/llama.cpp

Then launch the OpenAI-compatible server. On a 16 GB card, offloading all layers to GPU is safe at every quant tier documented here; explicitly pin with -ngl 99 (llama.cpp clamps to the model's real layer count). For image input, also load the vision projector with --mmproj:

# OpenAI-compatible local server with web UI (text only)
llama-server -hf unsloth/gemma-4-E4B-it-GGUF:Q4_K_M -ngl 99

# Text + image: add the F16 vision projector (≈ 1 GB, shared across quant tiers)
llama-server \
  -hf unsloth/gemma-4-E4B-it-GGUF:Q4_K_M \
  --mmproj-url https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF/resolve/main/mmproj-F16.gguf \
  -ngl 99

The -hf flag streams the GGUF directly from HuggingFace on first run and caches it locally. To switch tiers, replace the tag — :Q8_0 for the near-lossless quant, :BF16 for the full-precision GGUF (tight on 16 GB — see Troubleshooting), :UD-Q4_K_XL for Unsloth's dynamic 4-bit variant. The mmproj-F16.gguf projector is the same file at every text-quant tier.

Variant file-size table (from Unsloth's GGUF repo)

The figures below are read directly from unsloth/gemma-4-E4B-it-GGUF. They are the on-disk text-model file sizes; runtime VRAM is on-disk plus the KV cache, plus ~1 GB for the vision projector if you use image input.

Quant	File size	Fits 16 GB?
Q3_K_M	4.06 GB	✅ Comfortable
Q4_K_M	4.98 GB	✅ Comfortable — pinned by this recipe
UD-Q4_K_XL	5.13 GB	✅ Comfortable — Unsloth dynamic 4-bit
Q5_K_M	5.48 GB	✅ Comfortable
Q8_0	8.19 GB	✅ Comfortable — near-lossless quality
BF16	15.05 GB	⚠️ Fits but tight — ~1 GB headroom for KV cache
mmproj-F16 (vision)	0.99 GB	Loaded once for image input, shared across tiers

On a 16 GB RTX 4080 every text tier above fits with room to spare — the binding constraint is quality, not VRAM. Q4_K_M is the recommended starting point; step up to Q8_0 or the BF16 GGUF for quality-sensitive work since the card has ~7 GB of unused headroom even at Q8_0.

Optional Path C — `transformers` BF16 (Python, tight)

If you specifically need the canonical transformers Python API (e.g. for fine-tuning hooks), the full BF16 checkpoint sits at ~15 GB per Google's docs — it fits the 4080's 16 GB, but the runtime peak with non-trivial context can press the ceiling. Prefer Path B / Q8_0 for headroom on long contexts.

# Default CUDA 12.x PyTorch wheel — Ada sm_89 kernels ship in the standard release
pip install -U torch
pip install -U transformers accelerate torchvision librosa

Unlike Blackwell (RTX 50-series, sm_120), the RTX 4080 uses Ada Lovelace (sm_89), and the default pip install torch already includes the right kernels — no cu128-only wheel is required. The full multimodal transformers snippet (including the image and audio modalities and the BF16 checkpoint) is documented on the Gemma 4 E4B model card.

Running

Path A / B — chat via Ollama or llama.cpp

After ollama run … or llama-server … is up, both expose an OpenAI-compatible HTTP API on localhost:11434 (Ollama) or localhost:8080 (llama-server). For an interactive chat, just type at the prompt; for programmatic use:

# Ollama
curl http://localhost:11434/api/chat -d '{
  "model": "hf.co/unsloth/gemma-4-E4B-it-GGUF:Q4_K_M",
  "messages": [{"role": "user", "content": "Write a short joke about saving VRAM."}]
}'

# llama-server (OpenAI-compatible)
curl http://localhost:8080/v1/chat/completions -d '{
  "model": "gemma-4-e4b",
  "messages": [{"role": "user", "content": "Write a short joke about saving VRAM."}]
}'

For image input through llama.cpp, attach the image with the OpenAI image_url content block once --mmproj is loaded (Path B), and the server returns a text description. Recommended sampling per the model card: temperature=1.0, top_p=0.95, top_k=64.

Results

Speed: No RTX 4080-specific Gemma 4 E4B benchmark has been published yet, so a measured 4080 tok/s figure is omitted rather than estimated. As a comparable-tier consumer reference point only, a community walkthrough by danilchenko.dev (2026-04-07) reports ~45 tok/s for E4B at Q8_0 on an RTX 3060 (12 GB) — note the RTX 3060 is Ampere (sm_86), not Ada, and has lower memory bandwidth (~360 GB/s vs the 4080's ~717 GB/s), so the 4080 should be meaningfully faster, but no published 4080 number exists. Submit yours via /contribute and it will appear at /check/gemma-4-e4b/rtx-4080.
VRAM usage: Cited weight footprint per Google AI for Developers — BF16 = 15 GB, SFP8 = 7.5 GB, Q4_0 = 5 GB (weights only). Unsloth GGUF Q4_K_M file is 4.98 GB, Q8_0 is 8.19 GB, BF16 is 15.05 GB, and the F16 vision projector is 0.99 GB (file table). On the 4080's 16 GB, Q4_K_M leaves ~10 GB free after the text weights load (room for huge contexts and the vision projector); Q8_0 leaves ~7 GB; BF16 leaves ~1 GB — feasible but tight.
Quality notes: E4B is the "daily-driver" tier of the Gemma 4 family — 4.5 B effective / 8 B with embeddings via the MatFormer (Matryoshka-Transformer) architecture, multimodal across text, image, and audio input with text output (model card). Q4_K_M is the smallest quant that retains near-full instruction-following quality; below it (Q3, Q2) the multimodal alignment degrades visibly. Because the 4080 has plenty of headroom even at Q8_0, there is no reason to drop below Q4_K_M on this card — step up to Q8_0 (8.19 GB) for a quality-sensitive workload.

For the full benchmark data, see /check/gemma-4-e4b/rtx-4080.

Troubleshooting

BF16 fits but presses the 16 GB ceiling at long contexts

The BF16 weights are 15.05 GB per the Unsloth file table, which leaves only ~1 GB of VRAM headroom on the 4080's 16 GB after load. That is enough for short contexts, but the KV cache will hit the ceiling well before the model's full context window. If you want full precision and long contexts, step down to Q8_0 (8.19 GB on disk per the same file table) — same near-lossless quality, half the VRAM, room for 32 K+ context. The Google docs explicitly note that the static-weight numbers exclude KV-cache overhead.

Image input does nothing / model only sees text

The vision capability lives in a separate mmproj-F16.gguf projector that must be loaded alongside the text GGUF. With llama.cpp, pass --mmproj (or --mmproj-url) pointing at the mmproj-F16.gguf file (Path B); without it, llama-server runs text-only. With Ollama, the projector is normally resolved automatically from the HF repo — if image input is ignored, switch to the explicit llama.cpp Path B.

Ollama function-calling / streaming glitches

Per the danilchenko.dev walkthrough (2026-04-07), Gemma 4's attention design exposed bugs in Ollama's tool-call parser and streaming implementation, and the Ollama team is working on a fix. Workaround: run llama.cpp directly (Path B) or vLLM for tool-use workflows until the parser is patched.

`flash_attention_2` errors (if you take the optional transformers path)

If you copy a third-party transformers snippet that hardcodes attn_implementation="flash_attention_2", it may fail at first forward pass on some driver / wheel combinations — see the tracking issue at Dao-AILab/flash-attention#2168. Fix by removing the override or setting attn_implementation="sdpa". (The Ada Lovelace sm_89 kernels in FA2 are shipped — on the 4080 this is a wheel-mismatch issue rather than the missing-kernel issue 50-series users hit.)