self-hosted/ai
§01·recipe · multimodal

Gemma 4 E4B on RTX 4090: Multimodal Inference via BF16 (with optional Q8_0 / Q4_K_M GGUF)

multimodalbeginner20GB+ VRAMMay 20, 2026
models
tools
prerequisites
  • NVIDIA RTX 4090 (24 GB VRAM) — or any GPU with ≥ 20 GB for the BF16 path, ≥ 12 GB for Q8_0, ≥ 8 GB for Q4_K_M
  • Python 3.10+ with recent `transformers` for the BF16 path, OR llama.cpp / Ollama for the GGUF paths
  • CUDA 12.x driver (Ada Lovelace sm_89 — the default `pip install torch` ships the right kernels)

What You'll Build

A local Gemma 4 E4B instance on an RTX 4090 — Google's 4.5 B-effective-parameter (8 B with embeddings) instruction-tuned multimodal model that accepts text, images, and audio as input and produces text. On a 24 GB Ada Lovelace card the recipe pins BF16 as the recommended variant (full precision, ~15 GB weights, ~5–8 GB free for KV cache at long contexts) and documents Q8_0 GGUF and Q4_K_M GGUF as smaller-footprint alternatives.

Hardware data: RTX 4090 (24 GB VRAM) · multimodal-capable at full BF16 precision (~15 GB weights per the official Google docs) with ~5–8 GB of headroom for KV cache and activations · See benchmark data

ℹ️ Multimodal input, text-only output. Gemma 4 E4B reads images, audio (≤ 30 s clips), and text — and replies in text. It does not generate images, speech, or video. For TTS on this card, see Kokoro; for image generation, see Z-Image Turbo or Qwen-Image.

⚠️ vLLM v0.19.0 regression on RTX 4090. A documented vLLM bug (Issue #38887) forces TRITON_ATTN fallback on E4B and yields only ~9.2 tok/s on RTX 4090 (vs the 50–100+ tok/s the reporter and vLLM maintainers expect for a 4.5 B model on this card). If you want vLLM specifically, wait for the per-layer-backend fix to land. This recipe defaults to transformers (BF16) and llama.cpp / Ollama (GGUF) — none of which exhibit the regression.

Requirements

ComponentMinimumTested
GPU8 GB (Q4_K_M GGUF) / 12 GB (Q8_0 GGUF) / 20 GB (BF16)RTX 4090 (24 GB)
RAM16 GB system RAM
Storage5 – 16 GB depending on quant (Unsloth file-size table)4.98 GB for Q4_K_M, 8.19 GB for Q8_0, 15.05 GB for BF16
SoftwarePython 3.10+ with transformers (BF16 path) OR llama.cpp / Ollama (GGUF paths)

Per the official Google AI for Developers docs, the static-weight memory footprint is 15 GB at BF16, 7.5 GB at FP8, 5 GB at Q4_0 — these numbers cover the weights only, not the KV cache or runtime overhead. On a 24 GB card the BF16 path leaves ~9 GB of nominal headroom; in practice plan for 5–8 GB free after the runtime loads activations and KV cache. The KnightLi VRAM ladder (2026-05-01) lists BF16 as "20 GB min / 24 GB safer" — the 24 GB tier sits comfortably in the "safer" column.

Installation

This recipe defaults to BF16 via transformers — the 24 GB envelope on the RTX 4090 makes full precision the natural primary path. Q8_0 and Q4_K_M are documented as GGUF alternatives via Ollama or llama.cpp, useful when you want a smaller runtime footprint (more headroom for very long contexts) or a one-line install.

Path A — transformers BF16 (Python, recommended on 24 GB)

Install PyTorch + transformers. Unlike Blackwell GPUs (RTX 50-series, sm_120), the RTX 4090 uses Ada Lovelace (sm_89), and the default pip install torch already includes the right CUDA kernels — no cu128-only wheel selection is required.

# Default CUDA 12.x PyTorch wheel — Ada sm_89 kernels ship in the standard release
pip install -U torch
pip install -U transformers accelerate torchvision librosa

Path B — Ollama (one-liner, auto layer placement)

Ollama auto-detects the right number of GPU layers; no manual -ngl flag is needed on the 4090 — the model fits fully in VRAM at every quant tier documented here.

# macOS / Linux — install
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run the Unsloth Q8_0 GGUF (≈ 8 GB download on first run, near-lossless quality)
ollama run hf.co/unsloth/gemma-4-E4B-it-GGUF:Q8_0

The unsloth/gemma-4-E4B-it-GGUF repo hosts the file; its model tree explicitly links upstream google/gemma-4-E4B-it (the canonical Google release). The community-maintained bartowski/google_gemma-4-E4B-it-GGUF is an equivalent Q8_0 mirror if you prefer Bartowski's quants.

Path C — llama.cpp (explicit -ngl, full control)

# macOS / Linux
brew install llama.cpp

# Or build from source — https://github.com/ggml-org/llama.cpp

On the 4090, offloading all layers to GPU is safe at every quant tier documented here; pin with -ngl 99 (llama.cpp clamps to the model's real layer count):

# OpenAI-compatible local server with web UI — BF16 fits the 24 GB card cleanly
llama-server -hf unsloth/gemma-4-E4B-it-GGUF:BF16 -ngl 99

# Or pick a smaller quant for more KV-cache headroom at long contexts
llama-server -hf unsloth/gemma-4-E4B-it-GGUF:Q8_0 -ngl 99

# Or one-shot CLI
llama-cli -hf unsloth/gemma-4-E4B-it-GGUF:Q4_K_M -ngl 99

The -hf flag streams the GGUF directly from HuggingFace on first run and caches it locally.

Variant file-size table (from Unsloth's GGUF repo)

The figures below are read directly from the unsloth/gemma-4-E4B-it-GGUF tree. They are on-disk file sizes; runtime VRAM is on-disk plus the KV cache + activations.

QuantFile sizeFits 24 GB?
Q4_K_M4.98 GB✅ Trivial — ~19 GB free for huge contexts
UD-Q4_K_XL5.13 GB✅ Trivial — Unsloth dynamic 4-bit
Q5_K_M5.48 GB✅ Trivial
Q8_08.19 GB✅ Comfortable — near-lossless quality
Q8_K_XL (Unsloth)8.71 GB✅ Comfortable
BF1615.05 GB✅ Recommended — full precision, ~9 GB headroom

Running

Path A — transformers BF16

The HuggingFace card's canonical snippet, verbatim from google/gemma-4-E4B-it:

from transformers import AutoProcessor, AutoModelForMultimodalLM

MODEL_ID = "google/gemma-4-E4B-it"

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(
    MODEL_ID,
    dtype="auto",        # picks BF16 on the 4090
    device_map="auto",
)

# Multimodal example — image + text
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/Demos/sample-data/GoldenGate.png"},
            {"type": "text", "text": "What is shown in this image?"},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]

outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
print(processor.parse_response(response))

For text-only chat, swap AutoModelForMultimodalLM for AutoModelForCausalLM and drop the image content block. Recommended sampling per the model card: temperature=1.0, top_p=0.95, top_k=64.

Path B / C — chat via Ollama or llama.cpp

After ollama run … or llama-server … is up, both expose an OpenAI-compatible HTTP API on localhost:11434 (Ollama) or localhost:8080 (llama-server). For an interactive chat, just type at the prompt; for programmatic use:

# Ollama
curl http://localhost:11434/api/chat -d '{
  "model": "hf.co/unsloth/gemma-4-E4B-it-GGUF:Q8_0",
  "messages": [{"role": "user", "content": "Write a short joke about saving VRAM."}]
}'

# llama-server (OpenAI-compatible)
curl http://localhost:8080/v1/chat/completions -d '{
  "model": "gemma-4-e4b",
  "messages": [{"role": "user", "content": "Write a short joke about saving VRAM."}]
}'

For multimodal input (images), pass images: [<base64>] (Ollama) or attach with the OpenAI image_url content block (llama.cpp mmproj build) — full multimodal usage is documented on the model card.

Results

  • Speed: No published RTX 4090 tok/s benchmark for the recommended transformers BF16 or llama.cpp GGUF paths exists yet — the only RTX-4090-named E4B measurement in the wild is the vLLM Issue #38887 regression report, which is unrepresentative of normal performance (9.2 tok/s under TRITON_ATTN fallback in vLLM 0.19.0; the reporter and vLLM maintainers both call the expected baseline 50–100+ tok/s for a 4.5 B model on this card). Once the per-layer-backend fix lands or a community benchmark surfaces, this section will be re-anchored. Submit yours via /contribute.
  • VRAM usage: Cited weight footprint per Google AI for Developers — BF16 = 15 GB, FP8 = 7.5 GB, Q4_0 = 5 GB (weights only). The Unsloth tree confirms BF16 file = 15.05 GB, Q8_0 = 8.19 GB, Q4_K_M = 4.98 GB. On a 24 GB card the BF16 path leaves ~5–8 GB free in practice (weights + KV cache + activations); Q8_0 leaves ~14 GB; Q4_K_M leaves ~18 GB. The vLLM regression report inadvertently corroborates this — its RTX 4090 + BF16 run logs GPU KV cache usage: 1.9% during inference, confirming BF16 fits comfortably on the 4090 even before tuning.
  • Quality notes: E4B is the "daily-driver" tier of the Gemma 4 family — 4.5 B effective / 8 B with embeddings, 42 layers, 262 K vocab, ~150 M vision encoder + ~300 M audio encoder. Image inputs support variable aspect/resolution with configurable visual-token budgets (70 / 140 / 280 / 560 / 1120). Audio inputs cap at 30 seconds per clip. On 24 GB the BF16 path is the natural primary because there is no quality / VRAM trade-off to make — full precision fits with room for long contexts; Q8_0 and Q4_K_M exist for users who want more KV-cache headroom or a smaller download.

For the full benchmark data, see /check/gemma-4-e4b/rtx-4090.

Troubleshooting

vLLM v0.19.0 falls back to TRITON_ATTN and drops to ~9 tok/s

Per vLLM Issue #38887 (filed 2026-04-03, still open), Gemma 4's heterogeneous attention head dimensions (head_dim=256 for sliding-window layers, 512 for global) force vLLM to disable FlashAttention and fall back to TRITON_ATTN for the entire model. The observed throughput on RTX 4090 + vLLM 0.19.0 + BF16 is 9.2 tok/s — roughly 10× slower than expected for a 4.5 B model. Workaround: use transformers (Path A above) or llama.cpp / Ollama (Paths B/C) until the per-layer-backend fix lands. Both alternative paths sidestep the issue entirely.

enable_thinking not triggering on E4B

Discussion #26 on the HF card notes that enable_thinking=True in the chat template does not currently activate Thinking Mode on the E4B / E2B sizes — only the larger 26 B MoE and 31 B dense variants support it. If you need extended reasoning, switch to one of those (the 26 B MoE recipe targets the same 24 GB envelope: /check/gemma4-26b/rtx-4090). Leave enable_thinking=False for E4B.

flash_attention_2 errors on the transformers path

If you copy a third-party transformers snippet that hardcodes attn_implementation="flash_attention_2", it may fail at first forward pass on some driver / wheel combinations — see Dao-AILab/flash-attention#2168. Fix by removing the override or setting attn_implementation="sdpa". (The Ada Lovelace sm_89 kernels in FA2 are shipped — this is a wheel-mismatch issue on the 4090 rather than the missing-kernel issue 50-series users hit.)

Tokenizer AttributeError: 'list' object has no attribute 'keys'

Per Discussion #17 on the HF card, an extra_special_tokens format incompatibility in some recent transformers versions causes the tokenizer to fail at load. Workaround: upgrade transformers to the latest release (pip install -U transformers) — Google's chat-template fix landed in the model repo upstream.

Ollama function-calling / streaming glitches

Gemma 4's hybrid sliding-window/global attention can trigger parser bugs in Ollama's tool-call and streaming layers; if you hit malformed tool calls or chunked-streaming gaps, run llama.cpp directly (Path C) until the parser is patched.