self-hosted/ai
§01·recipe · multimodal

Gemma 4 E4B on RTX 5090: Multimodal Inference via BF16 (with 24 GB of Headroom to Spend)

multimodalbeginner20GB+ VRAMMay 25, 2026
models
tools
prerequisites
  • NVIDIA RTX 5090 (32 GB VRAM) — or any Blackwell/Ada/Ampere GPU with ≥ 20 GB for the BF16 path, ≥ 12 GB for Q8_0, ≥ 8 GB for Q4_K_M
  • Python 3.10+ with recent `transformers` for the BF16 path, OR llama.cpp / Ollama for the GGUF paths
  • CUDA 12.8+ wheel of PyTorch (cu128) — required for Blackwell sm_120; the default `pip install torch` index does NOT yet ship sm_120 kernels

What You'll Build

A local Gemma 4 E4B instance on an RTX 5090 — Google's 4.5 B-effective-parameter (8 B with embeddings) instruction-tuned multimodal model that accepts text, images, and audio as input and produces text. On a 32 GB Blackwell card the recipe pins BF16 as the recommended variant (full precision, ~15 GB weights per the official Google docs) — the 4.5 B model is wildly over-provisioned for 32 GB (over 4× the BF16 footprint), leaving ~17 GB of free VRAM for KV cache, long contexts, or one (or two) second resident models.

Hardware data: RTX 5090 (32 GB VRAM) · multimodal-capable at full BF16 precision (~15 GB weights, ~17 GB headroom) · See benchmark data

ℹ️ Multimodal input, text-only output. Gemma 4 E4B reads images, audio (≤ 30 s clips), video (≤ 60 s at 1 fps per the HF model card), and text — and replies in text. It does not generate images, speech, or video. For TTS on this card, see Kokoro; for image generation, see Flux.2 Klein or Qwen-Image.

⚠️ vLLM v0.19.0 regression persists on Blackwell. A documented vLLM bug (Issue #38887) forces TRITON_ATTN fallback on Gemma 4 because of the model's heterogeneous attention head dimensions (head_dim=256 for sliding-window layers, 512 for global) — FlashInfer / FlashAttention can't handle the mismatch. The fallback still fires on RTX 5090 even with the vendor-published vllm/vllm-openai:gemma4-cu130 docker build: a community report on the issue thread (comment #4230036896, 2026-04-11, user fikrikarim, author_association: NONE) notes "The triton fallback is still being selected" on RTX 5090 even after retesting on the cu130 build. No Blackwell-specific patch had landed at issue last-update (2026-04-20). Skip vLLM until the per-layer-backend fix lands. This recipe defaults to transformers (BF16) and llama.cpp / Ollama (GGUF) — none of which exhibit the regression.

Spending the Headroom — the Real Use Case on 32 GB

A 4.5 B-effective-parameter model on a 32 GB card is over 4× over-provisioned. BF16 weights alone are ~15 GB (Google AI docs); after KV cache and activations the runtime peak is ~17–19 GB, leaving 13–15 GB of genuinely free VRAM. At Q8_0 (~8 GB weights) or Q4_K_M (~5 GB weights) the headroom balloons to 23–26 GB. The interesting question on the 5090 is not "does Gemma 4 E4B fit" (trivially yes) but "what to do with the leftover VRAM." Three concrete options:

Option 1 — Colocate one (or two) second models

The most actionable use of the headroom is to load a second model on the same GPU. The Gemma 4 E4B Q4_K_M GGUF is 4.977 GB on disk per the Unsloth file table (read directly via the HF tree API) — load it on llama.cpp with -ngl 99 and use the remaining ~26 GB for one or more resident models. Concrete pairings that fit cleanly on the 5090's 32 GB envelope:

  • Gemma 4 E4B (BF16, ~15 GB) + Qwen3-8B (Q4_K_M, ~5 GB) + Kokoro-82M TTS (~0.5 GB) + Whisper-large-v3 (~3 GB) — a complete production-style "multi-model server" with ASR-in / multimodal reasoning / text-only reasoning / TTS-out on a single GPU, totaling ~23 GB and leaving 8+ GB for KV cache across all four. See the kokoro-tts recipe at /check/kokoro-tts/rtx-5090 and Qwen3-8B at /check/qwen3-8b/rtx-5090.
  • Gemma 4 E4B (Q8_0, ~8 GB) + HunyuanVideo-1.5 8.3B step-distilled (~14 GB minimum) — combines a multimodal chat front-end with on-demand video generation on the same card (~22 GB combined, ~10 GB free). See /check/hunyuan-video/rtx-5090.
  • Gemma 4 E4B (Q4_K_M, ~5 GB) + Qwen3-32B (UD-Q4_K_XL ~20 GB) — a "smart-router" pattern: Gemma for multimodal (image/audio in) queries, a much larger reasoning model for hard text-only work. Sum is ~25 GB with ~7 GB free for shared KV cache and activations. See /check/qwen3-32b/rtx-5090.

Each model loads in its own process with CUDA_VISIBLE_DEVICES=0 and the runtimes share the GPU via NVIDIA's MPS — or, if you prefer everything in one process, use Ollama's parallel-loaded-models feature (set OLLAMA_NUM_PARALLEL + OLLAMA_MAX_LOADED_MODELS env vars; reference: Ollama docs tree).

Option 2 — Expand the context window

Gemma 4 E4B supports a 128 K-token context window per the HF card's dense-models table (E2B/E4B both list "Context Length: 128K tokens"). At BF16 on 32 GB the KV cache can grow to fill ~15 GB before pressing the VRAM ceiling — comfortably reaching the model's full 128 K advertised context window without offload, even for document-Q&A or RAG workloads with no KV-cache quantization tricks. At Q8_0 or Q4_K_M the runtime would never reach the ceiling at the model's full context.

Option 3 — Multi-stream batching

For server-style workloads, llama.cpp's --parallel N and vLLM's continuous batching let you serve multiple concurrent requests off the same resident model. At Q4_K_M (~5 GB weights), the 5090's 32 GB envelope supports batch sizes that materially exceed what 24 GB Ada cards can — turning a 4.5 B model into an effective concurrent-throughput multiplier for ASR-front-end / chat-backend deployments. (Caveat: vLLM specifically is currently affected by the #38887 regression above — until the fix lands, use llama.cpp's llama-server --parallel for the same effect.)

Requirements

ComponentMinimumTested
GPU8 GB (Q4_K_M GGUF) / 12 GB (Q8_0 GGUF) / 20 GB (BF16)RTX 5090 (32 GB)
RAM16 GB system RAM
Storage5 – 15 GB depending on quant (Unsloth file-size table)4.977 GB for Q4_K_M, 8.193 GB for Q8_0, 15.053 GB for BF16
SoftwarePython 3.10+ with transformers + CUDA 12.8 wheel (BF16 path) OR llama.cpp / Ollama (GGUF paths)

Per the official Google AI for Developers docs, the inference-memory footprint is 17.9 GB at BF16, 8.9 GB at SFP8, 4.5 GB at Q4_0 — Google's figures already include ~20 % load overhead on top of the raw weights. On a 32 GB card the BF16 path leaves ~17 GB of nominal headroom; in practice plan for 13–15 GB free after the runtime loads activations and KV cache at default contexts. Headroom remains substantial even with a 128 K-token KV cache resident.

Installation

This recipe defaults to BF16 via transformers — the 32 GB envelope on the RTX 5090 makes full precision the natural primary path with massive headroom for long contexts or a colocated second model. Q8_0 and Q4_K_M are documented as GGUF alternatives via Ollama or llama.cpp, useful when you want to colocate more aggressively (see "Spending the Headroom" above) or run a leaner server.

Path A — transformers BF16 (Python, recommended on 32 GB)

Install PyTorch + transformers. The RTX 5090 uses Blackwell (sm_120) — unlike older Ada (sm_89, 4090) or Ampere (sm_86, 3090) GPUs where the default pip install torch ships the right CUDA kernels, Blackwell requires the cu128 wheel to pick up sm_120 kernels:

# CUDA 12.8 PyTorch wheel — required for Blackwell sm_120 (RTX 50-series)
pip install -U torch --index-url https://download.pytorch.org/whl/cu128
pip install -U transformers accelerate torchvision librosa

Path B — Ollama (one-liner, auto layer placement)

Ollama auto-detects the right number of GPU layers; no manual -ngl flag is needed on the 5090 — the model fits fully in VRAM at every quant tier documented here, with most of the card free.

# macOS / Linux — install
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run the Unsloth Q8_0 GGUF (≈ 8 GB download on first run, near-lossless quality)
ollama run hf.co/unsloth/gemma-4-E4B-it-GGUF:Q8_0

The unsloth/gemma-4-E4B-it-GGUF repo hosts the file; its model tree explicitly links upstream google/gemma-4-E4B-it (the canonical Google release). The community-maintained bartowski/google_gemma-4-E4B-it-GGUF is an equivalent Q8_0 mirror if you prefer Bartowski's quants.

Path C — llama.cpp (explicit -ngl, full control)

# macOS / Linux
brew install llama.cpp

# Or build from source — https://github.com/ggml-org/llama.cpp

On the 5090, offloading all layers to GPU is safe at every quant tier documented here; pin with -ngl 99 (llama.cpp clamps to the model's real layer count):

# OpenAI-compatible local server with web UI — BF16 fits the 32 GB card with ~17 GB to spare
llama-server -hf unsloth/gemma-4-E4B-it-GGUF:BF16 -ngl 99

# Or pick a smaller quant for more headroom for colocated second models
llama-server -hf unsloth/gemma-4-E4B-it-GGUF:Q8_0 -ngl 99

# Or one-shot CLI
llama-cli -hf unsloth/gemma-4-E4B-it-GGUF:Q4_K_M -ngl 99

The -hf flag streams the GGUF directly from HuggingFace on first run and caches it locally.

Variant file-size table (from Unsloth's GGUF repo)

The figures below are read directly from the unsloth/gemma-4-E4B-it-GGUF tree via the HF tree API. They are on-disk file sizes; runtime VRAM is on-disk plus the KV cache + activations.

QuantFile sizeFits 32 GB?
Q4_K_M4.977 GB✅ Trivial — ~27 GB free for colocate / long context
UD-Q4_K_XL5.126 GB✅ Trivial — Unsloth dynamic 4-bit
Q5_K_M5.482 GB✅ Trivial
Q8_08.193 GB✅ Comfortable — near-lossless quality, ~24 GB free
UD-Q8_K_XL (Unsloth)8.712 GB✅ Comfortable
BF1615.053 GB✅ Recommended — full precision, ~17 GB headroom

Running

Path A — transformers BF16

The HuggingFace card's canonical snippet, verbatim from google/gemma-4-E4B-it:

from transformers import AutoProcessor, AutoModelForMultimodalLM

MODEL_ID = "google/gemma-4-E4B-it"

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(
    MODEL_ID,
    dtype="auto",        # picks BF16 on the 5090
    device_map="auto",
)

# Multimodal example — image + text
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/apps/sample-data/GoldenGate.png"},
            {"type": "text", "text": "What is shown in this image?"},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]

outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
print(processor.parse_response(response))

For text-only chat, swap AutoModelForMultimodalLM for AutoModelForCausalLM and drop the image content block. Recommended sampling per the model card: temperature=1.0, top_p=0.95, top_k=64.

Path B / C — chat via Ollama or llama.cpp

After ollama run … or llama-server … is up, both expose an OpenAI-compatible HTTP API on localhost:11434 (Ollama) or localhost:8080 (llama-server). For an interactive chat, just type at the prompt; for programmatic use:

# Ollama
curl http://localhost:11434/api/chat -d '{
  "model": "hf.co/unsloth/gemma-4-E4B-it-GGUF:Q8_0",
  "messages": [{"role": "user", "content": "Write a short joke about saving VRAM."}]
}'

# llama-server (OpenAI-compatible)
curl http://localhost:8080/v1/chat/completions -d '{
  "model": "gemma-4-e4b",
  "messages": [{"role": "user", "content": "Write a short joke about saving VRAM."}]
}'

For multimodal input (images), pass images: [<base64>] (Ollama) or attach with the OpenAI image_url content block (llama.cpp mmproj build) — full multimodal usage is documented on the model card.

Results

  • Speed: No first-party RTX 5090 tok/s benchmark exists for the recommended transformers BF16 or llama.cpp GGUF paths. The only RTX-5090-named E4B measurement in the wild is a community comment on the vLLM regression report (#38887 comment #4230036896, by fikrikarim, author_association: NONE) measuring ~130 tok/s for google/gemma-4-E4B-it on RTX 5090 under the vllm/vllm-openai:gemma4-cu130 docker build with TRITON_ATTN fallback still firing — i.e. that number is a known-degraded vLLM-specific path (the reporter explicitly calls it "still presumably well below if FlashAttention is enabled"), and per the configuration-axis cite-discipline it would mis-match the recipe's primary transformers BF16 path on runtime even if you wanted to use it. The healthier transformers / llama.cpp / Ollama numbers on the 5090 should be materially higher (Blackwell memory bandwidth is ~1792 GB/s — over 1.7× the 4090's 1008 GB/s — and decode for a 4.5 B-effective model is bandwidth-bound). Once a community benchmark surfaces, this section will be re-anchored — submit yours via /contribute.
  • VRAM usage: Cited inference-memory footprint per Google AI for Developers — BF16 = 17.9 GB, SFP8 = 8.9 GB, Q4_0 = 4.5 GB (incl. ~20% load overhead). The Unsloth tree confirms BF16 file = 15.053 GB, Q8_0 = 8.193 GB, Q4_K_M = 4.977 GB (read via the HF tree API). On a 32 GB card the BF16 path leaves ~13–15 GB free in practice (weights + KV cache + activations at default contexts); Q8_0 leaves ~22 GB; Q4_K_M leaves ~26 GB — see "Spending the Headroom" above for what to do with the free space.
  • Quality notes: E4B is the "daily-driver" tier of the Gemma 4 family — 4.5 B effective / 8 B with embeddings, 42 layers, 128 K context window, 262 K vocab, ~150 M vision encoder + ~300 M audio encoder per the HF card's dense-models table. Image inputs support variable aspect/resolution with configurable visual-token budgets (70 / 140 / 280 / 560 / 1120). Audio inputs cap at 30 seconds per clip (use Whisper as a front-end for longer audio — see "Option 1: Colocate one (or two) second models" above). On 32 GB the BF16 path is the natural primary because there is no quality / VRAM trade-off to make — full precision fits with vast room for long contexts and multiple colocated models; Q8_0 and Q4_K_M exist for users who want even more headroom or to consolidate multiple models on the same card.

For the full benchmark data, see /check/gemma-4-e4b/rtx-5090.

Troubleshooting

vLLM v0.19.0 falls back to TRITON_ATTN — Blackwell-specific patch has NOT landed

Per vLLM Issue #38887 (filed 2026-04-03, still open at issue last-update 2026-04-20), Gemma 4's heterogeneous attention head dimensions (head_dim=256 for sliding-window layers, 512 for global) force vLLM to disable FlashAttention/FlashInfer and fall back to TRITON_ATTN for the entire Gemma 4 family. The original report measured 9.2 tok/s on RTX 4090 (Ada sm_89) for google/gemma-4-E4B-it. A community report from a 5090 user (comment #4230036896, fikrikarim, author_association: NONE) measures ~130 tok/s for E4B on RTX 5090 with the vendor's vllm/vllm-openai:gemma4-cu130 docker build — much better than the 4090 baseline thanks to Blackwell's higher memory bandwidth, but the same TRITON_ATTN fallback is still being selected: the regression mechanism (heterogeneous head-dim routing) is architecture-independent. No Blackwell-specific patch had landed at issue last-update. Workaround: use transformers (Path A above) or llama.cpp / Ollama (Paths B/C) until the per-layer-backend fix lands — all three alternative paths sidestep the issue entirely.

flash_attention_2 errors on the transformers path (Blackwell sm_120)

The canonical HF snippet uses dtype="auto" and does not request attn_implementation="flash_attention_2", so the default SDPA path runs cleanly on Blackwell. If you copy a third-party snippet that hardcodes flash_attention_2, the call will fail at first forward pass — FA2 wheels do not yet ship sm_120 kernels for RTX 50-series, tracked at Dao-AILab/flash-attention#2168 (open, last update 2026-05-10). Fix by removing the override or setting attn_implementation="sdpa". (This is the missing-kernel issue specific to the 50-series — on older Ada (sm_89) or Ampere (sm_86) cards the FA2 kernels are shipped and only wheel-mismatch causes failures.)

enable_thinking not triggering on E4B

Discussion #26 on the HF card notes that enable_thinking=True in the chat template does not currently activate Thinking Mode on the E4B / E2B sizes — only the larger 26 B MoE and 31 B dense variants support it. If you need extended reasoning, switch to one of those (the 26 B MoE recipe is documented on the 5090 as well: /check/gemma4-26b/rtx-5090 — fits Q8_0 comfortably at ~29 GB). Leave enable_thinking=False for E4B.

Tokenizer AttributeError: 'list' object has no attribute 'keys'

Per Discussion #17 on the HF card, an extra_special_tokens format incompatibility in some recent transformers versions causes the tokenizer to fail at load. Workaround: upgrade transformers to the latest release (pip install -U transformers) — Google's chat-template fix landed in the model repo upstream.

Ollama function-calling / streaming glitches

Gemma 4's hybrid sliding-window/global attention can trigger parser bugs in Ollama's tool-call and streaming layers; if you hit malformed tool calls or chunked-streaming gaps, run llama.cpp directly (Path C) until the parser is patched.