What You'll Build
A local Gemma 4 E4B instance on an RTX 3090 — Google's 4.5 B-effective-parameter (8 B with embeddings) instruction-tuned multimodal model that accepts text, images, and audio as input and produces text. On a 24 GB Ampere card the recipe pins BF16 as the recommended variant (full precision, ~15 GB weights per the official Google docs) — the 4.5 B model is wildly over-provisioned for 24 GB, leaving ~9 GB of free VRAM for KV cache, long contexts, or a second resident model.
Hardware data: RTX 3090 (24 GB VRAM) · multimodal-capable at full BF16 precision (~15 GB weights, ~9 GB headroom) · See benchmark data
ℹ️ Multimodal input, text-only output. Gemma 4 E4B reads images, audio (≤ 30 s clips), and text — and replies in text. It does not generate images, speech, or video. For TTS on this card, see Kokoro; for image generation, see Z-Image Turbo or Chroma V48.
⚠️ vLLM v0.19.0 regression — affects all GPU architectures. A documented vLLM bug (Issue #38887) forces TRITON_ATTN fallback on Gemma 4 because of the model's heterogeneous attention head dimensions (
head_dim=256for sliding-window layers,512for global) — FlashInfer / FlashAttention can't handle the mismatch. The TRITON_ATTN fallback is a family-wide architectural property (the heterogeneous head_dim is a Gemma 4 design choice, not per-variant). The original report measured 9.2 tok/s on RTX 4090 (Ada) for E4B; a follow-up confirmed the same fallback on RTX 5090 (Blackwell) for E4B specifically. A separate community report on DRIVE AGX Thor measured the sibling 31B NVFP4 variant under the same TRITON_ATTN routing — different variant, same regression mechanism. No first-party RTX 3090 (Ampere) measurement exists in the thread, but the routing-level bottleneck is independent of CUDA arch. Skip vLLM until the per-layer-backend fix lands. This recipe defaults totransformers(BF16) and llama.cpp / Ollama (GGUF) — none of which exhibit the regression.
Spending the Headroom — the Real Use Case on 24 GB
A 4.5 B-effective-parameter model on a 24 GB card is 3× over-provisioned. BF16 weights alone are ~15 GB (Google AI docs); after KV cache and activations the runtime peak is ~17–19 GB, leaving 5–7 GB of genuinely free VRAM. At Q8_0 (~8 GB weights) or Q4_K_M (~5 GB weights) the headroom balloons to 15–18 GB. The interesting question on the 3090 is not "does Gemma 4 E4B fit" (trivially yes) but "what to do with the leftover VRAM." Three concrete options:
Option 1 — Colocate a second model
The most actionable use of the headroom is to load a second model on the same GPU. The Gemma 4 E4B Q4_K_M GGUF is 4.98 GB on disk per the Unsloth file table — load it on llama.cpp with -ngl 99 and use the remaining ~18 GB for a second resident model. Concrete pairings that fit cleanly:
- Gemma 4 E4B (Q8_0, ~8 GB) + Kokoro-82M TTS (~0.5 GB) for a text+audio assistant that does ASR-in / text-out with Gemma and synthesizes the response with Kokoro on the same card. The Kokoro recipe lives at /check/kokoro-tts/rtx-3090.
- Gemma 4 E4B (Q4_K_M, ~5 GB) + Qwen3-8B (Q4_K_M, ~5 GB) for routed multi-model inference — Gemma for multimodal queries, a heavier reasoning model for text-only. Both at Q4_K_M leaves ~14 GB free for KV cache on whichever model is active.
- Gemma 4 E4B (BF16, ~15 GB) + Whisper-large-v3 (~3 GB) for an audio-pipeline: Whisper handles arbitrary-length transcription (Gemma 4's 30s audio clip ceiling is a hard limit), then hands its text to Gemma for reasoning over the transcript.
Each model loads in its own process with CUDA_VISIBLE_DEVICES=0 and the runtimes share the GPU via NVIDIA's MPS — or, if you prefer everything in one process, use Ollama's parallel-loaded-models feature (set OLLAMA_NUM_PARALLEL + OLLAMA_MAX_LOADED_MODELS env vars; reference: Ollama docs tree).
Option 2 — Expand the context window
Gemma 4 E4B supports a long context window (per the HF card, the architecture has hybrid sliding-window/global attention). At BF16 on 24 GB the KV cache can grow to fill ~7 GB before pressing the VRAM ceiling — enough headroom for documents 50K+ tokens long without offload. The Q8_0 or Q4_K_M paths leave even more, comfortably reaching the model's full advertised context window for document-Q&A or RAG workloads.
Option 3 — Multi-stream batching
For server-style workloads, llama.cpp's --parallel N and vLLM's continuous batching let you serve multiple concurrent requests off the same resident model. At Q4_K_M (~5 GB weights), the 3090 has room for batch sizes that meaningfully exceed what an 8 GB card can do — turning a 4.5 B model into an effective concurrent-throughput multiplier for ASR-front-end / chat-backend deployments. (Caveat: vLLM specifically is currently affected by the #38887 regression above — until the fix lands, use llama.cpp's llama-server --parallel for the same effect.)
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 8 GB (Q4_K_M GGUF) / 12 GB (Q8_0 GGUF) / 20 GB (BF16) | RTX 3090 (24 GB) |
| RAM | 16 GB system RAM | — |
| Storage | 5 – 15 GB depending on quant (Unsloth file-size table) | 4.98 GB for Q4_K_M, 8.19 GB for Q8_0, 15.1 GB for BF16 |
| Software | Python 3.10+ with transformers (BF16 path) OR llama.cpp / Ollama (GGUF paths) | — |
Per the official Google AI for Developers docs, the static-weight memory footprint is 15 GB at BF16, 7.5 GB at FP8, 5 GB at Q4_0 — these numbers cover the weights only, not the KV cache or runtime overhead. On a 24 GB card the BF16 path leaves ~9 GB of nominal headroom; in practice plan for 5–7 GB free after the runtime loads activations and KV cache.
Installation
This recipe defaults to BF16 via transformers — the 24 GB envelope on the RTX 3090 makes full precision the natural primary path. Q8_0 and Q4_K_M are documented as GGUF alternatives via Ollama or llama.cpp, useful when you want to colocate a second model (see "Spending the Headroom" above) or run very long contexts.
Path A — transformers BF16 (Python, recommended on 24 GB)
Install PyTorch + transformers. The RTX 3090 uses Ampere (sm_86); unlike Blackwell GPUs (RTX 50-series, sm_120), the default pip install torch already includes the right CUDA kernels — no cu128-only wheel selection is required.
# Default CUDA 12.x PyTorch wheel — Ampere sm_86 kernels ship in the standard release
pip install -U torch
pip install -U transformers accelerate torchvision librosa
Path B — Ollama (one-liner, auto layer placement)
Ollama auto-detects the right number of GPU layers; no manual -ngl flag is needed on the 3090 — the model fits fully in VRAM at every quant tier documented here.
# macOS / Linux — install
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run the Unsloth Q8_0 GGUF (≈ 8 GB download on first run, near-lossless quality)
ollama run hf.co/unsloth/gemma-4-E4B-it-GGUF:Q8_0
The unsloth/gemma-4-E4B-it-GGUF repo hosts the file; its model tree explicitly links upstream google/gemma-4-E4B-it (the canonical Google release). The community-maintained bartowski/google_gemma-4-E4B-it-GGUF is an equivalent Q8_0 mirror if you prefer Bartowski's quants.
Path C — llama.cpp (explicit -ngl, full control)
# macOS / Linux
brew install llama.cpp
# Or build from source — https://github.com/ggml-org/llama.cpp
On the 3090, offloading all layers to GPU is safe at every quant tier documented here; pin with -ngl 99 (llama.cpp clamps to the model's real layer count):
# OpenAI-compatible local server with web UI — BF16 fits the 24 GB card cleanly
llama-server -hf unsloth/gemma-4-E4B-it-GGUF:BF16 -ngl 99
# Or pick a smaller quant for headroom for a colocated second model
llama-server -hf unsloth/gemma-4-E4B-it-GGUF:Q8_0 -ngl 99
# Or one-shot CLI
llama-cli -hf unsloth/gemma-4-E4B-it-GGUF:Q4_K_M -ngl 99
The -hf flag streams the GGUF directly from HuggingFace on first run and caches it locally.
Variant file-size table (from Unsloth's GGUF repo)
The figures below are read directly from the unsloth/gemma-4-E4B-it-GGUF tree via the HF tree API. They are on-disk file sizes; runtime VRAM is on-disk plus the KV cache + activations.
| Quant | File size | Fits 24 GB? |
|---|---|---|
| Q4_K_M | 4.98 GB | ✅ Trivial — ~19 GB free for colocate / long context |
| UD-Q4_K_XL | 5.13 GB | ✅ Trivial — Unsloth dynamic 4-bit |
| Q5_K_M | 5.48 GB | ✅ Trivial |
| Q8_0 | 8.19 GB | ✅ Comfortable — near-lossless quality, ~16 GB free |
| UD-Q8_K_XL (Unsloth) | 8.71 GB | ✅ Comfortable |
| BF16 | 15.1 GB | ✅ Recommended — full precision, ~9 GB headroom |
Running
Path A — transformers BF16
The HuggingFace card's canonical snippet, verbatim from google/gemma-4-E4B-it:
from transformers import AutoProcessor, AutoModelForMultimodalLM
MODEL_ID = "google/gemma-4-E4B-it"
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(
MODEL_ID,
dtype="auto", # picks BF16 on the 3090
device_map="auto",
)
# Multimodal example — image + text
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/Demos/sample-data/GoldenGate.png"},
{"type": "text", "text": "What is shown in this image?"},
],
}
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
return_tensors="pt",
add_generation_prompt=True,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
print(processor.parse_response(response))
For text-only chat, swap AutoModelForMultimodalLM for AutoModelForCausalLM and drop the image content block. Recommended sampling per the model card: temperature=1.0, top_p=0.95, top_k=64.
Path B / C — chat via Ollama or llama.cpp
After ollama run … or llama-server … is up, both expose an OpenAI-compatible HTTP API on localhost:11434 (Ollama) or localhost:8080 (llama-server). For an interactive chat, just type at the prompt; for programmatic use:
# Ollama
curl http://localhost:11434/api/chat -d '{
"model": "hf.co/unsloth/gemma-4-E4B-it-GGUF:Q8_0",
"messages": [{"role": "user", "content": "Write a short joke about saving VRAM."}]
}'
# llama-server (OpenAI-compatible)
curl http://localhost:8080/v1/chat/completions -d '{
"model": "gemma-4-e4b",
"messages": [{"role": "user", "content": "Write a short joke about saving VRAM."}]
}'
For multimodal input (images), pass images: [<base64>] (Ollama) or attach with the OpenAI image_url content block (llama.cpp mmproj build) — full multimodal usage is documented on the model card.
Results
- Speed: No published RTX 3090 tok/s benchmark exists yet for Gemma 4 E4B on the recommended
transformersBF16 or llama.cpp GGUF paths. The closest same-architecture reference point is a danilchenko.dev walkthrough (2026-04-07) measuring ~45 tok/s at Q8_0 on an RTX 3060 (12 GB) — also Ampere (sm_86) but materially slower memory bandwidth (~360 GB/s on the 3060 vs the 3090's 936 GB/s). On the 3090 you should expect substantially higher throughput on Q8_0 (memory-bandwidth-bound for autoregressive decode); the BF16 path will be somewhat slower due to the larger working set. Once a community benchmark surfaces, this section will be re-anchored — submit yours via /contribute. - VRAM usage: Cited weight footprint per Google AI for Developers — BF16 = 15 GB, FP8 = 7.5 GB, Q4_0 = 5 GB (weights only). The Unsloth tree confirms BF16 file = 15.1 GB, Q8_0 = 8.19 GB, Q4_K_M = 4.98 GB. On a 24 GB card the BF16 path leaves ~5–7 GB free in practice (weights + KV cache + activations); Q8_0 leaves ~14 GB; Q4_K_M leaves ~18 GB — see "Spending the Headroom" above for what to do with the free space.
- Quality notes: E4B is the "daily-driver" tier of the Gemma 4 family — 4.5 B effective / 8 B with embeddings, 42 layers, 262 K vocab, ~150 M vision encoder + ~300 M audio encoder. Image inputs support variable aspect/resolution with configurable visual-token budgets (70 / 140 / 280 / 560 / 1120). Audio inputs cap at 30 seconds per clip (use Whisper as a front-end for longer audio — see "Option 1: Colocate a second model" above). On 24 GB the BF16 path is the natural primary because there is no quality / VRAM trade-off to make — full precision fits with room for long contexts or a colocated second model; Q8_0 and Q4_K_M exist for users who want even more headroom.
For the full benchmark data, see /check/gemma-4-e4b/rtx-3090.
Troubleshooting
vLLM v0.19.0 falls back to TRITON_ATTN — affects all GPU architectures
Per vLLM Issue #38887 (filed 2026-04-03, still open), Gemma 4's heterogeneous attention head dimensions (head_dim=256 for sliding-window layers, 512 for global) force vLLM to disable FlashAttention/FlashInfer and fall back to TRITON_ATTN for the entire Gemma 4 family. The routing-level regression is family-wide and architecture-independent — the original report measured 9.2 tok/s on RTX 4090 (Ada sm_89) for google/gemma-4-E4B-it; a follow-up confirmed the same fallback on RTX 5090 (Blackwell, ~130 tok/s for E4B specifically, comment #4230036896). A separate community report measured the sibling 31B NVFP4 variant on DRIVE AGX Thor (~7.6 tok/s on Blackwell sm_110, comment #4185970219) under the same TRITON_ATTN routing — different variant, same regression mechanism, so its absolute number is not directly comparable to E4B on Ampere but its existence confirms the fallback fires across the Gemma 4 family. No first-party RTX 3090 (Ampere sm_86) measurement appears in the thread, but the bottleneck is the routing logic, not any per-arch kernel. Workaround: use transformers (Path A above) or llama.cpp / Ollama (Paths B/C) until the per-layer-backend fix lands. Both alternative paths sidestep the issue entirely.
enable_thinking not triggering on E4B
Discussion #26 on the HF card notes that enable_thinking=True in the chat template does not currently activate Thinking Mode on the E4B / E2B sizes — only the larger 26 B MoE and 31 B dense variants support it. If you need extended reasoning, switch to one of those (the 26 B MoE recipe targets the same 24 GB envelope: /check/gemma4-26b/rtx-3090). Leave enable_thinking=False for E4B.
flash_attention_2 errors on the transformers path
If you copy a third-party transformers snippet that hardcodes attn_implementation="flash_attention_2", it may fail at first forward pass on some driver / wheel combinations — see Dao-AILab/flash-attention#2168. Fix by removing the override or setting attn_implementation="sdpa". (The Ampere sm_86 kernels in FA2 are shipped — this is a wheel-mismatch issue on the 3090 rather than the missing-kernel issue 50-series users hit.)
Tokenizer AttributeError: 'list' object has no attribute 'keys'
Per Discussion #17 on the HF card, an extra_special_tokens format incompatibility in some recent transformers versions causes the tokenizer to fail at load. Workaround: upgrade transformers to the latest release (pip install -U transformers) — Google's chat-template fix landed in the model repo upstream.
Ollama function-calling / streaming glitches
Gemma 4's hybrid sliding-window/global attention can trigger parser bugs in Ollama's tool-call and streaming layers; if you hit malformed tool calls or chunked-streaming gaps, run llama.cpp directly (Path C) until the parser is patched.