self-hosted/ai
§01·recipe · multimodal

Gemma 4 E4B on RTX 5070: Multimodal Inference via Q4_K_M GGUF (llama.cpp or Ollama — BF16 will not fit)

multimodalbeginner6GB+ VRAMJun 9, 2026
models
tools
prerequisites
  • NVIDIA RTX 5070 (12 GB VRAM) or any GPU with ≥ 6 GB for Q4_K_M, ≥ 9 GB for Q8_0
  • llama.cpp or Ollama installed (GGUF paths) — see Installation
  • Python 3.10+ with a CUDA 12.8 (cu128) PyTorch wheel only if you take the optional BF16 transformers path

What You'll Build

A local Gemma 4 E4B instance on an RTX 5070 — Google's 4.5 B-effective-parameter (8 B with embeddings) instruction-tuned multimodal model that accepts text, images, and audio as input and produces text. On a 12 GB Blackwell card the recipe pins Q4_K_M GGUF as the recommended variant (clean fit with ample headroom) and walks through two runtimes: Ollama (one-liner) and llama.cpp (full control). An optional BF16 transformers path is documented only to be explicit that it does not fit a 12 GB card.

Hardware data: RTX 5070 (12 GB VRAM) · multimodal-capable at Q4_K_M (~5 GB weights, ~6 GB peak with KV cache) · See benchmark data

⚠️ BF16 will not fit on 12 GB — use Q4_K_M instead. The canonical google/gemma-4-E4B-it checkpoint is a single 15.99 GB model.safetensors (read via the HF tree API); Google's docs list the inference-memory footprint as 17.9 GB at BF16, 8.9 GB at SFP8, 4.5 GB at Q4_0 (Gemma core docs). On a 12 GB card the 17.9 GB BF16 figure overruns the whole card by ~6 GB — it cannot load at all. Even the 8.9 GB SFP8 figure leaves little room once the KV cache and activations are added. Q4_K_M GGUF (4.977 GB on disk per Unsloth's file table) is the variant pinned by this recipe; Q8_0 (8.193 GB) is a comfortable near-lossless step up that still fits 12 GB.

ℹ️ Multimodal input, text-only output. Gemma 4 E4B reads images, audio (≤ 30 s clips), and text — and replies in text. It does not generate images, speech, or video. For TTS on this card, see Kokoro; for image generation, see Z-Image or Flux.2 Klein.

Requirements

ComponentMinimumTested
GPU6 GB (Q4_K_M GGUF) / 9 GB (Q8_0 GGUF) — BF16 needs ~18 GB and does not fit the 12 GB cardRTX 5070 (12 GB)
RAM16 GB system RAM
Storage~5 GB for the Q4_K_M GGUF (+ ~1 GB mmproj for image/audio input) (file table)4.977 GB Q4_K_M, 8.193 GB Q8_0
Softwarellama.cpp or Ollama (GGUF paths) OR Python 3.10+ with transformers + cu128 wheel (optional BF16 path)

Per the official Google AI for Developers docs, the inference-memory footprint is 17.9 GB at BF16, 8.9 GB at SFP8, 4.5 GB at Q4_0 — Google's published GPU/TPU memory for inference at each precision, so the BF16 path overruns the 12 GB envelope outright while Q4-class quants fit with room (≈ 6 GB peak at Q4_K_M on the 5070). The RTX 5070 is a Blackwell GB205 (sm_120) card with 6144 CUDA cores, ~672 GB/s of GDDR7 memory bandwidth across a 192-bit bus, 12 GB GDDR7, and a 250 W TGP. On a 12 GB desktop card with a display attached, usable VRAM is ~10.5–11.3 GB — comfortably above the ~6 GB Q4_K_M peak, so this recipe is a clean fit.

Installation

This recipe defaults to Q4_K_M GGUF — it is the smallest quant that retains near-full multimodal instruction-following quality, and on the RTX 5070 it loads in ~5 GB and leaves ample headroom for long contexts. Two runtimes — Ollama (zero config) and llama.cpp (full control) — are both documented; pick whichever fits your workflow. Q8_0 is documented as a near-lossless step up that still fits 12 GB; BF16 does not fit and is covered only in Troubleshooting.

Path A — Ollama (one-liner, auto layer placement)

Ollama auto-detects the right number of GPU layers for your card; no manual -ngl flag is needed.

# macOS / Linux — install
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run the Unsloth Q4_K_M GGUF (≈ 5 GB download on first run)
ollama run hf.co/unsloth/gemma-4-E4B-it-GGUF:Q4_K_M

The Unsloth GGUF repo at unsloth/gemma-4-E4B-it-GGUF hosts the file; its model tree explicitly links upstream google/gemma-4-E4B-it (the canonical Google release). The community-maintained bartowski/google_gemma-4-E4B-it-GGUF is an equivalent Q4_K_M mirror if you prefer Bartowski's quants.

Path B — llama.cpp (explicit -ngl)

# macOS / Linux
brew install llama.cpp

# Or build from source — https://github.com/ggml-org/llama.cpp

Then launch the OpenAI-compatible server. On a 12 GB card, offloading all layers to GPU is safe at Q4_K_M and Q8_0; explicitly pin with -ngl 99 (llama.cpp clamps to the model's real layer count):

# OpenAI-compatible local server with web UI
llama-server -hf unsloth/gemma-4-E4B-it-GGUF:Q4_K_M -ngl 99

# Or one-shot CLI
llama-cli -hf unsloth/gemma-4-E4B-it-GGUF:Q4_K_M -ngl 99

The -hf flag streams the GGUF directly from HuggingFace on first run and caches it locally.

For image / audio input, llama.cpp also needs the multimodal projector (mmproj) file. The Unsloth repo ships mmproj-F16.gguf (0.990 GB on disk per the file table); pass it with --mmproj-url:

# Multimodal-capable server (text + image + audio input)
llama-server -hf unsloth/gemma-4-E4B-it-GGUF:Q4_K_M --mmproj-url https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF/resolve/main/mmproj-F16.gguf -ngl 99

Variant file-size table (from Unsloth's GGUF repo)

The figures below are read directly from the unsloth/gemma-4-E4B-it-GGUF tree via the HF tree API. They are on-disk file sizes; runtime VRAM is on-disk plus the KV cache + activations (and the ~1 GB mmproj if you enable image/audio input).

QuantFile sizeFits 12 GB?
Q4_K_M4.977 GB✅ Recommended — clean fit, lots of headroom
UD-Q4_K_XL5.126 GB✅ Comfortable — Unsloth dynamic 4-bit
Q5_K_M5.482 GB✅ Comfortable
Q8_08.193 GB✅ Comfortable — near-lossless quality
UD-Q8_K_XL8.712 GB✅ Fits, but leaves ~3 GB free — watch long contexts
BF1615.053 GB❌ Does not fit 12 GB — use a GGUF quant above

Optional Path C — BF16 transformers (Python, does NOT fit 12 GB)

If you specifically need the canonical transformers Python API (e.g. for fine-tuning hooks or the documented multimodal AutoModelForMultimodalLM loader), BF16 weights are ~15 GB on disk / 17.9 GB inference-memory per Google's docs — this path does not fit a 12 GB card and is included only to document the dependency setup; run the GGUF paths above on the 5070. The RTX 5070 uses Blackwell (sm_120) — the default pip install torch index does not yet ship sm_120 kernels, so the cu128 wheel is required:

# CUDA 12.8 PyTorch wheel — required for Blackwell sm_120 (RTX 50-series)
pip install -U torch --index-url https://download.pytorch.org/whl/cu128
pip install -U transformers torchvision accelerate

To actually run the transformers BF16 path you need a ≥ 18 GB card (e.g. an RTX 4090 / 5090). The GGUF paths above are the supported route on the 5070.

Running

Path A / B — chat via Ollama or llama.cpp

After ollama run … or llama-server … is up, both expose an OpenAI-compatible HTTP API on localhost:11434 (Ollama) or localhost:8080 (llama-server). For an interactive chat, just type at the prompt; for programmatic use:

# Ollama
curl http://localhost:11434/api/chat -d '{
  "model": "hf.co/unsloth/gemma-4-E4B-it-GGUF:Q4_K_M",
  "messages": [{"role": "user", "content": "Write a short joke about saving VRAM."}]
}'

# llama-server (OpenAI-compatible)
curl http://localhost:8080/v1/chat/completions -d '{
  "model": "gemma-4-e4b",
  "messages": [{"role": "user", "content": "Write a short joke about saving VRAM."}]
}'

For multimodal input (images), pass images: [<base64>] (Ollama) or attach with the OpenAI image_url content block (llama.cpp built with the --mmproj projector from Installation Path B). Full multimodal usage is documented on the model card. Recommended sampling: temperature=1.0, top_p=0.95, top_k=64.

Optional Path C — text + image inference (transformers, needs ≥ 18 GB)

The HuggingFace card's canonical multimodal image snippet, from google/gemma-4-E4B-it — this requires a larger card than the 5070 (see Path C above):

from transformers import AutoProcessor, AutoModelForMultimodalLM

MODEL_ID = "google/gemma-4-E4B-it"

# Load model
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(
    MODEL_ID,
    dtype="auto",
    device_map="auto",
)

# Prompt — add image before text
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/apps/sample-data/GoldenGate.png"},
            {"type": "text", "text": "What is shown in this image?"},
        ],
    }
]

# Process input
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]

# Generate output
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)

# Parse output
print(processor.parse_response(response))

For text-only chat, swap AutoModelForMultimodalLM for AutoModelForCausalLM and drop the image content block. On the 5070, run this through a GGUF quant via Ollama / llama.cpp instead of the BF16 transformers loader.

Results

  • Speed: No first-party RTX 5070 tok/s benchmark exists for Gemma 4 E4B yet — the backend benchmark endpoint currently returns no measurements for this pair (verdict unknown, zero benchmarks). Decode for a 4.5 B-effective model is memory-bandwidth-bound, and the RTX 5070 carries ~672 GB/s of GDDR7 bandwidth — but no published RTX-5070-named figure exists to cite directly, and forward-extrapolating from a different card (e.g. the higher-bandwidth 5070 Ti or 5080) would be a guess rather than a measurement, so no number is quoted here. Submit yours via /contribute.
  • VRAM usage: Cited inference-memory footprint per Google AI for Developers — BF16 = 17.9 GB, SFP8 = 8.9 GB, Q4_0 = 4.5 GB (GPU/TPU memory for inference). The Unsloth tree confirms (read via the HF tree API) Q4_K_M = 4.977 GB, Q8_0 = 8.193 GB, BF16 = 15.053 GB, plus a ~0.990 GB mmproj-F16.gguf for image/audio input. At Q4_K_M the runtime peak is ~6 GB after the KV cache at default contexts, leaving ~5 GB of the 5070's 12 GB envelope free. The 17.9 GB BF16 inference figure is well past the card, which is why this recipe pins a GGUF quant.
  • Quality notes: E4B is the "daily-driver" tier of the Gemma 4 family — 4.5 B effective / 8 B with embeddings, a 128 K-token context window, and ~150 M vision encoder + ~300 M audio encoder per the HF card. Audio is featured natively on the E2B and E4B sizes. Q4_K_M is the smallest quant that retains near-full instruction-following quality; below it (Q3, Q2) the multimodal alignment degrades visibly. License is Apache 2.0 per the canonical model card (Gemma 4 license).

For the full benchmark data, see /check/gemma-4-e4b/rtx-5070.

Troubleshooting

BF16 does not fit 12 GB

This is expected. The canonical BF16 checkpoint is a single 15.99 GB model.safetensors (per the HF tree) and the GGUF BF16 file is 15.053 GB (Unsloth table); Google's docs put BF16 inference-memory at 17.9 GB (core docs). All three figures exceed the RTX 5070's 12 GB envelope outright — BF16 cannot load on this card. Use Q8_0 (8.193 GB, near-lossless) or Q4_K_M (4.977 GB) GGUF, which both leave comfortable room for long contexts on the 5070.

flash_attention_2 errors on the 5070 (Blackwell sm_120)

The canonical HF snippet uses dtype="auto" and does not request attn_implementation="flash_attention_2", so the default SDPA path runs cleanly on Blackwell. If you copy a third-party transformers snippet that hardcodes flash_attention_2, the call will fail at first forward pass — FA2 wheels do not yet ship sm_120 kernels for RTX 50-series, tracked at Dao-AILab/flash-attention#2168. Fix by removing the override or setting attn_implementation="sdpa". (This only matters for the optional transformers path, which itself needs a larger card — the GGUF runtimes are unaffected.)

Don't try the 26 B / 31 B Gemma 4 variants on a 12 GB card

A community user reports on llama.cpp Issue #21323 that unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL (the 26 B MoE sibling, not E4B) runs out of memory at load with CPU offloading on an RTX 5080 (16 GB) — a larger card than the 5070. A separate report (Issue #21371) describes the 31 B variant producing gibberish on Blackwell under a CUDA build. Both reports are on larger Gemma 4 variants, not the E4B this recipe pins — they are cited here only as confirmation that the E4B tier is the right choice for a 12 GB card. Google's own inference-memory table puts the 26 B A4B at 14.4 GB even at Q4_0 and the 31 B at 17.5 GB at Q4_0 (core docs), both past 12 GB; if you need those, a larger card is the supported path.

Ollama function-calling / streaming glitches

Per the danilchenko.dev walkthrough (2026-04-07), Gemma 4's hybrid sliding-window/global attention triggers parser bugs in Ollama's tool-call and streaming layers. Workaround: run llama.cpp directly (Path B) for tool-use workflows until the parser is patched.

Thinking Mode (enable_thinking) on E4B

Gemma 4 E4B does support Thinking Mode — a built-in step-by-step reasoning mode is listed under the model's Core Capabilities on the canonical model card, and enable_thinking=True in the chat template activates it (the <|think|> control token at the start of the system prompt triggers it; remove it to disable). Set it when you want step-by-step reasoning before the answer; leave it off for terse, direct replies. If the toggle appears inert in a GGUF runtime, update to a recent llama.cpp / Ollama build — chat-template handling of the thinking flag has been iterated on across versions.