self-hosted/ai
§01·recipe · multimodal

Gemma 4 E4B on RX 7900 XTX: Multimodal Inference via ROCm (Ollama or llama.cpp-HIP GGUF, with BF16)

multimodalbeginner6GB+ VRAMJun 17, 2026

This beginner recipe sets up Gemma 4 E4B-IT on the RX 7900 XTX, needing about 6 GB of VRAM.

models
tools
prerequisites
  • AMD Radeon RX 7900 XTX (24 GB VRAM, RDNA3 / Navi 31 / gfx1100) or equivalent ROCm-supported card
  • Linux (Ubuntu 24.04 / 22.04 or RHEL) with the AMD ROCm v7 driver installed via `amdgpu-install` — ROCm is NOT bundled with Ollama
  • Ollama, llama.cpp (HIP build), or LM Studio installed — see Installation
  • ~5 GB free disk for the Q4_K_M GGUF (or ~15 GB for the BF16 GGUF), plus ~1 GB for the vision projector

What You'll Build

A local Gemma 4 E4B instance on a 24 GB Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100) running through the ROCm stack — Google's 4.5 B-effective-parameter (8 B with embeddings) instruction-tuned multimodal model that accepts text, image, and audio as input and produces text. Served via Ollama for the one-command path or llama.cpp compiled with HIP for full control over the quant tier. With 24 GB of VRAM this model is never memory-bound: Q4_K_M GGUF (~5 GB) is the recommended starting point, and you have ample headroom all the way up to the full BF16 weights (~15 GB). Image input is handled by a small separate vision projector (mmproj) that loads alongside the text GGUF.

Hardware data: RX 7900 XTX (24 GB VRAM) · multimodal · GGUF or BF16 via ROCm 7 · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7900 XTX runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel, no FlashAttention-2 prebuilt wheel, and no FP8/FP4 path here (RDNA3 has no FP8/FP4 hardware — an FP8 checkpoint would just upcast to BF16 with no memory saving). The quant path is GGUF (via llama.cpp-HIP) or BF16 — not ExLlamaV2, not Marlin. If a guide tells you to pip install flash-attn or pick a cu12x wheel for this card, it's written for the wrong vendor.

ℹ️ Multimodal input, text-only output. Gemma 4 E4B reads text, images, and audio and replies in text — it does not generate images, speech, or video. It lives in our multimodal vertical because it spans more than one input modality. For text-to-speech on this card see Kokoro or VoxCPM; for image generation see Z-Image or SDXL.

ℹ️ Day-0 AMD support. AMD shipped Gemma 4 (including the E2B and E4B variants) across Radeon graphics cards on launch day, with documented paths through LM Studio, llama.cpp, Ollama, vLLM, SGLang, and Lemonade Server (AMD day-0 Gemma 4 announcement; coverage at wccftech). At 16 GB+ of VRAM the E4B model at common quants like Q4_K_M "fit comfortably on mid-range Radeon cards" — and the 7900 XTX's 24 GB clears that bar with room to spare.

Requirements

ComponentMinimumTested
GPU6 GB (Q4_K_M GGUF) / 9 GB (Q8_0 GGUF) / 15 GB (BF16) — ROCm-supported AMD cardRX 7900 XTX (24 GB)
RAM16 GB system RAM
Storage5 – 15 GB depending on quant (file-size table) + ~1 GB for the vision projector4.98 GB for Q4_K_M, 8.19 GB for Q8_0, 15.05 GB for BF16
DriverAMD ROCm v7 (installed via amdgpu-install) on Linux
RuntimeOllama / llama.cpp (HIP build) / LM Studio

Gemma 4 E4B is licensed Apache-2.0 (model card; the Gemma 4 license terms are also linked at ai.google.dev). The weights are not gated — the HF repo is publicly downloadable without an access request — so no hf auth login or license click-through is required to pull the GGUFs.

Per the official Google AI for Developers docs, the static-weight memory footprint for Gemma 4 E4B is 17.9 GB at BF16, 8.9 GB at SFP8, 4.5 GB at Q4_0 — these numbers cover the weights only, not the KV cache or runtime overhead, so plan on at least 25 % headroom for non-trivial contexts. On the 24 GB 7900 XTX even the full BF16 inference footprint leaves ~6 GB free for the KV cache — and the practical GGUF paths below (the BF16 GGUF file is only ~15 GB) are far lighter still.

Installation

Prerequisite — install the AMD ROCm v7 driver

The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU, but ROCm is not bundled with Ollama or the llama.cpp release binaries — you install it once at the OS level. Per the Ollama AMD GPU docs, Ollama requires the AMD ROCm v7 driver on Linux, installed or upgraded via the amdgpu-install utility. On Ubuntu 24.04 (Noble), install ROCm 7.2.1 via the standard amdgpu-install flow (AMD's Radeon ROCm install docs cover the current packages):

# 1. Add the amdgpu-install package and install ROCm
wget https://repo.radeon.com/amdgpu-install/7.2.1/ubuntu/noble/amdgpu-install_7.2.1.70201-1_all.deb
sudo apt install ./amdgpu-install_7.2.1.70201-1_all.deb
sudo apt update
sudo amdgpu-install -y --usecase=graphics,rocm

# 2. Add yourself to the render/video groups (log out/in afterward)
sudo usermod -a -G render,video $LOGNAME

The RX 7900 XTX is on Ollama's supported AMD Radeon RX list, and gfx1100 is in its supported LLVM-target list — so no HSA_OVERRIDE_GFX_VERSION masquerade is needed for this card (that override is only for cards ROCm doesn't ship kernels for).

This recipe defaults to Q4_K_M GGUF — it is the smallest quant that retains near-full multimodal instruction-following quality, and on the 7900 XTX it loads in ~5 GB and leaves enormous headroom. Q8_0 and BF16 are documented as quality-step-up paths that also fit trivially on 24 GB.

Option A — Ollama (recommended)

Ollama detects the ROCm runtime installed in the prerequisite step and auto-places GPU layers; on the 24 GB 7900 XTX the whole model fits at every tier, so no manual -ngl flag is needed.

# Linux — install
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run the Unsloth Q4_K_M GGUF (≈ 5 GB download on first run)
ollama run hf.co/unsloth/gemma-4-E4B-it-GGUF:Q4_K_M

The Unsloth GGUF repo at unsloth/gemma-4-E4B-it-GGUF hosts the file; its model tree explicitly links upstream google/gemma-4-E4B-it (the canonical Google release). The community-maintained bartowski/google_gemma-4-E4B-it-GGUF is an equivalent mirror if you prefer Bartowski's quants.

Option B — llama.cpp built with HIP/ROCm

For full control over the quant tier (Q8_0 for higher fidelity, BF16 for full precision) and for explicit image-projector loading, build llama.cpp against HIP and target the gfx1100 architecture directly.

1. Build llama.cpp with the HIP backend

Per the llama.cpp build docs, the Linux HIP build for an RDNA3 card like the RX 7900 XTX is:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
    cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1100 -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build --config Release -- -j 16

-DGGML_HIP=ON selects the ROCm backend; -DGPU_TARGETS=gfx1100 pins the kernels to the 7900 XTX's architecture (the build docs use gfx1100 as the explicit example for the "Radeon RX 7900XTX").

2. Run the model — text, and text + image

On a 24 GB card, offloading all layers to GPU is safe at every quant tier here; explicitly pin with -ngl 99 (llama.cpp clamps to the model's real layer count). For image input, also load the vision projector with --mmproj:

# OpenAI-compatible local server with web UI (text only)
./build/bin/llama-server -hf unsloth/gemma-4-E4B-it-GGUF:Q4_K_M -ngl 99

# Text + image: add the F16 vision projector (≈ 1 GB, shared across quant tiers)
./build/bin/llama-server \
  -hf unsloth/gemma-4-E4B-it-GGUF:Q4_K_M \
  --mmproj-url https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF/resolve/main/mmproj-F16.gguf \
  -ngl 99

The -hf flag streams the GGUF directly from Hugging Face on first run and caches it locally. To switch tiers, replace the tag — :Q8_0 for the near-lossless quant, :BF16 for the full-precision GGUF (fits comfortably on 24 GB), :UD-Q4_K_XL for Unsloth's dynamic 4-bit variant. The mmproj-F16.gguf projector is the same file at every text-quant tier.

3. Variant file-size table (from Unsloth's GGUF repo)

The figures below are read directly from unsloth/gemma-4-E4B-it-GGUF (cross-checked against Hugging Face's file-size headers). They are the on-disk text-model file sizes; runtime VRAM is on-disk plus the KV cache, plus ~1 GB for the vision projector if you use image input.

QuantFile sizeFits 24 GB?
Q3_K_M4.06 GB✅ Trivial
Q4_K_M4.98 GB✅ Trivial — pinned by this recipe
UD-Q4_K_XL5.13 GB✅ Trivial — Unsloth dynamic 4-bit
Q5_K_M5.48 GB✅ Trivial
Q8_08.19 GB✅ Comfortable — near-lossless quality
BF1615.05 GB✅ Comfortable — ~9 GB headroom for KV cache
mmproj-F16 (vision)0.99 GBLoaded once for image input, shared across tiers

On a 24 GB 7900 XTX every text tier above fits with room to spare — the binding constraint is quality, not VRAM. Q4_K_M is the recommended starting point; step up to Q8_0 or the BF16 GGUF for quality-sensitive work, since even the full BF16 weights leave ~9 GB free.

Option C — LM Studio (GUI)

LM Studio ships a ROCm runtime backend and offers a one-click path — AMD's day-0 Gemma 4 rollout names LM Studio as a primary Radeon surface. Search "gemma-4-E4B-it GGUF" inside the app and pick the Q4_K_M (or a higher) tier, or use the direct-import link lmstudio://open_from_hf?model=unsloth/gemma-4-E4B-it-GGUF. On the 24 GB 7900 XTX you have room for any tier through BF16.

Running

Path A / B — chat via Ollama or llama.cpp

After ollama run … or llama-server … is up, both expose an OpenAI-compatible HTTP API on localhost:11434 (Ollama) or localhost:8080 (llama-server). For an interactive chat, just type at the prompt; for programmatic use:

# Ollama
curl http://localhost:11434/api/chat -d '{
  "model": "hf.co/unsloth/gemma-4-E4B-it-GGUF:Q4_K_M",
  "messages": [{"role": "user", "content": "Write a short joke about saving VRAM."}]
}'

# llama-server (OpenAI-compatible)
curl http://localhost:8080/v1/chat/completions -d '{
  "model": "gemma-4-e4b",
  "messages": [{"role": "user", "content": "Write a short joke about saving VRAM."}]
}'

For image input through llama.cpp, attach the image with the OpenAI image_url content block once --mmproj is loaded (Path B), and the server returns a text description. Recommended sampling per the model card: temperature=1.0, top_p=0.95, top_k=64. Watch GPU activity in another terminal with rocm-smi to confirm the card — not the CPU — is doing the work.

Results

  • Speed: No RX-7900-XTX-named Gemma 4 E4B token-generation benchmark was found in research at the time of writing. A measured 7900 XTX tok/s figure is therefore omitted rather than estimated — one third-party page (ownrig.com) lists "89 tok/s at Q8_0" but explicitly labels it an estimate "from model architecture, quantization size, and device bandwidth," not a measurement, and the LocalScore RX 7900 XTX results page has no Gemma 4 E4B entry. Per project policy, an estimate or a number transferred from another card is not cited. If you've measured Gemma 4 E4B tok/s on a 7900 XTX, please contribute it so it lands on /check/gemma-4-e4b/rx-7900-xtx. As a general ROCm caveat: on RDNA3 the ROCm/HIP llama.cpp backend often trails the Vulkan backend at token generation — see Troubleshooting.
  • VRAM usage: Cited inference footprint per Google AI for Developers — BF16 = 17.9 GB, SFP8 = 8.9 GB, Q4_0 = 4.5 GB (weights + runtime, excludes KV cache). The Unsloth GGUF Q4_K_M file is 4.98 GB, Q8_0 is 8.19 GB, BF16 is 15.05 GB, and the F16 vision projector is 0.99 GB (file table). On the 7900 XTX's 24 GB, Q4_K_M leaves ~18 GB free after the text weights load (room for huge contexts and the vision projector); Q8_0 leaves ~15 GB; BF16 leaves ~9 GB. Memory is never the constraint on this card.
  • Quality notes: E4B is the "daily-driver" tier of the Gemma 4 family — 4.5 B effective / 8 B with embeddings via the MatFormer (Matryoshka-Transformer) architecture, multimodal across text, image, and audio input with text output (model card). Q4_K_M is the smallest quant that retains near-full instruction-following quality; below it (Q3, Q2) the multimodal alignment degrades visibly. Because the 7900 XTX has so much headroom, there is no reason to drop below Q4_K_M — step up to Q8_0 (8.19 GB) or the BF16 GGUF (15.05 GB) for quality-sensitive work.

For the full benchmark data, see /check/gemma-4-e4b/rx-7900-xtx.

Troubleshooting

Ollama runs on the CPU instead of the GPU

Confirm the ROCm v7 driver is installed (rocm-smi should list the 7900 XTX) and that your user is in the render and video groups (groups should show both — log out and back in after the usermod step). Per the Ollama AMD GPU docs, ROCm is a separate install from Ollama; if it's missing, Ollama silently falls back to CPU. The RX 7900 XTX (gfx1100) is natively supported, so you should not need HSA_OVERRIDE_GFX_VERSION — only unsupported cards need that masquerade.

Token generation feels slower than expected — try the Vulkan backend

On RDNA3 the ROCm/HIP backend can be slower at token generation than the Vulkan backend in llama.cpp. Per llama.cpp issue #20934, on the RX 7900 XTX (gfx1100) Vulkan (RADV) reached ~167–177 tok/s on Llama 7B Q4_0 while ROCm landed at ~129–144 tok/s across ROCm 6.4.4–7.x. If your generation rate disappoints under ROCm, build llama.cpp with -DGGML_VULKAN=ON instead of -DGGML_HIP=ON and re-benchmark with llama-bench — Vulkan often wins for pure generation on this card. (This is an LLM-served-via-llama.cpp path, so there is no FlashAttention build step either way — ignore any pip install flash-attn instruction.)

Image input does nothing / model only sees text

The vision capability lives in a separate mmproj-F16.gguf projector that must be loaded alongside the text GGUF. With llama.cpp, pass --mmproj (or --mmproj-url) pointing at the mmproj-F16.gguf file (Path B); without it, llama-server runs text-only. With Ollama, the projector is normally resolved automatically from the HF repo — if image input is ignored, switch to the explicit llama.cpp Path B.

A third-party snippet tells you to use FlashAttention-2, FP8, or a cu12x wheel

That snippet is written for an NVIDIA card. RDNA3 has no FP8/FP4 hardware (an FP8 checkpoint upcasts to BF16 with no memory saving), there is no prebuilt flash-attn wheel for gfx1100, and there is no cu124/cu128 build for ROCm. For Gemma 4 E4B on the 7900 XTX, stay on the GGUF-via-llama.cpp-HIP or Ollama paths above (or BF16 if you want full precision) — none of them need a CUDA wheel or a FlashAttention build. If you specifically want the transformers Python API on ROCm, install PyTorch from the ROCm index (pip install torch --index-url https://download.pytorch.org/whl/rocm6.3 — read the live tag at pytorch.org/get-started/locally, it moves between ROCm releases) and let attention fall back to PyTorch SDPA rather than FlashAttention.

common questions
How much VRAM does Gemma 4 E4B-IT need?

About 6 GB — the minimum this recipe targets.

Which GPUs is Gemma 4 E4B-IT tested on?

RX 7900 XTX (24 GB).

How hard is this setup?

Beginner — follow the steps above.