How much VRAM does Gemma 4 E4B-IT need?

About 6 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

Gemma 4 E4B on RX 7800 XT: Multimodal Inference via ROCm (Ollama or llama.cpp-HIP GGUF, with Q8_0)

What You'll Build

A local Gemma 4 E4B instance on a 16 GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) running through the ROCm stack — Google's 4.5 B-effective-parameter (8 B with embeddings) instruction-tuned multimodal model that accepts text, image, and audio as input and produces text. Served via Ollama for the one-command path or llama.cpp compiled with HIP for full control over the quant tier. With 16 GB of VRAM this model is not memory-bound at sensible quants: Q4_K_M GGUF (~5 GB) is the recommended starting point, and you have comfortable headroom up to Q8_0 (~8 GB), with the BF16 GGUF (~15 GB) fitting on disk but leaving little room for the KV cache. Image input is handled by a small separate vision projector (mmproj) that loads alongside the text GGUF.

Hardware data: RX 7800 XT (16 GB VRAM) · multimodal · GGUF via ROCm 7 · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7800 XT runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel, no FlashAttention-2 prebuilt wheel, and no FP8/FP4 path here (RDNA3 has no FP8/FP4 hardware — an FP8 checkpoint would just upcast to BF16 with no memory saving). The quant path is GGUF (via llama.cpp-HIP) or BF16 — not ExLlamaV2, not Marlin. If a guide tells you to pip install flash-attn or pick a cu12x wheel for this card, it's written for the wrong vendor.

ℹ️ Multimodal input, text-only output. Gemma 4 E4B reads text, images, and audio and replies in text — it does not generate images, speech, or video. It lives in our multimodal vertical because it spans more than one input modality. For text-to-speech on this card see Kokoro or VoxCPM; for image generation see Z-Image or SDXL.

ℹ️ Day-0 AMD support. AMD rolled out official support for Gemma 4 across its full range of GPUs and CPUs on launch day, naming LM Studio, vLLM, SGLang, llama.cpp, Ollama, and Lemonade as supported surfaces (AMD day-0 Gemma 4 announcement; coverage at wccftech). The 7800 XT is one of the Radeon cards in that rollout, and the E4B model at Q4_K_M (~5 GB) leaves the 16 GB card with ample room.

Requirements

Component	Minimum	Tested
GPU	6 GB (Q4_K_M GGUF) / 9 GB (Q8_0 GGUF) — ROCm-supported AMD card	RX 7800 XT (16 GB)
RAM	16 GB system RAM	—
Storage	5 – 8 GB depending on quant (file-size table) + ~1 GB for the vision projector	4.98 GB for Q4_K_M, 8.19 GB for Q8_0
Driver	AMD ROCm v7 (installed via `amdgpu-install`) on Linux	—
Runtime	Ollama / llama.cpp (HIP build) / LM Studio	—

Gemma 4 E4B is licensed Apache-2.0 (model card; the Gemma 4 license is linked at ai.google.dev). The weights are not gated — the HF repo is publicly downloadable without an access request — so no hf auth login or license click-through is required to pull the GGUFs.

Per the official Google AI for Developers docs, the static-weight memory footprint for Gemma 4 E4B is 17.9 GB at BF16, 8.9 GB at SFP8, 4.5 GB at Q4_0 — Google notes these are approximate figures based on parameter count, quantization level, and a 20 % overhead for loading. The implication for a 16 GB card is concrete: the full BF16 inference footprint (17.9 GB) does not fit the 7800 XT, but the SFP8/8-bit and 4-bit footprints leave generous headroom. Stay at Q8_0 or below and the 16 GB card is never the constraint; the BF16 GGUF file is ~15 GB and will load but leaves almost nothing for the KV cache.

Installation

Prerequisite — install the AMD ROCm v7 driver

The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU, but ROCm is not bundled with Ollama or the llama.cpp release binaries — you install it once at the OS level. Per the Ollama AMD GPU docs, Ollama requires the AMD ROCm v7 driver on Linux, installed or upgraded via the amdgpu-install utility. On Ubuntu 24.04 (Noble), install ROCm 7.2.1 via the standard amdgpu-install flow (AMD's Radeon ROCm install docs cover the current packages):

# 1. Add the amdgpu-install package and install ROCm
wget https://repo.radeon.com/amdgpu-install/7.2.1/ubuntu/noble/amdgpu-install_7.2.1.70201-1_all.deb
sudo apt install ./amdgpu-install_7.2.1.70201-1_all.deb
sudo apt update
sudo amdgpu-install -y --usecase=graphics,rocm

# 2. Add yourself to the render/video groups (log out/in afterward)
sudo usermod -a -G render,video $LOGNAME

The RX 7800 XT is on the ROCm-supported Radeon list — the ROCm install-on-Linux system-requirements matrix names the RX 7800 XT / 7700 XT / 7700 as gfx1101 (officially supported) — so no HSA_OVERRIDE_GFX_VERSION masquerade is needed for this card (that override is only for cards ROCm doesn't ship kernels for).

This recipe defaults to Q4_K_M GGUF — it is the smallest quant that retains near-full multimodal instruction-following quality, and on the 7800 XT it loads in ~5 GB and leaves enormous headroom. Q8_0 is documented as a quality-step-up path that also fits comfortably on 16 GB.

Option A — Ollama (recommended)

Ollama detects the ROCm runtime installed in the prerequisite step and auto-places GPU layers; on the 16 GB 7800 XT the whole model fits at Q4_K_M and Q8_0, so no manual -ngl flag is needed.

# Linux — install
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run the Unsloth Q4_K_M GGUF (≈ 5 GB download on first run)
ollama run hf.co/unsloth/gemma-4-E4B-it-GGUF:Q4_K_M

The Unsloth GGUF repo at unsloth/gemma-4-E4B-it-GGUF hosts the file; its model tree explicitly links upstream google/gemma-4-E4B-it (the canonical Google release). The community-maintained bartowski/google_gemma-4-E4B-it-GGUF is an equivalent mirror if you prefer Bartowski's quants.

Option B — llama.cpp built with HIP/ROCm

For full control over the quant tier (Q8_0 for higher fidelity) and for explicit image-projector loading, build llama.cpp against HIP and target the gfx1101 architecture directly.

1. Build llama.cpp with the HIP backend

Per the llama.cpp build docs, the Linux HIP build for an RDNA3 card like the RX 7800 XT is:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
    cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1101 -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build --config Release -- -j 16

-DGGML_HIP=ON selects the ROCm backend; -DGPU_TARGETS=gfx1101 pins the kernels to the 7800 XT's architecture (the build docs use gfx1100 for the 7900 XTX example — for the Navi 32 RX 7800 XT the correct target is gfx1101).

2. Run the model — text, and text + image

On a 16 GB card, offloading all layers to GPU is safe at Q4_K_M and Q8_0; explicitly pin with -ngl 99 (llama.cpp clamps to the model's real layer count). For image input, also load the vision projector with --mmproj:

# OpenAI-compatible local server with web UI (text only)
./build/bin/llama-server -hf unsloth/gemma-4-E4B-it-GGUF:Q4_K_M -ngl 99

# Text + image: add the F16 vision projector (≈ 1 GB, shared across quant tiers)
./build/bin/llama-server \
  -hf unsloth/gemma-4-E4B-it-GGUF:Q4_K_M \
  --mmproj-url https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF/resolve/main/mmproj-F16.gguf \
  -ngl 99

The -hf flag streams the GGUF directly from Hugging Face on first run and caches it locally. To switch tiers, replace the tag — :Q8_0 for the near-lossless quant (fits comfortably on 16 GB), :UD-Q4_K_XL for Unsloth's dynamic 4-bit variant. The mmproj-F16.gguf projector is the same file at every text-quant tier. The BF16 GGUF (:BF16, ~15 GB) loads on 16 GB but leaves almost no room for the KV cache — prefer Q8_0 for high-fidelity work on this card.

3. Variant file-size table (from Unsloth's GGUF repo)

The figures below are read directly from unsloth/gemma-4-E4B-it-GGUF (cross-checked against the Hugging Face tree API byte counts). They are the on-disk text-model file sizes; runtime VRAM is on-disk plus the KV cache, plus ~1 GB for the vision projector if you use image input.

Quant	File size	Fits 16 GB?
Q3_K_M	4.06 GB	✅ Trivial
Q4_K_M	4.98 GB	✅ Trivial — pinned by this recipe
UD-Q4_K_XL	5.13 GB	✅ Trivial — Unsloth dynamic 4-bit
Q5_K_M	5.48 GB	✅ Trivial
Q8_0	8.19 GB	✅ Comfortable — near-lossless quality, ~8 GB headroom
BF16	15.05 GB	⚠️ Tight — file fits but ~1 GB left for KV cache
mmproj-F16 (vision)	0.99 GB	Loaded once for image input, shared across tiers

On the 16 GB 7800 XT, every tier up to Q8_0 fits with room to spare — the binding constraint at those tiers is quality, not VRAM. Q4_K_M is the recommended starting point; step up to Q8_0 for quality-sensitive work, which still leaves ~8 GB free for the KV cache and the vision projector. The BF16 GGUF is the one tier where 16 GB becomes the limit — it loads but leaves almost nothing for context, so it is not recommended on this card.

Option C — LM Studio (GUI)

LM Studio ships a ROCm runtime backend and offers a one-click path — AMD's day-0 Gemma 4 rollout names LM Studio as a primary Radeon surface. Search "gemma-4-E4B-it GGUF" inside the app and pick the Q4_K_M (or Q8_0) tier, or use the direct-import link lmstudio://open_from_hf?model=unsloth/gemma-4-E4B-it-GGUF. On the 16 GB 7800 XT, stay at Q8_0 or below for comfortable headroom.

Running

Path A / B — chat via Ollama or llama.cpp

After ollama run … or llama-server … is up, both expose an OpenAI-compatible HTTP API on localhost:11434 (Ollama) or localhost:8080 (llama-server). For an interactive chat, just type at the prompt; for programmatic use:

# Ollama
curl http://localhost:11434/api/chat -d '{
  "model": "hf.co/unsloth/gemma-4-E4B-it-GGUF:Q4_K_M",
  "messages": [{"role": "user", "content": "Write a short joke about saving VRAM."}]
}'

# llama-server (OpenAI-compatible)
curl http://localhost:8080/v1/chat/completions -d '{
  "model": "gemma-4-e4b",
  "messages": [{"role": "user", "content": "Write a short joke about saving VRAM."}]
}'

For image input through llama.cpp, attach the image with the OpenAI image_url content block once --mmproj is loaded (Path B), and the server returns a text description. Recommended sampling per the model card: temperature=1.0, top_p=0.95, top_k=64. Watch GPU activity in another terminal with rocm-smi to confirm the card — not the CPU — is doing the work.

Results

Speed: No RX-7800-XT-named Gemma 4 E4B token-generation benchmark was found in research at the time of writing, and the backend has no benchmark for this pair yet (/check/gemma-4-e4b/rx-7800-xt returns verdict: unknown). A measured tok/s figure is therefore omitted rather than estimated — and the 7900 XTX figures are not transferable here, since the 7800 XT has lower memory bandwidth (624 GB/s vs the XTX's 960 GB/s) and fewer compute units. If you've measured Gemma 4 E4B tok/s on a 7800 XT, please contribute it so it lands on /check/gemma-4-e4b/rx-7800-xt. As a general ROCm caveat: on RDNA3 the ROCm/HIP llama.cpp backend often trails the Vulkan backend at token generation — see Troubleshooting.
VRAM usage: Cited inference footprint per Google AI for Developers — BF16 = 17.9 GB, SFP8 = 8.9 GB, Q4_0 = 4.5 GB (weights + ~20 % loading overhead, excludes the KV cache). The Unsloth GGUF Q4_K_M file is 4.98 GB and Q8_0 is 8.19 GB, with the F16 vision projector at 0.99 GB (file table). On the 7800 XT's 16 GB, Q4_K_M leaves ~11 GB free after the text weights load (room for large contexts and the vision projector) and Q8_0 leaves ~8 GB. The one footprint that does not fit is the full BF16 inference path at 17.9 GB — stay at Q8_0 or below and memory is never the constraint on this card.
Quality notes: E4B is the "daily-driver" tier of the Gemma 4 family — 4.5 B effective / 8 B with embeddings via the MatFormer (Matryoshka-Transformer) architecture, multimodal across text, image, and audio input with text output (model card). Q4_K_M is the smallest quant that retains near-full instruction-following quality; below it (Q3, Q2) the multimodal alignment degrades visibly. With 16 GB there is no reason to drop below Q4_K_M — step up to Q8_0 (8.19 GB) for quality-sensitive work, which fits comfortably.

For the full benchmark data, see /check/gemma-4-e4b/rx-7800-xt.

Troubleshooting

Ollama runs on the CPU instead of the GPU

Confirm the ROCm v7 driver is installed (rocm-smi should list the 7800 XT) and that your user is in the render and video groups (groups should show both — log out and back in after the usermod step). Per the Ollama AMD GPU docs, ROCm is a separate install from Ollama; if it's missing, Ollama silently falls back to CPU. The RX 7800 XT (gfx1101) is natively supported, so you should not need HSA_OVERRIDE_GFX_VERSION — only unsupported cards need that masquerade.

Token generation feels slower than expected — try the Vulkan backend

On RDNA3 the ROCm/HIP backend can be slower at token generation than the Vulkan backend in llama.cpp. Per llama.cpp issue #20934, on RDNA3 (gfx1100) Vulkan (RADV) reached ~167–177 tok/s on Llama 7B Q4_0 while ROCm landed at ~129–144 tok/s across ROCm 6.4.4–7.x. The 7800 XT has lower memory bandwidth than that 7900 XTX comparison so absolute numbers will differ, but the ROCm-vs-Vulkan gap is an architecture-level pattern: if your generation rate disappoints under ROCm, build llama.cpp with -DGGML_VULKAN=ON instead of -DGGML_HIP=ON and re-benchmark with llama-bench — Vulkan often wins for pure generation on RDNA3. (This is an LLM-served-via-llama.cpp path, so there is no FlashAttention build step either way — ignore any pip install flash-attn instruction.)

Image input does nothing / model only sees text

The vision capability lives in a separate mmproj-F16.gguf projector that must be loaded alongside the text GGUF. With llama.cpp, pass --mmproj (or --mmproj-url) pointing at the mmproj-F16.gguf file (Path B); without it, llama-server runs text-only. With Ollama, the projector is normally resolved automatically from the HF repo — if image input is ignored, switch to the explicit llama.cpp Path B.

A third-party snippet tells you to use FlashAttention-2, FP8, or a `cu12x` wheel

That snippet is written for an NVIDIA card. RDNA3 has no FP8/FP4 hardware (an FP8 checkpoint upcasts to BF16 with no memory saving), there is no prebuilt flash-attn wheel for gfx1101, and there is no cu124/cu128 build for ROCm. For Gemma 4 E4B on the 7800 XT, stay on the GGUF-via-llama.cpp-HIP or Ollama paths above (or Q8_0 if you want higher fidelity) — none of them need a CUDA wheel or a FlashAttention build. If you specifically want the transformers Python API on ROCm, install PyTorch from the ROCm index (pip install torch --index-url https://download.pytorch.org/whl/rocm6.3 — read the live tag at pytorch.org/get-started/locally, it moves between ROCm releases) and let attention fall back to PyTorch SDPA rather than FlashAttention.