What You'll Build
A fully local, private general assistant on a 16GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101): Gemma 4 12B — Google DeepMind's open-weight multimodal generalist (Instruct, release 2026) — served as an OpenAI-compatible endpoint by llama.cpp (built against AMD's HIP/ROCm backend) or Ollama, then used from a chat UI (Open WebUI is a good local front-end) or directly via the API. This is a general assistant: Q&A, drafting and editing, multi-step reasoning, and — optionally — understanding images and audio you feed it. It's a reasoning-strong 12B that runs on modest hardware; on a 16GB RX 7800 XT it fits comfortably at a high-quality quant, with the same quants reaching all the way down to 8GB cards. Everything runs on your own hardware, so prompts, documents, images and audio never leave the machine.
Hardware data: RX 7800 XT (16GB VRAM) · Gemma 4 12B, GGUF Q8_0 (12.67GB, recommended) — or Q6_K (9.79GB) / Q5_K_M (8.41GB) / Google's own QAT Q4_0 (6.98GB) for more KV-cache / context headroom · ROCm 7 · See benchmark data
⚠️ This is a ROCm recipe, not CUDA. The RX 7800 XT runs on AMD's ROCm/HIP stack — there is no
cu124/cu128wheel and no FlashAttention prebuilt-wheel step here. For this model the reliable path is GGUF via llama.cpp-HIP (or Ollama, which bundles llama.cpp). Do not follow a guide that tells you topip install flash-attn, pick acu12xwheel, or use ExLlamaV2/Marlin for this card — those are NVIDIA-only.
ℹ️ This is a dense ~12B multimodal generalist — no MoE. Gemma 4 12B is a
Gemma4UnifiedForConditionalGeneration(model_type: gemma4_unified) — ~11.95B dense parameters, 48 layers, hidden size 3840, GQA with 16 query / 8 KV heads, head_dim 256. Because it is dense, its footprint is simply the quant file you load plus the KV cache; there is no "active-parameters" shortcut. It uses a unified, encoder-free design: images (raw patches) and audio (waveforms) are projected directly into the decoder rather than through a separate vision/audio encoder. Positioned and used as a general assistant, so we file it underllm.
ℹ️ Multimodal input is optional and needs a separate projector. Gemma 4 accepts text, image, and audio in, text out. The LLM GGUF you load for chat is text-only on its own — to feed it images or audio you also pass a separate
mmprojprojector GGUF with--mmproj(and usellama-mtmd-cli/ the multimodal server path). Themmproj-*file is not the LLM and is excluded from the weight/VRAM math below — if you only need text chat, you don't need it at all.
ℹ️ Very long 256K context, made affordable by sliding-window attention. Gemma 4 advertises a 256K context window (
max_position_embeddings262,144). It uses hybrid attention: interleaved local sliding-window (window 1024) layers plus periodic full global attention (the final layer is always global). Sliding-window attention keeps the KV cache far smaller than a full-attention model at the same length — long context is genuinely cheap here. Even so, the full 256K won't fit on a 16GB card; bound the context (-c) here. On the RX 7800 XT's 16GB you still have generous room for a long context at Q8_0.
ℹ️ Runs on current llama.cpp out of the box. Gemma 4 support landed at the model's launch (~April 2026) and ggml-org ships official GGUFs — there is no special patch or PR gate. Just use a recent
llama.cpp(built with the HIP backend, below) or Ollama. Pass--jinjaso the embedded chat template applies (it's a complex template that includes a reasoning/thought channel).
Requirements
| Component | Minimum | Tested target |
|---|---|---|
| GPU | 8GB VRAM (QAT Q4_0 / Q4_K_M floor — the matrix reaches down this far) | RX 7800 XT (16GB, RDNA3 Navi 32, gfx1101) |
| RAM | 16GB system RAM | 32GB comfortable |
| Storage | ~7GB (QAT Q4_0) up to ~13GB (Q8_0); +~1GB for the optional mmproj | ~13GB for Q8_0 |
| Driver | AMD ROCm v7 (installed via amdgpu-install) on Linux | — |
| Software | llama.cpp (HIP build) or Ollama; optional Open WebUI chat client | llama-server, Open WebUI |
Model weights (first-party GGUF available). Unlike many open models, Gemma 4 ships official GGUFs. There are three good sources:
- Google's own QAT Q4_0 —
google/gemma-4-12b-it-qat-q4_0-ggufis a quantization-aware-trained Q4_0 (6.98GB). Because the model was fine-tuned for this quantization, it delivers noticeably better quality-per-byte than a naive Q4_0 — this is the low-VRAM hero (fits an 8GB card). (Themmproj-*file in that repo is the vision/audio projector, not the LLM.) - ggml-org first-party GGUF —
ggml-org/gemma-4-12B-it-GGUFships Q4_K_M (7.38GB, marginally larger than unsloth's 7.12GB in the table), Q8_0 (12.67GB) and bf16 (23.83GB), plus the mmproj. - Community K_M ladder —
unsloth/gemma-4-12b-it-GGUFprovides the conventional ladder used in the fit table below.
Byte-verified on-disk sizes (unsloth K_M ladder, plus Google's QAT):
| Quant | On-disk size | Fit on RX 7800 XT (16GB) |
|---|---|---|
| QAT Q4_0 (Google) | 6.98GB | Quality-per-byte low-VRAM option — quantization-aware-trained; also fits an 8GB card |
| Q4_K_M | 7.12GB | Tiny footprint — huge KV-cache / context headroom; small enough for an 8GB card |
| Q5_K_M | 8.41GB | Small footprint with a quality bump over Q4 — lots of room for a large KV cache |
| Q6_K | 9.79GB | Comfortable — near-lossless-feeling with lots of room for a large KV cache |
| Q8_0 | 12.67GB | Recommended — near-lossless weights with ~3GB left for the KV cache; Gemma's sliding-window attention makes that room afford a long context. The practical best-quality choice on 16GB |
| bf16 | 23.83GB | Full precision — does not fit a 16GB card (needs 32GB+); not an option here |
Not model weights — don't count these in the VRAM math:
- The
mmproj-*file is the multimodal (image/audio) projector, loaded separately with--mmprojonly if you want image/audio input. It is not part of the text-chat weights. - Any
*-MTP*/mtp-*file is a multi-token-prediction / speculative-decode draft head — not the model weights either.
Licensing. Gemma 4 is Apache-2.0 — free for commercial and non-commercial use, no revenue caps (model card). This is a notable change: earlier Gemma generations (1–3) shipped under the custom "Gemma Terms of Use", and Gemma 4 moved to standard Apache-2.0. Google layers a separate Prohibited Use Policy on top (disallowed use cases apply regardless of the license), but the weights themselves are Apache-2.0.
Installation
Prerequisite — install the AMD ROCm v7 driver
The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU — it is listed with LLVM target gfx1101 in AMD's ROCm Linux system-requirements matrix — but ROCm is not bundled with Ollama or the llama.cpp release binaries; you install it once at the OS level. Per the Ollama AMD GPU docs, Ollama requires the AMD ROCm v7 driver on Linux, installed or upgraded with the amdgpu-install utility. On Ubuntu 24.04 (Noble), install ROCm 7.2.1 via the standard amdgpu-install flow (AMD's Radeon ROCm install docs cover the current packages):
# 1. Add the amdgpu-install package and install ROCm
wget https://repo.radeon.com/amdgpu-install/7.2.1/ubuntu/noble/amdgpu-install_7.2.1.70201-1_all.deb
sudo apt install ./amdgpu-install_7.2.1.70201-1_all.deb
sudo apt update
sudo amdgpu-install -y --usecase=graphics,rocm
# 2. Add yourself to the render/video groups (log out/in afterward)
sudo usermod -a -G render,video $LOGNAME
After logging back in, confirm the driver sees the card — rocm-smi should list the RX 7800 XT (this is the ROCm equivalent of nvidia-smi; there is no nvidia-smi on an AMD box). Because the RX 7800 XT is on the supported-GPU matrix as gfx1101, you should not normally need an HSA_OVERRIDE_GFX_VERSION masquerade. If a tool ships only gfx1100 kernels and refuses to start on your card, the documented Linux fallback is to export HSA_OVERRIDE_GFX_VERSION=11.0.0 so the gfx1101 card presents as gfx1100 — treat that as a fallback, not a default.
You have two GGUF runtimes; pick one. Both are fine for this model — there is no patch requirement — so choose Ollama for the fastest start, or llama.cpp for the most control over context, KV-cache quantization, and multimodal input. Either way, use a recent build so Gemma 4's chat template is fully supported.
Option A — llama.cpp built with HIP/ROCm (recommended for full control)
1. Build llama.cpp with the HIP backend. Per the llama.cpp build docs, the Linux HIP build pattern is the same as for any RDNA3 card — only the GPU_TARGETS value changes. For the RX 7800 XT, pin it to gfx1101 (the card's LLVM target per the AMD ROCm system-requirements matrix):
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# RX 7800 XT is RDNA3 Navi 32 = gfx1101
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1101 -DCMAKE_BUILD_TYPE=Release \
&& cmake --build build --config Release -- -j 16
-DGGML_HIP=ON selects the ROCm backend; -DGPU_TARGETS=gfx1101 pins the kernels to the RX 7800 XT's architecture (Navi 32). Do not copy a gfx1100 value from a 7900-series guide — that is the wrong target for this card. A recent release is what matters here: Gemma 4 has been mainline in llama.cpp since its launch, so there is no PR branch to check out — just build a current tree. The GGUF quants are integer formats, so RDNA3's lack of FP8 tensor hardware is irrelevant here.
2. That's it for install — llama.cpp pulls the GGUF straight from Hugging Face at launch (next section). No separate download step.
Option B — Ollama
Ollama is built on llama.cpp and is the fastest way to stand this model up. Install it with the Linux one-liner; Ollama detects the ROCm runtime you installed above and runs the gfx1101 card without any manual architecture flag:
curl -fsSL https://ollama.com/install.sh | sh
Per the Ollama AMD preview blog, all of Ollama's features can be accelerated by AMD graphics cards on Linux. Then either use the curated tag (ollama run gemma4:12b, if listed) or pull the community GGUF straight from Hugging Face (HF × Ollama docs):
ollama run hf.co/unsloth/gemma-4-12b-it-GGUF:Q8_0
Swap the :Q8_0 tag for :Q6_K, :Q5_K_M or :Q4_K_M if you want a smaller footprint. Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients and uses the ROCm runtime on the gfx1101 card automatically.
Running
With llama.cpp
Serve an OpenAI-compatible API on port 8000. The -hf flag pulls the GGUF from Hugging Face; append :Q8_0 (case-insensitive) to pick the quant (llama-server docs):
# Q8_0 (recommended), offload all layers to the 7800 XT
./build/bin/llama-server -hf ggml-org/gemma-4-12B-it-GGUF:Q8_0 \
--port 8000 \
-ngl 99 \
-c 16384 \
--jinja
-ngl 99(--n-gpu-layers) offloads every layer to the GPU — the dense 12B quant file (12.67GB at Q8_0) sits entirely in the 16GB VRAM, leaving room for the KV cache.-c 16384sets a 16K context. At Q8_0 you have ~3GB free after the weights, and Gemma's sliding-window attention keeps the KV cache modest, so you can raise this.--jinjaapplies the GGUF's built-in chat template so the assistant format parses correctly — Gemma 4's template is complex (it includes a reasoning/thought channel), so this flag matters.
Push toward a longer context. Gemma 4 advertises a 256K context (max_position_embeddings 262,144), and its interleaved sliding-window attention (window 1024) + periodic global attention makes long context far cheaper in KV cache than a full-attention model of the same size. At Q8_0 on 16GB you have ~3GB free — enough for a healthy context, and you can stretch it further by quantizing the KV cache: add -fa on (Flash Attention, required for a quantized cache) and -ctk q8_0 -ctv q8_0, which roughly halves KV-cache VRAM versus f16 with minimal quality impact:
# Longer context by 8-bit-quantizing the KV cache
./build/bin/llama-server -hf ggml-org/gemma-4-12B-it-GGUF:Q8_0 \
--port 8000 -ngl 99 -c 65536 --jinja \
-fa on -ctk q8_0 -ctv q8_0
A note on Flash Attention on RDNA3. Flash Attention on the ROCm/HIP backend is less mature than on CUDA — if
-fa onmisbehaves or a quantized KV cache errors on your ROCm version, fall back to an un-quantized cache with a smaller-c, or drop to Q6_K (9.79GB) / Q5_K_M (8.41GB), which leave more room for a plain f16 cache on 16GB without needing-faat all. Even the full 256K won't fit a 16GB card — bound-cand treat the quantized-cache path as the way to reach a long (not maximal) context.
The full 256K won't fit a 16GB card even with SWA — bound -c here. But because Gemma 4 12B is only ~12B, you still have comfortable headroom for a long context on 16GB, and this same model also fits far smaller GPUs (QAT Q4_0 at 6.98GB or Q4_K_M at 7.12GB run on an 8GB card), so the matrix reaches well below this tier.
Optional: image and audio input. To use Gemma 4's multimodal side, add the projector with --mmproj (download the mmproj-* file from the same GGUF repo) and serve via the multimodal path — for the CLI, llama-mtmd-cli is the multimodal front-end:
# Multimodal: LLM weights + the separate projector (mmproj)
./build/bin/llama-mtmd-cli -hf ggml-org/gemma-4-12B-it-GGUF:Q8_0 \
--mmproj <path-to-mmproj-gguf> \
-ngl 99 --jinja
The mmproj is a small extra file (~1GB) on top of the quant sizes above — only load it if you actually want to pass images or audio; text chat doesn't need it.
With Ollama
Pull and run the community GGUF directly from Hugging Face; append a :quant tag to choose the quant (HF × Ollama docs):
ollama run hf.co/unsloth/gemma-4-12b-it-GGUF:Q8_0
Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients and uses the ROCm runtime on the gfx1101 card automatically.
Use it as a chat assistant
Point any OpenAI-compatible chat client at your local endpoint by setting its base URL and a dummy API key — no cloud, no per-token cost.
Open WebUI (optional local chat front-end). A self-hosted, ChatGPT-style UI that talks to any OpenAI-compatible server. Run it and point it at your local endpoint:
# Point Open WebUI at your local llama-server (or Ollama on :11434)
docker run -d -p 3000:8080 \
-e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
-e OPENAI_API_KEY=EMPTY \
ghcr.io/open-webui/open-webui:main
Then open http://localhost:3000 and chat. (Open WebUI also autodetects a local Ollama install, so with the Ollama path you can skip the base-URL wiring entirely.)
Directly via the API. Any OpenAI SDK or curl works against the same endpoint — use it for scripts, writing tools, or your own app:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-4-12b",
"messages": [{"role": "user", "content": "Summarize this in three bullet points: ..."}]
}'
Local servers don't check the key, so any non-empty string (e.g. EMPTY) works where a client requires one.
Results
- VRAM usage: The dense ~12B loads entirely as its GGUF file — Q8_0 is 12.67GB on disk (byte-verified from the ggml-org GGUF tree). On the RX 7800 XT's 16GB that leaves ~3GB for the KV cache — enough for a long context, and further still thanks to Gemma's sliding-window attention (and more with an 8-bit-quantized cache; see Running). Q6_K (9.79GB), Q5_K_M (8.41GB), Q4_K_M (7.12GB) and Google's QAT Q4_0 (6.98GB) shrink the footprint further for even larger context or smaller cards — QAT Q4_0 / Q4_K_M reach down to an 8GB card. The full-precision bf16 GGUF (23.83GB) does not fit a 16GB card at all (needs 32GB+).
- Model capability (vendor evals — Google's own, NOT hardware throughput): Google reports MMLU Pro 77.2%, GPQA Diamond 78.8%, AIME 2026 77.5%, and LiveCodeBench v6 72.0% — a reasoning-strong card for its size. These are the vendor's benchmarks, not measurements on this GPU.
- Speed: No community throughput benchmark for Gemma 4 12B on the RX 7800 XT exists yet — we would rather omit a tok/s figure than invent one or borrow it from different hardware. Live measurements will appear at
/check/gemma-4-12b/rx-7800-xtonce contributed.
For the full benchmark data, see /check/gemma-4-12b/rx-7800-xt.
Troubleshooting
The chat template looks wrong / responses are malformed
Pass --jinja to llama-server so the GGUF's built-in chat template is applied — without it the assistant format won't parse. Gemma 4's chat template is complex (it includes a reasoning/thought channel), so applying it correctly matters more than for a plain instruct model. Use a recent llama.cpp build so the template is fully supported.
Images or audio aren't recognized
The plain LLM GGUF is text-only. To pass images or audio you must also load the separate mmproj projector with --mmproj and use the multimodal path (llama-mtmd-cli, or the multimodal server). Download the mmproj-* file from the same GGUF repo — it is a distinct file from the quant, and text chat works fine without it.
Ollama or llama.cpp runs on the CPU instead of the GPU
Confirm the ROCm v7 driver is installed (rocm-smi should list the 7800 XT) and that your user is in the render and video groups (groups should show both — log out and back in after the usermod step). Per the Ollama AMD GPU docs, ROCm is a separate install from Ollama; if it's missing, Ollama silently falls back to CPU. For a source llama.cpp build, confirm you compiled with -DGGML_HIP=ON -DGPU_TARGETS=gfx1101. The RX 7800 XT (gfx1101) is on the supported-GPU matrix, so you should not normally need HSA_OVERRIDE_GFX_VERSION — reach for the HSA_OVERRIDE_GFX_VERSION=11.0.0 masquerade only if a specific tool ships gfx1100-only kernels and refuses to start.
Out of memory, or when raising the context
Q8_0 weights (12.67GB) leave ~3GB on a 16GB RX 7800 XT for the KV cache, and Gemma's sliding-window attention keeps that cache smaller than a full-attention model would — so OOM is unlikely at sane context sizes. But the full 256K can still be large. Options, in order: quantize the KV cache with -fa on -ctk q8_0 -ctv q8_0 (roughly halves cache VRAM — note Flash Attention is less mature on ROCm/RDNA3, so verify it behaves on your version); lower -c; or drop to Q6_K (9.79GB), Q5_K_M (8.41GB), Q4_K_M (7.12GB) or Google's QAT Q4_0 (6.98GB) for even more headroom. Don't try the bf16 GGUF (23.83GB) on 16GB — it doesn't fit; it needs a 32GB+ card.
Token generation feels slower than expected — try the Vulkan backend
On RDNA3 the ROCm/HIP backend can be slower at token generation than llama.cpp's Vulkan backend. Per llama.cpp issue #20934, measured on a 7900-series RDNA3 card, the Vulkan (RADV) backend outpaced ROCm for pure generation on a small model across ROCm 6.4.4–7.x. If your generation rate disappoints under ROCm on the 7800 XT, build llama.cpp with -DGGML_VULKAN=ON instead of -DGGML_HIP=ON and re-benchmark with llama-bench — Vulkan often wins for pure generation on RDNA3.
torch / CUDA errors — this is llama.cpp on ROCm, not a Python ML stack
Serving Gemma 4 via llama.cpp or Ollama does not require PyTorch, flash-attn wheels, or a Python ML stack — and it does not use CUDA at all on this card. If you hit a CUDA error, you almost certainly grabbed a CUDA/CPU binary instead of the ROCm build; rebuild with -DGGML_HIP=ON -DGPU_TARGETS=gfx1101 (Option A) or reinstall Ollama with the ROCm driver present. At 12B, Q8_0 is already near-lossless, so there is no reason to reach for the full-precision weights on this card.
Model or GPU 404 on /check
Gemma 4 12B is a new addition; if the /check/gemma-4-12b/rx-7800-xt link 404s, the catalogue row is still being registered. The recipe's install and run steps are independent of the benchmark endpoint.