What You'll Build
A fully local, private general assistant: Gemma 4 12B — Google DeepMind's open-weight multimodal generalist (Instruct, release 2026) — served as an OpenAI-compatible endpoint by llama.cpp or Ollama on a single 8GB RTX 4060, then used from a chat UI (Open WebUI is a good local front-end) or directly via the API. This is a general assistant: Q&A, drafting and editing, multi-step reasoning, and — optionally — understanding images and audio you feed it. This is the entry tier: a reasoning-strong 12B multimodal-capable model running on an 8GB card, made possible by Google's own quantization-aware-trained QAT Q4_0 (6.98GB) — a quality-per-byte hero built for exactly this low-VRAM case. Everything runs on your own hardware, so prompts, documents, images and audio never leave the machine.
Hardware data: RTX 4060 (8GB VRAM) · Gemma 4 12B, GGUF QAT Q4_0 (6.98GB, recommended) — Google's quantization-aware-trained quant — or the conventional Q4_K_M (7.12GB); both fit 8GB with a bounded context (Gemma's sliding-window attention keeps the KV cache small) · See benchmark data
ℹ️ This is a dense ~12B multimodal generalist — no MoE. Gemma 4 12B is a
Gemma4UnifiedForConditionalGeneration(model_type: gemma4_unified) — ~11.95B dense parameters, 48 layers, hidden size 3840, GQA with 16 query / 8 KV heads, head_dim 256. Because it is dense, its footprint is simply the quant file you load plus the KV cache; there is no "active-parameters" shortcut. It uses a unified, encoder-free design: images (raw patches) and audio (waveforms) are projected directly into the decoder rather than through a separate vision/audio encoder. Positioned and used as a general assistant, so we file it underllm.
ℹ️ Multimodal input is optional and needs a separate projector. Gemma 4 accepts text, image, and audio in, text out. The LLM GGUF you load for chat is text-only on its own — to feed it images or audio you also pass a separate
mmprojprojector GGUF with--mmproj(and usellama-mtmd-cli/ the multimodal server path). Themmproj-*file is not the LLM and is excluded from the weight/VRAM math below — if you only need text chat, you don't need it at all. On an 8GB card the projector's extra ~1GB is a tight squeeze on top of the quant, so treat image/audio as an occasional mode rather than always-on.
ℹ️ Very long 256K context, made affordable by sliding-window attention. Gemma 4 advertises a 256K context window (
max_position_embeddings262,144). It uses hybrid attention: interleaved local sliding-window (window 1024) layers plus periodic full global attention (the final layer is always global). Sliding-window attention keeps the KV cache far smaller than a full-attention model at the same length — long context is genuinely cheap here. That matters most on an 8GB card: the full 256K won't fit, but because SWA keeps the cache small you can still run a useful, bounded context. Start with a modest-cand lean on SWA rather than trying to reserve for 256K.
ℹ️ Runs on current llama.cpp out of the box. Gemma 4 support landed at the model's launch (~April 2026) and ggml-org ships official GGUFs — there is no special patch or PR gate. Just use a recent
llama.cpp(or Ollama) build. Pass--jinjaso the embedded chat template applies (it's a complex template that includes a reasoning/thought channel).
Requirements
| Component | Minimum | Tested target |
|---|---|---|
| GPU | 8GB VRAM (QAT Q4_0 / Q4_K_M floor — this is that floor tier) | RTX 4060 (8GB, Ada Lovelace AD107, sm_89) |
| RAM | 16GB system RAM | 32GB comfortable |
| Storage | ~7GB (QAT Q4_0 or Q4_K_M); +~1GB for the optional mmproj | ~7GB for QAT Q4_0 |
| Software | Recent llama.cpp (CUDA) or Ollama; optional Open WebUI chat client | llama-server, Open WebUI |
Model weights (first-party GGUF available). Unlike many open models, Gemma 4 ships official GGUFs. There are three good sources:
- Google's own QAT Q4_0 —
google/gemma-4-12b-it-qat-q4_0-ggufis a quantization-aware-trained Q4_0 (6.98GB). Because the model was fine-tuned for this quantization, it delivers noticeably better quality-per-byte than a naive Q4_0 — this is the low-VRAM hero and the recommended quant for this 8GB tier. (Themmproj-*file in that repo is the vision/audio projector, not the LLM.) - ggml-org first-party GGUF —
ggml-org/gemma-4-12B-it-GGUFships Q4_K_M (7.38GB, marginally larger than unsloth's 7.12GB in the table), Q8_0 (12.67GB) and bf16 (23.83GB), plus the mmproj. - Community K_M ladder —
unsloth/gemma-4-12b-it-GGUFprovides the conventional ladder used in the fit table below.
Byte-verified on-disk sizes (unsloth K_M ladder, plus Google's QAT):
| Quant | On-disk size | Fit on RTX 4060 (8GB) |
|---|---|---|
| QAT Q4_0 (Google) | 6.98GB | Recommended — quantization-aware-trained; best quality-per-byte and the safest fit on 8GB (~1GB headroom for a bounded KV cache) |
| Q4_K_M | 7.12GB | Also fits — conventional Q4; ~1GB left, so keep the context modest (SWA helps) |
| Q5_K_M | 8.41GB | Does not fit 8GB — larger than the card's VRAM even before the KV cache; use a bigger card |
| Q6_K | 9.79GB | Does not fit 8GB — needs a 12GB+ card |
| Q8_0 | 12.67GB | Does not fit 8GB — needs a 16GB+ card |
| bf16 | 23.83GB | Does not fit 8GB — 24GB-class only |
Not model weights — don't count these in the VRAM math:
- The
mmproj-*file is the multimodal (image/audio) projector, loaded separately with--mmprojonly if you want image/audio input. It is not part of the text-chat weights. - Any
*-MTP*/mtp-*file is a multi-token-prediction / speculative-decode draft head — not the model weights either.
Licensing. Gemma 4 is Apache-2.0 — free for commercial and non-commercial use, no revenue caps (model card). This is a notable change: earlier Gemma generations (1–3) shipped under the custom "Gemma Terms of Use", and Gemma 4 moved to standard Apache-2.0. Google layers a separate Prohibited Use Policy on top (disallowed use cases apply regardless of the license), but the weights themselves are Apache-2.0.
Installation
You have two GGUF runtimes; pick one. Both are fine for this model — there is no patch requirement — so choose Ollama for the fastest start, or llama.cpp for the most control over context, KV-cache quantization, and multimodal input.
Option A — llama.cpp with CUDA
The RTX 4060 is Ada Lovelace (AD107, sm_89). Build a recent llama.cpp and compile for sm_89, per the official build guide:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# RTX 4060 is Ada Lovelace = compute capability 8.9 (sm_89)
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89
cmake --build build --config Release -j 8
A recent release is all you need — Gemma 4 has been mainline in llama.cpp since its launch. If you prefer a prebuilt binary, grab a current one from the releases page. The CUDA backend flag is -DGGML_CUDA=ON on current llama.cpp (the old LLAMA_CUDA name was retired in late 2024); install the NVIDIA CUDA toolkit first.
Option B — Ollama
Ollama is built on llama.cpp and is the fastest way to stand this model up. Either use the curated tag (ollama run gemma4:12b, if listed) or pull the community GGUF straight from Hugging Face (HF × Ollama docs):
ollama run hf.co/unsloth/gemma-4-12b-it-GGUF:Q4_K_M
The :Q4_K_M tag is the smallest conventional ladder step and the safe choice on 8GB. Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients.
Running
With llama.cpp
Serve an OpenAI-compatible API on port 8000. For the best quality-per-byte on 8GB, load Google's QAT Q4_0 GGUF; append :Q4_0 (case-insensitive) to pick the quant (llama-server docs):
# QAT Q4_0 (recommended on 8GB), offload all layers to the 4060
llama-server -hf google/gemma-4-12b-it-qat-q4_0-gguf \
--port 8000 \
-ngl 99 \
-c 4096 \
--jinja
-ngl 99(--n-gpu-layers) offloads every layer to the GPU — the dense 12B QAT quant (6.98GB) sits in VRAM with ~1GB to spare on an 8GB card.-c 4096sets a modest 4K context to start. There's only ~1GB free after the weights, so keep the context bounded; Gemma's sliding-window attention keeps the KV cache small, so you can nudge this up and watch VRAM.--jinjaapplies the GGUF's built-in chat template so the assistant format parses correctly — Gemma 4's template is complex (it includes a reasoning/thought channel), so this flag matters.
Stretching context on 8GB. The full 256K won't fit here, but you don't have to leave it at 4K. Gemma 4's interleaved sliding-window attention (window 1024) + periodic global attention keeps the KV cache far cheaper than a full-attention model of the same size, so a longer bounded context is realistic. To go further, quantize the KV cache — add -fa on (Flash Attention, required for a quantized cache) and -ctk q8_0 -ctv q8_0, which roughly halves KV-cache VRAM versus f16 with minimal quality impact:
# Longer bounded context by 8-bit-quantizing the KV cache
llama-server -hf google/gemma-4-12b-it-qat-q4_0-gguf \
--port 8000 -ngl 99 -c 16384 --jinja \
-fa on -ctk q8_0 -ctv q8_0
Raise -c gradually and stop when VRAM runs tight — on 8GB the weights already take ~7GB, so the KV cache is what you're budgeting. This is why the entry tier leans on QAT Q4_0: it leaves the most room for context on the smallest card.
Optional: image and audio input. To use Gemma 4's multimodal side, add the projector with --mmproj (download the mmproj-* file from the same GGUF repo) and serve via the multimodal path — for the CLI, llama-mtmd-cli is the multimodal front-end:
# Multimodal: LLM weights + the separate projector (mmproj)
llama-mtmd-cli -hf google/gemma-4-12b-it-qat-q4_0-gguf \
--mmproj <path-to-mmproj-gguf> \
-ngl 99 --jinja
The mmproj is a small extra file (~1GB) on top of the quant sizes above. On 8GB that's a tight fit alongside the weights, so treat image/audio as an occasional mode (shorter context while it's loaded); text chat doesn't need it at all.
With Ollama
Pull and run the community GGUF directly from Hugging Face; append a :quant tag to choose the quant (HF × Ollama docs):
ollama run hf.co/unsloth/gemma-4-12b-it-GGUF:Q4_K_M
:Q4_K_M (7.12GB) is the largest quant that fits 8GB via the unsloth ladder — larger tags (:Q5_K_M and up) exceed the card's VRAM. Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients.
Use it as a chat assistant
Point any OpenAI-compatible chat client at your local endpoint by setting its base URL and a dummy API key — no cloud, no per-token cost.
Open WebUI (optional local chat front-end). A self-hosted, ChatGPT-style UI that talks to any OpenAI-compatible server. Run it and point it at your local endpoint:
# Point Open WebUI at your local llama-server (or Ollama on :11434)
docker run -d -p 3000:8080 \
-e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
-e OPENAI_API_KEY=EMPTY \
ghcr.io/open-webui/open-webui:main
Then open http://localhost:3000 and chat. (Open WebUI also autodetects a local Ollama install, so with the Ollama path you can skip the base-URL wiring entirely.)
Directly via the API. Any OpenAI SDK or curl works against the same endpoint — use it for scripts, writing tools, or your own app:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-4-12b",
"messages": [{"role": "user", "content": "Summarize this in three bullet points: ..."}]
}'
Local servers don't check the key, so any non-empty string (e.g. EMPTY) works where a client requires one.
Results
- VRAM usage: The dense ~12B loads entirely as its GGUF file — Google's QAT Q4_0 is 6.98GB on disk (byte-verified from the QAT GGUF tree), and the conventional Q4_K_M is 7.12GB (from the unsloth GGUF tree). On the RTX 4060's 8GB that leaves roughly ~1GB for the KV cache — enough for a bounded context, and stretched further by Gemma's sliding-window attention (and further still with an 8-bit-quantized cache; see Running). The larger quants — Q5_K_M (8.41GB), Q6_K (9.79GB), Q8_0 (12.67GB), bf16 (23.83GB) — do not fit 8GB; step up to a 12GB+ card for those.
- Model capability (vendor evals — Google's own, NOT hardware throughput): Google reports MMLU Pro 77.2%, MMMLU 83.4%, GPQA Diamond 78.8%, AIME 2026 77.5%, LiveCodeBench v6 72.0%, and MMMU Pro (vision) 69.1% — a reasoning-strong card for its size. These are the vendor's benchmarks, not measurements on this GPU.
- Speed: No community throughput benchmark for Gemma 4 12B on the RTX 4060 exists yet — we would rather omit a tok/s figure than invent one or borrow it from different hardware. Live measurements will appear at
/check/gemma-4-12b/rtx-4060once contributed.
For the full benchmark data, see /check/gemma-4-12b/rtx-4060.
Troubleshooting
The chat template looks wrong / responses are malformed
Pass --jinja to llama-server so the GGUF's built-in chat template is applied — without it the assistant format won't parse. Gemma 4's chat template is complex (it includes a reasoning/thought channel), so applying it correctly matters more than for a plain instruct model. Use a recent llama.cpp build so the template is fully supported.
Images or audio aren't recognized
The plain LLM GGUF is text-only. To pass images or audio you must also load the separate mmproj projector with --mmproj and use the multimodal path (llama-mtmd-cli, or the multimodal server). Download the mmproj-* file from the same GGUF repo — it is a distinct file from the quant, and text chat works fine without it. On 8GB the projector's ~1GB is a tight fit alongside the weights, so shorten the context while running multimodal.
Out of memory, or when raising the context
On 8GB the weights already take ~7GB (QAT Q4_0 6.98GB / Q4_K_M 7.12GB), so the KV cache is your whole budget — OOM here almost always means the context is too high. Options, in order: lower -c; quantize the KV cache with -fa on -ctk q8_0 -ctv q8_0 (roughly halves cache VRAM); and prefer QAT Q4_0 over Q4_K_M for the extra headroom. Don't reach for Q5_K_M (8.41GB) or larger — they exceed 8GB before any KV cache and won't load; those need a bigger card.
torch / CUDA errors — this is llama.cpp, not a Python ML stack
Serving Gemma 4 via llama.cpp or Ollama does not require PyTorch, flash-attn wheels, or a Python ML stack. If you hit a CUDA error, confirm you built (or downloaded) the CUDA-enabled llama.cpp (Option A, -DGGML_CUDA=ON) rather than a CPU-only binary. On an 8GB card the GGUF + llama.cpp path is the right one — a 12B in a naive full-precision Python stack won't fit here, which is exactly why the QAT Q4_0 quant matters.
Model or GPU 404 on /check
Gemma 4 12B is a new addition; if the /check/gemma-4-12b/rtx-4060 link 404s, the catalogue row is still being registered. The recipe's install and run steps are independent of the benchmark endpoint.