self-hosted/ai
§01·recipe · llm

Gemma 4 12B on RTX 4070: Local Private Assistant via llama.cpp / Ollama (12GB)

llmintermediate12GB+ VRAMJul 4, 2026

This intermediate recipe sets up Gemma 4 12B on the RTX 4070, needing about 12 GB of VRAM.

models
tools
prerequisites
  • NVIDIA RTX 4070 (12GB VRAM, Ada Lovelace AD104, sm_89)
  • 16GB+ system RAM (32GB comfortable)
  • ~7-10GB free disk for the GGUF (QAT Q4_0 ~7GB up to Q6_K ~10GB; add ~1GB more for the optional mmproj)
  • A recent llama.cpp build (CUDA) or Ollama — Gemma 4 is supported out of the box, no special patch needed
  • Optional: Open WebUI (or any OpenAI-compatible chat client) for a local chat front-end

What You'll Build

A fully local, private general assistant: Gemma 4 12B — Google DeepMind's open-weight multimodal generalist (Instruct, release 2026) — served as an OpenAI-compatible endpoint by llama.cpp or Ollama on a single 12GB RTX 4070, then used from a chat UI (Open WebUI is a good local front-end) or directly via the API. This is a general assistant: Q&A, drafting and editing, multi-step reasoning, and — optionally — understanding images and audio you feed it. It's a reasoning-strong 12B that runs on modest hardware; on a 12GB RTX 4070 a near-lossless-feeling Q6_K fits with room for a long context, and the same quants reach all the way down to 8GB cards. Everything runs on your own hardware, so prompts, documents, images and audio never leave the machine.

Hardware data: RTX 4070 (12GB VRAM) · Gemma 4 12B, GGUF Q6_K (9.79GB, recommended) — ~2GB free, and Gemma's sliding-window attention keeps a long context affordable — or Q5_K_M (8.41GB) / Q4_K_M (7.12GB) / Google's own QAT Q4_0 (6.98GB) for more KV-cache / context headroom · See benchmark data

ℹ️ This is a dense ~12B multimodal generalist — no MoE. Gemma 4 12B is a Gemma4UnifiedForConditionalGeneration (model_type: gemma4_unified) — ~11.95B dense parameters, 48 layers, hidden size 3840, GQA with 16 query / 8 KV heads, head_dim 256. Because it is dense, its footprint is simply the quant file you load plus the KV cache; there is no "active-parameters" shortcut. It uses a unified, encoder-free design: images (raw patches) and audio (waveforms) are projected directly into the decoder rather than through a separate vision/audio encoder. Positioned and used as a general assistant, so we file it under llm.

ℹ️ Multimodal input is optional and needs a separate projector. Gemma 4 accepts text, image, and audio in, text out. The LLM GGUF you load for chat is text-only on its own — to feed it images or audio you also pass a separate mmproj projector GGUF with --mmproj (and use llama-mtmd-cli / the multimodal server path). The mmproj-* file is not the LLM and is excluded from the weight/VRAM math below — if you only need text chat, you don't need it at all.

ℹ️ Very long 256K context, made affordable by sliding-window attention. Gemma 4 advertises a 256K context window (max_position_embeddings 262,144). It uses hybrid attention: interleaved local sliding-window (window 1024) layers plus periodic full global attention (the final layer is always global). Sliding-window attention keeps the KV cache far smaller than a full-attention model at the same length — long context is genuinely cheap here. Even so, the full 256K won't fit on small cards; bound the context (-c) on modest VRAM. On a 12GB 4070 with Q6_K you have ~2GB free for the KV cache, and SWA stretches that into a genuinely long context.

ℹ️ Runs on current llama.cpp out of the box. Gemma 4 support landed at the model's launch (~April 2026) and ggml-org ships official GGUFs — there is no special patch or PR gate. Just use a recent llama.cpp (or Ollama) build. Pass --jinja so the embedded chat template applies (it's a complex template that includes a reasoning/thought channel).

Requirements

ComponentMinimumTested target
GPU8GB VRAM (QAT Q4_0 / Q4_K_M floor — the matrix reaches down this far)RTX 4070 (12GB, Ada Lovelace AD104, sm_89)
RAM16GB system RAM32GB comfortable
Storage~7GB (QAT Q4_0) up to ~10GB (Q6_K); +~1GB for the optional mmproj~10GB for Q6_K
SoftwareRecent llama.cpp (CUDA) or Ollama; optional Open WebUI chat clientllama-server, Open WebUI

Model weights (first-party GGUF available). Unlike many open models, Gemma 4 ships official GGUFs. There are three good sources:

  • Google's own QAT Q4_0google/gemma-4-12b-it-qat-q4_0-gguf is a quantization-aware-trained Q4_0 (6.98GB). Because the model was fine-tuned for this quantization, it delivers noticeably better quality-per-byte than a naive Q4_0 — this is the low-VRAM hero (fits an 8GB card). (The mmproj-* file in that repo is the vision/audio projector, not the LLM.)
  • ggml-org first-party GGUFggml-org/gemma-4-12B-it-GGUF ships Q4_K_M (7.38GB, marginally larger than unsloth's 7.12GB in the table), Q8_0 (12.67GB) and bf16 (23.83GB), plus the mmproj.
  • Community K_M ladderunsloth/gemma-4-12b-it-GGUF provides the conventional ladder used in the fit table below.

Byte-verified on-disk sizes (unsloth K_M ladder, plus Google's QAT):

QuantOn-disk sizeFit on RTX 4070 (12GB)
QAT Q4_0 (Google)6.98GBQuality-per-byte low-VRAM option — quantization-aware-trained; leaves the most KV-cache room; also fits an 8GB card
Q4_K_M7.12GBTiny footprint — huge KV-cache / context headroom; small enough for an 8GB card
Q5_K_M8.41GBSmall footprint with a quality bump over Q4; comfortable with a large KV cache
Q6_K9.79GBRecommended — near-lossless-feeling with ~2GB free on 12GB; SWA turns that into a long context; the best-quality choice that still leaves KV-cache room here
Q8_012.67GBDoes not fit 12GB — larger than the card's VRAM before any KV cache; needs a 16GB+ card
bf1623.83GBDoes not fit 12GB — 24GB-class only

Not model weights — don't count these in the VRAM math:

  • The mmproj-* file is the multimodal (image/audio) projector, loaded separately with --mmproj only if you want image/audio input. It is not part of the text-chat weights.
  • Any *-MTP* / mtp-* file is a multi-token-prediction / speculative-decode draft head — not the model weights either.

Licensing. Gemma 4 is Apache-2.0 — free for commercial and non-commercial use, no revenue caps (model card). This is a notable change: earlier Gemma generations (1–3) shipped under the custom "Gemma Terms of Use", and Gemma 4 moved to standard Apache-2.0. Google layers a separate Prohibited Use Policy on top (disallowed use cases apply regardless of the license), but the weights themselves are Apache-2.0.

Installation

You have two GGUF runtimes; pick one. Both are fine for this model — there is no patch requirement — so choose Ollama for the fastest start, or llama.cpp for the most control over context, KV-cache quantization, and multimodal input.

Option A — llama.cpp with CUDA

The RTX 4070 is Ada Lovelace (AD104, sm_89). Build a recent llama.cpp and compile for sm_89, per the official build guide:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# RTX 4070 is Ada Lovelace = compute capability 8.9 (sm_89)
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89
cmake --build build --config Release -j 8

A recent release is all you need — Gemma 4 has been mainline in llama.cpp since its launch. If you prefer a prebuilt binary, grab a current one from the releases page. The CUDA backend flag is -DGGML_CUDA=ON on current llama.cpp (the old LLAMA_CUDA name was retired in late 2024); install the NVIDIA CUDA toolkit first.

Option B — Ollama

Ollama is built on llama.cpp and is the fastest way to stand this model up. Either use the curated tag (ollama run gemma4:12b, if listed) or pull the community GGUF straight from Hugging Face (HF × Ollama docs):

ollama run hf.co/unsloth/gemma-4-12b-it-GGUF:Q6_K

Swap the :Q6_K tag for :Q5_K_M or :Q4_K_M if you want an even smaller footprint. (Skip :Q8_0 on 12GB — at 12.67GB it exceeds the card's VRAM.) Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients.

Running

With llama.cpp

Serve an OpenAI-compatible API on port 8000. The -hf flag pulls the GGUF from Hugging Face; append :Q6_K (case-insensitive) to pick the quant (llama-server docs):

# Q6_K (recommended on 12GB), offload all layers to the 4070
llama-server -hf unsloth/gemma-4-12b-it-GGUF:Q6_K \
    --port 8000 \
    -ngl 99 \
    -c 8192 \
    --jinja
  • -ngl 99 (--n-gpu-layers) offloads every layer to the GPU — the dense 12B quant file (9.79GB at Q6_K) sits entirely in VRAM.
  • -c 8192 sets an 8K context. At Q6_K you have ~2GB free after the weights, and Gemma's sliding-window attention keeps the KV cache modest, so you can raise this.
  • --jinja applies the GGUF's built-in chat template so the assistant format parses correctly — Gemma 4's template is complex (it includes a reasoning/thought channel), so this flag matters.

Push toward the 256K context window. Gemma 4 advertises a 256K context (max_position_embeddings 262,144), and its interleaved sliding-window attention (window 1024) + periodic global attention makes long context far cheaper in KV cache than a full-attention model of the same size. On 12GB the ~2GB left after Q6_K is your KV budget; you can go a lot further by quantizing the KV cache — add -fa on (Flash Attention, required for a quantized cache) and -ctk q8_0 -ctv q8_0, which roughly halves KV-cache VRAM versus f16 with minimal quality impact:

# Longer context by 8-bit-quantizing the KV cache
llama-server -hf unsloth/gemma-4-12b-it-GGUF:Q6_K \
    --port 8000 -ngl 99 -c 32768 --jinja \
    -fa on -ctk q8_0 -ctv q8_0

The full 256K won't fit on a 12GB card even with SWA — bound -c accordingly — but a long, useful context is very much in reach. If you need still more room, drop to Q5_K_M (8.41GB) or Q4_K_M (7.12GB); this same model also fits far smaller GPUs (QAT Q4_0 at 6.98GB runs on an 8GB card), so the matrix reaches well below this tier.

Optional: image and audio input. To use Gemma 4's multimodal side, add the projector with --mmproj (download the mmproj-* file from the same GGUF repo) and serve via the multimodal path — for the CLI, llama-mtmd-cli is the multimodal front-end:

# Multimodal: LLM weights + the separate projector (mmproj)
llama-mtmd-cli -hf unsloth/gemma-4-12b-it-GGUF:Q6_K \
    --mmproj <path-to-mmproj-gguf> \
    -ngl 99 --jinja

The mmproj is a small extra file (~1GB) on top of the quant sizes above — only load it if you actually want to pass images or audio; text chat doesn't need it.

With Ollama

Pull and run the community GGUF directly from Hugging Face; append a :quant tag to choose the quant (HF × Ollama docs):

ollama run hf.co/unsloth/gemma-4-12b-it-GGUF:Q6_K

Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients.

Use it as a chat assistant

Point any OpenAI-compatible chat client at your local endpoint by setting its base URL and a dummy API key — no cloud, no per-token cost.

Open WebUI (optional local chat front-end). A self-hosted, ChatGPT-style UI that talks to any OpenAI-compatible server. Run it and point it at your local endpoint:

# Point Open WebUI at your local llama-server (or Ollama on :11434)
docker run -d -p 3000:8080 \
    -e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
    -e OPENAI_API_KEY=EMPTY \
    ghcr.io/open-webui/open-webui:main

Then open http://localhost:3000 and chat. (Open WebUI also autodetects a local Ollama install, so with the Ollama path you can skip the base-URL wiring entirely.)

Directly via the API. Any OpenAI SDK or curl works against the same endpoint — use it for scripts, writing tools, or your own app:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "gemma-4-12b",
      "messages": [{"role": "user", "content": "Summarize this in three bullet points: ..."}]
    }'

Local servers don't check the key, so any non-empty string (e.g. EMPTY) works where a client requires one.

Results

  • VRAM usage: The dense ~12B loads entirely as its GGUF file — Q6_K is 9.79GB on disk (byte-verified from the unsloth GGUF tree). On the RTX 4070's 12GB that leaves ~2GB for the KV cache — enough for a long context, and stretched further by Gemma's sliding-window attention (and further still with an 8-bit-quantized cache; see Running). Q5_K_M (8.41GB), Q4_K_M (7.12GB) and Google's QAT Q4_0 (6.98GB) shrink the footprint for even larger context or smaller cards. Q8_0 (12.67GB) does not fit 12GB — it exceeds the card's VRAM before any KV cache; step up to a 16GB+ card for near-lossless Q8_0, and the bf16 GGUF (23.83GB) is 24GB-class only.
  • Model capability (vendor evals — Google's own, NOT hardware throughput): Google reports MMLU Pro 77.2%, MMMLU 83.4%, GPQA Diamond 78.8%, AIME 2026 77.5%, LiveCodeBench v6 72.0%, and MMMU Pro (vision) 69.1% — a reasoning-strong card for its size. These are the vendor's benchmarks, not measurements on this GPU.
  • Speed: No community throughput benchmark for Gemma 4 12B on the RTX 4070 exists yet — we would rather omit a tok/s figure than invent one or borrow it from different hardware. Live measurements will appear at /check/gemma-4-12b/rtx-4070 once contributed.

For the full benchmark data, see /check/gemma-4-12b/rtx-4070.

Troubleshooting

The chat template looks wrong / responses are malformed

Pass --jinja to llama-server so the GGUF's built-in chat template is applied — without it the assistant format won't parse. Gemma 4's chat template is complex (it includes a reasoning/thought channel), so applying it correctly matters more than for a plain instruct model. Use a recent llama.cpp build so the template is fully supported.

Images or audio aren't recognized

The plain LLM GGUF is text-only. To pass images or audio you must also load the separate mmproj projector with --mmproj and use the multimodal path (llama-mtmd-cli, or the multimodal server). Download the mmproj-* file from the same GGUF repo — it is a distinct file from the quant, and text chat works fine without it.

Out of memory, or when raising the context

Q6_K weights (9.79GB) leave ~2GB on a 12GB 4070 for the KV cache, and Gemma's sliding-window attention keeps that cache smaller than a full-attention model would — so a bounded context is fine, but the full 256K is not. Options, in order: quantize the KV cache with -fa on -ctk q8_0 -ctv q8_0 (roughly halves cache VRAM); lower -c; or drop to Q5_K_M (8.41GB), Q4_K_M (7.12GB) or Google's QAT Q4_0 (6.98GB) for even more headroom. Don't try Q8_0 (12.67GB) on 12GB — it exceeds the card's VRAM before any KV cache and won't load; that quant needs a 16GB+ card.

torch / CUDA errors — this is llama.cpp, not a Python ML stack

Serving Gemma 4 via llama.cpp or Ollama does not require PyTorch, flash-attn wheels, or a Python ML stack. If you hit a CUDA error, confirm you built (or downloaded) the CUDA-enabled llama.cpp (Option A, -DGGML_CUDA=ON) rather than a CPU-only binary. For large-VRAM or multi-GPU production serving you could instead run the full-precision weights under a server like vLLM, but on a single 12GB 4070 the GGUF + llama.cpp path is the right one — and at 12B, Q6_K is already near-lossless-feeling.

Model or GPU 404 on /check

Gemma 4 12B is a new addition; if the /check/gemma-4-12b/rtx-4070 link 404s, the catalogue row is still being registered. The recipe's install and run steps are independent of the benchmark endpoint.

common questions
How much VRAM does Gemma 4 12B need?

About 12 GB — the minimum this recipe targets.

Which GPUs is Gemma 4 12B tested on?

RTX 4070 (12 GB).

How hard is this setup?

Intermediate — follow the steps above.