self-hosted/ai
§01·recipe · llm

Gemma 4 12B on Apple M2 Pro: Local Private Assistant via llama.cpp / Ollama (16GB)

llmintermediate16GB+ VRAMJul 4, 2026

This intermediate recipe sets up Gemma 4 12B on the Apple M2 Pro, needing about 16 GB of VRAM.

models
tools
prerequisites
  • Apple M2 Pro (16GB unified memory, Metal GPU)
  • macOS with the unified memory shared between CPU and GPU (~11-12GB practically usable for the model)
  • ~7-10GB free disk for the GGUF (QAT Q4_0 ~7GB up to Q6_K ~10GB; add ~1GB more for the optional mmproj)
  • A recent llama.cpp build (Metal) or Ollama — Gemma 4 is supported out of the box, no special patch needed
  • Optional: Open WebUI (or any OpenAI-compatible chat client) for a local chat front-end

What You'll Build

A fully local, private general assistant: Gemma 4 12B — Google DeepMind's open-weight multimodal generalist (Instruct, release 2026) — served as an OpenAI-compatible endpoint by llama.cpp (Metal) or Ollama on a single 16GB Apple M2 Pro, then used from a chat UI (Open WebUI is a good local front-end) or directly via the API. This is a general assistant: Q&A, drafting and editing, multi-step reasoning, and — optionally — understanding images and audio you feed it. This is the entry Apple tier: a reasoning-strong 12B multimodal-capable model running on a base 16GB unified-memory Mac, where Q6_K (9.79GB) sits comfortably under the practical ceiling. Everything runs on your own hardware, so prompts, documents, images and audio never leave the machine.

Hardware data: Apple M2 Pro (16GB unified memory, Metal) · Gemma 4 12B, GGUF Q6_K (9.79GB, recommended) — or Q5_K_M (8.41GB) / Q4_K_M (7.12GB) / Google's own QAT Q4_0 (6.98GB) for more KV-cache / context headroom · See benchmark data

ℹ️ This is a dense ~12B multimodal generalist — no MoE. Gemma 4 12B is a Gemma4UnifiedForConditionalGeneration (model_type: gemma4_unified) — ~11.95B dense parameters, 48 layers, hidden size 3840, GQA with 16 query / 8 KV heads, head_dim 256. Because it is dense, its footprint is simply the quant file you load plus the KV cache; there is no "active-parameters" shortcut. It uses a unified, encoder-free design: images (raw patches) and audio (waveforms) are projected directly into the decoder rather than through a separate vision/audio encoder. Positioned and used as a general assistant, so we file it under llm.

ℹ️ Multimodal input is optional and needs a separate projector. Gemma 4 accepts text, image, and audio in, text out. The LLM GGUF you load for chat is text-only on its own — to feed it images or audio you also pass a separate mmproj projector GGUF with --mmproj (and use llama-mtmd-cli / the multimodal server path). The mmproj-* file is not the LLM and is excluded from the weight/VRAM math below — if you only need text chat, you don't need it at all.

ℹ️ Very long 256K context, made affordable by sliding-window attention. Gemma 4 advertises a 256K context window (max_position_embeddings 262,144). It uses hybrid attention: interleaved local sliding-window (window 1024) layers plus periodic full global attention (the final layer is always global). Sliding-window attention keeps the KV cache far smaller than a full-attention model at the same length — long context is genuinely cheap here. On 16GB of unified memory the full 256K won't fit, but because SWA keeps the cache small you can still run a genuinely long, bounded context comfortably alongside a Q6_K quant.

ℹ️ Runs on current llama.cpp out of the box. Gemma 4 support landed at the model's launch (~April 2026) and ggml-org ships official GGUFs — there is no special patch or PR gate. Just use a recent llama.cpp (or Ollama) build. Pass --jinja so the embedded chat template applies (it's a complex template that includes a reasoning/thought channel).

Requirements

ComponentMinimumTested target
GPUApple Silicon with Metal, ~11-12GB usable unified memoryApple M2 Pro (16GB unified, Metal)
RAMUnified memory is shared with the OS — 16GB total here16GB unified (M2 Pro)
Storage~7GB (QAT Q4_0) up to ~10GB (Q6_K); +~1GB for the optional mmproj~10GB for Q6_K
SoftwareRecent llama.cpp (Metal) or Ollama; optional Open WebUI chat clientllama-server, Open WebUI

Model weights (first-party GGUF available). Unlike many open models, Gemma 4 ships official GGUFs. There are three good sources:

  • Google's own QAT Q4_0google/gemma-4-12b-it-qat-q4_0-gguf is a quantization-aware-trained Q4_0 (6.98GB). Because the model was fine-tuned for this quantization, it delivers noticeably better quality-per-byte than a naive Q4_0 — the low-VRAM hero when you want maximum context headroom. (The mmproj-* file in that repo is the vision/audio projector, not the LLM.)
  • ggml-org first-party GGUFggml-org/gemma-4-12B-it-GGUF ships Q4_K_M (7.38GB, marginally larger than unsloth's 7.12GB in the table), Q8_0 (12.67GB) and bf16 (23.83GB), plus the mmproj.
  • Community K_M ladderunsloth/gemma-4-12b-it-GGUF provides the conventional ladder used in the fit table below.

Byte-verified on-disk sizes (unsloth K_M ladder, plus Google's QAT):

QuantOn-disk sizeFit on M2 Pro (16GB unified, ~11-12GB usable)
QAT Q4_0 (Google)6.98GBQuantization-aware-trained — most context headroom; the roomiest fit here
Q4_K_M7.12GBTiny footprint — lots of room for a long KV cache under the unified-memory ceiling
Q5_K_M8.41GBSmall footprint with a quality bump over Q4 — comfortable
Q6_K9.79GBRecommended — near-lossless-feeling and still comfortably under the ~11-12GB usable ceiling, with room for a solid KV cache
Q8_012.67GBTight / borderline — sits right at the ~11-12GB usable ceiling; may load but leaves almost nothing for the KV cache. Use only with a small bounded context, and prefer Q6_K
bf1623.83GBDoes not fit 16GB — needs a larger unified-memory Mac (see the M3 Max tier)

Not model weights — don't count these in the VRAM math:

  • The mmproj-* file is the multimodal (image/audio) projector, loaded separately with --mmproj only if you want image/audio input. It is not part of the text-chat weights.
  • Any *-MTP* / mtp-* file is a multi-token-prediction / speculative-decode draft head — not the model weights either.

Licensing. Gemma 4 is Apache-2.0 — free for commercial and non-commercial use, no revenue caps (model card). This is a notable change: earlier Gemma generations (1–3) shipped under the custom "Gemma Terms of Use", and Gemma 4 moved to standard Apache-2.0. Google layers a separate Prohibited Use Policy on top (disallowed use cases apply regardless of the license), but the weights themselves are Apache-2.0.

Installation

You have two GGUF runtimes; pick one. Both are fine for this model — there is no patch requirement — so choose Ollama for the fastest start, or llama.cpp for the most control over context, KV-cache quantization, and multimodal input. On Apple Silicon both use the Metal GPU backend automatically; there is no CUDA and no nvidia-smi on a Mac.

Option A — llama.cpp with Metal

Build a recent llama.cpp with the Metal backend enabled, per the official build guide:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Apple Silicon: build with the Metal GPU backend
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j 8

A recent release is all you need — Gemma 4 has been mainline in llama.cpp since its launch. If you prefer a prebuilt binary, grab a current macOS build from the releases page. Metal is the default GPU backend on Apple Silicon, so -DGGML_METAL=ON (on by default in recent trees) is all that's required — no CUDA toolkit, no driver setup.

Option B — Ollama

Ollama is built on llama.cpp and is the fastest way to stand this model up; on Apple Silicon it uses Metal automatically. Either use the curated tag (ollama run gemma4:12b, if listed) or pull the community GGUF straight from Hugging Face (HF × Ollama docs):

ollama run hf.co/unsloth/gemma-4-12b-it-GGUF:Q6_K

Swap the :Q6_K tag for :Q5_K_M or :Q4_K_M if you want an even smaller footprint. Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients.

Running

With llama.cpp

Serve an OpenAI-compatible API on port 8000. The -hf flag pulls the GGUF from Hugging Face; append :Q6_K (case-insensitive) to pick the quant (llama-server docs):

# Q6_K (recommended on a 16GB M2 Pro), offload all layers to the Metal GPU
llama-server -hf unsloth/gemma-4-12b-it-GGUF:Q6_K \
    --port 8000 \
    -ngl 99 \
    -c 8192 \
    --jinja
  • -ngl 99 (--n-gpu-layers) offloads every layer to the Metal GPU — the dense 12B quant file (9.79GB at Q6_K) sits in unified memory with room for a KV cache under the ~11-12GB practical ceiling.
  • -c 8192 sets an 8K context to start. At Q6_K you have a few GB free before the ceiling, and Gemma's sliding-window attention keeps the KV cache modest, so you can raise this.
  • --jinja applies the GGUF's built-in chat template so the assistant format parses correctly — Gemma 4's template is complex (it includes a reasoning/thought channel), so this flag matters.

Push the context further on 16GB unified memory. Gemma 4 advertises a 256K context (max_position_embeddings 262,144), and its interleaved sliding-window attention (window 1024) + periodic global attention makes long context far cheaper in KV cache than a full-attention model of the same size. The full 256K won't fit in 16GB, but you can go well past 8K by quantizing the KV cache — add -fa on (Flash Attention, required for a quantized cache) and -ctk q8_0 -ctv q8_0, which roughly halves KV-cache memory versus f16 with minimal quality impact:

# Longer bounded context by 8-bit-quantizing the KV cache
llama-server -hf unsloth/gemma-4-12b-it-GGUF:Q6_K \
    --port 8000 -ngl 99 -c 32768 --jinja \
    -fa on -ctk q8_0 -ctv q8_0

Raise -c gradually and watch memory in Activity Monitor (or sudo powermetrics --samplers gpu_power) — on 16GB the Q6_K weights already take ~10GB, so the KV cache is your budget. Unified memory is shared with the OS: only about 70-75% (~11-12GB here) is practically usable for the model, and macOS caps GPU-wired memory via iogpu.wired_limit_mb. If you need more room for context, drop to Q5_K_M (8.41GB), Q4_K_M (7.12GB) or Google's QAT Q4_0 (6.98GB) — this same model also runs on larger Apple tiers (see the M3 Max card) where bf16 fits.

Optional: image and audio input. To use Gemma 4's multimodal side, add the projector with --mmproj (download the mmproj-* file from the same GGUF repo) and serve via the multimodal path — for the CLI, llama-mtmd-cli is the multimodal front-end:

# Multimodal: LLM weights + the separate projector (mmproj)
llama-mtmd-cli -hf unsloth/gemma-4-12b-it-GGUF:Q6_K \
    --mmproj <path-to-mmproj-gguf> \
    -ngl 99 --jinja

The mmproj is a small extra file (~1GB) on top of the quant sizes above — only load it if you actually want to pass images or audio. On 16GB that ~1GB is worth budgeting for: shorten the context while it's loaded. Text chat doesn't need it at all.

With Ollama

Pull and run the community GGUF directly from Hugging Face; append a :quant tag to choose the quant (HF × Ollama docs):

ollama run hf.co/unsloth/gemma-4-12b-it-GGUF:Q6_K

Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients.

Use it as a chat assistant

Point any OpenAI-compatible chat client at your local endpoint by setting its base URL and a dummy API key — no cloud, no per-token cost.

Open WebUI (optional local chat front-end). A self-hosted, ChatGPT-style UI that talks to any OpenAI-compatible server. Run it and point it at your local endpoint:

# Point Open WebUI at your local llama-server (or Ollama on :11434)
docker run -d -p 3000:8080 \
    -e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
    -e OPENAI_API_KEY=EMPTY \
    ghcr.io/open-webui/open-webui:main

Then open http://localhost:3000 and chat. (Open WebUI also autodetects a local Ollama install, so with the Ollama path you can skip the base-URL wiring entirely.)

Directly via the API. Any OpenAI SDK or curl works against the same endpoint — use it for scripts, writing tools, or your own app:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "gemma-4-12b",
      "messages": [{"role": "user", "content": "Summarize this in three bullet points: ..."}]
    }'

Local servers don't check the key, so any non-empty string (e.g. EMPTY) works where a client requires one.

Results

  • VRAM usage: The dense ~12B loads entirely as its GGUF file — Q6_K is 9.79GB on disk (byte-verified from the unsloth GGUF tree). On the M2 Pro's 16GB unified memory — of which only ~11-12GB is practically usable for the model (the rest is shared with macOS) — that leaves a couple of GB for the KV cache, stretched further by Gemma's sliding-window attention (and further still with an 8-bit-quantized cache; see Running). Q5_K_M (8.41GB), Q4_K_M (7.12GB) and Google's QAT Q4_0 (6.98GB) shrink the footprint for more context room. Q8_0 (12.67GB) is tight/borderline on the ~11-12GB usable ceiling — it may load but leaves almost nothing for the KV cache, so prefer Q6_K. The full-precision bf16 GGUF (23.83GB) does not fit 16GB — step up to a larger unified-memory Mac (see the M3 Max tier).
  • Model capability (vendor evals — Google's own, NOT hardware throughput): Google reports MMLU Pro 77.2%, MMMLU 83.4%, GPQA Diamond 78.8%, AIME 2026 77.5%, LiveCodeBench v6 72.0%, and MMMU Pro (vision) 69.1% — a reasoning-strong card for its size. These are the vendor's benchmarks, not measurements on this GPU.
  • Speed: No community throughput benchmark for Gemma 4 12B on the Apple M2 Pro exists yet — we would rather omit a tok/s figure than invent one or borrow it from different hardware. Live measurements will appear at /check/gemma-4-12b/m2-pro once contributed.

For the full benchmark data, see /check/gemma-4-12b/m2-pro.

Troubleshooting

The chat template looks wrong / responses are malformed

Pass --jinja to llama-server so the GGUF's built-in chat template is applied — without it the assistant format won't parse. Gemma 4's chat template is complex (it includes a reasoning/thought channel), so applying it correctly matters more than for a plain instruct model. Use a recent llama.cpp build so the template is fully supported.

Images or audio aren't recognized

The plain LLM GGUF is text-only. To pass images or audio you must also load the separate mmproj projector with --mmproj and use the multimodal path (llama-mtmd-cli, or the multimodal server). Download the mmproj-* file from the same GGUF repo — it is a distinct file from the quant, and text chat works fine without it. On 16GB unified memory the projector's ~1GB is worth budgeting for, so shorten the context while running multimodal.

Out of memory, or when raising the context

Q6_K weights (9.79GB) leave only a couple of GB under the ~11-12GB usable ceiling for the KV cache, and Gemma's sliding-window attention keeps that cache smaller than a full-attention model would — so OOM usually means the context is too high. Options, in order: quantize the KV cache with -fa on -ctk q8_0 -ctv q8_0 (roughly halves cache memory); lower -c; or drop to Q5_K_M (8.41GB), Q4_K_M (7.12GB) or Google's QAT Q4_0 (6.98GB) for more headroom. Avoid Q8_0 (12.67GB) here unless you keep the context tiny — it sits right at the usable ceiling. The bf16 GGUF (23.83GB) won't fit 16GB at all.

torch / CUDA errors — this is llama.cpp, not a Python ML stack

Serving Gemma 4 via llama.cpp or Ollama does not require PyTorch, flash-attn wheels, or a Python ML stack. There is no CUDA and no nvidia-smi on a Mac — the GPU backend here is Metal. If GPU offload isn't happening, confirm you built (or downloaded) a Metal-enabled llama.cpp (Option A, -DGGML_METAL=ON) and that -ngl 99 is set. Monitor GPU/memory with Activity Monitor or sudo powermetrics --samplers gpu_power, not nvidia-smi.

Model or GPU 404 on /check

Gemma 4 12B is a new addition; if the /check/gemma-4-12b/m2-pro link 404s, the catalogue row is still being registered. The recipe's install and run steps are independent of the benchmark endpoint.

common questions
How much VRAM does Gemma 4 12B need?

About 16 GB — the minimum this recipe targets.

Which GPUs is Gemma 4 12B tested on?

Apple M2 Pro (16 GB).

How hard is this setup?

Intermediate — follow the steps above.