How much VRAM does Gemma 4 12B need?

About 48 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Gemma 4 12B on Apple M3 Max: Local Private Assistant via llama.cpp / Ollama (48GB)

What You'll Build

A fully local, private general assistant: Gemma 4 12B — Google DeepMind's open-weight multimodal generalist (Instruct, release 2026) — served as an OpenAI-compatible endpoint by llama.cpp (Metal) or Ollama on a single 48GB Apple M3 Max, then used from a chat UI (Open WebUI is a good local front-end) or directly via the API. This is a general assistant: Q&A, drafting and editing, multi-step reasoning, and — optionally — understanding images and audio you feed it. On an M3 Max there's so much headroom that the differentiator is maximum fidelity: you can run the full-precision bf16 GGUF (23.83GB) — no quantization at all — and still have room for a long context. Everything runs on your own hardware, so prompts, documents, images and audio never leave the machine.

Hardware data: Apple M3 Max (48GB unified memory, Metal) · Gemma 4 12B, GGUF bf16 (23.83GB, recommended — full precision) — or the near-lossless Q8_0 (12.67GB) as the lighter option, down to Q6_K (9.79GB) / Q4_K_M (7.12GB) / QAT Q4_0 (6.98GB) for the smallest footprint · See benchmark data

ℹ️ This is a dense ~12B multimodal generalist — no MoE. Gemma 4 12B is a Gemma4UnifiedForConditionalGeneration (model_type: gemma4_unified) — ~11.95B dense parameters, 48 layers, hidden size 3840, GQA with 16 query / 8 KV heads, head_dim 256. Because it is dense, its footprint is simply the quant file you load plus the KV cache; there is no "active-parameters" shortcut. It uses a unified, encoder-free design: images (raw patches) and audio (waveforms) are projected directly into the decoder rather than through a separate vision/audio encoder. Positioned and used as a general assistant, so we file it under llm.

ℹ️ Multimodal input is optional and needs a separate projector. Gemma 4 accepts text, image, and audio in, text out. The LLM GGUF you load for chat is text-only on its own — to feed it images or audio you also pass a separate mmproj projector GGUF with --mmproj (and use llama-mtmd-cli / the multimodal server path). The mmproj-* file is not the LLM and is excluded from the weight/VRAM math below — if you only need text chat, you don't need it at all.

ℹ️ Very long 256K context, made affordable by sliding-window attention. Gemma 4 advertises a 256K context window (max_position_embeddings 262,144). It uses hybrid attention: interleaved local sliding-window (window 1024) layers plus periodic full global attention (the final layer is always global). Sliding-window attention keeps the KV cache far smaller than a full-attention model at the same length — long context is genuinely cheap here. On 48GB of unified memory you have ample room: even the bf16 weights (23.83GB) leave a large budget for the KV cache, so a very long context is comfortable, and SWA makes it cheaper still.

ℹ️ Runs on current llama.cpp out of the box. Gemma 4 support landed at the model's launch (~April 2026) and ggml-org ships official GGUFs — there is no special patch or PR gate. Just use a recent llama.cpp (or Ollama) build. Pass --jinja so the embedded chat template applies (it's a complex template that includes a reasoning/thought channel).

Requirements

Component	Minimum	Tested target
GPU	Apple Silicon with Metal, ~34-36GB usable unified memory for bf16	Apple M3 Max (48GB unified, Metal)
RAM	Unified memory is shared with the OS — 48GB total here	48GB unified (M3 Max)
Storage	~13GB (Q8_0) up to ~24GB (bf16); +~1GB for the optional mmproj	~24GB for bf16
Software	Recent llama.cpp (Metal) or Ollama; optional Open WebUI chat client	`llama-server`, Open WebUI

Model weights (first-party GGUF available). Unlike many open models, Gemma 4 ships official GGUFs. There are three good sources:

ggml-org first-party GGUF — ggml-org/gemma-4-12B-it-GGUF ships Q4_K_M (7.38GB, marginally larger than unsloth's 7.12GB in the table), Q8_0 (12.67GB) and the full-precision bf16 (23.83GB), plus the mmproj. On an M3 Max, the bf16 file here is the fidelity ceiling.
Google's own QAT Q4_0 — google/gemma-4-12b-it-qat-q4_0-gguf is a quantization-aware-trained Q4_0 (6.98GB). Because the model was fine-tuned for this quantization, it delivers noticeably better quality-per-byte than a naive Q4_0 — the low-footprint option if you'd rather leave most memory for other work. (The mmproj-* file in that repo is the vision/audio projector, not the LLM.)
Community K_M ladder — unsloth/gemma-4-12b-it-GGUF provides the conventional ladder used in the fit table below.

Byte-verified on-disk sizes (unsloth K_M ladder, plus ggml-org bf16 and Google's QAT):

Quant	On-disk size	Fit on M3 Max (48GB unified, ~34-36GB usable)
QAT Q4_0 (Google)	6.98GB	Quantization-aware-trained — smallest footprint; leaves the most memory for other apps
Q4_K_M	7.12GB	Tiny footprint — enormous KV-cache / context headroom on this Mac
Q5_K_M	8.41GB	Small footprint with a quality bump over Q4
Q6_K	9.79GB	Comfortable — near-lossless-feeling with lots of room
Q8_0	12.67GB	Near-lossless — the lighter recommended option; leaves ~20GB+ under the usable ceiling for a very large KV cache
bf16	23.83GB	Recommended — full precision, maximum fidelity. Fits comfortably under the ~34-36GB usable ceiling with room to spare for a long KV cache; the differentiator of this tier

Not model weights — don't count these in the VRAM math:

The mmproj-* file is the multimodal (image/audio) projector, loaded separately with --mmproj only if you want image/audio input. It is not part of the text-chat weights.
Any *-MTP* / mtp-* file is a multi-token-prediction / speculative-decode draft head — not the model weights either.

Licensing. Gemma 4 is Apache-2.0 — free for commercial and non-commercial use, no revenue caps (model card). This is a notable change: earlier Gemma generations (1–3) shipped under the custom "Gemma Terms of Use", and Gemma 4 moved to standard Apache-2.0. Google layers a separate Prohibited Use Policy on top (disallowed use cases apply regardless of the license), but the weights themselves are Apache-2.0.

Installation

You have two GGUF runtimes; pick one. Both are fine for this model — there is no patch requirement — so choose Ollama for the fastest start, or llama.cpp for the most control over context, KV-cache quantization, and multimodal input. On Apple Silicon both use the Metal GPU backend automatically; there is no CUDA and no nvidia-smi on a Mac.

Option A — llama.cpp with Metal

Build a recent llama.cpp with the Metal backend enabled, per the official build guide:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Apple Silicon: build with the Metal GPU backend
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j 8

A recent release is all you need — Gemma 4 has been mainline in llama.cpp since its launch. If you prefer a prebuilt binary, grab a current macOS build from the releases page. Metal is the default GPU backend on Apple Silicon, so -DGGML_METAL=ON (on by default in recent trees) is all that's required — no CUDA toolkit, no driver setup.

Option B — Ollama

Ollama is built on llama.cpp and is the fastest way to stand this model up; on Apple Silicon it uses Metal automatically. Either use the curated tag (ollama run gemma4:12b, if listed) or pull the community GGUF straight from Hugging Face (HF × Ollama docs):

ollama run hf.co/unsloth/gemma-4-12b-it-GGUF:Q8_0

For the full-precision bf16 weights, pull the ggml-org GGUF instead (hf.co/ggml-org/gemma-4-12B-it-GGUF:BF16). Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients.

Running

With llama.cpp

Serve an OpenAI-compatible API on port 8000. The -hf flag pulls the GGUF from Hugging Face; on this tier, load the full-precision bf16 for maximum fidelity (llama-server docs):

# bf16 — full precision, maximum fidelity (recommended on a 48GB M3 Max)
llama-server -hf ggml-org/gemma-4-12B-it-GGUF:BF16 \
    --port 8000 \
    -ngl 99 \
    -c 32768 \
    --jinja

-ngl 99 (--n-gpu-layers) offloads every layer to the Metal GPU — the full-precision bf16 file (23.83GB) sits in unified memory well under the ~34-36GB practical ceiling, leaving room for a large KV cache.
-c 32768 sets a 32K context. With ~10GB+ free after the bf16 weights and Gemma's sliding-window attention keeping the KV cache modest, you can push this much higher.
--jinja applies the GGUF's built-in chat template so the assistant format parses correctly — Gemma 4's template is complex (it includes a reasoning/thought channel), so this flag matters.

Prefer the lighter, near-lossless option? Swap the -hf tag for ggml-org/gemma-4-12B-it-GGUF:Q8_0 (12.67GB) — it leaves ~20GB+ free for the KV cache and is very close to bf16 in quality.

Push toward the 256K context window. Gemma 4 advertises a 256K context (max_position_embeddings 262,144), and its interleaved sliding-window attention (window 1024) + periodic global attention makes long context far cheaper in KV cache than a full-attention model of the same size. Even at bf16 on 48GB you have generous room; you can go further by quantizing the KV cache — add -fa on (Flash Attention, required for a quantized cache) and -ctk q8_0 -ctv q8_0, which roughly halves KV-cache memory versus f16 with minimal quality impact:

# Very long context by 8-bit-quantizing the KV cache
llama-server -hf ggml-org/gemma-4-12B-it-GGUF:BF16 \
    --port 8000 -ngl 99 -c 131072 --jinja \
    -fa on -ctk q8_0 -ctv q8_0

The full 256K is within reach here in a way it isn't on smaller cards. Watch memory in Activity Monitor (or sudo powermetrics --samplers gpu_power) — unified memory is shared with the OS, so only about 70-75% (~34-36GB here) is practically usable for the model, and macOS caps GPU-wired memory via iogpu.wired_limit_mb. Even so, bf16 weights plus a long KV cache fit comfortably in that budget.

Optional: image and audio input. To use Gemma 4's multimodal side, add the projector with --mmproj (download the mmproj-* file from the same GGUF repo) and serve via the multimodal path — for the CLI, llama-mtmd-cli is the multimodal front-end:

# Multimodal: LLM weights + the separate projector (mmproj)
llama-mtmd-cli -hf ggml-org/gemma-4-12B-it-GGUF:BF16 \
    --mmproj <path-to-mmproj-gguf> \
    -ngl 99 --jinja

The mmproj is a small extra file (~1GB) on top of the quant sizes above — only load it if you actually want to pass images or audio. On 48GB there's ample room for it alongside even the bf16 weights; text chat doesn't need it at all.

With Ollama

Pull and run the community GGUF directly from Hugging Face; append a :quant tag to choose the quant (HF × Ollama docs):

ollama run hf.co/ggml-org/gemma-4-12B-it-GGUF:BF16

Swap :BF16 for :Q8_0 (near-lossless, lighter) if you'd rather leave more memory free. Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients.

Use it as a chat assistant

Point any OpenAI-compatible chat client at your local endpoint by setting its base URL and a dummy API key — no cloud, no per-token cost.

Open WebUI (optional local chat front-end). A self-hosted, ChatGPT-style UI that talks to any OpenAI-compatible server. Run it and point it at your local endpoint:

# Point Open WebUI at your local llama-server (or Ollama on :11434)
docker run -d -p 3000:8080 \
    -e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
    -e OPENAI_API_KEY=EMPTY \
    ghcr.io/open-webui/open-webui:main

Then open http://localhost:3000 and chat. (Open WebUI also autodetects a local Ollama install, so with the Ollama path you can skip the base-URL wiring entirely.)

Directly via the API. Any OpenAI SDK or curl works against the same endpoint — use it for scripts, writing tools, or your own app:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "gemma-4-12b",
      "messages": [{"role": "user", "content": "Summarize this in three bullet points: ..."}]
    }'

Local servers don't check the key, so any non-empty string (e.g. EMPTY) works where a client requires one.

Results

VRAM usage: The dense ~12B loads entirely as its GGUF file — the full-precision bf16 is 23.83GB on disk (byte-verified from the ggml-org GGUF tree). On the M3 Max's 48GB unified memory — of which only ~34-36GB is practically usable for the model (the rest is shared with macOS) — that leaves ~10GB+ for the KV cache, stretched further by Gemma's sliding-window attention (and further still with an 8-bit-quantized cache; see Running). The near-lossless Q8_0 (12.67GB) is the lighter option, leaving ~20GB+ free; Q6_K (9.79GB), Q5_K_M (8.41GB), Q4_K_M (7.12GB) and Google's QAT Q4_0 (6.98GB) shrink the footprint further if you want to leave memory for other apps. The full-precision GGUF fitting comfortably is the point of this tier — you get maximum fidelity with no quantization.
Model capability (vendor evals — Google's own, NOT hardware throughput): Google reports MMLU Pro 77.2%, MMMLU 83.4%, GPQA Diamond 78.8%, AIME 2026 77.5%, LiveCodeBench v6 72.0%, and MMMU Pro (vision) 69.1% — a reasoning-strong card for its size. These are the vendor's benchmarks, not measurements on this GPU.
Speed: No community throughput benchmark for Gemma 4 12B on the Apple M3 Max exists yet — we would rather omit a tok/s figure than invent one or borrow it from different hardware. Live measurements will appear at /check/gemma-4-12b/m3-max once contributed.

For the full benchmark data, see /check/gemma-4-12b/m3-max.

Troubleshooting

The chat template looks wrong / responses are malformed

Pass --jinja to llama-server so the GGUF's built-in chat template is applied — without it the assistant format won't parse. Gemma 4's chat template is complex (it includes a reasoning/thought channel), so applying it correctly matters more than for a plain instruct model. Use a recent llama.cpp build so the template is fully supported.

Images or audio aren't recognized

The plain LLM GGUF is text-only. To pass images or audio you must also load the separate mmproj projector with --mmproj and use the multimodal path (llama-mtmd-cli, or the multimodal server). Download the mmproj-* file from the same GGUF repo — it is a distinct file from the quant, and text chat works fine without it. On 48GB there's plenty of room to keep it loaded alongside even the bf16 weights.

Out of memory, or when raising the context

At bf16 (23.83GB) on 48GB you have ~10GB+ under the ~34-36GB usable ceiling for the KV cache, and Gemma's sliding-window attention keeps that cache smaller than a full-attention model would — so OOM is unlikely at sane context sizes. If you do hit it while chasing the full 256K: quantize the KV cache with -fa on -ctk q8_0 -ctv q8_0 (roughly halves cache memory); lower -c; or drop to Q8_0 (12.67GB), which frees ~11GB more for context. Remember unified memory is shared with macOS, so the practical budget is ~34-36GB, not the full 48GB.

`torch` / CUDA errors — this is llama.cpp, not a Python ML stack

Serving Gemma 4 via llama.cpp or Ollama does not require PyTorch, flash-attn wheels, or a Python ML stack. There is no CUDA and no nvidia-smi on a Mac — the GPU backend here is Metal. If GPU offload isn't happening, confirm you built (or downloaded) a Metal-enabled llama.cpp (Option A, -DGGML_METAL=ON) and that -ngl 99 is set. Monitor GPU/memory with Activity Monitor or sudo powermetrics --samplers gpu_power, not nvidia-smi.

Model or GPU 404 on /check

Gemma 4 12B is a new addition; if the /check/gemma-4-12b/m3-max link 404s, the catalogue row is still being registered. The recipe's install and run steps are independent of the benchmark endpoint.