How much VRAM does Gemma 4 E4B-IT need?

About 5 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

Gemma 4 E4B on Apple M4 Max: local vision-language inference in unified memory with MLX-VLM

What You'll Build

A fully-local vision-language Gemma 4 E4B instance on an Apple M4 Max with 48 GB unified memory, running on Apple's native MLX-VLM runtime with 4-bit weights — no NVIDIA GPU, no CUDA, no FlashAttention. Gemma 4 E4B is Google's 4.5B-effective-parameter (8B with embeddings) instruction-tuned multimodal model that reads text and images and replies in text; you point it at a photo or screenshot and ask questions about it from the command line. At ~4.9 GB the 4-bit weights are small enough that this is an easy on-ramp to local multimodal inference on a Mac — it fits with room to spare and needs no memory tuning.

Hardware data: Apple M4 Max (48 GB unified memory) · MLX-VLM 4-bit weights ~4.9 GB on disk · See benchmark data

ℹ️ Unified memory is not VRAM. The M4 Max has 48 GB of unified memory shared by CPU and GPU — not 48 GB of dedicated VRAM. By default macOS only lets the GPU address roughly two-thirds of it on a sub-64 GB machine (~32 GB safe via Metal's recommendedMaxWorkingSetSize). At ~4.9 GB the 4-bit Gemma 4 E4B sits so far below that ceiling that the addressable-share caveat is moot here — it runs comfortably on any Apple Silicon Mac, down to a 16 GB MacBook Air (~10.5 GB GPU-addressable), with no wired-limit tuning.

ℹ️ Vision-language input, text-only output. Gemma 4 E4B reads images and text and replies in text (Google model card: input "Text, Image, Audio", output text). It does not generate images, speech, or video. The model also accepts audio input as a capability — and MLX-VLM does expose audio understanding for the 2B/4B Gemma 4 sizes (mlx-vlm Gemma 4 docs) — but image + text is the best-trodden Apple path and the focus of this recipe. For text-to-speech on this Mac see Kokoro; for image generation, the FLUX / Z-Image line.

Requirements

Component	Minimum	Tested
GPU / memory	16 GB unified memory (~10.5 GB GPU-addressable)	Apple M4 Max (48 GB unified memory, ~32 GB addressable)
RAM	Same pool — unified	48 GB unified
Storage	~5 GB (MLX 4-bit) / ~6 GB (GGUF Q4_K_M + mmproj)	~5 GB
Software	Python 3.10+, macOS Sonoma 14 / Sequoia 15+	macOS Sequoia 15

The binding constraint on Apple Silicon is addressable unified memory, not raw capacity — but for a 4.5B-effective model at 4-bit it barely binds at all. The MLX 4-bit weights are a single 5.22 GB model.safetensors shard (HF tree API for mlx-community/gemma-4-e4b-it-4bit). Against the ~32 GB the M4 Max's GPU can address by default (roughly two-thirds of its 48 GB unified pool), that leaves enormous headroom for the vision encoder's activations, the KV-cache, and macOS itself. Even the smallest current Apple Silicon config (16 GB → ~10.5 GB addressable) clears the weights with room for a generous context window, so this recipe is not memory-bound on the M4 Max.

Installation

1. Install MLX-VLM (the Apple-native vision path)

pip install -U mlx-vlm

MLX is Apple's array framework; mlx-vlm is its vision-language front-end, and it lists Gemma 4 as a supported model family (Blaizzy/mlx-vlm). There is nothing CUDA-shaped to install — no torch build flags, no cu12x wheel, no FlashAttention. (For a text-only LLM front-end you would reach for mlx-lm; for the multimodal image path you want mlx-vlm.)

2. Run the model on an image (weights download on first use)

mlx_vlm.generate \
  --model mlx-community/gemma-4-e4b-it-4bit \
  --image http://images.cocodataset.org/val2017/000000039769.jpg \
  --prompt "Describe this image." \
  --max-tokens 500

On first run, mlx-vlm pulls the 4-bit weights (~4.9 GB, single shard) from the mlx-community Hugging Face org and caches them under ~/.cache/huggingface. These mirror weights are an ungated re-distribution of Google's release, so no license-acceptance step is needed to download them; the model is governed by the Apache 2.0 license. Swap the --image URL for a local path (e.g. --image ~/Pictures/receipt.jpg) to ask questions about your own files.

For a text-only prompt, simply omit --image:

mlx_vlm.generate \
  --model mlx-community/gemma-4-e4b-it-4bit \
  --prompt "What is the capital of France?" \
  --max-tokens 500

Running

Once installed, the same mlx_vlm.generate command is your day-to-day entry point — point it at any image plus a question and it streams a text answer to the terminal. Recommended sampling for Gemma 4 is temperature=1.0, top_p=0.95, top_k=64 (from the model card); pass --temperature 1.0 --top-p 0.95 to match.

Alternative: the GGUF path (llama.cpp-Metal / Ollama / LM Studio)

If you prefer the portable GGUF ecosystem, the same model ships as GGUF with a separate multimodal projector (mmproj) file for image input. On Apple Silicon Metal is enabled by default in llama.cpp — "On MacOS, Metal is enabled by default. Using Metal makes the computation run on the GPU." (llama.cpp build docs) — so a standard cmake -B build && cmake --build build --config Release already runs on the GPU; no CUDA, no ROCm.

# Ollama (simplest) — pulls the Unsloth Q4_K_M GGUF + projector
ollama run hf.co/unsloth/gemma-4-E4B-it-GGUF:Q4_K_M

The unsloth/gemma-4-E4B-it-GGUF repo hosts gemma-4-E4B-it-Q4_K_M.gguf (4.98 GB) alongside mmproj-F16.gguf (990 MB) — the projector is what enables image input through llama.cpp/Ollama. For the highest-quality 4-bit option, Google publishes an official QAT (Quantization-Aware Training) build: google/gemma-4-E4B-it-qat-q4_0-gguf ships gemma-4-E4B_q4_0-it.gguf (5.15 GB) + gemma-4-E4B-it-mmproj.gguf (992 MB). Google's card describes QAT as a technique "which allows preserving similar quality to bfloat16 while dramatically reducing the memory requirements to load the model" (QAT GGUF card) — so QAT 4-bit gives you near-BF16 quality at the 4-bit footprint, a good reason to prefer it over a plain post-training quant. LM Studio runs both MLX and GGUF from a GUI if you prefer not to touch the terminal.

Both the ~4.9 GB MLX 4-bit and the ~5–6 GB GGUF + projector sit so far inside the M4 Max's ~32 GB default-addressable pool that neither needs the wired-limit raise a 70B-class model would.

Results

Speed: No first-party Apple M4 Max benchmark for this pair has been recorded yet — /check/gemma-4-e4b/m4-max currently returns verdict: unknown with no measurements. We are deliberately not quoting a token/sec figure: token generation on Apple Silicon is bandwidth-bound (the M4 Max runs ~546 GB/s unified memory) and no chip-named first-party throughput figure exists for Gemma 4 E4B on this Mac. If you run this, please contribute your tok/s so we can seed a real datapoint.
Memory usage: ~4.9 GB resident for the 4-bit weights, plus the vision encoder's activations during image processing and a KV-cache that grows with context. Fits the ~32 GB default-addressable pool many times over — memory is not the limiting factor on this hardware.
Quality notes: E4B is the daily-driver tier of the Gemma 4 family — 4.5B effective / 8B with embeddings, 42 layers, 128K context, instruction-tuned (model card). The 4-bit quantization trades a small amount of quality for the small footprint; with ~32 GB addressable on the M4 Max you have ample headroom for the higher-fidelity 8-bit (mlx-community/gemma-4-e4b-it-8bit) or even bf16 (mlx-community/gemma-4-e4b-it-bf16) MLX builds if you want maximum fidelity — the E4B is small enough that even 8-bit leaves the memory budget untouched. For GGUF, prefer the official QAT q4_0 build for near-BF16 quality at the 4-bit size.

For the full benchmark data (and to be the first to populate it), see /check/gemma-4-e4b/m4-max.

Troubleshooting

Tried to install FlashAttention / bitsandbytes / a `cu12x` wheel and it failed

None of those apply on Apple Silicon. There is no CUDA, no FlashAttention, and no GPU bitsandbytes kernel on macOS — MLX uses its own Metal attention and its own 4-bit quantization, and llama.cpp uses Metal + GGUF K-quants. If a generic Gemma 4 tutorial tells you to pip install flash-attn, pass --load-in-4bit, or load a GPTQ/AWQ build, skip those steps entirely; the mlx_vlm.generate and GGUF commands above are the complete Apple path.

Images aren't being read on the GGUF path

Image input through llama.cpp/Ollama requires the multimodal projector (mmproj) file in addition to the main GGUF. The ollama run hf.co/unsloth/gemma-4-E4B-it-GGUF:Q4_K_M command above pulls both automatically. If you build a llama-server invocation by hand, pass the projector explicitly with --mmproj (e.g. mmproj-F16.gguf from the Unsloth repo, or gemma-4-E4B-it-mmproj.gguf from the official QAT repo). Without it, the model loads as text-only. The MLX-VLM path bundles vision support in the single model repo, so it has no separate projector step.

Audio or video input doesn't behave like the image path

Gemma 4 E4B accepts audio as a model capability, and MLX-VLM exposes audio understanding for the 2B/4B sizes (mlx-vlm Gemma 4 docs), but the GGUF/Ollama path's mmproj covers image only — audio is not wired through llama.cpp's vision projector. If you need audio understanding on Apple Silicon, stay on the MLX-VLM runtime. Video is a documented model input but is not surfaced by these single-clip Apple CLIs; treat image + text as the production-ready path here.

`enable_thinking` not triggering on E4B

The Gemma 4 chat template exposes an enable_thinking flag, but discussions on the HF card note it does not activate an extended Thinking Mode on the E4B / E2B sizes. Treat E4B as a fast direct-answer multimodal model; the larger 26B-MoE / 31B-dense Gemma 4 variants are the path for explicit reasoning (and the M4 Max's 48 GB has room for the 31B at 4-bit if you want it).

Do I need to raise the unified-memory wired limit?

No. The sudo sysctl iogpu.wired_limit_mb raise matters only when a model's weights plus KV-cache exceed the default-addressable share (~32 GB on a 48 GB Mac) — a 70B-class problem. Gemma 4 E4B at ~4.9 GB sits far below the default ceiling on every Apple Silicon Mac the site covers, so the default limit is more than sufficient. Leave it alone.

No other widely-reported issues. Report problems via the submission form.