How much VRAM does Phi-4 need?

About 16 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Phi-4 (14B) on Apple M2 Pro: Local Private Assistant via llama.cpp / Ollama (16GB)

What You'll Build

A fully local, private general assistant: Phi-4 — Microsoft's flagship 14B of the Phi-4 family (released December 2024) — served as an OpenAI-compatible endpoint by llama.cpp or Ollama on an Apple M2 Pro with 16GB of unified memory, then used from a chat UI (Open WebUI is a good local front-end) or directly via the API. This is a text-only chat/reasoning/writing model with a reputation for strong STEM, math, and multi-step reasoning for its size — the result of heavy training on curated and synthetic data. The M2 Pro is the entry Apple tier for a 14B: 16GB unified is genuinely tight here, so quant choice matters. Everything runs on your own hardware, so prompts and documents never leave the machine.

Hardware data: Apple M2 Pro (16GB unified) · Phi-4, GGUF Q5_K_M (10.41GB, recommended) — or Q4_K_M (8.89GB) for more room; Q6_K (12.03GB) is borderline · See benchmark data

ℹ️ This is a dense, text-only 14B generalist — no MoE, no vision. Phi-4 is a Phi3ForCausalLM (model_type: phi3) — 40 layers, hidden size 5120, GQA with 40 query / 10 KV heads. Because it is dense, its footprint is simply the quant file you load plus the KV cache; there is no "active-parameters" shortcut. It is a pure text model — there is no vision tower and no image input. This is the generalist Phi-4 — not Phi-4-reasoning / -reasoning-plus (separate reasoning-RL variants) and not Phi-4-mini; don't conflate them.

⚠️ Context window is only 16K tokens (16,384). Phi-4's max_position_embeddings is 16,384 — notably shorter than most current peers, which often reach 128K. This caps long-document and long-conversation use: keep inputs within ~16K tokens. The upside is that the KV cache stays modest even at full context — a real help on a 16GB Mac, where every gigabyte counts.

ℹ️ Runs on current llama.cpp out of the box. Phi-4 uses the long-supported phi3 architecture — there is no special patch or PR gate. Just use a recent llama.cpp (Metal) or Ollama build together with a recent GGUF (see the template/EOS note below), and pass --jinja so the embedded chat template applies. Phi-4 uses the <|im_start|>role<|im_sep|>…<|im_end|> chat template (not Tekken).

⚠️ Early-2025 Phi-4 GGUFs had a chat-template / EOS bug. The first GGUFs (January 2025) shipped with the wrong end-of-turn token — EOS was set to <|endoftext|> instead of <|im_end|>, which causes runaway / garbled generation that won't stop cleanly. It is fixed in current llama.cpp and current GGUFs — unsloth documented the fix and re-uploaded corrected files. If you grabbed an early build, re-download a current GGUF and update llama.cpp. This is the single most common Phi-4 gotcha; see Troubleshooting.

Requirements

Component	Minimum	Tested target
GPU	Apple M2 Pro (16GB unified memory, Metal)	Apple M2 Pro (16GB unified)
Memory	16GB unified (shared with macOS — see note)	16GB unified
Storage	~9GB (Q4_K_M) up to ~10.4GB (Q5_K_M)	~10.4GB for Q5_K_M
Software	Recent llama.cpp (Metal) or Ollama + a recent GGUF; optional Open WebUI chat client	`llama-server`, Open WebUI

Model weights (first-party GGUF exists). Microsoft publishes both the full-precision weights (microsoft/phi-4) and an official GGUF (microsoft/phi-4-gguf, MIT). Note a naming quirk in Microsoft's repo: it names its K-quants Q4_K / Q5_K (not Q4_K_M / Q5_K_M) — sizes there are Q4_K 9.05GB, Q4_K_S 8.44GB, Q5_K 10.60GB, Q6_K 12.03GB, Q8_0 15.58GB, BF16 29.32GB. For most users we recommend the community unsloth/phi-4-GGUF (MIT), which ships the conventional K_M ladder with the documented chat-template fix already applied; bartowski/phi-4-GGUF offers imatrix quants too.

Unified-memory reality on a 16GB Mac. The GPU shares the same 16GB pool as macOS and everything else running. In practice only about 70–75% of unified memory is usable by the GPU — roughly 11–12GB here — the rest is reserved for the OS. (macOS caps GPU allocations via iogpu.wired_limit_mb; you can raise it with care, but don't starve the OS.) That ceiling is what makes quant choice tight on this tier. Byte-verified on-disk sizes (unsloth K_M ladder), read against the ~11–12GB GPU-usable ceiling:

Quant	On-disk size	Fit on M2 Pro (16GB unified, ~11–12GB GPU-usable)
Q4_K_M	8.89GB	Comfortable — leaves the most room for the KV cache and the OS; the safest choice if you want headroom
Q5_K_M	10.41GB	Recommended — best quality that still fits with a KV-quantized 16K cache under the ~11–12GB ceiling
Q6_K	12.03GB	Borderline — sits right on the ~11–12GB usable ceiling; may fit with a tight, KV-quantized context but leaves almost nothing for the OS. Prefer Q5_K_M unless you have measured room
Q8_0	15.58GB	Does not fit — exceeds the ~11–12GB GPU-usable pool on 16GB
F16 / BF16	29.32GB	Full precision — nowhere near fits 16GB unified

Licensing. Phi-4 is MIT — free for commercial and non-commercial use, no revenue caps (model card).

Installation

You have two GGUF runtimes; pick one. Both are fine for this model — there is no patch requirement — so choose Ollama for the fastest start, or llama.cpp for the most control. Either way, use a recent build and a recent GGUF so the chat-template / EOS fix is in place.

Option A — llama.cpp with Metal

On Apple Silicon, llama.cpp uses the Metal GPU backend (there is no CUDA on a Mac). Install the Xcode Command Line Tools first (xcode-select --install), then build a recent llama.cpp with Metal enabled, per the official build guide:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Apple Silicon: build the Metal GPU backend
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j 8

A recent release matters here: it carries the corrected Phi-4 chat template / EOS handling. If you prefer a prebuilt binary, grab a current macOS/arm64 build from the releases page. Metal is on by default on Apple Silicon builds; the flag above just makes it explicit.

Option B — Ollama

Ollama is built on llama.cpp and is the fastest way to stand this model up. On Apple Silicon it uses Metal automatically — no GPU flags. Either use the curated tag (ollama run phi4) or pull the community GGUF straight from Hugging Face (HF × Ollama docs):

ollama run hf.co/unsloth/phi-4-GGUF:Q5_K_M

Swap the :Q5_K_M tag for :Q4_K_M if you want more headroom on this 16GB tier. Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients. Prefer a recent Ollama version so the current, fixed template is used.

Running

With llama.cpp

Serve an OpenAI-compatible API on port 8000. The -hf flag pulls the GGUF from Hugging Face; append :Q5_K_M (case-insensitive) to pick the quant (llama-server docs). On a 16GB Mac, quantize the KV cache so the full 16K context fits alongside Q5_K_M weights:

# Q5_K_M (recommended for 16GB), offload all layers to Metal, full 16K context with a quantized KV cache
llama-server -hf unsloth/phi-4-GGUF:Q5_K_M \
    --port 8000 \
    -ngl 99 \
    -c 16384 \
    -fa on -ctk q8_0 -ctv q8_0 \
    --jinja

-ngl 99 (--n-gpu-layers) offloads every layer to the Metal GPU — the dense 14B quant file (10.41GB at Q5_K_M) sits in unified memory under the ~11–12GB GPU-usable ceiling.
-c 16384 sets the full 16K context — that's Phi-4's ceiling (max_position_embeddings 16384), so there's no benefit to requesting more.
-fa on -ctk q8_0 -ctv q8_0 turns on flash attention and quantizes the KV cache to 8-bit, roughly halving its VRAM. On this 16GB tier that's what keeps Q5_K_M + a full-16K cache inside the usable pool — drop it if you shorten the context and have measured room.
--jinja applies the GGUF's built-in <|im_start|>…<|im_sep|>…<|im_end|> chat template so the assistant format parses correctly — required for clean turn boundaries.

If memory is tight, drop to Q4_K_M (8.89GB) for more room, or reduce -c below 16K. Note there is no nvidia-smi on a Mac — watch GPU/memory pressure in Activity Monitor (Window → GPU History) or with sudo powermetrics --samplers gpu_power.

With Ollama

Pull and run the community GGUF directly from Hugging Face; append a :quant tag to choose the quant (HF × Ollama docs):

ollama run hf.co/unsloth/phi-4-GGUF:Q5_K_M

Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients. Use a recent Ollama build so the corrected Phi-4 template is applied.

Use it as a chat assistant

Point any OpenAI-compatible chat client at your local endpoint by setting its base URL and a dummy API key — no cloud, no per-token cost.

Open WebUI (optional local chat front-end). A self-hosted, ChatGPT-style UI that talks to any OpenAI-compatible server. Run it and point it at your local endpoint:

# Point Open WebUI at your local llama-server (or Ollama on :11434)
docker run -d -p 3000:8080 \
    -e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
    -e OPENAI_API_KEY=EMPTY \
    ghcr.io/open-webui/open-webui:main

Then open http://localhost:3000 and chat. (Open WebUI also autodetects a local Ollama install, so with the Ollama path you can skip the base-URL wiring entirely.)

Directly via the API. Any OpenAI SDK or curl works against the same endpoint — use it for scripts, writing tools, or your own app:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "phi-4",
      "messages": [{"role": "user", "content": "Summarize this in three bullet points: ..."}]
    }'

Local servers don't check the key, so any non-empty string (e.g. EMPTY) works where a client requires one.

Results

Memory usage: The dense 14B loads entirely as its GGUF file in unified memory — Q5_K_M is 10.41GB on disk (byte-verified from the unsloth GGUF tree). On the M2 Pro's 16GB, only ~11–12GB is GPU-usable (the rest is reserved for macOS), so Q5_K_M fits with a KV-quantized 16K cache and Q4_K_M (8.89GB) fits with more room. Q6_K (12.03GB) is borderline — it sits right on the usable ceiling — and Q8_0 (15.58GB) does not fit the 16GB pool. The full-precision F16/BF16 GGUF (29.32GB) is out of reach here entirely.
Model capability (vendor evals — Microsoft's own, NOT hardware throughput): Microsoft reports MMLU 84.8, GPQA 56.1, MATH 80.4, HumanEval 82.6, MGSM 80.6, DROP 75.5 — strong STEM/math/reasoning for a 14B. (Its SimpleQA score is low by design — that's a factuality / hallucination-resistance probe, not a capability score.) These are the vendor's benchmarks, not measurements on this GPU.
Speed: No community throughput benchmark for Phi-4 on the Apple M2 Pro exists yet — we would rather omit a tok/s figure than invent one or borrow it from different hardware. Live measurements will appear at /check/phi-4/m2-pro once contributed.

For the full benchmark data, see /check/phi-4/m2-pro.

Troubleshooting

Runaway / garbled generation that won't stop — old GGUF template bug

This is the headline Phi-4 gotcha. Early-2025 (January) Phi-4 GGUFs shipped with the wrong end-of-turn token — EOS was <|endoftext|> instead of <|im_end|> — so the model never stops cleanly and produces runaway or garbled output. Fix it by using a current GGUF (the unsloth GGUF ships the corrected template — unsloth's write-up) and a recent llama.cpp / Ollama build. If you downloaded weights in early 2025, re-download them. Also make sure you pass --jinja so the embedded template is applied.

The chat template looks wrong / responses are malformed

Pass --jinja to llama-server so the GGUF's built-in chat template is applied — without it the assistant format won't parse. Phi-4 uses the <|im_start|>role<|im_sep|>…<|im_end|> template (the phi3 architecture's format, not Mistral's Tekken). The GGUF / llama.cpp path uses the embedded tokenizer and template, so there's no extra Python install required.

Out of memory on the 16GB unified pool

This is the tight tier, so OOM is the likeliest snag. Remember only ~11–12GB of the 16GB is GPU-usable — the rest is macOS. In order: keep the KV cache quantized (-fa on -ctk q8_0 -ctv q8_0); if still tight, reduce -c below 16K; then drop the quant to Q4_K_M (8.89GB) for the most headroom. Q6_K (12.03GB) is borderline on this ceiling — prefer Q5_K_M unless you've measured room. Q8_0 (15.58GB) does not fit, and the F16/BF16 GGUF (29.32GB) is far out of reach. You can raise the GPU allocation ceiling with sudo sysctl iogpu.wired_limit_mb=<value>, but leave several GB for the OS.

No `nvidia-smi` — how to watch memory on a Mac

There is no nvidia-smi on Apple Silicon; that's an NVIDIA-only tool. Watch GPU and memory pressure in Activity Monitor (the Memory tab, and Window → GPU History) or on the command line with sudo powermetrics --samplers gpu_power. If macOS starts swapping, back off the quant or the context length.

`torch` / Metal errors — this is llama.cpp, not a Python ML stack

Serving Phi-4 via llama.cpp or Ollama does not require PyTorch, flash-attn wheels, or a Python ML stack. If you hit a Metal error, confirm you built (or downloaded) a Metal-enabled llama.cpp (Option A, -DGGML_METAL=ON) rather than a CPU-only binary — and that your macOS is current. At 14B, Q5_K_M is already near the quality of full precision, so there's no need to chase the larger quants on this tier.

Model or GPU 404 on /check

Phi-4 is a new addition; if the /check/phi-4/m2-pro link 404s, the catalogue row is still being registered. The recipe's install and run steps are independent of the benchmark endpoint.