How much VRAM does Phi-4 need?

About 48 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Phi-4 (14B) on Apple M3 Max: Full-Precision Local Assistant via llama.cpp / Ollama (48GB)

What You'll Build

A fully local, private general assistant: Phi-4 — Microsoft's flagship 14B of the Phi-4 family (released December 2024) — served as an OpenAI-compatible endpoint by llama.cpp or Ollama on an Apple M3 Max with 48GB of unified memory, then used from a chat UI (Open WebUI is a good local front-end) or directly via the API. This is a text-only chat/reasoning/writing model with a reputation for strong STEM, math, and multi-step reasoning for its size — the result of heavy training on curated and synthetic data. With 48GB unified, the M3 Max has enough headroom to run the full-precision F16 GGUF — the differentiator on this tier: no quantization loss at all. Everything runs on your own hardware, so prompts and documents never leave the machine.

Hardware data: Apple M3 Max (48GB unified) · Phi-4, GGUF F16 (29.32GB, recommended for max fidelity) — or Q8_0 (15.58GB) as the lighter near-lossless option · See benchmark data

ℹ️ This is a dense, text-only 14B generalist — no MoE, no vision. Phi-4 is a Phi3ForCausalLM (model_type: phi3) — 40 layers, hidden size 5120, GQA with 40 query / 10 KV heads. Because it is dense, its footprint is simply the quant file you load plus the KV cache; there is no "active-parameters" shortcut. It is a pure text model — there is no vision tower and no image input. This is the generalist Phi-4 — not Phi-4-reasoning / -reasoning-plus (separate reasoning-RL variants) and not Phi-4-mini; don't conflate them.

⚠️ Context window is only 16K tokens (16,384). Phi-4's max_position_embeddings is 16,384 — notably shorter than most current peers, which often reach 128K. This caps long-document and long-conversation use: keep inputs within ~16K tokens. The upside is that the KV cache stays small even at full context — on 48GB you can run the full-precision weights and the whole 16K cache with room to spare.

ℹ️ Runs on current llama.cpp out of the box. Phi-4 uses the long-supported phi3 architecture — there is no special patch or PR gate. Just use a recent llama.cpp (Metal) or Ollama build together with a recent GGUF (see the template/EOS note below), and pass --jinja so the embedded chat template applies. Phi-4 uses the <|im_start|>role<|im_sep|>…<|im_end|> chat template (not Tekken).

⚠️ Early-2025 Phi-4 GGUFs had a chat-template / EOS bug. The first GGUFs (January 2025) shipped with the wrong end-of-turn token — EOS was set to <|endoftext|> instead of <|im_end|>, which causes runaway / garbled generation that won't stop cleanly. It is fixed in current llama.cpp and current GGUFs — unsloth documented the fix and re-uploaded corrected files. If you grabbed an early build, re-download a current GGUF and update llama.cpp. This is the single most common Phi-4 gotcha; see Troubleshooting.

Requirements

Component	Minimum	Tested target
GPU	Apple M3 Max (48GB unified memory, Metal)	Apple M3 Max (48GB unified)
Memory	48GB unified (shared with macOS — see note)	48GB unified
Storage	~16GB (Q8_0) up to ~30GB (F16)	~30GB for F16
Software	Recent llama.cpp (Metal) or Ollama + a recent GGUF; optional Open WebUI chat client	`llama-server`, Open WebUI

Model weights (first-party GGUF exists). Microsoft publishes both the full-precision weights (microsoft/phi-4) and an official GGUF (microsoft/phi-4-gguf, MIT). Note a naming quirk in Microsoft's repo: it names its K-quants Q4_K / Q5_K (not Q4_K_M / Q5_K_M) — sizes there are Q4_K 9.05GB, Q4_K_S 8.44GB, Q5_K 10.60GB, Q6_K 12.03GB, Q8_0 15.58GB, BF16 29.32GB. For most users we recommend the community unsloth/phi-4-GGUF (MIT), which ships the conventional K_M ladder with the documented chat-template fix already applied; bartowski/phi-4-GGUF offers imatrix quants too. Byte-verified on-disk sizes (unsloth ladder):

Unified-memory reality on a 48GB Mac. The GPU shares the same 48GB pool as macOS and everything else running. In practice only about 70–75% of unified memory is usable by the GPU — roughly 34–36GB here — the rest is reserved for the OS. (macOS caps GPU allocations via iogpu.wired_limit_mb; you can raise it with care, but don't starve the OS.) Even so, that ceiling is comfortably above the full-precision file, so fidelity is a free choice on this tier. Sizes read against the ~34–36GB GPU-usable ceiling:

Quant	On-disk size	Fit on M3 Max (48GB unified, ~34–36GB GPU-usable)
Q4_K_M	8.89GB	Tiny footprint, enormous headroom — only if you want to load several models at once
Q5_K_M	10.41GB	Comfortable, lots of room
Q6_K	12.03GB	Comfortable — near-lossless-feeling with lots of room
Q8_0	15.58GB	Near-lossless weights, ~20GB of usable pool to spare — the lighter high-quality option
F16 / BF16	29.32GB	Recommended — full precision, no quantization loss at all; fits comfortably under the ~34–36GB usable ceiling with the full-16K KV cache. The best-fidelity choice on 48GB

Licensing. Phi-4 is MIT — free for commercial and non-commercial use, no revenue caps (model card).

Installation

You have two GGUF runtimes; pick one. Both are fine for this model — there is no patch requirement — so choose Ollama for the fastest start, or llama.cpp for the most control. Either way, use a recent build and a recent GGUF so the chat-template / EOS fix is in place.

Option A — llama.cpp with Metal

On Apple Silicon, llama.cpp uses the Metal GPU backend (there is no CUDA on a Mac). Install the Xcode Command Line Tools first (xcode-select --install), then build a recent llama.cpp with Metal enabled, per the official build guide:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Apple Silicon: build the Metal GPU backend
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j 8

A recent release matters here: it carries the corrected Phi-4 chat template / EOS handling. If you prefer a prebuilt binary, grab a current macOS/arm64 build from the releases page. Metal is on by default on Apple Silicon builds; the flag above just makes it explicit.

Option B — Ollama

Ollama is built on llama.cpp and is the fastest way to stand this model up. On Apple Silicon it uses Metal automatically — no GPU flags. Either use the curated tag (ollama run phi4) or pull the community GGUF straight from Hugging Face (HF × Ollama docs):

ollama run hf.co/unsloth/phi-4-GGUF:F16

Swap the :F16 tag for :Q8_0 if you want the lighter near-lossless option. Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients. Prefer a recent Ollama version so the current, fixed template is used.

Running

With llama.cpp

Serve an OpenAI-compatible API on port 8000. The -hf flag pulls the GGUF from Hugging Face; append :F16 (case-insensitive) to pick the quant (llama-server docs):

# F16 (recommended for max fidelity on 48GB), offload all layers to Metal, full 16K context
llama-server -hf unsloth/phi-4-GGUF:F16 \
    --port 8000 \
    -ngl 99 \
    -c 16384 \
    --jinja

-ngl 99 (--n-gpu-layers) offloads every layer to the Metal GPU — the full-precision 14B file (29.32GB at F16) sits in unified memory under the ~34–36GB GPU-usable ceiling.
-c 16384 sets the full 16K context — that's Phi-4's ceiling (max_position_embeddings 16384), so there's no benefit to requesting more. Even at full context the KV cache is small, so F16 + the whole cache fits comfortably on 48GB.
--jinja applies the GGUF's built-in <|im_start|>…<|im_sep|>…<|im_end|> chat template so the assistant format parses correctly — required for clean turn boundaries.

Because the context ceiling is 16K and you have ~34–36GB usable, KV-cache pressure is a non-issue here: you do not need to quantize the KV cache. If you'd rather keep more of the pool free (say, to run other apps), drop to Q8_0 (15.58GB) — near-lossless at 14B — instead of F16. Note there is no nvidia-smi on a Mac — watch GPU/memory pressure in Activity Monitor (Window → GPU History) or with sudo powermetrics --samplers gpu_power.

With Ollama

Pull and run the community GGUF directly from Hugging Face; append a :quant tag to choose the quant (HF × Ollama docs):

ollama run hf.co/unsloth/phi-4-GGUF:F16

Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients. Use a recent Ollama build so the corrected Phi-4 template is applied.

Use it as a chat assistant

Point any OpenAI-compatible chat client at your local endpoint by setting its base URL and a dummy API key — no cloud, no per-token cost.

Open WebUI (optional local chat front-end). A self-hosted, ChatGPT-style UI that talks to any OpenAI-compatible server. Run it and point it at your local endpoint:

# Point Open WebUI at your local llama-server (or Ollama on :11434)
docker run -d -p 3000:8080 \
    -e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
    -e OPENAI_API_KEY=EMPTY \
    ghcr.io/open-webui/open-webui:main

Then open http://localhost:3000 and chat. (Open WebUI also autodetects a local Ollama install, so with the Ollama path you can skip the base-URL wiring entirely.)

Directly via the API. Any OpenAI SDK or curl works against the same endpoint — use it for scripts, writing tools, or your own app:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "phi-4",
      "messages": [{"role": "user", "content": "Summarize this in three bullet points: ..."}]
    }'

Local servers don't check the key, so any non-empty string (e.g. EMPTY) works where a client requires one.

Results

Memory usage: The dense 14B loads entirely as its GGUF file in unified memory — F16 is 29.32GB on disk (byte-verified from the unsloth GGUF tree). On the M3 Max's 48GB, ~34–36GB is GPU-usable (the rest is reserved for macOS), so the full-precision weights fit with the full-16K KV cache and still leave a few GB free. Q8_0 (15.58GB) is the lighter near-lossless alternative with ~20GB to spare, and Q6_K/Q5_K_M/Q4_K_M shrink the footprint further if you want to run several models at once.
Model capability (vendor evals — Microsoft's own, NOT hardware throughput): Microsoft reports MMLU 84.8, GPQA 56.1, MATH 80.4, HumanEval 82.6, MGSM 80.6, DROP 75.5 — strong STEM/math/reasoning for a 14B. (Its SimpleQA score is low by design — that's a factuality / hallucination-resistance probe, not a capability score.) These are the vendor's benchmarks, not measurements on this GPU.
Speed: No community throughput benchmark for Phi-4 on the Apple M3 Max exists yet — we would rather omit a tok/s figure than invent one or borrow it from different hardware. Live measurements will appear at /check/phi-4/m3-max once contributed.

For the full benchmark data, see /check/phi-4/m3-max.

Troubleshooting

Runaway / garbled generation that won't stop — old GGUF template bug

This is the headline Phi-4 gotcha. Early-2025 (January) Phi-4 GGUFs shipped with the wrong end-of-turn token — EOS was <|endoftext|> instead of <|im_end|> — so the model never stops cleanly and produces runaway or garbled output. Fix it by using a current GGUF (the unsloth GGUF ships the corrected template — unsloth's write-up) and a recent llama.cpp / Ollama build. If you downloaded weights in early 2025, re-download them. Also make sure you pass --jinja so the embedded template is applied.

The chat template looks wrong / responses are malformed

Pass --jinja to llama-server so the GGUF's built-in chat template is applied — without it the assistant format won't parse. Phi-4 uses the <|im_start|>role<|im_sep|>…<|im_end|> template (the phi3 architecture's format, not Mistral's Tekken). The GGUF / llama.cpp path uses the embedded tokenizer and template, so there's no extra Python install required.

Out of memory when raising the context

Unlikely on a 48GB M3 Max: even the full-precision F16 GGUF (29.32GB) leaves headroom under the ~34–36GB usable ceiling, and because Phi-4's context ceiling is only 16K the KV cache stays small — there's no benefit to requesting more than -c 16384 (that's the model's limit). If you also run other memory-hungry apps and get tight, drop to Q8_0 (15.58GB) — near-lossless at 14B — or quantize the KV cache with -fa on -ctk q8_0 -ctv q8_0. You can raise the GPU allocation ceiling with sudo sysctl iogpu.wired_limit_mb=<value>, but leave several GB for the OS.

No `nvidia-smi` — how to watch memory on a Mac

There is no nvidia-smi on Apple Silicon; that's an NVIDIA-only tool. Watch GPU and memory pressure in Activity Monitor (the Memory tab, and Window → GPU History) or on the command line with sudo powermetrics --samplers gpu_power. If macOS starts swapping, back off from F16 to Q8_0.

`torch` / Metal errors — this is llama.cpp, not a Python ML stack

Serving Phi-4 via llama.cpp or Ollama does not require PyTorch, flash-attn wheels, or a Python ML stack. If you hit a Metal error, confirm you built (or downloaded) a Metal-enabled llama.cpp (Option A, -DGGML_METAL=ON) rather than a CPU-only binary — and that your macOS is current. On 48GB the full-precision F16 path is the natural choice; if you'd rather conserve memory, Q8_0 is already near-lossless at 14B.

Model or GPU 404 on /check

Phi-4 is a new addition; if the /check/phi-4/m3-max link 404s, the catalogue row is still being registered. The recipe's install and run steps are independent of the benchmark endpoint.