What You'll Build
A fully local, private general assistant: Phi-4 — Microsoft's flagship 14B of the Phi-4 family (released December 2024) — served as an OpenAI-compatible endpoint by llama.cpp or Ollama on a single 32GB RTX 5090, then used from a chat UI (Open WebUI is a good local front-end) or directly via the API. This is a text-only chat/reasoning/writing model with a reputation for strong STEM, math, and multi-step reasoning for its size — the result of heavy training on curated and synthetic data. On a 32GB RTX 5090 it runs with enormous headroom at Q8_0 — and it's one of the few consumer cards where you can run the full-precision F16/BF16 weights (with the KV cache quantized). Everything runs on your own hardware, so prompts and documents never leave the machine.
Hardware data: RTX 5090 (32GB VRAM) · Phi-4, GGUF Q8_0 (15.58GB, recommended) — or F16/BF16 (29.32GB, full precision with a quantized KV cache) · See benchmark data
ℹ️ This is a dense, text-only 14B generalist — no MoE, no vision. Phi-4 is a
Phi3ForCausalLM(model_type: phi3) — 40 layers, hidden size 5120, GQA with 40 query / 10 KV heads. Because it is dense, its footprint is simply the quant file you load plus the KV cache; there is no "active-parameters" shortcut. It is a pure text model — there is no vision tower and no image input. This is the generalist Phi-4 — notPhi-4-reasoning/-reasoning-plus(separate reasoning-RL variants) and notPhi-4-mini; don't conflate them.
⚠️ Context window is only 16K tokens (16,384). Phi-4's
max_position_embeddingsis 16,384 — notably shorter than most current peers, which often reach 128K. This caps long-document and long-conversation use: keep inputs within ~16K tokens. On 32GB the KV cache is a non-issue for the quantized options; it only matters at the very top when you run full-precision F16 weights (see the fit note below).
ℹ️ Runs on current llama.cpp out of the box — but Blackwell needs a recent CUDA. Phi-4 uses the long-supported
phi3architecture — there is no special patch or PR gate. The one Blackwell-specific requirement is toolchain: sm_120 support needs CUDA 12.8+ and a recent llama.cpp. Use a recent build together with a recent GGUF (see the template/EOS note below), and pass--jinjaso the embedded chat template applies. Phi-4 uses the<|im_start|>role<|im_sep|>…<|im_end|>chat template (not Tekken).
⚠️ Early-2025 Phi-4 GGUFs had a chat-template / EOS bug. The first GGUFs (January 2025) shipped with the wrong end-of-turn token — EOS was set to
<|endoftext|>instead of<|im_end|>, which causes runaway / garbled generation that won't stop cleanly. It is fixed in current llama.cpp and current GGUFs — unsloth documented the fix and re-uploaded corrected files. If you grabbed an early build, re-download a current GGUF and update llama.cpp. This is the single most common Phi-4 gotcha; see Troubleshooting.
Requirements
| Component | Minimum | Tested target |
|---|---|---|
| GPU | 12GB VRAM (Q4_K_M is 8.89GB — the 14B does not fit an 8GB card) | RTX 5090 (32GB, Blackwell GB202, sm_120) |
| RAM | 16GB system RAM | 32GB comfortable |
| Storage | ~16GB (Q8_0) up to ~29.3GB (F16/BF16) | ~15.6GB for Q8_0 |
| Software | Recent llama.cpp (CUDA 12.8+ for sm_120) or Ollama + a recent GGUF; optional Open WebUI chat client | llama-server, Open WebUI |
Model weights (first-party GGUF exists). Microsoft publishes both the full-precision weights (microsoft/phi-4) and an official GGUF (microsoft/phi-4-gguf, MIT). Note a naming quirk in Microsoft's repo: it names its K-quants Q4_K / Q5_K (not Q4_K_M / Q5_K_M) — sizes there are Q4_K 9.05GB, Q4_K_S 8.44GB, Q5_K 10.60GB, Q6_K 12.03GB, Q8_0 15.58GB, BF16 29.32GB. For most users we recommend the community unsloth/phi-4-GGUF (MIT), which ships the conventional K_M ladder with the documented chat-template fix already applied; bartowski/phi-4-GGUF offers imatrix quants too. Byte-verified on-disk sizes (unsloth K_M ladder):
| Quant | On-disk size | Fit on RTX 5090 (32GB) |
|---|---|---|
| Q4_K_M | 8.89GB | Tiny footprint, vast headroom — also the practical floor for a 12GB card (too big for 8GB) |
| Q5_K_M | 10.41GB | Vast headroom |
| Q6_K | 12.03GB | Vast headroom |
| Q8_0 | 15.58GB | Recommended — near-lossless weights; 15.58GB + ~3.3GB full-16K f16 KV ≈ 18.9GB leaves ~13GB free on 32GB. Runs with huge headroom |
| F16 / BF16 | 29.32GB | Fits with a quantized KV cache — 29.32GB + ~3.3GB full-16K f16 KV ≈ 32.6GB is slightly over 32GB, so use -fa on -ctk q8_0 -ctv q8_0 (~1.7GB → ~31GB) or a modest -c. The "full-precision on a consumer card" option |
Licensing. Phi-4 is MIT — free for commercial and non-commercial use, no revenue caps (model card).
Installation
You have two GGUF runtimes; pick one. Both are fine for this model — there is no patch requirement — so choose Ollama for the fastest start, or llama.cpp for the most control. Either way, use a recent build (with CUDA 12.8+ for Blackwell) and a recent GGUF so the chat-template / EOS fix is in place.
Option A — llama.cpp with CUDA
The RTX 5090 is Blackwell (GB202, sm_120) and needs CUDA 12.8 or newer. Build a recent llama.cpp and compile for sm_120, per the official build guide:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# RTX 5090 is Blackwell = compute capability 12.0 (sm_120), needs CUDA 12.8+
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120
cmake --build build --config Release -j 8
A recent release matters twice here: it carries the corrected Phi-4 chat template / EOS handling and the sm_120 Blackwell target. Make sure your CUDA toolkit is 12.8+ first — older toolkits don't know sm_120 and the build will fail or fall back. If you prefer a prebuilt binary, grab a current CUDA build from the releases page. The CUDA backend flag is -DGGML_CUDA=ON on current llama.cpp (the old LLAMA_CUDA name was retired in late 2024).
Option B — Ollama
Ollama is built on llama.cpp and is the fastest way to stand this model up. Use a recent Ollama build (its bundled runtime includes Blackwell/sm_120 support). Either use the curated tag (ollama run phi4) or pull the community GGUF straight from Hugging Face (HF × Ollama docs):
ollama run hf.co/unsloth/phi-4-GGUF:Q8_0
Swap the :Q8_0 tag for :Q6_K, :Q5_K_M, or :Q4_K_M for a smaller footprint. Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients. Prefer a recent Ollama version so the current, fixed template is used.
Running
With llama.cpp
Serve an OpenAI-compatible API on port 8000. The -hf flag pulls the GGUF from Hugging Face; append :Q8_0 (case-insensitive) to pick the quant (llama-server docs):
# Q8_0 (recommended), offload all layers to the 5090, full 16K context
llama-server -hf unsloth/phi-4-GGUF:Q8_0 \
--port 8000 \
-ngl 99 \
-c 16384 \
--jinja
-ngl 99(--n-gpu-layers) offloads every layer to the GPU — the dense 14B quant file (15.58GB at Q8_0) sits entirely in VRAM with ~16GB to spare before the KV cache.-c 16384sets the full 16K context — that's Phi-4's ceiling (max_position_embeddings16384), so there's no benefit to requesting more. On 32GB the ~3.3GB f16 KV cache disappears into the headroom.--jinjaapplies the GGUF's built-in<|im_start|>…<|im_sep|>…<|im_end|>chat template so the assistant format parses correctly — required for clean turn boundaries.
For the full-precision option (the reason to have 32GB), load the F16/BF16 GGUF. At 29.32GB the weights plus a full-16K f16 KV cache (~3.3GB) total ~32.6GB — slightly over 32GB — so quantize the KV cache (or trim context):
# F16/BF16 full precision on 32GB — quantize the KV cache so it fits at full 16K
llama-server -hf unsloth/phi-4-GGUF:BF16 \
--port 8000 \
-ngl 99 \
-c 16384 \
-fa on -ctk q8_0 -ctv q8_0 \
--jinja
That drops the KV cache from ~3.3GB to ~1.7GB, putting the total near ~31GB — inside 32GB. If you'd rather keep an f16 KV cache, reduce context (e.g. -c 8192) instead. Note that at 14B the quality gap between Q8_0 and full precision is small; F16 is here for completeness / exactness, not because Q8_0 is lacking.
With Ollama
Pull and run the community GGUF directly from Hugging Face; append a :quant tag to choose the quant (HF × Ollama docs):
ollama run hf.co/unsloth/phi-4-GGUF:Q8_0
Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients. Use a recent Ollama build so the corrected Phi-4 template (and Blackwell support) are in place.
Use it as a chat assistant
Point any OpenAI-compatible chat client at your local endpoint by setting its base URL and a dummy API key — no cloud, no per-token cost.
Open WebUI (optional local chat front-end). A self-hosted, ChatGPT-style UI that talks to any OpenAI-compatible server. Run it and point it at your local endpoint:
# Point Open WebUI at your local llama-server (or Ollama on :11434)
docker run -d -p 3000:8080 \
-e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
-e OPENAI_API_KEY=EMPTY \
ghcr.io/open-webui/open-webui:main
Then open http://localhost:3000 and chat. (Open WebUI also autodetects a local Ollama install, so with the Ollama path you can skip the base-URL wiring entirely.)
Directly via the API. Any OpenAI SDK or curl works against the same endpoint — use it for scripts, writing tools, or your own app:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "phi-4",
"messages": [{"role": "user", "content": "Summarize this in three bullet points: ..."}]
}'
Local servers don't check the key, so any non-empty string (e.g. EMPTY) works where a client requires one.
Results
- VRAM usage: The dense 14B loads entirely as its GGUF file — Q8_0 is 15.58GB on disk (byte-verified from the unsloth GGUF tree). On the RTX 5090's 32GB that leaves ~16GB before the KV cache; at full 16K context the f16 KV cache is only ~3.3GB, so Q8_0 + full context lands near ~18.9GB with ~13GB free. The full-precision F16/BF16 (29.32GB) also fits: 29.32GB + ~3.3GB f16 KV ≈ 32.6GB is slightly over 32GB, so quantize the KV cache to q8_0 (~1.7GB → ~31GB total) or trim context. Q6_K/Q5_K_M/Q4_K_M leave even more room. This 32GB card is one of the few consumer GPUs where full precision is on the table.
- Model capability (vendor evals — Microsoft's own, NOT hardware throughput): Microsoft reports MMLU 84.8, GPQA 56.1, MATH 80.4, HumanEval 82.6, MGSM 80.6, DROP 75.5 — strong STEM/math/reasoning for a 14B. (Its SimpleQA score is low by design — that's a factuality / hallucination-resistance probe, not a capability score.) These are the vendor's benchmarks, not measurements on this GPU.
- Speed: No community throughput benchmark for Phi-4 on the RTX 5090 exists yet — we would rather omit a tok/s figure than invent one or borrow it from different hardware. Live measurements will appear at
/check/phi-4/rtx-5090once contributed.
For the full benchmark data, see /check/phi-4/rtx-5090.
Troubleshooting
Runaway / garbled generation that won't stop — old GGUF template bug
This is the headline Phi-4 gotcha. Early-2025 (January) Phi-4 GGUFs shipped with the wrong end-of-turn token — EOS was <|endoftext|> instead of <|im_end|> — so the model never stops cleanly and produces runaway or garbled output. Fix it by using a current GGUF (the unsloth GGUF ships the corrected template — unsloth's write-up) and a recent llama.cpp / Ollama build. If you downloaded weights in early 2025, re-download them. Also make sure you pass --jinja so the embedded template is applied.
The chat template looks wrong / responses are malformed
Pass --jinja to llama-server so the GGUF's built-in chat template is applied — without it the assistant format won't parse. Phi-4 uses the <|im_start|>role<|im_sep|>…<|im_end|> template (the phi3 architecture's format, not Mistral's Tekken). The GGUF / llama.cpp path uses the embedded tokenizer and template, so there's no extra Python install required.
Blackwell / sm_120 build fails or the GPU isn't used
The RTX 5090 (GB202, sm_120) needs CUDA 12.8+. If the build errors on the architecture, or llama.cpp falls back to CPU, your CUDA toolkit is probably too old to know sm_120 — upgrade to 12.8 or newer and rebuild with -DCMAKE_CUDA_ARCHITECTURES=120, or use a current prebuilt CUDA release / recent Ollama that already bundles Blackwell support.
Out of memory — only the F16 path can hit it here
Q8_0 (15.58GB) on 32GB has ~13GB free at full context, so OOM is essentially impossible there. The one case that can overflow is full-precision F16/BF16 (29.32GB): weights plus a full-16K f16 KV cache (~3.3GB) total ~32.6GB, just over 32GB. Fix it by quantizing the KV cache (-fa on -ctk q8_0 -ctv q8_0 → ~1.7GB, ~31GB total) or reducing context (-c 8192). If you don't need exact full precision, Q8_0 sidesteps the whole issue with huge headroom.
torch / CUDA errors — this is llama.cpp, not a Python ML stack
Serving Phi-4 via llama.cpp or Ollama does not require PyTorch, flash-attn wheels, or a Python ML stack. If you hit a CUDA error, confirm you built (or downloaded) the CUDA-enabled llama.cpp (Option A, -DGGML_CUDA=ON) with CUDA 12.8+ for Blackwell, rather than a CPU-only or older-toolkit binary. For large-VRAM or multi-GPU production serving you could instead run the full-precision weights under a server like vLLM, but on a single 5090 the GGUF + llama.cpp path is the right one — and at 14B, Q8_0 is already near-lossless.
Model or GPU 404 on /check
Phi-4 is a new addition; if the /check/phi-4/rtx-5090 link 404s, the catalogue row is still being registered. The recipe's install and run steps are independent of the benchmark endpoint.