What You'll Build
A fully local, private general assistant: Phi-4 — Microsoft's flagship 14B of the Phi-4 family (released December 2024) — served as an OpenAI-compatible endpoint by llama.cpp or Ollama on a single 24GB RTX 3090, then used from a chat UI (Open WebUI is a good local front-end) or directly via the API. This is a text-only chat/reasoning/writing model with a reputation for strong STEM, math, and multi-step reasoning for its size — the result of heavy training on curated and synthetic data. On a 24GB RTX 3090 it runs very comfortably; the practical floor is a 12GB card. Everything runs on your own hardware, so prompts and documents never leave the machine.
Hardware data: RTX 3090 (24GB VRAM) · Phi-4, GGUF Q8_0 (15.58GB, recommended) — or Q6_K (12.03GB) / Q5_K_M (10.41GB) / Q4_K_M (8.89GB) for a smaller footprint · See benchmark data
ℹ️ This is a dense, text-only 14B generalist — no MoE, no vision. Phi-4 is a
Phi3ForCausalLM(model_type: phi3) — 40 layers, hidden size 5120, GQA with 40 query / 10 KV heads. Because it is dense, its footprint is simply the quant file you load plus the KV cache; there is no "active-parameters" shortcut. It is a pure text model — there is no vision tower and no image input. This is the generalist Phi-4 — notPhi-4-reasoning/-reasoning-plus(separate reasoning-RL variants) and notPhi-4-mini; don't conflate them.
⚠️ Context window is only 16K tokens (16,384). Phi-4's
max_position_embeddingsis 16,384 — notably shorter than most current peers, which often reach 128K. This caps long-document and long-conversation use: keep inputs within ~16K tokens. The upside is that the KV cache stays small even at full context (~3.3GB at f16), so KV-cache pressure on a 24GB card is low and you rarely need KV-cache quantization here.
ℹ️ Runs on current llama.cpp out of the box. Phi-4 uses the long-supported
phi3architecture — there is no special patch or PR gate. Just use a recentllama.cpp(or Ollama) build together with a recent GGUF (see the template/EOS note below), and pass--jinjaso the embedded chat template applies. Phi-4 uses the<|im_start|>role<|im_sep|>…<|im_end|>chat template (not Tekken).
⚠️ Early-2025 Phi-4 GGUFs had a chat-template / EOS bug. The first GGUFs (January 2025) shipped with the wrong end-of-turn token — EOS was set to
<|endoftext|>instead of<|im_end|>, which causes runaway / garbled generation that won't stop cleanly. It is fixed in current llama.cpp and current GGUFs — unsloth documented the fix and re-uploaded corrected files. If you grabbed an early build, re-download a current GGUF and update llama.cpp. This is the single most common Phi-4 gotcha; see Troubleshooting.
ℹ️ Ampere has no FP8 tensor cores — and this recipe doesn't need them. The RTX 3090 (GA102, sm_86) predates the FP8 hardware path that Ada/Hopper/Blackwell added. That's irrelevant here: GGUF quants are integer (Q4_K_M, Q6_K, Q8_0 are 4/6/8-bit integer formats), not FP8, and run natively on Ampere. Don't reach for an "FP8" build for this card — there isn't one, and you don't want it.
Requirements
| Component | Minimum | Tested target |
|---|---|---|
| GPU | 12GB VRAM (Q4_K_M is 8.89GB — the 14B does not fit an 8GB card) | RTX 3090 (24GB, Ampere GA102, sm_86) |
| RAM | 16GB system RAM | 32GB comfortable |
| Storage | ~9GB (Q4_K_M) up to ~15.6GB (Q8_0) | ~15.6GB for Q8_0 |
| Software | Recent llama.cpp (CUDA) or Ollama + a recent GGUF; optional Open WebUI chat client | llama-server, Open WebUI |
Model weights (first-party GGUF exists). Microsoft publishes both the full-precision weights (microsoft/phi-4) and an official GGUF (microsoft/phi-4-gguf, MIT). Note a naming quirk in Microsoft's repo: it names its K-quants Q4_K / Q5_K (not Q4_K_M / Q5_K_M) — sizes there are Q4_K 9.05GB, Q4_K_S 8.44GB, Q5_K 10.60GB, Q6_K 12.03GB, Q8_0 15.58GB, BF16 29.32GB. For most users we recommend the community unsloth/phi-4-GGUF (MIT), which ships the conventional K_M ladder with the documented chat-template fix already applied; bartowski/phi-4-GGUF offers imatrix quants too. Byte-verified on-disk sizes (unsloth K_M ladder):
| Quant | On-disk size | Fit on RTX 3090 (24GB) |
|---|---|---|
| Q4_K_M | 8.89GB | Tiny footprint, huge headroom — also the practical floor for a 12GB card (too big for 8GB) |
| Q5_K_M | 10.41GB | Comfortable |
| Q6_K | 12.03GB | Comfortable — near-lossless-feeling with lots of room |
| Q8_0 | 15.58GB | Recommended — near-lossless weights; 15.58GB + ~3.3GB full-16K f16 KV ≈ 18.9GB leaves ~5GB free on 24GB. Very comfortable and the practical best-quality choice here |
| F16 / BF16 | 29.32GB | Full precision — does not fit a 24GB card (needs 32GB+); not an option on the 3090 |
Licensing. Phi-4 is MIT — free for commercial and non-commercial use, no revenue caps (model card).
Installation
You have two GGUF runtimes; pick one. Both are fine for this model — there is no patch requirement — so choose Ollama for the fastest start, or llama.cpp for the most control. Either way, use a recent build and a recent GGUF so the chat-template / EOS fix is in place.
Option A — llama.cpp with CUDA
The RTX 3090 is Ampere (GA102, sm_86). Build a recent llama.cpp and compile for sm_86, per the official build guide:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# RTX 3090 is Ampere = compute capability 8.6 (sm_86)
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86
cmake --build build --config Release -j 8
A recent release matters here: it carries the corrected Phi-4 chat template / EOS handling. If you prefer a prebuilt binary, grab a current one from the releases page. The CUDA backend flag is -DGGML_CUDA=ON on current llama.cpp (the old LLAMA_CUDA name was retired in late 2024); install the NVIDIA CUDA toolkit first.
Option B — Ollama
Ollama is built on llama.cpp and is the fastest way to stand this model up. Either use the curated tag (ollama run phi4) or pull the community GGUF straight from Hugging Face (HF × Ollama docs):
ollama run hf.co/unsloth/phi-4-GGUF:Q8_0
Swap the :Q8_0 tag for :Q6_K, :Q5_K_M, or :Q4_K_M if you want a smaller footprint. Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients. Prefer a recent Ollama version so the current, fixed template is used.
Running
With llama.cpp
Serve an OpenAI-compatible API on port 8000. The -hf flag pulls the GGUF from Hugging Face; append :Q8_0 (case-insensitive) to pick the quant (llama-server docs):
# Q8_0 (recommended), offload all layers to the 3090, full 16K context
llama-server -hf unsloth/phi-4-GGUF:Q8_0 \
--port 8000 \
-ngl 99 \
-c 16384 \
--jinja
-ngl 99(--n-gpu-layers) offloads every layer to the GPU — the dense 14B quant file (15.58GB at Q8_0) sits entirely in VRAM with ~8GB to spare before the KV cache.-c 16384sets the full 16K context — that's Phi-4's ceiling (max_position_embeddings16384), so there's no benefit to requesting more. Even at full context the KV cache (~3.3GB at f16) fits easily on 24GB.--jinjaapplies the GGUF's built-in<|im_start|>…<|im_sep|>…<|im_end|>chat template so the assistant format parses correctly — required for clean turn boundaries.
Because the context ceiling is 16K, KV-cache pressure is low: you generally do not need to quantize the KV cache on a 24GB card. If you were tight on VRAM (e.g. running Q8_0 on a 16GB card), you could add -fa on -ctk q8_0 -ctv q8_0 to roughly halve KV-cache VRAM — but on the 3090 there's no need.
With Ollama
Pull and run the community GGUF directly from Hugging Face; append a :quant tag to choose the quant (HF × Ollama docs):
ollama run hf.co/unsloth/phi-4-GGUF:Q8_0
Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients. Use a recent Ollama build so the corrected Phi-4 template is applied.
Use it as a chat assistant
Point any OpenAI-compatible chat client at your local endpoint by setting its base URL and a dummy API key — no cloud, no per-token cost.
Open WebUI (optional local chat front-end). A self-hosted, ChatGPT-style UI that talks to any OpenAI-compatible server. Run it and point it at your local endpoint:
# Point Open WebUI at your local llama-server (or Ollama on :11434)
docker run -d -p 3000:8080 \
-e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
-e OPENAI_API_KEY=EMPTY \
ghcr.io/open-webui/open-webui:main
Then open http://localhost:3000 and chat. (Open WebUI also autodetects a local Ollama install, so with the Ollama path you can skip the base-URL wiring entirely.)
Directly via the API. Any OpenAI SDK or curl works against the same endpoint — use it for scripts, writing tools, or your own app:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "phi-4",
"messages": [{"role": "user", "content": "Summarize this in three bullet points: ..."}]
}'
Local servers don't check the key, so any non-empty string (e.g. EMPTY) works where a client requires one.
Results
- VRAM usage: The dense 14B loads entirely as its GGUF file — Q8_0 is 15.58GB on disk (byte-verified from the unsloth GGUF tree). On the RTX 3090's 24GB that leaves ~8GB before the KV cache; because Phi-4's context tops out at 16K, the f16 KV cache is only ~3.3GB, so Q8_0 + full context lands near ~18.9GB with ~5GB free. Q6_K (12.03GB), Q5_K_M (10.41GB), and Q4_K_M (8.89GB) shrink the footprint further — Q4_K_M is the practical floor for a 12GB card. The full-precision F16/BF16 GGUF (29.32GB) does not fit a 24GB card at all. (These are integer GGUF quants — Ampere runs them natively; there is no FP8 path on this card.)
- Model capability (vendor evals — Microsoft's own, NOT hardware throughput): Microsoft reports MMLU 84.8, GPQA 56.1, MATH 80.4, HumanEval 82.6, MGSM 80.6, DROP 75.5 — strong STEM/math/reasoning for a 14B. (Its SimpleQA score is low by design — that's a factuality / hallucination-resistance probe, not a capability score.) These are the vendor's benchmarks, not measurements on this GPU.
- Speed: No community throughput benchmark for Phi-4 on the RTX 3090 exists yet — we would rather omit a tok/s figure than invent one or borrow it from different hardware. Live measurements will appear at
/check/phi-4/rtx-3090once contributed.
For the full benchmark data, see /check/phi-4/rtx-3090.
Troubleshooting
Runaway / garbled generation that won't stop — old GGUF template bug
This is the headline Phi-4 gotcha. Early-2025 (January) Phi-4 GGUFs shipped with the wrong end-of-turn token — EOS was <|endoftext|> instead of <|im_end|> — so the model never stops cleanly and produces runaway or garbled output. Fix it by using a current GGUF (the unsloth GGUF ships the corrected template — unsloth's write-up) and a recent llama.cpp / Ollama build. If you downloaded weights in early 2025, re-download them. Also make sure you pass --jinja so the embedded template is applied.
The chat template looks wrong / responses are malformed
Pass --jinja to llama-server so the GGUF's built-in chat template is applied — without it the assistant format won't parse. Phi-4 uses the <|im_start|>role<|im_sep|>…<|im_end|> template (the phi3 architecture's format, not Mistral's Tekken). The GGUF / llama.cpp path uses the embedded tokenizer and template, so there's no extra Python install required.
Out of memory when raising the context
Unlikely on a 24GB 3090: Q8_0 weights (15.58GB) leave ~8GB free, and because Phi-4's context ceiling is only 16K the KV cache stays small (~3.3GB at f16) — there's no benefit to requesting more than -c 16384 (that's the model's limit). If you're on a smaller card and still tight, options in order: quantize the KV cache with -fa on -ctk q8_0 -ctv q8_0; drop to Q6_K (12.03GB), Q5_K_M (10.41GB), or Q4_K_M (8.89GB). Don't try the F16/BF16 GGUF (29.32GB) on 24GB — it doesn't fit.
Do I need an "FP8" build for the 3090?
No. Ampere (GA102, sm_86) has no FP8 tensor cores, and this recipe doesn't use FP8 at all — the GGUF quants (Q4_K_M, Q6_K, Q8_0) are integer formats that run natively on Ampere. FP8 weight-only serving is an Ada/Hopper/Blackwell path used by servers like vLLM; on the 3090, stick with the GGUF + llama.cpp path described here.
torch / CUDA errors — this is llama.cpp, not a Python ML stack
Serving Phi-4 via llama.cpp or Ollama does not require PyTorch, flash-attn wheels, or a Python ML stack. If you hit a CUDA error, confirm you built (or downloaded) the CUDA-enabled llama.cpp (Option A, -DGGML_CUDA=ON) rather than a CPU-only binary. For large-VRAM or multi-GPU production serving you could instead run the full-precision weights under a server like vLLM, but on a single 3090 the GGUF + llama.cpp path is the right one — and at 14B, Q8_0 is already near-lossless.
Model or GPU 404 on /check
Phi-4 is a new addition; if the /check/phi-4/rtx-3090 link 404s, the catalogue row is still being registered. The recipe's install and run steps are independent of the benchmark endpoint.