How much VRAM does Phi-4 need?

About 16 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Phi-4 (14B) on RX 7800 XT: Local Private Assistant via llama.cpp-HIP / Ollama (ROCm, 16GB)

What You'll Build

A fully local, private general assistant on a 16GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101): Phi-4 — Microsoft's flagship 14B of the Phi-4 family (released December 2024) — served as an OpenAI-compatible endpoint by llama.cpp (built against AMD's HIP/ROCm backend) or Ollama, then used from a chat UI (Open WebUI is a good local front-end) or directly via the API. This is a text-only chat/reasoning/writing model with a reputation for strong STEM, math, and multi-step reasoning for its size — the result of heavy training on curated and synthetic data. On a 16GB RX 7800 XT it fits comfortably at a high-quality quant; the practical floor is a 12GB card. Everything runs on your own hardware, so prompts and documents never leave the machine.

Hardware data: RX 7800 XT (16GB VRAM) · Phi-4, GGUF Q6_K (12.03GB, recommended) — or Q5_K_M (10.41GB) / Q4_K_M (8.89GB) for more KV-cache / context headroom · ROCm 7 · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7800 XT runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel and no FlashAttention prebuilt-wheel step here. For this model the reliable path is GGUF via llama.cpp-HIP (or Ollama, which bundles llama.cpp). Do not follow a guide that tells you to pip install flash-attn, pick a cu12x wheel, or use ExLlamaV2/Marlin for this card — those are NVIDIA-only.

ℹ️ This is a dense, text-only 14B generalist — no MoE, no vision. Phi-4 is a Phi3ForCausalLM (model_type: phi3) — 40 layers, hidden size 5120, GQA with 40 query / 10 KV heads. Because it is dense, its footprint is simply the quant file you load plus the KV cache; there is no "active-parameters" shortcut. It is a pure text model — there is no vision tower and no image input. This is the generalist Phi-4 — not Phi-4-reasoning / -reasoning-plus (separate reasoning-RL variants) and not Phi-4-mini; don't conflate them.

⚠️ Context window is only 16K tokens (16,384). Phi-4's max_position_embeddings is 16,384 — notably shorter than most current peers, which often reach 128K. This caps long-document and long-conversation use: keep inputs within ~16K tokens. The upside is that the KV cache stays small even at full context (~3.3GB f16 at the full 16K across 40 layers / 10 KV heads), so KV-cache pressure on a 16GB card is manageable.

ℹ️ Runs on current llama.cpp out of the box. Phi-4 uses the long-supported phi3 architecture — there is no special patch or PR gate. Just use a recent llama.cpp (built with the HIP backend, below) or Ollama together with a recent GGUF (see the template/EOS note below), and pass --jinja so the embedded chat template applies. Phi-4 uses the <|im_start|>role<|im_sep|>…<|im_end|> chat template (not Tekken).

⚠️ Early-2025 Phi-4 GGUFs had a chat-template / EOS bug. The first GGUFs (January 2025) shipped with the wrong end-of-turn token — EOS was set to <|endoftext|> instead of <|im_end|>, which causes runaway / garbled generation that won't stop cleanly. It is fixed in current llama.cpp and current GGUFs — unsloth documented the fix and re-uploaded corrected files. If you grabbed an early build, re-download a current GGUF and update llama.cpp. This is the single most common Phi-4 gotcha; see Troubleshooting.

Requirements

Component	Minimum	Tested target
GPU	12GB VRAM (Q4_K_M is 8.89GB — it does not fit 8GB)	RX 7800 XT (16GB, RDNA3 Navi 32, gfx1101)
RAM	16GB system RAM	32GB comfortable
Storage	~9GB (Q4_K_M) up to ~15.6GB (Q8_0)	~12GB for Q6_K
Driver	AMD ROCm v7 (installed via `amdgpu-install`) on Linux	—
Software	llama.cpp (HIP build) or Ollama + a recent GGUF; optional Open WebUI chat client	`llama-server`, Open WebUI

Model weights (first-party GGUF exists). Microsoft publishes both the full-precision weights (microsoft/phi-4) and an official GGUF (microsoft/phi-4-gguf, MIT). Note a naming quirk in Microsoft's repo: it names its K-quants Q4_K / Q5_K (not Q4_K_M / Q5_K_M) — sizes there are Q4_K 9.05GB, Q4_K_S 8.44GB, Q5_K 10.60GB, Q6_K 12.03GB, Q8_0 15.58GB, BF16 29.32GB. For most users we recommend the community unsloth/phi-4-GGUF (MIT), which ships the conventional K_M ladder with the documented chat-template fix already applied; bartowski/phi-4-GGUF offers imatrix quants too. Byte-verified on-disk sizes (unsloth K_M ladder):

Quant	On-disk size	Fit on RX 7800 XT (16GB)
Q4_K_M	8.89GB	Smallest footprint — largest KV cache / context room; also the practical floor for a 12GB card (too big for 8GB)
Q5_K_M	10.41GB	Comfortable — room for a large KV cache
Q6_K	12.03GB	Recommended — near-lossless-feeling weights, and 12.03GB + ~3.3GB f16 KV at full 16K ≈ 15.4GB fits the 16GB card with a little to spare
Q8_0	15.58GB	Tight on 16GB — the weights alone leave only ~0.4GB, so a full-16K f16 KV cache won't fit. Usable only with a quantized KV cache (`-fa on -ctk q8_0 -ctv q8_0`) and/or a reduced `-c`; otherwise prefer Q6_K
F16 / BF16	29.32GB	Full precision — does not fit a 16GB card (needs 32GB+); not an option here

Licensing. Phi-4 is MIT — free for commercial and non-commercial use, no revenue caps (model card).

Installation

Prerequisite — install the AMD ROCm v7 driver

The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU — it is listed with LLVM target gfx1101 in AMD's ROCm Linux system-requirements matrix — but ROCm is not bundled with Ollama or the llama.cpp release binaries; you install it once at the OS level. Per the Ollama AMD GPU docs, Ollama requires the AMD ROCm v7 driver on Linux, installed or upgraded with the amdgpu-install utility. On Ubuntu 24.04 (Noble), install ROCm 7.2.1 via the standard amdgpu-install flow (AMD's Radeon ROCm install docs cover the current packages):

# 1. Add the amdgpu-install package and install ROCm
wget https://repo.radeon.com/amdgpu-install/7.2.1/ubuntu/noble/amdgpu-install_7.2.1.70201-1_all.deb
sudo apt install ./amdgpu-install_7.2.1.70201-1_all.deb
sudo apt update
sudo amdgpu-install -y --usecase=graphics,rocm

# 2. Add yourself to the render/video groups (log out/in afterward)
sudo usermod -a -G render,video $LOGNAME

After logging back in, confirm the driver sees the card — rocm-smi should list the RX 7800 XT (this is the ROCm equivalent of nvidia-smi; there is no nvidia-smi on an AMD box). Because the RX 7800 XT is on the supported-GPU matrix as gfx1101, you should not normally need an HSA_OVERRIDE_GFX_VERSION masquerade. If a tool ships only gfx1100 kernels and refuses to start on your card, the documented Linux fallback is to export HSA_OVERRIDE_GFX_VERSION=11.0.0 so the gfx1101 card presents as gfx1100 — treat that as a fallback, not a default.

You have two GGUF runtimes; pick one. Both are fine for this model — there is no patch requirement — so choose Ollama for the fastest start, or llama.cpp for the most control over context and KV-cache quantization. Either way, use a recent build and a recent GGUF so the chat-template / EOS fix is in place.

Option A — llama.cpp built with HIP/ROCm (recommended for full control)

1. Build llama.cpp with the HIP backend. Per the llama.cpp build docs, the Linux HIP build pattern is the same as for any RDNA3 card — only the GPU_TARGETS value changes. For the RX 7800 XT, pin it to gfx1101 (the card's LLVM target per the AMD ROCm system-requirements matrix):

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# RX 7800 XT is RDNA3 Navi 32 = gfx1101
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
    cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1101 -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build --config Release -- -j 16

-DGGML_HIP=ON selects the ROCm backend; -DGPU_TARGETS=gfx1101 pins the kernels to the RX 7800 XT's architecture (Navi 32). Do not copy a gfx1100 value from a 7900-series guide — that is the wrong target for this card. A recent release is what matters here: Phi-4 uses the long-supported phi3 architecture, so there is no PR branch to check out — but a current build carries the corrected Phi-4 chat template / EOS handling. The GGUF quants are integer formats, so RDNA3's lack of FP8 tensor hardware is irrelevant here.

2. That's it for install — llama.cpp pulls the GGUF straight from Hugging Face at launch (next section). No separate download step.

Option B — Ollama

Ollama is built on llama.cpp and is the fastest way to stand this model up. Install it with the Linux one-liner; Ollama detects the ROCm runtime you installed above and runs the gfx1101 card without any manual architecture flag:

curl -fsSL https://ollama.com/install.sh | sh

Per the Ollama AMD preview blog, all of Ollama's features can be accelerated by AMD graphics cards on Linux. Then either use the curated tag (ollama run phi4) or pull the community GGUF straight from Hugging Face (HF × Ollama docs):

ollama run hf.co/unsloth/phi-4-GGUF:Q6_K

Swap the :Q6_K tag for :Q5_K_M or :Q4_K_M if you want a smaller footprint. Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients and uses the ROCm runtime on the gfx1101 card automatically. Prefer a recent Ollama version so the current, fixed template is used.

Running

With llama.cpp

Serve an OpenAI-compatible API on port 8000. The -hf flag pulls the GGUF from Hugging Face; append :Q6_K (case-insensitive) to pick the quant (llama-server docs):

# Q6_K (recommended), offload all layers to the 7800 XT, full 16K context
./build/bin/llama-server -hf unsloth/phi-4-GGUF:Q6_K \
    --port 8000 \
    -ngl 99 \
    -c 16384 \
    --jinja

-ngl 99 (--n-gpu-layers) offloads every layer to the GPU — the dense 14B quant file (12.03GB at Q6_K) sits entirely in the 16GB VRAM, leaving room for the KV cache.
-c 16384 sets the full 16K context — that's Phi-4's ceiling (max_position_embeddings 16384), so there's no benefit to requesting more. At Q6_K the ~3.3GB f16 KV cache at full context fits alongside the weights.
--jinja applies the GGUF's built-in <|im_start|>…<|im_sep|>…<|im_end|> chat template so the assistant format parses correctly — required for clean turn boundaries.

At Q6_K on 16GB you generally do not need to quantize the KV cache — the full-16K f16 cache fits. If you want to run the higher-quality Q8_0 (15.58GB) on this 16GB card, the weights alone leave only ~0.4GB, so a full-16K f16 KV cache will not fit; you must either quantize the cache or reduce the context:

# Q8_0 on 16GB: quantize the KV cache (or lower -c) — the weights leave almost no room for an f16 cache
./build/bin/llama-server -hf unsloth/phi-4-GGUF:Q8_0 \
    --port 8000 -ngl 99 -c 16384 --jinja \
    -fa on -ctk q8_0 -ctv q8_0

A note on Flash Attention on RDNA3. Flash Attention on the ROCm/HIP backend is less mature than on CUDA — if -fa on misbehaves or a quantized KV cache errors on your ROCm version, fall back to an un-quantized cache with a smaller -c, or just use Q6_K (12.03GB), which fits a full-16K f16 cache on 16GB without needing -fa at all. Q6_K at 14B is already near-lossless-feeling, so it is the recommended path here.

With Ollama

Pull and run the community GGUF directly from Hugging Face; append a :quant tag to choose the quant (HF × Ollama docs):

ollama run hf.co/unsloth/phi-4-GGUF:Q6_K

Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients and uses the ROCm runtime on the gfx1101 card automatically. Use a recent Ollama build so the corrected Phi-4 template is applied.

Use it as a chat assistant

Point any OpenAI-compatible chat client at your local endpoint by setting its base URL and a dummy API key — no cloud, no per-token cost.

Open WebUI (optional local chat front-end). A self-hosted, ChatGPT-style UI that talks to any OpenAI-compatible server. Run it and point it at your local endpoint:

# Point Open WebUI at your local llama-server (or Ollama on :11434)
docker run -d -p 3000:8080 \
    -e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
    -e OPENAI_API_KEY=EMPTY \
    ghcr.io/open-webui/open-webui:main

Then open http://localhost:3000 and chat. (Open WebUI also autodetects a local Ollama install, so with the Ollama path you can skip the base-URL wiring entirely.)

Directly via the API. Any OpenAI SDK or curl works against the same endpoint — use it for scripts, writing tools, or your own app:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "phi-4",
      "messages": [{"role": "user", "content": "Summarize this in three bullet points: ..."}]
    }'

Local servers don't check the key, so any non-empty string (e.g. EMPTY) works where a client requires one.

Results

VRAM usage: The dense 14B loads entirely as its GGUF file — Q6_K is 12.03GB on disk (byte-verified from the unsloth GGUF tree). On the RX 7800 XT's 16GB that leaves room for the ~3.3GB f16 KV cache at Phi-4's full 16K context — a comfortable fit. Q5_K_M (10.41GB) and Q4_K_M (8.89GB) shrink the footprint further for even more headroom — Q4_K_M is the practical floor for a 12GB card. Q8_0 (15.58GB) is tight on 16GB — the weights leave only ~0.4GB, so it needs a quantized KV cache or a reduced context (see Running); prefer Q6_K unless you specifically need Q8_0. The full-precision F16/BF16 GGUF (29.32GB) does not fit a 16GB card at all.
Model capability (vendor evals — Microsoft's own, NOT hardware throughput): Microsoft reports MMLU 84.8, GPQA 56.1, MATH 80.4, HumanEval 82.6, MGSM 80.6, DROP 75.5 — strong STEM/math/reasoning for a 14B. (Its SimpleQA score is low by design — that's a factuality / hallucination-resistance probe, not a capability score.) These are the vendor's benchmarks, not measurements on this GPU.
Speed: No community throughput benchmark for Phi-4 on the RX 7800 XT exists yet — we would rather omit a tok/s figure than invent one or borrow it from different hardware. Live measurements will appear at /check/phi-4/rx-7800-xt once contributed.

For the full benchmark data, see /check/phi-4/rx-7800-xt.

Troubleshooting

Runaway / garbled generation that won't stop — old GGUF template bug

This is the headline Phi-4 gotcha. Early-2025 (January) Phi-4 GGUFs shipped with the wrong end-of-turn token — EOS was <|endoftext|> instead of <|im_end|> — so the model never stops cleanly and produces runaway or garbled output. Fix it by using a current GGUF (the unsloth GGUF ships the corrected template — unsloth's write-up) and a recent llama.cpp / Ollama build. If you downloaded weights in early 2025, re-download them. Also make sure you pass --jinja so the embedded template is applied.

The chat template looks wrong / responses are malformed

Pass --jinja to llama-server so the GGUF's built-in chat template is applied — without it the assistant format won't parse. Phi-4 uses the <|im_start|>role<|im_sep|>…<|im_end|> template (the phi3 architecture's format, not Mistral's Tekken). The GGUF / llama.cpp path uses the embedded tokenizer and template, so there's no extra Python install required.

Ollama or llama.cpp runs on the CPU instead of the GPU

Confirm the ROCm v7 driver is installed (rocm-smi should list the 7800 XT) and that your user is in the render and video groups (groups should show both — log out and back in after the usermod step). Per the Ollama AMD GPU docs, ROCm is a separate install from Ollama; if it's missing, Ollama silently falls back to CPU. For a source llama.cpp build, confirm you compiled with -DGGML_HIP=ON -DGPU_TARGETS=gfx1101. The RX 7800 XT (gfx1101) is on the supported-GPU matrix, so you should not normally need HSA_OVERRIDE_GFX_VERSION — reach for the HSA_OVERRIDE_GFX_VERSION=11.0.0 masquerade only if a specific tool ships gfx1100-only kernels and refuses to start.

Out of memory when raising the context — or fitting Q8_0 on 16GB

Phi-4's context ceiling is only 16K, so the f16 KV cache stays modest (~3.3GB at full 16K). At Q6_K (12.03GB) that full-16K f16 cache fits the 16GB card, so plain -c 16384 works without KV quantization. The pressure point on this card is Q8_0 (15.58GB): the weights leave only ~0.4GB, so a full-16K f16 cache OOMs. Options, in order: use Q6_K (12.03GB, the recommended path — near-lossless-feeling and fits cleanly); or, to stay on Q8_0, quantize the KV cache with -fa on -ctk q8_0 -ctv q8_0 (roughly halves cache VRAM — note Flash Attention is less mature on ROCm/RDNA3, so verify it behaves on your version) and/or lower -c. Don't try the F16/BF16 GGUF (29.32GB) on 16GB — it doesn't fit; it needs a 32GB+ card.

Token generation feels slower than expected — try the Vulkan backend

On RDNA3 the ROCm/HIP backend can be slower at token generation than llama.cpp's Vulkan backend. Per llama.cpp issue #20934, measured on a 7900-series RDNA3 card, the Vulkan (RADV) backend outpaced ROCm for pure generation on a small model across ROCm 6.4.4–7.x. If your generation rate disappoints under ROCm on the 7800 XT, build llama.cpp with -DGGML_VULKAN=ON instead of -DGGML_HIP=ON and re-benchmark with llama-bench — Vulkan often wins for pure generation on RDNA3.

`torch` / CUDA errors — this is llama.cpp on ROCm, not a Python ML stack

Serving Phi-4 via llama.cpp or Ollama does not require PyTorch, flash-attn wheels, or a Python ML stack — and it does not use CUDA at all on this card. If you hit a CUDA error, you almost certainly grabbed a CUDA/CPU binary instead of the ROCm build; rebuild with -DGGML_HIP=ON -DGPU_TARGETS=gfx1101 (Option A) or reinstall Ollama with the ROCm driver present. At 14B, Q6_K/Q8_0 are already near-lossless, so there is no reason to reach for the full-precision weights on this card.

Model or GPU 404 on /check

Phi-4 is a new addition; if the /check/phi-4/rx-7800-xt link 404s, the catalogue row is still being registered. The recipe's install and run steps are independent of the benchmark endpoint.