self-hosted/ai
§01·recipe · llm

Qwen2.5-14B-Instruct on RX 7900 XTX: a fast local chat LLM via Ollama (ROCm)

llmbeginner9GB+ VRAMJun 27, 2026

This beginner recipe sets up Qwen2.5 14B on the RX 7900 XTX, needing about 9 GB of VRAM.

models
tools
prerequisites
  • AMD Radeon RX 7900 XTX (24GB) or equivalent gfx1100 RDNA3 card
  • Linux with AMD ROCm v7 driver installed (Ollama needs system ROCm, not bundled)
  • ~9GB free disk for the Q4_K_M GGUF weights

What You'll Build

A fully local, Apache-2.0-licensed 14B instruction-tuned chat assistant running on your AMD Radeon RX 7900 XTX. You pull one quantized GGUF, run one command, and get an OpenAI-compatible chat endpoint — no CUDA, no cloud, no token bills.

Hardware data: RX 7900 XTX (24GB VRAM) · ~32 tokens/s generation at Q4_K_M · See benchmark data

ℹ️ This is the AMD/ROCm path. Qwen2.5-14B-Instruct is hardware-agnostic at the model level, but the install differs sharply from NVIDIA: there is no CUDA, no flash-attn wheel, and no FP8 weight trick (RDNA3 has no FP8 tensor cores — an FP8 file would just upcast to BF16 with no memory win). The clean, supported path on this card is GGUF via Ollama (which uses llama.cpp's HIP backend under the hood).

Requirements

ComponentMinimumTested
GPU12GB VRAM (for Q4_K_M + context)RX 7900 XTX (24GB)
RAM16GB
Storage~9GB (Q4_K_M GGUF)8.99GB per the Qwen GGUF card
SoftwareLinux + AMD ROCm v7 driver, Ollama

The model itself is small relative to this card: Q4_K_M weights are 8.99 GB (Qwen2.5-14B-Instruct-GGUF card, confirmed by the Ollama qwen2.5:14b tag at 9.0 GB). That leaves well over 12 GB of the 24 GB free for KV cache, so you can run the model's full 32,768-token native context comfortably.

⚠️ About the 24 GB benchmark figure. The /check peak-VRAM datapoint reads 24 GB — that is the LocalScore harness reporting the card's total memory on the RX 7900 XTX accelerator page, not the model's footprint. Qwen2.5-14B at Q4_K_M actually resides in ~9 GB; this recipe's min_vram_gb reflects the real GGUF footprint.

Installation

1. Install the AMD ROCm v7 driver (Linux)

Ollama uses your system's ROCm stack — it does not bundle it. Per the Ollama GPU docs, Ollama requires the AMD ROCm v7 driver on Linux, and the RX 7900 XTX is on the officially supported AMD Radeon list (alongside the 7900 XT/GRE, 7800 XT, and the 9000/6000 series). Install ROCm via AMD's amdgpu-install for your distro, then reboot.

The 7900 XTX is gfx1100, an officially-supported target, so no HSA_OVERRIDE_GFX_VERSION masquerade is needed — that override is only for cards ROCm doesn't list natively.

2. Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

3. Pull the model

ollama pull qwen2.5:14b

This fetches the Q4_K_M build (9.0 GB) by default. To use Qwen's own official GGUF instead, you can pull directly from the Hub:

ollama run hf.co/Qwen/Qwen2.5-14B-Instruct-GGUF:Q4_K_M

Both commands are documented verbatim on the Qwen2.5-14B-Instruct-GGUF card.

Running

Start an interactive chat:

ollama run qwen2.5:14b

Or serve an OpenAI-compatible API on http://localhost:11434:

ollama serve
# then, from another shell:
curl http://localhost:11434/api/generate -d '{
  "model": "qwen2.5:14b",
  "prompt": "Explain RDNA3 WMMA in two sentences."
}'

On first run Ollama loads the weights onto the GPU; rocm-smi (or radeontop) should show ~9 GB resident plus your KV cache. Subsequent prompts stream tokens immediately.

Results

  • Speed: ~32 tokens/s generation at Q4_K_M on the RX 7900 XTX, per the /check benchmark (prompt processing 487 tok/s, TTFT 2.64s). The live LocalScore accelerator page — a community-submission aggregator that drifts as more runs land — currently reads ~24.2 tok/s for the same Qwen2.5 14B Instruct Q4_K_M row; treat ~24-32 tok/s as the realistic band and contribute your own measurement via /contribute.
  • VRAM usage: ~9 GB resident for Q4_K_M weights, leaving ample headroom for full 32K context on the 24 GB card. See /check/qwen2-5-14b/rx-7900-xtx.
  • Quality notes: 14.7B params, 48 layers, Apache 2.0, instruction-tuned causal LM with 131,072-token training context (32K default, YaRN-extendable) per the Qwen2.5-14B-Instruct card. It is a text-only chat/instruct model — not a reasoning model with <think> traces, so KV pressure stays modest.

For the full benchmark data, see /check/qwen2-5-14b/rx-7900-xtx.

Troubleshooting

Ollama doesn't use the GPU / falls back to CPU

Ollama needs system ROCm v7 installed and visible (it isn't bundled) — see the Ollama GPU docs. Confirm the card is detected with rocm-smi, ensure your user is in the render and video groups, and re-run after a reboot. The 7900 XTX (gfx1100) is officially supported, so you should not need HSA_OVERRIDE_GFX_VERSION.

Token generation feels slower than expected

On RDNA3, the llama.cpp Vulkan backend can sometimes outrun the ROCm/HIP backend that Ollama uses. A community report, llama.cpp issue #20934 ("Misc. bug: [ROCm] Significantly lower token generation performance vs Vulkan on RX 7900 XTX (gfx1100)", filed by a community user), measured ROCm ~15-25% behind Vulkan on a small model. If throughput matters, build llama.cpp with the Vulkan backend and benchmark both on your own workload — results vary by model size and ROCm version.

Want more context or a higher-quality quant

You have headroom: at 24 GB you can run a larger quant. The Qwen GGUF card lists Q5_K_M (10.5 GB), Q6_K (12.1 GB), and Q8_0 (15.7 GB) — all fit with room for full 32K context. Note the bigger tiers ship as sharded files; merge them first with ./llama-gguf-split --merge qwen2.5-14b-instruct-q5_k_m-00001-of-00003.gguf qwen2.5-14b-instruct-q5_k_m.gguf (command per the same card).

No other widely-reported issues. Report problems via the submission form.

common questions
How much VRAM does Qwen2.5 14B need?

About 9 GB — the minimum this recipe targets.

Which GPUs is Qwen2.5 14B tested on?

RX 7900 XTX (24 GB).

How hard is this setup?

Beginner — follow the steps above.