How much VRAM does Qwen3 14B need?

About 12 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

Qwen3-14B on RX 7900 XTX: ROCm via Ollama or llama.cpp-HIP

What You'll Build

A local Qwen3-14B chat / reasoning assistant running on a 24 GB Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100) through the ROCm stack — served via Ollama for the one-command path, or llama.cpp compiled with HIP for full control over the quant tier. The recipe pins the dense 14.8B variant at a GGUF Q4_K_M / Q5_K_M quant (9.0–10.5 GB on disk), which leaves generous headroom on the 24 GB card for Qwen3's 32k-native context window, the thinking-mode chain of thought, and the KV cache.

Hardware data: RX 7900 XTX (24GB VRAM) · Q4_K_M / Q5_K_M GGUF or AWQ-INT4 · ROCm 7 · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7900 XTX runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel, no FlashAttention-2 prebuilt wheel, and no FP8/FP4 path here (RDNA3 has no FP8/FP4 hardware — an FP8 checkpoint would just upcast to BF16 with no memory saving). The attention path is PyTorch SDPA. Quantization is GGUF (via llama.cpp-HIP), AWQ-INT4, or BF16 — not ExLlamaV2, not Marlin. If a guide tells you to pip install flash-attn or pick a cu12x wheel for this card, it's written for the wrong vendor.

⚠️ At 24 GB the variant AND the precision are binding for the 14B. Unlike the 8B sibling, Qwen3-14B in BF16 does NOT fit 24 GB: the official Qwen speed benchmark reports 28,402 MB for BF16 at input length 1, growing to 33,336 MB at 30k context — over the card's 24 GB ceiling before any real context loads. The BF16 GGUF is even larger on disk (29.54 GB per unsloth/Qwen3-14B-GGUF). So lead with a GGUF Q4_K_M / Q5_K_M quant or AWQ-INT4 (9,962 MB per the same benchmark); treat BF16 as out of reach on this card. Qwen3 ships eight sizes (0.6b, 1.7b, 4b, 8b, 14b, 30b/MoE, 32b, 235b/MoE) per the Ollama qwen3:14b tag list — these instructions are for the dense 14.8B model only.

ℹ️ Thinking mode is on by default. Per the Qwen3-14B model card, Qwen3 has a built-in chain-of-thought ("thinking") mode toggled by enable_thinking, with soft switches /think and /no_think you can add to a prompt. Output starts with a <think>...</think> block (often 2k–4k tokens on hard problems) followed by the user-facing answer. Send /no_think to skip it for latency-sensitive turns.

Requirements

Component	Minimum	Tested
GPU	12 GB VRAM (Q4_K_M / Q5_K_M weights + KV headroom)	RX 7900 XTX (24 GB)
RAM	16 GB system	—
Storage	9.0 GB (Q4_K_M GGUF) or 10.5 GB (Q5_K_M)	per unsloth/Qwen3-14B-GGUF
Driver	AMD ROCm v7 (installed via `amdgpu-install`) on Linux	—
Runtime	Ollama / llama.cpp (HIP build) / LM Studio	—

The model is released under Apache 2.0 (14.8B parameters) — commercial use is permitted. The weights are not gated on Hugging Face, so no access request or login is required.

Installation

Prerequisite — install the AMD ROCm v7 driver

The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU, but ROCm is not bundled with Ollama or the llama.cpp release binaries — you install it once at the OS level. Per the Ollama AMD GPU docs: "Ollama requires the AMD ROCm v7 driver on Linux. You can install or upgrade using the amdgpu-install utility." On Ubuntu 24.04 (Noble), install ROCm 7.2.1 via the standard amdgpu-install flow (AMD's Radeon ROCm install docs cover the current packages):

# 1. Add the amdgpu-install package and install ROCm
wget https://repo.radeon.com/amdgpu-install/7.2.1/ubuntu/noble/amdgpu-install_7.2.1.70201-1_all.deb
sudo apt install ./amdgpu-install_7.2.1.70201-1_all.deb
sudo apt update
sudo amdgpu-install -y --usecase=graphics,rocm

# 2. Add yourself to the render/video groups (log out/in afterward)
sudo usermod -a -G render,video $LOGNAME

The RX 7900 XTX is on Ollama's supported AMD Radeon RX list, and gfx1100 is in its supported LLVM-target list — so no HSA_OVERRIDE_GFX_VERSION masquerade is needed for this card (that override is only for cards ROCm doesn't ship kernels for).

Option A — Ollama (recommended)

1. Install Ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

Per the Ollama AMD preview blog: "All the features of Ollama can now be accelerated by AMD graphics cards on Ollama for Linux and Windows," with the RX 7900 XTX named in its supported-card list. Ollama detects the ROCm runtime installed in the prerequisite step.

2. Pull the 14B model

ollama pull qwen3:14b

This fetches a 9.3 GB Q4_K_M checkpoint per the Ollama qwen3:14b tag (14.8B parameters, Q4_K_M quantization). The download is one file — no manual quant-tier selection needed, and it fits the 24 GB card with room to spare.

Option B — llama.cpp built with HIP/ROCm

For full control over the quant tier (Q5_K_M / Q6_K for higher fidelity), build llama.cpp against HIP and target the gfx1100 architecture directly.

1. Build llama.cpp with the HIP backend

Per the llama.cpp build docs, the Linux HIP build for an RDNA3 card like the RX 7900 XTX is:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
    cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1100 -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build --config Release -- -j 16

-DGGML_HIP=ON selects the ROCm backend; -DGPU_TARGETS=gfx1100 pins the kernels to the 7900 XTX's architecture (the build docs use gfx1100 as the explicit example for the "Radeon RX 7900XTX").

2. Pull the quant you want

Per the unsloth/Qwen3-14B-GGUF Files tab per-tier file-size table (link-back to upstream Qwen/Qwen3-14B confirmed on the page), cross-checked against bartowski/Qwen_Qwen3-14B-GGUF:

Quant	File size	Fits 24 GB with KV headroom?
Q4_K_M	9.00 GB	yes — community default, lots of room
UD-Q4_K_XL	9.16 GB	yes — Unsloth Dynamic 2.0 imatrix-tuned
Q5_K_M	10.51 GB	yes — better quality, recommended on this card
Q6_K	12.12 GB	yes — comfortable, "near perfect" fidelity
Q8_0	15.70 GB	yes — near-lossless, still leaves ~8 GB for KV
BF16	29.54 GB	no — overflows the 24 GB 7900 XTX

On the 24 GB 7900 XTX every quant tier from Q4_K_M up through Q8_0 (15.70 GB) fits with generous KV-cache headroom — Q8_0 is comfortable here (it was not on a 16 GB card). Only the BF16 GGUF (29.54 GB) overflows. Then run via the llama.cpp Hugging Face shortcut (per the unsloth model card):

# OpenAI-compatible local server with web UI
./build/bin/llama-server -hf unsloth/Qwen3-14B-GGUF:Q5_K_M --ctx-size 16384

# Interactive terminal
./build/bin/llama-cli -hf unsloth/Qwen3-14B-GGUF:Q5_K_M --ctx-size 16384

Option C — LM Studio (GUI)

LM Studio ships a ROCm runtime backend and offers a one-click install path. Search "Qwen3-14B GGUF" inside the app and pick the Q5_K_M (or Q6_K / Q8_0) tier, or use the direct-import link lmstudio://open_from_hf?model=unsloth/Qwen3-14B-GGUF. On the 24 GB 7900 XTX you have room for any GGUF tier through Q8_0 — but not the BF16 GGUF.

Running

One-shot prompt via Ollama

ollama run qwen3:14b "Explain the difference between MoE and dense transformer architectures in three sentences."

First run loads the model into VRAM (~9 GB resident at idle for the Q4_K_M weights, growing as the KV cache fills with longer contexts). Watch GPU activity in another terminal with rocm-smi to confirm the card is doing the work.

Disable thinking mode for short answers

ollama run qwen3:14b "/no_think What's the capital of France?"

Per the Qwen3-14B model card, this flips enable_thinking=False for the request, skipping the <think>...</think> chain-of-thought prefix.

OpenAI-compatible HTTP API

# Ollama exposes localhost:11434 by default
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:14b",
    "messages": [{"role": "user", "content": "Write a haiku about a Radeon GPU."}]
  }'

The upstream Qwen3-14B card also documents vllm serve Qwen/Qwen3-14B --enable-reasoning --reasoning-parser deepseek_r1 and python -m sglang.launch_server --model-path Qwen/Qwen3-14B --reasoning-parser qwen3 — but both default to BF16 weights (~28 GB), which overflows the 7900 XTX's 24 GB. The fitting route for those servers is the AWQ-INT4 weights (Qwen/Qwen3-14B-AWQ, ~9,962 MB resident at length 1 / 15,323 MB at 30k per the Qwen benchmark). On ROCm, vLLM must be launched with VLLM_USE_TRITON_FLASH_ATTN=0 — the Triton FlashAttention path overflows the stack frame on gfx1100. For a single-GPU local setup, Ollama or llama.cpp-HIP with a GGUF quant is the simpler path.

Results

Speed: No RX-7900-XTX-named Qwen3-14B token-generation benchmark was found in research at the time of writing. Published 7900 XTX figures cover other models — e.g. Qwen2.5-14B at ~27.9 tok/s (Q4_K) on localscore.ai, and the Qwen3.6-35B-A3B MoE at ~130 tok/s via llama.cpp on ROCm — but neither is Qwen3-14B and neither transfers. Rather than borrow a number from a different model or a different vendor's card, the Speed figure is omitted here. If you've measured Qwen3-14B tok/s on a 7900 XTX, please contribute it so it lands on /check/qwen3-14b/rx-7900-xtx. As a general ROCm caveat: AMD ROCm token-generation throughput on RDNA3 tends to run softer than a comparable NVIDIA card, and ROCm itself often trails the Vulkan llama.cpp backend on this GPU (see Troubleshooting).
VRAM usage: At idle the Q4_K_M weights occupy ~9 GB (file size 9.00 GB); the runtime grows the KV cache from there with context length. The official Qwen speed benchmark gives the Transformers precision/VRAM ladder for Qwen3-14B: AWQ-INT4 = 9,962 MB at length 1 / 15,323 MB at 30k context, FP8 = 16,012 MB / 20,813 MB, BF16 = 28,402 MB / 33,336 MB. On the 24 GB 7900 XTX the AWQ-INT4 / Q4_K_M / Q5_K_M / Q6_K / Q8_0 paths all fit with KV headroom; BF16 overflows (and FP8 buys you nothing on RDNA3 — no FP8 hardware, so it upcasts to BF16). See /check/qwen3-14b/rx-7900-xtx for any community-submitted measurement.
Quality notes: Q4_K_M is the community-default "sweet spot"; with 24 GB there's no memory pressure to stop there — run Q5_K_M (10.51 GB), Q6_K (12.12 GB), or even Q8_0 (15.70 GB) for higher fidelity, all of which leave room for a large KV cache. Per the model card best-practices note, for thinking mode use Temperature=0.6, TopP=0.95, TopK=20, MinP=0, and for non-thinking mode Temperature=0.7, TopP=0.8; do not use greedy decoding — it triggers endless repetitions.

For the full benchmark data and other-GPU comparisons, see /check/qwen3-14b/rx-7900-xtx.

Troubleshooting

BF16 weights OOM at load

Qwen3-14B in BF16 needs ~28 GB resident (28,402 MB at input length 1 per the Qwen speed benchmark) — that overflows the 7900 XTX's 24 GB before any context loads, and the BF16 GGUF (29.54 GB) is larger still. This is the one Qwen3 size where the 24 GB card is genuinely memory-bound. The fix is to run a quant: any GGUF tier through Q8_0 (15.70 GB) fits comfortably, or use the AWQ-INT4 weights (~9,962 MB) for a vLLM/SGLang server. Don't reach for FP8 to "save memory" — RDNA3 has no FP8 hardware, so an FP8 checkpoint upcasts to BF16 at load and gives no memory win.

Ollama runs on the CPU instead of the GPU

Confirm the ROCm v7 driver is installed (rocm-smi should list the 7900 XTX) and that your user is in the render and video groups (groups should show both — log out and back in after the usermod step). Per the Ollama AMD GPU docs, ROCm is a separate install from Ollama; if it's missing, Ollama silently falls back to CPU. The RX 7900 XTX (gfx1100) is natively supported, so you should not need HSA_OVERRIDE_GFX_VERSION — only unsupported cards need that masquerade.

Token generation feels slower than expected — try the Vulkan backend

On RDNA3 the ROCm/HIP backend can be 13–23% slower at token generation than the Vulkan backend in llama.cpp. Per llama.cpp issue #20934, on the RX 7900 XTX (gfx1100) Vulkan (RADV) reached ~167–177 tok/s on Llama 7B Q4_0 while ROCm landed at ~129–144 tok/s across ROCm 6.4.4–7.x. (That figure is for Llama 7B, not Qwen3-14B — it characterizes the backend gap on this card, not this model's speed.) If your generation rate disappoints under ROCm, build llama.cpp with -DGGML_VULKAN=ON instead of -DGGML_HIP=ON and re-benchmark with llama-bench — Vulkan often wins for pure generation on this card.

Out of memory mid-generation on a hard problem

Qwen3-14B's thinking mode emits a <think>...</think> chain-of-thought that routinely runs 2k–4k tokens (longer on hard math / coding), and the KV cache grows linearly with every token. At 24 GB with a Q4_K_M/Q5_K_M quant this is rarely fatal, but a very long reasoning turn at extended context can still bite. Mitigations, in order: (1) cap the context with --ctx-size 16384 (shown in the Installation commands); (2) quantize the KV cache with --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn in llama.cpp to roughly halve its memory; (3) send /no_think to skip the chain-of-thought for turns that don't need it. Watch rocm-smi during a hard problem to calibrate the actual peak.

`<think>...</think>` output is bloating responses

Qwen3 enables thinking mode by default per the HF card. Send /no_think at the start of any user message to disable it for that turn, or pass enable_thinking=False if you're calling the chat-template API directly.

Generation slows dramatically past 32k context

32k is Qwen3-14B's native context window per the HF card ("Context Length: 32,768 natively and 131,072 tokens with YaRN"). Beyond that the model needs YaRN extension — supported in llama.cpp, vLLM, and SGLang per the unsloth GGUF instructions — but quality degrades on short prompts and the KV cache balloons. For long-doc workflows, prefer chunking + retrieval over pushing context past 32k.