How much VRAM does Qwen3 14B need?

About 9 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

Qwen3-14B on RX 7800 XT: ROCm via Ollama or llama.cpp-HIP

What You'll Build

A local Qwen3-14B chat / reasoning assistant running on a 16 GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) through the ROCm stack — served via Ollama for the one-command path, or llama.cpp compiled with HIP for full control over the quant tier. The recipe pins the dense 14.8B variant at a GGUF Q4_K_M / Q5_K_M quant (9.0–10.5 GB on disk), which leaves comfortable headroom on the 16 GB card for Qwen3's 32k-native context window, the thinking-mode chain of thought, and the KV cache.

Hardware data: RX 7800 XT (16GB VRAM) · Q4_K_M / Q5_K_M GGUF · ROCm 7 · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7800 XT runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel, no FlashAttention-2 prebuilt wheel, and no FP8/FP4 path here (RDNA3 has no FP8/FP4 hardware — an FP8 checkpoint would just upcast to BF16 with no memory saving). The attention path is PyTorch SDPA. Quantization is GGUF (via llama.cpp-HIP), AWQ-INT4, or BF16 — not ExLlamaV2, not Marlin. If a guide tells you to pip install flash-attn or pick a cu12x wheel for this card, it's written for the wrong vendor.

⚠️ At 16 GB you must quantize the 14B — BF16 is far out of reach. Qwen3-14B in BF16 does NOT fit 16 GB: the official Qwen speed benchmark reports 28,402 MB for BF16 at input length 1, growing to 33,336 MB at 30k context — almost twice the card's 16 GB ceiling. The BF16 GGUF is even larger on disk (29.54 GB per unsloth/Qwen3-14B-GGUF). So lead with a GGUF Q4_K_M / Q5_K_M quant or AWQ-INT4 (9,962 MB per the same benchmark) — these are the comfortable fits on this card. Qwen3 ships eight sizes (0.6b, 1.7b, 4b, 8b, 14b, 30b/MoE, 32b, 235b/MoE) per the Ollama qwen3:14b tag list — these instructions are for the dense 14.8B model only.

ℹ️ Thinking mode is on by default. Per the Qwen3-14B model card, Qwen3 has a built-in chain-of-thought ("thinking") mode toggled by enable_thinking, with soft switches /think and /no_think you can add to a prompt. Output starts with a <think>...</think> block (often 2k–4k tokens on hard problems) followed by the user-facing answer. Send /no_think to skip it for latency-sensitive turns.

Requirements

Component	Minimum	Tested
GPU	9 GB VRAM (Q4_K_M weights + KV headroom)	RX 7800 XT (16 GB)
RAM	16 GB system	—
Storage	9.0 GB (Q4_K_M GGUF) or 10.5 GB (Q5_K_M)	per unsloth/Qwen3-14B-GGUF
Driver	AMD ROCm v7 (installed via `amdgpu-install`) on Linux	—
Runtime	Ollama / llama.cpp (HIP build) / LM Studio	—

The model is released under Apache 2.0 (14.8B parameters) — commercial use is permitted. The weights are not gated on Hugging Face, so no access request or login is required.

Installation

Prerequisite — install the AMD ROCm v7 driver

The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU, but ROCm is not bundled with Ollama or the llama.cpp release binaries — you install it once at the OS level. Per the Ollama AMD GPU docs: "Ollama requires the AMD ROCm v7 driver on Linux. You can install or upgrade using the amdgpu-install utility." On Ubuntu 24.04 (Noble), install ROCm 7.2.1 via the standard amdgpu-install flow (AMD's Radeon ROCm install docs cover the current packages):

# 1. Add the amdgpu-install package and install ROCm
wget https://repo.radeon.com/amdgpu-install/7.2.1/ubuntu/noble/amdgpu-install_7.2.1.70201-1_all.deb
sudo apt install ./amdgpu-install_7.2.1.70201-1_all.deb
sudo apt update
sudo amdgpu-install -y --usecase=graphics,rocm

# 2. Add yourself to the render/video groups (log out/in afterward)
sudo usermod -a -G render,video $LOGNAME

The RX 7800 XT is on Ollama's supported AMD Radeon RX list, and gfx1101 is in its supported LLVM-target list — so no HSA_OVERRIDE_GFX_VERSION masquerade is needed for this card (that override is only for cards ROCm doesn't ship kernels for).

Option A — Ollama (recommended)

1. Install Ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

Per the Ollama AMD preview blog: "All the features of Ollama can now be accelerated by AMD graphics cards on Ollama for Linux and Windows," with the RX 7800 XT named in its supported-card list. Ollama detects the ROCm runtime installed in the prerequisite step.

2. Pull the 14B model

ollama pull qwen3:14b

This fetches a 9.3 GB Q4_K_M checkpoint per the Ollama qwen3:14b tag (14.8B parameters, Q4_K_M quantization). The download is one file — no manual quant-tier selection needed, and it fits the 16 GB card with KV-cache room to spare.

Option B — llama.cpp built with HIP/ROCm

For full control over the quant tier (Q5_K_M / Q6_K for higher fidelity), build llama.cpp against HIP and target the gfx1101 architecture directly.

1. Build llama.cpp with the HIP backend

Per the llama.cpp build docs, the Linux HIP build for an RDNA3 card like the RX 7800 XT is:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
    cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1101 -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build --config Release -- -j 16

-DGGML_HIP=ON selects the ROCm backend; -DGPU_TARGETS=gfx1101 pins the kernels to the 7800 XT's architecture (Navi 32). If a library ships only gfx1100 kernels and refuses to load on this card, the legacy fallback is to mask the GPU as gfx1100 at runtime with HSA_OVERRIDE_GFX_VERSION=11.0.0 — but for a gfx1101 build target this should not be needed.

2. Pull the quant you want

Per the unsloth/Qwen3-14B-GGUF Files tab per-tier file-size table (link-back to upstream Qwen/Qwen3-14B confirmed on the page), cross-checked against bartowski/Qwen_Qwen3-14B-GGUF:

Quant	File size	Fits 16 GB with KV headroom?
Q4_K_M	9.00 GB	yes — community default, ~7 GB free for KV
UD-Q4_K_XL	9.16 GB	yes — Unsloth Dynamic 2.0 imatrix-tuned
Q5_K_M	10.51 GB	yes — better quality, recommended on this card
Q6_K	12.12 GB	yes — comfortable ceiling, "near perfect" fidelity
Q8_0	15.70 GB	tight — fits on disk but leaves almost no KV-cache room on 16 GB; use Q6_K instead
BF16	29.54 GB	no — overflows the 16 GB 7800 XT

On the 16 GB 7800 XT the Q4_K_M / UD-Q4_K_XL / Q5_K_M / Q6_K tiers all fit with usable KV-cache headroom — Q6_K (12.12 GB) is the comfortable upper bound here, leaving ~3–4 GB for the cache. Q8_0 (15.70 GB) effectively fills the card and leaves no room for a meaningful context, so it is not recommended on this GPU (on a 24 GB card it is fine; here it is not). The BF16 GGUF (29.54 GB) overflows outright. Then run via the llama.cpp Hugging Face shortcut (per the unsloth model card):

# OpenAI-compatible local server with web UI
./build/bin/llama-server -hf unsloth/Qwen3-14B-GGUF:Q5_K_M --ctx-size 16384

# Interactive terminal
./build/bin/llama-cli -hf unsloth/Qwen3-14B-GGUF:Q5_K_M --ctx-size 16384

Option C — LM Studio (GUI)

LM Studio ships a ROCm runtime backend and offers a one-click install path. Search "Qwen3-14B GGUF" inside the app and pick the Q4_K_M or Q5_K_M (up to Q6_K) tier, or use the direct-import link lmstudio://open_from_hf?model=unsloth/Qwen3-14B-GGUF. On the 16 GB 7800 XT, Q6_K is the comfortable upper tier — skip Q8_0 and the BF16 GGUF, which leave no KV-cache room (Q8_0) or overflow the card (BF16).

Running

One-shot prompt via Ollama

ollama run qwen3:14b "Explain the difference between MoE and dense transformer architectures in three sentences."

First run loads the model into VRAM (~9 GB resident at idle for the Q4_K_M weights, growing as the KV cache fills with longer contexts). Watch GPU activity in another terminal with rocm-smi to confirm the card is doing the work.

Disable thinking mode for short answers

ollama run qwen3:14b "/no_think What's the capital of France?"

Per the Qwen3-14B model card, this flips enable_thinking=False for the request, skipping the <think>...</think> chain-of-thought prefix.

OpenAI-compatible HTTP API

# Ollama exposes localhost:11434 by default
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:14b",
    "messages": [{"role": "user", "content": "Write a haiku about a Radeon GPU."}]
  }'

The upstream Qwen3-14B card also documents vllm serve Qwen/Qwen3-14B --enable-reasoning --reasoning-parser deepseek_r1 and python -m sglang.launch_server --model-path Qwen/Qwen3-14B --reasoning-parser qwen3 — but both default to BF16 weights (~28 GB), which overflows the 7800 XT's 16 GB by a wide margin. The fitting route for those servers is the AWQ-INT4 weights (Qwen/Qwen3-14B-AWQ, ~9,962 MB resident at length 1 / 15,323 MB at 30k per the Qwen benchmark) — note that on 16 GB even AWQ-INT4 approaches the ceiling at full 30k context. On ROCm, vLLM must be launched with VLLM_USE_TRITON_FLASH_ATTN=0 — the Triton FlashAttention path overflows the stack frame on gfx1101. For a single-GPU local setup, Ollama or llama.cpp-HIP with a GGUF quant is the simpler and more memory-efficient path.

Results

Speed: No RX-7800-XT-named Qwen3-14B token-generation benchmark was found in research at the time of writing, and the backend reports verdict: unknown with no benchmarks for this pair (see /check/qwen3-14b/rx-7800-xt). The 7800 XT also has materially less memory bandwidth than the 7900 XTX (624 GB/s vs 960 GB/s) and fewer WMMA units, so token-generation throughput — which is memory-bandwidth-bound — would run slower than the XTX even before measurement; borrowing the XTX's (or any NVIDIA card's) number would mislead. Rather than carry a figure from a different card or model, the Speed figure is omitted here. If you've measured Qwen3-14B tok/s on an RX 7800 XT, please contribute it so it lands on /check/qwen3-14b/rx-7800-xt. As a general ROCm caveat: AMD ROCm token-generation throughput on RDNA3 tends to run softer than a comparable NVIDIA card, and ROCm itself often trails the Vulkan llama.cpp backend on this GPU (see Troubleshooting).
VRAM usage: At idle the Q4_K_M weights occupy ~9 GB (file size 9.00 GB); the runtime grows the KV cache from there with context length. The official Qwen speed benchmark gives the Transformers precision/VRAM ladder for Qwen3-14B: AWQ-INT4 = 9,962 MB at length 1 / 15,323 MB at 30k context, FP8 = 16,012 MB / 20,813 MB, BF16 = 28,402 MB / 33,336 MB. On the 16 GB 7800 XT the AWQ-INT4 / Q4_K_M / Q5_K_M / Q6_K paths all fit with KV headroom; Q8_0 fills the card and BF16 overflows (and FP8 buys you nothing on RDNA3 — no FP8 hardware, so it upcasts to BF16). See /check/qwen3-14b/rx-7800-xt for any community-submitted measurement.
Quality notes: Q4_K_M is the community-default "sweet spot"; on this 16 GB card you can step up to Q5_K_M (10.51 GB) or Q6_K (12.12 GB) for higher fidelity while still keeping room for a healthy KV cache — Q6_K is the practical ceiling here (Q8_0 and BF16 leave no context room). Per the model card best-practices note, for thinking mode use Temperature=0.6, TopP=0.95, TopK=20, MinP=0, and for non-thinking mode Temperature=0.7, TopP=0.8; do not use greedy decoding — it triggers endless repetitions.

For the full benchmark data and other-GPU comparisons, see /check/qwen3-14b/rx-7800-xt.

Troubleshooting

BF16 weights OOM at load

Qwen3-14B in BF16 needs ~28 GB resident (28,402 MB at input length 1 per the Qwen speed benchmark) — that is nearly double the 7800 XT's 16 GB before any context loads, and the BF16 GGUF (29.54 GB) is larger still. On a 16 GB card you must run a quant: GGUF tiers from Q4_K_M (9.00 GB) up through Q6_K (12.12 GB) fit with KV headroom, or use the AWQ-INT4 weights (~9,962 MB) for a vLLM/SGLang server. Don't reach for FP8 to "save memory" — RDNA3 has no FP8 hardware, so an FP8 checkpoint upcasts to BF16 at load and gives no memory win.

Ollama runs on the CPU instead of the GPU

Confirm the ROCm v7 driver is installed (rocm-smi should list the 7800 XT) and that your user is in the render and video groups (groups should show both — log out and back in after the usermod step). Per the Ollama AMD GPU docs, ROCm is a separate install from Ollama; if it's missing, Ollama silently falls back to CPU. The RX 7800 XT (gfx1101) is natively supported, so you should not need HSA_OVERRIDE_GFX_VERSION — only unsupported cards need that masquerade.

Token generation feels slower than expected — try the Vulkan backend

On RDNA3 the ROCm/HIP backend can be 13–23% slower at token generation than the Vulkan backend in llama.cpp. Per llama.cpp issue #20934, on the RX 7900 XTX (gfx1100) Vulkan (RADV) reached ~167–177 tok/s on Llama 7B Q4_0 while ROCm landed at ~129–144 tok/s across ROCm 6.4.4–7.x. (That figure is for Llama 7B on a 7900 XTX, not Qwen3-14B on the 7800 XT — it characterizes the ROCm-vs-Vulkan backend gap on RDNA3, not this pair's speed.) If your generation rate disappoints under ROCm, build llama.cpp with -DGGML_VULKAN=ON instead of -DGGML_HIP=ON and re-benchmark with llama-bench — Vulkan often wins for pure generation on RDNA3 cards.

Out of memory mid-generation on a hard problem

Qwen3-14B's thinking mode emits a <think>...</think> chain-of-thought that routinely runs 2k–4k tokens (longer on hard math / coding), and the KV cache grows linearly with every token. At 16 GB with a Q4_K_M/Q5_K_M quant this is usually fine, but a very long reasoning turn at extended context can push the cache into the remaining headroom — the margin is tighter here than on a 24 GB card. Mitigations, in order: (1) cap the context with --ctx-size 16384 (shown in the Installation commands); (2) quantize the KV cache with --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn in llama.cpp to roughly halve its memory; (3) send /no_think to skip the chain-of-thought for turns that don't need it; (4) drop from Q6_K to Q5_K_M or Q4_K_M to free another 1–3 GB for the cache. Watch rocm-smi during a hard problem to calibrate the actual peak.

`<think>...</think>` output is bloating responses

Qwen3 enables thinking mode by default per the HF card. Send /no_think at the start of any user message to disable it for that turn, or pass enable_thinking=False if you're calling the chat-template API directly.

Generation slows dramatically past 32k context

32k is Qwen3-14B's native context window per the HF card ("Context Length: 32,768 natively and 131,072 tokens with YaRN"). Beyond that the model needs YaRN extension — supported in llama.cpp, vLLM, and SGLang per the unsloth GGUF instructions — but quality degrades on short prompts and the KV cache balloons (a real concern on a 16 GB card). For long-doc workflows, prefer chunking + retrieval over pushing context past 32k.