self-hosted/ai
§01·recipe · llm

Qwen3-8B on RX 7900 XTX: ROCm via Ollama or llama.cpp-HIP

llmbeginner6GB+ VRAMJun 16, 2026

This beginner recipe sets up Qwen3-8B on the RX 7900 XTX, needing about 6 GB of VRAM.

models
tools
prerequisites
  • AMD Radeon RX 7900 XTX (24 GB VRAM, RDNA3 / Navi 31 / gfx1100) or equivalent ROCm-supported card
  • Linux (Ubuntu 24.04 / 22.04 or RHEL) with the AMD ROCm v7 driver installed via `amdgpu-install` — ROCm is NOT bundled with Ollama
  • Python 3.10+
  • ~5 GB free disk for the Q4_K_M GGUF checkpoint (or ~16.4 GB for the BF16 weights)
  • Ollama, llama.cpp (HIP build), or LM Studio installed

What You'll Build

A local Qwen3-8B chat / reasoning assistant running on a 24 GB Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100) through the ROCm stack — served via Ollama for the one-command path, or llama.cpp compiled with HIP for full control over the quant tier. With 24 GB of VRAM the 8B model is never memory-bound: you can run the full BF16 weights (16.39 GB) or any GGUF quant with generous KV-cache headroom for the 32k-native context window and the optional thinking-mode chain of thought.

Hardware data: RX 7900 XTX (24GB VRAM) · BF16 or GGUF · ROCm 7 · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7900 XTX runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel, no FlashAttention-2 prebuilt wheel, and no FP8/FP4 path here (RDNA3 has no FP8/FP4 hardware — an FP8 checkpoint would just upcast to BF16 with no memory saving). The attention path is PyTorch SDPA. Quantization is GGUF (via llama.cpp-HIP) or BF16 — not ExLlamaV2, not Marlin. If a guide tells you to pip install flash-attn or pick a cu12x wheel for this card, it's written for the wrong vendor.

ℹ️ Thinking mode is on by default. Per the Qwen3-8B model card, Qwen3 has a built-in chain-of-thought ("thinking") mode toggled by enable_thinking, with soft switches /think and /no_think you can add to a prompt. Output starts with a <think>...</think> block followed by the user-facing answer. Send /no_think to skip it for latency-sensitive turns.

Requirements

ComponentMinimumTested
GPU8 GB VRAM (ROCm-supported AMD card)RX 7900 XTX (24 GB)
RAM16 GB system
Storage5.03 GB (Q4_K_M GGUF) or 16.39 GB (BF16)per unsloth/Qwen3-8B-GGUF
DriverAMD ROCm v7 (installed via amdgpu-install) on Linux
RuntimeOllama / llama.cpp (HIP build) / LM Studio

The model is released under Apache 2.0 (8.2B parameters) — commercial use is permitted. The weights are not gated on Hugging Face, so no access request or login is required.

Installation

Prerequisite — install the AMD ROCm v7 driver

The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU, but ROCm is not bundled with Ollama or the llama.cpp release binaries — you install it once at the OS level. Per the Ollama AMD GPU docs: "Ollama requires the AMD ROCm v7 driver on Linux. You can install or upgrade using the amdgpu-install utility." On Ubuntu 24.04 (Noble), install ROCm 7.2.1 via the standard amdgpu-install flow (AMD's Radeon ROCm install docs cover the current packages; the .deb URL below is HEAD-verified live):

# 1. Add the amdgpu-install package and install ROCm
wget https://repo.radeon.com/amdgpu-install/7.2.1/ubuntu/noble/amdgpu-install_7.2.1.70201-1_all.deb
sudo apt install ./amdgpu-install_7.2.1.70201-1_all.deb
sudo apt update
sudo amdgpu-install -y --usecase=graphics,rocm

# 2. Add yourself to the render/video groups (log out/in afterward)
sudo usermod -a -G render,video $LOGNAME

The RX 7900 XTX is on Ollama's supported AMD Radeon RX list, and gfx1100 is in its supported LLVM-target list — so no HSA_OVERRIDE_GFX_VERSION masquerade is needed for this card (that override is only for cards ROCm doesn't ship kernels for).

Option A — Ollama (recommended)

1. Install Ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

Per the Ollama AMD preview blog: "All the features of Ollama can now be accelerated by AMD graphics cards on Ollama for Linux and Windows," with the RX 7900 XTX named in its supported-card list. Ollama detects the ROCm runtime installed in the prerequisite step.

2. Pull the 8B model

ollama pull qwen3:8b

This fetches the canonical Q4_K_M build maintained by the Qwen team (8.2B parameters). The download is one file — no manual quant-tier selection needed.

Option B — llama.cpp built with HIP/ROCm

For full control over the quant tier (Q6_K for higher fidelity, BF16 for full precision), build llama.cpp against HIP and target the gfx1100 architecture directly.

1. Build llama.cpp with the HIP backend

Per the llama.cpp build docs, the Linux HIP build for an RDNA3 card like the RX 7900 XTX is:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
    cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1100 -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build --config Release -- -j 16

-DGGML_HIP=ON selects the ROCm backend; -DGPU_TARGETS=gfx1100 pins the kernels to the 7900 XTX's architecture (the build docs use gfx1100 as the explicit example for the "Radeon RX 7900XTX").

2. Pull the quant you want

Per the unsloth/Qwen3-8B-GGUF per-tier file-size table (link-back to upstream Qwen/Qwen3-8B confirmed on the page), verified via Hugging Face file-size headers:

QuantFile sizeNotes
Q4_K_M5.03 GBcommunity default — trivially fits 24 GB
Q5_K_M5.85 GBbetter quality, still tiny
Q6_K6.73 GB"near perfect" per bartowski
Q8_08.71 GBnear-lossless
BF1616.39 GBfull precision — fits comfortably on the 24 GB 7900 XTX

Then run via the llama.cpp Hugging Face shortcut (per the unsloth model card):

# OpenAI-compatible local server with web UI
./build/bin/llama-server -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL

# Interactive terminal
./build/bin/llama-cli -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL

Option C — LM Studio (GUI)

LM Studio ships a ROCm runtime backend and offers a one-click install path. Search "Qwen3-8B GGUF" inside the app and pick the Q4_K_M (or a higher) tier, or use the direct-import link lmstudio://open_from_hf?model=unsloth/Qwen3-8B-GGUF. On the 24 GB 7900 XTX you have room for any tier through BF16.

Running

One-shot prompt via Ollama

ollama run qwen3:8b "Explain GQA attention in three sentences."

First run loads the model into VRAM (~5 GB resident for the Q4_K_M weights at idle, growing as the KV cache fills with longer contexts). Watch GPU activity in another terminal with rocm-smi to confirm the card is doing the work.

Disable thinking mode for short answers

ollama run qwen3:8b "/no_think What's the capital of France?"

Per the Qwen3-8B model card, this flips enable_thinking=False for the request, skipping the <think>...</think> chain-of-thought prefix.

OpenAI-compatible HTTP API

# Ollama exposes localhost:11434 by default
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:8b",
    "messages": [{"role": "user", "content": "Write a haiku about a Radeon GPU."}]
  }'

The 24 GB of VRAM lets you run the full BF16 weights (16.39 GB) if you want maximum fidelity rather than a quant — load unsloth/Qwen3-8B-GGUF:BF16 in llama.cpp, or run the upstream BF16 safetensors through a serving stack. The official Qwen3-8B card documents vllm and sglang serving; on ROCm, vLLM must be launched with VLLM_USE_TRITON_FLASH_ATTN=0 (the Triton FlashAttention path overflows the stack frame on gfx1100) — for a single-GPU local setup, Ollama or llama.cpp-HIP is the simpler path.

Results

  • Speed: No RX-7900-XTX-named Qwen3-8B token-generation benchmark was found in research at the time of writing — published 7900 XTX figures cover other models (Llama 2 7B, Llama 3.1 8B, Qwen 2.5 7B/14B) but not Qwen3-8B specifically. Rather than transfer a number from a different model or a different vendor's card, the Speed figure is omitted here. If you've measured Qwen3-8B tok/s on a 7900 XTX, please contribute it so it lands on /check/qwen3-8b/rx-7900-xtx. As a general ROCm caveat: AMD ROCm token-generation throughput on RDNA3 tends to run softer than a comparable NVIDIA card, and ROCm itself often trails the Vulkan llama.cpp backend on this GPU (see Troubleshooting).
  • VRAM usage: At idle the Q4_K_M weights occupy ~5 GB (file size 5.03 GB); the runtime grows the KV cache from there with context length. On the 24 GB 7900 XTX even the full BF16 weights (16.39 GB) leave room for a large KV cache — see /check/qwen3-8b/rx-7900-xtx for any community-submitted measurement.
  • Quality notes: Q4_K_M is the community-default "sweet spot"; the bartowski Q-tier guide flags Q6_K as "near perfect, recommended." On a 24 GB card there is no memory pressure to go below Q4_K_M — run Q6_K, Q8_0, or BF16 if you want higher fidelity. The unsloth card recommends Temperature 0.6 / TopP 0.95 for thinking mode and Temperature 0.7 / TopP 0.8 for non-thinking mode; avoid greedy decoding.

For the full benchmark data and other-GPU comparisons, see /check/qwen3-8b/rx-7900-xtx.

Troubleshooting

Ollama runs on the CPU instead of the GPU

Confirm the ROCm v7 driver is installed (rocm-smi should list the 7900 XTX) and that your user is in the render and video groups (groups should show both — log out and back in after the usermod step). Per the Ollama AMD GPU docs, ROCm is a separate install from Ollama; if it's missing, Ollama silently falls back to CPU. The RX 7900 XTX (gfx1100) is natively supported, so you should not need HSA_OVERRIDE_GFX_VERSION — only unsupported cards need that masquerade.

Token generation feels slower than expected — try the Vulkan backend

On RDNA3 the ROCm/HIP backend can be 20–30% slower at token generation than the Vulkan backend in llama.cpp. Per llama.cpp issue #20934, on the RX 7900 XTX (gfx1100) Vulkan (RADV) reached ~167–177 tok/s on Llama 7B Q4_0 while ROCm landed at ~129–144 tok/s across ROCm 6.4.4–7.x. If your generation rate disappoints under ROCm, build llama.cpp with -DGGML_VULKAN=ON instead of -DGGML_HIP=ON and re-benchmark with llama-bench — Vulkan often wins for pure generation on this card.

<think>...</think> output is bloating responses

Qwen3 enables thinking mode by default per the HF card. Send /no_think at the start of any user message to disable it for that turn, or pass enable_thinking=False if you're calling the chat-template API directly.

Generation slows past 32k context

Qwen3 natively supports a 32,768-token context, extendable to 131,072 tokens with YaRN RoPE scaling per the HF card (supported in llama.cpp, vLLM, and SGLang per the unsloth GGUF instructions). Beyond the native window the KV cache balloons and quality degrades on short prompts — prefer chunking + retrieval over pushing context past 32k.

common questions
How much VRAM does Qwen3-8B need?

About 6 GB — the minimum this recipe targets.

Which GPUs is Qwen3-8B tested on?

RX 7900 XTX (24 GB).

How hard is this setup?

Beginner — follow the steps above.