What You'll Build
A local Qwen3-8B chat / reasoning assistant running on a 24 GB Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100) through the ROCm stack — served via Ollama for the one-command path, or llama.cpp compiled with HIP for full control over the quant tier. With 24 GB of VRAM the 8B model is never memory-bound: you can run the full BF16 weights (16.39 GB) or any GGUF quant with generous KV-cache headroom for the 32k-native context window and the optional thinking-mode chain of thought.
Hardware data: RX 7900 XTX (24GB VRAM) · BF16 or GGUF · ROCm 7 · See benchmark data
⚠️ This is a ROCm recipe, not CUDA. The RX 7900 XTX runs on AMD's ROCm/HIP stack — there is no
cu124/cu128wheel, no FlashAttention-2 prebuilt wheel, and no FP8/FP4 path here (RDNA3 has no FP8/FP4 hardware — an FP8 checkpoint would just upcast to BF16 with no memory saving). The attention path is PyTorch SDPA. Quantization is GGUF (via llama.cpp-HIP) or BF16 — not ExLlamaV2, not Marlin. If a guide tells you topip install flash-attnor pick acu12xwheel for this card, it's written for the wrong vendor.
ℹ️ Thinking mode is on by default. Per the Qwen3-8B model card, Qwen3 has a built-in chain-of-thought ("thinking") mode toggled by
enable_thinking, with soft switches/thinkand/no_thinkyou can add to a prompt. Output starts with a<think>...</think>block followed by the user-facing answer. Send/no_thinkto skip it for latency-sensitive turns.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 8 GB VRAM (ROCm-supported AMD card) | RX 7900 XTX (24 GB) |
| RAM | 16 GB system | — |
| Storage | 5.03 GB (Q4_K_M GGUF) or 16.39 GB (BF16) | per unsloth/Qwen3-8B-GGUF |
| Driver | AMD ROCm v7 (installed via amdgpu-install) on Linux | — |
| Runtime | Ollama / llama.cpp (HIP build) / LM Studio | — |
The model is released under Apache 2.0 (8.2B parameters) — commercial use is permitted. The weights are not gated on Hugging Face, so no access request or login is required.
Installation
Prerequisite — install the AMD ROCm v7 driver
The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU, but ROCm is not bundled with Ollama or the llama.cpp release binaries — you install it once at the OS level. Per the Ollama AMD GPU docs: "Ollama requires the AMD ROCm v7 driver on Linux. You can install or upgrade using the amdgpu-install utility." On Ubuntu 24.04 (Noble), install ROCm 7.2.1 via the standard amdgpu-install flow (AMD's Radeon ROCm install docs cover the current packages; the .deb URL below is HEAD-verified live):
# 1. Add the amdgpu-install package and install ROCm
wget https://repo.radeon.com/amdgpu-install/7.2.1/ubuntu/noble/amdgpu-install_7.2.1.70201-1_all.deb
sudo apt install ./amdgpu-install_7.2.1.70201-1_all.deb
sudo apt update
sudo amdgpu-install -y --usecase=graphics,rocm
# 2. Add yourself to the render/video groups (log out/in afterward)
sudo usermod -a -G render,video $LOGNAME
The RX 7900 XTX is on Ollama's supported AMD Radeon RX list, and gfx1100 is in its supported LLVM-target list — so no HSA_OVERRIDE_GFX_VERSION masquerade is needed for this card (that override is only for cards ROCm doesn't ship kernels for).
Option A — Ollama (recommended)
1. Install Ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
Per the Ollama AMD preview blog: "All the features of Ollama can now be accelerated by AMD graphics cards on Ollama for Linux and Windows," with the RX 7900 XTX named in its supported-card list. Ollama detects the ROCm runtime installed in the prerequisite step.
2. Pull the 8B model
ollama pull qwen3:8b
This fetches the canonical Q4_K_M build maintained by the Qwen team (8.2B parameters). The download is one file — no manual quant-tier selection needed.
Option B — llama.cpp built with HIP/ROCm
For full control over the quant tier (Q6_K for higher fidelity, BF16 for full precision), build llama.cpp against HIP and target the gfx1100 architecture directly.
1. Build llama.cpp with the HIP backend
Per the llama.cpp build docs, the Linux HIP build for an RDNA3 card like the RX 7900 XTX is:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1100 -DCMAKE_BUILD_TYPE=Release \
&& cmake --build build --config Release -- -j 16
-DGGML_HIP=ON selects the ROCm backend; -DGPU_TARGETS=gfx1100 pins the kernels to the 7900 XTX's architecture (the build docs use gfx1100 as the explicit example for the "Radeon RX 7900XTX").
2. Pull the quant you want
Per the unsloth/Qwen3-8B-GGUF per-tier file-size table (link-back to upstream Qwen/Qwen3-8B confirmed on the page), verified via Hugging Face file-size headers:
| Quant | File size | Notes |
|---|---|---|
| Q4_K_M | 5.03 GB | community default — trivially fits 24 GB |
| Q5_K_M | 5.85 GB | better quality, still tiny |
| Q6_K | 6.73 GB | "near perfect" per bartowski |
| Q8_0 | 8.71 GB | near-lossless |
| BF16 | 16.39 GB | full precision — fits comfortably on the 24 GB 7900 XTX |
Then run via the llama.cpp Hugging Face shortcut (per the unsloth model card):
# OpenAI-compatible local server with web UI
./build/bin/llama-server -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL
# Interactive terminal
./build/bin/llama-cli -hf unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL
Option C — LM Studio (GUI)
LM Studio ships a ROCm runtime backend and offers a one-click install path. Search "Qwen3-8B GGUF" inside the app and pick the Q4_K_M (or a higher) tier, or use the direct-import link lmstudio://open_from_hf?model=unsloth/Qwen3-8B-GGUF. On the 24 GB 7900 XTX you have room for any tier through BF16.
Running
One-shot prompt via Ollama
ollama run qwen3:8b "Explain GQA attention in three sentences."
First run loads the model into VRAM (~5 GB resident for the Q4_K_M weights at idle, growing as the KV cache fills with longer contexts). Watch GPU activity in another terminal with rocm-smi to confirm the card is doing the work.
Disable thinking mode for short answers
ollama run qwen3:8b "/no_think What's the capital of France?"
Per the Qwen3-8B model card, this flips enable_thinking=False for the request, skipping the <think>...</think> chain-of-thought prefix.
OpenAI-compatible HTTP API
# Ollama exposes localhost:11434 by default
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3:8b",
"messages": [{"role": "user", "content": "Write a haiku about a Radeon GPU."}]
}'
The 24 GB of VRAM lets you run the full BF16 weights (16.39 GB) if you want maximum fidelity rather than a quant — load unsloth/Qwen3-8B-GGUF:BF16 in llama.cpp, or run the upstream BF16 safetensors through a serving stack. The official Qwen3-8B card documents vllm and sglang serving; on ROCm, vLLM must be launched with VLLM_USE_TRITON_FLASH_ATTN=0 (the Triton FlashAttention path overflows the stack frame on gfx1100) — for a single-GPU local setup, Ollama or llama.cpp-HIP is the simpler path.
Results
- Speed: No RX-7900-XTX-named Qwen3-8B token-generation benchmark was found in research at the time of writing — published 7900 XTX figures cover other models (Llama 2 7B, Llama 3.1 8B, Qwen 2.5 7B/14B) but not Qwen3-8B specifically. Rather than transfer a number from a different model or a different vendor's card, the Speed figure is omitted here. If you've measured Qwen3-8B tok/s on a 7900 XTX, please contribute it so it lands on /check/qwen3-8b/rx-7900-xtx. As a general ROCm caveat: AMD ROCm token-generation throughput on RDNA3 tends to run softer than a comparable NVIDIA card, and ROCm itself often trails the Vulkan llama.cpp backend on this GPU (see Troubleshooting).
- VRAM usage: At idle the Q4_K_M weights occupy ~5 GB (file size 5.03 GB); the runtime grows the KV cache from there with context length. On the 24 GB 7900 XTX even the full BF16 weights (16.39 GB) leave room for a large KV cache — see /check/qwen3-8b/rx-7900-xtx for any community-submitted measurement.
- Quality notes: Q4_K_M is the community-default "sweet spot"; the bartowski Q-tier guide flags Q6_K as "near perfect, recommended." On a 24 GB card there is no memory pressure to go below Q4_K_M — run Q6_K, Q8_0, or BF16 if you want higher fidelity. The unsloth card recommends Temperature 0.6 / TopP 0.95 for thinking mode and Temperature 0.7 / TopP 0.8 for non-thinking mode; avoid greedy decoding.
For the full benchmark data and other-GPU comparisons, see /check/qwen3-8b/rx-7900-xtx.
Troubleshooting
Ollama runs on the CPU instead of the GPU
Confirm the ROCm v7 driver is installed (rocm-smi should list the 7900 XTX) and that your user is in the render and video groups (groups should show both — log out and back in after the usermod step). Per the Ollama AMD GPU docs, ROCm is a separate install from Ollama; if it's missing, Ollama silently falls back to CPU. The RX 7900 XTX (gfx1100) is natively supported, so you should not need HSA_OVERRIDE_GFX_VERSION — only unsupported cards need that masquerade.
Token generation feels slower than expected — try the Vulkan backend
On RDNA3 the ROCm/HIP backend can be 20–30% slower at token generation than the Vulkan backend in llama.cpp. Per llama.cpp issue #20934, on the RX 7900 XTX (gfx1100) Vulkan (RADV) reached ~167–177 tok/s on Llama 7B Q4_0 while ROCm landed at ~129–144 tok/s across ROCm 6.4.4–7.x. If your generation rate disappoints under ROCm, build llama.cpp with -DGGML_VULKAN=ON instead of -DGGML_HIP=ON and re-benchmark with llama-bench — Vulkan often wins for pure generation on this card.
<think>...</think> output is bloating responses
Qwen3 enables thinking mode by default per the HF card. Send /no_think at the start of any user message to disable it for that turn, or pass enable_thinking=False if you're calling the chat-template API directly.
Generation slows past 32k context
Qwen3 natively supports a 32,768-token context, extendable to 131,072 tokens with YaRN RoPE scaling per the HF card (supported in llama.cpp, vLLM, and SGLang per the unsloth GGUF instructions). Beyond the native window the KV cache balloons and quality degrades on short prompts — prefer chunking + retrieval over pushing context past 32k.