How much VRAM does Qwen3-8B need?

About 9 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

Qwen3-8B on RX 7800 XT: 16 GB ROCm via Ollama or llama.cpp-HIP GGUF

What You'll Build

A local Qwen3-8B chat / reasoning assistant running on a 16 GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) through the ROCm stack — served via Ollama for the one-command path, or llama.cpp compiled with HIP for full control over the quant tier. Unlike the 24 GB Radeon cards, the 16 GB 7800 XT has no headroom to spare once you account for the KV cache, so this recipe leads with a GGUF quant (Q5_K_M / Q6_K at ~6–7 GB, or Q8_0 at ~8.7 GB) rather than the full BF16 weights — leaving plenty of room for the 32k-native context window and the optional thinking-mode chain of thought.

Hardware data: RX 7800 XT (16GB VRAM) · GGUF Q5_K_M / Q6_K / Q8_0 · ROCm 7 · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7800 XT runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel, no FlashAttention-2 prebuilt wheel, and no FP8/FP4 path here (RDNA3 has no FP8/FP4 hardware — an FP8 checkpoint would just upcast to BF16 with no memory saving, which is exactly the wrong move on a 16 GB card). The attention path is PyTorch SDPA. Quantization is GGUF (via llama.cpp-HIP) or BF16 — not ExLlamaV2, not Marlin. If a guide tells you to pip install flash-attn or pick a cu12x wheel for this card, it's written for the wrong vendor.

ℹ️ Thinking mode is on by default. Per the Qwen3-8B model card, Qwen3 has a built-in chain-of-thought ("thinking") mode toggled by enable_thinking, with soft switches /think and /no_think you can add to a prompt. Output starts with a <think>...</think> block followed by the user-facing answer. Send /no_think to skip it for latency-sensitive turns.

Requirements

Component	Minimum	Tested
GPU	8 GB VRAM (ROCm-supported AMD card)	RX 7800 XT (16 GB)
RAM	16 GB system	—
Storage	5.85 GB (Q5_K_M GGUF) or 8.71 GB (Q8_0)	per unsloth/Qwen3-8B-GGUF
Driver	AMD ROCm v7 (installed via `amdgpu-install`) on Linux	—
Runtime	Ollama / llama.cpp (HIP build) / LM Studio	—

The model is released under Apache 2.0 (8.2B parameters) — commercial use is permitted. The weights are not gated on Hugging Face, so no access request or login is required.

Installation

Prerequisite — install the AMD ROCm v7 driver

The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU — it is named in Ollama's supported AMD Radeon GPU list, and gfx1101 appears in that page's supported LLVM-target list — but ROCm is not bundled with Ollama or the llama.cpp release binaries, so you install it once at the OS level. The Ollama AMD GPU docs note that Ollama on Linux requires the AMD ROCm v7 driver, which you install or upgrade with the amdgpu-install utility. On Ubuntu 24.04 (Noble), install ROCm 7.2.1 via the standard amdgpu-install flow (AMD's Radeon ROCm install docs cover the current packages; the .deb URL below is HEAD-verified live):

# 1. Add the amdgpu-install package and install ROCm
wget https://repo.radeon.com/amdgpu-install/7.2.1/ubuntu/noble/amdgpu-install_7.2.1.70201-1_all.deb
sudo apt install ./amdgpu-install_7.2.1.70201-1_all.deb
sudo apt update
sudo amdgpu-install -y --usecase=graphics,rocm

# 2. Add yourself to the render/video groups (log out/in afterward)
sudo usermod -a -G render,video $LOGNAME

The RX 7800 XT is on Ollama's supported AMD Radeon RX list, and gfx1101 is in its supported LLVM-target list — so no HSA_OVERRIDE_GFX_VERSION masquerade is needed for this card. (That override — setting HSA_OVERRIDE_GFX_VERSION=11.0.0 to masquerade gfx1101 as gfx1100 — is only a legacy fallback for the rare library that ships gfx1100 kernels but not gfx1101; current Ollama and ROCm support gfx1101 natively.)

Option A — Ollama (recommended)

1. Install Ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

Per the Ollama AMD preview blog, Ollama acceleration on AMD graphics cards is supported on Linux and Windows, with the RX 7800 XT named in its supported-card list. Ollama detects the ROCm runtime installed in the prerequisite step.

2. Pull the 8B model

ollama pull qwen3:8b

This fetches the canonical Q4_K_M build maintained by the Qwen team (8.2B parameters, ~5 GB). The download is one file — no manual quant-tier selection needed — and it fits the 16 GB card with plenty of KV-cache headroom. To trade a little disk and VRAM for higher fidelity, pull a larger tier instead, e.g. ollama pull qwen3:8b-q8_0.

Option B — llama.cpp built with HIP/ROCm

For full control over the quant tier (Q5_K_M / Q6_K for the recommended 16 GB sweet spot, Q8_0 for near-lossless, or BF16 if you accept tight headroom), build llama.cpp against HIP and target the gfx1101 architecture directly.

1. Build llama.cpp with the HIP backend

Per the llama.cpp build docs, the Linux HIP build for an RDNA3 card targets the GPU's gfx architecture. The build docs' worked example uses gfx1100 for the RX 7900XTX/XT/GRE and instruct: "If necessary, adapt GPU_TARGETS to the GPU arch you want to compile for." The RX 7800 XT (Navi 32) is gfx1101 (per the LLVM AMDGPU processor list, which maps "Radeon RX 7800 XT" to gfx1101), so set -DGPU_TARGETS=gfx1101:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
    cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1101 -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build --config Release -- -j 16

-DGGML_HIP=ON selects the ROCm backend; -DGPU_TARGETS=gfx1101 pins the kernels to the 7800 XT's Navi 32 architecture.

2. Pull the quant you want

Per the unsloth/Qwen3-8B-GGUF per-tier file-size table (link-back to upstream Qwen/Qwen3-8B confirmed on the page), verified via the Hugging Face tree API — sizes match the official Qwen/Qwen3-8B-GGUF repo to the megabyte:

Quant	File size	Notes
Q4_K_M	5.03 GB	community default — comfortable on 16 GB
Q5_K_M	5.85 GB	recommended lead — better quality, ample KV headroom
Q6_K	6.73 GB	"near perfect" per bartowski
Q8_0	8.71 GB	near-lossless — still leaves ~7 GB for KV cache on 16 GB
BF16	16.39 GB	full precision — does NOT fit comfortably: the weights alone nearly fill the card, leaving no room for the KV cache (see below)

Then run via the llama.cpp Hugging Face shortcut (per the unsloth model card):

# OpenAI-compatible local server with web UI — Q5_K_M is the 16 GB sweet spot
./build/bin/llama-server -hf unsloth/Qwen3-8B-GGUF:Q5_K_M

# Interactive terminal
./build/bin/llama-cli -hf unsloth/Qwen3-8B-GGUF:Q5_K_M

Option C — LM Studio (GUI)

LM Studio ships a ROCm runtime backend and offers a one-click install path. Search "Qwen3-8B GGUF" inside the app and pick the Q5_K_M (or a higher) tier, or use the direct-import link lmstudio://open_from_hf?model=unsloth/Qwen3-8B-GGUF. On the 16 GB 7800 XT, Q5_K_M through Q8_0 all run with KV-cache headroom; BF16 is tight.

Running

One-shot prompt via Ollama

ollama run qwen3:8b "Explain GQA attention in three sentences."

First run loads the model into VRAM (~5 GB resident for the default Q4_K_M weights at idle, growing as the KV cache fills with longer contexts). Watch GPU activity in another terminal with rocm-smi to confirm the card is doing the work.

Disable thinking mode for short answers

ollama run qwen3:8b "/no_think What's the capital of France?"

Per the Qwen3-8B model card, this flips enable_thinking=False for the request, skipping the <think>...</think> chain-of-thought prefix.

OpenAI-compatible HTTP API

# Ollama exposes localhost:11434 by default
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:8b",
    "messages": [{"role": "user", "content": "Write a haiku about a Radeon GPU."}]
  }'

On a 16 GB card, prefer a GGUF quant over the full BF16 weights: the BF16 GGUF is 16.39 GB (file size per unsloth), which on its own nearly fills the 7800 XT and leaves no room for the KV cache — a longer context will OOM. Run Q5_K_M, Q6_K, or Q8_0 instead. The official Qwen3-8B card also documents vllm and sglang serving; on ROCm, vLLM must be launched with VLLM_USE_TRITON_FLASH_ATTN=0 (the Triton FlashAttention path overflows the stack frame on RDNA3) — for a single-GPU local setup, Ollama or llama.cpp-HIP is the simpler path.

Results

Speed: No RX-7800-XT-named Qwen3-8B token-generation benchmark was found in research at the time of writing. The 7800 XT also has materially lower memory bandwidth than the 24 GB 7900 XTX (624 GB/s vs 960 GB/s), so token generation — which is memory-bound — runs slower on this card; quoting the XTX's number here would overstate it. Rather than transfer a figure from a different card or a different model, the Speed figure is omitted. If you've measured Qwen3-8B tok/s on a 7800 XT, please contribute it so it lands on /check/qwen3-8b/rx-7800-xt. As a general ROCm caveat: AMD ROCm token-generation throughput on RDNA3 tends to run softer than a comparable NVIDIA card, and ROCm itself often trails the Vulkan llama.cpp backend on this GPU (see Troubleshooting).
VRAM usage: At idle the Q5_K_M weights occupy ~6 GB (file size 5.85 GB); the runtime grows the KV cache from there with context length. The recommended Q5_K_M/Q6_K (~6–7 GB) and Q8_0 (8.71 GB) tiers all leave several GB free for a large KV cache on the 16 GB 7800 XT. The full BF16 weights (16.39 GB) do not leave room for the KV cache on this card — see /check/qwen3-8b/rx-7800-xt for any community-submitted measurement.
Quality notes: Q5_K_M / Q6_K is the recommended 16 GB sweet spot; the bartowski Q-tier guide flags Q6_K as "near perfect, recommended." On a 16 GB card there's headroom to run Q8_0 for near-lossless quality, but not to leave the BF16 weights resident alongside a working KV cache. The unsloth card recommends Temperature 0.6 / TopP 0.95 for thinking mode and Temperature 0.7 / TopP 0.8 for non-thinking mode; avoid greedy decoding.

For the full benchmark data and other-GPU comparisons, see /check/qwen3-8b/rx-7800-xt.

Troubleshooting

Ollama runs on the CPU instead of the GPU

Confirm the ROCm v7 driver is installed (rocm-smi should list the 7800 XT) and that your user is in the render and video groups (groups should show both — log out and back in after the usermod step). Per the Ollama AMD GPU docs, ROCm is a separate install from Ollama; if it's missing, Ollama silently falls back to CPU. The RX 7800 XT (gfx1101) is natively supported, so you should not need HSA_OVERRIDE_GFX_VERSION — that masquerade is only a legacy fallback for cards (or libraries) that lack gfx1101 kernels.

Token generation feels slower than expected — try the Vulkan backend

On RDNA3 the ROCm/HIP backend can be 20–30% slower at token generation than the Vulkan backend in llama.cpp. Per llama.cpp issue #20934, on a gfx1100 RDNA3 card Vulkan (RADV) reached ~167–177 tok/s on Llama 7B Q4_0 while ROCm landed at ~129–144 tok/s across ROCm 6.4.4–7.x. The 7800 XT (gfx1101) is the same RDNA3 generation and the same trade-off applies. If your generation rate disappoints under ROCm, build llama.cpp with -DGGML_VULKAN=ON instead of -DGGML_HIP=ON and re-benchmark with llama-bench — Vulkan often wins for pure generation on these cards.

`<think>...</think>` output is bloating responses

Qwen3 enables thinking mode by default per the HF card. Send /no_think at the start of any user message to disable it for that turn, or pass enable_thinking=False if you're calling the chat-template API directly.

Out of memory at longer contexts

If you loaded BF16 (or a high quant) and hit an OOM as the conversation grows, the KV cache has run into the weights. On the 16 GB 7800 XT, drop to a smaller GGUF tier — Q5_K_M (5.85 GB) or Q6_K (6.73 GB) frees several GB for the cache. Qwen3 natively supports a 32,768-token context, extendable to 131,072 tokens with YaRN RoPE scaling per the HF card (supported in llama.cpp, vLLM, and SGLang per the unsloth GGUF instructions). Beyond the native window the KV cache balloons fastest, so on a 16 GB card prefer chunking + retrieval over pushing context past 32k.