What You'll Build
A fully local, OpenAI-style chat endpoint running Meta's Llama 3.2 1B Instruct — a small, text-only instruction-tuned LLM — on an AMD Radeon RX 7900 XTX via Ollama on the ROCm stack. At Q4_K_M the model is tiny (under 1 GB of weights), so it loads instantly and streams responses faster than you can read.
Hardware data: RX 7900 XTX (24GB VRAM) · ~106 tokens/s generation at Q4_K_M · See benchmark data
ℹ️ This is a 24 GB card running a sub-1 GB model. The benchmark below was recorded on a 24 GB RX 7900 XTX, but Llama 3.2 1B at Q4_K_M needs almost none of that VRAM — the weights are ~0.8 GB. The card's full 24 GB is wildly over-provisioned for this model; see "Spending the spare VRAM" in Troubleshooting for what that headroom is actually good for.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | Any RDNA3 Radeon with ≥2 GB free VRAM | RX 7900 XTX (24GB) |
| RAM | 8 GB | — |
| Storage | ~0.8 GB for Q4_K_M weights | ~1 GB |
| Software | Linux + ROCm v7, Ollama | — |
The Q4_K_M GGUF weight file is 807 MB (0.81 GB) per the unsloth and bartowski GGUF mirrors, which agree to the byte. With KV cache and a typical context window on top, runtime resident memory stays comfortably under 2 GB — hence min_vram_gb: 2. (For reference, the unquantized BF16 safetensors weigh 2.3 GiB on the canonical HF tree.)
Installation
1. Install ROCm and Ollama
Ollama is the cleanest "just works" LLM surface on RDNA3. On Linux, Ollama requires the AMD ROCm v7 driver, and the RX 7900 XTX (gfx1100) is on its officially-supported list. Install the ROCm driver stack first, then Ollama:
# Install the AMD ROCm v7 driver stack (Ubuntu 22.04/24.04)
sudo amdgpu-install --usecase=rocm,hiplibsdk
# Install Ollama (ships a bundled ROCm runtime that auto-detects gfx1100)
curl -fsSL https://ollama.com/install.sh | sh
Because the 7900 XTX is officially supported, no HSA_OVERRIDE_GFX_VERSION masquerade is needed — Ollama auto-detects the gfx1100 GPU once ROCm is installed. (That override is a legacy fallback for cards ROCm doesn't natively target; it does not apply here.)
2. Pull the Q4_K_M weights
Ollama's default llama3.2:1b tag ships Q8_0 (1.3 GB). To match the benchmarked configuration below, pull the Q4_K_M tag explicitly:
ollama pull llama3.2:1b-instruct-q4_K_M
Running
Start an interactive chat session:
ollama run llama3.2:1b-instruct-q4_K_M
You'll get a >>> prompt; type a message and responses stream back token-by-token. To verify the GPU is actually being used (not a CPU fallback — a common AMD footgun), run a query and check that the GPU shows activity:
# In another terminal, watch GPU utilization while a generation runs
rocm-smi
Ollama also exposes an OpenAI-compatible HTTP API on http://localhost:11434, so you can point any client library at it once the model is loaded.
Results
- Speed: ~106 tokens/s generation at Q4_K_M on the RX 7900 XTX, per the backend benchmark drawn from a LocalScore RX 7900 XTX run (prompt processing ~4241 tokens/s, time-to-first-token ~326 ms). See /check/llama-3-2-1b/rx-7900-xtx for the live figure. LocalScore aggregates community-submitted runs, so the per-row value drifts slightly over time — if your numbers differ, contribute them via /contribute.
- VRAM usage: weights are ~0.8 GB; runtime resident memory stays under 2 GB at typical context. The 24 GB on the card is almost entirely free during inference.
- Quality notes: Llama 3.2 1B is a small model — excellent for fast summarization, classification, routing, and lightweight chat, but not a substitute for a 7B+ model on hard reasoning tasks.
For the full benchmark data, see /check/llama-3-2-1b/rx-7900-xtx.
Troubleshooting
Ollama falls back to CPU (GPU not detected)
On AMD, the most common issue is Ollama running on the CPU because ROCm isn't found. Confirm the ROCm v7 driver is installed (sudo amdgpu-install --usecase=rocm,hiplibsdk) and that you're on a ROCm-certified OS (Ubuntu 22.04/24.04 or RHEL 9/10). Restart the Ollama service after installing ROCm. Use rocm-smi to confirm the GPU is visible to the driver before launching Ollama.
Want a faster token-generation backend
For LLM inference on RDNA3, the GGUF/llama.cpp HIP backend is solid, but on gfx1100 the Vulkan llama.cpp backend can sometimes match or beat the ROCm/HIP backend for token generation. If you build llama.cpp directly instead of using Ollama, it's worth benchmarking both backends on your machine and keeping whichever is faster for your workload. (At 1B the difference is small in absolute terms, but it scales as you move to larger models on the same card.)
Spending the spare VRAM
This 1B model leaves ~23 GB of the card's 24 GB unused. The RX 7900 XTX comfortably runs much larger models — an 8B–14B model at Q4_K_M fits easily — or you can colocate Llama 3.2 1B alongside a bigger model (e.g. as a fast router or draft model for speculative decoding) without contention. The headroom, not the fit, is the real story on this card for a 1B model.
No other widely-reported issues. Report problems via the submission form.