How much VRAM does Llama 3.1 8B need?

About 8 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

Llama 3.1 8B on Radeon RX 7900 XTX: Local Chat via Ollama or llama.cpp HIP (ROCm) GGUF

What You'll Build

A local Llama 3.1 8B Instruct chat assistant running on an AMD Radeon RX 7900 XTX (24 GB, RDNA3 / gfx1100) through Ollama or llama.cpp built with HIP/ROCm, using a GGUF quant. The clean lead is Ollama's llama3.1:8b tag (4.9 GB, Q4_K_M) — the exact tier a community LocalScore RX 7900 XTX run measured at 51.3 tok/s generation. On a 24 GB RDNA3 card the Q4 weights are ~5 GB resident, leaving enormous headroom: step up to a near-lossless Q8 quant, run the full BF16 GGUF (16.07 GB, which fits 24 GB but not a 16 GB card), stretch Llama 3.1's long context, or colocate a second model.

Hardware data: Radeon RX 7900 XTX (24 GB VRAM) · Q4_K_M GGUF · 51.3 tok/s generation, 870 tok/s prompt, 1.44 s time-to-first-token (LocalScore, see Results) · See benchmark data

⚠️ AMD / ROCm path — this is not the CUDA recipe. On RDNA3 there is no FlashAttention-2 prebuilt wheel, no cu124/cu128 PyTorch, no FP8/FP4 hardware acceleration, and no ExLlamaV2/Marlin. The reliable local-LLM surfaces on a 7900 XTX are Ollama and llama.cpp built with -DGGML_HIP=ON -DGPU_TARGETS=gfx1100; both run GGUF and do not depend on FlashAttention at all. The optional vLLM path (below) uses PyTorch SDPA / naive attention and must disable Triton FlashAttention. See Troubleshooting.

ℹ️ Access — the two recommended paths are public; only the canonical Meta repo is gated. This recipe's install paths need no Meta approval: Ollama's llama3.1:8b and the unsloth/Llama-3.1-8B-Instruct-GGUF mirror are both ungated (gated: false on the Hugging Face API) and download anonymously with no token. Only the canonical meta-llama/Llama-3.1-8B-Instruct repo is gated (gated: manual) — you need it solely for the optional BF16 vLLM path (see Troubleshooting): submit Meta's "Access Llama 3.1" form on the model page, wait for approval, then huggingface-cli login with a read token. Gating and license are separate: the weights are released under the Llama 3.1 Community License Agreement, which permits commercial use unless your products exceed 700 million monthly active users in the preceding calendar month (per the Llama 3.1 license) — above that threshold you must request a separate license from Meta.

Requirements

Component	Minimum	Tested
GPU	8 GB VRAM (Q4_K_M fits)	Radeon RX 7900 XTX (24 GB, gfx1100)
RAM	16 GB system	—
Storage	4.9 GB (Q4_K_M GGUF) per Ollama llama3.1:8b / unsloth file table	—
Driver	AMD ROCm v7 (Linux, via `amdgpu-install`) per Ollama GPU docs	—
Runtime	Ollama / llama.cpp (HIP build)	Ollama (ROCm), llama.cpp `-DGGML_HIP=ON`

The 7900 XTX's 24 GB is enormous for an 8B model — the Q4_K_M weights are ~5 GB resident on GPU, and even a 32K-context KV cache keeps the runtime peak well under 10 GB. The leftover ~14 GB is real headroom: run the BF16 GGUF (16.07 GB), bump to UD-Q8_K_XL (10.58 GB) for near-lossless quality, or colocate a small companion model — see Running and Results.

Per the Ollama GPU documentation, the RX 7900 XTX is on the supported RDNA3 list and Ollama "requires the AMD ROCm v7 driver on Linux" — install it with the amdgpu-install utility (the ROCm runtime is not bundled with Ollama on Linux). The HSA_OVERRIDE_GFX_VERSION masquerade is only needed when ROCm doesn't support a target; gfx1100 is natively supported, so you do not set it for the 7900 XTX.

Installation

Option A — Ollama (recommended one-line path)

Ollama ships a ROCm build that auto-detects the GPU and maintains its own pre-quantized Llama 3.1 8B. Per the Ollama llama3.1:8b tag the default tag is 4.9 GB at Q4_K_M — the exact tier the LocalScore RX 7900 XTX run measured.

1. Install the AMD ROCm v7 driver stack (Linux)

Follow the AMD ROCm Quick Start (Linux) to install ROCm via amdgpu-install, then add your user to the render and video groups so the runtime can reach the device:

sudo usermod -aG render,video $USER
# log out / back in for the group change to take effect

2. Install Ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

Ollama detects the gfx1100 device through the system ROCm install. (If the server logs show a CPU fallback, see Troubleshooting — it is almost always a render/video group or iGPU-priority issue, not a model problem.)

3. Pull and run the 8B model

ollama pull llama3.1:8b
ollama run llama3.1:8b "Explain GQA attention in three sentences."

The first run downloads ~4.9 GB and loads the model onto the 7900 XTX (resident ~5 GB; KV cache grows with conversation length). Subsequent prompts in the same session stay warm.

Option B — llama.cpp built with HIP (gfx1100) + GGUF

If you want explicit control over context size, --n-gpu-layers, and the exact quant tier, build llama.cpp against ROCm/HIP and drive it directly.

1. Build llama.cpp with HIP for gfx1100

With ROCm installed (Step A.1), build the HIP backend targeting the 7900 XTX's gfx1100 architecture, verbatim from the llama.cpp build docs:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
    cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1100 -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build --config Release -- -j 16

-DGGML_HIP=ON selects the ROCm/HIP backend and -DGPU_TARGETS=gfx1100 builds kernels directly for the 7900 XTX (Navi 31). The binaries land in build/bin/.

2. Pull a GGUF

The fastest path is the llama.cpp Hugging Face shortcut — no login is needed because the mirror is public:

# Standard Q4_K_M (the tier LocalScore measured at 51.3 tok/s)
./build/bin/llama-server -hf unsloth/Llama-3.1-8B-Instruct-GGUF:Q4_K_M

For a specific local directory and pinned filename, pull just the file you want (~4.9 GB) instead of the whole repo:

# download_q4km.py
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="unsloth/Llama-3.1-8B-Instruct-GGUF",
    local_dir="unsloth/Llama-3.1-8B-Instruct-GGUF",
    allow_patterns=["*Q4_K_M*"],
)

pip install huggingface_hub hf_transfer
python download_q4km.py

The resulting file is unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-Q4_K_M.gguf (4.92 GB per the unsloth file table).

3. Start the server

./build/bin/llama-server \
  --model unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --ctx-size 32768 \
  --n-gpu-layers 99 \
  --host 0.0.0.0 --port 8080

--n-gpu-layers 99 offloads every layer to the 7900 XTX — at Q4 the whole model and a 32K KV cache fit the 24 GB envelope with room to spare, so no layer streaming is needed. The HIP build does not use FlashAttention; llama.cpp's own ROCm attention kernels handle it.

Running

One-shot prompt via the llama.cpp HTTP server

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.1-8b-instruct",
    "messages": [{"role": "user", "content": "Write a haiku about local LLMs."}]
  }'

llama-server exposes an OpenAI-compatible /v1/chat/completions endpoint on the chosen port.

Interactive terminal

./build/bin/llama-cli \
  --model unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --ctx-size 32768 \
  --n-gpu-layers 99 \
  --interactive

Press Ctrl-C to interrupt generation; the CLI keeps the model warm on the GPU until exit.

Use the 24 GB headroom — run BF16 or a near-lossless quant

Unlike a 16 GB card, the 7900 XTX has the VRAM to run heavier tiers. The BF16 GGUF is 16.07 GB on disk per the unsloth tier table and loads on the 24 GB card with room for a useful KV cache — a memory path a 16 GB card cannot take without offload. For a high-quality-but-lighter option, UD-Q8_K_XL is 10.58 GB. RDNA3 has native BF16 in its WMMA units (per AMD's WMMA-on-RDNA3 writeup), so BF16 is a first-class path here — there is no FP8/FP4 hardware on RDNA3, so an FP8 weight would only upcast and waste memory; prefer BF16 or GGUF k-quants. Swap allow_patterns=["*BF16*"] or ["*UD-Q8_K_XL*"] into the download script to fetch the heavier tier. Token-generation throughput drops at higher precision because memory bandwidth, not compute, binds transformer decoding.

Results

Speed: The backend /check/ page reports verdict: runs for this pair, anchored on a community LocalScore run on an AMD Radeon RX 7900 XTX for Meta Llama 3.1 8B Instruct at Q4_K - Medium: 51.3 tok/s generation, 870 tok/s prompt processing, and 1.44 s time-to-first-token (LocalScore composite 314). That is the standard Q4_K_M tier — the same 4.9 GB quant Ollama ships by default. LocalScore aggregates community-submitted runs, so the figure may drift ~1% as more land; treat it as representative, not a guarantee. If you run llama.cpp HIP or Ollama on your own 7900 XTX, please submit your numbers so a backend-ingested first-party measurement keeps this current.
VRAM usage: The backend benchmark records peak_vram_gb: 24.0 for the LocalScore run (the card's full capacity, i.e. the run had the whole 24 GB available — not that the 8B model consumed it). As a derived envelope (labelled derived, not measured): Q4_K_M weights are 4.92 GB resident per unsloth's file table; a 32K-context KV cache on an 8B / 32-layer / 8-GQA-head model adds a few GB, putting the realistic runtime peak at ~7–9 GB — comfortably inside 24 GB, hence min_vram_gb: 8 for the entry-level Q4 path at the 32K context the commands above ship (drop --ctx-size and Q4_K_M will run on a 6 GB card). Community measurement of the actual resident peak will replace the derived envelope when it lands via /contribute.
Quality notes: Q4_K_M is the standard k-quant. On a 24 GB 7900 XTX you can step up to Q6_K (6.60 GB), UD-Q8_K_XL (10.58 GB), or the full BF16 GGUF (16.07 GB) per the unsloth file table — there's no quality-floor reason to run below Q4_K_M on this hardware. RDNA3 has no FP8/FP4 hardware, so there is no FP8 "memory-saving" tier here; BF16 and GGUF k-quants are the real ladder.

For the full benchmark data and cross-GPU comparisons, see /check/llama-3-1-8b/rx-7900-xtx.

Troubleshooting

Ollama falls back to CPU on the 7900 XTX

The most common cause is missing render/video group membership or the system ROCm stack not being installed (Ollama needs system ROCm v7 via amdgpu-install on Linux — it is not bundled; per the Ollama GPU docs). Confirm you ran sudo usermod -aG render,video $USER and logged back in. On boards with an integrated GPU, disable the iGPU in BIOS or make the discrete card the priority device so the gfx1100 is selected. Because gfx1100 is natively ROCm-supported, you should not need HSA_OVERRIDE_GFX_VERSION — that override is only for targets ROCm doesn't yet support.

Want vLLM instead — the RDNA3 FlashAttention flag

vLLM officially lists Radeon RX 7900 series (gfx1100/1101) as supported on ROCm 6.3 or above, with prebuilt Docker images vllm/vllm-openai-rocm:latest (per the vLLM GPU installation docs). But on RDNA3 you must set VLLM_USE_TRITON_FLASH_ATTN=0 — Triton FlashAttention is unsupported on gfx1100 and triggers a stack-frame overflow otherwise. This is tracked in vLLM issue #4514, "[Bug]: For RDNA3 (navi31; gfx1100) VLLM_USE_TRITON_FLASH_ATTN=0 currently must be forced", and gfx1100 support was added by vLLM PR #2768 ("support Radeon 7900 series (gfx1100) without using flash-attention"). vLLM loads the BF16 canonical weights from the gated meta-llama/Llama-3.1-8B-Instruct repo (16.07 GB on disk), so this path needs the Meta access request. AMD itself publishes a first-party Deploying Llama-3.1 8B using vLLM notebook (image vllm/vllm-openai-rocm:v0.15.0) — note that AMD's notebook targets the Instinct MI300X datacenter GPU, so on a 7900 XTX you keep the same vLLM workflow but add the VLLM_USE_TRITON_FLASH_ATTN=0 flag for the consumer RDNA3 card. For most users the Ollama / llama.cpp GGUF path above is simpler and faster to stand up.

401 / 403 when pulling weights from `meta-llama`

The two recommended paths never hit a gate: Ollama's llama3.1:8b and the unsloth/Llama-3.1-8B-Instruct-GGUF mirror are both public (gated: false) and download with no token. A 401/403 only appears on the optional BF16 vLLM path, which pulls from the canonical meta-llama/Llama-3.1-8B-Instruct repo (gated: manual). For that path: submit Meta's "Access Llama 3.1" form on the model page, wait for approval, then huggingface-cli login with a read token. The license terms (Llama 3.1 Community License, 700M-MAU commercial threshold) at llama.com/llama3_1/license apply regardless of how you obtain the weights.

llama.cpp built but only sees the CPU

Confirm the build actually enabled HIP — the cmake output should report the HIP backend, and -DGPU_TARGETS=gfx1100 must match the 7900 XTX. If you have multiple GPUs, HIP_VISIBLE_DEVICES selects the device at runtime (per the llama.cpp build docs). A llama.cpp binary built without -DGGML_HIP=ON is CPU-only no matter what flags you pass at runtime — rebuild from Step B.1.