How much VRAM does Llama 3.1 8B need?

About 6 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

Llama 3.1 8B on Radeon RX 7800 XT: Local Chat via Ollama or llama.cpp HIP (ROCm) GGUF

What You'll Build

A local Llama 3.1 8B Instruct chat assistant running on an AMD Radeon RX 7800 XT (16 GB, RDNA3 / Navi 32 / gfx1101) through Ollama or llama.cpp built with HIP/ROCm, using a GGUF quant. On a 16 GB card the clean lead is a mid-tier GGUF k-quant — Q5_K_M (5.73 GB) or Q6_K (6.60 GB) — both near-lossless against the original BF16, leaving generous room for a long KV cache. Ollama's default llama3.1:8b tag (4.9 GB, Q4_K_M) is the one-line "just works" path; step up to Q5_K_M / Q6_K via llama.cpp-HIP when you want more quality headroom. The full BF16 GGUF is 16.07 GB — it technically equals the card's capacity but leaves no room for the KV cache or runtime overhead, so on 16 GB it is a tight mention only, not the lead (see Running).

Hardware data: Radeon RX 7800 XT (16 GB VRAM, gfx1101) · GGUF Q5_K_M / Q6_K · See benchmark data

⚠️ AMD / ROCm path — this is not the CUDA recipe. On RDNA3 there is no FlashAttention-2 prebuilt wheel, no cu124/cu128 PyTorch, no FP8/FP4 hardware acceleration, and no ExLlamaV2/Marlin. The reliable local-LLM surfaces on a 7800 XT are Ollama and llama.cpp built with -DGGML_HIP=ON -DGPU_TARGETS=gfx1101; both run GGUF and do not depend on FlashAttention at all. The optional vLLM path (below) uses PyTorch SDPA / naive attention and must disable Triton FlashAttention. See Troubleshooting.

ℹ️ No FP8 escape hatch on RDNA3 — quantize via GGUF, not FP8. A 16 GB NVIDIA card would lean on an FP8 weight to squeeze a tight model in. RDNA3 has no FP8 or FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only, per AMD's WMMA-on-RDNA3 writeup), so an FP8 safetensors would just upcast to BF16 at load — no memory saving, no speedup. The right way to fit 16 GB here is a GGUF k-quant via llama.cpp-HIP, which is exactly the lead path below.

ℹ️ Access — the two recommended paths are public; only the canonical Meta repo is gated. This recipe's install paths need no Meta approval: Ollama's llama3.1:8b and the unsloth/Llama-3.1-8B-Instruct-GGUF mirror are both ungated (gated: false on the Hugging Face API) and download anonymously with no token. Only the canonical meta-llama/Llama-3.1-8B-Instruct repo is gated (gated: manual) — you need it solely for the optional BF16 vLLM path (see Troubleshooting): submit Meta's "Access Llama 3.1" form on the model page, wait for approval, then huggingface-cli login with a read token. Gating and license are separate: the weights are released under the Llama 3.1 Community License Agreement, which permits commercial use unless your products exceed 700 million monthly active users in the preceding calendar month (per the Llama 3.1 license) — above that threshold you must request a separate license from Meta.

Requirements

Component	Minimum	Tested
GPU	6 GB VRAM (Q5_K_M fits)	Radeon RX 7800 XT (16 GB, gfx1101)
RAM	16 GB system	—
Storage	5.73 GB (Q5_K_M GGUF) per unsloth file table — or 4.9 GB for the Q4_K_M Ollama llama3.1:8b default	—
Driver	AMD ROCm v7 (Linux, via `amdgpu-install`) per Ollama GPU docs	—
Runtime	Ollama / llama.cpp (HIP build)	Ollama (ROCm), llama.cpp `-DGGML_HIP=ON`

The 7800 XT's 16 GB comfortably fits a mid-tier GGUF k-quant: Q5_K_M weights are ~5.7 GB resident on GPU, and even a 32K-context KV cache on an 8B / 32-layer / 8-GQA-head model keeps the runtime peak in the single-digit GB range — leaving real headroom on a 16 GB card. The BF16 GGUF (16.07 GB) is the one tier that does not comfortably fit here: it equals the card's total capacity with nothing left for the KV cache, so prefer Q5_K_M / Q6_K and treat BF16 as a 24 GB-card path (see Running).

Per the Ollama GPU documentation, the RX 7800 XT is an RDNA3 card and Ollama "requires the AMD ROCm v7 driver on Linux" — install it with the amdgpu-install utility (the ROCm runtime is not bundled with Ollama on Linux). gfx1101 is natively ROCm-supported, so the HSA_OVERRIDE_GFX_VERSION masquerade is not needed for normal Ollama / llama.cpp use; it survives only as a legacy fallback for libraries that ship gfx1100-only kernels (see Troubleshooting).

Installation

Option A — Ollama (recommended one-line path)

Ollama ships a ROCm build that auto-detects the GPU and maintains its own pre-quantized Llama 3.1 8B. Per the Ollama llama3.1:8b tag the default tag is 4.9 GB at Q4_K_M — the quickest path to a running model; step up to Q5_K_M / Q6_K via llama.cpp (Option B) for more quality headroom on the 16 GB card.

1. Install the AMD ROCm v7 driver stack (Linux)

Follow the AMD ROCm Quick Start (Linux) to install ROCm via amdgpu-install, then add your user to the render and video groups so the runtime can reach the device:

sudo usermod -aG render,video $USER
# log out / back in for the group change to take effect

2. Install Ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

Ollama detects the gfx1101 device through the system ROCm install. (If the server logs show a CPU fallback, see Troubleshooting — it is almost always a render/video group or iGPU-priority issue, not a model problem.)

3. Pull and run the 8B model

ollama pull llama3.1:8b
ollama run llama3.1:8b "Explain GQA attention in three sentences."

The first run downloads ~4.9 GB (Q4_K_M) and loads the model onto the 7800 XT; the KV cache grows with conversation length. Subsequent prompts in the same session stay warm.

Option B — llama.cpp built with HIP (gfx1101) + GGUF Q5_K_M

If you want explicit control over context size, --n-gpu-layers, and the exact quant tier — and to run the higher-quality Q5_K_M / Q6_K the 16 GB card has room for — build llama.cpp against ROCm/HIP and drive it directly.

1. Build llama.cpp with HIP for gfx1101

With ROCm installed (Step A.1), build the HIP backend targeting the 7800 XT's gfx1101 architecture, verbatim from the llama.cpp build docs (only the -DGPU_TARGETS value differs from the docs' example):

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
    cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1101 -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build --config Release -- -j 16

-DGGML_HIP=ON selects the ROCm/HIP backend and -DGPU_TARGETS=gfx1101 builds kernels directly for the 7800 XT (Navi 32). The binaries land in build/bin/.

2. Pull a GGUF

The fastest path is the llama.cpp Hugging Face shortcut — no login is needed because the mirror is public:

# Q5_K_M — the recommended near-lossless tier for 16 GB
./build/bin/llama-server -hf unsloth/Llama-3.1-8B-Instruct-GGUF:Q5_K_M

For a specific local directory and pinned filename, pull just the file you want (~5.7 GB) instead of the whole repo:

# download_q5km.py
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="unsloth/Llama-3.1-8B-Instruct-GGUF",
    local_dir="unsloth/Llama-3.1-8B-Instruct-GGUF",
    allow_patterns=["*Q5_K_M*"],
)

pip install huggingface_hub hf_transfer
python download_q5km.py

The resulting file is unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-Q5_K_M.gguf (5.73 GB per the unsloth file table). To fetch Q6_K (6.60 GB) instead, change the pattern to allow_patterns=["*Q6_K*"].

3. Start the server

./build/bin/llama-server \
  --model unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-Q5_K_M.gguf \
  --ctx-size 32768 \
  --n-gpu-layers 99 \
  --host 0.0.0.0 --port 8080

--n-gpu-layers 99 offloads every layer to the 7800 XT — at Q5_K_M the whole model and a 32K KV cache fit the 16 GB envelope with room to spare, so no layer streaming is needed. The HIP build does not use FlashAttention; llama.cpp's own ROCm attention kernels handle it.

Running

One-shot prompt via the llama.cpp HTTP server

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.1-8b-instruct",
    "messages": [{"role": "user", "content": "Write a haiku about local LLMs."}]
  }'

llama-server exposes an OpenAI-compatible /v1/chat/completions endpoint on the chosen port.

Interactive terminal

./build/bin/llama-cli \
  --model unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-Q5_K_M.gguf \
  --ctx-size 32768 \
  --n-gpu-layers 99 \
  --interactive

Press Ctrl-C to interrupt generation; the CLI keeps the model warm on the GPU until exit.

Quant ladder on a 16 GB card

The 7800 XT's 16 GB sets a clear ceiling. The sensible ladder, with on-disk sizes from the unsloth tier table:

Q4_K_M (4.92 GB) — Ollama's default; smallest sensible tier, most KV-cache headroom.
Q5_K_M (5.73 GB) — the recommended lead; near-lossless, still leaves ~10 GB free for context.
Q6_K (6.60 GB) — a notch higher quality, still very comfortable on 16 GB.
UD-Q8_K_XL (10.58 GB) — effectively lossless; fits 16 GB but starts eating into KV-cache room at long context.
BF16 (16.07 GB) — equals the card's capacity with nothing left for the KV cache or runtime overhead; not a comfortable 16 GB path — it's a 24 GB-card tier. RDNA3 has native BF16 in its WMMA units (per AMD's WMMA-on-RDNA3 writeup), but VRAM, not format support, is the constraint here.

Swap the allow_patterns value in the download script (e.g. ["*Q6_K*"] or ["*UD-Q8_K_XL*"]) to fetch a different tier. Token-generation throughput drops slightly at higher precision because memory bandwidth, not compute, binds transformer decoding — and the 7800 XT's 624 GB/s bus (256-bit GDDR6, per the Radeon RX 7000 series specs) is its main throughput limiter.

Results

Speed: The backend /check/ page currently returns verdict: unknown with no ingested benchmark for this pair — there is no first-party measurement in our database yet. For an indication, community LocalScore runs that name the AMD Radeon RX 7800 XT running Meta Llama 3.1 8B Instruct at Q4_K - Medium report generation throughput in the ~34–39 tok/s range across configurations: e.g. accelerator/607 at 34.1 tok/s gen / 276 tok/s prompt / 6.9 s TTFT, and a faster entry on the model leaderboard at 38.9 tok/s gen / 524 tok/s prompt / 2.36 s TTFT. The wide spread reflects system configuration and llama.cpp backend choice, not a single canonical number — LocalScore aggregates community submissions. Notably these RX 7800 XT figures sit well below the RX 7900 XTX's measured Q4_K_M throughput, consistent with the 7800 XT's narrower memory bus (624 GB/s vs the XTX's 960 GB/s) binding memory-bound token generation. Treat these as representative, not a guarantee — if you run llama.cpp HIP or Ollama on your own 7800 XT, please submit your numbers so a backend-ingested first-party measurement can anchor this recipe.
VRAM usage: No measured resident peak is in the backend yet. As a derived envelope (labelled derived, not measured): Q5_K_M weights are 5.73 GB resident per unsloth's file table; a 32K-context KV cache on an 8B / 32-layer / 8-GQA-head model adds a few GB, putting the realistic runtime peak around 8–9 GB — comfortably inside 16 GB, hence min_vram_gb: 6 for the Q5_K_M path (drop --ctx-size and Q5_K_M will run on a 6 GB card; the Q4_K_M Ollama default is lighter still). Community measurement of the actual resident peak will replace the derived envelope when it lands via /contribute.
Quality notes: Q5_K_M and Q6_K are near-lossless k-quants — there's no quality-floor reason to drop below Q4_K_M on this hardware, and no upper-tier reason to push past Q6_K unless you want UD-Q8_K_XL's effectively-lossless output (10.58 GB, still fits 16 GB). RDNA3 has no FP8/FP4 hardware, so there is no FP8 "memory-saving" tier here; BF16 and GGUF k-quants are the real ladder, and BF16's 16.07 GB makes it a 24 GB-card tier rather than a 16 GB one.

For the full benchmark data and cross-GPU comparisons, see /check/llama-3-1-8b/rx-7800-xt.

Troubleshooting

Ollama falls back to CPU on the 7800 XT

The most common cause is missing render/video group membership or the system ROCm stack not being installed (Ollama needs system ROCm v7 via amdgpu-install on Linux — it is not bundled; per the Ollama GPU docs). Confirm you ran sudo usermod -aG render,video $USER and logged back in. On boards with an integrated GPU, disable the iGPU in BIOS or make the discrete card the priority device so the gfx1101 is selected. Because gfx1101 is natively ROCm-supported, you should not need HSA_OVERRIDE_GFX_VERSION for Ollama / llama.cpp. The one exception: if a different ROCm library in your stack ships only gfx1100 kernels and refuses to load on gfx1101, the legacy HSA_OVERRIDE_GFX_VERSION=11.0.0 masquerade tells it to treat the 7800 XT as gfx1100 — use it only as a fallback, not by default.

ROCm token generation slower than expected — try the Vulkan backend

On RDNA3, llama.cpp's ROCm/HIP backend is sometimes slower at token generation than llama.cpp's Vulkan backend for the same GGUF — this is tracked upstream in llama.cpp issue #20934. If your ROCm llama-server tok/s feels low, build llama.cpp with the Vulkan backend (-DGGML_VULKAN=ON instead of -DGGML_HIP=ON) and compare on the same Q5_K_M file. Ollama also has a Vulkan path. Which backend wins varies by ROCm version and driver, so benchmark both on your own box rather than assuming ROCm is fastest.

Want vLLM instead — the RDNA3 FlashAttention flag

vLLM officially lists Radeon RX 7900 series (gfx1100/1101) as supported on ROCm 6.3 or above, with prebuilt Docker images vllm/vllm-openai-rocm:latest (per the vLLM GPU installation docs). On RDNA3 you must set VLLM_USE_TRITON_FLASH_ATTN=0 — Triton FlashAttention is unsupported on gfx1100/1101 and triggers a stack-frame overflow otherwise. This is tracked in vLLM issue #4514, "[Bug]: For RDNA3 (navi31; gfx1100) VLLM_USE_TRITON_FLASH_ATTN=0 currently must be forced", and gfx1100 support was added by vLLM PR #2768 ("support Radeon 7900 series (gfx1100) without using flash-attention"). vLLM loads the BF16 canonical weights from the gated meta-llama/Llama-3.1-8B-Instruct repo (16.07 GB on disk) — on a 16 GB 7800 XT that is right at the card's capacity with no room for the vLLM KV-cache allocator, so on this card the GGUF Ollama / llama.cpp path above is the practical lead, and vLLM-BF16 is realistically a 24 GB-card option. For most users the GGUF path is simpler and fits comfortably regardless.

401 / 403 when pulling weights from `meta-llama`

The two recommended paths never hit a gate: Ollama's llama3.1:8b and the unsloth/Llama-3.1-8B-Instruct-GGUF mirror are both public (gated: false) and download with no token. A 401/403 only appears on the optional BF16 vLLM path, which pulls from the canonical meta-llama/Llama-3.1-8B-Instruct repo (gated: manual). For that path: submit Meta's "Access Llama 3.1" form on the model page, wait for approval, then huggingface-cli login with a read token. The license terms (Llama 3.1 Community License, 700M-MAU commercial threshold) at llama.com/llama3_1/license apply regardless of how you obtain the weights.

llama.cpp built but only sees the CPU

Confirm the build actually enabled HIP — the cmake output should report the HIP backend, and -DGPU_TARGETS=gfx1101 must match the 7800 XT. If you have multiple GPUs, HIP_VISIBLE_DEVICES selects the device at runtime (per the llama.cpp build docs). A llama.cpp binary built without -DGGML_HIP=ON is CPU-only no matter what flags you pass at runtime — rebuild from Step B.1, and double-check you set gfx1101 (not the XTX's gfx1100) for this card.