self-hosted/ai
§01·recipe · llm

Llama 3.1 8B on RTX 5080: Local Chat via Ollama or llama.cpp + Unsloth UD-Q4_K_XL GGUF

llmbeginner10GB+ VRAMMay 29, 2026
models
tools
prerequisites
  • NVIDIA RTX 5080 (16 GB VRAM) or equivalent Blackwell-class card
  • Recent NVIDIA driver with CUDA 12.8+ support (required for Blackwell sm_120 kernels)
  • ~5 GB free disk for the UD-Q4_K_XL GGUF (or ~16 GB for BF16, ~10.6 GB for UD-Q8_K_XL)
  • llama.cpp, Ollama, or LM Studio installed
  • Hugging Face account with access to the gated meta-llama/Llama-3.1-8B-Instruct repo

What You'll Build

A local Llama 3.1 8B Instruct chat assistant running on an RTX 5080 (16 GB VRAM) through llama.cpp (or Ollama / LM Studio) with the unsloth/Llama-3.1-8B-Instruct-GGUF UD-Q4_K_XL weights (4.99 GB on disk, Unsloth's mixed-precision Dynamic 2.0 GGUF tier). On a 16 GB envelope the Q4_K_XL build leaves ~6 GB of runtime headroom over the typical ~10 GB peak — enough room to step up to UD-Q6_K_XL / UD-Q8_K_XL for higher fidelity, stretch Llama 3.1's native context beyond 16K, or colocate a small companion model (a TTS encoder or a 1B-class assistant).

Hardware data: RTX 5080 (16 GB VRAM) · UD-Q4_K_XL GGUF · no first-party Llama 3.1 8B measurement on this card yet — see Results for a same-arch-class proxy · See benchmark data

⚠️ Quant pinned — Unsloth UD-Q4_K_XL. This recipe targets UD-Q4_K_XL from the unsloth/Llama-3.1-8B-Instruct-GGUF repo specifically — Unsloth's mixed-precision GGUF tier featured in their Dynamic 2.0 benchmark comparisons, with per-layer sensitivity-aware bit-allocation. Standard Q4_K_M from other publishers (bartowski/Meta-Llama-3.1-8B-Instruct-GGUF, TheBloke) loads with the same llama.cpp binary, but the per-layer recipe and resulting quality/speed profile are different — see Troubleshooting if you prefer the conventional flavor.

ℹ️ Gated model — Meta access form required. The canonical meta-llama/Llama-3.1-8B-Instruct repo and the derived unsloth/Llama-3.1-8B-Instruct-GGUF both require accepting Meta's Llama 3.1 Community License before download. Click "Agree and access" on the model page while logged into HF, then run huggingface-cli login locally with a read token before the steps below. The license permits commercial use until you exceed 700 million monthly active users.

Requirements

ComponentMinimumTested
GPU8 GB VRAM (UD-Q4_K_XL fits)RTX 5080 (16 GB)
RAM16 GB system
Storage4.99 GB (UD-Q4_K_XL GGUF) per unsloth/Llama-3.1-8B-Instruct-GGUF
DriverCUDA 12.8+ runtime (Blackwell sm_120)
Runtimellama.cpp / Ollama / LM Studiollama.cpp b9247+

The 5080's 16 GB is comfortable for Q4_K_XL — weights resident on GPU are ~5 GB and the KV cache for a 16K context adds another ~4 GB, putting runtime peak around ~10 GB. You have ~6 GB of headroom to either jump to a heavier quant tier (UD-Q6_K_XL at 7.33 GB on disk, UD-Q8_K_XL at 10.58 GB) or stretch to longer context windows — see Results for the throughput-vs-context tradeoff.

Installation

Option A — Ollama (recommended one-line path)

Ollama maintains its own pre-quantized build of Llama 3.1 8B Instruct and handles model download + serving with a single command. Per the Ollama llama3.1:8b tag, the default tag is 4.9 GB at Q4_K_M — essentially the same size and quality tier as Unsloth's UD-Q4_K_XL but using the standard k-quant recipe.

1. Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

(Windows: download from ollama.com/download.) Ollama bundles its own CUDA runtime, so the only host-side requirement is a recent NVIDIA driver with Blackwell sm_120 support (the GeForce 575+ series on Linux, or any current Windows driver).

2. Pull and run the 8B model

ollama pull llama3.1:8b
ollama run llama3.1:8b "Explain GQA attention in three sentences."

The first run downloads ~4.9 GB and loads the model into VRAM (resident ~5 GB; KV cache grows with conversation length). Subsequent prompts in the same session stay warm.

Option B — llama.cpp + Unsloth UD-Q4_K_XL GGUF

If you want the specific Unsloth Dynamic 2.0 tier (UD-Q4_K_XL) and explicit control over context size and --n-gpu-layers, drive llama.cpp directly.

1. Install llama.cpp (CUDA 12.8 build)

The RTX 5080 uses Blackwell sm_120 — mainline llama.cpp ships sm_120 kernels, but you need a CUDA 12.8+ build. Pre-built CUDA 12.8 binaries are published on the llama.cpp releases page — pick a *-bin-ubuntu-cuda-12.x-x64.zip asset (Linux) or the matching Windows CUDA build.

# Linux — pre-built CUDA binary
# Download the latest "llama-bXXXX-bin-ubuntu-cuda-12.x-x64.zip" asset from:
#   https://github.com/ggml-org/llama.cpp/releases
# Extract and add the bin/ directory to PATH.

# macOS (Homebrew) — CPU/Metal only, no CUDA, kept here for symmetry with the sibling 3090/4090 recipes
brew install llama.cpp

To build from source with CUDA 12.8 support, follow the llama.cpp CUDA build docs and pin the toolkit and arch explicitly:

# Make sure CUDA 12.8 is the active toolkit BEFORE cmake configure step
export PATH=/usr/local/cuda-12.8/bin:$PATH
export CUDAToolkit_ROOT=/usr/local/cuda-12.8

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=120 \
  -DCUDAToolkit_ROOT=/usr/local/cuda-12.8
cmake --build build --config Release -j $(nproc)

CMAKE_CUDA_ARCHITECTURES=120 builds sm_120 kernels directly, avoiding PTX JIT compilation at first run.

2. Pull the UD-Q4_K_XL GGUF

The fastest path is the llama.cpp Hugging Face shortcut from the Unsloth model card quickstart — llama.cpp will fetch the tagged file directly:

pip install huggingface_hub hf_transfer
huggingface-cli login   # paste a read token; required for the gated upstream
llama-server -hf unsloth/Llama-3.1-8B-Instruct-GGUF:UD-Q4_K_XL

For more control (specific local directory, pinned filename), pull only the Q4_K_XL file (~5 GB) via snapshot_download instead of the full repo:

# download_q4kxl.py
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="unsloth/Llama-3.1-8B-Instruct-GGUF",
    local_dir="unsloth/Llama-3.1-8B-Instruct-GGUF",
    allow_patterns=["*UD-Q4_K_XL*"],
)
python download_q4kxl.py

The resulting file is unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf (4.99 GB per the unsloth model card).

3. Start the server

llama-server \
  --model unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf \
  --ctx-size 16384 \
  --n-gpu-layers 99 \
  --host 0.0.0.0 --port 8080

--n-gpu-layers 99 offloads every layer to the 5080 (the 16 GB envelope is enough to keep the whole model resident at Q4_K_XL; layer streaming is unnecessary). --ctx-size 16384 sets a 16K context window — see Troubleshooting for guidance on pushing context higher.

Option C — LM Studio (GUI)

LM Studio's built-in catalog search ("Llama 3.1 8B Instruct GGUF") will surface both the Unsloth UD-Q4_K_XL build and the bartowski standard-quant ladder. Pick Llama-3.1-8B-Instruct-UD-Q4_K_XL from the Unsloth repo and download — same file as Option B. LM Studio's loader will set --n-gpu-layers to "max" automatically for a 5080 once it recognizes the Blackwell card.

Running

One-shot prompt via the llama.cpp HTTP server

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.1-8b-instruct",
    "messages": [{"role": "user", "content": "Write a haiku about Blackwell GPUs."}]
  }'

The llama.cpp llama-server binary exposes an OpenAI-compatible /v1/chat/completions endpoint on the port chosen above.

Interactive terminal

llama-cli \
  --model unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf \
  --ctx-size 16384 \
  --n-gpu-layers 99 \
  --interactive

Press Ctrl-C to interrupt generation; the CLI keeps the model warm in VRAM until exit.

Step up to UD-Q8_K_XL (near-lossless) on this card

The UD-Q8_K_XL build is 10.58 GB on disk per the unsloth tier table; on the 5080's 16 GB envelope you still have ~5 GB of headroom for a 16K-token KV cache, which fits the typical chat / coding workload comfortably at near-lossless quality. Use allow_patterns=["*UD-Q8_K_XL*"] in the snapshot_download script above to fetch the Q8 file instead. Expect throughput to drop relative to Q4 because memory bandwidth, not compute, is the binding constraint on transformer token generation.

Results

  • Speed: No first-party Llama 3.1 8B measurement on the RTX 5080 exists yet — the backend /check/ page currently reports verdict: unknown with no benchmark rows for this pair. As a same-arch-class proxy (not a Llama measurement), Hardware Corner's RTX 5080 LLM benchmark page measures a comparable dense 8B Q4 model — its Qwen3 8B (Q4_K) row reads 129.1 tok/s generation / 6,410.1 tok/s prompt-processing at 4K context on the 5080. Llama 3.1 8B should land in a similar band (same dense-transformer 8B class, same Q4 tier) modulo per-architecture variance, but this is an extrapolation from a different model — not a Llama 3.1 8B number. If you run llama.cpp + UD-Q4_K_XL on your own 5080, please submit your numbers so a Llama-3.1-8B-specific first-party measurement replaces this proxy.
  • VRAM usage: No first-party measured peak VRAM is in the backend yet. As a derived envelope (labelled as derived — not measured): UD-Q4_K_XL weights resident on GPU are 4.99 GB per unsloth's file table; the KV cache for a 16K context on an 8B model with 32 layers and 8 GQA heads adds ~4 GB, putting the runtime peak around ~9–10 GB — well inside the 5080's 16 GB envelope. Community measurement of the actual resident peak will replace the derived envelope when it lands via /contribute.
  • Quality notes: UD-Q4_K_XL is the Unsloth mixed-precision GGUF tier; the Unsloth Dynamic 2.0 docs discuss per-layer sensitivity-aware bit-allocation across the family. On a 16 GB 5080 you can comfortably step up to UD-Q6_K_XL (7.33 GB), UD-Q8_K_XL (10.58 GB), or even Q6_K standard (6.60 GB per the unsloth file table) — there's no quality-floor reason to run anything below Q4_K_M on this hardware. BF16 full precision (16.07 GB on disk) overflows the 16 GB card without offload and isn't recommended.

For the full benchmark data and cross-GPU comparisons (3090 / 4090 / 5090 siblings), see /check/llama-3-1-8b/rtx-5080.

Troubleshooting

huggingface-cli 401 / 403 on the Unsloth GGUF repo

The Unsloth quantization inherits gating from the upstream meta-llama/Llama-3.1-8B-Instruct repo. You need to (a) be logged in via huggingface-cli login with a token that has read access, and (b) have clicked "Agree and access" on the upstream Meta repo while logged in — the access carries through to the derived Unsloth mirror. The full license terms are at github.com/meta-llama/llama-models.

Driver too old — Ollama silently falls back to CPU

The RTX 5080 uses Blackwell sm_120; older CUDA wheels lack the kernels and Ollama silently falls back to CPU inference, which appears as a hang or single-digit tok/s. Confirm CUDA 12.8+ drivers are installed (nvidia-smi should report driver 575+ on Linux), then reinstall Ollama. The same advice applies to llama.cpp — use a cuda-12.8 release binary, not an older one.

Generation slows down at longer context

Llama 3.1 ships with a 128K-token native context window per the HF model card metadata (base_model:meta-llama/Llama-3.1-8B, arxiv:2204.05149), but throughput drops as the KV cache fills. The same-class proxy on Hardware Corner's RTX 5080 LLM benchmark page — the Qwen3 8B Q4_K row — degrades from 129.1 tok/s at 4K to 94.1 tok/s at 16K to 72.5 tok/s at 32K; expect Llama 3.1 8B to follow a similar curve. At full 128K the KV cache alone consumes >12 GB and overflows the 5080's 16 GB envelope. For long-doc workflows on this card, keep --ctx-size at 32K or below; for longer documents, use chunking + retrieval.

Want a different runtime — vLLM or SGLang?

The Meta canonical HF model card documents vllm serve "meta-llama/Llama-3.1-8B-Instruct" and python3 -m sglang.launch_server --model-path "meta-llama/Llama-3.1-8B-Instruct" — both load BF16 weights (16.07 GB on disk per the unsloth tier table) rather than the GGUF quantization. The 5080's 16 GB VRAM is right at this card's BF16 capacity — vLLM's KV-cache pre-allocation will push it over the line OOM without aggressive --max-model-len capping. For 16 GB consumer cards, the llama.cpp / Ollama GGUF path is the comfortable choice; reserve the BF16 vLLM/SGLang path for 24 GB+ cards (see the 4090 and 5090 siblings).

Standard Q4_K_M instead of Unsloth's UD-Q4_K_XL?

Both load with the same llama.cpp binary; only the quantization recipe differs. bartowski/Meta-Llama-3.1-8B-Instruct-GGUF ships the standard k-quant ladder if you prefer the conventional flavor — file sizes are nearly identical (Q4_K_M = 4.92 GB per the bartowski tree). Throughput will be close to but not identical to Unsloth's UD-Q4_K_XL because the per-layer bit-allocation differs. Ollama's llama3.1:8b default tag is also standard Q4_K_M (4.9 GB per the Ollama library page).

FlashAttention 2 errors with transformers

If you bypass Ollama / llama.cpp and run the HF model card's transformers quickstart directly, do not add attn_implementation="flash_attention_2" — FA2 wheels don't ship sm_120 kernels as of mid-2026 (Dao-AILab/flash-attention#2168). Either omit the argument (PyTorch picks SDPA automatically) or set attn_implementation="sdpa" explicitly. This caveat is moot for the recommended GGUF path above — llama.cpp and Ollama don't depend on FlashAttention.