What You'll Build
A local Llama 3.1 8B Instruct chat assistant running on an RTX 5080 (16 GB VRAM) through llama.cpp (or Ollama / LM Studio) with the unsloth/Llama-3.1-8B-Instruct-GGUF UD-Q4_K_XL weights (4.99 GB on disk, Unsloth's mixed-precision Dynamic 2.0 GGUF tier). On a 16 GB envelope the Q4_K_XL build leaves ~6 GB of runtime headroom over the typical ~10 GB peak — enough room to step up to UD-Q6_K_XL / UD-Q8_K_XL for higher fidelity, stretch Llama 3.1's native context beyond 16K, or colocate a small companion model (a TTS encoder or a 1B-class assistant).
Hardware data: RTX 5080 (16 GB VRAM) · UD-Q4_K_XL GGUF · no first-party Llama 3.1 8B measurement on this card yet — see Results for a same-arch-class proxy · See benchmark data
⚠️ Quant pinned — Unsloth UD-Q4_K_XL. This recipe targets
UD-Q4_K_XLfrom the unsloth/Llama-3.1-8B-Instruct-GGUF repo specifically — Unsloth's mixed-precision GGUF tier featured in their Dynamic 2.0 benchmark comparisons, with per-layer sensitivity-aware bit-allocation. StandardQ4_K_Mfrom other publishers (bartowski/Meta-Llama-3.1-8B-Instruct-GGUF, TheBloke) loads with the same llama.cpp binary, but the per-layer recipe and resulting quality/speed profile are different — see Troubleshooting if you prefer the conventional flavor.
ℹ️ Gated model — Meta access form required. The canonical meta-llama/Llama-3.1-8B-Instruct repo and the derived unsloth/Llama-3.1-8B-Instruct-GGUF both require accepting Meta's Llama 3.1 Community License before download. Click "Agree and access" on the model page while logged into HF, then run
huggingface-cli loginlocally with a read token before the steps below. The license permits commercial use until you exceed 700 million monthly active users.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 8 GB VRAM (UD-Q4_K_XL fits) | RTX 5080 (16 GB) |
| RAM | 16 GB system | — |
| Storage | 4.99 GB (UD-Q4_K_XL GGUF) per unsloth/Llama-3.1-8B-Instruct-GGUF | — |
| Driver | CUDA 12.8+ runtime (Blackwell sm_120) | — |
| Runtime | llama.cpp / Ollama / LM Studio | llama.cpp b9247+ |
The 5080's 16 GB is comfortable for Q4_K_XL — weights resident on GPU are ~5 GB and the KV cache for a 16K context adds another ~4 GB, putting runtime peak around ~10 GB. You have ~6 GB of headroom to either jump to a heavier quant tier (UD-Q6_K_XL at 7.33 GB on disk, UD-Q8_K_XL at 10.58 GB) or stretch to longer context windows — see Results for the throughput-vs-context tradeoff.
Installation
Option A — Ollama (recommended one-line path)
Ollama maintains its own pre-quantized build of Llama 3.1 8B Instruct and handles model download + serving with a single command. Per the Ollama llama3.1:8b tag, the default tag is 4.9 GB at Q4_K_M — essentially the same size and quality tier as Unsloth's UD-Q4_K_XL but using the standard k-quant recipe.
1. Install Ollama
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
(Windows: download from ollama.com/download.) Ollama bundles its own CUDA runtime, so the only host-side requirement is a recent NVIDIA driver with Blackwell sm_120 support (the GeForce 575+ series on Linux, or any current Windows driver).
2. Pull and run the 8B model
ollama pull llama3.1:8b
ollama run llama3.1:8b "Explain GQA attention in three sentences."
The first run downloads ~4.9 GB and loads the model into VRAM (resident ~5 GB; KV cache grows with conversation length). Subsequent prompts in the same session stay warm.
Option B — llama.cpp + Unsloth UD-Q4_K_XL GGUF
If you want the specific Unsloth Dynamic 2.0 tier (UD-Q4_K_XL) and explicit control over context size and --n-gpu-layers, drive llama.cpp directly.
1. Install llama.cpp (CUDA 12.8 build)
The RTX 5080 uses Blackwell sm_120 — mainline llama.cpp ships sm_120 kernels, but you need a CUDA 12.8+ build. Pre-built CUDA 12.8 binaries are published on the llama.cpp releases page — pick a *-bin-ubuntu-cuda-12.x-x64.zip asset (Linux) or the matching Windows CUDA build.
# Linux — pre-built CUDA binary
# Download the latest "llama-bXXXX-bin-ubuntu-cuda-12.x-x64.zip" asset from:
# https://github.com/ggml-org/llama.cpp/releases
# Extract and add the bin/ directory to PATH.
# macOS (Homebrew) — CPU/Metal only, no CUDA, kept here for symmetry with the sibling 3090/4090 recipes
brew install llama.cpp
To build from source with CUDA 12.8 support, follow the llama.cpp CUDA build docs and pin the toolkit and arch explicitly:
# Make sure CUDA 12.8 is the active toolkit BEFORE cmake configure step
export PATH=/usr/local/cuda-12.8/bin:$PATH
export CUDAToolkit_ROOT=/usr/local/cuda-12.8
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES=120 \
-DCUDAToolkit_ROOT=/usr/local/cuda-12.8
cmake --build build --config Release -j $(nproc)
CMAKE_CUDA_ARCHITECTURES=120 builds sm_120 kernels directly, avoiding PTX JIT compilation at first run.
2. Pull the UD-Q4_K_XL GGUF
The fastest path is the llama.cpp Hugging Face shortcut from the Unsloth model card quickstart — llama.cpp will fetch the tagged file directly:
pip install huggingface_hub hf_transfer
huggingface-cli login # paste a read token; required for the gated upstream
llama-server -hf unsloth/Llama-3.1-8B-Instruct-GGUF:UD-Q4_K_XL
For more control (specific local directory, pinned filename), pull only the Q4_K_XL file (~5 GB) via snapshot_download instead of the full repo:
# download_q4kxl.py
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="unsloth/Llama-3.1-8B-Instruct-GGUF",
local_dir="unsloth/Llama-3.1-8B-Instruct-GGUF",
allow_patterns=["*UD-Q4_K_XL*"],
)
python download_q4kxl.py
The resulting file is unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf (4.99 GB per the unsloth model card).
3. Start the server
llama-server \
--model unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf \
--ctx-size 16384 \
--n-gpu-layers 99 \
--host 0.0.0.0 --port 8080
--n-gpu-layers 99 offloads every layer to the 5080 (the 16 GB envelope is enough to keep the whole model resident at Q4_K_XL; layer streaming is unnecessary). --ctx-size 16384 sets a 16K context window — see Troubleshooting for guidance on pushing context higher.
Option C — LM Studio (GUI)
LM Studio's built-in catalog search ("Llama 3.1 8B Instruct GGUF") will surface both the Unsloth UD-Q4_K_XL build and the bartowski standard-quant ladder. Pick Llama-3.1-8B-Instruct-UD-Q4_K_XL from the Unsloth repo and download — same file as Option B. LM Studio's loader will set --n-gpu-layers to "max" automatically for a 5080 once it recognizes the Blackwell card.
Running
One-shot prompt via the llama.cpp HTTP server
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.1-8b-instruct",
"messages": [{"role": "user", "content": "Write a haiku about Blackwell GPUs."}]
}'
The llama.cpp llama-server binary exposes an OpenAI-compatible /v1/chat/completions endpoint on the port chosen above.
Interactive terminal
llama-cli \
--model unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf \
--ctx-size 16384 \
--n-gpu-layers 99 \
--interactive
Press Ctrl-C to interrupt generation; the CLI keeps the model warm in VRAM until exit.
Step up to UD-Q8_K_XL (near-lossless) on this card
The UD-Q8_K_XL build is 10.58 GB on disk per the unsloth tier table; on the 5080's 16 GB envelope you still have ~5 GB of headroom for a 16K-token KV cache, which fits the typical chat / coding workload comfortably at near-lossless quality. Use allow_patterns=["*UD-Q8_K_XL*"] in the snapshot_download script above to fetch the Q8 file instead. Expect throughput to drop relative to Q4 because memory bandwidth, not compute, is the binding constraint on transformer token generation.
Results
- Speed: No first-party Llama 3.1 8B measurement on the RTX 5080 exists yet — the backend
/check/page currently reportsverdict: unknownwith no benchmark rows for this pair. As a same-arch-class proxy (not a Llama measurement), Hardware Corner's RTX 5080 LLM benchmark page measures a comparable dense 8B Q4 model — its Qwen3 8B (Q4_K) row reads 129.1 tok/s generation / 6,410.1 tok/s prompt-processing at 4K context on the 5080. Llama 3.1 8B should land in a similar band (same dense-transformer 8B class, same Q4 tier) modulo per-architecture variance, but this is an extrapolation from a different model — not a Llama 3.1 8B number. If you run llama.cpp + UD-Q4_K_XL on your own 5080, please submit your numbers so a Llama-3.1-8B-specific first-party measurement replaces this proxy. - VRAM usage: No first-party measured peak VRAM is in the backend yet. As a derived envelope (labelled as derived — not measured): UD-Q4_K_XL weights resident on GPU are 4.99 GB per unsloth's file table; the KV cache for a 16K context on an 8B model with 32 layers and 8 GQA heads adds ~4 GB, putting the runtime peak around ~9–10 GB — well inside the 5080's 16 GB envelope. Community measurement of the actual resident peak will replace the derived envelope when it lands via /contribute.
- Quality notes: UD-Q4_K_XL is the Unsloth mixed-precision GGUF tier; the Unsloth Dynamic 2.0 docs discuss per-layer sensitivity-aware bit-allocation across the family. On a 16 GB 5080 you can comfortably step up to UD-Q6_K_XL (7.33 GB), UD-Q8_K_XL (10.58 GB), or even Q6_K standard (6.60 GB per the unsloth file table) — there's no quality-floor reason to run anything below Q4_K_M on this hardware. BF16 full precision (16.07 GB on disk) overflows the 16 GB card without offload and isn't recommended.
For the full benchmark data and cross-GPU comparisons (3090 / 4090 / 5090 siblings), see /check/llama-3-1-8b/rtx-5080.
Troubleshooting
huggingface-cli 401 / 403 on the Unsloth GGUF repo
The Unsloth quantization inherits gating from the upstream meta-llama/Llama-3.1-8B-Instruct repo. You need to (a) be logged in via huggingface-cli login with a token that has read access, and (b) have clicked "Agree and access" on the upstream Meta repo while logged in — the access carries through to the derived Unsloth mirror. The full license terms are at github.com/meta-llama/llama-models.
Driver too old — Ollama silently falls back to CPU
The RTX 5080 uses Blackwell sm_120; older CUDA wheels lack the kernels and Ollama silently falls back to CPU inference, which appears as a hang or single-digit tok/s. Confirm CUDA 12.8+ drivers are installed (nvidia-smi should report driver 575+ on Linux), then reinstall Ollama. The same advice applies to llama.cpp — use a cuda-12.8 release binary, not an older one.
Generation slows down at longer context
Llama 3.1 ships with a 128K-token native context window per the HF model card metadata (base_model:meta-llama/Llama-3.1-8B, arxiv:2204.05149), but throughput drops as the KV cache fills. The same-class proxy on Hardware Corner's RTX 5080 LLM benchmark page — the Qwen3 8B Q4_K row — degrades from 129.1 tok/s at 4K to 94.1 tok/s at 16K to 72.5 tok/s at 32K; expect Llama 3.1 8B to follow a similar curve. At full 128K the KV cache alone consumes >12 GB and overflows the 5080's 16 GB envelope. For long-doc workflows on this card, keep --ctx-size at 32K or below; for longer documents, use chunking + retrieval.
Want a different runtime — vLLM or SGLang?
The Meta canonical HF model card documents vllm serve "meta-llama/Llama-3.1-8B-Instruct" and python3 -m sglang.launch_server --model-path "meta-llama/Llama-3.1-8B-Instruct" — both load BF16 weights (16.07 GB on disk per the unsloth tier table) rather than the GGUF quantization. The 5080's 16 GB VRAM is right at this card's BF16 capacity — vLLM's KV-cache pre-allocation will push it over the line OOM without aggressive --max-model-len capping. For 16 GB consumer cards, the llama.cpp / Ollama GGUF path is the comfortable choice; reserve the BF16 vLLM/SGLang path for 24 GB+ cards (see the 4090 and 5090 siblings).
Standard Q4_K_M instead of Unsloth's UD-Q4_K_XL?
Both load with the same llama.cpp binary; only the quantization recipe differs. bartowski/Meta-Llama-3.1-8B-Instruct-GGUF ships the standard k-quant ladder if you prefer the conventional flavor — file sizes are nearly identical (Q4_K_M = 4.92 GB per the bartowski tree). Throughput will be close to but not identical to Unsloth's UD-Q4_K_XL because the per-layer bit-allocation differs. Ollama's llama3.1:8b default tag is also standard Q4_K_M (4.9 GB per the Ollama library page).
FlashAttention 2 errors with transformers
If you bypass Ollama / llama.cpp and run the HF model card's transformers quickstart directly, do not add attn_implementation="flash_attention_2" — FA2 wheels don't ship sm_120 kernels as of mid-2026 (Dao-AILab/flash-attention#2168). Either omit the argument (PyTorch picks SDPA automatically) or set attn_implementation="sdpa" explicitly. This caveat is moot for the recommended GGUF path above — llama.cpp and Ollama don't depend on FlashAttention.