self-hosted/ai
§01·recipe · llm

Llama 3.1 8B on RTX 5060 Ti: Local Chat via Ollama or llama.cpp + Unsloth UD-Q4_K_XL GGUF

llmbeginner10GB+ VRAMMay 28, 2026
models
tools
prerequisites
  • NVIDIA RTX 5060 Ti (16 GB VRAM) or equivalent Blackwell-class card
  • Recent NVIDIA driver with CUDA 12.8+ support (required for Blackwell sm_120 kernels)
  • ~5 GB free disk for the UD-Q4_K_XL GGUF (or ~16 GB for BF16, ~10.6 GB for UD-Q8_K_XL)
  • llama.cpp, Ollama, or LM Studio installed
  • Hugging Face account with access to the gated meta-llama/Llama-3.1-8B-Instruct repo

What You'll Build

A local Llama 3.1 8B Instruct chat assistant running on an RTX 5060 Ti (16 GB VRAM) through llama.cpp (or Ollama / LM Studio) with the unsloth/Llama-3.1-8B-Instruct-GGUF UD-Q4_K_XL weights (4.99 GB on disk, Unsloth's mixed-precision Dynamic 2.0 GGUF tier). On a 16 GB envelope the Q4_K_XL build leaves ~6 GB of runtime headroom over the typical ~10 GB peak — enough room to step up to UD-Q6_K_XL / UD-Q8_K_XL for higher fidelity, stretch Llama 3.1's native context beyond 16K, or colocate a small companion model (a TTS encoder or a 1B-class assistant).

Hardware data: RTX 5060 Ti (16 GB VRAM) · UD-Q4_K_XL GGUF · no first-party Llama 3.1 8B measurement on this card yet — see Results for a same-arch-class proxy · See benchmark data

⚠️ Quant pinned — Unsloth UD-Q4_K_XL. This recipe targets UD-Q4_K_XL from the unsloth/Llama-3.1-8B-Instruct-GGUF repo specifically — Unsloth's mixed-precision GGUF tier featured in their Dynamic 2.0 benchmark comparisons, with per-layer sensitivity-aware bit-allocation. Standard Q4_K_M from other publishers (bartowski/Meta-Llama-3.1-8B-Instruct-GGUF, TheBloke) loads with the same llama.cpp binary, but the per-layer recipe and resulting quality/speed profile are different — see Troubleshooting if you prefer the conventional flavor.

ℹ️ Gated model — Meta access form required. The canonical meta-llama/Llama-3.1-8B-Instruct repo and the derived unsloth/Llama-3.1-8B-Instruct-GGUF both require accepting Meta's Llama 3.1 Community License before download. Click "Agree and access" on the model page while logged into HF, then run huggingface-cli login locally with a read token before the steps below. The license permits commercial use until you exceed 700 million monthly active users.

Requirements

ComponentMinimumTested
GPU8 GB VRAM (UD-Q4_K_XL fits)RTX 5060 Ti (16 GB)
RAM16 GB system
Storage4.99 GB (UD-Q4_K_XL GGUF) per unsloth/Llama-3.1-8B-Instruct-GGUF
DriverCUDA 12.8+ runtime (Blackwell sm_120)
Runtimellama.cpp / Ollama / LM Studiollama.cpp b9247+

The 5060 Ti's 16 GB is comfortable for Q4_K_XL — weights resident on GPU are ~5 GB and the KV cache for a 16K context adds another ~4 GB, putting runtime peak around ~10 GB. You have ~6 GB of headroom to either jump to a heavier quant tier (UD-Q6_K_XL at 7.33 GB on disk, UD-Q8_K_XL at 10.58 GB) or stretch to longer context windows — see Results for the throughput-vs-context tradeoff.

Installation

Option A — Ollama (recommended one-line path)

Ollama maintains its own pre-quantized build of Llama 3.1 8B Instruct and handles model download + serving with a single command. Per the Ollama llama3.1:8b tag, the default tag is 4.9 GB at Q4_K_M — essentially the same size and quality tier as Unsloth's UD-Q4_K_XL but using the standard k-quant recipe.

1. Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

(Windows: download from ollama.com/download.) Ollama bundles its own CUDA runtime, so the only host-side requirement is a recent NVIDIA driver with Blackwell sm_120 support (the GeForce 575+ series on Linux, or any current Windows driver).

2. Pull and run the 8B model

ollama pull llama3.1:8b
ollama run llama3.1:8b "Explain GQA attention in three sentences."

The first run downloads ~4.9 GB and loads the model into VRAM (resident ~5 GB; KV cache grows with conversation length). Subsequent prompts in the same session stay warm.

Option B — llama.cpp + Unsloth UD-Q4_K_XL GGUF

If you want the specific Unsloth Dynamic 2.0 tier (UD-Q4_K_XL) and explicit control over context size and --n-gpu-layers, drive llama.cpp directly.

1. Install llama.cpp (CUDA 12.8 build)

The RTX 5060 Ti uses Blackwell sm_120 — mainline llama.cpp ships sm_120 kernels, but you need a CUDA 12.8+ build. Pre-built CUDA 12.8 binaries are published on the llama.cpp releases page — pick a *-bin-ubuntu-cuda-12.x-x64.zip asset (Linux) or the matching Windows CUDA build.

# Linux — pre-built CUDA binary
# Download the latest "llama-bXXXX-bin-ubuntu-cuda-12.x-x64.zip" asset from:
#   https://github.com/ggml-org/llama.cpp/releases
# Extract and add the bin/ directory to PATH.

# macOS (Homebrew) — CPU/Metal only, no CUDA, kept here for symmetry with the sibling 3090/4090 recipes
brew install llama.cpp

To build from source with CUDA 12.8 support, follow the llama.cpp CUDA build docs and pin the toolkit and arch explicitly:

# Make sure CUDA 12.8 is the active toolkit BEFORE cmake configure step
export PATH=/usr/local/cuda-12.8/bin:$PATH
export CUDAToolkit_ROOT=/usr/local/cuda-12.8

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=120 \
  -DCUDAToolkit_ROOT=/usr/local/cuda-12.8
cmake --build build --config Release -j $(nproc)

CMAKE_CUDA_ARCHITECTURES=120 builds sm_120 kernels directly, avoiding PTX JIT compilation at first run.

2. Pull the UD-Q4_K_XL GGUF

The fastest path is the llama.cpp Hugging Face shortcut from the Unsloth model card quickstart — llama.cpp will fetch the tagged file directly:

pip install huggingface_hub hf_transfer
huggingface-cli login   # paste a read token; required for the gated upstream
llama-server -hf unsloth/Llama-3.1-8B-Instruct-GGUF:UD-Q4_K_XL

For more control (specific local directory, pinned filename), pull only the Q4_K_XL file (~5 GB) via snapshot_download instead of the full 96-file repo:

# download_q4kxl.py
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="unsloth/Llama-3.1-8B-Instruct-GGUF",
    local_dir="unsloth/Llama-3.1-8B-Instruct-GGUF",
    allow_patterns=["*UD-Q4_K_XL*"],
)
python download_q4kxl.py

The resulting file is unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf (4.99 GB per the unsloth model card).

3. Start the server

llama-server \
  --model unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf \
  --ctx-size 16384 \
  --n-gpu-layers 99 \
  --host 0.0.0.0 --port 8080

--n-gpu-layers 99 offloads every layer to the 5060 Ti (the 16 GB envelope is enough to keep the whole model resident at Q4_K_XL; layer streaming is unnecessary). --ctx-size 16384 matches the context window the public 5060 Ti benchmark was measured at — see Troubleshooting for guidance on pushing context higher.

Option C — LM Studio (GUI)

LM Studio's built-in catalog search ("Llama 3.1 8B Instruct GGUF") will surface both the Unsloth UD-Q4_K_XL build and the bartowski standard-quant ladder. Pick Llama-3.1-8B-Instruct-UD-Q4_K_XL from the Unsloth repo and download — same file as Option B. LM Studio's loader will set --n-gpu-layers to "max" automatically for a 5060 Ti once it recognizes the Blackwell card.

Running

One-shot prompt via the llama.cpp HTTP server

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.1-8b-instruct",
    "messages": [{"role": "user", "content": "Write a haiku about Blackwell GPUs."}]
  }'

The llama.cpp llama-server binary exposes an OpenAI-compatible /v1/chat/completions endpoint on the port chosen above.

Interactive terminal

llama-cli \
  --model unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf \
  --ctx-size 16384 \
  --n-gpu-layers 99 \
  --interactive

Press Ctrl-C to interrupt generation; the CLI keeps the model warm in VRAM until exit.

Step up to UD-Q8_K_XL (near-lossless) on this card

The UD-Q8_K_XL build is 10.58 GB on disk per the unsloth tier table; on the 5060 Ti's 16 GB envelope you still have ~5 GB of headroom for a 16K-token KV cache, which fits the typical chat / coding workload comfortably at near-lossless quality. Use allow_patterns=["*UD-Q8_K_XL*"] in the snapshot_download script above to fetch the Q8 file instead. Expect throughput to drop relative to Q4 because memory bandwidth, not compute, is the binding constraint on transformer inference.

Results

  • Speed: No first-party Llama 3.1 8B measurement on the RTX 5060 Ti is available at the time of writing. The backend /check/ page currently surfaces an aggregator citation whose underlying numbers appear to be relabelled from the Qwen3-8B-Q4 row of a different benchmark page — disregard those figures until a first-party measurement lands. For a same-card same-arch-class proxy use Hardware Corner's RTX 5060 Ti 16GB LLM benchmark page: the Qwen3 8B Q4_K row measures 69.2 tok/s generation / 2,965 tok/s prefill at 4K context — Llama 3.1 8B should land in a similar envelope (same dense-transformer 8B class, same Q4 tier) modulo per-architecture variance. If you run llama.cpp + UD-Q4_K_XL on your own 5060 Ti, please submit your numbers so a Llama-3.1-8B-specific first-party measurement replaces the placeholder.
  • VRAM usage: No first-party measured peak VRAM is in the backend yet. As a derived envelope (labelled as derived — not measured): UD-Q4_K_XL weights resident on GPU are 4.99 GB per unsloth's file table; the KV cache for a 16K context on an 8B model with 32 layers and 8 GQA heads adds ~4 GB, putting the runtime peak around ~9–10 GB — well inside the 5060 Ti's 16 GB envelope. Community measurement of the actual resident peak will replace the derived envelope when it lands via /contribute.
  • Quality notes: UD-Q4_K_XL is the Unsloth mixed-precision GGUF tier; the Unsloth Dynamic 2.0 docs discuss per-layer sensitivity-aware bit-allocation across the family. On a 16 GB 5060 Ti you can comfortably step up to UD-Q6_K_XL (7.33 GB), UD-Q8_K_XL (10.58 GB), or even Q6_K standard (6.60 GB per the unsloth file table) — there's no quality-floor reason to run anything below Q4_K_M on this hardware. BF16 full precision (16.07 GB on disk) overflows the 16 GB card without offload and isn't recommended.

For the full benchmark data and cross-GPU comparisons (3090 / 4090 / 5090 siblings), see /check/llama-3-1-8b/rtx-5060-ti.

Troubleshooting

huggingface-cli 401 / 403 on the Unsloth GGUF repo

The Unsloth quantization inherits gating from the upstream meta-llama/Llama-3.1-8B-Instruct repo. You need to (a) be logged in via huggingface-cli login with a token that has read access, and (b) have clicked "Agree and access" on the upstream Meta repo while logged in — the access carries through to the derived Unsloth mirror. The full license terms are at github.com/meta-llama/llama-models.

Driver too old — Ollama silently falls back to CPU

The RTX 5060 Ti uses Blackwell sm_120; older CUDA wheels lack the kernels and Ollama silently falls back to CPU inference, which appears as a hang or single-digit tok/s. Confirm CUDA 12.8+ drivers are installed (nvidia-smi should report driver 575+ on Linux), then reinstall Ollama. The same advice applies to llama.cpp — use a cuda-12.8 release binary, not an older one.

Generation slows dramatically past 16K context

Llama 3.1 ships with a 128K-token native context window per the HF model card metadata (base_model:meta-llama/Llama-3.1-8B, arxiv:2204.05149), but throughput drops as the KV cache fills — on Hardware Corner's RTX 5060 Ti 16GB LLM benchmark page the adjacent 8B-class qwen3-8b Q4_K row degrades 4K = 69.2 tok/s → 16K = 51.4 tok/s → 32K = 38.9 tok/s; expect Llama 3.1 8B to follow a similar curve. At full 128K the cache alone consumes >12 GB and overflows the 5060 Ti's 16 GB envelope. For long-doc workflows on this card, keep --ctx-size at 32K or below; for longer documents, use chunking + retrieval.

Want a different runtime — vLLM or SGLang?

The Meta canonical HF model card documents vllm serve "meta-llama/Llama-3.1-8B-Instruct" and python3 -m sglang.launch_server --model-path "meta-llama/Llama-3.1-8B-Instruct" — both load BF16 weights (16.07 GB on disk per the unsloth tier table) rather than the GGUF quantization. The 5060 Ti's 16 GB VRAM is right at this card's BF16 capacity — vLLM's KV-cache pre-allocation will push it over the line OOM without aggressive --max-model-len capping. For 16 GB consumer cards, the llama.cpp / Ollama GGUF path is the comfortable choice; reserve the BF16 vLLM/SGLang path for 24 GB+ cards (see the 4090 and 5090 siblings).

Standard Q4_K_M instead of Unsloth's UD-Q4_K_XL?

Both load with the same llama.cpp binary; only the quantization recipe differs. bartowski/Meta-Llama-3.1-8B-Instruct-GGUF ships the standard k-quant ladder if you prefer the conventional flavor — file sizes are nearly identical (Q4_K_M = 4.92 GB per the bartowski tree). Throughput will be close to but not identical to Unsloth's UD-Q4_K_XL because the per-layer bit-allocation differs. Ollama's llama3.1:8b default tag is also standard Q4_K_M (4.9 GB per the Ollama library page).

FlashAttention 2 errors with transformers

If you bypass Ollama / llama.cpp and run the HF model card's transformers quickstart directly, do not add attn_implementation="flash_attention_2" — FA2 wheels don't ship sm_120 kernels as of mid-2026 (Dao-AILab/flash-attention#2168). Either omit the argument (PyTorch picks SDPA automatically) or set attn_implementation="sdpa" explicitly. This caveat is moot for the recommended GGUF path above — llama.cpp and Ollama don't depend on FlashAttention.