Llama 3.1 8B on RTX 5070: Local Chat via Ollama or llama.cpp + Unsloth UD-Q4_K_XL GGUF

What You'll Build

A local Llama 3.1 8B Instruct chat assistant running on an RTX 5070 (12 GB VRAM) through llama.cpp (or Ollama / LM Studio) with the unsloth/Llama-3.1-8B-Instruct-GGUF UD-Q4_K_XL weights (4.99 GB on disk, Unsloth's mixed-precision Dynamic 2.0 GGUF tier). At 4.99 GB resident, the Q4_K_XL build leaves comfortable room on the 12 GB card for a 16K-token context window and a display, with a typical runtime peak of ~9–10 GB.

Hardware data: RTX 5070 (12 GB VRAM) · UD-Q4_K_XL GGUF · no first-party Llama 3.1 8B measurement on this card yet — see Results for a same-arch-class proxy · See benchmark data

⚠️ 12 GB is not 16 GB — mind the usable headroom. A desktop RTX 5070 with a monitor attached exposes roughly 10.5–11.3 GB of usable VRAM (the display compositor and driver reserve the rest); a headless Linux box gets closer to ~11.6 GB. The UD-Q4_K_XL path's ~9–10 GB peak fits this with margin, but the heavier near-lossless quants that fit a 16 GB card (UD-Q8_K_XL at 10.58 GB on disk) do not leave display headroom here — see Results for which tiers are realistic on 12 GB.

⚠️ Quant pinned — Unsloth UD-Q4_K_XL. This recipe targets UD-Q4_K_XL from the unsloth/Llama-3.1-8B-Instruct-GGUF repo specifically — Unsloth's mixed-precision GGUF tier featured in their Dynamic 2.0 GGUF docs, with per-layer sensitivity-aware bit-allocation. Standard Q4_K_M from other publishers (bartowski/Meta-Llama-3.1-8B-Instruct-GGUF, TheBloke) loads with the same llama.cpp binary, but the per-layer recipe and resulting quality/speed profile are different — see Troubleshooting if you prefer the conventional flavor.

ℹ️ Gated model — Meta access form required. The canonical meta-llama/Llama-3.1-8B-Instruct repo and the derived unsloth/Llama-3.1-8B-Instruct-GGUF both require accepting Meta's Llama 3.1 Community License before download. Click "Agree and access" on the model page while logged into HF, then run huggingface-cli login locally with a read token before the steps below. The license permits commercial use until you exceed 700 million monthly active users.

Requirements

Component	Minimum	Tested
GPU	8 GB VRAM (UD-Q4_K_XL fits)	RTX 5070 (12 GB)
RAM	16 GB system	—
Storage	4.99 GB (UD-Q4_K_XL GGUF) per unsloth/Llama-3.1-8B-Instruct-GGUF	—
Driver	CUDA 12.8+ runtime (Blackwell sm_120)	—
Runtime	llama.cpp / Ollama / LM Studio	llama.cpp b9247+

The 5070's 12 GB is comfortable for Q4_K_XL — weights resident on GPU are ~5 GB and the KV cache for a 16K context adds another ~4 GB, putting runtime peak around ~9–10 GB. That leaves ~1–2 GB of display headroom on a 12 GB desktop card. Unlike a 16 GB sibling, you do not have room to jump to UD-Q8_K_XL (10.58 GB on disk) with a monitor attached — see Results for the tiers that fit 12 GB.

Installation

Option A — Ollama (recommended one-line path)

Ollama maintains its own pre-quantized build of Llama 3.1 8B Instruct and handles model download + serving with a single command. Per the Ollama llama3.1:8b tag, the default tag is 4.9 GB at Q4_K_M — essentially the same size and quality tier as Unsloth's UD-Q4_K_XL but using the standard k-quant recipe.

1. Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

(Windows: download from ollama.com/download.) Ollama bundles its own CUDA runtime, so the only host-side requirement is a recent NVIDIA driver with Blackwell sm_120 support (the GeForce 575+ series on Linux, or any current Windows driver).

2. Pull and run the 8B model

ollama pull llama3.1:8b
ollama run llama3.1:8b "Explain GQA attention in three sentences."

The first run downloads ~4.9 GB and loads the model into VRAM (resident ~5 GB; KV cache grows with conversation length). Subsequent prompts in the same session stay warm.

Option B — llama.cpp + Unsloth UD-Q4_K_XL GGUF

If you want the specific Unsloth Dynamic 2.0 tier (UD-Q4_K_XL) and explicit control over context size and --n-gpu-layers, drive llama.cpp directly.

1. Install llama.cpp (CUDA 12.8 build)

The RTX 5070 uses Blackwell sm_120 — mainline llama.cpp ships sm_120 kernels, but you need a CUDA 12.8+ build. Pre-built CUDA 12.8 binaries are published on the llama.cpp releases page — pick a *-bin-ubuntu-cuda-12.x-x64.zip asset (Linux) or the matching Windows CUDA build.

# Linux — pre-built CUDA binary
# Download the latest "llama-bXXXX-bin-ubuntu-cuda-12.x-x64.zip" asset from:
#   https://github.com/ggml-org/llama.cpp/releases
# Extract and add the bin/ directory to PATH.

# macOS (Homebrew) — CPU/Metal only, no CUDA, kept here for symmetry with the sibling recipes
brew install llama.cpp

To build from source with CUDA 12.8 support, follow the llama.cpp CUDA build docs and pin the toolkit and arch explicitly:

# Make sure CUDA 12.8 is the active toolkit BEFORE cmake configure step
export PATH=/usr/local/cuda-12.8/bin:$PATH
export CUDAToolkit_ROOT=/usr/local/cuda-12.8

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=120 \
  -DCUDAToolkit_ROOT=/usr/local/cuda-12.8
cmake --build build --config Release -j $(nproc)

CMAKE_CUDA_ARCHITECTURES=120 builds sm_120 kernels directly, avoiding PTX JIT compilation at first run.

2. Pull the UD-Q4_K_XL GGUF

The fastest path is the llama.cpp Hugging Face shortcut from the Unsloth model card quickstart — llama.cpp will fetch the tagged file directly:

pip install huggingface_hub hf_transfer
huggingface-cli login   # paste a read token; required for the gated upstream
llama-server -hf unsloth/Llama-3.1-8B-Instruct-GGUF:UD-Q4_K_XL

For more control (specific local directory, pinned filename), pull only the Q4_K_XL file (~5 GB) via snapshot_download instead of the full repo:

# download_q4kxl.py
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="unsloth/Llama-3.1-8B-Instruct-GGUF",
    local_dir="unsloth/Llama-3.1-8B-Instruct-GGUF",
    allow_patterns=["*UD-Q4_K_XL*"],
)

python download_q4kxl.py

The resulting file is unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf (4.99 GB per the unsloth model card).

3. Start the server

llama-server \
  --model unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf \
  --ctx-size 16384 \
  --n-gpu-layers 99 \
  --host 0.0.0.0 --port 8080

--n-gpu-layers 99 offloads every layer to the 5070 (the 12 GB envelope is enough to keep the whole model resident at Q4_K_XL; layer streaming is unnecessary). --ctx-size 16384 sets a 16K context window — see Troubleshooting for guidance on context limits on a 12 GB card.

Option C — LM Studio (GUI)

LM Studio's built-in catalog search ("Llama 3.1 8B Instruct GGUF") will surface both the Unsloth UD-Q4_K_XL build and the bartowski standard-quant ladder. Pick Llama-3.1-8B-Instruct-UD-Q4_K_XL from the Unsloth repo and download — same file as Option B. LM Studio's loader will set --n-gpu-layers to "max" automatically for a 5070 once it recognizes the Blackwell card.

Running

One-shot prompt via the llama.cpp HTTP server

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.1-8b-instruct",
    "messages": [{"role": "user", "content": "Write a haiku about Blackwell GPUs."}]
  }'

The llama.cpp llama-server binary exposes an OpenAI-compatible /v1/chat/completions endpoint on the port chosen above.

Interactive terminal

llama-cli \
  --model unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf \
  --ctx-size 16384 \
  --n-gpu-layers 99 \
  --interactive

Press Ctrl-C to interrupt generation; the CLI keeps the model warm in VRAM until exit.

Stepping up a quant tier on 12 GB

UD-Q4_K_XL (4.99 GB) is the comfortable default. If you want a little more fidelity and your context needs are modest, UD-Q6_K_XL (7.33 GB on disk per the unsloth tier table) is the realistic ceiling on a 12 GB card — at ~7.3 GB resident plus a smaller KV budget it can stay under the usable envelope if you cap context (8K–16K) and run headless or close the desktop compositor. Fetch it with allow_patterns=["*UD-Q6_K_XL*"] in the snapshot_download script above. UD-Q8_K_XL (10.58 GB) does not fit a 12 GB card with a display — it overflows the usable envelope before any KV cache is allocated; reserve it for 16 GB+ siblings.

Results

Speed: No first-party Llama 3.1 8B measurement on the RTX 5070 exists yet — the backend /check/ page currently reports verdict: unknown with no benchmark rows for this pair. As a same-arch-class proxy (this is a Qwen3 8B measurement, NOT a Llama 3.1 8B number), Hardware Corner's RTX 5070 LLM benchmark page measures a comparable dense 8B Q4 model — its Qwen3 8B (Q4_K) row reads 85.8 tok/s token-generation / 3,487.7 tok/s prompt-processing at 4K context on the RTX 5070. Llama 3.1 8B should land in a roughly similar band (same dense-transformer 8B class, same Q4 tier) modulo per-architecture variance, but this is an extrapolation from a different model — not a Llama 3.1 8B number. If you run llama.cpp + UD-Q4_K_XL on your own 5070, please submit your numbers so a Llama-3.1-8B-specific first-party measurement replaces this proxy.
VRAM usage: No first-party measured peak VRAM is in the backend yet. As a derived envelope (labelled as derived — not measured): UD-Q4_K_XL weights resident on GPU are 4.99 GB per unsloth's file table; the KV cache for a 16K context on an 8B model with 32 layers and 8 GQA KV heads adds ~4 GB, putting the runtime peak around ~9–10 GB — inside the 12 GB card's ~10.5–11.3 GB usable envelope. Community measurement of the actual resident peak will replace the derived envelope when it lands via /contribute.
Quality notes: UD-Q4_K_XL is the Unsloth mixed-precision GGUF tier; the Unsloth Dynamic 2.0 docs discuss per-layer sensitivity-aware bit-allocation across the family. On a 12 GB 5070 the practical quality ceiling is UD-Q6_K_XL (7.33 GB) with a capped context, or Q6_K standard (6.60 GB per the unsloth file table) — there's no quality-floor reason to run anything below Q4_K_M on this hardware. BF16 full precision (16.07 GB on disk) overflows the 12 GB card by a wide margin and isn't an option here without heavy offload.

For the full benchmark data and cross-GPU comparisons (5070 Ti / 5080 / 4080 siblings), see /check/llama-3-1-8b/rtx-5070.

Troubleshooting

`huggingface-cli` 401 / 403 on the Unsloth GGUF repo

A 401 / 403 almost always means your huggingface-cli download (or loader) is pointing at the gated upstream meta-llama/Llama-3.1-8B-Instruct repo rather than the public mirror. The Unsloth GGUF unsloth/Llama-3.1-8B-Instruct-GGUF is public and ungated — it downloads with no "Agree and access" step. Confirm the repo id in your command targets the Unsloth mirror (a plain huggingface-cli login is enough); you only need to click "Agree and access" on Meta’s terms if you specifically pull the upstream meta-llama/… weights. The full license terms are at github.com/meta-llama/llama-models.

Driver too old — Ollama silently falls back to CPU

The RTX 5070 uses Blackwell sm_120; older CUDA wheels lack the kernels and Ollama silently falls back to CPU inference, which appears as a hang or single-digit tok/s. Confirm CUDA 12.8+ drivers are installed (nvidia-smi should report driver 575+ on Linux), then reinstall Ollama. The same advice applies to llama.cpp — use a cuda-12.8 release binary, not an older one.

Generation slows down at longer context — and the 12 GB ceiling

Llama 3.1 ships with a 128K-token native context window per the HF model card metadata, but throughput drops as the KV cache fills, and on a 12 GB card the KV cache is the binding constraint long before you reach 128K. The same-class proxy on Hardware Corner's RTX 5070 LLM benchmark page — the Qwen3 8B Q4_K row — degrades from 85.8 tok/s at 4K to 59.1 tok/s at 16K to 43.6 tok/s at 32K; expect Llama 3.1 8B to follow a similar curve. On a 12 GB 5070, keep --ctx-size at 16K or below for the Q4_K_XL build (32K is feasible but eats into display headroom); a full 128K KV cache alone consumes well over 12 GB and is not possible on this card. For long-doc workflows, use chunking + retrieval rather than a giant context window.

Want a different runtime — vLLM or SGLang?

The Meta canonical HF model card documents vllm serve "meta-llama/Llama-3.1-8B-Instruct" and python3 -m sglang.launch_server --model-path "meta-llama/Llama-3.1-8B-Instruct" — both load BF16 weights (16.07 GB on disk per the unsloth tier table) rather than the GGUF quantization. BF16 alone exceeds the 12 GB 5070's VRAM, so the vLLM/SGLang BF16 path is not viable on this card — stick to the llama.cpp / Ollama GGUF path. Reserve the BF16 serving path for 24 GB+ cards (see the 4090 and 5090 siblings).

Standard Q4_K_M instead of Unsloth's UD-Q4_K_XL?

Both load with the same llama.cpp binary; only the quantization recipe differs. bartowski/Meta-Llama-3.1-8B-Instruct-GGUF ships the standard k-quant ladder if you prefer the conventional flavor — file sizes are nearly identical (Q4_K_M = 4.92 GB per the unsloth tier table). Throughput will be close to but not identical to Unsloth's UD-Q4_K_XL because the per-layer bit-allocation differs. Ollama's llama3.1:8b default tag is also standard Q4_K_M (4.9 GB per the Ollama library page).

FlashAttention 2 errors with `transformers`

If you bypass Ollama / llama.cpp and run the HF model card's transformers quickstart directly, do not add attn_implementation="flash_attention_2" — FA2 wheels don't ship sm_120 kernels as of mid-2026 (Dao-AILab/flash-attention#2168). Either omit the argument (PyTorch picks SDPA automatically) or set attn_implementation="sdpa" explicitly. This caveat is moot for the recommended GGUF path above — llama.cpp and Ollama don't depend on FlashAttention.