self-hosted/ai
§01·recipe · llm

Llama 3.1 8B on RTX 4080 SUPER: Local Chat via Ollama or llama.cpp + Unsloth UD-Q4_K_XL GGUF

llmbeginner10GB+ VRAMJun 2, 2026
models
tools
prerequisites
  • NVIDIA RTX 4080 SUPER (16 GB VRAM) or equivalent Ada Lovelace-class card
  • Recent NVIDIA driver with CUDA 12.x runtime (the default stable wheels already include Ada sm_89 kernels)
  • ~5 GB free disk for the UD-Q4_K_XL GGUF (or ~16 GB for BF16, ~10.6 GB for UD-Q8_K_XL)
  • llama.cpp, Ollama, or LM Studio installed
  • (Optional — only for the BF16 vLLM/SGLang path) Hugging Face account with approved access to the gated meta-llama/Llama-3.1-8B-Instruct repo. The recommended Ollama and Unsloth GGUF paths are public and need no approval.

What You'll Build

A local Llama 3.1 8B Instruct chat assistant running on an RTX 4080 SUPER (16 GB VRAM) through llama.cpp (or Ollama / LM Studio) with the unsloth/Llama-3.1-8B-Instruct-GGUF UD-Q4_K_XL weights (4.99 GB on disk, Unsloth's mixed-precision Dynamic 2.0 GGUF tier). On a 16 GB envelope the Q4_K_XL build leaves ~6 GB of runtime headroom over the typical ~10 GB peak — enough room to step up to UD-Q6_K_XL / UD-Q8_K_XL for higher fidelity, stretch Llama 3.1's native context beyond 16K, or colocate a small companion model (a TTS encoder or a 1B-class assistant).

Hardware data: RTX 4080 SUPER (16 GB VRAM) · UD-Q4_K_XL GGUF · ~88+ tok/s generation (a close-sibling RTX 4080 LocalScore lower bound; the SUPER is slightly faster — see Results) · See benchmark data

⚠️ Quant pinned — Unsloth UD-Q4_K_XL. This recipe targets UD-Q4_K_XL from the unsloth/Llama-3.1-8B-Instruct-GGUF repo specifically — Unsloth's mixed-precision GGUF tier featured in their Dynamic 2.0 documentation, with per-layer sensitivity-aware bit-allocation. Standard Q4_K_M from other publishers (bartowski/Meta-Llama-3.1-8B-Instruct-GGUF, the Ollama default) loads with the same llama.cpp binary, but the per-layer recipe and resulting quality/speed profile are different — see Troubleshooting if you prefer the conventional flavor.

ℹ️ Access — the two recommended paths are public; only the canonical Meta repo is gated. This recipe's install paths need no Meta approval: Ollama's llama3.1:8b and the unsloth/Llama-3.1-8B-Instruct-GGUF mirror are both ungated (gated: false on the Hugging Face API) and download anonymously with no token. Only the canonical meta-llama/Llama-3.1-8B-Instruct repo is gated (gated: manual) — you need it solely for the optional BF16 vLLM/SGLang path (see Troubleshooting): submit Meta's "Access Llama 3.1" form on the model page, wait for approval (usually fast), then huggingface-cli login with a read token. Gating and license are separate: the weights are released under the Llama 3.1 Community License, which permits commercial use unless your products exceed 700 million monthly active users in the preceding calendar month (per the Llama 3.1 license) — at which point you must request a separate license from Meta.

Requirements

ComponentMinimumTested
GPU8 GB VRAM (UD-Q4_K_XL fits)RTX 4080 SUPER (16 GB)
RAM16 GB system
Storage4.99 GB (UD-Q4_K_XL GGUF) per unsloth/Llama-3.1-8B-Instruct-GGUF
DriverCUDA 12.x runtime (Ada sm_89 — default stable wheels)
Runtimellama.cpp / Ollama / LM Studiollama.cpp b9247+

The 4080 SUPER's 16 GB is comfortable for Q4_K_XL — weights resident on GPU are ~5 GB and the KV cache for a 16K context adds another ~4 GB, putting runtime peak around ~10 GB. You have ~6 GB of headroom to either jump to a heavier quant tier (UD-Q6_K_XL at 7.33 GB on disk, UD-Q8_K_XL at 10.58 GB) or stretch to longer context windows — see Results for the throughput-vs-context tradeoff.

Unlike Blackwell GPUs (sm_120), the RTX 4080 SUPER is Ada Lovelace (sm_89) and needs no special wheel selection — the default CUDA 12.x stable builds of llama.cpp, Ollama, and PyTorch already ship sm_89 kernels.

Installation

Option A — Ollama (recommended one-line path)

Ollama maintains its own pre-quantized build of Llama 3.1 8B Instruct and handles model download + serving with a single command. Per the Ollama llama3.1:8b tag, the default tag is 4.9 GB at Q4_K_M — essentially the same size and quality tier as Unsloth's UD-Q4_K_XL but using the standard k-quant recipe.

1. Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

(Windows: download from ollama.com/download.) Ollama bundles its own CUDA runtime, so the only host-side requirement is a recent NVIDIA driver — no special version is needed for the Ada-class RTX 4080 SUPER.

2. Pull and run the 8B model

ollama pull llama3.1:8b
ollama run llama3.1:8b "Explain GQA attention in three sentences."

The first run downloads ~4.9 GB and loads the model into VRAM (resident ~5 GB; KV cache grows with conversation length). Subsequent prompts in the same session stay warm.

Option B — llama.cpp + Unsloth UD-Q4_K_XL GGUF

If you want the specific Unsloth Dynamic 2.0 tier (UD-Q4_K_XL) and explicit control over context size and --n-gpu-layers, drive llama.cpp directly.

1. Install llama.cpp (CUDA build)

The RTX 4080 SUPER is Ada Lovelace sm_89 — mainline llama.cpp ships sm_89 kernels in its default CUDA releases, so any current CUDA build works. Pre-built CUDA binaries are published on the llama.cpp releases page — pick a *-bin-ubuntu-cuda-x64.zip asset (Linux) or the matching Windows CUDA build.

# Linux — pre-built CUDA binary
# Download the latest "llama-bXXXX-bin-ubuntu-cuda-x64.zip" asset from:
#   https://github.com/ggml-org/llama.cpp/releases
# Extract and add the bin/ directory to PATH.

# macOS (Homebrew) — CPU/Metal only, no CUDA, kept here for symmetry with the sibling recipes
brew install llama.cpp

To build from source with CUDA support, follow the llama.cpp CUDA build docs and target the Ada sm_89 architecture explicitly:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=89
cmake --build build --config Release -j $(nproc)

CMAKE_CUDA_ARCHITECTURES=89 builds sm_89 kernels directly for the RTX 4080 SUPER, avoiding PTX JIT compilation at first run.

2. Pull the UD-Q4_K_XL GGUF

The fastest path is the llama.cpp Hugging Face shortcut from the Unsloth model card — llama.cpp will fetch the tagged file directly:

pip install huggingface_hub hf_transfer
# No login needed — the Unsloth GGUF mirror is public (gated: false).
llama-server -hf unsloth/Llama-3.1-8B-Instruct-GGUF:UD-Q4_K_XL

For more control (specific local directory, pinned filename), pull only the Q4_K_XL file (~5 GB) via snapshot_download instead of the full repo:

# download_q4kxl.py
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="unsloth/Llama-3.1-8B-Instruct-GGUF",
    local_dir="unsloth/Llama-3.1-8B-Instruct-GGUF",
    allow_patterns=["*UD-Q4_K_XL*"],
)
python download_q4kxl.py

The resulting file is unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf (4.99 GB per the unsloth model card).

3. Start the server

llama-server \
  --model unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf \
  --ctx-size 16384 \
  --n-gpu-layers 99 \
  --host 0.0.0.0 --port 8080

--n-gpu-layers 99 offloads every layer to the 4080 SUPER (the 16 GB envelope is enough to keep the whole model resident at Q4_K_XL; layer streaming is unnecessary). --ctx-size 16384 sets a 16K context window — see Troubleshooting for guidance on pushing context higher.

Option C — LM Studio (GUI)

LM Studio's built-in catalog search ("Llama 3.1 8B Instruct GGUF") will surface both the Unsloth UD-Q4_K_XL build and the bartowski standard-quant ladder. Pick Llama-3.1-8B-Instruct-UD-Q4_K_XL from the Unsloth repo and download — same file as Option B. LM Studio's loader will set --n-gpu-layers to "max" automatically once it recognizes the 4080 SUPER.

Running

One-shot prompt via the llama.cpp HTTP server

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.1-8b-instruct",
    "messages": [{"role": "user", "content": "Write a haiku about local LLMs."}]
  }'

The llama.cpp llama-server binary exposes an OpenAI-compatible /v1/chat/completions endpoint on the port chosen above.

Interactive terminal

llama-cli \
  --model unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf \
  --ctx-size 16384 \
  --n-gpu-layers 99 \
  --interactive

Press Ctrl-C to interrupt generation; the CLI keeps the model warm in VRAM until exit.

Step up to UD-Q8_K_XL (near-lossless) on this card

The UD-Q8_K_XL build is 10.58 GB on disk per the unsloth tier table; on the 4080 SUPER's 16 GB envelope you still have ~5 GB of headroom for a 16K-token KV cache, which fits the typical chat / coding workload comfortably at near-lossless quality. Use allow_patterns=["*UD-Q8_K_XL*"] in the snapshot_download script above to fetch the Q8 file instead. Expect throughput to drop relative to Q4 because memory bandwidth, not compute, is the binding constraint on transformer token generation.

Results

  • Speed: No first-party RTX 4080 SUPER measurement exists yet for this pair — the backend /check/ page reports verdict: unknown with no benchmark rows. As a close-sibling lower bound, a community LocalScore run on the closely-matched plain NVIDIA GeForce RTX 4080 (16 GB) measured 88.3 tok/s generation, 4792 tok/s prompt-processing, 279 ms time-to-first-token, and a LocalScore of 1149 for Meta Llama 3.1 8B Instruct at Q4_K - Medium (the standard Q4_K_M tier — the same 4.92 GB quant Ollama ships by default, near-identical to this recipe's 4.99 GB UD-Q4_K_XL). The RTX 4080 SUPER is the same Ada Lovelace (sm_89) architecture and the same 16 GB GDDR6X tier, with ~5% more CUDA cores (10,240 vs 9,728) and ~3% more memory bandwidth (736 vs 716.8 GB/s) per NVIDIA's published specs — so expect the SUPER to be ~3–5% faster than this figure on token generation (memory-bandwidth-bound). Treat ~88+ tok/s as a conservative floor for your card. This is a single community-submitted LocalScore figure and may drift by ~1% as more submissions land — not a guarantee. (LocalScore also lists two rows labelled "RTX 4080 SUPER" directly, but both are anomalous and not cited here: one reports an impossible "32GB" capacity — the SUPER ships only as a 16 GB card per Hardware Corner — and the other a single 54.4 tok/s run that is implausibly slower than the plain 4080 on identical work.) If you run llama.cpp + UD-Q4_K_XL on your own 4080 SUPER, please submit your numbers so a backend-ingested first-party measurement replaces this close-sibling estimate.
  • VRAM usage: The backend /check/ page currently reports verdict: unknown with no benchmark rows for this pair. As a derived envelope (labelled as derived — not measured): UD-Q4_K_XL weights resident on GPU are 4.99 GB per unsloth's file table; the KV cache for a 16K context on an 8B model with 32 layers and 8 GQA heads adds ~4 GB, putting the runtime peak around ~9–10 GB — well inside the 4080 SUPER's 16 GB envelope. That the Q4_K tier runs on the matched RTX 4080 is independently corroborated by the LocalScore run above. Community measurement of the actual resident peak will replace the derived envelope when it lands via /contribute.
  • Quality notes: UD-Q4_K_XL is the Unsloth mixed-precision GGUF tier; the Unsloth Dynamic 2.0 docs discuss per-layer sensitivity-aware bit-allocation across the family. On a 16 GB 4080 SUPER you can comfortably step up to UD-Q6_K_XL (7.33 GB), UD-Q8_K_XL (10.58 GB), or even Q6_K standard (6.60 GB per the unsloth file table) — there's no quality-floor reason to run anything below Q4_K_M on this hardware. BF16 full precision (16.07 GB on disk) overflows the 16 GB card without offload and isn't recommended.

For the full benchmark data and cross-GPU comparisons (4080 / 3090 Ti / 5080 / 5060 Ti siblings), see /check/llama-3-1-8b/rtx-4080-super.

Troubleshooting

401 / 403 when pulling the BF16 weights from meta-llama

The two recommended paths never hit a gate: Ollama's llama3.1:8b and the unsloth/Llama-3.1-8B-Instruct-GGUF mirror are both public (gated: false) and download with no token. A 401/403 only appears on the optional BF16 vLLM/SGLang path, which pulls from the canonical meta-llama/Llama-3.1-8B-Instruct repo (gated: manual). For that path: (a) submit Meta's "Access Llama 3.1" form on the model page and wait for approval, then (b) huggingface-cli login with a read token. The license terms (Llama 3.1 Community License, 700M-MAU commercial threshold) are at llama.com/llama3_1/license and apply regardless of how you obtain the weights.

Generation slows down at longer context

Llama 3.1 ships with a 128K-token native context window (per Meta's Llama 3.1 announcement), but throughput drops as the KV cache fills. At full 128K the KV cache alone consumes >12 GB and pressures the 4080 SUPER's 16 GB envelope. For long-doc workflows on this card, keep --ctx-size at 32K or below; for longer documents, use chunking + retrieval.

Want a different runtime — vLLM or SGLang?

The Meta canonical HF model card documents vLLM and SGLang serving, but both load BF16 weights (16.07 GB on disk per the unsloth tier table) rather than the GGUF quantization. The 4080 SUPER's 16 GB VRAM is right at this card's BF16 capacity — vLLM's KV-cache pre-allocation will push it over the line OOM without aggressive --max-model-len capping. For 16 GB consumer cards, the llama.cpp / Ollama GGUF path is the comfortable choice; reserve the BF16 vLLM/SGLang path for 24 GB+ cards (see the 5090 sibling).

Standard Q4_K_M instead of Unsloth's UD-Q4_K_XL?

Both load with the same llama.cpp binary; only the quantization recipe differs. bartowski/Meta-Llama-3.1-8B-Instruct-GGUF ships the standard k-quant ladder if you prefer the conventional flavor — file sizes are nearly identical (Q4_K_M = 4.92 GB per the bartowski tree). Throughput will be close to but not identical to Unsloth's UD-Q4_K_XL because the per-layer bit-allocation differs. Ollama's llama3.1:8b default tag is also standard Q4_K_M (4.9 GB per the Ollama library page).

FlashAttention 2 with transformers

If you bypass Ollama / llama.cpp and run a transformers quickstart directly, the RTX 4080 SUPER (Ada sm_89) has full prebuilt FlashAttention-2 wheel coverage — unlike Blackwell (sm_120) cards, you can set attn_implementation="flash_attention_2" and it will work, or simply omit the argument and let PyTorch pick SDPA. This is moot for the recommended GGUF path above — llama.cpp and Ollama don't depend on FlashAttention.