Llama 3.1 8B on RTX 5090: Local Chat via llama.cpp + Unsloth UD-Q4_K_XL GGUF

What You'll Build

A local Llama 3.1 8B Instruct chat assistant running on an RTX 5090 (32 GB VRAM) through llama.cpp with the unsloth/Llama-3.1-8B-Instruct-GGUF UD-Q4_K_XL weights (4.99 GB on disk, Unsloth's mixed-precision Dynamic 2.0 GGUF tier). At ~5 GB weights resident on GPU plus a ~4 GB KV cache for a 16K-token context, the runtime footprint is around 10 GB — leaving ~22 GB of headroom on the 32 GB envelope. The legitimate angle on the 5090 isn't "does it fit" (it fits five times over); it's "what to do with the unused VRAM" — colocating a second model, stepping up to BF16 full precision, or stretching to Llama 3.1's full 128K-token native context.

Hardware data: RTX 5090 (32 GB VRAM) · UD-Q4_K_XL GGUF · ~10 GB derived runtime envelope · See benchmark data

⚠️ Driver / CUDA caveat (Blackwell-specific). Early Blackwell launch drivers (notably 591.86 and 590.48.01) ship with an NVIDIA-acknowledged sharedMemPerBlockOptin overflow bug that causes llama.cpp's MMQ matmul kernel to abort with mmq_x_best=0 on RTX 5090 / RTX 5080 — community-tracked at ggml-org/llama.cpp#23385 (open, no maintainer fix merged at time of writing) and #22499. Symptom: throughput collapses from the expected ~190-210 tok/s range to ~65 tok/s, or the server crashes mid-prompt. Mitigation: update past the early launch drivers, build llama.cpp with CUDA Toolkit 12.8 (not 13.x — see Troubleshooting), and avoid the -DGGML_CUDA_MMQ=OFF workaround if you can.

⚠️ Quant pinned — Unsloth UD-Q4_K_XL. This recipe targets UD-Q4_K_XL from the unsloth/Llama-3.1-8B-Instruct-GGUF repo specifically — Unsloth's mixed-precision GGUF tier featured in their Dynamic 2.0 benchmark comparisons, with per-layer sensitivity-aware bit-allocation. Standard Q4_K_M from other publishers (bartowski, TheBloke) will load with the same llama.cpp binary, but the per-layer recipe and resulting quality/speed profile are different — see Troubleshooting if you prefer the conventional flavor.

ℹ️ Gated model — Meta access form required. The canonical meta-llama/Llama-3.1-8B-Instruct repo and the derived unsloth/Llama-3.1-8B-Instruct-GGUF both require accepting Meta's Llama 3.1 Community License before download. Click "Agree and access" on the model page while logged into HF, then run huggingface-cli login locally with a read token before the steps below. Per the license's Section 2 (Additional Commercial Terms), commercial use is permitted unless your products or services exceed 700 million monthly active users at the release date.

Requirements

Component	Minimum	Tested
GPU	8 GB VRAM (UD-Q4_K_XL fits)	RTX 5090 (32 GB)
RAM	16 GB system	—
Storage	4.99 GB (UD-Q4_K_XL GGUF) per unsloth/Llama-3.1-8B-Instruct-GGUF	—
Driver	NVIDIA driver beyond `591.86` (Blackwell smpbo guard required)	—
CUDA	CUDA Toolkit 12.8 for build (not 13.x)	—
Runtime	llama.cpp / Ollama / LM Studio	llama.cpp b9247+

The 5090 is wildly over-provisioned for the UD-Q4_K_XL build — weights resident on GPU are 4.99 GB and the KV cache for a 16K-token context adds ~4 GB, so the runtime peak is around ~10 GB on a 32 GB envelope. That's 4× over-provisioned at the minimum useful context (16K) and still leaves >20 GB free at Llama 3.1's full 128K native context.

Spending the headroom — what 22 GB of spare VRAM enables on the 5090

A 32 GB card is over six times the size of the resident Q4 weights, so the legitimate recipe pivot is not "does it fit" (it fits trivially) but "what to do with the unused VRAM." The 5090's 32 GB is the only consumer NVIDIA card above 24 GB, so the colocation angle is materially richer than on the 4090 / 3090 siblings — concrete options grounded in our GPU check pages:

Colocate a second 8B-class LLM for a multi-model server. A 5090 fits two independent ~10 GB Q4 stacks side-by-side. Pair Llama 3.1 8B Q4_K_XL with Qwen3-8B (~6 GB weights at Q4_K_M + KV) for a quality + speed comparison server, or with DeepSeek-R1-Distill-Qwen-14B Q4_K_M (~9 GB weights + reasoning KV margin) for a chat + reasoning split. Total runtime ~20-22 GB; KV-cache headroom still comfortable.
Multi-vertical pipeline on one card. Llama 3.1 8B Q4 + Qwen-Image FP8 (~21 GB peak) — text-chat + image-gen on the same card, tight but workable for batch-1 inference. Or Llama 3.1 8B Q4 + Kokoro 82M TTS (~1 GB) for a chat + voice agent with ~21 GB free for a third model.
Step up to BF16 full precision with comfortable margins. The BF16 build is 16.1 GB on disk per the unsloth tier table; on the 5090's 32 GB envelope you have ~16 GB of KV-cache + activation headroom even at 32K context — useful when downstream quality matters more than peak throughput (see Results for the memory-bandwidth caveat).
Run UD-Q8_K_XL at full context. The UD-Q8_K_XL build is 10.6 GB on disk per the unsloth tier table; the 5090's 32 GB lets you keep 20+ GB free for KV cache, which means Llama 3.1's full 128K-token context fits without offload at Q8 quality. On 3090/4090 the 24 GB ceiling forced you to choose between 8-bit quality OR long context; on the 5090 you get both.
Concrete colocations to be aware of. Llama 3.1 8B Q4 + Qwen3-32B UD-Q6_K_XL (~29 GB) does not fit (41 GB total); Llama 3.1 8B Q4 + Qwen3-32B AWQ-INT4 (~22 GB) is tight (32 GB total — fits with discipline on context length, OOM at default 32K). Use /check/<other-model>/rtx-5090 to validate any candidate companion model's envelope before colocating.

Installation

Option A — llama.cpp + Unsloth GGUF (recommended path)

This is the canonical CUDA-accelerated llama.cpp loader for an 8B GGUF on the 5090. The 5090's Blackwell sm_120 compute capability requires CUDA Toolkit 12.8 at build time (not 13.x, which currently has known issues — see Troubleshooting). Mainline llama.cpp does ship sm_120 kernels, but the early-2026 Blackwell driver bug (Issue #23385) means you need a recent driver AND a 12.8-toolkit build to hit the expected ~190-210 tok/s throughput band.

1. Install llama.cpp (CUDA 12.8 build)

Pre-built CUDA 12.8 binaries published on the llama.cpp releases page work directly on the 5090 — pick a *-bin-ubuntu-cuda-12.x-x64.zip asset (Linux) or the matching Windows CUDA build.

# Linux — pre-built CUDA binary
# Download the latest "llama-bXXXX-bin-ubuntu-cuda-12.x-x64.zip" asset from:
#   https://github.com/ggml-org/llama.cpp/releases
# Extract and add the bin/ directory to PATH.

# macOS (Homebrew) — CPU/Metal only, no CUDA, kept here for symmetry with the sibling 3090/4090 recipes
brew install llama.cpp

To build from source with CUDA 12.8 support, follow the llama.cpp CUDA build docs and pin the toolkit explicitly:

# Make sure CUDA 12.8 is the active toolkit BEFORE cmake configure step
export PATH=/usr/local/cuda-12.8/bin:$PATH
export CUDAToolkit_ROOT=/usr/local/cuda-12.8

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=120 \
  -DCUDAToolkit_ROOT=/usr/local/cuda-12.8
cmake --build build --config Release -j $(nproc)

CMAKE_CUDA_ARCHITECTURES=120 builds sm_120 kernels directly, avoiding PTX JIT compilation at first run (per the Zenn benchmark write-up by toki_mwc which traced a 5-6× prompt-processing regression to the wrong toolkit choice).

2. Download the UD-Q4_K_XL GGUF

The fastest path is the one-liner from the Unsloth model card quickstart — llama.cpp will fetch the tagged file directly:

huggingface-cli login   # paste a read token; required for the gated upstream
llama-server -hf unsloth/Llama-3.1-8B-Instruct-GGUF:UD-Q4_K_XL

For more control (specific local directory, pinned filename), pull only the Q4_K_XL file (~5 GB) via snapshot_download:

pip install huggingface_hub hf_transfer

# download_q4kxl.py
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="unsloth/Llama-3.1-8B-Instruct-GGUF",
    local_dir="unsloth/Llama-3.1-8B-Instruct-GGUF",
    allow_patterns=["*UD-Q4_K_XL*"],
)

python download_q4kxl.py

The resulting file is unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf (4.99 GB per the unsloth model card).

3. Start the server

llama-server \
  --model unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf \
  --ctx-size 16384 \
  --n-gpu-layers 99 \
  --host 0.0.0.0 --port 8080

--n-gpu-layers 99 offloads every layer to the 5090 (32 GB is far more than enough to keep the whole 5 GB model resident; layer-streaming is unnecessary). --ctx-size 16384 is the most common benchmark setting — on the 5090 you can safely bump to 131072 for Llama 3.1's full native context per the HF model card without OOM (the KV cache at full 128K for an 8B model is well under the 22 GB headroom; throughput still drops with cache size — see Troubleshooting).

Option B — Ollama (one-command alternative)

If you don't care about the precise UD-Q4_K_XL tier, Ollama maintains its own quantized build:

curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b
ollama run llama3.1:8b "Explain GQA attention in three sentences."

Ollama's default llama3.1:8b tag is Q4_0 at ~4.7 GB, not UD-Q4_K_XL — the speed will be in the same ballpark but won't match Unsloth's tier exactly. The Ollama llama3.1 library lists alternate quant tags. Ollama bundles its own llama.cpp build and abstracts the CUDA toolkit choice, so the driver caveat above applies but the manual -DCMAKE_CUDA_ARCHITECTURES=120 step does not.

Option C — LM Studio (GUI)

LM Studio's built-in catalog search ("Llama 3.1 8B Instruct GGUF") will surface the unsloth UD-Q4_K_XL build alongside the bartowski standard-quant ladder. Pick Llama-3.1-8B-Instruct-UD-Q4_K_XL from the unsloth repo and download — same file as Option A. LM Studio's loader will set --n-gpu-layers to "max" automatically for a 5090. Same driver caveat: LM Studio bundles its own llama.cpp; if you see ~65 tok/s instead of ~190+, update your NVIDIA driver before troubleshooting further.

Running

One-shot prompt via the llama.cpp HTTP server

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.1-8b-instruct",
    "messages": [{"role": "user", "content": "Write a haiku about Blackwell GPUs."}]
  }'

The llama.cpp llama-server binary exposes an OpenAI-compatible /v1/chat/completions endpoint on the port chosen above.

Interactive terminal

llama-cli \
  --model unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf \
  --ctx-size 16384 \
  --n-gpu-layers 99 \
  --interactive

Press Ctrl-C to interrupt generation; the CLI keeps the model warm in VRAM until exit.

UD-Q8_K_XL or BF16 — using the 32 GB envelope for quality

The 5090's 32 GB lets you skip the Q4 / Q8 / BF16 trade-off entirely. The UD-Q8_K_XL build is 10.6 GB on disk; BF16 is 16.1 GB on disk per the unsloth tier table, leaving 16+ GB of KV-cache headroom for the BF16 path at long context (the entire 24 GB consumer-card class can't do this without OOM). Swap allow_patterns=["*BF16*"] or allow_patterns=["*UD-Q8_K_XL*"] in the snapshot_download script to fetch the higher-precision file. Throughput will be lower at BF16 than Q4 because transformer inference is memory-bandwidth-bound at small batch sizes and the BF16 weights are roughly 3× larger — but on the 5090's 1,792 GB/s bandwidth you still get a comfortable Q8 throughput; see Results below.

Results

Speed: Three independent third-party measurements converge on a healthy throughput of ~185–213 tok/s for 8B-class Q4 inference on a correctly-built RTX 5090: awesomeagents.ai's Home GPU LLM Leaderboard (updated April 2026) reports Llama 3.1 8B Q4_K_M at ~190 tok/s on the 5090; Runyard's RTX 5090 local-LLM guide (May 13 2026) reports Llama 3.1 8B Q4_K_M at 213 tok/s; and Hardware Corner's RTX 5090 LLM benchmarks measures the close-architecture sibling Qwen3 8B (Q4_K) at 200.4 tok/s @ 4K context (162.3 @ 16K, 129.8 @ 32K, 91.8 @ 64K, 58.8 @ 128K) — Qwen3 8B and Llama 3.1 8B are both 8B-class dense GQA architectures with similar memory-bandwidth profiles. Caveat — LocalScore divergence. LocalScore's accelerator page for the RTX 5090 currently shows Meta Llama 3.1 8B Instruct Q4_K Medium at only 65.1 tok/s generation (6,477 tok/s prompt processing, TTFT 217 ms, LocalScore composite 1260) — this is materially below the other three sources and almost certainly reflects the Blackwell MMQ driver bug tracked at ggml-org/llama.cpp#23385; see the Troubleshooting section for the diagnostic and fix. LocalScore is community-submission-aggregated and the page may shift up as more correctly-built 5090 submissions land. Surfaced via /check/llama-3-1-8b/rtx-5090; if you run the build above on your own correctly-driven 5090, please submit your numbers so the empirical /check page anchors on healthy measurements.
VRAM usage: No measured peak VRAM is in the backend yet for this pair. As a derived envelope (labelled as derived, not measured): UD-Q4_K_XL weights resident on GPU are 4.99 GB per unsloth's file table; the KV cache for a 16K context on an 8B model with 32 layers and 8 GQA heads adds ~4 GB, putting the runtime peak around ~9–10 GB — i.e. less than a third of the 5090's 32 GB envelope. At full 128K context the cache grows to ~12-15 GB, total ~17-20 GB — still ~12 GB of headroom on the 5090, vs the 24 GB cards which fill at ~23+ GB. A measured Llama-3.1-8B-on-5090 number will replace this once community data lands; see /check/llama-3-1-8b/rtx-5090 for the canonical figure.
Quality notes: UD-Q4_K_XL is the Unsloth mixed-precision GGUF tier; the Unsloth Dynamic 2.0 docs discuss per-layer sensitivity-aware bit-allocation across the family. On a 32 GB card the constraint that forces Q4 on 24 GB cards (KV-cache pressure at high context) is gone — step up to UD-Q8_K_XL (10.6 GB) or BF16 (16.1 GB) freely. There's no quality-floor reason to stay at Q4 on this hardware unless you want maximum tok/s headroom for colocation.

For the full benchmark data and other-GPU comparisons, see /check/llama-3-1-8b/rtx-5090.

Troubleshooting

Throughput stuck at ~60-100 tok/s instead of ~190+ tok/s (Blackwell driver / MMQ bug)

The 5090's expected Llama 3.1 8B Q4 throughput is in the ~185-213 tok/s range (see Results). If you measure ~65 tok/s, or if llama-server aborts mid-prompt with mmq_x_best=0 in the logs, you've hit the early-Blackwell driver bug tracked at ggml-org/llama.cpp#23385. The root cause is documented on NVIDIA's developer forum: early Blackwell launch drivers (591.86 on Windows, 590.48.01 on Linux) return a corrupted sharedMemPerBlockOptin value (either 0 or 0x100000001), which makes llama.cpp's MMQ matmul kernel skip every valid math configuration and either crash or fall back to cuBLAS (much slower). The three-step fix:

Update your NVIDIA driver past the early launch revisions. Anything newer than the launch driver typically reports sharedMemPerBlockOptin correctly. Verify via the NVIDIA control panel or nvidia-smi --query-gpu=driver_version --format=csv.
Build llama.cpp against CUDA Toolkit 12.8, not 13.x. The Zenn benchmark by toki_mwc traces a 5-6× prompt-processing regression to the wrong toolkit; CUDA 13.1 also causes segmentation faults inside the MMQ kernel.
If you can't update the driver right away, fall back temporarily with -DGGML_CUDA_MMQ=OFF at build time (per Issue #22499's community workaround) — runs via cuBLAS, slower than MMQ but doesn't touch the broken driver value. Remove the flag and rebuild once the driver is updated.

This is the most common single cause of "5090 is slower than my 4090" reports on llama.cpp.

`huggingface-cli` 401 / 403 on the Unsloth GGUF repo

The Unsloth quantization inherits gating from the upstream meta-llama/Llama-3.1-8B-Instruct repo. You need to (a) be logged in via huggingface-cli login with a token that has read access, and (b) have clicked "Agree and access" on the upstream Meta repo while logged in — the access carries through to the derived Unsloth mirror. The full license terms are at github.com/meta-llama/llama-models.

Generation slows as context grows past 64K

Llama 3.1 ships with a 128K-token native context window per the HF model card, but throughput drops as the KV cache fills. Hardware Corner's RTX 5090 Qwen3 8B (Q4_K) measurements — same 8B-class architecture as Llama 3.1 8B — show the pattern across the context ladder: 200.4 tok/s @ 4K → 162.3 @ 16K → 129.8 @ 32K → 91.8 @ 64K → 58.8 @ 128K. The 5090's 32 GB envelope easily fits the full 128K KV cache for an 8B model without offload (unlike the 24 GB cards where 128K forces partial-offload behavior), so the slowdown is pure compute scaling, not memory thrashing — predictable per-token cost growth. KV-cache quantization (--cache-type-k q8_0 --cache-type-v q8_0 --flash-attn) roughly halves KV memory if you want even more headroom for colocation at long context.

Want a different runtime — vLLM or SGLang?

The Meta canonical HF model card documents vllm serve "meta-llama/Llama-3.1-8B-Instruct" and python3 -m sglang.launch_server --model-path "meta-llama/Llama-3.1-8B-Instruct" — both load BF16 weights (16.1 GB) rather than the GGUF quantization. The 5090's 32 GB accommodates BF16 plus generous batch sizes for production-style serving, and unlike the 24 GB 3090/4090 cards, vLLM's default --max-model-len (which reserves KV cache aggressively, often 3-5× weights-resident size) is much less likely to OOM here — you have headroom for the default reservation behavior. Expect lower per-request latency than llama.cpp at the cost of GPU memory headroom. The llama.cpp GGUF path and the vLLM/SGLang BF16 path are different quant tiers — their throughput numbers are not directly comparable.

Standard Q4_K_M instead of Unsloth's UD-Q4_K_XL?

Both load with the same llama.cpp binary; only the quantization recipe differs. bartowski/Meta-Llama-3.1-8B-Instruct-GGUF ships the standard k-quant ladder if you prefer the conventional flavor — file sizes are nearly identical (Q4_K_M = 4.92 GB per the unsloth tier table). The Q4_K_M throughput will be close to but not identical to Unsloth's UD-Q4_K_XL because the per-layer bit-allocation differs.

Blackwell vs Ada vs Ampere — what's different for the 5090?

Three architectural differences matter for this recipe:

Driver maturity. Blackwell (sm_120) shipped in early 2026; the launch drivers had the sharedMemPerBlockOptin bug discussed above. Ada (sm_89, RTX 40-series) and Ampere (sm_86, RTX 30-series) have been mainline for years and have no equivalent driver pitfall on llama.cpp. If you're cross-shopping, the 5090 will outperform the 4090 here once correctly driven (~213 vs ~96-145 tok/s in published Q4 figures) but the install experience is materially less forgiving in early 2026.
FP8 / FP4 hardware support. Blackwell adds native FP8 (E4M3/E5M2) AND FP4 (microscaling) tensor cores; Ada has FP8; Ampere has neither. For this GGUF recipe — Q4_K integer quant via MMQ — the FP8/FP4 hardware doesn't directly apply (MMQ does integer math via dp4a), but it matters for other models on the same card. See qwen3-14b-on-rtx-5090 (recipe id=127) for the FP8-vLLM path that benefits.
VRAM envelope. 32 GB vs 24 GB on the previous flagship is what unlocks the colocation and full-precision angles in "Spending the headroom" above. For Q4 inference alone of an 8B model, the 5090's extra 8 GB is unused; the meaningful upgrade is what you do with the spare.

Why is throughput memory-bandwidth-bound — what does that mean for the 5090?

Transformer inference at small batch sizes is dominated by reading weights from VRAM each token, so peak tokens/sec scales with memory bandwidth, not raw FLOPs. The 5090 has 1,792 GB/s of GDDR7 bandwidth vs the 4090's 1,008 GB/s — a 78% increase — and the published Llama 3.1 8B Q4 throughput numbers (~213 tok/s 5090 per Runyard vs ~96-145 tok/s 4090 per multiple sources) line up with the bandwidth ratio. The 5090 will not deliver 4× the throughput of a 4090 despite ~4× the FP16 TFLOPS, because compute isn't the bottleneck for batch-1 inference. For batched / multi-request serving (where bandwidth scales sublinearly but compute scales linearly with batch size), the 5090's compute advantage shows up more clearly — relevant for the colocation / production-server angle above.