self-hosted/ai
§01·recipe · llm

Llama 3.1 8B on RTX 3090 Ti: Local Chat via llama.cpp + Unsloth UD-Q4_K_XL GGUF

llmbeginner10GB+ VRAMMay 28, 2026
models
tools
prerequisites
  • NVIDIA RTX 3090 Ti (24 GB VRAM) or equivalent Ampere-class card
  • Recent NVIDIA driver with CUDA 12.x support (Ampere sm_86 — full mainline kernel coverage, no special wheel selection required)
  • ~5 GB free disk for the UD-Q4_K_XL GGUF (or ~16 GB for BF16)
  • llama.cpp, Ollama, or LM Studio installed
  • Hugging Face account with access to the gated meta-llama/Llama-3.1-8B-Instruct repo

What You'll Build

A local Llama 3.1 8B Instruct chat assistant running on an RTX 3090 Ti (24 GB VRAM) through llama.cpp with the unsloth/Llama-3.1-8B-Instruct-GGUF UD-Q4_K_XL weights (4.99 GB on disk, Unsloth's mixed-precision Dynamic 2.0 GGUF tier). The 3090 Ti's 24 GB envelope is wildly over-provisioned for this quant — weights resident on GPU are ~5 GB and a 16K-token KV cache adds another ~4 GB, leaving ~13 GB of headroom for Llama 3.1's full 128K-token context, a heavier quant tier (Q8_0, BF16), or running a second model concurrently on the same card.

Hardware data: RTX 3090 Ti (24 GB VRAM) · UD-Q4_K_XL GGUF · ~110 tok/s generation at Q4_K · See benchmark data

⚠️ Quant pinned — Unsloth UD-Q4_K_XL. This recipe targets UD-Q4_K_XL from the unsloth/Llama-3.1-8B-Instruct-GGUF repo specifically — Unsloth's mixed-precision GGUF tier featured in their Dynamic 2.0 benchmark comparisons, with per-layer sensitivity-aware bit-allocation. Standard Q4_K_M from other publishers (bartowski, TheBloke) will load with the same llama.cpp binary, but the per-layer recipe and resulting quality/speed profile are different — see Troubleshooting if you prefer the conventional flavor.

ℹ️ Gated model — Meta access form required. The canonical meta-llama/Llama-3.1-8B-Instruct repo and the derived unsloth/Llama-3.1-8B-Instruct-GGUF both require accepting Meta's Llama 3.1 Community License before download. Click "Agree and access" on the model page while logged into HF, then run huggingface-cli login locally with a read token before the steps below. The license permits commercial use until you exceed 700 million monthly active users.

Requirements

ComponentMinimumTested
GPU8 GB VRAM (Q4_K_XL fits)RTX 3090 Ti (24 GB)
RAM16 GB system
Storage4.99 GB (UD-Q4_K_XL GGUF) per unsloth/Llama-3.1-8B-Instruct-GGUF
DriverCUDA 12.x runtime (Ampere sm_86)
Runtimellama.cpp / Ollama / LM Studiollama.cpp b9247+

The 3090 Ti is wildly over-provisioned for the UD-Q4_K_XL build (weights resident on GPU are 4.99 GB; KV cache for a 16K-token context adds another ~4 GB). For an upper-bound reference on the same 24 GB envelope, Hardware Corner's RTX 3090 24 GB VRAM guide tested Qwen3 8B (an adjacent-8B-class model) at Q4_K_M and reached ~23.57 GB at a ~90K-token context window — i.e. a 24 GB Ampere card only fills VRAM when you push to near-maximum context. At 16K context most users will have 13+ GB to spare for colocation or a heavier quant tier.

Spending the headroom — what 13+ GB of spare VRAM enables on the 3090 Ti

A 24 GB card is more than five times the size of the resident Q4_K_XL weights, so the legitimate recipe pivot is not "does it fit" (it fits trivially) but "what to do with the unused VRAM." Three concrete options on the 3090 Ti:

  • Colocate a second model. A Whisper-Large-V3 ASR model (~3 GB) plus a CPU-bound embedding model (e.g. bge-small) leaves room for a full transcription → Llama-3.1 chat → response pipeline on a single card. Or pair Llama-3.1-8B Q4_K_XL with a TTS model like Kokoro for a voice agent — Kokoro fits in ~1 GB.
  • Step up to BF16 full precision. The BF16 build is 16.07 GB on disk per the unsloth/Llama-3.1-8B-Instruct-GGUF Files tab; the Ti's 24 GB still leaves ~6–7 GB of KV-cache headroom for a 16K context. Useful when downstream quality matters more than peak throughput — though see Results below for the memory-bandwidth caveat.
  • Stretch to long context. Llama 3.1 ships with a 128K-token native context window per the HF model card. Per the Hardware Corner 3090 guide above, an 8B model at Q4_K_M comfortably runs up to ~90K tokens on a 24 GB Ampere card; throughput drops as KV fills (see Troubleshooting).

Installation

Option A — llama.cpp + Unsloth GGUF (recommended path)

This is the canonical CUDA-accelerated llama.cpp loader for an 8B GGUF on a 24 GB Ampere card. Both the binary install and the GGUF format work without modification across Ampere / Ada / Blackwell — the 3090 Ti's sm_86 compute capability has full mainline CUDA / FlashAttention / cuBLAS kernel coverage and the default pre-built llama.cpp CUDA binaries Just Work.

1. Install llama.cpp

# macOS (Homebrew)
brew install llama.cpp

# Linux — pre-built CUDA binary
# Download the latest "llama-bXXXX-bin-ubuntu-cuda-12.x-x64.zip" asset from:
#   https://github.com/ggml-org/llama.cpp/releases
# Extract and add the bin/ directory to PATH.

To build from source with CUDA support instead, follow the llama.cpp CUDA build docs:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j 8

2. Download the UD-Q4_K_XL GGUF

The fastest path is the one-liner from the Unsloth model card quickstart — llama.cpp will fetch the tagged file directly:

huggingface-cli login   # paste a read token; required for the gated upstream
llama-server -hf unsloth/Llama-3.1-8B-Instruct-GGUF:UD-Q4_K_XL

For more control (specific local directory, pinned filename), pull only the Q4_K_XL file (~5 GB) via snapshot_download:

pip install huggingface_hub hf_transfer
# download_q4kxl.py
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="unsloth/Llama-3.1-8B-Instruct-GGUF",
    local_dir="unsloth/Llama-3.1-8B-Instruct-GGUF",
    allow_patterns=["*UD-Q4_K_XL*"],
)
python download_q4kxl.py

The resulting file is unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf (4.99 GB per the unsloth/Llama-3.1-8B-Instruct-GGUF Files tab).

3. Start the server

llama-server \
  --model unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf \
  --ctx-size 16384 \
  --n-gpu-layers 99 \
  --host 0.0.0.0 --port 8080

--n-gpu-layers 99 offloads every layer to the 3090 Ti (24 GB is plenty to keep the whole 5 GB model resident; layer-streaming is unnecessary). --ctx-size 16384 is the most common benchmark setting — bump to 131072 for Llama 3.1's full native context, but throughput will fall as the KV cache grows (see Troubleshooting).

Option B — Ollama (one-command alternative)

If you don't care about the precise UD-Q4_K_XL tier, Ollama maintains its own quantized build:

curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b
ollama run llama3.1:8b "Explain GQA attention in three sentences."

Ollama's default llama3.1:8b tag is Q4_0 at ~4.7 GB, not Q4_K_XL — the speed will be in the same ballpark but won't match Unsloth's tier exactly. The Ollama llama3.1 library lists alternate quant tags.

Option C — LM Studio (GUI)

LM Studio's built-in catalog search ("Llama 3.1 8B Instruct GGUF") will surface the unsloth UD-Q4_K_XL build alongside the bartowski standard-quant ladder. Pick Llama-3.1-8B-Instruct-UD-Q4_K_XL from the unsloth repo and download — same file as Option A. LM Studio's loader will set --n-gpu-layers to "max" automatically for a 3090 Ti.

Running

One-shot prompt via the llama.cpp HTTP server

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.1-8b-instruct",
    "messages": [{"role": "user", "content": "Write a haiku about Ampere GPUs."}]
  }'

The llama.cpp llama-server binary exposes an OpenAI-compatible /v1/chat/completions endpoint on the port chosen above.

Interactive terminal

llama-cli \
  --model unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf \
  --ctx-size 16384 \
  --n-gpu-layers 99 \
  --interactive

Press Ctrl-C to interrupt generation; the CLI keeps the model warm in VRAM until exit.

BF16 native precision

The 3090 Ti's 24 GB VRAM can also fit the BF16 build (16.07 GB on disk per the unsloth/Llama-3.1-8B-Instruct-GGUF Files tab) with ~6–7 GB of KV-cache headroom at a 16K context. Use allow_patterns=["*BF16*"] in the snapshot_download script above to fetch the full-precision file instead — useful when downstream quality is the priority over peak throughput. Speed will be lower than Q4_K_XL because memory bandwidth, not compute, is the binding constraint on transformer inference, and the BF16 weights are roughly 3× larger.

Results

  • Speed: LocalScore (a Mozilla Builders Project benchmarking platform built on llamafile / llama.cpp) records Meta Llama 3.1 8B Instruct at Q4_K - Medium at ~110 tokens/s generation and ~4,024 tokens/s prompt processing on the RTX 3090 Ti (Time-to-First-Token ~320 ms; performance score 1,113 — performance rank 9 of 351 ranked accelerators on this model at the time of writing). LocalScore's methodology runs a standardized eight-scenario suite (input 64–4,096 tokens, output 16–3,072 tokens). The figures are the cleanest Llama-3.1-8B-specific RTX 3090 Ti measurement available — variant-named in the source row (NOT relabeled from an adjacent model — Hardware Corner's RTX 3090 Ti page, by contrast, carries Qwen3-8B but no Llama-3.1-8B row at all, so it is not a usable speed citation for this pair). LocalScore is community-submission-aggregated, so the row may drift by ~1% as more submissions land; if you see slightly different numbers when you click through, that's expected. Surfaced via /check/llama-3-1-8b/rtx-3090-ti; if you run the build above on your own 3090 Ti, please submit your numbers so additional measurements corroborate the LocalScore figure.
  • VRAM usage: No measured peak VRAM is in the backend yet for this pair. As a derived envelope (labelled as derived, not measured): UD-Q4_K_XL weights resident on GPU are 4.99 GB per the unsloth/Llama-3.1-8B-Instruct-GGUF Files tab; the KV cache for a 16K context on an 8B model with 32 layers and 8 GQA heads adds ~4 GB, putting the runtime peak around ~9–10 GB — well inside the 3090 Ti's 24 GB envelope. For an upper-bound reference on the same 24 GB Ampere envelope, Hardware Corner's RTX 3090 24 GB VRAM guide measured Qwen3 8B (adjacent-8B-class) at Q4_K_M / ~90K context at ~23.57 GB — i.e. a 24 GB Ampere card only fills its VRAM if you push to near-maximum context. A measured Llama-3.1-8B Ti number will replace this once community data lands; see /check/llama-3-1-8b/rtx-3090-ti for the canonical figure.
  • Quality notes: UD-Q4_K_XL is the Unsloth mixed-precision GGUF tier; the Unsloth Dynamic 2.0 docs discuss per-layer sensitivity-aware bit-allocation across the family. On a 24 GB card you can step up to Q8_0 (~8.54 GB per the bartowski mirror per-tier table) or BF16 (16.07 GB) freely if quality matters more than throughput; there's no quality-floor reason to run anything below Q4_K_M on this hardware.

For the full benchmark data and other-GPU comparisons, see /check/llama-3-1-8b/rtx-3090-ti.

Troubleshooting

huggingface-cli 401 / 403 on the Unsloth GGUF repo

The Unsloth quantization inherits gating from the upstream meta-llama/Llama-3.1-8B-Instruct repo. You need to (a) be logged in via huggingface-cli login with a token that has read access, and (b) have clicked "Agree and access" on the upstream Meta repo while logged in — the access carries through to the derived Unsloth mirror. The full license terms are at github.com/meta-llama/llama-models.

Generation slows dramatically past 32K context

Llama 3.1 ships with a 128K-token native context window per the HF model card, but throughput drops as the KV cache fills. Hardware Corner's RTX 3090 24 GB VRAM guide reports that 8B-class models at Q4_K_M can run up to a ~90K-token context on a 24 GB Ampere card before approaching the VRAM ceiling; at that scale the VRAM is fully consumed by KV cache and partial-offload behavior emerges. For long-doc workflows on a single 3090 Ti, keep --ctx-size at 32K or below for sustained throughput, and use chunking + retrieval beyond that. KV-cache quantization (--cache-type-k q8_0 --cache-type-v q8_0 --flash-attn) roughly halves KV memory if you need to push context higher at Q4_K weights.

Want a different runtime — vLLM or SGLang?

The Meta canonical HF model card documents vllm serve "meta-llama/Llama-3.1-8B-Instruct" and python3 -m sglang.launch_server --model-path "meta-llama/Llama-3.1-8B-Instruct" — both load BF16 weights (16.07 GB) rather than the GGUF quantization. The 3090 Ti's 24 GB accommodates BF16 + reasonable batch sizes for production-style serving, but vLLM's default --max-model-len will reserve KV cache aggressively (typically 3–5× the weights-resident figure); use --max-model-len 8192 or --gpu-memory-utilization 0.85 to keep peak VRAM under control on a single 3090 Ti. Expect lower per-request latency than llama.cpp at the cost of GPU memory headroom. The llama.cpp GGUF path and the vLLM / SGLang BF16 path are different quant tiers — their throughput numbers are not directly comparable.

Standard Q4_K_M instead of Unsloth's Q4_K_XL?

Both load with the same llama.cpp binary; only the quantization recipe differs. bartowski/Meta-Llama-3.1-8B-Instruct-GGUF ships the standard k-quant ladder if you prefer the conventional flavor — file sizes are nearly identical (Q4_K_M = 4.92 GB per the unsloth/Llama-3.1-8B-Instruct-GGUF Files tab). The Q4_K_M throughput will be close to but not identical to Unsloth's UD-Q4_K_XL because the per-layer bit-allocation differs.

Ampere vs Ada / Blackwell — anything special for the 3090 Ti?

No. The RTX 3090 Ti is Ampere (sm_86) with full mainline CUDA / FlashAttention 2 / cuBLAS kernel coverage since 2021 — older than the Ada (sm_89) and Blackwell (sm_120) cards but no less supported in the LLM-inference stack. The default pip install torch and pre-built llama.cpp CUDA binaries work out of the box. The 3090 Ti lacks FP8 tensor cores (introduced on Hopper sm_90 / Ada sm_89), but the recommended UD-Q4_K_XL build uses INT4 / k-quant math which runs natively on Ampere — there is no FP8 fallback path to worry about for the canonical GGUF recipe here.

LLM token generation is memory-bandwidth-bound — what does that mean for the 3090 Ti?

Transformer inference at small batch sizes is dominated by reading weights from VRAM each token, so peak tokens/sec scales with memory bandwidth, not raw FLOPs. The 3090 Ti has 1008 GB/s of GDDR6X bandwidth (vs. the non-Ti 3090's 936 GB/s, ~7.7% more); LocalScore's measured ~110 tok/s on the 3090 Ti for Llama 3.1 8B Q4_K is ~15% above the same platform's RTX 3090 figure of ~95 tok/s for the same row, which is consistent with the Ti's combined memory-bandwidth + sustained-clock uplift over the non-Ti at small-batch Q4 inference. For Q4-class inference on an 8B model, the 3090 Ti sits at the top of the Ampere 24 GB tier without crossing into the Ada / Blackwell price band.