How much VRAM does Llama 3.1 8B need?

About 10 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

Llama 3.1 8B on RTX 4090: Local Chat via llama.cpp + Unsloth UD-Q4_K_XL GGUF

What You'll Build

A local Llama 3.1 8B Instruct chat assistant running on an RTX 4090 (24 GB VRAM) through llama.cpp with the unsloth/Llama-3.1-8B-Instruct-GGUF UD-Q4_K_XL weights (4.99 GB on disk, Unsloth's mixed-precision Dynamic 2.0 GGUF tier). The 4090's 24 GB envelope is wildly over-provisioned for this quant — weights resident on GPU are ~5 GB and the 16K-token KV cache adds another ~4 GB, so you have ample headroom for Llama 3.1's full 128K-token context, larger batch sizes, or running a second model concurrently.

Hardware data: RTX 4090 (24 GB VRAM) · UD-Q4_K_XL GGUF · ~10 GB derived runtime envelope · See benchmark data

⚠️ Quant pinned — Unsloth UD-Q4_K_XL. This recipe targets Q4_K_XL from the unsloth/Llama-3.1-8B-Instruct-GGUF repo specifically — Unsloth's mixed-precision GGUF tier featured in their Dynamic 2.0 benchmark comparisons, with per-layer sensitivity-aware bit-allocation. Standard Q4_K_M from other publishers (bartowski, TheBloke) will load with the same llama.cpp binary, but the per-layer recipe and resulting quality/speed profile are different — see Troubleshooting if you prefer the conventional flavor.

ℹ️ Gated model — Meta access form required. The canonical meta-llama/Llama-3.1-8B-Instruct repo and the derived unsloth/Llama-3.1-8B-Instruct-GGUF both require accepting Meta's Llama 3.1 Community License before download. Click "Agree and access" on the model page while logged into HF, then run huggingface-cli login locally with a read token before the steps below. The license permits commercial use until you exceed 700 million monthly active users.

Requirements

Component	Minimum	Tested
GPU	8 GB VRAM (Q4_K_XL fits)	RTX 4090 (24 GB)
RAM	16 GB system	—
Storage	4.99 GB (UD-Q4_K_XL GGUF) per unsloth/Llama-3.1-8B-Instruct-GGUF	—
Driver	CUDA 12.x runtime (Ada sm_89)	—
Runtime	llama.cpp / Ollama / LM Studio	llama.cpp b9247+

The 4090 is wildly over-provisioned for the Q4_K_XL build (the weights resident on GPU are only ~5 GB; KV cache for a 16K-token context adds another ~4 GB), so you have plenty of headroom to either jump to a heavier quant tier (Q8_0, BF16) or run a second model concurrently.

Installation

Option A — llama.cpp + Unsloth GGUF (recommended path)

This is the canonical CUDA-accelerated llama.cpp loader for an 8B GGUF on a 24 GB Ada card. Hardware Corner's RTX 4090 LLM benchmark suite measures generic "8B Q4_K_XL" at ~96 tok/s @ 16K context with llama.cpp + CUDA 12.8 — adjacent evidence for what this loader path delivers on the 4090, though the source table does not explicitly identify the 8B model under test (see Results below for the caveat).

1. Install llama.cpp

# macOS (Homebrew)
brew install llama.cpp

# Linux — pre-built CUDA binary
# Download the latest "llama-bXXXX-bin-ubuntu-cuda-12.x-x64.zip" asset from:
#   https://github.com/ggml-org/llama.cpp/releases
# Extract and add the bin/ directory to PATH.

To build from source with CUDA support instead, follow the llama.cpp CUDA build docs:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j 8

2. Download the UD-Q4_K_XL GGUF

Per unsloth's run guide, use snapshot_download to pull only the Q4_K_XL file (~5 GB) instead of the full 96-file repo:

pip install huggingface_hub hf_transfer
huggingface-cli login   # paste a read token; required for the gated upstream

# download_q4kxl.py
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="unsloth/Llama-3.1-8B-Instruct-GGUF",
    local_dir="unsloth/Llama-3.1-8B-Instruct-GGUF",
    allow_patterns=["*UD-Q4_K_XL*"],
)

python download_q4kxl.py

The resulting file is unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf (4.99 GB per the unsloth model card).

3. Start the server

llama-server \
  --model unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf \
  --ctx-size 16384 \
  --n-gpu-layers 99 \
  --host 0.0.0.0 --port 8080

--n-gpu-layers 99 offloads every layer to the 4090 (the 4090 has enough VRAM to keep the whole model resident; layer-streaming is unnecessary). --ctx-size 16384 matches the 16K context the benchmark was measured at — bump to 131072 for Llama 3.1's full native context, though throughput will fall as the KV cache grows (see Troubleshooting).

Option B — Ollama (one-command alternative)

If you don't care about the precise UD-Q4_K_XL tier, Ollama maintains its own quantized build:

curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b
ollama run llama3.1:8b "Explain GQA attention in three sentences."

Ollama's default llama3.1:8b tag is Q4_0 at ~4.7 GB, not Q4_K_XL — the speed will be in the same ballpark but won't match Unsloth's tier exactly. The Ollama llama3.1 library lists alternate quant tags.

Option C — LM Studio (GUI)

LM Studio's built-in catalog search ("Llama 3.1 8B Instruct GGUF") will surface the unsloth UD-Q4_K_XL build alongside the bartowski standard-quant ladder. Pick Llama-3.1-8B-Instruct-UD-Q4_K_XL from the unsloth repo and download — same file as Option A. LM Studio's loader will set --n-gpu-layers to "max" automatically for a 4090.

Running

One-shot prompt via the llama.cpp HTTP server

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.1-8b-instruct",
    "messages": [{"role": "user", "content": "Write a haiku about Ada Lovelace GPUs."}]
  }'

The llama.cpp llama-server binary exposes an OpenAI-compatible /v1/chat/completions endpoint on the port chosen above.

Interactive terminal

llama-cli \
  --model unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf \
  --ctx-size 16384 \
  --n-gpu-layers 99 \
  --interactive

Press Ctrl-C to interrupt generation; the CLI keeps the model warm in VRAM until exit.

BF16 native precision

The 4090's 24 GB VRAM can also fit the BF16 build (16.1 GB on disk per the unsloth tier table) with ~6–7 GB of KV-cache headroom for a 16K context. Use allow_patterns=["*BF16*"] in the snapshot_download script above to fetch the full-precision file instead — useful when downstream quality is the priority over peak throughput. Speed will be lower than Q4_K_XL because memory bandwidth, not compute, is the binding constraint on transformer inference.

Results

Speed: No community benchmark explicitly identifies Llama 3.1 8B (vs. some other 8B-class model) at this quant tier on RTX 4090. The closest cited measurement is Hardware Corner's RTX 4090 LLM benchmark suite, which records 8B Q4_K_XL at 96.12 tokens/s @ 16K context with llama.cpp + CUDA 12.8 (full ladder: 4K=131.00, 8K=119.36, 16K=96.12, 32K=77.42 tok/s). Hardware Corner's table labels the model generically as "8B" without naming the specific architecture under test — treat this figure as adjacent 8B-class evidence, not a Llama-3.1-8B-specific measurement. Llama 3.1 8B is the dominant 8B GGUF on the llama.cpp ecosystem and very likely sits in this range, but we can't confirm at the source. Surfaced via /check/llama-3-1-8b/rtx-4090; if you run the build above on your own 4090, please submit your numbers so a Llama-3.1-8B-specific measurement replaces the adjacent-evidence anchor.
VRAM usage: No measured peak VRAM is in the backend yet. As a derived envelope (per Lesson D — labelled as derived, not measured): UD-Q4_K_XL weights resident on GPU are 4.99 GB per unsloth's file table; the KV cache for a 16K context on an 8B model with 32 layers and 8 GQA heads adds ~4 GB, putting the runtime peak around ~9–10 GB — well inside the 4090's 24 GB envelope. A measured number will replace this once community data lands; see /check/llama-3-1-8b/rtx-4090 for the canonical figure.
Quality notes: UD-Q4_K_XL is the Unsloth mixed-precision GGUF tier; the Unsloth Dynamic 2.0 docs discuss per-layer sensitivity-aware bit-allocation across the family. On a 24 GB card you can step up to Q8_0 or BF16 freely if quality matters more than throughput; there's no quality-floor reason to run anything below Q4_K_M on this hardware.

For the full benchmark data and other-GPU comparisons, see /check/llama-3-1-8b/rtx-4090.

Troubleshooting

`huggingface-cli` 401 / 403 on the Unsloth GGUF repo

The Unsloth quantization inherits gating from the upstream meta-llama/Llama-3.1-8B-Instruct repo. You need to (a) be logged in via huggingface-cli login with a token that has read access, and (b) have clicked "Agree and access" on the upstream Meta repo while logged in — the access carries through to the derived Unsloth mirror. The full license terms are at github.com/meta-llama/llama-models.

Generation slows dramatically past 32k context

Llama 3.1 ships with a 128K-token native context window per the HF model card, but throughput drops as the KV cache fills. Hardware Corner's 8B-class table shows the pattern: ~131 tok/s @ 4K → ~96 tok/s @ 16K → ~77 tok/s @ 32K. At full 128K the cache alone consumes >30 GB and forces partial offload. For long-doc workflows on a single 4090, keep --ctx-size at 32K or below and use chunking + retrieval beyond that.

Want a different runtime — vLLM or SGLang?

The Meta canonical HF model card documents vllm serve "meta-llama/Llama-3.1-8B-Instruct" and python3 -m sglang.launch_server --model-path "meta-llama/Llama-3.1-8B-Instruct" — both load BF16 weights (16.1 GB) rather than the GGUF quantization. The 4090's 24 GB VRAM accommodates BF16 + reasonable batch sizes for production-style serving; expect lower per-request latency at the cost of GPU memory headroom. The llama.cpp GGUF path and the vLLM/SGLang BF16 path are different quant tiers — their throughput numbers are not directly comparable.

Standard Q4_K_M instead of Unsloth's Q4_K_XL?

Both load with the same llama.cpp binary; only the quantization recipe differs. bartowski/Meta-Llama-3.1-8B-Instruct-GGUF ships the standard k-quant ladder if you prefer the conventional flavor — file sizes are nearly identical (Q4_K_M = 4.92 GB per the unsloth tier table). The Q4_K_M throughput will be close to but not identical to Unsloth's Q4_K_XL because the per-layer bit-allocation differs.

Ada vs Blackwell — anything special for the 4090?

No. The RTX 4090 is Ada Lovelace (sm_89) with full mainline CUDA / FlashAttention / cuBLAS kernel coverage since 2023. Unlike Blackwell-class cards (RTX 50-series, sm_120), no cu128-specific wheel selection or attention-implementation overrides are required — the default pip install torch and pre-built llama.cpp CUDA binaries work out of the box.