How much VRAM does Llama 3.1 8B need?

About 10 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

Llama 3.1 8B on RTX 3090: Local Chat via llama.cpp + Unsloth UD-Q4_K_XL GGUF

What You'll Build

A local Llama 3.1 8B Instruct chat assistant running on an RTX 3090 (24 GB VRAM) through llama.cpp with the unsloth/Llama-3.1-8B-Instruct-GGUF UD-Q4_K_XL weights (4.99 GB on disk, Unsloth's mixed-precision Dynamic 2.0 GGUF tier). The 3090's 24 GB envelope is wildly over-provisioned for this quant — weights resident on GPU are ~5 GB and a 16K-token KV cache adds another ~4 GB, leaving ~13 GB of headroom for Llama 3.1's full 128K-token context, a heavier quant tier (Q8_0, BF16), or running a second model concurrently on the same card.

Hardware data: RTX 3090 (24 GB VRAM) · UD-Q4_K_XL GGUF · ~92 tok/s generation at Q4_K · See benchmark data

⚠️ Quant pinned — Unsloth UD-Q4_K_XL. This recipe targets UD-Q4_K_XL from the unsloth/Llama-3.1-8B-Instruct-GGUF repo specifically — Unsloth's mixed-precision GGUF tier featured in their Dynamic 2.0 benchmark comparisons, with per-layer sensitivity-aware bit-allocation. Standard Q4_K_M from other publishers (bartowski, TheBloke) will load with the same llama.cpp binary, but the per-layer recipe and resulting quality/speed profile are different — see Troubleshooting if you prefer the conventional flavor.

ℹ️ Gated model — Meta access form required. The canonical meta-llama/Llama-3.1-8B-Instruct repo and the derived unsloth/Llama-3.1-8B-Instruct-GGUF both require accepting Meta's Llama 3.1 Community License before download. Click "Agree and access" on the model page while logged into HF, then run huggingface-cli login locally with a read token before the steps below. The license permits commercial use until you exceed 700 million monthly active users.

Requirements

Component	Minimum	Tested
GPU	8 GB VRAM (Q4_K_XL fits)	RTX 3090 (24 GB)
RAM	16 GB system	—
Storage	4.99 GB (UD-Q4_K_XL GGUF) per unsloth/Llama-3.1-8B-Instruct-GGUF	—
Driver	CUDA 12.x runtime (Ampere sm_86)	—
Runtime	llama.cpp / Ollama / LM Studio	llama.cpp b9247+

The 3090 is wildly over-provisioned for the UD-Q4_K_XL build (weights resident on GPU are 4.99 GB; KV cache for a 16K-token context adds another ~4 GB). For context, Hardware Corner's RTX 3090 24 GB VRAM guide tested Qwen3 8B at Q4_K_M and reached ~23.57 GB at a 90K-token context window — i.e. the 3090 can run an 8B Q4_K_M model at near-maximum native context before hitting the VRAM ceiling. At 16K context most users will have 13+ GB to spare for colocation or a heavier quant tier.

Spending the headroom — what 13+ GB of spare VRAM enables on the 3090

A 24 GB card is more than five times the size of the resident Q4_K_XL weights, so the legitimate recipe pivot is not "does it fit" (it fits trivially) but "what to do with the unused VRAM." Three concrete options on the 3090:

Colocate a second model. A small Whisper-Large-V3 ASR model (~3 GB) plus a CPU-bound embedding model (e.g. bge-small) leaves room for a full transcription → Llama-3.1 chat → response pipeline on a single card. Or pair Llama-3.1-8B Q4_K_XL with a TTS model like Kokoro for a voice agent — Kokoro fits in ~1 GB.
Step up to BF16 full precision. The BF16 build is 16.07 GB on disk per the unsloth tier table; the 3090's 24 GB still leaves ~6–7 GB of KV-cache headroom for a 16K context. Useful when downstream quality matters more than peak throughput — though see Results below for the memory-bandwidth caveat.
Stretch to long context. Llama 3.1 ships with a 128K-token native context window per the HF model card. Per the Hardware Corner guide above, an 8B model at Q4_K_M comfortably runs up to ~90K tokens on a 24 GB card; throughput drops as KV fills (see Troubleshooting).

Installation

Option A — llama.cpp + Unsloth GGUF (recommended path)

This is the canonical CUDA-accelerated llama.cpp loader for an 8B GGUF on a 24 GB Ampere card. Both the binary install and the GGUF format work without modification across Ampere / Ada / Blackwell — the 3090's sm_86 compute capability has full mainline CUDA / FlashAttention / cuBLAS kernel coverage and the default pre-built llama.cpp CUDA binaries Just Work.

1. Install llama.cpp

# macOS (Homebrew)
brew install llama.cpp

# Linux — pre-built CUDA binary
# Download the latest "llama-bXXXX-bin-ubuntu-cuda-12.x-x64.zip" asset from:
#   https://github.com/ggml-org/llama.cpp/releases
# Extract and add the bin/ directory to PATH.

To build from source with CUDA support instead, follow the llama.cpp CUDA build docs:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j 8

2. Download the UD-Q4_K_XL GGUF

The fastest path is the one-liner from the Unsloth model card quickstart — llama.cpp will fetch the tagged file directly:

huggingface-cli login   # paste a read token; required for the gated upstream
llama-server -hf unsloth/Llama-3.1-8B-Instruct-GGUF:UD-Q4_K_XL

For more control (specific local directory, pinned filename), pull only the Q4_K_XL file (~5 GB) via snapshot_download:

pip install huggingface_hub hf_transfer

# download_q4kxl.py
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="unsloth/Llama-3.1-8B-Instruct-GGUF",
    local_dir="unsloth/Llama-3.1-8B-Instruct-GGUF",
    allow_patterns=["*UD-Q4_K_XL*"],
)

python download_q4kxl.py

The resulting file is unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf (4.99 GB per the unsloth model card).

3. Start the server

llama-server \
  --model unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf \
  --ctx-size 16384 \
  --n-gpu-layers 99 \
  --host 0.0.0.0 --port 8080

--n-gpu-layers 99 offloads every layer to the 3090 (24 GB is plenty to keep the whole 5 GB model resident; layer-streaming is unnecessary). --ctx-size 16384 is the most common benchmark setting — bump to 131072 for Llama 3.1's full native context, but throughput will fall as the KV cache grows (see Troubleshooting).

Option B — Ollama (one-command alternative)

If you don't care about the precise UD-Q4_K_XL tier, Ollama maintains its own quantized build:

curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b
ollama run llama3.1:8b "Explain GQA attention in three sentences."

Ollama's default llama3.1:8b tag is Q4_0 at ~4.7 GB, not Q4_K_XL — the speed will be in the same ballpark but won't match Unsloth's tier exactly. The Ollama llama3.1 library lists alternate quant tags.

Option C — LM Studio (GUI)

LM Studio's built-in catalog search ("Llama 3.1 8B Instruct GGUF") will surface the unsloth UD-Q4_K_XL build alongside the bartowski standard-quant ladder. Pick Llama-3.1-8B-Instruct-UD-Q4_K_XL from the unsloth repo and download — same file as Option A. LM Studio's loader will set --n-gpu-layers to "max" automatically for a 3090.

Running

One-shot prompt via the llama.cpp HTTP server

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.1-8b-instruct",
    "messages": [{"role": "user", "content": "Write a haiku about Ampere GPUs."}]
  }'

The llama.cpp llama-server binary exposes an OpenAI-compatible /v1/chat/completions endpoint on the port chosen above.

Interactive terminal

llama-cli \
  --model unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf \
  --ctx-size 16384 \
  --n-gpu-layers 99 \
  --interactive

Press Ctrl-C to interrupt generation; the CLI keeps the model warm in VRAM until exit.

BF16 native precision

The 3090's 24 GB VRAM can also fit the BF16 build (16.07 GB on disk per the unsloth tier table) with ~6–7 GB of KV-cache headroom at a 16K context. Use allow_patterns=["*BF16*"] in the snapshot_download script above to fetch the full-precision file instead — useful when downstream quality is the priority over peak throughput. Speed will be lower than Q4_K_XL because memory bandwidth, not compute, is the binding constraint on transformer inference, and the BF16 weights are roughly 3× larger.

Results

Speed: LocalScore (a Mozilla Builders Project benchmarking platform built on llamafile / llama.cpp) records Meta Llama 3.1 8B Instruct at Q4_K - Medium at 92.2 tokens/s generation and 3,484 tokens/s prompt processing on the RTX 3090 (Time-to-First-Token 505 ms; LocalScore composite 934 — "excellent" on their scale). LocalScore's methodology runs a standardized eight-scenario suite (input 64–4,096 tokens, output 16–3,072 tokens). The 92.2 figure is the cleanest Llama-3.1-8B-specific RTX 3090 measurement available — variant-named in the source row, on a transparent benchmark page. LocalScore is community-submission-aggregated, so the row may drift by ~1% as more submissions land; if you see slightly different numbers when you click through, that's expected. Surfaced via /check/llama-3-1-8b/rtx-3090; if you run the build above on your own 3090, please submit your numbers so additional measurements corroborate the LocalScore figure.
VRAM usage: No measured peak VRAM is in the backend yet for this pair. As a derived envelope (labelled as derived, not measured): UD-Q4_K_XL weights resident on GPU are 4.99 GB per unsloth's file table; the KV cache for a 16K context on an 8B model with 32 layers and 8 GQA heads adds ~4 GB, putting the runtime peak around ~9–10 GB — well inside the 3090's 24 GB envelope. For an upper-bound reference, Hardware Corner's RTX 3090 24 GB VRAM guide measured Qwen3 8B (adjacent-8B-class) at Q4_K_M / 90K context at ~23.57 GB — i.e. the 3090 only fills its VRAM if you push to near-maximum context. A measured Llama-3.1-8B number will replace this once community data lands; see /check/llama-3-1-8b/rtx-3090 for the canonical figure.
Quality notes: UD-Q4_K_XL is the Unsloth mixed-precision GGUF tier; the Unsloth Dynamic 2.0 docs discuss per-layer sensitivity-aware bit-allocation across the family. On a 24 GB card you can step up to Q8_0 (~10.6 GB) or BF16 (16.07 GB) freely if quality matters more than throughput; there's no quality-floor reason to run anything below Q4_K_M on this hardware.

For the full benchmark data and other-GPU comparisons, see /check/llama-3-1-8b/rtx-3090.

Troubleshooting

`huggingface-cli` 401 / 403 on the Unsloth GGUF repo

The Unsloth quantization inherits gating from the upstream meta-llama/Llama-3.1-8B-Instruct repo. You need to (a) be logged in via huggingface-cli login with a token that has read access, and (b) have clicked "Agree and access" on the upstream Meta repo while logged in — the access carries through to the derived Unsloth mirror. The full license terms are at github.com/meta-llama/llama-models.

Generation slows dramatically past 32K context

Llama 3.1 ships with a 128K-token native context window per the HF model card, but throughput drops as the KV cache fills. Hardware Corner's RTX 3090 24 GB VRAM guide reports that 8B-class models at Q4_K_M can run up to a ~90K-token context on the 3090 before approaching the 24 GB VRAM ceiling; at that scale the VRAM is fully consumed by KV cache and partial-offload behavior emerges. For long-doc workflows on a single 3090, keep --ctx-size at 32K or below for sustained throughput, and use chunking + retrieval beyond that. KV-cache quantization (--cache-type-k q8_0 --cache-type-v q8_0 --flash-attn) roughly halves KV memory if you need to push context higher at Q4_K weights.

Want a different runtime — vLLM or SGLang?

The Meta canonical HF model card documents vllm serve "meta-llama/Llama-3.1-8B-Instruct" and python3 -m sglang.launch_server --model-path "meta-llama/Llama-3.1-8B-Instruct" — both load BF16 weights (16.07 GB) rather than the GGUF quantization. The 3090's 24 GB accommodates BF16 + reasonable batch sizes for production-style serving, but vLLM's default --max-model-len will reserve KV cache aggressively (typically 3–5× the weights-resident figure); use --max-model-len 8192 or --gpu-memory-utilization 0.85 to keep peak VRAM under control on a single 3090. Expect lower per-request latency than llama.cpp at the cost of GPU memory headroom. The llama.cpp GGUF path and the vLLM / SGLang BF16 path are different quant tiers — their throughput numbers are not directly comparable.

Standard Q4_K_M instead of Unsloth's Q4_K_XL?

Both load with the same llama.cpp binary; only the quantization recipe differs. bartowski/Meta-Llama-3.1-8B-Instruct-GGUF ships the standard k-quant ladder if you prefer the conventional flavor — file sizes are nearly identical (Q4_K_M = 4.92 GB per the unsloth tier table). The Q4_K_M throughput will be close to but not identical to Unsloth's UD-Q4_K_XL because the per-layer bit-allocation differs.

Ampere vs Ada / Blackwell — anything special for the 3090?

No. The RTX 3090 is Ampere (sm_86) with full mainline CUDA / FlashAttention 2 / cuBLAS kernel coverage since 2021 — older than the Ada (sm_89) and Blackwell (sm_120) cards but no less supported in the LLM-inference stack. The default pip install torch and pre-built llama.cpp CUDA binaries work out of the box. The 3090 lacks FP8 tensor cores (introduced on Hopper sm_90 / Ada sm_89), but the recommended UD-Q4_K_XL build uses INT4 / k-quant math which runs natively on Ampere — there is no FP8 fallback path to worry about for the canonical GGUF recipe here.

LLM token generation is memory-bandwidth-bound — what does that mean for the 3090?

Transformer inference at small batch sizes is dominated by reading weights from VRAM each token, so peak tokens/sec scales with memory bandwidth, not raw FLOPs. The 3090 has 936 GB/s of GDDR6X bandwidth (vs. the 4090's 1008 GB/s, ~93% ratio); LocalScore's measured 92.2 tok/s on the 3090 sits at ~96% of the 4090's adjacent-8B Hardware Corner figure (96.12 tok/s at 16K context per the 4090 LLM benchmarks page), so the bandwidth-ratio prediction lines up well with real-world measurement. For Q4-class inference on an 8B model, the 3090 is the price-performance sweet spot of the 24 GB tier.