How much VRAM does Llama 3.1 8B need?

About 8 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

Llama 3.1 8B on RTX 3060 Ti: Local Chat via Ollama or llama.cpp + Unsloth UD-Q4_K_XL GGUF

What You'll Build

A local Llama 3.1 8B Instruct chat assistant running on an RTX 3060 Ti (8 GB VRAM) through Ollama (or llama.cpp / LM Studio) with the unsloth/Llama-3.1-8B-Instruct-GGUF UD-Q4_K_XL weights (4.99 GB on disk, Unsloth's mixed-precision Dynamic 2.0 GGUF tier). The 8 GB RTX 3060 Ti is the natural home for a Q4 8B model: at 4.99 GB resident the weights leave just enough room for a working context window on this card.

Hardware data: RTX 3060 Ti (8 GB VRAM) · Q4 GGUF · 57.34 tok/s generation measured at Q4 with Ollama 0.5.4 · ~8 GB peak · See benchmark data

⚠️ 8 GB is the binding constraint — pin a Q4 quant and keep context modest. A Q4 8B GGUF (UD-Q4_K_XL is 4.99 GB on disk per the unsloth file table) fits the 8 GB RTX 3060 Ti with room for a few-thousand-token context, but this card has no headroom for the heavier quant tiers that a 12 GB card can run. UD-Q6_K_XL (7.33 GB) and UD-Q8_K_XL (10.58 GB) do not leave room for the KV cache on 8 GB — stay at Q4 (or Q5_K at most, headless). A desktop 3060 Ti with a monitor attached also exposes only roughly 6.5–7.3 GB of usable VRAM (the display compositor and driver reserve the rest), so keep --ctx-size conservative — see Results and Troubleshooting.

⚠️ Quant pinned — Unsloth UD-Q4_K_XL (or the equivalent Ollama Q4_K_M default). This recipe targets UD-Q4_K_XL from the unsloth/Llama-3.1-8B-Instruct-GGUF repo — Unsloth's mixed-precision GGUF tier featured in their Dynamic 2.0 documentation, with per-layer sensitivity-aware bit-allocation. Standard Q4_K_M from other publishers (bartowski/Meta-Llama-3.1-8B-Instruct-GGUF, the Ollama default) loads with the same llama.cpp binary at nearly the same size (4.92 GB) — the measured 57.34 tok/s figure below is from the Ollama Q4 default tier. See Troubleshooting if you prefer the conventional flavor.

ℹ️ Access — the two recommended paths are public; only the canonical Meta repo is gated. This recipe's install paths need no Meta approval: Ollama's llama3.1:8b and the unsloth/Llama-3.1-8B-Instruct-GGUF mirror are both ungated (gated: false on the Hugging Face API, verified) and download anonymously with no token. Only the canonical meta-llama/Llama-3.1-8B-Instruct repo is gated (gated: manual) — and its BF16 weights do not fit 8 GB anyway, so you only need it for an off-card serving path. Gating and license are separate: the weights are released under the Llama 3.1 Community License, which permits commercial use unless your products exceed 700 million monthly active users in the preceding calendar month (per the Llama 3.1 license) — at which point you must request a separate license from Meta.

Requirements

Component	Minimum	Tested
GPU	8 GB VRAM (Q4 GGUF fits)	RTX 3060 Ti (8 GB)
RAM	16 GB system	—
Storage	4.99 GB (UD-Q4_K_XL GGUF) per unsloth/Llama-3.1-8B-Instruct-GGUF	—
Driver	CUDA 12.x runtime (Ampere sm_86 — default stable wheels)	—
Runtime	llama.cpp / Ollama / LM Studio	Ollama 0.5.4+

The RTX 3060 Ti's 8 GB handles a Q4 8B GGUF: weights resident on GPU are ~5 GB and a modest KV cache fits in the remaining envelope, putting the runtime peak around the full 8 GB the backend recorded (/check/, peak_vram 8.0 GB). On an 8 GB card (~6.5–7.3 GB usable with a display) the practical move is to cap context rather than reach for a heavier quant — UD-Q6_K_XL (7.33 GB on disk) and UD-Q8_K_XL (10.58 GB) leave no room for a KV cache here.

The RTX 3060 Ti is Ampere (sm_86) with full mainline CUDA / FlashAttention 2 / cuBLAS kernel coverage since 2021 — no special wheel selection is required. The default CUDA 12.x stable builds of llama.cpp, Ollama, and PyTorch already ship sm_86 kernels, and the recommended GGUF path doesn't depend on FlashAttention at all.

Installation

Option A — Ollama (recommended one-line path)

Ollama maintains its own pre-quantized build of Llama 3.1 8B Instruct and handles model download + serving with a single command. Per the Ollama llama3.1:8b tag, the default tag is 4.9 GB at Q4_K_M — the same size and quality tier the 57.34 tok/s benchmark below measured.

1. Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

(Windows: download from ollama.com/download.) Ollama bundles its own CUDA runtime, so the only host-side requirement is a recent NVIDIA driver — no special version is needed for the Ampere-class RTX 3060 Ti.

2. Pull and run the 8B model

ollama pull llama3.1:8b
ollama run llama3.1:8b "Explain GQA attention in three sentences."

The first run downloads ~4.9 GB and loads the model into VRAM (resident ~5 GB; KV cache grows with conversation length). Subsequent prompts in the same session stay warm.

Option B — llama.cpp + Unsloth UD-Q4_K_XL GGUF

If you want the specific Unsloth Dynamic 2.0 tier (UD-Q4_K_XL) and explicit control over context size and --n-gpu-layers, drive llama.cpp directly.

1. Install llama.cpp (CUDA build)

The RTX 3060 Ti is Ampere sm_86 — mainline llama.cpp ships sm_86 kernels in its default CUDA releases, so any current CUDA build works. Pre-built CUDA binaries are published on the llama.cpp releases page — pick a *-bin-ubuntu-cuda-x64.zip asset (Linux) or the matching Windows CUDA build.

# Linux — pre-built CUDA binary
# Download the latest "llama-bXXXX-bin-ubuntu-cuda-x64.zip" asset from:
#   https://github.com/ggml-org/llama.cpp/releases
# Extract and add the bin/ directory to PATH.

# macOS (Homebrew) — CPU/Metal only, no CUDA, kept here for symmetry with the sibling recipes
brew install llama.cpp

To build from source with CUDA support, follow the llama.cpp CUDA build docs and target the Ampere sm_86 architecture explicitly:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=86
cmake --build build --config Release -j $(nproc)

CMAKE_CUDA_ARCHITECTURES=86 builds sm_86 kernels directly for the RTX 3060 Ti, avoiding PTX JIT compilation at first run.

2. Pull the UD-Q4_K_XL GGUF

The fastest path is the llama.cpp Hugging Face shortcut from the Unsloth model card — llama.cpp will fetch the tagged file directly:

pip install huggingface_hub hf_transfer
# No login needed — the Unsloth GGUF mirror is public (gated: false).
llama-server -hf unsloth/Llama-3.1-8B-Instruct-GGUF:UD-Q4_K_XL

For more control (specific local directory, pinned filename), pull only the Q4_K_XL file (~5 GB) via snapshot_download instead of the full repo:

# download_q4kxl.py
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="unsloth/Llama-3.1-8B-Instruct-GGUF",
    local_dir="unsloth/Llama-3.1-8B-Instruct-GGUF",
    allow_patterns=["*UD-Q4_K_XL*"],
)

python download_q4kxl.py

The resulting file is unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf (4.99 GB per the unsloth model card).

3. Start the server

llama-server \
  --model unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf \
  --ctx-size 8192 \
  --n-gpu-layers 99 \
  --host 0.0.0.0 --port 8080

--n-gpu-layers 99 offloads every layer to the 3060 Ti (at Q4_K_XL the whole 4.99 GB model stays resident; layer streaming is unnecessary). --ctx-size 8192 sets an 8K context window — a safe default on 8 GB. See Troubleshooting for guidance on pushing context higher with KV-cache quantization, and why 16K+ starts to overflow an 8 GB card with a display attached.

Option C — LM Studio (GUI)

LM Studio's built-in catalog search ("Llama 3.1 8B Instruct GGUF") will surface both the Unsloth UD-Q4_K_XL build and the bartowski standard-quant ladder. Pick Llama-3.1-8B-Instruct-UD-Q4_K_XL from the Unsloth repo and download — same file as Option B. LM Studio's loader will set --n-gpu-layers to "max" automatically once it recognizes the 3060 Ti; if you see GPU offload fall short on 8 GB, lower the context length in the loader panel.

Running

One-shot prompt via the llama.cpp HTTP server

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.1-8b-instruct",
    "messages": [{"role": "user", "content": "Write a haiku about local LLMs."}]
  }'

The llama.cpp llama-server binary exposes an OpenAI-compatible /v1/chat/completions endpoint on the port chosen above.

Interactive terminal

llama-cli \
  --model unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf \
  --ctx-size 8192 \
  --n-gpu-layers 99 \
  --interactive

Press Ctrl-C to interrupt generation; the CLI keeps the model warm in VRAM until exit.

Results

Speed: A benchmark on a plain NVIDIA GeForce RTX 3060 Ti measured 57.34 tok/s generation for llama3.1:8b at 4-bit (Q4) under Ollama 0.5.4, per DatabaseMart's Ollama RTX 3060 Ti benchmark — this is the figure the backend ingested as /check/ benchmark id=184 (verdict: runs, 57.34 tok/s, Q4). It is a single commercial-host benchmark; treat it as representative, not a guarantee. Interestingly the 3060 Ti's 57.34 tok/s edges out a 12 GB RTX 3060 LocalScore run's 52.2 tok/s on the same Q4 tier, despite the 3060 Ti's smaller VRAM — because token generation is bandwidth-bound and the 3060 Ti's 448 GB/s GDDR6 (NVIDIA RTX 3060 family specs) beats the 3060's 360 GB/s. If you run Ollama or llama.cpp + UD-Q4_K_XL on your own 3060 Ti, please submit your numbers so a community first-party measurement corroborates this single-source figure.
VRAM usage: The backend /check/ page reports verdict: runs with a measured peak of 8.0 GB (benchmark id=184) — effectively the full 8 GB card at Q4 with Ollama's default context. Note the DatabaseMart source page's own "GPU vRAM" column shows 80% for this row, which is a utilization percentage, not a GB figure; the 8.0 GB peak is the backend-recorded measurement, consistent with the Q4 weights (4.99 GB resident per unsloth's file table) plus the KV cache and runtime overhead filling the card. Keep --ctx-size modest (8K is comfortable) so the KV cache doesn't push you into an out-of-memory. If you measure the resident peak at a specific context length on your own 3060 Ti, please submit your numbers so a second first-party measurement corroborates the backend figure.
Quality notes: UD-Q4_K_XL is the Unsloth mixed-precision GGUF tier; the Unsloth Dynamic 2.0 docs discuss per-layer sensitivity-aware bit-allocation across the family. On an 8 GB 3060 Ti, Q4 (UD-Q4_K_XL 4.99 GB, or Q4_K_M 4.92 GB) is the right tier — there's no quality-floor reason to go below Q4_K_M, and the heavier tiers (UD-Q5_K_XL 5.74 GB headless-only, UD-Q6_K_XL 7.33 GB, UD-Q8_K_XL 10.58 GB) don't leave room for a usable KV cache on this card. BF16 full precision (16.07 GB on disk) overflows 8 GB by 2× and is not an option here.

For the full benchmark data and cross-GPU comparisons (3060 / 3090 / 4070 siblings), see /check/llama-3-1-8b/rtx-3060-ti.

Troubleshooting

Out of memory at longer context — the 8 GB ceiling

Llama 3.1 ships with a 128K-token native context window (per Meta's Llama 3.1 announcement), but on an 8 GB card the KV cache is the binding constraint long before you reach it. Keep --ctx-size at 8K or below for the Q4 build on this card with a display attached; 16K is feasible headless but eats into the margin, and a full 128K KV cache alone consumes well over 8 GB and is not possible on the 3060 Ti. KV-cache quantization (--cache-type-k q8_0 --cache-type-v q8_0 --flash-attn) roughly halves KV memory if you need to push context higher. For long-doc workflows, prefer chunking + retrieval over a giant context window.

Token generation speed on the 3060 Ti

Transformer inference at small batch sizes is dominated by reading the weights from VRAM each token, so peak tokens/sec scales with memory bandwidth, not raw FLOPs. The RTX 3060 Ti has 448 GB/s of GDDR6 bandwidth (256-bit, NVIDIA RTX 3060 family specs) — higher than the 12 GB RTX 3060's 360 GB/s, which is why the 3060 Ti's measured 57.34 tok/s actually exceeds the 3060's 52.2 tok/s on the same Q4 tier even with less VRAM. (Note a later 3060 Ti revision shipped faster GDDR6X; the common original SKU is GDDR6 at 448 GB/s.) That's a comfortable interactive speed for an 8B model — well above reading pace. Compute (4,864 CUDA cores) is rarely the bottleneck at batch size 1.

vLLM or SGLang — not viable on 8 GB

The Meta canonical HF model card documents vllm serve "meta-llama/Llama-3.1-8B-Instruct" and python3 -m sglang.launch_server --model-path "meta-llama/Llama-3.1-8B-Instruct" — both load BF16 weights (16.07 GB on disk per the unsloth tier table) rather than GGUF quantization. BF16 is 2× the 3060 Ti's 8 GB VRAM, so the vLLM/SGLang BF16 path is not viable on this card — stick to the llama.cpp / Ollama GGUF path. Reserve the BF16 serving path for 24 GB+ cards.

Standard Q4_K_M instead of Unsloth's UD-Q4_K_XL?

Both load with the same llama.cpp binary; only the quantization recipe differs. bartowski/Meta-Llama-3.1-8B-Instruct-GGUF ships the standard k-quant ladder if you prefer the conventional flavor — file sizes are nearly identical (Q4_K_M = 4.92 GB per the unsloth tier table). Throughput will be close to but not identical to Unsloth's UD-Q4_K_XL because the per-layer bit-allocation differs. Ollama's llama3.1:8b default tag is standard Q4_K_M (4.9 GB per the Ollama library page) — and is the exact tier the 57.34 tok/s DatabaseMart benchmark above measured.

FlashAttention 2 with `transformers`

If you bypass Ollama / llama.cpp and run a transformers quickstart directly, the RTX 3060 Ti (Ampere sm_86) has full prebuilt FlashAttention-2 wheel coverage — sm_86 is among the oldest architectures with stock FA2 kernels, so you can set attn_implementation="flash_attention_2" and it will work, or simply omit the argument and let PyTorch pick SDPA. This is moot for the recommended GGUF path above — llama.cpp and Ollama don't depend on FlashAttention.