Llama 3.1 8B on RTX 4070: Local Chat via Ollama or llama.cpp + Unsloth UD-Q4_K_XL GGUF

What You'll Build

A local Llama 3.1 8B Instruct chat assistant running on an RTX 4070 (12 GB VRAM) through llama.cpp (or Ollama / LM Studio) with the unsloth/Llama-3.1-8B-Instruct-GGUF UD-Q4_K_XL weights (4.99 GB on disk, Unsloth's mixed-precision Dynamic 2.0 GGUF tier). At 4.99 GB resident, the Q4_K_XL build leaves room on the 12 GB card for a 16K-token context window and a display, with a typical runtime peak of ~9–10 GB.

Hardware data: RTX 4070 (12 GB VRAM) · UD-Q4_K_XL GGUF · 76.3 tok/s generation on a LocalScore RTX 4070 run at the Q4_K_M tier (see Results) · See benchmark data

⚠️ 12 GB is not 16 GB — mind the usable headroom. A desktop RTX 4070 with a monitor attached exposes roughly 10.5–11.3 GB of usable VRAM (the display compositor and driver reserve the rest); a headless Linux box gets closer to ~11.6 GB. The UD-Q4_K_XL path's ~9–10 GB peak fits this with margin, but the heaviest near-lossless quant that fits a 16 GB card (UD-Q8_K_XL at 10.58 GB on disk) does not leave display headroom here — see Results for which tiers are realistic on 12 GB.

⚠️ Quant pinned — Unsloth UD-Q4_K_XL. This recipe targets UD-Q4_K_XL from the unsloth/Llama-3.1-8B-Instruct-GGUF repo specifically — Unsloth's mixed-precision GGUF tier featured in their Dynamic 2.0 documentation, with per-layer sensitivity-aware bit-allocation. Standard Q4_K_M from other publishers (bartowski/Meta-Llama-3.1-8B-Instruct-GGUF, the Ollama default) loads with the same llama.cpp binary, but the per-layer recipe and resulting quality/speed profile are different — see Troubleshooting if you prefer the conventional flavor.

ℹ️ Access — the two recommended paths are public; only the canonical Meta repo is gated. This recipe's install paths need no Meta approval: Ollama's llama3.1:8b and the unsloth/Llama-3.1-8B-Instruct-GGUF mirror are both ungated (gated: false on the Hugging Face API, verified) and download anonymously with no token. Only the canonical meta-llama/Llama-3.1-8B-Instruct repo is gated (gated: manual) — you need it solely for the optional BF16 vLLM/SGLang path (see Troubleshooting): submit Meta's "Access Llama 3.1" form on the model page, wait for approval (usually fast), then huggingface-cli login with a read token. Gating and license are separate: the weights are released under the Llama 3.1 Community License, which permits commercial use unless your products exceed 700 million monthly active users in the preceding calendar month (per the Llama 3.1 license) — at which point you must request a separate license from Meta.

Requirements

Component	Minimum	Tested
GPU	8 GB VRAM (UD-Q4_K_XL fits)	RTX 4070 (12 GB)
RAM	16 GB system	—
Storage	4.99 GB (UD-Q4_K_XL GGUF) per unsloth/Llama-3.1-8B-Instruct-GGUF	—
Driver	CUDA 12.x runtime (Ada sm_89 — default stable wheels)	—
Runtime	llama.cpp / Ollama / LM Studio	llama.cpp b9247+

The RTX 4070's 12 GB handles Q4_K_XL with margin — weights resident on GPU are ~5 GB and the KV cache for a 16K context adds another ~4 GB, putting runtime peak around ~9–10 GB. On a 12 GB card (~10.5–11.3 GB usable with a display) that leaves a thin but real headroom for a longer context or a heavier quant tier (UD-Q6_K_XL at 7.33 GB on disk fits comfortably; UD-Q8_K_XL at 10.58 GB is headless-only) — see Results for the throughput-vs-context tradeoff.

Unlike Blackwell GPUs (sm_120), the RTX 4070 is Ada Lovelace (sm_89) and needs no special wheel selection — the default CUDA 12.x stable builds of llama.cpp, Ollama, and PyTorch already ship sm_89 kernels.

Installation

Option A — Ollama (recommended one-line path)

Ollama maintains its own pre-quantized build of Llama 3.1 8B Instruct and handles model download + serving with a single command. Per the Ollama llama3.1:8b tag, the default tag is 4.9 GB at Q4_K_M — essentially the same size and quality tier as Unsloth's UD-Q4_K_XL but using the standard k-quant recipe.

1. Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

(Windows: download from ollama.com/download.) Ollama bundles its own CUDA runtime, so the only host-side requirement is a recent NVIDIA driver — no special version is needed for the Ada-class RTX 4070.

2. Pull and run the 8B model

ollama pull llama3.1:8b
ollama run llama3.1:8b "Explain GQA attention in three sentences."

The first run downloads ~4.9 GB and loads the model into VRAM (resident ~5 GB; KV cache grows with conversation length). Subsequent prompts in the same session stay warm.

Option B — llama.cpp + Unsloth UD-Q4_K_XL GGUF

If you want the specific Unsloth Dynamic 2.0 tier (UD-Q4_K_XL) and explicit control over context size and --n-gpu-layers, drive llama.cpp directly.

1. Install llama.cpp (CUDA build)

The RTX 4070 is Ada Lovelace sm_89 — mainline llama.cpp ships sm_89 kernels in its default CUDA releases, so any current CUDA build works. Pre-built CUDA binaries are published on the llama.cpp releases page — pick a *-bin-ubuntu-cuda-x64.zip asset (Linux) or the matching Windows CUDA build.

# Linux — pre-built CUDA binary
# Download the latest "llama-bXXXX-bin-ubuntu-cuda-x64.zip" asset from:
#   https://github.com/ggml-org/llama.cpp/releases
# Extract and add the bin/ directory to PATH.

# macOS (Homebrew) — CPU/Metal only, no CUDA, kept here for symmetry with the sibling recipes
brew install llama.cpp

To build from source with CUDA support, follow the llama.cpp CUDA build docs and target the Ada sm_89 architecture explicitly:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=89
cmake --build build --config Release -j $(nproc)

CMAKE_CUDA_ARCHITECTURES=89 builds sm_89 kernels directly for the RTX 4070, avoiding PTX JIT compilation at first run.

2. Pull the UD-Q4_K_XL GGUF

The fastest path is the llama.cpp Hugging Face shortcut from the Unsloth model card — llama.cpp will fetch the tagged file directly:

pip install huggingface_hub hf_transfer
# No login needed — the Unsloth GGUF mirror is public (gated: false).
llama-server -hf unsloth/Llama-3.1-8B-Instruct-GGUF:UD-Q4_K_XL

For more control (specific local directory, pinned filename), pull only the Q4_K_XL file (~5 GB) via snapshot_download instead of the full repo:

# download_q4kxl.py
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="unsloth/Llama-3.1-8B-Instruct-GGUF",
    local_dir="unsloth/Llama-3.1-8B-Instruct-GGUF",
    allow_patterns=["*UD-Q4_K_XL*"],
)

python download_q4kxl.py

The resulting file is unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf (4.99 GB per the unsloth model card).

3. Start the server

llama-server \
  --model unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf \
  --ctx-size 16384 \
  --n-gpu-layers 99 \
  --host 0.0.0.0 --port 8080

--n-gpu-layers 99 offloads every layer to the 4070 (at Q4_K_XL the whole model stays resident in the 12 GB envelope; layer streaming is unnecessary). --ctx-size 16384 sets a 16K context window — see Troubleshooting for guidance on pushing context higher, and why 32K starts to eat into display headroom on a 12 GB card.

Option C — LM Studio (GUI)

LM Studio's built-in catalog search ("Llama 3.1 8B Instruct GGUF") will surface both the Unsloth UD-Q4_K_XL build and the bartowski standard-quant ladder. Pick Llama-3.1-8B-Instruct-UD-Q4_K_XL from the Unsloth repo and download — same file as Option B. LM Studio's loader will set --n-gpu-layers to "max" automatically once it recognizes the 4070.

Running

One-shot prompt via the llama.cpp HTTP server

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.1-8b-instruct",
    "messages": [{"role": "user", "content": "Write a haiku about local LLMs."}]
  }'

The llama.cpp llama-server binary exposes an OpenAI-compatible /v1/chat/completions endpoint on the port chosen above.

Interactive terminal

llama-cli \
  --model unsloth/Llama-3.1-8B-Instruct-GGUF/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf \
  --ctx-size 16384 \
  --n-gpu-layers 99 \
  --interactive

Press Ctrl-C to interrupt generation; the CLI keeps the model warm in VRAM until exit.

Heavier quant on 12 GB — UD-Q6_K_XL, not UD-Q8_K_XL

On a 12 GB card the realistic quality ceiling is UD-Q6_K_XL (7.33 GB on disk per the unsloth tier table), which still leaves room for a 16K KV cache inside the ~10.5–11.3 GB usable envelope. The near-lossless UD-Q8_K_XL (10.58 GB on disk) is headless-only on this card — it does not leave display headroom for the KV cache alongside a monitor. Use allow_patterns=["*UD-Q6_K_XL*"] in the snapshot_download script above to fetch the Q6 file. Expect throughput to drop relative to Q4 because memory bandwidth, not compute, is the binding constraint on transformer token generation.

Results

Speed: A community LocalScore run on a plain NVIDIA GeForce RTX 4070 (12 GB) measured 76.3 tok/s generation, 3192 tok/s prompt-processing, 415 ms time-to-first-token, and a LocalScore of 838 for Meta Llama 3.1 8B Instruct at Q4_K - Medium (the standard Q4_K_M tier — the same 4.92 GB quant Ollama ships by default, near-identical to this recipe's 4.99 GB UD-Q4_K_XL). This is a community-submitted figure on the LocalScore aggregator and may drift by ~1% as more submissions land; treat it as representative, not a guarantee. The figure is for the plain RTX 4070 specifically — LocalScore lists the RTX 4070 Ti (60.0 tok/s), 4070 SUPER (45.3 tok/s), and 4070 Ti SUPER (53.9 tok/s) as separate accelerators. If you run llama.cpp + UD-Q4_K_XL on your own 4070, please submit your numbers so a backend-ingested first-party measurement replaces this aggregator figure.
VRAM usage: The backend /check/ page reports verdict: runs. As a derived envelope (labelled as derived — not measured): UD-Q4_K_XL weights resident on GPU are 4.99 GB per unsloth's file table; the KV cache for a 16K context on an 8B model with 32 layers and 8 GQA KV heads adds ~4 GB, putting the runtime peak around ~9–10 GB — inside the 12 GB card's ~10.5–11.3 GB usable envelope. That the Q4_K tier runs on this exact card is independently corroborated by the LocalScore RTX 4070 run above. Community measurement of the actual resident peak will replace the derived envelope when it lands via /contribute.
Quality notes: UD-Q4_K_XL is the Unsloth mixed-precision GGUF tier; the Unsloth Dynamic 2.0 docs discuss per-layer sensitivity-aware bit-allocation across the family. On a 12 GB 4070 the practical quality ceiling is UD-Q6_K_XL (7.33 GB) with a capped context, or Q6_K standard (6.60 GB per the unsloth file table) — there's no quality-floor reason to run anything below Q4_K_M on this hardware. BF16 full precision (16.07 GB on disk) overflows the 12 GB card by a wide margin and isn't an option here without heavy offload.

For the full benchmark data and cross-GPU comparisons (4070 Ti SUPER / 5070 / 4080 siblings), see /check/llama-3-1-8b/rtx-4070.

Troubleshooting

401 / 403 when pulling the BF16 weights from `meta-llama`

The two recommended paths never hit a gate: Ollama's llama3.1:8b and the unsloth/Llama-3.1-8B-Instruct-GGUF mirror are both public (gated: false, verified on the HF API) and download with no token. A 401/403 only appears on the optional BF16 vLLM/SGLang path, which pulls from the canonical meta-llama/Llama-3.1-8B-Instruct repo (gated: manual). For that path: (a) submit Meta's "Access Llama 3.1" form on the model page and wait for approval, then (b) huggingface-cli login with a read token. The license terms (Llama 3.1 Community License, 700M-MAU commercial threshold) are at llama.com/llama3_1/license and apply regardless of how you obtain the weights.

Generation slows down at longer context — and the 12 GB ceiling

Llama 3.1 ships with a 128K-token native context window (per Meta's Llama 3.1 announcement), but throughput drops as the KV cache fills, and on a 12 GB card the KV cache is the binding constraint long before you reach 128K. Keep --ctx-size at 16K or below for the Q4_K_XL build on this card (32K is feasible but eats into display headroom); a full 128K KV cache alone consumes well over 12 GB and is not possible on the 4070. For long-doc workflows, use chunking + retrieval rather than a giant context window.

Want a different runtime — vLLM or SGLang?

The Meta canonical HF model card documents vllm serve "meta-llama/Llama-3.1-8B-Instruct" and python3 -m sglang.launch_server --model-path "meta-llama/Llama-3.1-8B-Instruct" — both load BF16 weights (16.07 GB on disk per the unsloth tier table) rather than the GGUF quantization. BF16 alone exceeds the 12 GB 4070's VRAM, so the vLLM/SGLang BF16 path is not viable on this card — stick to the llama.cpp / Ollama GGUF path. Reserve the BF16 serving path for 24 GB+ cards (see the 4090 and 5090 siblings).

Standard Q4_K_M instead of Unsloth's UD-Q4_K_XL?

Both load with the same llama.cpp binary; only the quantization recipe differs. bartowski/Meta-Llama-3.1-8B-Instruct-GGUF ships the standard k-quant ladder if you prefer the conventional flavor — file sizes are nearly identical (Q4_K_M = 4.92 GB per the unsloth tier table). Throughput will be close to but not identical to Unsloth's UD-Q4_K_XL because the per-layer bit-allocation differs. Ollama's llama3.1:8b default tag is also standard Q4_K_M (4.9 GB per the Ollama library page) — and is the exact tier the LocalScore RTX 4070 run above measured.

FlashAttention 2 with `transformers`

If you bypass Ollama / llama.cpp and run a transformers quickstart directly, the RTX 4070 (Ada sm_89) has full prebuilt FlashAttention-2 wheel coverage — unlike Blackwell (sm_120) cards, you can set attn_implementation="flash_attention_2" and it will work, or simply omit the argument and let PyTorch pick SDPA. This is moot for the recommended GGUF path above — llama.cpp and Ollama don't depend on FlashAttention.