self-hosted/ai
§01·recipe · llm

Llama 2 13B on RTX 3060 Ti: The Slow Ceiling of What 8 GB Holds

llmbeginner8GB+ VRAMJun 27, 2026

This beginner recipe sets up llama2 13b on the RTX 3060 Ti, needing about 8 GB of VRAM.

models
tools
prerequisites
  • NVIDIA RTX 3060 Ti (8GB VRAM) or equivalent 8GB card
  • Ollama installed (https://ollama.com/download)
  • ~8 GB free disk for the Q4 model weights

What You'll Build

A local chat assistant running Meta's Llama 2 13B entirely on your own RTX 3060 Ti through Ollama — no cloud, no API key. The catch, and the whole point of this recipe: at the Q4 quant the 13B just barely fits an 8 GB card, and it runs slowly. Expect about 9 tokens/s, not the brisk speed a 7B model gives you on the same card. This page is the honest "yes it runs, but here's the tradeoff" guide.

Hardware data: RTX 3060 Ti (8GB VRAM) · ~9.25 tokens/s generation (Q4, Ollama 0.5.4) · sits hard against the 8 GB ceiling · See benchmark data

⚠️ It runs, but it's the slow ceiling of what 8 GB holds. Llama 2 13B at Q4 is a ~7.4 GB download (ollama.com/library/llama2), which leaves almost nothing for the KV cache on an 8 GB card. Ollama spills part of the work to CPU/system RAM to fit, and throughput collapses to single digits — the benchmark records 9.25 tokens/s, versus roughly 73 tokens/s for Llama 2 7B on the same card. For interactive use you almost certainly want a 7–8B model (Llama 3.1 8B, Qwen2.5 7B) at the same footprint and far higher speed. Reach for the 13B here only if you specifically need it and can tolerate ~9 tokens/s.

ℹ️ Llama 2 is a 2023 model — a legacy baseline, not a current pick. Meta's Llama 2 13B chat was a strong open-weights release for its time, under the Llama 2 Community License. Newer models are meaningfully stronger at a fraction of this VRAM and speed cost. Run this when you specifically want the original Llama 2 13B — to reproduce a 2023 result, compare against a known baseline, or because a downstream tool pins it.

Requirements

ComponentMinimumTested
GPU8GB VRAMRTX 3060 Ti (8GB)
RAM16GB
Storage~8 GB (Q4 weights)7.4 GB model pull
SoftwareOllama, NVIDIA driver + CUDAOllama 0.5.4

Llama 2 13B is Meta's mid-size foundation chat model, released under the Llama 2 Community License. Ollama describes the family as "Llama 2 is a collection of foundation language models ranging from 7B to 70B parameters." and notes the model supports a context length of 4096 tokens by default (ollama.com/library/llama2).

⚠️ No headroom on an 8 GB card. The cited benchmark peaks at 8.0 GB on this 8 GB card — i.e. effectively full. Close other GPU consumers before you run: browsers with hardware acceleration, other models, even a second monitor's compositor can tip you into an out-of-memory error or force an even slower fallback. This recipe documents the Q4 quant specifically because anything heavier does not fit 8 GB at all.

Installation

1. Install Ollama

Download and install Ollama for your OS from ollama.com/download. On Linux:

curl -fsSL https://ollama.com/install.sh | sh

Confirm it sees your GPU:

ollama --version
nvidia-smi

2. Pull the Llama 2 13B weights

The llama2:13b tag is a 7.4 GB Q4_0 download (ollama.com/library/llama2). Note the bare llama2 tag resolves to the 7B build, so you must name the size explicitly to get the 13B:

ollama pull llama2:13b

Running

Start an interactive chat session — always pin the :13b tag, or you'll get the 7B model instead:

ollama run llama2:13b

You'll drop into a >>> prompt — type a question and the model streams its answer:

>>> Explain the difference between TCP and UDP in two sentences.

Type /bye to exit. For a one-shot answer without the interactive prompt, pass the text directly:

ollama run llama2:13b "Write a haiku about garbage collection."

The first run after a fresh pull spends a moment loading weights into VRAM. Then the answer streams out token by token — and on this card, it streams slowly. That is expected: see the Results section for why.

Results

  • Speed: ~9.25 tokens/s generation at Q4 on the RTX 3060 Ti, measured by DatabaseMart under their "Eval Rate(tokens/s)" column (Ollama 0.5.4). For a plain text LLM this is the generation speed — the rate at which the model writes its answer. At ~9 tokens/s, a long reply takes a noticeable wait; this is the single-digit ceiling you hit when a 13B is squeezed onto 8 GB. The 7B sibling on the same card runs roughly 8× faster (~73 tokens/s).
  • VRAM usage: The backend records an 8.0 GB peak on this 8 GB card — i.e. effectively full. DatabaseMart's own table reports its GPU vRAM column as a utilization percentage, not a GB figure — it lists 84% for the 13B (versus 63% for the 7B on the same card), so the 13B is leaning much harder on the card. Either way, plan for essentially zero spare VRAM; the backend's 8.0 GB peak is the number to plan against. See /check
  • Quality notes: This is a single commercial benchmark source. Numbers will vary with your Ollama version, driver, and context length. If you measure your own throughput or peak VRAM on a 3060 Ti, please contribute it via /contribute so the next reader gets a corroborating datapoint.

For the full benchmark data, see /check/llama2-13b/rtx-3060-ti.

Troubleshooting

Generation is painfully slow (~9 tokens/s)

This is not a misconfiguration — it is the expected result for this pair. A 7.4 GB Q4 weight set on an 8 GB card leaves almost no room for the KV cache, so Ollama keeps part of the model out of GPU memory and the work spills to CPU/system RAM, dragging throughput down to single digits. If ~9 tokens/s is too slow for your use, the fix is not a setting — it's a smaller model. Drop to llama2:7b (~73 tokens/s on this card) or a newer 7–8B model; both fit the 8 GB card with room to spare.

Out of memory / model falls back further to CPU

At 8.0 GB peak on an 8 GB card there is no margin. If you see an OOM error, or generation crawls below even ~9 tokens/s, something else is holding VRAM. Run nvidia-smi to see what is resident, close it, and retry. Don't try the larger tag on this card — llama2:70b is a 39 GB download (ollama.com/library/llama2), far past what an 8 GB card can hold.

Short answers or context runs out

The default Ollama context window is 4096 tokens. Llama 2 13B can take that full window, but loading more KV cache eats into the already-exhausted 8 GB budget on this card. If you raise the context length and hit an OOM, lower it again — on an 8 GB card the 13B weights already fill the VRAM, leaving little room for a large KV cache.

Should I use Llama 2 13B on this card at all?

For most new local-chat use cases, no — a newer 7–8B model (Llama 3.1 8B, Qwen2.5 7B) is the better default at this footprint: same 8 GB-class card, stronger answers, and far higher speed than this 13B's ~9 tokens/s. Use this recipe when you specifically need Llama 2 13B's behaviour. If you measured a newer model on this card, share it via /contribute.

No other widely-reported issues for this pair. Report problems via the submission form.

common questions
How much VRAM does llama2 13b need?

About 8 GB — the minimum this recipe targets.

Which GPUs is llama2 13b tested on?

RTX 3060 Ti (8 GB).

How hard is this setup?

Beginner — follow the steps above.