What You'll Build
A local chat assistant running Meta's Llama 2 13B entirely on your own RTX 3060 Ti through Ollama — no cloud, no API key. The catch, and the whole point of this recipe: at the Q4 quant the 13B just barely fits an 8 GB card, and it runs slowly. Expect about 9 tokens/s, not the brisk speed a 7B model gives you on the same card. This page is the honest "yes it runs, but here's the tradeoff" guide.
Hardware data: RTX 3060 Ti (8GB VRAM) · ~9.25 tokens/s generation (Q4, Ollama 0.5.4) · sits hard against the 8 GB ceiling · See benchmark data
⚠️ It runs, but it's the slow ceiling of what 8 GB holds. Llama 2 13B at Q4 is a ~7.4 GB download (ollama.com/library/llama2), which leaves almost nothing for the KV cache on an 8 GB card. Ollama spills part of the work to CPU/system RAM to fit, and throughput collapses to single digits — the benchmark records 9.25 tokens/s, versus roughly 73 tokens/s for Llama 2 7B on the same card. For interactive use you almost certainly want a 7–8B model (Llama 3.1 8B, Qwen2.5 7B) at the same footprint and far higher speed. Reach for the 13B here only if you specifically need it and can tolerate ~9 tokens/s.
ℹ️ Llama 2 is a 2023 model — a legacy baseline, not a current pick. Meta's Llama 2 13B chat was a strong open-weights release for its time, under the Llama 2 Community License. Newer models are meaningfully stronger at a fraction of this VRAM and speed cost. Run this when you specifically want the original Llama 2 13B — to reproduce a 2023 result, compare against a known baseline, or because a downstream tool pins it.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 8GB VRAM | RTX 3060 Ti (8GB) |
| RAM | 16GB | — |
| Storage | ~8 GB (Q4 weights) | 7.4 GB model pull |
| Software | Ollama, NVIDIA driver + CUDA | Ollama 0.5.4 |
Llama 2 13B is Meta's mid-size foundation chat model, released under the Llama 2 Community License. Ollama describes the family as "Llama 2 is a collection of foundation language models ranging from 7B to 70B parameters." and notes the model supports a context length of 4096 tokens by default (ollama.com/library/llama2).
⚠️ No headroom on an 8 GB card. The cited benchmark peaks at 8.0 GB on this 8 GB card — i.e. effectively full. Close other GPU consumers before you run: browsers with hardware acceleration, other models, even a second monitor's compositor can tip you into an out-of-memory error or force an even slower fallback. This recipe documents the Q4 quant specifically because anything heavier does not fit 8 GB at all.
Installation
1. Install Ollama
Download and install Ollama for your OS from ollama.com/download. On Linux:
curl -fsSL https://ollama.com/install.sh | sh
Confirm it sees your GPU:
ollama --version
nvidia-smi
2. Pull the Llama 2 13B weights
The llama2:13b tag is a 7.4 GB Q4_0 download (ollama.com/library/llama2). Note the bare llama2 tag resolves to the 7B build, so you must name the size explicitly to get the 13B:
ollama pull llama2:13b
Running
Start an interactive chat session — always pin the :13b tag, or you'll get the 7B model instead:
ollama run llama2:13b
You'll drop into a >>> prompt — type a question and the model streams its answer:
>>> Explain the difference between TCP and UDP in two sentences.
Type /bye to exit. For a one-shot answer without the interactive prompt, pass the text directly:
ollama run llama2:13b "Write a haiku about garbage collection."
The first run after a fresh pull spends a moment loading weights into VRAM. Then the answer streams out token by token — and on this card, it streams slowly. That is expected: see the Results section for why.
Results
- Speed: ~9.25 tokens/s generation at Q4 on the RTX 3060 Ti, measured by DatabaseMart under their "Eval Rate(tokens/s)" column (Ollama 0.5.4). For a plain text LLM this is the generation speed — the rate at which the model writes its answer. At ~9 tokens/s, a long reply takes a noticeable wait; this is the single-digit ceiling you hit when a 13B is squeezed onto 8 GB. The 7B sibling on the same card runs roughly 8× faster (~73 tokens/s).
- VRAM usage: The backend records an 8.0 GB peak on this 8 GB card — i.e. effectively full. DatabaseMart's own table reports its
GPU vRAMcolumn as a utilization percentage, not a GB figure — it lists 84% for the 13B (versus 63% for the 7B on the same card), so the 13B is leaning much harder on the card. Either way, plan for essentially zero spare VRAM; the backend's 8.0 GB peak is the number to plan against. See /check - Quality notes: This is a single commercial benchmark source. Numbers will vary with your Ollama version, driver, and context length. If you measure your own throughput or peak VRAM on a 3060 Ti, please contribute it via /contribute so the next reader gets a corroborating datapoint.
For the full benchmark data, see /check/llama2-13b/rtx-3060-ti.
Troubleshooting
Generation is painfully slow (~9 tokens/s)
This is not a misconfiguration — it is the expected result for this pair. A 7.4 GB Q4 weight set on an 8 GB card leaves almost no room for the KV cache, so Ollama keeps part of the model out of GPU memory and the work spills to CPU/system RAM, dragging throughput down to single digits. If ~9 tokens/s is too slow for your use, the fix is not a setting — it's a smaller model. Drop to llama2:7b (~73 tokens/s on this card) or a newer 7–8B model; both fit the 8 GB card with room to spare.
Out of memory / model falls back further to CPU
At 8.0 GB peak on an 8 GB card there is no margin. If you see an OOM error, or generation crawls below even ~9 tokens/s, something else is holding VRAM. Run nvidia-smi to see what is resident, close it, and retry. Don't try the larger tag on this card — llama2:70b is a 39 GB download (ollama.com/library/llama2), far past what an 8 GB card can hold.
Short answers or context runs out
The default Ollama context window is 4096 tokens. Llama 2 13B can take that full window, but loading more KV cache eats into the already-exhausted 8 GB budget on this card. If you raise the context length and hit an OOM, lower it again — on an 8 GB card the 13B weights already fill the VRAM, leaving little room for a large KV cache.
Should I use Llama 2 13B on this card at all?
For most new local-chat use cases, no — a newer 7–8B model (Llama 3.1 8B, Qwen2.5 7B) is the better default at this footprint: same 8 GB-class card, stronger answers, and far higher speed than this 13B's ~9 tokens/s. Use this recipe when you specifically need Llama 2 13B's behaviour. If you measured a newer model on this card, share it via /contribute.
No other widely-reported issues for this pair. Report problems via the submission form.