How much VRAM does llama2 7b need?

About 8 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

Llama 2 7B on RTX 3060 Ti: Run a Local Chatbot at the 8 GB Floor

What You'll Build

A fully local chat assistant: you type a prompt and Llama 2 7B answers in your terminal — no cloud, no API key, no data leaving your machine. The whole thing runs on an 8 GB RTX 3060 Ti through Ollama at the Q4 quant.

Hardware data: RTX 3060 Ti (8GB VRAM) · ~73 tokens/s generation (Q4, Ollama 0.5.4) · sits right at the 8 GB ceiling · See benchmark data

ℹ️ Llama 2 is a 2023 model — a legacy baseline, not a current pick. Meta's Llama 2 7B chat was a milestone open-weights model, but newer 7–8B models (Llama 3.1 8B, Qwen2.5 7B) are meaningfully stronger at the same VRAM footprint. Run this when you specifically want the original Llama 2 — for reproducing a 2023 result, comparing against a known baseline, or because a downstream tool pins it. If you just want the best local chatbot that fits 8 GB, prefer a newer 7–8B model.

Requirements

Component	Minimum	Tested
GPU	8GB VRAM	RTX 3060 Ti (8GB)
RAM	16GB	—
Storage	~4 GB (Q4 weights)	3.8 GB model pull
Software	Ollama, NVIDIA driver + CUDA	Ollama 0.5.4

Llama 2 7B is Meta's foundation chat model, released under the Llama 2 Community License. Ollama describes the family as "Llama 2 is a collection of foundation language models ranging from 7B to 70B parameters." and notes the model "is trained on 2 trillion tokens, and by default supports a context length of 4096." (ollama.com/library/llama2)

⚠️ Right at the 8 GB wall. The cited benchmark peaks at 8.0 GB on this 8 GB card — there is no headroom. Close other GPU consumers before you run: browsers with hardware acceleration, other models, even a second monitor's compositor can tip you into an out-of-memory error or force a slow CPU fallback. This recipe documents the Q4 quant specifically because the 13B and 70B tags do not fit 8 GB.

Installation

1. Install Ollama

Download and install Ollama for your OS from ollama.com/download. On Linux:

curl -fsSL https://ollama.com/install.sh | sh

Confirm it sees your GPU:

ollama --version
nvidia-smi

2. Pull the Llama 2 7B weights

The llama2:7b tag is a ~3.8 GB Q4_0 download (ollama.com/library/llama2):

ollama pull llama2:7b

Running

Start an interactive chat session:

ollama run llama2

The llama2 tag resolves to the 7B Q4_0 build by default (ollama.com/library/llama2); use ollama run llama2:7b if you want to pin the size explicitly. You'll drop into a >>> prompt — type a question and the model streams its answer:

>>> Explain the difference between TCP and UDP in two sentences.

Type /bye to exit. For a one-shot answer without the interactive prompt, pass the text directly:

ollama run llama2 "Write a haiku about garbage collection."

Results

Speed: ~73 tokens/s generation at Q4 on the RTX 3060 Ti, measured by DatabaseMart under their "Eval Rate(tokens/s)" column (Ollama 0.5.4). This is the rate at which the model writes its answer.
VRAM usage: The backend records an 8.0 GB peak on this 8 GB card — i.e. effectively full. (DatabaseMart's own table reports the GPU vRAM column as a utilization percentage, not a GB figure, so the backend's 8.0 GB peak is the number to plan against.) See /check
Quality notes: This is a single commercial benchmark source. Numbers will vary with your Ollama version, driver, and context length. If you measure your own throughput or peak VRAM on a 3060 Ti, please contribute it via /contribute so the next reader gets a corroborating datapoint.

For the full benchmark data, see /check/llama2-7b/rtx-3060-ti.

Troubleshooting

Out of memory / model falls back to CPU

At 8.0 GB peak on an 8 GB card there is no margin. If you see an OOM error or generation suddenly crawls, something else is holding VRAM. Run nvidia-smi to see what is resident, close it, and retry. Don't try the larger tags on this card — llama2:13b is a 7.4 GB download and llama2:70b is 39 GB (ollama.com/library/llama2), neither of which leaves room to run on 8 GB.

Generation is slower than ~73 tokens/s

The cited figure is for the Q4_0 quant with the default 4096-token context under Ollama 0.5.4. A longer context window, a heavier system prompt, or an older driver will all pull throughput down. Keep the context short for the headline speed, and confirm the GPU is actually being used with nvidia-smi during generation (if it idles, Ollama fell back to CPU).

Should I use Llama 2 at all?

Honestly, for most new local-chat use cases a newer 7–8B model (Llama 3.1 8B, Qwen2.5 7B) is the better default — same 8 GB-class footprint, stronger answers. Use this recipe when you specifically need Llama 2's behaviour. If you measured a newer model on this card, share it via /contribute.

No other widely-reported issues for this pair. Report problems via the submission form.