How much VRAM does Gemma 2 9B need?

About 8 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

Gemma 2 9B on RTX 3060 Ti: A Capable Local LLM at the 8 GB Floor

What You'll Build

A local, private chat LLM running entirely on your own 8 GB RTX 3060 Ti: you pull Google's Gemma 2 9B instruction-tuned model through Ollama, then chat with it in your terminal — no cloud, no API key, no data leaving your machine. At 9B parameters it is the largest, most capable chat model in this size class that still fits an 8 GB card, traded against a moderate generation speed.

Hardware data: RTX 3060 Ti (8GB VRAM) · ~23.8 tokens/s text generation (Q4, Ollama 0.5.4) · sits right at the 8 GB ceiling · See benchmark data

⚠️ Right at the 8 GB wall. The cited benchmark peaks at 8.0 GB on this 8 GB card — there is no headroom. A 9B model is the biggest chat model that fits this tier, and only at the Q4 quant; heavier quants (Q5/Q6/Q8) and the 27B sibling do not fit 8 GB. Close other GPU consumers before you run: browsers with hardware acceleration, other models, even a second monitor's compositor can tip you into an out-of-memory error or a slow CPU fallback.

Requirements

Component	Minimum	Tested
GPU	8GB VRAM	RTX 3060 Ti (8GB)
RAM	16GB	—
Storage	~6 GB (Q4 weights)	5.4 GB model pull
Software	Ollama, NVIDIA driver + CUDA	Ollama 0.5.4

Gemma 2 9B is Google's instruction-tuned, decoder-only text LLM — a plain chat/text-generation model (no vision, no audio). Ollama describes the family as "Google Gemma 2 is a high-performing and efficient model available in three sizes: 2B, 9B, and 27B." (ollama.com/library/gemma2). The canonical weights live on Hugging Face at google/gemma-2-9b-it under the Gemma license — that repo is gated, so downloading directly from Hugging Face requires logging in and accepting Google's usage terms; Ollama redistributes the model so you can pull it without the HF gate.

Installation

1. Install Ollama

Download and install Ollama for your OS from ollama.com/download. On Linux:

curl -fsSL https://ollama.com/install.sh | sh

Confirm it sees your GPU:

ollama --version
nvidia-smi

2. Pull the Gemma 2 9B weights

The gemma2:9b tag is a 5.4 GB Q4 download (ollama.com/library/gemma2):

ollama pull gemma2:9b

Running

Start an interactive chat session:

ollama run gemma2:9b

Then just type your prompt at the >>> cursor and Gemma streams its answer back token by token:

ollama run gemma2:9b
>>> Explain the difference between a process and a thread in two sentences.

For a one-shot, non-interactive answer (handy in scripts), pass the prompt inline:

ollama run gemma2:9b "Summarize the plot of Hamlet in three bullet points."

On first run Ollama loads the weights into VRAM (a few seconds), then generation begins. The reported speed below is the rate at which Gemma writes its answer.

Results

Speed: ~23.8 tokens/s text generation at Q4 on the RTX 3060 Ti, measured by DatabaseMart under their "Eval Rate(tokens/s)" column (Ollama 0.5.4). That is the model's generation throughput — comfortable for interactive chat, but not the fastest in this tier: a 9B is the heaviest model that fits 8 GB, so smaller 7B/8B models on the same card run noticeably quicker. If you measure your own throughput on a 3060 Ti, please contribute it via /contribute so the next reader gets a corroborating datapoint.
VRAM usage: The backend records an 8.0 GB peak on this 8 GB card — i.e. effectively full. DatabaseMart's own table lists Gemma 2 9B's "GPU vRAM" utilization at 83% on this card (note: that column is a percentage, not a GB figure). Either way, plan for no spare VRAM. See /check
Quality notes: This is a single commercial benchmark source. Numbers will vary with your Ollama version, driver, and context length. Gemma 2 9B is a strong general-purpose chat model for its size, but at the 8 GB floor you are limited to the Q4 quant — higher-precision quants would improve quality marginally at the cost of fitting on this card.

For the full benchmark data, see /check/gemma-2-9b/rtx-3060-ti.

Troubleshooting

Out of memory / model falls back to CPU

At 8.0 GB peak on an 8 GB card there is no margin. If you see an OOM error or generation suddenly crawls (a sign Ollama spilled layers to the CPU), something else is holding VRAM. Run nvidia-smi to see what is resident, close it, and retry. Don't try a larger quant or the 27B tag on this card — gemma2:27b is a 16 GB download (ollama.com/library/gemma2), far past what an 8 GB card can hold. Stick to the default Q4 gemma2:9b.

Generation feels slow

~23.8 tokens/s is the expected throughput for a 9B model at Q4 on this card — it is the price of running the largest chat model that fits 8 GB. If you need faster responses and can accept a smaller model, a 7B or 8B model at Q4 will run meaningfully quicker on the same 3060 Ti with VRAM to spare. Gemma 2 also ships a 2B variant (gemma2:2b, 1.6 GB) that is far faster if quality at small scale is acceptable for your task.

Which Gemma version am I running?

This recipe targets the canonical instruction-tuned weights at google/gemma-2-9b-it. The Ollama gemma2:9b tag tracks Google's Gemma 2 9B release; pin the exact tag if you need reproducibility across machines.

No other widely-reported issues for this pair. Report problems via the submission form.