How much VRAM does gemma 7b need?

About 8 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

Gemma 7B on RTX 3060 Ti: Run a Local Chat LLM at the 8 GB Floor

What You'll Build

A local chat assistant: Google's Gemma 7B running entirely on your own RTX 3060 Ti, answering questions, drafting text, writing and explaining code, and doing reasoning over text — no cloud and no API key. The whole thing runs on an 8 GB RTX 3060 Ti through Ollama at the Q4 quant.

Hardware data: RTX 3060 Ti (8GB VRAM) · ~31.95 tokens/s generation (Q4, Ollama 0.5.4) · sits right at the 8 GB ceiling · See benchmark data

ℹ️ This is first-generation Gemma — not Gemma 2. The catalogue lists this model against the canonical google/gemma-7b-it repo and the gemma:7b Ollama tag. Gemma 2 9B is a separate, newer model with its own page and its own Ollama tag (gemma2:9b). If you want Gemma 2, pull gemma2:9b instead — the numbers below are for the original Gemma 7B specifically.

ℹ️ A legacy baseline by 2026. First-generation Gemma shipped in early 2024 and is now an older 7B model. Newer 7–9B-class open models — Gemma 2 9B, Qwen2.5 7B, Llama 3.1 8B — generally score higher on reasoning and coding benchmarks at a similar memory footprint. Run Gemma 7B here for its small size, permissive Google distribution, and tight 8 GB fit; reach for a newer model if you want the strongest quality per gigabyte.

Requirements

Component	Minimum	Tested
GPU	8GB VRAM	RTX 3060 Ti (8GB)
RAM	16GB	—
Storage	~5 GB (Q4 weights)	5.0 GB model pull
Software	Ollama, NVIDIA driver + CUDA	Ollama 0.5.4

Gemma 7B is an open large language model from Google. Ollama describes the family as "Gemma is a family of lightweight, state-of-the-art open models built by Google DeepMind" and as "Gemma is a new open model developed by Google and its DeepMind team" (ollama.com/library/gemma). The weights are released under the Gemma license; the canonical Hugging Face repo google/gemma-7b-it is gated, so you must accept Google's terms there to download directly. Ollama redistributes the model ungated, which is why the pull below works without a token.

⚠️ Right at the 8 GB wall. The cited benchmark peaks at 8.0 GB on this 8 GB card — there is no headroom. Close other GPU consumers before you run: browsers with hardware acceleration, other models, even a second monitor's compositor can tip you into an out-of-memory error or force a slow CPU fallback. This recipe documents the Q4 quant specifically because heavier quants do not fit 8 GB.

Installation

1. Install Ollama

Download and install Ollama for your OS from ollama.com/download. On Linux:

curl -fsSL https://ollama.com/install.sh | sh

Confirm it sees your GPU:

ollama --version
nvidia-smi

2. Pull the Gemma 7B weights

The gemma:7b tag is a 5.0 GB Q4_0 download (ollama.com/library/gemma):

ollama pull gemma:7b

Running

Start a chat session:

ollama run gemma:7b
>>> Write a Python function that returns the nth Fibonacci number.

Or run a one-shot prompt straight from your shell:

ollama run gemma:7b "Explain what a B-tree is in two sentences."

The model loads onto the GPU and streams its answer token by token. The first run after a fresh pull spends a moment loading weights into VRAM; subsequent prompts in the same session reply immediately.

Results

Speed: ~31.95 tokens/s generation at Q4 on the RTX 3060 Ti, measured by DatabaseMart under their "Eval Rate(tokens/s)" column (Ollama 0.5.4). For a plain text LLM like Gemma, this is the generation speed — the rate at which the model writes its answer.
VRAM usage: The backend records an 8.0 GB peak on this 8 GB card — i.e. effectively full. DatabaseMart's own table lists Gemma's GPU vRAM utilization at 81% on the card it benchmarked (a percentage, not a gigabyte figure). Either way, plan for very little spare VRAM. See /check
Quality notes: This is a single commercial benchmark source. Numbers will vary with your Ollama version, driver, and context length. If you measure your own throughput or peak VRAM on a 3060 Ti, please contribute it via /contribute so the next reader gets a corroborating datapoint.

For the full benchmark data, see /check/gemma-7b/rtx-3060-ti.

Troubleshooting

Out of memory / model falls back to CPU

At 8.0 GB peak on an 8 GB card there is no margin. If you see an OOM error or generation suddenly crawls, something else is holding VRAM. Run nvidia-smi to see what is resident, close it, and retry. Don't try a larger quant on this card — the weights at Q4 already fill most of the VRAM. If you need more headroom, drop to a smaller member of the family such as gemma:2b, which is a much lighter download (ollama.com/library/gemma).

Short answers or context runs out

The default Ollama context window is modest. Gemma 7B supports a longer context, but loading more KV cache eats into the already-tight 8 GB budget on this card. If you raise the context length and hit an OOM, lower it again — on an 8 GB card the weights alone fill most of the VRAM, leaving little room for a large KV cache.

Am I running Gemma or Gemma 2?

These are different models. ollama pull gemma:7b gives you the original Gemma 7B (this recipe's model, cited against google/gemma-7b-it); ollama pull gemma2:9b gives you the newer Gemma 2. The 31.95 tokens/s figure here is for first-gen Gemma 7B specifically — on the same DatabaseMart table, the separate Gemma 2 9B row runs slower at 23.80 tokens/s. Pin the exact tag if you need reproducibility.

No other widely-reported issues for this pair. Report problems via the submission form.