What You'll Build
A local chat assistant: Qwen2 7B Instruct running entirely on your own RTX 3060 Ti, answering questions, drafting text, writing and explaining code, and doing reasoning over text — no cloud and no API key. The whole thing runs on an 8 GB RTX 3060 Ti through Ollama at the Q4 quant.
Hardware data: RTX 3060 Ti (8GB VRAM) · ~63.7 tokens/s generation (Q4, Ollama 0.5.4) · sits right at the 8 GB ceiling · See benchmark data
ℹ️ This is Qwen2, the prior generation — not Qwen2.5. The catalogue lists this model against the canonical
Qwen/Qwen2-7B-Instructrepo and theqwen2:7bOllama tag. Qwen2.5 7B is a separate, newer model with its own page. If you want Qwen2.5, pullqwen2.5:7binstead — the numbers below are for Qwen2 specifically.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 8GB VRAM | RTX 3060 Ti (8GB) |
| RAM | 16GB | — |
| Storage | ~5 GB (Q4 weights) | 4.4 GB model pull |
| Software | Ollama, NVIDIA driver + CUDA | Ollama 0.5.4 |
Qwen2 7B Instruct is an instruction-tuned text large language model from Alibaba's Qwen team, released under the Apache-2.0 license. The model card states plainly: "This repo contains the instruction-tuned 7B Qwen2 model." (huggingface.co/Qwen/Qwen2-7B-Instruct) Ollama describes the family as "Qwen2 is a new series of large language models from Alibaba group" (ollama.com/library/qwen2).
⚠️ Right at the 8 GB wall. The cited benchmark peaks at 8.0 GB on this 8 GB card — there is no headroom. Close other GPU consumers before you run: browsers with hardware acceleration, other models, even a second monitor's compositor can tip you into an out-of-memory error or force a slow CPU fallback. This recipe documents the Q4 quant specifically because heavier quants do not fit 8 GB.
Installation
1. Install Ollama
Download and install Ollama for your OS from ollama.com/download. On Linux:
curl -fsSL https://ollama.com/install.sh | sh
Confirm it sees your GPU:
ollama --version
nvidia-smi
2. Pull the Qwen2 7B weights
The qwen2:7b tag is a 4.4 GB Q4 download, and latest resolves to qwen2:7b-instruct (ollama.com/library/qwen2):
ollama pull qwen2:7b
Running
Start a chat session:
ollama run qwen2:7b
>>> Write a Python function that returns the nth Fibonacci number.
Or run a one-shot prompt straight from your shell:
ollama run qwen2:7b "Explain what a B-tree is in two sentences."
The model loads onto the GPU and streams its answer token by token. The first run after a fresh pull spends a moment loading weights into VRAM; subsequent prompts in the same session reply immediately.
Results
- Speed: ~63.7 tokens/s generation at Q4 on the RTX 3060 Ti, measured by DatabaseMart under their "Eval Rate(tokens/s)" column (Ollama 0.5.4). For a plain text LLM like Qwen2, this is the generation speed — the rate at which the model writes its answer.
- VRAM usage: The backend records an 8.0 GB peak on this 8 GB card — i.e. effectively full. DatabaseMart's own table lists Qwen2's
GPU vRAMutilization at 65% on the card it benchmarked. Either way, plan for very little spare VRAM. See /check - Quality notes: This is a single commercial benchmark source. Numbers will vary with your Ollama version, driver, and context length. If you measure your own throughput or peak VRAM on a 3060 Ti, please contribute it via /contribute so the next reader gets a corroborating datapoint.
For the full benchmark data, see /check/qwen2-7b/rtx-3060-ti.
Troubleshooting
Out of memory / model falls back to CPU
At 8.0 GB peak on an 8 GB card there is no margin. If you see an OOM error or generation suddenly crawls, something else is holding VRAM. Run nvidia-smi to see what is resident, close it, and retry. Don't try a larger quant or the 72B tag on this card — qwen2:72b is a 41 GB download (ollama.com/library/qwen2), far past what an 8 GB card can hold. If you need more headroom, drop to a smaller member of the family such as qwen2:1.5b.
Short answers or context runs out
The default Ollama context window is modest. Qwen2 7B supports a long context, but loading more KV cache eats into the already-tight 8 GB budget on this card. If you raise the context length and hit an OOM, lower it again — on an 8 GB card the weights alone fill most of the VRAM, leaving little room for a large KV cache.
Am I running Qwen2 or Qwen2.5?
These are different models. ollama pull qwen2:7b gives you Qwen2 (this recipe's model, cited against Qwen/Qwen2-7B-Instruct); ollama pull qwen2.5:7b gives you the newer Qwen2.5. The 63.7 tokens/s figure here is for Qwen2 specifically — on the same DatabaseMart table, the separate Qwen2.5 7B row runs slightly slower. Pin the exact tag if you need reproducibility.
No other widely-reported issues for this pair. Report problems via the submission form.