How much VRAM does llava 7b need?

About 8 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

LLaVA 7B on RTX 3060 Ti: Ask Questions About Images Locally at the 8 GB Floor

What You'll Build

A local visual-question-answering assistant: you point LLaVA 7B at an image file on your own machine and ask it questions in plain language — what is in this picture, read the text on this sign, describe the mood — and it answers as a chatbot, no cloud and no API key. The whole thing runs on an 8 GB RTX 3060 Ti through Ollama at the Q4 quant.

Hardware data: RTX 3060 Ti (8GB VRAM) · ~72 tokens/s text generation (Q4, Ollama 0.5.4) · sits right at the 8 GB ceiling · See benchmark data

ℹ️ The speed number is text generation, not image speed. LLaVA is a vision-language model — it reads an image once, then generates a text answer about it. The ~72 tokens/s figure below measures how fast it writes that text answer (the standard token-generation metric Ollama applies to every model), not how fast it encodes or "analyzes" the image. There is no separate published image-processing speed for this pair; the one-time image encode happens before generation begins.

Requirements

Component	Minimum	Tested
GPU	8GB VRAM	RTX 3060 Ti (8GB)
RAM	16GB	—
Storage	~5 GB (Q4 weights)	4.7 GB model pull
Software	Ollama, NVIDIA driver + CUDA	Ollama 0.5.4

LLaVA 7B is a Vicuna/Llama-2 language model paired with a CLIP visual encoder, released under the Llama 2 Community License. Ollama describes it as a multimodal model that "combines a vision encoder and Vicuna for general-purpose visual and language understanding." (ollama.com/library/llava)

⚠️ Right at the 8 GB wall. The cited benchmark peaks at 8.0 GB on this 8 GB card — there is no headroom. Close other GPU consumers before you run: browsers with hardware acceleration, other models, even a second monitor's compositor can tip you into an out-of-memory error or force a slow CPU fallback. This recipe documents the Q4 quant specifically because heavier quants do not fit 8 GB.

Installation

1. Install Ollama

Download and install Ollama for your OS from ollama.com/download. On Linux:

curl -fsSL https://ollama.com/install.sh | sh

Confirm it sees your GPU:

ollama --version
nvidia-smi

2. Pull the LLaVA 7B weights

The llava:7b tag is a ~4.7 GB Q4 download (ollama.com/library/llava):

ollama pull llava:7b

Running

Ask a question about an image by giving its file path inside the prompt. The canonical one-shot form from the Ollama vision-models guide is:

ollama run llava "describe this image: ./art.jpg"

That exact command — "describe this image: ./art.jpg" — is the example Ollama publishes for passing an image to LLaVA (ollama.com/blog/vision-models). Swap in your own path and prompt.

For an interactive session, start the model and include the image path in your question:

ollama run llava
>>> what's in this image? /home/you/photos/receipt.png

LLaVA encodes the image once, then streams a text answer. You can ask follow-up questions about the same image in the same session.

Results

Speed: ~72 tokens/s text generation at Q4 on the RTX 3060 Ti, measured by DatabaseMart under their "Eval Rate(tokens/s)" column (Ollama 0.5.4). This is the rate at which the model writes its answer, not an image-analysis rate.
VRAM usage: The backend records an 8.0 GB peak on this 8 GB card — i.e. effectively full. DatabaseMart's own table lists LLaVA's GPU vRAM utilization at 80% on the 8 GB card. Either way, plan for no spare VRAM. See /check
Quality notes: This is a single commercial benchmark source. Numbers will vary with your Ollama version, driver, and context length. If you measure your own throughput or peak VRAM on a 3060 Ti, please contribute it via /contribute so the next reader gets a corroborating datapoint.

For the full benchmark data, see /check/llava-7b/rtx-3060-ti.

Troubleshooting

Out of memory / model falls back to CPU

At 8.0 GB peak on an 8 GB card there is no margin. If you see an OOM error or generation suddenly crawls, something else is holding VRAM. Run nvidia-smi to see what is resident, close it, and retry. Don't try a larger quant or the 13B/34B tags on this card — llava:13b is an 8.0 GB download and llava:34b is 20 GB (ollama.com/library/llava), neither of which leaves room to run on 8 GB.

"How fast does it look at the image?"

There is no separate published image-encode speed for this pair. The ~72 tokens/s figure is text-generation throughput only. The image is encoded once before the answer streams; on a 7B model at Q4 that encode is fast relative to a multi-hundred-token answer, but it is not what the benchmark measures. Treat the tokens/s number as "how fast the answer is written."

Which LLaVA version am I running?

The site catalogues this model against the canonical LLaVA 1.5 7B repo (Vicuna/Llama-2 + CLIP, Llama 2 Community License). The Ollama llava library tag tracks the LLaVA line and has been updated over time, so the exact point release you pull from Ollama may differ from the 1.5 weights on Hugging Face. For this recipe's purpose — local image Q&A at the 8 GB floor — either behaves the same way; pin a specific tag if you need reproducibility.

No other widely-reported issues for this pair. Report problems via the submission form.