How much VRAM does SD1.5 need?

About 4 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

Stable Diffusion 1.5 on RTX 3060: 512x512 Image Generation with 12 GB to Spare

What You'll Build

A complete local text-to-image setup running Stable Diffusion 1.5 on an RTX 3060, generating 512x512 images in a few seconds each through AUTOMATIC1111 / Forge (the most-documented path), ComfyUI, or the diffusers Python library. SD 1.5's native training resolution is 512x512, and the model is small — the inference checkpoint is 4.27 GB on disk — so on a 12 GB card you have enormous headroom left over.

Hardware data: RTX 3060 (12GB VRAM) · ~6-8 it/s at 512x512 (Euler a, 20 steps) · See benchmark data

ℹ️ This card is over-provisioned for SD 1.5. The base model needs only a few GB; on 12 GB the interesting question is not "does it fit" but "what to do with the other ~8+ GB." Lead with hires-fix, large batch counts, ControlNet stacks, and higher-resolution upscaling — see the "Spending the spare VRAM" section below.

Requirements

Component	Minimum	Tested
GPU	4GB VRAM (with `--medvram`/`--lowvram`)	RTX 3060 (12GB)
RAM	16GB	32GB (per community reports in discussion #9798)
Storage	~5GB (4.27 GB checkpoint)	—
Software	Python 3.10.6, git	AUTOMATIC1111 / Forge / ComfyUI

The canonical inference checkpoint, v1-5-pruned-emaonly.safetensors, is 4.27 GB (HF Files tab). The model card labels it as the ema-only weight. uses less VRAM - suitable for inference; the larger v1-5-pruned.safetensors (ema+non-ema, 7.7 GB) is for fine-tuning. For generation, always pick the emaonly file.

ℹ️ About the repo. The original runwayml/stable-diffusion-v1-5 repo was removed. The canonical community re-host is stable-diffusion-v1-5/stable-diffusion-v1-5 (the HF card states it is a mirror of the deprecated RunwayML repo). License is CreativeML OpenRAIL-M.

Installation

Option A — AUTOMATIC1111 WebUI (most-documented path)

On Windows, the official README's release-package route is the simplest:

1. Download `sd.webui.zip` from the v1.0.0-pre release and extract it.
2. Run `update.bat`.
3. Run `run.bat`.

For a git-based install (Windows or Linux), follow the official README. On Linux:

# Debian/Ubuntu dependencies
sudo apt install wget git python3 python3-venv libgl1 libglib2.0-0

# Clone and launch
git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui
cd stable-diffusion-webui
./webui.sh

On Windows with the automatic installer: install Python 3.10.6 (newer Python does not support the pinned torch), install git, git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git, then run webui-user.bat. (All steps verbatim from the AUTOMATIC1111 README.)

Option B — Forge (optimized A1111 fork)

Stable Diffusion WebUI Forge is a drop-in fork of A1111 (based on SD-WebUI 1.10.1) that, per its README, aims to optimize resource management, speed up inference, and study experimental features. Use the one-click package:

1. Download the one-click package (e.g. webui_forge_cu121_torch231.7z) from the Forge releases.
2. Uncompress it.
3. Run `update.bat`, then `run.bat`.

Forge reuses the same SD-format checkpoints and the same UI as A1111 — the download step below is identical.

Download the checkpoint (A1111 & Forge)

Put the checkpoint in stable-diffusion-webui/models/Stable-diffusion/:

cd stable-diffusion-webui/models/Stable-diffusion
wget https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5/resolve/main/v1-5-pruned-emaonly.safetensors

Restart the WebUI (or hit the refresh icon next to the checkpoint dropdown) and select v1-5-pruned-emaonly.safetensors.

Option C — ComfyUI

git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI
pip install -r requirements.txt
# place the checkpoint where ComfyUI expects it:
wget -P models/checkpoints https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5/resolve/main/v1-5-pruned-emaonly.safetensors
python main.py

ComfyUI's README notes that for smaller models you only need to drop the checkpoint into ComfyUI\models\checkpoints. On Windows the portable NVIDIA build (download, extract, run) is the no-Python-setup route.

Running

Via the WebUI (A1111 / Forge)

The default launch is sufficient on a 12 GB card — no memory flags are needed. Browse to http://127.0.0.1:7860, enter a prompt, keep width/height at 512x512 (SD 1.5's native resolution), pick a sampler (Euler a is the fast default), set steps to ~20, and generate.

If you ever run this on a 4 GB card instead, add a flag to webui-user per the official Optimizations wiki: --medvram (which Makes the Stable Diffusion model consume less VRAM by splitting it into three parts […] Lowers performance, but only by a bit) or, in severe cases, --lowvram. On the RTX 3060 you do not need either.

Via diffusers (Python)

from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]
image.save("astronaut_rides_horse.png")

This is the model card's own snippet (adapted to the canonical org slug). torch_dtype=torch.float16 keeps the resident footprint to a few GB.

Results

Speed: ~6-8 it/s at 512x512, 20 steps, Euler a, on the RTX 3060 12GB. The AUTOMATIC1111 discussion #9798 has multiple 3060 12GB owners reporting Euler a in this band (one measured ~7it/s, another ~6.8it/s, another 5.8 it/s at batch size 1). Independently, the vladmandic sd-extension-system-info benchmark database has 100+ desktop RTX 3060 12GB runs on the exact canonical v1-5-pruned-emaonly.safetensors checkpoint clustering at the same ~6-8 it/s steady-state. DPM++ SDE Karras is roughly half that (~3.5 it/s) — sampler choice matters more than anything else.
VRAM usage: SD 1.5 is tiny relative to the card — the inference checkpoint is 4.27 GB on disk and runs well under the 12 GB ceiling at fp16. No memory flags required. See /check/sd1-5/rtx-3060 (no community-submitted peak yet — measured your own? Contribute it).
Quality notes: SD 1.5 is trained at 512x512; generating directly at much larger sizes produces duplicated/distorted subjects. Use hires-fix (generate at 512x512, then upscale) rather than raising the base resolution.

For the full benchmark data, see /check/sd1-5/rtx-3060.

Spending the spare VRAM

The whole point of a 12 GB card on a 4 GB-class model is headroom. Concrete uses for the unused ~8 GB:

Hires-fix / upscaling: generate at 512x512, then 2x upscale in the same pass. The 3060 has room for the larger latent + an upscaler model without offloading.
Large batch counts: generate 4-8 images per click instead of 1 — the model weights are resident once; extra images mostly cost activation memory.
ControlNet / LoRA stacks: load one or more ControlNet models (pose, depth, canny) alongside the base checkpoint. These add 1-2 GB each and still fit comfortably.
--opt-sdp-attention for speed: per the Optimizations wiki this May results in faster speeds than using xFormers on some systems but requires more VRAM — a fair trade on a 12 GB card.

Troubleshooting

Generation is unexpectedly slow (~1 it/s instead of 6-8)

Per discussion #9798, the usual culprits on a 3060 are: an unnecessary --medvram flag (you don't need it at 12 GB), a large batch size masquerading as low it/s, or browser/desktop GPU acceleration competing for VRAM. One reporter's generations jumped from ~1.47 it/s to ~3.6 it/s (DPM++ SDE) and ~3.3 to ~6.8 it/s (Euler a) after removing --medvram and minimizing the browser tab.

Distorted or duplicated subjects at high resolution

SD 1.5's native resolution is 512x512. Generating directly at 1024x1024+ causes the model to repeat features. Generate at 512x512 and use hires-fix / an upscaler instead.

Choosing a sampler

Euler a and DPM++ 2M Karras are the fast options (~7 it/s on a 3060); DPM++ SDE Karras is roughly half the speed for similar quality. If throughput matters, stick to Euler a.

No other widely-reported issues for this pairing. Report problems via the submission form.