How much VRAM does CogVideoX 1.5 need?

About 10 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

CogVideoX 1.5 on RTX 4070: High-Quality Local Video Guide

What You'll Build

Generate cinematic-quality videos locally using CogVideoX 1.5 — one of the highest-quality open-source video models. On a 12 GB RTX 4070, this requires INT8 quantization (PytorchAO) or --lowvram + CPU offload to handle the VAE decode peak; the predecessor CogVideoX-5B fits more comfortably and matches the community times below.

Model-card reference numbers (zai-org/CogVideoX1.5-5B):

BF16 VRAM floor: from 10 GB (single GPU, diffusers)
INT8 VRAM floor (PytorchAO): from 7 GB
Single A100, 5-sec/50-step clip: ~1000 sec (~16–17 min)
Single H100: ~550 sec (~9 min)
Native resolution: 1360×768, frame formula 16N+1, default 81 frames

Variant attribution — important. The widely-cited community numbers of ~10 min on RTX 4070 and ~15 min on RTX 4070 Super (HF discussion) are for CogVideoX-5B — the 1.0 release, not CogVideoX-1.5-5B. The A100 reference times above show 1.5 is roughly 5–6× slower than 5B at the same resolution/step count (A100: ~1000 sec for 1.5-5B vs ~180 sec for 5B). Expect a 4070 to need 30+ min per 5-sec 1.5 clip at 1360×768 native, or to use the 5B predecessor for the 10–15 min times below. For honest 1.5-at-native-resolution timing on 24 GB hardware, see the RTX 4090 recipe →.

CogVideoX vs LTX vs Wan: CogVideoX prioritizes quality over speed. Use it when you want the best visual output and can wait longer. Compare →

Requirements

Component	Minimum (1.5-5B, INT8)	Tested (CogVideoX-5B BF16)
GPU	RTX 4060 Ti / RTX 4070 12GB	RTX 4070 / 4070 Super 12GB
VRAM	7 GB (INT8) / 10 GB (BF16)	5 GB (BF16)
RAM	32GB	32GB
Storage	25GB	12GB

Installation

1. Install ComfyUI

git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI
pip install -r requirements.txt

2. Install CogVideoX ComfyUI Nodes

cd ComfyUI/custom_nodes
git clone https://github.com/kijai/ComfyUI-CogVideoXWrapper
cd ComfyUI-CogVideoXWrapper
pip install -r requirements.txt

3. Download weights — pick a variant

The official THUDM org migrated to zai-org; both names redirect, but pin the canonical org going forward.

For CogVideoX 1.5-5B (native 1360×768; INT8 fits 12 GB):

huggingface-cli download zai-org/CogVideoX1.5-5B \
  --local-dir ./models/cogvideox-1.5/

For CogVideoX-5B (the 1.0 predecessor, native 720×480, comfortable fit on 12 GB — matches the community times above):

huggingface-cli download zai-org/CogVideoX-5b \
  --local-dir ./models/cogvideox-5b/

Place model files in ComfyUI/models/cogvideo/.

4. Download Text Encoder

CogVideoX uses T5 for text encoding:

huggingface-cli download google/t5-v1_1-xxl \
  --local-dir ./models/t5/

Running

Start ComfyUI:

python main.py --listen --lowvram

--lowvram is recommended for either variant on a 12 GB card to absorb the VAE decode stage (kijai's wrapper documents VAE-decode peaks of roughly 13–14 GB on a 4090 without offload — source).

Load the workflow from ComfyUI-CogVideoXWrapper/examples/cogvideox1.5_t2v.json (1.5) or cogvideox_t2v.json (5B).

If you prefer diffusers over ComfyUI, the model card recommends pipe.enable_model_cpu_offload() over pipe.enable_sequential_cpu_offload() — same VRAM footprint, noticeably faster.

Recommended Settings (12 GB-friendly)

Parameter	CogVideoX 1.5-5B	CogVideoX-5B
Resolution	1360×768 (native)	720×480 (native)
Frames	81 (default; formula `16N+1`)	49 (default; 6 sec @ 8 fps)
Steps	50	50
Precision	INT8 (PytorchAO)	BF16
CFG	6.0	6.0
Sampler	DPM++ 2M Karras	DPM++ 2M Karras

Performance

GPU	VRAM	Variant	Time per clip
RTX 4070	12GB	CogVideoX-5B @ 720×480, 50 steps, BF16	~10 min (community report¹)
RTX 4070 Super	12GB	CogVideoX-5B @ 720×480	~15 min (community report¹)
RTX 4070	12GB	CogVideoX 1.5-5B @ 1360×768, 50 steps, INT8	Not measured here — extrapolating from A100 (~1000 sec) suggests 30+ min
A100 (reference)	80GB	CogVideoX 1.5-5B @ 1360×768	~1000 sec (~16-17 min)
RTX 4090	24GB	CogVideoX 1.5-5B @ 1360×768	See 4090 recipe

¹ Community reports from zai-org/CogVideoX-5b/discussions/7 — RTX 4070/4070 Super users running the 5B predecessor with enable_model_cpu_offload(). Not a controlled benchmark, and not measured for the 1.5 variant. See full data →.

Optimizing Generation Speed

Use `enable_model_cpu_offload`

The community discussion thread and the model card both recommend pipe.enable_model_cpu_offload() (in diffusers) over pipe.enable_sequential_cpu_offload() — same VRAM footprint, noticeably faster.

Use the I2V (Image-to-Video) Mode

Image-to-video is slightly faster than pure T2V and gives better control:

Generate a starting frame with Flux.1 Dev or similar
Load in CogVideoX I2V workflow (zai-org/CogVideoX1.5-5B-I2V)
Describe the motion you want

INT8 quantization (recommended for 1.5 on 12 GB)

The CogVideoX 1.5-5B model card documents a 7 GB VRAM floor via PytorchAO INT8 (model card). In diffusers, apply it at load time:

from torchao.quantization import quantize_, int8_weight_only
# ...
quantize_(pipe.transformer, int8_weight_only())

For a ready-to-load SAT-format INT8 variant, see zai-org/CogVideoX1.5-5b-SAT (note the lowercase b).

Quality Tips

Prompt structure: CogVideoX responds well to detailed scene descriptions.

Good: "A slow push-in shot of a forest path in autumn, golden leaves falling gently, cinematic lighting, 4K"

Motion control: Add motion keywords to your prompt:

Camera: "push in", "pull back", "pan left/right", "tracking shot"
Subject: "walking slowly", "turning around", "hovering"

Consistency: Use seed control for reproducible results. Test prompts with 30 steps first, then full 50 for finals.

Troubleshooting

OOM at 12GB on 1.5 native (1360×768) BF16: Expected — kijai's wrapper documents 13-14 GB peak at the VAE decode stage. Either switch to INT8 (7 GB floor) or enable --lowvram + enable_model_cpu_offload.

OOM at 720×480 on 5B: Enable --lowvram or reduce to lower resolution.

Very slow (> 30 min on 1.5): Expected per the A100 reference (~1000 sec) — 1.5 is ~5-6× slower than 5B at the same step count. If you need faster iteration, fall back to CogVideoX-5B.

Blurry output: Ensure steps ≥ 40 and use DPM++ 2M Karras sampler.

Black frames at start/end: Normal for some configurations — trim in post with any video editor.

When to Choose CogVideoX vs Alternatives

Model	Best For	Speed
CogVideoX 1.5-5B (24 GB+)	Best quality at 1360×768	Slow (~5-7 min on 4090; 30+ min on 4070)
CogVideoX-5B (12 GB)	Best 12 GB-friendly quality	~10-15 min on RTX 4070 (community)
Wan 2.1	Quality + reasonable speed	Medium (4 min)
LTX Video 2.3	Rapid iteration	Fast (45s-5min)