What You'll Build
Generate cinematic-quality videos locally using CogVideoX 1.5 — one of the highest-quality open-source video models. On a 12 GB RTX 4070, this requires INT8 quantization (PytorchAO) or --lowvram + CPU offload to handle the VAE decode peak; the predecessor CogVideoX-5B fits more comfortably and matches the community times below.
Model-card reference numbers (zai-org/CogVideoX1.5-5B):
- BF16 VRAM floor: from 10 GB (single GPU, diffusers)
- INT8 VRAM floor (PytorchAO): from 7 GB
- Single A100, 5-sec/50-step clip: ~1000 sec (~16–17 min)
- Single H100: ~550 sec (~9 min)
- Native resolution: 1360×768, frame formula
16N+1, default 81 frames
Variant attribution — important. The widely-cited community numbers of ~10 min on RTX 4070 and ~15 min on RTX 4070 Super (HF discussion) are for CogVideoX-5B — the 1.0 release, not CogVideoX-1.5-5B. The A100 reference times above show 1.5 is roughly 5–6× slower than 5B at the same resolution/step count (A100: ~1000 sec for 1.5-5B vs ~180 sec for 5B). Expect a 4070 to need 30+ min per 5-sec 1.5 clip at 1360×768 native, or to use the 5B predecessor for the 10–15 min times below. For honest 1.5-at-native-resolution timing on 24 GB hardware, see the RTX 4090 recipe →.
CogVideoX vs LTX vs Wan: CogVideoX prioritizes quality over speed. Use it when you want the best visual output and can wait longer. Compare →
Requirements
| Component | Minimum (1.5-5B, INT8) | Tested (CogVideoX-5B BF16) |
|---|---|---|
| GPU | RTX 4060 Ti / RTX 4070 12GB | RTX 4070 / 4070 Super 12GB |
| VRAM | 7 GB (INT8) / 10 GB (BF16) | 5 GB (BF16) |
| RAM | 32GB | 32GB |
| Storage | 25GB | 12GB |
Installation
1. Install ComfyUI
git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI
pip install -r requirements.txt
2. Install CogVideoX ComfyUI Nodes
cd ComfyUI/custom_nodes
git clone https://github.com/kijai/ComfyUI-CogVideoXWrapper
cd ComfyUI-CogVideoXWrapper
pip install -r requirements.txt
3. Download weights — pick a variant
The official THUDM org migrated to zai-org; both names redirect, but pin the canonical org going forward.
For CogVideoX 1.5-5B (native 1360×768; INT8 fits 12 GB):
huggingface-cli download zai-org/CogVideoX1.5-5B \
--local-dir ./models/cogvideox-1.5/
For CogVideoX-5B (the 1.0 predecessor, native 720×480, comfortable fit on 12 GB — matches the community times above):
huggingface-cli download zai-org/CogVideoX-5b \
--local-dir ./models/cogvideox-5b/
Place model files in ComfyUI/models/cogvideo/.
4. Download Text Encoder
CogVideoX uses T5 for text encoding:
huggingface-cli download google/t5-v1_1-xxl \
--local-dir ./models/t5/
Running
Start ComfyUI:
python main.py --listen --lowvram
--lowvram is recommended for either variant on a 12 GB card to absorb the VAE decode stage (kijai's wrapper reports peaks of "13–14 GB momentarily" on a 4090 without offload — source).
Load the workflow from ComfyUI-CogVideoXWrapper/examples/cogvideox1.5_t2v.json (1.5) or cogvideox_t2v.json (5B).
If you prefer diffusers over ComfyUI, the model card recommends pipe.enable_model_cpu_offload() over pipe.enable_sequential_cpu_offload() — same VRAM footprint, noticeably faster.
Recommended Settings (12 GB-friendly)
| Parameter | CogVideoX 1.5-5B | CogVideoX-5B |
|---|---|---|
| Resolution | 1360×768 (native) | 720×480 (native) |
| Frames | 81 (default; formula 16N+1) | 49 (default; 6 sec @ 8 fps) |
| Steps | 50 | 50 |
| Precision | INT8 (PytorchAO) | BF16 |
| CFG | 6.0 | 6.0 |
| Sampler | DPM++ 2M Karras | DPM++ 2M Karras |
Performance
| GPU | VRAM | Variant | Time per clip |
|---|---|---|---|
| RTX 4070 | 12GB | CogVideoX-5B @ 720×480, 50 steps, BF16 | ~10 min (community report¹) |
| RTX 4070 Super | 12GB | CogVideoX-5B @ 720×480 | ~15 min (community report¹) |
| RTX 4070 | 12GB | CogVideoX 1.5-5B @ 1360×768, 50 steps, INT8 | Not measured here — extrapolating from A100 (~1000 sec) suggests 30+ min |
| A100 (reference) | 80GB | CogVideoX 1.5-5B @ 1360×768 | ~1000 sec (~16-17 min) |
| RTX 4090 | 24GB | CogVideoX 1.5-5B @ 1360×768 | See 4090 recipe |
¹ Community reports from zai-org/CogVideoX-5b/discussions/7 — RTX 4070/4070 Super users running the 5B predecessor with enable_model_cpu_offload(). Not a controlled benchmark, and not measured for the 1.5 variant. See full data →.
Optimizing Generation Speed
Use enable_model_cpu_offload
The community discussion thread and the model card both recommend pipe.enable_model_cpu_offload() (in diffusers) over pipe.enable_sequential_cpu_offload() — same VRAM footprint, noticeably faster.
Use the I2V (Image-to-Video) Mode
Image-to-video is slightly faster than pure T2V and gives better control:
- Generate a starting frame with Flux.1 Dev or similar
- Load in CogVideoX I2V workflow (
zai-org/CogVideoX1.5-5B-I2V) - Describe the motion you want
INT8 quantization (recommended for 1.5 on 12 GB)
The CogVideoX 1.5-5B model card documents a 7 GB VRAM floor via PytorchAO INT8 (model card). In diffusers, apply it at load time:
from torchao.quantization import quantize_, int8_weight_only
# ...
quantize_(pipe.transformer, int8_weight_only())
For a ready-to-load SAT-format INT8 variant, see zai-org/CogVideoX1.5-5b-SAT (note the lowercase b).
Quality Tips
Prompt structure: CogVideoX responds well to detailed scene descriptions.
Good: "A slow push-in shot of a forest path in autumn, golden leaves falling gently, cinematic lighting, 4K"
Motion control: Add motion keywords to your prompt:
- Camera: "push in", "pull back", "pan left/right", "tracking shot"
- Subject: "walking slowly", "turning around", "hovering"
Consistency: Use seed control for reproducible results. Test prompts with 30 steps first, then full 50 for finals.
Troubleshooting
OOM at 12GB on 1.5 native (1360×768) BF16: Expected — kijai's wrapper documents 13-14 GB peak at the VAE decode stage. Either switch to INT8 (7 GB floor) or enable --lowvram + enable_model_cpu_offload.
OOM at 720×480 on 5B: Enable --lowvram or reduce to lower resolution.
Very slow (> 30 min on 1.5): Expected per the A100 reference (~1000 sec) — 1.5 is ~5-6× slower than 5B at the same step count. If you need faster iteration, fall back to CogVideoX-5B.
Blurry output: Ensure steps ≥ 40 and use DPM++ 2M Karras sampler.
Black frames at start/end: Normal for some configurations — trim in post with any video editor.
When to Choose CogVideoX vs Alternatives
| Model | Best For | Speed |
|---|---|---|
| CogVideoX 1.5-5B (24 GB+) | Best quality at 1360×768 | Slow (~5-7 min on 4090; 30+ min on 4070) |
| CogVideoX-5B (12 GB) | Best 12 GB-friendly quality | ~10-15 min on RTX 4070 (community) |
| Wan 2.1 | Quality + reasonable speed | Medium (4 min) |
| LTX Video 2.3 | Rapid iteration | Fast (45s-5min) |