self-hosted/ai
§01·recipe · video

CogVideoX 1.5 on RTX 4070: High-Quality Local Video Guide

videointermediate10GB+ VRAMMay 13, 2026
models
tools
prerequisites
  • NVIDIA GPU with ≥ 12GB VRAM (RTX 4070 or 4070 Super recommended)
  • ComfyUI installed (or diffusers ≥ 0.32)
  • Python 3.10+
  • ~25GB free storage

What You'll Build

Generate cinematic-quality videos locally using CogVideoX 1.5 — one of the highest-quality open-source video models. On a 12 GB RTX 4070, this requires INT8 quantization (PytorchAO) or --lowvram + CPU offload to handle the VAE decode peak; the predecessor CogVideoX-5B fits more comfortably and matches the community times below.

Model-card reference numbers (zai-org/CogVideoX1.5-5B):

  • BF16 VRAM floor: from 10 GB (single GPU, diffusers)
  • INT8 VRAM floor (PytorchAO): from 7 GB
  • Single A100, 5-sec/50-step clip: ~1000 sec (~16–17 min)
  • Single H100: ~550 sec (~9 min)
  • Native resolution: 1360×768, frame formula 16N+1, default 81 frames

Variant attribution — important. The widely-cited community numbers of ~10 min on RTX 4070 and ~15 min on RTX 4070 Super (HF discussion) are for CogVideoX-5B — the 1.0 release, not CogVideoX-1.5-5B. The A100 reference times above show 1.5 is roughly 5–6× slower than 5B at the same resolution/step count (A100: ~1000 sec for 1.5-5B vs ~180 sec for 5B). Expect a 4070 to need 30+ min per 5-sec 1.5 clip at 1360×768 native, or to use the 5B predecessor for the 10–15 min times below. For honest 1.5-at-native-resolution timing on 24 GB hardware, see the RTX 4090 recipe →.

CogVideoX vs LTX vs Wan: CogVideoX prioritizes quality over speed. Use it when you want the best visual output and can wait longer. Compare →

Requirements

ComponentMinimum (1.5-5B, INT8)Tested (CogVideoX-5B BF16)
GPURTX 4060 Ti / RTX 4070 12GBRTX 4070 / 4070 Super 12GB
VRAM7 GB (INT8) / 10 GB (BF16)5 GB (BF16)
RAM32GB32GB
Storage25GB12GB

Installation

1. Install ComfyUI

git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI
pip install -r requirements.txt

2. Install CogVideoX ComfyUI Nodes

cd ComfyUI/custom_nodes
git clone https://github.com/kijai/ComfyUI-CogVideoXWrapper
cd ComfyUI-CogVideoXWrapper
pip install -r requirements.txt

3. Download weights — pick a variant

The official THUDM org migrated to zai-org; both names redirect, but pin the canonical org going forward.

For CogVideoX 1.5-5B (native 1360×768; INT8 fits 12 GB):

huggingface-cli download zai-org/CogVideoX1.5-5B \
  --local-dir ./models/cogvideox-1.5/

For CogVideoX-5B (the 1.0 predecessor, native 720×480, comfortable fit on 12 GB — matches the community times above):

huggingface-cli download zai-org/CogVideoX-5b \
  --local-dir ./models/cogvideox-5b/

Place model files in ComfyUI/models/cogvideo/.

4. Download Text Encoder

CogVideoX uses T5 for text encoding:

huggingface-cli download google/t5-v1_1-xxl \
  --local-dir ./models/t5/

Running

Start ComfyUI:

python main.py --listen --lowvram

--lowvram is recommended for either variant on a 12 GB card to absorb the VAE decode stage (kijai's wrapper reports peaks of "13–14 GB momentarily" on a 4090 without offload — source).

Load the workflow from ComfyUI-CogVideoXWrapper/examples/cogvideox1.5_t2v.json (1.5) or cogvideox_t2v.json (5B).

If you prefer diffusers over ComfyUI, the model card recommends pipe.enable_model_cpu_offload() over pipe.enable_sequential_cpu_offload() — same VRAM footprint, noticeably faster.

Recommended Settings (12 GB-friendly)

ParameterCogVideoX 1.5-5BCogVideoX-5B
Resolution1360×768 (native)720×480 (native)
Frames81 (default; formula 16N+1)49 (default; 6 sec @ 8 fps)
Steps5050
PrecisionINT8 (PytorchAO)BF16
CFG6.06.0
SamplerDPM++ 2M KarrasDPM++ 2M Karras

Performance

GPUVRAMVariantTime per clip
RTX 407012GBCogVideoX-5B @ 720×480, 50 steps, BF16~10 min (community report¹)
RTX 4070 Super12GBCogVideoX-5B @ 720×480~15 min (community report¹)
RTX 407012GBCogVideoX 1.5-5B @ 1360×768, 50 steps, INT8Not measured here — extrapolating from A100 (~1000 sec) suggests 30+ min
A100 (reference)80GBCogVideoX 1.5-5B @ 1360×768~1000 sec (~16-17 min)
RTX 409024GBCogVideoX 1.5-5B @ 1360×768See 4090 recipe

¹ Community reports from zai-org/CogVideoX-5b/discussions/7 — RTX 4070/4070 Super users running the 5B predecessor with enable_model_cpu_offload(). Not a controlled benchmark, and not measured for the 1.5 variant. See full data →.

Optimizing Generation Speed

Use enable_model_cpu_offload

The community discussion thread and the model card both recommend pipe.enable_model_cpu_offload() (in diffusers) over pipe.enable_sequential_cpu_offload() — same VRAM footprint, noticeably faster.

Use the I2V (Image-to-Video) Mode

Image-to-video is slightly faster than pure T2V and gives better control:

  1. Generate a starting frame with Flux.1 Dev or similar
  2. Load in CogVideoX I2V workflow (zai-org/CogVideoX1.5-5B-I2V)
  3. Describe the motion you want

INT8 quantization (recommended for 1.5 on 12 GB)

The CogVideoX 1.5-5B model card documents a 7 GB VRAM floor via PytorchAO INT8 (model card). In diffusers, apply it at load time:

from torchao.quantization import quantize_, int8_weight_only
# ...
quantize_(pipe.transformer, int8_weight_only())

For a ready-to-load SAT-format INT8 variant, see zai-org/CogVideoX1.5-5b-SAT (note the lowercase b).

Quality Tips

Prompt structure: CogVideoX responds well to detailed scene descriptions.

Good: "A slow push-in shot of a forest path in autumn, golden leaves falling gently, cinematic lighting, 4K"

Motion control: Add motion keywords to your prompt:

  • Camera: "push in", "pull back", "pan left/right", "tracking shot"
  • Subject: "walking slowly", "turning around", "hovering"

Consistency: Use seed control for reproducible results. Test prompts with 30 steps first, then full 50 for finals.

Troubleshooting

OOM at 12GB on 1.5 native (1360×768) BF16: Expected — kijai's wrapper documents 13-14 GB peak at the VAE decode stage. Either switch to INT8 (7 GB floor) or enable --lowvram + enable_model_cpu_offload.

OOM at 720×480 on 5B: Enable --lowvram or reduce to lower resolution.

Very slow (> 30 min on 1.5): Expected per the A100 reference (~1000 sec) — 1.5 is ~5-6× slower than 5B at the same step count. If you need faster iteration, fall back to CogVideoX-5B.

Blurry output: Ensure steps ≥ 40 and use DPM++ 2M Karras sampler.

Black frames at start/end: Normal for some configurations — trim in post with any video editor.

When to Choose CogVideoX vs Alternatives

ModelBest ForSpeed
CogVideoX 1.5-5B (24 GB+)Best quality at 1360×768Slow (~5-7 min on 4090; 30+ min on 4070)
CogVideoX-5B (12 GB)Best 12 GB-friendly quality~10-15 min on RTX 4070 (community)
Wan 2.1Quality + reasonable speedMedium (4 min)
LTX Video 2.3Rapid iterationFast (45s-5min)