How much VRAM does CogVideoX 1.5 need?

About 10 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

CogVideoX 1.5 on RTX 4070 Super: 1360x768 Text-to-Video with Diffusers

What You'll Build

Generate high-quality 1360×768 text-to-video clips locally with CogVideoX 1.5-5B on a 12 GB RTX 4070 Super, using diffusers with sequential CPU offload and VAE tiling/slicing to keep the model resident within the card's 12 GB envelope. CogVideoX 1.5 is a 5B DiT video model that prioritizes visual quality over speed.

Hardware data: RTX 4070 Super (12GB VRAM) · 10 GB BF16 / 7 GB INT8 floor with optimizations enabled (per model card) · See benchmark data

No first-party RTX 4070 Super speed datapoint yet. The only /check benchmark currently attached to this pair is a misattributed CogVideoX-5B (1.0 predecessor) community number, not a measured CogVideoX 1.5-5B run — so this recipe deliberately omits a hard per-clip time for the 4070 Super. See the Performance section for an honest cross-card extrapolation, and submit your own measured run via /contribute to seed the first real datapoint.

CogVideoX prioritizes quality over speed. Use it when you want the best visual output and can wait longer per clip. If you need rapid-iteration video instead, the faster LTX-2.3 video model is a lighter-weight alternative.

Requirements

Component	Minimum (1.5-5B)	Tested
GPU	10GB VRAM (BF16) / 7GB (INT8)	RTX 4070 Super (12GB)
RAM	32GB (offload spills weights to system RAM)	—
Storage	25GB	—
Software	Python 3.10+, diffusers ≥ 0.32	—

The 12 GB RTX 4070 Super is the same VRAM tier and Ada architecture as the plain RTX 4070 (the Super carries higher clocks and slightly more compute, but the same 12 GB of VRAM), so the VRAM-fit story is identical — the offload path below is what makes CogVideoX 1.5-5B fit either card.

Installation

1. Install diffusers and dependencies

pip install "diffusers>=0.32" transformers accelerate torch torchvision

For the INT8 path (7 GB floor), also install TorchAO:

pip install torchao

2. Download CogVideoX 1.5-5B weights

The official THUDM org migrated to zai-org; both names redirect, but pin the canonical org going forward.

huggingface-cli download zai-org/CogVideoX1.5-5B \
  --local-dir ./models/cogvideox-1.5/

Native resolution is 1360×768, frame count follows the formula 16N + 1 where N ≤ 10 (default 81) — per the CogVideoX1.5-5B model card.

Running

The model card's 10 GB BF16 floor is a number reached with optimizations enabled — specifically sequential CPU offload plus VAE tiling and VAE slicing. On a 12 GB card these are not optional. A minimal diffusers invocation:

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

pipe = CogVideoXPipeline.from_pretrained(
    "./models/cogvideox-1.5/",
    torch_dtype=torch.bfloat16,
)
# Required on a 12 GB card to stay within the BF16 floor:
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_tiling()
pipe.vae.enable_slicing()

prompt = "A slow push-in shot of a forest path in autumn, golden leaves falling gently, cinematic lighting"
video = pipe(prompt=prompt, num_frames=81, num_inference_steps=50, guidance_scale=6.0).frames[0]
export_to_video(video, "output.mp4", fps=8)

The VAE decode stage is the runtime VRAM peak. The kijai ComfyUI wrapper README notes that, even with everything offloaded, "VAE decoding seems to be the only big that takes a lot of VRAM when everything is offloaded, peaks at around 13-14GB momentarily at that stage" (source) — which is why VAE tiling/slicing (above) matters on a 12 GB card. Output lands in output.mp4.

INT8 (7 GB floor) for more headroom

The CogVideoX 1.5-5B model card documents a 7 GB VRAM floor via TorchAO INT8 with the same offload optimizations enabled. Apply it at load time:

from torchao.quantization import quantize_, int8_weight_only
quantize_(pipe.transformer, int8_weight_only())

Performance

Speed (RTX 4070 Super): Omitted — no first-party measurement exists for this pair. The only /check benchmark currently attached to this pair is a misattributed CogVideoX-5B (the 1.0 predecessor) community number, not a measured CogVideoX 1.5-5B run, so quoting it here would mislead. Submit a measured run via /contribute to seed the first real RTX 4070 Super datapoint.
Cross-card extrapolation: On the near-identical RTX 4070 (same 12 GB Ada, slightly lower clocks), the model fits via the same offload path; the 4070 Super's higher clocks make it modestly faster, but no first-party 4070 Super datapoint exists yet. Treat the 4070's time as a conservative upper bound on the 4070 Super's per-clip runtime — the Super should come in modestly faster — not as a measured 4070 Super number. See the RTX 4070 recipe → for the sibling's fuller framing.
Reference (model card): the CogVideoX1.5-5B card reports a 5-second clip at ~1000 seconds on a single A100 and ~550 seconds on a single H100. A 12 GB consumer card running under sequential CPU offload will be materially slower than the A100 reference — budget tens of minutes per native-resolution 1.5 clip, not minutes.
VRAM usage: 10 GB BF16 / 7 GB INT8 floor with offload + VAE tiling/slicing enabled, per the model card. The VAE decode stage transiently peaks higher (kijai wrapper: ~13-14 GB without offload), which is why offload is mandatory on a 12 GB card.

For the full benchmark data, see /check/cogvideox-1-5/rtx-4070-super.

⚠️ Variant attribution. The widely-cited community times of ~10 min (RTX 4070) and ~15 min (RTX 4070 Super) on the HF discussion are for CogVideoX-5B — the 1.0 release, not CogVideoX 1.5-5B. The A100 reference above shows 1.5 is several times slower than 5B at the same resolution and step count, so do not transfer the 5B community times to a 1.5 recipe. For 1.5 timing on higher-VRAM hardware, see the RTX 4090 recipe.

Troubleshooting

OOM at 12 GB on native 1360×768 BF16

Expected if offload is not enabled — the VAE decode stage transiently peaks at ~13-14 GB (kijai wrapper). Ensure enable_sequential_cpu_offload() + vae.enable_tiling() + vae.enable_slicing() are all active, or switch to the INT8 path (7 GB floor).

Generation is very slow

Expected — CogVideoX 1.5 prioritizes quality over speed, and sequential CPU offload trades VRAM for time by streaming weights from system RAM. Budget tens of minutes per native-resolution clip on a 12 GB card. For faster iteration, reduce num_inference_steps to ~30 for drafts (then 50 for finals), or fall back to a lighter video model.

Blurry output

Use num_inference_steps >= 40 and keep guidance_scale around 6.0. Test prompts at 30 steps first, then run the full 50 for finals.

No other widely-reported issues for this pair. Report problems or submit a measured benchmark via /contribute.