CogVideoX 1.5 5B on RTX 4090: 1360x768 Text-to-Video with Diffusers

What You'll Build

A local text-to-video pipeline that generates 5- or 10-second clips at 1360×768 from a text prompt using THUDM/CogVideoX1.5-5B on a single RTX 4090. The 4090's 24 GB VRAM lets you keep all stages on-GPU without aggressive offload, trading some VRAM headroom for cleaner runtime ergonomics.

Hardware data: RTX 4090 (24 GB VRAM) · CogVideoX 1.5 5B BF16 fits comfortably with diffusers optimizations enabled · See benchmark data

ℹ️ Pick this variant, not its siblings. CogVideoX is a family — CogVideoX-2B (8N+1 frames at 720×480, ~5 GB BF16), CogVideoX-5B (8N+1 at 720×480, 15 GB BF16), CogVideoX1.5-5B (this recipe — 16N+1 frames at 1360×768, from 10 GB BF16 with optimizations), and CogVideoX1.5-5B-I2V (image-to-video at 768–1360 with from 4 GB FP16). All four cite distinct VRAM and resolution profiles on the official model card. This recipe pins the 1.5-5B text-to-video variant; for image-to-video on the same GPU, swap in the I2V pipeline and consult the I2V model card separately.

⚠️ 4090 is over-provisioned for CogVideoX 1.5. With diffusers optimizations enabled, the model fits in 10–14 GB. The 4090's extra headroom is useful if you disable optimizations for a 3–4× speed boost (per the model card note: "Disabling optimizations can triple VRAM usage but increase speed by 3-4 times"), but a 4070 Ti Super or 4080 will run the optimized path at the same quality. Pick the 4090 for this model only if you also need it for larger workloads.

Requirements

Component	Minimum	Tested
GPU	16 GB VRAM (RTX 4080, 4070 Ti Super) for optimized path; 24 GB for the no-offload speedup	RTX 4090 (24 GB)
RAM	16 GB system RAM	32 GB recommended for CPU-offload swap
Storage	~40 GB for weights	~40 GB (transformer + T5 + VAE)
Software	Python 3.10–3.12, diffusers from source, transformers ≥ 4.46.2, accelerate ≥ 1.1.1	—

Installation

1. Install diffusers from source

The CogVideoX 1.5 pipeline requires diffusers built from the development branch per the official model card:

pip install git+https://github.com/huggingface/diffusers
pip install --upgrade "transformers>=4.46.2" "accelerate>=1.1.1" imageio-ffmpeg

2. Download the model weights

huggingface-cli download THUDM/CogVideoX1.5-5B --local-dir ./CogVideoX1.5-5B

If THUDM/CogVideoX1.5-5B is unavailable, the zai-org/CogVideoX1.5-5B mirror is the same model — the upstream org name was renamed but the weights are identical.

3. (Optional) Install ComfyUI + the CogVideoX wrapper

If you prefer a node-based workflow, the kijai/ComfyUI-CogVideoXWrapper repo ships example workflows including cogvideox1.5_t2v.json. Skip if you're using diffusers directly.

cd ComfyUI/custom_nodes
git clone https://github.com/kijai/ComfyUI-CogVideoXWrapper
cd ComfyUI-CogVideoXWrapper
pip install -r requirements.txt

Running

Save the following as run_cogvideox.py — this is the canonical inference snippet from the official model card, unmodified except for the model_id already pinned:

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

prompt = (
    "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool "
    "in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic "
    "guitar, producing soft, melodic tunes."
)

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX1.5-5B",
    torch_dtype=torch.bfloat16,
)

pipe.enable_sequential_cpu_offload()
pipe.vae.enable_tiling()
pipe.vae.enable_slicing()

video = pipe(
    prompt=prompt,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=81,
    guidance_scale=6,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)

python run_cogvideox.py

num_frames=81 produces a ~5-second clip at 16 fps (the formula is 16N + 1 with N ≤ 10; 81 is N=5). For a 10-second clip set num_frames=161 (N=10). The first run downloads weights to your HuggingFace cache; the second run starts inference immediately.

Results

Speed: The official model card explicitly notes "This scheme has not been tested for actual memory usage on devices outside of NVIDIA A100 / H100 architectures." The published reference points are single A100: ~1000 s and single H100: ~550 s for a 5-second clip (50 steps, FP/BF16). On consumer Ada hardware, kijai's ComfyUI-CogVideoXWrapper README reports 4.23 s/it on 4090 with 49 frames with onediff (Linux-only), giving an order-of-magnitude consistent figure (≈ 3–4 min for the 49-frame configuration). For a measured end-to-end RTX 4090 number on the 81-frame default, see /check/cogvideox-1-5/rtx-4090 — and contribute one if you have it.
VRAM usage: The model card's memory table cites diffusers BF16: from 10 GB with enable_sequential_cpu_offload() + vae.enable_tiling() + vae.enable_slicing() active. kijai's wrapper README observes that "VAE decoding...peaks at around 13-14 GB momentarily" and "Sampling itself takes only maybe 5-6 GB" for the 5B family on a 4090. The 4090's 24 GB leaves ~10 GB of unused headroom on the optimized path.
Quality notes: Native resolution is 1360×768 (not 720×480 — that's CogVideoX-2B's resolution). Don't reduce below 768 on the short axis; the model is trained for the higher resolution. Recommend keeping steps at 50 for quality; lower step counts (< 40) noticeably degrade output per the model card.

For the full benchmark data, see /check/cogvideox-1-5/rtx-4090.

Troubleshooting

Want a 3–4× speed boost and willing to use more VRAM?

The 4090's 24 GB is large enough to drop some of the diffusers offload optimizations. Per the model card: "Disabling optimizations can triple VRAM usage but increase speed by 3-4 times. You can selectively disable certain optimizations." Try removing pipe.enable_sequential_cpu_offload() first while keeping vae.enable_tiling() and vae.enable_slicing() — that's the highest-impact toggle. Watch nvidia-smi during a run; if peak stays under ~22 GB, the change is safe.

OOM during VAE decoding even on 24 GB

This is rare on a 4090 but possible at the 161-frame (10-second) configuration. Per kijai's wrapper, the VAE decode is the runtime peak (~13–14 GB on the 5B family). Keep pipe.vae.enable_tiling() and pipe.vae.enable_slicing() on; if it still OOMs, drop num_frames to 81 (5-second clip) and use a second pass for length.

Multi-GPU note

The model card explicitly warns: "In multi-GPU inference, enable_sequential_cpu_offload() optimization needs to be disabled." Single 4090 setups are unaffected — this is only relevant if you split across two 4090s.

`flash_attention_2` errors

Unlike Blackwell GPUs (sm_120), the RTX 4090 (Ada, sm_89) has full FlashAttention-2 kernel coverage. No special wheel selection is required — the default pip install torch already includes sm_89 kernels. If you hit an FA2 error, it's almost certainly a transformers/diffusers version mismatch, not a kernel-availability issue.

Output looks low-res

Make sure num_frames follows the 16N + 1 formula (e.g. 17, 33, 49, 65, 81, 97, 113, 129, 145, 161). Off-by-one frame counts trigger the wrong code path. Resolution is fixed at 1360×768 for the T2V variant; don't try to force lower — the model is trained for this size.

If your problem isn't covered above, report it via the submission form so we can extend this section.