What You'll Build
A local text-to-video pipeline that generates 5- or 10-second clips at 1360×768 from a text prompt using THUDM/CogVideoX1.5-5B on a single RTX 4090. The 4090's 24 GB VRAM lets you keep all stages on-GPU without aggressive offload, trading some VRAM headroom for cleaner runtime ergonomics.
Hardware data: RTX 4090 (24 GB VRAM) · CogVideoX 1.5 5B BF16 fits comfortably with diffusers optimizations enabled · See benchmark data
ℹ️ Pick this variant, not its siblings. CogVideoX is a family —
CogVideoX-2B(8N+1 frames at 720×480, ~5 GB BF16),CogVideoX-5B(8N+1 at 720×480, 15 GB BF16),CogVideoX1.5-5B(this recipe — 16N+1 frames at 1360×768, from 10 GB BF16 with optimizations), andCogVideoX1.5-5B-I2V(image-to-video at 768–1360 with from 4 GB FP16). All four cite distinct VRAM and resolution profiles on the official model card. This recipe pins the 1.5-5B text-to-video variant; for image-to-video on the same GPU, swap in the I2V pipeline and consult the I2V model card separately.
⚠️ 4090 is over-provisioned for CogVideoX 1.5. With diffusers optimizations enabled, the model fits in 10–14 GB. The 4090's extra headroom is useful if you disable optimizations for a 3–4× speed boost (per the model card note: "Disabling optimizations can triple VRAM usage but increase speed by 3-4 times"), but a 4070 Ti Super or 4080 will run the optimized path at the same quality. Pick the 4090 for this model only if you also need it for larger workloads.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 16 GB VRAM (RTX 4080, 4070 Ti Super) for optimized path; 24 GB for the no-offload speedup | RTX 4090 (24 GB) |
| RAM | 16 GB system RAM | 32 GB recommended for CPU-offload swap |
| Storage | ~40 GB for weights | ~40 GB (transformer + T5 + VAE) |
| Software | Python 3.10–3.12, diffusers from source, transformers ≥ 4.46.2, accelerate ≥ 1.1.1 | — |
Installation
1. Install diffusers from source
The CogVideoX 1.5 pipeline requires diffusers built from the development branch per the official model card:
pip install git+https://github.com/huggingface/diffusers
pip install --upgrade "transformers>=4.46.2" "accelerate>=1.1.1" imageio-ffmpeg
2. Download the model weights
huggingface-cli download THUDM/CogVideoX1.5-5B --local-dir ./CogVideoX1.5-5B
If THUDM/CogVideoX1.5-5B is unavailable, the zai-org/CogVideoX1.5-5B mirror is the same model — the upstream org name was renamed but the weights are identical.
3. (Optional) Install ComfyUI + the CogVideoX wrapper
If you prefer a node-based workflow, the kijai/ComfyUI-CogVideoXWrapper repo ships example workflows including cogvideox1.5_t2v.json. Skip if you're using diffusers directly.
cd ComfyUI/custom_nodes
git clone https://github.com/kijai/ComfyUI-CogVideoXWrapper
cd ComfyUI-CogVideoXWrapper
pip install -r requirements.txt
Running
Save the following as run_cogvideox.py — this is the canonical inference snippet from the official model card, unmodified except for the model_id already pinned:
import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video
prompt = (
"A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool "
"in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic "
"guitar, producing soft, melodic tunes."
)
pipe = CogVideoXPipeline.from_pretrained(
"THUDM/CogVideoX1.5-5B",
torch_dtype=torch.bfloat16,
)
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_tiling()
pipe.vae.enable_slicing()
video = pipe(
prompt=prompt,
num_videos_per_prompt=1,
num_inference_steps=50,
num_frames=81,
guidance_scale=6,
generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]
export_to_video(video, "output.mp4", fps=8)
python run_cogvideox.py
num_frames=81 produces a ~5-second clip at 16 fps (the formula is 16N + 1 with N ≤ 10; 81 is N=5). For a 10-second clip set num_frames=161 (N=10). The first run downloads weights to your HuggingFace cache; the second run starts inference immediately.
Results
- Speed: The official model card explicitly notes "This scheme has not been tested for actual memory usage on devices outside of NVIDIA A100 / H100 architectures." The published reference points are single A100: ~1000 s and single H100: ~550 s for a 5-second clip (50 steps, FP/BF16). On consumer Ada hardware, kijai's ComfyUI-CogVideoXWrapper README reports
4.23 s/it on 4090 with 49 frameswith onediff (Linux-only), giving an order-of-magnitude consistent figure (≈ 3–4 min for the 49-frame configuration). For a measured end-to-end RTX 4090 number on the 81-frame default, see /check/cogvideox-1-5/rtx-4090 — and contribute one if you have it. - VRAM usage: The model card's memory table cites diffusers BF16: from 10 GB with
enable_sequential_cpu_offload()+vae.enable_tiling()+vae.enable_slicing()active. kijai's wrapper README observes that "VAE decoding...peaks at around 13-14 GB momentarily" and "Sampling itself takes only maybe 5-6 GB" for the 5B family on a 4090. The 4090's 24 GB leaves ~10 GB of unused headroom on the optimized path. - Quality notes: Native resolution is 1360×768 (not 720×480 — that's CogVideoX-2B's resolution). Don't reduce below 768 on the short axis; the model is trained for the higher resolution. Recommend keeping steps at 50 for quality; lower step counts (< 40) noticeably degrade output per the model card.
For the full benchmark data, see /check/cogvideox-1-5/rtx-4090.
Troubleshooting
Want a 3–4× speed boost and willing to use more VRAM?
The 4090's 24 GB is large enough to drop some of the diffusers offload optimizations. Per the model card: "Disabling optimizations can triple VRAM usage but increase speed by 3-4 times. You can selectively disable certain optimizations." Try removing pipe.enable_sequential_cpu_offload() first while keeping vae.enable_tiling() and vae.enable_slicing() — that's the highest-impact toggle. Watch nvidia-smi during a run; if peak stays under ~22 GB, the change is safe.
OOM during VAE decoding even on 24 GB
This is rare on a 4090 but possible at the 161-frame (10-second) configuration. Per kijai's wrapper, the VAE decode is the runtime peak (~13–14 GB on the 5B family). Keep pipe.vae.enable_tiling() and pipe.vae.enable_slicing() on; if it still OOMs, drop num_frames to 81 (5-second clip) and use a second pass for length.
Multi-GPU note
The model card explicitly warns: "In multi-GPU inference, enable_sequential_cpu_offload() optimization needs to be disabled." Single 4090 setups are unaffected — this is only relevant if you split across two 4090s.
flash_attention_2 errors
Unlike Blackwell GPUs (sm_120), the RTX 4090 (Ada, sm_89) has full FlashAttention-2 kernel coverage. No special wheel selection is required — the default pip install torch already includes sm_89 kernels. If you hit an FA2 error, it's almost certainly a transformers/diffusers version mismatch, not a kernel-availability issue.
Output looks low-res
Make sure num_frames follows the 16N + 1 formula (e.g. 17, 33, 49, 65, 81, 97, 113, 129, 145, 161). Off-by-one frame counts trigger the wrong code path. Resolution is fixed at 1360×768 for the T2V variant; don't try to force lower — the model is trained for this size.
If your problem isn't covered above, report it via the submission form so we can extend this section.