What You'll Build
A local text-to-video pipeline that generates 5-second clips at 1360×768 from a text prompt using THUDM/CogVideoX1.5-5B on a single RTX 5090. Where the 24 GB 3090 and 4090 siblings are pinned to the model card's default full enable_sequential_cpu_offload() path, the 5090's 32 GB envelope lets you switch to the lighter enable_model_cpu_offload() middle-ground — the official cli_demo.py documents this swap as a commented-out alternative for users with enough GPU memory.
Hardware data: RTX 5090 (32 GB VRAM) · CogVideoX 1.5 5B BF16 with enable_model_cpu_offload() + VAE tiling/slicing fits the 32 GB envelope with comfortable headroom · See benchmark data
ℹ️ Pick this variant, not its siblings. CogVideoX is a family —
CogVideoX-2B(8N+1 frames at 720×480, ~5 GB BF16),CogVideoX-5B(8N+1 at 720×480, 15 GB BF16),CogVideoX1.5-5B(this recipe — 16N+1 frames at 1360×768, from 10 GB BF16 with optimizations), andCogVideoX1.5-5B-I2V(image-to-video at 768–1360). All four cite distinct VRAM and resolution profiles on the official model card. This recipe pins the 1.5-5B text-to-video variant.
⚠️ Full no-offload (
pipe.to("cuda")) is NOT safe on 32 GB. Don't be tempted to skip CPU offload entirely. The model card's SAT BF16 "without optimizations" peak is 76 GB; the diffusers CogVideoX docs document ~33 GB peak with all optimizations disabled for the smaller CogVideoX-5B (720×480, 8N+1 frames). CogVideoX 1.5 runs at the larger 1360×768, 16N+1 frame format, so its no-offload peak is materially higher than the 5B ladder. The recipe'senable_model_cpu_offload()path keeps the T5 text encoder (19 GB) on CPU between encode/decode hops while the transformer (11 GB) and VAE stay on GPU — that's the right middle-ground for this card.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 20 GB VRAM derived envelope (diffusers documents 19 GB peak with enable_model_cpu_offload() for the smaller CogVideoX-5B; 1.5-5B runs at 1360×768 vs 720×480 → headroom budget); 32 GB recommended for VAE-decode comfort at 81 frames | RTX 5090 (32 GB) |
| RAM | 32 GB system RAM minimum; 64 GB recommended for the T5 encoder's CPU-side residency under model offload | — |
| Storage | ~40 GB for weights (per the README) | ~31 GB measured (11.14 GB transformer + 19.05 GB T5 + 0.86 GB VAE per HF Files tab) |
| Software | Python 3.10–3.12, diffusers from source, transformers ≥ 4.46.2, accelerate ≥ 1.1.1 | — |
Installation
1. Install diffusers from source
The CogVideoX 1.5 pipeline requires diffusers built from the development branch per the official model card:
pip install git+https://github.com/huggingface/diffusers
pip install --upgrade "transformers>=4.46.2" "accelerate>=1.1.1" imageio-ffmpeg
On the 5090, install a CUDA 12.8 (cu128) PyTorch wheel so sm_120 Blackwell kernels are present:
pip install --upgrade torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
By mid-2026, mainline FlashAttention-2 wheels include sm_120 kernels (Dao-AILab/flash-attention#1542 closed 2026-04), so no FA2 workaround is needed on the 5090 — diffusers' default attention backend selects sm_120 paths automatically.
2. Download the model weights
huggingface-cli download THUDM/CogVideoX1.5-5B --local-dir ./CogVideoX1.5-5B
If THUDM/CogVideoX1.5-5B is unavailable, the zai-org/CogVideoX1.5-5B mirror is the same model — the upstream org name was renamed but the weights are identical.
3. (Optional) Install ComfyUI + the CogVideoX wrapper
If you prefer a node-based workflow, the kijai/ComfyUI-CogVideoXWrapper repo ships example workflows including cogvideox1.5_t2v.json. Skip if you're using diffusers directly.
cd ComfyUI/custom_nodes
git clone https://github.com/kijai/ComfyUI-CogVideoXWrapper
cd ComfyUI-CogVideoXWrapper
pip install -r requirements.txt
Running
Save the following as run_cogvideox.py. This is the official cli_demo.py snippet modified per the script's own commented-out alternative — swapping enable_sequential_cpu_offload() for enable_model_cpu_offload() because the 5090 has the 32 GB to spare:
import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video
prompt = (
"A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool "
"in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic "
"guitar, producing soft, melodic tunes."
)
pipe = CogVideoXPipeline.from_pretrained(
"THUDM/CogVideoX1.5-5B",
torch_dtype=torch.bfloat16,
)
# The 5090's 32 GB lets us use model_cpu_offload (lighter than the default
# sequential_cpu_offload that the model card script ships) for ~10% faster
# generation per the canonical README.
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()
pipe.vae.enable_slicing()
video = pipe(
prompt=prompt,
num_videos_per_prompt=1,
num_inference_steps=50,
num_frames=81,
guidance_scale=6,
generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]
export_to_video(video, "output.mp4", fps=8)
python run_cogvideox.py
num_frames=81 produces a ~5-second clip at 16 fps (the formula is 16N + 1 with N ≤ 10; 81 is N=5). For a 10-second clip set num_frames=161 — but read the Troubleshooting section first: maintainer Issue #493 documents a VAE-decode OOM on an 80 GB GPU at the 161-frame configuration, so the 10-second path is not a free upgrade even on a 5090.
Results
- VRAM usage (this recipe,
enable_model_cpu_offload()path): Derived ~20 GB peak on the 5090. The diffusers CogVideoX documentation documents 19 GB peak withenable_model_cpu_offload()for the smaller CogVideoX-5B at 720×480; CogVideoX 1.5-5B's higher 1360×768 resolution and 16N+1 frame format push activations modestly higher. The kijai/ComfyUI-CogVideoXWrapper README confirms "VAE decoding...peaks at around 13-14GB momentarily" for the 5B family — this peak transfers cleanly from Ada/Ampere to Blackwell because VAE decode is memory-bound rather than compute-bound. The 32 GB envelope leaves ~12 GB of headroom for activations and KV cache. - VRAM usage (alternative
enable_sequential_cpu_offload()path, NOT the installed path): The official model card cites diffusers BF16: from 10 GB with the full sequential offload + VAE tiling/slicing path. That's the path the 3090 sibling recipe and 4090 sibling recipe install. If you switch back to it (replaceenable_model_cpu_offload()withenable_sequential_cpu_offload()in the snippet above) you'll see this lower peak, at the cost of "about 10%" slower generation per the canonical zai-org/CogVideo README. - Speed: No first-party RTX 5090 measurement for CogVideoX 1.5 5B is published. The official model card reports reference points only for datacenter hardware (single A100: ~1000 s, single H100: ~550 s for a 5-second / 81-frame / 50-step clip) and explicitly notes "This scheme has not been tested for actual memory usage on devices outside of NVIDIA A100 / H100 architectures." CogVideoX 1.5 is a video DiT — compute-bound at the transformer stage — and the per-step throughput on a 5090 (Blackwell sm_120) has not been measured in any source we found. If you have a 5090 measurement, please contribute it so this section can replace the omission with a real number.
- Quality notes: Native resolution is 1360×768 (not 720×480 — that's CogVideoX-2B's resolution). Don't reduce below 768 on the short axis; the model is trained for the higher resolution. The model card's reference benchmarks use
num_inference_steps=50; lower step counts trade quality for speed (no per-step-count quality comparison is published on the card — adjust empirically).
For the full benchmark data, see /check/cogvideox-1-5/rtx-5090.
Troubleshooting
Want the lowest possible VRAM? Switch to full sequential offload
If you're co-locating another workload on the 5090 and need to free as much VRAM as possible, swap the snippet's pipe.enable_model_cpu_offload() for pipe.enable_sequential_cpu_offload(). The official model card cites "from 10 GB" on this path; the zai-org/CogVideo README notes "Without memory optimization, inference speed increases by about 10%" — i.e. the reverse holds, full sequential offload trades ~10% speed for ~50% lower peak VRAM. This is exactly the path the 3090 and 4090 sibling recipes install.
10-second clips (num_frames=161) OOM during VAE decode
Tracked in the canonical repo: Issue #493 (reported by community user DZY-irene) describes a VAE-decode OOM on an 80 GB GPU at the 10-second / 161-frame configuration despite consuming 77 GB before the failure. The error stack traces to torch.cat(output_chunks, dim=2) inside the VAE decoder. This is a model-class issue, not a card-specific one, and the 5090's 32 GB does not escape it. If you need 10-second clips, stay on the enable_sequential_cpu_offload() path (where the kijai wrapper's 13-14 GB VAE peak is more reliable) or split your generation into two 5-second segments and concatenate post-hoc.
Full no-offload (pipe.to("cuda")) is the wrong escape hatch on 32 GB
You'll see the model card mention "Disabling optimizations can triple VRAM usage but increase speed by 3-4 times." On the 5090 you might be tempted to try pipe.to("cuda") without any offload — don't. The model card's SAT BF16 "without optimizations" number is 76 GB, and the diffusers CogVideoX page documents ~33 GB peak with all optimizations disabled for the smaller CogVideoX-5B (720×480, 8N+1 frames). CogVideoX 1.5 runs at 1360×768 with 16N+1 frame format, so its no-offload peak is even higher. The enable_model_cpu_offload() middle-ground is the right ceiling on 32 GB; full no-offload requires multi-GPU or H100-class hardware. The canonical README's "about 10%" speed difference between offload and no-offload (not 3-4×) is the more conservative figure to plan against.
Multi-GPU note
The model card explicitly warns: "In multi-GPU inference, enable_sequential_cpu_offload() optimization needs to be disabled." Single 5090 setups are unaffected — this is only relevant if you split across two 5090s (in which case use enable_model_cpu_offload() or no offload, both fit when split).
Output looks low-res
Make sure num_frames follows the 16N + 1 formula (e.g. 17, 33, 49, 65, 81, 97, 113, 129, 145, 161). Off-by-one frame counts trigger the wrong code path. Resolution is fixed at 1360×768 for the T2V variant; don't try to force lower — the model is trained for this size.
If your problem isn't covered above, report it via the submission form so we can extend this section.