How much VRAM does CogVideoX 1.5 need?

About 20 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

CogVideoX 1.5 5B on RTX 5090: 1360x768 Text-to-Video, No-Sequential-Offload Path

What You'll Build

A local text-to-video pipeline that generates 5-second clips at 1360×768 from a text prompt using THUDM/CogVideoX1.5-5B on a single RTX 5090. Where the 24 GB 3090 and 4090 siblings are pinned to the model card's default full enable_sequential_cpu_offload() path, the 5090's 32 GB envelope lets you switch to the lighter enable_model_cpu_offload() middle-ground — the official cli_demo.py documents this swap as a commented-out alternative for users with enough GPU memory.

Hardware data: RTX 5090 (32 GB VRAM) · CogVideoX 1.5 5B BF16 with enable_model_cpu_offload() + VAE tiling/slicing fits the 32 GB envelope with comfortable headroom · See benchmark data

ℹ️ Pick this variant, not its siblings. CogVideoX is a family — CogVideoX-2B (8N+1 frames at 720×480, ~5 GB BF16), CogVideoX-5B (8N+1 at 720×480, 15 GB BF16), CogVideoX1.5-5B (this recipe — 16N+1 frames at 1360×768, from 10 GB BF16 with optimizations), and CogVideoX1.5-5B-I2V (image-to-video at 768–1360). All four cite distinct VRAM and resolution profiles on the official model card. This recipe pins the 1.5-5B text-to-video variant.

⚠️ Full no-offload (pipe.to("cuda")) is NOT safe on 32 GB. Don't be tempted to skip CPU offload entirely. The model card's SAT BF16 "without optimizations" peak is 76 GB; the diffusers CogVideoX docs document ~33 GB peak with all optimizations disabled for the smaller CogVideoX-5B (720×480, 8N+1 frames). CogVideoX 1.5 runs at the larger 1360×768, 16N+1 frame format, so its no-offload peak is materially higher than the 5B ladder. The recipe's enable_model_cpu_offload() path keeps the T5 text encoder (19 GB) on CPU between encode/decode hops while the transformer (11 GB) and VAE stay on GPU — that's the right middle-ground for this card.

Requirements

Component	Minimum	Tested
GPU	20 GB VRAM derived envelope (diffusers documents 19 GB peak with `enable_model_cpu_offload()` for the smaller CogVideoX-5B; 1.5-5B runs at 1360×768 vs 720×480 → headroom budget); 32 GB recommended for VAE-decode comfort at 81 frames	RTX 5090 (32 GB)
RAM	32 GB system RAM minimum; 64 GB recommended for the T5 encoder's CPU-side residency under model offload	—
Storage	~40 GB for weights (per the README)	~31 GB measured (11.14 GB transformer + 19.05 GB T5 + 0.86 GB VAE per HF Files tab)
Software	Python 3.10–3.12, diffusers from source, transformers ≥ 4.46.2, accelerate ≥ 1.1.1	—

Installation

1. Install diffusers from source

The CogVideoX 1.5 pipeline requires diffusers built from the development branch per the official model card:

pip install git+https://github.com/huggingface/diffusers
pip install --upgrade "transformers>=4.46.2" "accelerate>=1.1.1" imageio-ffmpeg

On the 5090, install a CUDA 12.8 (cu128) PyTorch wheel so sm_120 Blackwell kernels are present:

pip install --upgrade torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

By mid-2026, mainline FlashAttention-2 wheels include sm_120 kernels (Dao-AILab/flash-attention#1542 closed 2026-04), so no FA2 workaround is needed on the 5090 — diffusers' default attention backend selects sm_120 paths automatically.

2. Download the model weights

huggingface-cli download THUDM/CogVideoX1.5-5B --local-dir ./CogVideoX1.5-5B

If THUDM/CogVideoX1.5-5B is unavailable, the zai-org/CogVideoX1.5-5B mirror is the same model — the upstream org name was renamed but the weights are identical.

3. (Optional) Install ComfyUI + the CogVideoX wrapper

If you prefer a node-based workflow, the kijai/ComfyUI-CogVideoXWrapper repo ships example workflows including cogvideox1.5_t2v.json. Skip if you're using diffusers directly.

cd ComfyUI/custom_nodes
git clone https://github.com/kijai/ComfyUI-CogVideoXWrapper
cd ComfyUI-CogVideoXWrapper
pip install -r requirements.txt

Running

Save the following as run_cogvideox.py. This is the official cli_demo.py snippet modified per the script's own commented-out alternative — swapping enable_sequential_cpu_offload() for enable_model_cpu_offload() because the 5090 has the 32 GB to spare:

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

prompt = (
    "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool "
    "in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic "
    "guitar, producing soft, melodic tunes."
)

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX1.5-5B",
    torch_dtype=torch.bfloat16,
)

# The 5090's 32 GB lets us use model_cpu_offload (lighter than the default
# sequential_cpu_offload that the model card script ships) for ~10% faster
# generation per the canonical README.
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()
pipe.vae.enable_slicing()

video = pipe(
    prompt=prompt,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=81,
    guidance_scale=6,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)

python run_cogvideox.py

num_frames=81 produces a ~5-second clip at 16 fps (the formula is 16N + 1 with N ≤ 10; 81 is N=5). For a 10-second clip set num_frames=161 — but read the Troubleshooting section first: maintainer Issue #493 documents a VAE-decode OOM on an 80 GB GPU at the 161-frame configuration, so the 10-second path is not a free upgrade even on a 5090.

Results

VRAM usage (this recipe, enable_model_cpu_offload() path): Derived ~20 GB peak on the 5090. The diffusers CogVideoX documentation documents 19 GB peak with enable_model_cpu_offload() for the smaller CogVideoX-5B at 720×480; CogVideoX 1.5-5B's higher 1360×768 resolution and 16N+1 frame format push activations modestly higher. The kijai/ComfyUI-CogVideoXWrapper README confirms "VAE decoding...peaks at around 13-14GB momentarily" for the 5B family — this peak transfers cleanly from Ada/Ampere to Blackwell because VAE decode is memory-bound rather than compute-bound. The 32 GB envelope leaves ~12 GB of headroom for activations and KV cache.
VRAM usage (alternative enable_sequential_cpu_offload() path, NOT the installed path): The official model card cites diffusers BF16: from 10 GB with the full sequential offload + VAE tiling/slicing path. That's the path the 3090 sibling recipe and 4090 sibling recipe install. If you switch back to it (replace enable_model_cpu_offload() with enable_sequential_cpu_offload() in the snippet above) you'll see this lower peak, at the cost of "about 10%" slower generation per the canonical zai-org/CogVideo README.
Speed: No first-party RTX 5090 measurement for CogVideoX 1.5 5B is published. The official model card reports reference points only for datacenter hardware (single A100: ~1000 s, single H100: ~550 s for a 5-second / 81-frame / 50-step clip) and explicitly notes "This scheme has not been tested for actual memory usage on devices outside of NVIDIA A100 / H100 architectures." CogVideoX 1.5 is a video DiT — compute-bound at the transformer stage — and the per-step throughput on a 5090 (Blackwell sm_120) has not been measured in any source we found. If you have a 5090 measurement, please contribute it so this section can replace the omission with a real number.
Quality notes: Native resolution is 1360×768 (not 720×480 — that's CogVideoX-2B's resolution). Don't reduce below 768 on the short axis; the model is trained for the higher resolution. The model card's reference benchmarks use num_inference_steps=50; lower step counts trade quality for speed (no per-step-count quality comparison is published on the card — adjust empirically).

For the full benchmark data, see /check/cogvideox-1-5/rtx-5090.

Troubleshooting

Want the lowest possible VRAM? Switch to full sequential offload

If you're co-locating another workload on the 5090 and need to free as much VRAM as possible, swap the snippet's pipe.enable_model_cpu_offload() for pipe.enable_sequential_cpu_offload(). The official model card cites "from 10 GB" on this path; the zai-org/CogVideo README notes "Without memory optimization, inference speed increases by about 10%" — i.e. the reverse holds, full sequential offload trades ~10% speed for ~50% lower peak VRAM. This is exactly the path the 3090 and 4090 sibling recipes install.

10-second clips (`num_frames=161`) OOM during VAE decode

Tracked in the canonical repo: Issue #493 (reported by community user DZY-irene) describes a VAE-decode OOM on an 80 GB GPU at the 10-second / 161-frame configuration despite consuming 77 GB before the failure. The error stack traces to torch.cat(output_chunks, dim=2) inside the VAE decoder. This is a model-class issue, not a card-specific one, and the 5090's 32 GB does not escape it. If you need 10-second clips, stay on the enable_sequential_cpu_offload() path (where the kijai wrapper's 13-14 GB VAE peak is more reliable) or split your generation into two 5-second segments and concatenate post-hoc.

Full no-offload (`pipe.to("cuda")`) is the wrong escape hatch on 32 GB

You'll see the model card mention "Disabling optimizations can triple VRAM usage but increase speed by 3-4 times." On the 5090 you might be tempted to try pipe.to("cuda") without any offload — don't. The model card's SAT BF16 "without optimizations" number is 76 GB, and the diffusers CogVideoX page documents ~33 GB peak with all optimizations disabled for the smaller CogVideoX-5B (720×480, 8N+1 frames). CogVideoX 1.5 runs at 1360×768 with 16N+1 frame format, so its no-offload peak is even higher. The enable_model_cpu_offload() middle-ground is the right ceiling on 32 GB; full no-offload requires multi-GPU or H100-class hardware. The canonical README's "about 10%" speed difference between offload and no-offload (not 3-4×) is the more conservative figure to plan against.

Multi-GPU note

The model card explicitly warns: "In multi-GPU inference, enable_sequential_cpu_offload() optimization needs to be disabled." Single 5090 setups are unaffected — this is only relevant if you split across two 5090s (in which case use enable_model_cpu_offload() or no offload, both fit when split).

Output looks low-res

Make sure num_frames follows the 16N + 1 formula (e.g. 17, 33, 49, 65, 81, 97, 113, 129, 145, 161). Off-by-one frame counts trigger the wrong code path. Resolution is fixed at 1360×768 for the T2V variant; don't try to force lower — the model is trained for this size.

If your problem isn't covered above, report it via the submission form so we can extend this section.