CogVideoX 1.5 5B on RTX 3090: 1360x768 Text-to-Video with Diffusers

What You'll Build

A local text-to-video pipeline that generates 5- or 10-second clips at 1360×768 from a text prompt using THUDM/CogVideoX1.5-5B on a single RTX 3090. The 3090's 24 GB Ampere envelope handles the model's BF16 path with diffusers' standard offload + VAE tiling/slicing optimizations — no FP8, no exotic quantization, no architecture-specific wheel selection.

Hardware data: RTX 3090 (24 GB VRAM) · CogVideoX 1.5 5B BF16 fits within the 24 GB envelope with the official optimized path · See benchmark data

ℹ️ Pick this variant, not its siblings. CogVideoX is a family — CogVideoX-2B (8N+1 frames at 720×480, ~5 GB BF16), CogVideoX-5B (8N+1 at 720×480, 15 GB BF16), CogVideoX1.5-5B (this recipe — 16N+1 frames at 1360×768, from 10 GB BF16 with optimizations), and CogVideoX1.5-5B-I2V (image-to-video at 768–1360). All four cite distinct VRAM and resolution profiles on the official model card. This recipe pins the 1.5-5B text-to-video variant.

⚠️ 3090 is the right card, but expect video-DiT compute density. The 3090 (Ampere sm_86) has no FP8 tensor-core acceleration (FP8 first shipped on Hopper sm_90 and consumer Ada sm_89), so any "FP8 path" advice from Ada/Hopper recipes does NOT transfer. The model card's recommended path is BF16, which Ampere supports natively. The 24 GB envelope fits cleanly; the compute density difference vs an Ada card shows up as slower per-step time, not as a memory failure.

Requirements

Component	Minimum	Tested
GPU	14 GB VRAM derived envelope (HF card's BF16-with-optimizations floor of 10 GB + kijai VAE peak headroom of 3-4 GB); 24 GB recommended for the no-offload speedup	RTX 3090 (24 GB)
RAM	16 GB system RAM	32 GB recommended for CPU-offload swap during `enable_sequential_cpu_offload()`
Storage	~40 GB for weights	~40 GB (transformer + T5 + VAE)
Software	Python 3.10–3.12, diffusers from source, transformers ≥ 4.46.2, accelerate ≥ 1.1.1	—

Installation

1. Install diffusers from source

The CogVideoX 1.5 pipeline requires diffusers built from the development branch per the official model card:

pip install git+https://github.com/huggingface/diffusers
pip install --upgrade "transformers>=4.46.2" "accelerate>=1.1.1" imageio-ffmpeg

The default pip install torch already includes sm_86 (Ampere) kernels — no special wheel selection is needed on the 3090. FlashAttention-2 also has full sm_86 coverage if you want it later.

2. Download the model weights

huggingface-cli download THUDM/CogVideoX1.5-5B --local-dir ./CogVideoX1.5-5B

If THUDM/CogVideoX1.5-5B is unavailable, the zai-org/CogVideoX1.5-5B mirror is the same model — the upstream org name was renamed but the weights are identical.

3. (Optional) Install ComfyUI + the CogVideoX wrapper

If you prefer a node-based workflow, the kijai/ComfyUI-CogVideoXWrapper repo ships example workflows including cogvideox1.5_t2v.json. Skip if you're using diffusers directly.

cd ComfyUI/custom_nodes
git clone https://github.com/kijai/ComfyUI-CogVideoXWrapper
cd ComfyUI-CogVideoXWrapper
pip install -r requirements.txt

Running

Save the following as run_cogvideox.py — this is the canonical inference snippet from the official model card, unmodified except for the model_id already pinned:

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

prompt = (
    "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool "
    "in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic "
    "guitar, producing soft, melodic tunes."
)

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX1.5-5B",
    torch_dtype=torch.bfloat16,
)

pipe.enable_sequential_cpu_offload()
pipe.vae.enable_tiling()
pipe.vae.enable_slicing()

video = pipe(
    prompt=prompt,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=81,
    guidance_scale=6,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)

python run_cogvideox.py

num_frames=81 produces a ~5-second clip at 16 fps (the formula is 16N + 1 with N ≤ 10; 81 is N=5). For a 10-second clip set num_frames=161 (N=10). The first run downloads weights to your HuggingFace cache; the second run starts inference immediately.

Results

Speed: No first-party RTX 3090 measurement for CogVideoX 1.5 5B is published. The official model card reports reference points only for datacenter hardware (single A100: ~1000 s, single H100: ~550 s for a 5-second / 81-frame / 50-step clip) and explicitly notes "This scheme has not been tested for actual memory usage on devices outside of NVIDIA A100 / H100 architectures." CogVideoX 1.5 is a video DiT — compute-bound at the transformer stage — so expect substantially longer wall-clock time on a 3090 than on an Ada or Hopper card of comparable VRAM. The site's sibling 4090 recipe is the closest published reference point for the same workload, but the 4090's per-step throughput on Ada sm_89 does NOT transfer to the 3090. If you have a 3090 measurement, please contribute it so this section can replace the omission with a real number.
VRAM usage: The model card's memory table cites diffusers BF16: from 10 GB with enable_sequential_cpu_offload() + vae.enable_tiling() + vae.enable_slicing() active. kijai's wrapper README observes that "VAE decoding...peaks at around 13-14 GB momentarily" and "Sampling itself takes only maybe 5-6 GB" for the 5B family — the VAE is memory-bound, so this peak transfers cleanly from Ada to Ampere with only marginal change. The 3090's 24 GB leaves ~10 GB of headroom on the optimized path.
Quality notes: Native resolution is 1360×768 (not 720×480 — that's CogVideoX-2B's resolution). Don't reduce below 768 on the short axis; the model is trained for the higher resolution. The model card's reference benchmarks use num_inference_steps=50; lower step counts trade quality for speed (no per-step-count quality comparison is published on the card — adjust empirically).

For the full benchmark data, see /check/cogvideox-1-5/rtx-3090.

Troubleshooting

Want a 3–4× speed boost and willing to use more VRAM?

The 3090's 24 GB is large enough to drop some of the diffusers offload optimizations. Per the model card: "Disabling optimizations can triple VRAM usage but increase speed by 3-4 times. You can selectively disable certain optimizations." The HF card's reported "without optimizations" peak is 76 GB BF16, so do NOT remove all offloads on a 24 GB card. Try removing pipe.enable_sequential_cpu_offload() first while keeping vae.enable_tiling() and vae.enable_slicing() — that's the highest-impact toggle and brings runtime closer to the 33 GB / 19 GB / 11 GB ladder documented in the diffusers CogVideoX page (presented there in the CogVideoX-5B context). Watch nvidia-smi during a run; if peak stays under ~22 GB, the change is safe.

OOM during VAE decoding even on 24 GB

This is possible at the 161-frame (10-second) configuration. Per kijai's wrapper, the VAE decode is the runtime peak (~13–14 GB on the 5B family). Keep pipe.vae.enable_tiling() and pipe.vae.enable_slicing() on; if it still OOMs, drop num_frames to 81 (5-second clip) and use a second pass for length.

FP8 isn't faster on the 3090 — don't switch quants chasing speed

If you've read CogVideoX recipes targeting Ada or Hopper cards, you may see FP8 paths advertised as a speed/VRAM trade. This does NOT apply to the 3090. Ampere sm_86 has no FP8 tensor cores — FP8 first shipped on Hopper sm_90 and consumer Ada sm_89. Loading an FP8 weight file on the 3090 works (the runtime dequantizes to BF16 on the fly) but produces no speed acceleration and only the storage-side VRAM savings. The model card's recommended BF16 + offload path is the right default for this card. The fastest CPU-offload alternative on Ampere is INT8 via torchao (HF card cites "INT8 with optimizations: 7 GB minimum") if you need to free VRAM for a co-located workload, not for raw speed.

Multi-GPU note

The model card explicitly warns: "In multi-GPU inference, enable_sequential_cpu_offload() optimization needs to be disabled." Single 3090 setups are unaffected — this is only relevant if you split across two 3090s.

Output looks low-res

Make sure num_frames follows the 16N + 1 formula (e.g. 17, 33, 49, 65, 81, 97, 113, 129, 145, 161). Off-by-one frame counts trigger the wrong code path. Resolution is fixed at 1360×768 for the T2V variant; don't try to force lower — the model is trained for this size.

If your problem isn't covered above, report it via the submission form so we can extend this section.