self-hosted/ai
§01·recipe · video

CogVideoX 1.5 5B on RTX 3090 Ti: 1360x768 Text-to-Video with Diffusers

videointermediate14GB+ VRAMMay 28, 2026
models
tools
prerequisites
  • NVIDIA RTX 3090 Ti (24 GB VRAM) — the official model card explicitly notes the model should generally work with all NVIDIA Ampere architecture or higher devices
  • Python 3.10–3.12 (community-recommended range for diffusers main + transformers ≥ 4.46)
  • ~40 GB free storage for the 5B transformer, T5 text encoder, and VAE weights

What You'll Build

A local text-to-video pipeline that generates 5- or 10-second clips at 1360×768 from a text prompt using THUDM/CogVideoX1.5-5B on a single RTX 3090 Ti. The 3090 Ti's 24 GB Ampere envelope handles the model's BF16 path with diffusers' standard offload + VAE tiling/slicing optimizations — no FP8, no exotic quantization, no architecture-specific wheel selection. The 3090 Ti is the higher-end Ampere sibling of the RTX 3090 (same sm_86, same 24 GB), so the install path is identical; per-step compute on the Ti is slightly higher than the 3090 (10752 vs 10496 CUDA cores at higher boost clocks) but no first-party measurement is published for either card.

Hardware data: RTX 3090 Ti (24 GB VRAM) · CogVideoX 1.5 5B BF16 fits within the 24 GB envelope with the official optimized path · See benchmark data

ℹ️ Pick this variant, not its siblings. CogVideoX is a family — CogVideoX-2B (8N+1 frames at 720×480, ~5 GB BF16 minimum), CogVideoX-5B (8N+1 at 720×480, 5 GB BF16 minimum), CogVideoX1.5-5B (this recipe — 16N+1 frames at 1360×768, from 10 GB BF16 with optimizations), and CogVideoX1.5-5B-I2V (image-to-video at 768–1360). All four cite distinct VRAM and resolution profiles on the official model card. This recipe pins the 1.5-5B text-to-video variant.

⚠️ 3090 Ti is the right card, but expect video-DiT compute density. The 3090 Ti (Ampere sm_86) has no FP8 tensor-core acceleration (FP8 first shipped on Hopper sm_90 and consumer Ada sm_89), so any "FP8 path" advice from Ada/Hopper recipes does NOT transfer. The model card's recommended path is BF16, which Ampere supports natively. The 24 GB envelope fits cleanly; the compute density difference vs an Ada card shows up as slower per-step time, not as a memory failure.

Requirements

ComponentMinimumTested
GPU14 GB VRAM derived envelope (HF card's BF16-with-optimizations floor of 10 GB + VAE peak headroom of 3-4 GB per the kijai wrapper); 24 GB recommended for the no-offload speedupRTX 3090 Ti (24 GB)
RAM16 GB system RAM32 GB recommended for CPU-offload swap during enable_sequential_cpu_offload()
Storage~40 GB for weights~40 GB (transformer + T5 + VAE)
SoftwarePython 3.10–3.12, diffusers from source, transformers ≥ 4.46.2, accelerate ≥ 1.1.1

Installation

1. Install diffusers from source

The CogVideoX 1.5 pipeline requires diffusers built from the development branch per the official model card:

pip install git+https://github.com/huggingface/diffusers
pip install --upgrade "transformers>=4.46.2" "accelerate>=1.1.1" imageio-ffmpeg

The default pip install torch already includes sm_86 (Ampere) kernels — no special wheel selection is needed on the 3090 Ti. FlashAttention-2 also has full sm_86 coverage if you want it later.

2. Download the model weights

huggingface-cli download THUDM/CogVideoX1.5-5B --local-dir ./CogVideoX1.5-5B

If THUDM/CogVideoX1.5-5B redirects (the upstream org was renamed), the zai-org/CogVideoX1.5-5B mirror is the same model — same weights, just under the renamed org slug. The model card you see at the THUDM URL today is served from zai-org via redirect.

3. (Optional) Install ComfyUI + the CogVideoX wrapper

If you prefer a node-based workflow, the kijai/ComfyUI-CogVideoXWrapper repo ships example workflows including support for the 1.5-5B variant. Skip if you're using diffusers directly.

cd ComfyUI/custom_nodes
git clone https://github.com/kijai/ComfyUI-CogVideoXWrapper
cd ComfyUI-CogVideoXWrapper
pip install -r requirements.txt

Running

Save the following as run_cogvideox.py — this is the canonical inference snippet from the official model card, unmodified except for the model_id already pinned:

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

prompt = (
    "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool "
    "in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic "
    "guitar, producing soft, melodic tunes."
)

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX1.5-5B",
    torch_dtype=torch.bfloat16,
)

pipe.enable_sequential_cpu_offload()
pipe.vae.enable_tiling()
pipe.vae.enable_slicing()

video = pipe(
    prompt=prompt,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=81,
    guidance_scale=6,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)
python run_cogvideox.py

num_frames=81 produces a ~5-second clip at 16 fps (the formula is 16N + 1 with N ≤ 10; 81 is N=5). For a 10-second clip set num_frames=161 (N=10). The first run downloads weights to your HuggingFace cache; the second run starts inference immediately.

Results

  • Speed: No first-party RTX 3090 Ti measurement for CogVideoX 1.5 5B is published. The official model card reports reference points only for datacenter hardware (Single A100: ~1000 seconds and Single H100: ~550 seconds for a 5-second / 81-frame / 50-step clip) and explicitly notes the testing "has not been tested on non-NVIDIA A100/H100 devices." CogVideoX 1.5 is a video DiT — compute-bound at the transformer stage — so expect substantially longer wall-clock time on a 3090 Ti than on an Ada or Hopper card of comparable VRAM. The site's sibling 4090 recipe is the closest published Ada reference point for the same workload, but the 4090's per-step throughput on Ada sm_89 does NOT transfer cleanly to Ampere sm_86. If you have a 3090 Ti measurement, please contribute it so this section can replace the omission with a real number.
  • VRAM usage: The model card's memory table cites diffusers BF16: from 10 GB with enable_sequential_cpu_offload() + vae.enable_tiling() + vae.enable_slicing() active. The kijai wrapper README observes that "VAE decoding seems to be the only big that takes a lot of VRAM when everything is offloaded, peaks at around 13-14GB momentarily at that stage." and "Sampling itself takes only maybe 5-6GB." for the 5B family — the VAE is memory-bound, so this peak transfers cleanly from Ada to Ampere with only marginal change. The 3090 Ti's 24 GB leaves ~10 GB of headroom on the optimized path.
  • Quality notes: Native resolution is 1360×768 (not 720×480 — that's the CogVideoX-2B / CogVideoX-5B resolution). Don't reduce below 768 on the short axis; the model is trained for the higher resolution. The model card's reference benchmarks use num_inference_steps=50; the card does not publish a per-step-count quality comparison — tune empirically if you want to trade quality for speed.

For the full benchmark data, see /check/cogvideox-1-5/rtx-3090-ti.

Troubleshooting

Want a 3–4× speed boost and willing to use more VRAM?

The 3090 Ti's 24 GB is large enough to drop some of the diffusers offload optimizations. The model card states verbatim: "Disabling optimizations can triple VRAM usage but increase speed by 3-4 times. You can selectively disable certain optimizations" (the three optimizations called out by name on the card are pipe.enable_sequential_cpu_offload(), pipe.vae.enable_slicing(), and pipe.vae.enable_tiling()). The card's documented SAT-path no-optimization peak is 76 GB BF16 for this variant — do NOT remove all offloads on a 24 GB card. The pragmatic order on a 24 GB Ampere card is to remove pipe.enable_sequential_cpu_offload() first while keeping vae.enable_tiling() and vae.enable_slicing() — that's the highest-impact toggle for compute throughput. Watch nvidia-smi during a run; if peak stays under ~22 GB, the change is safe.

CUDA OOM with allocator fragmentation

On older PyTorch + driver combos the diffusers cli demo can OOM at the boundary of the 24 GB envelope due to allocator fragmentation, not absolute peak. The Z-AI team (zRzRzRzRzRzRzR, GitHub MEMBER) recommends in Issue #92 setting the environment variable PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True before launching. The issue thread was opened against the older CogVideoX-2B variant, but the fragmentation workaround is allocator-level and applies to any CogVideoX variant on a 24 GB Ampere card running tight against the envelope.

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python run_cogvideox.py

OOM during VAE decoding even on 24 GB

This is possible at the 161-frame (10-second) configuration. The kijai wrapper observes the VAE decode is the runtime peak (~13–14 GB on the 5B family). Keep pipe.vae.enable_tiling() and pipe.vae.enable_slicing() on; if it still OOMs, drop num_frames to 81 (5-second clip) and use a second pass for length.

FP8 isn't faster on the 3090 Ti — don't switch quants chasing speed

If you've read CogVideoX recipes targeting Ada or Hopper cards, you may see FP8 paths advertised as a speed/VRAM trade. This does NOT apply to the 3090 Ti. Ampere sm_86 has no FP8 tensor cores — FP8 first shipped on Hopper sm_90 and consumer Ada sm_89. Loading an FP8 weight file on the 3090 Ti works (the runtime dequantizes to BF16 on the fly) but produces no speed acceleration and only the storage-side VRAM savings. The model card's recommended BF16 + offload path is the right default for this card. If you need to free VRAM for a co-located workload (not for raw speed), the model card's documented INT8-via-torchao path lists "diffusers INT8(torchao): from 7GB" for the 1.5-5B variant — but that path costs inference speed per the card's note that "Using an INT8 model reduces inference speed."

Multi-GPU note

The model card explicitly warns: "In multi-GPU inference, enable_sequential_cpu_offload() optimization needs to be disabled." Single 3090 Ti setups are unaffected — this is only relevant if you split across two 3090 Tis (the card's documented multi-GPU diffusers peak for 1.5-5B is "BF16: 24GB using diffusers" with enable_sequential_cpu_offload() removed).

Output looks low-res

Make sure num_frames follows the 16N + 1 formula (e.g. 17, 33, 49, 65, 81, 97, 113, 129, 145, 161). Off-by-one frame counts trigger the wrong code path. Resolution is fixed at 1360×768 for the T2V variant; don't try to force lower — the model is trained for this size.

If your problem isn't covered above, report it via the submission form so we can extend this section.