What You'll Build
A local text-to-video pipeline that generates 5- or 10-second clips at 1360×768 from a text prompt using THUDM/CogVideoX1.5-5B on a single RTX 3090. The 3090's 24 GB Ampere envelope handles the model's BF16 path with diffusers' standard offload + VAE tiling/slicing optimizations — no FP8, no exotic quantization, no architecture-specific wheel selection.
Hardware data: RTX 3090 (24 GB VRAM) · CogVideoX 1.5 5B BF16 fits within the 24 GB envelope with the official optimized path · See benchmark data
ℹ️ Pick this variant, not its siblings. CogVideoX is a family —
CogVideoX-2B(8N+1 frames at 720×480, ~5 GB BF16),CogVideoX-5B(8N+1 at 720×480, 15 GB BF16),CogVideoX1.5-5B(this recipe — 16N+1 frames at 1360×768, from 10 GB BF16 with optimizations), andCogVideoX1.5-5B-I2V(image-to-video at 768–1360). All four cite distinct VRAM and resolution profiles on the official model card. This recipe pins the 1.5-5B text-to-video variant.
⚠️ 3090 is the right card, but expect video-DiT compute density. The 3090 (Ampere sm_86) has no FP8 tensor-core acceleration (FP8 first shipped on Hopper sm_90 and consumer Ada sm_89), so any "FP8 path" advice from Ada/Hopper recipes does NOT transfer. The model card's recommended path is BF16, which Ampere supports natively. The 24 GB envelope fits cleanly; the compute density difference vs an Ada card shows up as slower per-step time, not as a memory failure.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 14 GB VRAM derived envelope (HF card's BF16-with-optimizations floor of 10 GB + kijai VAE peak headroom of 3-4 GB); 24 GB recommended for the no-offload speedup | RTX 3090 (24 GB) |
| RAM | 16 GB system RAM | 32 GB recommended for CPU-offload swap during enable_sequential_cpu_offload() |
| Storage | ~40 GB for weights | ~40 GB (transformer + T5 + VAE) |
| Software | Python 3.10–3.12, diffusers from source, transformers ≥ 4.46.2, accelerate ≥ 1.1.1 | — |
Installation
1. Install diffusers from source
The CogVideoX 1.5 pipeline requires diffusers built from the development branch per the official model card:
pip install git+https://github.com/huggingface/diffusers
pip install --upgrade "transformers>=4.46.2" "accelerate>=1.1.1" imageio-ffmpeg
The default pip install torch already includes sm_86 (Ampere) kernels — no special wheel selection is needed on the 3090. FlashAttention-2 also has full sm_86 coverage if you want it later.
2. Download the model weights
huggingface-cli download THUDM/CogVideoX1.5-5B --local-dir ./CogVideoX1.5-5B
If THUDM/CogVideoX1.5-5B is unavailable, the zai-org/CogVideoX1.5-5B mirror is the same model — the upstream org name was renamed but the weights are identical.
3. (Optional) Install ComfyUI + the CogVideoX wrapper
If you prefer a node-based workflow, the kijai/ComfyUI-CogVideoXWrapper repo ships example workflows including cogvideox1.5_t2v.json. Skip if you're using diffusers directly.
cd ComfyUI/custom_nodes
git clone https://github.com/kijai/ComfyUI-CogVideoXWrapper
cd ComfyUI-CogVideoXWrapper
pip install -r requirements.txt
Running
Save the following as run_cogvideox.py — this is the canonical inference snippet from the official model card, unmodified except for the model_id already pinned:
import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video
prompt = (
"A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool "
"in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic "
"guitar, producing soft, melodic tunes."
)
pipe = CogVideoXPipeline.from_pretrained(
"THUDM/CogVideoX1.5-5B",
torch_dtype=torch.bfloat16,
)
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_tiling()
pipe.vae.enable_slicing()
video = pipe(
prompt=prompt,
num_videos_per_prompt=1,
num_inference_steps=50,
num_frames=81,
guidance_scale=6,
generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]
export_to_video(video, "output.mp4", fps=8)
python run_cogvideox.py
num_frames=81 produces a ~5-second clip at 16 fps (the formula is 16N + 1 with N ≤ 10; 81 is N=5). For a 10-second clip set num_frames=161 (N=10). The first run downloads weights to your HuggingFace cache; the second run starts inference immediately.
Results
- Speed: No first-party RTX 3090 measurement for CogVideoX 1.5 5B is published. The official model card reports reference points only for datacenter hardware (single A100: ~1000 s, single H100: ~550 s for a 5-second / 81-frame / 50-step clip) and explicitly notes "This scheme has not been tested for actual memory usage on devices outside of NVIDIA A100 / H100 architectures." CogVideoX 1.5 is a video DiT — compute-bound at the transformer stage — so expect substantially longer wall-clock time on a 3090 than on an Ada or Hopper card of comparable VRAM. The site's sibling 4090 recipe is the closest published reference point for the same workload, but the 4090's per-step throughput on Ada sm_89 does NOT transfer to the 3090. If you have a 3090 measurement, please contribute it so this section can replace the omission with a real number.
- VRAM usage: The model card's memory table cites diffusers BF16: from 10 GB with
enable_sequential_cpu_offload()+vae.enable_tiling()+vae.enable_slicing()active. kijai's wrapper README observes that "VAE decoding...peaks at around 13-14 GB momentarily" and "Sampling itself takes only maybe 5-6 GB" for the 5B family — the VAE is memory-bound, so this peak transfers cleanly from Ada to Ampere with only marginal change. The 3090's 24 GB leaves ~10 GB of headroom on the optimized path. - Quality notes: Native resolution is 1360×768 (not 720×480 — that's CogVideoX-2B's resolution). Don't reduce below 768 on the short axis; the model is trained for the higher resolution. The model card's reference benchmarks use
num_inference_steps=50; lower step counts trade quality for speed (no per-step-count quality comparison is published on the card — adjust empirically).
For the full benchmark data, see /check/cogvideox-1-5/rtx-3090.
Troubleshooting
Want a 3–4× speed boost and willing to use more VRAM?
The 3090's 24 GB is large enough to drop some of the diffusers offload optimizations. Per the model card: "Disabling optimizations can triple VRAM usage but increase speed by 3-4 times. You can selectively disable certain optimizations." The HF card's reported "without optimizations" peak is 76 GB BF16, so do NOT remove all offloads on a 24 GB card. Try removing pipe.enable_sequential_cpu_offload() first while keeping vae.enable_tiling() and vae.enable_slicing() — that's the highest-impact toggle and brings runtime closer to the 33 GB / 19 GB / 11 GB ladder documented in the diffusers CogVideoX page (presented there in the CogVideoX-5B context). Watch nvidia-smi during a run; if peak stays under ~22 GB, the change is safe.
OOM during VAE decoding even on 24 GB
This is possible at the 161-frame (10-second) configuration. Per kijai's wrapper, the VAE decode is the runtime peak (~13–14 GB on the 5B family). Keep pipe.vae.enable_tiling() and pipe.vae.enable_slicing() on; if it still OOMs, drop num_frames to 81 (5-second clip) and use a second pass for length.
FP8 isn't faster on the 3090 — don't switch quants chasing speed
If you've read CogVideoX recipes targeting Ada or Hopper cards, you may see FP8 paths advertised as a speed/VRAM trade. This does NOT apply to the 3090. Ampere sm_86 has no FP8 tensor cores — FP8 first shipped on Hopper sm_90 and consumer Ada sm_89. Loading an FP8 weight file on the 3090 works (the runtime dequantizes to BF16 on the fly) but produces no speed acceleration and only the storage-side VRAM savings. The model card's recommended BF16 + offload path is the right default for this card. The fastest CPU-offload alternative on Ampere is INT8 via torchao (HF card cites "INT8 with optimizations: 7 GB minimum") if you need to free VRAM for a co-located workload, not for raw speed.
Multi-GPU note
The model card explicitly warns: "In multi-GPU inference, enable_sequential_cpu_offload() optimization needs to be disabled." Single 3090 setups are unaffected — this is only relevant if you split across two 3090s.
Output looks low-res
Make sure num_frames follows the 16N + 1 formula (e.g. 17, 33, 49, 65, 81, 97, 113, 129, 145, 161). Off-by-one frame counts trigger the wrong code path. Resolution is fixed at 1360×768 for the T2V variant; don't try to force lower — the model is trained for this size.
If your problem isn't covered above, report it via the submission form so we can extend this section.