self-hosted/ai
§01·recipe · video

Mochi 1 on RTX 5090: 85-frame 480p Text-to-Video with Diffusers

videointermediate22GB+ VRAMMay 24, 2026
models
tools
prerequisites
  • NVIDIA RTX 5090 (32GB VRAM) — the documented 22 GB diffusers bf16 peak leaves ~10 GB of headroom
  • Python 3.10+ with PyTorch 2.5+ built against CUDA 12.8 (Blackwell sm_120 wheels)
  • ~50 GB free disk space for the bf16 variant + T5-XXL text encoder weights
  • FFmpeg installed (required for video export)

What You'll Build

A local text-to-video pipeline that turns a prompt into an 85-frame 848×480 clip at 30 fps using Genmo's 10B-parameter Mochi 1 model on a single RTX 5090. The recipe uses the bf16 variant with enable_model_cpu_offload() + enable_vae_tiling() — the official diffusers Mochi docs document this exact configuration at num_frames=85 as fitting 22 GB VRAM, leaving roughly 10 GB of headroom on the 5090's 32 GB envelope.

Hardware data: RTX 5090 (32 GB VRAM) · ~22 GB documented peak (diffusers bf16 + offload + tiling, num_frames=85) · See benchmark data

ℹ️ Frame-count framing. The diffusers docs publish two anchor points: num_frames=85 at 22 GB (bf16 + offload + tiling, what this recipe walks through) and the full 163-frame run requiring 70 GB in the "Reproducing the results from the Genmo Mochi repo" section. The 5090's 32 GB envelope sits comfortably above the 85-frame anchor but well below the 163-frame anchor, so this recipe pins 85 — same as the official diffusers example. Pushing past 85 frames on a single 5090 is undocumented territory; if you want the full 163 frames, the diffusers docs show a multi-GPU device_map="auto" split with max_memory={0:"24GB",1:"24GB"} that gets you there.

ℹ️ No first-party RTX 5090 measurement yet. The backend /check/mochi-1/rtx-5090 endpoint reports verdict: unknown (no community benchmark has been submitted on this pair). The 32 GB envelope is anchored on the diffusers docs' 22 GB published number for the bf16+offload+tiling path, which is runtime-agnostic with respect to the card's VRAM size as long as the card has ≥ 22 GB. Please report your run via /contribute once you've generated a clip — this is exactly the kind of verdict: unknown row that turns into a measured verdict: runs row from a single community submission.

Requirements

ComponentMinimumTested
GPU22 GB VRAM (diffusers bf16 variant)RTX 5090 (32 GB)
RAM32 GB (T5-XXL + cpu-offloaded transformer share host memory)
Storage~50 GB (model weights + T5-XXL encoder)
SoftwarePython 3.10+, PyTorch 2.5+ (cu128), FFmpeg

Installation

1. Install diffusers from main

Mochi 1's MochiPipeline lives in the diffusers main branch:

pip install git+https://github.com/huggingface/diffusers.git
pip install transformers accelerate sentencepiece imageio imageio-ffmpeg

2. Authenticate with Hugging Face

The genmo/mochi-1-preview weights are open under Apache 2.0 but the download is gated through HF. Log in once:

huggingface-cli login

3. Install PyTorch with Blackwell (sm_120) kernels

The RTX 5090 is Blackwell (sm_120). Standard PyTorch wheels built against CUDA 12.8 include sm_120 kernels — install (or upgrade to) a cu128 build:

pip install --upgrade --index-url https://download.pytorch.org/whl/cu128 torch
python -c "import torch; print(torch.__version__, torch.cuda.is_available(), torch.cuda.get_device_name(0))"
# Expected: 2.5+ (cu128), True, NVIDIA GeForce RTX 5090

If torch.cuda.is_available() is False or get_device_name errors out, your wheel was built against an older CUDA toolkit without sm_120 — re-install from the cu128 index above.

Running

Save as run_mochi.py. This is the canonical bf16 recipe from the official diffusers Mochi docs, which the page documents verbatim as "requires 22GB VRAM to run" with enable_model_cpu_offload() + enable_vae_tiling() active at num_frames=85:

import torch
from diffusers import MochiPipeline
from diffusers.utils import export_to_video

pipe = MochiPipeline.from_pretrained(
    "genmo/mochi-1-preview",
    variant="bf16",
    torch_dtype=torch.bfloat16,
)

# Memory savings — both are required to stay at the documented 22 GB envelope
pipe.enable_model_cpu_offload()
pipe.enable_vae_tiling()

prompt = (
    "Close-up of a chameleon's eye, with its scaly skin changing color. "
    "Ultra high resolution 4k."
)
frames = pipe(prompt, num_frames=85).frames[0]
export_to_video(frames, "mochi.mp4", fps=30)

Run it:

python run_mochi.py

The first call downloads ~50 GB of weights into ~/.cache/huggingface/hub/. Subsequent calls reuse the cache. Default output is 848×480 at 30 fps; the script writes mochi.mp4 in the working directory.

Results

  • Speed: Omitted. No first-party RTX 5090 timing has been published for Mochi 1 in a citable form yet — neither the Genmo HF card, the diffusers docs, nor any Tier-A community walkthrough we surfaced names the 5090 by row. Please report your wallclock time via /contribute once you've run a clip — it will populate /check/mochi-1/rtx-5090.
  • VRAM usage (diffusers path, this recipe): ~22 GB peak per the official diffusers docs for the bf16 + enable_model_cpu_offload() + enable_vae_tiling() configuration at num_frames=85 — the exact path the Running section installs. The 5090's 32 GB envelope leaves ~10 GB of headroom. The on-disk bf16 weight file is 20.1 GB on the Comfy-Org repackager, consistent with the published 22 GB runtime peak (weights + small overhead under offload+tiling). For comparison, a community reply by PsiPi on HF discussion #8 reports "takes about 17-18 Gb IIRC" on RTX 3090 — but that figure refers to the kijai/ComfyUI-MochiWrapper runtime with FP8/GGUF quants, NOT the diffusers path. If you want the lower envelope (and the Blackwell-native FP8 speed-up that comes with it on the 5090), switch to the kijai wrapper (separate install — see Troubleshooting below). The min_vram_gb: 22 frontmatter matches the installed diffusers path; the kijai-wrapper figure is documented as a comparison only.
  • Quality notes: 480p output only — Mochi 1 is explicitly trained at 848×480 and is "optimized for photorealistic styles" per the Genmo model card; animated / stylized content underperforms. Minor warping can occur under extreme motion.

For the full benchmark data, see /check/mochi-1/rtx-5090.

Troubleshooting

Spending the 5090's headroom — what to actually do with 10 GB of slack

The 22 GB diffusers envelope leaves ~10 GB free on a 32 GB card. The honest next steps:

  • Colocate the T5-XXL encoder on-card without offload. Skipping enable_model_cpu_offload() and keeping the full pipeline GPU-resident pushes peak past 24 GB, but in our reading is plausible inside 32 GB at num_frames=85. This is undocumented — measure with nvidia-smi --query-gpu=memory.used --format=csv -l 1 before relying on it.
  • Try larger frame counts in small steps. The diffusers docs publish 22 GB at 85 frames and 70 GB at 163 frames (full precision) — the curve between is undocumented. If you experiment, bump num_frames by 12-frame increments (97, 109, 121) and watch nvidia-smi; bail at the first OOM. Report your highest stable frame count via /contribute so this recipe can pin a measured cap.
  • Reach for the kijai wrapper for the lower envelope + Blackwell FP8 speed-up. The kijai/ComfyUI-MochiWrapper reports fitting under 20 GB with FP8-scaled weights — and unlike Ampere/Ada, the 5090's Blackwell sm_120 has native FP8 tensor cores, so FP8 is a real compute speed-up on this card (not just a memory-only escape hatch). Worth trying if the diffusers path's wallclock annoys you.

Out-of-memory partway through generation

The 22 GB envelope assumes both enable_model_cpu_offload() and enable_vae_tiling() are active. Skipping either pushes peak VRAM past 24 GB — enable_vae_tiling() in particular is what keeps the VAE decode step from spiking. If you still OOM at 85 frames, drop to num_frames=49 or num_frames=37 (output is shorter; per-frame quality is unchanged).

Want lower VRAM still — bitsandbytes 8-bit

The diffusers docs include a bitsandbytes 8-bit quantization recipe that drops both the transformer and the T5-XXL text encoder to 8-bit. On a 32 GB 5090 the bf16 + offload path is already comfortable; 8-bit becomes interesting only if you want to colocate Mochi with another large model on the same card.

"163 frames" attempt OOMs even with tiling

Decoding 163 frames at full precision needs ~70 GB per the official diffusers docs — that's H100 territory and cannot fit on a single 32 GB 5090. The same docs show a multi-GPU split with device_map="auto" + max_memory={0:"24GB",1:"24GB"} — adapt to {0:"32GB",1:"32GB"} for a dual-5090 rig. Single-card-only: stay at 85 frames.

ComfyUI path instead of diffusers

If you'd rather stay in ComfyUI, ComfyUI native gained Mochi support upstream in late 2024 — the ComfyUI blog post on Mochi consumer-GPU support notes verbatim that "Mochi can now fit on consumer GPUs like a 4090. The Mochi node in ComfyUI supports multiple attention backends, letting it fit in <24GB of VRAM." Both ComfyUI native and the kijai wrapper target the same genmo/mochi-1-preview weights. The 5090 inherits ComfyUI's Mochi support without any extra plumbing.

FlashAttention-2 on Blackwell sm_120

FlashAttention-2 sm_120 wheel coverage is tracked at Dao-AILab/flash-attention#2168. If the runtime you use (kijai wrapper, ComfyUI Mochi node) lets you pick an attention backend and FA2 errors out, fall back to PyTorch's built-in scaled_dot_product_attention (SDPA) — it works out of the box on cu128 wheels with no extra build step. The diffusers path this recipe installs uses SDPA by default and does not require FA2.

Video has audio? No — Mochi 1 is silent

Mochi 1 generates video frames only; there's no audio track. Add audio in post with FFmpeg or a separate TTS / music model.