Mochi 1 on RTX 4090: 49-frame 480p Text-to-Video with Diffusers

What You'll Build

A local text-to-video pipeline that turns a prompt into a 49-frame 848×480 clip at 30 fps using Genmo's 10B-parameter Mochi 1 model on a single RTX 4090. The recipe uses the bf16 variant with CPU offload + tiled VAE decoding so the runtime peak stays at ~20GB — comfortably inside the 4090's 24GB envelope, with ~4GB of headroom.

Hardware data: RTX 4090 (24GB VRAM) · ~5 min per 49-frame clip · 20GB peak VRAM · See benchmark data

ℹ️ About the 49-frame default. Mochi 1's diffusers pipeline accepts arbitrary num_frames up to ~163, but VRAM and time scale with frame count. The 49-frame configuration documented here is the largest profile that comfortably fits a 24GB card with cpu-offload + VAE tiling. Pushing to the full 163 frames in one shot crosses into 70GB+ decode territory per the official diffusers docs — that requires multi-GPU splitting or accepting a quality hit.

Requirements

Component	Minimum	Tested
GPU	22GB VRAM (bf16 variant)	RTX 4090 (24GB)
RAM	32GB (T5-XXL + cpu-offloaded transformer share host memory)	—
Storage	~50GB (model weights + T5-XXL encoder)	—
Software	Python 3.10+, PyTorch 2.4+, FFmpeg	—

Installation

1. Install diffusers from main

Mochi 1's MochiPipeline lives in the diffusers main branch. The model card recommends installing from git directly:

pip install git+https://github.com/huggingface/diffusers.git
pip install transformers accelerate sentencepiece imageio imageio-ffmpeg

2. Authenticate with Hugging Face

The genmo/mochi-1-preview weights are open under Apache 2.0 but the download is gated through HF. Log in once:

huggingface-cli login

3. Verify CUDA + PyTorch

The RTX 4090 is Ada Lovelace (sm_89) — standard PyTorch wheels include sm_89 kernels by default, no special wheel selection needed:

python -c "import torch; print(torch.__version__, torch.cuda.is_available(), torch.cuda.get_device_name(0))"
# Expected: 2.4+, True, NVIDIA GeForce RTX 4090

Running

Save as run_mochi.py. This is the canonical bf16 recipe from the official diffusers Mochi docs, which the page documents as fitting 22GB VRAM — your 4090 has the headroom:

import torch
from diffusers import MochiPipeline
from diffusers.utils import export_to_video

pipe = MochiPipeline.from_pretrained(
    "genmo/mochi-1-preview",
    variant="bf16",
    torch_dtype=torch.bfloat16,
)

# Memory savings — both are required to stay under 24GB on the 4090
pipe.enable_model_cpu_offload()
pipe.enable_vae_tiling()

prompt = (
    "Close-up of a chameleon's eye, with its scaly skin changing color. "
    "Ultra high resolution 4k."
)
frames = pipe(prompt, num_frames=49).frames[0]
export_to_video(frames, "mochi.mp4", fps=30)

Run it:

python run_mochi.py

The first call downloads ~50GB of weights into ~/.cache/huggingface/hub/. Subsequent calls reuse the cache. Default output is 848×480 at 30 fps; the script writes mochi.mp4 in the working directory.

Results

Speed: ~5 minutes per 49-frame clip on RTX 4090 via the ComfyUI-optimized path, measured by InsiderLLM's local AI video guide. Diffusers with cpu-offload runs in a comparable range on the same hardware; pushing to 163 frames stretches into 20–30 minutes per the same source.
VRAM usage: ~20GB peak at the default 49-frame, 848×480, 64-step configuration — leaves ~4GB headroom on a 24GB card. The official diffusers docs report 22GB for the same bf16 path with cpu-offload + VAE tiling. See /check/mochi-1/rtx-4090.
Quality notes: 480p output only — Mochi 1 is explicitly trained at 848×480 and is "optimized for photorealistic styles" per the Genmo model card; animated / stylized content underperforms. Minor warping can occur under extreme motion.

For the full benchmark data, see /check/mochi-1/rtx-4090.

Troubleshooting

Out-of-memory partway through generation

The 20GB peak assumes both enable_model_cpu_offload() and enable_vae_tiling() are active. Skipping either pushes peak VRAM well past 24GB — enable_vae_tiling() in particular is what keeps the VAE decode step from spiking. If you still OOM at 49 frames, drop to num_frames=37 or num_frames=25 (output is shorter but quality is unchanged).

"163 frames" attempt OOMs even with tiling

Decoding 163 frames at full precision needs ~70GB per the official diffusers docs — that's H100 territory. On a 4090, stay at 49 frames or split across two 24GB cards using the device_map="auto" + max_memory pattern shown in the same docs.

Want lower VRAM still — bitsandbytes 8-bit

The diffusers docs include a bitsandbytes 8-bit quantization recipe that drops both the transformer and the T5-XXL text encoder to 8-bit. Useful on cards smaller than 22GB; on a 24GB 4090 the headroom is comfortable enough that you usually don't need it.

ComfyUI path instead of diffusers

If you'd rather stay in ComfyUI, the community wrapper kijai/ComfyUI-MochiWrapper reports fitting "under 20GB" depending on frame count, with sage / flash / sdpa attention selectable. ComfyUI native also gained Mochi support upstream in late 2024 — both paths target the same genmo/mochi-1-preview weights.

Video has audio? No — Mochi 1 is silent

Mochi 1 generates video frames only; there's no audio track. Add audio in post with FFmpeg or a separate TTS / music model.