What You'll Build
A local text-to-video pipeline that turns a prompt into a 49-frame 848×480 clip at 30 fps using Genmo's 10B-parameter Mochi 1 model on a single RTX 4090. The recipe uses the bf16 variant with CPU offload + tiled VAE decoding so the runtime peak stays at ~20GB — comfortably inside the 4090's 24GB envelope, with ~4GB of headroom.
Hardware data: RTX 4090 (24GB VRAM) · ~5 min per 49-frame clip · 20GB peak VRAM · See benchmark data
ℹ️ About the 49-frame default. Mochi 1's diffusers pipeline accepts arbitrary
num_framesup to ~163, but VRAM and time scale with frame count. The 49-frame configuration documented here is the largest profile that comfortably fits a 24GB card with cpu-offload + VAE tiling. Pushing to the full 163 frames in one shot crosses into 70GB+ decode territory per the official diffusers docs — that requires multi-GPU splitting or accepting a quality hit.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 22GB VRAM (bf16 variant) | RTX 4090 (24GB) |
| RAM | 32GB (T5-XXL + cpu-offloaded transformer share host memory) | — |
| Storage | ~50GB (model weights + T5-XXL encoder) | — |
| Software | Python 3.10+, PyTorch 2.4+, FFmpeg | — |
Installation
1. Install diffusers from main
Mochi 1's MochiPipeline lives in the diffusers main branch. The model card recommends installing from git directly:
pip install git+https://github.com/huggingface/diffusers.git
pip install transformers accelerate sentencepiece imageio imageio-ffmpeg
2. Authenticate with Hugging Face
The genmo/mochi-1-preview weights are open under Apache 2.0 but the download is gated through HF. Log in once:
huggingface-cli login
3. Verify CUDA + PyTorch
The RTX 4090 is Ada Lovelace (sm_89) — standard PyTorch wheels include sm_89 kernels by default, no special wheel selection needed:
python -c "import torch; print(torch.__version__, torch.cuda.is_available(), torch.cuda.get_device_name(0))"
# Expected: 2.4+, True, NVIDIA GeForce RTX 4090
Running
Save as run_mochi.py. This is the canonical bf16 recipe from the official diffusers Mochi docs, which the page documents as fitting 22GB VRAM — your 4090 has the headroom:
import torch
from diffusers import MochiPipeline
from diffusers.utils import export_to_video
pipe = MochiPipeline.from_pretrained(
"genmo/mochi-1-preview",
variant="bf16",
torch_dtype=torch.bfloat16,
)
# Memory savings — both are required to stay under 24GB on the 4090
pipe.enable_model_cpu_offload()
pipe.enable_vae_tiling()
prompt = (
"Close-up of a chameleon's eye, with its scaly skin changing color. "
"Ultra high resolution 4k."
)
frames = pipe(prompt, num_frames=49).frames[0]
export_to_video(frames, "mochi.mp4", fps=30)
Run it:
python run_mochi.py
The first call downloads ~50GB of weights into ~/.cache/huggingface/hub/. Subsequent calls reuse the cache. Default output is 848×480 at 30 fps; the script writes mochi.mp4 in the working directory.
Results
- Speed: ~5 minutes per 49-frame clip on RTX 4090 via the ComfyUI-optimized path, measured by InsiderLLM's local AI video guide. Diffusers with cpu-offload runs in a comparable range on the same hardware; pushing to 163 frames stretches into 20–30 minutes per the same source.
- VRAM usage: ~20GB peak at the default 49-frame, 848×480, 64-step configuration — leaves ~4GB headroom on a 24GB card. The official diffusers docs report 22GB for the same bf16 path with cpu-offload + VAE tiling. See /check/mochi-1/rtx-4090.
- Quality notes: 480p output only — Mochi 1 is explicitly trained at 848×480 and is "optimized for photorealistic styles" per the Genmo model card; animated / stylized content underperforms. Minor warping can occur under extreme motion.
For the full benchmark data, see /check/mochi-1/rtx-4090.
Troubleshooting
Out-of-memory partway through generation
The 20GB peak assumes both enable_model_cpu_offload() and enable_vae_tiling() are active. Skipping either pushes peak VRAM well past 24GB — enable_vae_tiling() in particular is what keeps the VAE decode step from spiking. If you still OOM at 49 frames, drop to num_frames=37 or num_frames=25 (output is shorter but quality is unchanged).
"163 frames" attempt OOMs even with tiling
Decoding 163 frames at full precision needs ~70GB per the official diffusers docs — that's H100 territory. On a 4090, stay at 49 frames or split across two 24GB cards using the device_map="auto" + max_memory pattern shown in the same docs.
Want lower VRAM still — bitsandbytes 8-bit
The diffusers docs include a bitsandbytes 8-bit quantization recipe that drops both the transformer and the T5-XXL text encoder to 8-bit. Useful on cards smaller than 22GB; on a 24GB 4090 the headroom is comfortable enough that you usually don't need it.
ComfyUI path instead of diffusers
If you'd rather stay in ComfyUI, the community wrapper kijai/ComfyUI-MochiWrapper reports fitting "under 20GB" depending on frame count, with sage / flash / sdpa attention selectable. ComfyUI native also gained Mochi support upstream in late 2024 — both paths target the same genmo/mochi-1-preview weights.
Video has audio? No — Mochi 1 is silent
Mochi 1 generates video frames only; there's no audio track. Add audio in post with FFmpeg or a separate TTS / music model.