Mochi 1 on RTX 3090: 49-frame 480p Text-to-Video with Diffusers

What You'll Build

A local text-to-video pipeline that turns a prompt into a 49-frame 848×480 clip at 30 fps using Genmo's 10B-parameter Mochi 1 model on a single RTX 3090. The recipe uses the bf16 variant with CPU offload + tiled VAE decoding so the runtime peak stays at ~22 GB — well within the 3090's 24 GB envelope. The Ampere arch handles the same diffusers path the 4090 sibling uses, with the DiT compute step running ~40-50% slower per InsiderLLM's 3090 scaling rule and the VAE decode running at near-parity.

Hardware data: RTX 3090 (24 GB VRAM) · ~22 GB peak VRAM (diffusers bf16 + offload + tiling) · ~10–12 min per 49-frame clip (extrapolated from RTX 4090; see Results) · See benchmark data

ℹ️ About the 49-frame default. The diffusers docs themselves use num_frames=85 for the bf16+offload+tiling example at the same ~22 GB envelope, so the pipeline can fit larger clips — but we pin 49 to match the InsiderLLM RTX 4090 benchmark anchor (~5 min per 49-frame clip), which is the closest published timing for cross-card scaling math. Pushing to the full 163 frames in one shot crosses into 70 GB+ decode territory per the official diffusers docs — that requires multi-GPU splitting or accepting a quality hit.

Requirements

Component	Minimum	Tested
GPU	22 GB VRAM (bf16 variant)	RTX 3090 (24 GB)
RAM	32 GB (T5-XXL + cpu-offloaded transformer share host memory)	—
Storage	~50 GB (model weights + T5-XXL encoder)	—
Software	Python 3.10+, PyTorch 2.4+, FFmpeg	—

Installation

1. Install diffusers from main

Mochi 1's MochiPipeline lives in the diffusers main branch. The model card recommends installing from git directly:

pip install git+https://github.com/huggingface/diffusers.git
pip install transformers accelerate sentencepiece imageio imageio-ffmpeg

2. Authenticate with Hugging Face

The genmo/mochi-1-preview weights are open under Apache 2.0 but the download is gated through HF. Log in once:

huggingface-cli login

3. Verify CUDA + PyTorch

The RTX 3090 is Ampere (sm_86) — standard PyTorch wheels include sm_86 kernels by default, no special wheel selection needed:

python -c "import torch; print(torch.__version__, torch.cuda.is_available(), torch.cuda.get_device_name(0))"
# Expected: 2.4+, True, NVIDIA GeForce RTX 3090

Running

Save as run_mochi.py. This is the canonical bf16 recipe from the official diffusers Mochi docs, which the page documents as fitting 22 GB VRAM with cpu-offload + VAE tiling — well inside the 3090's 24 GB envelope:

import torch
from diffusers import MochiPipeline
from diffusers.utils import export_to_video

pipe = MochiPipeline.from_pretrained(
    "genmo/mochi-1-preview",
    variant="bf16",
    torch_dtype=torch.bfloat16,
)

# Memory savings — both are required to stay under 24GB on the 3090
pipe.enable_model_cpu_offload()
pipe.enable_vae_tiling()

prompt = (
    "Close-up of a chameleon's eye, with its scaly skin changing color. "
    "Ultra high resolution 4k."
)
frames = pipe(prompt, num_frames=49).frames[0]
export_to_video(frames, "mochi.mp4", fps=30)

Run it:

python run_mochi.py

The first call downloads ~50 GB of weights into ~/.cache/huggingface/hub/. Subsequent calls reuse the cache. Default output is 848×480 at 30 fps; the script writes mochi.mp4 in the working directory.

Results

Speed: ~10–12 minutes per 49-frame 848×480 clip (extrapolated). The RTX 4090 sibling recipe measures ~5 min for the same configuration per InsiderLLM's local AI video guide, and the same page publishes a generic "Add 40-50% to all RTX 4090 times" RTX 3090 scaling rule — yielding ~7–7.5 min if compute scaled purely with throughput, but Mochi 1's DiT denoising loop is more strongly compute-bound (per public 3090-vs-4090 BF16 TFLOPS, ratio ~0.43-0.58 depending on whether sparsity is included), so 10–12 min is a more conservative estimate. The VAE decode is memory-bandwidth-bound (936 GB/s on the 3090 vs 1008 GB/s on the 4090, ~93%) and scales close to 1:1. No first-party RTX 3090 timing exists in citable form yet — please report yours via /contribute once you've run a clip.
VRAM usage (diffusers path, this recipe): ~22 GB peak at the 49-frame, 848×480, default-step configuration per the official diffusers docs for the bf16 + enable_model_cpu_offload() + enable_vae_tiling() configuration the Running section walks through. The min_vram_gb: 22 frontmatter matches this path. For comparison, a community reply by PsiPi on HF discussion #8 reports "takes about 17-18 Gb IIRC" on RTX 3090 — but that figure refers to the kijai/ComfyUI-MochiWrapper runtime with FP8/GGUF quants, NOT the diffusers path. If you want the lower envelope, switch to the kijai wrapper (separate install). The ComfyUI blog post on Mochi consumer-GPU support notes that with the right attention backend "Mochi can now fit on consumer GPUs like a 4090. The Mochi node in ComfyUI supports multiple attention backends, letting it fit in <24GB of VRAM." See /check/mochi-1/rtx-3090 for live benchmark data.
Quality notes: 480p output only — Mochi 1 is explicitly trained at 848×480 and is "optimized for photorealistic styles" per the Genmo model card; animated / stylized content underperforms. Minor warping can occur under extreme motion.

For the full benchmark data, see /check/mochi-1/rtx-3090.

Troubleshooting

DiT loop is slow on Ampere — what to expect

Mochi 1's transformer denoising is compute-bound, not memory-bound. The RTX 3090 sm_86 architecture predates Ada / Hopper FP8 tensor cores, so the recipe runs at BF16 throughput throughout (FP8 weight files exist via kijai/ComfyUI-MochiWrapper but are dequantized on the fly on Ampere — they save VRAM, not time). Plan on roughly 2× the wallclock per clip vs. an RTX 4090; the VAE decode step is the only stage where the gap closes (memory bandwidth on the 3090 is ~93% of the 4090's). If you want to validate your run: time the DiT denoising loop and the VAE decode separately — the loop is the long pole.

Out-of-memory partway through generation

The ~22 GB peak (diffusers path) assumes both enable_model_cpu_offload() and enable_vae_tiling() are active. Skipping either pushes peak VRAM past 24 GB — enable_vae_tiling() in particular is what keeps the VAE decode step from spiking. If you still OOM at 49 frames, drop to num_frames=37 or num_frames=25 (output is shorter; per-frame quality is unchanged).

"163 frames" attempt OOMs even with tiling

Decoding 163 frames at full precision needs ~70 GB per the official diffusers docs — H100 territory. On a 3090, stay at 49 frames or split across two 24 GB cards using the device_map="auto" + max_memory={0: "24GB", 1: "24GB"} pattern shown in the same docs.

ComfyUI path instead of diffusers

If you'd rather stay in ComfyUI, the community wrapper kijai/ComfyUI-MochiWrapper reports fitting "under 20 GB" with frame counts up to ~97 (experimental tiled decoder), and lets you select among SDPA, FlashAttention-2, and Sage Attention backends. The 3090 is in FlashAttention-2's officially-supported GPU list ("Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100)"), so FA2 is a clean install. Sage attention is typically the fastest of the three.

Want lower VRAM still — bitsandbytes 8-bit

The diffusers docs include a bitsandbytes 8-bit quantization recipe that drops both the transformer and the T5-XXL text encoder to 8-bit. On a 24 GB 3090 the bf16 + offload path is already comfortable; 8-bit becomes interesting if you want to colocate Mochi with another model on the same card.

Video has audio? No — Mochi 1 is silent

Mochi 1 generates video frames only; there's no audio track. Add audio in post with FFmpeg or a separate TTS / music model.