Mochi 1 on RTX 3090 Ti: 85-frame 480p Text-to-Video with Diffusers

What You'll Build

A local text-to-video pipeline that turns a prompt into an 85-frame 848×480 clip at 30 fps using Genmo's 10B-parameter Mochi 1 model on a single RTX 3090 Ti. The recipe uses the bf16 variant with CPU offload + tiled VAE decoding so the runtime peak stays at ~22 GB — within the 3090 Ti's 24 GB envelope. The Ampere (sm_86) arch on the Ti is the same family as the standard 3090 sibling and runs the identical diffusers code path; expect very slightly better wallclock on the Ti due to its higher boost clocks and ~8% memory-bandwidth uplift (1008 GB/s vs 936 GB/s).

Hardware data: RTX 3090 Ti (24 GB VRAM) · ~22 GB peak VRAM (diffusers bf16 + offload + tiling) · See benchmark data

ℹ️ About the 85-frame default. The official diffusers Mochi docs ship num_frames=85 as the canonical bf16+offload+tiling example at the documented 22 GB envelope, so we follow it verbatim. The same docs note that pushing to the full 163 frames in one shot crosses into 70 GB+ decode territory — that requires multi-GPU splitting or accepting a quality hit.

Requirements

Component	Minimum	Tested
GPU	22 GB VRAM (bf16 variant)	RTX 3090 Ti (24 GB)
RAM	32 GB (T5-XXL + cpu-offloaded transformer share host memory)	—
Storage	~50 GB (model weights + T5-XXL encoder)	—
Software	Python 3.10+, PyTorch 2.4+, FFmpeg	—

Installation

1. Install diffusers from main

Mochi 1's MochiPipeline lives in the diffusers main branch. The model card recommends installing from git directly:

pip install git+https://github.com/huggingface/diffusers.git
pip install transformers accelerate sentencepiece imageio imageio-ffmpeg

2. Authenticate with Hugging Face

The genmo/mochi-1-preview weights are open under Apache 2.0 but the download is gated through HF. Log in once:

huggingface-cli login

3. Verify CUDA + PyTorch

The RTX 3090 Ti is Ampere (sm_86) — standard PyTorch wheels include sm_86 kernels by default, no special wheel selection needed:

python -c "import torch; print(torch.__version__, torch.cuda.is_available(), torch.cuda.get_device_name(0))"
# Expected: 2.4+, True, NVIDIA GeForce RTX 3090 Ti

Running

Save as run_mochi.py. This is the canonical bf16 recipe from the official diffusers Mochi docs, which the page documents as requiring 22GB VRAM with cpu-offload + VAE tiling — within the 3090 Ti's 24 GB envelope:

import torch
from diffusers import MochiPipeline
from diffusers.utils import export_to_video

pipe = MochiPipeline.from_pretrained(
    "genmo/mochi-1-preview",
    variant="bf16",
    torch_dtype=torch.bfloat16,
)

# Memory savings — both are required to stay under 24GB on the 3090 Ti
pipe.enable_model_cpu_offload()
pipe.enable_vae_tiling()

prompt = (
    "Close-up of a chameleon's eye, with its scaly skin changing color. "
    "Ultra high resolution 4k."
)
frames = pipe(prompt, num_frames=85).frames[0]
export_to_video(frames, "mochi.mp4", fps=30)

Run it:

python run_mochi.py

The first call downloads ~50 GB of weights into ~/.cache/huggingface/hub/. Subsequent calls reuse the cache. Default output is 848×480 at 30 fps; the script writes mochi.mp4 in the working directory.

Results

Speed: Per-clip wallclock on the RTX 3090 Ti is not yet sourced from a first-party measurement we can cite verbatim — no published benchmark names the 3090 Ti running this diffusers configuration at 85 frames 848×480. The closest comparable card is the standard RTX 3090 (same Ampere sm_86, same 24 GB envelope, identical diffusers path); the 3090 Ti has ~8% higher memory bandwidth (1008 GB/s vs 936 GB/s) and modestly higher boost clocks, so the Ti should produce very slightly better numbers than whatever the 3090 measures. Please report yours via /contribute once you've run a clip — first-party 3090 Ti timings will replace this paragraph.
VRAM usage (diffusers path, this recipe): ~22 GB peak at the 85-frame, 848×480, default-step configuration per the official diffusers docs for the bf16 + enable_model_cpu_offload() + enable_vae_tiling() configuration the Running section walks through. The min_vram_gb: 22 frontmatter matches this path. For comparison, a community reply by PsiPi on HF discussion #8 (a community user — no Genmo team / org-member badge on the byline) reports "it can work on a 3090 - takes about 17-18 Gb IIRC" — but that figure refers to the kijai/ComfyUI-MochiWrapper runtime with FP8/GGUF quants, NOT the diffusers path. If you want the lower envelope, switch to the kijai wrapper (separate install). The ComfyUI blog post on Mochi consumer-GPU support notes that "Mochi can now fit on consumer GPUs like a 4090. The Mochi node in ComfyUI supports multiple attention backends, letting it fit in <24GB of VRAM." See /check/mochi-1/rtx-3090-ti for live benchmark data.
Quality notes: 480p output only — per the Genmo model card, "The initial release generates videos at 480p today" and "Mochi 1 is also optimized for photorealistic styles so does not perform well with animated content." The card also notes minor warping can occur under extreme motion.

For the full benchmark data, see /check/mochi-1/rtx-3090-ti.

Troubleshooting

FP8 weights don't accelerate Mochi on Ampere

Mochi 1's transformer denoising is compute-bound. The RTX 3090 Ti's sm_86 architecture predates Ada / Hopper FP8 tensor cores, so the recipe runs at BF16 throughput throughout. FP8 weight files exist via kijai/ComfyUI-MochiWrapper but PyTorch dequantizes them on the fly on Ampere — they save VRAM, not time. The architectural premise (no FP8 tensor cores on sm_86) is documented by the drbaph HiDream-O1 FP8 model card, which explicitly buckets pre-Ada cards as the "FP8 weights load but dequantize to BF16 at compute time" path. If you want raw speed on this card, stay with the BF16 diffusers path documented here; do not chase an FP8 mirror expecting a tensor-core speedup that won't materialize.

Out-of-memory partway through generation

The ~22 GB peak (diffusers path) assumes both enable_model_cpu_offload() and enable_vae_tiling() are active. Skipping either pushes peak VRAM past 24 GB — enable_vae_tiling() in particular is what keeps the VAE decode step from spiking. If you still OOM at 85 frames, drop to num_frames=49 or num_frames=37 (output is shorter; per-frame quality is unchanged).

"163 frames" attempt OOMs even with tiling

Decoding 163 frames at full precision needs ~70 GB per the official diffusers docs — H100 territory. On a 24 GB consumer card, stay at 85 frames or split across two cards using the device_map="auto" + max_memory={...} pattern shown in the same docs.

ComfyUI path instead of diffusers

If you'd rather stay in ComfyUI, the community wrapper kijai/ComfyUI-MochiWrapper describes its memory envelope as "Depending on frame count can fit under 20GB, VAE decoding is heavy and there is experimental tiled decoder (taken from CogVideoX -diffusers code) which allows higher frame counts, so far highest I've done is 97 with the default tile size 2x2 grid." The wrapper lets you select among SDPA, FlashAttention-2, and Sage Attention backends. The 3090 Ti is covered by FlashAttention-2's officially-supported GPU list — "Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100)" — so FA2 is a clean install. Sage attention is typically the fastest of the three per the wrapper's own ordering.

Want lower VRAM still — bitsandbytes 8-bit

The diffusers docs include a bitsandbytes 8-bit quantization recipe that drops both the transformer and the T5-XXL text encoder to 8-bit. On a 24 GB 3090 Ti the bf16 + offload path is already comfortable; 8-bit becomes interesting if you want to colocate Mochi with another model on the same card.

Video has audio? No — Mochi 1 is silent

Mochi 1 generates video frames only; there's no audio track. Add audio in post with FFmpeg or a separate TTS / music model.