HunyuanVideo-1.5 on RTX 4090: 480p Step-Distilled Text-to-Video in ~75 Seconds

What You'll Build

A single-GPU text-to-video pipeline that turns a prompt into a 480p clip in about 75 seconds on an RTX 4090, using Tencent's 8.3B-parameter HunyuanVideo-1.5 model with the step-distilled checkpoint. The same install also supports the standard (non-distilled) variant if you want higher quality at the cost of 3-12 minutes per clip.

Hardware data: RTX 4090 (24GB VRAM) - 75s per 480p clip (step-distilled) - See benchmark data

ℹ️ This recipe is HunyuanVideo-1.5, not the original 13B HunyuanVideo. Tencent ships two distinct video models under the "HunyuanVideo" umbrella: HunyuanVideo (1.0) is a 13B model whose original FP16 weights need 40GB+ of VRAM and was, per an independent walkthrough, "impressive but impractical" on consumer GPUs without aggressive Q8 quantization via Kijai's ComfyUI wrappers. HunyuanVideo-1.5 is the late-2025 8.3B successor explicitly designed to fit a single RTX 4090. We anchor the recipe on 1.5 because that's what the empirical benchmarks on /check/hunyuan-video/rtx-4090 measure - and it's the only path that actually fits the card in a sensible runtime budget.

Requirements

Component	Minimum	Tested
GPU	14GB VRAM (with model offloading)	RTX 4090 (24GB)
RAM	32GB	-
Storage	~60GB for full checkpoint set (DiT + VAE + text encoders)	-
Software	Python 3.10+, CUDA 12.x, PyTorch 2.x, Linux	-

Installation

1. Clone the official Tencent repository

git clone https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5.git
cd HunyuanVideo-1.5

These steps come verbatim from the official README.

2. Install Python dependencies

pip install -r requirements.txt
pip install -i https://mirrors.tencent.com/pypi/simple/ --upgrade tencentcloud-sdk-python

3. Install an attention backend

HunyuanVideo-1.5 uses variable-length attention masks. Per the HuggingFace diffusers integration notes, for an RTX 4090 the recommended backend is flash_hub or flash_varlen_hub (i.e. FlashAttention). Install FlashAttention from the Dao-AILab repository. The Ada Lovelace architecture (sm_89) has full FlashAttention-2 kernel coverage, so the default wheel works without special flags.

Optional but recommended for FP8 GEMM acceleration (added in the December 23, 2025 Tencent release):

pip install sgl-kernel==0.3.18

4. Download the checkpoints

From the official checkpoints-download.md:

hf download tencent/HunyuanVideo-1.5 --local-dir ./ckpts
hf download Qwen/Qwen2.5-VL-7B-Instruct --local-dir ./ckpts/text_encoder/llm
hf download google/byt5-small --local-dir ./ckpts/text_encoder/byt5-small

The repository ships the main DiT (including the 480p_i2v_step_distilled weights), the 3D causal VAE, and the glyph-aware text-encoder config. Qwen2.5-VL-7B-Instruct is the primary text encoder; byt5-small handles glyph-aware text rendering inside generated videos.

Running

Option A - Official Tencent script (T2V, step-distilled)

The official launcher is generate.py invoked via torchrun. For a single RTX 4090, set N_INFERENCE_GPU=1:

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:128

PROMPT='A fluffy teddy bear sits on a bed of soft pillows surrounded by children toys.'
SEED=1
ASPECT_RATIO=16:9
RESOLUTION=480p
OUTPUT_PATH=./outputs/output.mp4
MODEL_PATH=./ckpts

torchrun --nproc_per_node=1 generate.py \
  --prompt "$PROMPT" \
  --resolution $RESOLUTION \
  --aspect_ratio $ASPECT_RATIO \
  --seed $SEED \
  --rewrite false \
  --enable_step_distill true \
  --use_sageattn false \
  --overlap_group_offloading true \
  --output_path $OUTPUT_PATH \
  --model_path $MODEL_PATH

Key flags:

--enable_step_distill true selects the 8 or 12-step distilled checkpoint (recommended 8 or 12 inference steps, up to 6x speedup per the official README).
--overlap_group_offloading true keeps peak VRAM under 24GB by streaming layers between CPU RAM and the GPU. Set to false if you have a 40GB+ card and want maximum throughput.
--rewrite false skips the LLM-based prompt rewriter; flip to true and set T2V_REWRITE_BASE_URL plus T2V_REWRITE_MODEL_NAME if you have a vLLM server hosting Qwen2.5-VL-7B-Instruct.

Option B - HuggingFace diffusers (T2V, single Python process)

If you'd rather use the diffusers integration, the upstream API handles offloading for you:

import torch
from diffusers import HunyuanVideo15Pipeline
from diffusers.utils import export_to_video

pipe = HunyuanVideo15Pipeline.from_pretrained(
    "hunyuanvideo-community/HunyuanVideo-1.5-480p_t2v",
    torch_dtype=torch.bfloat16,
)
pipe.transformer.set_attention_backend("flash_hub")  # recommended for RTX 4090
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

prompt = "A fluffy teddy bear sits on a bed of soft pillows surrounded by children toys."
video = pipe(prompt=prompt, num_frames=61, num_inference_steps=50).frames[0]
export_to_video(video, "output.mp4", fps=15)

enable_model_cpu_offload() keeps inactive submodules on CPU and matches the official 14GB-minimum claim; enable_tiling() chunks the VAE decode so it fits alongside the DiT in 24GB.

Results

Speed: Tencent's HF card states "On RTX 4090, end-to-end generation time is reduced by 75%, and a single RTX 4090 can generate videos within 75 seconds" when the step-distilled checkpoint is used. The same source clocks the standard (non-distilled) variant at 3-12 minutes per clip on RTX 4090 in the insiderllm walkthrough.
VRAM usage: 14GB minimum with --overlap_group_offloading true / pipe.enable_model_cpu_offload(), ~24GB without offloading - both cited by the official HF card and confirmed in the insiderllm walkthrough ("VRAM (standard) ~24GB", "VRAM down to 14GB with offloading").
Quality notes: The step-distilled checkpoint "maintains comparable quality to the original model" per Tencent's release note, but is currently only released for 480p I2V and the 8-step T2V path (status as of the December 5, 2025 release). The standard 50-step T2V remains the higher-quality option when you can afford the 3-12 minute generation budget.

For full benchmark data, see /check/hunyuan-video/rtx-4090.

Troubleshooting

OOM at 24GB even with offloading enabled

Set the PyTorch allocator hints before launching, per the official README:

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:128

Confirm --overlap_group_offloading true is set (Tencent script) or pipe.enable_model_cpu_offload() is called before pipe.to("cuda") (diffusers path). Disabling either drops VRAM headroom into the 24GB ceiling.

Slow first run / loading the Qwen2.5-VL text encoder

First generation loads the 7B Qwen text encoder and the DiT - expect 60-120 seconds of one-time disk and CPU-to-GPU transfer before the actual 75-second inference begins. Subsequent runs reuse the loaded weights.

"I want the original HunyuanVideo 13B, not 1.5"

The original tencent/HunyuanVideo model has very different runtime characteristics on the 4090 (community Q8 quantization via Kijai's ComfyUI wrappers gets it into ~24GB but at ~5-10 minutes per 5-second clip per the insiderllm walkthrough). The Q8 community pipeline is a separate recipe path - submit a contribution if you want it documented.