What You'll Build
A single-GPU text-to-video pipeline that turns a prompt into a 480p clip in about 75 seconds on an RTX 4090, using Tencent's 8.3B-parameter HunyuanVideo-1.5 model with the step-distilled checkpoint. The same install also supports the standard (non-distilled) variant if you want higher quality at the cost of 3-12 minutes per clip.
Hardware data: RTX 4090 (24GB VRAM) - 75s per 480p clip (step-distilled) - See benchmark data
ℹ️ This recipe is HunyuanVideo-1.5, not the original 13B HunyuanVideo. Tencent ships two distinct video models under the "HunyuanVideo" umbrella: HunyuanVideo (1.0) is a 13B model whose original FP16 weights need 40GB+ of VRAM and was, per an independent walkthrough, "impressive but impractical" on consumer GPUs without aggressive Q8 quantization via Kijai's ComfyUI wrappers. HunyuanVideo-1.5 is the late-2025 8.3B successor explicitly designed to fit a single RTX 4090. We anchor the recipe on 1.5 because that's what the empirical benchmarks on
/check/hunyuan-video/rtx-4090measure - and it's the only path that actually fits the card in a sensible runtime budget.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 14GB VRAM (with model offloading) | RTX 4090 (24GB) |
| RAM | 32GB | - |
| Storage | ~60GB for full checkpoint set (DiT + VAE + text encoders) | - |
| Software | Python 3.10+, CUDA 12.x, PyTorch 2.x, Linux | - |
Installation
1. Clone the official Tencent repository
git clone https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5.git
cd HunyuanVideo-1.5
These steps come verbatim from the official README.
2. Install Python dependencies
pip install -r requirements.txt
pip install -i https://mirrors.tencent.com/pypi/simple/ --upgrade tencentcloud-sdk-python
3. Install an attention backend
HunyuanVideo-1.5 uses variable-length attention masks. Per the HuggingFace diffusers integration notes, for an RTX 4090 the recommended backend is flash_hub or flash_varlen_hub (i.e. FlashAttention). Install FlashAttention from the Dao-AILab repository. The Ada Lovelace architecture (sm_89) has full FlashAttention-2 kernel coverage, so the default wheel works without special flags.
Optional but recommended for FP8 GEMM acceleration (added in the December 23, 2025 Tencent release):
pip install sgl-kernel==0.3.18
4. Download the checkpoints
From the official checkpoints-download.md:
hf download tencent/HunyuanVideo-1.5 --local-dir ./ckpts
hf download Qwen/Qwen2.5-VL-7B-Instruct --local-dir ./ckpts/text_encoder/llm
hf download google/byt5-small --local-dir ./ckpts/text_encoder/byt5-small
The repository ships the main DiT (including the 480p_i2v_step_distilled weights), the 3D causal VAE, and the glyph-aware text-encoder config. Qwen2.5-VL-7B-Instruct is the primary text encoder; byt5-small handles glyph-aware text rendering inside generated videos.
Running
Option A - Official Tencent script (T2V, step-distilled)
The official launcher is generate.py invoked via torchrun. For a single RTX 4090, set N_INFERENCE_GPU=1:
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:128
PROMPT='A fluffy teddy bear sits on a bed of soft pillows surrounded by children toys.'
SEED=1
ASPECT_RATIO=16:9
RESOLUTION=480p
OUTPUT_PATH=./outputs/output.mp4
MODEL_PATH=./ckpts
torchrun --nproc_per_node=1 generate.py \
--prompt "$PROMPT" \
--resolution $RESOLUTION \
--aspect_ratio $ASPECT_RATIO \
--seed $SEED \
--rewrite false \
--enable_step_distill true \
--use_sageattn false \
--overlap_group_offloading true \
--output_path $OUTPUT_PATH \
--model_path $MODEL_PATH
Key flags:
--enable_step_distill trueselects the 8 or 12-step distilled checkpoint (recommended 8 or 12 inference steps, up to 6x speedup per the official README).--overlap_group_offloading truekeeps peak VRAM under 24GB by streaming layers between CPU RAM and the GPU. Set tofalseif you have a 40GB+ card and want maximum throughput.--rewrite falseskips the LLM-based prompt rewriter; flip totrueand setT2V_REWRITE_BASE_URLplusT2V_REWRITE_MODEL_NAMEif you have a vLLM server hosting Qwen2.5-VL-7B-Instruct.
Option B - HuggingFace diffusers (T2V, single Python process)
If you'd rather use the diffusers integration, the upstream API handles offloading for you:
import torch
from diffusers import HunyuanVideo15Pipeline
from diffusers.utils import export_to_video
pipe = HunyuanVideo15Pipeline.from_pretrained(
"hunyuanvideo-community/HunyuanVideo-1.5-480p_t2v",
torch_dtype=torch.bfloat16,
)
pipe.transformer.set_attention_backend("flash_hub") # recommended for RTX 4090
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()
prompt = "A fluffy teddy bear sits on a bed of soft pillows surrounded by children toys."
video = pipe(prompt=prompt, num_frames=61, num_inference_steps=50).frames[0]
export_to_video(video, "output.mp4", fps=15)
enable_model_cpu_offload() keeps inactive submodules on CPU and matches the official 14GB-minimum claim; enable_tiling() chunks the VAE decode so it fits alongside the DiT in 24GB.
Results
- Speed: Tencent's HF card states "On RTX 4090, end-to-end generation time is reduced by 75%, and a single RTX 4090 can generate videos within 75 seconds" when the step-distilled checkpoint is used. The same source clocks the standard (non-distilled) variant at 3-12 minutes per clip on RTX 4090 in the insiderllm walkthrough.
- VRAM usage: 14GB minimum with
--overlap_group_offloading true/pipe.enable_model_cpu_offload(), ~24GB without offloading - both cited by the official HF card and confirmed in the insiderllm walkthrough ("VRAM (standard) ~24GB", "VRAM down to 14GB with offloading"). - Quality notes: The step-distilled checkpoint "maintains comparable quality to the original model" per Tencent's release note, but is currently only released for 480p I2V and the 8-step T2V path (status as of the December 5, 2025 release). The standard 50-step T2V remains the higher-quality option when you can afford the 3-12 minute generation budget.
For full benchmark data, see /check/hunyuan-video/rtx-4090.
Troubleshooting
OOM at 24GB even with offloading enabled
Set the PyTorch allocator hints before launching, per the official README:
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:128
Confirm --overlap_group_offloading true is set (Tencent script) or pipe.enable_model_cpu_offload() is called before pipe.to("cuda") (diffusers path). Disabling either drops VRAM headroom into the 24GB ceiling.
Slow first run / loading the Qwen2.5-VL text encoder
First generation loads the 7B Qwen text encoder and the DiT - expect 60-120 seconds of one-time disk and CPU-to-GPU transfer before the actual 75-second inference begins. Subsequent runs reuse the loaded weights.
"I want the original HunyuanVideo 13B, not 1.5"
The original tencent/HunyuanVideo model has very different runtime characteristics on the 4090 (community Q8 quantization via Kijai's ComfyUI wrappers gets it into ~24GB but at ~5-10 minutes per 5-second clip per the insiderllm walkthrough). The Q8 community pipeline is a separate recipe path - submit a contribution if you want it documented.