How much VRAM does HunyuanVideo 1.5 need?

About 14 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

HunyuanVideo-1.5 on RTX 5090: 480p Step-Distilled I2V with an 8 GB Headroom Unlock

What You'll Build

A single-GPU image-to-video pipeline that turns a still image into a 480p clip on an RTX 5090, using Tencent's 8.3B-parameter HunyuanVideo-1.5 with the step-distilled checkpoint. The same install also runs the standard (non-distilled) 480p T2V/I2V and 720p variants, and the 32 GB envelope is the first consumer card where running these without --overlap_group_offloading is comfortable rather than perilous.

Hardware data: RTX 5090 (32 GB VRAM) · ~75 seconds per 480p step-distilled I2V clip (vendor-published RTX 4090 anchor; community 5090 reports below) · See benchmark data

ℹ️ Why this card matters for HunyuanVideo-1.5. The model is engineered around a 14 GB offload-mode floor and a roughly 24 GB no-offload peak — both quoted by the official HF card. On the RTX 3090 and RTX 4090 the no-offload path just barely fits (the 3090 sibling recipe describes it as "razor-thin"). On a 32 GB RTX 5090 you have ~8 GB of comfortable headroom over that 24 GB ceiling, which means you can drop --overlap_group_offloading, enable FP8 GEMM via sgl-kernel, and still leave room for longer frame counts than the 3090/4090 siblings document.

ℹ️ This recipe is HunyuanVideo-1.5, not the original 13B HunyuanVideo. Tencent ships two distinct video models under the "HunyuanVideo" umbrella. HunyuanVideo (1.0) is a 13B model whose FP16 weights need 40 GB+ of VRAM — even the 32 GB RTX 5090 does not fit FP16 v1.0; community Q8 quantization via Kijai's ComfyUI-HunyuanVideoWrapper gets it into ~24 GB but is a separate recipe path. HunyuanVideo-1.5 is the late-2025 8.3B successor and is the only path covered here — /check/hunyuan-video/rtx-5090 measures this variant.

ℹ️ Step-distilled = 480P I2V only. The released step-distilled checkpoint covers 480p image-to-video (8 or 12 inference steps). For text-to-video on this card, use the 480P-T2V or 480P-T2V-cfg-distill variant — same install, slower runtime (no 8-step speedup), and the 5090's headroom lets you push to 720p T2V/I2V at full step counts where 24 GB cards must offload.

Requirements

Component	Minimum	Tested / Reference
GPU	14 GB VRAM with `--overlap_group_offloading true`	RTX 5090 (32 GB) — no offload required
RAM	16 GB (only consumed when CPU offload is enabled)	—
Storage	~60 GB for the full checkpoint set (DiT + VAE + text encoders)	—
Software	Python 3.10+, CUDA 12.8+, PyTorch 2.x with sm_120 wheels, Linux	—

Hardware notes — what 32 GB buys you

The official HF card is unambiguous: "Minimum GPU Memory: 14 GB (with model offloading enabled)" with the follow-up "If your GPU has sufficient memory, you may disable offloading for improved inference speed." On the RTX 5090 "sufficient memory" is genuinely true — the no-offload path peaks near 24 GB (the same envelope the 3090 sibling describes as "right at the boundary" of its 24 GB card), and the 5090 has ~8 GB of headroom on top of that ceiling. This unlocks three concrete improvements over the 24 GB siblings:

No-offload inference for the 480p step-distilled path, removing the CPU-staging overhead and shortening end-to-end wall time.
Native FP8 GEMM via sgl-kernel==0.3.18, which provides real speed acceleration on Blackwell sm_120 — see the FP8 note below.
Longer clips: community reports on the canonical HF discussion thread document 5090 users running up to 365 frames (15-sec at 720p 848×480) successfully; 1600 frames OOMs even on 32 GB (HF discussion #3 — see Results).

Installation

1. Install a CUDA 12.8+ PyTorch wheel with sm_120 kernels

Blackwell sm_120 kernels ship in PyTorch wheels built for CUDA 12.8 and later. The standard pip install torch from a CUDA 12.6 wheel index will not include sm_120 kernels and will refuse to run on the RTX 5090. Pin the CUDA-12.8 wheel explicitly:

pip install torch --index-url https://download.pytorch.org/whl/cu128

This is the only Blackwell-specific step in the install. Everything that follows is identical to the 3090 / 4090 siblings.

2. Clone the official Tencent repository

git clone https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5.git
cd HunyuanVideo-1.5

These steps come verbatim from the official README.

3. Install Python dependencies

pip install -r requirements.txt
pip install -i https://mirrors.tencent.com/pypi/simple/ --upgrade tencentcloud-sdk-python

4. Install SageAttention (the working Blackwell attention backend)

FlashAttention-2 sm_120 wheel coverage is still tracked open at Dao-AILab/flash-attention#2168 — pip install flash-attn against a CUDA 12.8 toolchain will install but the kernels typically fail at runtime on the RTX 5090 with "no kernel image is available for execution on the device". The diffusers HunyuanVideo15 docs route the RTX 5090 (and every GPU outside the A100/A800/RTX 4090 set) to sage_hub for exactly this reason. Install SageAttention 2++ — it has full sm_120 kernel coverage:

pip install sageattention

If you need FlashAttention specifically (for a downstream tool that hard-requires it), see Troubleshooting.

5. Install `sgl-kernel` for FP8 GEMM (real speed win on Blackwell)

pip install sgl-kernel==0.3.18

The Tencent release added FP8 GEMM inference support on December 23, 2025 (HF card News section). On Blackwell sm_120 this is genuinely useful: the sgl-kernel CMake configuration compiles sm_120a kernels when built against CUDA 12.8+, and Blackwell sm_120 has native FP8 tensor cores (unlike Ampere, where the 3090 sibling deliberately skips this step because FP8 is dequantized to BF16 on the fly). On the 5090 the package's CUDA 12.8 wheel ships the sm_120 path; install before your first run.

6. Download the checkpoints

From the official checkpoints-download.md:

hf download tencent/HunyuanVideo-1.5 --local-dir ./ckpts
hf download Qwen/Qwen2.5-VL-7B-Instruct --local-dir ./ckpts/text_encoder/llm
hf download google/byt5-small --local-dir ./ckpts/text_encoder/byt5-small

The repository ships the main DiT (including the 480P-I2V-step-distill weights), the 3D causal VAE, and the glyph-aware text-encoder config. Qwen2.5-VL-7B-Instruct is the primary text encoder; byt5-small handles glyph-aware text rendering inside generated videos.

Running

Option A — Official Tencent script (480p I2V, step-distilled, no offload)

The 32 GB envelope lets you drop --overlap_group_offloading for the 480p step-distilled path. Keep the PyTorch allocator hints regardless:

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:128

PROMPT='A fluffy teddy bear sits on a bed of soft pillows surrounded by children toys.'
SEED=1
ASPECT_RATIO=16:9
RESOLUTION=480p
OUTPUT_PATH=./outputs/output.mp4
MODEL_PATH=./ckpts
INPUT_IMAGE=./inputs/teddy.png

torchrun --nproc_per_node=1 generate.py \
  --task i2v \
  --image_path "$INPUT_IMAGE" \
  --prompt "$PROMPT" \
  --resolution $RESOLUTION \
  --aspect_ratio $ASPECT_RATIO \
  --seed $SEED \
  --rewrite false \
  --enable_step_distill true \
  --use_sageattn true \
  --overlap_group_offloading false \
  --output_path $OUTPUT_PATH \
  --model_path $MODEL_PATH

Key flags for the 5090:

--enable_step_distill true selects the 8 / 12-step distilled I2V checkpoint (recommended 8 or 12 steps, up to 75% speedup vs. the 50-step path per the official README).
--overlap_group_offloading false is the 5090's advantage — the 3090 sibling treats this flag as non-optional; with 32 GB you can run the no-offload path comfortably, which removes the CPU-staging round-trip on every layer.
--rewrite false skips the LLM-based prompt rewriter (a separate vLLM-served Qwen2.5-VL-7B-Instruct). Enable only if you want the rewritten-prompt pipeline.

Option B — HuggingFace diffusers (480p I2V, single Python process)

If you'd rather use the diffusers integration, the upstream API handles offloading for you:

import torch
from diffusers import HunyuanVideo15ImageToVideoPipeline
from diffusers.utils import export_to_video, load_image

pipe = HunyuanVideo15ImageToVideoPipeline.from_pretrained(
    "hunyuanvideo-community/HunyuanVideo-1.5-480p_i2v",
    torch_dtype=torch.bfloat16,
)
# Per the diffusers HunyuanVideo15 docs attention-backend table:
#   H100/H800            -> _flash_3_hub
#   A100/A800/RTX 4090   -> flash_hub
#   Other GPUs (5090,..) -> sage_hub
# RTX 5090 is NOT in the flash_hub row; the documented mapping is sage_hub.
pipe.transformer.set_attention_backend("sage_hub")
pipe.vae.enable_tiling()  # keep the VAE decode chunk-streamed even when DiT fits
pipe.to("cuda")            # no enable_model_cpu_offload() — 32 GB has headroom

image = load_image("./inputs/teddy.png")
prompt = "A fluffy teddy bear sits on a bed of soft pillows surrounded by children toys."
video = pipe(image=image, prompt=prompt, num_frames=61, num_inference_steps=8).frames[0]
export_to_video(video, "output.mp4", fps=15)

Two 5090-specific deviations from the 4090 sibling worth flagging:

Attention backend. The diffusers HunyuanVideo15 docs publish an explicit table — "H100/H800: _flash_3_hub or _flash_3_varlen_hub; A100/A800/RTX 4090: flash_hub or flash_varlen_hub; Other GPUs: sage_hub". The RTX 5090 (Blackwell sm_120) is not in the flash_hub row, so the documented mapping is sage_hub. Do not inherit flash_hub from the 4090 sibling without checking — the diffusers maintainers explicitly route everything outside the A100/A800/RTX-4090 set into sage_hub. SageAttention has full sm_120 kernel coverage; FlashAttention's flash_hub path is not validated for Blackwell in this table.
No enable_model_cpu_offload(). The 32 GB envelope means the offload call is optional, not required. Leave it on the table for the longer-frame variants below.

Results

Speed: The HunyuanVideo-1.5 HF card publishes the vendor anchor for the 480p step-distilled I2V path on RTX 4090: "On RTX 4090, end-to-end generation time is reduced by 75%, and a single RTX 4090 can generate videos within 75 seconds". Tencent has not published an RTX 5090 figure for the step-distilled path. The closest available 5090 measurements are community reports on HF discussion #3, which test the non-distilled 720p path (not this recipe's step-distilled 480p): a community user tlennon-ie (community reporter, not a Tencent team member) measured 284 seconds for 121 frames at 720p 848×480 on RTX 5090, vs. another community user Whitsu's RTX 4090 (undervolted) measurement of 297 seconds at the same settings — i.e. the 5090 is roughly 5% faster on that non-distilled 720p path. These community numbers corroborate that the 5090 runs the model cleanly but are not the same variant as the recipe's 480p step-distilled path; expect the step-distilled path to land at or below the vendor's 75 second RTX 4090 anchor. Once a step-distilled 5090 measurement lands, /check/hunyuan-video/rtx-5090 will replace this attribution — submit a contribution if you have a clean run.
VRAM usage: 14 GB minimum with --overlap_group_offloading true (or pipe.enable_model_cpu_offload()), and ~24 GB peak without offloading — both cited verbatim by the official HF card. On a 32 GB RTX 5090 the no-offload path leaves ~8 GB of comfortable headroom, which is the recipe's framing: drop offloading on 480p step-distilled, enable FP8 GEMM via sgl-kernel, and reserve the headroom for longer-clip experimentation on the 720p variants (Troubleshooting covers the 365-frame and 1600-frame community reports).
Quality notes: The step-distilled checkpoint "maintains comparable quality to the original model" per Tencent's release note, but is currently only released for the 480p I2V path. The 5090's headroom is most visible on the standard (non-distilled) 720p T2V/I2V variants — where the 24 GB cards must offload, the 5090 can stay GPU-resident — at the cost of multi-minute generation budgets (a community user tlennon-ie reports ~5 minutes per 5-sec clip at 720p 848×480 on RTX 5090 per HF discussion #3, with longer clips scaling roughly linearly to 15 minutes for 365 frames).

For full benchmark data, see /check/hunyuan-video/rtx-5090. The pair currently shows verdict: unknown (no benchmark) — please submit a contribution with your own measurement once you have a clean run.

Troubleshooting

`flash_hub` attention backend errors or silently mis-dispatches on RTX 5090

The 4090 sibling recipe uses pipe.transformer.set_attention_backend("flash_hub"). On Blackwell sm_120 this is not the documented mapping — the diffusers HunyuanVideo15 docs explicitly route "Other GPUs" (which includes the RTX 5090) to sage_hub. If you copy the 4090 code verbatim and hit a backend error, switch to sage_hub:

pipe.transformer.set_attention_backend("sage_hub")

SageAttention ships full sm_120 kernels; FlashAttention's flash_hub path is validated by the diffusers team for the A100/A800/RTX 4090 row only.

`pip install torch` from the default index refuses to run on the RTX 5090

The default PyTorch wheel from PyPI may be built against CUDA 12.6, which predates sm_120 kernel availability. Install the CUDA 12.8 wheel explicitly:

pip install torch --index-url https://download.pytorch.org/whl/cu128

The cu128 wheel includes Blackwell kernels for sm_120. The same applies to sgl-kernel==0.3.18 (the published wheel is built with SGL_KERNEL_ENABLE_SM100A on CUDA 12.8+, which compiles sm_120a per the sgl-kernel CMakeLists.txt).

`flash-attn` errors with "no kernel image is available" on the RTX 5090

FlashAttention-2 sm_120 wheel coverage is still tracked as an open issue at Dao-AILab/flash-attention#2168. pip install flash-attn against CUDA 12.8 installs the package, but the kernels typically fail at runtime on Blackwell. This recipe routes through SageAttention 2++ instead (full sm_120 coverage; pip install sageattention is the entire install) — both Option A (--use_sageattn true) and Option B (pipe.transformer.set_attention_backend("sage_hub")) wire it in. If a downstream tool you add later hard-requires FlashAttention, watch the issue for the eventual sm_120 wheel release.

Pushing past 365 frames OOMs even on 32 GB

A community user tlennon-ie reports on HF discussion #3 that 1600-frame generation OOMs on the RTX 5090, while 365 frames (15-sec at 720p 848×480) completes in ~15 minutes. The official supported time length is 5–10 seconds per the model card — beyond ~365 frames you are outside the design envelope. Mitigations: drop back to 480p, re-enable --overlap_group_offloading true to push the resident peak back toward the 14 GB floor, or split the clip across multiple runs.

"Can I run the original HunyuanVideo 13B FP16 on a 5090?"

No. The original tencent/HunyuanVideo at FP16 needs 40 GB+ of VRAM and does not fit even the 32 GB RTX 5090 in the official runtime. Per Kijai's ComfyUI-HunyuanVideoWrapper the community Q8 quantization gets it into ~24 GB, which the 5090 fits comfortably — but no first-party 5090 timing for the Q8 wrapper path is currently published. Submit a contribution if you have measured numbers, or open the wrapper repo's Issues for community reports. The Q8 wrapper is a separate recipe path from this 1.5 8.3B walkthrough.