What You'll Build
A single-GPU image-to-video pipeline that turns a still image into a 480p clip on an RTX 3090, using Tencent's 8.3B-parameter HunyuanVideo-1.5 with the step-distilled checkpoint. The same install also runs the standard (non-distilled) 480p T2V/I2V and 720p variants if you want higher quality at much longer generation times.
Hardware data: RTX 3090 (24 GB VRAM) · ~105–112 s per 480p step-distilled I2V clip (extrapolated from RTX 4090's ~75 s, see Results) · See benchmark data
⚠️ Razor-thin VRAM envelope. Tencent's official HF card cites peak memory near 24 GB without offloading and 14 GB with offloading. On a 24 GB RTX 3090 you are on the edge of the no-offload path — close the browser, kill spurious CUDA processes, and run with
--overlap_group_offloading truefor a safe margin. See Troubleshooting before your first run.
ℹ️ This recipe is HunyuanVideo-1.5, not the original 13B HunyuanVideo. Tencent ships two distinct video models under the "HunyuanVideo" umbrella. HunyuanVideo (1.0) is a 13B model whose FP16 weights need 40 GB+ of VRAM and was, per an independent walkthrough, only usable on consumer GPUs through aggressive Q8 community quantization. HunyuanVideo-1.5 is the late-2025 8.3B successor explicitly designed to fit a single 24 GB card with BF16 weights and step distillation. We anchor the recipe on 1.5 because it is the only path that fits the RTX 3090 in a sensible runtime budget.
ℹ️ Step-distilled = 480P I2V only. The released step-distilled checkpoint covers 480p image-to-video (8 or 12 inference steps). If you want text-to-video on this card, use the 480P-T2V or 480P-T2V-cfg-distill variant — same install, slower runtime (no 8-step speedup), and the same razor-thin VRAM ceiling.
Requirements
| Component | Minimum | Tested / Reference |
|---|---|---|
| GPU | 14 GB VRAM with --overlap_group_offloading true | RTX 3090 (24 GB) — see Hardware notes |
| RAM | 32 GB (used by CPU offload during inference) | — |
| Storage | ~60 GB for the full checkpoint set (DiT + VAE + text encoders) | — |
| Software | Python 3.10+, CUDA 12.x, PyTorch 2.x, Linux | — |
Hardware notes — the 24 GB on 24 GB squeeze
The HF card is unambiguous: "Minimum GPU Memory: 14 GB (with model offloading enabled)" with the follow-up that "If your GPU has sufficient memory, you may disable offloading for improved inference speed." For the RTX 3090, "sufficient memory" is right at the boundary — the no-offload path peaks near 24 GB per the sibling RTX 4090 recipe which cited the same envelope. On a 4090 in a desktop with a second display GPU you have slack; on a 3090 that is also driving the desktop, the OS, browser, and any background CUDA process compete for the last few hundred megabytes. Default to the offload path on this card.
Installation
1. Clone the official Tencent repository
git clone https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5.git
cd HunyuanVideo-1.5
These steps come verbatim from the official README.
2. Install Python dependencies
pip install -r requirements.txt
pip install -i https://mirrors.tencent.com/pypi/simple/ --upgrade tencentcloud-sdk-python
3. Install FlashAttention
HunyuanVideo-1.5 uses variable-length attention masks. Install FlashAttention from the Dao-AILab repository — the Ampere architecture (RTX 3090 = sm_86) has full FlashAttention-2 kernel coverage, so the default wheel works without special flags.
ℹ️ No FP8 path on Ampere. Unlike the Ada Lovelace (RTX 4090) and Hopper (H100) recipes that may use FP8 GEMM kernels (e.g.
sgl-kernel), Amperesm_86has no FP8 tensor cores — FP16 / BF16 / INT8 / TF32 only. HunyuanVideo-1.5 is BF16-native and step-distilled out of the box, so this is not a problem: the recipe runs cleanly in BF16 with FlashAttention-2 on the 3090, and you do not need to installsgl-kernel==0.3.18. Skip that step from any 4090-based walkthrough.
4. Download the checkpoints
From the official checkpoints-download.md:
hf download tencent/HunyuanVideo-1.5 --local-dir ./ckpts
hf download Qwen/Qwen2.5-VL-7B-Instruct --local-dir ./ckpts/text_encoder/llm
hf download google/byt5-small --local-dir ./ckpts/text_encoder/byt5-small
The repository ships the main DiT (including the 480P-I2V-step-distill weights), the 3D causal VAE, and the glyph-aware text-encoder config. Qwen2.5-VL-7B-Instruct is the primary text encoder; byt5-small handles glyph-aware text rendering inside generated videos.
Running
Option A — Official Tencent script (480p I2V, step-distilled)
The official launcher is generate.py invoked via torchrun. For a single RTX 3090, set --nproc_per_node=1 and enable group offloading explicitly:
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:128
PROMPT='A fluffy teddy bear sits on a bed of soft pillows surrounded by children toys.'
SEED=1
ASPECT_RATIO=16:9
RESOLUTION=480p
OUTPUT_PATH=./outputs/output.mp4
MODEL_PATH=./ckpts
INPUT_IMAGE=./inputs/teddy.png
torchrun --nproc_per_node=1 generate.py \
--task i2v \
--image_path "$INPUT_IMAGE" \
--prompt "$PROMPT" \
--resolution $RESOLUTION \
--aspect_ratio $ASPECT_RATIO \
--seed $SEED \
--rewrite false \
--enable_step_distill true \
--use_sageattn false \
--overlap_group_offloading true \
--output_path $OUTPUT_PATH \
--model_path $MODEL_PATH
Key flags for the 3090:
--enable_step_distill trueselects the 8 / 12-step distilled I2V checkpoint (recommended 8 or 12 steps; up to 75% speedup vs. the 50-step path per the official README).--overlap_group_offloading trueis non-optional on a 24 GB 3090 — it streams layers between CPU RAM and the GPU and keeps peak resident VRAM near the 14 GB floor.--rewrite falseskips the LLM-based prompt rewriter (a separate vLLM-served Qwen2.5-VL-7B-Instruct). Enable only if you have the extra VRAM elsewhere.
Option B — HuggingFace diffusers (480p I2V, single Python process)
If you'd rather use the diffusers integration, the upstream API handles offloading for you:
import torch
from diffusers import HunyuanVideo15ImageToVideoPipeline
from diffusers.utils import export_to_video
from diffusers.utils import load_image
pipe = HunyuanVideo15ImageToVideoPipeline.from_pretrained(
"hunyuanvideo-community/HunyuanVideo-1.5-480p_i2v",
torch_dtype=torch.bfloat16,
)
pipe.transformer.set_attention_backend("sage_hub") # per diffusers docs: Ampere/Other GPUs → sage_hub (flash_hub is mapped to A100/A800/RTX 4090)
pipe.enable_model_cpu_offload() # required on 24 GB 3090
pipe.vae.enable_tiling() # chunks VAE decode to fit alongside DiT
image = load_image("./inputs/teddy.png")
prompt = "A fluffy teddy bear sits on a bed of soft pillows surrounded by children toys."
video = pipe(image=image, prompt=prompt, num_frames=61, num_inference_steps=8).frames[0]
export_to_video(video, "output.mp4", fps=15)
enable_model_cpu_offload() matches the official 14 GB floor; enable_tiling() is what keeps the VAE decode from blowing the 24 GB envelope at the end of the run.
Results
- Speed: No vendor-published RTX 3090 number exists for the step-distilled path. The Tencent HunyuanVideo-1.5 model card News section publishes the RTX 4090 anchor: "On RTX 4090, end-to-end generation time is reduced by 75%, and a single RTX 4090 can generate videos within 75 seconds" — tied directly to the 480p I2V step-distilled variant this recipe targets. The InsiderLLM 2026 video-generation guide publishes a generic RTX 3090 scaling rule: "Add 40-50% to all RTX 4090 times" (their worked example applies it to Wan 14B, not HunyuanVideo specifically). Applying that rule to the HF card's 75-sec 4090 anchor yields roughly 105–112 seconds per 480p I2V step-distilled clip on the 3090. Caveat: InsiderLLM's own HunyuanVideo 1.5 table row is labeled 720p (not 480p), so this extrapolation transfers an arch-scaling rule but not a same-resolution measurement. Treat as estimate, not measurement — once community benchmark data lands,
/check/hunyuan-video/rtx-3090will replace it. - VRAM usage: 14 GB minimum with
--overlap_group_offloading true(orpipe.enable_model_cpu_offload()), ~24 GB without offloading — both cited verbatim by the official HF card. On a 24 GB RTX 3090 the no-offload path is at the ceiling, not under it — see Troubleshooting if you decide to try it. - Quality notes: The step-distilled checkpoint "maintains comparable quality to the original model" per Tencent's release note, but is currently only released for the 480p I2V path (status per the late-2025 release). The standard 50-step T2V / I2V remains the higher-quality option when you can afford the multi-minute generation budget — and on the 3090, with offloading mandatory, expect the standard path to run noticeably slower than on a 4090 (apply the same ~40–50% scale-up).
For full benchmark data, see /check/hunyuan-video/rtx-3090. The pair currently shows verdict: unknown (no benchmark) — please submit a contribution with your own measurement once you have a clean run.
Troubleshooting
Out of memory on a 24 GB 3090 even at default settings
The 24 GB envelope leaves zero headroom for the desktop, browser, or background CUDA processes. Mitigations in order:
- Enable group offloading (this is the default — only retry if you turned it off): set
--overlap_group_offloading true(Tencent script) or callpipe.enable_model_cpu_offload()beforepipe.to("cuda")(diffusers path). - Set the PyTorch allocator hints before launching (verbatim from the official README):
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:128 - Free competing VRAM: close the browser, switch to a TTY (
Ctrl+Alt+F3on most Linux desktops), and runnvidia-smito confirm <500 MB is in use before launch. On a single-GPU workstation that also drives the display, even an idle desktop can claim 600–800 MB. - Drop to 480p I2V if you were experimenting with 720p variants — the 720p T2V/I2V paths push peak VRAM well over 14 GB even with offloading, and on a 24 GB 3090 the safe path is 480p step-distilled.
Slow first run / loading the Qwen2.5-VL text encoder
First generation loads the 7B Qwen text encoder and the DiT — expect 60–120 seconds of one-time disk-to-CPU-to-GPU transfer before inference begins. Subsequent runs reuse the loaded weights. On the 3090 with offloading, this initial load is slightly longer than on a 4090 because PCIe bandwidth is the same Gen4 x16 on both desktops but the CPU-side staging dominates.
"Can I run the original HunyuanVideo 13B on a 3090?"
The original tencent/HunyuanVideo at FP16 needs 40 GB+ of VRAM and does not fit a single 3090 in the official runtime. Per Kijai's ComfyUI-HunyuanVideoWrapper the community Q8 quantization gets it into ~24 GB, but no first-party 3090 timing for that path is currently published — submit a contribution if you have measured numbers, or open the wrapper repo's Issues for community reports. It is a separate recipe path from this 1.5 8.3B walkthrough.