HunyuanVideo-1.5 on RTX 3090 Ti: 480p Step-Distilled Image-to-Video on the Same Razor-Thin 24 GB Envelope

What You'll Build

A single-GPU image-to-video pipeline that turns a still image into a 480p clip on an RTX 3090 Ti, using Tencent's 8.3B-parameter HunyuanVideo-1.5 with the step-distilled checkpoint. The same install also runs the standard (non-distilled) 480p T2V/I2V and 720p variants if you want higher quality at much longer generation times.

Hardware data: RTX 3090 Ti (24 GB VRAM) · ~100–105 s per 480p step-distilled I2V clip (close-sibling forward-statement from the RTX 3090 recipe, see Results) · See benchmark data

⚠️ Razor-thin VRAM envelope — same as the non-Ti 3090. The 3090 Ti has the same 24 GB VRAM as the 3090; the Ti's extra memory bandwidth and shader count do not relax the envelope. Tencent's official HF card lists a 14 GB minimum with offloading enabled and notes that the no-offload path is for cards with sufficient memory headroom. On the 3090 Ti, "sufficient" is right at the 24 GB ceiling — close the browser, kill spurious CUDA processes, and run with --overlap_group_offloading true for a safe margin. See Troubleshooting before your first run.

ℹ️ This recipe is HunyuanVideo-1.5, not the original 13B HunyuanVideo. Tencent ships two distinct video models under the "HunyuanVideo" umbrella. HunyuanVideo (1.0) is a 13B model whose FP16 weights need 40 GB+ of VRAM and does not fit a single 3090 Ti in the official runtime — the community Q8 path via Kijai's wrapper brings it into the 24 GB envelope but has no first-party RTX 3090 Ti timing published. HunyuanVideo-1.5 is the late-2025 8.3B successor explicitly designed to fit a single 24 GB card with BF16 weights and step distillation. We anchor the recipe on 1.5 because it is the only path with documented consumer-GPU support that fits the 3090 Ti in a sensible runtime budget.

ℹ️ Step-distilled = 480P I2V only. The released step-distilled checkpoint covers 480p image-to-video (8 or 12 inference steps). If you want text-to-video on this card, use the 480P-T2V or 480P-T2V-cfg-distill variant — same install, slower runtime (no 8-step speedup), and the same razor-thin VRAM ceiling.

Requirements

Component	Minimum	Tested / Reference
GPU	14 GB VRAM with `--overlap_group_offloading true`	RTX 3090 Ti (24 GB) — see Hardware notes
RAM	32 GB (used by CPU offload during inference)	—
Storage	~60 GB for the full checkpoint set (DiT + VAE + text encoders)	—
Software	Python 3.10+, CUDA 12.x, PyTorch 2.x, Linux	—

Hardware notes — 24 GB is 24 GB, even on a Ti

The HF card is unambiguous: it lists Minimum GPU Memory: 14 GB (with model offloading enabled) under System Requirements, with the follow-up note that "If your GPU has sufficient memory, you may disable offloading for improved inference speed." For the 3090 Ti, "sufficient memory" sits at the 24 GB physical envelope — the no-offload path consumes the full card without margin for OS/desktop/browser. The Ti is a same-envelope sibling of the standard 3090 (24 GB, Ampere sm_86) — extra bandwidth and shader headroom on the Ti give a small speed lift but zero VRAM relaxation. On a 3090 Ti that is also driving the desktop, the OS, browser, and any background CUDA process compete for the last few hundred megabytes. Default to the offload path on this card.

Installation

1. Clone the official Tencent repository

git clone https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5.git
cd HunyuanVideo-1.5

These steps come verbatim from the official README.

2. Install Python dependencies

pip install -r requirements.txt
pip install -i https://mirrors.tencent.com/pypi/simple/ --upgrade tencentcloud-sdk-python

3. Install FlashAttention

HunyuanVideo-1.5 uses variable-length attention masks. Install FlashAttention from the Dao-AILab repository — the Ampere architecture (RTX 3090 Ti = sm_86) has full FlashAttention-2 kernel coverage, so the default wheel works without special flags.

ℹ️ No FP8 path on Ampere. Unlike the Ada Lovelace (RTX 4090) and Hopper (H100) recipes that may use FP8 GEMM kernels (e.g. sgl-kernel), Ampere sm_86 has no FP8 tensor cores — FP16 / BF16 / INT8 / TF32 only. HunyuanVideo-1.5 is BF16-native and step-distilled out of the box, so this is not a problem: the recipe runs cleanly in BF16 with FlashAttention-2 on the 3090 Ti, and you do not need to install sgl-kernel==0.3.18. Skip that step from any 4090-based walkthrough.

4. Download the checkpoints

From the official checkpoints-download.md:

hf download tencent/HunyuanVideo-1.5 --local-dir ./ckpts
hf download Qwen/Qwen2.5-VL-7B-Instruct --local-dir ./ckpts/text_encoder/llm
hf download google/byt5-small --local-dir ./ckpts/text_encoder/byt5-small

The repository ships the main DiT (including the 480P-I2V-step-distill weights), the 3D causal VAE, and the glyph-aware text-encoder config. Qwen2.5-VL-7B-Instruct is the primary text encoder; byt5-small handles glyph-aware text rendering inside generated videos.

Running

Option A — Official Tencent script (480p I2V, step-distilled)

The official launcher is generate.py invoked via torchrun. For a single RTX 3090 Ti, set --nproc_per_node=1 and enable group offloading explicitly:

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:128

PROMPT='A fluffy teddy bear sits on a bed of soft pillows surrounded by children toys.'
SEED=1
ASPECT_RATIO=16:9
RESOLUTION=480p
OUTPUT_PATH=./outputs/output.mp4
MODEL_PATH=./ckpts
INPUT_IMAGE=./inputs/teddy.png

torchrun --nproc_per_node=1 generate.py \
  --image_path "$INPUT_IMAGE" \
  --prompt "$PROMPT" \
  --resolution $RESOLUTION \
  --aspect_ratio $ASPECT_RATIO \
  --seed $SEED \
  --rewrite false \
  --enable_step_distill true \
  --use_sageattn false \
  --overlap_group_offloading true \
  --output_path $OUTPUT_PATH \
  --model_path $MODEL_PATH

Key flags for the 3090 Ti:

--enable_step_distill true selects the 8 / 12-step distilled I2V checkpoint (recommended 8 or 12 steps; up to 75% speedup vs. the 50-step path per the official README).
--overlap_group_offloading true is non-optional on a 24 GB 3090 Ti — it streams layers between CPU RAM and the GPU and keeps peak resident VRAM near the 14 GB floor.
--rewrite false skips the LLM-based prompt rewriter (a separate vLLM-served Qwen2.5-VL-7B-Instruct). Enable only if you have the extra VRAM elsewhere.

Option B — HuggingFace diffusers (480p I2V, single Python process)

If you'd rather use the diffusers integration, the upstream API handles offloading for you:

import torch
from diffusers import HunyuanVideo15ImageToVideoPipeline
from diffusers.utils import export_to_video
from diffusers.utils import load_image

pipe = HunyuanVideo15ImageToVideoPipeline.from_pretrained(
    "hunyuanvideo-community/HunyuanVideo-1.5-480p_i2v",
    torch_dtype=torch.bfloat16,
)
pipe.transformer.set_attention_backend("sage_hub")  # per diffusers docs: Ampere/Other GPUs → sage_hub (flash_hub is mapped to A100/A800/RTX 4090)
pipe.enable_model_cpu_offload()  # required on 24 GB 3090 Ti
pipe.vae.enable_tiling()          # chunks VAE decode to fit alongside DiT

image = load_image("./inputs/teddy.png")
prompt = "A fluffy teddy bear sits on a bed of soft pillows surrounded by children toys."
video = pipe(image=image, prompt=prompt, num_frames=61, num_inference_steps=8).frames[0]
export_to_video(video, "output.mp4", fps=15)

Note the I2V pipeline class is HunyuanVideo15ImageToVideoPipeline — it accepts image=image as a kwarg. The T2V class HunyuanVideo15Pipeline is a separate entry point and does not accept an input image.

enable_model_cpu_offload() matches the official 14 GB floor; enable_tiling() is what keeps the VAE decode from blowing the 24 GB envelope at the end of the run.

Results

Speed: No vendor-published RTX 3090 Ti number exists for the step-distilled path. The canonical HunyuanVideo-1.5 model card only publishes an RTX 4090 anchor — its News entry for the 480p I2V step-distilled checkpoint reports end-to-end generation time reduced by 75% and a single RTX 4090 generating videos within 75 seconds — tied directly to the 480p I2V step-distilled variant this recipe targets. The RTX 3090 sibling recipe reports ~105–112 seconds per clip on the standard 3090 as an arch-scaled extrapolation. The 3090 Ti is a close-sibling of the 3090 — same Ampere sm_86 architecture, same 24 GB VRAM envelope, with roughly +8% memory bandwidth (1008 GB/s vs 936 GB/s) helping the memory-bound VAE decode and roughly +12% compute helping the DiT prefill. Applying that close-sibling lift to the 3090 figure gives roughly 100–105 seconds per 480p I2V step-distilled clip on the 3090 Ti. Caveat: this is a close-sibling forward-statement on top of an already-extrapolated 3090 figure, not a measurement. Once community benchmark data lands, /check/hunyuan-video/rtx-3090-ti will replace it — please submit a contribution if you have a clean run.
VRAM usage: 14 GB minimum with --overlap_group_offloading true (or pipe.enable_model_cpu_offload()), per the official HF card's System Requirements line. The no-offload path consumes the full 24 GB envelope on the 3090 Ti without spare margin for OS/desktop/CUDA-background processes — treat it as at the physical ceiling, not under it. See Troubleshooting if you decide to try it.
Quality notes: The step-distilled checkpoint "maintains comparable quality to the original model" per Tencent's release note on the HF card, but is currently only released for the 480p I2V path. The standard 50-step T2V / I2V remains the higher-quality option when you can afford the multi-minute generation budget — on the 3090 Ti, with offloading mandatory, expect the standard path to run noticeably slower than on a 4090.

For full benchmark data, see /check/hunyuan-video/rtx-3090-ti. The pair currently shows verdict: unknown (no benchmark) — please submit a contribution with your own measurement once you have a clean run.

Troubleshooting

Out of memory on a 24 GB 3090 Ti even at default settings

The 24 GB envelope leaves zero headroom for the desktop, browser, or background CUDA processes. Mitigations in order:

Enable group offloading (this is the default — only retry if you turned it off): set --overlap_group_offloading true (Tencent script) or call pipe.enable_model_cpu_offload() before pipe.to("cuda") (diffusers path).
Set the PyTorch allocator hints before launching (verbatim from the official README):
```
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:128
```
Free competing VRAM: close the browser, switch to a TTY (Ctrl+Alt+F3 on most Linux desktops), and run nvidia-smi to confirm <500 MB is in use before launch. On a single-GPU workstation that also drives the display, even an idle desktop can claim 600–800 MB.
Drop to 480p I2V if you were experimenting with 720p variants — the 720p T2V/I2V paths push peak VRAM well over 14 GB even with offloading, and on a 24 GB 3090 Ti the safe path is 480p step-distilled.

Slow first run / loading the Qwen2.5-VL text encoder

First generation loads the 7B Qwen text encoder and the DiT — expect 60–120 seconds of one-time disk-to-CPU-to-GPU transfer before inference begins. Subsequent runs reuse the loaded weights. On the 3090 Ti with offloading, this initial load is governed by CPU-side staging and PCIe Gen4 x16 bandwidth — the Ti's extra memory bandwidth helps inference loops, not the first-load disk-to-CPU staging.

Wrong pipeline class — "TypeError: init() got an unexpected keyword argument 'image'"

If you see this error, you instantiated HunyuanVideo15Pipeline (the text-to-video entry point) and then tried to pass image=image to its __call__. The image-to-video path uses a different class: HunyuanVideo15ImageToVideoPipeline. Swap the import and re-run — see the diffusers code block above.

"Can I run the original HunyuanVideo 13B on a 3090 Ti?"

The original tencent/HunyuanVideo at FP16 needs 40 GB+ of VRAM and does not fit a single 3090 Ti in the official runtime. Per Kijai's ComfyUI-HunyuanVideoWrapper the community Q8 quantization gets it into ~24 GB, but no first-party 3090 Ti timing for that path is currently published — submit a contribution if you have measured numbers, or open the wrapper repo's Issues for community reports. It is a separate recipe path from this 1.5 8.3B walkthrough.