What You'll Build
Generate short text-to-video clips locally using LightX2V — an inference framework that ships 4-step, CFG-free distilled checkpoints of Wan2.1-T2V-14B — on a 24 GB RTX 4090. The distilled checkpoint cuts inference from 40–50 steps down to 4 with no classifier-free guidance, and on a 4090 you have two viable paths: the FP8 distill (distill_fp8/, ~21.7 GB on disk) that fits without aggressive offload, and the BF16 distill (distill_models/, ~40 GB on disk including T5) that needs enable_offload(cpu_offload=True, ...) to stay under 24 GB (file sizes via HF tree API).
Hardware data: RTX 4090 (24GB VRAM) · 4-step distilled Wan2.1-T2V-14B · See benchmark data
ℹ️ This is distilled Wan 2.1, not Wan 2.2. The lightx2v org publishes a wider family — Wan2.1-T2V-14B distilled, Wan2.2-A14B distilled (timestep-MoE), HunyuanVideo-1.5 distilled, Qwen-Image distilled — each with different VRAM and inference characteristics. This recipe is specifically the
Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2vrepo (4-step CFG-free distillation of the dense Wan2.1-T2V-14B base). The same install steps don't transfer cleanly to the A14B timestep-MoE — see the LightX2V Latest News for the Wan 2.2 / HunyuanVideo distilled releases.
⚠️ OOM without optimization is real. A user reported OOM on a 48 GB A6000 trying to run the unquantized distilled T2V-14B without the framework's optimizations enabled (HF predecessor-repo discussion #9 "OOMs"). On a 24 GB 4090, the BF16 path absolutely requires
enable_offload(cpu_offload=True, ...); the FP8 path is what fits natively.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 8GB VRAM with full offload; 24GB for FP8 native; ≥48GB ideal for BF16 native | RTX 4090 (24GB) |
| RAM | 16GB (per LightX2V Quickstart); 32GB recommended when offloading BF16 | — |
| Storage | ~22GB (FP8) or ~50GB (BF16 + T5 + VAE) | weights + framework |
| Software | Python 3.10+, PyTorch 2.6+, CUDA 12.4 or 12.8 | per Quickstart |
The lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v HF repo ships three sub-directories: distill_models/ (BF16 dense — distill_model.safetensors is 28.58 GB, larger than a single 4090's VRAM), distill_fp8/ (per-block FP8 quant — 40 transformer blocks at ~352 MB each plus non_block.safetensors 0.93 GB and an FP8 T5 encoder at 6.73 GB), and distill_int8/ (per-block INT8 with similar layout) — all per HF tree API.
Installation
The canonical install is Docker (recommended) or conda from source — both documented in the LightX2V Quickstart.
1. Install the framework (conda path)
# Clone and create the environment
git clone https://github.com/ModelTC/LightX2V.git
cd LightX2V
conda create -n lightx2v python=3.11 -y
conda activate lightx2v
pip install -v -e .
Verbatim from the LightX2V Quickstart. The default pip install torch shipped with pip install -v -e . already includes Ada sm_89 kernels — no cu128-vs-cu126 toggling required on the RTX 4090 (unlike Blackwell sm_120 cards, which do need a CUDA-12.8 PyTorch wheel).
2. (Recommended) build SageAttention 2 for ~2× attention speedup
git clone https://github.com/thu-ml/SageAttention.git
cd SageAttention && CUDA_ARCHITECTURES="8.0,8.6,8.9,9.0" \
EXT_PARALLEL=4 NVCC_APPEND_FLAGS="--threads 8" MAX_JOBS=32 \
pip install -v -e .
CUDA_ARCHITECTURES="8.0,8.6,8.9,9.0" covers Ampere through Hopper; the 8.9 entry is the RTX 4090's Ada target. (Blackwell sm_120 is not needed for this card.)
3. (Optional, Ada-specific) install Q8 Kernels
For the INT8 distilled path, the LightX2V Quickstart calls out Q8 Kernels as the "appropriate quantization operator … for Ada architecture GPUs (such as RTX 4090, L40S, etc.)":
git clone https://github.com/KONAKONA666/q8_kernels.git
cd q8_kernels && git submodule init && git submodule update
python setup.py install
Skip this step if you only plan to run the FP8 or BF16 paths.
4. Pull the 4-step distilled T2V-14B checkpoint
Pick one of the three sub-directories — FP8 is the sweet spot for a 24 GB 4090:
# FP8 distill — fits natively on 24 GB (~21.7 GB on disk for DiT + T5-fp8)
huggingface-cli download \
lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v \
--include "distill_fp8/*" \
--local-dir ./weights/Wan2.1-T2V-14B-StepDistill
# OR: BF16 distill — needs CPU offload on 24 GB (~40 GB on disk; 28.58 GB DiT alone)
# huggingface-cli download \
# lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v \
# --include "distill_models/*" \
# --local-dir ./weights/Wan2.1-T2V-14B-StepDistill
Alternative: Docker (simplest)
docker pull lightx2v/lightx2v:26011201-cu128
cu128 is fine on the RTX 4090 even though sm_120 kernels aren't required — the wheel still includes Ada sm_89 support. Use the -cu124 variant if you need to match an older driver. Both per the Quickstart.
Running
The framework ships ready-to-run shell scripts under scripts/wan/ — the relevant ones for this recipe are run_wan_t2v_distill_fp8_4step_cfg.sh (FP8 path) and run_wan_t2v_distill_model_4step_cfg.sh (BF16 path). Fill in lightx2v_path (the cloned repo root) and model_path (the directory you downloaded weights to in step 4 above), then:
# FP8 path — primary recommendation for 24 GB 4090
bash scripts/wan/run_wan_t2v_distill_fp8_4step_cfg.sh
Under the hood, the script invokes (verbatim from the run script):
python -m lightx2v.infer \
--model_cls wan2.1_distill \
--task t2v \
--model_path $model_path \
--config_json ${lightx2v_path}/configs/distill/wan21/wan_t2v_distill_fp8_4step_cfg.json \
--prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
--use_prompt_enhancer \
--negative_prompt "..." \
--save_result_path ${lightx2v_path}/save_results/wan_t2v_distill_fp8_4step.mp4
Output lands at ${lightx2v_path}/save_results/wan_t2v_distill_fp8_4step.mp4. The recommended sampler settings — LCM scheduler with shift=5.0 and guidance_scale=1.0 — are baked into configs/distill/wan21/wan_t2v_distill_fp8_4step_cfg.json and explicitly documented on the HF model card.
For the Python API directly (BF16 path with offload, what a 24 GB card needs):
from lightx2v import LightX2VPipeline
pipe = LightX2VPipeline(
model_path="./weights/Wan2.1-T2V-14B-StepDistill",
model_cls="wan2.1_distill",
task="t2v",
)
# Required on a 24 GB card if you load the BF16 distill (28.58 GB DiT)
pipe.enable_offload(
cpu_offload=True,
offload_granularity="block",
text_encoder_offload=True,
image_encoder_offload=False,
vae_offload=False,
)
pipe.create_generator(
attn_mode="sage_attn2",
infer_steps=4, # the whole point of the distilled checkpoint
height=480, width=832,
num_frames=81,
guidance_scale=1.0,
sample_shift=5.0,
)
pipe.generate(
seed=42,
prompt="A man with short gray hair plays a red electric guitar.",
save_result_path="./output.mp4",
)
The enable_offload(...) call mirrors the official examples/wan/wan_i2v.py snippet — copied here for the T2V task. Start at 480×832, 81 frames, 4 steps and only push to 720×1280 once you've confirmed peak VRAM stays comfortably under 24 GB.
Results
- Speed: The official LightX2V README's "Cross-Framework Performance Comparison (RTX 4090D)" table reports 20.26 s/it for LightX2V single-GPU on a RTX 4090D (a China-only sibling of the RTX 4090 — same 24 GB VRAM, same Ada
sm_89arch, ~95% compute of the full 4090; expect very slightly better numbers on a full RTX 4090). The measurement is for the base Wan2.1-I2V-14B at 480P, 40 steps, 81 frames — not the 4-step distilled variant; the distilled checkpoint reuses the same DiT architecture so per-step compute is essentially the same, but you only need 4 steps instead of 40, so wall-clock per clip should be roughly 10× faster. An RTX-4090-specific 4-step distilled-T2V benchmark hasn't been published yet — contribute one at /check/lightx2v/rtx-4090 and it will appear here. - VRAM usage: No first-party RTX-4090 runtime measurement has been published for this distilled variant. Per the HF tree API, the on-disk envelope is ~21.7 GB for the FP8 path (40 transformer blocks at ~352 MB each + 0.93 GB non-block weights + 6.73 GB FP8 T5 + 0.5 GB Wan2.1 VAE) and ~40 GB for the BF16 path (28.58 GB DiT alone + 11.36 GB BF16 T5 + 0.5 GB VAE). The 4090's 24 GB envelope fits the FP8 path natively; the BF16 path requires
enable_offload(cpu_offload=True, text_encoder_offload=True, ...)to keep peak GPU residency under 24 GB. See /check/lightx2v/rtx-4090 for empirical numbers as they land. - Quality notes: The distilled checkpoint trades fine motion detail and prompt fidelity for the 4-step / no-CFG speedup. Use the recommended LCM scheduler,
shift=5.0,guidance_scale=1.0(HF model card) and stay close to the model's training resolutions (480×832, 720×1280) for best results.
For the full benchmark data, see /check/lightx2v/rtx-4090.
Troubleshooting
Out of memory loading the BF16 distill on a 24 GB card
The BF16 distill_models/distill_model.safetensors is 28.58 GB on disk (HF tree API) — larger than the 4090's 24 GB VRAM, so it cannot be loaded fully resident. Two options:
- Use the FP8 path (
distill_fp8/, ~21.7 GB on-disk total) which fits natively. This is the recommended path for a 24 GB 4090. - Stay on BF16 with offload —
pipe.enable_offload(cpu_offload=True, offload_granularity="block", text_encoder_offload=True, ...)(per theexamples/wan/wan_i2v.pysnippet). Expect a wall-time penalty from the block-level swap between CPU RAM and GPU VRAM.
Out of memory even with the framework optimizations
A user reported OOM with the unquantized distilled T2V-14B even on a 48 GB A6000 (HF predecessor-repo discussion #9 "OOMs"). The maintainers' guidance there points at the framework's quantization docs — combine SageAttention 2, FP8 scaled quantization, block-level offload, and (if your resolution allows) VAE tiling. The same techniques that let community users run the model on 16 GB cards are what keep peak VRAM under control on 24–48 GB ones.
Slow inference despite the 4-step distillation
The 4-step path only delivers the advertised speedup if the LCM scheduler is actually loaded and guidance_scale=1.0. If you're calling create_generator(...) directly, make sure you pass guidance_scale=1.0 and sample_shift=5.0 — and that infer_steps=4, not the base model's 40. The provided shell scripts and config JSONs already encode the right defaults; the trap is bespoke Python scripts that copy partial parameters and silently fall back to the un-distilled inference path. Both settings are explicit on the HF model card.
Resolution / frame-count crashes
The Wan2.1 base requires resolutions divisible by 16 and a frame count that follows the model's grouping. Stick to the example configs (480×832 / 81 frames; 720×1280 / 81 frames) until you've measured a comfortable VRAM margin via nvidia-smi -l 1 during a generation run.
Report new issues via submission form — community RTX 4090 benchmarks would directly improve the /check/lightx2v/rtx-4090 data.