self-hosted/ai
§01·recipe · video

LightX2V on RTX 5090: 4-Step Text-to-Video with Distilled Wan2.1-14B (Blackwell-Native FP8 + Future NVFP4 Path)

videointermediate22GB+ VRAMMay 24, 2026
models
tools
prerequisites
  • NVIDIA RTX 5090 (32 GB VRAM, Blackwell sm_120) or another Blackwell-class GPU
  • 32 GB+ system RAM
  • Python 3.10+ (3.11 recommended) and PyTorch built against CUDA 12.8+ (sm_120 kernels require cu128)
  • ~25 GB free disk space for the FP8 distilled weights (~50 GB if you also pull BF16)

What You'll Build

Generate short text-to-video clips locally using LightX2V — an inference framework that ships 4-step, CFG-free distilled checkpoints of Wan2.1-T2V-14B — on a 32 GB RTX 5090. The distilled checkpoint cuts inference from 40–50 steps down to 4 with no classifier-free guidance, and on Blackwell sm_120 the right path is FP8 (native hardware-accelerated tensor cores) running fully resident, with the BF16 path also comfortable inside 32 GB for the first time on a consumer card.

Hardware data: RTX 5090 (32 GB VRAM, Blackwell sm_120) · 4-step distilled Wan2.1-T2V-14B · See benchmark data

ℹ️ This is distilled Wan 2.1, not Wan 2.2. The lightx2v org publishes a wider family — Wan2.1-T2V-14B distilled, Wan2.2-A14B distilled (timestep-MoE), HunyuanVideo-1.5 distilled, Qwen-Image distilled — each with different VRAM and inference characteristics. This recipe is specifically the Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v repo (4-step CFG-free distillation of the dense Wan2.1-T2V-14B base). The same install steps don't transfer cleanly to the A14B timestep-MoE — see the LightX2V Latest News for the Wan 2.2 / HunyuanVideo distilled releases.

FP8 is the fast path on Blackwell. Unlike Ampere sm_86 (RTX 30-series) — where loading FP8 weights forces a dequantize-to-BF16 at compute time because the architecture has no FP8 tensor cores — Blackwell sm_120 (RTX 50-series) has native FP8 tensor-core acceleration (E4M3 / E5M2). FP8 here gives you both the VRAM savings and the throughput win. The LightX2V team makes this explicit in the Wan-NVFP4 release notes: "NVFP4 Quantization-Aware Step Distillation for Blackwell Architecture" — and the still-merging upstream PR #1090 "feat: add MXFP8 fused operators for Wan transformer inference on SM120" reports a measured 1.20× end-to-end FFN speedup on RTX 5090 with auto-fallback on non-SM120 GPUs. This recipe routes through FP8 as the primary path.

🧪 NVFP4 is shipping — for sibling variants, not T2V-14B (yet). The lightx2v/Wan-NVFP4 repo currently publishes Blackwell-native NVFP4 distilled weights for Wan2.1-I2V-14B-480P and Wan2.1-T2V-1.3Bnot the 14B T2V variant this recipe targets. NVFP4 (w4a4-nvfp4) is listed as a first-class quantization scheme in the framework's feature list alongside w8a8-int8 and w8a8-fp8, and the examples/wan/wan_t2v_nvfp4.py script ships ready-to-run but is currently hard-coded to the 1.3B base. If you specifically want NVFP4 acceleration on a 5090 today, use the I2V-14B-480P NVFP4 (see the Wan-NVFP4 card's RTX 5090 single-GPU benchmark: 17.65 s end-to-end vs. 498.9 s on the original I2V-14B — a 28× speedup, but on the I2V model, not T2V-14B). Track the LightX2V Latest News for the T2V-14B NVFP4 drop.

Requirements

ComponentMinimumTested
GPUBlackwell-class with FP8 tensor cores; the framework's Quickstart sets the floor at 8 GB VRAM with full offloadRTX 5090 (Blackwell sm_120, 32 GB)
RAM16 GB minimum per Quickstart; 32 GB headroom helps if you fall back to BF16 + text-encoder offload
Storage~22 GB (FP8 sub-directory)~50 GB if you also pull BF16
SoftwarePython 3.10+, PyTorch built against CUDA 12.8+ (sm_120 kernels require cu128)per Quickstart

The lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v HF repo ships three sub-directories (sizes via the HF tree API):

  • distill_fp8/ — per-block FP8 quant, ~22 GB total (40 transformer blocks at ~352 MB each + non_block.safetensors 0.93 GB + FP8 UMT5-XXL encoder 6.73 GB + Wan2.1 VAE 0.51 GB). Recommended primary path on the 5090 — FP8 is hardware-accelerated on Blackwell, and the full DiT + encoder + VAE fits comfortably inside 32 GB with ample headroom for activations at 720×1280.
  • distill_int8/ — per-block INT8 quant, ~22 GB total with the same layout. Fully supported on Blackwell as well; FP8 is preferred because Blackwell's FP8 throughput is the headline tensor-core path for diffusion workloads.
  • distill_models/ — BF16 dense, ~40.5 GB total (28.58 GB DiT + 11.36 GB BF16 UMT5 + 0.51 GB VAE). The 28.58 GB DiT alone now fits resident on the 5090's 32 GB (it did not on the 4090's 24 GB) — but the additional 11.36 GB BF16 UMT5 still benefits from text_encoder_offload=True so DiT activations have room.

Installation

The canonical install is Docker (simplest) or conda from source — both documented in the LightX2V Quickstart.

1. Install the framework (conda path)

# Clone and create the environment
git clone https://github.com/ModelTC/LightX2V.git
cd LightX2V
conda create -n lightx2v python=3.11 -y
conda activate lightx2v
pip install -v -e .

Verbatim from the LightX2V Quickstart. Confirm your torch was built against CUDA 12.8 or newer — Blackwell sm_120 kernels require cu128. If pip install -v -e . pulled a wheel built for an older CUDA, reinstall PyTorch explicitly:

pip install --index-url https://download.pytorch.org/whl/cu128 torch torchvision

(SageAttention's README is explicit on the same constraint: "CUDA: >=12.8 for Blackwell or SageAttention2++".)

2. (Strongly recommended) build SageAttention 2++ — 2.7× faster than FlashAttention 2 on the 5090

git clone https://github.com/thu-ml/SageAttention.git
cd SageAttention && CUDA_ARCHITECTURES="8.9,9.0,12.0" \
  EXT_PARALLEL=4 NVCC_APPEND_FLAGS="--threads 8" MAX_JOBS=32 \
  pip install -v -e .

CUDA_ARCHITECTURES="8.9,9.0,12.0" covers Ada through Blackwell; the 12.0 entry is the RTX 5090's sm_120 target. The SageAttention README explicitly announced 5090 support on 2025-02-15: "The compilation code is updated to support RTX5090! On RTX5090, SageAttention reaches 560T, 2.7x faster than FlashAttention2!" On Blackwell this is the single biggest attention-kernel lever — much more impactful than on the 4090, and substantially more than on Ampere.

3. Pull the 4-step distilled T2V-14B checkpoint (FP8 — recommended primary path)

FP8 is hardware-accelerated on Blackwell sm_120; this is the recommended path for the 5090:

huggingface-cli download \
  lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v \
  --include "distill_fp8/*" \
  --local-dir ./weights/Wan2.1-T2V-14B-StepDistill

Alternative: the BF16 path also fits a 32 GB 5090 with text_encoder_offload=True (the 28.58 GB DiT alone fits VRAM, but the 11.36 GB BF16 UMT5 plus activations needs the headroom — offloading the encoder leaves room for full-quality DiT compute):

huggingface-cli download \
  lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v \
  --include "distill_models/*" \
  --local-dir ./weights/Wan2.1-T2V-14B-StepDistill

Alternative: Docker (simplest)

docker pull lightx2v/lightx2v:26011201-cu128

The cu128 tag is the right one for the 5090 — sm_120 Blackwell kernels require CUDA 12.8. The older 25101501-cu124 tag will not contain sm_120 kernels and will fall back to JIT compile or fail outright; stay on cu128 (Quickstart).

Alternative path: ComfyUI + GGUF (community-quantized)

If you'd rather work in ComfyUI than the LightX2V Python framework, QuantStack ships GGUF conversions that explicitly identify as "a GGUF conversion of an addon of lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill" (link-back to canonical), produced with city96's standard conversion scripts. On a 32 GB 5090 the full Q8_0 (18.7 GB DiT) is the highest-quality GGUF tier — lower tiers are available if you also want to colocate a larger UMT5 quant or run multiple workflows side-by-side. Loading needs the ComfyUI-GGUF custom node plus the UMT5-XXL text encoder (Comfy-Org safetensors) and the Wan2.1 VAE (Kijai mirror). Note that QuantStack's release pre-merges the VACE control-conditioning addon onto the distilled base.

Running

The LightX2V repo ships ready-to-run shell scripts under scripts/wan/. The relevant ones for this recipe are:

# FP8 path on Blackwell — the recommended default for the 5090
bash scripts/wan/run_wan_t2v_distill_fp8_4step_cfg.sh

# BF16 path (fits resident on 32 GB; pair with text_encoder_offload for headroom)
bash scripts/wan/run_wan_t2v_distill_model_4step_cfg.sh

Under the hood, both invoke python -m lightx2v.infer with the matching config JSON from configs/distill/wan21/. Fill in lightx2v_path (the cloned repo root) and model_path (the directory you downloaded weights to in step 3) at the top of the script before running.

Output lands at ${lightx2v_path}/save_results/wan_t2v_distill_*_4step.mp4. The recommended sampler settings — LCM scheduler with shift=5.0 and guidance_scale=1.0 — are baked into the distill config JSONs and explicitly documented on the HF model card:

"We recommend using the LCM scheduler with the following settings: shift=5.0, guidance_scale=1.0 (i.e., without CFG)."

For the Python API directly, the FP8 path on a 5090 looks like:

from lightx2v import LightX2VPipeline

pipe = LightX2VPipeline(
    model_path="./weights/Wan2.1-T2V-14B-StepDistill/distill_fp8",
    model_cls="wan2.1_distill",
    task="t2v",
)

# Optional on a 32 GB 5090 for the FP8 path — text_encoder_offload helps if you
# push to 720×1280 / longer clips. Required if you switch to the BF16 path so
# the 11.36 GB BF16 UMT5 doesn't squeeze DiT activations.
pipe.enable_offload(
    cpu_offload=False,                # set True only if you go BF16 with tight headroom
    offload_granularity="block",
    text_encoder_offload=True,        # leaves the 6.73 GB FP8 UMT5 on CPU
    image_encoder_offload=False,
    vae_offload=False,
)

pipe.create_generator(
    attn_mode="sage_attn2",           # SageAttention 2++, 5090-optimised
    infer_steps=4,                    # the whole point of the distilled checkpoint
    height=480, width=832,
    num_frames=81,
    guidance_scale=1.0,
    sample_shift=5.0,
)
pipe.generate(
    seed=42,
    prompt="A man with short gray hair plays a red electric guitar.",
    save_result_path="./output.mp4",
)

ℹ️ Don't copy the HF card's diffusers snippet as-is for T2V. The repo's auto-generated diffusers Quick Start example loads an input image and passes image=image — that's the Image-to-Video pipeline signature on a Text-to-Video repo (the snippet is templated from pipeline_tag and isn't variant-aware). For T2V, call the pipeline with prompt= only, or use the official LightX2V scripts above which already encode the right call shape.

Start at 480×832, 81 frames, 4 steps and only push to 720×1280 once you've confirmed peak VRAM via nvidia-smi -l 1. The 5090's 32 GB envelope gives you substantial room to push resolution before hitting the tile-cliff that 24 GB cards experience.

Results

  • Speed: No first-party RTX 5090–specific benchmark has been published for the Wan2.1-T2V-14B 4-step distilled variant at the time of writing. For directional context — and not as a direct prediction — the official LightX2V README "Cross-Framework Performance Comparison (RTX 4090D)" table reports 20.26 s/it for LightX2V single-GPU on an RTX 4090D (Ada sm_89, 24 GB) running the base Wan2.1-I2V-14B at 480P, 40 steps. Cross-architecture transfer (Ada → Blackwell) doesn't follow a single multiplier — compute, bandwidth, and attention-kernel selection all shift — so we deliberately do not relabel this number as a 5090 prediction. The closest first-party RTX 5090 measurement on the Wan family is on the I2V-14B-480P NVFP4 sibling: per the Wan-NVFP4 model card, end-to-end inference dropped from 498.9 s on the original I2V-14B to 17.65 s with the NVFP4 4-step distill on RTX 5090 single GPU (a 28× speedup). That number is for a different variant (I2V vs. T2V) and a different quant (NVFP4 vs. FP8), but it bounds how fast Blackwell can drive a Wan2.1-class workload once both the distillation and the quant path are arch-native. Empirical RTX 5090 numbers for this recipe will land at /check/lightx2v/rtx-5090 once a community benchmark is submitted via /contribute.
  • VRAM usage: The framework's Quickstart sets the floor at "at least 8GB VRAM" with offload + quant; the 5090's 32 GB envelope gives ample headroom on the FP8 path (~22 GB on-disk, comfortably under VRAM even fully resident). The on-disk envelope per the HF tree API is ~22 GB for the FP8 sub-directory and ~40.5 GB for the BF16 sub-directory; on the 5090 the 28.58 GB BF16 DiT alone now fits resident (it did not on the 4090's 24 GB), so the BF16 path becomes viable for the first time on a consumer card — pair with text_encoder_offload=True to give activations comfortable room. The LightX2V quantization matrix documents FP8 modes (fp8-vllm, fp8-sgl, fp8-q8f, fp8-b128-deepgemm) on "H100/H200/H800, RTX 40 series, etc." — the framework's actively-merging Blackwell support (PR #1090 SM120 MXFP8 fused operators, Wan-NVFP4 release) extends that list to sm_120 even where the docs page hasn't been refreshed yet. See /check/lightx2v/rtx-5090 for empirical numbers as they land.
  • Quality notes: The distilled checkpoint trades fine motion detail and prompt fidelity for the 4-step / no-CFG speed-up. Use the recommended LCM scheduler, shift=5.0, guidance_scale=1.0 (HF model card) and stay close to the model's training resolutions (480×832, 720×1280) for best results.

For the full benchmark data, see /check/lightx2v/rtx-5090.

Troubleshooting

"I installed PyTorch but it doesn't see the 5090 / I get no kernel image is available"

Your PyTorch wheel was almost certainly built for an older CUDA toolkit (cu121, cu124, cu126) that doesn't include sm_120 Blackwell kernels. Reinstall against the cu128 channel:

pip install --index-url https://download.pytorch.org/whl/cu128 torch torchvision

SageAttention's README calls out the same constraint verbatim: "CUDA: >=12.8 for Blackwell or SageAttention2++". The Docker cu128 tag (lightx2v/lightx2v:26011201-cu128) sidesteps the issue entirely — strongly recommended on a fresh 5090 setup.

Out of memory pushing to 720×1280 / longer clips

The 5090's 32 GB envelope is generous on the FP8 path, but activations grow with resolution × frame-count. If you OOM at higher settings, in order of effectiveness:

  1. Make sure SageAttention 2++ is actually loaded. Pass attn_mode="sage_attn2" to create_generator(...). Per the SageAttention README, SageAttention 2++ runs at 560T on the 5090, 2.7× faster than FlashAttention 2 — without it, attention activations on a 720×1280 / 81-frame clip eat substantially more VRAM.
  2. Offload the UMT5 text encoder. Set text_encoder_offload=True in enable_offload(...); the FP8 UMT5 is 6.73 GB on its own, and pushing it to CPU frees that for DiT activations.
  3. Enable torch.compile. Cited as "+20% speed and -20% VRAM" in HF discussion #9 on the predecessor repo.
  4. Drop frame count before resolution. 720×1280 / 49 frames before 720×1280 / 81 frames before 480×832 / 81 frames as a final fallback.

"I want NVFP4 acceleration on my 5090 — where's the T2V-14B NVFP4 file?"

It doesn't exist yet. The lightx2v/Wan-NVFP4 repo currently ships wan2.1_i2v_480p_nvfp4_lightx2v_4step.safetensors and wan2.1_t2v_1_3b_nvfp4_lightx2v_4step.safetensors — I2V-14B and T2V-1.3B, not T2V-14B. The framework supports nvfp4 as a first-class scheme and the wan_t2v_nvfp4.py example exists but is hard-coded to the 1.3B base. A maintainer response on Issue #605 (FP4-support request) linked the Wan-NVFP4 release on 2025-12-23, covering the I2V-14B and T2V-1.3B variants but not yet the T2V-14B distilled. The T2V-14B NVFP4 is a plausible future drop — watch the LightX2V Latest News for the announcement. For now, FP8 on Blackwell is the fastest officially-supported path for this exact recipe.

Reports of FP8 + offload crashes on the 5090 (different model — Qwen-Image, not Wan2.1)

LightX2V's open Issue #657 "Qwen Image FP8 Offload Crash on RTX 5090" documents an API mismatch (infer_block() got an unexpected keyword argument 'temb') when enabling offload with examples/qwen_image/qwen_2511_fp8.py. This is not the Wan2.1-T2V-14B path this recipe uses, but it's a sign that some Blackwell + FP8-offload edge cases are still being shaken out in the framework. If you hit a similar infer_block(...) traceback on the Wan path, pin the LightX2V commit to a tagged release rather than main, and file an issue with the full traceback.

Windows + RTX 50-series + one-click installer: KeyError: 'None-triton'

The Windows one-click bundle + 50-series environment combo currently fails to initialize the T5 offloaded attention path with KeyError: 'None-triton' (Issue #943). The likely cause per the reporter is a missing triton-windows install in the 50-series env package. Until upstream patches the bundle, the workaround is either (a) install on WSL2 / Linux directly, or (b) install triton-windows manually into the bundled env.

Slow inference despite the 4-step distillation

The 4-step path only delivers the advertised speedup if the LCM scheduler is actually loaded and guidance_scale=1.0. If you're calling create_generator(...) directly, make sure you pass guidance_scale=1.0 and sample_shift=5.0 — and that infer_steps=4, not the base model's 40. The shipped shell scripts and config JSONs already encode the right defaults; the trap is bespoke Python scripts that copy partial parameters and silently fall back to the un-distilled inference path. Both settings are explicit on the HF model card.

Resolution / frame-count crashes

The Wan2.1 base requires resolutions divisible by 16 and a frame count that follows the model's grouping. Stick to the example configs (480×832 / 81 frames; 720×1280 / 81 frames) until you've measured a comfortable VRAM margin via nvidia-smi -l 1 during a generation run.

Report new issues via submission form — community RTX 5090 benchmarks would directly improve the /check/lightx2v/rtx-5090 data.