self-hosted/ai
§01·recipe · video

LightX2V on RTX 5070 Ti: 4-Step Text-to-Video with Distilled Wan2.1-14B via Blackwell-Native FP8

videointermediate8GB+ VRAMJun 3, 2026
models
tools
prerequisites
  • NVIDIA RTX 5070 Ti (16 GB VRAM, Blackwell sm_120) or any CUDA GPU with ≥8 GB VRAM
  • 16 GB+ system RAM (per the LightX2V Quickstart; 32 GB helps if you fall back to BF16 + offload)
  • Python 3.10+ (3.11 recommended) and PyTorch built against CUDA 12.8+ (sm_120 kernels require cu128)
  • ~25 GB free disk space for the FP8 distilled weights (~50 GB if you also pull BF16)

What You'll Build

Generate short text-to-video clips locally using LightX2V — an inference framework that ships 4-step, CFG-free distilled checkpoints of Wan2.1-T2V-14B — on a 16 GB RTX 5070 Ti. The distilled checkpoint cuts inference from 40–50 steps down to 4 with no classifier-free guidance, and the HF model card explicitly calls out: "New fp8 and int8 quantized distillation models have been added, which enable fast inference using lightx2v on RTX 4060." The 5070 Ti sits well above the RTX 4060 (8 GB) the maintainers tested — same FP8-capable hardware class, double the VRAM — so the FP8 path runs comfortably on the 16 GB envelope once the text encoder is offloaded.

Hardware data: RTX 5070 Ti (16 GB VRAM, Blackwell sm_120) · 4-step distilled Wan2.1-T2V-14B · See benchmark data

ℹ️ This is distilled Wan 2.1, not Wan 2.2. The lightx2v org publishes a wider family — Wan2.1-T2V-14B distilled, Wan2.2-A14B distilled (timestep-MoE), HunyuanVideo-1.5 distilled, Qwen-Image distilled — each with different VRAM and inference characteristics. This recipe is specifically the Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v repo (4-step CFG-free distillation of the dense Wan2.1-T2V-14B base). The same install steps don't transfer cleanly to the A14B timestep-MoE — see the LightX2V repository for the Wan 2.2 / HunyuanVideo distilled releases.

FP8 is the fast path on Blackwell. Unlike Ampere sm_86 (RTX 30-series), where loading FP8 weights forces a dequantize-to-BF16 at compute time because the architecture has no FP8 tensor cores, Blackwell sm_120 (RTX 50-series) has native FP8 tensor-core acceleration (E4M3 / E5M2). The RTX 5070 Ti (8960 CUDA cores, sm_120) is built on the same Blackwell GB203 die as the RTX 5080, so FP8 gives you both the VRAM savings and the throughput win. The framework's actively-developing Blackwell support is visible in the open upstream PR #1090 "feat: add MXFP8 fused operators for Wan transformer inference on SM120", which reports an "End-to-end FFN: 1.20× faster (608μs → 505μs, -103μs per block)" result on the SM120 path. This recipe routes through FP8 as the primary path.

Requirements

ComponentMinimumTested
GPU8 GB VRAM (CUDA) per the LightX2V QuickstartRTX 5070 Ti (Blackwell sm_120, 16 GB GDDR7)
RAM16 GB or more recommended per Quickstart; 32 GB if you fall back to BF16 + text-encoder offload
StorageAt least 50 GB available space per Quickstart; ~22 GB for the FP8 sub-directory alone
SoftwarePython 3.10+, PyTorch built against CUDA 12.8+ (sm_120 kernels require cu128)per Quickstart

The lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v HF repo ships three sub-directories (sizes verified via the HF tree API):

  • distill_fp8/ — per-block FP8 quant, ~22.3 GB total (40 transformer blocks at 0.352 GB each = 14.08 GB DiT + non_block.safetensors 0.931 GB + FP8 UMT5-XXL encoder models_t5_umt5-xxl-enc-fp8.pth 6.733 GB + Wan2.1 VAE 0.508 GB). Recommended primary path on the 5070 Ti — FP8 is hardware-accelerated on Blackwell, and the ~15 GB DiT + VAE fits the 16 GB envelope once the 6.733 GB UMT5 encoder is offloaded to CPU.
  • distill_int8/ — per-block INT8 quant, ~22.3 GB total with the same layout (INT8 UMT5 6.733 GB). Fully supported as well; FP8 is preferred on Blackwell because the FP8 tensor-core path is the architecture's headline throughput route for diffusion workloads.
  • distill_models/ — BF16 dense, ~40.5 GB total (28.577 GB DiT + 11.362 GB BF16 UMT5 + 0.508 GB VAE). The 28.577 GB DiT alone exceeds the 5070 Ti's 16 GB VRAM, so the BF16 path is not viable on this card without heavy CPU offload (and even then is impractical) — stay on FP8 / INT8.

Installation

The canonical install is Docker (simplest) or conda from source — both documented in the LightX2V Quickstart.

1. Install the framework (conda path)

# Clone and create the environment
git clone https://github.com/ModelTC/LightX2V.git
cd LightX2V
conda create -n lightx2v python=3.11 -y
conda activate lightx2v
pip install -v -e .

Verbatim from the LightX2V Quickstart. Confirm your torch was built against CUDA 12.8 or newer — Blackwell sm_120 kernels require cu128. If pip install -v -e . pulled a wheel built for an older CUDA, reinstall PyTorch explicitly:

pip install --index-url https://download.pytorch.org/whl/cu128 torch torchvision

The SageAttention README calls out the same constraint in its install notes: CUDA >=12.8 is required for Blackwell.

2. (Strongly recommended) build SageAttention 2 — the biggest attention-kernel lever on Blackwell

git clone https://github.com/thu-ml/SageAttention.git
cd SageAttention && CUDA_ARCHITECTURES="8.9,9.0,12.0" \
  EXT_PARALLEL=4 NVCC_APPEND_FLAGS="--threads 8" MAX_JOBS=32 \
  pip install -v -e .

CUDA_ARCHITECTURES="8.9,9.0,12.0" covers Ada through Blackwell; the 12.0 entry is the RTX 5070 Ti's sm_120 target. The SageAttention README announced Blackwell support on 2025-02-15: "The compilation code is updated to support RTX5090! On RTX5090, SageAttention reaches 560T, 2.7x faster than FlashAttention2!" The RTX 5090, 5080 and 5070 Ti all share the same Blackwell sm_120 target, so the same kernel path applies; the 5070 Ti's lower core count means a smaller absolute throughput, but SageAttention 2 remains the single biggest attention-kernel win on the card.

3. Pull the 4-step distilled T2V-14B checkpoint (FP8 — recommended primary path)

FP8 is hardware-accelerated on Blackwell sm_120; this is the recommended path for the 5070 Ti:

huggingface-cli download \
  lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v \
  --include "distill_fp8/*" \
  --local-dir ./weights/Wan2.1-T2V-14B-StepDistill

Alternative INT8 weights (also Blackwell-supported, slightly different numeric profile) — note there is no shipped INT8 shell script, so the INT8 weights are pulled the same way and then pointed at the INT8 config:

huggingface-cli download \
  lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v \
  --include "distill_int8/*" \
  --local-dir ./weights/Wan2.1-T2V-14B-StepDistill

Alternative: Docker (simplest)

docker pull lightx2v/lightx2v:26011201-cu128

The cu128 tag is the right one for the 5070 Ti — sm_120 Blackwell kernels require CUDA 12.8. The Quickstart recommends "using the cuda128 environment for faster inference"; the older 25101501-cu124 tag will not contain sm_120 kernels, so stay on cu128 (Quickstart).

Alternative path: ComfyUI + GGUF (community-quantized)

If you'd rather work in ComfyUI than the LightX2V Python framework, QuantStack ships GGUF conversions that explicitly identify as "a GGUF conversion of an addon of lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill" (link-back to canonical — Lesson C clean), produced with city96's standard conversion scripts. On a 16 GB 5070 Ti stick to the Q5_K_M (~13 GB DiT) or smaller tiers so the separate UMT5 text encoder and VAE still fit alongside; the Q8_0 (18.7 GB DiT) does not fit 16 GB. Loading needs the ComfyUI-GGUF custom node plus the UMT5-XXL text encoder (Comfy-Org safetensors or city96's GGUF UMT5 mirror) and the Wan2.1 VAE (Kijai mirror). Note that QuantStack's release pre-merges the VACE control-conditioning addon onto the distilled base — useful if you want pose/depth control, otherwise identical inference shape for plain T2V.

Running

The LightX2V repo ships ready-to-run shell scripts under scripts/wan/. The relevant one for the FP8 text-to-video path is:

# FP8 path on Blackwell — the recommended default for the 5070 Ti
bash scripts/wan/run_wan_t2v_distill_fp8_4step_cfg.sh

Under the hood this invokes python -m lightx2v.infer with --model_cls wan2.1_distill --task t2v and the matching config JSON from configs/distill/wan21/ (wan_t2v_distill_fp8_4step_cfg.json). Fill in lightx2v_path (the cloned repo root) and model_path (the directory you downloaded weights to in step 3) at the top of the script before running. Note the shipped script passes a text prompt only (--prompt "...") and a --save_result_path — there is no image input, which is the correct shape for this text-to-video repo.

For the Python API directly, the FP8 path on a 5070 Ti looks like:

from lightx2v import LightX2VPipeline

pipe = LightX2VPipeline(
    model_path="./weights/Wan2.1-T2V-14B-StepDistill/distill_fp8",
    model_cls="wan2.1_distill",
    task="t2v",
)

# REQUIRED on the 16 GB 5070 Ti: offload the 6.733 GB UMT5 text encoder to CPU so
# the ~15 GB FP8 DiT + VAE + activations fit the 16 GB envelope.
pipe.enable_offload(
    cpu_offload=False,                # leave the FP8 DiT resident on the 5070 Ti
    offload_granularity="block",
    text_encoder_offload=True,        # leaves the 6.733 GB UMT5 on CPU — needed on 16 GB
    image_encoder_offload=False,
    vae_offload=False,
)

pipe.create_generator(
    attn_mode="sage_attn2",           # SageAttention 2, sm_120-optimised
    infer_steps=4,                    # the whole point of the distilled checkpoint
    height=480, width=832,
    num_frames=81,
    guidance_scale=1.0,
    sample_shift=5.0,
)
pipe.generate(
    seed=42,
    prompt="A man with short gray hair plays a red electric guitar.",
    save_result_path="./output.mp4",
)

⚠️ Don't copy the rendered HF card's diffusers Quick Start as-is for T2V. The repo's pipeline_tag is set to image-to-video, so HuggingFace auto-generates a diffusers snippet that loads an input image and passes image=image — that's the Image-to-Video pipeline signature on what is a Text-to-Video repo (the snippet is templated from pipeline_tag and isn't variant-aware). For T2V, call the pipeline with prompt= only (as above), or use the official LightX2V shell script, which already encodes the prompt-only call shape (--prompt "...", no image input).

The recommended sampler settings are the LCM scheduler with shift=5.0 and guidance_scale=1.0 (no CFG) — documented on the HF model card and baked into the distill config JSONs. Start at 480×832, 81 frames, 4 steps and only push toward 720×1280 once you've confirmed peak VRAM via nvidia-smi -l 1 — the 16 GB envelope is tighter than a 5090's, so verify headroom before scaling resolution.

Results

  • Speed: No first-party RTX 5070 Ti benchmark has been published for the Wan2.1-T2V-14B 4-step distilled variant at the time of writing, so we omit a speed figure rather than extrapolate. The 5070 Ti is not a close-sibling of the lower-tier card the maintainers tested (the RTX 4060), and while it shares the Blackwell GB203 die with the RTX 5080, its narrower memory bandwidth (~896 GB/s vs the 5080's ~960 GB/s) and ~16% lower core count (8960 vs 10752 CUDA cores) put it below the 5080 — so even a 5080 figure would overstate it, and a 5090 number more so. The upstream SM120 FFN micro-benchmark cited above (PR #1090) was measured on an RTX 5090, not this card, and is a single-operator kernel result rather than an end-to-end generation time. Empirical RTX 5070 Ti numbers for this recipe will land at /check/lightx2v/rtx-5070-ti once a community benchmark is submitted via /contribute.
  • VRAM usage: The framework's Quickstart sets the floor at "at least 8GB VRAM" with offload + quant, and the HF model card confirms the fp8 / int8 distillation weights "enable fast inference using lightx2v on RTX 4060." (an 8 GB card). On the 16 GB 5070 Ti the FP8 DiT (14.08 GB of per-block weights + 0.931 GB non_block ≈ 15 GB) plus the 0.508 GB VAE sits right at the envelope, so text_encoder_offload=True is required to keep the 6.733 GB UMT5 encoder off the GPU — leaving comfortable room for activations at 480×832 / 81 frames. The on-disk envelope per the HF tree API is ~22.3 GB for the FP8 sub-directory; the BF16 sub-directory is ~40.5 GB (its 28.577 GB DiT alone exceeds 16 GB and is impractical on this card). See /check/lightx2v/rtx-5070-ti for empirical numbers as they land.
  • Quality notes: The distilled checkpoint trades fine motion detail and prompt fidelity for the 4-step / no-CFG speed-up. Use the recommended LCM scheduler, shift=5.0, guidance_scale=1.0 (HF model card) and stay close to the model's training resolutions (480×832, 720×1280) for best results.

For the full benchmark data, see /check/lightx2v/rtx-5070-ti.

Troubleshooting

"I installed PyTorch but it doesn't see the 5070 Ti / I get no kernel image is available"

Your PyTorch wheel was almost certainly built for an older CUDA toolkit (cu121, cu124, cu126) that doesn't include sm_120 Blackwell kernels. Reinstall against the cu128 channel:

pip install --index-url https://download.pytorch.org/whl/cu128 torch torchvision

The SageAttention README calls out the same CUDA >=12.8 constraint for Blackwell. The Docker cu128 tag (lightx2v/lightx2v:26011201-cu128) sidesteps the issue entirely — strongly recommended on a fresh 5070 Ti setup.

Out of memory on the FP8 path

The 16 GB envelope is tight once the ~15 GB FP8 DiT is resident. In order of effectiveness:

  1. Offload the UMT5 text encoder. Set text_encoder_offload=True in enable_offload(...) — this is mandatory on the 5070 Ti, not optional. The FP8 UMT5 is 6.733 GB on its own, and there is no room for it on the GPU alongside the DiT.
  2. Make sure SageAttention 2 is actually loaded. Pass attn_mode="sage_attn2" to create_generator(...). Without it, attention activations on a 480×832 / 81-frame clip eat substantially more VRAM (and run slower).
  3. Enable torch.compile. A community commenter (Blyss, no team/org badge) on HF discussion #9 on the predecessor repo describes torch.compile as netting "+20% speed and -20% VRAM" on top of the other optimizations.
  4. Drop frame count before resolution. 480×832 / 49 frames before 480×832 / 81 frames as a final fallback; only push to 720×1280 if peak VRAM stays comfortably below 16 GB.

On HF discussion #9, a user (jzli) hit OOM with the unquantized distilled T2V-14B even on a 48 GB A6000; a community commenter (Blyss, no team/org badge) recommended the toolbox of SageAttention, scaled-fp8, torch.compile, and transformer block-swap as the way to bring it down. The same toolbox applies on the 16 GB 5070 Ti — where the quantized FP8 weights are mandatory rather than optional.

Reports of FP8 + offload crashes on Blackwell (different model — Qwen-Image, not Wan2.1)

LightX2V's open Issue #657 "Qwen Image FP8 Offload Crash on RTX 5090" documents an API mismatch (infer_block() got an unexpected keyword argument 'temb') when enabling offload with the Qwen-Image FP8 example. This is not the Wan2.1-T2V-14B path this recipe uses, but it shows some Blackwell + FP8-offload edge cases are still being shaken out in the framework. If you hit a similar infer_block(...) traceback on the Wan path, pin the LightX2V commit to a tagged release rather than main, and file an issue with the full traceback.

Windows + RTX 50-series + one-click installer: KeyError: 'None-triton'

The Windows one-click bundle + 50-series environment combo currently fails to initialize the T5 offloaded attention path with KeyError: 'None-triton' (Issue #943). The likely cause per the reporter is a missing triton-windows install in the 50-series env package. Until upstream patches the bundle, the workaround is either (a) install on WSL2 / Linux directly, or (b) install triton-windows manually into the bundled env.

Slow inference despite the 4-step distillation

The 4-step path only delivers the advertised speedup if the LCM scheduler is loaded and guidance_scale=1.0. If you're calling create_generator(...) directly, make sure you pass guidance_scale=1.0 and sample_shift=5.0 — and that infer_steps=4, not the base model's 40. The shipped shell script and config JSON already encode the right defaults; the trap is bespoke Python scripts that copy partial parameters and silently fall back to the un-distilled inference path. Both settings are documented on the HF model card.

Resolution / frame-count crashes

The Wan2.1 base requires resolutions divisible by 16 and a frame count that follows the model's grouping. Stick to the example configs (480×832 / 81 frames; 720×1280 / 81 frames) until you've measured a comfortable VRAM margin via nvidia-smi -l 1 during a generation run.

Report new issues via submission form — community RTX 5070 Ti benchmarks would directly improve the /check/lightx2v/rtx-5070-ti data.