self-hosted/ai
§01·recipe · video

LightX2V on RTX 5070: 4-Step Text-to-Video with Distilled Wan2.1-14B via Blackwell-Native FP8 + Offload

videointermediate8GB+ VRAMJun 5, 2026
models
tools
prerequisites
  • NVIDIA RTX 5070 (12 GB VRAM, Blackwell sm_120) or any CUDA GPU with ≥8 GB VRAM
  • 16 GB+ system RAM (per the LightX2V Quickstart — the CPU tier of the offload path holds the encoder + streamed blocks)
  • Python 3.10+ (3.11 recommended) and PyTorch built against CUDA 12.8+ (sm_120 kernels require cu128)
  • ~25 GB free disk space for the FP8 distilled weights (~50 GB if you also pull BF16)

What You'll Build

Generate short text-to-video clips locally using LightX2V — an inference framework that ships 4-step, CFG-free distilled checkpoints of Wan2.1-T2V-14B — on a 12 GB RTX 5070. The distilled checkpoint cuts inference from 40–50 steps down to 4 with no classifier-free guidance, and the HF model card explicitly calls out: "New fp8 and int8 quantized distillation models have been added, which enable fast inference using lightx2v on RTX 4060." The RTX 4060 the maintainers name is an 8 GB card; the framework reaches it via a disk-CPU-GPU three-tier offload, so the 12 GB RTX 5070 — same FP8-capable Blackwell hardware class, 50 % more VRAM — runs the same FP8 path with more headroom.

Hardware data: RTX 5070 (12 GB VRAM, Blackwell sm_120) · 4-step distilled Wan2.1-T2V-14B · See benchmark data

ℹ️ This is distilled Wan 2.1, not Wan 2.2. The lightx2v org publishes a wider family — Wan2.1-T2V-14B distilled, Wan2.2-A14B distilled (timestep-MoE), HunyuanVideo-1.5 distilled, Qwen-Image distilled — each with different VRAM and inference characteristics. This recipe is specifically the Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v repo (4-step CFG-free distillation of the dense Wan2.1-T2V-14B base). The same install steps don't transfer cleanly to the A14B timestep-MoE — see the LightX2V repository for the Wan 2.2 / HunyuanVideo distilled releases.

⚠️ This is Text-to-Video — there is no image input. The repo's pipeline_tag is set to image-to-video, so HuggingFace auto-generates a diffusers Quick Start snippet that loads an input image and passes image=image. That is the wrong call shape for this repo. The repo name is Wan2.1-**T2V**-14B, the README's run command is bash scripts/wan/run_wan_t2v_distill_4step_cfg.sh, and the shipped script passes a text prompt only (--prompt "...", no image). Drive it with prompt= only — never image=.

FP8 is the fast path on Blackwell. Unlike Ampere sm_86 (RTX 30-series), where loading FP8 weights forces a dequantize-to-BF16 at compute time because the architecture has no FP8 tensor cores, Blackwell sm_120 (RTX 50-series) has native FP8 tensor-core acceleration (E4M3 / E5M2). The RTX 5070 (6144 CUDA cores, sm_120) is built on the Blackwell GB205 die. FP8 gives you both the VRAM savings and the throughput win — which is exactly why FP8 (not BF16) is the only practical path on a 12 GB card. The framework's actively-developing Blackwell support is visible in the open upstream PR #1090 "feat: add MXFP8 fused operators for Wan transformer inference on SM120". This recipe routes through FP8 as the primary path.

Requirements

ComponentMinimumTested
GPU8 GB VRAM (CUDA) per the LightX2V QuickstartRTX 5070 (Blackwell sm_120, 12 GB GDDR7)
RAM16 GB or more recommended per Quickstart; the offload CPU tier holds the 6.7 GB UMT5 encoder + streamed blocks
StorageAt least 50 GB available space per Quickstart; ~22 GB for the FP8 sub-directory alone
SoftwarePython 3.10+, PyTorch built against CUDA 12.8+ (sm_120 kernels require cu128)per Quickstart

The lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v HF repo ships three sub-directories (sizes verified via the HF tree API):

  • distill_fp8/ — per-block FP8 quant, ~22.3 GB total (40 transformer blocks at 0.352 GB each = 14.08 GB DiT + non_block.safetensors 0.931 GB + FP8 UMT5-XXL encoder models_t5_umt5-xxl-enc-fp8.pth 6.733 GB + Wan2.1 VAE 0.508 GB). Recommended primary path on the 5070 — FP8 is hardware-accelerated on Blackwell, and the per-block file layout is exactly what the framework's block-granularity offload streams from CPU so the full ~15 GB DiT never has to be resident at once on the 12 GB card.
  • distill_int8/ — per-block INT8 quant, ~22.3 GB total with the same layout (INT8 UMT5 6.733 GB). Fully supported as well; FP8 is preferred on Blackwell because the FP8 tensor-core path is the architecture's headline throughput route for diffusion workloads.
  • distill_models/ — BF16 dense, ~40.5 GB total (28.577 GB DiT + 11.362 GB BF16 UMT5 + 0.508 GB VAE). The 28.577 GB DiT alone is more than double the 5070's 12 GB VRAM, so the BF16 path is not viable on this card — stay on FP8 / INT8.

Installation

The canonical install is Docker (simplest) or conda from source — both documented in the LightX2V Quickstart.

1. Install the framework (conda path)

# Clone and create the environment
git clone https://github.com/ModelTC/LightX2V.git
cd LightX2V
conda create -n lightx2v python=3.11 -y
conda activate lightx2v
pip install -v -e .

Verbatim from the LightX2V Quickstart. Confirm your torch was built against CUDA 12.8 or newer — Blackwell sm_120 kernels require cu128. If pip install -v -e . pulled a wheel built for an older CUDA, reinstall PyTorch explicitly:

pip install --index-url https://download.pytorch.org/whl/cu128 torch torchvision

The SageAttention README calls out the same constraint in its install notes: CUDA >=12.8 is required for Blackwell.

2. (Strongly recommended) build SageAttention 2 — the biggest attention-kernel lever on Blackwell

git clone https://github.com/thu-ml/SageAttention.git
cd SageAttention && CUDA_ARCHITECTURES="8.9,9.0,12.0" \
  EXT_PARALLEL=4 NVCC_APPEND_FLAGS="--threads 8" MAX_JOBS=32 \
  pip install -v -e .

CUDA_ARCHITECTURES="8.9,9.0,12.0" covers Ada through Blackwell; the 12.0 entry is the RTX 5070's sm_120 target. The SageAttention README announced Blackwell support on 2025-02-15: "The compilation code is updated to support RTX5090! On RTX5090, SageAttention reaches 560T, 2.7x faster than FlashAttention2!" The whole RTX 50-series shares the same Blackwell sm_120 target, so the same kernel path applies; the 5070's lower core count means a smaller absolute throughput, but SageAttention 2 remains the single biggest attention-kernel win on the card.

3. Pull the 4-step distilled T2V-14B checkpoint (FP8 — recommended primary path)

FP8 is hardware-accelerated on Blackwell sm_120; this is the recommended path for the 5070:

huggingface-cli download \
  lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v \
  --include "distill_fp8/*" \
  --local-dir ./weights/Wan2.1-T2V-14B-StepDistill

Alternative INT8 weights (also Blackwell-supported, slightly different numeric profile) — there is no separate shipped INT8 shell script, so the INT8 weights are pulled the same way and then pointed at the INT8 config:

huggingface-cli download \
  lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v \
  --include "distill_int8/*" \
  --local-dir ./weights/Wan2.1-T2V-14B-StepDistill

Alternative: Docker (simplest)

docker pull lightx2v/lightx2v:26011201-cu128

The cu128 tag is the right one for the 5070 — sm_120 Blackwell kernels require CUDA 12.8. The Quickstart recommends the cuda128 environment for faster inference; the older 25101501-cu124 tag will not contain sm_120 kernels, so stay on cu128 (Quickstart).

Alternative path: ComfyUI + GGUF (community-quantized)

If you'd rather work in ComfyUI than the LightX2V Python framework, QuantStack ships GGUF conversions that explicitly identify as "a GGUF conversion of an addon of lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill" (link-back to canonical — Lesson C clean), produced with city96's standard conversion scripts. On a 12 GB 5070 stay on the Q4_K_M / Q5_K_M (~11–13 GB DiT) or smaller tiers so the separate UMT5 text encoder and VAE still fit alongside; the Q8_0 (18.7 GB DiT) does not fit. Loading needs the ComfyUI-GGUF custom node plus the UMT5-XXL text encoder (Comfy-Org safetensors or city96's GGUF UMT5 mirror) and the Wan2.1 VAE (Kijai mirror). QuantStack's release pre-merges the VACE control-conditioning addon onto the distilled base — useful if you want pose/depth control, otherwise identical inference shape for plain T2V.

Running

The LightX2V repo ships ready-to-run shell scripts under scripts/wan/. The relevant one for the FP8 text-to-video path is:

# FP8 path on Blackwell
bash scripts/wan/run_wan_t2v_distill_fp8_4step_cfg.sh

Under the hood this invokes python -m lightx2v.infer with --model_cls wan2.1_distill --task t2v and the matching config JSON from configs/distill/wan21/ (wan_t2v_distill_fp8_4step_cfg.json). Fill in lightx2v_path (the cloned repo root) and model_path (the directory you downloaded weights to in step 3) at the top of the script before running. The shipped script passes a text prompt only (--prompt "...") and a --save_result_path — there is no image input, which is the correct shape for this text-to-video repo.

The one change that matters for 12 GB. The shipped wan_t2v_distill_fp8_4step_cfg.json defaults to "cpu_offload": false, which keeps the full ~15 GB FP8 DiT resident — that fits a 16 GB card but OOMs a 12 GB 5070. Per the Parameter Offload guide, edit the config (or pass overrides) to turn on the disk-CPU-GPU offload so blocks stream from CPU instead of all being resident:

{
  "cpu_offload": true,
  "offload_granularity": "block",
  "offload_ratio": 1.0,
  "t5_cpu_offload": true,
  "vae_cpu_offload": false
}

The official guide's progressive strategy for memory-constrained devices is: first enable cpu_offload, then gradually enable CPU offload for the T5 / CLIP / VAE components, then consider quantization + offload or lazy_load (Parameter Offload guide). On the 5070, cpu_offload: true + offload_granularity: "block" + t5_cpu_offload: true is the combination that keeps the FP8 DiT streaming and the 6.733 GB UMT5 encoder off the GPU. Start at 480×832, 81 frames, 4 steps (the config defaults) and only push toward 720×1280 once you've confirmed peak VRAM via nvidia-smi -l 1.

For the Python API directly, the FP8 path on a 5070 looks like:

from lightx2v import LightX2VPipeline

pipe = LightX2VPipeline(
    model_path="./weights/Wan2.1-T2V-14B-StepDistill/distill_fp8",
    model_cls="wan2.1_distill",
    task="t2v",
)

# REQUIRED on the 12 GB 5070: stream the FP8 DiT in block granularity and keep the
# 6.733 GB UMT5 text encoder on CPU so the model fits the 12 GB envelope.
pipe.enable_offload(
    cpu_offload=True,                 # stream the FP8 DiT — do NOT keep it all resident on 12 GB
    offload_granularity="block",
    text_encoder_offload=True,        # leaves the 6.733 GB UMT5 on CPU
    image_encoder_offload=False,
    vae_offload=False,
)

pipe.create_generator(
    attn_mode="sage_attn2",           # SageAttention 2, sm_120-optimised
    infer_steps=4,                    # the whole point of the distilled checkpoint
    height=480, width=832,
    num_frames=81,
    guidance_scale=1.0,
    sample_shift=5.0,
)
pipe.generate(
    seed=42,
    prompt="A man with short gray hair plays a red electric guitar.",
    save_result_path="./output.mp4",
)

The recommended sampler settings are the LCM scheduler with shift=5.0 and guidance_scale=1.0 (no CFG) — documented on the HF model card and baked into the distill config JSONs.

Results

  • Speed: No first-party RTX 5070 benchmark has been published for the Wan2.1-T2V-14B 4-step distilled variant at the time of writing, so we omit a speed figure rather than extrapolate. The 5070 is not a close-sibling of any card with a published number: its ~672 GB/s memory bandwidth and 6144 CUDA cores sit well below the 5070 Ti / 5080, so even those cards' figures would overstate it. The upstream SM120 micro-benchmark referenced in PR #1090 was a single-operator kernel result measured on an RTX 5090, not an end-to-end generation time on this card. Empirical RTX 5070 numbers for this recipe will land at /check/lightx2v/rtx-5070 once a community benchmark is submitted via /contribute.
  • VRAM usage: The framework's Quickstart sets the floor at "at least 8GB VRAM" with offload + quant, and the HF model card confirms the fp8 / int8 distillation weights "enable fast inference using lightx2v on RTX 4060." (an 8 GB card). The 12 GB RTX 5070 sits above that 8 GB floor: with cpu_offload: true + offload_granularity: "block" the ~15 GB FP8 DiT (14.08 GB per-block weights + 0.931 GB non_block, per the HF tree API) streams from CPU rather than being fully resident, and t5_cpu_offload: true keeps the 6.733 GB UMT5 encoder off the GPU. The on-disk envelope is ~22.3 GB for the FP8 sub-directory; the BF16 sub-directory is ~40.5 GB (its 28.577 GB DiT alone is more than double 12 GB and not viable). See /check/lightx2v/rtx-5070 for empirical numbers as they land.
  • Quality notes: The distilled checkpoint trades fine motion detail and prompt fidelity for the 4-step / no-CFG speed-up. Use the recommended LCM scheduler, shift=5.0, guidance_scale=1.0 (HF model card) and stay close to the model's training resolutions (480×832, 720×1280) for best results.

For the full benchmark data, see /check/lightx2v/rtx-5070.

Troubleshooting

"I installed PyTorch but it doesn't see the 5070 / I get no kernel image is available"

Your PyTorch wheel was almost certainly built for an older CUDA toolkit (cu121, cu124, cu126) that doesn't include sm_120 Blackwell kernels. Reinstall against the cu128 channel:

pip install --index-url https://download.pytorch.org/whl/cu128 torch torchvision

The SageAttention README calls out the same CUDA >=12.8 constraint for Blackwell. The Docker cu128 tag (lightx2v/lightx2v:26011201-cu128) sidesteps the issue entirely — recommended on a fresh 5070 setup.

Out of memory on the FP8 path

The 12 GB envelope is too small to keep the full ~15 GB FP8 DiT resident, so offload is mandatory, not optional. In order of effectiveness:

  1. Turn on block-granularity CPU offload. Set cpu_offload: true and offload_granularity: "block" in the config (or enable_offload(cpu_offload=True, offload_granularity="block") in the Python API). This streams the DiT blocks from CPU instead of holding all ~15 GB on the GPU — the shipped config's cpu_offload: false default is a 16 GB setting and will OOM the 5070.
  2. Offload the UMT5 text encoder. Set t5_cpu_offload: true (config) / text_encoder_offload=True (Python). The FP8 UMT5 is 6.733 GB on its own — there is no room for it on the GPU alongside the streamed DiT.
  3. Make sure SageAttention 2 is actually loaded. Pass attn_mode="sage_attn2". Without it, attention activations on a 480×832 / 81-frame clip eat substantially more VRAM (and run slower).
  4. Drop frame count before resolution. 480×832 / 49 frames before 480×832 / 81 frames as a final fallback; only push to 720×1280 if peak VRAM stays comfortably below 12 GB.

The Parameter Offload guide documents the full progressive strategy for memory-constrained devices.

Windows RTX 50-series: KeyError: 'None-triton' during T5 offload init

If you use the Windows one-click package with the "50 series environment package" and hit KeyError: 'None-triton' during the T5 offloaded-attention init (the exact stage this recipe leans on for the 12 GB path), it is a missing Triton dependency in that bundle. A community contributor on LightX2V Issue #943 recommends running pip install triton-windows and replacing the bundled LightX2V directory with the latest upstream code. The issue is still open at the time of writing; on Linux the conda-from-source path above avoids the one-click bundle entirely.

Wrong output / errors from the HF card's diffusers snippet

The repo's pipeline_tag is image-to-video, so the auto-generated diffusers Quick Start loads an input image and passes image=image. That is the Image-to-Video signature on what is a Text-to-Video repo. For T2V, call the pipeline with prompt= only (as in Running above), or use the official LightX2V shell script, which already encodes the prompt-only call shape.