LightX2V on RTX 5070: 4-Step Text-to-Video with Distilled Wan2.1-14B via Blackwell-Native FP8 + Offload

What You'll Build

Generate short text-to-video clips locally using LightX2V — an inference framework that ships 4-step, CFG-free distilled checkpoints of Wan2.1-T2V-14B — on a 12 GB RTX 5070. The distilled checkpoint cuts inference from 40–50 steps down to 4 with no classifier-free guidance, and the HF model card explicitly calls out: "New fp8 and int8 quantized distillation models have been added, which enable fast inference using lightx2v on RTX 4060." The RTX 4060 the maintainers name is an 8 GB card; the framework reaches it via a disk-CPU-GPU three-tier offload, so the 12 GB RTX 5070 — same FP8-capable Blackwell hardware class, 50 % more VRAM — runs the same FP8 path with more headroom.

Hardware data: RTX 5070 (12 GB VRAM, Blackwell sm_120) · 4-step distilled Wan2.1-T2V-14B · See benchmark data

ℹ️ This is distilled Wan 2.1, not Wan 2.2. The lightx2v org publishes a wider family — Wan2.1-T2V-14B distilled, Wan2.2-A14B distilled (timestep-MoE), HunyuanVideo-1.5 distilled, Qwen-Image distilled — each with different VRAM and inference characteristics. This recipe is specifically the Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v repo (4-step CFG-free distillation of the dense Wan2.1-T2V-14B base). The same install steps don't transfer cleanly to the A14B timestep-MoE — see the LightX2V repository for the Wan 2.2 / HunyuanVideo distilled releases.

⚠️ This is Text-to-Video — there is no image input. The repo's pipeline_tag is set to image-to-video, so HuggingFace auto-generates a diffusers Quick Start snippet that loads an input image and passes image=image. That is the wrong call shape for this repo. The repo name is Wan2.1-**T2V**-14B, the README's run command is bash scripts/wan/run_wan_t2v_distill_4step_cfg.sh, and the shipped script passes a text prompt only (--prompt "...", no image). Drive it with prompt= only — never image=.

⚡ FP8 is the fast path on Blackwell. Unlike Ampere sm_86 (RTX 30-series), where loading FP8 weights forces a dequantize-to-BF16 at compute time because the architecture has no FP8 tensor cores, Blackwell sm_120 (RTX 50-series) has native FP8 tensor-core acceleration (E4M3 / E5M2). The RTX 5070 (6144 CUDA cores, sm_120) is built on the Blackwell GB205 die. FP8 gives you both the VRAM savings and the throughput win — which is exactly why FP8 (not BF16) is the only practical path on a 12 GB card. The framework's actively-developing Blackwell support is visible in the open upstream PR #1090 "feat: add MXFP8 fused operators for Wan transformer inference on SM120". This recipe routes through FP8 as the primary path.

Requirements

Component	Minimum	Tested
GPU	8 GB VRAM (CUDA) per the LightX2V Quickstart	RTX 5070 (Blackwell `sm_120`, 12 GB GDDR7)
RAM	16 GB or more recommended per Quickstart; the offload CPU tier holds the 6.7 GB UMT5 encoder + streamed blocks	—
Storage	At least 50 GB available space per Quickstart; ~22 GB for the FP8 sub-directory alone	—
Software	Python 3.10+, PyTorch built against CUDA 12.8+ (sm_120 kernels require cu128)	per Quickstart

The lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v HF repo ships three sub-directories (sizes verified via the HF tree API):

distill_fp8/ — per-block FP8 quant, ~22.3 GB total (40 transformer blocks at 0.352 GB each = 14.08 GB DiT + non_block.safetensors 0.931 GB + FP8 UMT5-XXL encoder models_t5_umt5-xxl-enc-fp8.pth 6.733 GB + Wan2.1 VAE 0.508 GB). Recommended primary path on the 5070 — FP8 is hardware-accelerated on Blackwell, and the per-block file layout is exactly what the framework's block-granularity offload streams from CPU so the full ~15 GB DiT never has to be resident at once on the 12 GB card.
distill_int8/ — per-block INT8 quant, ~22.3 GB total with the same layout (INT8 UMT5 6.733 GB). Fully supported as well; FP8 is preferred on Blackwell because the FP8 tensor-core path is the architecture's headline throughput route for diffusion workloads.
distill_models/ — BF16 dense, ~40.5 GB total (28.577 GB DiT + 11.362 GB BF16 UMT5 + 0.508 GB VAE). The 28.577 GB DiT alone is more than double the 5070's 12 GB VRAM, so the BF16 path is not viable on this card — stay on FP8 / INT8.

Installation

The canonical install is Docker (simplest) or conda from source — both documented in the LightX2V Quickstart.

1. Install the framework (conda path)

# Clone and create the environment
git clone https://github.com/ModelTC/LightX2V.git
cd LightX2V
conda create -n lightx2v python=3.11 -y
conda activate lightx2v
pip install -v -e .

Verbatim from the LightX2V Quickstart. Confirm your torch was built against CUDA 12.8 or newer — Blackwell sm_120 kernels require cu128. If pip install -v -e . pulled a wheel built for an older CUDA, reinstall PyTorch explicitly:

pip install --index-url https://download.pytorch.org/whl/cu128 torch torchvision

The SageAttention README calls out the same constraint in its install notes: CUDA >=12.8 is required for Blackwell.

2. (Strongly recommended) build SageAttention 2 — the biggest attention-kernel lever on Blackwell

git clone https://github.com/thu-ml/SageAttention.git
cd SageAttention && CUDA_ARCHITECTURES="8.9,9.0,12.0" \
  EXT_PARALLEL=4 NVCC_APPEND_FLAGS="--threads 8" MAX_JOBS=32 \
  pip install -v -e .

CUDA_ARCHITECTURES="8.9,9.0,12.0" covers Ada through Blackwell; the 12.0 entry is the RTX 5070's sm_120 target. The SageAttention README announced Blackwell support on 2025-02-15: "The compilation code is updated to support RTX5090! On RTX5090, SageAttention reaches 560T, 2.7x faster than FlashAttention2!" The whole RTX 50-series shares the same Blackwell sm_120 target, so the same kernel path applies; the 5070's lower core count means a smaller absolute throughput, but SageAttention 2 remains the single biggest attention-kernel win on the card.

3. Pull the 4-step distilled T2V-14B checkpoint (FP8 — recommended primary path)

FP8 is hardware-accelerated on Blackwell sm_120; this is the recommended path for the 5070:

huggingface-cli download \
  lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v \
  --include "distill_fp8/*" \
  --local-dir ./weights/Wan2.1-T2V-14B-StepDistill

Alternative INT8 weights (also Blackwell-supported, slightly different numeric profile) — there is no separate shipped INT8 shell script, so the INT8 weights are pulled the same way and then pointed at the INT8 config:

huggingface-cli download \
  lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v \
  --include "distill_int8/*" \
  --local-dir ./weights/Wan2.1-T2V-14B-StepDistill

Alternative: Docker (simplest)

docker pull lightx2v/lightx2v:26011201-cu128

The cu128 tag is the right one for the 5070 — sm_120 Blackwell kernels require CUDA 12.8. The Quickstart recommends the cuda128 environment for faster inference; the older 25101501-cu124 tag will not contain sm_120 kernels, so stay on cu128 (Quickstart).

Alternative path: ComfyUI + GGUF (community-quantized)

If you'd rather work in ComfyUI than the LightX2V Python framework, QuantStack ships GGUF conversions that explicitly identify as "a GGUF conversion of an addon of lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill" (link-back to canonical — Lesson C clean), produced with city96's standard conversion scripts. On a 12 GB 5070 stay on the Q4_K_M / Q5_K_M (~11–13 GB DiT) or smaller tiers so the separate UMT5 text encoder and VAE still fit alongside; the Q8_0 (18.7 GB DiT) does not fit. Loading needs the ComfyUI-GGUF custom node plus the UMT5-XXL text encoder (Comfy-Org safetensors or city96's GGUF UMT5 mirror) and the Wan2.1 VAE (Kijai mirror). QuantStack's release pre-merges the VACE control-conditioning addon onto the distilled base — useful if you want pose/depth control, otherwise identical inference shape for plain T2V.

Running

The LightX2V repo ships ready-to-run shell scripts under scripts/wan/. The relevant one for the FP8 text-to-video path is:

# FP8 path on Blackwell
bash scripts/wan/run_wan_t2v_distill_fp8_4step_cfg.sh

Under the hood this invokes python -m lightx2v.infer with --model_cls wan2.1_distill --task t2v and the matching config JSON from configs/distill/wan21/ (wan_t2v_distill_fp8_4step_cfg.json). Fill in lightx2v_path (the cloned repo root) and model_path (the directory you downloaded weights to in step 3) at the top of the script before running. The shipped script passes a text prompt only (--prompt "...") and a --save_result_path — there is no image input, which is the correct shape for this text-to-video repo.

The one change that matters for 12 GB. The shipped wan_t2v_distill_fp8_4step_cfg.json defaults to "cpu_offload": false, which keeps the full ~15 GB FP8 DiT resident — that fits a 16 GB card but OOMs a 12 GB 5070. Per the Parameter Offload guide, edit the config (or pass overrides) to turn on the disk-CPU-GPU offload so blocks stream from CPU instead of all being resident:

{
  "cpu_offload": true,
  "offload_granularity": "block",
  "offload_ratio": 1.0,
  "t5_cpu_offload": true,
  "vae_cpu_offload": false
}

The official guide's progressive strategy for memory-constrained devices is: first enable cpu_offload, then gradually enable CPU offload for the T5 / CLIP / VAE components, then consider quantization + offload or lazy_load (Parameter Offload guide). On the 5070, cpu_offload: true + offload_granularity: "block" + t5_cpu_offload: true is the combination that keeps the FP8 DiT streaming and the 6.733 GB UMT5 encoder off the GPU. Start at 480×832, 81 frames, 4 steps (the config defaults) and only push toward 720×1280 once you've confirmed peak VRAM via nvidia-smi -l 1.

For the Python API directly, the FP8 path on a 5070 looks like:

from lightx2v import LightX2VPipeline

pipe = LightX2VPipeline(
    model_path="./weights/Wan2.1-T2V-14B-StepDistill/distill_fp8",
    model_cls="wan2.1_distill",
    task="t2v",
)

# REQUIRED on the 12 GB 5070: stream the FP8 DiT in block granularity and keep the
# 6.733 GB UMT5 text encoder on CPU so the model fits the 12 GB envelope.
pipe.enable_offload(
    cpu_offload=True,                 # stream the FP8 DiT — do NOT keep it all resident on 12 GB
    offload_granularity="block",
    text_encoder_offload=True,        # leaves the 6.733 GB UMT5 on CPU
    image_encoder_offload=False,
    vae_offload=False,
)

pipe.create_generator(
    attn_mode="sage_attn2",           # SageAttention 2, sm_120-optimised
    infer_steps=4,                    # the whole point of the distilled checkpoint
    height=480, width=832,
    num_frames=81,
    guidance_scale=1.0,
    sample_shift=5.0,
)
pipe.generate(
    seed=42,
    prompt="A man with short gray hair plays a red electric guitar.",
    save_result_path="./output.mp4",
)

The recommended sampler settings are the LCM scheduler with shift=5.0 and guidance_scale=1.0 (no CFG) — documented on the HF model card and baked into the distill config JSONs.

Results

Speed: No first-party RTX 5070 benchmark has been published for the Wan2.1-T2V-14B 4-step distilled variant at the time of writing, so we omit a speed figure rather than extrapolate. The 5070 is not a close-sibling of any card with a published number: its ~672 GB/s memory bandwidth and 6144 CUDA cores sit well below the 5070 Ti / 5080, so even those cards' figures would overstate it. The upstream SM120 micro-benchmark referenced in PR #1090 was a single-operator kernel result measured on an RTX 5090, not an end-to-end generation time on this card. Empirical RTX 5070 numbers for this recipe will land at /check/lightx2v/rtx-5070 once a community benchmark is submitted via /contribute.
VRAM usage: The framework's Quickstart sets the floor at "at least 8GB VRAM" with offload + quant, and the HF model card confirms the fp8 / int8 distillation weights "enable fast inference using lightx2v on RTX 4060." (an 8 GB card). The 12 GB RTX 5070 sits above that 8 GB floor: with cpu_offload: true + offload_granularity: "block" the ~15 GB FP8 DiT (14.08 GB per-block weights + 0.931 GB non_block, per the HF tree API) streams from CPU rather than being fully resident, and t5_cpu_offload: true keeps the 6.733 GB UMT5 encoder off the GPU. The on-disk envelope is ~22.3 GB for the FP8 sub-directory; the BF16 sub-directory is ~40.5 GB (its 28.577 GB DiT alone is more than double 12 GB and not viable). See /check/lightx2v/rtx-5070 for empirical numbers as they land.
Quality notes: The distilled checkpoint trades fine motion detail and prompt fidelity for the 4-step / no-CFG speed-up. Use the recommended LCM scheduler, shift=5.0, guidance_scale=1.0 (HF model card) and stay close to the model's training resolutions (480×832, 720×1280) for best results.

For the full benchmark data, see /check/lightx2v/rtx-5070.

Troubleshooting

"I installed PyTorch but it doesn't see the 5070 / I get `no kernel image is available`"

Your PyTorch wheel was almost certainly built for an older CUDA toolkit (cu121, cu124, cu126) that doesn't include sm_120 Blackwell kernels. Reinstall against the cu128 channel:

pip install --index-url https://download.pytorch.org/whl/cu128 torch torchvision

The SageAttention README calls out the same CUDA >=12.8 constraint for Blackwell. The Docker cu128 tag (lightx2v/lightx2v:26011201-cu128) sidesteps the issue entirely — recommended on a fresh 5070 setup.

Out of memory on the FP8 path

The 12 GB envelope is too small to keep the full ~15 GB FP8 DiT resident, so offload is mandatory, not optional. In order of effectiveness:

Turn on block-granularity CPU offload. Set cpu_offload: true and offload_granularity: "block" in the config (or enable_offload(cpu_offload=True, offload_granularity="block") in the Python API). This streams the DiT blocks from CPU instead of holding all ~15 GB on the GPU — the shipped config's cpu_offload: false default is a 16 GB setting and will OOM the 5070.
Offload the UMT5 text encoder. Set t5_cpu_offload: true (config) / text_encoder_offload=True (Python). The FP8 UMT5 is 6.733 GB on its own — there is no room for it on the GPU alongside the streamed DiT.
Make sure SageAttention 2 is actually loaded. Pass attn_mode="sage_attn2". Without it, attention activations on a 480×832 / 81-frame clip eat substantially more VRAM (and run slower).
Drop frame count before resolution. 480×832 / 49 frames before 480×832 / 81 frames as a final fallback; only push to 720×1280 if peak VRAM stays comfortably below 12 GB.

The Parameter Offload guide documents the full progressive strategy for memory-constrained devices.

Windows RTX 50-series: `KeyError: 'None-triton'` during T5 offload init

If you use the Windows one-click package with the "50 series environment package" and hit KeyError: 'None-triton' during the T5 offloaded-attention init (the exact stage this recipe leans on for the 12 GB path), it is a missing Triton dependency in that bundle. A community contributor on LightX2V Issue #943 recommends running pip install triton-windows and replacing the bundled LightX2V directory with the latest upstream code. The issue is still open at the time of writing; on Linux the conda-from-source path above avoids the one-click bundle entirely.

Wrong output / errors from the HF card's diffusers snippet

The repo's pipeline_tag is image-to-video, so the auto-generated diffusers Quick Start loads an input image and passes image=image. That is the Image-to-Video signature on what is a Text-to-Video repo. For T2V, call the pipeline with prompt= only (as in Running above), or use the official LightX2V shell script, which already encodes the prompt-only call shape.