self-hosted/ai
§01·recipe · video

LightX2V on RTX 4080 SUPER: 4-Step Text-to-Video with Distilled Wan2.1-T2V-14B via FP8/INT8 + Offload

videointermediate8GB+ VRAMJun 2, 2026
models
tools
  • Lightx2V
prerequisites
  • NVIDIA RTX 4080 SUPER 16GB or any Ada Lovelace CUDA GPU with ≥8GB VRAM
  • 16GB+ system RAM (32GB recommended if you plan to offload BF16)
  • Python 3.10+ (3.12 per the official Quickstart) and PyTorch 2.6+ with CUDA 12.4 or 12.8
  • ~25GB free disk space for the FP8 distilled weights (~50GB if you also pull BF16)

What You'll Build

Generate short text-to-video clips locally using LightX2V — an inference framework that ships 4-step, CFG-free distilled checkpoints of Wan2.1-T2V-14B — on the 16 GB RTX 4080 SUPER. Per the HF model card, the distilled checkpoint generates videos "with significantly fewer inference steps (4 steps) and without classifier-free guidance," and the same card explicitly calls out: "New fp8 and int8 quantized distillation models have been added, which enable fast inference using lightx2v on RTX 4060." The RTX 4080 SUPER sits well above the 4060 the maintainers tested — same Ada Lovelace sm_89 arch, double the VRAM — so the FP8 / INT8 path runs with comfortable headroom once offload is engaged.

Hardware data: RTX 4080 SUPER (16GB VRAM) · 4-step distilled Wan2.1-T2V-14B · See benchmark data

ℹ️ This is distilled Wan 2.1, not Wan 2.2. The lightx2v org publishes a wider family — Wan2.1-T2V-14B distilled, Wan2.2-A14B distilled, HunyuanVideo-1.5 distilled, Qwen-Image distilled — each with different VRAM and inference characteristics. This recipe is specifically the Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v repo — a 4-step CFG-free distillation of the dense Wan2.1-T2V-14B base. The same install steps do not transfer cleanly to the A14B variant; see the LightX2V GitHub repo for the other distilled releases.

⚠️ OOM without optimization is real. A user reports in HF predecessor-repo discussion #9 "OOMs": "I tried to run this on A6000 (48GB VRAM), but it always OOMs" — the unquantized path is heavy. On a 16 GB 4080 SUPER you cannot run the BF16 path natively: the BF16 DiT alone is 28.58 GB on disk per the HF tree API. Stick to the FP8 (or INT8) distilled weights and keep the text encoder offloaded to CPU.

Requirements

ComponentMinimumTested
GPU"at least 8GB VRAM" per the LightX2V QuickstartRTX 4080 SUPER 16GB (Ada sm_89)
RAM16 GB ("16GB or more recommended" per the Quickstart); 32 GB recommended when offloading the BF16 T5
Storage~25 GB (FP8 or INT8 sub-directory)~50 GB if you also pull BF16
SoftwarePython 3.10+ (3.12 recommended per Quickstart), PyTorch 2.6+, CUDA 12.4 or 12.8

The lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v HF repo ships three sub-directories (sizes from the HF tree API, verified live):

  • distill_fp8/ — per-block FP8 quant, ~22.3 GB total (40 transformer blocks + non_block.safetensors 0.93 GB + FP8 umT5 encoder 6.73 GB + Wan2.1 VAE 0.51 GB). With the 6.73 GB T5 offloaded to CPU, the resident DiT + VAE fits inside the 4080 SUPER's 16 GB.
  • distill_int8/ — per-block INT8 with the same layout, ~22.3 GB total (INT8 umT5 6.73 GB + non_block 0.93 GB + VAE 0.51 GB). Use this path if you build Q8 Kernels for Ada (see step 3).
  • distill_models/ — BF16 dense, ~40.5 GB total (28.58 GB DiT + 11.36 GB BF16 T5 + 0.51 GB VAE). Does not fit a 16 GB card even with offload as the recipe's primary path — keep to FP8/INT8.

The RTX 4080 SUPER is Ada Lovelace sm_89 (AD103, 10240 CUDA cores, ~736 GB/s memory bandwidth, 320 W) with native 4th-gen FP8 (E4M3/E5M2) tensor cores, so the FP8 path is both a memory win and a compute win on this card. Unlike Blackwell GPUs (sm_120), no special wheel selection is required for the 4080 SUPER — the default pip install torch already includes sm_89 kernels.

Installation

The canonical install is Docker (simplest) or conda from source — both documented in the LightX2V Quickstart.

1. Install the framework (conda path)

# Clone and create the environment
git clone https://github.com/ModelTC/LightX2V.git
cd LightX2V
conda create -n lightx2v python=3.12 -y
conda activate lightx2v
pip install -v -e .

Commands taken verbatim from the LightX2V Quickstart. On Ada Lovelace cards (4060, 4070, 4080, 4080 SUPER, 4090) the default pip install torch already includes sm_89 kernels — no special CUDA index-url toggling required.

2. (Recommended) build SageAttention — the biggest VRAM / speed lever

git clone https://github.com/thu-ml/SageAttention.git
cd SageAttention && CUDA_ARCHITECTURES="8.0,8.6,8.9,9.0" \
  EXT_PARALLEL=4 NVCC_APPEND_FLAGS="--threads 8" MAX_JOBS=32 \
  pip install -v -e .

CUDA_ARCHITECTURES="8.0,8.6,8.9,9.0" covers Ampere through Hopper; the 8.9 entry is the RTX 4080 SUPER's Ada target. SageAttention is the single biggest VRAM / speed lever — community testimony in HF discussion #9 describes it as "The number one boon is SageAttention - this is a highly optimized, quantized attention kernel that nearly doubles inference speed for Wan on the right hardware (Ada) in exchange for a minor difference in output compared to SDPA/Flash."

3. (Optional, Ada-specific) install Q8 Kernels for the INT8 path

For the INT8 distilled path, the LightX2V Quickstart lists Q8 Kernels as one of its quantization-operator options, describing them as "Suitable for Ada architecture GPUs (such as RTX 4090, L40S, etc.)." — the same sm_89 family the 4080 SUPER belongs to:

git clone https://github.com/KONAKONA666/q8_kernels.git
cd q8_kernels && git submodule init && git submodule update
python setup.py install

Skip this step if you only plan to run the FP8 path.

4. Pull the 4-step distilled T2V-14B checkpoint

The FP8 sub-directory is the recommended path for a 16 GB card:

# FP8 distill — ~22.3 GB on disk; T5 offloads to CPU at runtime
huggingface-cli download \
  lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v \
  --include "distill_fp8/*" \
  --local-dir ./weights/Wan2.1-T2V-14B-StepDistill

Or for the INT8 path (similar size, accelerated on cards with Q8 Kernels built):

huggingface-cli download \
  lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v \
  --include "distill_int8/*" \
  --local-dir ./weights/Wan2.1-T2V-14B-StepDistill

Alternative: Docker (simplest)

docker pull lightx2v/lightx2v:25101501-cu124

The 25101501-cu124 image is one of the two tags the Quickstart lists for pulling LightX2V; it includes Ada sm_89 support and runs cleanly on the 4080 SUPER. The Quickstart also documents a 26011201-cu128 tag and notes it recommends the cuda128 environment for faster inference speed — either works on Ada.

Running

The HF model card documents running the 4-step distilled checkpoint through the framework's shell script:

bash scripts/wan/run_wan_t2v_distill_4step_cfg.sh

Note that the live scripts/wan/ directory has since been reorganized into task- and precision-specific variants. For the FP8 weights you downloaded in step 4, run the dedicated FP8 script, scripts/wan/run_wan_t2v_distill_fp8_4step_cfg.sh (verified present in the repo). Open the script and set the repo root and the model_path to your downloaded weights directory before running it.

For the Python API directly, the official LightX2V README documents the offload knobs that keep peak VRAM under 16 GB. Adapted from that README's examples/wan/wan_i2v.py snippet for the T2V task:

from lightx2v import LightX2VPipeline

pipe = LightX2VPipeline(
    model_path="./weights/Wan2.1-T2V-14B-StepDistill/distill_fp8",
    model_cls="wan2.1",
    task="t2v",
)

# Required on a 16 GB card — keep the 6.73 GB FP8 umT5 text encoder on CPU
# so the resident DiT + VAE fits comfortably in VRAM.
pipe.enable_offload(
    cpu_offload=True,
    offload_granularity="block",
    text_encoder_offload=True,
    image_encoder_offload=False,
    vae_offload=False,
)

pipe.create_generator(
    attn_mode="sage_attn2",
    infer_steps=4,            # the whole point of the distilled checkpoint
    height=480, width=832,
    num_frames=81,
    guidance_scale=1.0,       # CFG disabled — the distilled checkpoint runs CFG-free
    sample_shift=5.0,
)
pipe.generate(
    seed=42,
    prompt="A man with short gray hair plays a red electric guitar.",
    save_result_path="./output.mp4",
)

The enable_offload(...) and create_generator(...) parameter names above are reproduced from the README's documented snippet. The HF model card recommends the LCM scheduler with shift=5.0 and guidance_scale=1.0 (i.e., without CFG) — and the distilled checkpoint must run at infer_steps=4, not the base model's 40. Start at 480×832, 81 frames, 4 steps and only push to 720×1280 once you've confirmed peak VRAM stays comfortably under 16 GB via nvidia-smi -l 1. Output lands at the path you pass to save_result_path.

Results

  • Speed: No RTX 4080 SUPER-specific benchmark for the 4-step distilled checkpoint has been published yet, and /check/lightx2v/rtx-4080-super currently returns verdict: unknown with no benchmark rows. The framework's own README performance table reports 20.26 s/it for LightX2V single-GPU on an RTX 4090D running base Wan2.1-I2V-14B at 40 steps — but that figure does not transfer to a 4080 SUPER: it's a 24 GB card with ~30% more compute, the measurement is the un-distilled base (40 steps, not this recipe's 4), and it's I2V rather than T2V. Rather than quote a misleading number, this recipe omits wall-clock speed. What is a model fact: the distilled checkpoint runs in 4 steps instead of the base model's 40, cutting per-clip iteration count by roughly 10×. Empirical RTX 4080 SUPER numbers will land at /check/lightx2v/rtx-4080-super once a community benchmark is submitted via /contribute.
  • VRAM usage: The HF model card states verbatim that "New fp8 and int8 quantized distillation models have been added, which enable fast inference using lightx2v on RTX 4060." (HF card) — the RTX 4060 is the 8 GB Ada sibling below the 16 GB 4080 SUPER, so the same FP8/INT8 path runs with substantial headroom on this card. Community testimony from HF discussion #9 confirms the 16 GB Ada tier directly: "I run it in 16GB VRAM on my 4070 Ti Super myself using my own https://github.com/Sarania/blissful-tuner/ using features like SageAttention ( https://github.com/thu-ml/SageAttention ), fp8 scaled quantization, torch.compile optimization, transformer block swap, and more." The RTX 4070 Ti Super is the same Ada sm_89 arch and the same 16 GB VRAM envelope as the recipe target — the same FP8 + SageAttention + offload combination applies here. The on-disk envelope per the HF tree API is ~22.3 GB for the FP8 sub-directory (DiT + 6.73 GB FP8 umT5 + 0.51 GB VAE), so the 16 GB VRAM envelope is feasible only with the text encoder offloaded to CPU. See /check/lightx2v/rtx-4080-super for empirical numbers as they land.
  • Quality notes: The distilled checkpoint trades fine motion detail and prompt fidelity for the 4-step / no-CFG speedup. Use the recommended settings from the HF model card — LCM scheduler, shift=5.0, guidance_scale=1.0 — and stay close to the model's training resolutions (480×832, 720×1280) for best results.

For the full benchmark data, see /check/lightx2v/rtx-4080-super.

Troubleshooting

Out of memory loading the BF16 distill

The BF16 distill_models/distill_model.safetensors is 28.58 GB on disk (HF tree API) — larger than the 4080 SUPER's 16 GB VRAM by nearly 2×. CPU offload alone is not enough to recover; stick to the FP8 or INT8 sub-directory on this card.

Out of memory even with the FP8 path

Most 16 GB failures trace to one of three causes:

  • Text encoder not offloaded. The FP8 umT5 is 6.73 GB on its own (HF tree API); together with the DiT it crowds the 16 GB envelope. Keep the text encoder offloaded to CPU (text_encoder_offload=True).
  • SageAttention not active. Community testimony in HF discussion #9 calls out SageAttention as the primary VRAM/speed lever — without it, attention activations on a 480×832 / 81-frame clip can OOM. Build it per step 2 and pass attn_mode="sage_attn2" to create_generator(...).
  • Resolution / frame count too aggressive. 720×1280 / 81 frames eats substantially more activation memory than 480×832 / 81 frames. Stay at 480×832 until you've measured peak VRAM during a successful run, then step up gradually.

If you're still hitting OOM after these three, HF discussion #9 recommends also enabling torch.compile — described there as netting "another +20% speed and -20% VRAM" — plus transformer block swap as the final levers before dropping resolution.

Slow inference despite the 4-step distillation

The 4-step path only delivers the advertised speedup if CFG is actually disabled and the distilled settings are loaded. Per the HF model card, use the LCM scheduler with shift=5.0 and guidance_scale=1.0 (i.e., without CFG), and infer_steps=4 — not the base model's 40. The trap is bespoke scripts that copy partial parameters and silently fall back to the un-distilled inference path; the provided shell scripts and config JSONs already encode the right defaults.

Resolution / frame-count crashes

The Wan2.1 base requires resolutions divisible by 16 and a frame count that follows the model's grouping. Stick to the example resolutions (480×832 / 81 frames; 720×1280 / 81 frames) until you've measured a comfortable VRAM margin via nvidia-smi -l 1 during a generation run.

Report new issues via submission form — community RTX 4080 SUPER benchmarks would directly improve the /check/lightx2v/rtx-4080-super data.