self-hosted/ai
§01·recipe · video

LightX2V on RTX 4060 Ti 16GB: 4-Step Text-to-Video via the LightX2V Framework with Distilled Wan2.1-14B

videointermediate8GB+ VRAMMay 21, 2026
models
tools
  • Lightx2V
prerequisites
  • NVIDIA RTX 4060 Ti 16GB or any Ada Lovelace CUDA GPU with ≥8GB VRAM
  • 16GB+ system RAM (32GB recommended if you plan to offload BF16)
  • Python 3.10+ (3.11 recommended) and PyTorch 2.6+ with CUDA 12.4 or 12.8
  • ~25GB free disk space for the FP8 distilled weights (~50GB if you also pull BF16)

What You'll Build

Generate short text-to-video clips locally using LightX2V — an inference framework that ships 4-step, CFG-free distilled checkpoints of Wan2.1-T2V-14B — on the 16 GB RTX 4060 Ti. The distilled checkpoint cuts inference from 40–50 steps down to 4 with no classifier-free guidance, and the HF model card explicitly calls out: "New fp8 and int8 quantized distillation models have been added, which enable fast inference using lightx2v on RTX 4060." The 4060 Ti 16GB sits a tier above the 4060 the maintainers tested — same Ada Lovelace sm_89 arch, double the VRAM — so the FP8 / INT8 path runs with comfortable headroom once offload is engaged.

Hardware data: RTX 4060 Ti 16GB · 4-step distilled Wan2.1-T2V-14B · See benchmark data

ℹ️ This is distilled Wan 2.1, not Wan 2.2. The lightx2v org publishes a wider family — Wan2.1-T2V-14B distilled, Wan2.2-A14B distilled (timestep-MoE), HunyuanVideo-1.5 distilled, Qwen-Image distilled — each with different VRAM and inference characteristics. This recipe is specifically the Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v repo (4-step CFG-free distillation of the dense Wan2.1-T2V-14B base). The same install steps don't transfer cleanly to the A14B timestep-MoE — see the LightX2V Latest News for the Wan 2.2 / HunyuanVideo distilled releases.

⚠️ OOM without optimization is real. A user reported OOM with the unquantized distilled T2V-14B even on a 48 GB A6000 (HF predecessor-repo discussion #9 "OOMs"). On a 16 GB 4060 Ti you cannot run the BF16 path natively — the BF16 DiT alone is 28.58 GB on disk per the HF tree API. Stick to the FP8 (or INT8) distilled weights and enable cpu_offload=True plus text_encoder_offload=True.

Requirements

ComponentMinimumTested
GPU8 GB VRAM (CUDA) per LightX2V QuickstartRTX 4060 Ti 16GB (Ada sm_89)
RAM16 GB (per Quickstart); 32 GB recommended when offloading the BF16 path
Storage~25 GB (FP8 or INT8 sub-directory)~50 GB if you also pull BF16
SoftwarePython 3.10+, PyTorch 2.6+, CUDA 12.4 or 12.8per Quickstart

The lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v HF repo ships three sub-directories (HF tree API):

  • distill_fp8/ — per-block FP8 quant, ~22.3 GB total (40 transformer blocks at ~0.35 GB each + non_block.safetensors 0.93 GB + FP8 T5 encoder 6.73 GB + Wan2.1 VAE 0.51 GB). With text_encoder_offload=True the 6.73 GB T5 lives on CPU, leaving the ~15 GB DiT + VAE resident in 16 GB VRAM.
  • distill_int8/ — per-block INT8 with the same layout, ~22.3 GB total. Use this path if you build Q8 Kernels for Ada (see step 3 below).
  • distill_models/ — BF16 dense, ~40.5 GB total (28.58 GB DiT + 11.36 GB BF16 T5 + 0.51 GB VAE). Does not fit a 16 GB card even with offload as the recipe's primary path — keep to FP8/INT8.

Installation

The canonical install is Docker (simplest) or conda from source — both documented in the LightX2V Quickstart.

1. Install the framework (conda path)

# Clone and create the environment
git clone https://github.com/ModelTC/LightX2V.git
cd LightX2V
conda create -n lightx2v python=3.11 -y
conda activate lightx2v
pip install -v -e .

Verbatim from the LightX2V Quickstart. On Ada Lovelace cards (4060 Ti, 4070, 4080, 4090) the default pip install torch already includes sm_89 kernels — no cu128 toggling required.

2. (Recommended) build SageAttention 2 for ~2× attention speedup

git clone https://github.com/thu-ml/SageAttention.git
cd SageAttention && CUDA_ARCHITECTURES="8.0,8.6,8.9,9.0" \
  EXT_PARALLEL=4 NVCC_APPEND_FLAGS="--threads 8" MAX_JOBS=32 \
  pip install -v -e .

CUDA_ARCHITECTURES="8.0,8.6,8.9,9.0" covers Ampere through Hopper; the 8.9 entry is the RTX 4060 Ti's Ada target. SageAttention is the single biggest VRAM / speed lever — community testimony in HF discussion #9 ("The number one boon is SageAttention - this is a highly optimized, quantized attention kernel that nearly doubles inference speed") is consistent with the framework's own positioning.

3. (Optional, Ada-specific) install Q8 Kernels for the INT8 path

For the INT8 distilled path, the LightX2V Quickstart calls out Q8 Kernels as the "appropriate quantization operator … for Ada architecture GPUs (such as RTX 4090, L40S, etc.)" — the same sm_89 family the 4060 Ti belongs to:

git clone https://github.com/KONAKONA666/q8_kernels.git
cd q8_kernels && git submodule init && git submodule update
python setup.py install

Skip this step if you only plan to run the FP8 path.

4. Pull the 4-step distilled T2V-14B checkpoint

The FP8 sub-directory is the recommended path for a 16 GB card:

# FP8 distill — ~22.3 GB on disk; T5 offloads to CPU at runtime
huggingface-cli download \
  lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v \
  --include "distill_fp8/*" \
  --local-dir ./weights/Wan2.1-T2V-14B-StepDistill

Or for the INT8 path (similar size, faster on cards with Q8 Kernels built):

huggingface-cli download \
  lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v \
  --include "distill_int8/*" \
  --local-dir ./weights/Wan2.1-T2V-14B-StepDistill

Alternative: Docker (simplest)

docker pull lightx2v/lightx2v:26011201-cu128

The cu128 tag is fine on the 4060 Ti — the wheel still includes Ada sm_89 support. The older 25101501-cu124 tag works equally well if you'd rather match a CUDA 12.4 driver (Quickstart).

Running

The framework ships ready-to-run shell scripts under scripts/wan/ — the relevant one for this recipe is run_wan_t2v_distill_fp8_4step_cfg.sh. Fill in lightx2v_path (the cloned repo root) and model_path (the directory you downloaded weights to in step 4 above), then:

bash scripts/wan/run_wan_t2v_distill_fp8_4step_cfg.sh

Under the hood, the script invokes (verbatim from the run script):

python -m lightx2v.infer \
  --model_cls wan2.1_distill \
  --task t2v \
  --model_path $model_path \
  --config_json ${lightx2v_path}/configs/distill/wan21/wan_t2v_distill_fp8_4step_cfg.json \
  --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
  --use_prompt_enhancer \
  --negative_prompt "..." \
  --save_result_path ${lightx2v_path}/save_results/wan_t2v_distill_fp8_4step.mp4

Output lands at ${lightx2v_path}/save_results/wan_t2v_distill_fp8_4step.mp4. The recommended sampler settings — LCM scheduler with shift=5.0 and guidance_scale=1.0 — are baked into configs/distill/wan21/wan_t2v_distill_fp8_4step_cfg.json and explicitly documented on the HF model card.

For the Python API directly — the offload knobs that keep peak VRAM under 16 GB on a 4060 Ti:

from lightx2v import LightX2VPipeline

pipe = LightX2VPipeline(
    model_path="./weights/Wan2.1-T2V-14B-StepDistill/distill_fp8",
    model_cls="wan2.1_distill",
    task="t2v",
)

# Required on a 16 GB card — keep the 6.73 GB FP8 T5 on CPU so the
# ~15 GB DiT + VAE fits comfortably in VRAM.
pipe.enable_offload(
    cpu_offload=True,
    offload_granularity="block",
    text_encoder_offload=True,
    image_encoder_offload=False,
    vae_offload=False,
)

pipe.create_generator(
    attn_mode="sage_attn2",
    infer_steps=4,            # the whole point of the distilled checkpoint
    height=480, width=832,
    num_frames=81,
    guidance_scale=1.0,
    sample_shift=5.0,
)
pipe.generate(
    seed=42,
    prompt="A man with short gray hair plays a red electric guitar.",
    save_result_path="./output.mp4",
)

The enable_offload(...) call mirrors the official LightX2V examples/wan/wan_i2v.py snippet — copied here for the T2V task. Start at 480×832, 81 frames, 4 steps and only push to 720×1280 once you've confirmed peak VRAM stays comfortably under 16 GB via nvidia-smi -l 1.

Results

  • Speed: No RTX 4060 Ti–specific benchmark has been published for the 4-step distilled checkpoint yet. For reference, the official LightX2V README's "Cross-Framework Performance Comparison" reports 20.26 s/it for LightX2V single-GPU on an RTX 4090D (a 24 GB Ada sm_89 card with substantially more compute than the 4060 Ti) running the base Wan2.1-I2V-14B at 480P, 40 steps. That number does not transfer to a 4060 Ti — different VRAM tier, ~2.5× fewer CUDA cores, narrower memory bandwidth, and different inference variant (base vs. distilled, I2V vs. T2V, 40 steps vs. 4) — but it establishes that the framework's recommended path is iteration-time-dominated and that the 4-step distillation cuts wall-clock per clip by roughly 10× via the step-count drop alone. Empirical 4060 Ti 16GB numbers will land at /check/lightx2v/rtx-4060-ti-16gb once a community benchmark is submitted via /contribute.
  • VRAM usage: The HF model card states verbatim that "New fp8 and int8 quantized distillation models have been added, which enable fast inference using lightx2v on RTX 4060" (HF card) — the RTX 4060 is the 8 GB Ada sibling below the 4060 Ti 16GB, so the same path runs with substantial headroom on this card. The framework's Quickstart sets the floor at "at least 8GB VRAM" with offload + quant. Community testimony from HF discussion #9 confirms the 16 GB Ada tier: "I run it in 16GB VRAM on my 4070 Ti Super myself … using features like SageAttention, fp8 scaled quantization, torch.compile optimization, transformer block swap, and more." The RTX 4070 Ti Super is the same Ada sm_89 arch and the same 16 GB VRAM envelope as the recipe target — the same FP8 + SageAttention + offload combination applies here. The on-disk envelope per the HF tree API is ~22.3 GB for the FP8 sub-directory (~15 GB DiT + 6.73 GB FP8 T5 + 0.51 GB VAE), so the 16 GB VRAM envelope is feasible only with text_encoder_offload=True keeping the T5 on CPU. See /check/lightx2v/rtx-4060-ti-16gb for empirical numbers as they land.
  • Quality notes: The distilled checkpoint trades fine motion detail and prompt fidelity for the 4-step / no-CFG speedup. Use the recommended LCM scheduler, shift=5.0, guidance_scale=1.0 (HF model card) and stay close to the model's training resolutions (480×832, 720×1280) for best results.

For the full benchmark data, see /check/lightx2v/rtx-4060-ti-16gb.

Troubleshooting

Out of memory loading the BF16 distill

The BF16 distill_models/distill_model.safetensors is 28.58 GB on disk (HF tree API) — larger than the 4060 Ti's 16 GB VRAM by nearly 2×. Block-level CPU offload alone is not enough to recover; stick to the FP8 or INT8 sub-directory on this card.

Out of memory even with the FP8 path

Most 16 GB failures trace to one of three causes:

  • Text encoder not offloaded. The FP8 T5 is 6.73 GB on its own; together with the ~15 GB DiT it exceeds the 16 GB envelope. Set text_encoder_offload=True in enable_offload(...).
  • SageAttention not active. The framework's quantization docs and the community testimony in HF discussion #9 both call out SageAttention 2 as the primary VRAM lever — without it, the attention activations on a 480×832 / 81-frame clip can OOM. Pass attn_mode="sage_attn2" to create_generator(...).
  • Resolution / frame count too aggressive. 720×1280 / 81 frames eats substantially more activation memory than 480×832 / 81 frames. Stay at 480×832 until you've measured peak VRAM during a successful run, then step up gradually.

If you're still hitting OOM after these three, HF discussion #9 recommends also enabling torch.compile (cited as "+20% speed and -20% VRAM") and falling back to fp16_accumulation and VAE tiling as the final levers before dropping resolution.

Slow inference despite the 4-step distillation

The 4-step path only delivers the advertised speedup if the LCM scheduler is actually loaded and guidance_scale=1.0. If you're calling create_generator(...) directly, make sure you pass guidance_scale=1.0 and sample_shift=5.0 — and that infer_steps=4, not the base model's 40. The provided shell scripts and config JSONs already encode the right defaults; the trap is bespoke Python scripts that copy partial parameters and silently fall back to the un-distilled inference path. Both settings are explicit on the HF model card.

Resolution / frame-count crashes

The Wan2.1 base requires resolutions divisible by 16 and a frame count that follows the model's grouping. Stick to the example configs (480×832 / 81 frames; 720×1280 / 81 frames) until you've measured a comfortable VRAM margin via nvidia-smi -l 1 during a generation run.

Report new issues via submission form — community RTX 4060 Ti 16GB benchmarks would directly improve the /check/lightx2v/rtx-4060-ti-16gb data.