What You'll Build
Generate short text-to-video clips locally using LightX2V — an inference framework that ships 4-step, CFG-free distilled checkpoints of Wan2.1-T2V-14B — on a 12 GB RTX 4070. Per the HF model card, the distilled checkpoint generates videos "with significantly fewer inference steps (4 steps) and without classifier-free guidance, substantially reducing video generation time while maintaining high quality outputs." The same card explicitly calls out: "New fp8 and int8 quantized distillation models have been added, which enable fast inference using lightx2v on RTX 4060." The RTX 4060 the maintainers name is an 8 GB Ada card — below the 12 GB RTX 4070 — and the framework reaches it through a disk-CPU-GPU offload path, so the same FP8 route runs on the 4070 with more headroom.
Hardware data: RTX 4070 (12 GB VRAM, Ada Lovelace sm_89) · 4-step distilled Wan2.1-T2V-14B · See benchmark data
ℹ️ This is distilled Wan 2.1, not Wan 2.2. The lightx2v org publishes a wider family — Wan2.1-T2V-14B distilled, Wan2.2-A14B distilled (timestep-MoE), HunyuanVideo-1.5 distilled, Qwen-Image distilled — each with different VRAM and inference characteristics. This recipe is specifically the
Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2vrepo (4-step CFG-free distillation of the dense Wan2.1-T2V-14B base). The same install steps don't transfer cleanly to the A14B timestep-MoE — see the LightX2V repository for the Wan 2.2 / HunyuanVideo distilled releases.
⚠️ This is Text-to-Video — there is no image input. The repo's
pipeline_tagis set toimage-to-video, so HuggingFace auto-generates a diffusers Quick Start snippet that loads an input image and passesimage=image. That is the wrong call shape for this repo. The repo name isWan2.1-**T2V**-14B, the README's run command isbash scripts/wan/run_wan_t2v_distill_4step_cfg.sh, and the shipped script passes a text prompt only (--prompt "...", no image). Drive it withprompt=only — neverimage=.
⚡ FP8 is hardware-accelerated on Ada. The RTX 4070 (Ada Lovelace
sm_89, AD104 die) has native 4th-generation FP8 tensor cores (E4M3 / E5M2), so loading the FP8 distilled weights gives you both the VRAM saving and the compute win — which is exactly why FP8 (not BF16) is the practical path on a 12 GB card. Unlike Blackwellsm_120(RTX 50-series), the 4070 needs no special wheel: the defaultpip install torchalready shipssm_89kernels, thecu124Docker image is the right one, and FlashAttention-2 / SageAttention have prebuiltsm_89kernels (there is nosm_120kernel-gap to work around). This recipe routes through FP8 as the primary path. NVFP4 / MXFP8 are Blackwell-only and do not apply on Ada — fall back to FP8 or GGUF.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | "at least 8GB VRAM" per the LightX2V Quickstart | RTX 4070 (Ada Lovelace sm_89, 12 GB GDDR6X) |
| RAM | "16GB or more recommended" per the Quickstart; the offload CPU tier holds the 6.7 GB UMT5 encoder + streamed blocks | — |
| Storage | "At least 50GB available space" per the Quickstart; ~22 GB for the FP8 sub-directory alone | — |
| Software | Python 3.10+, PyTorch 2.6+ with CUDA 12.4 (default Ada wheel; cu128 is Blackwell-only) | per Quickstart |
The lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v HF repo ships three sub-directories (sizes verified live via the HF tree API):
distill_fp8/— per-block FP8 quant, ~22.3 GB total (40 transformer blocks at 0.352 GB each = 14.08 GB DiT +non_block.safetensors0.931 GB + FP8 UMT5-XXL encodermodels_t5_umt5-xxl-enc-fp8.pth6.733 GB + Wan2.1 VAE 0.508 GB). Recommended primary path on the 4070 — FP8 is hardware-accelerated on Adasm_89, and the per-block file layout is exactly what the framework's block-granularity offload streams from CPU so the full ~15 GB DiT never has to be fully resident on the 12 GB card.distill_int8/— per-block INT8 quant, ~22.3 GB total with the same layout (INT8 UMT5 6.733 GB). Fully supported as well; on Ada the INT8 path runs through Q8 Kernels (see step 3 of Installation), which the Quickstart describes as "Suitable for Ada architecture GPUs (such as RTX 4090, L40S, etc.)." — the samesm_89family the 4070 belongs to.distill_models/— BF16 dense, ~40.5 GB total (28.577 GB DiT + 11.362 GB BF16 UMT5 + 0.508 GB VAE). The 28.577 GB DiT alone is more than double the 4070's 12 GB VRAM, so the BF16 path is not viable on this card — stay on FP8 / INT8.
Installation
The canonical install is Docker (simplest) or conda from source — both documented in the LightX2V Quickstart.
1. Install the framework (conda path)
# Clone and create the environment
git clone https://github.com/ModelTC/LightX2V.git
cd LightX2V
conda create -n lightx2v python=3.10 -y
conda activate lightx2v
pip install -v -e .
Verbatim from the LightX2V Quickstart. On Ada Lovelace cards (4060, 4070, 4080, 4090) the default pip install torch already includes sm_89 kernels — no special CUDA index-url toggling required. (The cu128 channel is only needed for Blackwell sm_120 cards; ignore any cu128 instruction you see written for the RTX 50-series.)
2. (Strongly recommended) build SageAttention — the biggest attention-kernel lever
git clone https://github.com/thu-ml/SageAttention.git
cd SageAttention && CUDA_ARCHITECTURES="8.0,8.6,8.9,9.0" \
EXT_PARALLEL=4 NVCC_APPEND_FLAGS="--threads 8" MAX_JOBS=32 \
pip install -v -e .
CUDA_ARCHITECTURES="8.0,8.6,8.9,9.0" covers Ampere through Hopper; the 8.9 entry is the RTX 4070's Ada sm_89 target. SageAttention is the single biggest VRAM / speed lever for this workload — a community user with a same-family Ada card describes it in HF discussion #9 as one of the optimizations (alongside FP8 scaled quantization, torch.compile, and transformer-block swap) that let them run a Wan-architecture model in a tight VRAM envelope. Pass attn_mode="sage_attn2" when you create the generator (see Running).
3. (Optional, Ada-specific) install Q8 Kernels for the INT8 path
For the INT8 distilled path, the LightX2V Quickstart lists Q8 Kernels as one of its quantization-operator options, describing them as "Suitable for Ada architecture GPUs (such as RTX 4090, L40S, etc.)." — the same sm_89 family the RTX 4070 belongs to:
git clone https://github.com/KONAKONA666/q8_kernels.git
cd q8_kernels && git submodule init && git submodule update
python setup.py install
Skip this step if you only plan to run the FP8 path.
4. Pull the 4-step distilled T2V-14B checkpoint (FP8 — recommended primary path)
FP8 is hardware-accelerated on Ada sm_89; this is the recommended path for the 4070:
huggingface-cli download \
lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v \
--include "distill_fp8/*" \
--local-dir ./weights/Wan2.1-T2V-14B-StepDistill
Or the INT8 weights (also Ada-supported via Q8 Kernels — there is no separate shipped INT8 shell script, so the INT8 weights are pulled the same way and then pointed at the INT8 config):
huggingface-cli download \
lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v \
--include "distill_int8/*" \
--local-dir ./weights/Wan2.1-T2V-14B-StepDistill
Alternative: Docker (simplest)
docker pull lightx2v/lightx2v:25101501-cu124
The 25101501-cu124 tag is the right one for the 4070 — Ada sm_89 kernels are included in the CUDA 12.4 image, and the default pip install torch Ada path matches it. The Quickstart also documents a 26011201-cu128 tag, but that cuda128 image targets Blackwell sm_120 and is unnecessary on Ada — stay on cu124.
Alternative path: ComfyUI + GGUF (community-quantized)
If you'd rather work in ComfyUI than the LightX2V Python framework, QuantStack ships GGUF conversions that explicitly identify as "a GGUF conversion of an addon of lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill" (link-back to canonical — Lesson C clean), produced with city96's standard conversion scripts. On a 12 GB 4070 stay on the Q4_K_M / Q5_K_M (~11–13 GB DiT) or smaller tiers so the separate UMT5 text encoder and VAE still fit alongside; the Q8_0 (18.7 GB DiT) does not fit. Loading needs the ComfyUI-GGUF custom node plus the UMT5-XXL text encoder (Comfy-Org safetensors or city96's GGUF UMT5 mirror) and the Wan2.1 VAE (Kijai mirror). GGUF inference is arch-independent; the 4070's PCIe Gen4 link makes any CPU-offloaded encoder slightly slower than on a Gen5 card, but the path still fits.
Running
The LightX2V repo ships ready-to-run shell scripts under scripts/wan/. The relevant one for the FP8 text-to-video path is:
# FP8 path on Ada
bash scripts/wan/run_wan_t2v_distill_fp8_4step_cfg.sh
Under the hood this invokes the FP8 text-to-video pipeline with the matching config JSON from configs/distill/wan21/ (wan_t2v_distill_fp8_4step_cfg.json). Fill in the cloned repo root and the model_path (the directory you downloaded weights to in step 4) at the top of the script before running. The shipped script passes a text prompt only (--prompt "...") and a --save_result_path — there is no image input, which is the correct shape for this text-to-video repo.
The one change that matters for 12 GB. The shipped wan_t2v_distill_fp8_4step_cfg.json defaults to "cpu_offload": false and "offload_granularity": "model", which keeps the full ~15 GB FP8 DiT resident — that fits a 16 GB card but OOMs a 12 GB 4070 (verified against the shipped config). Per the Parameter Offload guide, edit the config (or pass overrides) to turn on the disk-CPU-GPU offload so blocks stream from CPU instead of all being resident:
{
"cpu_offload": true,
"offload_granularity": "block",
"t5_cpu_offload": true,
"vae_cpu_offload": false
}
The official guide's progressive strategy for memory-constrained devices is to first enable cpu_offload, then gradually enable CPU offload for the T5 / CLIP / VAE components, then consider quantization + offload or lazy_load (Parameter Offload guide). On the 4070, cpu_offload: true + offload_granularity: "block" + t5_cpu_offload: true is the combination that keeps the FP8 DiT streaming and the 6.733 GB UMT5 encoder off the GPU. Start at 480×832, 81 frames, 4 steps (the config defaults) and only push toward 720×1280 once you've confirmed peak VRAM via nvidia-smi -l 1.
For the Python API directly, the FP8 text-to-video path on a 4070 looks like this. The enable_offload(...) / create_generator(...) / generate(...) parameter names are reproduced from the LightX2V README (the README's worked example is I2V; for this T2V repo, drop image_path= and pass prompt= only):
from lightx2v import LightX2VPipeline
pipe = LightX2VPipeline(
model_path="./weights/Wan2.1-T2V-14B-StepDistill/distill_fp8",
model_cls="wan2.1_distill",
task="t2v",
)
# REQUIRED on the 12 GB 4070: stream the FP8 DiT in block granularity and keep the
# 6.733 GB UMT5 text encoder on CPU so the model fits the 12 GB envelope.
pipe.enable_offload(
cpu_offload=True, # stream the FP8 DiT — do NOT keep it all resident on 12 GB
offload_granularity="block",
text_encoder_offload=True, # leaves the 6.733 GB UMT5 on CPU
image_encoder_offload=False,
vae_offload=False,
)
pipe.create_generator(
attn_mode="sage_attn2", # SageAttention 2, sm_89-optimised
infer_steps=4, # the whole point of the distilled checkpoint
height=480, width=832,
num_frames=81,
guidance_scale=1.0, # CFG disabled — the distilled checkpoint runs CFG-free
sample_shift=5.0,
)
pipe.generate(
seed=42,
prompt="A man with short gray hair plays a red electric guitar.",
save_result_path="./output.mp4",
)
The recommended sampler settings are the LCM scheduler with shift=5.0 and guidance_scale=1.0 (no CFG) — documented verbatim on the HF model card and baked into the distill config JSONs. Output lands at the path you pass to save_result_path.
Results
- Speed: No first-party RTX 4070 benchmark has been published for the Wan2.1-T2V-14B 4-step distilled variant, and /check/lightx2v/rtx-4070 currently returns
verdict: unknownwith no benchmark rows. The framework's own README performance table reports a single consumer-GPU row —4090D · 8 GPUs + cfg · 4.75s/it— but that is an 8-GPU, CFG-enabled, un-distilled base measurement on a 24 GB card and does not transfer to a single 12 GB 4070 running this 4-step distilled checkpoint. Rather than quote a misleading number, this recipe omits wall-clock speed. What is a model fact: the distilled checkpoint runs in 4 steps instead of the base model's 40, cutting per-clip iteration count by roughly 10×. Empirical RTX 4070 numbers will land at /check/lightx2v/rtx-4070 once a community benchmark is submitted via /contribute. - VRAM usage: The framework's Quickstart sets the floor at "at least 8GB VRAM" with offload + quant, and the HF model card confirms the fp8 / int8 distillation weights "enable fast inference using lightx2v on RTX 4060." (an 8 GB card). The 12 GB RTX 4070 sits comfortably above that 8 GB floor: with
cpu_offload: true+offload_granularity: "block"the ~15 GB FP8 DiT (14.08 GB per-block weights + 0.931 GBnon_block, per the HF tree API) streams from CPU rather than being fully resident, andt5_cpu_offload: truekeeps the 6.733 GB UMT5 encoder off the GPU. The on-disk envelope is ~22.3 GB for the FP8 sub-directory; the BF16 sub-directory is ~40.5 GB (its 28.577 GB DiT alone is more than double 12 GB and not viable). As corroboration from a same-family Ada card, a community user reports in HF discussion #9: "I run it in 16GB VRAM on my 4070 Ti Super myself […] using features like SageAttention […] fp8 scaled quantization, torch.compile optimization, transformer block swap, and more." — that is a 16 GB Ada sibling, not this 12 GB card, but it confirms the same FP8 + SageAttention + offload combination this recipe walks through runs on Ada hardware. See /check/lightx2v/rtx-4070 for empirical numbers as they land. - Quality notes: The distilled checkpoint trades fine motion detail and prompt fidelity for the 4-step / no-CFG speed-up. Use the recommended LCM scheduler,
shift=5.0,guidance_scale=1.0(HF model card) and stay close to the model's training resolutions (480×832, 720×1280) for best results.
For the full benchmark data, see /check/lightx2v/rtx-4070.
Troubleshooting
Out of memory loading the BF16 distill
The BF16 distill_models/distill_model.safetensors is 28.577 GB on disk (HF tree API) — larger than the 4070's 12 GB VRAM by more than 2×. CPU offload alone cannot recover this; stick to the FP8 or INT8 sub-directory on this card.
Out of memory on the FP8 path
The 12 GB envelope is too small to keep the full ~15 GB FP8 DiT resident, so offload is mandatory, not optional. The shipped config's cpu_offload: false / offload_granularity: "model" defaults are a 16 GB setting and will OOM the 4070. In order of effectiveness:
- Turn on block-granularity CPU offload. Set
cpu_offload: trueandoffload_granularity: "block"in the config (orenable_offload(cpu_offload=True, offload_granularity="block")in the Python API). This streams the DiT blocks from CPU instead of holding all ~15 GB on the GPU. - Offload the UMT5 text encoder. Set
t5_cpu_offload: true(config) /text_encoder_offload=True(Python). The FP8 UMT5 is 6.733 GB on its own (HF tree API) — there is no room for it on the GPU alongside the streamed DiT. On the 4070's PCIe Gen4 link this CPU↔GPU streaming is slower than on a Gen5 card, but it keeps the model inside 12 GB. - Make sure SageAttention 2 is actually loaded. Pass
attn_mode="sage_attn2". Without it, attention activations on a 480×832 / 81-frame clip eat substantially more VRAM (and run slower). The quantization tutorial and HF discussion #9 both call SageAttention a primary lever for fitting Wan into a tight envelope. - Drop frame count before resolution. 480×832 / 49 frames before 480×832 / 81 frames as a final fallback; only push to 720×1280 if peak VRAM stays comfortably below 12 GB.
The Parameter Offload guide documents the full progressive strategy for memory-constrained devices.
Wrong output / errors from the HF card's diffusers snippet
The repo's pipeline_tag is image-to-video, so the auto-generated diffusers Quick Start loads an input image and passes image=image. That is the Image-to-Video signature on what is a Text-to-Video repo. For T2V, call the pipeline with prompt= only (as in Running above), or use the official LightX2V shell script scripts/wan/run_wan_t2v_distill_fp8_4step_cfg.sh, which already encodes the prompt-only call shape.
Slow inference despite the 4-step distillation
The 4-step path only delivers the advertised speed-up if CFG is actually disabled and the distilled settings are loaded. Per the HF model card, use the LCM scheduler with shift=5.0 and guidance_scale=1.0 (i.e., without CFG), and infer_steps=4 — not the base model's 40. The trap is bespoke scripts that copy partial parameters and silently fall back to the un-distilled inference path; the shipped shell scripts and config JSONs already encode the right defaults.
Report new issues via submission form — community RTX 4070 benchmarks would directly improve the /check/lightx2v/rtx-4070 data.