LightX2V on RTX 3090: 4-Step Text-to-Video with Distilled Wan2.1-14B via INT8 / BF16 Offload

What You'll Build

Generate short text-to-video clips locally using LightX2V — an inference framework that ships 4-step, CFG-free distilled checkpoints of Wan2.1-T2V-14B — on a 24 GB RTX 3090. The distilled checkpoint cuts inference from 40–50 steps down to 4 with no classifier-free guidance, and on Ampere the right path is INT8 (explicitly supported per the framework's quantization matrix), BF16 + offload, or a community GGUF — not FP8 (see the architecture note below).

Hardware data: RTX 3090 (24 GB VRAM) · 4-step distilled Wan2.1-T2V-14B · See benchmark data

ℹ️ This is distilled Wan 2.1, not Wan 2.2. The lightx2v org publishes a wider family — Wan2.1-T2V-14B distilled, Wan2.2-A14B distilled (timestep-MoE), HunyuanVideo-1.5 distilled, Qwen-Image distilled — each with different VRAM and inference characteristics. This recipe is specifically the Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v repo (4-step CFG-free distillation of the dense Wan2.1-T2V-14B base). The same install steps don't transfer cleanly to the A14B timestep-MoE — see the LightX2V Latest News for the Wan 2.2 / HunyuanVideo distilled releases.

⚠️ FP8 is the wrong path on Ampere. The RTX 3090 is Ampere sm_86 and has no FP8 tensor cores — FP8 only ships hardware acceleration on Ada sm_89 (RTX 40-series), Hopper sm_90 (H100), and Blackwell sm_120 (RTX 50-series). The official LightX2V quantization docs confirm this scope directly: FP8 modes (fp8-vllm, fp8-sgl, fp8-q8f) are listed as supported on "H100/H200/H800, RTX 40 series, etc." — RTX 30 is absent. INT8 modes (int8-vllm, int8-sgl) are explicitly supported on "A100/A800, RTX 30/40 series, etc." You can load FP8 weights on a 3090 (they're valid .pth / .safetensors tensors), but the runtime dequantizes to BF16 at compute time — you keep the VRAM savings, you lose the FP8 tensor-core speed-up. The recipe below routes through INT8, BF16 + offload, or GGUF instead.

Requirements

Component	Minimum	Tested
GPU	8 GB VRAM (CUDA) per LightX2V Quickstart	RTX 3090 (Ampere `sm_86`, 24 GB)
RAM	16 GB minimum per Quickstart; 32 GB recommended when running the BF16 path with text-encoder offload	—
Storage	~25 GB (INT8 sub-directory)	~50 GB if you also pull BF16
Software	Python 3.10+, PyTorch 2.6+, CUDA 12.4 or 12.8	per Quickstart

The lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v HF repo ships three sub-directories (sizes from the HF tree API):

distill_int8/ — per-block INT8 quant, ~22.3 GB total (40 transformer blocks at ~0.35 GB each + non_block.safetensors 0.93 GB + INT8 UMT5-XXL encoder 6.73 GB + Wan2.1 VAE 0.51 GB). Recommended primary path on the 3090 — INT8 has full Ampere tensor-core support and the 24 GB envelope handles the ~15 GB DiT + VAE + UMT5 with comfortable headroom for activations.
distill_fp8/ — per-block FP8 quant, ~22.3 GB total with the same layout. Weights load fine on a 3090 but compute dequantizes to BF16 on Ampere — VRAM savings, no speed win. Use INT8 instead.
distill_models/ — BF16 dense, ~40.5 GB total (28.58 GB DiT + 11.36 GB BF16 UMT5 + 0.51 GB VAE). The full 28.58 GB DiT alone exceeds 24 GB VRAM — the BF16 path on a 3090 requires cpu_offload=True plus text_encoder_offload=True, with the 32 GB system-RAM minimum to host the UMT5 + offloaded blocks.

Installation

The canonical install is Docker (simplest) or conda from source — both documented in the LightX2V Quickstart. The 3090 belongs to the same Ampere sm_86 family as the A100 (sm_80); both are explicitly named in the framework's INT8 support list.

1. Install the framework (conda path)

# Clone and create the environment
git clone https://github.com/ModelTC/LightX2V.git
cd LightX2V
conda create -n lightx2v python=3.11 -y
conda activate lightx2v
pip install -v -e .

Verbatim from the LightX2V Quickstart. On Ampere cards the default pip install torch already includes sm_86 kernels — no special wheel selection is required (unlike Blackwell, which needs cu128).

2. (Recommended) build SageAttention 2 — gives a meaningful boost on Ampere too

git clone https://github.com/thu-ml/SageAttention.git
cd SageAttention && CUDA_ARCHITECTURES="8.0,8.6,8.9,9.0" \
  EXT_PARALLEL=4 NVCC_APPEND_FLAGS="--threads 8" MAX_JOBS=32 \
  pip install -v -e .

CUDA_ARCHITECTURES="8.0,8.6,8.9,9.0" covers Ampere through Hopper; the 8.6 entry is the RTX 3090's Ampere target. The author of the blissful-tuner toolkit cited in HF discussion #9 explicitly addressed an Ampere-card poster: "You have an Ampere card so the boon won't be quite as much, but Ampere IS supported so you definitely will want that." SageAttention 2 is still the single biggest attention-kernel lever on the 3090 — just don't expect the near-2× Ada speed-up.

3. Pull the 4-step distilled T2V-14B checkpoint (INT8 — recommended primary path)

INT8 has hardware-accelerated tensor-core support on Ampere; this is the recommended path for the 3090:

huggingface-cli download \
  lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v \
  --include "distill_int8/*" \
  --local-dir ./weights/Wan2.1-T2V-14B-StepDistill

Alternative: the BF16 path also fits a 24 GB 3090 with offload (the 28.58 GB DiT exceeds 24 GB raw, but block-level cpu_offload=True plus text_encoder_offload=True brings peak VRAM back under the envelope at the cost of more system RAM and slower step time):

huggingface-cli download \
  lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v \
  --include "distill_models/*" \
  --local-dir ./weights/Wan2.1-T2V-14B-StepDistill

Alternative: Docker (simplest)

docker pull lightx2v/lightx2v:26011201-cu128

The cu128 tag works on the 3090 — the wheel still includes Ampere sm_86 support. The older 25101501-cu124 tag is equivalent if you'd rather match a CUDA 12.4 driver (Quickstart).

Alternative path: ComfyUI + GGUF (community-quantized)

If you'd rather work in ComfyUI than the LightX2V Python framework, QuantStack ships GGUF conversions that explicitly identify as "a GGUF conversion of an addon of lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill" (link-back to canonical — Lesson C clean), produced with city96's standard conversion scripts. The quant ladder per the QuantStack README (DiT only — text encoder + VAE load separately):

Tier	DiT size	Fits 24 GB w/ UMT5 + VAE
Q2_K	6.36 GB	yes — very ample headroom
Q3_K_S / Q3_K_M / Q3_K_L	7.84 / 8.64 / 9.37 GB	yes
Q4_0 / Q4_K_S / Q4_1 / Q4_K_M	10.3 / 10.6 / 11.2 / 11.6 GB	yes
Q5_K_S / Q5_0 / Q5_K_M / Q5_1	12.3 / 12.5 / 13 / 13.3 GB	yes
Q6_K	14.5 GB	yes
Q8_0	18.7 GB	yes (tight w/ activations — favour Q5/Q6)
F16	34.7 GB	no — exceeds 24 GB DiT alone

Loading needs the ComfyUI-GGUF custom node plus the separate UMT5-XXL text encoder (Comfy-Org safetensors or city96's GGUF UMT5 mirror) and the Wan2.1 VAE (Kijai mirror). Note that QuantStack's release pre-merges the VACE control-conditioning addon onto the distilled base — useful if you want pose/depth control, otherwise identical inference shape for plain T2V.

Running

The LightX2V repo ships ready-to-run shell scripts under scripts/wan/. The relevant ones for this recipe are:

# INT8 path on Ampere — the recommended default for the 3090
bash scripts/wan/run_wan_t2v_distill_int8_4step_cfg.sh

# BF16 path (needs cpu_offload + text_encoder_offload on 24 GB)
bash scripts/wan/run_wan_t2v_distill_4step_cfg.sh

Under the hood, both invoke python -m lightx2v.infer with the matching config JSON from configs/distill/wan21/. Fill in lightx2v_path (the cloned repo root) and model_path (the directory you downloaded weights to in step 3) at the top of the script before running.

Output lands at ${lightx2v_path}/save_results/wan_t2v_distill_*_4step.mp4. The recommended sampler settings — LCM scheduler with shift=5.0 and guidance_scale=1.0 — are baked into the distill config JSONs and explicitly documented on the HF model card:

"We recommend using the LCM scheduler with the following settings: shift=5.0, guidance_scale=1.0 (i.e., without CFG)."

For the Python API directly, the INT8 path on a 3090 looks like:

from lightx2v import LightX2VPipeline

pipe = LightX2VPipeline(
    model_path="./weights/Wan2.1-T2V-14B-StepDistill/distill_int8",
    model_cls="wan2.1_distill",
    task="t2v",
)

# Optional on a 24 GB 3090 for the INT8 path — leaves more headroom for
# activations at 720p. Required if you switch to the BF16 path.
pipe.enable_offload(
    cpu_offload=False,                # set True if you go BF16 on 24 GB
    offload_granularity="block",
    text_encoder_offload=True,        # leaves the 6.73 GB UMT5 on CPU
    image_encoder_offload=False,
    vae_offload=False,
)

pipe.create_generator(
    attn_mode="sage_attn2",           # SageAttention 2, supported on Ampere
    infer_steps=4,                    # the whole point of the distilled checkpoint
    height=480, width=832,
    num_frames=81,
    guidance_scale=1.0,
    sample_shift=5.0,
)
pipe.generate(
    seed=42,
    prompt="A man with short gray hair plays a red electric guitar.",
    save_result_path="./output.mp4",
)

ℹ️ Don't copy the HF card's diffusers snippet as-is for T2V. The repo's auto-generated diffusers Quick Start example loads an input image and passes image=image — that's the Image-to-Video pipeline signature on a Text-to-Video repo (the snippet is templated from pipeline_tag and isn't variant-aware). For T2V, call the pipeline with prompt= only, or use the official LightX2V scripts above which already encode the right call shape.

Start at 480×832, 81 frames, 4 steps and only push to 720×1280 once you've confirmed peak VRAM via nvidia-smi -l 1.

Results

Speed: No RTX 3090–specific benchmark has been published for the 4-step distilled checkpoint at the time of writing. For directional reference only, the official LightX2V README "Cross-Framework Performance Comparison" reports 20.26 s/it for LightX2V single-GPU on an RTX 4090D (a 24 GB Ada sm_89 card with substantially more compute and FP8 tensor-core support that the 3090 lacks) on the base Wan2.1-I2V-14B at 480P / 40 steps. That number does NOT transfer to a 3090 — different arch, different quant path (no FP8 acceleration), and different inference variant (base vs. distilled, I2V vs. T2V, 40 steps vs. 4). Empirical RTX 3090 numbers will land at /check/lightx2v/rtx-3090 once a community benchmark is submitted via /contribute.
VRAM usage: The framework's Quickstart sets the floor at "at least 8GB VRAM" with offload + quant; the 3090's 24 GB envelope gives ample headroom on the INT8 path (~15 GB DiT + VAE resident with the 6.73 GB UMT5 offloaded). Per the framework's quantization matrix, INT8 modes are supported on "A100/A800, RTX 30/40 series, etc." — direct Ampere mention. Community testimony from HF discussion #9 is also informative: an OOM-on-A6000-48GB report was resolved by enabling SageAttention + scaled-fp8 + torch.compile + transformer block-swap — the same toolbox applies on the 3090 if you push to 720×1280 / longer clips. The on-disk envelope per the HF tree API is ~22.3 GB for the INT8 sub-directory; the BF16 path is 40.5 GB on disk (28.58 GB DiT alone exceeds 24 GB VRAM and requires offload). See /check/lightx2v/rtx-3090 for empirical numbers as they land.
Quality notes: The distilled checkpoint trades fine motion detail and prompt fidelity for the 4-step / no-CFG speed-up. Use the recommended LCM scheduler, shift=5.0, guidance_scale=1.0 (HF model card) and stay close to the model's training resolutions (480×832, 720×1280) for best results.

For the full benchmark data, see /check/lightx2v/rtx-3090.

Troubleshooting

"I loaded the FP8 weights and they're slower than I expected"

That's the expected behavior on Ampere — the 3090's sm_86 has no FP8 tensor cores, so PyTorch dequantizes FP8 weights to BF16 on the fly at compute time. You keep the on-disk size savings but lose the speed-up that the RTX 4090 / H100 path gets from native FP8 tensor-core throughput. The fix is to use the INT8 sub-directory instead — INT8 is hardware-accelerated on Ampere per the framework's quantization matrix ("A100/A800, RTX 30/40 series, etc." for int8-vllm and int8-sgl). Re-download distill_int8/* and point model_path at it.

Out of memory loading the BF16 distill on 24 GB

The BF16 distill_models/distill_model.safetensors is 28.58 GB on disk per the HF tree API — larger than the 3090's 24 GB VRAM. Two paths forward:

Stay on INT8 — the recommended Ampere path, ~22.3 GB on disk with the ~15 GB DiT fitting VRAM cleanly. This is the lowest-friction option.
Run BF16 with full offload — enable cpu_offload=True and text_encoder_offload=True in enable_offload(...), use offload_granularity="block", and ensure you have 32 GB+ system RAM to host the offloaded transformer blocks plus the 11.36 GB BF16 UMT5. Step time roughly doubles versus the all-GPU path due to PCIe traffic.

Out of memory even on the INT8 path

Most failures trace to one of three causes:

Text encoder not offloaded. The INT8 UMT5 is 6.73 GB on its own; together with the ~15 GB DiT it leaves only ~2 GB for activations at 720p. Set text_encoder_offload=True in enable_offload(...).
SageAttention not active. The framework's quantization docs and the community testimony in HF discussion #9 both call out SageAttention 2 as the primary attention lever — without it, the attention activations on a 480×832 / 81-frame clip can OOM. Pass attn_mode="sage_attn2" to create_generator(...).
Resolution / frame count too aggressive. 720×1280 / 81 frames eats substantially more activation memory than 480×832 / 81 frames. Stay at 480×832 until you've measured peak VRAM during a successful run, then step up gradually.

If you're still hitting OOM after these three, HF discussion #9 recommends enabling torch.compile (cited as "+20% speed and -20% VRAM") and falling back to fp16_accumulation and VAE tiling as the final levers before dropping resolution.

SageAttention build on the 3090

SageAttention 2 is fully supported on Ampere — make sure CUDA_ARCHITECTURES in the build command includes 8.6 (the 3090's target). The HF discussion #9 author addresses an Ampere poster directly: "You have an Ampere card so the boon won't be quite as much, but Ampere IS supported so you definitely will want that." Expect a smaller speed-up than the near-2× number quoted on Ada cards.

Slow inference despite the 4-step distillation

The 4-step path only delivers the advertised speedup if the LCM scheduler is loaded and guidance_scale=1.0. If you're calling create_generator(...) directly, make sure you pass guidance_scale=1.0 and sample_shift=5.0 — and that infer_steps=4, not the base model's 40. The shipped shell scripts and config JSONs already encode the right defaults; the trap is bespoke Python scripts that copy partial parameters and silently fall back to the un-distilled inference path. Both settings are explicit on the HF model card.

Resolution / frame-count crashes

The Wan2.1 base requires resolutions divisible by 16 and a frame count that follows the model's grouping. Stick to the example configs (480×832 / 81 frames; 720×1280 / 81 frames) until you've measured a comfortable VRAM margin via nvidia-smi -l 1 during a generation run.

Report new issues via submission form — community RTX 3090 benchmarks would directly improve the /check/lightx2v/rtx-3090 data.