self-hosted/ai
§01·recipe · video

LightX2V on RTX 3080 Ti: 4-Step Text-to-Video with Distilled Wan2.1-T2V-14B via INT8 + Offload

videointermediate12GB+ VRAMJun 15, 2026

This intermediate recipe sets up LightX2V on the RTX 3080 Ti, needing about 12 GB of VRAM.

models
tools
prerequisites
  • NVIDIA RTX 3080 Ti (12 GB VRAM, Ampere GA102-225 sm_86) or any Ampere CUDA GPU with ≥8 GB VRAM
  • 32 GB+ system RAM (the offload path streams the DiT blocks + holds the UMT5 encoder in CPU RAM)
  • Python 3.10+ and PyTorch 2.6+ with CUDA 12.4 (the default Ampere wheel; cu128 is only needed for Blackwell)
  • ~25 GB free disk space for the INT8 distilled weights (~50 GB if you also pull BF16)

What You'll Build

Generate short text-to-video clips locally using LightX2V — an inference framework that ships 4-step, CFG-free distilled checkpoints of Wan2.1-T2V-14B — on a 12 GB RTX 3080 Ti. Per the HF model card, the distilled checkpoint generates videos "with significantly fewer inference steps (4 steps) and without classifier-free guidance", and the same card calls out: "New fp8 and int8 quantized distillation models have been added, which enable fast inference using lightx2v on RTX 4060" — an 8 GB card that sits below the 12 GB RTX 3080 Ti. On Ampere the path that fits this card is INT8 (explicitly in the framework's quantization matrix for "RTX 30/40 series") plus mandatory CPU/block offload — not FP8 (see the architecture note below).

Hardware data: RTX 3080 Ti (12 GB VRAM, Ampere GA102-225 sm_86) · 4-step distilled Wan2.1-T2V-14B · See benchmark data

ℹ️ This is distilled Wan 2.1, not Wan 2.2. The lightx2v org publishes a wider family — Wan2.1-T2V-14B distilled, Wan2.2-A14B distilled (timestep-MoE), HunyuanVideo-1.5 distilled, Qwen-Image distilled — each with different VRAM and inference characteristics. This recipe is specifically the Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v repo (4-step CFG-free distillation of the dense Wan2.1-T2V-14B base). The same install steps don't transfer cleanly to the A14B timestep-MoE — see the LightX2V Latest News for the Wan 2.2 / HunyuanVideo distilled releases.

⚠️ FP8 is the wrong path on Ampere. The RTX 3080 Ti is Ampere sm_86 and has no FP8 tensor cores — FP8 only ships hardware acceleration on Ada sm_89 (RTX 40-series), Hopper sm_90 (H100), and Blackwell sm_120 (RTX 50-series). The official LightX2V quantization docs confirm this scope directly: FP8 modes (fp8-vllm, fp8-sgl) are listed as supported on "H100/H200/H800, RTX 40 series, etc." — RTX 30 is absent. INT8 modes (int8-vllm, int8-sgl) are explicitly supported on "A100/A800, RTX 30/40 series, etc." You can load FP8 weights on a 3080 Ti (they're valid .pth / .safetensors tensors) and you keep the on-disk size savings, but the runtime dequantizes them to BF16 at compute time — a memory escape hatch, not a speed win. The recipe below routes through INT8 + offload, or a community GGUF, instead.

⚠️ This is Text-to-Video — there is no image input. The repo's pipeline_tag is set to image-to-video, so HuggingFace auto-generates a diffusers Quick Start snippet that loads an input image and passes image=image. That is the wrong call shape for this repo. The repo name is Wan2.1-**T2V**-14B, the README's run command is bash scripts/wan/distill/run_wan_t2v_distill_4step_cfg.sh, and the shipped script passes a text prompt only (--prompt "...", no image). Drive it with prompt= only — never image=.

Requirements

ComponentMinimumTested
GPU"at least 8GB VRAM" per the LightX2V QuickstartRTX 3080 Ti (Ampere GA102-225 sm_86, 12 GB GDDR6X, 384-bit / 912 GB/s — per TechPowerUp)
RAM"16GB or more recommended" per the Quickstart; 32 GB recommended because the offload CPU tier holds the 6.733 GB UMT5 encoder + streamed DiT blocks
Storage"At least 50GB available space" per the Quickstart; ~22 GB for the INT8 sub-directory alone
SoftwarePython 3.10+, PyTorch 2.6+ with CUDA 12.4 (default Ampere wheel; cu128 is Blackwell-only)per Quickstart

The lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v HF repo ships three sub-directories (sizes verified live via the HF tree API):

  • distill_int8/ — per-block INT8 quant, ~22.3 GB total (40 transformer blocks at 0.352 GB each = 14.08 GB DiT + non_block.safetensors 0.931 GB + INT8 UMT5-XXL encoder models_t5_umt5-xxl-enc-int8.pth 6.733 GB + Wan2.1 VAE 0.508 GB). Recommended primary path on the 3080 Ti — INT8 has hardware-accelerated tensor-core support on Ampere sm_86. The ~15 GB DiT is larger than the 12 GB card, so it must be streamed via block offload (mandatory here — see Running) rather than held fully resident.
  • distill_fp8/ — per-block FP8 quant, ~22.3 GB total with the same layout (FP8 UMT5 models_t5_umt5-xxl-enc-fp8.pth 6.733 GB). Weights load fine on a 3080 Ti but compute dequantizes to BF16 on Ampere — VRAM savings, no speed win. Use INT8 instead.
  • distill_models/ — BF16 dense, ~40.5 GB total (28.577 GB DiT + 11.362 GB BF16 UMT5 + 0.508 GB VAE). The 28.577 GB DiT alone is more than double the 3080 Ti's 12 GB VRAM, so the BF16 path is not viable on this card even with offload-heavy gymnastics — stay on INT8 or GGUF.

Installation

The canonical install is Docker (simplest) or conda from source — both documented in the LightX2V Quickstart. The 3080 Ti belongs to the same Ampere sm_86 family as the A100 (sm_80); both are explicitly named in the framework's INT8 support list.

1. Install the framework (conda path)

# Clone and create the environment
git clone https://github.com/ModelTC/LightX2V.git
cd LightX2V
conda create -n lightx2v python=3.10 -y
conda activate lightx2v
pip install -v -e .

Verbatim from the LightX2V Quickstart. On Ampere cards the default pip install torch already includes sm_86 kernels — no special CUDA index-url toggling required. (The cu128 channel is only needed for Blackwell sm_120 cards; ignore any cu128 instruction you see written for the RTX 50-series.)

2. (Recommended) build SageAttention 2 — the biggest attention-kernel lever on Ampere

git clone https://github.com/thu-ml/SageAttention.git
cd SageAttention && CUDA_ARCHITECTURES="8.0,8.6,8.9,9.0" \
  EXT_PARALLEL=4 NVCC_APPEND_FLAGS="--threads 8" MAX_JOBS=32 \
  pip install -v -e .

CUDA_ARCHITECTURES="8.0,8.6,8.9,9.0" covers Ampere through Hopper; the 8.6 entry is the RTX 3080 Ti's Ampere target. The blissful-tuner author cited in HF discussion #9 addressed an Ampere-card poster directly: "You have an Ampere card so the boon won't be quite as much, but Ampere IS supported so you definitely will want that." — that applies to the 3080 Ti's sm_86 exactly. SageAttention 2 is still the single biggest attention-kernel lever on this card; just don't expect the near-2× Ada speed-up. Pass attn_mode="sage_attn2" when you create the generator (see Running).

3. Pull the 4-step distilled T2V-14B checkpoint (INT8 — recommended primary path)

INT8 has hardware-accelerated tensor-core support on Ampere; this is the recommended path for the 3080 Ti:

huggingface-cli download \
  lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v \
  --include "distill_int8/*" \
  --local-dir ./weights/Wan2.1-T2V-14B-StepDistill

There is no separate shipped INT8 shell script; the INT8 weights are pulled the same way as any other sub-directory and then pointed at the INT8 config (see Running). The FP8 sub-directory is downloaded identically (--include "distill_fp8/*") but, per the architecture note above, it buys you only VRAM on Ampere, not speed — prefer INT8.

Alternative: Docker (simplest)

docker pull lightx2v/lightx2v:25101501-cu124

The 25101501-cu124 tag is the right one for the 3080 Ti — Ampere sm_86 kernels are included in the CUDA 12.4 image, and the default pip install torch Ampere path matches it. The Quickstart also documents a 26011201-cu128 tag; that cuda128 image targets Blackwell sm_120 and is unnecessary on Ampere — stay on cu124.

Alternative path: ComfyUI + GGUF (community-quantized)

If you'd rather work in ComfyUI than the LightX2V Python framework, QuantStack ships GGUF conversions that explicitly identify as "a GGUF conversion of an addon of lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill" (link-back to canonical — Lesson C clean), produced with city96's standard conversion scripts. Note QuantStack's release pre-merges the Wan2.1-VACE-14B control-conditioning addon onto the distilled base — useful if you want pose/depth-driven control, otherwise identical inference shape for plain text-to-video. The DiT quant ladder per the QuantStack tree (DiT only — the UMT5 text encoder + VAE load separately):

TierDiT sizeFits 12 GB w/ UMT5 + VAE
Q2_K6.36 GByes — but offload the UMT5 encoder to leave room
Q3_K_S / Q3_K_M / Q3_K_L7.84 / 8.64 / 9.37 GByes, with the encoder offloaded to CPU/GGUF
Q4_0 / Q4_K_S10.33 / 10.55 GBtight — encoder must be offloaded
Q4_1 / Q4_K_M11.18 / 11.64 GBvery tight; favour Q3/Q4_K_S on 12 GB
Q5_K_S and larger12.25 GB+no — DiT alone leaves no room on a 12 GB card
Q8_018.66 GBno
F1634.69 GBno

On a 12 GB 3080 Ti stay on the Q3_K_M / Q4_K_S tier and load the UMT5 encoder as the city96 GGUF UMT5 mirror (or the Comfy-Org safetensors encoder) so it stays off the GPU. Loading needs the ComfyUI-GGUF custom node plus that UMT5 encoder and the Wan2.1 VAE (Kijai mirror). GGUF inference is arch-independent (no FP8 hardware needed); the 3080 Ti's PCIe Gen4 link keeps any CPU-offloaded encoder traffic reasonable.

Running

The LightX2V repo ships ready-to-run shell scripts under scripts/wan/distill/. The 4-step CFG distill entry points there are run_wan_t2v_distill_4step_cfg.sh (standard / BF16 distill) and run_wan_t2v_distill_fp8_4step_cfg.sh (FP8 distill). Each invokes python -m lightx2v.infer with a matching config JSON from configs/distill/wan21/; you fill in lightx2v_path (the cloned repo root) and model_path (the weights directory from step 3) at the top of the script, and the shipped script passes a text prompt only (--prompt "...") plus --save_result_path — no image input, which is the correct shape for this text-to-video repo.

There is no INT8 turnkey shell script for Wan2.1-T2V distill — the shipped distill scripts/configs are BF16 and FP8 only, and (as noted in the install step) FP8 buys VRAM but no speed on Ampere sm_86. The INT8 path that actually fits the 3080 Ti is driven through the LightX2V Python API below: point model_path at the distill_int8/ weights you downloaded and turn on block offload.

The one change that matters for 12 GB. By default the pipeline keeps the full ~15 GB INT8 DiT resident, which fits a 24 GB card but OOMs a 12 GB 3080 Ti. Per the Parameter Offload guide, the official progressive strategy for memory-constrained devices is: first enable cpu_offload (keep T5/CLIP/VAE on GPU), then "If memory is still insufficient, gradually enable CPU offload for T5, CLIP, VAE", then consider quantization + offload or lazy_load. On the 3080 Ti, turn on block-granularity CPU offload for the DiT and offload the UMT5 text encoder so blocks stream from CPU instead of all being resident:

{
  "cpu_offload": true,
  "offload_granularity": "block",
  "t5_cpu_offload": true,
  "vae_cpu_offload": false
}

For the Python API directly, the INT8 text-to-video path on a 3080 Ti looks like this. The enable_offload(...) / create_generator(...) / generate(...) parameter names are reproduced from the LightX2V README (the README's worked example is I2V; for this T2V repo, drop image_path= and pass prompt= only):

from lightx2v import LightX2VPipeline

pipe = LightX2VPipeline(
    model_path="./weights/Wan2.1-T2V-14B-StepDistill/distill_int8",
    model_cls="wan2.1_distill",
    task="t2v",
)

# REQUIRED on the 12 GB 3080 Ti: stream the INT8 DiT in block granularity and keep the
# 6.733 GB UMT5 text encoder on CPU so the model fits the 12 GB envelope.
pipe.enable_offload(
    cpu_offload=True,                 # stream the INT8 DiT — do NOT keep it all resident on 12 GB
    offload_granularity="block",
    text_encoder_offload=True,        # leaves the 6.733 GB UMT5 on CPU
    image_encoder_offload=False,
    vae_offload=False,
)

pipe.create_generator(
    attn_mode="sage_attn2",           # SageAttention 2, supported on Ampere sm_86
    infer_steps=4,                    # the whole point of the distilled checkpoint
    height=480, width=832,
    num_frames=81,
    guidance_scale=1.0,               # CFG disabled — the distilled checkpoint runs CFG-free
    sample_shift=5.0,
)
pipe.generate(
    seed=42,
    prompt="A man with short gray hair plays a red electric guitar.",
    save_result_path="./output.mp4",
)

The recommended sampler settings are the LCM scheduler with shift=5.0 and guidance_scale=1.0 (no CFG). The HF model card states verbatim: "We recommend using the LCM scheduler with the following settings:" followed by shift=5.0 and guidance_scale=1.0 (i.e., without CFG). These are baked into the distill config JSONs. Output lands at the path you pass to save_result_path. Start at 480×832, 81 frames, 4 steps and only push toward 720×1280 once you've confirmed peak VRAM via nvidia-smi -l 1.

Results

  • Speed: No RTX 3080 Ti–specific benchmark has been published for the Wan2.1-T2V-14B 4-step distilled variant, and /check/lightx2v/rtx-3080-ti currently returns verdict: unknown with no benchmark rows. The framework's own README performance table reports consumer-GPU rows only for Ada sm_89 hardware (e.g. an RTX 4090D measurement) — a card with FP8 tensor-core acceleration the 3080 Ti lacks, far more compute (10240 CUDA cores at 912 GB/s on the 3080 Ti vs. the 4090D's much larger configuration), and a different inference variant — so those numbers do not transfer to a single 12 GB 3080 Ti running this INT8-with-offload 4-step path. Rather than quote a misleading number, this recipe omits wall-clock speed. What is a model fact: the distilled checkpoint runs in 4 steps instead of the base model's 40, cutting per-clip iteration count by roughly 10×. Empirical RTX 3080 Ti numbers will land at /check/lightx2v/rtx-3080-ti once a community benchmark is submitted via /contribute.
  • VRAM usage: The framework's Quickstart sets the floor at "at least 8GB VRAM" with offload + quant, and the HF model card confirms the fp8 / int8 distillation weights "enable fast inference using lightx2v on RTX 4060." (an 8 GB card). The 12 GB RTX 3080 Ti sits above that 8 GB floor, but the ~15 GB INT8 DiT (14.08 GB per-block weights + 0.931 GB non_block, per the HF tree API) still exceeds 12 GB, so cpu_offload: true + offload_granularity: "block" is mandatory here (unlike a 24 GB card, where INT8 fits resident) — the DiT streams from CPU, and t5_cpu_offload: true keeps the 6.733 GB UMT5 encoder off the GPU. The on-disk envelope is ~22.3 GB for the INT8 sub-directory; the BF16 sub-directory is ~40.5 GB (its 28.577 GB DiT alone is more than double 12 GB and not viable). As corroboration that the FP8/INT8 + SageAttention + offload toolbox runs on this model, a community user reports in HF discussion #9: "I run it in 16GB VRAM on my 4070 Ti Super myself using my own […] features like SageAttention […] fp8 scaled quantization, torch.compile optimization, transformer block swap, and more." — that is a 16 GB Ada sibling, not this 12 GB Ampere card, so it does not transfer as a VRAM figure, but it confirms the same SageAttention + offload combination this recipe walks through. See /check/lightx2v/rtx-3080-ti for empirical numbers as they land.
  • Quality notes: The distilled checkpoint trades fine motion detail and prompt fidelity for the 4-step / no-CFG speed-up. Use the recommended LCM scheduler, shift=5.0, guidance_scale=1.0 (HF model card) and stay close to the model's training resolutions (480×832, 720×1280) for best results.

For the full benchmark data, see /check/lightx2v/rtx-3080-ti.

Troubleshooting

"I loaded the FP8 weights and they're slower than I expected"

That's the expected behavior on Ampere — the 3080 Ti's sm_86 has no FP8 tensor cores, so PyTorch dequantizes FP8 weights to BF16 on the fly at compute time. You keep the on-disk size savings but lose the speed-up that the RTX 4090 / H100 path gets from native FP8 tensor-core throughput. The fix is to use the INT8 sub-directory instead — INT8 is hardware-accelerated on Ampere per the framework's quantization matrix ("A100/A800, RTX 30/40 series, etc." for int8-vllm and int8-sgl; FP8 modes list only "H100/H200/H800, RTX 40 series, etc."). Re-download distill_int8/* and point model_path at it.

Out of memory loading the BF16 distill

The BF16 distill_models/distill_model.safetensors is 28.577 GB on disk (HF tree API) — larger than the 3080 Ti's 12 GB VRAM by more than 2×. CPU offload alone cannot recover this on a 12 GB card; stick to the INT8 sub-directory or a small GGUF tier.

Out of memory on the INT8 path

The 12 GB envelope is too small to keep the full ~15 GB INT8 DiT resident, so offload is mandatory, not optional. In order of effectiveness:

  1. Turn on block-granularity CPU offload. Set cpu_offload: true and offload_granularity: "block" in the config (or enable_offload(cpu_offload=True, offload_granularity="block") in the Python API). This streams the DiT blocks from CPU instead of holding all ~15 GB on the GPU.
  2. Offload the UMT5 text encoder. Set t5_cpu_offload: true (config) / text_encoder_offload=True (Python). The INT8 UMT5 is 6.733 GB on its own (HF tree API) — there is no room for it on the GPU alongside the streamed DiT. The 3080 Ti's PCIe Gen4 link keeps this CPU↔GPU streaming reasonable, but it is slower than an all-GPU run.
  3. Make sure SageAttention 2 is actually loaded. Pass attn_mode="sage_attn2". Without it, attention activations on a 480×832 / 81-frame clip eat substantially more VRAM (and run slower). The quantization tutorial and HF discussion #9 both call SageAttention a primary lever for fitting Wan into a tight envelope.
  4. Drop frame count before resolution. 480×832 / 49 frames before 480×832 / 81 frames as a final fallback; only push to 720×1280 if peak VRAM stays comfortably below 12 GB. If you are still hitting OOM, HF discussion #9 notes that enabling torch.compile nets "another +20% speed and -20% VRAM".

Wrong output / errors from the HF card's diffusers snippet

The repo's pipeline_tag is image-to-video, so the auto-generated diffusers Quick Start loads an input image and passes image=image. That is the Image-to-Video signature on what is a Text-to-Video repo. For T2V, call the pipeline with prompt= only (as in Running above), or run the official LightX2V shell script scripts/wan/distill/run_wan_t2v_distill_4step_cfg.sh, which already encodes the prompt-only call shape.

Slow inference despite the 4-step distillation

The 4-step path only delivers the advertised speed-up if CFG is actually disabled and the distilled settings are loaded. Per the HF model card, use the LCM scheduler with shift=5.0 and guidance_scale=1.0 (i.e., without CFG), and infer_steps=4 — not the base model's 40. The trap is bespoke scripts that copy partial parameters and silently fall back to the un-distilled inference path; the shipped shell scripts and config JSONs already encode the right defaults.

SageAttention build on the 3080 Ti

SageAttention 2 is fully supported on Ampere — make sure CUDA_ARCHITECTURES in the build command includes 8.6 (the 3080 Ti's target). The HF discussion #9 author addresses an Ampere poster directly: "You have an Ampere card so the boon won't be quite as much, but Ampere IS supported so you definitely will want that." Expect a smaller speed-up than the near-2× number quoted on Ada cards.

Report new issues via submission form — community RTX 3080 Ti benchmarks would directly improve the /check/lightx2v/rtx-3080-ti data.

common questions
How much VRAM does LightX2V need?

About 12 GB — the minimum this recipe targets.

Which GPUs is LightX2V tested on?

RTX 3080 Ti (12 GB).

How hard is this setup?

Intermediate — follow the steps above.