How much VRAM does LightX2V need?

About 12 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

LightX2V on RTX 3060: 4-Step Text-to-Video with Distilled Wan2.1-T2V-14B via INT8 + Offload

What You'll Build

Generate short text-to-video clips locally using LightX2V — an inference framework that ships 4-step, CFG-free distilled checkpoints of Wan2.1-T2V-14B — on a 12 GB RTX 3060. Per the HF model card, the distilled checkpoint generates videos "with significantly fewer inference steps (4 steps) and without classifier-free guidance", and the same card calls out: "New fp8 and int8 quantized distillation models have been added, which enable fast inference using lightx2v on RTX 4060" — an 8 GB card that sits below the 12 GB RTX 3060. On Ampere the path that fits this card is INT8 (explicitly in the framework's quantization matrix for "RTX 30/40 series") plus mandatory CPU/block offload — not FP8 (see the architecture note below).

Hardware data: RTX 3060 (12 GB VRAM, Ampere GA106 sm_86) · 4-step distilled Wan2.1-T2V-14B · See benchmark data

ℹ️ This is distilled Wan 2.1, not Wan 2.2. The lightx2v org publishes a wider family — Wan2.1-T2V-14B distilled, Wan2.2-A14B distilled (timestep-MoE), HunyuanVideo-1.5 distilled, Qwen-Image distilled — each with different VRAM and inference characteristics. This recipe is specifically the Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v repo (4-step CFG-free distillation of the dense Wan2.1-T2V-14B base). The same install steps don't transfer cleanly to the A14B timestep-MoE — see the LightX2V Latest News for the Wan 2.2 / HunyuanVideo distilled releases.

⚠️ FP8 is the wrong path on Ampere. The RTX 3060 is Ampere sm_86 and has no FP8 tensor cores — FP8 only ships hardware acceleration on Ada sm_89 (RTX 40-series), Hopper sm_90 (H100), and Blackwell sm_120 (RTX 50-series). The official LightX2V quantization docs confirm this scope directly: FP8 modes (fp8-vllm, fp8-sgl) are listed as supported on "H100/H200/H800, RTX 40 series, etc." — RTX 30 is absent. INT8 modes (int8-vllm, int8-sgl) are explicitly supported on "A100/A800, RTX 30/40 series, etc." You can load FP8 weights on a 3060 (they're valid .pth / .safetensors tensors) and you keep the on-disk size savings, but the runtime dequantizes them to BF16 at compute time — a memory escape hatch, not a speed win. The recipe below routes through INT8 + offload, or a community GGUF, instead.

⚠️ This is Text-to-Video — there is no image input. The repo's pipeline_tag is set to image-to-video, so HuggingFace auto-generates a diffusers Quick Start snippet that loads an input image and passes image=image. That is the wrong call shape for this repo. The repo name is Wan2.1-**T2V**-14B, the README's run command is bash scripts/wan/distill/run_wan_t2v_distill_4step_cfg.sh, and the shipped script passes a text prompt only (--prompt "...", no image). Drive it with prompt= only — never image=.

Requirements

Component	Minimum	Tested
GPU	"at least 8GB VRAM" per the LightX2V Quickstart	RTX 3060 (Ampere GA106 `sm_86`, 12 GB GDDR6, 192-bit / 360 GB/s — per TechPowerUp)
RAM	"16GB or more recommended" per the Quickstart; 32 GB recommended because the offload CPU tier holds the 6.733 GB UMT5 encoder + streamed DiT blocks	—
Storage	"At least 50GB available space" per the Quickstart; ~22 GB for the INT8 sub-directory alone	—
Software	Python 3.10+, PyTorch 2.6+ with CUDA 12.4 (default Ampere wheel; cu128 is Blackwell-only)	per Quickstart

The lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v HF repo ships three sub-directories (sizes verified live via the HF tree API):

distill_int8/ — per-block INT8 quant, ~22.3 GB total (40 transformer blocks at 0.352 GB each = 14.08 GB DiT + non_block.safetensors 0.931 GB + INT8 UMT5-XXL encoder models_t5_umt5-xxl-enc-int8.pth 6.733 GB + Wan2.1 VAE 0.508 GB). Recommended primary path on the 3060 — INT8 has hardware-accelerated tensor-core support on Ampere sm_86. The ~15 GB DiT is larger than the 12 GB card, so it must be streamed via block offload (mandatory here — see Running) rather than held fully resident.
distill_fp8/ — per-block FP8 quant, ~22.3 GB total with the same layout (FP8 UMT5 models_t5_umt5-xxl-enc-fp8.pth 6.733 GB). Weights load fine on a 3060 but compute dequantizes to BF16 on Ampere — VRAM savings, no speed win. Use INT8 instead.
distill_models/ — BF16 dense, ~40.5 GB total (28.577 GB DiT + 11.362 GB BF16 UMT5 + 0.508 GB VAE). The 28.577 GB DiT alone is more than double the 3060's 12 GB VRAM, so the BF16 path is not viable on this card even with offload-heavy gymnastics — stay on INT8 or GGUF.

Installation

The canonical install is Docker (simplest) or conda from source — both documented in the LightX2V Quickstart. The 3060 belongs to the same Ampere sm_86 family as the A100 (sm_80); both are explicitly named in the framework's INT8 support list.

1. Install the framework (conda path)

# Clone and create the environment
git clone https://github.com/ModelTC/LightX2V.git
cd LightX2V
conda create -n lightx2v python=3.10 -y
conda activate lightx2v
pip install -v -e .

Verbatim from the LightX2V Quickstart. On Ampere cards the default pip install torch already includes sm_86 kernels — no special CUDA index-url toggling required. (The cu128 channel is only needed for Blackwell sm_120 cards; ignore any cu128 instruction you see written for the RTX 50-series.)

2. (Recommended) build SageAttention 2 — the biggest attention-kernel lever on Ampere

git clone https://github.com/thu-ml/SageAttention.git
cd SageAttention && CUDA_ARCHITECTURES="8.0,8.6,8.9,9.0" \
  EXT_PARALLEL=4 NVCC_APPEND_FLAGS="--threads 8" MAX_JOBS=32 \
  pip install -v -e .

CUDA_ARCHITECTURES="8.0,8.6,8.9,9.0" covers Ampere through Hopper; the 8.6 entry is the RTX 3060's Ampere target. The blissful-tuner author cited in HF discussion #9 addressed an Ampere-card poster directly: "You have an Ampere card so the boon won't be quite as much, but Ampere IS supported so you definitely will want that." — that applies to the 3060's sm_86 exactly. SageAttention 2 is still the single biggest attention-kernel lever on this card; just don't expect the near-2× Ada speed-up. Pass attn_mode="sage_attn2" when you create the generator (see Running).

3. Pull the 4-step distilled T2V-14B checkpoint (INT8 — recommended primary path)

INT8 has hardware-accelerated tensor-core support on Ampere; this is the recommended path for the 3060:

huggingface-cli download \
  lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v \
  --include "distill_int8/*" \
  --local-dir ./weights/Wan2.1-T2V-14B-StepDistill

There is no separate shipped INT8 shell script; the INT8 weights are pulled the same way as any other sub-directory and then pointed at the INT8 config (see Running). The FP8 sub-directory is downloaded identically (--include "distill_fp8/*") but, per the architecture note above, it buys you only VRAM on Ampere, not speed — prefer INT8.

Alternative: Docker (simplest)

docker pull lightx2v/lightx2v:25101501-cu124

The 25101501-cu124 tag is the right one for the 3060 — Ampere sm_86 kernels are included in the CUDA 12.4 image, and the default pip install torch Ampere path matches it. The Quickstart also documents a 26011201-cu128 tag; that cuda128 image targets Blackwell sm_120 and is unnecessary on Ampere — stay on cu124.

Alternative path: ComfyUI + GGUF (community-quantized)

If you'd rather work in ComfyUI than the LightX2V Python framework, QuantStack ships GGUF conversions that explicitly identify as "a GGUF conversion of an addon of lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill" (link-back to canonical — Lesson C clean), produced with city96's standard conversion scripts. Note QuantStack's release pre-merges the Wan2.1-VACE-14B control-conditioning addon onto the distilled base — useful if you want pose/depth-driven control, otherwise identical inference shape for plain text-to-video. The DiT quant ladder per the QuantStack tree (DiT only — the UMT5 text encoder + VAE load separately):

Tier	DiT size	Fits 12 GB w/ UMT5 + VAE
Q2_K	6.36 GB	yes — but offload the UMT5 encoder to leave room
Q3_K_S / Q3_K_M / Q3_K_L	7.84 / 8.64 / 9.37 GB	yes, with the encoder offloaded to CPU/GGUF
Q4_0 / Q4_K_S	10.33 / 10.55 GB	tight — encoder must be offloaded
Q4_1 / Q4_K_M	11.18 / 11.64 GB	very tight; favour Q3/Q4_K_S on 12 GB
Q5_K_S and larger	12.25 GB+	no — DiT alone leaves no room on a 12 GB card
Q8_0	18.66 GB	no
F16	34.69 GB	no

On a 12 GB 3060 stay on the Q3_K_M / Q4_K_S tier and load the UMT5 encoder as the city96 GGUF UMT5 mirror (or the Comfy-Org safetensors encoder) so it stays off the GPU. Loading needs the ComfyUI-GGUF custom node plus that UMT5 encoder and the Wan2.1 VAE (Kijai mirror). GGUF inference is arch-independent (no FP8 hardware needed); the 3060's PCIe Gen4 link keeps any CPU-offloaded encoder traffic reasonable.

Running

The LightX2V repo ships ready-to-run shell scripts under scripts/wan/distill/. The 4-step CFG distill entry points there are run_wan_t2v_distill_4step_cfg.sh (standard / BF16 distill) and run_wan_t2v_distill_fp8_4step_cfg.sh (FP8 distill). Each invokes python -m lightx2v.infer with a matching config JSON from configs/distill/wan21/; you fill in lightx2v_path (the cloned repo root) and model_path (the weights directory from step 3) at the top of the script, and the shipped script passes a text prompt only (--prompt "...") plus --save_result_path — no image input, which is the correct shape for this text-to-video repo.

There is no INT8 turnkey shell script for Wan2.1-T2V distill — the shipped distill scripts/configs are BF16 and FP8 only, and (as noted in the install step) FP8 buys VRAM but no speed on Ampere sm_86. The INT8 path that actually fits the 3060 is driven through the LightX2V Python API below: point model_path at the distill_int8/ weights you downloaded and turn on block offload.

The one change that matters for 12 GB. By default the pipeline keeps the full ~15 GB INT8 DiT resident, which fits a 24 GB card but OOMs a 12 GB 3060. Per the Parameter Offload guide, the official progressive strategy for memory-constrained devices is: first enable cpu_offload (keep T5/CLIP/VAE on GPU), then "If memory is still insufficient, gradually enable CPU offload for T5, CLIP, VAE", then consider quantization + offload or lazy_load. On the 3060, turn on block-granularity CPU offload for the DiT and offload the UMT5 text encoder so blocks stream from CPU instead of all being resident:

{
  "cpu_offload": true,
  "offload_granularity": "block",
  "t5_cpu_offload": true,
  "vae_cpu_offload": false
}

For the Python API directly, the INT8 text-to-video path on a 3060 looks like this. The enable_offload(...) / create_generator(...) / generate(...) parameter names are reproduced from the LightX2V README (the README's worked example is I2V; for this T2V repo, drop image_path= and pass prompt= only):

from lightx2v import LightX2VPipeline

pipe = LightX2VPipeline(
    model_path="./weights/Wan2.1-T2V-14B-StepDistill/distill_int8",
    model_cls="wan2.1_distill",
    task="t2v",
)

# REQUIRED on the 12 GB 3060: stream the INT8 DiT in block granularity and keep the
# 6.733 GB UMT5 text encoder on CPU so the model fits the 12 GB envelope.
pipe.enable_offload(
    cpu_offload=True,                 # stream the INT8 DiT — do NOT keep it all resident on 12 GB
    offload_granularity="block",
    text_encoder_offload=True,        # leaves the 6.733 GB UMT5 on CPU
    image_encoder_offload=False,
    vae_offload=False,
)

pipe.create_generator(
    attn_mode="sage_attn2",           # SageAttention 2, supported on Ampere sm_86
    infer_steps=4,                    # the whole point of the distilled checkpoint
    height=480, width=832,
    num_frames=81,
    guidance_scale=1.0,               # CFG disabled — the distilled checkpoint runs CFG-free
    sample_shift=5.0,
)
pipe.generate(
    seed=42,
    prompt="A man with short gray hair plays a red electric guitar.",
    save_result_path="./output.mp4",
)

The recommended sampler settings are the LCM scheduler with shift=5.0 and guidance_scale=1.0 (no CFG). The HF model card states verbatim: "We recommend using the LCM scheduler with the following settings:" followed by shift=5.0 and guidance_scale=1.0 (i.e., without CFG). These are baked into the distill config JSONs. Output lands at the path you pass to save_result_path. Start at 480×832, 81 frames, 4 steps and only push toward 720×1280 once you've confirmed peak VRAM via nvidia-smi -l 1.

Results

Speed: No RTX 3060–specific benchmark has been published for the Wan2.1-T2V-14B 4-step distilled variant, and /check/lightx2v/rtx-3060 currently returns verdict: unknown with no benchmark rows. The framework's own README performance table reports consumer-GPU rows only for Ada sm_89 hardware (e.g. an RTX 4090D measurement) — a card with FP8 tensor-core acceleration the 3060 lacks, far more compute (3584 CUDA cores at 360 GB/s on the 3060 vs. the 4090D's much larger configuration), and a different inference variant — so those numbers do not transfer to a single 12 GB 3060 running this INT8-with-offload 4-step path. Rather than quote a misleading number, this recipe omits wall-clock speed. What is a model fact: the distilled checkpoint runs in 4 steps instead of the base model's 40, cutting per-clip iteration count by roughly 10×. Empirical RTX 3060 numbers will land at /check/lightx2v/rtx-3060 once a community benchmark is submitted via /contribute.
VRAM usage: The framework's Quickstart sets the floor at "at least 8GB VRAM" with offload + quant, and the HF model card confirms the fp8 / int8 distillation weights "enable fast inference using lightx2v on RTX 4060." (an 8 GB card). The 12 GB RTX 3060 sits above that 8 GB floor, but the ~15 GB INT8 DiT (14.08 GB per-block weights + 0.931 GB non_block, per the HF tree API) still exceeds 12 GB, so cpu_offload: true + offload_granularity: "block" is mandatory here (unlike a 24 GB card, where INT8 fits resident) — the DiT streams from CPU, and t5_cpu_offload: true keeps the 6.733 GB UMT5 encoder off the GPU. The on-disk envelope is ~22.3 GB for the INT8 sub-directory; the BF16 sub-directory is ~40.5 GB (its 28.577 GB DiT alone is more than double 12 GB and not viable). As corroboration that the FP8/INT8 + SageAttention + offload toolbox runs on this model, a community user reports in HF discussion #9: "I run it in 16GB VRAM on my 4070 Ti Super myself using my own […] features like SageAttention […] fp8 scaled quantization, torch.compile optimization, transformer block swap, and more." — that is a 16 GB Ada sibling, not this 12 GB Ampere card, so it does not transfer as a VRAM figure, but it confirms the same SageAttention + offload combination this recipe walks through. See /check/lightx2v/rtx-3060 for empirical numbers as they land.
Quality notes: The distilled checkpoint trades fine motion detail and prompt fidelity for the 4-step / no-CFG speed-up. Use the recommended LCM scheduler, shift=5.0, guidance_scale=1.0 (HF model card) and stay close to the model's training resolutions (480×832, 720×1280) for best results.

For the full benchmark data, see /check/lightx2v/rtx-3060.

Troubleshooting

"I loaded the FP8 weights and they're slower than I expected"

That's the expected behavior on Ampere — the 3060's sm_86 has no FP8 tensor cores, so PyTorch dequantizes FP8 weights to BF16 on the fly at compute time. You keep the on-disk size savings but lose the speed-up that the RTX 4090 / H100 path gets from native FP8 tensor-core throughput. The fix is to use the INT8 sub-directory instead — INT8 is hardware-accelerated on Ampere per the framework's quantization matrix ("A100/A800, RTX 30/40 series, etc." for int8-vllm and int8-sgl; FP8 modes list only "H100/H200/H800, RTX 40 series, etc."). Re-download distill_int8/* and point model_path at it.

Out of memory loading the BF16 distill

The BF16 distill_models/distill_model.safetensors is 28.577 GB on disk (HF tree API) — larger than the 3060's 12 GB VRAM by more than 2×. CPU offload alone cannot recover this on a 12 GB card; stick to the INT8 sub-directory or a small GGUF tier.

Out of memory on the INT8 path

The 12 GB envelope is too small to keep the full ~15 GB INT8 DiT resident, so offload is mandatory, not optional. In order of effectiveness:

Turn on block-granularity CPU offload. Set cpu_offload: true and offload_granularity: "block" in the config (or enable_offload(cpu_offload=True, offload_granularity="block") in the Python API). This streams the DiT blocks from CPU instead of holding all ~15 GB on the GPU.
Offload the UMT5 text encoder. Set t5_cpu_offload: true (config) / text_encoder_offload=True (Python). The INT8 UMT5 is 6.733 GB on its own (HF tree API) — there is no room for it on the GPU alongside the streamed DiT. The 3060's PCIe Gen4 link keeps this CPU↔GPU streaming reasonable, but it is slower than an all-GPU run.
Make sure SageAttention 2 is actually loaded. Pass attn_mode="sage_attn2". Without it, attention activations on a 480×832 / 81-frame clip eat substantially more VRAM (and run slower). The quantization tutorial and HF discussion #9 both call SageAttention a primary lever for fitting Wan into a tight envelope.
Drop frame count before resolution. 480×832 / 49 frames before 480×832 / 81 frames as a final fallback; only push to 720×1280 if peak VRAM stays comfortably below 12 GB. If you are still hitting OOM, HF discussion #9 notes that enabling torch.compile nets "another +20% speed and -20% VRAM".

Wrong output / errors from the HF card's diffusers snippet

The repo's pipeline_tag is image-to-video, so the auto-generated diffusers Quick Start loads an input image and passes image=image. That is the Image-to-Video signature on what is a Text-to-Video repo. For T2V, call the pipeline with prompt= only (as in Running above), or run the official LightX2V shell script scripts/wan/distill/run_wan_t2v_distill_4step_cfg.sh, which already encodes the prompt-only call shape.

Slow inference despite the 4-step distillation

The 4-step path only delivers the advertised speed-up if CFG is actually disabled and the distilled settings are loaded. Per the HF model card, use the LCM scheduler with shift=5.0 and guidance_scale=1.0 (i.e., without CFG), and infer_steps=4 — not the base model's 40. The trap is bespoke scripts that copy partial parameters and silently fall back to the un-distilled inference path; the shipped shell scripts and config JSONs already encode the right defaults.

SageAttention build on the 3060

SageAttention 2 is fully supported on Ampere — make sure CUDA_ARCHITECTURES in the build command includes 8.6 (the 3060's target). The HF discussion #9 author addresses an Ampere poster directly: "You have an Ampere card so the boon won't be quite as much, but Ampere IS supported so you definitely will want that." Expect a smaller speed-up than the near-2× number quoted on Ada cards.

Report new issues via submission form — community RTX 3060 benchmarks would directly improve the /check/lightx2v/rtx-3060 data.