How much VRAM does LightX2V need?

About 24 GB — the minimum this recipe targets.

How hard is this setup?

Advanced — follow the steps above.

LightX2V 4-Step on RX 7900 XTX: Distilled Wan2.1-T2V-14B in ComfyUI on ROCm (BF16 + 4-Step LoRA)

What You'll Build

A local 4-step text-to-video setup on a 24 GB Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100) that runs the LightX2V step/CFG-distillation of Wan2.1-T2V-14B inside ComfyUI on the ROCm stack. Instead of the LightX2V framework's NVIDIA-only FP8/INT8 kernel path, this recipe takes the ComfyUI-native route: load the base Wan2.1-T2V-14B diffusion model and apply LightX2V's published rank-64 4-step distill LoRA, then sample at 4 steps with CFG disabled. The LoRA is what cuts the base model's ~25-40 step run down to 4.

Hardware data: RX 7900 XTX (24GB VRAM) · base Wan2.1-T2V-14B BF16 + LightX2V 4-step LoRA · ComfyUI on ROCm · See benchmark data

⚠️ This is a ROCm recipe, not CUDA — and not the LightX2V framework. The RX 7900 XTX runs on AMD's ROCm/HIP stack: there is no cu124/cu128 wheel, no SageAttention build, no Q8-Kernels, no pip install xformers, and no FP8/FP4 path here. RDNA3 has no FP8/FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only — source: AMD GPUOpen "WMMA on RDNA3"), so the lightx2v repo's distill_fp8/ weights would just upcast to BF16 — no memory win, no compute win. The attention path is PyTorch SDPA (ComfyUI's default; explicit flag --use-pytorch-cross-attention), not FlashAttention-2 and not xformers. If a guide tells you to build SageAttention, install Q8 Kernels, or pick a cu12x wheel for this card, it was written for the wrong vendor.

⚠️ Honest fragility: this inherits Wan2.1's instability on AMD. LightX2V is a distillation over Wan2.1 — it rides the exact same ComfyUI Wan backend, and that backend is AMD's most finicky surface, not its smoothest. The 4-step LoRA lowers the step count and therefore the wall-clock, but it does not fix the underlying ROCm video-stack rough edges: large-model loads can stall (--disable-pinned-memory), VAE decode needs tile-size tuning, and throughput trails NVIDIA badly. A real RX 7900 XTX run of the base Wan2.1 14B in ComfyUI is documented in HF Wan2.1 discussion #14: a user reports "I have it running on my RX 7900 XTX but the 25 steps is running for ~24 mins at 832480"* under ROCm 6.3, later, on an image-to-video run of the 480P 14B FP8 model with TeaCache + torch.compile, reporting "a 81 frame video from an input image in just under 20 minutes, with 30 steps". (That datapoint is image-to-video on the FP8 base — a different task and precision from this text-to-video 4-step recipe.) The 4-step distill path is what makes this practical on the card — but treat it as supported-but-finicky, not turnkey.

Requirements

Component	Minimum	Tested
GPU	24 GB VRAM (ROCm-supported AMD card) for the 14B DiT	RX 7900 XTX (24 GB, gfx1100)
RAM	32 GB system (the 28.58 GB DiT is paged through host RAM)	—
Storage	~41 GB (14B DiT + FP16 umT5 + VAE + LoRA)	—
Driver	AMD ROCm 6.3+ on Linux (7.2.x current)	ROCm 6.3 per HF discussion #14
Software	ComfyUI + PyTorch (ROCm build), Python 3.10+	—

Both halves of this stack are Apache 2.0 and not gated on Hugging Face — no access request or login is required. The LightX2V distill repo (lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v) is Apache 2.0, and the base Wan-AI/Wan2.1-T2V-14B states "The models in this repository are licensed under the Apache 2.0 License."

Why the ComfyUI-native LoRA route rather than the lightx2v Python framework: the framework's documented acceleration (SageAttention, Q8 Kernels, FP8/INT8 distill weights) is built for NVIDIA Ada sm_89 and has no RDNA3 equivalent. But LightX2V also publishes its distillation as a plain ComfyUI LoRA — loras/Wan21_T2V_14B_lightx2v_cfg_step_distill_lora_rank64.safetensors (631,344,264 bytes ≈ 0.60 GB, verified via the HF tree API) — which applies on top of the base Wan2.1 model that already runs on this card. That LoRA is the part that survives the vendor switch.

Installation

1. Install ComfyUI

Per the ComfyUI README, clone the repo:

git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI

2. Install PyTorch for ROCm

The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel. Per the ComfyUI README "AMD GPUs (Linux)" section:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. The rocmX.Y tag moves over time (6.3 → 6.4 → 7.x). Read the current line in the live ComfyUI README before running — the documented RX 7900 XTX run above was on ROCm 6.3; the current ComfyUI README pins rocm7.2. Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP).

3. Install ComfyUI dependencies

Per the ComfyUI README "Dependencies" section:

pip install -r requirements.txt

4. Download the base Wan2.1-T2V-14B ComfyUI files

ComfyUI's native Wan video workflow sources its files from the Comfy-Org/Wan_2.1_ComfyUI_repackaged repo. On AMD, take the BF16/FP16 files (not the FP8 variant — it has no hardware benefit on RDNA3). Sizes verified via the HF tree API:

# Diffusion model — wan2.1_t2v_14B_bf16.safetensors, 28,577,096,680 bytes ≈ 28.58 GB
wget -P models/diffusion_models/ \
  "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/diffusion_models/wan2.1_t2v_14B_bf16.safetensors"

# Text encoder — FP16 umT5 (11,366,399,385 bytes ≈ 11.37 GB), NOT the fp8 variant
wget -P models/text_encoders/ \
  "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/text_encoders/umt5_xxl_fp16.safetensors"

# VAE
wget -P models/vae/ \
  "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/vae/wan_2.1_vae.safetensors"

ℹ️ Why FP16 umT5, not FP8. The Comfy-Org repo ships both umt5_xxl_fp16.safetensors (11.37 GB) and umt5_xxl_fp8_e4m3fn_scaled.safetensors (6.74 GB). On NVIDIA the FP8 encoder is a memory win; on RDNA3 it isn't — there is no FP8 tensor hardware, so the weights upcast to BF16/FP16 at load with no saving. At 24 GB you have the room, and the text encoder runs on CPU during sampling anyway. Take the FP16 file.

5. Download the LightX2V 4-step distill LoRA

This is the LightX2V part — the rank-64 step/CFG distillation, as a ComfyUI LoRA (HF tree API):

# Wan21_T2V_14B_lightx2v_cfg_step_distill_lora_rank64.safetensors — 631,344,264 bytes ≈ 0.60 GB
wget -P models/loras/ \
  "https://huggingface.co/lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v/resolve/main/loras/Wan21_T2V_14B_lightx2v_cfg_step_distill_lora_rank64.safetensors"

Running

Launch ComfyUI from the repo root. Per the ComfyUI README "Running" section:

python main.py

This starts the server (default http://127.0.0.1:8188). Open it, load the built-in Wan 2.1 text-to-video template (the ComfyUI Wan video tutorial describes the graph), and make two changes for the LightX2V 4-step path:

Insert a LoraLoaderModelOnly node between Load Diffusion Model and the sampler, and select Wan21_T2V_14B_lightx2v_cfg_step_distill_lora_rank64.safetensors at strength 1.0. This is what converts the base model into the 4-step distilled model.
Set the sampler to the distilled schedule. Per the LightX2V model card, the distilled checkpoint runs with "significantly fewer inference steps (4 steps) and without classifier-free guidance", recommending the LCM scheduler with shift=5.0 and guidance_scale=1.0. In ComfyUI terms: KSampler steps = 4, CFG = 1.0, sampler lcm (or euler), scheduler matched to the template's Wan default, and the ModelSamplingSD3/shift node set to 5.0.

Start at 480×832, 81 frames — the resolution the card and the documented AMD run both use — and only push to 720×1280 after confirming a clean run. Generated videos land in ComfyUI/output/.

The 28.58 GB BF16 DiT is larger than the 7900 XTX's 24 GB, so ComfyUI's weight-streaming / smart-memory offload pages part of it through host RAM (hence the 32 GB system-RAM prerequisite). This is normal for the 14B on a 24 GB card and is exactly the configuration the documented AMD run used.

Results

Speed: No RX 7900 XTX benchmark for the 4-step LightX2V path has been published, and /check/lightx2v/rx-7900-xtx currently returns verdict: unknown with no benchmark rows. The only verifiable AMD datapoint is for the un-distilled base Wan2.1 14B: HF discussion #14 reports "a 81 frame video from an input image in just under 20 minutes, with 30 steps" (an image-to-video run on the FP8 480P 14B base) on a 7900 XTX. That is a 30-step image-to-video base run, not this recipe's 4-step text-to-video distilled run — the whole point of the LoRA is to cut the step count by ~7×, so wall-clock should drop substantially, but no one has published the 4-step AMD number. Rather than quote a misleading figure, this recipe omits wall-clock speed. What is a model fact: the distill runs in 4 steps instead of the base model's 25-40. Empirical RX 7900 XTX numbers will land at /check/lightx2v/rx-7900-xtx once a community benchmark is submitted via /contribute.
VRAM usage: At 24 GB, VRAM is not the binding constraint — ROCm video-stack stability is. The 14B BF16 DiT is 28.58 GB on disk (HF tree API), so ComfyUI streams/offloads it; the documented 7900 XTX runs above confirm the 14B is workable on this card with offload. The 0.60 GB LoRA adds negligible memory. The FP16 umT5 (11.37 GB) runs on CPU during sampling. See /check/lightx2v/rx-7900-xtx for community measurements as they land.
Quality notes: The distill trades fine motion detail and prompt fidelity for the 4-step / no-CFG speedup. Use the card's recommended settings — LCM scheduler, shift=5.0, guidance_scale=1.0, 4 steps — and stay at the model's training resolutions (480×832, 720×1280). A LoRA strength below 1.0 weakens the distillation and can reintroduce the need for more steps.

For the full benchmark data and other-GPU comparisons, see /check/lightx2v/rx-7900-xtx.

Troubleshooting

Model load stalls / hangs when loading the 14B DiT

ROCm's pinned-memory and smart-memory paths can stall large video-model loads on RDNA3. Per the AMD-ROCm video-stack guidance and ComfyUI's launch flags, the go-to fix is to disable them:

python main.py --disable-pinned-memory --disable-smart-memory

Add --use-pytorch-cross-attention as well if attention auto-selection misbehaves — it forces the PyTorch SDPA path, which is the correct attention backend on RDNA3 (there is no FlashAttention-2 or xformers path on this card).

VAE decode is slow or OOMs at decode time

The Wan VAE decode is a known pressure point. Per the documented 7900 XTX runs in HF discussion #14, reducing the VAE tile size to 256 makes decode tractable (the user reports a 720×480 video decoding in ~16 s after the change). Use ComfyUI's tiled VAE decode node and set the tile size to 256 if a full-frame decode stalls or runs out of memory.

"Torch not compiled with CUDA enabled"

A CUDA build of PyTorch got installed instead of the ROCm build. Per the ComfyUI README troubleshooting note, reinstall against the ROCm wheel index:

pip uninstall torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

The output looks like the un-distilled model (too few details, or needs more steps)

The 4-step speedup only applies if the LoRA is actually loaded and CFG is disabled. Check that: (1) the LoraLoaderModelOnly node is in the graph at strength 1.0 feeding the sampler; (2) KSampler CFG = 1.0 (no classifier-free guidance); (3) steps = 4 with the LCM scheduler and shift 5.0 per the model card. If CFG is left at the base model's default (e.g. 6.0), the distill behaves incorrectly and looks broken.

Do not install SageAttention, Q8 Kernels, xformers, or FP8 weights

Guides written for the NVIDIA LightX2V framework recommend building SageAttention, installing Q8 Kernels, or downloading the distill_fp8//distill_int8/ weights. None of those apply to RDNA3: there is no FP8 hardware (an INT8 path would map to RDNA3's WMMA IU8 units in principle, but no reproducible ComfyUI/ROCm INT8 Wan-DiT inference path exists today), the ROCm xformers fork is limited, and ComfyUI already routes attention through PyTorch SDPA. Run the base Wan2.1 BF16 weights plus the rank-64 LoRA, and force SDPA with --use-pytorch-cross-attention if needed.

Report new issues via submission form — community RX 7900 XTX benchmarks would directly improve the /check/lightx2v/rx-7900-xtx data.