self-hosted/ai
§01·recipe · video

LightX2V 4-Step on RX 7800 XT: Distilled Wan2.1-T2V-14B in ComfyUI on ROCm via GGUF + 4-Step LoRA

videoadvanced16GB+ VRAMJun 19, 2026

This advanced recipe sets up LightX2V on the RX 7800 XT, needing about 16 GB of VRAM.

models
tools
prerequisites
  • AMD Radeon RX 7800 XT (16 GB VRAM, RDNA3 / Navi 32 / gfx1101) or equivalent ROCm-supported card
  • Linux (Ubuntu 24.04 / 22.04 or RHEL) with the AMD ROCm stack installed (ROCm 6.3+ / 7.2.x)
  • Python 3.10+
  • 32 GB+ system RAM (the GGUF UMT5 text encoder runs on CPU, and ComfyUI streams DiT weights through host RAM)
  • ~16 GB free disk (Q5_K_M GGUF DiT 11.26 GB + GGUF umT5 4.15 GB + VAE 0.25 GB + the ~0.60 GB distill LoRA)
  • ComfyUI installed (git clone) with PyTorch built for ROCm, plus the ComfyUI-GGUF custom node

What You'll Build

A local 4-step text-to-video setup on a 16 GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) that runs the LightX2V step/CFG-distillation of Wan2.1-T2V-14B inside ComfyUI on the ROCm stack. Instead of the LightX2V framework's NVIDIA-only FP8/INT8 kernel path, this recipe takes the ComfyUI-GGUF route: load a GGUF-quantized base Wan2.1-T2V-14B diffusion model and apply LightX2V's published rank-64 4-step distill LoRA, then sample at 4 steps with CFG disabled. The LoRA is what cuts the base model's ~25-40 step run down to 4; the GGUF quant is what fits the 14B DiT into 16 GB.

Hardware data: RX 7800 XT (16GB VRAM) · GGUF base Wan2.1-T2V-14B + LightX2V 4-step LoRA · ComfyUI on ROCm · See benchmark data

⚠️ This is a ROCm recipe, not CUDA — and not the LightX2V framework. The RX 7800 XT runs on AMD's ROCm/HIP stack: there is no cu124/cu128 wheel, no SageAttention build, no Q8-Kernels, no pip install xformers, and no FP8/FP4 path here. RDNA3 has no FP8/FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only — source: AMD GPUOpen "WMMA on RDNA3"), so the lightx2v repo's distill_fp8/ weights would just upcast to BF16 — no memory win, no compute win. GGUF, by contrast, is arch-independent: the same Q5_K_M file loads identically on AMD and NVIDIA. The attention path is PyTorch SDPA (ComfyUI's default; explicit flag --use-pytorch-cross-attention), not FlashAttention-2 and not xformers. If a guide tells you to build SageAttention, install Q8 Kernels, or pick a cu12x wheel for this card, it was written for the wrong vendor.

⚠️ Why GGUF here and not BF16 like the 24 GB sibling. The RX 7900 XTX recipe runs the full BF16 Wan2.1-14B DiT (28.58 GB on disk) and lets ComfyUI offload it through 24 GB of VRAM. That offload gets much heavier at 16 GB, and RDNA3 has no FP8 escape hatch to shrink the DiT in hardware. The path that fits 16 GB cleanly is a GGUF quant of the base model: city96's Q5_K_M DiT is 11.26 GB (Q4_K_M is 10.12 GB), the rank-64 LoRA adds ~0.60 GB, the VAE is ~0.25 GB, and the GGUF UMT5 text encoder runs on CPU — leaving comfortable headroom for activations. GGUF inference needs no FP8 hardware, which is exactly why it is the right route on Radeon.

⚠️ Honest fragility: this inherits Wan2.1's instability on AMD. LightX2V is a distillation over Wan2.1 — it rides the exact same ComfyUI Wan backend, and that backend is AMD's most finicky surface, not its smoothest. The 4-step LoRA lowers the step count and therefore the wall-clock, but it does not fix the underlying ROCm video-stack rough edges: large-model loads can stall (--disable-pinned-memory), VAE decode needs tile-size tuning, throughput trails NVIDIA badly, and the GGUF loader itself has an open AMD-specific bug (ComfyUI-GGUF #300 — see Troubleshooting). A real RX 7900 XTX run of the base Wan2.1 14B in ComfyUI is documented in HF Wan2.1 discussion #14: a user reports running it on a 7900 XTX under ROCm and eventually generating an 81-frame clip in just under 20 minutes at 30 steps on the 14B model with TeaCache + torch.compile (that was an image-to-video run on a 24 GB card, not this recipe's text-to-video on 16 GB — context, not a transferable number). The 4-step distill path is what makes this practical on a smaller card — but treat it as supported-but-finicky, not turnkey.

Requirements

ComponentMinimumTested
GPU16 GB VRAM (ROCm-supported AMD card) for the GGUF 14B DiTRX 7800 XT (16 GB, Navi 32, gfx1101)
RAM32 GB system (the GGUF umT5 encoder runs on CPU; DiT weights stream through host RAM)
Storage~16 GB (Q5_K_M DiT + GGUF umT5 + VAE + LoRA)
DriverAMD ROCm 6.3+ on Linux (7.2.x current)ROCm 6.3 per HF discussion #14
SoftwareComfyUI + PyTorch (ROCm build) + ComfyUI-GGUF, Python 3.10+

Every layer of this stack is Apache 2.0 and not gated on Hugging Face — no access request or login is required. The LightX2V distill repo (lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v) is Apache 2.0; the base Wan-AI/Wan2.1-T2V-14B states "The models in this repository are licensed under the Apache 2.0 License."; and city96's GGUF conversion carries that same license: apache-2.0 and identifies itself as "a direct GGUF conversion of Wan-AI/Wan2.1-T2V-14B".

Why the ComfyUI-GGUF LoRA route rather than the lightx2v Python framework: the framework's documented acceleration (SageAttention, Q8 Kernels, FP8/INT8 distill weights) is built for NVIDIA sm_89/sm_90 and has no RDNA3 equivalent. But LightX2V also publishes its distillation as a plain ComfyUI LoRA — loras/Wan21_T2V_14B_lightx2v_cfg_step_distill_lora_rank64.safetensors (631,344,264 bytes ≈ 0.60 GB, verified via the HF tree API) — which applies on top of the base Wan2.1 model. Combined with a GGUF quant of that base, both halves run on this card.

Installation

1. Install ComfyUI

Per the ComfyUI README, clone the repo:

git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI

2. Install PyTorch for ROCm

The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel. Per the ComfyUI README "AMD GPUs (Linux)" section:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. The rocmX.Y tag moves over time (6.3 → 6.4 → 7.x). Read the current line in the live ComfyUI README before running — the documented RX 7900 XTX run referenced above was on ROCm 6.3; the current ComfyUI README pins rocm7.2. Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP). As an officially-supported gfx1101 card, the 7800 XT does not need HSA_OVERRIDE_GFX_VERSION.

3. Install ComfyUI dependencies and the ComfyUI-GGUF node

Per the ComfyUI README "Dependencies" section:

pip install -r requirements.txt

Then add the ComfyUI-GGUF custom node, which provides the GGUF model and CLIP loaders:

cd custom_nodes
git clone https://github.com/city96/ComfyUI-GGUF.git
pip install -r ComfyUI-GGUF/requirements.txt
cd ..

4. Download the GGUF base Wan2.1-T2V-14B

city96's GGUF conversion of the base Wan2.1-T2V-14B is arch-independent — the same file loads on AMD and NVIDIA. On a 16 GB card, the Q5_K_M tier (11.26 GB DiT) is the recommended balance of quality and headroom; Q4_K_M (10.12 GB) gives more room if you hit memory pressure. Per the city96 README, place GGUF model files in models/unet. Sizes verified via the HF tree API:

# Recommended: Q5_K_M DiT (11.26 GB). Drop to wan2.1-t2v-14b-Q4_K_M.gguf (10.12 GB) if memory is tight.
wget -P models/unet/ \
  "https://huggingface.co/city96/Wan2.1-T2V-14B-gguf/resolve/main/wan2.1-t2v-14b-Q5_K_M.gguf"

5. Download the GGUF UMT5 text encoder and the VAE

The UMT5-XXL text encoder is also available as GGUF from city96 — loaded with the ComfyUI-GGUF CLIPLoader (GGUF) node and kept on CPU during sampling so it costs no VRAM. The Wan 2.1 VAE comes from the Comfy-Org repackaged repo (254 MB, verified via the HF tree API):

# GGUF UMT5 encoder (Q5_K_M, 4.15 GB) — loaded via CLIPLoader (GGUF), runs on CPU
wget -P models/text_encoders/ \
  "https://huggingface.co/city96/umt5-xxl-encoder-gguf/resolve/main/umt5-xxl-encoder-Q5_K_M.gguf"

# Wan 2.1 VAE
wget -P models/vae/ \
  "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/vae/wan_2.1_vae.safetensors"

6. Download the LightX2V 4-step distill LoRA

This is the LightX2V part — the rank-64 step/CFG distillation, as a ComfyUI LoRA (HF tree API):

# Wan21_T2V_14B_lightx2v_cfg_step_distill_lora_rank64.safetensors — 631,344,264 bytes ≈ 0.60 GB
wget -P models/loras/ \
  "https://huggingface.co/lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v/resolve/main/loras/Wan21_T2V_14B_lightx2v_cfg_step_distill_lora_rank64.safetensors"

Running

Launch ComfyUI from the repo root. Per the ComfyUI README "Running" section:

python main.py

This starts the server (default http://127.0.0.1:8188). Open it, start from the built-in Wan 2.1 text-to-video template (the ComfyUI Wan video tutorial describes the graph), and make these changes for the GGUF + LightX2V 4-step path:

  1. Replace Load Diffusion Model with the Unet Loader (GGUF) node from ComfyUI-GGUF, and select wan2.1-t2v-14b-Q5_K_M.gguf. This loads the quantized base DiT.
  2. Replace the text-encoder loader with CLIPLoader (GGUF), type wan, and select umt5-xxl-encoder-Q5_K_M.gguf. The GGUF encoder runs on CPU, so it costs no VRAM during sampling.
  3. Insert a LoraLoaderModelOnly node between the Unet loader and the sampler, and select Wan21_T2V_14B_lightx2v_cfg_step_distill_lora_rank64.safetensors at strength 1.0. This is what converts the base model into the 4-step distilled model. (LoRA loading on GGUF models is supported — the ComfyUI-GGUF README notes "LoRA loading is experimental but it should work with just the built-in LoRA loader node(s).")
  4. Set the sampler to the distilled schedule. Per the LightX2V model card, the distilled checkpoint generates videos "with significantly fewer inference steps (4 steps) and without classifier-free guidance", recommending the LCM scheduler with shift=5.0 and guidance_scale=1.0. In ComfyUI terms: KSampler steps = 4, CFG = 1.0, sampler lcm (or euler), and the ModelSamplingSD3/shift node set to 5.0.

Start at 480×832, 81 frames — the model's training resolution — and only push to 720×1280 after confirming a clean run. Generated videos land in ComfyUI/output/.

Results

  • Speed: No RX 7800 XT benchmark for the 4-step LightX2V path has been published, and /check/lightx2v/rx-7800-xt currently returns verdict: unknown with no benchmark rows. The closest verifiable AMD datapoint is for the un-distilled base Wan2.1 14B on a different, larger card: HF discussion #14 reports an 81-frame clip generated in just under 20 minutes at 30 steps on an RX 7900 XTX (24 GB) — and that was an image-to-video run. That is the 30-step base run on a bigger GPU, not this recipe's 4-step distilled text-to-video on the 16 GB 7800 XT — the whole point of the LoRA is to cut the step count by ~7×, but no one has published a 4-step number on this card. Rather than quote a misleading figure, this recipe omits wall-clock speed. What is a model fact: the distill runs in 4 steps instead of the base model's 25-40. Empirical RX 7800 XT numbers will land at /check/lightx2v/rx-7800-xt once a community benchmark is submitted via /contribute.
  • VRAM usage: The GGUF path is sized to fit 16 GB. The Q5_K_M base DiT is 11.26 GB on disk, the rank-64 LoRA adds ~0.60 GB, and the Wan VAE is ~0.25 GB (all per the city96 tree API and HF tree API) — a ~12 GB resident envelope that leaves headroom for activations and latents, with the 4.15 GB GGUF UMT5 encoder kept on CPU. This is a derived on-disk envelope, not a measured runtime peak; drop to Q4_K_M (10.12 GB DiT) if a generation OOMs. See /check/lightx2v/rx-7800-xt for community measurements as they land.
  • Quality notes: The distill trades fine motion detail and prompt fidelity for the 4-step / no-CFG speedup. Use the card's recommended settings — LCM scheduler, shift=5.0, guidance_scale=1.0, 4 steps — and stay at the model's training resolutions (480×832, 720×1280). A LoRA strength below 1.0 weakens the distillation and can reintroduce the need for more steps. The Q5_K_M quant gives slightly higher fidelity than Q4_K_M at the cost of ~1 GB more VRAM.

For the full benchmark data and other-GPU comparisons, see /check/lightx2v/rx-7800-xt.

Troubleshooting

GGUF output is pure noise on the GPU (but correct on CPU)

A community user reports in ComfyUI-GGUF #300 that GGUF workflows produce a "noise-filled image when running on my AMD GPU" on an RX 7900 XT under ROCm, while "the same workflow generates the correct image" in CPU-only mode; a second user confirms the same on a 7900 XTX. The issue is open with no maintainer response yet, and was reported against Flux GGUF rather than Wan, so it may or may not bite this workflow — but it is an RDNA3/ROCm GGUF-loader failure mode worth knowing. If your output is all noise, first confirm correctness with python main.py --cpu (slow but a clean diagnostic), then try a different ROCm version or quant tier and report your result on the issue. Track ComfyUI-GGUF #300 for a fix.

Model load stalls / hangs when loading the 14B DiT

ROCm's pinned-memory and smart-memory paths can stall large video-model loads on RDNA3. Per the AMD-ROCm video-stack guidance and ComfyUI's launch flags, the go-to fix is to disable them:

python main.py --disable-pinned-memory --disable-smart-memory

Add --use-pytorch-cross-attention as well if attention auto-selection misbehaves — it forces the PyTorch SDPA path, which is the correct attention backend on RDNA3 (there is no FlashAttention-2 or xformers path on this card).

VAE decode is slow or OOMs at decode time

The Wan VAE decode is a known pressure point and is tighter on 16 GB than on 24 GB. Per the documented 7900 XTX runs in HF discussion #14, reducing the VAE tile size to 256 makes decode tractable (the user reports a 720×480 video decoding in ~16 s after the change). Use ComfyUI's tiled VAE decode node and set the tile size to 256 if a full-frame decode stalls or runs out of memory.

"Torch not compiled with CUDA enabled"

A CUDA build of PyTorch got installed instead of the ROCm build. Per the ComfyUI README troubleshooting note, reinstall against the ROCm wheel index:

pip uninstall torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

The output looks like the un-distilled model (too few details, or needs more steps)

The 4-step speedup only applies if the LoRA is actually loaded and CFG is disabled. Check that: (1) the LoraLoaderModelOnly node is in the graph at strength 1.0 feeding the sampler; (2) KSampler CFG = 1.0 (no classifier-free guidance); (3) steps = 4 with the LCM scheduler and shift 5.0 per the model card. If CFG is left at the base model's default (e.g. 6.0), the distill behaves incorrectly and looks broken.

Do not install SageAttention, Q8 Kernels, xformers, or FP8 weights

Guides written for the NVIDIA LightX2V framework recommend building SageAttention, installing Q8 Kernels, or downloading the distill_fp8//distill_int8/ weights. None of those apply to RDNA3: there is no FP8 hardware (an INT8 path would map to RDNA3's WMMA IU8 units in principle, but no reproducible ComfyUI/ROCm INT8 Wan-DiT inference path exists today), the ROCm xformers fork is limited, and ComfyUI already routes attention through PyTorch SDPA. Run the GGUF base weights plus the rank-64 LoRA, and force SDPA with --use-pytorch-cross-attention if needed.

Report new issues via submission form — community RX 7800 XT benchmarks would directly improve the /check/lightx2v/rx-7800-xt data.

common questions
How much VRAM does LightX2V need?

About 16 GB — the minimum this recipe targets.

Which GPUs is LightX2V tested on?

RX 7800 XT (16 GB).

How hard is this setup?

Advanced — follow the steps above.