What You'll Build
A local 4-step text-to-video setup on a 16 GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) that runs the LightX2V step/CFG-distillation of Wan2.1-T2V-14B inside ComfyUI on the ROCm stack. Instead of the LightX2V framework's NVIDIA-only FP8/INT8 kernel path, this recipe takes the ComfyUI-GGUF route: load a GGUF-quantized base Wan2.1-T2V-14B diffusion model and apply LightX2V's published rank-64 4-step distill LoRA, then sample at 4 steps with CFG disabled. The LoRA is what cuts the base model's ~25-40 step run down to 4; the GGUF quant is what fits the 14B DiT into 16 GB.
Hardware data: RX 7800 XT (16GB VRAM) · GGUF base Wan2.1-T2V-14B + LightX2V 4-step LoRA · ComfyUI on ROCm · See benchmark data
⚠️ This is a ROCm recipe, not CUDA — and not the LightX2V framework. The RX 7800 XT runs on AMD's ROCm/HIP stack: there is no
cu124/cu128wheel, no SageAttention build, no Q8-Kernels, nopip install xformers, and no FP8/FP4 path here. RDNA3 has no FP8/FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only — source: AMD GPUOpen "WMMA on RDNA3"), so the lightx2v repo'sdistill_fp8/weights would just upcast to BF16 — no memory win, no compute win. GGUF, by contrast, is arch-independent: the same Q5_K_M file loads identically on AMD and NVIDIA. The attention path is PyTorch SDPA (ComfyUI's default; explicit flag--use-pytorch-cross-attention), not FlashAttention-2 and not xformers. If a guide tells you to build SageAttention, install Q8 Kernels, or pick acu12xwheel for this card, it was written for the wrong vendor.
⚠️ Why GGUF here and not BF16 like the 24 GB sibling. The RX 7900 XTX recipe runs the full BF16 Wan2.1-14B DiT (28.58 GB on disk) and lets ComfyUI offload it through 24 GB of VRAM. That offload gets much heavier at 16 GB, and RDNA3 has no FP8 escape hatch to shrink the DiT in hardware. The path that fits 16 GB cleanly is a GGUF quant of the base model: city96's Q5_K_M DiT is 11.26 GB (Q4_K_M is 10.12 GB), the rank-64 LoRA adds ~0.60 GB, the VAE is ~0.25 GB, and the GGUF UMT5 text encoder runs on CPU — leaving comfortable headroom for activations. GGUF inference needs no FP8 hardware, which is exactly why it is the right route on Radeon.
⚠️ Honest fragility: this inherits Wan2.1's instability on AMD. LightX2V is a distillation over Wan2.1 — it rides the exact same ComfyUI Wan backend, and that backend is AMD's most finicky surface, not its smoothest. The 4-step LoRA lowers the step count and therefore the wall-clock, but it does not fix the underlying ROCm video-stack rough edges: large-model loads can stall (
--disable-pinned-memory), VAE decode needs tile-size tuning, throughput trails NVIDIA badly, and the GGUF loader itself has an open AMD-specific bug (ComfyUI-GGUF #300 — see Troubleshooting). A real RX 7900 XTX run of the base Wan2.1 14B in ComfyUI is documented in HF Wan2.1 discussion #14: a user reports running it on a 7900 XTX under ROCm and eventually generating an 81-frame clip in just under 20 minutes at 30 steps on the 14B model with TeaCache + torch.compile (that was an image-to-video run on a 24 GB card, not this recipe's text-to-video on 16 GB — context, not a transferable number). The 4-step distill path is what makes this practical on a smaller card — but treat it as supported-but-finicky, not turnkey.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 16 GB VRAM (ROCm-supported AMD card) for the GGUF 14B DiT | RX 7800 XT (16 GB, Navi 32, gfx1101) |
| RAM | 32 GB system (the GGUF umT5 encoder runs on CPU; DiT weights stream through host RAM) | — |
| Storage | ~16 GB (Q5_K_M DiT + GGUF umT5 + VAE + LoRA) | — |
| Driver | AMD ROCm 6.3+ on Linux (7.2.x current) | ROCm 6.3 per HF discussion #14 |
| Software | ComfyUI + PyTorch (ROCm build) + ComfyUI-GGUF, Python 3.10+ | — |
Every layer of this stack is Apache 2.0 and not gated on Hugging Face — no access request or login is required. The LightX2V distill repo (lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v) is Apache 2.0; the base Wan-AI/Wan2.1-T2V-14B states "The models in this repository are licensed under the Apache 2.0 License."; and city96's GGUF conversion carries that same license: apache-2.0 and identifies itself as "a direct GGUF conversion of Wan-AI/Wan2.1-T2V-14B".
Why the ComfyUI-GGUF LoRA route rather than the lightx2v Python framework: the framework's documented acceleration (SageAttention, Q8 Kernels, FP8/INT8 distill weights) is built for NVIDIA sm_89/sm_90 and has no RDNA3 equivalent. But LightX2V also publishes its distillation as a plain ComfyUI LoRA — loras/Wan21_T2V_14B_lightx2v_cfg_step_distill_lora_rank64.safetensors (631,344,264 bytes ≈ 0.60 GB, verified via the HF tree API) — which applies on top of the base Wan2.1 model. Combined with a GGUF quant of that base, both halves run on this card.
Installation
1. Install ComfyUI
Per the ComfyUI README, clone the repo:
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
2. Install PyTorch for ROCm
The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel. Per the ComfyUI README "AMD GPUs (Linux)" section:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2
ℹ️ Verify the ROCm tag before you copy it. The
rocmX.Ytag moves over time (6.3 → 6.4 → 7.x). Read the current line in the live ComfyUI README before running — the documented RX 7900 XTX run referenced above was on ROCm 6.3; the current ComfyUI README pinsrocm7.2. Confirm the installed build is the ROCm one:python -c "import torch; print(torch.__version__)"should print a+rocm-style suffix, andtorch.cuda.is_available()returnsTrue(ROCm masquerades as thecudadevice namespace under HIP). As an officially-supported gfx1101 card, the 7800 XT does not needHSA_OVERRIDE_GFX_VERSION.
3. Install ComfyUI dependencies and the ComfyUI-GGUF node
Per the ComfyUI README "Dependencies" section:
pip install -r requirements.txt
Then add the ComfyUI-GGUF custom node, which provides the GGUF model and CLIP loaders:
cd custom_nodes
git clone https://github.com/city96/ComfyUI-GGUF.git
pip install -r ComfyUI-GGUF/requirements.txt
cd ..
4. Download the GGUF base Wan2.1-T2V-14B
city96's GGUF conversion of the base Wan2.1-T2V-14B is arch-independent — the same file loads on AMD and NVIDIA. On a 16 GB card, the Q5_K_M tier (11.26 GB DiT) is the recommended balance of quality and headroom; Q4_K_M (10.12 GB) gives more room if you hit memory pressure. Per the city96 README, place GGUF model files in models/unet. Sizes verified via the HF tree API:
# Recommended: Q5_K_M DiT (11.26 GB). Drop to wan2.1-t2v-14b-Q4_K_M.gguf (10.12 GB) if memory is tight.
wget -P models/unet/ \
"https://huggingface.co/city96/Wan2.1-T2V-14B-gguf/resolve/main/wan2.1-t2v-14b-Q5_K_M.gguf"
5. Download the GGUF UMT5 text encoder and the VAE
The UMT5-XXL text encoder is also available as GGUF from city96 — loaded with the ComfyUI-GGUF CLIPLoader (GGUF) node and kept on CPU during sampling so it costs no VRAM. The Wan 2.1 VAE comes from the Comfy-Org repackaged repo (254 MB, verified via the HF tree API):
# GGUF UMT5 encoder (Q5_K_M, 4.15 GB) — loaded via CLIPLoader (GGUF), runs on CPU
wget -P models/text_encoders/ \
"https://huggingface.co/city96/umt5-xxl-encoder-gguf/resolve/main/umt5-xxl-encoder-Q5_K_M.gguf"
# Wan 2.1 VAE
wget -P models/vae/ \
"https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/vae/wan_2.1_vae.safetensors"
6. Download the LightX2V 4-step distill LoRA
This is the LightX2V part — the rank-64 step/CFG distillation, as a ComfyUI LoRA (HF tree API):
# Wan21_T2V_14B_lightx2v_cfg_step_distill_lora_rank64.safetensors — 631,344,264 bytes ≈ 0.60 GB
wget -P models/loras/ \
"https://huggingface.co/lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v/resolve/main/loras/Wan21_T2V_14B_lightx2v_cfg_step_distill_lora_rank64.safetensors"
Running
Launch ComfyUI from the repo root. Per the ComfyUI README "Running" section:
python main.py
This starts the server (default http://127.0.0.1:8188). Open it, start from the built-in Wan 2.1 text-to-video template (the ComfyUI Wan video tutorial describes the graph), and make these changes for the GGUF + LightX2V 4-step path:
- Replace
Load Diffusion Modelwith theUnet Loader (GGUF)node from ComfyUI-GGUF, and selectwan2.1-t2v-14b-Q5_K_M.gguf. This loads the quantized base DiT. - Replace the text-encoder loader with
CLIPLoader (GGUF), typewan, and selectumt5-xxl-encoder-Q5_K_M.gguf. The GGUF encoder runs on CPU, so it costs no VRAM during sampling. - Insert a
LoraLoaderModelOnlynode between the Unet loader and the sampler, and selectWan21_T2V_14B_lightx2v_cfg_step_distill_lora_rank64.safetensorsat strength1.0. This is what converts the base model into the 4-step distilled model. (LoRA loading on GGUF models is supported — the ComfyUI-GGUF README notes "LoRA loading is experimental but it should work with just the built-in LoRA loader node(s).") - Set the sampler to the distilled schedule. Per the LightX2V model card, the distilled checkpoint generates videos "with significantly fewer inference steps (4 steps) and without classifier-free guidance", recommending the LCM scheduler with
shift=5.0andguidance_scale=1.0. In ComfyUI terms: KSampler steps = 4, CFG = 1.0, samplerlcm(oreuler), and theModelSamplingSD3/shift node set to5.0.
Start at 480×832, 81 frames — the model's training resolution — and only push to 720×1280 after confirming a clean run. Generated videos land in ComfyUI/output/.
Results
- Speed: No RX 7800 XT benchmark for the 4-step LightX2V path has been published, and /check/lightx2v/rx-7800-xt currently returns
verdict: unknownwith no benchmark rows. The closest verifiable AMD datapoint is for the un-distilled base Wan2.1 14B on a different, larger card: HF discussion #14 reports an 81-frame clip generated in just under 20 minutes at 30 steps on an RX 7900 XTX (24 GB) — and that was an image-to-video run. That is the 30-step base run on a bigger GPU, not this recipe's 4-step distilled text-to-video on the 16 GB 7800 XT — the whole point of the LoRA is to cut the step count by ~7×, but no one has published a 4-step number on this card. Rather than quote a misleading figure, this recipe omits wall-clock speed. What is a model fact: the distill runs in 4 steps instead of the base model's 25-40. Empirical RX 7800 XT numbers will land at /check/lightx2v/rx-7800-xt once a community benchmark is submitted via /contribute. - VRAM usage: The GGUF path is sized to fit 16 GB. The Q5_K_M base DiT is 11.26 GB on disk, the rank-64 LoRA adds ~0.60 GB, and the Wan VAE is ~0.25 GB (all per the city96 tree API and HF tree API) — a ~12 GB resident envelope that leaves headroom for activations and latents, with the 4.15 GB GGUF UMT5 encoder kept on CPU. This is a derived on-disk envelope, not a measured runtime peak; drop to Q4_K_M (10.12 GB DiT) if a generation OOMs. See /check/lightx2v/rx-7800-xt for community measurements as they land.
- Quality notes: The distill trades fine motion detail and prompt fidelity for the 4-step / no-CFG speedup. Use the card's recommended settings — LCM scheduler,
shift=5.0,guidance_scale=1.0, 4 steps — and stay at the model's training resolutions (480×832, 720×1280). A LoRA strength below 1.0 weakens the distillation and can reintroduce the need for more steps. The Q5_K_M quant gives slightly higher fidelity than Q4_K_M at the cost of ~1 GB more VRAM.
For the full benchmark data and other-GPU comparisons, see /check/lightx2v/rx-7800-xt.
Troubleshooting
GGUF output is pure noise on the GPU (but correct on CPU)
A community user reports in ComfyUI-GGUF #300 that GGUF workflows produce a "noise-filled image when running on my AMD GPU" on an RX 7900 XT under ROCm, while "the same workflow generates the correct image" in CPU-only mode; a second user confirms the same on a 7900 XTX. The issue is open with no maintainer response yet, and was reported against Flux GGUF rather than Wan, so it may or may not bite this workflow — but it is an RDNA3/ROCm GGUF-loader failure mode worth knowing. If your output is all noise, first confirm correctness with python main.py --cpu (slow but a clean diagnostic), then try a different ROCm version or quant tier and report your result on the issue. Track ComfyUI-GGUF #300 for a fix.
Model load stalls / hangs when loading the 14B DiT
ROCm's pinned-memory and smart-memory paths can stall large video-model loads on RDNA3. Per the AMD-ROCm video-stack guidance and ComfyUI's launch flags, the go-to fix is to disable them:
python main.py --disable-pinned-memory --disable-smart-memory
Add --use-pytorch-cross-attention as well if attention auto-selection misbehaves — it forces the PyTorch SDPA path, which is the correct attention backend on RDNA3 (there is no FlashAttention-2 or xformers path on this card).
VAE decode is slow or OOMs at decode time
The Wan VAE decode is a known pressure point and is tighter on 16 GB than on 24 GB. Per the documented 7900 XTX runs in HF discussion #14, reducing the VAE tile size to 256 makes decode tractable (the user reports a 720×480 video decoding in ~16 s after the change). Use ComfyUI's tiled VAE decode node and set the tile size to 256 if a full-frame decode stalls or runs out of memory.
"Torch not compiled with CUDA enabled"
A CUDA build of PyTorch got installed instead of the ROCm build. Per the ComfyUI README troubleshooting note, reinstall against the ROCm wheel index:
pip uninstall torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2
The output looks like the un-distilled model (too few details, or needs more steps)
The 4-step speedup only applies if the LoRA is actually loaded and CFG is disabled. Check that: (1) the LoraLoaderModelOnly node is in the graph at strength 1.0 feeding the sampler; (2) KSampler CFG = 1.0 (no classifier-free guidance); (3) steps = 4 with the LCM scheduler and shift 5.0 per the model card. If CFG is left at the base model's default (e.g. 6.0), the distill behaves incorrectly and looks broken.
Do not install SageAttention, Q8 Kernels, xformers, or FP8 weights
Guides written for the NVIDIA LightX2V framework recommend building SageAttention, installing Q8 Kernels, or downloading the distill_fp8//distill_int8/ weights. None of those apply to RDNA3: there is no FP8 hardware (an INT8 path would map to RDNA3's WMMA IU8 units in principle, but no reproducible ComfyUI/ROCm INT8 Wan-DiT inference path exists today), the ROCm xformers fork is limited, and ComfyUI already routes attention through PyTorch SDPA. Run the GGUF base weights plus the rank-64 LoRA, and force SDPA with --use-pytorch-cross-attention if needed.
Report new issues via submission form — community RX 7800 XT benchmarks would directly improve the /check/lightx2v/rx-7800-xt data.