What You'll Build
A local ComfyUI pipeline on a 16 GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) that turns a text prompt (or a starting image) into a 5-second 720p video using the Wan 2.2 TI2V-5B model — the only Wan 2.2 variant the official repo documents as runnable on a single consumer-grade GPU. This runs through AMD's ROCm stack with PyTorch SDPA as the attention path (not FlashAttention), and the ComfyUI native workflow as the canonical lead. The 5B model's FP16 diffusion weights are only ~9.31 GB, so the workload fits the 7800 XT's 16 GB comfortably — VRAM is not the bottleneck here. ROCm's memory-management fragility on large video-model loads is, and this recipe is honest about that.
Hardware data: RX 7800 XT (16GB VRAM, RDNA3 / gfx1101) · 720p (1280×704 / 704×1280) at 24 fps · ComfyUI on ROCm · See benchmark data
⚠️ This is a ROCm recipe, not CUDA. The RX 7800 XT runs on AMD's ROCm/HIP stack — there is no
cu124/cu128wheel, nopip install flash-attn, and no FP8/FP4 path here. RDNA3 has no FP8/FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only), so an FP8 checkpoint just upcasts to FP16 with no memory saving. The 5B model is small enough that you take the native FP16 weights directly and never need a quant. The attention path is PyTorch SDPA (ComfyUI's default), not FlashAttention-2. On the closely-related RX 7900 XTX (RDNA3 / gfx1100, same WMMA family, one gfx step up), an owner running Wan via ComfyUI+ROCm reported that forcing FlashAttention-2 actually increased sampling time by ~50% versus the native SDP path (Wan2.1 AMD support discussion #14). That report is on the XTX, not this card — but it is the same RDNA3 SDP path the 7800 XT uses, so treat it as a same-family signal rather than a 7800 XT measurement. If a guide tells you to install a FlashAttention wheel or pick acu12xwheel for this card, it's written for the wrong vendor.
⚠️ Honest status: this is an AMD video pairing, and video is the finicky tier on RDNA3. Wan 2.2 TI2V-5B runs on RDNA3 through PyTorch's SDP attention, but with two caveats this recipe does not hide: (1) there is no first-party RX 7800 XT benchmark for this model — the working path is inherited from the same-family RX 7900 XTX and the official ComfyUI tutorial, and you should verify it on your own run; (2) video generation on ROCm is markedly slower and less stable than the NVIDIA reference (the only AMD timing data point, a Wan 2.1 run on a 7900 XTX, was "~24 mins" for 25 steps — a different model version and workflow, see Results). If you hit a load-stall, the Troubleshooting section has the ROCm-specific flags that fix it.
Why TI2V-5B and not the 14B variants? The Wan 2.2 family ships five variants: TI2V-5B (this recipe), T2V-A14B, I2V-A14B, S2V-14B, and Animate-14B. The four 14B-class variants are MoE models the official
Wan-Video/Wan2.2README documents with an 80 GB single-GPU floor — far past a 16 GB card at native precision. Only TI2V-5B is positioned as a single-consumer-GPU target. The Wan-AI HF card describes it as a 5B dense model (one fused checkpoint, no high-noise / low-noise expert split) with a unified text-and-image-to-video architecture and a high-compression 16×16×4 VAE, released alongside the larger 27B-total MoE models. The 14B variants need a different recipe entirely.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 8 GB VRAM (ROCm-supported AMD card) — the official ComfyUI tutorial documents the 5B model fitting on 8 GB with ComfyUI native offloading | RX 7800 XT (16 GB, RDNA3 / gfx1101) |
| RAM | 16 GB | 32 GB+ recommended (offloading is RAM-heavy) |
| Storage | ~22 GB (TI2V-5B FP16 9.31 GB + UMT5-XXL FP16 text encoder 11.37 GB + Wan2.2-VAE 1.41 GB) | per HF Files tree |
| Driver | AMD ROCm 7.2.x on Linux | — |
| Software | ComfyUI + PyTorch (ROCm 7.2 build), Python 3.10+ | — |
The model is released under the Apache 2.0 License (per the Wan-AI HF card frontmatter) and the weights are not gated on Hugging Face — no access request or login is required. The official README's single-GPU CLI command for TI2V-5B states it "can run on a GPU with at least 24GB VRAM (e.g, RTX 4090 GPU)" — that is above the 7800 XT's 16 GB, which is exactly why this recipe leads with the ComfyUI native workflow instead: ComfyUI's runtime offloader brings the working floor down to 8 GB (per the tutorial), well within the 7800 XT's envelope.
Installation
1. Install ComfyUI
Per the ComfyUI README, clone the repo:
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
2. Install PyTorch for ROCm
The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU (named explicitly in the ROCm install-on-linux system-requirements matrix as a gfx1101 card), so it uses the stable ROCm PyTorch wheel — no HSA_OVERRIDE_GFX_VERSION masquerade is required. Per the ComfyUI README "AMD GPUs (Linux)" section, the stable install command is:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2
ℹ️ Verify the ROCm tag before you copy it. As of this writing the ComfyUI README pins
rocm7.2as the stable wheel — but therocmX.Ytag moves over time (6.3 → 6.4 → 7.x). Read the current line in the live ComfyUI README before running. A nightly variant (https://download.pytorch.org/whl/nightly/rocm7.2) "might have some performance improvements" per the README. There is also a separate experimental RDNA-3 wheel index (https://rocm.nightlies.amd.com/v2/gfx110X-all/) the README lists for Windows+Linux RDNA3 — on officially-supported Linux you do not need it; the stablewhl/rocm7.2wheel above is the canonical path.
3. Install ComfyUI dependencies
Per the ComfyUI README "Dependencies" section:
pip install -r requirements.txt
4. Download model files for the native workflow
Per the ComfyUI native workflow docs, download these three files from the Comfy-Org Wan 2.2 repackaged repo and place them in ComfyUI/models/. File sizes are verified from the Hugging Face Files tree:
# diffusion model (FP16, ~9.31 GB) → ComfyUI/models/diffusion_models/
wget -P models/diffusion_models/ \
https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged/resolve/main/split_files/diffusion_models/wan2.2_ti2v_5B_fp16.safetensors
# text encoder (FP16, ~11.37 GB) → ComfyUI/models/text_encoders/
wget -P models/text_encoders/ \
https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged/resolve/main/split_files/text_encoders/umt5_xxl_fp16.safetensors
# VAE (~1.41 GB) → ComfyUI/models/vae/
wget -P models/vae/ \
https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged/resolve/main/split_files/vae/wan2.2_vae.safetensors
The resulting layout matches what the official template expects:
ComfyUI/models/diffusion_models/wan2.2_ti2v_5B_fp16.safetensors
ComfyUI/models/text_encoders/umt5_xxl_fp16.safetensors
ComfyUI/models/vae/wan2.2_vae.safetensors
ℹ️ Why the FP16 text encoder, not the FP8 one. The official ComfyUI tutorial points you at
umt5_xxl_fp8_e4m3fn_scaled.safetensors(6.74 GB). That file is an NVIDIA optimization: on RDNA3 there is no FP8 hardware, so an FP8 text encoder upcasts to FP16 at load — you pay the upcast and get no memory win or speedup. The text encoder is offloaded to system RAM by ComfyUI's native offloader during sampling (it isn't resident alongside the diffusion model), so on 16 GB you can take the nativeumt5_xxl_fp16.safetensors(11.37 GB) directly without crowding the diffusion weights on the GPU.
Running
Launch ComfyUI from the repo root. Per the ComfyUI README "Running" section:
python main.py
This starts the server (default http://127.0.0.1:8188). Open the Wan2.2 5B video generation template under Workflow → Browse Templates → Video (you need a ComfyUI build new enough to ship it), set the positive prompt, choose resolution 1280×704 (landscape) or 704×1280 (portrait), set the frame count for clip length (24 fps → 120 frames for a 5-second clip), and queue. The first render is slower due to model load; subsequent renders reuse the cached weights.
For image-to-video, drop a starting image into the LoadImage node wired into the template's Wan22ImageToVideoLatent input — TI2V is a unified text-and-image-to-video model, so the same workflow file handles both modes.
Attention path: ComfyUI's default attention backend on this stack is PyTorch's scaled-dot-product attention (SDPA), which is the path RDNA3 cards run Wan through (Wan2.1 AMD discussion #14, reported on a 7900 XTX: "using PyTorch's native Flash attention (via SDP) on PyTorch 2.6+rocm6.2.4"). Do not install or force a FlashAttention-2 wheel on this card — the same 7900 XTX user found it ~50% slower than SDP, and the upstream Composable-Kernel FlashAttention build is CDNA/MI-only on consumer RDNA3 anyway. Stick with the default.
ℹ️ Why ComfyUI native and not the CLI. The official
generate.py --task ti2v-5Bcommand is tuned for a 24 GB card (the README quotes a 24 GB floor — above the 7800 XT's 16 GB), and per Wan2.2 issue #90 it OOMs at the decode stage even on a 24 GB GPU (reported on an RTX 3090 24 GB, with--offload_model True --convert_model_dtype --t5_cpuall set). ComfyUI's runtime offloader is more aggressive than the CLI's static offload flags and is the path the documented 8 GB working floor refers to — so it is the reliable lead on the 16 GB 7800 XT, where the CLI's 24 GB path would not fit at all.
Results
- Speed: No first-party RX 7800 XT measurement for Wan 2.2 TI2V-5B exists in the Wan-AI HF card, the official README, or the backend benchmark data (/check/wan-2-2/rx-7800-xt returns no benchmark rows, verified at write time). The README's only published timing is the model-wide claim that TI2V-5B generates a 5-second 720P video in under 9 minutes on a single consumer-grade GPU — and elsewhere it names that consumer GPU as an RTX 4090 (NVIDIA), not this card. The closest AMD data point is a Wan 2.1 (not 2.2) run on a 7900 XTX reported in discussion #14: "25 steps is running for ~24 mins at 832×480" via a different workflow. That is a different model version, resolution, GPU, and workflow, so it is not quoted as this pairing's speed — treat it only as a rough signal that ROCm video generation on RDNA3 is functional but markedly slower than the NVIDIA reference. We do not publish an invented number. If you've measured Wan 2.2 TI2V-5B wall-clock on a 7800 XT, please contribute it so it lands on /check/wan-2-2/rx-7800-xt.
- VRAM usage: ~8 GB working floor on the ComfyUI native path with the runtime offloader engaged (per the ComfyUI tutorial: the 5B version "should fit well on 8GB vram with the ComfyUI native offloading"), leaving roughly half the 16 GB card free. The FP16 diffusion file (9.31 GB) is the resident component on the GPU; the UMT5-XXL FP16 text encoder (11.37 GB) and the Wan2.2-VAE (1.41 GB) stream from system RAM rather than loading resident. At 16 GB, VRAM is not the constraint on this pairing — ROCm memory-management stability is (see Troubleshooting). Live data: /check/wan-2-2/rx-7800-xt.
- Quality notes: TI2V-5B output is 720p (1280×704 or 704×1280) at 24 fps; the README documents 720P generation at 24 FPS for this model. Clip length is configurable via frame count. The dense single-checkpoint architecture means quality is consistent on the canonical FP16 path; there is no per-expert quality-vs-speed dial for this variant.
For the full benchmark data and other-GPU comparisons, see /check/wan-2-2/rx-7800-xt.
Troubleshooting
Model load stalls or hangs ("Requested to load …") on ROCm
The single most common video-model failure on an RDNA3 Radeon is not OOM — it's a load-time stall caused by interactions between ROCm's memory management and ComfyUI's pinned / async / dynamic-VRAM offloading. A 7900 XTX owner hit exactly this on a video model and resolved it with launch flags (ComfyUI issue #13730, reported for LTX on a 7900 XTX + ROCm 7.2; the same plumbing affects Wan loads on RDNA3 cards including the 7800 XT):
python main.py --disable-pinned-memory --disable-async-offload --disable-dynamic-vram
The reporter notes --disable-pinned-memory and --disable-async-offload each "seem important." This is an AMD/ROCm-specific issue — the identical model loads fine on NVIDIA without these flags. If a Wan generation hangs at the "load model" step on this card, try these first.
"Torch not compiled with CUDA enabled"
This means a CUDA build of PyTorch got installed instead of the ROCm build. Per the ComfyUI README troubleshooting note, uninstall and reinstall against the ROCm wheel index:
pip uninstall torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2
Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP).
Slow VAE decode at the end of a render
A 7900 XTX user found VAE decode was the slow stage and fixed it by lowering the VAE tile size: per discussion #14, they reported "changing the tile size to 256x256 - a 720*480 video will decode in 16s now." If your workflow exposes a VAE-tiling node, drop the tile to 256×256 when decode dominates wall-clock. (Reported on a 7900 XTX; the same RDNA3 VAE-tiling behaviour applies to the 7800 XT.)
Don't install FlashAttention or a cu12x wheel
HF and ComfyUI guides written for NVIDIA frequently suggest a FlashAttention wheel or a CUDA cu124/cu128 index. On RDNA3 both are the wrong path: the upstream Composable-Kernel FlashAttention build is CDNA/MI-only on consumer gfx110x, and a 7900 XTX owner measured FA2 ~50% slower than the default SDP route (discussion #14). ComfyUI already routes attention through PyTorch SDPA on this stack — stick with the default.
Want the 14B variants?
Per the official README, the 14B / A14B single-GPU commands need at least 80 GB VRAM — out of scope for a 16 GB card at native precision. Community GGUF quants of the 14B Wan variants exist but need a separate workflow and a GGUF loader; file a request on /contribute if you want a 14B-quantized AMD recipe added once a stable 16 GB gfx1101 workflow lands.