What You'll Build
A working ComfyUI text-to-video pipeline that runs the Wan 2.2 T2V-A14B variant — Alibaba's two-expert (high-noise + low-noise) 14B-active-per-step video model — on a single RTX 3090, producing 5.4-second clips at 1280×720, 81 frames. The native upstream code path requires 80 GB VRAM per the official model card; the path that actually fits 24 GB is the Comfy-Org repackaged FP8 scaled workflow, which swaps the two experts in and out of GPU memory sequentially during denoising. Expect roughly 1.65× the runtime of an RTX 4090 at the identical workload — Ampere has no FP8 tensor cores, so the FP8 weight files load but are dequantized to BF16 on the fly at compute time (memory savings preserved, no speed boost from the FP8 format itself).
Hardware data: RTX 3090 (24GB VRAM) · 7m 10s per 81-frame 720p clip at FP8, 30 steps · peak 24 GB VRAM · See benchmark data
⚠️ Variant pin — this recipe is specifically for T2V-A14B. The Wan 2.2 family ships several variants under one brand: T2V-A14B (text-to-video, this recipe), I2V-A14B (image-to-video), Animate-14B (motion-from-image), S2V-14B (speech-to-video), and TI2V-5B (a dense fused 5B variant — different architecture, fits 16 GB cards). They share family branding but ship different weights and different ComfyUI workflows. If you want any of the others, the install steps below do not apply verbatim — start from the official Wan 2.2 GitHub instead.
ℹ️ Timestep-MoE, not classical sparse MoE. The "A14B from 27B total" framing in the model card describes a two-expert timestep MoE: per the HF card, "a high-noise expert for the early stages, focusing on overall layout; and a low-noise expert for the later stages, refining video details." The switch happens once per generation at a fixed SNR threshold — not per token via a learned router. The practical consequence: only one 14B expert is resident in VRAM at any moment (the other is on disk/CPU), so peak VRAM = one-expert × bytes-per-param + text encoder + VAE + activations. Do NOT size for 27B resident.
ℹ️ FP8 weight ≠ FP8 compute on Ampere sm_86. FP8 tensor cores first shipped on Hopper (sm_90) / Ada Lovelace (sm_89); the RTX 3090's Ampere sm_86 has FP16 / BF16 / INT8 / TF32 only. The FP8 e4m3fn safetensors files load fine — PyTorch dequantizes them to BF16 on the fly at compute time. You get the VRAM savings (one expert at 14.3 GB instead of ~28 GB at BF16, which would not fit) but no speed boost from the FP8 format itself. This is why the 3090 takes ~1.65× the 4090's wall-clock time for the same workload at the same precision tier (see Results).
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 24 GB VRAM (Ampere sm_80/86 or newer) | RTX 3090 (24 GB) |
| RAM | 32 GB | — |
| Storage | ~30 GB (two 14.3 GB FP8 experts + text encoder + VAE) | — |
| Software | ComfyUI (recent build with Wan2.2 templates) | — |
| Python | 3.10+ | — |
| PyTorch | 2.4+ with CUDA | — |
Installation
1. Install or update ComfyUI
If you don't have ComfyUI yet, follow the official install. If you already have it, update to the latest version — the Wan 2.2 templates and FP8-scaled loader support landed in mid-2025 and you'll want a recent ComfyUI for both. Per the official ComfyUI Wan 2.2 tutorial, the Wan 2.2 14B T2V template is available via Workflow → Browse Templates → Video.
2. Download the two FP8 scaled expert weights
The T2V-A14B model is structured as two 14B experts that activate at different denoising timesteps — the high-noise expert handles early layout and motion; the low-noise expert handles late-stage detail. ComfyUI's workflow loads each into VRAM sequentially (not simultaneously), which is what keeps peak memory at the 24 GB ceiling of the 3090.
Download both files from Comfy-Org/Wan_2.2_ComfyUI_Repackaged:
cd ComfyUI/models/diffusion_models
# High-noise expert (14.3 GB)
huggingface-cli download Comfy-Org/Wan_2.2_ComfyUI_Repackaged \
split_files/diffusion_models/wan2.2_t2v_high_noise_14B_fp8_scaled.safetensors \
--local-dir . --local-dir-use-symlinks False
# Low-noise expert (14.3 GB)
huggingface-cli download Comfy-Org/Wan_2.2_ComfyUI_Repackaged \
split_files/diffusion_models/wan2.2_t2v_low_noise_14B_fp8_scaled.safetensors \
--local-dir . --local-dir-use-symlinks False
After download, move both .safetensors files out of the nested split_files/diffusion_models/ subfolder into ComfyUI/models/diffusion_models/ directly (ComfyUI looks at the top level of that folder).
3. Download the text encoder and VAE
# UMT5-XXL text encoder (FP8 e4m3fn scaled)
huggingface-cli download Comfy-Org/Wan_2.2_ComfyUI_Repackaged \
split_files/text_encoders/umt5_xxl_fp8_e4m3fn_scaled.safetensors \
--local-dir ComfyUI/models/text_encoders --local-dir-use-symlinks False
# VAE (shared with Wan 2.1)
huggingface-cli download Comfy-Org/Wan_2.2_ComfyUI_Repackaged \
split_files/vae/wan_2.1_vae.safetensors \
--local-dir ComfyUI/models/vae --local-dir-use-symlinks False
Same note on the nested folder paths — flatten so the encoder file sits directly under text_encoders/ and the VAE sits directly under vae/.
4. Load the official Wan 2.2 14B T2V workflow
Inside ComfyUI:
- Click
Workflow → Browse Templates → Video - Select Wan2.2 14B T2V
- The template instantiates two
LoadDiffusionModelnodes (one per expert) wired into the dual-sampler chain
Running
With the workflow loaded:
- Enter your prompt in the
CLIPTextEncode (Positive)node - Confirm resolution is
1280×720and frame count is81in the latent video node (this is the configuration the benchmark cited below was measured at) - Confirm steps = 30, CFG = 5.0, sampler = Euler
- Click
Queue Prompt
The first run will spend extra time loading the high-noise expert into VRAM. Once denoising switches to the low-noise stage at the SNR threshold, ComfyUI evicts the high-noise expert and loads the low-noise expert — expect a visible pause at the timestep switch. Output lands in ComfyUI/output/ as an MP4 (or as a sequence of frames depending on your video-saver node).
Results
- Speed: 7 minutes 10 seconds for an 81-frame 1280×720 clip at 30 steps, FP8 e4m3fn precision — measured on RTX 3090 by LocalAIMaster (April 2026). The same head-to-head benchmark reports the RTX 4090 at 4m 20s for the identical workload, so the 3090 runs at roughly 1.65× the wall-clock — the expected Ampere-vs-Ada gap given that FP8 weights are dequantized at runtime on Ampere rather than computed natively in FP8 tensor cores.
- VRAM usage: 24 GB peak at the configuration above — recorded in our
/check/wan-2-2-14b/rtx-3090row (id=253, FP8, 1280×720, 30 steps, generation time 7m 10s). The model "wants every byte of a 24GB card" per LocalAIMaster's writeup; there is essentially no headroom. See Troubleshooting if you hit OOM. - Quality notes: LocalAIMaster's head-to-head review identifies Wan 2.2-T2V-A14B as the "highest-quality open video model" in their four-model comparison across prompt adherence, motion stability, and aesthetic. The dual-expert architecture (high-noise = layout/motion, low-noise = texture/detail) is the cited reason for the motion-stability lift over single-expert 14B competitors.
For the full benchmark data, see /check/wan-2-2-14b/rtx-3090.
Troubleshooting
Out-of-memory at the VAE decode stage
The cited peak is 24 GB on the nose — any background GPU consumer (browser hardware acceleration, video conferencing, a second model loaded in another process) can push you over on a 3090. First step: close everything else using the GPU. Second step: drop resolution to 1280×704 or 960×544. Third step: switch to a GGUF quant via the city96/ComfyUI-GGUF custom node — Q5_K_M (10.8 GB per expert) or Q6_K (12 GB per expert) from QuantStack/Wan2.2-T2V-A14B-GGUF. The QuantStack repo ships both HighNoise/ and LowNoise/ subfolders at every quant tier (Q2_K through Q8_0), so the dual-expert swap pattern is preserved at GGUF. Load via Unet Loader (GGUF) instead of LoadDiffusionModel. No published RTX 3090 timing for the GGUF path yet — report yours via the submission form and we'll add it to /check/.
FP8 path is no faster on Ampere — why bother
The FP8 e4m3fn safetensors are not chosen here for tensor-core speed (the 3090 has no FP8 tensor cores — that's a Hopper/Ada/Blackwell feature). They are chosen because BF16 weights at 14B per expert are ~28 GB on disk and do not fit 24 GB; FP8 cuts that to 14.3 GB per expert, which loads in 24 GB even after the runtime dequantizes back to BF16 in working memory. If you have a Hopper or Ada card you get both savings; on Ampere you get only the memory savings. The 7m 10s figure already reflects this — there is no FP8-accelerated faster mode hiding on the 3090.
Native Wan 2.2 install (generate.py) wants 80 GB
This is expected — the upstream Wan-Video/Wan2.2 repo's single-GPU code path holds both experts resident and requires ~80 GB VRAM per the model card ("This command can run on a GPU with at least 80GB VRAM"). Memory flags like --offload_model True --convert_model_dtype --t5_cpu exist but the ComfyUI FP8 scaled path is the cleaner consumer-GPU route. Don't try to run python generate.py --task t2v-A14B directly on a 3090.
"Where do I get the I2V or Animate variant?"
This recipe is T2V (text-to-video) only. For image-to-video (I2V-A14B), Animate-14B, S2V-14B, or the dense 5B sibling (TI2V-5B), the workflows and weight files differ — start from the Wan-AI HF org page and pick the matching ComfyUI repackaged variant. Same arch family, same install pattern, but different files. Report your results via the submission form and we'll add sibling recipes.