Wan 2.2 T2V-A14B on RTX 3090 Ti: 720p text-to-video in ComfyUI with FP8 weights (Ampere)

What You'll Build

A working ComfyUI text-to-video pipeline that runs the Wan 2.2 T2V-A14B variant — Alibaba's two-expert (high-noise + low-noise) 14B-active-per-step video model — on a single RTX 3090 Ti, producing 5.4-second clips at 1280×720, 81 frames. The native upstream code path requires 80 GB VRAM per the official model card; the path that actually fits 24 GB is the Comfy-Org repackaged FP8 scaled workflow, which swaps the two experts in and out of GPU memory sequentially during denoising. The RTX 3090 Ti is the same Ampere sm_86 architecture as the RTX 3090, with the same 24 GB GDDR6X envelope and roughly +8% memory bandwidth (1008 GB/s vs 936 GB/s) and +12% compute (~40 TFLOPS FP16 dense vs ~35.6 TFLOPS) per TechPowerUp's RTX 3090 Ti datasheet and RTX 3090 datasheet. Same arch, same envelope, slightly faster — but no FP8 tensor cores either way (those first shipped on Hopper / Ada Lovelace), so the FP8 weight files load but are dequantized to BF16 on the fly at compute time (memory savings preserved, no speed boost from the FP8 format itself).

Hardware data: RTX 3090 Ti (24GB VRAM) · ~6:30–6:45 estimated per 81-frame 720p clip at FP8, 30 steps (close-sibling forward-statement from the RTX 3090's measured 7m 10s — see Results) · peak 24 GB VRAM · See benchmark data

⚠️ Variant pin — this recipe is specifically for T2V-A14B. The Wan 2.2 family ships several variants under one brand: T2V-A14B (text-to-video, this recipe), I2V-A14B (image-to-video), Animate-14B (motion-from-image), S2V-14B (speech-to-video), and TI2V-5B (a dense fused 5B variant — different architecture, fits 16 GB cards). They share family branding but ship different weights and different ComfyUI workflows. If you want any of the others, the install steps below do not apply verbatim — start from the official Wan 2.2 GitHub instead.

ℹ️ Timestep-MoE, not classical sparse MoE. The "A14B from 27B total" framing in the model card describes a two-expert timestep MoE: per the HF card, "a high-noise expert for the early stages, focusing on overall layout; and a low-noise expert for the later stages, refining video details." The switch happens once per generation at a fixed SNR threshold — not per token via a learned router. The practical consequence: only one 14B expert is resident in VRAM at any moment (the other is on disk/CPU), so peak VRAM = one-expert × bytes-per-param + text encoder + VAE + activations. Do NOT size for 27B resident. This is structurally different from classical sparse router-per-token MoE models (Mixtral 8×7B, DeepSeek-V3, GPT-OSS-20B) where every expert must be resident because the router picks per token.

ℹ️ FP8 weight ≠ FP8 compute on Ampere sm_86. FP8 tensor cores first shipped on Hopper (sm_90) / Ada Lovelace (sm_89); the RTX 3090 Ti's Ampere sm_86 has FP16 / BF16 / INT8 / TF32 only. The FP8 e4m3fn safetensors files load fine — PyTorch dequantizes them to BF16 on the fly at compute time. You get the VRAM savings (one expert at 14.3 GB instead of ~28 GB at BF16, which would not fit) but no speed boost from the FP8 format itself. This is the same on the 3090 Ti as it is on the 3090 — Ampere sm_86 covers both cards.

Requirements

Component	Minimum	Tested
GPU	24 GB VRAM (Ampere sm_80/86 or newer)	RTX 3090 Ti (24 GB)
RAM	32 GB	—
Storage	~30 GB (two 14.3 GB FP8 experts + text encoder + VAE)	—
Software	ComfyUI (recent build with Wan2.2 templates)	—
Python	3.10+	—
PyTorch	2.4+ with CUDA	—

Installation

1. Install or update ComfyUI

If you don't have ComfyUI yet, follow the official install. If you already have it, update to the latest version — the Wan 2.2 templates and FP8-scaled loader support landed in mid-2025 and you'll want a recent ComfyUI for both. Per the official ComfyUI Wan 2.2 tutorial, the Wan 2.2 14B T2V template is available via Workflow → Browse Templates → Video.

2. Download the two FP8 scaled expert weights

The T2V-A14B model is structured as two 14B experts that activate at different denoising timesteps — the high-noise expert handles early layout and motion; the low-noise expert handles late-stage detail. ComfyUI's workflow loads each into VRAM sequentially (not simultaneously), which is what keeps peak memory at the 24 GB ceiling of the 3090 Ti.

Download both files from Comfy-Org/Wan_2.2_ComfyUI_Repackaged:

cd ComfyUI/models/diffusion_models

# High-noise expert (14.3 GB)
huggingface-cli download Comfy-Org/Wan_2.2_ComfyUI_Repackaged \
  split_files/diffusion_models/wan2.2_t2v_high_noise_14B_fp8_scaled.safetensors \
  --local-dir . --local-dir-use-symlinks False

# Low-noise expert (14.3 GB)
huggingface-cli download Comfy-Org/Wan_2.2_ComfyUI_Repackaged \
  split_files/diffusion_models/wan2.2_t2v_low_noise_14B_fp8_scaled.safetensors \
  --local-dir . --local-dir-use-symlinks False

After download, move both .safetensors files out of the nested split_files/diffusion_models/ subfolder into ComfyUI/models/diffusion_models/ directly (ComfyUI looks at the top level of that folder).

3. Download the text encoder and VAE

# UMT5-XXL text encoder (FP8 e4m3fn scaled)
huggingface-cli download Comfy-Org/Wan_2.2_ComfyUI_Repackaged \
  split_files/text_encoders/umt5_xxl_fp8_e4m3fn_scaled.safetensors \
  --local-dir ComfyUI/models/text_encoders --local-dir-use-symlinks False

# VAE (shared with Wan 2.1)
huggingface-cli download Comfy-Org/Wan_2.2_ComfyUI_Repackaged \
  split_files/vae/wan_2.1_vae.safetensors \
  --local-dir ComfyUI/models/vae --local-dir-use-symlinks False

Same note on the nested folder paths — flatten so the encoder file sits directly under text_encoders/ and the VAE sits directly under vae/.

4. Load the official Wan 2.2 14B T2V workflow

Inside ComfyUI:

Click Workflow → Browse Templates → Video
Select Wan2.2 14B T2V
The template instantiates two LoadDiffusionModel nodes (one per expert) wired into the dual-sampler chain

Running

With the workflow loaded:

Enter your prompt in the CLIPTextEncode (Positive) node
Confirm resolution is 1280×720 and frame count is 81 in the latent video node (this matches the configuration the sibling RTX 3090 benchmark cited below was measured at)
Confirm steps = 30, CFG = 5.0, sampler = Euler
Click Queue Prompt

The first run will spend extra time loading the high-noise expert into VRAM. Once denoising switches to the low-noise stage at the SNR threshold, ComfyUI evicts the high-noise expert and loads the low-noise expert — expect a visible pause at the timestep switch. Output lands in ComfyUI/output/ as an MP4 (or as a sequence of frames depending on your video-saver node).

Results

Speed (close-sibling forward-statement): No first-party RTX 3090 Ti measurement for Wan 2.2-T2V-A14B has surfaced yet. The most directly comparable cited measurement is on the RTX 3090 — a same-arch (Ampere sm_86), same-VRAM-envelope (24 GB GDDR6X) close sibling: 7 minutes 10 seconds for an 81-frame 1280×720 clip at 30 steps, FP8 e4m3fn, measured by LocalAIMaster (April 2026) and corroborated by our /check/wan-2-2-14b/rtx-3090 benchmark id=253. Applying the documented memory-bandwidth and compute uplifts from the RTX 3090 Ti's TechPowerUp datasheet (1008 GB/s vs 936 GB/s = +7.7% memory bandwidth; ~40 vs ~35.6 TFLOPS FP16 dense = +12% compute), the Ti should land in the ~6:30–6:45 per clip range at the same configuration. This is a forward-statement, not a measurement — if you run it, please submit your timing via /contribute so we can replace the estimate with a first-party number.
VRAM usage: 24 GB peak on the 3090 Ti at the configuration above, inherited from the same-arch same-envelope RTX 3090 measurement at /check/wan-2-2-14b/rtx-3090 (id=253). The model "wants every byte of a 24GB card" per LocalAIMaster's writeup; there is essentially no headroom. The 3090 Ti's identical 24 GB GDDR6X envelope hits the same wall — see Troubleshooting if you OOM.
Quality notes: LocalAIMaster's head-to-head review identifies Wan 2.2-T2V-A14B as the "highest-quality open video model" in their four-model comparison across prompt adherence, motion stability, and aesthetic. The dual-expert architecture (high-noise = layout/motion, low-noise = texture/detail) is the cited reason for the motion-stability lift over single-expert 14B competitors. Quality is variant-driven, not card-driven — the 3090 Ti renders the same model the 3090 does.

For the full benchmark data, see /check/wan-2-2-14b/rtx-3090-ti.

Troubleshooting

Out-of-memory at the VAE decode stage

The cited peak is 24 GB on the nose — any background GPU consumer (browser hardware acceleration, video conferencing, a second model loaded in another process) can push you over on a 3090 Ti just as easily as on a 3090. First step: close everything else using the GPU. Second step: drop resolution to 1280×704 or 960×544. Third step: switch to a GGUF quant via the city96/ComfyUI-GGUF custom node — Q5_K_M (~10.8 GB per expert) or Q6_K (~12 GB per expert) from QuantStack/Wan2.2-T2V-A14B-GGUF. The QuantStack repo ships both HighNoise/ and LowNoise/ subfolders at every quant tier (Q2_K through Q8_0), so the dual-expert timestep-swap pattern is preserved at GGUF. Load via Unet Loader (GGUF) instead of LoadDiffusionModel. No published RTX 3090 Ti timing for the GGUF path yet — report yours via the submission form.

FP8 path is no faster on Ampere — why bother

The FP8 e4m3fn safetensors are not chosen here for tensor-core speed (the 3090 Ti has no FP8 tensor cores — that's a Hopper/Ada/Blackwell feature). They are chosen because BF16 weights at 14B per expert are ~28 GB on disk and do not fit 24 GB; FP8 cuts that to 14.3 GB per expert (verified via the Comfy-Org repackager's file listing), which loads in 24 GB even after the runtime dequantizes back to BF16 in working memory. If you have a Hopper or Ada card you get both the storage savings AND tensor-core acceleration; on Ampere (3090 / 3090 Ti / A100) you get only the storage savings. The ~6:30–6:45 estimate above already reflects this — there is no FP8-accelerated faster mode hiding on the 3090 Ti.

Native Wan 2.2 install (`generate.py`) wants 80 GB

This is expected — the upstream Wan-Video/Wan2.2 repo's single-GPU code path holds both experts resident and requires ~80 GB VRAM per the model card ("This command can run on a GPU with at least 80GB VRAM."). Memory flags like --offload_model True --convert_model_dtype --t5_cpu exist but the ComfyUI FP8 scaled path is the cleaner consumer-GPU route. Don't try to run python generate.py --task t2v-A14B directly on a 3090 Ti.

"Where do I get the I2V or Animate variant?"

This recipe is T2V (text-to-video) only. For image-to-video (I2V-A14B), Animate-14B, S2V-14B, or the dense 5B sibling (TI2V-5B), the workflows and weight files differ — start from the Wan-AI HF org page and pick the matching ComfyUI repackaged variant. Same arch family, same install pattern, but different files. Report your results via the submission form and we'll add sibling recipes.