What You'll Build
Generate short text-to-video and image-to-video clips locally with LTX Video 2.3 — a 22B-parameter DiT video model from Lightricks — on a 16GB consumer GPU. The full-precision model demands 32GB+ VRAM, so this recipe runs the Q4_K_S GGUF quantization of the distilled checkpoint together with a 4-bit QAT Gemma 3 text encoder.
Hardware data: RTX 5060 Ti (16GB VRAM) · Q4_K_S GGUF distilled + Gemma 3 12B QAT-Q4 · See benchmark data
⚠️ Known issue: The full LTX-2.3-22B with the unquantized Gemma 3 12B text encoder does NOT fit in 16GB VRAM. A user with an RTX 5080 16GB reported OOM at "Peak Usage: 29068 MiB" even with FP8 quantization (Lightricks/ComfyUI-LTXVideo#303). Stick to GGUF + quantized Gemma.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 16GB VRAM (Ampere or newer) | RTX 5060 Ti (16GB) |
| RAM | 32GB | 32GB |
| Storage | ~30GB | ~30GB (model + encoder + VAE) |
| Software | ComfyUI + ComfyUI-LTXVideo + ComfyUI-GGUF + KJNodes | Python 3.10+, CUDA 12.7+ |
The full unquantized LTX-2.3 needs 32GB+ VRAM per the official ComfyUI-LTXVideo README — running on 16GB requires the GGUF path below.
Installation
1. Install ComfyUI
git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
2. Install the LTX-Video, GGUF, and KJNodes custom nodes
cd ComfyUI/custom_nodes
# Official Lightricks ComfyUI nodes for LTX-2.3
git clone https://github.com/Lightricks/ComfyUI-LTXVideo.git
pip install -r ComfyUI-LTXVideo/requirements.txt
# city96's GGUF loader — required for the quantized UNet
git clone https://github.com/city96/ComfyUI-GGUF.git
pip install -r ComfyUI-GGUF/requirements.txt
# Kijai's KJNodes — used by recommended workflows
git clone https://github.com/kijai/ComfyUI-KJNodes.git
pip install -r ComfyUI-KJNodes/requirements.txt
Source: unsloth/LTX-2.3-GGUF model card lists this exact custom-node trio for the GGUF workflow.
3. Download the Q4_K_S distilled GGUF weights
# UNet — Q4_K_S distilled (~13 GB; sweet spot for 16GB VRAM)
huggingface-cli download QuantStack/LTX-2.3-GGUF \
LTX-2.3-distilled/LTX-2.3-distilled-Q4_K_S.gguf \
--local-dir ComfyUI/models/unet/
# Or the Unsloth dynamic-quant variant
huggingface-cli download unsloth/LTX-2.3-GGUF \
LTX-2.3-distilled-UD-Q4_K_S.gguf \
--local-dir ComfyUI/models/unet/
Quant-tier file-size reference (from QuantStack/LTX-2.3-GGUF):
| Quant | File size |
|---|---|
| Q3_K_S | 14 GB |
| Q4_K_S | 16.7 GB |
| Q4_K_M | 17.8 GB |
| Q5_K_S | 18.5 GB |
| Q8_0 | 25.5 GB |
The Unsloth repo ships a more aggressively packed Q4_K_S at 13 GB (unsloth/LTX-2.3-GGUF) — try Unsloth first on 16GB.
4. Download the quantized Gemma 3 12B text encoder
The standard gemma_3_12B_it.safetensors 12B-parameter encoder is too large to coexist with the 22B LTX-2.3 weights on a 16GB card — Lightricks/ComfyUI-LTXVideo#303 documents a 29068 MiB OOM peak on RTX 5080 16GB with the unquantized pipeline. Use the QAT-Q4 GGUF instead:
huggingface-cli download unsloth/gemma-3-12b-it-qat-GGUF \
gemma-3-12b-it-qat-UD-Q4_K_XL.gguf \
--local-dir ComfyUI/models/text_encoders/
huggingface-cli download unsloth/gemma-3-12b-it-qat-GGUF \
mmproj-BF16.gguf \
--local-dir ComfyUI/models/text_encoders/
Both files are loaded by ComfyUI-GGUF's Gemma encoder node. The QAT-Q4 quantization significantly reduces the encoder's footprint vs the unquantized 12B file; the exact peak depends on the workflow but the 14926 MiB total-peak datapoint cited in Results is on a 16GB card running this same setup.
5. Download the VAE and the spatial upscaler
The Lightricks/LTX-2.3 repo bundles the VAE inside the 22B .safetensors checkpoints (verified via HF Files tab — no *vae*.safetensors file exists standalone). For the GGUF-only flow we use here, pull the VAE from Kijai's community mirror, which exposes it as a standalone bf16 file (architecture: ltxv shared across the LTX family):
# VAE — community mirror (Lightricks does not ship a standalone VAE file)
huggingface-cli download Kijai/LTXV2_comfy \
VAE/LTX2_video_vae_bf16.safetensors \
--local-dir ComfyUI/models/vae/
# Latent spatial upscaler — pick whichever upscale factor the canonical workflow uses
huggingface-cli download Lightricks/LTX-2.3 \
ltx-2.3-spatial-upscaler-x2-1.1.safetensors \
--local-dir ComfyUI/models/latent_upscale_models/
File listing confirmed at Lightricks/LTX-2.3 Files and Kijai/LTXV2_comfy/VAE.
Running
Launch ComfyUI:
python main.py --listen
Open the browser UI, then load one of the example workflows shipped by the Lightricks node:
ComfyUI/custom_nodes/ComfyUI-LTXVideo/example_workflows/2.3/
Swap the default UNet loader for the Unet Loader (GGUF) node from ComfyUI-GGUF, and point the text encoder at the GGUF Gemma 3 loader.
Recommended distilled settings
| Parameter | Value | Source |
|---|---|---|
| Sampler steps | 8 | Distilled checkpoint default per Lightricks/LTX-2.3 card |
| CFG | 1.0 | Same |
| Resolution | width × height must be divisible by 32 | Lightricks/LTX-2.3 card |
| Frame count | (multiple of 8) + 1 (e.g. 65, 97, 121) | Same |
Start small (e.g. 480×832, 65 frames) and scale up only if peak VRAM stays well below 16GB.
Results
- Speed: Omitted — no published benchmark on RTX 5060 Ti 16GB at time of writing. For order-of-magnitude context, a community user reported
"485 sec for 5 sec 1280×720"on RTX 3050 6GB with Q4 GGUF and CPU offload (Kijai/LTXV2_comfy discussion #7). Empirical 5060 Ti data will appear at /check/ltx-video-2-3/rtx-5060-ti once a benchmark report lands. - VRAM usage: A 16GB ComfyUI user running LTX-2 distilled reported peak
14926 MiBduring sampling (Comfy-Org/ComfyUI#11726). Q4_K_S of LTX-2.3 distilled has a similar weight footprint; the 14926 MiB datapoint is the closest cited consumer-GPU peak. - Quality notes: The distilled checkpoint runs at 8 steps with CFG=1 and trades fine motion detail for speed. The
LTX-2.3-22b-dev(full) checkpoint produces higher quality but only fits in 16GB at Q3_K_S — quality regressions are noticeable at Q3 and below per the file-size table above.
For up-to-date benchmark data on this pair, see /check/ltx-video-2-3/rtx-5060-ti.
Troubleshooting
OOM when loading the text encoder
The default gemma_3_12B_it.safetensors 12B-parameter encoder will OOM on 16GB cards when loaded alongside the 22B LTX-2.3 weights — the unquantized pipeline peaks at 29068 MiB per Lightricks/ComfyUI-LTXVideo#303. This is the most common 16GB failure mode. Replace with gemma-3-12b-it-qat-UD-Q4_K_XL.gguf from Unsloth (Step 4 above).
OOM during sampling on the full LTX-2.3-22b-dev checkpoint
The full (non-distilled) 22B model + 12B encoder hits ~29 GB peak per the same issue. On 16GB:
- Use the distilled checkpoint (
LTX-2.3-distilled-Q4_K_S.gguf) - Drop resolution to 480×832 or 512×768
- Limit frame count to 65 or 97
Gemma GGUF loader fails or outputs gibberish
The Gemma 3 GGUF loader in ComfyUI required PRs #399 and #402 to be merged in ComfyUI-GGUF at the time of writing (Kijai/LTXV2_comfy discussion #7). Pull the latest city96/ComfyUI-GGUF main — both PRs are now merged.
Slow generation
If the encoder is loaded fully on the GPU each generation, VRAM thrashing hurts wall time — keep Gemma offloaded with the KJNodes model-offload nodes. The only cited consumer-GPU datapoint at time of writing is the RTX 3050 6GB run linked above (485s for a 5-second 1280×720 clip with Q4 GGUF + CPU encoder) — wall time on the 5060 Ti will land at /check/ when benchmarks arrive.
Audio-video output not synchronized
LTX-2.3 generates synchronized video + audio in a single model per the Lightricks/LTX-2.3 card ("a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model"). The dedicated audio workflow lives at example_workflows/2.3/ in ComfyUI-LTXVideo. The non-audio workflows produce silent video — make sure you loaded the audio-enabled workflow file if audio is needed.