self-hosted/ai
§01·recipe · video

LTX Video 2.3 on RTX 5080: 22B Audio-Video at the 16 GB Floor via GGUF + CPU-Offloaded Gemma

videoadvanced16GB+ VRAMMay 29, 2026
models
tools
prerequisites
  • NVIDIA RTX 5080 (16GB VRAM) or any 16GB consumer GPU
  • 64GB system RAM strongly recommended (the Gemma 3 12B text encoder is streamed/offloaded to RAM)
  • Python 3.10+ and CUDA 12.7+ (Blackwell sm_120 needs a cu128-class PyTorch wheel — default `pip install torch` on PyTorch 2.7+ ships it)
  • ComfyUI installed (latest version) + ComfyUI-LTXVideo + ComfyUI-GGUF + KJNodes
  • ~20GB free disk space for the quantized transformer + encoder + VAE

What You'll Build

Generate short, synchronized audio + video clips locally with LTX Video 2.3 — Lightricks' 22B-parameter DiT audio-video foundation model — on a 16 GB RTX 5080. Per the Lightricks/LTX-2.3 model card, "LTX-2.3 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model." The canonical ComfyUI install wants a "CUDA-compatible GPU with 32GB+ VRAM" (ComfyUI-LTXVideo README) — well above the 5080's 16 GB — so this recipe runs the distilled GGUF transformer with the heavy Gemma 3 12B text encoder offloaded to system RAM, the path RTX 5080 owners have proven works in ComfyUI-LTXVideo Issue #303.

Hardware data: RTX 5080 (16GB VRAM) · Q4_K_S distilled GGUF + CPU-offloaded Gemma 3 12B · See benchmark data

⚠️ The 16 GB envelope is tight and the text encoder is the binding constraint, not the transformer. Gemma 3 12B (LTX-2.3's text encoder) needs roughly 24-27 GB resident on its own per the Issue #303 report; on a 16 GB card it MUST run on the CPU (or be streamed from RAM). The transformer fits as a GGUF quant; the encoder is what OOMs you if you leave it on the GPU.

Variant pin. This recipe targets LTX-2.3 (22B, canonical repo Lightricks/LTX-2.3). It is NOT for the older LTX-2 19B line (repo Lightricks/LTX-2) nor the LTX-Video 0.9.x family. Several Issue #303 reports below were filed against LTX-2 19B; they are cited here only for the shared Gemma 3 12B encoder failure mode, which is identical across both because both use the same 12B encoder.

Requirements

ComponentMinimumTested
GPU16GB VRAM (Blackwell, Ada, or Ampere)RTX 5080 (16GB)
RAM32GB64GB (Issue #303 RTX 5080 reports)
Storage~20GB~20GB (Q4_K_S transformer + Gemma encoder + VAE)
SoftwareComfyUI + ComfyUI-LTXVideo + ComfyUI-GGUF + KJNodesPython 3.10+, CUDA 12.7+

The full unquantized LTX-2.3 requires a "CUDA-compatible GPU with 32GB+ VRAM" per the ComfyUI-LTXVideo README. On 16 GB you replace the BF16 transformer with a distilled GGUF quant and keep the Gemma encoder off the GPU — the steps below.

Installation

1. Install ComfyUI

git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

On Blackwell (sm_120) the default pip install torch ships sm_120 kernels via the cu128 wheel on PyTorch 2.7+. The LTX-2.3 codebase "was tested with Python >=3.12, CUDA version >12.7, and supports PyTorch ~= 2.7" per the Lightricks/LTX-2.3 model card.

2. Install the LTX-Video, GGUF, and KJNodes custom nodes

cd ComfyUI/custom_nodes

# Official Lightricks ComfyUI nodes for LTX-2.3
git clone https://github.com/Lightricks/ComfyUI-LTXVideo.git
pip install -r ComfyUI-LTXVideo/requirements.txt

# city96's GGUF loader — required for the quantized transformer
git clone https://github.com/city96/ComfyUI-GGUF.git
pip install -r ComfyUI-GGUF/requirements.txt

# Kijai's KJNodes — used by the recommended offload workflows
git clone https://github.com/kijai/ComfyUI-KJNodes.git
pip install -r ComfyUI-KJNodes/requirements.txt

This three-node trio is the one listed by the unsloth/LTX-2.3-GGUF model card for the GGUF workflow.

3. Download a distilled GGUF transformer that fits 16 GB

The distilled checkpoint runs at 8 steps with CFG 1.0 and is the right choice for a tight card. On 16 GB you want a transformer small enough to leave room for VAE-decode activations — Q4_K_S (12.22 GB) or, for more comfortable headroom, Q3_K_S (9.26 GB):

# Q4_K_S distilled (12.22 GB on disk) — tightest tier with full Q4 quality
huggingface-cli download unsloth/LTX-2.3-GGUF \
  distilled/ltx-2.3-22b-distilled-Q4_K_S.gguf \
  --local-dir ComfyUI/models/unet/

# OR Q3_K_S distilled (9.26 GB) — more headroom, small quality trade
huggingface-cli download unsloth/LTX-2.3-GGUF \
  distilled/ltx-2.3-22b-distilled-Q3_K_S.gguf \
  --local-dir ComfyUI/models/unet/

Distilled-transformer GGUF file sizes (verified live via the unsloth/LTX-2.3-GGUF tree API, distilled/ folder):

Quantunsloth distilled file size
Q2_K7.71 GB
Q3_K_S9.26 GB
Q3_K_M10.03 GB
Q4_K_S12.22 GB
Q4_K_M13.34 GB
Q5_K_S14.20 GB
Q6_K16.55 GB
Q8_021.19 GB

QuantStack/LTX-2.3-GGUF ships an equivalent ladder but its Q4_K_S is larger (15.56 GB) — too tight once the VAE and activations load, so prefer the unsloth distilled/ files on a 16 GB card. There is also a newer distilled-1.1 revision in both repos (its Q4_K_S is 12.07 GB) — either revision works; this recipe pins the original distilled/.

4. Download the embeddings connectors and VAE

The distilled GGUF transformer needs its matching text-projection connectors and the audio + video VAE (LTX-2.3 ships them separately for the GGUF flow):

huggingface-cli download unsloth/LTX-2.3-GGUF \
  text_encoders/ltx-2.3-22b-distilled_embeddings_connectors.safetensors \
  vae/ltx-2.3-22b-distilled_video_vae.safetensors \
  vae/ltx-2.3-22b-distilled_audio_vae.safetensors \
  --local-dir ComfyUI/models/

(Connectors 2.15 GB, video VAE 1.35 GB, audio VAE 0.34 GB — verified live via the unsloth/LTX-2.3-GGUF tree API.)

5. Download a quantized Gemma 3 12B text encoder

LTX-2.3 uses Gemma 3 12B as its text encoder. The default gemma-3-12b-it-qat-q4_0-unquantized weights need ~24-27 GB to operate per Issue #303 — far over the card — so download a quantized encoder and keep it off the GPU (Step 6). The GGUF QAT-Q4 encoder is the smallest broadly-supported option:

huggingface-cli download unsloth/gemma-3-12b-it-qat-GGUF \
  gemma-3-12b-it-qat-UD-Q4_K_XL.gguf \
  mmproj-BF16.gguf \
  --local-dir ComfyUI/models/text_encoders/

The encoder file is 6.92 GB and the mmproj 0.80 GB (verified live via the unsloth/gemma-3-12b-it-qat-GGUF tree API). An FP8 single-file alternative (gemma_3_12B_it_fp8_e4m3fn.safetensors, 12.30 GB) is documented in Issue #303's workaround comment — see Troubleshooting for when to prefer it.

Running

Launch ComfyUI with a VRAM mode that forces the heavy encoder off the GPU. The two RTX-5080-proven options from Issue #303 are --novram (stream all weights from RAM) and --reserve-vram 10 (reserve headroom so the encoder spills to CPU):

# Option A — stream weights from RAM (RTX 5080 owner: "only costs my gpu 3 GB VRAM to make a 720p video")
python main.py --listen --novram

# Option B — reserve VRAM so the encoder offloads (RTX 5080 owner: "these settings work for RTX 5080")
python main.py --listen --reserve-vram 10

Open the browser UI and load one of the example workflows shipped by the Lightricks node:

ComfyUI/custom_nodes/ComfyUI-LTXVideo/example_workflows/2.3/

Swap the default UNet loader for the Unet Loader (GGUF) node from ComfyUI-GGUF and point it at the distilled GGUF from Step 3. Wire the Gemma 3 12B encoder through the GGUF text-encoder loader. The output (silent video, or synchronized audio+video if you load the audio-enabled workflow) lands in ComfyUI/output/.

Recommended distilled settings

ParameterValueSource
Sampler steps8Distilled checkpoint default per Lightricks/LTX-2.3 model card
CFG1.0Same
Resolutionwidth × height divisible by 32Lightricks/LTX-2.3 card: "Width & height settings must be divisible by 32."
Frame count(multiple of 8) + 1 (e.g. 65, 97, 121)Lightricks/LTX-2.3 card: "Frame count must be divisible by 8 + 1."

Start small (e.g. 512×512, 65 frames) to confirm the install fits before scaling resolution.

Results

  • Speed: Omitted. No clean RTX 5080 benchmark for LTX-2.3 22B at a fixed configuration has been published — the RTX 5080 owners in Issue #303 describe the CPU-offloaded encoder qualitatively ("slow when processing on CPU but the inference part works incredibly fast") without a comparable timing. Re-anchoring from the 32 GB RTX 5090 sibling is not valid either: the 5080's memory bandwidth (~960 GB/s) and compute differ enough that forward-extrapolating a 5090 number would mislead. Empirical 5080 timings will appear at /check/ltx-video-2-3/rtx-5080 once a community benchmark lands via /contribute.
  • VRAM usage: With the Gemma encoder forced to CPU via --novram, an RTX 5080 16GB owner reports the GPU footprint drops to roughly 3 GB during 720p generation (Issue #303 comment) — the distilled GGUF transformer streams in as needed. Leave the encoder on the GPU and the same hardware OOMs at "Peak Usage: 29068 MiB" (Issue #303 body, filed on RTX 5080 16GB). The recipe's min_vram_gb: 16 reflects the standard ComfyUI offload path (transformer resident + encoder on CPU), which fills the 16 GB card; see /check/ltx-video-2-3/rtx-5080.
  • Quality notes: The distilled checkpoint (8 steps, CFG 1.0) trades fine motion detail for speed. Q4_K_S keeps full Q4 quality; Q3_K_S frees ~3 GB of headroom at a small quality cost. The full ltx-2.3-22b-dev BF16 transformer is 42 GB on disk (unsloth/LTX-2.3-GGUF tree) and is out of scope for 16 GB.

For the full benchmark data, see /check/ltx-video-2-3/rtx-5080.

Troubleshooting

Out of memory loading the Gemma 3 12B text encoder

This is the dominant 16 GB failure mode. The unquantized Gemma 3 12B encoder needs ~24-27 GB and OOMs at "Peak Usage: 29068 MiB" on RTX 5080 16GB per Issue #303 (filed on LTX-2 19B, but the encoder is the same Gemma 3 12B used by LTX-2.3, so the failure transfers). Fixes, in order of preference:

  1. Launch with --novram so the encoder runs on CPU. An RTX 5080 owner reports this drops GPU use to ~3 GB (comment).
  2. Or --reserve-vram 10, which another RTX 5080 owner confirms (comment).
  3. Use the GGUF QAT-Q4 Gemma from Step 5, or the FP8 single-file gemma_3_12B_it_fp8_e4m3fn.safetensors documented in Jackson3195's workaround comment.

Encoder is painfully slow on CPU

The --novram path moves the Gemma encoder to CPU, which is slow for the text-encode pass. A community user reports the "LTXV Audio Text Encoder Loader" node loads Gemma "8x times faster then normal loader" (sic) (Issue #303 comment). Replace the default Gemma loader with it and load the single safetensors encoder file.

mat1 and mat2 shapes cannot be multiplied after enabling --novram

A user hit this when running --novram together with --use-sage-attention --fast fp16_accumulation and a --preview-method latent2rgb flag (Issue #303 comment). Remove the sage-attention and custom preview flags; the error traces to preview generation during sampling, not the model itself.

FlashAttention-2 on Blackwell (sm_120)

LTX-2.3's codebase defaults to PyTorch SDPA rather than raw FlashAttention-2, so the FA2 sm_120 wheel gap rarely blocks here. If you do force FA2, sm_120 kernel coverage is tracked at Dao-AILab/flash-attention#2168 — fall back to SDPA on Blackwell until it lands.

Audio-video output not synchronized

LTX-2.3 produces synchronized video + audio in a single model per the Lightricks/LTX-2.3 card. The non-audio workflows produce silent video — load the audio-enabled workflow from example_workflows/2.3/ in ComfyUI-LTXVideo if you need sound.