LTX Video 2.3 on RTX 5080: 22B Audio-Video at the 16 GB Floor via GGUF + CPU-Offloaded Gemma

What You'll Build

Generate short, synchronized audio + video clips locally with LTX Video 2.3 — Lightricks' 22B-parameter DiT audio-video foundation model — on a 16 GB RTX 5080. Per the Lightricks/LTX-2.3 model card, "LTX-2.3 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model." The canonical ComfyUI install wants a "CUDA-compatible GPU with 32GB+ VRAM" (ComfyUI-LTXVideo README) — well above the 5080's 16 GB — so this recipe runs the distilled GGUF transformer with the heavy Gemma 3 12B text encoder offloaded to system RAM, the path RTX 5080 owners have proven works in ComfyUI-LTXVideo Issue #303.

Hardware data: RTX 5080 (16GB VRAM) · Q4_K_S distilled GGUF + CPU-offloaded Gemma 3 12B · See benchmark data

⚠️ The 16 GB envelope is tight and the text encoder is the binding constraint, not the transformer. Gemma 3 12B (LTX-2.3's text encoder) needs roughly 24-27 GB resident on its own per the Issue #303 report; on a 16 GB card it MUST run on the CPU (or be streamed from RAM). The transformer fits as a GGUF quant; the encoder is what OOMs you if you leave it on the GPU.

Variant pin. This recipe targets LTX-2.3 (22B, canonical repo Lightricks/LTX-2.3). It is NOT for the older LTX-2 19B line (repo Lightricks/LTX-2) nor the LTX-Video 0.9.x family. Several Issue #303 reports below were filed against LTX-2 19B; they are cited here only for the shared Gemma 3 12B encoder failure mode, which is identical across both because both use the same 12B encoder.

Requirements

Component	Minimum	Tested
GPU	16GB VRAM (Blackwell, Ada, or Ampere)	RTX 5080 (16GB)
RAM	32GB	64GB (Issue #303 RTX 5080 reports)
Storage	~20GB	~20GB (Q4_K_S transformer + Gemma encoder + VAE)
Software	ComfyUI + ComfyUI-LTXVideo + ComfyUI-GGUF + KJNodes	Python 3.10+, CUDA 12.7+

The full unquantized LTX-2.3 requires a "CUDA-compatible GPU with 32GB+ VRAM" per the ComfyUI-LTXVideo README. On 16 GB you replace the BF16 transformer with a distilled GGUF quant and keep the Gemma encoder off the GPU — the steps below.

Installation

1. Install ComfyUI

git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

On Blackwell (sm_120) the default pip install torch ships sm_120 kernels via the cu128 wheel on PyTorch 2.7+. The LTX-2.3 codebase "was tested with Python >=3.12, CUDA version >12.7, and supports PyTorch ~= 2.7" per the Lightricks/LTX-2.3 model card.

2. Install the LTX-Video, GGUF, and KJNodes custom nodes

cd ComfyUI/custom_nodes

# Official Lightricks ComfyUI nodes for LTX-2.3
git clone https://github.com/Lightricks/ComfyUI-LTXVideo.git
pip install -r ComfyUI-LTXVideo/requirements.txt

# city96's GGUF loader — required for the quantized transformer
git clone https://github.com/city96/ComfyUI-GGUF.git
pip install -r ComfyUI-GGUF/requirements.txt

# Kijai's KJNodes — used by the recommended offload workflows
git clone https://github.com/kijai/ComfyUI-KJNodes.git
pip install -r ComfyUI-KJNodes/requirements.txt

This three-node trio is the one listed by the unsloth/LTX-2.3-GGUF model card for the GGUF workflow.

3. Download a distilled GGUF transformer that fits 16 GB

The distilled checkpoint runs at 8 steps with CFG 1.0 and is the right choice for a tight card. On 16 GB you want a transformer small enough to leave room for VAE-decode activations — Q4_K_S (12.22 GB) or, for more comfortable headroom, Q3_K_S (9.26 GB):

# Q4_K_S distilled (12.22 GB on disk) — tightest tier with full Q4 quality
huggingface-cli download unsloth/LTX-2.3-GGUF \
  distilled/ltx-2.3-22b-distilled-Q4_K_S.gguf \
  --local-dir ComfyUI/models/unet/

# OR Q3_K_S distilled (9.26 GB) — more headroom, small quality trade
huggingface-cli download unsloth/LTX-2.3-GGUF \
  distilled/ltx-2.3-22b-distilled-Q3_K_S.gguf \
  --local-dir ComfyUI/models/unet/

Distilled-transformer GGUF file sizes (verified live via the unsloth/LTX-2.3-GGUF tree API, distilled/ folder):

Quant	unsloth distilled file size
Q2_K	7.71 GB
Q3_K_S	9.26 GB
Q3_K_M	10.03 GB
Q4_K_S	12.22 GB
Q4_K_M	13.34 GB
Q5_K_S	14.20 GB
Q6_K	16.55 GB
Q8_0	21.19 GB

QuantStack/LTX-2.3-GGUF ships an equivalent ladder but its Q4_K_S is larger (15.56 GB) — too tight once the VAE and activations load, so prefer the unsloth distilled/ files on a 16 GB card. There is also a newer distilled-1.1 revision in both repos (its Q4_K_S is 12.07 GB) — either revision works; this recipe pins the original distilled/.

4. Download the embeddings connectors and VAE

The distilled GGUF transformer needs its matching text-projection connectors and the audio + video VAE (LTX-2.3 ships them separately for the GGUF flow):

huggingface-cli download unsloth/LTX-2.3-GGUF \
  text_encoders/ltx-2.3-22b-distilled_embeddings_connectors.safetensors \
  vae/ltx-2.3-22b-distilled_video_vae.safetensors \
  vae/ltx-2.3-22b-distilled_audio_vae.safetensors \
  --local-dir ComfyUI/models/

(Connectors 2.15 GB, video VAE 1.35 GB, audio VAE 0.34 GB — verified live via the unsloth/LTX-2.3-GGUF tree API.)

5. Download a quantized Gemma 3 12B text encoder

LTX-2.3 uses Gemma 3 12B as its text encoder. The default gemma-3-12b-it-qat-q4_0-unquantized weights need ~24-27 GB to operate per Issue #303 — far over the card — so download a quantized encoder and keep it off the GPU (Step 6). The GGUF QAT-Q4 encoder is the smallest broadly-supported option:

huggingface-cli download unsloth/gemma-3-12b-it-qat-GGUF \
  gemma-3-12b-it-qat-UD-Q4_K_XL.gguf \
  mmproj-BF16.gguf \
  --local-dir ComfyUI/models/text_encoders/

The encoder file is 6.92 GB and the mmproj 0.80 GB (verified live via the unsloth/gemma-3-12b-it-qat-GGUF tree API). An FP8 single-file alternative (gemma_3_12B_it_fp8_e4m3fn.safetensors, 12.30 GB) is documented in Issue #303's workaround comment — see Troubleshooting for when to prefer it.

Running

Launch ComfyUI with a VRAM mode that forces the heavy encoder off the GPU. The two RTX-5080-proven options from Issue #303 are --novram (stream all weights from RAM) and --reserve-vram 10 (reserve headroom so the encoder spills to CPU):

# Option A — stream weights from RAM (RTX 5080 owner: "only costs my gpu 3 GB VRAM to make a 720p video")
python main.py --listen --novram

# Option B — reserve VRAM so the encoder offloads (RTX 5080 owner: "these settings work for RTX 5080")
python main.py --listen --reserve-vram 10

Open the browser UI and load one of the example workflows shipped by the Lightricks node:

ComfyUI/custom_nodes/ComfyUI-LTXVideo/example_workflows/2.3/

Swap the default UNet loader for the Unet Loader (GGUF) node from ComfyUI-GGUF and point it at the distilled GGUF from Step 3. Wire the Gemma 3 12B encoder through the GGUF text-encoder loader. The output (silent video, or synchronized audio+video if you load the audio-enabled workflow) lands in ComfyUI/output/.

Recommended distilled settings

Parameter	Value	Source
Sampler steps	8	Distilled checkpoint default per Lightricks/LTX-2.3 model card
CFG	1.0	Same
Resolution	width × height divisible by 32	Lightricks/LTX-2.3 card: "Width & height settings must be divisible by 32."
Frame count	(multiple of 8) + 1 (e.g. 65, 97, 121)	Lightricks/LTX-2.3 card: "Frame count must be divisible by 8 + 1."

Start small (e.g. 512×512, 65 frames) to confirm the install fits before scaling resolution.

Results

Speed: Omitted. No clean RTX 5080 benchmark for LTX-2.3 22B at a fixed configuration has been published — the RTX 5080 owners in Issue #303 describe the CPU-offloaded encoder qualitatively ("slow when processing on CPU but the inference part works incredibly fast") without a comparable timing. Re-anchoring from the 32 GB RTX 5090 sibling is not valid either: the 5080's memory bandwidth (~960 GB/s) and compute differ enough that forward-extrapolating a 5090 number would mislead. Empirical 5080 timings will appear at /check/ltx-video-2-3/rtx-5080 once a community benchmark lands via /contribute.
VRAM usage: With the Gemma encoder forced to CPU via --novram, an RTX 5080 16GB owner reports the GPU footprint drops to roughly 3 GB during 720p generation (Issue #303 comment) — the distilled GGUF transformer streams in as needed. Leave the encoder on the GPU and the same hardware OOMs at "Peak Usage: 29068 MiB" (Issue #303 body, filed on RTX 5080 16GB). The recipe's min_vram_gb: 16 reflects the standard ComfyUI offload path (transformer resident + encoder on CPU), which fills the 16 GB card; see /check/ltx-video-2-3/rtx-5080.
Quality notes: The distilled checkpoint (8 steps, CFG 1.0) trades fine motion detail for speed. Q4_K_S keeps full Q4 quality; Q3_K_S frees ~3 GB of headroom at a small quality cost. The full ltx-2.3-22b-dev BF16 transformer is 42 GB on disk (unsloth/LTX-2.3-GGUF tree) and is out of scope for 16 GB.

For the full benchmark data, see /check/ltx-video-2-3/rtx-5080.

Troubleshooting

Out of memory loading the Gemma 3 12B text encoder

This is the dominant 16 GB failure mode. The unquantized Gemma 3 12B encoder needs ~24-27 GB and OOMs at "Peak Usage: 29068 MiB" on RTX 5080 16GB per Issue #303 (filed on LTX-2 19B, but the encoder is the same Gemma 3 12B used by LTX-2.3, so the failure transfers). Fixes, in order of preference:

Launch with --novram so the encoder runs on CPU. An RTX 5080 owner reports this drops GPU use to ~3 GB (comment).
Or --reserve-vram 10, which another RTX 5080 owner confirms (comment).
Use the GGUF QAT-Q4 Gemma from Step 5, or the FP8 single-file gemma_3_12B_it_fp8_e4m3fn.safetensors documented in Jackson3195's workaround comment.

Encoder is painfully slow on CPU

The --novram path moves the Gemma encoder to CPU, which is slow for the text-encode pass. A community user reports the "LTXV Audio Text Encoder Loader" node loads Gemma "8x times faster then normal loader" (sic) (Issue #303 comment). Replace the default Gemma loader with it and load the single safetensors encoder file.

`mat1 and mat2 shapes cannot be multiplied` after enabling `--novram`

A user hit this when running --novram together with --use-sage-attention --fast fp16_accumulation and a --preview-method latent2rgb flag (Issue #303 comment). Remove the sage-attention and custom preview flags; the error traces to preview generation during sampling, not the model itself.

FlashAttention-2 on Blackwell (sm_120)

LTX-2.3's codebase defaults to PyTorch SDPA rather than raw FlashAttention-2, so the FA2 sm_120 wheel gap rarely blocks here. If you do force FA2, sm_120 kernel coverage is tracked at Dao-AILab/flash-attention#2168 — fall back to SDPA on Blackwell until it lands.

Audio-video output not synchronized

LTX-2.3 produces synchronized video + audio in a single model per the Lightricks/LTX-2.3 card. The non-audio workflows produce silent video — load the audio-enabled workflow from example_workflows/2.3/ in ComfyUI-LTXVideo if you need sound.