self-hosted/ai
§01·recipe · video

LTX Video 2.3 on RTX 4080: 22B Audio-Video at the 16 GB Floor via GGUF + CPU-Offloaded Gemma

videoadvanced16GB+ VRAMMay 30, 2026
models
tools
prerequisites
  • NVIDIA RTX 4080 (16GB VRAM) or any 16GB consumer GPU
  • 64GB system RAM strongly recommended (the Gemma 3 12B text encoder is streamed/offloaded to RAM)
  • Python 3.10+ and CUDA 12.7+ — on the RTX 4080 (Ada, sm_89) the default `pip install torch` already ships sm_89 kernels in the stable cu124-class wheel; no special index-url is needed
  • ComfyUI installed (latest version) + ComfyUI-LTXVideo + ComfyUI-GGUF + KJNodes
  • ~22GB free disk space for the quantized transformer + connectors + encoder + VAE

What You'll Build

Generate short, synchronized audio + video clips locally with LTX Video 2.3 — Lightricks' 22B-parameter DiT audio-video foundation model — on a 16 GB RTX 4080. Per the Lightricks/LTX-2.3 model card, "LTX-2.3 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model." The canonical ComfyUI install wants a "CUDA-compatible GPU with 32GB+ VRAM" (ComfyUI-LTXVideo README) — twice the 4080's 16 GB — so this recipe runs the distilled GGUF transformer with the heavy Gemma 3 12B text encoder offloaded to system RAM. The RTX 4080 is called out by name as a target of this exact 16 GB constraint in ComfyUI-LTXVideo Issue #303, whose reporter writes that the Gemma encoder requirement "makes LTX-2 unusable on consumer GPUs with 16GB VRAM (RTX 5080, RTX 4080, etc.), even with FP8 models and all optimizations applied."

Hardware data: RTX 4080 (16GB VRAM) · Q4_K_S distilled GGUF + CPU-offloaded Gemma 3 12B · See benchmark data

⚠️ The 16 GB envelope is tight and the text encoder is the binding constraint, not the transformer. Gemma 3 12B (LTX-2.3's text encoder) "needs ~24-27GB VRAM to operate" on its own per the Issue #303 report; on a 16 GB card it MUST run on the CPU (or be streamed from RAM). The transformer fits as a GGUF quant; the encoder is what OOMs you if you leave it on the GPU.

Variant pin. This recipe targets LTX-2.3 (22B, canonical repo Lightricks/LTX-2.3). It is NOT for the older LTX-2 19B line (repo Lightricks/LTX-2) nor the LTX-Video 0.9.x family. The Issue #303 reports below were filed against LTX-2 19B; they are cited here only for the shared Gemma 3 12B encoder failure mode, which is identical across both because both use the same 12B encoder.

ℹ️ License. The LTX-2.3 weights are released under the ltx-2-community-license-agreement (the model card's license: is other with license_name: ltx-2-community-license-agreement, license link) — not Apache-2.0. Read the license before any commercial use.

Requirements

ComponentMinimumTested
GPU16GB VRAM (Ada, Blackwell, or Ampere)RTX 4080 (16GB, Ada sm_89)
RAM32GB64GB recommended (Issue #303 — 16GB-card reporters used 32–64GB)
Storage~22GB~22GB (Q4_K_S transformer + connectors + Gemma encoder + VAE)
SoftwareComfyUI + ComfyUI-LTXVideo + ComfyUI-GGUF + KJNodesPython 3.10+, CUDA 12.7+

The full unquantized LTX-2.3 requires a "CUDA-compatible GPU with 32GB+ VRAM" per the ComfyUI-LTXVideo README. On 16 GB you replace the BF16 transformer with a distilled GGUF quant and keep the Gemma encoder off the GPU — the steps below.

Installation

1. Install ComfyUI

git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

On the RTX 4080 (Ada Lovelace, sm_89) the default pip install torch already includes sm_89 kernels in the stable cu124-class wheel — unlike Blackwell (sm_120), no special --index-url / cu128 wheel selection is needed. The LTX-2.3 codebase "was tested with Python >=3.12, CUDA version >12.7, and supports PyTorch ~= 2.7" per the Lightricks/LTX-2.3 model card.

2. Install the LTX-Video, GGUF, and KJNodes custom nodes

cd ComfyUI/custom_nodes

# Official Lightricks ComfyUI nodes for LTX-2.3
git clone https://github.com/Lightricks/ComfyUI-LTXVideo.git
pip install -r ComfyUI-LTXVideo/requirements.txt

# city96's GGUF loader — required for the quantized transformer
git clone https://github.com/city96/ComfyUI-GGUF.git
pip install -r ComfyUI-GGUF/requirements.txt

# Kijai's KJNodes — used by the recommended offload workflows
git clone https://github.com/kijai/ComfyUI-KJNodes.git
pip install -r ComfyUI-KJNodes/requirements.txt

The unsloth/LTX-2.3-GGUF model card lists city96/ComfyUI-GGUF and kijai/ComfyUI-KJNodes for the GGUF workflow; the Lightricks LTXVideo nodes provide the LTX-2.3 sampler and example workflows.

3. Download a distilled GGUF transformer that fits 16 GB

The distilled checkpoint runs at 8 steps with CFG 1.0 and is the right choice for a tight card. On 16 GB you want a transformer small enough to leave room for the connectors and VAE-decode activations — Q4_K_S (13.12 GB) or, for more comfortable headroom, Q3_K_S (9.95 GB):

# Q4_K_S distilled (13.12 GB on disk) — tightest Q4 tier with full Q4 quality
huggingface-cli download unsloth/LTX-2.3-GGUF \
  distilled/ltx-2.3-22b-distilled-Q4_K_S.gguf \
  --local-dir ComfyUI/models/unet/

# OR Q3_K_S distilled (9.95 GB) — more headroom, small quality trade
huggingface-cli download unsloth/LTX-2.3-GGUF \
  distilled/ltx-2.3-22b-distilled-Q3_K_S.gguf \
  --local-dir ComfyUI/models/unet/

Distilled-transformer GGUF file sizes (verified live via the unsloth/LTX-2.3-GGUF tree API, distilled/ folder):

Quantunsloth distilled file size
Q2_K8.28 GB
Q3_K_S9.95 GB
Q3_K_M10.77 GB
Q4_K_S13.12 GB
Q4_K_M14.33 GB
Q5_K_S15.25 GB
Q6_K17.77 GB
Q8_022.76 GB

There is also a newer distilled-1.1 revision in the same repo (a "different aesthetic experience and improved audio compared to v1.0" per the Lightricks/LTX-2.3 model card); either revision works and this recipe pins the original distilled/. The full ltx-2.3-22b-dev BF16 transformer is 42.04 GB on disk (unsloth/LTX-2.3-GGUF tree) and is out of scope for 16 GB.

4. Download the embeddings connectors and VAE

The distilled GGUF transformer needs its matching text-projection connectors and the audio + video VAE (LTX-2.3 ships them separately for the GGUF flow):

huggingface-cli download unsloth/LTX-2.3-GGUF \
  text_encoders/ltx-2.3-22b-distilled_embeddings_connectors.safetensors \
  vae/ltx-2.3-22b-distilled_video_vae.safetensors \
  vae/ltx-2.3-22b-distilled_audio_vae.safetensors \
  --local-dir ComfyUI/models/

(Connectors 2.31 GB, video VAE 1.45 GB, audio VAE 0.36 GB — verified live via the unsloth/LTX-2.3-GGUF tree API.)

5. Download a quantized Gemma 3 12B text encoder

LTX-2.3 uses Gemma 3 12B as its text encoder. The unquantized Gemma encoder "needs ~24-27GB VRAM to operate" per Issue #303 — far over the card — so download a quantized encoder and keep it off the GPU (Step 6). The GGUF QAT-Q4 encoder is the smallest broadly-supported option:

huggingface-cli download unsloth/gemma-3-12b-it-qat-GGUF \
  gemma-3-12b-it-qat-UD-Q4_K_XL.gguf \
  mmproj-BF16.gguf \
  --local-dir ComfyUI/models/text_encoders/

The encoder file is 7.43 GB and the mmproj 0.85 GB (verified live via the unsloth/gemma-3-12b-it-qat-GGUF tree API). An FP8 single-file alternative (gemma_3_12B_it_fp8_e4m3fn.safetensors) is documented in Issue #303's workaround comment — see Troubleshooting for when to prefer it.

Running

Launch ComfyUI with a VRAM mode that forces the heavy encoder off the GPU. The two 16 GB-card-proven options from Issue #303 are --novram (stream all weights from RAM) and --reserve-vram 10 (reserve headroom so the encoder spills to CPU):

# Option A — stream weights from RAM
python main.py --listen --novram

# Option B — reserve VRAM so the encoder offloads
python main.py --listen --reserve-vram 10

With --novram an RTX 5080 16GB owner — same 16 GB tier as the 4080 — reports the GPU footprint drops dramatically: "the inference part works incredibly fast and it only costs my gpu 3 GB VRAM to make a 720p video" (Issue #303 comment, community reporter). For --reserve-vram 10, another RTX 5080 owner confirms "these settings work for RTX 5080" (Issue #303 comment). The 4080 sits in the same 16 GB envelope, so the same offload discipline applies; treat the specific 3 GB figure as the 5080 reporter's measurement, not a 4080 benchmark.

Open the browser UI and load one of the example workflows shipped by the Lightricks node:

ComfyUI/custom_nodes/ComfyUI-LTXVideo/example_workflows/2.3/

Swap the default UNet loader for the Unet Loader (GGUF) node from ComfyUI-GGUF and point it at the distilled GGUF from Step 3. Wire the Gemma 3 12B encoder through the GGUF text-encoder loader. The output (silent video, or synchronized audio+video if you load the audio-enabled workflow) lands in ComfyUI/output/.

Recommended distilled settings

ParameterValueSource
Sampler steps8Distilled checkpoint default per Lightricks/LTX-2.3 model card
CFG1.0Same
Resolutionwidth & height divisible by 32Lightricks/LTX-2.3 card: "Width & height settings must be divisible by 32."
Frame countdivisible by 8 + 1 (e.g. 65, 97, 121)Lightricks/LTX-2.3 card: "Frame count must be divisible by 8 + 1."

Start small (e.g. 512×512, 65 frames) to confirm the install fits before scaling resolution.

Results

  • Speed: Omitted. No RTX 4080 benchmark for LTX-2.3 22B at a fixed configuration has been published, and /check/ltx-video-2-3/rtx-4080 currently has no benchmark data. Re-anchoring from a different-bandwidth sibling would mislead: the RTX 4090 (~1008 GB/s) is materially faster than the 4080 (~716.8 GB/s) and the RTX 5080 (~960 GB/s, Blackwell) is faster again — none transfer down to the 4080 cleanly. Empirical 4080 timings will appear at /check/ltx-video-2-3/rtx-4080 once a community benchmark lands via /contribute.
  • VRAM usage: With the Gemma encoder forced to CPU, a 16 GB card runs the distilled GGUF transformer resident while streaming the encoder from RAM. Leave the encoder on the GPU and a 16 GB card OOMs at "Peak Usage: 29068 MiB" (Issue #303 body, filed on an RTX 5080 16GB). The recipe's min_vram_gb: 16 reflects the standard ComfyUI offload path (Q4_K_S transformer 13.12 GB resident + connectors + VAE-decode activations + encoder on CPU), which fills the 16 GB card; see /check/ltx-video-2-3/rtx-4080.
  • Quality notes: The distilled checkpoint (8 steps, CFG 1.0) trades fine motion detail for speed. Q4_K_S keeps full Q4 quality; Q3_K_S frees ~3 GB of headroom (9.95 GB vs 13.12 GB on disk) at a small quality cost.

For the full benchmark data, see /check/ltx-video-2-3/rtx-4080.

Troubleshooting

Out of memory loading the Gemma 3 12B text encoder

This is the dominant 16 GB failure mode, and the RTX 4080 is named explicitly as a target of it in Issue #303: the unquantized Gemma 3 12B encoder "needs ~24-27GB VRAM to operate" and OOMs at "Peak Usage: 29068 MiB" on a 16 GB card (the report was filed on LTX-2 19B, but the encoder is the same Gemma 3 12B used by LTX-2.3, so the failure transfers). Fixes, in order of preference:

  1. Launch with --novram so the encoder runs on CPU. A 16 GB-card owner reports this drops GPU use to ~3 GB (comment).
  2. Or --reserve-vram 10, which an RTX 5080 owner confirms (comment).
  3. Use the GGUF QAT-Q4 Gemma from Step 5, or the FP8 single-file gemma_3_12B_it_fp8_e4m3fn.safetensors documented in a workaround comment.

Encoder is painfully slow on CPU

The --novram path moves the Gemma encoder to CPU, which is slow for the text-encode pass. A community user reports the "LTXV Audio Text Encoder Loader" node loads Gemma "8x times faster then normal loader" (sic) (Issue #303 comment). Replace the default Gemma loader with it and load the single safetensors encoder file.

mat1 and mat2 shapes cannot be multiplied after enabling --novram

A user hit this when running --novram together with --use-sage-attention --fast fp16_accumulation and a --preview-method latent2rgb flag (Issue #303 comment). Remove the sage-attention and custom preview flags; the error traces to preview generation during sampling, not the model itself.

FlashAttention on Ada (sm_89)

Unlike Blackwell (sm_120), the RTX 4080's Ada sm_89 architecture is fully covered by prebuilt FlashAttention wheels, so there is no kernel-availability gap to work around — and LTX-2.3's ComfyUI path defaults to PyTorch SDPA regardless. If you instead compile the optional LTX-Video Q8 FP8-matmul kernels, make sure your CUDA toolkit is 12.8+; a mismatched toolkit produces an sm89 assertion failure at the FP8 matmul (Issue #182). The GGUF path in this recipe does not use those kernels.

Audio-video output not synchronized

LTX-2.3 produces synchronized video + audio in a single model per the Lightricks/LTX-2.3 card. The non-audio workflows produce silent video — load the audio-enabled workflow from example_workflows/2.3/ in ComfyUI-LTXVideo if you need sound.