self-hosted/ai
§01·recipe · video

LTX Video 2.3 on RTX 4070 Ti SUPER: 22B Audio-Video at the 16 GB Floor via GGUF + CPU-Offloaded Gemma

videoadvanced16GB+ VRAMMay 31, 2026
models
tools
prerequisites
  • NVIDIA RTX 4070 Ti SUPER (16GB VRAM) or any 16GB consumer GPU
  • 64GB system RAM strongly recommended (the Gemma 3 12B text encoder is streamed/offloaded to RAM)
  • Python 3.10+ and CUDA 12.7+ — on the RTX 4070 Ti SUPER (Ada, sm_89) the default `pip install torch` already ships sm_89 kernels in the stable cu124-class wheel; no special index-url is needed
  • ComfyUI installed (latest version) + ComfyUI-LTXVideo + ComfyUI-GGUF + KJNodes
  • ~22GB free disk space for the quantized transformer + connectors + encoder + VAE

What You'll Build

Generate short, synchronized audio + video clips locally with LTX Video 2.3 — Lightricks' 22B-parameter DiT audio-video foundation model — on a 16 GB RTX 4070 Ti SUPER. Per the Lightricks/LTX-2.3 model card, "LTX-2.3 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model." The canonical ComfyUI install wants a "CUDA-compatible GPU with 32GB+ VRAM" (ComfyUI-LTXVideo README) — twice the 4070 Ti SUPER's 16 GB — so this recipe runs the distilled GGUF transformer with the heavy Gemma 3 12B text encoder offloaded to system RAM. The 16 GB constraint is documented in ComfyUI-LTXVideo Issue #303, whose reporter writes that the Gemma encoder requirement "makes LTX-2 unusable on consumer GPUs with 16GB VRAM (RTX 5080, RTX 4080, etc.), even with FP8 models and all optimizations applied." The RTX 4070 Ti SUPER sits in that same 16 GB tier, so the same offload discipline applies.

Hardware data: RTX 4070 Ti SUPER (16GB VRAM) · Q4_K_S distilled GGUF + CPU-offloaded Gemma 3 12B · See benchmark data

⚠️ The 16 GB envelope is tight and the text encoder is the binding constraint, not the transformer. Gemma 3 12B (LTX-2.3's text encoder) "needs ~24-27GB VRAM to operate" on its own per the Issue #303 report; on a 16 GB card it MUST run on the CPU (or be streamed from RAM). The transformer fits as a GGUF quant; the encoder is what OOMs you if you leave it on the GPU.

Variant pin. This recipe targets LTX-2.3 (22B, canonical repo Lightricks/LTX-2.3). It is NOT for the older LTX-2 19B line (repo Lightricks/LTX-2) nor the LTX-Video 0.9.x family. The Issue #303 reports below were filed against LTX-2 19B; they are cited here only for the shared Gemma 3 12B encoder failure mode, which is identical across both because both use the same 12B encoder.

ℹ️ License. The LTX-2.3 weights are released under the ltx-2-community-license-agreement (the model card's license: is other with license_name: ltx-2-community-license-agreement, license link) — not Apache-2.0. Read the license before any commercial use.

Requirements

ComponentMinimumTested
GPU16GB VRAM (Ada, Blackwell, or Ampere)RTX 4070 Ti SUPER (16GB, Ada sm_89)
RAM32GB64GB recommended (Issue #303 — 16GB-card reporters used 32–64GB)
Storage~22GB~22GB (Q4_K_S transformer + connectors + Gemma encoder + VAE)
SoftwareComfyUI + ComfyUI-LTXVideo + ComfyUI-GGUF + KJNodesPython 3.10+, CUDA 12.7+

The full unquantized LTX-2.3 requires a "CUDA-compatible GPU with 32GB+ VRAM" per the ComfyUI-LTXVideo README. On 16 GB you replace the BF16 transformer with a distilled GGUF quant and keep the Gemma encoder off the GPU — the steps below.

Installation

1. Install ComfyUI

git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

On the RTX 4070 Ti SUPER (Ada Lovelace, sm_89) the default pip install torch already includes sm_89 kernels in the stable cu124-class wheel — unlike Blackwell (sm_120), no special --index-url / cu128 wheel selection is needed. The LTX-2.3 codebase "was tested with Python >=3.12, CUDA version >12.7, and supports PyTorch ~= 2.7" per the Lightricks/LTX-2.3 model card.

2. Install the LTX-Video, GGUF, and KJNodes custom nodes

cd ComfyUI/custom_nodes

# Official Lightricks ComfyUI nodes for LTX-2.3
git clone https://github.com/Lightricks/ComfyUI-LTXVideo.git
pip install -r ComfyUI-LTXVideo/requirements.txt

# city96's GGUF loader — required for the quantized transformer
git clone https://github.com/city96/ComfyUI-GGUF.git
pip install -r ComfyUI-GGUF/requirements.txt

# Kijai's KJNodes — used by the recommended offload workflows
git clone https://github.com/kijai/ComfyUI-KJNodes.git
pip install -r ComfyUI-KJNodes/requirements.txt

The unsloth/LTX-2.3-GGUF model card lists city96/ComfyUI-GGUF and kijai/ComfyUI-KJNodes for the GGUF workflow; the Lightricks LTXVideo nodes provide the LTX-2.3 sampler and example workflows.

3. Download a distilled GGUF transformer that fits 16 GB

The distilled checkpoint runs at 8 steps with CFG 1.0 and is the right choice for a tight card. On 16 GB you want a transformer small enough to leave room for the connectors and VAE-decode activations — Q4_K_S (13.12 GB) or, for more comfortable headroom, Q3_K_S (9.95 GB):

# Q4_K_S distilled (13.12 GB on disk) — tightest Q4 tier with full Q4 quality
huggingface-cli download unsloth/LTX-2.3-GGUF \
  distilled/ltx-2.3-22b-distilled-Q4_K_S.gguf \
  --local-dir ComfyUI/models/unet/

# OR Q3_K_S distilled (9.95 GB) — more headroom, small quality trade
huggingface-cli download unsloth/LTX-2.3-GGUF \
  distilled/ltx-2.3-22b-distilled-Q3_K_S.gguf \
  --local-dir ComfyUI/models/unet/

Distilled-transformer GGUF file sizes (verified live via the unsloth/LTX-2.3-GGUF tree API, distilled/ folder):

Quantunsloth distilled file size
Q2_K8.28 GB
Q3_K_S9.95 GB
Q3_K_M10.77 GB
Q4_K_S13.12 GB
Q4_K_M14.33 GB
Q5_K_S15.25 GB
Q6_K17.77 GB
Q8_022.76 GB

There is also a newer distilled-1.1 revision in the same repo (a "different aesthetic experience and improved audio compared to v1.0" per the Lightricks/LTX-2.3 model card); either revision works and this recipe pins the original distilled/. The full ltx-2.3-22b-dev BF16 transformer is 42.04 GB on disk (unsloth/LTX-2.3-GGUF tree) and is out of scope for 16 GB.

4. Download the embeddings connectors and VAE

The distilled GGUF transformer needs its matching text-projection connectors and the audio + video VAE (LTX-2.3 ships them separately for the GGUF flow):

huggingface-cli download unsloth/LTX-2.3-GGUF \
  text_encoders/ltx-2.3-22b-distilled_embeddings_connectors.safetensors \
  vae/ltx-2.3-22b-distilled_video_vae.safetensors \
  vae/ltx-2.3-22b-distilled_audio_vae.safetensors \
  --local-dir ComfyUI/models/

(Connectors 2.31 GB, video VAE 1.45 GB, audio VAE 0.36 GB — verified live via the unsloth/LTX-2.3-GGUF tree API.)

5. Download a quantized Gemma 3 12B text encoder

LTX-2.3 uses Gemma 3 12B as its text encoder. The unquantized Gemma encoder "needs ~24-27GB VRAM to operate" per Issue #303 — far over the card — so download a quantized encoder and keep it off the GPU (Step 6). The GGUF QAT-Q4 encoder is the smallest broadly-supported option:

huggingface-cli download unsloth/gemma-3-12b-it-qat-GGUF \
  gemma-3-12b-it-qat-UD-Q4_K_XL.gguf \
  mmproj-BF16.gguf \
  --local-dir ComfyUI/models/text_encoders/

The encoder file is 7.43 GB and the mmproj 0.85 GB (verified live via the unsloth/gemma-3-12b-it-qat-GGUF tree API). An FP8 single-file alternative (gemma_3_12B_it_fp8_e4m3fn.safetensors) is documented in Issue #303's workaround comment — see Troubleshooting for when to prefer it.

Running

Launch ComfyUI with a VRAM mode that forces the heavy encoder off the GPU. The two 16 GB-card-proven options from Issue #303 are --novram (stream all weights from RAM) and --reserve-vram 10 (reserve headroom so the encoder spills to CPU):

# Option A — stream weights from RAM
python main.py --listen --novram

# Option B — reserve VRAM so the encoder offloads
python main.py --listen --reserve-vram 10

With --novram an RTX 5080 16GB owner — same 16 GB tier as the 4070 Ti SUPER — reports the GPU footprint drops dramatically: "the inference part works incredibly fast and it only costs my gpu 3 GB VRAM to make a 720p video" (Issue #303 comment, community reporter). For --reserve-vram 10, another RTX 5080 owner confirms "these settings work for RTX 5080" (Issue #303 comment). The 4070 Ti SUPER sits in the same 16 GB envelope, so the same offload discipline applies; treat the specific 3 GB figure as the 5080 reporter's measurement, not a 4070 Ti SUPER benchmark.

Open the browser UI and load one of the example workflows shipped by the Lightricks node:

ComfyUI/custom_nodes/ComfyUI-LTXVideo/example_workflows/2.3/

Swap the default UNet loader for the Unet Loader (GGUF) node from ComfyUI-GGUF and point it at the distilled GGUF from Step 3. Wire the Gemma 3 12B encoder through the GGUF text-encoder loader. The output (silent video, or synchronized audio+video if you load the audio-enabled workflow) lands in ComfyUI/output/.

Recommended distilled settings

ParameterValueSource
Sampler steps8Distilled checkpoint default per Lightricks/LTX-2.3 model card
CFG1.0Same
Resolutionwidth & height divisible by 32Lightricks/LTX-2.3 card: "Width & height settings must be divisible by 32."
Frame countdivisible by 8 + 1 (e.g. 65, 97, 121)Lightricks/LTX-2.3 card: "Frame count must be divisible by 8 + 1."

Start small (e.g. 512×512, 65 frames) to confirm the install fits before scaling resolution.

Results

  • Speed: Omitted. No RTX 4070 Ti SUPER benchmark for LTX-2.3 22B at a fixed configuration has been published, and /check/ltx-video-2-3/rtx-4070-ti-super currently has no benchmark data. Re-anchoring from a different-bandwidth sibling would mislead: the RTX 4080 (~716.8 GB/s) is faster than the 4070 Ti SUPER (~672 GB/s), and the RTX 4090 (~1008 GB/s) and RTX 5080 (~960 GB/s, Blackwell) are faster again — none transfer to the 4070 Ti SUPER cleanly. Empirical 4070 Ti SUPER timings will appear at /check/ltx-video-2-3/rtx-4070-ti-super once a community benchmark lands via /contribute.
  • VRAM usage: With the Gemma encoder forced to CPU, a 16 GB card runs the distilled GGUF transformer resident while streaming the encoder from RAM. Leave the encoder on the GPU and a 16 GB card OOMs at "Peak Usage: 29068 MiB" (Issue #303 body, filed on an RTX 5080 16GB). The recipe's min_vram_gb: 16 reflects the standard ComfyUI offload path (Q4_K_S transformer 13.12 GB resident + connectors + VAE-decode activations + encoder on CPU), which fills the 16 GB card; see /check/ltx-video-2-3/rtx-4070-ti-super.
  • Quality notes: The distilled checkpoint (8 steps, CFG 1.0) trades fine motion detail for speed. Q4_K_S keeps full Q4 quality; Q3_K_S frees ~3 GB of headroom (9.95 GB vs 13.12 GB on disk) at a small quality cost.

For the full benchmark data, see /check/ltx-video-2-3/rtx-4070-ti-super.

Troubleshooting

Out of memory loading the Gemma 3 12B text encoder

This is the dominant 16 GB failure mode. Issue #303 names the RTX 4080 and RTX 5080 explicitly as 16 GB-tier cards hit by it — the RTX 4070 Ti SUPER is the same 16 GB tier and behaves identically. The unquantized Gemma 3 12B encoder "needs ~24-27GB VRAM to operate" and OOMs at "Peak Usage: 29068 MiB" on a 16 GB card (the report was filed on LTX-2 19B, but the encoder is the same Gemma 3 12B used by LTX-2.3, so the failure transfers). Fixes, in order of preference:

  1. Launch with --novram so the encoder runs on CPU. A 16 GB-card owner reports this drops GPU use to ~3 GB (comment).
  2. Or --reserve-vram 10, which an RTX 5080 owner confirms (comment).
  3. Use the GGUF QAT-Q4 Gemma from Step 5, or the FP8 single-file gemma_3_12B_it_fp8_e4m3fn.safetensors documented in a workaround comment.

Encoder is painfully slow on CPU

The --novram path moves the Gemma encoder to CPU, which is slow for the text-encode pass. A community user reports the "LTXV Audio Text Encoder Loader" node loads Gemma "8x times faster then normal loader" (sic) (Issue #303 comment). Replace the default Gemma loader with it and load the single safetensors encoder file.

mat1 and mat2 shapes cannot be multiplied after enabling --novram

A user hit this when running --novram together with --use-sage-attention --fast fp16_accumulation and a --preview-method latent2rgb flag (Issue #303 comment). Remove the sage-attention and custom preview flags; the error traces to preview generation during sampling, not the model itself.

FlashAttention on Ada (sm_89)

Unlike Blackwell (sm_120), the RTX 4070 Ti SUPER's Ada sm_89 architecture is fully covered by prebuilt FlashAttention wheels, so there is no kernel-availability gap to work around — and LTX-2.3's ComfyUI path defaults to PyTorch SDPA regardless. If you instead compile the optional LTX-Video Q8 FP8-matmul kernels, make sure your CUDA toolkit is 12.8+; a mismatched toolkit produces an sm89 assertion failure at the FP8 matmul (Issue #182). The GGUF path in this recipe does not use those kernels.

Audio-video output not synchronized

LTX-2.3 produces synchronized video + audio in a single model per the Lightricks/LTX-2.3 card. The non-audio workflows produce silent video — load the audio-enabled workflow from example_workflows/2.3/ in ComfyUI-LTXVideo if you need sound.