self-hosted/ai
§01·recipe · video

Sulphur 2 on RTX 4060 Ti 16GB: Uncensored LTX-2.3 Video via ComfyUI GGUF

videoadvanced16GB+ VRAMMay 21, 2026
models
tools
prerequisites
  • NVIDIA RTX 4060 Ti (16GB VRAM) or any 16GB consumer GPU
  • 32GB+ system RAM (Gemma 3 12B text encoder is offloaded to CPU)
  • Python 3.10+ and CUDA 12.6+ (Ada sm_89 — stock PyTorch wheels work; no special wheel selection required)
  • ComfyUI installed (latest version) with ComfyUI-LTXVideo, ComfyUI-GGUF, ComfyUI-KJNodes custom nodes
  • ~25GB free disk space for Q4-tier GGUF + Gemma 3 QAT encoder + LTX VAE

What You'll Build

Generate uncensored text-to-video and image-to-video clips locally with Sulphur 2 — a 21B-param LTX-2.3 fine-tune from SulphurAI — on an RTX 4060 Ti 16GB. The upstream sulphur_dev_fp8mixed.safetensors weighs 27.16 GiB (upstream tree) — too large to fit on 16GB VRAM by itself, let alone alongside the Gemma 3 12B text encoder. This recipe uses the community Q4_K_S GGUF (12.29 GiB / 13.2 GB-decimal) from vantagewithai/Sulphur-2-Base-GGUF together with a quantized Gemma 3 12B QAT encoder.

Hardware data: RTX 4060 Ti 16GB · Q4_K_S GGUF + Gemma 3 12B QAT-Q4 encoder · See benchmark data

⚠️ Tight on 16 GB. The upstream sulphur_dev_bf16.safetensors (42.97 GiB) and sulphur_dev_fp8mixed.safetensors (27.16 GiB) shipped on SulphurAI/Sulphur-2-base do not fit in 16GB VRAM. The GGUF path below is the only one that runs on this card, and even then the encoder + DiT + VAE stack lives within 1–2 GiB of the ceiling — start with low frame counts and resolution and scale up carefully.

Requirements

ComponentMinimumTested
GPU16GB VRAM (Ada sm_89 or newer)RTX 4060 Ti 16GB
RAM32GB32GB
Storage~25GBQ4_K_S 12.29 GiB + Gemma 3 QAT encoder ~7 GiB + LTX VAE 2.28 GiB (Kijai tree API)
SoftwareComfyUI + ComfyUI-LTXVideo + ComfyUI-GGUF + KJNodesPython 3.10+, CUDA 12.6+

Sulphur 2 inherits the LTX-2.3 architecture (architecture: ltxv per the vantagewithai GGUF card) and the same Gemma 3 12B text-encoder requirement that ships with LTX-2.3. The 4060 Ti 16GB uses Ada Lovelace sm_89 — full FlashAttention-2 kernel coverage is in stock CUDA wheels (no cu128-specific wheel selection is required; that requirement only applies to Blackwell sm_120 cards like the 5060 Ti).

Installation

1. Install ComfyUI and the LTX-Video custom nodes

git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

cd custom_nodes
git clone https://github.com/Lightricks/ComfyUI-LTXVideo.git
pip install -r ComfyUI-LTXVideo/requirements.txt

git clone https://github.com/city96/ComfyUI-GGUF.git
pip install -r ComfyUI-GGUF/requirements.txt

git clone https://github.com/kijai/ComfyUI-KJNodes.git
pip install -r ComfyUI-KJNodes/requirements.txt

The canonical Sulphur-2 workflow uses LTXV-prefixed nodes (LTXVConcatAVLatent, LTXVCropGuides, LTXVPreprocess, SamplerCustomAdvanced) — all provided by ComfyUI-LTXVideo, confirmed by inspecting workflows/ltx23_t2v distilled.json on the upstream repo.

2. Download the Q4_K_S Sulphur-2 GGUF

# Q4_K_S — 12.29 GiB (13.2 GB-decimal), the sweet spot for 16GB VRAM
huggingface-cli download vantagewithai/Sulphur-2-Base-GGUF \
  sulphur_dev-Q4_K_S.gguf \
  --local-dir ComfyUI/models/unet/

Quant-tier file-size reference (precise GiB from the HF tree API; GB-decimal matches the vantagewithai card per-tier table — 21B params, architecture: ltxv):

QuantFile size (GiB)GB-decimalFits 16GB GPU?
Q3_K_S9.6310.3yes (headroom for encoder + activations)
Q3_K_M10.3711.1yes (more headroom than Q4_K_S)
Q4_K_S12.2913.2yes — recommended
Q4_K_M13.3114.3tight — possible if you cap resolution / frames aggressively
Q5_K_S14.0115.0no (no room for encoder + activations)
Q5_K_M15.0316.1no
Q6_K16.5517.8no (weights alone exceed VRAM)
Q8_021.1922.8no

The 16GB ceiling is anchored on the closest published consumer-GPU LTX-family measurement: a 16GB ComfyUI user running the architecturally-related LTX-2 19B distilled stack reported a peak of 14926 MiB (~14.6 GiB) during sampling and OOM'd shortly after (Comfy-Org/ComfyUI#11726). Sulphur-2's 21B distilled weights at Q4_K_S (12.29 GiB) are a similar size to the LTX-2 distilled weights in that report; the peak across DiT + encoder + VAE on this card lives in the 13.5–15 GiB band, leaving only 1–2 GiB of headroom against the 16 GB ceiling. Q4_K_M (13.31 GiB) is feasible only at very low frame counts; Q5 and above will OOM.

3. Download the quantized Gemma 3 12B text encoder

Sulphur 2 inherits LTX-2.3's Gemma 3 12B text encoder. The full unquantized Gemma 3 12B will OOM on 16GB cards when loaded alongside the Sulphur 2 weights — the closest published consumer-GPU OOM datapoint in the LTX family is 29068 MiB peak on RTX 5080 16GB with the LTX-2 19B-dev-fp8 stack (Lightricks/ComfyUI-LTXVideo#303), and Sulphur-2's 21B distilled weights are heavier still. Use the QAT-Q4 GGUF instead:

huggingface-cli download unsloth/gemma-3-12b-it-qat-GGUF \
  gemma-3-12b-it-qat-UD-Q4_K_XL.gguf \
  --local-dir ComfyUI/models/text_encoders/

huggingface-cli download unsloth/gemma-3-12b-it-qat-GGUF \
  mmproj-BF16.gguf \
  --local-dir ComfyUI/models/text_encoders/

Both files are loaded by ComfyUI-GGUF's Gemma encoder node.

4. Download the LTX video VAE (Kijai community mirror)

Sulphur 2 reuses the upstream LTX video VAE — SulphurAI/Sulphur-2-base does not expose the VAE as a standalone file (it ships only sulphur_dev_bf16, sulphur_dev_fp8mixed, sulphur_distil_bf16, the rank-768 LoRA, and workflows). The simplest path for the GGUF-only flow is the community mirror by Kijai, which exposes a standalone bf16 VAE — architecture: ltxv is shared across the LTX family:

huggingface-cli download Kijai/LTXV2_comfy \
  VAE/LTX2_video_vae_bf16.safetensors \
  --local-dir ComfyUI/models/vae/

File listing confirmed at Kijai/LTXV2_comfy.

5. Download the canonical Sulphur-2 workflow JSON

The canonical Sulphur 2 ComfyUI workflow lives on the upstream SulphurAI repo:

huggingface-cli download SulphurAI/Sulphur-2-base \
  "workflows/ltx23_t2v distilled.json" \
  --local-dir ComfyUI/user/default/workflows/

Optionally also pull the distill LoRA — per the SulphurAI README, this is the recommended quality path when running the dev (non-distilled) full-precision weights. If you loaded the GGUF in step 2, you do NOT need the LoRA — the distill is already baked into those weights.

# Optional — only needed if you switch away from the GGUF path
huggingface-cli download SulphurAI/Sulphur-2-base \
  sulphur_lora_rank_768.safetensors \
  --local-dir ComfyUI/models/loras/

The upstream README explicitly notes: "I'm aware the workflows contain sulphur_final right now, just use the lora or use the full models, don't use both at the same time."

Running

Launch ComfyUI:

python main.py --listen

Open the browser UI, then load the workflow downloaded in step 5:

ComfyUI/user/default/workflows/ltx23_t2v distilled.json

In the loaded graph, swap the default UNet loader for the Unet Loader (GGUF) node from ComfyUI-GGUF (point it at sulphur_dev-Q4_K_S.gguf), and point the text encoder at the GGUF Gemma 3 loader from the same custom node pack. The canonical workflow defaults are tuned for high-VRAM cards — drop them on the 4060 Ti 16GB:

ParameterCanonical defaultRecommended on 16GBSource
Frame count18 (the LTXVPreprocess widget value in the shipped workflow)start at 65 maxLTXVPreprocess widget in ltx23_t2v distilled.json
Resolution (longer edge)1536 pxdrop to 832 px for first runResizeImagesByLongerEdge widget in the same file

Once the workflow loads cleanly at 832 px / 65 frames, scale up only while peak VRAM stays comfortably below 16 GiB in nvidia-smi. The 4060 Ti 16GB has the same VRAM ceiling as the 5060 Ti 16GB used in the sibling recipe but ~25–30% less compute (Ada sm_89 vs Blackwell sm_120) — expect proportionally slower wall-time, not different VRAM behaviour.

Optional: prompt enhancer

The upstream Sulphur 2 ships a Q8_0 prompt enhancer (sulphur_prompt_enhancer_model-q8_0.gguf + mmproj-BF16.gguf) intended to be used via LM Studio. Per the SulphurAI README: create Sulphur/promptenhancer/ inside your LM Studio model folder, drop both files in, and load the model from LM Studio's UI. There is no system prompt — send the raw text (and optionally an image) you want enhanced.

Results

  • Speed: Omitted — no published Sulphur-2 benchmark on an RTX 4060 Ti 16GB at the time of writing. Empirical 4060 Ti 16GB data will land at /check/sulphur-2/rtx-4060-ti-16gb once a community benchmark report is contributed via /contribute.
  • VRAM usage: The closest cited consumer-16GB peak from the LTX family is 14926 MiB (~14.6 GiB) during sampling on a 16GB ComfyUI user running the LTX-2 19B distilled stack (Comfy-Org/ComfyUI#11726). Sulphur-2 at Q4_K_S (12.29 GiB weights) is similar in size, so plan on a runtime peak in the 13.5–15 GiB band with the QAT-Q4 Gemma encoder — within the 16 GiB ceiling but with very little headroom. If you load the unquantized Gemma 3 12B encoder instead, the peak jumps to ~29 GiB (Lightricks/ComfyUI-LTXVideo#303) — that's why the QAT encoder in step 3 is mandatory on this card.
  • Quality notes: The Sulphur 2 GGUF is a quantization of the distilled checkpoint — expect the same 8-step / CFG=1 short-step sampling profile as LTX-2.3 distilled. Q3 tier and below shows noticeable quality regression; Q4_K_S is the recommended balance.

For up-to-date benchmark data on this pair, see /check/sulphur-2/rtx-4060-ti-16gb.

Troubleshooting

"Can I run this at full bf16 on 16 GB?" — No

The upstream sulphur_dev_bf16.safetensors is 42.97 GiB and sulphur_dev_fp8mixed.safetensors is 27.16 GiB (upstream tree). Both weights alone exceed 16 GiB by a wide margin before the encoder, VAE, and activations enter VRAM. The Q4_K_S GGUF in step 2 is the only path that runs on this card. If you have a 24GB+ card, see the upstream README for the fp8mixed flow.

OOM when loading the text encoder

Same root cause as documented upstream — the default unquantized Gemma 3 12B encoder OOMs on 16GB cards when loaded alongside the Sulphur 2 weights (Lightricks/ComfyUI-LTXVideo#303 reports peak 29068 MiB on RTX 5080 16GB with the LTX-2 19B-dev-fp8 pipeline, and Sulphur-2's 21B distilled weights are heavier still). Replace with gemma-3-12b-it-qat-UD-Q4_K_XL.gguf from Unsloth (step 3 above). Also enable CPU offload for the Gemma encoder via the KJNodes model-offload nodes — keep the encoder unloaded from VRAM while the DiT is sampling.

"sulphur_final" referenced in the workflow but missing locally

The upstream workflow JSON contains a sulphur_final checkpoint reference that does not exist as a published file. Per the SulphurAI README: "the workflows contain sulphur_final right now, just use the lora or use the full models, don't use both at the same time." If you used the GGUF in step 2, point the loader at sulphur_dev-Q4_K_S.gguf instead and delete or bypass the LoRA node — the distill is already baked into the GGUF weights.

Gemma GGUF loader fails or outputs gibberish

The Gemma 3 GGUF loader in ComfyUI-GGUF required PRs #399 and #402 to be merged for the LTX-2 family path (Sulphur-2 inherits this requirement via the LTX-2.3 lineage it builds on); pull the latest city96/ComfyUI-GGUF main — both PRs are now merged (Kijai/LTXV2_comfy discussion #7).

Slow generation

Keep the Gemma encoder offloaded with the KJNodes model-offload nodes; VRAM thrashing on a 16GB card kills wall time. The 4060 Ti 16GB's memory bandwidth is ~288 GB/s (about ~40% of the 5060 Ti's 448 GB/s) — VRAM pressure that's tolerable on the 5060 Ti will be more punishing on the 4060 Ti. Drop frame counts and resolution further if you observe steady-state VRAM oscillation in nvidia-smi. Empirical 4060 Ti 16GB wall-time numbers will appear at /check/sulphur-2/rtx-4060-ti-16gb once contributed.

Pushing beyond Q4_K_S

The Q4_K_M GGUF (13.31 GiB) is technically loadable on the 4060 Ti 16GB but leaves only ~1 GiB of headroom for activations + the QAT Gemma encoder — feasible only at the lowest frame counts (≤ 25) and resolution (≤ 768 px). Q5 and above will OOM; the vantagewithai per-tier table is unambiguous that Q5_K_S at 15 GB-decimal already exceeds the practical ceiling.