How much VRAM does Sulphur 2 need?

About 16 GB — the minimum this recipe targets.

How hard is this setup?

Advanced — follow the steps above.

Sulphur 2 on RTX 4060 Ti 16GB: Uncensored LTX-2.3 Video via ComfyUI GGUF

What You'll Build

Generate uncensored text-to-video and image-to-video clips locally with Sulphur 2 — a 21B-param LTX-2.3 fine-tune from SulphurAI — on an RTX 4060 Ti 16GB. The upstream sulphur_dev_fp8mixed.safetensors weighs 27.16 GiB (upstream tree) — too large to fit on 16GB VRAM by itself, let alone alongside the Gemma 3 12B text encoder. This recipe uses the community Q4_K_S GGUF (12.29 GiB / 13.2 GB-decimal) from vantagewithai/Sulphur-2-Base-GGUF together with a quantized Gemma 3 12B QAT encoder.

Hardware data: RTX 4060 Ti 16GB · Q4_K_S GGUF + Gemma 3 12B QAT-Q4 encoder · See benchmark data

⚠️ Tight on 16 GB. The upstream sulphur_dev_bf16.safetensors (42.97 GiB) and sulphur_dev_fp8mixed.safetensors (27.16 GiB) shipped on SulphurAI/Sulphur-2-base do not fit in 16GB VRAM. The GGUF path below is the only one that runs on this card, and even then the encoder + DiT + VAE stack lives within 1–2 GiB of the ceiling — start with low frame counts and resolution and scale up carefully.

Requirements

Component	Minimum	Tested
GPU	16GB VRAM (Ada sm_89 or newer)	RTX 4060 Ti 16GB
RAM	32GB	32GB
Storage	~25GB	Q4_K_S 12.29 GiB + distill LoRA 662 MB + Gemma 3 QAT encoder ~7 GiB + LTX VAE 2.28 GiB (Kijai tree API)
Software	ComfyUI + ComfyUI-LTXVideo + ComfyUI-GGUF + KJNodes	Python 3.10+, CUDA 12.6+

Sulphur 2 inherits the LTX-2.3 architecture (architecture: ltxv per the vantagewithai GGUF card) and the same Gemma 3 12B text-encoder requirement that ships with LTX-2.3. The 4060 Ti 16GB uses Ada Lovelace sm_89 — full FlashAttention-2 kernel coverage is in stock CUDA wheels (no cu128-specific wheel selection is required; that requirement only applies to Blackwell sm_120 cards like the 5060 Ti).

Installation

1. Install ComfyUI and the LTX-Video custom nodes

git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

cd custom_nodes
git clone https://github.com/Lightricks/ComfyUI-LTXVideo.git
pip install -r ComfyUI-LTXVideo/requirements.txt

git clone https://github.com/city96/ComfyUI-GGUF.git
pip install -r ComfyUI-GGUF/requirements.txt

git clone https://github.com/kijai/ComfyUI-KJNodes.git
pip install -r ComfyUI-KJNodes/requirements.txt

The canonical Sulphur-2 workflow uses LTXV-prefixed nodes (LTXVConcatAVLatent, LTXVCropGuides, LTXVPreprocess, SamplerCustomAdvanced) — all provided by ComfyUI-LTXVideo, confirmed by inspecting workflows/ltx23_t2v distilled.json on the upstream repo.

2. Download the Q4_K_S Sulphur-2 GGUF

# Q4_K_S — 12.29 GiB (13.2 GB-decimal), the sweet spot for 16GB VRAM
huggingface-cli download vantagewithai/Sulphur-2-Base-GGUF \
  sulphur_dev-Q4_K_S.gguf \
  --local-dir ComfyUI/models/unet/

Quant-tier file-size reference (precise GiB from the HF tree API; GB-decimal matches the vantagewithai card per-tier table — 21B params, architecture: ltxv):

Quant	File size (GiB)	GB-decimal	Fits 16GB GPU?
Q3_K_S	9.63	10.3	yes (headroom for encoder + activations)
Q3_K_M	10.37	11.1	yes (more headroom than Q4_K_S)
Q4_K_S	12.29	13.2	yes — recommended
Q4_K_M	13.31	14.3	tight — possible if you cap resolution / frames aggressively
Q5_K_S	14.01	15.0	no (no room for encoder + activations)
Q5_K_M	15.03	16.1	no
Q6_K	16.55	17.8	no (weights alone exceed VRAM)
Q8_0	21.19	22.8	no

The 16GB ceiling is anchored on the closest published consumer-GPU LTX-family measurement: a 16GB ComfyUI user running the architecturally-related LTX-2 19B distilled stack reported a peak of 14926 MiB (~14.6 GiB) during sampling and OOM'd shortly after (Comfy-Org/ComfyUI#11726). Sulphur-2's 21B dev weights at Q4_K_S (12.29 GiB) are a similar size to the LTX-2 distilled weights in that report; the peak across DiT + encoder + VAE on this card lives in the 13.5–15 GiB band, leaving only 1–2 GiB of headroom against the 16 GB ceiling. Q4_K_M (13.31 GiB) is feasible only at very low frame counts; Q5 and above will OOM.

3. Download the quantized Gemma 3 12B text encoder

Sulphur 2 inherits LTX-2.3's Gemma 3 12B text encoder. The full unquantized Gemma 3 12B will OOM on 16GB cards when loaded alongside the Sulphur 2 weights — the closest published consumer-GPU OOM datapoint in the LTX family is 29068 MiB peak on RTX 5080 16GB with the LTX-2 19B-dev-fp8 stack (Lightricks/ComfyUI-LTXVideo#303), and Sulphur-2's 21B dev weights are heavier still. Use the QAT-Q4 GGUF instead:

huggingface-cli download unsloth/gemma-3-12b-it-qat-GGUF \
  gemma-3-12b-it-qat-UD-Q4_K_XL.gguf \
  --local-dir ComfyUI/models/text_encoders/

huggingface-cli download unsloth/gemma-3-12b-it-qat-GGUF \
  mmproj-BF16.gguf \
  --local-dir ComfyUI/models/text_encoders/

Both files are loaded by ComfyUI-GGUF's Gemma encoder node.

4. Download the LTX video VAE (Kijai community mirror)

Sulphur 2 reuses the upstream LTX video VAE — SulphurAI/Sulphur-2-base does not expose the VAE as a standalone file (it ships only sulphur_dev_bf16, sulphur_dev_fp8mixed, sulphur_distil_bf16, the rank-768 LoRA, and workflows). The simplest path for the GGUF-only flow is the community mirror by Kijai, which exposes a standalone bf16 VAE — architecture: ltxv is shared across the LTX family:

huggingface-cli download Kijai/LTXV2_comfy \
  VAE/LTX2_video_vae_bf16.safetensors \
  --local-dir ComfyUI/models/vae/

File listing confirmed at Kijai/LTXV2_comfy.

5. Download the canonical Sulphur-2 workflow JSON

The canonical Sulphur 2 ComfyUI workflow lives on the upstream SulphurAI repo:

huggingface-cli download SulphurAI/Sulphur-2-base \
  "workflows/ltx23_t2v distilled.json" \
  --local-dir ComfyUI/user/default/workflows/

Also pull the distill LoRA — it is required on the GGUF path. Every tier published by vantagewithai is named sulphur_dev-*.gguf: the GGUF is a quantization of the non-distilled dev weights, not of the distilled checkpoint. The dev model's base output is corrupted without a distill LoRA (Discussion #14); the canonical ltx23_t2v distilled.json wires in the 662 MB distill_loras/ distill LoRA — download it:

# Required distill LoRA — the in-workflow distill_loras/ one (the dev GGUF is not distilled)
huggingface-cli download SulphurAI/Sulphur-2-base \
  distill_loras/ltx-2.3-22b-distilled-lora-1.1_fro90_ceil72_condsafe.safetensors \
  --local-dir ComfyUI/models/loras/

The upstream README explicitly notes: "I'm aware the workflows contain sulphur_final right now, just use the lora or use the full models, don't use both at the same time." That means EITHER (dev weights + this distill LoRA — the GGUF path here) OR a full distilled checkpoint (sulphur_distil_bf16.safetensors), never both. Keep the workflow's distill-LoRA nodes wired (next section); without them the 8-step / CFG=1 schedule runs un-distilled and produces degraded, under-denoised output. (The repo's heavier sulphur_lora_rank_768.safetensors, 9.56 GiB, is a 24 GB-class alternative — skip it on 16 GB.)

Running

Launch ComfyUI:

python main.py --listen

Open the browser UI, then load the workflow downloaded in step 5:

ComfyUI/user/default/workflows/ltx23_t2v distilled.json

In the loaded graph, swap the default UNet loader for the Unet Loader (GGUF) node from ComfyUI-GGUF (point it at sulphur_dev-Q4_K_S.gguf), and point the text encoder at the GGUF Gemma 3 loader from the same custom node pack. Then keep the workflow's LoraLoaderModelOnly distill-LoRA node wired and point it at ltx-2.3-22b-distilled-lora-1.1_fro90_ceil72_condsafe.safetensors (the file you downloaded) — do not delete it; the dev GGUF needs the distill LoRA for the 8-step / CFG=1 short-step schedule. The canonical workflow defaults are tuned for high-VRAM cards — drop them on the 4060 Ti 16GB:

Parameter	Canonical default	Recommended on 16GB	Source
Frame count	18 (the `LTXVPreprocess` widget value in the shipped workflow)	start at 65 max	`LTXVPreprocess` widget in `ltx23_t2v distilled.json`
Resolution (longer edge)	1536 px	drop to 832 px for first run	`ResizeImagesByLongerEdge` widget in the same file

Once the workflow loads cleanly at 832 px / 65 frames, scale up only while peak VRAM stays comfortably below 16 GiB in nvidia-smi. The 4060 Ti 16GB has the same VRAM ceiling as the 5060 Ti 16GB used in the sibling recipe but ~25–30% less compute (Ada sm_89 vs Blackwell sm_120) — expect proportionally slower wall-time, not different VRAM behaviour.

Optional: prompt enhancer

The upstream Sulphur 2 ships a Q8_0 prompt enhancer (sulphur_prompt_enhancer_model-q8_0.gguf + mmproj-BF16.gguf) intended to be used via LM Studio. Per the SulphurAI README: create Sulphur/promptenhancer/ inside your LM Studio model folder, drop both files in, and load the model from LM Studio's UI. There is no system prompt — send the raw text (and optionally an image) you want enhanced.

Results

Speed: Omitted — no published Sulphur-2 benchmark on an RTX 4060 Ti 16GB at the time of writing. Empirical 4060 Ti 16GB data will land at /check/sulphur-2/rtx-4060-ti-16gb once a community benchmark report is contributed via /contribute.
VRAM usage: The closest cited consumer-16GB peak from the LTX family is 14926 MiB (~14.6 GiB) during sampling on a 16GB ComfyUI user running the LTX-2 19B distilled stack (Comfy-Org/ComfyUI#11726). Sulphur-2 at Q4_K_S (12.29 GiB weights) is similar in size, so plan on a runtime peak in the 13.5–15 GiB band with the QAT-Q4 Gemma encoder — within the 16 GiB ceiling but with very little headroom. If you load the unquantized Gemma 3 12B encoder instead, the peak jumps to ~29 GiB (Lightricks/ComfyUI-LTXVideo#303) — that's why the QAT encoder in step 3 is mandatory on this card.
Quality notes: The Sulphur 2 GGUF is a quantization of the non-distilled dev weights; the distill LoRA wired into the workflow supplies the 8-step / CFG=1 short-step sampling profile (matching LTX-2.3 distilled). Run it without the LoRA and the output under-denoises. Q3 tier and below shows noticeable quality regression; Q4_K_S is the recommended balance.

For up-to-date benchmark data on this pair, see /check/sulphur-2/rtx-4060-ti-16gb.

Troubleshooting

"Can I run this at full bf16 on 16 GB?" — No

The upstream sulphur_dev_bf16.safetensors is 42.97 GiB and sulphur_dev_fp8mixed.safetensors is 27.16 GiB (upstream tree). Both weights alone exceed 16 GiB by a wide margin before the encoder, VAE, and activations enter VRAM. The Q4_K_S GGUF in step 2 is the only path that runs on this card. If you have a 24GB+ card, see the upstream README for the fp8mixed flow.

OOM when loading the text encoder

Same root cause as documented upstream — the default unquantized Gemma 3 12B encoder OOMs on 16GB cards when loaded alongside the Sulphur 2 weights (Lightricks/ComfyUI-LTXVideo#303 reports peak 29068 MiB on RTX 5080 16GB with the LTX-2 19B-dev-fp8 pipeline, and Sulphur-2's 21B dev weights are heavier still). Replace with gemma-3-12b-it-qat-UD-Q4_K_XL.gguf from Unsloth (step 3 above). Also enable CPU offload for the Gemma encoder via the KJNodes model-offload nodes — keep the encoder unloaded from VRAM while the DiT is sampling.

"sulphur_final" referenced in the workflow but missing locally

The upstream workflow JSON contains a sulphur_final checkpoint reference that does not exist as a published file. Per the SulphurAI README: "the workflows contain sulphur_final right now, just use the lora or use the full models, don't use both at the same time." On the GGUF path, point the model loader at sulphur_dev-Q4_K_S.gguf and keep the workflow's LoraLoaderModelOnly distill-LoRA node pointed at ltx-2.3-22b-distilled-lora-1.1_fro90_ceil72_condsafe.safetensors (from step 5) — do not delete it. The dev GGUF is not distilled; without the LoRA the 8-step / CFG=1 schedule runs un-distilled and degrades output.

Gemma GGUF loader fails or outputs gibberish

The Gemma 3 GGUF loader in ComfyUI-GGUF required PRs #399 and #402 to be merged for the LTX-2 family path (Sulphur-2 inherits this requirement via the LTX-2.3 lineage it builds on); pull the latest city96/ComfyUI-GGUF main — both PRs are now merged (Kijai/LTXV2_comfy discussion #7).

Slow generation

Keep the Gemma encoder offloaded with the KJNodes model-offload nodes; VRAM thrashing on a 16GB card kills wall time. The 4060 Ti 16GB's memory bandwidth is ~288 GB/s (about ~40% of the 5060 Ti's 448 GB/s) — VRAM pressure that's tolerable on the 5060 Ti will be more punishing on the 4060 Ti. Drop frame counts and resolution further if you observe steady-state VRAM oscillation in nvidia-smi. Empirical 4060 Ti 16GB wall-time numbers will appear at /check/sulphur-2/rtx-4060-ti-16gb once contributed.

Pushing beyond Q4_K_S

The Q4_K_M GGUF (13.31 GiB) is technically loadable on the 4060 Ti 16GB but leaves only ~1 GiB of headroom for activations + the QAT Gemma encoder — feasible only at the lowest frame counts (≤ 25) and resolution (≤ 768 px). Q5 and above will OOM; the vantagewithai per-tier table is unambiguous that Q5_K_S at 15 GB-decimal already exceeds the practical ceiling.