Sulphur 2 on RTX 4070 Ti SUPER: Uncensored LTX-2.3 Video via ComfyUI GGUF

What You'll Build

Generate uncensored text-to-video and image-to-video clips locally with Sulphur 2 — an LTX-2.3 fine-tune from SulphurAI — on an RTX 4070 Ti SUPER 16GB. The upstream full-precision sulphur_dev_bf16.safetensors weighs 42.97 GiB and the sulphur_dev_fp8mixed.safetensors weighs 27.16 GiB (upstream tree) — both far too large for 16GB VRAM, let alone alongside the Gemma 3 12B text encoder. This recipe runs the community Q4_K_S GGUF (12.29 GiB / 13.2 GB-decimal) from vantagewithai/Sulphur-2-Base-GGUF — a quantization of the non-distilled dev weights — with the upstream distill LoRA applied on top for the fast short-step schedule, plus a quantized Gemma 3 12B QAT encoder.

Hardware data: RTX 4070 Ti SUPER 16GB · Q4_K_S dev GGUF + distill LoRA + Gemma 3 12B QAT-Q4 encoder · See benchmark data

⚠️ The GGUF is the dev model — you must add the distill LoRA. Every tier published by vantagewithai is named sulphur_dev-*.gguf: it is a quantization of the non-distilled dev weights, not of the distilled checkpoint. Per the SulphurAI README, the fast short-step path comes from applying the distill LoRA on top of the dev weights — "I recommend downloading either of the dev versions, (fp8mixed or bf16) and downloading the distill lora provided." On this recipe's GGUF path, the dev GGUF replaces the dev safetensors, and the distill LoRA (sulphur_lora_rank_768.safetensors) is still required. Without it, the 8-step / CFG=1 schedule below runs against an un-distilled model and produces degraded output. The README's note "just use the lora or use the full models, don't use both at the same time" means: EITHER (dev weights + distill LoRA) OR a full distilled model (sulphur_distil_bf16.safetensors) — never stack the LoRA on an already-distilled full model, and never load the non-existent sulphur_final the shipped workflow references.

⚠️ Tight on 16 GB. Even the Q4_K_S dev GGUF + distill LoRA + QAT encoder + VAE stack lives within ~1–2 GiB of the 16 GiB ceiling. Start with low frame counts and resolution and scale up carefully. The RTX 4070 Ti SUPER's compute over smaller 16GB cards buys you faster generation, not more headroom — the VRAM ceiling is identical.

Requirements

Component	Minimum	Tested
GPU	16GB VRAM (Ada sm_89 or newer)	RTX 4070 Ti SUPER 16GB
RAM	32GB	32GB
Storage	~32GB	Q4_K_S 12.29 GiB + distill LoRA 9.56 GiB + Gemma 3 QAT encoder ~7.7 GiB (6.92 + 0.80) + LTX VAE 2.28 GiB (Kijai VAE tree API)
Software	ComfyUI + ComfyUI-LTXVideo + ComfyUI-GGUF + KJNodes	Python 3.10+, CUDA 12.4+

Sulphur 2 is a fine-tune of Lightricks' LTX-2.3 (architecture: ltxv per the vantagewithai GGUF card) and inherits its Gemma 3 12B text encoder. The RTX 4070 Ti SUPER uses Ada Lovelace sm_89 — full FlashAttention-2 kernel coverage is in stock CUDA wheels (the default pip install torch works; no cu128-specific index-url is required, since that requirement only applies to Blackwell sm_120 cards). The 4070 Ti SUPER also has native 4th-gen FP8 tensor cores, but the GGUF path below uses k-quant integer kernels, not FP8, so that capability isn't exercised here.

Installation

1. Install ComfyUI and the LTX-Video custom nodes

git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

cd custom_nodes
git clone https://github.com/Lightricks/ComfyUI-LTXVideo.git
pip install -r ComfyUI-LTXVideo/requirements.txt

git clone https://github.com/city96/ComfyUI-GGUF.git
pip install -r ComfyUI-GGUF/requirements.txt

git clone https://github.com/kijai/ComfyUI-KJNodes.git
pip install -r ComfyUI-KJNodes/requirements.txt

The canonical Sulphur-2 workflow uses LTXV-prefixed nodes (LTXVConcatAVLatent, LTXVCropGuides, LTXVPreprocess, SamplerCustomAdvanced, LTXVScheduler) — all provided by ComfyUI-LTXVideo, confirmed by inspecting workflows/ltx23_t2v distilled.json on the upstream repo.

2. Download the Q4_K_S Sulphur-2 dev GGUF

# Q4_K_S — 12.29 GiB (13.2 GB-decimal), the sweet spot for 16GB VRAM
huggingface-cli download vantagewithai/Sulphur-2-Base-GGUF \
  sulphur_dev-Q4_K_S.gguf \
  --local-dir ComfyUI/models/unet/

Quant-tier file-size reference (precise GiB from the HF tree API; GB-decimal matches the vantagewithai card per-tier table — architecture: ltxv). Every tier is a quant of the non-distilled dev weights (sulphur_dev-*):

Quant	File size (GiB)	GB-decimal	Fits 16GB GPU?
Q3_K_S	9.63	10.3	yes (headroom for LoRA + encoder + activations)
Q3_K_M	10.37	11.1	yes (more headroom than Q4_K_S)
Q4_K_S	12.29	13.2	yes — recommended
Q4_K_M	13.31	14.3	tight — possible if you cap resolution / frames aggressively
Q5_K_S	14.01	15.0	no (no room for LoRA + encoder + activations)
Q5_K_M	15.03	16.1	no
Q6_K	16.55	17.8	no (weights alone exceed VRAM)
Q8_0	21.19	22.8	no

The 16GB ceiling reflects a closely-related community datapoint: a 16GB ComfyUI user running the architecturally-related LTX-2 19B distilled stack recorded a sampling-stage peak of 14926 MiB (~14.6 GiB) in Comfy-Org/ComfyUI#11726 (their OOM occurred later, at the 2x-upscale second-sampler stage at 1080p / >200 frames — not at modest settings). That single datapoint suggests the peak across DiT + LoRA + encoder + VAE on a 16GB card lives in the 13.5–15 GiB band at conservative settings, leaving only 1–2 GiB of headroom. Q4_K_M (13.31 GiB) is feasible only at very low frame counts; Q5 and above will OOM. This VRAM envelope is identical on every 16GB card — the RTX 4070 Ti SUPER's compute advantage changes wall-time, not the memory ceiling.

3. Download the distill LoRA (required for the 8-step path)

The dev GGUF is not distilled. To get Sulphur 2's fast short-step (8-step / CFG=1) behavior, you must apply the upstream distill LoRA on top of it:

huggingface-cli download SulphurAI/Sulphur-2-base \
  sulphur_lora_rank_768.safetensors \
  --local-dir ComfyUI/models/loras/

This is the sulphur_lora_rank_768.safetensors file (9.56 GiB) listed on the upstream tree. Per the SulphurAI README, the two valid configurations are mutually exclusive — "just use the lora or use the full models, don't use both at the same time" — i.e. EITHER (dev weights + this LoRA, which is the GGUF path here) OR a full distilled checkpoint, never both.

4. Download the quantized Gemma 3 12B text encoder

Sulphur 2 inherits LTX-2.3's Gemma 3 12B text encoder. The full unquantized Gemma 3 12B will OOM on 16GB cards when loaded alongside the Sulphur 2 weights — a community feature request on the LTX-2 repo documents this directly, reporting that LTX-2 with Gemma 3 12B "needs ~24-27GB VRAM to operate" and is therefore unusable on "consumer GPUs with 16GB VRAM (RTX 5080, RTX 4080, etc.)", with a measured Peak Usage: 29068 MiB on a 16GB card running the LTX-2 19B-dev-fp8 stack (Lightricks/ComfyUI-LTXVideo#303, opened by community user Jackson3195, author_association: NONE — a feature request, not official guidance). Use the QAT-Q4 GGUF encoder instead:

huggingface-cli download unsloth/gemma-3-12b-it-qat-GGUF \
  gemma-3-12b-it-qat-UD-Q4_K_XL.gguf \
  --local-dir ComfyUI/models/text_encoders/

huggingface-cli download unsloth/gemma-3-12b-it-qat-GGUF \
  mmproj-BF16.gguf \
  --local-dir ComfyUI/models/text_encoders/

Both files are loaded by ComfyUI-GGUF's Gemma encoder node (gemma-3-12b-it-qat-UD-Q4_K_XL.gguf is 6.92 GiB, mmproj-BF16.gguf is 0.80 GiB per the unsloth tree).

5. Download the LTX video VAE (Kijai community mirror)

Sulphur 2 reuses the upstream LTX video VAE — SulphurAI/Sulphur-2-base does not expose the VAE as a standalone file (it ships only sulphur_dev_bf16, sulphur_dev_fp8mixed, sulphur_distil_bf16, the rank-768 distill LoRA, the prompt enhancer, and workflows). The simplest path for the GGUF flow is the community mirror by Kijai, which exposes a standalone bf16 VAE — architecture: ltxv is shared across the LTX family:

huggingface-cli download Kijai/LTXV2_comfy \
  VAE/LTX2_video_vae_bf16.safetensors \
  --local-dir ComfyUI/models/vae/

File listing confirmed at Kijai/LTXV2_comfy.

6. Download the canonical Sulphur-2 workflow JSON

The canonical Sulphur 2 ComfyUI workflow lives on the upstream SulphurAI repo:

huggingface-cli download SulphurAI/Sulphur-2-base \
  "workflows/ltx23_t2v distilled.json" \
  --local-dir ComfyUI/user/default/workflows/

The upstream README is explicit that the shipped workflow references a checkpoint that does not exist as a published file: "I'm aware the workflows contain sulphur_final right now, just use the lora or use the full models, don't use both at the same time." You will rewire the LoRA loader to point at sulphur_lora_rank_768.safetensors in the next section.

Running

Launch ComfyUI:

python main.py --listen

Open the browser UI, then load the workflow downloaded in step 6:

ComfyUI/user/default/workflows/ltx23_t2v distilled.json

In the loaded graph, make three swaps:

Model loader. The shipped workflow loads the model via a CheckpointLoaderSimple pointed at ltx-2.3-22b-dev-fp8.safetensors. Replace it with the Unet Loader (GGUF) node from ComfyUI-GGUF and point it at sulphur_dev-Q4_K_S.gguf.
Distill LoRA. The shipped graph contains LoraLoaderModelOnly nodes referencing the missing sulphur_final.safetensors. Point those at sulphur_lora_rank_768.safetensors (downloaded in step 3) — this is what supplies the distillation on the dev GGUF path and makes the 8-step / CFG=1 schedule valid. Do not delete the LoRA node; without it the model runs un-distilled.
Text encoder. Point the text-encoder loader at the GGUF Gemma 3 loader from ComfyUI-GGUF (gemma-3-12b-it-qat-UD-Q4_K_XL.gguf + mmproj-BF16.gguf).

The shipped distilled workflow's sampler is already configured for the short-step path: an LTXVScheduler set to 8 steps, a CFGGuider at CFG = 1, and an euler_ancestral_cfg_pp / lcm sampler (widget values in ltx23_t2v distilled.json). That 8-step / CFG=1 profile is only correct because the distill LoRA is applied — keep it as shipped once the LoRA is wired in. The canonical workflow's frame/resolution defaults are tuned for high-VRAM cards — drop them on the 4070 Ti SUPER 16GB:

Parameter	Canonical default	Recommended on 16GB	Source
Latent length	97 frames (`EmptyLTXVLatentVideo` widget)	start at 65	`EmptyLTXVLatentVideo` widget `[768, 512, 97, 1]` in `ltx23_t2v distilled.json`
Resolution (longer edge)	1536 px	drop to 832 px for first run	`ResizeImagesByLongerEdge` widget `[1536]` in the same file

Once the workflow loads cleanly at 832 px / 65 frames, scale up only while peak VRAM stays comfortably below 16 GiB in nvidia-smi. The 4070 Ti SUPER shares the same 16 GiB VRAM ceiling as the other 16GB cards documented in the sibling recipes — it just samples at a different wall-time (its ~672 GB/s memory bandwidth and ~44 TFLOPS FP32 are roughly 2.3× the bandwidth and 2× the compute of an RTX 4060 Ti 16GB). Treat the VRAM discipline below as identical to any 16GB card; the only difference is wall-time.

Optional: prompt enhancer

Sulphur 2 ships a Q8_0 prompt enhancer (sulphur_prompt_enhancer_model-q8_0.gguf + mmproj-BF16.gguf, under prompt_enhancer/ on the upstream tree) intended to be used via LM Studio. Per the SulphurAI README: inside your LM Studio model folder, create a Sulphur/promptenhancer/ folder, drop both files in, and load the model from LM Studio's UI. "There is no system prompt for it, just send the text (and an image) you'd like to be enhanced."

Results

Speed: Omitted — there is no published Sulphur-2 benchmark on an RTX 4070 Ti SUPER at the time of writing (/check/sulphur-2/rtx-4070-ti-super returns no benchmark data), and the 4070 Ti SUPER's compute is far enough from any sibling card (≈2× an RTX 4060 Ti 16GB, ≈90% of an RTX 4080) that transferring another card's wall-time would mislead. Empirical RTX 4070 Ti SUPER data will land at /check/sulphur-2/rtx-4070-ti-super once a community benchmark report is contributed via /contribute.
VRAM usage: Plan on a runtime peak in the 13.5–15 GiB band with the Q4_K_S dev GGUF + distill LoRA + QAT-Q4 Gemma encoder — within the 16 GiB ceiling but with very little headroom. One closely-related community datapoint: a 16GB user running the LTX-2 19B distilled stack recorded a sampling-stage peak of 14926 MiB (~14.6 GiB) in Comfy-Org/ComfyUI#11726 (one reading at the first-sampling stage — not a hard OOM threshold; their actual OOM was at 1080p / >200 frames during 2x upscale). If you load the unquantized Gemma 3 12B encoder instead, the peak jumps to ~29 GiB (Lightricks/ComfyUI-LTXVideo#303, Peak Usage: 29068 MiB) — that's why the QAT encoder in step 4 is mandatory on this card.
Quality notes: The recommended configuration is the non-distilled dev GGUF with the distill LoRA applied — that combination delivers the 8-step / CFG=1 short-step sampling shipped in the distilled workflow JSON. Running the dev GGUF without the LoRA at those settings under-denoises and degrades output; if you ever drop the LoRA, switch to the dev model's own (longer, higher-CFG) schedule. Q3 tier and below shows noticeable quality regression; Q4_K_S is the recommended balance.

For the full benchmark data on this pair, see /check/sulphur-2/rtx-4070-ti-super.

Troubleshooting

"Can I run this at full bf16 or fp8mixed on 16 GB?" — No

The upstream sulphur_dev_bf16.safetensors is 42.97 GiB and sulphur_dev_fp8mixed.safetensors is 27.16 GiB (upstream tree). Both weights alone exceed 16 GiB by a wide margin before the LoRA, encoder, VAE, and activations enter VRAM — and this is true even though the RTX 4070 Ti SUPER has native FP8 tensor cores, because the constraint here is on-card memory, not compute. The Q4_K_S dev GGUF in step 2 (plus the distill LoRA in step 3) is the only path that runs on this card. If you have a 24GB+ card, see the upstream README for the fp8mixed flow.

OOM when loading the text encoder

Same root cause as documented upstream — the default unquantized Gemma 3 12B encoder OOMs on 16GB cards when loaded alongside the Sulphur 2 weights (Lightricks/ComfyUI-LTXVideo#303, a community feature request, names the RTX 4080 as one of the blocked 16GB cards and reports Peak Usage: 29068 MiB on a 16GB card running the LTX-2 19B-dev-fp8 pipeline). Replace it with gemma-3-12b-it-qat-UD-Q4_K_XL.gguf from Unsloth (step 4 above). Also enable CPU offload for the Gemma encoder via the KJNodes model-offload nodes — keep the encoder unloaded from VRAM while the DiT is sampling. The RTX 4070 Ti SUPER's PCIe Gen4 x16 host link makes this offload no worse than on narrower-bus 16GB cards.

Output looks under-baked / noisy at 8 steps

If you used the dev GGUF but forgot the distill LoRA, the 8-step / CFG=1 schedule runs against an un-distilled model and produces under-denoised, low-quality frames. Confirm the LoraLoaderModelOnly node is pointed at sulphur_lora_rank_768.safetensors (step 3) and is connected to the GGUF model loader — the dev GGUF is not distilled on its own.

"sulphur_final" referenced in the workflow but missing locally

The upstream workflow JSON contains a sulphur_final.safetensors LoRA reference that does not exist as a published file. Per the SulphurAI README: "I'm aware the workflows contain sulphur_final right now, just use the lora or use the full models, don't use both at the same time." On the GGUF path, point that LoraLoaderModelOnly node at sulphur_lora_rank_768.safetensors (the published distill LoRA) instead.

Gemma GGUF loader fails or outputs gibberish

The Gemma 3 GGUF loader in ComfyUI-GGUF needed recent loader fixes merged for the LTX-2 family path (Sulphur-2 inherits this requirement via the LTX-2.3 lineage it builds on); pull the latest city96/ComfyUI-GGUF main so the Gemma encoder loads correctly — see the loader-compatibility notes on Kijai/LTXV2_comfy discussion #7.

Slow generation

Keep the Gemma encoder offloaded with the KJNodes model-offload nodes; VRAM thrashing on a 16GB card kills wall time even on a fast card. The RTX 4070 Ti SUPER's ~672 GB/s memory bandwidth is generous for a 16GB card, but a stack that's swapping between CPU RAM and VRAM mid-sample will still stall — drop frame counts and resolution further if you observe steady-state VRAM oscillation in nvidia-smi. Empirical RTX 4070 Ti SUPER wall-time numbers will appear at /check/sulphur-2/rtx-4070-ti-super once contributed.

Pushing beyond Q4_K_S

The Q4_K_M dev GGUF (13.31 GiB) is technically loadable on the 4070 Ti SUPER 16GB but leaves only ~1 GiB of headroom for the LoRA + activations + the QAT Gemma encoder — feasible only at the lowest frame counts (≤ 25) and resolution (≤ 768 px). Q5 and above will OOM; the vantagewithai per-tier table is unambiguous that Q5_K_S at 15 GB-decimal already exceeds the practical ceiling. The RTX 4070 Ti SUPER's extra compute does not buy a higher quant — VRAM is the binding constraint.