What You'll Build
Generate uncensored text-to-video and image-to-video clips locally with Sulphur 2 — an LTX-2.3 fine-tune from SulphurAI — on an RTX 4080 SUPER 16GB. The upstream full-precision sulphur_dev_bf16.safetensors weighs 42.97 GiB and the sulphur_dev_fp8mixed.safetensors weighs 27.16 GiB (upstream tree) — both far too large for 16GB VRAM, let alone alongside the Gemma 3 12B text encoder. This recipe runs the community Q4_K_S GGUF (12.29 GiB / 13.2 GB-decimal) from vantagewithai/Sulphur-2-Base-GGUF — a quantization of the non-distilled dev weights — with the upstream distill LoRA applied on top for the fast short-step schedule, plus a quantized Gemma 3 12B QAT encoder.
Hardware data: RTX 4080 SUPER 16GB · Q4_K_S dev GGUF + distill LoRA + Gemma 3 12B QAT-Q4 encoder · See benchmark data
⚠️ The GGUF is the dev model — you must add the distill LoRA. Every tier published by vantagewithai is named
sulphur_dev-*.gguf: it is a quantization of the non-distilled dev weights, not of the distilled checkpoint. Per the SulphurAI README, the fast short-step path comes from applying the distill LoRA on top of the dev weights — "I recommend downloading either of the dev versions, (fp8mixed or bf16) and downloading the distill lora provided." On this recipe's GGUF path, the dev GGUF replaces the dev safetensors, and the distill LoRA (sulphur_lora_rank_768.safetensors) is still required. Without it, the 8-step / CFG=1 schedule below runs against an un-distilled model and produces degraded output. The README's note "just use the lora or use the full models, don't use both at the same time" means: EITHER (dev weights + distill LoRA) OR a full distilled model (sulphur_distil_bf16.safetensors) — never stack the LoRA on an already-distilled full model, and never load the non-existentsulphur_finalthe shipped workflow references.
⚠️ Tight on 16 GB. Even the Q4_K_S dev GGUF + distill LoRA + QAT encoder + VAE stack lives within ~1–2 GiB of the 16 GiB ceiling. Start with low frame counts and resolution and scale up carefully. The RTX 4080 SUPER's extra compute over smaller 16GB cards buys you faster generation, not more headroom — the VRAM ceiling is identical.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 16GB VRAM (Ada sm_89 or newer) | RTX 4080 SUPER 16GB |
| RAM | 32GB | 32GB |
| Storage | ~32GB | Q4_K_S 12.29 GiB + distill LoRA 9.56 GiB + Gemma 3 QAT encoder ~7.7 GiB (6.92 + 0.80) + LTX VAE 2.28 GiB (Kijai VAE tree API) |
| Software | ComfyUI + ComfyUI-LTXVideo + ComfyUI-GGUF + KJNodes | Python 3.10+, CUDA 12.4+ |
Sulphur 2 is a fine-tune of Lightricks' LTX-2.3 (architecture: ltxv per the vantagewithai GGUF card) and inherits its Gemma 3 12B text encoder. The RTX 4080 SUPER uses Ada Lovelace AD103 sm_89 — full FlashAttention-2 kernel coverage is in stock CUDA wheels (the default pip install torch works; no cu128-specific index-url is required, since that requirement only applies to Blackwell sm_120 cards). The 4080 SUPER also has native 4th-gen FP8 tensor cores, but the GGUF path below uses k-quant integer kernels, not FP8, so that capability isn't exercised here.
Installation
1. Install ComfyUI and the LTX-Video custom nodes
git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cd custom_nodes
git clone https://github.com/Lightricks/ComfyUI-LTXVideo.git
pip install -r ComfyUI-LTXVideo/requirements.txt
git clone https://github.com/city96/ComfyUI-GGUF.git
pip install -r ComfyUI-GGUF/requirements.txt
git clone https://github.com/kijai/ComfyUI-KJNodes.git
pip install -r ComfyUI-KJNodes/requirements.txt
The canonical Sulphur-2 workflow uses LTXV-prefixed nodes (LTXVConcatAVLatent, LTXVCropGuides, LTXVPreprocess, SamplerCustomAdvanced, LTXVScheduler) — all provided by ComfyUI-LTXVideo, confirmed by inspecting workflows/ltx23_t2v distilled.json on the upstream repo.
2. Download the Q4_K_S Sulphur-2 dev GGUF
# Q4_K_S — 12.29 GiB (13.2 GB-decimal), the sweet spot for 16GB VRAM
huggingface-cli download vantagewithai/Sulphur-2-Base-GGUF \
sulphur_dev-Q4_K_S.gguf \
--local-dir ComfyUI/models/unet/
Quant-tier file-size reference (precise GiB from the HF tree API; GB-decimal matches the vantagewithai card per-tier table — architecture: ltxv). Every tier is a quant of the non-distilled dev weights (sulphur_dev-*):
| Quant | File size (GiB) | GB-decimal | Fits 16GB GPU? |
|---|---|---|---|
| Q3_K_S | 9.63 | 10.3 | yes (headroom for LoRA + encoder + activations) |
| Q3_K_M | 10.37 | 11.1 | yes (more headroom than Q4_K_S) |
| Q4_K_S | 12.29 | 13.2 | yes — recommended |
| Q4_K_M | 13.31 | 14.3 | tight — possible if you cap resolution / frames aggressively |
| Q5_K_S | 14.01 | 15.0 | no (no room for LoRA + encoder + activations) |
| Q5_K_M | 15.03 | 16.1 | no |
| Q6_K | 16.55 | 17.8 | no (weights alone exceed VRAM) |
| Q8_0 | 21.19 | 22.8 | no |
The 16GB ceiling reflects a closely-related community datapoint: a 16GB ComfyUI user running the architecturally-related LTX-2 19B distilled stack recorded a sampling-stage peak of 14926 MiB (~14.6 GiB) in Comfy-Org/ComfyUI#11726 (their OOM occurred later, at the 2x-upscale second-sampler stage at 1080p / >200 frames — not at modest settings). That single datapoint suggests the peak across DiT + LoRA + encoder + VAE on a 16GB card lives in the 13.5–15 GiB band at conservative settings, leaving only 1–2 GiB of headroom. Q4_K_M (13.31 GiB) is feasible only at very low frame counts; Q5 and above will OOM. This VRAM envelope is identical on every 16GB card — the RTX 4080 SUPER's compute advantage changes wall-time, not the memory ceiling.
3. Download the distill LoRA (required for the 8-step path)
The dev GGUF is not distilled. To get Sulphur 2's fast short-step (8-step / CFG=1) behavior, you must apply the upstream distill LoRA on top of it:
huggingface-cli download SulphurAI/Sulphur-2-base \
sulphur_lora_rank_768.safetensors \
--local-dir ComfyUI/models/loras/
This is the sulphur_lora_rank_768.safetensors file (9.56 GiB) listed on the upstream tree. Per the SulphurAI README, the two valid configurations are mutually exclusive — "just use the lora or use the full models, don't use both at the same time" — i.e. EITHER (dev weights + this LoRA, which is the GGUF path here) OR a full distilled checkpoint, never both.
4. Download the quantized Gemma 3 12B text encoder
Sulphur 2 inherits LTX-2.3's Gemma 3 12B text encoder. The full unquantized Gemma 3 12B will OOM on 16GB cards when loaded alongside the Sulphur 2 weights — a community feature request on the LTX-2 repo documents this directly, reporting that LTX-2 with Gemma 3 12B "needs ~24-27GB VRAM to operate" and is therefore unusable on "consumer GPUs with 16GB VRAM (RTX 5080, RTX 4080, etc.)", with a measured Peak Usage: 29068 MiB on a 16GB card running the LTX-2 19B-dev-fp8 stack (Lightricks/ComfyUI-LTXVideo#303, opened by community user Jackson3195, author_association: NONE — a feature request, not official guidance). Use the QAT-Q4 GGUF encoder instead:
huggingface-cli download unsloth/gemma-3-12b-it-qat-GGUF \
gemma-3-12b-it-qat-UD-Q4_K_XL.gguf \
--local-dir ComfyUI/models/text_encoders/
huggingface-cli download unsloth/gemma-3-12b-it-qat-GGUF \
mmproj-BF16.gguf \
--local-dir ComfyUI/models/text_encoders/
Both files are loaded by ComfyUI-GGUF's Gemma encoder node (gemma-3-12b-it-qat-UD-Q4_K_XL.gguf is 6.92 GiB, mmproj-BF16.gguf is 0.80 GiB per the unsloth tree).
5. Download the LTX video VAE (Kijai community mirror)
Sulphur 2 reuses the upstream LTX video VAE — SulphurAI/Sulphur-2-base does not expose the VAE as a standalone file (it ships only sulphur_dev_bf16, sulphur_dev_fp8mixed, sulphur_distil_bf16, the rank-768 distill LoRA, the prompt enhancer, and workflows). The simplest path for the GGUF flow is the community mirror by Kijai, which exposes a standalone bf16 VAE — architecture: ltxv is shared across the LTX family:
huggingface-cli download Kijai/LTXV2_comfy \
VAE/LTX2_video_vae_bf16.safetensors \
--local-dir ComfyUI/models/vae/
File listing confirmed at Kijai/LTXV2_comfy.
6. Download the canonical Sulphur-2 workflow JSON
The canonical Sulphur 2 ComfyUI workflow lives on the upstream SulphurAI repo:
huggingface-cli download SulphurAI/Sulphur-2-base \
"workflows/ltx23_t2v distilled.json" \
--local-dir ComfyUI/user/default/workflows/
The upstream README is explicit that the shipped workflow references a checkpoint that does not exist as a published file: "I'm aware the workflows contain sulphur_final right now, just use the lora or use the full models, don't use both at the same time." You will rewire the LoRA loader to point at sulphur_lora_rank_768.safetensors in the next section.
Running
Launch ComfyUI:
python main.py --listen
Open the browser UI, then load the workflow downloaded in step 6:
ComfyUI/user/default/workflows/ltx23_t2v distilled.json
In the loaded graph, make three swaps:
- Model loader. The shipped workflow loads the model via a
CheckpointLoaderSimplepointed atltx-2.3-22b-dev-fp8.safetensors. Replace it with the Unet Loader (GGUF) node from ComfyUI-GGUF and point it atsulphur_dev-Q4_K_S.gguf. - Distill LoRA. The shipped graph contains
LoraLoaderModelOnlynodes referencing the missingsulphur_final.safetensors. Point those atsulphur_lora_rank_768.safetensors(downloaded in step 3) — this is what supplies the distillation on the dev GGUF path and makes the 8-step / CFG=1 schedule valid. Do not delete the LoRA node; without it the model runs un-distilled. - Text encoder. Point the text-encoder loader at the GGUF Gemma 3 loader from ComfyUI-GGUF (
gemma-3-12b-it-qat-UD-Q4_K_XL.gguf+mmproj-BF16.gguf).
The shipped distilled workflow's sampler is already configured for the short-step path: an LTXVScheduler set to 8 steps, a CFGGuider at CFG = 1, and an euler_ancestral_cfg_pp / lcm sampler (widget values in ltx23_t2v distilled.json). That 8-step / CFG=1 profile is only correct because the distill LoRA is applied — keep it as shipped once the LoRA is wired in. The canonical workflow's frame/resolution defaults are tuned for high-VRAM cards — drop them on the 4080 SUPER 16GB:
| Parameter | Canonical default | Recommended on 16GB | Source |
|---|---|---|---|
| Latent length | 97 frames (EmptyLTXVLatentVideo widget) | start at 65 | EmptyLTXVLatentVideo widget [768, 512, 97, 1] in ltx23_t2v distilled.json |
| Resolution (longer edge) | 1536 px | drop to 832 px for first run | ResizeImagesByLongerEdge widget [1536] in the same file |
Once the workflow loads cleanly at 832 px / 65 frames, scale up only while peak VRAM stays comfortably below 16 GiB in nvidia-smi. The 4080 SUPER shares the same 16 GiB VRAM ceiling as the smaller 16GB cards documented in the sibling recipes — it just samples faster (its ~736 GB/s memory bandwidth across the 256-bit GDDR6X bus and 10240 Ada AD103 CUDA cores give it generous throughput for a 16GB card). Treat the VRAM discipline below as identical to any 16GB card; the only difference is wall-time.
Optional: prompt enhancer
Sulphur 2 ships a Q8_0 prompt enhancer (sulphur_prompt_enhancer_model-q8_0.gguf + mmproj-BF16.gguf, under prompt_enhancer/ on the upstream tree) intended to be used via LM Studio. Per the SulphurAI README: inside your LM Studio model folder, create a Sulphur/promptenhancer/ folder, drop both files in, and load the model from LM Studio's UI. "There is no system prompt for it, just send the text (and an image) you'd like to be enhanced."
Results
- Speed: Omitted — there is no published Sulphur-2 benchmark on an RTX 4080 SUPER at the time of writing, and quoting another card's wall-time would mislead. Empirical RTX 4080 SUPER data will land at /check/sulphur-2/rtx-4080-super once a community benchmark report is contributed via /contribute.
- VRAM usage: Plan on a runtime peak in the 13.5–15 GiB band with the Q4_K_S dev GGUF + distill LoRA + QAT-Q4 Gemma encoder — within the 16 GiB ceiling but with very little headroom. One closely-related community datapoint: a 16GB user running the LTX-2 19B distilled stack recorded a sampling-stage peak of
14926 MiB(~14.6 GiB) in Comfy-Org/ComfyUI#11726 (one reading at the first-sampling stage — not a hard OOM threshold; their actual OOM was at 1080p / >200 frames during 2x upscale). If you load the unquantized Gemma 3 12B encoder instead, the peak jumps to ~29 GiB (Lightricks/ComfyUI-LTXVideo#303,Peak Usage: 29068 MiB) — that's why the QAT encoder in step 4 is mandatory on this card. - Quality notes: The recommended configuration is the non-distilled dev GGUF with the distill LoRA applied — that combination delivers the 8-step / CFG=1 short-step sampling shipped in the distilled workflow JSON. Running the dev GGUF without the LoRA at those settings under-denoises and degrades output; if you ever drop the LoRA, switch to the dev model's own (longer, higher-CFG) schedule. Q3 tier and below shows noticeable quality regression; Q4_K_S is the recommended balance.
For the full benchmark data on this pair, see /check/sulphur-2/rtx-4080-super.
Troubleshooting
"Can I run this at full bf16 or fp8mixed on 16 GB?" — No
The upstream sulphur_dev_bf16.safetensors is 42.97 GiB and sulphur_dev_fp8mixed.safetensors is 27.16 GiB (upstream tree). Both weights alone exceed 16 GiB by a wide margin before the LoRA, encoder, VAE, and activations enter VRAM — and this is true even though the RTX 4080 SUPER has native FP8 tensor cores, because the constraint here is on-card memory, not compute. The Q4_K_S dev GGUF in step 2 (plus the distill LoRA in step 3) is the only path that runs on this card. If you have a 24GB+ card, see the upstream README for the fp8mixed flow.
OOM when loading the text encoder
Same root cause as documented upstream — the default unquantized Gemma 3 12B encoder OOMs on 16GB cards when loaded alongside the Sulphur 2 weights (Lightricks/ComfyUI-LTXVideo#303, a community feature request, names the RTX 4080 as one of the blocked 16GB cards and reports Peak Usage: 29068 MiB on a 16GB card running the LTX-2 19B-dev-fp8 pipeline). Replace it with gemma-3-12b-it-qat-UD-Q4_K_XL.gguf from Unsloth (step 4 above). Also enable CPU offload for the Gemma encoder via the KJNodes model-offload nodes — keep the encoder unloaded from VRAM while the DiT is sampling. The RTX 4080 SUPER's PCIe Gen4 x16 host link makes this offload no worse than on narrower-bus 16GB cards.
Output looks under-baked / noisy at 8 steps
If you used the dev GGUF but forgot the distill LoRA, the 8-step / CFG=1 schedule runs against an un-distilled model and produces under-denoised, low-quality frames. Confirm the LoraLoaderModelOnly node is pointed at sulphur_lora_rank_768.safetensors (step 3) and is connected to the GGUF model loader — the dev GGUF is not distilled on its own.
"sulphur_final" referenced in the workflow but missing locally
The upstream workflow JSON contains a sulphur_final.safetensors LoRA reference that does not exist as a published file. Per the SulphurAI README: "I'm aware the workflows contain sulphur_final right now, just use the lora or use the full models, don't use both at the same time." On the GGUF path, point that LoraLoaderModelOnly node at sulphur_lora_rank_768.safetensors (the published distill LoRA) instead.
Gemma GGUF loader fails or outputs gibberish
The Gemma 3 GGUF loader in ComfyUI-GGUF needed recent loader fixes merged for the LTX-2 family path (Sulphur-2 inherits this requirement via the LTX-2.3 lineage it builds on); pull the latest city96/ComfyUI-GGUF main so the Gemma encoder loads correctly — see the loader-compatibility notes on Kijai/LTXV2_comfy discussion #7.
Slow generation
Keep the Gemma encoder offloaded with the KJNodes model-offload nodes; VRAM thrashing on a 16GB card kills wall time even on a fast card. The RTX 4080 SUPER's ~736 GB/s memory bandwidth is generous for a 16GB card, but a stack that's swapping between CPU RAM and VRAM mid-sample will still stall — drop frame counts and resolution further if you observe steady-state VRAM oscillation in nvidia-smi. Empirical RTX 4080 SUPER wall-time numbers will appear at /check/sulphur-2/rtx-4080-super once contributed.
Pushing beyond Q4_K_S
The Q4_K_M dev GGUF (13.31 GiB) is technically loadable on the 4080 SUPER 16GB but leaves only ~1 GiB of headroom for the LoRA + activations + the QAT Gemma encoder — feasible only at the lowest frame counts (≤ 25) and resolution (≤ 768 px). Q5 and above will OOM; the vantagewithai per-tier table is unambiguous that Q5_K_S at 15 GB-decimal already exceeds the practical ceiling. The RTX 4080 SUPER's extra compute does not buy a higher quant — VRAM is the binding constraint.