Sulphur 2 on RTX 5090: Uncensored LTX-2.3 Video at fp8mixed, Native

What You'll Build

Generate uncensored text-to-video clips locally with Sulphur 2 — an LTX-2.3 22B fine-tune from SulphurAI (1.4M+ downloads, 1407 likes) — on an RTX 5090, the first consumer NVIDIA card whose 32 GB GDDR7 envelope holds sulphur_dev_fp8mixed.safetensors natively. Where the 16 GB RTX 4060 Ti sibling and 16 GB RTX 5060 Ti sibling had to lean on the community Q4_K_S GGUF (13.2 GB) and quantized Gemma encoder to squeeze under the envelope, the 5090 runs the upstream sulphur_dev_fp8mixed.safetensors (27.16 GiB) alongside an FP8-scaled Gemma 3 12B encoder with ~600 MiB of VRAM headroom to spare. This is the canonical "fp8mixed runs as intended" path.

Hardware data: RTX 5090 (32 GB VRAM) · fp8mixed dev at 50 steps quality + 5-step refine · See benchmark data

Variant pin. This recipe targets Sulphur 2 at the upstream sulphur_dev_fp8mixed.safetensors weights (27.16 GiB) — the canonical fp8 fine-tune of Lightricks' LTX-2.3 22B. NOT the GGUF-quantized siblings (which work on smaller cards) and NOT the experimental LoRA (sulphur_experimental_lora_v1.safetensors). Canonical repo: SulphurAI/Sulphur-2-base.

Requirements

Component	Minimum	Tested
GPU	32 GB VRAM per the official ComfyUI-LTXVideo README	RTX 5090 (32,607 MiB)
RAM	32 GB	64 GB (matches the BigBlueWhale RTX 5090 reproducer hardware target — note the GitHub URL spells the second character as a capital I (`BigBIueWhale`), not a lowercase L; same human as HF user `BigBlueWhale` per cross-link in HF Discussion #22)
Storage	~58 GB	fp8mixed 27.16 GiB + FP8 Gemma 12.30 GiB + VAE/upscalers ~2 GB + distill LoRA 632 MiB + prompt enhancer ~9 GiB
Software	ComfyUI ≥ v0.21.0 + ComfyUI-LTXVideo + ComfyUI-GGUF + KJNodes	PyTorch 2.7+ / CUDA 12.7+ (cu128/cu130 wheels)

Sulphur 2 is a 22B fine-tune of Lightricks' LTX-2.3 architecture — same DiT shape, same Gemma 3 12B text encoder, same audio + video VAE. The ComfyUI-LTXVideo README requires "CUDA-compatible GPU with 32GB+ VRAM" as a hard prerequisite. The RTX 5090 is the first consumer card meeting that floor — the 16 GB sibling recipes cover the GGUF-squeeze fallback path for smaller cards.

Blackwell note (sm_120). This is the recipe's native arch. The 5090 has both FP8 (E4M3/E5M2) tensor cores AND FP4 microscaling hardware — so the sulphur_dev_fp8mixed.safetensors file is accelerated, not dequantized to BF16 at compute time. This is the inverse of the Ada-and-older lesson where FP8 weights are a memory-only escape hatch.

Installation

1. Install ComfyUI + LTX nodes

git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

cd custom_nodes
git clone https://github.com/Lightricks/ComfyUI-LTXVideo.git
pip install -r ComfyUI-LTXVideo/requirements.txt

git clone https://github.com/city96/ComfyUI-GGUF.git
pip install -r ComfyUI-GGUF/requirements.txt

git clone https://github.com/kijai/ComfyUI-KJNodes.git
pip install -r ComfyUI-KJNodes/requirements.txt

Blackwell (sm_120) is a first-class target on PyTorch 2.7+ cu128 wheels — pip install torch already picks the right index when you're on a 50-series card. The full pinned stack used by the BigBlueWhale RTX 5090 reproducer (the source of this recipe's measured numbers) is CUDA 13.0.2 base + PyTorch 2.11.0+cu130 + Triton 3.6.0 + ComfyUI v0.21.0 (commit 52976f3ea33c) + ComfyUI-LTXVideo commit 229437c6b657 — useful as a sanity-check pin if you want byte-identical reproduction.

2. Download the Sulphur 2 fp8mixed weights (primary path)

This is the live-traffic file — sulphur_dev_fp8mixed.safetensors is what end users are searching for, and the 5090 is the first card it fits on:

# 27.16 GiB / 29.16 GB-decimal — primary checkpoint
huggingface-cli download SulphurAI/Sulphur-2-base \
  sulphur_dev_fp8mixed.safetensors \
  --local-dir ComfyUI/models/checkpoints/

Sulphur ships four DiT checkpoints on the canonical Sulphur-2-base tree; the table below is from a direct HF tree API call:

File	Size	Fits 32 GB?
`sulphur_dev_bf16.safetensors`	42.97 GiB	No (weights alone exceed 32 GB)
`sulphur_dev_fp8mixed.safetensors`	27.16 GiB	Yes — primary path
`sulphur_distil_bf16.safetensors`	42.97 GiB	No (same; distilled BF16 variant)
`sulphur_lora_rank_768.safetensors`	9.56 GiB	(LoRA, not a standalone checkpoint)

The sulphur_lora_rank_768.safetensors file is the rank-768 distill LoRA that the canonical README intends to be used alongside the base (non-distilled) bf16 weights. The fp8mixed dev weights also need a distill LoRA to produce non-broken output (confirmed by community testing in Discussion #14: multiple users report base output is corrupted without it; user ubergarm notes "This is the base model. That's why it requires https://huggingface.co/SulphurAI/Sulphur-2-base/tree/main/distill_loras to output a correct image."). The smaller rank-384 + post-distillation LoRA from distill_loras/ is the in-workflow default; the rank-768 LoRA is the higher-rank alternative.

# Distill LoRA — required at stage-2 refinement in the canonical workflow (smaller file, in-workflow default)
huggingface-cli download SulphurAI/Sulphur-2-base \
  distill_loras/ltx-2.3-22b-distilled-lora-1.1_fro90_ceil72_condsafe.safetensors \
  --local-dir ComfyUI/models/loras/

# Optional — higher-rank alternative for the dev path
huggingface-cli download SulphurAI/Sulphur-2-base \
  sulphur_lora_rank_768.safetensors \
  --local-dir ComfyUI/models/loras/

3. Download the FP8-scaled Gemma 3 12B text encoder

Sulphur 2 inherits LTX-2.3's Gemma 3 12B text encoder. On 32 GB envelopes the upstream FP8-scaled variant from the ComfyUI org repackager is the right tier — the unquantized BF16 Gemma (24.38 GB) would crash with sulphur_dev_fp8mixed.safetensors resident; the FP4-mixed variant hits a known Blackwell regression (see Troubleshooting):

# 12.30 GiB / 13.21 GB-decimal — Blackwell-safe Gemma encoder
huggingface-cli download Comfy-Org/ltx-2 \
  split_files/text_encoders/gemma_3_12B_it_fp8_scaled.safetensors \
  --local-dir ComfyUI/models/text_encoders/

The Comfy-Org/ltx-2 text_encoders tree hosts four Gemma 3 12B variants; the FP8-scaled one (13.21 GB) is the canonical Blackwell pick.

4. Download the LTX-2.3 VAE + spatial upscaler

The audio + video VAE and the 2× spatial upscaler ship from the canonical Lightricks LTX-2.3 repo:

huggingface-cli download Lightricks/LTX-2.3 \
  ltx-2.3-spatial-upscaler-x2-1.1.safetensors \
  --local-dir ComfyUI/models/latent_upscale_models/

ComfyUI-LTXVideo's LTXVAudioVAELoader + LTXAVTextEncoderLoader nodes pull the in-checkpoint VAEs from the diffusion-model file directly (see the node implementations in the ComfyUI-LTXVideo source and the canonical ltx23_t2v base.json workflow node wiring) — no separate VAE download is required when running fp8mixed.

5. Download the canonical T2V workflow + prompt enhancer

# Canonical T2V quality workflow (50-step base + 5-step refinement)
huggingface-cli download SulphurAI/Sulphur-2-base \
  "workflows/ltx23_t2v base.json" \
  --local-dir ComfyUI/user/default/workflows/

# Optional: distilled t2v workflow for the fast path (8-step DISTILLED_SIGMA_VALUES)
huggingface-cli download SulphurAI/Sulphur-2-base \
  "workflows/ltx23_t2v distilled.json" \
  --local-dir ComfyUI/user/default/workflows/

# Prompt enhancer — Sulphur ships a Q8 GGUF intended to run on CPU via LM Studio or llama.cpp
huggingface-cli download SulphurAI/Sulphur-2-base \
  prompt_enhancer/sulphur_prompt_enhancer_model-q8_0.gguf \
  prompt_enhancer/mmproj-BF16.gguf \
  --local-dir ComfyUI/models/prompt_enhancer/

The Sulphur README's bootstrap guidance is explicit: "To get started with the model, I recommend downloading either of the dev versions, (fp8mixed or bf16) and downloading the distill lora provided. By the way, I'm aware the workflows contain sulphur_final right now, just use the lora or use the full models, don't use both at the same time." The workflows ship with a sulphur_final.safetensors reference inside the LoRA loader nodes — that file does not exist as a separate published artifact. Per the README's instruction, point the LoRA loader at either sulphur_lora_rank_768.safetensors (rank-768 distill LoRA) or the in-workflow-default ltx-2.3-22b-distilled-lora-1.1_fro90_ceil72_condsafe.safetensors, never both at once.

Running

Launch ComfyUI with Blackwell-friendly allocator flags:

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  python main.py --listen --reserve-vram 0.5

The --reserve-vram flag is official ComfyUI-LTXVideo guidance for tight-VRAM workloads (the README "Low VRAM" section documents the flag with example value 5). The specific 0.5 GiB value used here is BigBIueWhale's empirical 5090 tuning — at ~98% VRAM utilization on the default config, a 0.5 GiB reservation is enough to prevent fragmentation-OOM near the ceiling without wasting envelope; the reproducer README flags this as the "2.5% headroom" reasoning.

Load the canonical T2V quality workflow ltx23_t2v base.json. Replace the CheckpointLoaderSimple model field with sulphur_dev_fp8mixed.safetensors, leave the LTXAVTextEncoderLoader pointing at gemma_3_12B_it_fp8_scaled.safetensors (override the workflow's default gemma_3_12B_it_fp4_mixed.safetensors — see Troubleshooting), and confirm the LoRA loader points at one (not both) of the two distill LoRAs from step 2.

Canonical settings — Sulphur's tested envelope

The workflow ships pre-configured for the Sulphur fine-tune's tested operating point — these defaults come straight from ltx23_t2v base.json:

Parameter	Value	Source
Resolution	1280 × 704 (or 1344 × 768 stage-2)	`ResizeImageMaskNode` widget in `ltx23_t2v base.json`
Frame count	241 frames @ 24 fps = 10 s	`LTXVPreprocess` widget; satisfies LTX-2.3 `8n+1` constraint
Stage-1 sampler	`euler_ancestral`	`KSamplerSelect` widget
Stage-1 steps / CFG	50 / 3.6	`LTXVScheduler` + `CFGGuider` widgets
Stage-2 sigmas	5-step refinement `(0.85, 0.7933, 0.68, 0.51, 0.2833, 0.0)`	`ManualSigmas` widget
Stage-2 distill LoRA	`fro90_ceil72_condsafe` @ 0.5	Stage-2 `LoraLoaderModelOnly` (in-workflow default)
Stage-2 CFG	1.0	Stage-2 `CFGGuider`

For the fast path, load ltx23_t2v distilled.json instead — same Sulphur-tested envelope, but the stage-1 distill LoRA is stacked on (mode: 0) for 8-step DISTILLED_SIGMA_VALUES sampling, roughly halving wall-clock time.

Optional: prompt enhancer

The Sulphur prompt enhancer ships as a Q8 GGUF + BF16 mmproj. The canonical Sulphur README documents the LM Studio loading path: "This model contains a prompt enhancer. The easiest way to get started with the prompt enhancer is by using it on lmstudio. The way to accomplish this is by going to your model folder inside lmstudio, then opening it up in your file explorer. Create a folder named "Sulphur", then a folder inside that called "promptenhancer". Inside that folder, place the gguf file and the mmproj file." For a programmatic / GPU-VRAM-free path, run it under llama.cpp with GGML_CUDA=OFF on a CPU core — that's the approach the BigBlueWhale reproducer takes (10 GB system RAM, ~5–8 s per rewrite, never touches the GPU).

Results

Speed (community reproducer, RTX 5090): A single community-published first-hand RTX 5090 reproducer by BigBlueWhale (also published as HF Discussion #22 on the canonical Sulphur card) reports a default-envelope cold-start wall-clock of 188 s (~3 min 8 s) and warm-cache wall-clock of 169 s at 1280×704 × 10 s × quality mode with the prompt enhancer running CPU-side. Speaker is a community user (no Sulphur team / HF staff badge); validated 2026-05-11. The recipe's entire VRAM-and-speed envelope currently rests on this single reproducer plus the canonical Lightricks 32 GB+ spec floor — a second-source empirical measurement is queued (please submit yours via /contribute if you run the recipe). Empirical 5090 data from other measurement chains will appear at /check/sulphur-2/rtx-5090 as community benchmarks land.
VRAM usage: Same source reports cold-start peak of 31,795 MiB / 32,607 MiB (97.5%) and warm-cache peak of 32,095 MiB (98.4%) on the canonical Sulphur workflow at 1280×704 × 10 s × quality mode with gemma_3_12B_it_fp8_scaled encoder. This is consistent with the official 32 GB+ floor from the Lightricks ComfyUI-LTXVideo README and with the Issue #303 sibling-variant LTX-2 19B FP8 + unquantized Gemma 16 GB-RTX-5080 OOM report (peak 29,068 MiB on a smaller LTX-2 — Sulphur's 22B fp8mixed is larger, pushing the 5090 close to its ceiling). The 5090 hits the LTX-2.3 32 GB official floor with ~600 MiB of operating margin — no further envelope unlock; this is the binding card.
Quality notes: The fp8mixed dev weights produce broken output unless paired with a distill LoRA (community-confirmed in Discussion #14: user tech77 notes "For base model you always need to add DIstilled Lora ... Distilled model have this lora merged and no needed."). The recipe defaults above route through the stage-2-distill-LoRA path baked into the canonical workflow, which produces coherent output.

For the full benchmark data, see /check/sulphur-2/rtx-5090.

Troubleshooting

FP4-mixed Gemma encoder breaks framing on Blackwell

The canonical Sulphur workflow ships a LTXAVTextEncoderLoader node pointing at gemma_3_12B_it_fp4_mixed.safetensors. On Blackwell sm_120 cards, this combination hits the regression documented in Comfy-Org/ComfyUI#11920 — the recent Gemma-3 multimodal patch in comfy/text_encoders/llama.py corrupts spatial alignment ("only the top half of a head is rendered instead of a medium shot") and weight_dtype: fp8_e4m3fn triggers NotImplementedError: "addmm_cuda" not implemented for 'Float8_e4m3fn'. The issue was closed when the reporter (community user zappazack, no team badge) confirmed a workflow-side workaround. The simplest fix is to swap the encoder file to gemma_3_12B_it_fp8_scaled.safetensors (13.21 GB) from the Comfy-Org/ltx-2 text_encoders tree — the path documented in step 3 above. This is also what the BigBlueWhale reproducer does for the same reason — its versions.env documents the encoder swap as a Blackwell-stability fix, with the README explicitly flagging that upstream FP4 hits ComfyUI #11920 on sm_120.

"sulphur_final" referenced in the workflow but missing locally

The canonical workflows JSON files contain a sulphur_final.safetensors reference inside the LoRA loader nodes. That file does NOT exist as a separate published artifact on the Sulphur HF repo. Per the README: "the workflows contain sulphur_final right now, just use the lora or use the full models, don't use both at the same time." If you loaded sulphur_dev_fp8mixed.safetensors in step 2 of the install, point the LoRA loader at the rank-768 LoRA (sulphur_lora_rank_768.safetensors) — that's the dev-checkpoint distillation lookup. The in-workflow stage-2 LoRA (ltx-2.3-22b-distilled-lora-1.1_fro90_ceil72_condsafe.safetensors from distill_loras/) stays untouched — both are required for clean output on the dev path.

Base/dev model produces corrupted output without distill LoRA

Multiple users in Discussion #14 report that running the dev (non-distilled) Sulphur weights — including sulphur_dev_fp8mixed.safetensors — without a distill LoRA produces broken video output. Community user tech77 notes: "For base model you always need to add DIstilled Lora - https://huggingface.co/SulphurAI/Sulphur-2-base/tree/main/distill_loras" and follows with: "Distilled model have this lora merged and no needed." The canonical workflow's stage-2 LoRA loader (pointing at ltx-2.3-22b-distilled-lora-1.1_fro90_ceil72_condsafe.safetensors) handles this — do not bypass that node when running the dev path.

SageAttention `auto` selects unstable kernels on sm_120

The Sulphur workflow and the LTXVideo wrapper expect SageAttention 2++ for attention acceleration. The auto kernel selector occasionally lands on a path that's stable on Ada but flaky on Blackwell — the BigBlueWhale reproducer's versions.env pins SageAttention v2.2.0 built from source for sm_120 with the explicit kernel sageattn_qk_int8_pv_fp16_cuda (INT8 QK / FP16 PV). If you're seeing intermittent black frames or NaN warnings, set that kernel explicitly via the SageAttention node options.

FlashAttention-2 on Blackwell (sm_120)

LTX-2.3's PyTorch / ComfyUI path uses PyTorch SDPA + SageAttention 2, not raw FlashAttention-2, so the FA2 sm_120 wheel gap is rarely the blocking issue here. If you do try flash-attn via a custom node or wrapper, sm_120 kernel coverage is tracked at Dao-AILab/flash-attention#2168 (open since 2026-01-11) — fall back to SDPA or SageAttention 2 on Blackwell until that lands.

`sulphur_dev_bf16.safetensors` — why it doesn't fit even on the 5090

The BF16 dev weights are 42.97 GiB on disk per the HF tree API — they exceed the 32 GB 5090 envelope by ~11 GiB before the encoder, VAE, and activations enter VRAM. Use sulphur_dev_fp8mixed.safetensors (27.16 GiB) as documented in step 2. The next consumer envelope that fits BF16 is the RTX PRO 6000 Blackwell at 96 GB or a datacenter-class card. For users curious about the difference: fp8mixed is a quantization-aware fine-tune that targets Blackwell's FP8 tensor cores natively — on sm_120 it's the intended runtime path, not a degraded fallback.

Pushing the LTX-2.3 envelope (1920×1088 × 20 s)

The 5090 has the VRAM headroom to run the LTX-2.3 base ceiling — 1920×1088 × 481 frames × 24 fps — but the Sulphur fine-tune was tested at a smaller envelope (~1024×576 × 25-125 frames per the BigBlueWhale reproducer's reverse-engineering of the shipped workflow cross-referencing the Musubi-tuner LTX-2.3 community standard and TenStrip's 10Eros sister model). The reproducer reports a successful 445 s wall-clock at 1920×1088 × 20 s in fast mode (31,713 MiB VRAM peak), but the quality-mode variant crashes at the audio VAE with avcodec_send_frame() returned 22 (EINVAL) — likely audio-VAE amplitude drift at the larger latent. Stick to the canonical 1280×704 × 10 s envelope unless you're OK with experimental behavior at the LTX ceiling.

Sibling recipes for smaller cards

Sulphur 2 has published recipes for 16 GB consumer cards via the community Q4_K_S GGUF path: RTX 5060 Ti 16GB sibling and RTX 4060 Ti 16GB sibling both use vantagewithai/Sulphur-2-Base-GGUF (Q4_K_S = 13.2 GB) + a Gemma 3 12B QAT-Q4 encoder. Those recipes are the right pick for cards below the 32 GB floor; this recipe is the canonical first-card-meets-floor path that the upstream maintainer designed sulphur_dev_fp8mixed.safetensors for.