self-hosted/ai
§01·recipe · video

Sulphur 2 on RX 7900 XTX: Uncensored LTX-2.3 Video on ROCm via GGUF + the --disable-pinned-memory Load-Stall Fix

videoadvanced24GB+ VRAMJun 17, 2026

This advanced recipe sets up Sulphur 2 on the RX 7900 XTX, needing about 24 GB of VRAM.

models
tools
prerequisites
  • AMD Radeon RX 7900 XTX (24 GB VRAM, RDNA3 / Navi 31 / gfx1100) or equivalent ROCm-supported card
  • Linux (Ubuntu 24.04 / 22.04 or RHEL) with the AMD ROCm stack installed (ROCm 7.2.x)
  • 64 GB system RAM strongly recommended (the Gemma 3 12B text encoder is offloaded to RAM, and ROCm's aggressive RAM offload needs the headroom)
  • Python 3.12+ and PyTorch built for ROCm (not CUDA)
  • ComfyUI installed (latest version) with ComfyUI-LTXVideo, ComfyUI-GGUF, ComfyUI-KJNodes custom nodes
  • ~35 GB free disk for the GGUF dev transformer + the required distill LoRA + Gemma 3 QAT encoder + LTX VAE

What You'll Build

Generate uncensored text-to-video and image-to-video clips locally with Sulphur 2 — an LTX-2.3 fine-tune from SulphurAI — on a 24 GB Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100) through the ROCm stack. Sulphur 2 is built directly on Lightricks' LTX-2.3 audio-video DiT (the upstream card states the model "supports both t2v and i2v natively, as well as all of the other ltx 2.3 formats", SulphurAI/Sulphur-2-base), and the community GGUF carries architecture: ltxv (vantagewithai/Sulphur-2-Base-GGUF). Because Sulphur 2 is LTX-2.3 under the hood, it runs through the exact same ROCm path that LTX-2.3 itself runs through on this card — including the same documented ROCm load-stall and the same fix.

This recipe runs the community dev GGUF from vantagewithai/Sulphur-2-Base-GGUF — a quantization of the non-distilled dev weights — with the upstream distill LoRA graph-wired on top for the fast short-step schedule, plus a quantized Gemma 3 12B QAT encoder. At 24 GB you step up to a high-bit quant rather than fighting the 16 GB squeeze of the NVIDIA siblings.

Hardware data: RX 7900 XTX (24GB VRAM) · dev GGUF + REQUIRED distill LoRA + Gemma 3 12B QAT encoder · ComfyUI on ROCm 7.2 · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7900 XTX runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel, no xformers install, and no FP8/FP4 path here. RDNA3 has no FP8/FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only), so the upstream sulphur_dev_fp8mixed.safetensors checkpoint would just upcast to FP16/BF16 on this card with no memory saving and no compute acceleration — it is not a path to take here. The attention path is PyTorch SDPA (ComfyUI's default), not FlashAttention-2 and not xformers. If a guide tells you to pip install xformers, build a flash-attn wheel, or pick a cu12x wheel for this card, it's written for the wrong vendor.

⚠️ The GGUF is the dev model — you MUST graph-wire the distill LoRA. Every tier published by vantagewithai is a quantization of the non-distilled dev weights, not of the distilled checkpoint. Per the SulphurAI README, the fast short-step path comes from applying the distill LoRA on top of the dev weights — "I recommend downloading either of the dev versions, (fp8mixed or bf16) and downloading the distill lora provided." On this recipe's GGUF path, the dev GGUF replaces the dev safetensors, and the distill LoRA (sulphur_lora_rank_768.safetensors) is still required and must stay wired into the graph. Without it, the 8-step / CFG=1 schedule below runs against an un-distilled model and produces degraded, under-denoised output. The README's note "just use the lora or use the full models, don't use both at the same time" means: EITHER (dev weights + distill LoRA — this recipe's path) OR a full distilled model (sulphur_distil_bf16.safetensors) — never stack the LoRA on an already-distilled full model, and never load the non-existent sulphur_final the shipped workflow references.

⚠️ At 24 GB, VRAM is not the binding constraint — ROCm memory management is. Unlike the 16 GB NVIDIA recipes for this model (which fight to fit), the 7900 XTX has plenty of room for a high-quality dev GGUF + distill LoRA + offloaded encoder. The real failure mode on this card is ROCm's pinned-memory / smart-memory / async-offload machinery stalling or OOM-ing during the large LTX-family weight load. An RX 7900 XTX owner running ROCm 7.2 reports LTX-2.3 "stalls during Requested to load LTXAV" and that a clean launch "always crashes with OOMs" (ComfyUI Issue #13730) — Sulphur 2 hits the identical stall because it loads through the same LTX-2.3 sampler and weight path. See Running for the exact launch flags.

ℹ️ License: LTX-2 Community License Agreement. The upstream SulphurAI/Sulphur-2-base repo ships a LICENSE.txt containing the LTX-2 Community License Agreement — only the Hugging Face model card's YAML license: metadata field is absent, not the license itself. This matches the community GGUF repo vantagewithai/Sulphur-2-Base-GGUF, which tags ltx-2-community-license-agreement, and is inherited from the LTX-2.3 base it derives from. Read the LTX-2 community license terms before any commercial use.

Requirements

ComponentMinimumTested
GPU24 GB VRAM, ROCm-supported AMD cardRX 7900 XTX (24 GB, RDNA3 / gfx1100)
RAM32 GB64 GB recommended (Gemma encoder + ROCm RAM offload)
Storage~35 GBdev GGUF (Q6_K 17.8 GB) + distill LoRA 9.56 GiB + Gemma 3 QAT encoder ~7.7 GiB + LTX VAE 2.28 GiB
DriverAMD ROCm 7.2.x on LinuxROCm 7.2 (Issue #13730 reporter)
SoftwareComfyUI + ComfyUI-LTXVideo + ComfyUI-GGUF + KJNodesPython 3.12+, PyTorch (ROCm build)

Sulphur 2 is a fine-tune of Lightricks' LTX-2.3 (architecture: ltxv per the vantagewithai GGUF card) and inherits its Gemma 3 12B text encoder. The CUDA/cu12x instructions that appear in NVIDIA Sulphur-2 guides do not apply on this card — on the 7900 XTX you install the ROCm PyTorch wheel instead (Step 2), and PyTorch's HIP runtime presents as the cuda device namespace.

Why not the full BF16 or fp8mixed dev weights on a 24 GB card? The upstream sulphur_dev_bf16.safetensors is 42.97 GiB and sulphur_dev_fp8mixed.safetensors is 27.16 GiB (upstream tree) — both exceed 24 GB resident, and on RDNA3 the fp8mixed file gives no memory win (it upcasts to BF16, no FP8 hardware). So even at 24 GB you run a high-bit GGUF quant of the dev weights. The difference from the 16 GB NVIDIA recipe is that at 24 GB you drop the tight Q4 squeeze and step up to a near-lossless Q6_K / Q8_0 quant for better quality.

Installation

1. Install ComfyUI

Per the ComfyUI README:

git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
python3 -m venv .venv
source .venv/bin/activate

2. Install PyTorch for ROCm

The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel — not a CUDA wheel. Per the ComfyUI README "AMD GPUs (Linux)" section:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. As of this writing the ComfyUI README pins rocm7.2 as the stable wheel — but the rocmX.Y tag moves over time (6.3 → 6.4 → 7.x). Read the current line in the live ComfyUI README before running. Confirm you got the ROCm build: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP).

3. Install ComfyUI dependencies and the LTX-Video, GGUF, and KJNodes custom nodes

# core deps
pip install -r requirements.txt

cd custom_nodes

# Official Lightricks ComfyUI nodes — provide the LTXV-prefixed Sulphur-2 graph nodes
git clone https://github.com/Lightricks/ComfyUI-LTXVideo.git
pip install -r ComfyUI-LTXVideo/requirements.txt

# city96's GGUF loader — required for the quantized dev transformer
git clone https://github.com/city96/ComfyUI-GGUF.git
pip install -r ComfyUI-GGUF/requirements.txt

# Kijai's KJNodes — used by the recommended offload workflows
git clone https://github.com/kijai/ComfyUI-KJNodes.git
pip install -r ComfyUI-KJNodes/requirements.txt
cd ..

The canonical Sulphur-2 workflow uses LTXV-prefixed nodes (LTXVConcatAVLatent, LTXVCropGuides, LTXVPreprocess, SamplerCustomAdvanced, LTXVScheduler) — all provided by ComfyUI-LTXVideo, confirmed by inspecting workflows/ltx23_t2v distilled.json on the upstream repo. These are the same Lightricks nodes that drive the LTX-2.3 base on this card — nothing Sulphur-specific or ROCm-incompatible.

4. Download a dev GGUF transformer (lead: Q6_K, 17.8 GB)

With 24 GB you are not memory-bound the way a 16 GB card is — pick a high-bit quant for near-lossless quality. Q6_K (17.8 GB) is the recommended lead: it leaves clear headroom for the distill LoRA and VAE-decode activations once the encoder is offloaded. Q5_K_S (15 GB) is a lighter option, and Q8_0 (22.8 GB) is the maximum-quality quant if you keep the encoder fully off the GPU (it fills most of the card).

# Q6_K dev GGUF (17.8 GB on disk) — recommended lead for 24 GB
huggingface-cli download vantagewithai/Sulphur-2-Base-GGUF \
  sulphur_dev-Q6_K.gguf \
  --local-dir ComfyUI/models/unet/

# OR Q8_0 dev GGUF (22.8 GB) — max quality, fills most of the 24 GB card
# huggingface-cli download vantagewithai/Sulphur-2-Base-GGUF \
#   sulphur_dev-Q8_0.gguf \
#   --local-dir ComfyUI/models/unet/

Quant-tier file-size reference (from the vantagewithai/Sulphur-2-Base-GGUF card, architecture: ltxv). Every tier is a quant of the non-distilled dev weights — the distill LoRA in Step 5 is required on top of any of them:

QuantGB-decimalFit on 24 GB
Q4_K_S13.2comfortable (but lower quality than needed here)
Q5_K_S15.0comfortable
Q6_K17.8recommended lead — room for LoRA + VAE activations
Q8_022.8tight — fills most of the card

5. Download the distill LoRA (REQUIRED for the 8-step path)

The dev GGUF is not distilled. To get Sulphur 2's fast short-step (8-step / CFG=1) behavior, you must apply the upstream distill LoRA on top of it — and keep it wired into the graph:

huggingface-cli download SulphurAI/Sulphur-2-base \
  sulphur_lora_rank_768.safetensors \
  --local-dir ComfyUI/models/loras/

This is the sulphur_lora_rank_768.safetensors file (9.56 GiB) listed on the upstream tree. Per the SulphurAI README, the two valid configurations are mutually exclusive — "just use the lora or use the full models, don't use both at the same time" — i.e. EITHER (dev weights + this LoRA, which is the GGUF path here) OR a full distilled checkpoint, never both. LoRA application is pure PyTorch and works identically on the ROCm/SDPA path — there is no custom op here, so the distill LoRA graph-wires on the 7900 XTX exactly as it does on NVIDIA.

6. Download the quantized Gemma 3 12B text encoder

Sulphur 2 inherits LTX-2.3's Gemma 3 12B text encoder. Even though the 7900 XTX has 24 GB, the cleanest setup keeps the encoder off the GPU and lets the transformer own VRAM — and ROCm's RAM-offload path (see Running) handles the encoder on CPU. Download the QAT-Q4 GGUF encoder:

huggingface-cli download unsloth/gemma-3-12b-it-qat-GGUF \
  gemma-3-12b-it-qat-UD-Q4_K_XL.gguf \
  mmproj-BF16.gguf \
  --local-dir ComfyUI/models/text_encoders/

Both files are loaded by ComfyUI-GGUF's Gemma encoder node (gemma-3-12b-it-qat-UD-Q4_K_XL.gguf is 6.92 GiB, mmproj-BF16.gguf is 0.80 GiB per the unsloth tree).

7. Download the LTX video VAE (Kijai community mirror)

Sulphur 2 reuses the upstream LTX video VAE — SulphurAI/Sulphur-2-base does not expose the VAE as a standalone file (it ships only sulphur_dev_bf16, sulphur_dev_fp8mixed, sulphur_distil_bf16, the rank-768 distill LoRA, the prompt enhancer, and workflows). The simplest path for the GGUF flow is the community mirror by Kijai, which exposes a standalone bf16 VAE — architecture: ltxv is shared across the LTX family:

huggingface-cli download Kijai/LTXV2_comfy \
  VAE/LTX2_video_vae_bf16.safetensors \
  --local-dir ComfyUI/models/vae/

File listing confirmed at Kijai/LTXV2_comfy.

8. Download the canonical Sulphur-2 workflow JSON

The canonical Sulphur 2 ComfyUI workflow lives on the upstream SulphurAI repo:

huggingface-cli download SulphurAI/Sulphur-2-base \
  "workflows/ltx23_t2v distilled.json" \
  --local-dir ComfyUI/user/default/workflows/

The upstream README is explicit that the shipped workflow references a checkpoint that does not exist as a published file: "By the way, I'm aware the workflows contain sulphur_final right now, just use the lora or use the full models, don't use both at the same time." You will rewire the LoRA loader to point at sulphur_lora_rank_768.safetensors in the next section.

Running

This is the part that breaks on the 7900 XTX if you launch ComfyUI clean. Because Sulphur 2 loads through the LTX-2.3 sampler and weight path, it hits the same documented LTX-family load-stall on this card: it stalls at "Requested to load LTXAV" and then OOMs because ROCm's pinned-memory, smart-memory, and async-offload paths interact badly with the large weight load. An RX 7900 XTX owner on ROCm 7.2 reports their only working configuration is to disable them (Issue #13730):

# Working launch on RX 7900 XTX / ROCm 7.2 — the load-stall fix
python main.py --listen --disable-pinned-memory --disable-async-offload --disable-dynamic-vram

That reporter states plainly: "Without this config and a clean comfy launch it always crashes with OOMs." (Issue #13730). The same issue also lists --reserve-vram 0.5, --cache-none, and --use-quad-cross-attention as part of their working set — add them if the bare three-flag launch still stalls.

The single most important flag is --disable-pinned-memory (ComfyUI cli_args.py: "Disable pinned memory use."). A second RX 7900 XTX owner running the LTX-2 family confirms the same flag is load-bearing — without it the second generation crashes; with it a 1024×1024 run completes (Issue #11949). AMD's own ComfyUI-on-Radeon install guide likewise documents: "If running on low-memory configs, try adding the --lowvram and --disable-pinned-memory parameters to the run command."

If the load-stall persists, add --disable-smart-memory (cli_args.py: "Force ComfyUI to agressively offload to regular ram instead of keeping models in vram when it can."), which forces the Gemma encoder and idle weights out to system RAM — this is why 64 GB RAM is recommended:

# If the stall persists, also force RAM offload
python main.py --listen --disable-pinned-memory --disable-async-offload --disable-dynamic-vram --disable-smart-memory

Once running, open the browser UI (default http://127.0.0.1:8188) and load the workflow downloaded in step 8:

ComfyUI/user/default/workflows/ltx23_t2v distilled.json

In the loaded graph, make three swaps:

  1. Model loader. The shipped workflow loads the model via a CheckpointLoaderSimple pointed at a full LTX-2.3 dev checkpoint. Replace it with the Unet Loader (GGUF) node from ComfyUI-GGUF and point it at sulphur_dev-Q6_K.gguf (Step 4).
  2. Distill LoRA (do NOT remove it). The shipped graph contains LoraLoaderModelOnly nodes referencing the missing sulphur_final.safetensors. Point those at sulphur_lora_rank_768.safetensors (Step 5) — this is what supplies the distillation on the dev GGUF path and makes the 8-step / CFG=1 schedule valid. Do not delete the LoRA node; without it the model runs un-distilled and produces under-baked frames.
  3. Text encoder. Point the text-encoder loader at the GGUF Gemma 3 loader from ComfyUI-GGUF (gemma-3-12b-it-qat-UD-Q4_K_XL.gguf + mmproj-BF16.gguf).

The shipped distilled workflow's sampler is already configured for the short-step path: an LTXVScheduler set to 8 steps, a CFGGuider at CFG = 1, and an euler_ancestral_cfg_pp / lcm sampler. That 8-step / CFG=1 profile is only correct because the distill LoRA is applied — keep it as shipped once the LoRA is wired in.

Recommended settings

ParameterValueSource
Sampler steps8Distilled (LoRA-applied) short-step path, LTXVScheduler widget in ltx23_t2v distilled.json
CFG1.0CFGGuider widget, same file
Resolutionwidth & height divisible by 32; start at 512×512LTX-2.3 lineage constraint; scale up after the load clears the ROCm stall
Frame countdivisible by 8 + 1 (e.g. 65, 97, 121); start at 65LTX-2.3 lineage constraint

Start small (e.g. 512×512, 65 frames) to confirm the load clears the ROCm stall before scaling resolution. The canonical workflow's frame/resolution defaults are tuned for high-VRAM cards — on the 7900 XTX you have room, but clear the stall first, then scale.

Results

  • Speed: Omitted. No RX-7900-XTX Sulphur-2 benchmark at a fixed configuration has been verified, and /check/sulphur-2/rx-7900-xtx currently has no benchmark data. The single 7900-XTX timing in the wild (1024×1024 in 114.22 s, Issue #11949) is for the base LTX-2 family, not Sulphur 2, and is a one-off second-run number — not a stable benchmark — so it is not quoted as this recipe's speed. If you've measured Sulphur-2 timings on a 7900 XTX, please contribute them so they land on /check/sulphur-2/rx-7900-xtx.
  • VRAM usage: At 24 GB the Q6_K dev GGUF (17.8 GB) sits resident with headroom for the distill LoRA and VAE-decode activations, while the Gemma encoder runs from system RAM. VRAM is not the binding constraint on this card — the binding factor is ROCm's memory-management stall during the weight load (see Running / Troubleshooting). The min_vram_gb: 24 reflects the tested card and the high-bit-quant lead path; lighter quants (Q5_K_S, Q4_K_S) reduce the resident footprint further. See /check/sulphur-2/rx-7900-xtx for any community-submitted measurement.
  • Quality notes: The recommended configuration is the non-distilled dev GGUF with the distill LoRA applied — that combination delivers the 8-step / CFG=1 short-step sampling shipped in the distilled workflow JSON. Running the dev GGUF without the LoRA at those settings under-denoises and degrades output; if you ever drop the LoRA, switch to the dev model's own (longer, higher-CFG) schedule. At 24 GB you run Q6_K or Q8_0 rather than the 16 GB cards' Q4 — near-lossless versus the dev BF16 transformer. There is no FP8/FP4 path to consider on RDNA3; the GGUF integer quant is the memory-saving route, not the upstream fp8mixed file.

For the full benchmark data and other-GPU comparisons, see /check/sulphur-2/rx-7900-xtx.

Troubleshooting

Sulphur 2 stalls at "Requested to load LTXAV" and then OOMs (the headline 7900 XTX trap)

This is the dominant 7900 XTX failure mode and it is not about running out of 24 GB — it is ROCm's memory manager mishandling the large LTX-family weight load. Because Sulphur 2 loads through the LTX-2.3 sampler, it hits the exact stall an RX 7900 XTX / ROCm 7.2 owner reports for LTX-2.3: the model "stalls during Requested to load LTXAV" and a clean launch "always crashes with OOMs" (Issue #13730). Fixes, in order:

  1. Launch with --disable-pinned-memory --disable-async-offload --disable-dynamic-vram — the exact working config from the Issue #13730 reporter. This alone clears the stall for most users. If it doesn't, add their further flags --reserve-vram 0.5 --cache-none --use-quad-cross-attention.
  2. Add --disable-smart-memory to force idle weights and the Gemma encoder out to system RAM (cli_args.py: "Force ComfyUI to agressively offload to regular ram..."). Needs ample RAM (64 GB recommended).
  3. Confirm --disable-pinned-memory is present. It is the single most-cited flag for this class of failure on the 7900 XTX — a second owner confirms the LTX family crashes without it and runs with it (Issue #11949), and AMD's ComfyUI-Radeon guide recommends it for low-memory configs.

Output looks under-baked / noisy at 8 steps

If you used the dev GGUF but forgot to wire the distill LoRA, the 8-step / CFG=1 schedule runs against an un-distilled model and produces under-denoised, low-quality frames. Confirm the LoraLoaderModelOnly node is pointed at sulphur_lora_rank_768.safetensors (Step 5) and is connected to the GGUF model loader — the dev GGUF is not distilled on its own. This is the most common Sulphur-2 mistake; it is independent of GPU vendor.

"sulphur_final" referenced in the workflow but missing locally

The upstream workflow JSON contains a sulphur_final.safetensors LoRA reference that does not exist as a published file. Per the SulphurAI README: "just use the lora or use the full models, don't use both at the same time." On the GGUF path, point that LoraLoaderModelOnly node at sulphur_lora_rank_768.safetensors (the published distill LoRA) instead.

"Torch not compiled with CUDA enabled"

A CUDA build of PyTorch got installed instead of the ROCm build. Per the ComfyUI README troubleshooting note, uninstall and reinstall against the ROCm wheel index:

pip uninstall torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

torch.cuda.is_available() should return True even on AMD — ROCm presents under the cuda device namespace via HIP.

Do not install xformers or FlashAttention on this card, and do not use the fp8mixed weights

HF and ComfyUI guides written for NVIDIA frequently suggest pip install xformers, a FlashAttention wheel, or the upstream sulphur_dev_fp8mixed.safetensors. On RDNA3 these are all the wrong path: the ROCm xformers fork is limited (no FP32, head-dim ≤ 256), consumer-card CK FlashAttention builds routinely fail on gfx1100, and RDNA3 has no FP8 hardware — so the fp8mixed checkpoint upcasts to BF16/FP16 with no memory win and no accel. ComfyUI already routes attention through PyTorch SDPA on this stack — stick with the default, and use the GGUF integer quants for the memory savings.

Gemma GGUF loader fails or outputs gibberish

The Gemma 3 GGUF loader in ComfyUI-GGUF needed recent loader fixes merged for the LTX-2 family path (Sulphur-2 inherits this requirement via the LTX-2.3 lineage it builds on); pull the latest city96/ComfyUI-GGUF main so the Gemma encoder loads correctly — see the loader-compatibility notes on Kijai/LTXV2_comfy discussion #7.

Encoder is slow on CPU

Forcing the Gemma 3 12B encoder to RAM (via --disable-smart-memory / low-VRAM modes) makes the text-encode pass slow. The Lightricks node ships an "LTXV Audio Text Encoder Loader" that a community user reports loads Gemma "8x times faster then normal loader" (sic) (ComfyUI-LTXVideo Issue #303 comment). Swap the default Gemma loader for it where available. Also keep the encoder offloaded with the KJNodes model-offload nodes so it stays out of VRAM while the DiT samples.

common questions
How much VRAM does Sulphur 2 need?

About 24 GB — the minimum this recipe targets.

Which GPUs is Sulphur 2 tested on?

RX 7900 XTX (24 GB).

How hard is this setup?

Advanced — follow the steps above.