self-hosted/ai
§01·recipe · video

Sulphur 2 on RX 7800 XT: Uncensored LTX-2.3 Video on ROCm via ComfyUI GGUF

videoadvanced16GB+ VRAMJun 19, 2026

This advanced recipe sets up Sulphur 2 on the RX 7800 XT, needing about 16 GB of VRAM.

models
tools
prerequisites
  • AMD Radeon RX 7800 XT (16 GB VRAM, RDNA3 / Navi 32 / gfx1101) or an equivalent 16 GB ROCm-supported card
  • Linux (Ubuntu 24.04 / 22.04 or RHEL) with the AMD ROCm stack installed (ROCm 7.2.x)
  • 32 GB+ system RAM (the Gemma 3 12B text encoder is offloaded to RAM, and ROCm's aggressive RAM offload needs the headroom)
  • Python 3.12+ and PyTorch built for ROCm (not CUDA)
  • ComfyUI installed (latest version) with ComfyUI-LTXVideo, ComfyUI-GGUF, ComfyUI-KJNodes custom nodes
  • ~25 GB free disk for the Q4-tier dev GGUF + the required distill LoRA + Gemma 3 QAT encoder + LTX VAE

What You'll Build

Generate uncensored text-to-video and image-to-video clips locally with Sulphur 2 — an LTX-2.3 fine-tune from SulphurAI — on a 16 GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) through the ROCm stack. Sulphur 2 is built directly on Lightricks' LTX-2.3 audio-video DiT (the upstream card states the model "supporting both t2v and i2v natively, as well as all of the other ltx 2.3 formats", SulphurAI/Sulphur-2-base), and the community GGUF carries the ltxv architecture tags (vantagewithai/Sulphur-2-Base-GGUF). Because Sulphur 2 is LTX-2.3 under the hood, it runs through the exact same ROCm path that LTX-2.3 itself runs through on this card — including the same documented ROCm load-stall and the same fix.

This recipe runs the community Q4_K_S dev GGUF (12.29 GiB) from vantagewithai/Sulphur-2-Base-GGUF — a quantization of the non-distilled dev weights — with the upstream distill LoRA graph-wired on top for the fast short-step schedule, plus a quantized Gemma 3 12B QAT encoder offloaded to system RAM. On 16 GB the binding constraint is VRAM, and unlike the 24 GB Radeon sibling there is no FP8 escape hatch on RDNA3 — so the integer GGUF quant is the only memory-saving route, and you lead with a Q4 tier exactly as the 16 GB NVIDIA recipes do.

Hardware data: RX 7800 XT (16 GB VRAM, gfx1101) · Q4_K_S dev GGUF + REQUIRED distill LoRA + Gemma 3 12B QAT-Q4 encoder · ComfyUI on ROCm 7.2 · See benchmark data

⚠️ This is a ROCm recipe, not CUDA — and on RDNA3 there is no FP8. The RX 7800 XT runs on AMD's ROCm/HIP stack: there is no cu124/cu128 wheel, no xformers install, and no FP8/FP4 path here. RDNA3's WMMA units accept FP16, BF16, INT8, INT4 only — there is no FP8 hardware — so the upstream sulphur_dev_fp8mixed.safetensors checkpoint would just upcast to FP16/BF16 on this card with no memory saving and no compute acceleration. That matters more on 16 GB than on the 24 GB sibling: the FP8-to-squeeze move that 16 GB NVIDIA cards can fall back on does not exist here, so the integer GGUF quant is the squeeze. The attention path is PyTorch SDPA (ComfyUI's default), not FlashAttention-2 and not xformers. If a guide tells you to pip install xformers, build a flash-attn wheel, or pick a cu12x wheel for this card, it's written for the wrong vendor.

⚠️ The GGUF is the dev model — you MUST graph-wire the distill LoRA. Every tier published by vantagewithai is named sulphur_dev-*.gguf: it is a quantization of the non-distilled dev weights, not of the distilled checkpoint. Upstream ships sulphur_dev_bf16 AND sulphur_distil_bf16 as two separate 42.97 GiB checkpoints (upstream tree); the GGUF is a quant of the dev one. Per the SulphurAI README, the fast short-step path comes from applying the distill LoRA on top of the dev weights — "I recommend downloading either of the dev versions, (fp8mixed or bf16) and downloading the distill lora provided." On this recipe's GGUF path, the dev GGUF replaces the dev safetensors, and the distill LoRA (ltx-2.3-22b-distilled-lora-1.1_fro90_ceil72_condsafe.safetensors) is still required and must stay wired into the graph. Without it, the 8-step / CFG=1 schedule below runs against an un-distilled model and produces degraded, under-denoised output. The README's note "just use the lora or use the full models, don't use both at the same time" means: EITHER (dev weights + distill LoRA — this recipe's path) OR a full distilled model (sulphur_distil_bf16.safetensors) — never stack the LoRA on an already-distilled full model, and never load the non-existent sulphur_final the shipped workflow references.

⚠️ Tight on 16 GB. Even the Q4_K_S dev GGUF + distill LoRA + QAT encoder + VAE stack lives within ~1–2 GiB of the 16 GiB ceiling. Start with low frame counts and resolution and scale up carefully. And because Sulphur 2 loads through the LTX-2.3 sampler, it also hits the documented RDNA3 ROCm load-stall (see Running) — that is a memory-management failure on top of the VRAM squeeze, not the same thing as running out of VRAM.

ℹ️ License: LTX-2 Community License Agreement. The upstream SulphurAI/Sulphur-2-base repo ships a LICENSE.txt containing the LTX-2 Community License Agreement — only the Hugging Face model card's YAML license: metadata field is absent, not the license itself. This matches the community GGUF repo vantagewithai/Sulphur-2-Base-GGUF, which tags ltx-2-community-license-agreement, and is inherited from the LTX-2.3 base it derives from. Read the LTX-2 community license terms before any commercial use.

Requirements

ComponentMinimumTested
GPU16 GB VRAM, ROCm-supported AMD cardRX 7800 XT (16 GB, RDNA3 / gfx1101)
RAM32 GB32 GB+ (Gemma encoder + ROCm RAM offload)
Storage~25 GBQ4_K_S dev GGUF 12.29 GiB + distill LoRA 662 MB + Gemma 3 QAT encoder ~7.7 GiB (6.92 + 0.80) + LTX VAE 2.28 GiB
DriverAMD ROCm 7.2.x on LinuxROCm 7.2
SoftwareComfyUI + ComfyUI-LTXVideo + ComfyUI-GGUF + KJNodesPython 3.12+, PyTorch (ROCm build)

Sulphur 2 is a fine-tune of Lightricks' LTX-2.3 (architecture: ltxv in the vantagewithai GGUF metadata) and inherits its Gemma 3 12B text encoder. The RX 7800 XT (gfx1101, Navi 32) is an officially ROCm-supported card per the ROCm install-on-Linux system-requirements matrix, which lists the RX 7800 XT / 7700 XT / 7700 as gfx1101 — distinct from the RX 7900 XTX (gfx1100) and the RX 7600 (gfx1102). Because it is officially supported, no HSA_OVERRIDE_GFX_VERSION masquerade is required. The CUDA/cu12x instructions that appear in NVIDIA Sulphur-2 guides do not apply on this card — on the 7800 XT you install the ROCm PyTorch wheel instead (Step 2), and PyTorch's HIP runtime presents as the cuda device namespace.

Why not the full BF16 or fp8mixed dev weights on a 16 GB card? The upstream sulphur_dev_bf16.safetensors is 42.97 GiB and sulphur_dev_fp8mixed.safetensors is 27.16 GiB (upstream tree) — both far exceed 16 GB resident, before the Gemma 3 12B encoder, LoRA, VAE, and activations even enter VRAM. And on RDNA3 the fp8mixed file gives no memory win (it upcasts to BF16, no FP8 hardware). So you run a low-bit GGUF quant of the dev weights — and at 16 GB you lead the Q4_K_S tier, exactly the squeeze the 16 GB NVIDIA recipes use. The difference from the 24 GB Radeon sibling is that it could step up to a near-lossless Q6_K/Q8_0; here Q4_K_S is the binding fit.

Installation

1. Install ComfyUI

Per the ComfyUI README:

git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
python3 -m venv .venv
source .venv/bin/activate

2. Install PyTorch for ROCm

The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel — not a CUDA wheel. Per the ComfyUI README "AMD GPUs (Linux)" section:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. As of this writing the ComfyUI README pins rocm7.2 as the stable wheel — but the rocmX.Y tag moves over time (6.3 → 6.4 → 7.x). Read the current line in the live ComfyUI README before running. Confirm you got the ROCm build: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP).

3. Install ComfyUI dependencies and the LTX-Video, GGUF, and KJNodes custom nodes

# core deps
pip install -r requirements.txt

cd custom_nodes

# Official Lightricks ComfyUI nodes — provide the LTXV-prefixed Sulphur-2 graph nodes
git clone https://github.com/Lightricks/ComfyUI-LTXVideo.git
pip install -r ComfyUI-LTXVideo/requirements.txt

# city96's GGUF loader — required for the quantized dev transformer
git clone https://github.com/city96/ComfyUI-GGUF.git
pip install -r ComfyUI-GGUF/requirements.txt

# Kijai's KJNodes — used by the recommended offload workflows
git clone https://github.com/kijai/ComfyUI-KJNodes.git
pip install -r ComfyUI-KJNodes/requirements.txt
cd ..

The canonical Sulphur-2 workflow uses LTXV-prefixed nodes (LTXVConcatAVLatent, LTXVCropGuides, LTXVPreprocess, SamplerCustomAdvanced, LTXVScheduler) — all provided by ComfyUI-LTXVideo, confirmed by inspecting workflows/ltx23_t2v distilled.json on the upstream repo. These are the same Lightricks nodes that drive the LTX-2.3 base on this card — nothing Sulphur-specific or ROCm-incompatible.

4. Download the Q4_K_S dev GGUF transformer (lead: 12.29 GiB)

On 16 GB you are memory-bound, and with no FP8 hardware on RDNA3 the integer GGUF quant is the squeeze. Q4_K_S (12.29 GiB) is the recommended lead — it leaves ~1–2 GiB for the distill LoRA, the offloaded encoder's working set, and VAE-decode activations. Lower tiers (Q3_K_M, Q3_K_S) trade quality for more headroom; Q5 and above will not fit alongside the rest of the stack.

# Q4_K_S dev GGUF (12.29 GiB on disk) — recommended lead for 16 GB
huggingface-cli download vantagewithai/Sulphur-2-Base-GGUF \
  sulphur_dev-Q4_K_S.gguf \
  --local-dir ComfyUI/models/unet/

Quant-tier file-size reference (precise GiB from the vantagewithai/Sulphur-2-Base-GGUF tree API; the GGUF metadata is architecture: ltxv). Every tier is a quant of the non-distilled dev weights (sulphur_dev-*) — the distill LoRA in Step 5 is required on top of any of them:

QuantFile size (GiB)GB-decimalFits 16 GB GPU?
Q3_K_S9.6310.34yes (headroom for LoRA + encoder + activations)
Q3_K_M10.3711.13yes (more headroom than Q4_K_S)
Q4_K_S12.2913.20yes — recommended lead
Q4_K_M13.3114.30tight — possible only at low resolution / frame counts
Q5_K_S14.0115.04no (no room for LoRA + encoder + activations)
Q5_K_M15.0316.14no
Q6_K16.5517.77no (weights alone exceed VRAM)
Q8_021.1922.76no

The 16 GB ceiling reflects a closely-related community datapoint: a 16 GB ComfyUI user running the architecturally-related LTX-2 19B distilled stack recorded a sampling-stage peak of 14926 MiB (~14.6 GiB) in Comfy-Org/ComfyUI#11726 (their OOM occurred later, at the 2x-upscale second-sampler stage at 1080p / >200 frames — not at modest settings). That datapoint puts the peak across DiT + LoRA + encoder + VAE on a 16 GB card in the 13.5–15 GiB band at conservative settings, leaving only 1–2 GiB of headroom. Q4_K_M (13.31 GiB) is feasible only at very low frame counts; Q5 and above will OOM. Sulphur 2 itself is not separately measured on a 16 GB card — this is the closest cited LTX-family datapoint, and the VRAM envelope is identical on every 16 GB card (the RX 7800 XT's compute changes wall-time, not the memory ceiling).

5. Download the distill LoRA (REQUIRED for the 8-step path)

The dev GGUF is not distilled. To get Sulphur 2's fast short-step (8-step / CFG=1) behavior, you must apply the upstream distill LoRA on top of it — and keep it wired into the graph:

huggingface-cli download SulphurAI/Sulphur-2-base \
  distill_loras/ltx-2.3-22b-distilled-lora-1.1_fro90_ceil72_condsafe.safetensors \
  --local-dir ComfyUI/models/loras/

This is the 662 MB in-workflow distill LoRA from distill_loras/ — the one the canonical ltx23_t2v distilled.json wires in and that the community uses to fix the dev model's corrupted base output. In Discussion #14 ("distilled works fine, base doesn't"), community users confirm the dev/base weights produce garbled output without it: one writes "For base model you always need to add DIstilled Lora" and another that "This is the base model. That's why it requires https://huggingface.co/SulphurAI/Sulphur-2-base/tree/main/distill_loras to output a correct image" (both author.type: user, community reports — not the model author). At 662 MB the LoRA is under 1 GiB, so it fits the 16 GB stack comfortably on top of the Q4_K_S DiT. Per the SulphurAI README, the two valid configs are mutually exclusive — "just use the lora or use the full models, don't use both at the same time" — i.e. EITHER (dev weights + this LoRA, the GGUF path here) OR a full distilled checkpoint, never both. Do NOT substitute the repo's heavier sulphur_lora_rank_768.safetensors (9.56 GiB): it is a 24 GB-class higher-rank alternative and does not fit this 16 GB budget. LoRA application is pure PyTorch and works identically on the ROCm/SDPA path — no custom op, so the distill LoRA graph-wires on the 7800 XT exactly as on NVIDIA.

6. Download the quantized Gemma 3 12B text encoder

Sulphur 2 inherits LTX-2.3's Gemma 3 12B text encoder. The full unquantized Gemma 3 12B will OOM on 16 GB cards when loaded alongside the Sulphur 2 weights — a community feature request on the LTX-2 repo documents this directly, reporting that LTX-2 with Gemma 3 12B "needs ~24-27GB VRAM to operate" and is therefore unusable on "consumer GPUs with 16GB VRAM (RTX 5080, RTX 4080, etc.)", with a measured Peak Usage: 29068 MiB on a 16 GB card running the LTX-2 19B-dev stack (Lightricks/ComfyUI-LTXVideo#303, opened by community user Jackson3195, author_association: NONE — a feature request, not official guidance). The same OOM applies on this 16 GB Radeon card. Use the QAT-Q4 GGUF encoder instead, and keep it off the GPU:

huggingface-cli download unsloth/gemma-3-12b-it-qat-GGUF \
  gemma-3-12b-it-qat-UD-Q4_K_XL.gguf \
  mmproj-BF16.gguf \
  --local-dir ComfyUI/models/text_encoders/

Both files are loaded by ComfyUI-GGUF's Gemma encoder node (gemma-3-12b-it-qat-UD-Q4_K_XL.gguf is 6.92 GiB, mmproj-BF16.gguf is 0.80 GiB per the unsloth tree).

7. Download the LTX video VAE (Kijai community mirror)

Sulphur 2 reuses the upstream LTX video VAE — SulphurAI/Sulphur-2-base does not expose the VAE as a standalone file (it ships only sulphur_dev_bf16, sulphur_dev_fp8mixed, sulphur_distil_bf16, the distill LoRAs (the 662 MB distill_loras/ default plus the heavier rank-768), the prompt enhancer, and workflows). The simplest path for the GGUF flow is the community mirror by Kijai, which exposes a standalone bf16 VAE — architecture: ltxv is shared across the LTX family:

huggingface-cli download Kijai/LTXV2_comfy \
  VAE/LTX2_video_vae_bf16.safetensors \
  --local-dir ComfyUI/models/vae/

File listing confirmed at Kijai/LTXV2_comfy.

8. Download the canonical Sulphur-2 workflow JSON

The canonical Sulphur 2 ComfyUI workflow lives on the upstream SulphurAI repo:

huggingface-cli download SulphurAI/Sulphur-2-base \
  "workflows/ltx23_t2v distilled.json" \
  --local-dir ComfyUI/user/default/workflows/

The upstream README is explicit that the shipped workflow references a checkpoint that does not exist as a published file: "By the way, I'm aware the workflows contain sulphur_final right now, just use the lora or use the full models, don't use both at the same time." You will rewire the LoRA loader to point at ltx-2.3-22b-distilled-lora-1.1_fro90_ceil72_condsafe.safetensors in the next section.

Running

This is the part that breaks on the 7800 XT if you launch ComfyUI clean. Because Sulphur 2 loads through the LTX-2.3 sampler and weight path, it hits the same documented LTX-family load-stall that RDNA3 ROCm cards exhibit: it stalls at "Requested to load LTXAV" and then OOMs because ROCm's pinned-memory, smart-memory, and async-offload paths interact badly with the large weight load. The gfx1101 7800 XT inherits this from the same RDNA3 ROCm memory-management behavior the gfx1100 7900 XTX shows. An RX 7900 XTX owner on ROCm 7.2 reports their only working configuration is to disable them — the same flags apply on this card (ComfyUI Issue #13730):

# Working launch for the LTX-2.3 family on RDNA3 / ROCm 7.2 — the load-stall fix
python main.py --listen --disable-pinned-memory --disable-async-offload --disable-dynamic-vram

That reporter states plainly: "Without this config and a clean comfy launch it always crashes with OOMs." (Issue #13730). The same issue also lists --reserve-vram 0.5, --cache-none, and --use-quad-cross-attention as part of their working set — add them if the bare three-flag launch still stalls.

The single most important flag is --disable-pinned-memory (ComfyUI cli_args.py: "Disable pinned memory use."). A second RDNA3 owner running the LTX-2 family confirms the same flag is load-bearing — without it the second generation crashes; with it a 1024×1024 run completes (Issue #11949). AMD's own ComfyUI-on-Radeon install guide likewise documents: "If running on low-memory configs, try adding the --lowvram and --disable-pinned-memory parameters to the run command." On this 16 GB card --lowvram is also worth adding — it keeps more of the stack off the GPU while the DiT samples.

If the load-stall persists, add --disable-smart-memory (cli_args.py: "Force ComfyUI to agressively offload to regular ram instead of keeping models in vram when it can."), which forces the Gemma encoder and idle weights out to system RAM — this is why 32 GB+ RAM is required:

# If the stall persists, also force RAM offload
python main.py --listen --lowvram --disable-pinned-memory --disable-async-offload --disable-dynamic-vram --disable-smart-memory

Once running, open the browser UI (default http://127.0.0.1:8188) and load the workflow downloaded in step 8:

ComfyUI/user/default/workflows/ltx23_t2v distilled.json

In the loaded graph, make three swaps:

  1. Model loader. The shipped workflow loads the model via a CheckpointLoaderSimple pointed at a full LTX-2.3 dev checkpoint. Replace it with the Unet Loader (GGUF) node from ComfyUI-GGUF and point it at sulphur_dev-Q4_K_S.gguf (Step 4).
  2. Distill LoRA (do NOT remove it). The shipped graph contains LoraLoaderModelOnly nodes referencing the missing sulphur_final.safetensors. Point those at ltx-2.3-22b-distilled-lora-1.1_fro90_ceil72_condsafe.safetensors (Step 5) — this is what supplies the distillation on the dev GGUF path and makes the 8-step / CFG=1 schedule valid. Do not delete the LoRA node; without it the model runs un-distilled and produces under-baked frames.
  3. Text encoder. Point the text-encoder loader at the GGUF Gemma 3 loader from ComfyUI-GGUF (gemma-3-12b-it-qat-UD-Q4_K_XL.gguf + mmproj-BF16.gguf), and keep it CPU-offloaded with the KJNodes model-offload nodes so it stays out of VRAM while the DiT samples.

The shipped distilled workflow's sampler is already configured for the short-step path: an LTXVScheduler set to 8 steps, a CFGGuider at CFG = 1, and an euler_ancestral_cfg_pp / lcm sampler. That 8-step / CFG=1 profile is only correct because the distill LoRA is applied — keep it as shipped once the LoRA is wired in. The canonical workflow's frame/resolution defaults are tuned for high-VRAM cards — drop them on the 16 GB 7800 XT:

ParameterCanonical defaultRecommended on 16 GBSource
Latent length97 frames (EmptyLTXVLatentVideo widget)start at 65EmptyLTXVLatentVideo widget [768, 512, 97, 1] in ltx23_t2v distilled.json
Resolution (longer edge)1536 pxdrop to 832 px for first runResizeImagesByLongerEdge widget [1536] in the same file

Once the workflow loads cleanly at 832 px / 65 frames (and the load clears the ROCm stall), scale up only while peak VRAM stays comfortably below 16 GiB in rocm-smi. Frame counts follow the LTX-2.3 lineage constraint (divisible by 8 + 1, e.g. 65, 97, 121), and width/height should be divisible by 32.

Optional: prompt enhancer

Sulphur 2 ships a Q8_0 prompt enhancer (sulphur_prompt_enhancer_model-q8_0.gguf + mmproj-BF16.gguf, under prompt_enhancer/ on the upstream tree) intended to be used via LM Studio. Per the SulphurAI README: inside your LM Studio model folder, create a Sulphur/promptenhancer/ folder, drop both files in, and load the model from LM Studio's UI. "There is no system prompt for it, just send the text (and an image) you'd like to be enhanced."

Results

  • Speed: Omitted. No RX-7800-XT Sulphur-2 benchmark has been verified, and /check/sulphur-2/rx-7800-xt currently has no benchmark data. The 7800 XT also has materially lower memory bandwidth than the 24 GB Radeon sibling (~624 GB/s vs ~960 GB/s), so a wall-time figure from any other card would mislead — none is quoted here. If you've measured Sulphur-2 timings on a 7800 XT, please contribute them so they land on /check/sulphur-2/rx-7800-xt.
  • VRAM usage: Plan on a runtime peak in the 13.5–15 GiB band with the Q4_K_S dev GGUF + distill LoRA + QAT-Q4 Gemma encoder — within the 16 GiB ceiling but with very little headroom. The closest cited LTX-family datapoint is a 16 GB user running the LTX-2 19B distilled stack who recorded a sampling-stage peak of 14926 MiB (~14.6 GiB) in Comfy-Org/ComfyUI#11726 (one reading at the first-sampling stage — not a hard OOM threshold; their actual OOM was at 1080p / >200 frames during 2x upscale). Sulphur 2 itself is not separately measured on a 16 GB card. If you load the unquantized Gemma 3 12B encoder instead, the peak jumps to ~29 GiB (Lightricks/ComfyUI-LTXVideo#303, Peak Usage: 29068 MiB) — that's why the QAT encoder in step 6 is mandatory on this card. On 16 GB the binding factor is this VRAM squeeze plus the ROCm load-stall (see Running / Troubleshooting). See /check/sulphur-2/rx-7800-xt for any community-submitted measurement.
  • Quality notes: The recommended configuration is the non-distilled dev GGUF with the distill LoRA applied — that combination delivers the 8-step / CFG=1 short-step sampling shipped in the distilled workflow JSON. Running the dev GGUF without the LoRA at those settings under-denoises and degrades output; if you ever drop the LoRA, switch to the dev model's own (longer, higher-CFG) schedule. Q3 tier and below shows noticeable quality regression; Q4_K_S is the recommended balance on 16 GB. There is no FP8/FP4 path to consider on RDNA3; the GGUF integer quant is the memory-saving route, not the upstream fp8mixed file.

For the full benchmark data and other-GPU comparisons, see /check/sulphur-2/rx-7800-xt.

Troubleshooting

Sulphur 2 stalls at "Requested to load LTXAV" and then OOMs (the headline RDNA3 ROCm trap)

This is the dominant RDNA3 ROCm failure mode for the LTX-2 family and it is not purely about running out of 16 GB — it is ROCm's memory manager mishandling the large LTX-family weight load. Because Sulphur 2 loads through the LTX-2.3 sampler, it hits the exact stall an RX 7900 XTX / ROCm 7.2 owner reports for LTX-2.3 (the same RDNA3 ROCm path the 7800 XT runs): the model "stalls during Requested to load LTXAV" and a clean launch "always crashes with OOMs" (Issue #13730). Fixes, in order:

  1. Launch with --disable-pinned-memory --disable-async-offload --disable-dynamic-vram — the exact working config from the Issue #13730 reporter. This alone clears the stall for most users. If it doesn't, add their further flags --reserve-vram 0.5 --cache-none --use-quad-cross-attention.
  2. Add --lowvram and --disable-smart-memory to force idle weights and the Gemma encoder out to system RAM (cli_args.py: "Force ComfyUI to agressively offload to regular ram..."). Needs ample RAM (32 GB+). --lowvram is especially relevant on this 16 GB card.
  3. Confirm --disable-pinned-memory is present. It is the single most-cited flag for this class of failure on RDNA3 — a second owner confirms the LTX family crashes without it and runs with it (Issue #11949), and AMD's ComfyUI-Radeon guide recommends it for low-memory configs.

"Can I run this at full bf16 or fp8mixed on 16 GB?" — No

The upstream sulphur_dev_bf16.safetensors is 42.97 GiB and sulphur_dev_fp8mixed.safetensors is 27.16 GiB (upstream tree). Both weights alone exceed 16 GiB by a wide margin before the LoRA, encoder, VAE, and activations enter VRAM. And unlike a 16 GB NVIDIA card, the 7800 XT cannot fall back to FP8 to shrink the footprint — RDNA3 has no FP8 hardware, so the fp8mixed file upcasts to BF16 with no memory win. The Q4_K_S dev GGUF in step 4 (plus the distill LoRA in step 5) is the only path that runs on this card.

Output looks under-baked / noisy at 8 steps

If you used the dev GGUF but forgot to wire the distill LoRA, the 8-step / CFG=1 schedule runs against an un-distilled model and produces under-denoised, low-quality frames. Confirm the LoraLoaderModelOnly node is pointed at ltx-2.3-22b-distilled-lora-1.1_fro90_ceil72_condsafe.safetensors (Step 5) and is connected to the GGUF model loader — the dev GGUF is not distilled on its own. This is the most common Sulphur-2 mistake; it is independent of GPU vendor, and it is exactly the failure the community reports in Discussion #14.

"sulphur_final" referenced in the workflow but missing locally

The upstream workflow JSON contains a sulphur_final.safetensors LoRA reference that does not exist as a published file. Per the SulphurAI README: "just use the lora or use the full models, don't use both at the same time." On the GGUF path, point that LoraLoaderModelOnly node at ltx-2.3-22b-distilled-lora-1.1_fro90_ceil72_condsafe.safetensors (the published distill LoRA) instead.

"Torch not compiled with CUDA enabled"

A CUDA build of PyTorch got installed instead of the ROCm build. Per the ComfyUI README troubleshooting note, uninstall and reinstall against the ROCm wheel index:

pip uninstall torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

torch.cuda.is_available() should return True even on AMD — ROCm presents under the cuda device namespace via HIP.

Do not install xformers or FlashAttention on this card, and do not use the fp8mixed weights

HF and ComfyUI guides written for NVIDIA frequently suggest pip install xformers, a FlashAttention wheel, or the upstream sulphur_dev_fp8mixed.safetensors. On RDNA3 these are all the wrong path: the ROCm xformers fork is limited (no FP32, head-dim ≤ 256), consumer-card CK FlashAttention builds routinely fail on RDNA3, and RDNA3 has no FP8 hardware — so the fp8mixed checkpoint upcasts to BF16/FP16 with no memory win and no accel. ComfyUI already routes attention through PyTorch SDPA on this stack — stick with the default, and use the GGUF integer quants for the memory savings.

Gemma GGUF loader fails or outputs gibberish

The Gemma 3 GGUF loader in ComfyUI-GGUF needed recent loader fixes merged for the LTX-2 family path (Sulphur-2 inherits this requirement via the LTX-2.3 lineage it builds on); pull the latest city96/ComfyUI-GGUF main so the Gemma encoder loads correctly — see the loader-compatibility notes on Kijai/LTXV2_comfy discussion #7.

Encoder is slow on CPU

Forcing the Gemma 3 12B encoder to RAM (via --lowvram / --disable-smart-memory) makes the text-encode pass slow. The Lightricks node ships an "LTXV Audio Text Encoder Loader" that a community user reports loads Gemma "8x times faster then normal loader" (sic) (ComfyUI-LTXVideo Issue #303 comment). Swap the default Gemma loader for it where available. Also keep the encoder offloaded with the KJNodes model-offload nodes so it stays out of VRAM while the DiT samples — essential on a 16 GB card. Empirical RX 7800 XT wall-time numbers will appear at /check/sulphur-2/rx-7800-xt once a community benchmark is contributed.

common questions
How much VRAM does Sulphur 2 need?

About 16 GB — the minimum this recipe targets.

Which GPUs is Sulphur 2 tested on?

RX 7800 XT (16 GB).

How hard is this setup?

Advanced — follow the steps above.