What You'll Build
Generate uncensored text-to-video and image-to-video clips locally with Sulphur 2 — an LTX-2.3 fine-tune from SulphurAI — on a 16 GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) through the ROCm stack. Sulphur 2 is built directly on Lightricks' LTX-2.3 audio-video DiT (the upstream card states the model "supporting both t2v and i2v natively, as well as all of the other ltx 2.3 formats", SulphurAI/Sulphur-2-base), and the community GGUF carries the ltxv architecture tags (vantagewithai/Sulphur-2-Base-GGUF). Because Sulphur 2 is LTX-2.3 under the hood, it runs through the exact same ROCm path that LTX-2.3 itself runs through on this card — including the same documented ROCm load-stall and the same fix.
This recipe runs the community Q4_K_S dev GGUF (12.29 GiB) from vantagewithai/Sulphur-2-Base-GGUF — a quantization of the non-distilled dev weights — with the upstream distill LoRA graph-wired on top for the fast short-step schedule, plus a quantized Gemma 3 12B QAT encoder offloaded to system RAM. On 16 GB the binding constraint is VRAM, and unlike the 24 GB Radeon sibling there is no FP8 escape hatch on RDNA3 — so the integer GGUF quant is the only memory-saving route, and you lead with a Q4 tier exactly as the 16 GB NVIDIA recipes do.
Hardware data: RX 7800 XT (16 GB VRAM, gfx1101) · Q4_K_S dev GGUF + REQUIRED distill LoRA + Gemma 3 12B QAT-Q4 encoder · ComfyUI on ROCm 7.2 · See benchmark data
⚠️ This is a ROCm recipe, not CUDA — and on RDNA3 there is no FP8. The RX 7800 XT runs on AMD's ROCm/HIP stack: there is no
cu124/cu128wheel, no xformers install, and no FP8/FP4 path here. RDNA3's WMMA units accept FP16, BF16, INT8, INT4 only — there is no FP8 hardware — so the upstreamsulphur_dev_fp8mixed.safetensorscheckpoint would just upcast to FP16/BF16 on this card with no memory saving and no compute acceleration. That matters more on 16 GB than on the 24 GB sibling: the FP8-to-squeeze move that 16 GB NVIDIA cards can fall back on does not exist here, so the integer GGUF quant is the squeeze. The attention path is PyTorch SDPA (ComfyUI's default), not FlashAttention-2 and not xformers. If a guide tells you topip install xformers, build a flash-attn wheel, or pick acu12xwheel for this card, it's written for the wrong vendor.
⚠️ The GGUF is the dev model — you MUST graph-wire the distill LoRA. Every tier published by vantagewithai is named
sulphur_dev-*.gguf: it is a quantization of the non-distilled dev weights, not of the distilled checkpoint. Upstream shipssulphur_dev_bf16ANDsulphur_distil_bf16as two separate 42.97 GiB checkpoints (upstream tree); the GGUF is a quant of the dev one. Per the SulphurAI README, the fast short-step path comes from applying the distill LoRA on top of the dev weights — "I recommend downloading either of the dev versions, (fp8mixed or bf16) and downloading the distill lora provided." On this recipe's GGUF path, the dev GGUF replaces the dev safetensors, and the distill LoRA (ltx-2.3-22b-distilled-lora-1.1_fro90_ceil72_condsafe.safetensors) is still required and must stay wired into the graph. Without it, the 8-step / CFG=1 schedule below runs against an un-distilled model and produces degraded, under-denoised output. The README's note "just use the lora or use the full models, don't use both at the same time" means: EITHER (dev weights + distill LoRA — this recipe's path) OR a full distilled model (sulphur_distil_bf16.safetensors) — never stack the LoRA on an already-distilled full model, and never load the non-existentsulphur_finalthe shipped workflow references.
⚠️ Tight on 16 GB. Even the Q4_K_S dev GGUF + distill LoRA + QAT encoder + VAE stack lives within ~1–2 GiB of the 16 GiB ceiling. Start with low frame counts and resolution and scale up carefully. And because Sulphur 2 loads through the LTX-2.3 sampler, it also hits the documented RDNA3 ROCm load-stall (see Running) — that is a memory-management failure on top of the VRAM squeeze, not the same thing as running out of VRAM.
ℹ️ License: LTX-2 Community License Agreement. The upstream SulphurAI/Sulphur-2-base repo ships a
LICENSE.txtcontaining the LTX-2 Community License Agreement — only the Hugging Face model card's YAMLlicense:metadata field is absent, not the license itself. This matches the community GGUF repo vantagewithai/Sulphur-2-Base-GGUF, which tagsltx-2-community-license-agreement, and is inherited from the LTX-2.3 base it derives from. Read the LTX-2 community license terms before any commercial use.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 16 GB VRAM, ROCm-supported AMD card | RX 7800 XT (16 GB, RDNA3 / gfx1101) |
| RAM | 32 GB | 32 GB+ (Gemma encoder + ROCm RAM offload) |
| Storage | ~25 GB | Q4_K_S dev GGUF 12.29 GiB + distill LoRA 662 MB + Gemma 3 QAT encoder ~7.7 GiB (6.92 + 0.80) + LTX VAE 2.28 GiB |
| Driver | AMD ROCm 7.2.x on Linux | ROCm 7.2 |
| Software | ComfyUI + ComfyUI-LTXVideo + ComfyUI-GGUF + KJNodes | Python 3.12+, PyTorch (ROCm build) |
Sulphur 2 is a fine-tune of Lightricks' LTX-2.3 (architecture: ltxv in the vantagewithai GGUF metadata) and inherits its Gemma 3 12B text encoder. The RX 7800 XT (gfx1101, Navi 32) is an officially ROCm-supported card per the ROCm install-on-Linux system-requirements matrix, which lists the RX 7800 XT / 7700 XT / 7700 as gfx1101 — distinct from the RX 7900 XTX (gfx1100) and the RX 7600 (gfx1102). Because it is officially supported, no HSA_OVERRIDE_GFX_VERSION masquerade is required. The CUDA/cu12x instructions that appear in NVIDIA Sulphur-2 guides do not apply on this card — on the 7800 XT you install the ROCm PyTorch wheel instead (Step 2), and PyTorch's HIP runtime presents as the cuda device namespace.
Why not the full BF16 or fp8mixed dev weights on a 16 GB card? The upstream
sulphur_dev_bf16.safetensorsis 42.97 GiB andsulphur_dev_fp8mixed.safetensorsis 27.16 GiB (upstream tree) — both far exceed 16 GB resident, before the Gemma 3 12B encoder, LoRA, VAE, and activations even enter VRAM. And on RDNA3 the fp8mixed file gives no memory win (it upcasts to BF16, no FP8 hardware). So you run a low-bit GGUF quant of the dev weights — and at 16 GB you lead the Q4_K_S tier, exactly the squeeze the 16 GB NVIDIA recipes use. The difference from the 24 GB Radeon sibling is that it could step up to a near-lossless Q6_K/Q8_0; here Q4_K_S is the binding fit.
Installation
1. Install ComfyUI
Per the ComfyUI README:
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
python3 -m venv .venv
source .venv/bin/activate
2. Install PyTorch for ROCm
The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel — not a CUDA wheel. Per the ComfyUI README "AMD GPUs (Linux)" section:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2
ℹ️ Verify the ROCm tag before you copy it. As of this writing the ComfyUI README pins
rocm7.2as the stable wheel — but therocmX.Ytag moves over time (6.3 → 6.4 → 7.x). Read the current line in the live ComfyUI README before running. Confirm you got the ROCm build:python -c "import torch; print(torch.__version__)"should print a+rocm7.2-style suffix, andtorch.cuda.is_available()returnsTrue(ROCm masquerades as thecudadevice namespace under HIP).
3. Install ComfyUI dependencies and the LTX-Video, GGUF, and KJNodes custom nodes
# core deps
pip install -r requirements.txt
cd custom_nodes
# Official Lightricks ComfyUI nodes — provide the LTXV-prefixed Sulphur-2 graph nodes
git clone https://github.com/Lightricks/ComfyUI-LTXVideo.git
pip install -r ComfyUI-LTXVideo/requirements.txt
# city96's GGUF loader — required for the quantized dev transformer
git clone https://github.com/city96/ComfyUI-GGUF.git
pip install -r ComfyUI-GGUF/requirements.txt
# Kijai's KJNodes — used by the recommended offload workflows
git clone https://github.com/kijai/ComfyUI-KJNodes.git
pip install -r ComfyUI-KJNodes/requirements.txt
cd ..
The canonical Sulphur-2 workflow uses LTXV-prefixed nodes (LTXVConcatAVLatent, LTXVCropGuides, LTXVPreprocess, SamplerCustomAdvanced, LTXVScheduler) — all provided by ComfyUI-LTXVideo, confirmed by inspecting workflows/ltx23_t2v distilled.json on the upstream repo. These are the same Lightricks nodes that drive the LTX-2.3 base on this card — nothing Sulphur-specific or ROCm-incompatible.
4. Download the Q4_K_S dev GGUF transformer (lead: 12.29 GiB)
On 16 GB you are memory-bound, and with no FP8 hardware on RDNA3 the integer GGUF quant is the squeeze. Q4_K_S (12.29 GiB) is the recommended lead — it leaves ~1–2 GiB for the distill LoRA, the offloaded encoder's working set, and VAE-decode activations. Lower tiers (Q3_K_M, Q3_K_S) trade quality for more headroom; Q5 and above will not fit alongside the rest of the stack.
# Q4_K_S dev GGUF (12.29 GiB on disk) — recommended lead for 16 GB
huggingface-cli download vantagewithai/Sulphur-2-Base-GGUF \
sulphur_dev-Q4_K_S.gguf \
--local-dir ComfyUI/models/unet/
Quant-tier file-size reference (precise GiB from the vantagewithai/Sulphur-2-Base-GGUF tree API; the GGUF metadata is architecture: ltxv). Every tier is a quant of the non-distilled dev weights (sulphur_dev-*) — the distill LoRA in Step 5 is required on top of any of them:
| Quant | File size (GiB) | GB-decimal | Fits 16 GB GPU? |
|---|---|---|---|
| Q3_K_S | 9.63 | 10.34 | yes (headroom for LoRA + encoder + activations) |
| Q3_K_M | 10.37 | 11.13 | yes (more headroom than Q4_K_S) |
| Q4_K_S | 12.29 | 13.20 | yes — recommended lead |
| Q4_K_M | 13.31 | 14.30 | tight — possible only at low resolution / frame counts |
| Q5_K_S | 14.01 | 15.04 | no (no room for LoRA + encoder + activations) |
| Q5_K_M | 15.03 | 16.14 | no |
| Q6_K | 16.55 | 17.77 | no (weights alone exceed VRAM) |
| Q8_0 | 21.19 | 22.76 | no |
The 16 GB ceiling reflects a closely-related community datapoint: a 16 GB ComfyUI user running the architecturally-related LTX-2 19B distilled stack recorded a sampling-stage peak of 14926 MiB (~14.6 GiB) in Comfy-Org/ComfyUI#11726 (their OOM occurred later, at the 2x-upscale second-sampler stage at 1080p / >200 frames — not at modest settings). That datapoint puts the peak across DiT + LoRA + encoder + VAE on a 16 GB card in the 13.5–15 GiB band at conservative settings, leaving only 1–2 GiB of headroom. Q4_K_M (13.31 GiB) is feasible only at very low frame counts; Q5 and above will OOM. Sulphur 2 itself is not separately measured on a 16 GB card — this is the closest cited LTX-family datapoint, and the VRAM envelope is identical on every 16 GB card (the RX 7800 XT's compute changes wall-time, not the memory ceiling).
5. Download the distill LoRA (REQUIRED for the 8-step path)
The dev GGUF is not distilled. To get Sulphur 2's fast short-step (8-step / CFG=1) behavior, you must apply the upstream distill LoRA on top of it — and keep it wired into the graph:
huggingface-cli download SulphurAI/Sulphur-2-base \
distill_loras/ltx-2.3-22b-distilled-lora-1.1_fro90_ceil72_condsafe.safetensors \
--local-dir ComfyUI/models/loras/
This is the 662 MB in-workflow distill LoRA from distill_loras/ — the one the canonical ltx23_t2v distilled.json wires in and that the community uses to fix the dev model's corrupted base output. In Discussion #14 ("distilled works fine, base doesn't"), community users confirm the dev/base weights produce garbled output without it: one writes "For base model you always need to add DIstilled Lora" and another that "This is the base model. That's why it requires https://huggingface.co/SulphurAI/Sulphur-2-base/tree/main/distill_loras to output a correct image" (both author.type: user, community reports — not the model author). At 662 MB the LoRA is under 1 GiB, so it fits the 16 GB stack comfortably on top of the Q4_K_S DiT. Per the SulphurAI README, the two valid configs are mutually exclusive — "just use the lora or use the full models, don't use both at the same time" — i.e. EITHER (dev weights + this LoRA, the GGUF path here) OR a full distilled checkpoint, never both. Do NOT substitute the repo's heavier sulphur_lora_rank_768.safetensors (9.56 GiB): it is a 24 GB-class higher-rank alternative and does not fit this 16 GB budget. LoRA application is pure PyTorch and works identically on the ROCm/SDPA path — no custom op, so the distill LoRA graph-wires on the 7800 XT exactly as on NVIDIA.
6. Download the quantized Gemma 3 12B text encoder
Sulphur 2 inherits LTX-2.3's Gemma 3 12B text encoder. The full unquantized Gemma 3 12B will OOM on 16 GB cards when loaded alongside the Sulphur 2 weights — a community feature request on the LTX-2 repo documents this directly, reporting that LTX-2 with Gemma 3 12B "needs ~24-27GB VRAM to operate" and is therefore unusable on "consumer GPUs with 16GB VRAM (RTX 5080, RTX 4080, etc.)", with a measured Peak Usage: 29068 MiB on a 16 GB card running the LTX-2 19B-dev stack (Lightricks/ComfyUI-LTXVideo#303, opened by community user Jackson3195, author_association: NONE — a feature request, not official guidance). The same OOM applies on this 16 GB Radeon card. Use the QAT-Q4 GGUF encoder instead, and keep it off the GPU:
huggingface-cli download unsloth/gemma-3-12b-it-qat-GGUF \
gemma-3-12b-it-qat-UD-Q4_K_XL.gguf \
mmproj-BF16.gguf \
--local-dir ComfyUI/models/text_encoders/
Both files are loaded by ComfyUI-GGUF's Gemma encoder node (gemma-3-12b-it-qat-UD-Q4_K_XL.gguf is 6.92 GiB, mmproj-BF16.gguf is 0.80 GiB per the unsloth tree).
7. Download the LTX video VAE (Kijai community mirror)
Sulphur 2 reuses the upstream LTX video VAE — SulphurAI/Sulphur-2-base does not expose the VAE as a standalone file (it ships only sulphur_dev_bf16, sulphur_dev_fp8mixed, sulphur_distil_bf16, the distill LoRAs (the 662 MB distill_loras/ default plus the heavier rank-768), the prompt enhancer, and workflows). The simplest path for the GGUF flow is the community mirror by Kijai, which exposes a standalone bf16 VAE — architecture: ltxv is shared across the LTX family:
huggingface-cli download Kijai/LTXV2_comfy \
VAE/LTX2_video_vae_bf16.safetensors \
--local-dir ComfyUI/models/vae/
File listing confirmed at Kijai/LTXV2_comfy.
8. Download the canonical Sulphur-2 workflow JSON
The canonical Sulphur 2 ComfyUI workflow lives on the upstream SulphurAI repo:
huggingface-cli download SulphurAI/Sulphur-2-base \
"workflows/ltx23_t2v distilled.json" \
--local-dir ComfyUI/user/default/workflows/
The upstream README is explicit that the shipped workflow references a checkpoint that does not exist as a published file: "By the way, I'm aware the workflows contain sulphur_final right now, just use the lora or use the full models, don't use both at the same time." You will rewire the LoRA loader to point at ltx-2.3-22b-distilled-lora-1.1_fro90_ceil72_condsafe.safetensors in the next section.
Running
This is the part that breaks on the 7800 XT if you launch ComfyUI clean. Because Sulphur 2 loads through the LTX-2.3 sampler and weight path, it hits the same documented LTX-family load-stall that RDNA3 ROCm cards exhibit: it stalls at "Requested to load LTXAV" and then OOMs because ROCm's pinned-memory, smart-memory, and async-offload paths interact badly with the large weight load. The gfx1101 7800 XT inherits this from the same RDNA3 ROCm memory-management behavior the gfx1100 7900 XTX shows. An RX 7900 XTX owner on ROCm 7.2 reports their only working configuration is to disable them — the same flags apply on this card (ComfyUI Issue #13730):
# Working launch for the LTX-2.3 family on RDNA3 / ROCm 7.2 — the load-stall fix
python main.py --listen --disable-pinned-memory --disable-async-offload --disable-dynamic-vram
That reporter states plainly: "Without this config and a clean comfy launch it always crashes with OOMs." (Issue #13730). The same issue also lists --reserve-vram 0.5, --cache-none, and --use-quad-cross-attention as part of their working set — add them if the bare three-flag launch still stalls.
The single most important flag is --disable-pinned-memory (ComfyUI cli_args.py: "Disable pinned memory use."). A second RDNA3 owner running the LTX-2 family confirms the same flag is load-bearing — without it the second generation crashes; with it a 1024×1024 run completes (Issue #11949). AMD's own ComfyUI-on-Radeon install guide likewise documents: "If running on low-memory configs, try adding the --lowvram and --disable-pinned-memory parameters to the run command." On this 16 GB card --lowvram is also worth adding — it keeps more of the stack off the GPU while the DiT samples.
If the load-stall persists, add --disable-smart-memory (cli_args.py: "Force ComfyUI to agressively offload to regular ram instead of keeping models in vram when it can."), which forces the Gemma encoder and idle weights out to system RAM — this is why 32 GB+ RAM is required:
# If the stall persists, also force RAM offload
python main.py --listen --lowvram --disable-pinned-memory --disable-async-offload --disable-dynamic-vram --disable-smart-memory
Once running, open the browser UI (default http://127.0.0.1:8188) and load the workflow downloaded in step 8:
ComfyUI/user/default/workflows/ltx23_t2v distilled.json
In the loaded graph, make three swaps:
- Model loader. The shipped workflow loads the model via a
CheckpointLoaderSimplepointed at a full LTX-2.3 dev checkpoint. Replace it with the Unet Loader (GGUF) node from ComfyUI-GGUF and point it atsulphur_dev-Q4_K_S.gguf(Step 4). - Distill LoRA (do NOT remove it). The shipped graph contains
LoraLoaderModelOnlynodes referencing the missingsulphur_final.safetensors. Point those atltx-2.3-22b-distilled-lora-1.1_fro90_ceil72_condsafe.safetensors(Step 5) — this is what supplies the distillation on the dev GGUF path and makes the 8-step / CFG=1 schedule valid. Do not delete the LoRA node; without it the model runs un-distilled and produces under-baked frames. - Text encoder. Point the text-encoder loader at the GGUF Gemma 3 loader from ComfyUI-GGUF (
gemma-3-12b-it-qat-UD-Q4_K_XL.gguf+mmproj-BF16.gguf), and keep it CPU-offloaded with the KJNodes model-offload nodes so it stays out of VRAM while the DiT samples.
The shipped distilled workflow's sampler is already configured for the short-step path: an LTXVScheduler set to 8 steps, a CFGGuider at CFG = 1, and an euler_ancestral_cfg_pp / lcm sampler. That 8-step / CFG=1 profile is only correct because the distill LoRA is applied — keep it as shipped once the LoRA is wired in. The canonical workflow's frame/resolution defaults are tuned for high-VRAM cards — drop them on the 16 GB 7800 XT:
| Parameter | Canonical default | Recommended on 16 GB | Source |
|---|---|---|---|
| Latent length | 97 frames (EmptyLTXVLatentVideo widget) | start at 65 | EmptyLTXVLatentVideo widget [768, 512, 97, 1] in ltx23_t2v distilled.json |
| Resolution (longer edge) | 1536 px | drop to 832 px for first run | ResizeImagesByLongerEdge widget [1536] in the same file |
Once the workflow loads cleanly at 832 px / 65 frames (and the load clears the ROCm stall), scale up only while peak VRAM stays comfortably below 16 GiB in rocm-smi. Frame counts follow the LTX-2.3 lineage constraint (divisible by 8 + 1, e.g. 65, 97, 121), and width/height should be divisible by 32.
Optional: prompt enhancer
Sulphur 2 ships a Q8_0 prompt enhancer (sulphur_prompt_enhancer_model-q8_0.gguf + mmproj-BF16.gguf, under prompt_enhancer/ on the upstream tree) intended to be used via LM Studio. Per the SulphurAI README: inside your LM Studio model folder, create a Sulphur/promptenhancer/ folder, drop both files in, and load the model from LM Studio's UI. "There is no system prompt for it, just send the text (and an image) you'd like to be enhanced."
Results
- Speed: Omitted. No RX-7800-XT Sulphur-2 benchmark has been verified, and /check/sulphur-2/rx-7800-xt currently has no benchmark data. The 7800 XT also has materially lower memory bandwidth than the 24 GB Radeon sibling (~624 GB/s vs ~960 GB/s), so a wall-time figure from any other card would mislead — none is quoted here. If you've measured Sulphur-2 timings on a 7800 XT, please contribute them so they land on /check/sulphur-2/rx-7800-xt.
- VRAM usage: Plan on a runtime peak in the 13.5–15 GiB band with the Q4_K_S dev GGUF + distill LoRA + QAT-Q4 Gemma encoder — within the 16 GiB ceiling but with very little headroom. The closest cited LTX-family datapoint is a 16 GB user running the LTX-2 19B distilled stack who recorded a sampling-stage peak of
14926 MiB(~14.6 GiB) in Comfy-Org/ComfyUI#11726 (one reading at the first-sampling stage — not a hard OOM threshold; their actual OOM was at 1080p / >200 frames during 2x upscale). Sulphur 2 itself is not separately measured on a 16 GB card. If you load the unquantized Gemma 3 12B encoder instead, the peak jumps to ~29 GiB (Lightricks/ComfyUI-LTXVideo#303,Peak Usage: 29068 MiB) — that's why the QAT encoder in step 6 is mandatory on this card. On 16 GB the binding factor is this VRAM squeeze plus the ROCm load-stall (see Running / Troubleshooting). See /check/sulphur-2/rx-7800-xt for any community-submitted measurement. - Quality notes: The recommended configuration is the non-distilled dev GGUF with the distill LoRA applied — that combination delivers the 8-step / CFG=1 short-step sampling shipped in the distilled workflow JSON. Running the dev GGUF without the LoRA at those settings under-denoises and degrades output; if you ever drop the LoRA, switch to the dev model's own (longer, higher-CFG) schedule. Q3 tier and below shows noticeable quality regression; Q4_K_S is the recommended balance on 16 GB. There is no FP8/FP4 path to consider on RDNA3; the GGUF integer quant is the memory-saving route, not the upstream fp8mixed file.
For the full benchmark data and other-GPU comparisons, see /check/sulphur-2/rx-7800-xt.
Troubleshooting
Sulphur 2 stalls at "Requested to load LTXAV" and then OOMs (the headline RDNA3 ROCm trap)
This is the dominant RDNA3 ROCm failure mode for the LTX-2 family and it is not purely about running out of 16 GB — it is ROCm's memory manager mishandling the large LTX-family weight load. Because Sulphur 2 loads through the LTX-2.3 sampler, it hits the exact stall an RX 7900 XTX / ROCm 7.2 owner reports for LTX-2.3 (the same RDNA3 ROCm path the 7800 XT runs): the model "stalls during Requested to load LTXAV" and a clean launch "always crashes with OOMs" (Issue #13730). Fixes, in order:
- Launch with
--disable-pinned-memory --disable-async-offload --disable-dynamic-vram— the exact working config from the Issue #13730 reporter. This alone clears the stall for most users. If it doesn't, add their further flags--reserve-vram 0.5 --cache-none --use-quad-cross-attention. - Add
--lowvramand--disable-smart-memoryto force idle weights and the Gemma encoder out to system RAM (cli_args.py: "Force ComfyUI to agressively offload to regular ram..."). Needs ample RAM (32 GB+).--lowvramis especially relevant on this 16 GB card. - Confirm
--disable-pinned-memoryis present. It is the single most-cited flag for this class of failure on RDNA3 — a second owner confirms the LTX family crashes without it and runs with it (Issue #11949), and AMD's ComfyUI-Radeon guide recommends it for low-memory configs.
"Can I run this at full bf16 or fp8mixed on 16 GB?" — No
The upstream sulphur_dev_bf16.safetensors is 42.97 GiB and sulphur_dev_fp8mixed.safetensors is 27.16 GiB (upstream tree). Both weights alone exceed 16 GiB by a wide margin before the LoRA, encoder, VAE, and activations enter VRAM. And unlike a 16 GB NVIDIA card, the 7800 XT cannot fall back to FP8 to shrink the footprint — RDNA3 has no FP8 hardware, so the fp8mixed file upcasts to BF16 with no memory win. The Q4_K_S dev GGUF in step 4 (plus the distill LoRA in step 5) is the only path that runs on this card.
Output looks under-baked / noisy at 8 steps
If you used the dev GGUF but forgot to wire the distill LoRA, the 8-step / CFG=1 schedule runs against an un-distilled model and produces under-denoised, low-quality frames. Confirm the LoraLoaderModelOnly node is pointed at ltx-2.3-22b-distilled-lora-1.1_fro90_ceil72_condsafe.safetensors (Step 5) and is connected to the GGUF model loader — the dev GGUF is not distilled on its own. This is the most common Sulphur-2 mistake; it is independent of GPU vendor, and it is exactly the failure the community reports in Discussion #14.
"sulphur_final" referenced in the workflow but missing locally
The upstream workflow JSON contains a sulphur_final.safetensors LoRA reference that does not exist as a published file. Per the SulphurAI README: "just use the lora or use the full models, don't use both at the same time." On the GGUF path, point that LoraLoaderModelOnly node at ltx-2.3-22b-distilled-lora-1.1_fro90_ceil72_condsafe.safetensors (the published distill LoRA) instead.
"Torch not compiled with CUDA enabled"
A CUDA build of PyTorch got installed instead of the ROCm build. Per the ComfyUI README troubleshooting note, uninstall and reinstall against the ROCm wheel index:
pip uninstall torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2
torch.cuda.is_available() should return True even on AMD — ROCm presents under the cuda device namespace via HIP.
Do not install xformers or FlashAttention on this card, and do not use the fp8mixed weights
HF and ComfyUI guides written for NVIDIA frequently suggest pip install xformers, a FlashAttention wheel, or the upstream sulphur_dev_fp8mixed.safetensors. On RDNA3 these are all the wrong path: the ROCm xformers fork is limited (no FP32, head-dim ≤ 256), consumer-card CK FlashAttention builds routinely fail on RDNA3, and RDNA3 has no FP8 hardware — so the fp8mixed checkpoint upcasts to BF16/FP16 with no memory win and no accel. ComfyUI already routes attention through PyTorch SDPA on this stack — stick with the default, and use the GGUF integer quants for the memory savings.
Gemma GGUF loader fails or outputs gibberish
The Gemma 3 GGUF loader in ComfyUI-GGUF needed recent loader fixes merged for the LTX-2 family path (Sulphur-2 inherits this requirement via the LTX-2.3 lineage it builds on); pull the latest city96/ComfyUI-GGUF main so the Gemma encoder loads correctly — see the loader-compatibility notes on Kijai/LTXV2_comfy discussion #7.
Encoder is slow on CPU
Forcing the Gemma 3 12B encoder to RAM (via --lowvram / --disable-smart-memory) makes the text-encode pass slow. The Lightricks node ships an "LTXV Audio Text Encoder Loader" that a community user reports loads Gemma "8x times faster then normal loader" (sic) (ComfyUI-LTXVideo Issue #303 comment). Swap the default Gemma loader for it where available. Also keep the encoder offloaded with the KJNodes model-offload nodes so it stays out of VRAM while the DiT samples — essential on a 16 GB card. Empirical RX 7800 XT wall-time numbers will appear at /check/sulphur-2/rx-7800-xt once a community benchmark is contributed.