What You'll Build
A local LongCat-Image text-to-image setup running in ComfyUI on a 16 GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) through the ROCm stack. LongCat-Image is Meituan's bilingual (Chinese + English) 6B-parameter diffusion transformer, and this recipe runs it with a GGUF-quantized transformer (Q4_K_M) so the whole pipeline fits the 7800 XT's 16 GB. The vanilla BF16 diffusers path needs ~17–18 GB even with CPU offload, which overflows 16 GB — and because RDNA3 has no FP8 hardware, there is no FP8-to-squeeze escape hatch the way 16 GB NVIDIA cards have. The GGUF route is what makes 16 GB work here.
Hardware data: RX 7800 XT (16GB VRAM) · GGUF Q4_K_M · ComfyUI-GGUF on ROCm 7.2 · See benchmark data
⚠️ This is a ROCm recipe, not CUDA. The RX 7800 XT runs on AMD's ROCm/HIP stack — there is no
cu124/cu128wheel, no xformers install, no FlashAttention wheel, and no FP8/FP4 path here. RDNA3's WMMA units accept FP16, BF16, INT8 and INT4 only (no native FP8/FP4), so an FP8 checkpoint would just upcast to BF16 with no memory saving — which is precisely why a 16 GB NVIDIA card's "drop to FP8" trick has no AMD equivalent, and why this recipe leans on a GGUF quant instead. The attention path is PyTorch SDPA, forced explicitly with--use-pytorch-cross-attention. If a guide tells you topip install xformers, build a FlashAttention wheel, or pick acu12xwheel for this card, it's written for the wrong vendor.
⚠️ Why this recipe pins the base variant. Meituan publishes several siblings under the LongCat-Image brand and their fit on 16 GB differs. See "Sibling variants" below before downloading anything.
Sibling variants
| Variant | Purpose | 16 GB fit on this card |
|---|---|---|
| LongCat-Image (this recipe) | Final-release T2I, 6B params | Yes via GGUF — vantagewithai/LongCat-Image-GGUF ships per-tier ComfyUI files (Q4_K_M 3.66 GB · Q6_K 5.20 GB · Q8_0 6.71 GB · BF16 12.54 GB — sizes from the repo's comfy/ folder via the HF tree API) |
| LongCat-Image-Edit | Image-to-image editing variant | The editing pipeline loads Qwen2.5-VL and the transformer together; out of scope here |
| LongCat-Image-Edit-Turbo | Distilled few-step edit variant | Same family as Edit; out of scope here |
| LongCat-Image-Dev | Mid-training checkpoint, intended for fine-tuning, not inference | Out of scope for this recipe |
The base model is released under the Apache-2.0 license per the HF model card and the weights are not gated — no access request or login is required to download them.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 16 GB VRAM (ROCm-supported AMD card) | RX 7800 XT (16 GB) |
| RAM | 32 GB system (text encoder is CPU-offloaded between stages) | — |
| Storage | ~15 GB (Q4_K_M transformer 3.66 GB + FP8-scaled Qwen2.5-VL text encoder 9.38 GB + VAE 0.34 GB) | per HF tree |
| Driver | AMD ROCm 7.2.x on Linux | — |
| Software | ComfyUI + ComfyUI-GGUF + PyTorch (ROCm 7.2 build), Python 3.10+ | — |
LongCat-Image is a 6B diffusion transformer paired with a Qwen2.5-VL text encoder and a VAE. The full BF16 transformer is 12.54 GB and the BF16 text encoder is 16.58 GB across five shards in the canonical repo's text_encoder/ folder — together far too much to keep resident on 16 GB. Quantizing the transformer to GGUF Q4_K_M (3.66 GB) and offloading the text encoder between stages is what brings the runtime envelope down under 16 GB.
Installation
1. Install ComfyUI
Per the ComfyUI README, clone the repo:
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
2. Install PyTorch for ROCm
The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel. Per the ComfyUI README "AMD GPUs (Linux)" section, the stable install command is:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2
ℹ️ Verify the ROCm tag before you copy it. As of this writing the ComfyUI README pins
rocm7.2as the stable wheel — but therocmX.Ytag moves over time (6.3 → 6.4 → 7.x). Read the current line in the live ComfyUI README before running. The README also lists a nightly variant (https://download.pytorch.org/whl/nightly/rocm7.2) which it says "might have some performance improvements". On the officially-supported gfx1101 card you do not need the experimental per-arch RDNA3 wheel (gfx110X-all); the stablewhl/rocm7.2wheel above is the canonical path. Only if a library ships gfx1100-only kernels would you fall back to the legacyHSA_OVERRIDE_GFX_VERSION=11.0.0masquerade — it is not required for ComfyUI on this card.
3. Install ComfyUI dependencies
Per the ComfyUI README "Dependencies" section:
pip install -r requirements.txt
4. Install ComfyUI-GGUF
The GGUF transformer requires city96's ComfyUI-GGUF custom node, which provides the UnetLoaderGGUF node:
cd custom_nodes
git clone https://github.com/city96/ComfyUI-GGUF
cd ComfyUI-GGUF
pip install -r requirements.txt
cd ../..
5. Download the model files (GGUF path)
The 16 GB path needs three pieces: the LongCat-Image GGUF Q4_K_M transformer, the Qwen2.5-VL-7B text encoder, and the VAE. The GGUF transformer and matching ComfyUI workflow come from vantagewithai/LongCat-Image-GGUF (comfy/ folder, sizes verified from the HF tree); the encoder and VAE are Comfy-Org repackages.
# 1) LongCat-Image transformer (Q4_K_M, 3.66 GB) — smallest tier with full headroom for the encoder + VAE
hf download vantagewithai/LongCat-Image-GGUF comfy/LongCat-Image-Q4_K_M.gguf \
--local-dir models/diffusion_models/
# 2) Qwen2.5-VL-7B text encoder, FP8-scaled (Comfy-Org repackage, 9.38 GB on disk)
hf download Comfy-Org/HunyuanVideo_1.5_repackaged \
split_files/text_encoders/qwen_2.5_vl_7b_fp8_scaled.safetensors \
--local-dir models/text_encoders/
# 3) VAE (Comfy-Org repackage, 0.34 GB)
hf download Comfy-Org/z_image_turbo split_files/vae/ae.safetensors \
--local-dir models/vae/
ℹ️ On AMD, the FP8 encoder file saves disk, not VRAM. The
qwen_2.5_vl_7b_fp8_scaled.safetensorsfile is the one the official Vantage workflow wires up, and it is 9.38 GB on disk versus 16.58 GB for the BF16 shards — a real download saving. But because RDNA3 has no native FP8 (see the ROCm note above), the FP8 weights upcast to BF16 in VRAM when loaded, so during the text-encode stage the encoder is back to a ~16 GB-class resident size. That fits anyway because ComfyUI keeps only one major model resident at a time: it loads the encoder, encodes the prompt, then offloads it to CPU RAM before swapping in the 3.66 GB GGUF transformer for sampling. The 32 GB system-RAM recommendation in Requirements is for exactly this offload. If you prefer to avoid the upcast entirely, you can instead point the encoder at the full BF16 shards in the canonicaltext_encoder/folder — same VRAM behaviour during encode, larger download.
Running
Launch ComfyUI from the repo root with the ROCm-stable launch flags:
python main.py --use-pytorch-cross-attention --disable-smart-memory --disable-pinned-memory
Why these three flags on RDNA3:
--use-pytorch-cross-attentionforces ComfyUI onto PyTorch's scaled-dot-product attention (SDPA) — the attention path RDNA3 supports cleanly, since there is no FlashAttention wheel or xformers path for this card. Per ComfyUI'scli_args.py, the flag is documented as "Use the new pytorch 2.0 cross attention function." Do not pass--use-split-cross-attentionhere.--disable-smart-memoryand--disable-pinned-memorystabilize repeated runs on ROCm and back the encoder→transformer offload above. The first is documented as "Force ComfyUI to agressively offload to regular ram instead of keeping models in vram when it can." and the second as "Disable pinned memory use." (both fromcli_args.py).
This starts the server (default http://127.0.0.1:8188). Open it in a browser and load the LongCat-Image text-to-image workflow. Two reasonable starting points:
- Community GGUF workflow — the
comfy/Vantage-Longcat-Image.jsonfile shipped alongside the GGUF weights. It already wires theUnetLoaderGGUFtransformer loader, the FP8 Qwen2.5-VL encoder, and theae.safetensorsVAE used above. - ComfyUI Core template — LongCat Image: Text to Image. If you start here, swap the default
UNETLoaderforUnetLoaderGGUFand point it atLongCat-Image-Q4_K_M.gguf; leave the encoder onqwen_2.5_vl_7b_fp8_scaled.safetensorsand the VAE onae.safetensors.
Point the loaders at the three files you downloaded, enter a prompt, and queue. Generated PNGs land in ComfyUI/output/ with the full workflow embedded. The model card's reference example generates at 768×1344, guidance_scale=4.0, num_inference_steps=50. One important detail for text rendering: the HF model card cautions that for any image containing text you must enclose the target text in single or double quotation marks (English '...'/"..." or Chinese), because the model uses a character-level encoding strategy that only triggers on quoted content.
Results
- Speed: Not yet measured on the RX 7800 XT. The /check/longcat-image/rx-7800-xt endpoint currently has no benchmark for this pair (
verdict: unknown), and no RX-7800-XT-named LongCat-Image timing has surfaced in research — so no number is quoted here rather than transferring one from a different card. The 7800 XT has 624 GB/s of memory bandwidth and the GGUF diffusion + VAE-decode stages are memory-bound, so a real figure depends on the quant tier, resolution, and step count you pick. If you run it, please submit your timing via /contribute so a real figure lands on /check/longcat-image/rx-7800-xt. - VRAM usage: budget for a peak in the low-to-mid teens of GB. The Q4_K_M transformer is 3.66 GB and the FP8-scaled Qwen2.5-VL encoder is 9.38 GB on disk (upcasting toward a ~16 GB-class size during encode on RDNA3), but ComfyUI keeps only one of them resident on the GPU at a time, offloading the encoder before the diffusion step. Peak occurs during sampling — the 3.66 GB GGUF transformer + activations + latents + VAE decode — which is what keeps the recipe within 16 GB. By contrast, the vanilla BF16
diffuserspath needs ~17 GB even withenable_model_cpu_offload(): the canonical model card Quick Start annotates that lineRequired ~17 GB, and a LongCat-Image project COLLABORATOR states the latest official code consumes approximately 18 GB of VRAM and supports inference on an RTX 4090 (Issue #8 comment byjunqiangwu). That ~17–18 GB envelope is an NVIDIA-measured upstream footprint, not a 7800 XT figure — it is quoted here only to explain why the BF16 path overflows 16 GB and the GGUF route is the lead on this card. See /check/longcat-image/rx-7800-xt for any community-submitted measurement. - Quality notes: LongCat-Image is bilingual by design and the arXiv technical report (2512.07584) highlights multilingual text rendering as a primary target. Q4_K_M is the lead tier here for safe 16 GB headroom; step up to
comfy/LongCat-Image-Q6_K.gguf(5.20 GB) orcomfy/LongCat-Image-Q8_0.gguf(6.71 GB) from the same repo if you notice quality drop on fine details or small text — both still fit 16 GB because only one major model is resident at a time. The full BF16 transformer (12.54 GB) is the highest-quality tier and is what a 24 GB card such as the RX 7900 XTX can run un-quantized.
For the full benchmark data, see /check/longcat-image/rx-7800-xt.
Troubleshooting
"Torch not compiled with CUDA enabled"
This means a CUDA build of PyTorch got installed instead of the ROCm build. Per the ComfyUI README, uninstall and reinstall against the ROCm wheel index:
pip uninstall torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2
Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP).
OOM during the diffusion step on a 16 GB card
If you loaded the BF16 transformer (12.54 GB) instead of the GGUF, or left the full BF16 5-shard Qwen2.5-VL encoder (16.58 GB) resident, you will OOM — neither leaves room for activations and the VAE on a 16 GB card. Use the Q4_K_M GGUF transformer and confirm ComfyUI is offloading the encoder before sampling: the --disable-smart-memory --disable-pinned-memory launch flags from the Running section force the aggressive CPU offload that makes the encoder→transformer swap fit. There is no FP8-precision escape hatch on RDNA3 (FP8 weights upcast to BF16), so the GGUF transformer is the memory lever, not encoder precision.
Crash during VAE decode, or unstable repeated runs
On RDNA3, ComfyUI's default attention selection can be unstable during VAE decode, and keeping models pinned in VRAM across runs can fail. The fix is the launch flags from the Running section: force SDPA with --use-pytorch-cross-attention, and add --disable-smart-memory --disable-pinned-memory to stop ComfyUI from holding pinned/smart-managed memory between runs. Do not reach for --use-split-cross-attention — it is not the right path on this card.
Do not install xformers or FlashAttention
HF and ComfyUI guides written for NVIDIA frequently suggest pip install xformers or a FlashAttention wheel. On RDNA3 these are the wrong path: the ROCm xformers fork is limited and a prebuilt FlashAttention wheel does not exist for gfx1101 consumer cards. ComfyUI routes attention through PyTorch SDPA on this stack — force it explicitly with --use-pytorch-cross-attention.