self-hosted/ai
§01·recipe · image

LongCat-Image (base T2I) on RX 7900 XTX: Bilingual 6B Text-to-Image at full BF16 via ComfyUI on ROCm

imageintermediate18GB+ VRAMJun 17, 2026

This intermediate recipe sets up LongCat Image on the RX 7900 XTX, needing about 18 GB of VRAM.

models
tools
prerequisites
  • AMD Radeon RX 7900 XTX (24 GB VRAM, RDNA3 / Navi 31 / gfx1100) or equivalent ROCm-supported card
  • Linux (Ubuntu 24.04 / 22.04 or RHEL) with the AMD ROCm stack installed (ROCm 7.2.x)
  • Python 3.10+
  • ComfyUI installed (git clone) with PyTorch built for ROCm
  • ~30 GB free disk for the full BF16 transformer + Qwen2.5-VL text encoder + VAE

What You'll Build

A local LongCat-Image text-to-image setup running in ComfyUI on a 24 GB Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100) through the ROCm stack. LongCat-Image is Meituan's bilingual (Chinese + English) 6B-parameter diffusion transformer, and this recipe runs it at full BF16 precision — no quantization. On a 16 GB card the vanilla diffusers path does not fit and you are forced down to a GGUF quant; on the 7900 XTX's 24 GB the entire BF16 pipeline fits with headroom, so you get the model's native quality without the squeeze.

Hardware data: RX 7900 XTX (24GB VRAM) · full BF16 · ComfyUI on ROCm 7.2 · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7900 XTX runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel, no xformers install, no FlashAttention wheel, and no FP8/FP4 path here. RDNA3's WMMA units accept FP16, BF16, INT8 and INT4 only (no native FP8/FP4), so an FP8 checkpoint would just upcast to BF16 with no memory saving — and at 24 GB you don't need quantization at all. The attention path is PyTorch SDPA, forced explicitly with --use-pytorch-cross-attention. If a guide tells you to pip install xformers, build a FlashAttention wheel, or pick a cu12x wheel for this card, it's written for the wrong vendor.

⚠️ Why this recipe pins the base variant. Meituan publishes several siblings under the LongCat-Image brand. This recipe covers the base text-to-image model. See "Sibling variants" below before downloading anything.

Sibling variants

VariantPurposeNotes
LongCat-Image (this recipe)Final-release T2I, 6B params, BF16 transformer 12.54 GBRuns at full BF16 on 24 GB — no quant needed
LongCat-Image-EditImage-to-image editing variantThe editing pipeline loads Qwen2.5-VL and the transformer together; out of scope here
LongCat-Image-Edit-TurboDistilled few-step edit variantSame family as Edit; out of scope here
LongCat-Image-DevMid-training checkpoint, intended for fine-tuning, not inferenceOut of scope for this recipe

The base model is released under the Apache-2.0 license per the HF model card and the weights are not gated — no access request or login is required to download them.

Requirements

ComponentMinimumTested
GPU18 GB VRAM (ROCm-supported AMD card)RX 7900 XTX (24 GB)
RAM16 GB system
Storage~30 GB (BF16 transformer 12.54 GB + Qwen2.5-VL text encoder 16.58 GB + VAE 0.17 GB)per HF Files tree
DriverAMD ROCm 7.2.x on Linux
SoftwareComfyUI + PyTorch (ROCm 7.2 build), Python 3.10+

LongCat-Image is a 6B diffusion transformer paired with a Qwen2.5-VL text encoder and a VAE. The BF16 transformer is 12.54 GB and the BF16 text encoder is 16.58 GB across five shards in the canonical repo's text_encoder/ folder — but they are not both resident on the GPU at once when you use CPU offload (see below), which is what keeps the runtime envelope at ~18 GB rather than the ~29 GB on-disk total.

Installation

1. Install ComfyUI

Per the ComfyUI README, clone the repo:

git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI

2. Install PyTorch for ROCm

The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel. Per the ComfyUI README "AMD GPUs (Linux)" section, the stable install command is:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. As of this writing the ComfyUI README pins rocm7.2 as the stable wheel — but the rocmX.Y tag moves over time (6.3 → 6.4 → 7.x). Read the current line in the live ComfyUI README before running. The README also lists a nightly variant (https://download.pytorch.org/whl/nightly/rocm7.2) which it says "might have some performance improvements". On officially-supported Linux you do not need the experimental RDNA3-specific wheel index; the stable whl/rocm7.2 wheel above is the canonical path.

3. Install ComfyUI dependencies

Per the ComfyUI README "Dependencies" section:

pip install -r requirements.txt

4. Download the model files (full BF16)

The full BF16 path needs three pieces: the LongCat-Image BF16 transformer, the Qwen2.5-VL text encoder, and the VAE. The ComfyUI core team publishes the repackaged BF16 transformer at Comfy-Org/LongCat-Image (split_files/diffusion_models/longcat_image_bf16.safetensors, 12.54 GB verified from the HF tree); the text encoder and VAE come from the canonical Meituan repo.

# 1) LongCat-Image BF16 transformer (~12.54 GB) — ComfyUI repackaged
wget -P models/diffusion_models/ \
  https://huggingface.co/Comfy-Org/LongCat-Image/resolve/main/split_files/diffusion_models/longcat_image_bf16.safetensors

# 2) VAE (~0.17 GB) — from the canonical Meituan repo
wget -O models/vae/longcat_image_vae.safetensors \
  https://huggingface.co/meituan-longcat/LongCat-Image/resolve/main/vae/diffusion_pytorch_model.safetensors

The Qwen2.5-VL-7B text encoder (5 BF16 shards, 16.58 GB total) lives in the text_encoder/ folder of the canonical repo and is loaded by the native LongCat workflow. Pull the whole folder with the Hugging Face CLI:

# 3) Qwen2.5-VL-7B text encoder (~16.58 GB BF16, 5 shards)
hf download meituan-longcat/LongCat-Image \
  --include "text_encoder/*" --local-dir models/text_encoders/longcat-image/

Running

Launch ComfyUI from the repo root with the ROCm-stable launch flags:

python main.py --use-pytorch-cross-attention --disable-smart-memory --disable-pinned-memory

Why these three flags on RDNA3:

  • --use-pytorch-cross-attention forces ComfyUI onto PyTorch's scaled-dot-product attention (SDPA) — the attention path RDNA3 supports cleanly, since there is no FlashAttention wheel or xformers path for this card. Per ComfyUI's cli_args.py, the flag is documented as "Use the new pytorch 2.0 cross attention function." Do not pass --use-split-cross-attention here.
  • --disable-smart-memory and --disable-pinned-memory stabilize repeated runs on ROCm. The first is documented as "Force ComfyUI to agressively offload to regular ram instead of keeping models in vram when it can." and the second as "Disable pinned memory use." (both from cli_args.py).

This starts the server (default http://127.0.0.1:8188). Open it in a browser and load the LongCat-Image text-to-image workflow — it is an official ComfyUI template, LongCat Image: Text to Image, which wires the diffusion transformer, the Qwen2.5-VL text encoder and the VAE into a compact graph. Point the loaders at the three files you downloaded, enter a prompt, and queue. Generated PNGs land in ComfyUI/output/ with the full workflow embedded.

The model card's reference example generates at 768×1344, guidance_scale=4.0, num_inference_steps=50. One important detail for text rendering: the HF model card cautions that for any image containing text you must enclose the target text in single or double quotation marks (English '...'/"..." or Chinese '...'/"..."), because the model uses a character-level encoding strategy that only triggers on quoted content.

Results

  • Speed: Not yet measured on the RX 7900 XTX. The /check/longcat-image/rx-7900-xtx endpoint currently has no benchmark for this pair (verdict: unknown), and no RX-7900-XTX-named LongCat-Image timing has surfaced in research — so no number is quoted here rather than transferring one from a different card. If you run it, please submit your timing via /contribute so a real figure lands on /check/longcat-image/rx-7900-xtx.
  • VRAM usage: the full BF16 path is sourced at ~18 GB. A LongCat-Image developer states that the latest official inference code consumes approximately 18 GB of VRAM and supports inference on an RTX 4090 (Issue #8 comment by junqiangwu, a project COLLABORATOR), and the HF model card's Quick Start annotates its enable_model_cpu_offload() line Required ~17 GB (model card). That ~17–18 GB envelope is a footprint figure measured on NVIDIA hardware; it is well within the 7900 XTX's 24 GB, which is why this recipe runs full BF16 rather than the 16 GB card's GGUF quant. See /check/longcat-image/rx-7900-xtx for any community-submitted measurement.
  • Quality notes: LongCat-Image is bilingual by design and the arXiv technical report (2512.07584) highlights multilingual text rendering as a primary target. Full BF16 is the highest-quality configuration; there is no quantization tradeoff to consider on this card.

For the full benchmark data, see /check/longcat-image/rx-7900-xtx.

Troubleshooting

"Torch not compiled with CUDA enabled"

This means a CUDA build of PyTorch got installed instead of the ROCm build. Per the ComfyUI README, uninstall and reinstall against the ROCm wheel index:

pip uninstall torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP).

Crash during VAE decode, or unstable repeated runs

On RDNA3, ComfyUI's default attention selection can be unstable during VAE decode, and keeping models pinned in VRAM across runs can fail. The fix is the launch flags from the Running section: force SDPA with --use-pytorch-cross-attention, and add --disable-smart-memory --disable-pinned-memory to stop ComfyUI from holding pinned/smart-managed memory between runs. Do not reach for --use-split-cross-attention — it is not the right path on this card.

Want more headroom? Optional GGUF path

You do not need it at 24 GB, but if you want to free VRAM (for a larger batch, a second model colocated on the card, or higher resolution), a GGUF quant of the transformer is available at vantagewithai/LongCat-Image-GGUF — per-tier sizes run Q4_K_M 3.73 GB · Q6_K 5.15 GB · Q8_0 6.67 GB (verified from the HF tree). GGUF on AMD loads via the HIP-backed loader; on a 7900 XTX, Q8 quality is closest to BF16. This path requires the city96/ComfyUI-GGUF custom node and a GGUF Qwen2.5-VL text encoder — it is a deliberate trade of quality/setup-simplicity for VRAM, and the full-BF16 path above is the recommended default on this card.

Do not install xformers or FlashAttention

HF and ComfyUI guides written for NVIDIA frequently suggest pip install xformers or a FlashAttention wheel. On RDNA3 these are the wrong path: the ROCm xformers fork is limited and a prebuilt FlashAttention wheel does not exist for gfx1100 consumer cards. ComfyUI routes attention through PyTorch SDPA on this stack — force it explicitly with --use-pytorch-cross-attention.

common questions
How much VRAM does LongCat Image need?

About 18 GB — the minimum this recipe targets.

Which GPUs is LongCat Image tested on?

RX 7900 XTX (24 GB).

How hard is this setup?

Intermediate — follow the steps above.