self-hosted/ai
§01·recipe · image

LongCat-Image (base T2I) on RX 7800 XT: Bilingual 6B Text-to-Image at 16 GB via ComfyUI-GGUF on ROCm

imageintermediate13GB+ VRAMJun 18, 2026

This intermediate recipe sets up LongCat Image on the RX 7800 XT, needing about 13 GB of VRAM.

models
tools
prerequisites
  • AMD Radeon RX 7800 XT (16 GB VRAM, RDNA3 / Navi 32 / gfx1101) or equivalent ROCm-supported card
  • Linux (Ubuntu 24.04 / 22.04 or RHEL) with the AMD ROCm stack installed (ROCm 7.2.x)
  • Python 3.10+
  • ComfyUI installed (git clone) with PyTorch built for ROCm
  • ComfyUI-GGUF custom node (city96)
  • 32 GB system RAM recommended (text-encoder offload uses CPU RAM)

What You'll Build

A local LongCat-Image text-to-image setup running in ComfyUI on a 16 GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) through the ROCm stack. LongCat-Image is Meituan's bilingual (Chinese + English) 6B-parameter diffusion transformer, and this recipe runs it with a GGUF-quantized transformer (Q4_K_M) so the whole pipeline fits the 7800 XT's 16 GB. The vanilla BF16 diffusers path needs ~17–18 GB even with CPU offload, which overflows 16 GB — and because RDNA3 has no FP8 hardware, there is no FP8-to-squeeze escape hatch the way 16 GB NVIDIA cards have. The GGUF route is what makes 16 GB work here.

Hardware data: RX 7800 XT (16GB VRAM) · GGUF Q4_K_M · ComfyUI-GGUF on ROCm 7.2 · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7800 XT runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel, no xformers install, no FlashAttention wheel, and no FP8/FP4 path here. RDNA3's WMMA units accept FP16, BF16, INT8 and INT4 only (no native FP8/FP4), so an FP8 checkpoint would just upcast to BF16 with no memory saving — which is precisely why a 16 GB NVIDIA card's "drop to FP8" trick has no AMD equivalent, and why this recipe leans on a GGUF quant instead. The attention path is PyTorch SDPA, forced explicitly with --use-pytorch-cross-attention. If a guide tells you to pip install xformers, build a FlashAttention wheel, or pick a cu12x wheel for this card, it's written for the wrong vendor.

⚠️ Why this recipe pins the base variant. Meituan publishes several siblings under the LongCat-Image brand and their fit on 16 GB differs. See "Sibling variants" below before downloading anything.

Sibling variants

VariantPurpose16 GB fit on this card
LongCat-Image (this recipe)Final-release T2I, 6B paramsYes via GGUF — vantagewithai/LongCat-Image-GGUF ships per-tier ComfyUI files (Q4_K_M 3.66 GB · Q6_K 5.20 GB · Q8_0 6.71 GB · BF16 12.54 GB — sizes from the repo's comfy/ folder via the HF tree API)
LongCat-Image-EditImage-to-image editing variantThe editing pipeline loads Qwen2.5-VL and the transformer together; out of scope here
LongCat-Image-Edit-TurboDistilled few-step edit variantSame family as Edit; out of scope here
LongCat-Image-DevMid-training checkpoint, intended for fine-tuning, not inferenceOut of scope for this recipe

The base model is released under the Apache-2.0 license per the HF model card and the weights are not gated — no access request or login is required to download them.

Requirements

ComponentMinimumTested
GPU16 GB VRAM (ROCm-supported AMD card)RX 7800 XT (16 GB)
RAM32 GB system (text encoder is CPU-offloaded between stages)
Storage~15 GB (Q4_K_M transformer 3.66 GB + FP8-scaled Qwen2.5-VL text encoder 9.38 GB + VAE 0.34 GB)per HF tree
DriverAMD ROCm 7.2.x on Linux
SoftwareComfyUI + ComfyUI-GGUF + PyTorch (ROCm 7.2 build), Python 3.10+

LongCat-Image is a 6B diffusion transformer paired with a Qwen2.5-VL text encoder and a VAE. The full BF16 transformer is 12.54 GB and the BF16 text encoder is 16.58 GB across five shards in the canonical repo's text_encoder/ folder — together far too much to keep resident on 16 GB. Quantizing the transformer to GGUF Q4_K_M (3.66 GB) and offloading the text encoder between stages is what brings the runtime envelope down under 16 GB.

Installation

1. Install ComfyUI

Per the ComfyUI README, clone the repo:

git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI

2. Install PyTorch for ROCm

The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel. Per the ComfyUI README "AMD GPUs (Linux)" section, the stable install command is:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. As of this writing the ComfyUI README pins rocm7.2 as the stable wheel — but the rocmX.Y tag moves over time (6.3 → 6.4 → 7.x). Read the current line in the live ComfyUI README before running. The README also lists a nightly variant (https://download.pytorch.org/whl/nightly/rocm7.2) which it says "might have some performance improvements". On the officially-supported gfx1101 card you do not need the experimental per-arch RDNA3 wheel (gfx110X-all); the stable whl/rocm7.2 wheel above is the canonical path. Only if a library ships gfx1100-only kernels would you fall back to the legacy HSA_OVERRIDE_GFX_VERSION=11.0.0 masquerade — it is not required for ComfyUI on this card.

3. Install ComfyUI dependencies

Per the ComfyUI README "Dependencies" section:

pip install -r requirements.txt

4. Install ComfyUI-GGUF

The GGUF transformer requires city96's ComfyUI-GGUF custom node, which provides the UnetLoaderGGUF node:

cd custom_nodes
git clone https://github.com/city96/ComfyUI-GGUF
cd ComfyUI-GGUF
pip install -r requirements.txt
cd ../..

5. Download the model files (GGUF path)

The 16 GB path needs three pieces: the LongCat-Image GGUF Q4_K_M transformer, the Qwen2.5-VL-7B text encoder, and the VAE. The GGUF transformer and matching ComfyUI workflow come from vantagewithai/LongCat-Image-GGUF (comfy/ folder, sizes verified from the HF tree); the encoder and VAE are Comfy-Org repackages.

# 1) LongCat-Image transformer (Q4_K_M, 3.66 GB) — smallest tier with full headroom for the encoder + VAE
hf download vantagewithai/LongCat-Image-GGUF comfy/LongCat-Image-Q4_K_M.gguf \
  --local-dir models/diffusion_models/

# 2) Qwen2.5-VL-7B text encoder, FP8-scaled (Comfy-Org repackage, 9.38 GB on disk)
hf download Comfy-Org/HunyuanVideo_1.5_repackaged \
  split_files/text_encoders/qwen_2.5_vl_7b_fp8_scaled.safetensors \
  --local-dir models/text_encoders/

# 3) VAE (Comfy-Org repackage, 0.34 GB)
hf download Comfy-Org/z_image_turbo split_files/vae/ae.safetensors \
  --local-dir models/vae/

ℹ️ On AMD, the FP8 encoder file saves disk, not VRAM. The qwen_2.5_vl_7b_fp8_scaled.safetensors file is the one the official Vantage workflow wires up, and it is 9.38 GB on disk versus 16.58 GB for the BF16 shards — a real download saving. But because RDNA3 has no native FP8 (see the ROCm note above), the FP8 weights upcast to BF16 in VRAM when loaded, so during the text-encode stage the encoder is back to a ~16 GB-class resident size. That fits anyway because ComfyUI keeps only one major model resident at a time: it loads the encoder, encodes the prompt, then offloads it to CPU RAM before swapping in the 3.66 GB GGUF transformer for sampling. The 32 GB system-RAM recommendation in Requirements is for exactly this offload. If you prefer to avoid the upcast entirely, you can instead point the encoder at the full BF16 shards in the canonical text_encoder/ folder — same VRAM behaviour during encode, larger download.

Running

Launch ComfyUI from the repo root with the ROCm-stable launch flags:

python main.py --use-pytorch-cross-attention --disable-smart-memory --disable-pinned-memory

Why these three flags on RDNA3:

  • --use-pytorch-cross-attention forces ComfyUI onto PyTorch's scaled-dot-product attention (SDPA) — the attention path RDNA3 supports cleanly, since there is no FlashAttention wheel or xformers path for this card. Per ComfyUI's cli_args.py, the flag is documented as "Use the new pytorch 2.0 cross attention function." Do not pass --use-split-cross-attention here.
  • --disable-smart-memory and --disable-pinned-memory stabilize repeated runs on ROCm and back the encoder→transformer offload above. The first is documented as "Force ComfyUI to agressively offload to regular ram instead of keeping models in vram when it can." and the second as "Disable pinned memory use." (both from cli_args.py).

This starts the server (default http://127.0.0.1:8188). Open it in a browser and load the LongCat-Image text-to-image workflow. Two reasonable starting points:

  1. Community GGUF workflow — the comfy/Vantage-Longcat-Image.json file shipped alongside the GGUF weights. It already wires the UnetLoaderGGUF transformer loader, the FP8 Qwen2.5-VL encoder, and the ae.safetensors VAE used above.
  2. ComfyUI Core templateLongCat Image: Text to Image. If you start here, swap the default UNETLoader for UnetLoaderGGUF and point it at LongCat-Image-Q4_K_M.gguf; leave the encoder on qwen_2.5_vl_7b_fp8_scaled.safetensors and the VAE on ae.safetensors.

Point the loaders at the three files you downloaded, enter a prompt, and queue. Generated PNGs land in ComfyUI/output/ with the full workflow embedded. The model card's reference example generates at 768×1344, guidance_scale=4.0, num_inference_steps=50. One important detail for text rendering: the HF model card cautions that for any image containing text you must enclose the target text in single or double quotation marks (English '...'/"..." or Chinese), because the model uses a character-level encoding strategy that only triggers on quoted content.

Results

  • Speed: Not yet measured on the RX 7800 XT. The /check/longcat-image/rx-7800-xt endpoint currently has no benchmark for this pair (verdict: unknown), and no RX-7800-XT-named LongCat-Image timing has surfaced in research — so no number is quoted here rather than transferring one from a different card. The 7800 XT has 624 GB/s of memory bandwidth and the GGUF diffusion + VAE-decode stages are memory-bound, so a real figure depends on the quant tier, resolution, and step count you pick. If you run it, please submit your timing via /contribute so a real figure lands on /check/longcat-image/rx-7800-xt.
  • VRAM usage: budget for a peak in the low-to-mid teens of GB. The Q4_K_M transformer is 3.66 GB and the FP8-scaled Qwen2.5-VL encoder is 9.38 GB on disk (upcasting toward a ~16 GB-class size during encode on RDNA3), but ComfyUI keeps only one of them resident on the GPU at a time, offloading the encoder before the diffusion step. Peak occurs during sampling — the 3.66 GB GGUF transformer + activations + latents + VAE decode — which is what keeps the recipe within 16 GB. By contrast, the vanilla BF16 diffusers path needs ~17 GB even with enable_model_cpu_offload(): the canonical model card Quick Start annotates that line Required ~17 GB, and a LongCat-Image project COLLABORATOR states the latest official code consumes approximately 18 GB of VRAM and supports inference on an RTX 4090 (Issue #8 comment by junqiangwu). That ~17–18 GB envelope is an NVIDIA-measured upstream footprint, not a 7800 XT figure — it is quoted here only to explain why the BF16 path overflows 16 GB and the GGUF route is the lead on this card. See /check/longcat-image/rx-7800-xt for any community-submitted measurement.
  • Quality notes: LongCat-Image is bilingual by design and the arXiv technical report (2512.07584) highlights multilingual text rendering as a primary target. Q4_K_M is the lead tier here for safe 16 GB headroom; step up to comfy/LongCat-Image-Q6_K.gguf (5.20 GB) or comfy/LongCat-Image-Q8_0.gguf (6.71 GB) from the same repo if you notice quality drop on fine details or small text — both still fit 16 GB because only one major model is resident at a time. The full BF16 transformer (12.54 GB) is the highest-quality tier and is what a 24 GB card such as the RX 7900 XTX can run un-quantized.

For the full benchmark data, see /check/longcat-image/rx-7800-xt.

Troubleshooting

"Torch not compiled with CUDA enabled"

This means a CUDA build of PyTorch got installed instead of the ROCm build. Per the ComfyUI README, uninstall and reinstall against the ROCm wheel index:

pip uninstall torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP).

OOM during the diffusion step on a 16 GB card

If you loaded the BF16 transformer (12.54 GB) instead of the GGUF, or left the full BF16 5-shard Qwen2.5-VL encoder (16.58 GB) resident, you will OOM — neither leaves room for activations and the VAE on a 16 GB card. Use the Q4_K_M GGUF transformer and confirm ComfyUI is offloading the encoder before sampling: the --disable-smart-memory --disable-pinned-memory launch flags from the Running section force the aggressive CPU offload that makes the encoder→transformer swap fit. There is no FP8-precision escape hatch on RDNA3 (FP8 weights upcast to BF16), so the GGUF transformer is the memory lever, not encoder precision.

Crash during VAE decode, or unstable repeated runs

On RDNA3, ComfyUI's default attention selection can be unstable during VAE decode, and keeping models pinned in VRAM across runs can fail. The fix is the launch flags from the Running section: force SDPA with --use-pytorch-cross-attention, and add --disable-smart-memory --disable-pinned-memory to stop ComfyUI from holding pinned/smart-managed memory between runs. Do not reach for --use-split-cross-attention — it is not the right path on this card.

Do not install xformers or FlashAttention

HF and ComfyUI guides written for NVIDIA frequently suggest pip install xformers or a FlashAttention wheel. On RDNA3 these are the wrong path: the ROCm xformers fork is limited and a prebuilt FlashAttention wheel does not exist for gfx1101 consumer cards. ComfyUI routes attention through PyTorch SDPA on this stack — force it explicitly with --use-pytorch-cross-attention.

common questions
How much VRAM does LongCat Image need?

About 13 GB — the minimum this recipe targets.

Which GPUs is LongCat Image tested on?

RX 7800 XT (16 GB).

How hard is this setup?

Intermediate — follow the steps above.