LongCat-Image (base T2I) on RTX 5080: Bilingual 6B Text-to-Image at 16 GB via ComfyUI GGUF

What You'll Build

A working ComfyUI workflow that runs Meituan's LongCat-Image — a 6B-parameter bilingual (Chinese + English) diffusion transformer — on a single 16 GB RTX 5080. The base text-to-image variant is the focus; image-editing siblings are out of scope below.

Hardware data: RTX 5080 (16 GB VRAM) · 1024×1024 baseline at 20 steps · See benchmark data

⚠️ Why this recipe pins the base variant. Meituan publishes several siblings under the LongCat-Image brand and their fit on 16 GB differs. See "Sibling variants and what fits 16 GB" below before downloading anything.

⚠️ The vanilla diffusers path does not fit 16 GB. Even with CPU offload, the canonical Quick Start code on the HF model card carries the comment "Offload to CPU to save VRAM (Required ~17 GB); slower but prevents OOM" — that is, ~17 GB, already over the 5080's 16 GB envelope (Quick Start section, meituan-longcat/LongCat-Image). The Meituan team confirms the same profile: their latest official inference code "consumes approximately 18 GB of VRAM and supports inference on an RTX 4090" (Issue #8 comment by junqiangwu, a project COLLABORATOR). The 5080 has the same 16 GB GDDR7 envelope as the smaller 16 GB Blackwell cards, so the diffusers path does not fit here either. This recipe uses the ComfyUI + GGUF path because that is the only sourced configuration confirmed to run end-to-end on a 16 GB consumer card.

Sibling variants and what fits 16 GB

Variant	Purpose	16 GB fit (cited)
LongCat-Image (this recipe)	Final-release T2I, 6B params, BF16 transformer ~12.5 GB	Yes via GGUF — vantagewithai/LongCat-Image-GGUF ships per-tier ComfyUI files (Q4_K_M 3.66 GB · Q6_K 5.20 GB · Q8_0 6.71 GB · BF16 12.54 GB — sizes from the repo's `comfy/` folder via the HF tree API)
LongCat-Image-Edit	Image-to-image editing variant	Harder — `enable_model_cpu_offload()` does not currently work with the edit pipeline (see Troubleshooting). If you need 16 GB image editing on this card, /contribute a working workflow so we can publish one
LongCat-Image-Edit-Turbo	Distilled few-step edit variant	Same memory profile as Edit; out of scope for this recipe
LongCat-Image-Dev	Mid-training checkpoint, intended for fine-tuning, not inference	Out of scope for this recipe

Requirements

Component	Minimum	Tested
GPU	16 GB VRAM, CUDA-capable	RTX 5080 (16 GB)
RAM	32 GB system RAM (text encoder is CPU-offloaded between stages)	—
Storage	~25 GB free (Q4_K_M transformer 3.66 GB + FP8 Qwen2.5-VL text encoder 9.38 GB + VAE 0.34 GB; the full BF16 repo is 29.3 GB)	—
Software	ComfyUI v0.16.0+, ComfyUI-GGUF, cu128 PyTorch, Python 3.10+	—

Installation

1. Update ComfyUI

LongCat-Image landed in ComfyUI Core on 2026-03-05 (v0.16.0). Update your ComfyUI to that build or later before continuing.

cd ComfyUI
git pull
pip install -r requirements.txt

ℹ️ Blackwell (sm_120) wheel note. The RTX 5080 is a Blackwell sm_120 card. ComfyUI's default PyTorch install must use the CUDA 12.8 (cu128) wheels, which ship the sm_120 kernels — older cu126/cu121 wheels will fail at the first CUDA call. If you bootstrap a fresh venv, install torch first: pip install torch --index-url https://download.pytorch.org/whl/cu128. Unlike the diffusers path, the GGUF/ComfyUI flow below does not touch FlashAttention-2, so the sm_120 FA2 kernel gap (Dao-AILab#2168) does not apply.

2. Install ComfyUI-GGUF

The GGUF transformer requires city96's ComfyUI-GGUF custom node, which provides the UnetLoaderGGUF node.

cd ComfyUI/custom_nodes
git clone https://github.com/city96/ComfyUI-GGUF
cd ComfyUI-GGUF
pip install -r requirements.txt

3. Download the model files

The 16 GB path needs three pieces: the LongCat-Image diffusion transformer (GGUF Q4_K_M), the Qwen2.5-VL-7B text encoder (FP8-scaled), and the VAE. The Qwen2.5-VL text encoder is what eats the most VRAM in the BF16 path (16.58 GB across five sharded safetensors in the canonical text_encoder/ folder), so the FP8-scaled repackage is essential. All three filenames below are the exact ones referenced by the official Vantage ComfyUI workflow JSON.

# LongCat-Image transformer (Q4_K_M — smallest tier that keeps full headroom for the encoder + VAE)
hf download vantagewithai/LongCat-Image-GGUF comfy/LongCat-Image-Q4_K_M.gguf \
  --local-dir ComfyUI/models/diffusion_models/

# Qwen2.5-VL-7B text encoder, FP8-scaled (Comfy-Org repackage, 9.38 GB)
hf download Comfy-Org/HunyuanVideo_1.5_repackaged \
  split_files/text_encoders/qwen_2.5_vl_7b_fp8_scaled.safetensors \
  --local-dir ComfyUI/models/text_encoders/

# VAE (Comfy-Org repackage, 0.34 GB)
hf download Comfy-Org/z_image_turbo split_files/vae/ae.safetensors \
  --local-dir ComfyUI/models/vae/

If you have headroom to spare and want maximum quality, swap the Q4_K_M transformer for comfy/LongCat-Image-Q6_K.gguf (5.20 GB) or comfy/LongCat-Image-Q8_0.gguf (6.71 GB) from the same repo — both still leave room on a 16 GB card because ComfyUI keeps only one major model resident at a time.

Running

Launch ComfyUI: python main.py.
Load the LongCat-Image text-to-image workflow. Two reasonable starting points:
- ComfyUI Core workflow: comfy.org/workflows/image_longcat_text_to_image-c0a547f8fee6 — the official template, default 20 steps at CFG 4, tuned for 1024×1024.
- Community GGUF workflow: the comfy/Vantage-Longcat-Image.json file shipped alongside the GGUF weights — it already wires up the FP8 Qwen2.5-VL text encoder and the ae.safetensors VAE used above.
In the workflow, swap the default UNETLoader for UnetLoaderGGUF and point it at LongCat-Image-Q4_K_M.gguf. Leave the CLIPLoader pointed at qwen_2.5_vl_7b_fp8_scaled.safetensors and the VAELoader at ae.safetensors.
Hit Queue Prompt. The first run loads the encoder, then swaps in the transformer for sampling; subsequent runs load straight from disk.

The sampler defaults are 20 steps at CFG 4, 1024×1024 — these come from the official ComfyUI workflow page above. The HF model card example uses guidance_scale=4.0, num_inference_steps=50, 768×1344 — bump steps to 50 if you want the model-card-quality reference. Note that LongCat-Image uses a character-level encoding for quoted text: the card's [!CAUTION] block instructs that for any text you want rendered in the image, "you must enclose the target text within single or double quotation marks" — omitting the quotes disables the text-rendering mechanism.

Results

Speed: no RTX 5080-specific benchmark has been published for LongCat-Image yet, so this recipe does not quote a generation-time figure. The 5080's memory bandwidth (~960 GB/s) is roughly 2× that of the smaller 16 GB Blackwell cards (~448 GB/s), so the diffusion and VAE-decode stages should be meaningfully faster than on those cards — but forward-extrapolating a number from a different card would be a guess, not a measurement. If you benchmark it, please /contribute so we can publish a real figure at /check/longcat-image/rtx-5080.
VRAM usage: budget the full 16 GB. The Q4_K_M transformer is 3.66 GB and the FP8 Qwen2.5-VL text encoder is 9.38 GB, but ComfyUI keeps only one of them resident on the GPU at a time, swapping between the encode and sampling stages. Peak occurs during the diffusion step itself — transformer + activations + latents + VAE decode. The vanilla diffusers BF16 path needs ~17 GB even with enable_model_cpu_offload() per the canonical model card, and ~18 GB per the Meituan team's own Issue #8 statement, which is why this recipe uses GGUF instead.
Quality notes: LongCat-Image is bilingual by design and the arXiv technical report (2512.07584) highlights multilingual text rendering as a primary target. Q4_K_M is the lead tier here for safe 16 GB headroom; step up to Q6_K or Q8_0 (both still fit) if you notice quality drop on fine details or small text.

For the full benchmark data, see /check/longcat-image/rtx-5080.

Troubleshooting

`pip install -r requirements.txt` errors with `No module named 'dskernels'`

The official Meituan infer_requirements.txt lists dskernels, which is not on PyPI — community user ghostnyambit reported the same blocker on Issue #8. This only bites if you go down the native-diffusers path. The ComfyUI + GGUF flow above does not import dskernels, so this error is a strong signal you've installed the wrong dependency tree. Skip the Meituan infer_requirements.txt entirely; install the ComfyUI-GGUF requirements instead.

OOM during the diffusion step on a 16 GB card

If you used the BF16 transformer (12.54 GB) instead of the GGUF, or left the full BF16 5-shard Qwen2.5-VL text encoder (16.58 GB) resident, you will OOM — neither leaves room for activations and the VAE on a 16 GB card. Use the Q4_K_M GGUF transformer and the FP8-scaled text encoder (qwen_2.5_vl_7b_fp8_scaled.safetensors) as in the Installation steps, and confirm ComfyUI is offloading the encoder before sampling. The diffusers BF16 + CPU-offload path needs ~17 GB and will not fit the 5080's 16 GB envelope.

`enable_model_cpu_offload()` doesn't work on the Edit pipeline

Community user mingyi456 notes on Issue #8 that enable_model_cpu_offload() currently does not work with the edit pipeline. The base LongCatImagePipeline (this recipe) is unaffected. This is one reason the recipe is scoped to the base text-to-image variant only.

Blackwell sm_120 wheel mismatch

If ComfyUI crashes at the first CUDA call with a kernel/arch error, you are likely on a cu126/cu121 PyTorch wheel that lacks sm_120 kernels. Reinstall torch from the cu128 index (pip install torch --index-url https://download.pytorch.org/whl/cu128). This affects all Blackwell 50-series cards, including the 5080.