self-hosted/ai
§01·recipe · image

LongCat-Image (base T2I) on RTX 5060 Ti: Bilingual 6B Text-to-Image at 16 GB via ComfyUI GGUF

imageintermediate16GB+ VRAMMay 19, 2026
models
tools
prerequisites
  • NVIDIA RTX 5060 Ti (16 GB VRAM) or any 16 GB+ consumer card
  • Python 3.10+
  • ComfyUI (recent build with native LongCat-Image support, March 2026 or later)
  • ComfyUI-GGUF custom node (city96)
  • 32 GB system RAM recommended (text-encoder offload uses CPU RAM)

What You'll Build

A working ComfyUI workflow that runs Meituan's LongCat-Image — a 6B-parameter bilingual (Chinese + English) diffusion transformer — on a single 16 GB RTX 5060 Ti. The base text-to-image variant is the focus; image-editing siblings are out of scope below.

Hardware data: RTX 5060 Ti (16 GB VRAM) · 1024×1024 baseline at 20 steps · See benchmark data

⚠️ Why this recipe pins the base variant. Meituan publishes four siblings under the LongCat-Image brand and their fit on 16 GB differs. See "Sibling variants and what fits 16 GB" below before downloading anything.

⚠️ The vanilla diffusers path does not fit 16 GB. The Meituan team's own statement on the GitHub repo is that the latest official inference code "consumes approximately 18 GB of VRAM and supports inference on an RTX 4090" (Issue #8 comment by junqiangwu, 2025-12-08). This recipe uses the ComfyUI + GGUF path because that is the only sourced configuration confirmed to run end-to-end on a 16 GB consumer card.

Sibling variants and what fits 16 GB

VariantPurpose16 GB fit (cited)
LongCat-Image (this recipe)Final-release T2I, 6B params, BF16 transformer ~12.5 GBYes via GGUF Q6_K / Q8_0 — vantagewithai/LongCat-Image-GGUF ships per-tier files (Q4_K_M 3.66 GB · Q6_K 5.20 GB · Q8_0 6.71 GB · BF16 12.5 GB)
LongCat-Image-EditImage-to-image editing variantPossible — user ZAVHome reports a 16 GB RTX 5070 run by patching pipeline_longcat_image_edit.py to load/unload Qwen2.5-VL and the transformer sequentially; 30 steps in 1:17. Not a turnkey path
LongCat-Image-Edit-TurboDistilled few-step edit variantSame memory profile as Edit; if you need 16 GB image editing on this card, /contribute a working workflow so we can publish one
LongCat-Image-DevMid-training checkpoint, intended for fine-tuning, not inferenceOut of scope for this recipe

Requirements

ComponentMinimumTested
GPU16 GB VRAM, CUDA-capableRTX 5060 Ti (16 GB)
RAM32 GB system RAM (text encoder is CPU-offloaded)
Storage~30 GB free (transformer GGUF + Qwen2.5-VL text encoder + VAE; full BF16 repo is 29.3 GB)
SoftwareComfyUI with native LongCat-Image support, ComfyUI-GGUF, Python 3.10+

Installation

1. Update ComfyUI

LongCat-Image landed in ComfyUI Core in March 2026. Update your ComfyUI to that build or later before continuing.

cd ComfyUI
git pull
pip install -r requirements.txt

2. Install ComfyUI-GGUF

The GGUF transformer requires city96's ComfyUI-GGUF custom node, which provides the UnetLoaderGGUF and CLIPLoader (GGUF) nodes.

cd ComfyUI/custom_nodes
git clone https://github.com/city96/ComfyUI-GGUF
cd ComfyUI-GGUF
pip install -r requirements.txt

3. Download the model files

The 16 GB path needs three pieces: the LongCat-Image diffusion transformer (GGUF), the Qwen2.5-VL-7B-Instruct text encoder, and a VAE. The Qwen2.5-VL text encoder is what eats the most VRAM in the BF16 path (~16.6 GB on its own, sourced from the text_encoder folder in the canonical repo — five sharded safetensors totalling ~16.57 GB), so picking a quantized version is essential.

# LongCat-Image transformer (pick ONE tier — Q8_0 is the recommended quality/size tradeoff for 16 GB)
hf download vantagewithai/LongCat-Image-GGUF org/LongCat-Image-Q8_0.gguf \
  --local-dir ComfyUI/models/diffusion_models/

# Qwen2.5-VL-7B text encoder + mmproj (GGUF). Place both in models/text_encoders/ —
# rename mmproj to Qwen2.5-VL-7B-Instruct-mmproj-BF16.gguf so ComfyUI-GGUF auto-detects it.
# Pick a Q4_K_M-class file (~5 GB) to leave room for the transformer.

# VAE — pulled from the canonical repo's vae/ folder
hf download meituan-longcat/LongCat-Image vae/diffusion_pytorch_model.safetensors \
  --local-dir ComfyUI/models/vae/

The text-encoder filenames and a working layout (mmproj in the same folder as the main text encoder, auto-detection on selection) are documented in the QuantStack Qwen2.5-VL family of repos and used unchanged by the LongCat workflow — see for example the QuantStack/Qwen-Image-Edit-GGUF discussion on mmproj loading, which the LongCat ComfyUI Native Support workflow inherits.

Running

  1. Launch ComfyUI: python main.py.
  2. Load the LongCat-Image text-to-image workflow. Two reasonable starting points:
  3. In the workflow, swap the default UNETLoader for UnetLoaderGGUF and point it at LongCat-Image-Q8_0.gguf. Swap the default text-encoder loader for CLIPLoader (GGUF) pointing at the Qwen2.5-VL Q4_K_M file.
  4. Hit Queue Prompt. First run downloads any remaining auto-fetched assets; subsequent runs load straight from disk.

The sampler defaults are 20 steps at CFG 4, 1024×1024 — these come from the official ComfyUI workflow page above. The HF model card example uses guidance_scale=4.0, num_inference_steps=50, 768×1344 — bump steps to 50 if you want the model-card-quality reference.

Results

  • Speed: user ZAVHome on the LongCat-Image-Edit HF discussion reports 30 sampling steps in 01:17 on a 16 GB RTX 5070 using the sequential-offload approach — i.e. roughly 2.58 seconds per step. The base T2I path on a 5060 Ti will be in the same ballpark since the diffusion transformer is shared between Edit and base variants. Treat this as an order-of-magnitude figure until a 5060 Ti benchmark lands at /check/longcat-image/rtx-5060-ti.
  • VRAM usage: budget the full 16 GB. The transformer at Q8_0 is 6.71 GB, the Qwen2.5-VL text encoder at Q4_K_M is ~5 GB, and ComfyUI keeps only one of them resident on GPU at a time. Peak occurs during the diffusion step itself — transformer + activations + latents + VAE. The vanilla diffusers BF16 path needs ~18 GB even with enable_model_cpu_offload() per the Meituan team's own Issue #8 statement, which is why this recipe uses GGUF.
  • Quality notes: LongCat-Image is bilingual by design and the arXiv technical report (2512.07584) highlights multilingual text rendering as a primary target. Quality at Q8_0 should be close to BF16; Q6_K is the budget tier if VRAM gets tight after VAE decode. Q4_K_M is viable but expect mild quality drop on fine details.

For the full benchmark data, see /check/longcat-image/rtx-5060-ti.

Troubleshooting

pip install -r requirements.txt errors with No module named 'dskernels'

The official Meituan infer_requirements.txt lists dskernels, which is not on PyPI — ghostnyambit reported the same blocker on Issue #8 (2025-12-11). This is only an issue if you go down the native-diffusers path. The ComfyUI + GGUF flow above does not import dskernels, so this error is a strong signal you've installed the wrong dependency tree. Skip the Meituan infer_requirements.txt entirely; install the ComfyUI-GGUF requirements instead.

OOM during the diffusion step on a 16 GB card

If you used the BF16 transformer instead of GGUF, you will OOM — the BF16 transformer alone is 12.5 GB (file sizes per the vantagewithai GGUF README's tier table), leaving no room for activations and the VAE on a 16 GB card. Switch the UnetLoaderGGUF to Q8_0 or Q6_K and rerun. Also make sure the Qwen2.5-VL text encoder is the GGUF version, not the BF16 5-shard set — the BF16 text encoder is ~16.6 GB on its own and will not coexist with the transformer in VRAM.

enable_model_cpu_offload() doesn't work on the Edit pipeline

User mingyi456 notes on Issue #8 that enable_model_cpu_offload() currently doesn't work with LongCatImageEditPipeline. The base LongCatImagePipeline is unaffected. This is another reason this recipe is scoped to the base variant only — the editing path needs the manual sequential-offload patch documented by ZAVHome in the linked HF discussion.

LiVeen's FP8 (LongCat-Image-Edit-FP8-e4m3fn) is unverified

A community FP8 quant of the Edit variant exists at LiVeen/LongCat-Image-Edit-FP8-e4m3fn. The author themselves stated on Issue #8 that "there is a fairly high likelihood that this model won't work without the rest of the diffusers stuff, or even at all" and that they have not tested it. Don't substitute it for the vantagewithai GGUF until somebody actually verifies it works — report results via /contribute if you try.