self-hosted/ai
§01·recipe · image

LongCat-Image (base T2I) on RTX 4060 Ti 16GB: Bilingual 6B Text-to-Image via ComfyUI GGUF

imageintermediate16GB+ VRAMMay 20, 2026
models
tools
prerequisites
  • NVIDIA RTX 4060 Ti 16GB (or any 16 GB+ consumer card with CUDA support)
  • Python 3.10+
  • ComfyUI v0.16.0 or later (native LongCat-Image support landed 2026-03-05)
  • ComfyUI-GGUF custom node (city96)
  • 32 GB system RAM recommended (text-encoder offload uses CPU RAM)

What You'll Build

A working ComfyUI workflow that runs Meituan's LongCat-Image — a 6B-parameter bilingual (Chinese + English) diffusion transformer — on a single 16 GB RTX 4060 Ti. The base text-to-image variant is the focus; image-editing siblings are out of scope below.

Hardware data: RTX 4060 Ti 16GB · 1024×1024 baseline at 20 steps · See benchmark data

⚠️ Why this recipe pins the base variant. Meituan publishes four siblings under the LongCat-Image brand and their fit on 16 GB differs. See "Sibling variants and what fits 16 GB" below before downloading anything.

⚠️ The vanilla diffusers path does not fit 16 GB. The Meituan team's own statement on the GitHub repo is that the latest official inference code "consumes approximately 18 GB of VRAM and supports inference on an RTX 4090" (Issue #8 comment by junqiangwu, 2025-12-08). The HF model card itself confirms this profile: "CPU offload mode: Requires ~17 GB VRAM" per the Quick Start section on meituan-longcat/LongCat-Image. This recipe uses the ComfyUI + GGUF path because that is the only sourced configuration confirmed to run end-to-end on a 16 GB consumer card.

Sibling variants and what fits 16 GB

VariantPurpose16 GB fit (cited)
LongCat-Image (this recipe)Final-release T2I, 6B params, BF16 transformer ~12.5 GBYes via GGUF Q6_K / Q8_0 — vantagewithai/LongCat-Image-GGUF ships per-tier files (Q4_K_M 3.66 GB · Q6_K 5.2 GB · Q8_0 6.71 GB · BF16 12.5 GB)
LongCat-Image-EditImage-to-image editing variantPossible — user ZAVHome reports a 16 GB run on a Blackwell card by patching pipeline_longcat_image_edit.py to load/unload Qwen2.5-VL and the transformer sequentially (LongCat-Image-Edit discussion #2). Not a turnkey path
LongCat-Image-Edit-TurboDistilled few-step edit variantSame memory profile as Edit; if you need 16 GB image editing on this card, /contribute a working workflow so we can publish one
LongCat-Image-DevMid-training checkpoint, intended for fine-tuning, not inferenceOut of scope for this recipe

Requirements

ComponentMinimumTested
GPU16 GB VRAM, CUDA-capableRTX 4060 Ti 16GB
RAM32 GB system RAM (text encoder is CPU-offloaded)
Storage~30 GB free (transformer GGUF + Qwen2.5-VL text encoder GGUF + VAE; the full BF16 upstream repo is 29.3 GB)
SoftwareComfyUI v0.16.0+, ComfyUI-GGUF, Python 3.10+

Installation

1. Update ComfyUI

LongCat-Image landed in ComfyUI Core in v0.16.0 on 2026-03-05. Update your ComfyUI to that build or later before continuing.

cd ComfyUI
git pull
pip install -r requirements.txt

The default pip install torch already includes sm_89 kernels (Ada Lovelace) — no special wheel selection is required for the RTX 4060 Ti 16GB. FlashAttention-2 also ships full sm_89 kernel coverage, so you can leave attn_implementation at its default if you wire one up later.

2. Install ComfyUI-GGUF

The GGUF transformer requires city96's ComfyUI-GGUF custom node, which provides the UnetLoaderGGUF and CLIPLoader (GGUF) nodes.

cd ComfyUI/custom_nodes
git clone https://github.com/city96/ComfyUI-GGUF
cd ComfyUI-GGUF
pip install -r requirements.txt

3. Download the model files

The 16 GB path needs three pieces: the LongCat-Image diffusion transformer (GGUF), the Qwen2.5-VL-7B-Instruct text encoder, and a VAE. The Qwen2.5-VL text encoder is what eats the most VRAM in the BF16 path (~16.57 GB on its own — five sharded safetensors in the text_encoder/ folder of the canonical repo), so picking a quantized version is essential.

# LongCat-Image transformer (pick ONE tier — Q8_0 is the recommended quality/size tradeoff for 16 GB)
hf download vantagewithai/LongCat-Image-GGUF org/LongCat-Image-Q8_0.gguf \
  --local-dir ComfyUI/models/diffusion_models/

# Qwen2.5-VL-7B text encoder + mmproj (GGUF). Place both in models/text_encoders/ —
# rename mmproj to Qwen2.5-VL-7B-Instruct-mmproj-BF16.gguf so ComfyUI-GGUF auto-detects it.
# Pick a Q4_K_M-class file (~5 GB) to leave room for the transformer.

# VAE — pulled from the canonical repo's vae/ folder
hf download meituan-longcat/LongCat-Image vae/diffusion_pytorch_model.safetensors \
  --local-dir ComfyUI/models/vae/

The text-encoder filenames and a working layout (mmproj in the same folder as the main text encoder, auto-detection on selection) are documented in the QuantStack Qwen2.5-VL family of repos and used unchanged by the LongCat workflow — see for example the QuantStack/Qwen-Image-Edit-GGUF discussion on mmproj loading, which the LongCat ComfyUI Native Support workflow inherits.

Running

  1. Launch ComfyUI: python main.py.
  2. Load the LongCat-Image text-to-image workflow. Two reasonable starting points:
  3. In the workflow, swap the default UNETLoader for UnetLoaderGGUF and point it at LongCat-Image-Q8_0.gguf. Swap the default text-encoder loader for CLIPLoader (GGUF) pointing at the Qwen2.5-VL Q4_K_M file.
  4. Hit Queue Prompt. First run downloads any remaining auto-fetched assets; subsequent runs load straight from disk.

The HF model card's reference example uses guidance_scale=4.0, num_inference_steps=50, 768×1344 — bump steps to 50 if you want the model-card-quality reference; 20 steps is a reasonable preview default.

Results

  • Speed: Not yet measured on the RTX 4060 Ti 16GB. The 5060 Ti sibling recipe quoted a 16 GB Blackwell figure for the Edit variant (per the ZAVHome HF discussion), but that source measures LongCat-Image-Edit on an RTX 5070 (672 GB/s memory bandwidth) and not the base T2I path on a 4060 Ti (288 GB/s) — so the number does not transfer. Once a community run lands at /check/longcat-image/rtx-4060-ti-16gb, this section gets updated.
  • VRAM usage: budget the full 16 GB. The transformer at Q8_0 is 6.71 GB and the Qwen2.5-VL text encoder at Q4_K_M is ~5 GB per the vantagewithai/LongCat-Image-GGUF per-tier table; ComfyUI keeps only one of them resident on GPU at a time. The vanilla diffusers BF16 path needs ~17–18 GB even with enable_model_cpu_offload() per the HF model card Quick Start and the Meituan team's Issue #8 statement, which is why this recipe uses GGUF. 16 GB is a tight fit — close other GPU-using applications before launching.
  • Quality notes: LongCat-Image is bilingual by design and the arXiv technical report (2512.07584) highlights multilingual text rendering as a primary target. Quality at Q8_0 should be close to BF16; Q6_K is the budget tier if VRAM gets tight after VAE decode. Q4_K_M is viable but expect mild quality drop on fine details.

For the full benchmark data, see /check/longcat-image/rtx-4060-ti-16gb.

Troubleshooting

pip install -r requirements.txt errors with No module named 'dskernels'

The official Meituan infer_requirements.txt lists dskernels, which is not on PyPI — ghostnyambit reported the same blocker on Issue #8 (2025-12-11). This is only an issue if you go down the native-diffusers path. The ComfyUI + GGUF flow above does not import dskernels, so this error is a strong signal you've installed the wrong dependency tree. Skip the Meituan infer_requirements.txt entirely; install the ComfyUI-GGUF requirements instead.

OOM during the diffusion step on a 16 GB card

If you used the BF16 transformer instead of GGUF, you will OOM — the BF16 transformer alone is 12.5 GB (file sizes per the vantagewithai GGUF README's tier table), leaving no room for activations and the VAE on a 16 GB card. Switch the UnetLoaderGGUF to Q8_0 or Q6_K and rerun. Also make sure the Qwen2.5-VL text encoder is the GGUF version, not the BF16 5-shard set — the BF16 text encoder is ~16.57 GB on its own and will not coexist with the transformer in VRAM.

enable_model_cpu_offload() doesn't work on the Edit pipeline

User mingyi456 notes on Issue #8 that enable_model_cpu_offload() currently doesn't work with LongCatImageEditPipeline. The base LongCatImagePipeline is unaffected. This is another reason this recipe is scoped to the base variant only — the editing path needs the manual sequential-offload patch documented by ZAVHome in the linked HF discussion.

LiVeen's FP8 (LongCat-Image-Edit-FP8-e4m3fn) is unverified

A community FP8 quant of the Edit variant exists at LiVeen/LongCat-Image-Edit-FP8-e4m3fn. The author themselves stated on Issue #8 that "there is a fairly high likelihood that this model won't work without the rest of the diffusers stuff, or even at all" and that they have not tested it. Don't substitute it for the vantagewithai GGUF until somebody actually verifies it works — report results via /contribute if you try.