LongCat-Image (base T2I) on RTX 4060 Ti 16GB: Bilingual 6B Text-to-Image via ComfyUI GGUF

What You'll Build

A working ComfyUI workflow that runs Meituan's LongCat-Image — a 6B-parameter bilingual (Chinese + English) diffusion transformer — on a single 16 GB RTX 4060 Ti. The base text-to-image variant is the focus; image-editing siblings are out of scope below.

Hardware data: RTX 4060 Ti 16GB · 1024×1024 baseline at 20 steps · See benchmark data

⚠️ Why this recipe pins the base variant. Meituan publishes four siblings under the LongCat-Image brand and their fit on 16 GB differs. See "Sibling variants and what fits 16 GB" below before downloading anything.

⚠️ The vanilla diffusers path does not fit 16 GB. The Meituan team's own statement on the GitHub repo is that the latest official inference code "consumes approximately 18 GB of VRAM and supports inference on an RTX 4090" (Issue #8 comment by junqiangwu, 2025-12-08). The HF model card itself confirms this profile: "CPU offload mode: Requires ~17 GB VRAM" per the Quick Start section on meituan-longcat/LongCat-Image. This recipe uses the ComfyUI + GGUF path because that is the only sourced configuration confirmed to run end-to-end on a 16 GB consumer card.

Sibling variants and what fits 16 GB

Variant	Purpose	16 GB fit (cited)
LongCat-Image (this recipe)	Final-release T2I, 6B params, BF16 transformer ~12.5 GB	Yes via GGUF Q6_K / Q8_0 — vantagewithai/LongCat-Image-GGUF ships per-tier files (Q4_K_M 3.66 GB · Q6_K 5.2 GB · Q8_0 6.71 GB · BF16 12.5 GB)
LongCat-Image-Edit	Image-to-image editing variant	Possible — user `ZAVHome` reports a 16 GB run on a Blackwell card by patching `pipeline_longcat_image_edit.py` to load/unload Qwen2.5-VL and the transformer sequentially (LongCat-Image-Edit discussion #2). Not a turnkey path
LongCat-Image-Edit-Turbo	Distilled few-step edit variant	Same memory profile as Edit; if you need 16 GB image editing on this card, /contribute a working workflow so we can publish one
LongCat-Image-Dev	Mid-training checkpoint, intended for fine-tuning, not inference	Out of scope for this recipe

Requirements

Component	Minimum	Tested
GPU	16 GB VRAM, CUDA-capable	RTX 4060 Ti 16GB
RAM	32 GB system RAM (text encoder is CPU-offloaded)	—
Storage	~30 GB free (transformer GGUF + Qwen2.5-VL text encoder GGUF + VAE; the full BF16 upstream repo is 29.3 GB)	—
Software	ComfyUI v0.16.0+, ComfyUI-GGUF, Python 3.10+	—

Installation

1. Update ComfyUI

LongCat-Image landed in ComfyUI Core in v0.16.0 on 2026-03-05. Update your ComfyUI to that build or later before continuing.

cd ComfyUI
git pull
pip install -r requirements.txt

The default pip install torch already includes sm_89 kernels (Ada Lovelace) — no special wheel selection is required for the RTX 4060 Ti 16GB. FlashAttention-2 also ships full sm_89 kernel coverage, so you can leave attn_implementation at its default if you wire one up later.

2. Install ComfyUI-GGUF

The GGUF transformer requires city96's ComfyUI-GGUF custom node, which provides the UnetLoaderGGUF and CLIPLoader (GGUF) nodes.

cd ComfyUI/custom_nodes
git clone https://github.com/city96/ComfyUI-GGUF
cd ComfyUI-GGUF
pip install -r requirements.txt

3. Download the model files

The 16 GB path needs three pieces: the LongCat-Image diffusion transformer (GGUF), the Qwen2.5-VL-7B-Instruct text encoder, and a VAE. The Qwen2.5-VL text encoder is what eats the most VRAM in the BF16 path (~16.57 GB on its own — five sharded safetensors in the text_encoder/ folder of the canonical repo), so picking a quantized version is essential.

# LongCat-Image transformer (pick ONE tier — Q8_0 is the recommended quality/size tradeoff for 16 GB)
hf download vantagewithai/LongCat-Image-GGUF org/LongCat-Image-Q8_0.gguf \
  --local-dir ComfyUI/models/diffusion_models/

# Qwen2.5-VL-7B text encoder + mmproj (GGUF). Place both in models/text_encoders/ —
# rename mmproj to Qwen2.5-VL-7B-Instruct-mmproj-BF16.gguf so ComfyUI-GGUF auto-detects it.
# Pick a Q4_K_M-class file (~5 GB) to leave room for the transformer.

# VAE — pulled from the canonical repo's vae/ folder
hf download meituan-longcat/LongCat-Image vae/diffusion_pytorch_model.safetensors \
  --local-dir ComfyUI/models/vae/

The text-encoder filenames and a working layout (mmproj in the same folder as the main text encoder, auto-detection on selection) are documented in the QuantStack Qwen2.5-VL family of repos and used unchanged by the LongCat workflow — see for example the QuantStack/Qwen-Image-Edit-GGUF discussion on mmproj loading, which the LongCat ComfyUI Native Support workflow inherits.

Running

Launch ComfyUI: python main.py.
Load the LongCat-Image text-to-image workflow. Two reasonable starting points:
- ComfyUI Core workflow: comfy.org/workflows/image_longcat_text_to_image-c0a547f8fee6 — the official template, tuned for 1024×1024.
- Community GGUF workflow: the /comfy/Vantage-Longcat-Image.json file shipped alongside the GGUF weights.
In the workflow, swap the default UNETLoader for UnetLoaderGGUF and point it at LongCat-Image-Q8_0.gguf. Swap the default text-encoder loader for CLIPLoader (GGUF) pointing at the Qwen2.5-VL Q4_K_M file.
Hit Queue Prompt. First run downloads any remaining auto-fetched assets; subsequent runs load straight from disk.

The HF model card's reference example uses guidance_scale=4.0, num_inference_steps=50, 768×1344 — bump steps to 50 if you want the model-card-quality reference; 20 steps is a reasonable preview default.

Results

Speed: Not yet measured on the RTX 4060 Ti 16GB. The 5060 Ti sibling recipe quoted a 16 GB Blackwell figure for the Edit variant (per the ZAVHome HF discussion), but that source measures LongCat-Image-Edit on an RTX 5070 (672 GB/s memory bandwidth) and not the base T2I path on a 4060 Ti (288 GB/s) — so the number does not transfer. Once a community run lands at /check/longcat-image/rtx-4060-ti-16gb, this section gets updated.
VRAM usage: budget the full 16 GB. The transformer at Q8_0 is 6.71 GB and the Qwen2.5-VL text encoder at Q4_K_M is ~5 GB per the vantagewithai/LongCat-Image-GGUF per-tier table; ComfyUI keeps only one of them resident on GPU at a time. The vanilla diffusers BF16 path needs ~17–18 GB even with enable_model_cpu_offload() per the HF model card Quick Start and the Meituan team's Issue #8 statement, which is why this recipe uses GGUF. 16 GB is a tight fit — close other GPU-using applications before launching.
Quality notes: LongCat-Image is bilingual by design and the arXiv technical report (2512.07584) highlights multilingual text rendering as a primary target. Quality at Q8_0 should be close to BF16; Q6_K is the budget tier if VRAM gets tight after VAE decode. Q4_K_M is viable but expect mild quality drop on fine details.

For the full benchmark data, see /check/longcat-image/rtx-4060-ti-16gb.

Troubleshooting

`pip install -r requirements.txt` errors with `No module named 'dskernels'`

The official Meituan infer_requirements.txt lists dskernels, which is not on PyPI — ghostnyambit reported the same blocker on Issue #8 (2025-12-11). This is only an issue if you go down the native-diffusers path. The ComfyUI + GGUF flow above does not import dskernels, so this error is a strong signal you've installed the wrong dependency tree. Skip the Meituan infer_requirements.txt entirely; install the ComfyUI-GGUF requirements instead.

OOM during the diffusion step on a 16 GB card

If you used the BF16 transformer instead of GGUF, you will OOM — the BF16 transformer alone is 12.5 GB (file sizes per the vantagewithai GGUF README's tier table), leaving no room for activations and the VAE on a 16 GB card. Switch the UnetLoaderGGUF to Q8_0 or Q6_K and rerun. Also make sure the Qwen2.5-VL text encoder is the GGUF version, not the BF16 5-shard set — the BF16 text encoder is ~16.57 GB on its own and will not coexist with the transformer in VRAM.

`enable_model_cpu_offload()` doesn't work on the Edit pipeline

User mingyi456 notes on Issue #8 that enable_model_cpu_offload() currently doesn't work with LongCatImageEditPipeline. The base LongCatImagePipeline is unaffected. This is another reason this recipe is scoped to the base variant only — the editing path needs the manual sequential-offload patch documented by ZAVHome in the linked HF discussion.

LiVeen's FP8 (`LongCat-Image-Edit-FP8-e4m3fn`) is unverified

A community FP8 quant of the Edit variant exists at LiVeen/LongCat-Image-Edit-FP8-e4m3fn. The author themselves stated on Issue #8 that "there is a fairly high likelihood that this model won't work without the rest of the diffusers stuff, or even at all" and that they have not tested it. Don't substitute it for the vantagewithai GGUF until somebody actually verifies it works — report results via /contribute if you try.