self-hosted/ai
§01·recipe · image

LongCat-Image (base T2I) on RTX 4080: Bilingual 6B Text-to-Image at 16 GB via ComfyUI GGUF

imageintermediate16GB+ VRAMMay 29, 2026
models
tools
prerequisites
  • NVIDIA RTX 4080 (16GB VRAM) or any 16 GB+ consumer card with CUDA support
  • Python 3.10+
  • ComfyUI with native LongCat-Image support (the official text-to-image template ships with current builds)
  • ComfyUI-GGUF custom node (city96)
  • 32 GB system RAM recommended (text-encoder offload uses CPU RAM)

What You'll Build

A working ComfyUI workflow that runs Meituan's LongCat-Image — a 6B-parameter bilingual (Chinese + English) diffusion transformer — on a single 16 GB RTX 4080. The base text-to-image variant is the focus; the image-editing siblings are out of scope below.

Hardware data: RTX 4080 (16GB VRAM) · 1024×1024 baseline at 20 steps · See benchmark data

⚠️ Why this recipe pins the base variant. Meituan publishes four siblings under the LongCat-Image brand and their fit on 16 GB differs. See "Sibling variants and what fits 16 GB" below before downloading anything.

⚠️ The vanilla diffusers path does not fit 16 GB. A LongCat-Image developer states on the GitHub repo that the latest official inference code consumes approximately 18 GB of VRAM and supports inference on an RTX 4090 (Issue #8 comment by junqiangwu, a project COLLABORATOR). The HF model card's own Quick Start confirms the same profile — its enable_model_cpu_offload() line is annotated Required ~17 GB (Quick Start on meituan-longcat/LongCat-Image). Both numbers exceed the RTX 4080's 16 GB, so this recipe uses the ComfyUI + GGUF path — the only sourced configuration confirmed to run end-to-end on a 16 GB consumer card.

Sibling variants and what fits 16 GB

VariantPurpose16 GB fit (cited)
LongCat-Image (this recipe)Final-release T2I, 6B params, BF16 transformer 12.54 GBYes via GGUF — vantagewithai/LongCat-Image-GGUF ships per-tier files (Q4_K_M 3.73 GB · Q6_K 5.15 GB · Q8_0 6.67 GB; no BF16 — the 12.54 GB BF16 transformer lives in the upstream/Comfy-Org repos)
LongCat-Image-EditImage-to-image editing variantNot a turnkey 16 GB path — the editing pipeline loads Qwen2.5-VL and the transformer together and needs manual sequential offload; if you need 16 GB image editing, /contribute a working workflow so we can publish one
LongCat-Image-Edit-TurboDistilled few-step edit variantSame memory profile as Edit; out of scope here
LongCat-Image-DevMid-training checkpoint, intended for fine-tuning, not inferenceOut of scope for this recipe

Requirements

ComponentMinimumTested
GPU16 GB VRAM, CUDA-capableRTX 4080 (16GB)
RAM32 GB system RAM (text encoder is CPU-offloaded)
Storage~12 GB free (transformer GGUF + Qwen2.5-VL text-encoder GGUF + mmproj + VAE; the full BF16 upstream repo is ~29 GB)
SoftwareComfyUI (native LongCat support) + ComfyUI-GGUF, Python 3.10+

Installation

1. Update ComfyUI

LongCat-Image text-to-image is an official ComfyUI workflow template — see LongCat Image: Text to Image. Make sure your ComfyUI is current so the native LongCat nodes and template are present.

cd ComfyUI
git pull
pip install -r requirements.txt

The default pip install torch already includes sm_89 kernels (Ada Lovelace) — no special wheel selection is required for the RTX 4080. FlashAttention-2 also ships full sm_89 kernel coverage, so there is no eager/sdpa override needed on this card (unlike Blackwell sm_120 GPUs, which lag on FA2 wheels).

2. Install ComfyUI-GGUF

The GGUF transformer and text encoder require city96's ComfyUI-GGUF custom node, which provides the UnetLoaderGGUF and CLIPLoader (GGUF) nodes.

cd ComfyUI/custom_nodes
git clone https://github.com/city96/ComfyUI-GGUF
cd ComfyUI-GGUF
pip install -r requirements.txt

3. Download the model files

The 16 GB path needs three pieces: the LongCat-Image diffusion transformer (GGUF), the Qwen2.5-VL-7B-Instruct text encoder (GGUF + mmproj), and the VAE. The Qwen2.5-VL text encoder is what eats the most VRAM in the BF16 path — its five sharded safetensors in the text_encoder/ folder of the canonical repo total 16.58 GB on their own — so a quantized GGUF version is essential.

# 1) LongCat-Image transformer (GGUF). Q4_K_M is the only-path-that-fits anchor on 16 GB.
hf download vantagewithai/LongCat-Image-GGUF org/LongCat-Image-Q4_K_M.gguf \
  --local-dir ComfyUI/models/diffusion_models/

# 2) Qwen2.5-VL-7B text encoder + mmproj (GGUF). Place BOTH in models/text_encoders/.
#    ComfyUI-GGUF auto-detects the mmproj when it sits beside the main encoder.
hf download unsloth/Qwen2.5-VL-7B-Instruct-GGUF Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf \
  --local-dir ComfyUI/models/text_encoders/
hf download unsloth/Qwen2.5-VL-7B-Instruct-GGUF mmproj-BF16.gguf \
  --local-dir ComfyUI/models/text_encoders/

# 3) VAE — pulled from the canonical repo's vae/ folder
hf download meituan-longcat/LongCat-Image vae/diffusion_pytorch_model.safetensors \
  --local-dir ComfyUI/models/vae/

Per the live vantagewithai/LongCat-Image-GGUF Files tab, the transformer GGUF tiers are Q4_K_M 3.73 GB, Q6_K 5.15 GB and Q8_0 6.67 GB. The unsloth/Qwen2.5-VL-7B-Instruct-GGUF Files tab lists the Q4_K_M text encoder at 4.68 GB and mmproj-BF16.gguf at 1.35 GB. The mmproj-loading convention (place the mmproj in the same folder as the main text encoder; the node auto-detects it on selection) is documented in the QuantStack Qwen2.5-VL mmproj-loading discussion and is the same layout the LongCat ComfyUI workflow uses.

Running

  1. Launch ComfyUI: python main.py.
  2. Load the LongCat-Image text-to-image workflow. Two reasonable starting points:
  3. In the workflow, set UnetLoaderGGUF to LongCat-Image-Q4_K_M.gguf, and the CLIPLoader (GGUF) to the Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf encoder.
  4. Hit Queue Prompt. First run loads the weights from disk; subsequent runs reuse them.

The HF model card's reference example uses guidance_scale=4.0, num_inference_steps=50, 768×1344 — bump steps to 50 if you want model-card-quality reference output; 20 steps is a reasonable preview default.

Results

  • Speed: Not yet measured on the RTX 4080. The 4080 (~716.8 GB/s memory bandwidth) is materially faster than the 16 GB siblings we have published recipes for (e.g. the RTX 4060 Ti 16GB at 288 GB/s), so quoting their numbers here would understate the card — and no RTX-4080-named LongCat-Image measurement has surfaced. The /check/longcat-image/rtx-4080 endpoint currently has no benchmark for this pair (verdict: unknown). If you run it, please submit your timing via /contribute so we can populate a real figure.
  • VRAM usage: budget the full 16 GB. The transformer at Q4_K_M is 3.73 GB and the Qwen2.5-VL text encoder at Q4_K_M is 4.68 GB per the per-tier file tables linked above; ComfyUI keeps only one of them resident on GPU at a time, leaving headroom for activations and VAE decode. The vanilla diffusers BF16 path needs ~17 GB with enable_model_cpu_offload() per the HF model card Quick Start, or ~18 GB for the latest official inference code per the Meituan COLLABORATOR's Issue #8 statement — which is why this recipe uses GGUF.
  • Quality notes: LongCat-Image is bilingual by design and the arXiv technical report (2512.07584) highlights multilingual text rendering as a primary target. Quality at Q8_0 is closest to BF16; Q6_K is a budget tier if VRAM gets tight after VAE decode. Q4_K_M is viable and is the smallest tier that comfortably fits this card — expect a mild quality drop on fine details versus the higher tiers.

For the full benchmark data, see /check/longcat-image/rtx-4080.

Troubleshooting

pip install -r requirements.txt errors with No module named 'dskernels'

The official Meituan infer_requirements.txt lists dskernels, which is not on PyPI — ghostnyambit reported the same blocker on Issue #8. This is only an issue if you go down the native-diffusers path. The ComfyUI + GGUF flow above does not import dskernels, so this error is a strong signal you've installed the wrong dependency tree. Skip the Meituan infer_requirements.txt entirely; install the ComfyUI-GGUF requirements instead.

OOM during the diffusion step on a 16 GB card

If you loaded the BF16 transformer instead of GGUF, you will OOM — the BF16 transformer alone is 12.54 GB (per the Comfy-Org repackaged BF16), leaving almost no room for activations and the VAE on a 16 GB card. Switch the UnetLoaderGGUF to LongCat-Image-Q4_K_M.gguf (or Q6_K / Q8_0) and rerun. Also make sure the Qwen2.5-VL text encoder is the GGUF version, not the BF16 5-shard set — the BF16 text encoder is 16.58 GB on its own and will not coexist with the transformer in VRAM.

enable_model_cpu_offload() doesn't work on the Edit pipeline

User mingyi456 notes on Issue #8 that enable_model_cpu_offload() currently doesn't work with LongCatImageEditPipeline. The base LongCatImagePipeline is unaffected. This is another reason this recipe is scoped to the base text-to-image variant only — the editing path needs a manual sequential-offload patch.

Where are the BF16 ComfyUI weights if I have a bigger card?

The ComfyUI core team publishes repackaged BF16 safetensors at Comfy-Org/LongCat-Image (split_files/diffusion_models/longcat_image_bf16.safetensors, 12.54 GB). Those are for cards with more headroom than 16 GB; on the RTX 4080 stick with the GGUF transformer above. Report any working configuration via /contribute.