What You'll Build
A working ComfyUI workflow that runs Meituan's LongCat-Image — a 6B-parameter bilingual (Chinese + English) diffusion transformer — on a single 16 GB RTX 4070 Ti SUPER. The base text-to-image variant is the focus; the image-editing siblings are out of scope below.
Hardware data: RTX 4070 Ti SUPER (16GB VRAM) · 1024×1024 baseline at 20 steps · See benchmark data
⚠️ Why this recipe pins the base variant. Meituan publishes four siblings under the LongCat-Image brand and their fit on 16 GB differs. See "Sibling variants and what fits 16 GB" below before downloading anything.
⚠️ The vanilla
diffuserspath does not fit 16 GB. A LongCat-Image developer states on the GitHub repo that the latest official inference code consumes approximately 18 GB of VRAM and supports inference on an RTX 4090 (Issue #8 comment byjunqiangwu, a project COLLABORATOR). The HF model card's own Quick Start confirms the same profile — itsenable_model_cpu_offload()line is annotatedRequired ~17 GB(Quick Start onmeituan-longcat/LongCat-Image). Both numbers exceed the RTX 4070 Ti SUPER's 16 GB, so this recipe uses the ComfyUI + GGUF path — the only sourced configuration confirmed to run end-to-end on a 16 GB consumer card.
Sibling variants and what fits 16 GB
| Variant | Purpose | 16 GB fit (cited) |
|---|---|---|
| LongCat-Image (this recipe) | Final-release T2I, 6B params, BF16 transformer 12.54 GB | Yes via GGUF — vantagewithai/LongCat-Image-GGUF ships per-tier files (Q4_K_M 3.73 GB · Q6_K 5.15 GB · Q8_0 6.67 GB; no BF16 in the org/ folder — the 12.54 GB BF16 transformer lives in the upstream/Comfy-Org repos) |
| LongCat-Image-Edit | Image-to-image editing variant | Not a turnkey 16 GB path — the editing pipeline loads Qwen2.5-VL and the transformer together and needs manual sequential offload; if you need 16 GB image editing, /contribute a working workflow so we can publish one |
| LongCat-Image-Edit-Turbo | Distilled few-step edit variant | Same memory profile as Edit; out of scope here |
| LongCat-Image-Dev | Mid-training checkpoint, intended for fine-tuning, not inference | Out of scope for this recipe |
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 16 GB VRAM, CUDA-capable | RTX 4070 Ti SUPER (16GB) |
| RAM | 32 GB system RAM (text encoder is CPU-offloaded) | — |
| Storage | ~12 GB free (transformer GGUF + Qwen2.5-VL text-encoder GGUF + mmproj + VAE; the full BF16 upstream repo is ~29 GB) | — |
| Software | ComfyUI (native LongCat support) + ComfyUI-GGUF, Python 3.10+ | — |
Installation
1. Update ComfyUI
LongCat-Image text-to-image is an official ComfyUI workflow template — see LongCat Image: Text to Image. Make sure your ComfyUI is current so the native LongCat nodes and template are present.
cd ComfyUI
git pull
pip install -r requirements.txt
The default pip install torch already includes sm_89 kernels (Ada Lovelace) — no special wheel selection is required for the RTX 4070 Ti SUPER. FlashAttention-2 also ships full sm_89 kernel coverage, so there is no eager/sdpa override needed on this card (unlike Blackwell sm_120 GPUs, which lag on FA2 wheels).
2. Install ComfyUI-GGUF
The GGUF transformer and text encoder require city96's ComfyUI-GGUF custom node, which provides the UnetLoaderGGUF and CLIPLoader (GGUF) nodes.
cd ComfyUI/custom_nodes
git clone https://github.com/city96/ComfyUI-GGUF
cd ComfyUI-GGUF
pip install -r requirements.txt
3. Download the model files
The 16 GB path needs three pieces: the LongCat-Image diffusion transformer (GGUF), the Qwen2.5-VL-7B-Instruct text encoder (GGUF + mmproj), and the VAE. The Qwen2.5-VL text encoder is what eats the most VRAM in the BF16 path — its five sharded safetensors in the text_encoder/ folder of the canonical repo total 16.58 GB on their own — so a quantized GGUF version is essential.
# 1) LongCat-Image transformer (GGUF). Q4_K_M is the only-path-that-fits anchor on 16 GB.
hf download vantagewithai/LongCat-Image-GGUF org/LongCat-Image-Q4_K_M.gguf \
--local-dir ComfyUI/models/diffusion_models/
# 2) Qwen2.5-VL-7B text encoder + mmproj (GGUF). Place BOTH in models/text_encoders/.
# ComfyUI-GGUF auto-detects the mmproj when it sits beside the main encoder.
hf download unsloth/Qwen2.5-VL-7B-Instruct-GGUF Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf \
--local-dir ComfyUI/models/text_encoders/
hf download unsloth/Qwen2.5-VL-7B-Instruct-GGUF mmproj-BF16.gguf \
--local-dir ComfyUI/models/text_encoders/
# 3) VAE — pulled from the canonical repo's vae/ folder
hf download meituan-longcat/LongCat-Image vae/diffusion_pytorch_model.safetensors \
--local-dir ComfyUI/models/vae/
Per the live vantagewithai/LongCat-Image-GGUF Files tab, the transformer GGUF tiers are Q4_K_M 3.73 GB, Q6_K 5.15 GB and Q8_0 6.67 GB. The unsloth/Qwen2.5-VL-7B-Instruct-GGUF Files tab lists the Q4_K_M text encoder at 4.68 GB and mmproj-BF16.gguf at 1.35 GB. The mmproj-loading convention (place the mmproj in the same folder as the main text encoder; the node auto-detects it on selection) is documented in the QuantStack Qwen2.5-VL mmproj-loading discussion and is the same layout the LongCat ComfyUI workflow uses.
Running
- Launch ComfyUI:
python main.py. - Load the LongCat-Image text-to-image workflow. Two reasonable starting points:
- Official ComfyUI template: LongCat Image: Text to Image — combines the LongCat-Image model, the Qwen2.5-VL text encoder and the VAE in a compact graph, tuned for 1024×1024 at 20 steps.
- Community GGUF workflow: the
comfy/Vantage-Longcat-Image.jsonfile shipped alongside the GGUF weights.
- In the workflow, set
UnetLoaderGGUFtoLongCat-Image-Q4_K_M.gguf, and theCLIPLoader (GGUF)to theQwen2.5-VL-7B-Instruct-Q4_K_M.ggufencoder. - Hit
Queue Prompt. First run loads the weights from disk; subsequent runs reuse them.
The HF model card's reference example uses guidance_scale=4.0, num_inference_steps=50, 768×1344 — bump steps to 50 if you want model-card-quality reference output; 20 steps is a reasonable preview default.
Results
- Speed: Not yet measured on the RTX 4070 Ti SUPER. No RTX-4070-Ti-SUPER-named LongCat-Image measurement has surfaced, and this card is materially slower than the larger 16 GB Ada siblings we have published recipes for (the RTX 4070 Ti SUPER has ~672 GB/s memory bandwidth and 8448 CUDA cores, versus ~716.8 GB/s and 9728 cores on the RTX 4080), so borrowing a 4080 figure would overstate it. The /check/longcat-image/rtx-4070-ti-super endpoint currently has no benchmark for this pair (
verdict: unknown). If you run it, please submit your timing via /contribute so we can populate a real figure. - VRAM usage: budget the full 16 GB. The transformer at Q4_K_M is 3.73 GB and the Qwen2.5-VL text encoder at Q4_K_M is 4.68 GB per the per-tier file tables linked above; ComfyUI keeps only one of them resident on GPU at a time, leaving headroom for activations and VAE decode. The vanilla
diffusersBF16 path needs ~17 GB withenable_model_cpu_offload()per the HF model card Quick Start, or ~18 GB for the latest official inference code per the Meituan COLLABORATOR's Issue #8 statement — which is why this recipe uses GGUF. - Quality notes: LongCat-Image is bilingual by design and the arXiv technical report (2512.07584) highlights multilingual text rendering as a primary target. Quality at Q8_0 is closest to BF16; Q6_K is a budget tier if VRAM gets tight after VAE decode. Q4_K_M is viable and is the smallest tier that comfortably fits this card — expect a mild quality drop on fine details versus the higher tiers.
For the full benchmark data, see /check/longcat-image/rtx-4070-ti-super.
Troubleshooting
pip install -r requirements.txt errors with No module named 'dskernels'
The official Meituan infer_requirements.txt lists dskernels, which is not on PyPI — ghostnyambit reported the same blocker on Issue #8. This is only an issue if you go down the native-diffusers path. The ComfyUI + GGUF flow above does not import dskernels, so this error is a strong signal you've installed the wrong dependency tree. Skip the Meituan infer_requirements.txt entirely; install the ComfyUI-GGUF requirements instead.
OOM during the diffusion step on a 16 GB card
If you loaded the BF16 transformer instead of GGUF, you will OOM — the BF16 transformer alone is 12.54 GB (per the Comfy-Org repackaged BF16), leaving almost no room for activations and the VAE on a 16 GB card. Switch the UnetLoaderGGUF to LongCat-Image-Q4_K_M.gguf (or Q6_K / Q8_0) and rerun. Also make sure the Qwen2.5-VL text encoder is the GGUF version, not the BF16 5-shard set — the BF16 text encoder is 16.58 GB on its own and will not coexist with the transformer in VRAM.
enable_model_cpu_offload() doesn't work on the Edit pipeline
Community user mingyi456 notes on Issue #8 that enable_model_cpu_offload() currently does not work with the edit pipeline. The base text-to-image pipeline is unaffected. This is another reason this recipe is scoped to the base variant only — the editing path needs a manual sequential-offload patch.
Where are the BF16 ComfyUI weights if I have a bigger card?
The ComfyUI core team publishes repackaged BF16 safetensors at Comfy-Org/LongCat-Image (split_files/diffusion_models/longcat_image_bf16.safetensors, 12.54 GB). Those are for cards with more headroom than 16 GB; on the RTX 4070 Ti SUPER stick with the GGUF transformer above. Report any working configuration via /contribute.