LongCat-Image (base T2I) on RTX 4070 Ti SUPER: Bilingual 6B Text-to-Image at 16 GB via ComfyUI GGUF

What You'll Build

A working ComfyUI workflow that runs Meituan's LongCat-Image — a 6B-parameter bilingual (Chinese + English) diffusion transformer — on a single 16 GB RTX 4070 Ti SUPER. The base text-to-image variant is the focus; the image-editing siblings are out of scope below.

Hardware data: RTX 4070 Ti SUPER (16GB VRAM) · 1024×1024 baseline at 20 steps · See benchmark data

⚠️ Why this recipe pins the base variant. Meituan publishes four siblings under the LongCat-Image brand and their fit on 16 GB differs. See "Sibling variants and what fits 16 GB" below before downloading anything.

⚠️ The vanilla diffusers path does not fit 16 GB. A LongCat-Image developer states on the GitHub repo that the latest official inference code consumes approximately 18 GB of VRAM and supports inference on an RTX 4090 (Issue #8 comment by junqiangwu, a project COLLABORATOR). The HF model card's own Quick Start confirms the same profile — its enable_model_cpu_offload() line is annotated Required ~17 GB (Quick Start on meituan-longcat/LongCat-Image). Both numbers exceed the RTX 4070 Ti SUPER's 16 GB, so this recipe uses the ComfyUI + GGUF path — the only sourced configuration confirmed to run end-to-end on a 16 GB consumer card.

Sibling variants and what fits 16 GB

Variant	Purpose	16 GB fit (cited)
LongCat-Image (this recipe)	Final-release T2I, 6B params, BF16 transformer 12.54 GB	Yes via GGUF — vantagewithai/LongCat-Image-GGUF ships per-tier files (Q4_K_M 3.73 GB · Q6_K 5.15 GB · Q8_0 6.67 GB; no BF16 in the `org/` folder — the 12.54 GB BF16 transformer lives in the upstream/Comfy-Org repos)
LongCat-Image-Edit	Image-to-image editing variant	Not a turnkey 16 GB path — the editing pipeline loads Qwen2.5-VL and the transformer together and needs manual sequential offload; if you need 16 GB image editing, /contribute a working workflow so we can publish one
LongCat-Image-Edit-Turbo	Distilled few-step edit variant	Same memory profile as Edit; out of scope here
LongCat-Image-Dev	Mid-training checkpoint, intended for fine-tuning, not inference	Out of scope for this recipe

Requirements

Component	Minimum	Tested
GPU	16 GB VRAM, CUDA-capable	RTX 4070 Ti SUPER (16GB)
RAM	32 GB system RAM (text encoder is CPU-offloaded)	—
Storage	~12 GB free (transformer GGUF + Qwen2.5-VL text-encoder GGUF + mmproj + VAE; the full BF16 upstream repo is ~29 GB)	—
Software	ComfyUI (native LongCat support) + ComfyUI-GGUF, Python 3.10+	—

Installation

1. Update ComfyUI

LongCat-Image text-to-image is an official ComfyUI workflow template — see LongCat Image: Text to Image. Make sure your ComfyUI is current so the native LongCat nodes and template are present.

cd ComfyUI
git pull
pip install -r requirements.txt

The default pip install torch already includes sm_89 kernels (Ada Lovelace) — no special wheel selection is required for the RTX 4070 Ti SUPER. FlashAttention-2 also ships full sm_89 kernel coverage, so there is no eager/sdpa override needed on this card (unlike Blackwell sm_120 GPUs, which lag on FA2 wheels).

2. Install ComfyUI-GGUF

The GGUF transformer and text encoder require city96's ComfyUI-GGUF custom node, which provides the UnetLoaderGGUF and CLIPLoader (GGUF) nodes.

cd ComfyUI/custom_nodes
git clone https://github.com/city96/ComfyUI-GGUF
cd ComfyUI-GGUF
pip install -r requirements.txt

3. Download the model files

The 16 GB path needs three pieces: the LongCat-Image diffusion transformer (GGUF), the Qwen2.5-VL-7B-Instruct text encoder (GGUF + mmproj), and the VAE. The Qwen2.5-VL text encoder is what eats the most VRAM in the BF16 path — its five sharded safetensors in the text_encoder/ folder of the canonical repo total 16.58 GB on their own — so a quantized GGUF version is essential.

# 1) LongCat-Image transformer (GGUF). Q4_K_M is the only-path-that-fits anchor on 16 GB.
hf download vantagewithai/LongCat-Image-GGUF org/LongCat-Image-Q4_K_M.gguf \
  --local-dir ComfyUI/models/diffusion_models/

# 2) Qwen2.5-VL-7B text encoder + mmproj (GGUF). Place BOTH in models/text_encoders/.
#    ComfyUI-GGUF auto-detects the mmproj when it sits beside the main encoder.
hf download unsloth/Qwen2.5-VL-7B-Instruct-GGUF Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf \
  --local-dir ComfyUI/models/text_encoders/
hf download unsloth/Qwen2.5-VL-7B-Instruct-GGUF mmproj-BF16.gguf \
  --local-dir ComfyUI/models/text_encoders/

# 3) VAE — pulled from the canonical repo's vae/ folder
hf download meituan-longcat/LongCat-Image vae/diffusion_pytorch_model.safetensors \
  --local-dir ComfyUI/models/vae/

Per the live vantagewithai/LongCat-Image-GGUF Files tab, the transformer GGUF tiers are Q4_K_M 3.73 GB, Q6_K 5.15 GB and Q8_0 6.67 GB. The unsloth/Qwen2.5-VL-7B-Instruct-GGUF Files tab lists the Q4_K_M text encoder at 4.68 GB and mmproj-BF16.gguf at 1.35 GB. The mmproj-loading convention (place the mmproj in the same folder as the main text encoder; the node auto-detects it on selection) is documented in the QuantStack Qwen2.5-VL mmproj-loading discussion and is the same layout the LongCat ComfyUI workflow uses.

Running

Launch ComfyUI: python main.py.
Load the LongCat-Image text-to-image workflow. Two reasonable starting points:
- Official ComfyUI template: LongCat Image: Text to Image — combines the LongCat-Image model, the Qwen2.5-VL text encoder and the VAE in a compact graph, tuned for 1024×1024 at 20 steps.
- Community GGUF workflow: the comfy/Vantage-Longcat-Image.json file shipped alongside the GGUF weights.
In the workflow, set UnetLoaderGGUF to LongCat-Image-Q4_K_M.gguf, and the CLIPLoader (GGUF) to the Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf encoder.
Hit Queue Prompt. First run loads the weights from disk; subsequent runs reuse them.

The HF model card's reference example uses guidance_scale=4.0, num_inference_steps=50, 768×1344 — bump steps to 50 if you want model-card-quality reference output; 20 steps is a reasonable preview default.

Results

Speed: Not yet measured on the RTX 4070 Ti SUPER. No RTX-4070-Ti-SUPER-named LongCat-Image measurement has surfaced, and this card is materially slower than the larger 16 GB Ada siblings we have published recipes for (the RTX 4070 Ti SUPER has ~672 GB/s memory bandwidth and 8448 CUDA cores, versus ~716.8 GB/s and 9728 cores on the RTX 4080), so borrowing a 4080 figure would overstate it. The /check/longcat-image/rtx-4070-ti-super endpoint currently has no benchmark for this pair (verdict: unknown). If you run it, please submit your timing via /contribute so we can populate a real figure.
VRAM usage: budget the full 16 GB. The transformer at Q4_K_M is 3.73 GB and the Qwen2.5-VL text encoder at Q4_K_M is 4.68 GB per the per-tier file tables linked above; ComfyUI keeps only one of them resident on GPU at a time, leaving headroom for activations and VAE decode. The vanilla diffusers BF16 path needs ~17 GB with enable_model_cpu_offload() per the HF model card Quick Start, or ~18 GB for the latest official inference code per the Meituan COLLABORATOR's Issue #8 statement — which is why this recipe uses GGUF.
Quality notes: LongCat-Image is bilingual by design and the arXiv technical report (2512.07584) highlights multilingual text rendering as a primary target. Quality at Q8_0 is closest to BF16; Q6_K is a budget tier if VRAM gets tight after VAE decode. Q4_K_M is viable and is the smallest tier that comfortably fits this card — expect a mild quality drop on fine details versus the higher tiers.

For the full benchmark data, see /check/longcat-image/rtx-4070-ti-super.

Troubleshooting

`pip install -r requirements.txt` errors with `No module named 'dskernels'`

The official Meituan infer_requirements.txt lists dskernels, which is not on PyPI — ghostnyambit reported the same blocker on Issue #8. This is only an issue if you go down the native-diffusers path. The ComfyUI + GGUF flow above does not import dskernels, so this error is a strong signal you've installed the wrong dependency tree. Skip the Meituan infer_requirements.txt entirely; install the ComfyUI-GGUF requirements instead.

OOM during the diffusion step on a 16 GB card

If you loaded the BF16 transformer instead of GGUF, you will OOM — the BF16 transformer alone is 12.54 GB (per the Comfy-Org repackaged BF16), leaving almost no room for activations and the VAE on a 16 GB card. Switch the UnetLoaderGGUF to LongCat-Image-Q4_K_M.gguf (or Q6_K / Q8_0) and rerun. Also make sure the Qwen2.5-VL text encoder is the GGUF version, not the BF16 5-shard set — the BF16 text encoder is 16.58 GB on its own and will not coexist with the transformer in VRAM.

`enable_model_cpu_offload()` doesn't work on the Edit pipeline

Community user mingyi456 notes on Issue #8 that enable_model_cpu_offload() currently does not work with the edit pipeline. The base text-to-image pipeline is unaffected. This is another reason this recipe is scoped to the base variant only — the editing path needs a manual sequential-offload patch.

Where are the BF16 ComfyUI weights if I have a bigger card?

The ComfyUI core team publishes repackaged BF16 safetensors at Comfy-Org/LongCat-Image (split_files/diffusion_models/longcat_image_bf16.safetensors, 12.54 GB). Those are for cards with more headroom than 16 GB; on the RTX 4070 Ti SUPER stick with the GGUF transformer above. Report any working configuration via /contribute.