self-hosted/ai
§01·recipe · image

LongCat-Image (base T2I) on RTX 4090: Bilingual 6B Text-to-Image via diffusers

imageintermediate18GB+ VRAMMay 20, 2026
models
tools
prerequisites
  • NVIDIA RTX 4090 (24 GB VRAM) or any 24 GB+ consumer card with CUDA support
  • Python 3.10
  • 32 GB system RAM recommended (CPU offload of the text encoder uses host RAM)
  • ~30 GB free disk for model weights

What You'll Build

A working diffusers setup for Meituan's LongCat-Image — a 6B-parameter bilingual (Chinese + English) MM-DiT/Single-DiT text-to-image model — running natively on a single 24 GB RTX 4090. This recipe is scoped to the base text-to-image variant; the image-editing siblings are out of scope below.

Hardware data: RTX 4090 (24 GB VRAM) · canonical diffusers + enable_model_cpu_offload() · ~18 GB peak per Meituan team statement · See benchmark data

⚠️ Why this recipe pins the base variant. Meituan publishes four siblings under the LongCat-Image brand and their inference paths and VRAM profiles differ — see "Sibling variants and what fits 24 GB" below before downloading anything.

ℹ️ Why this recipe pins the diffusers runtime. On a 16 GB consumer card the only confirmed path is ComfyUI + GGUF (see the 4060 Ti sibling recipe). The 24 GB tier unlocks the canonical diffusers LongCatImagePipeline directly: the Meituan team's own GitHub statement is that the latest official inference code "consumes approximately 18 GB of VRAM and supports inference on an RTX 4090" (Issue #8 comment by junqiangwu, 2025-12-08). The HF model card's Quick Start independently confirms the same profile: with pipe.enable_model_cpu_offload() it is "Required ~17 GB" (HF model card). The community ComfyUI integration that PR'd back to the official repo similarly reports "Standard (CPU offload disabled): ~24GB+" and "Low VRAM (CPU offload enabled): ~17-19GB" (sooxt98/comfyui_longcat_image). Two independent sources name the same tier — this is the canonical 4090 path.

Sibling variants and what fits 24 GB

VariantPurpose24 GB fit (cited)
LongCat-Image (this recipe)Final-release T2I, 6B params, hybrid MM-DiT/Single-DiT à la Flux1.dev per the arXiv technical report (2512.07584)Yes — canonical diffusers path with enable_model_cpu_offload(), ~18 GB peak per Meituan team
LongCat-Image-EditImage-to-image editing variantTighter — enable_model_cpu_offload() does not work on the Edit pipeline per user mingyi456 on Issue #8, so the no-offload "~24 GB+" tier is what you get. Borderline on a 24 GB card; not in scope here
LongCat-Image-Edit-TurboDistilled few-step edit variantSame memory profile as Edit; if you get a working setup, /contribute so we can publish one
LongCat-Image-DevMid-training checkpoint intended for fine-tuning, not inferenceOut of scope for this recipe

Requirements

ComponentMinimumTested
GPU24 GB VRAM, CUDA-capableRTX 4090 (24 GB)
RAM32 GB system RAM (CPU offload spills the text encoder to host)
Storage~30 GB free for the upstream BF16 repo
SoftwarePython 3.10, latest diffusers from main, PyTorch with CUDA

Installation

The default pip install torch already includes sm_89 kernels (Ada Lovelace) — no special wheel selection is required for the RTX 4090. FlashAttention-2 also ships full sm_89 kernel coverage, so the standard PyTorch + diffusers install is everything you need.

1. Create the conda environment

The official GitHub README pins Python 3.10:

conda create -n longcat-image python=3.10
conda activate longcat-image

2. Clone the repo and install dependencies

The recommended path for diffusers integration is to install diffusers from main (the LongCatImagePipeline class needs the upstream HF diffusers#12828 integration that the HF model card's Quick Start uses):

git clone https://github.com/meituan-longcat/LongCat-Image
cd LongCat-Image
pip install -r infer_requirements.txt
pip install -U git+https://github.com/huggingface/diffusers

If infer_requirements.txt errors with No module named 'dskernels', see Troubleshooting below — dskernels is not on PyPI, but you can skip it for inference.

3. Pre-download the model

The pipeline auto-downloads on first run, but pre-fetching avoids surprises and gives you a clean progress bar:

hf download meituan-longcat/LongCat-Image --local-dir ./longcat-image

The repo is ~29 GB on disk (BF16 transformer + Qwen2.5-VL-7B text encoder + VAE).

Running

The HF model card's reference Quick Start works as-is on a 24 GB card. Save the following as run_t2i.py inside the cloned LongCat-Image directory:

import torch
from diffusers import LongCatImagePipeline

if __name__ == '__main__':
    pipe = LongCatImagePipeline.from_pretrained(
        "meituan-longcat/LongCat-Image",
        torch_dtype=torch.bfloat16,
    )
    # On a 24 GB RTX 4090, keep CPU offload enabled — this is the path the
    # Meituan team validated at ~18 GB peak. Disable only if you have ≥32 GB VRAM.
    pipe.enable_model_cpu_offload()

    prompt = (
        "A young Asian woman in a yellow knit sweater with a white necklace, "
        "hands resting on her knees, calm expression. Background is a rough "
        "brick wall, warm afternoon sunlight, medium-distance shot."
    )

    image = pipe(
        prompt,
        height=768,
        width=1344,
        guidance_scale=4.0,
        num_inference_steps=50,
        num_images_per_prompt=1,
        generator=torch.Generator("cpu").manual_seed(43),
        enable_cfg_renorm=True,
        enable_prompt_rewrite=True,
    ).images[0]

    image.save("./t2i_example.png")

Run it:

python run_t2i.py

Output lands at ./t2i_example.png. First run downloads any weights not pre-fetched; subsequent runs load straight from the HF cache.

Meituan's repo also includes scripts/inference_t2i.py with the same defaults hardcoded; that script is equivalent to the above and runs with python scripts/inference_t2i.py.

Text-in-image: LongCat-Image renders embedded text — the HF README is explicit that you must wrap the target text in single or double quotation marks (English '...' / "..." or the Chinese full-width equivalents '...' / "..."). Without quotes, the model treats the words as scene description, not glyphs to render.

Results

  • Speed: No RTX-4090-specific inference-time measurement is cited in the official model card, GitHub repo, ComfyUI integration, or arXiv tech report at time of writing. Once a community run lands at /check/longcat-image/rtx-4090, this section gets updated; contribute one via /contribute if you measure it.
  • VRAM usage: ~18 GB peak with enable_model_cpu_offload() per the Meituan team comment on Issue #8. The HF model card's Quick Start labels the same path "Required ~17 GB" (model card). The community ComfyUI port lists "Low VRAM (CPU offload enabled): ~17-19GB" and "Standard (CPU offload disabled): ~24GB+" (sooxt98/comfyui_longcat_image) — so a 24 GB card with offload disabled is borderline, but with offload enabled it sits comfortably under the VRAM ceiling.
  • Quality notes: LongCat-Image is bilingual by design (Chinese + English) and the arXiv technical report (2512.07584) highlights multilingual text rendering as a primary target. The 6B parameter count is "significantly smaller than the nearly 20B or larger Mixture-of-Experts (MoE) architectures common in the field" per the same report, and the architecture is "a hybrid MM-DiT and Single-DiT structure, consistent with Flux1.dev". Quality at native BF16 is the canonical reference — no quantization tradeoffs to consider on this card.

For the full benchmark data, see /check/longcat-image/rtx-4090.

Troubleshooting

pip install -r infer_requirements.txt errors with No module named 'dskernels'

The official Meituan infer_requirements.txt lists dskernels, which is not on PyPI — ghostnyambit reported the same blocker on Issue #8 (2025-12-11). dskernels is only required for training-time DeepSpeed optimizations and not for inference. Comment the line out of infer_requirements.txt and re-run the install — the diffusers Quick Start above does not touch DeepSpeed.

OOM despite enable_model_cpu_offload()

Confirm you are not also calling pipe.to(device, torch.bfloat16) after enable_model_cpu_offload() — the two are mutually exclusive. The HF model card's Quick Start has the pipe.to(...) line commented out for exactly this reason, with the in-line note "Uncomment for high VRAM devices (Faster inference)". On a 24 GB card, leave it commented; the team's quoted ~18 GB number assumes offload is active.

Out of host RAM, not VRAM

User reckless-huang reported on Issue #8 (2025-12-08) that the old script failed on a system with 32 GB host RAM even though VRAM was fine. The Meituan team's follow-up fix in the next-day commit reduced host-memory pressure as well as VRAM. Make sure you've installed from the latest main branch — if you cloned before 2025-12-08, pull again.

enable_model_cpu_offload() doesn't work on the Edit pipeline

User mingyi456 notes on Issue #8 that enable_model_cpu_offload() currently does not work with LongCatImageEditPipeline. This recipe is scoped to the base LongCatImagePipeline for exactly this reason — Edit needs the no-offload "~24GB+" tier, which is borderline on this card. For LongCat-Image-Edit on a 4090, follow the upstream issue thread for the manual sequential-offload patch before assuming a turnkey workflow exists.

LiVeen's FP8 (LongCat-Image-Edit-FP8-e4m3fn) is unverified

A community FP8 quant of the Edit variant exists at LiVeen/LongCat-Image-Edit-FP8-e4m3fn. The author themselves stated on Issue #8 that there is "a fairly high likelihood that this model won't work without the rest of the diffusers stuff, or even at all" and has not tested it. Don't substitute it for the canonical BF16 path until somebody verifies it — report results via /contribute if you try.