self-hosted/ai
§01·recipe · image

LongCat-Image (base T2I) on RTX 5090: Bilingual 6B Text-to-Image via diffusers BF16 with 14 GB Headroom

imageintermediate18GB+ VRAMMay 24, 2026
models
tools
prerequisites
  • NVIDIA RTX 5090 (32 GB VRAM, Blackwell sm_120) or any 32 GB+ consumer card with CUDA support
  • Python 3.10
  • PyTorch built against CUDA 12.8+ wheels (cu128) for sm_120 kernel coverage
  • 32 GB system RAM recommended (CPU offload of the text encoder uses host RAM)
  • ~30 GB free disk for model weights

What You'll Build

A working diffusers setup for Meituan's LongCat-Image — a 6B-parameter bilingual (Chinese + English) MM-DiT/Single-DiT text-to-image model — running natively on a single 32 GB RTX 5090 (Blackwell, sm_120). This recipe is scoped to the base text-to-image variant; the image-editing siblings are out of scope below. The install steps are unchanged from the 24 GB Ada (4090) and Ampere (3090) siblings because the official inference code is BF16-only and the on-disk transformer is just 12.54 GB; the Blackwell-specific notes here cover the cu128 PyTorch wheel, the FA2 sm_120 kernel gap (you don't need FA2 — diffusers uses SDPA by default), and what to spend the 14 GB of leftover VRAM on.

Hardware data: RTX 5090 (32 GB VRAM, Blackwell sm_120) · canonical diffusers + enable_model_cpu_offload() · ~17 GB peak per the HF model card Quick Start (verbatim Python comment: Required ~17 GB), cross-confirmed at ~18 GB by the Meituan team's GitHub statement (by junqiangwu, repository COLLABORATOR). See benchmark data.

ℹ️ Why this recipe pins the base variant. Meituan publishes four siblings under the LongCat-Image brand and their inference paths and VRAM profiles differ — see "Sibling variants and what fits 32 GB" below before downloading anything.

ℹ️ Why this recipe pins the diffusers runtime, not ComfyUI. On a 16 GB consumer card the only confirmed path is ComfyUI + GGUF (see the 4060 Ti / 5060 Ti sibling recipes). On 24 GB cards the canonical diffusers LongCatImagePipeline runs directly. The 32 GB 5090 inherits that path with substantial headroom: the Meituan team's own GitHub statement is that the latest official inference code "consumes approximately 18 GB of VRAM and supports inference on an RTX 4090" (Issue #8 comment by junqiangwu (COLLABORATOR), 2025-12-08). The HF model card's Quick Start independently labels the same path with the inline comment Required ~17 GB (HF model card). Two independent first-party sources name the same envelope; the 32 GB tier holds it with ~14 GB to spare.

Why the 4090 path transfers cleanly to the 5090

  • Pure BF16 pipeline, no FP8 needed. The official inference code is torch_dtype=torch.bfloat16 end-to-end. The 5090's Blackwell sm_120 has hardware FP8 (E4M3/E5M2) and FP4 (microscaling) acceleration that the 4090 does not, but those tensor-core paths are irrelevant here because the model does not ship FP8 weights and the canonical pipeline does not invoke FP8 kernels at runtime. The Ada → Blackwell swap is no-op at the precision level for this recipe.
  • Same 24 GB-tier envelope, with 14 GB to spare. The "~17 GB" with offload / "~24 GB+" without offload envelope is set by the BF16 transformer weight count (12.54 GB on disk per the HF tree API) plus activations plus the Qwen2.5-VL text encoder briefly. A 32 GB Blackwell card holds the same footprint a 24 GB Ada card does, with the extra VRAM available for things like multi-image batching, larger generation resolutions, or co-loading a second model. See "Spending the 14 GB Headroom" below.
  • SDPA is the default — FA2 sm_120 kernel gap doesn't matter here. diffusers uses scaled_dot_product_attention (PyTorch's built-in fused attention) by default, and SDPA ships full sm_120 kernel coverage on the cu128 PyTorch wheel. FlashAttention-2's sm_120 kernel coverage is still incomplete — tracked at Dao-AILab/flash-attention#2168 (open as of 2026-05-24) — but the canonical LongCat-Image inference path does not require FA2 and no install step pulls it in.
  • cu128 PyTorch wheel is the one Blackwell-specific install detail. The default pip install torch resolves to cu128 since the late-Q1 2026 wheel update, which carries sm_120 kernels. If you're using a pre-cu128 environment (older virtualenv pinning an older CUDA index URL), upgrade — that's the only sm_120-vs-sm_89 install delta for this recipe.
  • Compute is ~1.3× of Ada but speed is not quoted here. Blackwell sm_120 dense FP16 throughput is roughly 105 TFLOPS vs the 4090's ~82 TFLOPS, and Blackwell's 1792 GB/s memory bandwidth tops the 4090's 1008 GB/s. The 4090 sibling recipe does not quote a per-image speed because no GPU-named inference benchmark is in the official sources, so extrapolating a 5090 number from a 4090 number we do not have would be doubly speculative. See Results → Speed below for the empirical handoff to /check/.

Sibling variants and what fits 32 GB

VariantPurpose32 GB fit (cited)
LongCat-Image (this recipe)Final-release T2I, 6B params, hybrid MM-DiT/Single-DiT à la Flux1.dev per the arXiv technical report (2512.07584)Yes — canonical diffusers path with enable_model_cpu_offload(), ~17 GB peak per HF card Quick Start, ~18 GB per Meituan team; ~14 GB headroom
LongCat-Image-EditImage-to-image editing variantYes — enable_model_cpu_offload() currently does not work on the Edit pipeline per user mingyi456 on Issue #8, so the no-offload "~24 GB+" tier is what you get. Borderline on a 24 GB card but well within 32 GB. Out of scope here pending a published end-to-end workflow
LongCat-Image-Edit-TurboDistilled few-step edit variantSame memory profile as Edit; comfortable on 32 GB. If you publish a working workflow, /contribute so we can publish one
LongCat-Image-DevMid-training checkpoint intended for fine-tuning, not inferenceOut of scope for this recipe

Requirements

ComponentMinimumTested
GPU18 GB VRAM, CUDA-capable (sm_120 / sm_89 / sm_86 all fine)RTX 5090 (32 GB, Blackwell sm_120)
RAM32 GB system RAM (CPU offload spills the text encoder to host)
Storage~30 GB free for the upstream BF16 repo
SoftwarePython 3.10, latest diffusers (≥ the integration in diffusers#12828), PyTorch with CUDA 12.8 (cu128) for sm_120 kernel coverage

Installation

The default pip install torch resolves to a cu128 wheel since the late-Q1 2026 update — that wheel carries sm_120 kernels for Blackwell. SDPA (PyTorch's built-in fused attention, which diffusers uses by default) has full sm_120 coverage. FlashAttention-2 sm_120 kernel coverage is still incomplete and tracked at Dao-AILab/flash-attention#2168 — but no step below pulls FA2 in, and it is not required for this pipeline.

1. Create the conda environment

The official GitHub README pins Python 3.10:

conda create -n longcat-image python=3.10
conda activate longcat-image

2. Clone the repo and install dependencies

Install diffusers from PyPI (≥ the version that includes the LongCatImagePipeline integration via diffusers#12828). The current Meituan README installs from PyPI directly, and the upstream HF model card's Quick Start uses the same shipped class:

git clone https://github.com/meituan-longcat/LongCat-Image
cd LongCat-Image
pip install -r infer_requirements.txt
pip install -U diffusers

If infer_requirements.txt errors with No module named 'dskernels', see Troubleshooting below — dskernels is not on PyPI, but you can skip it for inference.

If your torch install was pinned to a pre-cu128 index URL (e.g. an older --index-url https://download.pytorch.org/whl/cu121), reinstall against the current default index:

pip install --upgrade --force-reinstall torch torchvision

The default index now serves cu128 wheels with sm_120 kernels for Blackwell.

3. Pre-download the model

The pipeline auto-downloads on first run, but pre-fetching avoids surprises and gives you a clean progress bar:

hf download meituan-longcat/LongCat-Image --local-dir ./longcat-image

The repo is ~29 GB on disk (12.54 GB BF16 transformer + 16.58 GB Qwen2.5-VL-7B-Instruct text encoder + ~168 MB VAE, per the HF tree API).

Running

The HF model card's reference Quick Start works as-is on a 32 GB 5090. Save the following as run_t2i.py inside the cloned LongCat-Image directory:

import torch
from diffusers import LongCatImagePipeline

if __name__ == '__main__':
    pipe = LongCatImagePipeline.from_pretrained(
        "meituan-longcat/LongCat-Image",
        torch_dtype=torch.bfloat16,
    )
    # On a 32 GB RTX 5090, you have two valid choices:
    #   (a) enable_model_cpu_offload() — the HF card's documented "Required ~17 GB"
    #       path. ~14 GB headroom on the 5090; the safe default. Use this if you
    #       plan to co-load a second model or batch multiple images.
    #   (b) pipe.to("cuda", torch.bfloat16) — keep everything resident on GPU,
    #       which the HF card flags as "Faster inference" for "high VRAM devices".
    #       Uses the no-offload "~24 GB+" tier; still fits with ~8 GB headroom.
    # This recipe defaults to (a) to match the cited ~17–18 GB envelope across all
    # three Meituan sources. Comment out the next line and uncomment the one below
    # for the faster fully-resident path.
    pipe.enable_model_cpu_offload()
    # pipe.to("cuda", torch.bfloat16)

    prompt = (
        "A young Asian woman in a yellow knit sweater with a white necklace, "
        "hands resting on her knees, calm expression. Background is a rough "
        "brick wall, warm afternoon sunlight, medium-distance shot."
    )

    image = pipe(
        prompt,
        height=768,
        width=1344,
        guidance_scale=4.0,
        num_inference_steps=50,
        num_images_per_prompt=1,
        generator=torch.Generator("cpu").manual_seed(43),
        enable_cfg_renorm=True,
        enable_prompt_rewrite=True,
    ).images[0]

    image.save("./t2i_example.png")

Run it:

python run_t2i.py

Output lands at ./t2i_example.png. First run downloads any weights not pre-fetched; subsequent runs load straight from the HF cache.

Meituan's repo also includes scripts/inference_t2i.py with the same defaults hardcoded; that script is equivalent to the above and runs with python scripts/inference_t2i.py.

Text-in-image: LongCat-Image renders embedded text — the HF README is explicit that you must wrap the target text in single or double quotation marks (English '...' / "..." or the Chinese full-width equivalents ‘...’ / “...”). Without quotes, the model treats the words as scene description, not glyphs to render.

Spending the 14 GB Headroom

With the default enable_model_cpu_offload() path at ~17–18 GB peak, the 5090 leaves you roughly 14 GB of unused VRAM. The 32 GB card unlocks several use patterns that the 24 GB tier cannot:

  • num_images_per_prompt=2 or =3. Batching multiple images per prompt at the same 768×1344 resolution adds activation memory roughly linearly. On the 24 GB tier this pushes into OOM territory with offload; on 32 GB you can comfortably hold a 2-image batch in the headroom (each additional image adds the activation footprint of one forward pass, well within the spare ~14 GB).
  • The no-offload fully-resident path (option b above). Switching enable_model_cpu_offload() for pipe.to("cuda", torch.bfloat16) puts the entire pipeline — transformer + Qwen2.5-VL text encoder + VAE — onto the GPU. This is the "~24 GB+" tier the sooxt98 community ComfyUI port names in its tier table. Faster per-step because there's no CPU↔GPU transfer between the text-encoder and transformer stages; still fits 32 GB with ~8 GB to spare.
  • Co-loading a small auxiliary model. With ~14 GB free in the offload path you can keep e.g. a small captioner, an embedding model, or a control-LoRA loader resident on the same GPU for a tighter pipeline turnaround. The 24 GB tier can't do this without dropping LongCat-Image off the GPU between steps.

Results

  • Speed: No RTX-5090-specific inference-time measurement is cited in the official HF model card, the canonical Meituan GitHub repo, the arXiv technical report (2512.07584), or the diffusers integration PR (huggingface/diffusers#12828) at time of writing. The 4090 sibling recipe omits speed for the same reason. This skill's discipline forbids extrapolating an unmeasured number across GPU architectures even when the compute and bandwidth ratios suggest a reasonable estimate — Blackwell sm_120 is ~1.3× the 4090's compute, but the 4090's per-image wall-time is itself uncited. Once a community run lands at /check/longcat-image/rtx-5090, this section gets updated; contribute one via /contribute if you measure it.
  • VRAM usage: ~17 GB peak with enable_model_cpu_offload() per the HF model card Quick Start (verbatim Python comment: Required ~17 GB). The Meituan team comment on Issue #8 by junqiangwu (COLLABORATOR) cross-confirms at ~18 GB: "We've submitted a new version of the code, which consumes approximately 18 GB of VRAM and supports inference on an RTX 4090." Both sources describe runtime-agnostic measurements that transfer cleanly upward to the 32 GB tier — the binding constraint has already cleared on the 24 GB tier, so the 5090 trivially holds the same footprint with ~14 GB headroom. The sooxt98 community ComfyUI port provides a third-party VRAM tier table for context (its "Standard (CPU offload disabled): ~24GB+" band describes the no-offload path), but it is a vanilla community walkthrough (not currently endorsed by the canonical Meituan repo, whose Community Works section is no longer present in the README as of 2026-05-24).
  • Quality notes: LongCat-Image is bilingual by design (Chinese + English) and the arXiv technical report (2512.07584) highlights multilingual text rendering as a primary target. The 6B parameter count is "significantly smaller than the nearly 20B or larger Mixture-of-Experts (MoE) architectures common in the field" per the same report, and the architecture is "a hybrid MM-DiT and Single-DiT structure, consistent with Flux1.dev". Quality at native BF16 is the canonical reference on this card — no quantization tradeoffs to consider, and no FP8 fallback to evaluate even though Blackwell has hardware FP8 acceleration (the canonical pipeline doesn't ship FP8 weights, and the model fits BF16 comfortably anyway).

For the full benchmark data, see /check/longcat-image/rtx-5090.

Troubleshooting

pip install -r infer_requirements.txt errors with No module named 'dskernels'

The official Meituan infer_requirements.txt lists dskernels, which is not on PyPI — ghostnyambit reported the same blocker on Issue #8 (2025-12-11). dskernels is only required for training-time DeepSpeed optimizations and not for inference. Comment the line out of infer_requirements.txt and re-run the install — the diffusers Quick Start above does not touch DeepSpeed.

no kernel image is available for execution on the device on first inference call

Your PyTorch wheel does not include sm_120 kernels for Blackwell. The default pip install torch resolves to the cu128 wheel since the late-Q1 2026 update, which carries sm_120 kernels. Verify with:

python -c "import torch; print(torch.version.cuda, torch.cuda.get_arch_list())"

The get_arch_list() output should include sm_120. If it doesn't, reinstall against the default index per Installation step 2 (or explicitly pin --index-url https://download.pytorch.org/whl/cu128). Note that if your code-path somehow imports FlashAttention-2 (the canonical Quick Start does not), you may hit the same error from FA2's side regardless of the torch wheel — FA2 sm_120 kernel coverage is tracked at Dao-AILab/flash-attention#2168 (open as of 2026-05-24). The safe answer on this card is to let diffusers use SDPA (its default) and not install FA2.

OOM despite enable_model_cpu_offload()

Confirm you are not also calling pipe.to(device, torch.bfloat16) after enable_model_cpu_offload() — the two are mutually exclusive. The HF model card's Quick Start has the pipe.to(...) line commented out for exactly this reason, with the in-line note "Uncomment for high VRAM devices (Faster inference)". On the 5090 you have enough VRAM to switch to the fully-resident path (option b in the Running section above) if you want — but you cannot have both flags engaged. The cited ~17–18 GB number assumes offload is active.

Out of host RAM, not VRAM

User reckless-huang reported on Issue #8 (2025-12-08) that the old script failed on a system with 32 GB host RAM even though VRAM was fine. The Meituan team's follow-up fix by junqiangwu (COLLABORATOR) in the same Issue #8 thread reduced host-memory pressure as well as VRAM. Make sure you've installed from the latest commit on main — if you cloned before 2025-12-08, pull again.

enable_model_cpu_offload() doesn't work on the Edit pipeline

User mingyi456 notes on Issue #8 that enable_model_cpu_offload() currently does not work with LongCatImageEditPipeline. This recipe is scoped to the base LongCatImagePipeline for exactly that reason. On the 5090 you have enough VRAM (32 GB) to run the Edit pipeline in its no-offload "~24 GB+" envelope, but a published end-to-end Edit workflow is out of scope for this recipe — see the sibling-variants table above and the Edit/Edit-Turbo HF model cards if you want to attempt one.

FP8 weight quants are not relevant for this recipe

A community FP8 quant of the Edit variant was uploaded by LiVeenMusic (community contributor, NONE association on GitHub) — and per their own caveat on Issue #8, "all I did to make that one was download the diffusers HF repo, and use the same quantizing script that people were using for Flux" and the quant is essentially untested for end-to-end inference. The HF repo itself returns 401 to unauthenticated readers as of 2026-05-24 (the LiVeen namespace was withdrawn), so this is best treated as a historical pointer rather than a download target. The 5090 does have hardware FP8 (E4M3/E5M2) tensor cores that would accelerate such a path if a canonical FP8 release lands, but on this recipe (base T2I, BF16 weights, ~14 GB headroom) there is no memory pressure or speed motivation to substitute FP8 weights. Stick with the BF16 path documented above.