LongCat-Image (base T2I) on RTX 3090: Bilingual 6B Text-to-Image via diffusers BF16 + CPU Offload

What You'll Build

A working diffusers setup for Meituan's LongCat-Image — a 6B-parameter bilingual (Chinese + English) MM-DiT/Single-DiT text-to-image model — running natively on a single 24 GB RTX 3090 (Ampere, sm_86). This recipe is scoped to the base text-to-image variant; the image-editing siblings are out of scope below. The 3090 path is identical to the 4090 path because the official inference code is BF16-only and does not use FP8 — the architectural gap between Ada (sm_89) and Ampere (sm_86) does not change the VRAM envelope here.

Hardware data: RTX 3090 (24 GB VRAM, 936 GB/s memory bandwidth, Ampere sm_86) · canonical diffusers + enable_model_cpu_offload() · ~17 GB peak per the HF model card Quick Start, cross-confirmed at ~18 GB by the Meituan team's GitHub statement. The sooxt98 ComfyUI port's VRAM tier table places the RTX 3090 in the "Standard (CPU offload disabled): ~24GB+" tier — note that the table's separate "~17-19GB" band names offload-enabled mid-range cards (3080/4080), not the 3090. · See benchmark data

ℹ️ Why this recipe pins the base variant. Meituan publishes four siblings under the LongCat-Image brand and their inference paths and VRAM profiles differ — see "Sibling variants and what fits 24 GB" below before downloading anything.

ℹ️ Why this recipe pins the diffusers runtime, not ComfyUI. On a 16 GB consumer card the only confirmed path is ComfyUI + GGUF (see the 4060 Ti sibling recipe). The 24 GB tier — RTX 3090 included — unlocks the canonical diffusers LongCatImagePipeline directly. The Meituan team's own GitHub statement is that the latest official inference code "consumes approximately 18 GB of VRAM and supports inference on an RTX 4090" (Issue #8 comment by junqiangwu, 2025-12-08) — the same code path with the same VRAM profile runs unchanged on the 3090 because the pipeline is BF16-only and BF16 is fully supported on Ampere. The HF model card's Quick Start independently labels the same path "Required ~17 GB" (HF model card), and the sooxt98 community ComfyUI port places the RTX 3090 in the "Standard (CPU offload disabled): ~24GB+" tier (sooxt98/comfyui_longcat_image) — its separate "Low VRAM (CPU offload enabled): ~17-19GB" band describes offload-enabled mid-range cards (3080/4080), not the 3090. Three independent sources agree on the BF16-resident envelope for this tier.

Why the 4090 path transfers cleanly to the 3090

No FP8 dependency. The Ampere sm_86 architecture lacks FP8 tensor cores (FP8 first appears on Ada sm_89 / Hopper sm_90). For this model that does not matter: the official inference code path is BF16 throughout — torch_dtype=torch.bfloat16 is what every cited source uses. There is no FP8 quant being substituted, so the Ada-to-Ampere swap is no-op at the precision level.
Same 24 GB VRAM tier. The "~17 GB" with offload / "~24 GB" without offload envelope is set by the model's BF16 weight count, not by the GPU. A 24 GB Ampere card holds the same footprint a 24 GB Ada card does.
FlashAttention-2 happy on sm_86. Stock pip install flash-attn ships sm_86 kernels. If the diffusers install pulls FA2, it works out of the box on the 3090 with no special wheel selection.
Compute is ~½ of Ada, but speed is not quoted here. RTX 3090 dense FP16 throughput (~35.58 TFLOPS per TechPowerUp) is roughly half the RTX 4090's (~82.58 TFLOPS per TechPowerUp), and the LongCat-Image transformer forward is compute-bound on a DiT of this size. The 4090 recipe does not quote a per-image speed because no GPU-named inference benchmark is in the official sources; the 3090 number would be ~2× slower, but quoting an extrapolation would violate this skill's no-fake-data rule. See Results → Speed below for the empirical handoff to /check/.

Sibling variants and what fits 24 GB

Variant	Purpose	24 GB fit (cited)
LongCat-Image (this recipe)	Final-release T2I, 6B params, hybrid MM-DiT/Single-DiT à la Flux1.dev per the arXiv technical report (2512.07584)	Yes — canonical diffusers path with `enable_model_cpu_offload()`, ~17 GB peak per HF card Quick Start, ~18 GB per Meituan team
LongCat-Image-Edit	Image-to-image editing variant	Tighter — `enable_model_cpu_offload()` does not work on the Edit pipeline per user `mingyi456` on Issue #8, so the no-offload "~24 GB+" tier is what you get. Borderline on a 24 GB card; not in scope here
LongCat-Image-Edit-Turbo	Distilled few-step edit variant	Same memory profile as Edit; if you get a working setup, /contribute so we can publish one
LongCat-Image-Dev	Mid-training checkpoint intended for fine-tuning, not inference	Out of scope for this recipe

Requirements

Component	Minimum	Tested
GPU	24 GB VRAM, CUDA-capable	RTX 3090 (24 GB, Ampere sm_86)
RAM	32 GB system RAM (CPU offload spills the text encoder to host)	—
Storage	~30 GB free for the upstream BF16 repo	—
Software	Python 3.10, latest `diffusers` (≥ the integration in diffusers#12828), PyTorch with CUDA	—

Installation

The default pip install torch already includes sm_86 kernels (Ampere) — no special wheel selection is required for the RTX 3090. FlashAttention-2 also ships full sm_86 kernel coverage, so the standard PyTorch + diffusers install is everything you need.

1. Create the conda environment

The official GitHub README pins Python 3.10:

conda create -n longcat-image python=3.10
conda activate longcat-image

2. Clone the repo and install dependencies

Install diffusers from PyPI (≥ the version that includes the LongCatImagePipeline integration via diffusers#12828). The current Meituan README installs from PyPI directly; the upstream HF model card's Quick Start uses the same shipped class:

git clone https://github.com/meituan-longcat/LongCat-Image
cd LongCat-Image
pip install -r infer_requirements.txt
pip install -U diffusers

If infer_requirements.txt errors with No module named 'dskernels', see Troubleshooting below — dskernels is not on PyPI, but you can skip it for inference.

3. Pre-download the model

The pipeline auto-downloads on first run, but pre-fetching avoids surprises and gives you a clean progress bar:

hf download meituan-longcat/LongCat-Image --local-dir ./longcat-image

The repo is ~29 GB on disk (BF16 transformer + Qwen2.5-VL-7B text encoder + VAE).

Running

The HF model card's reference Quick Start works as-is on a 24 GB 3090. Save the following as run_t2i.py inside the cloned LongCat-Image directory:

import torch
from diffusers import LongCatImagePipeline

if __name__ == '__main__':
    pipe = LongCatImagePipeline.from_pretrained(
        "meituan-longcat/LongCat-Image",
        torch_dtype=torch.bfloat16,
    )
    # On a 24 GB RTX 3090, keep CPU offload enabled — this is the path the
    # HF card's Quick Start ships ("Required ~17 GB") and the Meituan team
    # validated at ~18 GB peak. The 3090's 24 GB ceiling is the same as the
    # 4090's, so the same envelope applies. Disable only if you have ≥32 GB VRAM.
    pipe.enable_model_cpu_offload()

    prompt = (
        "A young Asian woman in a yellow knit sweater with a white necklace, "
        "hands resting on her knees, calm expression. Background is a rough "
        "brick wall, warm afternoon sunlight, medium-distance shot."
    )

    image = pipe(
        prompt,
        height=768,
        width=1344,
        guidance_scale=4.0,
        num_inference_steps=50,
        num_images_per_prompt=1,
        generator=torch.Generator("cpu").manual_seed(43),
        enable_cfg_renorm=True,
        enable_prompt_rewrite=True,
    ).images[0]

    image.save("./t2i_example.png")

Run it:

python run_t2i.py

Output lands at ./t2i_example.png. First run downloads any weights not pre-fetched; subsequent runs load straight from the HF cache.

Meituan's repo also includes scripts/inference_t2i.py with the same defaults hardcoded; that script is equivalent to the above and runs with python scripts/inference_t2i.py.

Text-in-image: LongCat-Image renders embedded text — the HF README is explicit that you must wrap the target text in single or double quotation marks (English '...' / "..." or the Chinese full-width equivalents ‘...’ / “...”). Without quotes, the model treats the words as scene description, not glyphs to render.

Results

Speed: No RTX-3090-specific inference-time measurement is cited in the official model card, GitHub repo, ComfyUI integration, or arXiv tech report at time of writing. The RTX 4090 sibling recipe omits speed for the same reason. Expect roughly 2× the per-image wall time of an RTX 4090 because the 3090 has ~½ of Ada's dense FP16 throughput (~35.58 TFLOPS on Ampere vs ~82.58 TFLOPS on Ada) and the diffusion forward is compute-bound — but the official sources do not name a per-GPU number, and this skill does not invent one. Once a community run lands at /check/longcat-image/rtx-3090, this section gets updated; contribute one via /contribute if you measure it.
VRAM usage: ~17 GB peak with enable_model_cpu_offload() per the HF model card Quick Start (verbatim: "Required ~17 GB"). The Meituan team comment on Issue #8 cross-confirms at ~18 GB. The sooxt98 community ComfyUI port's tier table (sooxt98/comfyui_longcat_image) places the RTX 3090 in the "Standard (CPU offload disabled): ~24GB+" tier — its separate "Low VRAM (CPU offload enabled): ~17-19GB" band describes offload-enabled mid-range cards (3080/4080), not the 3090 itself. All three citations are runtime-agnostic measurements that transfer cleanly within the 24 GB tier — Ada → Ampere is no-op here because the path is BF16 throughout with no FP8 dependency.
Quality notes: LongCat-Image is bilingual by design (Chinese + English) and the arXiv technical report (2512.07584) highlights multilingual text rendering as a primary target. The 6B parameter count is "significantly smaller than the nearly 20B or larger Mixture-of-Experts (MoE) architectures common in the field" per the same report, and the architecture is "a hybrid MM-DiT and Single-DiT structure, consistent with Flux1.dev". Quality at native BF16 is the canonical reference — no quantization tradeoffs to consider on this card, and no FP8 fallback to evaluate since Ampere has no FP8 tensor cores in the first place.

For the full benchmark data, see /check/longcat-image/rtx-3090.

Troubleshooting

`pip install -r infer_requirements.txt` errors with `No module named 'dskernels'`

The official Meituan infer_requirements.txt lists dskernels, which is not on PyPI — ghostnyambit reported the same blocker on Issue #8 (2025-12-11). dskernels is only required for training-time DeepSpeed optimizations and not for inference. Comment the line out of infer_requirements.txt and re-run the install — the diffusers Quick Start above does not touch DeepSpeed.

OOM despite `enable_model_cpu_offload()`

Confirm you are not also calling pipe.to(device, torch.bfloat16) after enable_model_cpu_offload() — the two are mutually exclusive. The HF model card's Quick Start has the pipe.to(...) line commented out for exactly this reason, with the in-line note "Uncomment for high VRAM devices (Faster inference)". On a 24 GB card, leave it commented; the cited ~17–18 GB number assumes offload is active.

Out of host RAM, not VRAM

User reckless-huang reported on Issue #8 (2025-12-08) that the old script failed on a system with 32 GB host RAM even though VRAM was fine. The Meituan team's follow-up fix by junqiangwu in the same Issue #8 thread reduced host-memory pressure as well as VRAM. Make sure you've installed from the latest commit on main — if you cloned before 2025-12-08, pull again.

`enable_model_cpu_offload()` doesn't work on the Edit pipeline

User mingyi456 notes on Issue #8 that enable_model_cpu_offload() currently does not work with LongCatImageEditPipeline. This recipe is scoped to the base LongCatImagePipeline for exactly this reason — Edit needs the no-offload "~24 GB+" tier, which is borderline on this card. For LongCat-Image-Edit on a 3090, follow the upstream issue thread for the manual sequential-offload patch before assuming a turnkey workflow exists.

FP8 community quants are not relevant on Ampere

A community FP8 quant of the Edit variant exists in the ecosystem, but FP8 has no hardware acceleration on Ampere sm_86 — FP8 tensor cores first appear on Ada sm_89. On the 3090 there is no speed or VRAM benefit to chasing an FP8 quant: stick with the BF16 path documented above, which is also the one all three cited sources (HF card, Meituan team, sooxt98 port) name as the canonical envelope.