What You'll Build
A working diffusers setup for Meituan's LongCat-Image — a 6B-parameter bilingual (Chinese + English) MM-DiT/Single-DiT text-to-image model — running natively on a single 24 GB RTX 3090 (Ampere, sm_86). This recipe is scoped to the base text-to-image variant; the image-editing siblings are out of scope below. The 3090 path is identical to the 4090 path because the official inference code is BF16-only and does not use FP8 — the architectural gap between Ada (sm_89) and Ampere (sm_86) does not change the VRAM envelope here.
Hardware data: RTX 3090 (24 GB VRAM, 936 GB/s memory bandwidth, Ampere sm_86) · canonical diffusers + enable_model_cpu_offload() · ~17 GB peak per the HF model card Quick Start, cross-confirmed at ~18 GB by the Meituan team's GitHub statement. The sooxt98 ComfyUI port's VRAM tier table places the RTX 3090 in the "Standard (CPU offload disabled): ~24GB+" tier — note that the table's separate "~17-19GB" band names offload-enabled mid-range cards (3080/4080), not the 3090. · See benchmark data
ℹ️ Why this recipe pins the base variant. Meituan publishes four siblings under the LongCat-Image brand and their inference paths and VRAM profiles differ — see "Sibling variants and what fits 24 GB" below before downloading anything.
ℹ️ Why this recipe pins the diffusers runtime, not ComfyUI. On a 16 GB consumer card the only confirmed path is ComfyUI + GGUF (see the 4060 Ti sibling recipe). The 24 GB tier — RTX 3090 included — unlocks the canonical
diffusersLongCatImagePipelinedirectly. The Meituan team's own GitHub statement is that the latest official inference code "consumes approximately 18 GB of VRAM and supports inference on an RTX 4090" (Issue #8 comment byjunqiangwu, 2025-12-08) — the same code path with the same VRAM profile runs unchanged on the 3090 because the pipeline is BF16-only and BF16 is fully supported on Ampere. The HF model card's Quick Start independently labels the same path "Required ~17 GB" (HF model card), and the sooxt98 community ComfyUI port places the RTX 3090 in the "Standard (CPU offload disabled): ~24GB+" tier (sooxt98/comfyui_longcat_image) — its separate "Low VRAM (CPU offload enabled): ~17-19GB" band describes offload-enabled mid-range cards (3080/4080), not the 3090. Three independent sources agree on the BF16-resident envelope for this tier.
Why the 4090 path transfers cleanly to the 3090
- No FP8 dependency. The Ampere sm_86 architecture lacks FP8 tensor cores (FP8 first appears on Ada sm_89 / Hopper sm_90). For this model that does not matter: the official inference code path is BF16 throughout —
torch_dtype=torch.bfloat16is what every cited source uses. There is no FP8 quant being substituted, so the Ada-to-Ampere swap is no-op at the precision level. - Same 24 GB VRAM tier. The "~17 GB" with offload / "~24 GB" without offload envelope is set by the model's BF16 weight count, not by the GPU. A 24 GB Ampere card holds the same footprint a 24 GB Ada card does.
- FlashAttention-2 happy on sm_86. Stock
pip install flash-attnships sm_86 kernels. If thediffusersinstall pulls FA2, it works out of the box on the 3090 with no special wheel selection. - Compute is ~½ of Ada, but speed is not quoted here. RTX 3090 dense FP16 throughput (~35.58 TFLOPS per TechPowerUp) is roughly half the RTX 4090's (~82.58 TFLOPS per TechPowerUp), and the LongCat-Image transformer forward is compute-bound on a DiT of this size. The 4090 recipe does not quote a per-image speed because no GPU-named inference benchmark is in the official sources; the 3090 number would be ~2× slower, but quoting an extrapolation would violate this skill's no-fake-data rule. See Results → Speed below for the empirical handoff to
/check/.
Sibling variants and what fits 24 GB
| Variant | Purpose | 24 GB fit (cited) |
|---|---|---|
| LongCat-Image (this recipe) | Final-release T2I, 6B params, hybrid MM-DiT/Single-DiT à la Flux1.dev per the arXiv technical report (2512.07584) | Yes — canonical diffusers path with enable_model_cpu_offload(), ~17 GB peak per HF card Quick Start, ~18 GB per Meituan team |
| LongCat-Image-Edit | Image-to-image editing variant | Tighter — enable_model_cpu_offload() does not work on the Edit pipeline per user mingyi456 on Issue #8, so the no-offload "~24 GB+" tier is what you get. Borderline on a 24 GB card; not in scope here |
| LongCat-Image-Edit-Turbo | Distilled few-step edit variant | Same memory profile as Edit; if you get a working setup, /contribute so we can publish one |
| LongCat-Image-Dev | Mid-training checkpoint intended for fine-tuning, not inference | Out of scope for this recipe |
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 24 GB VRAM, CUDA-capable | RTX 3090 (24 GB, Ampere sm_86) |
| RAM | 32 GB system RAM (CPU offload spills the text encoder to host) | — |
| Storage | ~30 GB free for the upstream BF16 repo | — |
| Software | Python 3.10, latest diffusers (≥ the integration in diffusers#12828), PyTorch with CUDA | — |
Installation
The default pip install torch already includes sm_86 kernels (Ampere) — no special wheel selection is required for the RTX 3090. FlashAttention-2 also ships full sm_86 kernel coverage, so the standard PyTorch + diffusers install is everything you need.
1. Create the conda environment
The official GitHub README pins Python 3.10:
conda create -n longcat-image python=3.10
conda activate longcat-image
2. Clone the repo and install dependencies
Install diffusers from PyPI (≥ the version that includes the LongCatImagePipeline integration via diffusers#12828). The current Meituan README installs from PyPI directly; the upstream HF model card's Quick Start uses the same shipped class:
git clone https://github.com/meituan-longcat/LongCat-Image
cd LongCat-Image
pip install -r infer_requirements.txt
pip install -U diffusers
If infer_requirements.txt errors with No module named 'dskernels', see Troubleshooting below — dskernels is not on PyPI, but you can skip it for inference.
3. Pre-download the model
The pipeline auto-downloads on first run, but pre-fetching avoids surprises and gives you a clean progress bar:
hf download meituan-longcat/LongCat-Image --local-dir ./longcat-image
The repo is ~29 GB on disk (BF16 transformer + Qwen2.5-VL-7B text encoder + VAE).
Running
The HF model card's reference Quick Start works as-is on a 24 GB 3090. Save the following as run_t2i.py inside the cloned LongCat-Image directory:
import torch
from diffusers import LongCatImagePipeline
if __name__ == '__main__':
pipe = LongCatImagePipeline.from_pretrained(
"meituan-longcat/LongCat-Image",
torch_dtype=torch.bfloat16,
)
# On a 24 GB RTX 3090, keep CPU offload enabled — this is the path the
# HF card's Quick Start ships ("Required ~17 GB") and the Meituan team
# validated at ~18 GB peak. The 3090's 24 GB ceiling is the same as the
# 4090's, so the same envelope applies. Disable only if you have ≥32 GB VRAM.
pipe.enable_model_cpu_offload()
prompt = (
"A young Asian woman in a yellow knit sweater with a white necklace, "
"hands resting on her knees, calm expression. Background is a rough "
"brick wall, warm afternoon sunlight, medium-distance shot."
)
image = pipe(
prompt,
height=768,
width=1344,
guidance_scale=4.0,
num_inference_steps=50,
num_images_per_prompt=1,
generator=torch.Generator("cpu").manual_seed(43),
enable_cfg_renorm=True,
enable_prompt_rewrite=True,
).images[0]
image.save("./t2i_example.png")
Run it:
python run_t2i.py
Output lands at ./t2i_example.png. First run downloads any weights not pre-fetched; subsequent runs load straight from the HF cache.
Meituan's repo also includes scripts/inference_t2i.py with the same defaults hardcoded; that script is equivalent to the above and runs with python scripts/inference_t2i.py.
Text-in-image: LongCat-Image renders embedded text — the HF README is explicit that you must wrap the target text in single or double quotation marks (English '...' / "..." or the Chinese full-width equivalents ‘...’ / “...”). Without quotes, the model treats the words as scene description, not glyphs to render.
Results
- Speed: No RTX-3090-specific inference-time measurement is cited in the official model card, GitHub repo, ComfyUI integration, or arXiv tech report at time of writing. The RTX 4090 sibling recipe omits speed for the same reason. Expect roughly 2× the per-image wall time of an RTX 4090 because the 3090 has ~½ of Ada's dense FP16 throughput (~35.58 TFLOPS on Ampere vs ~82.58 TFLOPS on Ada) and the diffusion forward is compute-bound — but the official sources do not name a per-GPU number, and this skill does not invent one. Once a community run lands at /check/longcat-image/rtx-3090, this section gets updated; contribute one via /contribute if you measure it.
- VRAM usage: ~17 GB peak with
enable_model_cpu_offload()per the HF model card Quick Start (verbatim: "Required ~17 GB"). The Meituan team comment on Issue #8 cross-confirms at ~18 GB. The sooxt98 community ComfyUI port's tier table (sooxt98/comfyui_longcat_image) places the RTX 3090 in the "Standard (CPU offload disabled): ~24GB+" tier — its separate "Low VRAM (CPU offload enabled): ~17-19GB" band describes offload-enabled mid-range cards (3080/4080), not the 3090 itself. All three citations are runtime-agnostic measurements that transfer cleanly within the 24 GB tier — Ada → Ampere is no-op here because the path is BF16 throughout with no FP8 dependency. - Quality notes: LongCat-Image is bilingual by design (Chinese + English) and the arXiv technical report (2512.07584) highlights multilingual text rendering as a primary target. The 6B parameter count is "significantly smaller than the nearly 20B or larger Mixture-of-Experts (MoE) architectures common in the field" per the same report, and the architecture is "a hybrid MM-DiT and Single-DiT structure, consistent with Flux1.dev". Quality at native BF16 is the canonical reference — no quantization tradeoffs to consider on this card, and no FP8 fallback to evaluate since Ampere has no FP8 tensor cores in the first place.
For the full benchmark data, see /check/longcat-image/rtx-3090.
Troubleshooting
pip install -r infer_requirements.txt errors with No module named 'dskernels'
The official Meituan infer_requirements.txt lists dskernels, which is not on PyPI — ghostnyambit reported the same blocker on Issue #8 (2025-12-11). dskernels is only required for training-time DeepSpeed optimizations and not for inference. Comment the line out of infer_requirements.txt and re-run the install — the diffusers Quick Start above does not touch DeepSpeed.
OOM despite enable_model_cpu_offload()
Confirm you are not also calling pipe.to(device, torch.bfloat16) after enable_model_cpu_offload() — the two are mutually exclusive. The HF model card's Quick Start has the pipe.to(...) line commented out for exactly this reason, with the in-line note "Uncomment for high VRAM devices (Faster inference)". On a 24 GB card, leave it commented; the cited ~17–18 GB number assumes offload is active.
Out of host RAM, not VRAM
User reckless-huang reported on Issue #8 (2025-12-08) that the old script failed on a system with 32 GB host RAM even though VRAM was fine. The Meituan team's follow-up fix by junqiangwu in the same Issue #8 thread reduced host-memory pressure as well as VRAM. Make sure you've installed from the latest commit on main — if you cloned before 2025-12-08, pull again.
enable_model_cpu_offload() doesn't work on the Edit pipeline
User mingyi456 notes on Issue #8 that enable_model_cpu_offload() currently does not work with LongCatImageEditPipeline. This recipe is scoped to the base LongCatImagePipeline for exactly this reason — Edit needs the no-offload "~24 GB+" tier, which is borderline on this card. For LongCat-Image-Edit on a 3090, follow the upstream issue thread for the manual sequential-offload patch before assuming a turnkey workflow exists.
FP8 community quants are not relevant on Ampere
A community FP8 quant of the Edit variant exists in the ecosystem, but FP8 has no hardware acceleration on Ampere sm_86 — FP8 tensor cores first appear on Ada sm_89. On the 3090 there is no speed or VRAM benefit to chasing an FP8 quant: stick with the BF16 path documented above, which is also the one all three cited sources (HF card, Meituan team, sooxt98 port) name as the canonical envelope.