What You'll Build
A local anime-focused text-to-image pipeline using Anima, a 2-billion-parameter DiT built on NVIDIA Cosmos-Predict2 with a Qwen3 0.6B text encoder — a collaboration between CircleStone Labs and Comfy Org. On an 8GB RTX 5060 the binding constraint is VRAM, so the quant choice is the recipe: the INT8 ConvRot weights are the fastest path on this card and the one we lead with, with MXFP8 and full BF16 as alternatives.
Hardware data: RTX 5060 (8GB VRAM) · INT8 ConvRot ~1.12 it/s, peak 8.0GB · 24 steps · See benchmark data
⚠️ 8GB is a tight fit. All three quant paths peak at 8.0GB on this 8GB card per the benchmark data — the model fits, but with essentially no headroom. Close other GPU workloads (browser hardware-acceleration, a second model) before running, and prefer the INT8 ConvRot path below, which is both the fastest measured option and the one that loads the smallest weight file.
⚠️ License note: Anima ships under the CircleStone Labs Non-Commercial License (
license_name: circlestone-labs-non-commercial-licenseon the model card) and, as a "Derivative Model" ofnvidia/Cosmos-Predict2-2B-Text2Image, is additionally subject to NVIDIA's Open Model License. The weights are non-commercial; per the model card, the generated images (Outputs) may be used commercially. Commercial use of the model behind an API requires a separate license — emailtdrussell@circlestone.ai.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 8GB VRAM (NVIDIA) | RTX 5060 (8GB) |
| RAM | 16GB system | — |
| Storage | ~4.1GB (INT8 ConvRot 2.65GB + encoder 1.19GB + VAE 0.25GB); ~5.6GB for the full BF16 base | per HF tree API |
| Software | ComfyUI (recent build) + PyTorch (cu12x/cu13x), Python 3.10+ | — |
Anima is a 2B-parameter latent-diffusion DiT that pairs the diffusion model with a Qwen3 0.6B text encoder and the Qwen-Image VAE, all confirmed by the repo's split_files/ tree. It is natively supported in ComfyUI (no custom nodes for the BF16 path), and is focused on anime concepts, characters, and styles, intentionally not tuned for realism, per the model card.
Installation
1. Update ComfyUI
Anima depends on the Cosmos-Predict2 diffusion model class and the Qwen-Image VAE — both available only in recent ComfyUI builds, and the INT8 path additionally needs ComfyUI's native INT8 support. From your ComfyUI root:
git pull
pip install -r requirements.txt
ℹ️ PyTorch on the RTX 5060 (Blackwell, sm_120): the 5060 is a Blackwell card. Use a recent PyTorch wheel that ships sm_120 kernels (a
cu128or newer wheel —pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128, or newer per the live PyTorch install page). The oldercu124/cu126wheels predate sm_120 and can fail to launch kernels on Blackwell. Anima's runtime has no FlashAttention-2 dependency, so the FA2-sm_120 kernel gap is a non-issue here — ComfyUI's default attention (or SageAttention, used in the INT8 path below) covers it.
2. Download the model files
You always need the text encoder and VAE (shared across every quant), plus one diffusion-model file depending on the path you pick. The destination folders are stated verbatim in the model card. File sizes are verified from the Hugging Face tree API:
# from ComfyUI/ root — shared encoder + VAE (both paths need these)
cd models/text_encoders
wget https://huggingface.co/circlestone-labs/Anima/resolve/main/split_files/text_encoders/qwen_3_06b_base.safetensors
cd ../vae
wget https://huggingface.co/circlestone-labs/Anima/resolve/main/split_files/vae/qwen_image_vae.safetensors
qwen_3_06b_base.safetensors (1,192,135,096 bytes ≈ 1.19GB) goes in ComfyUI/models/text_encoders, and qwen_image_vae.safetensors (253,806,246 bytes ≈ 0.25GB) goes in ComfyUI/models/vae (this is the Qwen-Image VAE — you may already have it).
Now pick one diffusion-model file:
# from ComfyUI/ root
cd models/diffusion_models
# LEAD — INT8 ConvRot (fastest measured on the 5060; smallest weight file, 2.65GB)
wget https://huggingface.co/Bedovyy/Anima-INT8/resolve/main/anima-base-v1.0-int8convrot.safetensors
# ALT A — MXFP8 (hardware-FP8 on Blackwell; ~2.6GB)
# wget https://huggingface.co/pachiiahri/anima-fp8-comfyui/resolve/main/anima-preview_tcfp8_mixed.safetensors
# ALT B — full BF16 base (highest quality, tightest fit, slowest; 4.18GB)
# wget https://huggingface.co/circlestone-labs/Anima/resolve/main/split_files/diffusion_models/anima-base-v1.0.safetensors
The INT8 ConvRot file (anima-base-v1.0-int8convrot.safetensors, 2,649,827,358 bytes ≈ 2.65GB) is a Lesson-C-clean quant of the canonical model (base_model: circlestone-labs/Anima, base_model_relation: quantized). The MXFP8 file from pachiiahri/anima-fp8-comfyui carries calibrated metadata for hardware FP8 (Blackwell's TensorCoreFP8Layout).
3. (INT8 path) Install the INT8-Fast custom node
ComfyUI has native INT8 support, but per the Bedovyy/Anima-INT8 card it rejects the int8_rowwise format and offers no real speedup over BF16 on its own. The card's workaround: the published int8convrot file actually uses int8_rowwise with convrot but is labeled int8_tensorwise to bypass ComfyUI's format restrictions, so the stock Load Diffusion Model node loads it. For the actual speed gain, install the ComfyUI-INT8-Fast custom node:
# from ComfyUI/custom_nodes/
git clone https://github.com/BobJohnson24/ComfyUI-INT8-Fast
4. Load the official workflow
The BF16 path is natively supported — the model card ships an example workflow image (example.png) with the graph embedded; drag-and-drop it onto the ComfyUI canvas. For the INT8/MXFP8 paths, start from the same graph and point the Load Diffusion Model node at the quant file you downloaded in step 2.
Running
After the encoder + VAE + your chosen diffusion-model file are in place and the workflow is loaded:
- Type a prompt in the positive node. The model card's recommended positive prefix is
masterpiece, best quality, score_7, safe,and the recommended negative isworst quality, low quality, score_1, score_2, score_3, artist name. The documented tag order is[quality/meta/year/safety tags] [1girl/1boy/1other etc] [character] [series] [artist] [general tags]. Artist tags must be prefixed with@. - Set resolution between 512×512 and 1536×1536 (the model card documents this range; ~1MP — e.g. 1024×1024 — is the sweet spot).
- Steps: 30–50. CFG: 4–5. Sampler:
er_sdeis the author's documented default;euler_asoftens line work;dpmpp_2m_sde_gpuadds variety. - For the INT8 path, launch ComfyUI with the INT8-Fast options the Bedovyy card documents —
--fast fp16_accumulation fp8_matrix_mult cublas_ops --use-sage-attention --disable-dynamic-vram— and run a couple of warm-up generations before timing (the backend's BF16 figure was measured after 2 warmup rounds). - Click "Queue Prompt". Outputs land in
ComfyUI/output/.
Results
All three numbers below are firsthand measurements on an 8GB RTX 5060 at 24 steps, from the benchmark data:
- Speed (INT8 ConvRot — lead): 1.12 it/s, peak 8.0GB — the fastest of the three quant paths on this card. See /check/anima/rtx-5060.
- Speed (MXFP8 — alternative): 0.89 it/s, peak 8.0GB. Blackwell's hardware-FP8 path; to get a speedup from it you need
torch.compile(viaTorchCompileModelAdvancedfrom KJNodes inmax-autotune-no-cudagraphsmode), otherwise it can run no faster than BF16. - Speed (BF16 — alternative): 0.78 it/s, peak 8.0GB — the full-precision base. It fits the 8GB card but is the slowest and tightest path; choose it only when you want maximum quality and can spare the generation time.
- VRAM usage: 8.0GB peak on all three paths per the benchmark data — a tight fit on the 8GB card. The INT8 ConvRot file (2.65GB on disk) leaves the most room for the encoder, VAE, and activations during the run.
- Quality notes: Anime-first; the base model is intentionally style-neutral, so reach for explicit
@artisttags or LoRAs for stronger stylization. Text rendering is weak — the model card notes it "can generally do single words and sometimes short phrases, but lengthy text rendering won't work well." INT8/MXFP8 introduce minor quality loss versus BF16 but, unlike the Turbo LoRA, preserve negative-prompt behaviour.
For the full benchmark data, see /check/anima/rtx-5060. Measured a different quant, resolution, or step count on your 5060? Please contribute your numbers.
Troubleshooting
INT8 model loads but gives no speedup
Stock ComfyUI rejects the int8_rowwise format and runs INT8 no faster than BF16. Use the int8convrot file from Bedovyy/Anima-INT8 (which is labeled int8_tensorwise so the loader accepts it) together with the ComfyUI-INT8-Fast custom node and the --fast ... --disable-dynamic-vram launch options from step 4. The INT8-Fast README documents "between 1.5~2x faster inference" on a 3090 and that it "appears to be faster than FP8 on 40-Series and above" — the 5060's measured 1.12 it/s (vs 0.78 BF16 / 0.89 MXFP8) is consistent with that pattern.
Out of memory at 8GB
Every path peaks at 8.0GB, so there is no headroom margin. Close other GPU consumers first (browsers with hardware acceleration, a colocated model). If you still OOM at the final decode, swap the standard VAE decode node for ComfyUI's VAE Decode (Tiled) node, which decodes the latent in tiles to keep the decode-stage peak down. Dropping resolution toward 1024×1024 or below also reduces activation memory.
Slow generations on first run
The backend's BF16 figure was measured after 2 warmup rounds — the first generation after a model load is always slower while ComfyUI compiles and caches. Run one or two throwaway generations before judging speed, especially with torch.compile enabled on the MXFP8 path (the first compiled run pays the full compile cost).
Got different numbers?
The /check/anima/rtx-5060 page currently carries three 24-step measurements (INT8 ConvRot, MXFP8, BF16). If you measure a torch.compile-accelerated MXFP8 run, a different resolution, or a different step count on your 5060, please contribute your numbers so the live data reflects more of the quant matrix.