How much VRAM does Anima need?

About 8 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Anima 2B on RX 7900 XTX: ComfyUI on ROCm (BF16/FP16)

What You'll Build

A local anime-focused text-to-image pipeline using Anima, a 2B-parameter DiT built on NVIDIA Cosmos-Predict2 with a Qwen3 0.6B text encoder — a collaboration between CircleStone Labs and Comfy Org — running in ComfyUI on a 24 GB Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100) through the ROCm stack. The model is natively supported in ComfyUI (no custom nodes), is tagged diffusion-single-file on its model card, and at ~5.6 GB of FP16 weights uses well under a quarter of the 7900 XTX's 24 GB VRAM budget — so you run the native BF16/FP16 weights with no quantization needed.

Hardware data: RX 7900 XTX (24GB VRAM) · ~7GB peak (unquantized) · ComfyUI on ROCm 7.2 · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7900 XTX runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel, no xformers install, and no FP8/FP4 path here. RDNA3 has no FP8/FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only), so an FP8 checkpoint would just upcast to FP16 with no memory saving — and at 24 GB you don't need it anyway. The attention path is PyTorch SDPA (ComfyUI's default; the explicit flag is --use-pytorch-cross-attention), not FlashAttention-2 and not xformers. If a guide tells you to pip install xformers or pick a cu12x wheel for this card, it's written for the wrong vendor.

⚠️ License note: Anima ships under the CircleStone Labs Non-Commercial License (license_name: circlestone-labs-non-commercial-license on the model card) and, as a derivative of nvidia/Cosmos-Predict2-2B-Text2Image, is additionally subject to NVIDIA's Open Model License. This is for non-commercial use — check the LICENSE.md on the model card before any commercial deployment.

Requirements

Component	Minimum	Tested
GPU	8 GB VRAM (ROCm-supported AMD card)	RX 7900 XTX (24 GB)
RAM	16 GB system	—
Storage	~5.6 GB (base 4.18 GB + encoder 1.19 GB + VAE 0.25 GB)	per HF Files tree
Driver	AMD ROCm 7.2.x on Linux	—
Software	ComfyUI + PyTorch (ROCm 7.2 build), Python 3.10+	—

Anima is a 2B-parameter latent-diffusion DiT that pairs the diffusion model with a Qwen3 0.6B text encoder and the Qwen-Image VAE, all confirmed by the repo's split_files/ tree. It is focused on anime concepts, characters, and styles, and is intentionally not tuned for realism, per the model card.

Installation

1. Update ComfyUI

Anima depends on the Cosmos-Predict2 diffusion model class and the Qwen-Image VAE — both available only in recent ComfyUI builds. From your ComfyUI root:

git pull
pip install -r requirements.txt

2. Install / confirm PyTorch for ROCm

The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel. Per the ComfyUI README "AMD GPUs (Linux)" section, the stable install command is:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. As of this writing the ComfyUI README pins rocm7.2 as the stable wheel — but the rocmX.Y tag moves over time (6.3 → 6.4 → 7.x). Read the current line in the live ComfyUI README before running. A nightly variant (https://download.pytorch.org/whl/nightly/rocm7.2) "might have some performance improvements" per the README. There is also a separate experimental RDNA-3-specific wheel index (https://rocm.nightlies.amd.com/v2/gfx110X-all/) the README lists for Windows+Linux RDNA3 support — on officially-supported Linux you do not need it; the stable whl/rocm7.2 wheel above is the canonical path.

3. Download the model files

Three files are needed, and they go into three different ComfyUI subfolders. The destination folders are stated verbatim in the model card; the weight files live under the repo's split_files/ tree. File sizes are verified from the Hugging Face tree API (base 4,182,218,328 bytes ≈ 4.18 GB; encoder 1,192,135,096 bytes ≈ 1.19 GB; VAE 253,806,246 bytes ≈ 0.25 GB):

# from ComfyUI/ root
cd models/diffusion_models
wget https://huggingface.co/circlestone-labs/Anima/resolve/main/split_files/diffusion_models/anima-base-v1.0.safetensors

cd ../text_encoders
wget https://huggingface.co/circlestone-labs/Anima/resolve/main/split_files/text_encoders/qwen_3_06b_base.safetensors

cd ../vae
wget https://huggingface.co/circlestone-labs/Anima/resolve/main/split_files/vae/qwen_image_vae.safetensors

Per the model card: anima-base-v1.0.safetensors goes in ComfyUI/models/diffusion_models, qwen_3_06b_base.safetensors in ComfyUI/models/text_encoders, and qwen_image_vae.safetensors in ComfyUI/models/vae (this is the Qwen-Image VAE — you may already have it).

4. Load the official workflow

The model is natively supported in ComfyUI. The model card ships an example workflow image (example.png) with the graph embedded — drag-and-drop it onto the ComfyUI canvas to load the workflow, or build a standard Load Diffusion Model → CLIP (Qwen3) → KSampler → VAE Decode → Save Image graph and point the loaders at the three files from step 3.

Running

Launch ComfyUI from the repo root with the PyTorch SDPA attention backend — this is the attention path AMD's ROCm stack uses (it replaces the CUDA-only FlashAttention/xformers paths, which don't apply on RDNA3). Per ComfyUI's cli_args.py, the flag is documented as "Use the new pytorch 2.0 cross attention function.":

python main.py --use-pytorch-cross-attention

This starts the server (default http://127.0.0.1:8188). Open it in a browser and load the workflow from step 4. Then, per the model card's documented generation settings:

Type a prompt in the positive text node. The recommended positive prefix is masterpiece, best quality, score_7, safe, and the documented tag order is [quality/meta/year/safety tags] [1girl/1boy/1other etc] [character] [series] [artist] [general tags]. Artist tags must be prefixed with @.
Set resolution between 512×512 and 1536×1536 (the model card documents this range; ~1MP — e.g. 1024×1024 — is the sweet spot).
Steps: 30–50. CFG: 4–5. Sampler: er_sde is the model author's documented default; euler_a softens line work; dpmpp_2m_sde_gpu adds variety.
Click "Queue Prompt".

Generated PNGs land in ComfyUI/output/ with the full workflow embedded. At 24 GB you should not need --lowvram or the memory-saving --use-split-cross-attention fallback (documented in cli_args.py as "Use the split cross attention optimization.") — those are for VRAM-constrained cards.

Results

Speed: Omitted. The backend reports verdict: unknown for anima × rx-7900-xtx — there is no published Anima benchmark on the RX 7900 XTX, and no comparable ROCm number exists in the community to anchor to. Forward-extrapolating from a different architecture, vendor, or card would be misleading. Once a community benchmark lands it will appear at /check/anima/rx-7900-xtx — please contribute yours.
VRAM usage: ~7 GB peak unquantized, per the lilting.ch technical overview (Feb 2026, updated Apr 2026), whose spec table lists VRAM as ~7GB (without quantization). This is corroborated by the on-disk weights: the split_files/ tree totals ~5.6 GB at FP16 (base 4.18 GB + Qwen3-0.6B encoder 1.19 GB + Qwen-Image VAE 0.25 GB), and runtime activations push the peak to ~7 GB. Either way it leaves over two-thirds of the 7900 XTX's 24 GB budget free — there is no quantization tradeoff to consider on this card, so run the native FP16/BF16 weights. See /check/anima/rx-7900-xtx for any community-submitted measurement.
Quality notes: Anime-first. The base model is intentionally style-neutral; reach for explicit @artist tags or LoRAs for stronger stylization (the model card documents finetuning tips — don't train the LLM adapter, use a low learning rate). Text rendering is weak — the model card notes it "can generally do single words and sometimes short phrases, but lengthy text rendering won't work well."

For the full benchmark data, see /check/anima/rx-7900-xtx.

Troubleshooting

"Torch not compiled with CUDA enabled"

This means a CUDA build of PyTorch got installed instead of the ROCm build. Per the ComfyUI README, uninstall and reinstall against the ROCm wheel index:

pip uninstall torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP).

VAE Decode crashes or hangs on ROCm

ComfyUI's VAE-decode stage has a model-independent ROCm footgun on the 7900 XTX: the decode step can default to an FP32 path that inflates VRAM and triggers a driver timeout / crash above 1024 px. This is tracked in ROCm Issue #4729 ("VAE decode defaults to FP32 causing driver timeout above 1024 pixel"), where a 7900 XTX user reports it reproduces "on SDXL and other models too" — and on the same stack a sibling image model hit a related decode crash in ComfyUI Issue #11551 (gfx1100, 24 GB, BF16). Anima uses the Qwen-Image VAE, so if you see the KSampler stage complete and then VAE Decode hang or crash, try these in order:

Keep resolution ≤ 1024 px on the longest side — the crash is far likelier above 1024, per Issue #4729.
Swap in the VAE Decode (Tiled) node — it decodes the latent in tiles, keeping the decode-stage peak down.
Move the VAE to the CPU with --cpu-vae — slower, but it sidesteps the ROCm VAE kernel entirely. Anima's VAE is small (0.25 GB), so the CPU penalty is modest.
Try --bf16-vae (documented in cli_args.py as "Run the VAE in bf16."). This helps in some Z-Image reports, but note that community users on Issue #4729 found BF16 VAE decode can itself inflate VRAM on this card versus FP16/FP32 — so treat it as one option to test, not a guaranteed fix, and prefer the tiled / CPU paths above if it doesn't help. (Avoid --fp16-vae, documented as "Run the VAE in fp16, might cause black images.")

Generation feels slow on the first run — enable TunableOp

Per the ComfyUI README "AMD ROCm Tips": "You can try setting this env variable PYTORCH_TUNABLEOP_ENABLED=1 which might speed things up at the cost of a very slow initial run." TunableOp auto-tunes GEMM kernels for your card on the first pass (slow), then caches them for faster subsequent generations:

PYTORCH_TUNABLEOP_ENABLED=1 python main.py --use-pytorch-cross-attention

Do not install xformers or FlashAttention

HF and ComfyUI guides written for NVIDIA frequently suggest pip install xformers or a FlashAttention wheel. On RDNA3 these are the wrong path: the ROCm xformers fork is limited, and ComfyUI already routes attention through PyTorch SDPA on this stack. Stick with the default, or force it explicitly with --use-pytorch-cross-attention.