What You'll Build
A local anime-focused text-to-image pipeline using Anima, a 2B-parameter DiT built on NVIDIA Cosmos-Predict2 with a Qwen3 0.6B text encoder — a collaboration between CircleStone Labs and Comfy Org — running in ComfyUI on a 24 GB Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100) through the ROCm stack. The model is natively supported in ComfyUI (no custom nodes), is tagged diffusion-single-file on its model card, and at ~5.6 GB of FP16 weights uses well under a quarter of the 7900 XTX's 24 GB VRAM budget — so you run the native BF16/FP16 weights with no quantization needed.
Hardware data: RX 7900 XTX (24GB VRAM) · ~7GB peak (unquantized) · ComfyUI on ROCm 7.2 · See benchmark data
⚠️ This is a ROCm recipe, not CUDA. The RX 7900 XTX runs on AMD's ROCm/HIP stack — there is no
cu124/cu128wheel, no xformers install, and no FP8/FP4 path here. RDNA3 has no FP8/FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only), so an FP8 checkpoint would just upcast to FP16 with no memory saving — and at 24 GB you don't need it anyway. The attention path is PyTorch SDPA (ComfyUI's default; the explicit flag is--use-pytorch-cross-attention), not FlashAttention-2 and not xformers. If a guide tells you topip install xformersor pick acu12xwheel for this card, it's written for the wrong vendor.
⚠️ License note: Anima ships under the CircleStone Labs Non-Commercial License (
license_name: circlestone-labs-non-commercial-licenseon the model card) and, as a derivative ofnvidia/Cosmos-Predict2-2B-Text2Image, is additionally subject to NVIDIA's Open Model License. This is for non-commercial use — check theLICENSE.mdon the model card before any commercial deployment.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 8 GB VRAM (ROCm-supported AMD card) | RX 7900 XTX (24 GB) |
| RAM | 16 GB system | — |
| Storage | ~5.6 GB (base 4.18 GB + encoder 1.19 GB + VAE 0.25 GB) | per HF Files tree |
| Driver | AMD ROCm 7.2.x on Linux | — |
| Software | ComfyUI + PyTorch (ROCm 7.2 build), Python 3.10+ | — |
Anima is a 2B-parameter latent-diffusion DiT that pairs the diffusion model with a Qwen3 0.6B text encoder and the Qwen-Image VAE, all confirmed by the repo's split_files/ tree. It is focused on anime concepts, characters, and styles, and is intentionally not tuned for realism, per the model card.
Installation
1. Update ComfyUI
Anima depends on the Cosmos-Predict2 diffusion model class and the Qwen-Image VAE — both available only in recent ComfyUI builds. From your ComfyUI root:
git pull
pip install -r requirements.txt
2. Install / confirm PyTorch for ROCm
The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel. Per the ComfyUI README "AMD GPUs (Linux)" section, the stable install command is:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2
ℹ️ Verify the ROCm tag before you copy it. As of this writing the ComfyUI README pins
rocm7.2as the stable wheel — but therocmX.Ytag moves over time (6.3 → 6.4 → 7.x). Read the current line in the live ComfyUI README before running. A nightly variant (https://download.pytorch.org/whl/nightly/rocm7.2) "might have some performance improvements" per the README. There is also a separate experimental RDNA-3-specific wheel index (https://rocm.nightlies.amd.com/v2/gfx110X-all/) the README lists for Windows+Linux RDNA3 support — on officially-supported Linux you do not need it; the stablewhl/rocm7.2wheel above is the canonical path.
3. Download the model files
Three files are needed, and they go into three different ComfyUI subfolders. The destination folders are stated verbatim in the model card; the weight files live under the repo's split_files/ tree. File sizes are verified from the Hugging Face tree API (base 4,182,218,328 bytes ≈ 4.18 GB; encoder 1,192,135,096 bytes ≈ 1.19 GB; VAE 253,806,246 bytes ≈ 0.25 GB):
# from ComfyUI/ root
cd models/diffusion_models
wget https://huggingface.co/circlestone-labs/Anima/resolve/main/split_files/diffusion_models/anima-base-v1.0.safetensors
cd ../text_encoders
wget https://huggingface.co/circlestone-labs/Anima/resolve/main/split_files/text_encoders/qwen_3_06b_base.safetensors
cd ../vae
wget https://huggingface.co/circlestone-labs/Anima/resolve/main/split_files/vae/qwen_image_vae.safetensors
Per the model card: anima-base-v1.0.safetensors goes in ComfyUI/models/diffusion_models, qwen_3_06b_base.safetensors in ComfyUI/models/text_encoders, and qwen_image_vae.safetensors in ComfyUI/models/vae (this is the Qwen-Image VAE — you may already have it).
4. Load the official workflow
The model is natively supported in ComfyUI. The model card ships an example workflow image (example.png) with the graph embedded — drag-and-drop it onto the ComfyUI canvas to load the workflow, or build a standard Load Diffusion Model → CLIP (Qwen3) → KSampler → VAE Decode → Save Image graph and point the loaders at the three files from step 3.
Running
Launch ComfyUI from the repo root with the PyTorch SDPA attention backend — this is the attention path AMD's ROCm stack uses (it replaces the CUDA-only FlashAttention/xformers paths, which don't apply on RDNA3). Per ComfyUI's cli_args.py, the flag is documented as "Use the new pytorch 2.0 cross attention function.":
python main.py --use-pytorch-cross-attention
This starts the server (default http://127.0.0.1:8188). Open it in a browser and load the workflow from step 4. Then, per the model card's documented generation settings:
- Type a prompt in the positive text node. The recommended positive prefix is
masterpiece, best quality, score_7, safe,and the documented tag order is[quality/meta/year/safety tags] [1girl/1boy/1other etc] [character] [series] [artist] [general tags]. Artist tags must be prefixed with@. - Set resolution between 512×512 and 1536×1536 (the model card documents this range; ~1MP — e.g. 1024×1024 — is the sweet spot).
- Steps: 30–50. CFG: 4–5. Sampler:
er_sdeis the model author's documented default;euler_asoftens line work;dpmpp_2m_sde_gpuadds variety. - Click "Queue Prompt".
Generated PNGs land in ComfyUI/output/ with the full workflow embedded. At 24 GB you should not need --lowvram or the memory-saving --use-split-cross-attention fallback (documented in cli_args.py as "Use the split cross attention optimization.") — those are for VRAM-constrained cards.
Results
- Speed: Omitted. The backend reports
verdict: unknownfor anima × rx-7900-xtx — there is no published Anima benchmark on the RX 7900 XTX, and no comparable ROCm number exists in the community to anchor to. Forward-extrapolating from a different architecture, vendor, or card would be misleading. Once a community benchmark lands it will appear at /check/anima/rx-7900-xtx — please contribute yours. - VRAM usage: ~7 GB peak unquantized, per the lilting.ch technical overview (Feb 2026, updated Apr 2026), whose spec table lists VRAM as
~7GB (without quantization). This is corroborated by the on-disk weights: thesplit_files/tree totals ~5.6 GB at FP16 (base 4.18 GB + Qwen3-0.6B encoder 1.19 GB + Qwen-Image VAE 0.25 GB), and runtime activations push the peak to ~7 GB. Either way it leaves over two-thirds of the 7900 XTX's 24 GB budget free — there is no quantization tradeoff to consider on this card, so run the native FP16/BF16 weights. See /check/anima/rx-7900-xtx for any community-submitted measurement. - Quality notes: Anime-first. The base model is intentionally style-neutral; reach for explicit
@artisttags or LoRAs for stronger stylization (the model card documents finetuning tips — don't train the LLM adapter, use a low learning rate). Text rendering is weak — the model card notes it "can generally do single words and sometimes short phrases, but lengthy text rendering won't work well."
For the full benchmark data, see /check/anima/rx-7900-xtx.
Troubleshooting
"Torch not compiled with CUDA enabled"
This means a CUDA build of PyTorch got installed instead of the ROCm build. Per the ComfyUI README, uninstall and reinstall against the ROCm wheel index:
pip uninstall torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2
Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP).
VAE Decode crashes or hangs on ROCm
ComfyUI's VAE-decode stage has a model-independent ROCm footgun on the 7900 XTX: the decode step can default to an FP32 path that inflates VRAM and triggers a driver timeout / crash above 1024 px. This is tracked in ROCm Issue #4729 ("VAE decode defaults to FP32 causing driver timeout above 1024 pixel"), where a 7900 XTX user reports it reproduces "on SDXL and other models too" — and on the same stack a sibling image model hit a related decode crash in ComfyUI Issue #11551 (gfx1100, 24 GB, BF16). Anima uses the Qwen-Image VAE, so if you see the KSampler stage complete and then VAE Decode hang or crash, try these in order:
- Keep resolution ≤ 1024 px on the longest side — the crash is far likelier above 1024, per Issue #4729.
- Swap in the
VAE Decode (Tiled)node — it decodes the latent in tiles, keeping the decode-stage peak down. - Move the VAE to the CPU with
--cpu-vae— slower, but it sidesteps the ROCm VAE kernel entirely. Anima's VAE is small (0.25 GB), so the CPU penalty is modest. - Try
--bf16-vae(documented incli_args.pyas "Run the VAE in bf16."). This helps in some Z-Image reports, but note that community users on Issue #4729 found BF16 VAE decode can itself inflate VRAM on this card versus FP16/FP32 — so treat it as one option to test, not a guaranteed fix, and prefer the tiled / CPU paths above if it doesn't help. (Avoid--fp16-vae, documented as "Run the VAE in fp16, might cause black images.")
Generation feels slow on the first run — enable TunableOp
Per the ComfyUI README "AMD ROCm Tips": "You can try setting this env variable PYTORCH_TUNABLEOP_ENABLED=1 which might speed things up at the cost of a very slow initial run." TunableOp auto-tunes GEMM kernels for your card on the first pass (slow), then caches them for faster subsequent generations:
PYTORCH_TUNABLEOP_ENABLED=1 python main.py --use-pytorch-cross-attention
Do not install xformers or FlashAttention
HF and ComfyUI guides written for NVIDIA frequently suggest pip install xformers or a FlashAttention wheel. On RDNA3 these are the wrong path: the ROCm xformers fork is limited, and ComfyUI already routes attention through PyTorch SDPA on this stack. Stick with the default, or force it explicitly with --use-pytorch-cross-attention.