How much VRAM does Anima need?

About 6 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Anima 2B on RX 7800 XT: ComfyUI on ROCm (BF16/FP16)

What You'll Build

A local anime-focused text-to-image pipeline using Anima, a 2B-parameter DiT built on NVIDIA Cosmos-Predict2 with a Qwen3 0.6B text encoder — a collaboration between CircleStone Labs and Comfy Org — running in ComfyUI on a 16 GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) through the ROCm stack. The model is natively supported in ComfyUI (no custom nodes), is tagged diffusion-single-file on its model card, and at ~5.6 GB of FP16 weights uses about a third of the 7800 XT's 16 GB VRAM budget — so you run the native BF16/FP16 weights with no quantization needed.

Hardware data: RX 7800 XT (16GB VRAM) · ~7GB peak (unquantized) · ComfyUI on ROCm 7.2 · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7800 XT runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel, no xformers install, and no FP8/FP4 path here. RDNA3 has no FP8/FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only), so an FP8 checkpoint would just upcast to FP16 with no memory saving — and at ~5.6 GB of weights you don't need it anyway. The attention path is PyTorch SDPA (ComfyUI's default; the explicit flag is --use-pytorch-cross-attention), not FlashAttention-2 and not xformers. If a guide tells you to pip install xformers or pick a cu12x wheel for this card, it's written for the wrong vendor.

⚠️ License note: Anima ships under the CircleStone Labs Non-Commercial License (license_name: circlestone-labs-non-commercial-license on the model card) and, as a derivative of nvidia/Cosmos-Predict2-2B-Text2Image, is additionally subject to NVIDIA's Open Model License. This is for non-commercial use — check the LICENSE.md on the model card before any commercial deployment.

Requirements

Component	Minimum	Tested
GPU	6 GB VRAM (ROCm-supported AMD card)	RX 7800 XT (16 GB)
RAM	16 GB system	—
Storage	~5.6 GB (base 4.18 GB + encoder 1.19 GB + VAE 0.25 GB)	per HF Files tree
Driver	AMD ROCm 7.2.x on Linux	—
Software	ComfyUI + PyTorch (ROCm 7.2 build), Python 3.10+	—

Anima is a 2B-parameter latent-diffusion DiT that pairs the diffusion model with a Qwen3 0.6B text encoder and the Qwen-Image VAE, all confirmed by the repo's split_files/ tree. It is focused on anime concepts, characters, and styles, and is intentionally not tuned for realism, per the model card.

Installation

1. Update ComfyUI

Anima depends on the Cosmos-Predict2 diffusion model class and the Qwen-Image VAE — both available only in recent ComfyUI builds. From your ComfyUI root:

git pull
pip install -r requirements.txt

2. Install / confirm PyTorch for ROCm

The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel. Per the ComfyUI README "AMD GPUs (Linux)" section, the stable install command is:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. As of this writing the ComfyUI README pins rocm7.2 as the stable wheel — but the rocmX.Y tag moves over time (6.3 → 6.4 → 7.x). Read the current line in the live ComfyUI README before running. A nightly variant (https://download.pytorch.org/whl/nightly/rocm7.2) "might have some performance improvements" per the README. There is also a separate experimental RDNA-3-specific wheel index (https://rocm.nightlies.amd.com/v2/gfx110X-all/) the README lists for Windows+Linux RDNA3 support — on officially-supported Linux you do not need it; the stable whl/rocm7.2 wheel above is the canonical path.

ℹ️ gfx1101 vs gfx1100 — no override needed on current ROCm. The 7800 XT's ROCm target is gfx1101 (Navi 32 — not gfx1102, which is the RX 7600). It is a first-class supported target on current ROCm, so the stable wheel above works as-is. HSA_OVERRIDE_GFX_VERSION=11.0.0 (masquerading as gfx1100) is only a legacy fallback for older libraries that shipped gfx1100-only kernels — you should not need it on a current ROCm 7.2 stack. Set it only if a specific library refuses to launch a kernel on gfx1101.

3. Download the model files

Three files are needed, and they go into three different ComfyUI subfolders. The destination folders are stated verbatim in the model card; the weight files live under the repo's split_files/ tree. File sizes are verified from the Hugging Face tree API (base 4,182,218,328 bytes ≈ 4.18 GB; encoder 1,192,135,096 bytes ≈ 1.19 GB; VAE 253,806,246 bytes ≈ 0.25 GB):

# from ComfyUI/ root
cd models/diffusion_models
wget https://huggingface.co/circlestone-labs/Anima/resolve/main/split_files/diffusion_models/anima-base-v1.0.safetensors

cd ../text_encoders
wget https://huggingface.co/circlestone-labs/Anima/resolve/main/split_files/text_encoders/qwen_3_06b_base.safetensors

cd ../vae
wget https://huggingface.co/circlestone-labs/Anima/resolve/main/split_files/vae/qwen_image_vae.safetensors

Per the model card: anima-base-v1.0.safetensors goes in ComfyUI/models/diffusion_models, qwen_3_06b_base.safetensors in ComfyUI/models/text_encoders, and qwen_image_vae.safetensors in ComfyUI/models/vae (this is the Qwen-Image VAE — you may already have it).

4. Load the official workflow

The model is natively supported in ComfyUI. The model card ships an example workflow image (example.png) with the graph embedded — drag-and-drop it onto the ComfyUI canvas to load the workflow, or build a standard Load Diffusion Model → CLIP (Qwen3) → KSampler → VAE Decode → Save Image graph and point the loaders at the three files from step 3.

Running

Launch ComfyUI from the repo root with the PyTorch SDPA attention backend — this is the attention path AMD's ROCm stack uses (it replaces the CUDA-only FlashAttention/xformers paths, which don't apply on RDNA3). Per ComfyUI's cli_args.py, the flag is documented as "Use the new pytorch 2.0 cross attention function.":

python main.py --use-pytorch-cross-attention

This starts the server (default http://127.0.0.1:8188). Open it in a browser and load the workflow from step 4. Then, per the model card's documented generation settings:

Type a prompt in the positive text node. The recommended positive prefix is masterpiece, best quality, score_7, safe, and the documented tag order is [quality/meta/year/safety tags] [1girl/1boy/1other etc] [character] [series] [artist] [general tags]. Artist tags must be prefixed with @.
Set resolution between 512×512 and 1536×1536 (the model card documents this range; ~1MP — e.g. 1024×1024 — is the sweet spot).
Steps: 30–50. CFG: 4–5. Sampler: er_sde is the model author's documented default; euler_a softens line work; dpmpp_2m_sde_gpu adds variety.
Click "Queue Prompt".

Generated PNGs land in ComfyUI/output/ with the full workflow embedded. At 16 GB, with the native FP16 weights occupying only ~5.6 GB, you should not need --lowvram — but if the VAE-decode stage spikes (see Troubleshooting below), the --use-split-cross-attention fallback (documented in cli_args.py as "Use the split cross attention optimization.") and the tiled-VAE workarounds keep the peak in check.

Results

Speed: Omitted. The backend reports verdict: unknown for anima × rx-7800-xt — there is no published Anima benchmark on the RX 7800 XT, and no comparable ROCm number exists in the community to anchor to. Forward-extrapolating from a different architecture, vendor, or card (including the 24 GB RX 7900 XTX, which has ~54% more memory bandwidth) would be misleading. Once a community benchmark lands it will appear at /check/anima/rx-7800-xt — please contribute yours.
VRAM usage: ~7 GB peak unquantized, per the lilting.ch technical overview (Feb 2026, updated Apr 2026), whose spec table lists VRAM as ~7GB (without quantization). This is corroborated by the on-disk weights: the split_files/ tree totals ~5.6 GB at FP16 (base 4.18 GB + Qwen3-0.6B encoder 1.19 GB + Qwen-Image VAE 0.25 GB), and runtime activations push the peak to ~7 GB. That leaves over half of the 7800 XT's 16 GB budget free — there is no quantization tradeoff to consider on this card, so run the native FP16/BF16 weights. (The one caveat is the BF16 VAE-decode footgun below, which is more acute at 16 GB.) See /check/anima/rx-7800-xt for any community-submitted measurement.
Quality notes: Anime-first. The base model is intentionally style-neutral; reach for explicit @artist tags or LoRAs for stronger stylization (the model card documents finetuning tips — don't train the LLM adapter, use a low learning rate). Text rendering is weak — the model card notes it "can generally do single words and sometimes short phrases, but lengthy text rendering won't work well."

For the full benchmark data, see /check/anima/rx-7800-xt.

Troubleshooting

"Torch not compiled with CUDA enabled"

This means a CUDA build of PyTorch got installed instead of the ROCm build. Per the ComfyUI README, uninstall and reinstall against the ROCm wheel index:

pip uninstall torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP).

VAE Decode crashes, hangs, or balloons VRAM on ROCm

ComfyUI's VAE-decode stage has a model-independent ROCm footgun on RDNA3 cards: the decode step can route through a path that inflates VRAM and triggers a driver timeout / crash above 1024 px. This is tracked in ROCm Issue #4729 ("VAE decode defaults to FP32 causing driver timeout above 1024 pixel"), where a community user reports it reproduces "on SDXL and other models too" and that BF16 VAE decode at 1024×1024 tries to use ~16 GB of VRAM while FP16 uses ~2 GB and FP32 ~4 GB. That makes this footgun especially acute on a 16 GB card like the 7800 XT — a BF16 VAE-decode spike can consume your entire VRAM budget. Anima uses the Qwen-Image VAE, so if you see the KSampler stage complete and then VAE Decode hang, crash, or OOM, try these in order:

Keep resolution ≤ 1024 px on the longest side — the crash is far likelier above 1024, per Issue #4729.
Swap in the VAE Decode (Tiled) node — it decodes the latent in tiles, keeping the decode-stage peak down. This is the most reliable fix at 16 GB.
Move the VAE to the CPU with --cpu-vae — slower, but it sidesteps the ROCm VAE kernel entirely. Anima's VAE is small (0.25 GB), so the CPU penalty is modest.
Avoid --bf16-vae here. The cli_args.py flag is documented as "Run the VAE in bf16.", but community users on Issue #4729 measured BF16 VAE decode ballooning to ~16 GB on RDNA3 (vs ~2 GB at FP16) — on a 16 GB card that is exactly the wrong direction. Prefer the tiled / CPU paths above, and also avoid --fp16-vae, documented as "Run the VAE in fp16, might cause black images."

Generation feels slow on the first run — enable TunableOp

Per the ComfyUI README "AMD ROCm Tips": "You can try setting this env variable PYTORCH_TUNABLEOP_ENABLED=1 which might speed things up at the cost of a very slow initial run." TunableOp auto-tunes GEMM kernels for your card on the first pass (slow), then caches them for faster subsequent generations:

PYTORCH_TUNABLEOP_ENABLED=1 python main.py --use-pytorch-cross-attention

Do not install xformers or FlashAttention

HF and ComfyUI guides written for NVIDIA frequently suggest pip install xformers or a FlashAttention wheel. On RDNA3 these are the wrong path: the ROCm xformers fork is limited, and ComfyUI already routes attention through PyTorch SDPA on this stack. Stick with the default, or force it explicitly with --use-pytorch-cross-attention.