How much VRAM does ACE-Step 1.5 XL need?

About 12 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

ACE-Step 1.5 XL on RX 7900 XTX: Text-to-Music Generation on ROCm

What You'll Build

A working text-to-music pipeline that turns a text prompt + optional lyrics into a full song (vocals, instruments, up to ~4 minutes) on a single Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100) through the ROCm stack, driven by the official Gradio app. ACE-Step is a pure-PyTorch, diffusion-based music generator — a flow-matching transformer over a Deep Compression AutoEncoder latent — so it runs on AMD's ROCm/HIP backend the same way it runs on CUDA, with no custom kernels to port. At 24 GB the native bf16 weights (~12 GB resident) are never memory-bound.

Hardware data: RX 7900 XTX (24GB VRAM) · text-to-music, lyric-aligned vocals, 19 supported languages (top 10 well-performing) per HF model card · See benchmark data

ℹ️ Not a TTS model. ACE-Step generates music — instruments and lyric-aligned vocals — from a text description. It is filed under our tts vertical because the catalogue groups all audio-output models together, but it is not a text-to-speech engine. If you want spoken speech synthesis on this GPU, see Kokoro or VoxCPM. If you want sung vocals over generated backing, you're in the right place.

⚠️ This is a ROCm recipe, not CUDA. The RX 7900 XTX runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel here, no FlashAttention build, no xformers, and no FP8/FP4 path. RDNA3 has no FP8/FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only), and ACE-Step's native precision is bf16 anyway, so there is nothing to quantize down. ACE-Step is plain-PyTorch diffusion: it carries no custom CUDA kernel in its sampler or scheduler (the flow-match Euler/Heun/Pingpong schedulers are pure PyTorch), so it runs unmodified on ROCm — PyTorch's HIP backend exposes the AMD GPU through the same cuda device namespace. The attention path is PyTorch SDPA, not FlashAttention-2 and not xformers. If a guide tells you to pip install xformers or pick a cu12x wheel for this card, it's written for the wrong vendor.

Requirements

Component	Minimum	Tested
GPU	12 GB VRAM at default bf16 precision (8 GB possible with optimization flags)	RX 7900 XTX (24 GB)
RAM	16 GB system	—
Storage	~9 GB for the 3.5B transformer + DCAE + vocoder + UMT5-base text encoder	per HF Files tree
Driver	AMD ROCm 7.2.x on Linux	—
Software	Python 3.10, PyTorch (ROCm build)	—

The model is released under the Apache 2.0 License (HF model card) and the weights are not gated — no access request or login is required to download them. The repository is ~8.28 GB total on disk (HF Files tree), split across four components downloaded automatically on first launch: the ace_step_transformer diffusion model, the music_dcae_f8c8 autoencoder, the music_vocoder, and the umt5-base text encoder.

Installation

1. Clone the repo and create the conda environment

git clone https://github.com/ace-step/ACE-Step.git
cd ACE-Step
conda create -n ace_step python=3.10 -y
conda activate ace_step

2. Install PyTorch for ROCm

The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel — not the default CUDA wheel that pip install torch pulls. Install from the ROCm wheel index:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. The rocmX.Y wheel tag moves over time (6.3 → 6.4 → 7.x). Read the current stable line at the live PyTorch "Get Started" selector (choose Linux → Pip → Python → ROCm) before running. AMD also ships its own Radeon-recommended wheels at repo.radeon.com (currently ~PyTorch 2.9.1 / ROCm 7.2.1 / Ubuntu 24.04) if you prefer the vendor build.

3. Install the package

pip install -e .

This installs the acestep console-script entry point along with diffusers, transformers, accelerate, and the project's audio dependencies. ACE-Step is pure-PyTorch diffusion — there is no CUDA-extension compile step, so the install does not need nvcc and runs cleanly under ROCm. Weights for ACE-Step/ACE-Step-v1-3.5B download automatically from Hugging Face on first launch.

4. (Optional) ComfyUI custom node

If you would rather drive the model from a ComfyUI workflow on your ROCm-built ComfyUI install:

cd ComfyUI/custom_nodes
git clone https://github.com/billwuhao/ComfyUI_ACE-Step.git

Then download the weights into ComfyUI/models/TTS/ACE-Step-v1-3.5B/. Per the custom node README the folder must contain four subdirectories: ace_step_transformer, music_dcae_f8c8, music_vocoder, and umt5-base. (ComfyUI itself must already be installed against the ROCm PyTorch wheel — see our SDXL-on-7900-XTX recipe for that setup.)

Running

Gradio app (official)

The repository's setup.py registers acestep as a console script that maps to acestep.gui:main — a Click CLI that launches a Gradio web app. There is no .text2music() one-liner; the supported entrypoint is the Gradio app (or python -m acestep.gui):

acestep --port 7865 --bf16 true

Then open http://localhost:7865. In the Text2Music tab, enter descriptive tags (style, mood, instruments), optional lyrics with structure markers like [verse] / [chorus], set the audio duration, and click Generate. The app returns a downloadable audio file. On the 24 GB 7900 XTX the default bf16 path runs comfortably without any memory flags. --bf16 defaults to true, which matches the model's native precision — there is no smaller hardware format to drop to on RDNA3.

The Click options are defined in acestep/gui.py: --checkpoint_path, --server_name, --port, --device_id, --share, --bf16, --torch_compile, --cpu_offload, --overlapped_decode. The --device_id flag selects the GPU by index (under ROCm, PyTorch enumerates the AMD card through the cuda device namespace, so --device_id 0 targets your 7900 XTX).

Memory-optimized launch (free headroom for other workloads)

acestep --torch_compile true --cpu_offload true --overlapped_decode true --port 7865

You do not need these on a 24 GB 7900 XTX for a single generation — the default bf16 path fits with ~12 GB to spare. But these three flags together drop resident VRAM to the official 8 GB floor, useful if the card is also driving a display or sharing with other inference. --cpu_offload loads only the current stage's model to the GPU; --overlapped_decode runs the DCAE and vocoder using sliding windows to speed decoding; both are documented in acestep/gui.py.

⚠️ --torch_compile on ROCm: verify on first run. torch.compile works on RDNA3 (it lowers to Triton-ROCm), but exotic fused ops occasionally hit kernel-compile failures on gfx1100 and fall back to eager. If a compiled run errors, drop --torch_compile — the model is fully functional in eager mode. Do not install triton-windows (that note in the upstream README is Windows/CUDA-specific; you are on Linux/ROCm).

Library / programmatic use

To integrate ACE-Step into your own Python project, follow the inference code in the source tree — the ACEStepPipeline class in acestep/pipeline_ace_step.py is the real call surface. The repository — not the auto-generated HF Hub text-to-audio snippet — is the source of truth for the call signature; see Troubleshooting below.

Results

Speed: No RX 7900 XTX benchmark is published yet (the live /check/acestep-1-5-xl/rx-7900-xtx verdict is unknown). The official ACE-Step model card publishes a per-device throughput table covering NVIDIA (A100, RTX 4090, RTX 3090) and Apple (M2 Max) — but no AMD / RDNA3 figure, and these are different architectures, so we do not extrapolate a 7900 XTX number from them. The Speed figure is therefore omitted. If you measure generation time on a 7900 XTX, please contribute it via the submission form and it will appear at /check/acestep-1-5-xl/rx-7900-xtx.
VRAM usage: At default bf16 precision the resident footprint is ~12 GB (the four components total ~8.28 GB on disk, plus activations) — trivially within the 24 GB 7900 XTX envelope, leaving ample headroom. The official minimum drops to 8 GB with cpu_offload + torch_compile + overlapped_decode enabled, but you don't need that on this card. See /check/acestep-1-5-xl/rx-7900-xtx for any community-submitted measurement.
Quality notes: Performs best in the top 10 well-performing languages (19 supported in total per the GitHub README); rare instruments may render imperfectly; outputs beyond ~5 minutes can lose structural coherence; the model is highly seed-sensitive ("gacha-style" results, per the HF card's Limitations section).

For the full benchmark data, see /check/acestep-1-5-xl/rx-7900-xtx.

Troubleshooting

"Torch not compiled with CUDA enabled"

This means a CUDA build of PyTorch got installed instead of the ROCm build (the default pip install torch pulls CUDA). Uninstall and reinstall against the ROCm wheel index:

pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and python -c "import torch; print(torch.cuda.is_available())" returns True (ROCm masquerades as the cuda device namespace under HIP, so ACE-Step's torch.device("cuda:0") calls resolve to your 7900 XTX with no code changes).

HF Quick Start snippet doesn't match the real API

ACE-Step is a music-generation model, but the Hugging Face Hub auto-generates a generic snippet from the text-to-audio pipeline tag — it does not reflect a runnable music-generation call, and there is no .text2music() method. The authoritative inference entry point is the acestep console script (which the repository's setup.py registers as acestep.gui:main, a Gradio app), launchable as acestep --port 7865 or python -m acestep.gui. For programmatic use, drive ACEStepPipeline from acestep/pipeline_ace_step.py rather than copy-pasting the Hub snippet.

`acestep` command not found after `pip install -e .`

The -e (editable) install registers the acestep entry point in your conda env. If the shell can't find it, you're probably in a different env — re-activate with conda activate ace_step and verify with which acestep. The GitHub repo README documents the entry point and command-line arguments.

`torch.compile` errors on ROCm

If you passed --torch_compile true and a generation crashes with a Triton or Inductor kernel-compilation error, this is a known RDNA3 rough edge — torch.compile lowers to Triton-ROCm, which occasionally fails on exotic fused ops on gfx1100. Drop the flag and run in eager mode (acestep --port 7865); the model is fully functional without compilation. The compile flag is an optional speedup, not a requirement.

Generations sound unstructured past ~5 minutes

This is a documented limitation, not a bug. The model card calls it out under "Limitations" — the model loses long-range structural coherence beyond ~5 minutes. Either keep prompts inside that window or use the repaint/extend operations on shorter segments and stitch them.

If you hit something not covered here, please report via the submission form so we can add it to the catalogue.