What You'll Build
A working text-to-music pipeline that turns a text prompt + optional lyrics into a full song (vocals, instruments, up to ~4 minutes) on a single Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100) through the ROCm stack, driven by the official Gradio app. ACE-Step is a pure-PyTorch, diffusion-based music generator — a flow-matching transformer over a Deep Compression AutoEncoder latent — so it runs on AMD's ROCm/HIP backend the same way it runs on CUDA, with no custom kernels to port. At 24 GB the native bf16 weights (~12 GB resident) are never memory-bound.
Hardware data: RX 7900 XTX (24GB VRAM) · text-to-music, lyric-aligned vocals, 19 supported languages (top 10 well-performing) per HF model card · See benchmark data
ℹ️ Not a TTS model. ACE-Step generates music — instruments and lyric-aligned vocals — from a text description. It is filed under our
ttsvertical because the catalogue groups all audio-output models together, but it is not a text-to-speech engine. If you want spoken speech synthesis on this GPU, see Kokoro or VoxCPM. If you want sung vocals over generated backing, you're in the right place.
⚠️ This is a ROCm recipe, not CUDA. The RX 7900 XTX runs on AMD's ROCm/HIP stack — there is no
cu124/cu128wheel here, no FlashAttention build, no xformers, and no FP8/FP4 path. RDNA3 has no FP8/FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only), and ACE-Step's native precision is bf16 anyway, so there is nothing to quantize down. ACE-Step is plain-PyTorch diffusion: it carries no custom CUDA kernel in its sampler or scheduler (the flow-match Euler/Heun/Pingpong schedulers are pure PyTorch), so it runs unmodified on ROCm — PyTorch's HIP backend exposes the AMD GPU through the samecudadevice namespace. The attention path is PyTorch SDPA, not FlashAttention-2 and not xformers. If a guide tells you topip install xformersor pick acu12xwheel for this card, it's written for the wrong vendor.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 12 GB VRAM at default bf16 precision (8 GB possible with optimization flags) | RX 7900 XTX (24 GB) |
| RAM | 16 GB system | — |
| Storage | ~9 GB for the 3.5B transformer + DCAE + vocoder + UMT5-base text encoder | per HF Files tree |
| Driver | AMD ROCm 7.2.x on Linux | — |
| Software | Python 3.10, PyTorch (ROCm build) | — |
The model is released under the Apache 2.0 License (HF model card) and the weights are not gated — no access request or login is required to download them. The repository is ~8.28 GB total on disk (HF Files tree), split across four components downloaded automatically on first launch: the ace_step_transformer diffusion model, the music_dcae_f8c8 autoencoder, the music_vocoder, and the umt5-base text encoder.
Installation
1. Clone the repo and create the conda environment
git clone https://github.com/ace-step/ACE-Step.git
cd ACE-Step
conda create -n ace_step python=3.10 -y
conda activate ace_step
2. Install PyTorch for ROCm
The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel — not the default CUDA wheel that pip install torch pulls. Install from the ROCm wheel index:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2
ℹ️ Verify the ROCm tag before you copy it. The
rocmX.Ywheel tag moves over time (6.3 → 6.4 → 7.x). Read the current stable line at the live PyTorch "Get Started" selector (choose Linux → Pip → Python → ROCm) before running. AMD also ships its own Radeon-recommended wheels at repo.radeon.com (currently ~PyTorch 2.9.1 / ROCm 7.2.1 / Ubuntu 24.04) if you prefer the vendor build.
3. Install the package
pip install -e .
This installs the acestep console-script entry point along with diffusers, transformers, accelerate, and the project's audio dependencies. ACE-Step is pure-PyTorch diffusion — there is no CUDA-extension compile step, so the install does not need nvcc and runs cleanly under ROCm. Weights for ACE-Step/ACE-Step-v1-3.5B download automatically from Hugging Face on first launch.
4. (Optional) ComfyUI custom node
If you would rather drive the model from a ComfyUI workflow on your ROCm-built ComfyUI install:
cd ComfyUI/custom_nodes
git clone https://github.com/billwuhao/ComfyUI_ACE-Step.git
Then download the weights into ComfyUI/models/TTS/ACE-Step-v1-3.5B/. Per the custom node README the folder must contain four subdirectories: ace_step_transformer, music_dcae_f8c8, music_vocoder, and umt5-base. (ComfyUI itself must already be installed against the ROCm PyTorch wheel — see our SDXL-on-7900-XTX recipe for that setup.)
Running
Gradio app (official)
The repository's setup.py registers acestep as a console script that maps to acestep.gui:main — a Click CLI that launches a Gradio web app. There is no .text2music() one-liner; the supported entrypoint is the Gradio app (or python -m acestep.gui):
acestep --port 7865 --bf16 true
Then open http://localhost:7865. In the Text2Music tab, enter descriptive tags (style, mood, instruments), optional lyrics with structure markers like [verse] / [chorus], set the audio duration, and click Generate. The app returns a downloadable audio file. On the 24 GB 7900 XTX the default bf16 path runs comfortably without any memory flags. --bf16 defaults to true, which matches the model's native precision — there is no smaller hardware format to drop to on RDNA3.
The Click options are defined in acestep/gui.py: --checkpoint_path, --server_name, --port, --device_id, --share, --bf16, --torch_compile, --cpu_offload, --overlapped_decode. The --device_id flag selects the GPU by index (under ROCm, PyTorch enumerates the AMD card through the cuda device namespace, so --device_id 0 targets your 7900 XTX).
Memory-optimized launch (free headroom for other workloads)
acestep --torch_compile true --cpu_offload true --overlapped_decode true --port 7865
You do not need these on a 24 GB 7900 XTX for a single generation — the default bf16 path fits with ~12 GB to spare. But these three flags together drop resident VRAM to the official 8 GB floor, useful if the card is also driving a display or sharing with other inference. --cpu_offload loads only the current stage's model to the GPU; --overlapped_decode runs the DCAE and vocoder using sliding windows to speed decoding; both are documented in acestep/gui.py.
⚠️
--torch_compileon ROCm: verify on first run.torch.compileworks on RDNA3 (it lowers to Triton-ROCm), but exotic fused ops occasionally hit kernel-compile failures on gfx1100 and fall back to eager. If a compiled run errors, drop--torch_compile— the model is fully functional in eager mode. Do not installtriton-windows(that note in the upstream README is Windows/CUDA-specific; you are on Linux/ROCm).
Library / programmatic use
To integrate ACE-Step into your own Python project, follow the inference code in the source tree — the ACEStepPipeline class in acestep/pipeline_ace_step.py is the real call surface. The repository — not the auto-generated HF Hub text-to-audio snippet — is the source of truth for the call signature; see Troubleshooting below.
Results
- Speed: No RX 7900 XTX benchmark is published yet (the live /check/acestep-1-5-xl/rx-7900-xtx verdict is
unknown). The official ACE-Step model card publishes a per-device throughput table covering NVIDIA (A100, RTX 4090, RTX 3090) and Apple (M2 Max) — but no AMD / RDNA3 figure, and these are different architectures, so we do not extrapolate a 7900 XTX number from them. The Speed figure is therefore omitted. If you measure generation time on a 7900 XTX, please contribute it via the submission form and it will appear at /check/acestep-1-5-xl/rx-7900-xtx. - VRAM usage: At default bf16 precision the resident footprint is ~12 GB (the four components total ~8.28 GB on disk, plus activations) — trivially within the 24 GB 7900 XTX envelope, leaving ample headroom. The official minimum drops to 8 GB with
cpu_offload + torch_compile + overlapped_decodeenabled, but you don't need that on this card. See /check/acestep-1-5-xl/rx-7900-xtx for any community-submitted measurement. - Quality notes: Performs best in the top 10 well-performing languages (19 supported in total per the GitHub README); rare instruments may render imperfectly; outputs beyond ~5 minutes can lose structural coherence; the model is highly seed-sensitive ("gacha-style" results, per the HF card's Limitations section).
For the full benchmark data, see /check/acestep-1-5-xl/rx-7900-xtx.
Troubleshooting
"Torch not compiled with CUDA enabled"
This means a CUDA build of PyTorch got installed instead of the ROCm build (the default pip install torch pulls CUDA). Uninstall and reinstall against the ROCm wheel index:
pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2
Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and python -c "import torch; print(torch.cuda.is_available())" returns True (ROCm masquerades as the cuda device namespace under HIP, so ACE-Step's torch.device("cuda:0") calls resolve to your 7900 XTX with no code changes).
HF Quick Start snippet doesn't match the real API
ACE-Step is a music-generation model, but the Hugging Face Hub auto-generates a generic snippet from the text-to-audio pipeline tag — it does not reflect a runnable music-generation call, and there is no .text2music() method. The authoritative inference entry point is the acestep console script (which the repository's setup.py registers as acestep.gui:main, a Gradio app), launchable as acestep --port 7865 or python -m acestep.gui. For programmatic use, drive ACEStepPipeline from acestep/pipeline_ace_step.py rather than copy-pasting the Hub snippet.
acestep command not found after pip install -e .
The -e (editable) install registers the acestep entry point in your conda env. If the shell can't find it, you're probably in a different env — re-activate with conda activate ace_step and verify with which acestep. The GitHub repo README documents the entry point and command-line arguments.
torch.compile errors on ROCm
If you passed --torch_compile true and a generation crashes with a Triton or Inductor kernel-compilation error, this is a known RDNA3 rough edge — torch.compile lowers to Triton-ROCm, which occasionally fails on exotic fused ops on gfx1100. Drop the flag and run in eager mode (acestep --port 7865); the model is fully functional without compilation. The compile flag is an optional speedup, not a requirement.
Generations sound unstructured past ~5 minutes
This is a documented limitation, not a bug. The model card calls it out under "Limitations" — the model loses long-range structural coherence beyond ~5 minutes. Either keep prompts inside that window or use the repaint/extend operations on shorter segments and stitch them.
If you hit something not covered here, please report via the submission form so we can add it to the catalogue.