How much VRAM does ACE-Step 1.5 XL need?

About 12 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

ACE-Step 1.5 XL on RX 7800 XT: Text-to-Music Generation on ROCm

What You'll Build

A working text-to-music pipeline that turns a text prompt + optional lyrics into a full song (vocals, instruments, up to ~4 minutes) on a single Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) through the ROCm stack, driven by the official Gradio app. ACE-Step is a pure-PyTorch, diffusion-based music generator — a flow-matching transformer over a Deep Compression AutoEncoder latent — so it runs on AMD's ROCm/HIP backend the same way it runs on CUDA, with no custom kernels to port. At 16 GB the native bf16 weights (~12 GB resident) fit, but with limited headroom: this is the tightest of the RDNA3 music-gen fits, so plan around the ~12 GB footprint rather than assuming spare memory (see Results for the headroom honesty note and the optional 8 GB path).

Hardware data: RX 7800 XT (16GB VRAM) · text-to-music, lyric-aligned vocals, 19 supported languages (top 10 well-performing) per HF model card · See benchmark data

ℹ️ Not a TTS model. ACE-Step generates music — instruments and lyric-aligned vocals — from a text description. It is filed under our tts vertical because the catalogue groups all audio-output models together, but it is not a text-to-speech engine. If you want spoken speech synthesis on this GPU, see Kokoro or VoxCPM. If you want sung vocals over generated backing, you're in the right place.

⚠️ This is a ROCm recipe, not CUDA. The RX 7800 XT runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel here, no FlashAttention build, no xformers, and no FP8/FP4 path. RDNA3 has no FP8/FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only), and ACE-Step's native precision is bf16 anyway, so there is nothing to quantize down. ACE-Step is plain-PyTorch diffusion: it carries no custom CUDA kernel in its sampler or scheduler (the flow-match Euler/Heun/Pingpong schedulers are pure PyTorch), so it runs unmodified on ROCm — PyTorch's HIP backend exposes the AMD GPU through the same cuda device namespace. The attention path is PyTorch SDPA, not FlashAttention-2 and not xformers. If a guide tells you to pip install xformers or pick a cu12x wheel for this card, it's written for the wrong vendor.

Requirements

Component	Minimum	Tested
GPU	12 GB VRAM at default bf16 precision (8 GB possible with optimization flags)	RX 7800 XT (16 GB)
RAM	16 GB system	—
Storage	~9 GB for the 3.5B transformer + DCAE + vocoder + UMT5-base text encoder	per HF Files tree
Driver	AMD ROCm 7.2.x on Linux	—
Software	Python 3.10, PyTorch (ROCm build)	—

The model is released under the Apache 2.0 License (HF model card) and the weights are not gated — no access request or login is required to download them. The repository is ~8.28 GB total on disk (HF Files tree), split across four components downloaded automatically on first launch: the ace_step_transformer diffusion model (~6.61 GB), the music_dcae_f8c8 autoencoder (~0.31 GB), the music_vocoder (~0.21 GB), and the umt5-base text encoder (~1.14 GB).

Installation

1. Clone the repo and create the conda environment

git clone https://github.com/ace-step/ACE-Step.git
cd ACE-Step
conda create -n ace_step python=3.10 -y
conda activate ace_step

2. Install PyTorch for ROCm

The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel — not the default CUDA wheel that pip install torch pulls. Install from the ROCm wheel index:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. The rocmX.Y wheel tag moves over time (6.3 → 6.4 → 7.x). Read the current stable line at the live PyTorch "Get Started" selector (choose Linux → Pip → Python → ROCm) before running. AMD also ships its own Radeon-recommended wheels at repo.radeon.com (currently ~PyTorch 2.9.1 / ROCm 7.2.1 / Ubuntu 24.04) if you prefer the vendor build.

3. Install the package

pip install -e .

This installs the acestep console-script entry point along with diffusers, transformers, accelerate, and the project's audio dependencies. ACE-Step is pure-PyTorch diffusion — there is no CUDA-extension compile step, so the install does not need nvcc and runs cleanly under ROCm. Weights for ACE-Step/ACE-Step-v1-3.5B download automatically from Hugging Face on first launch.

4. (Optional) ComfyUI custom node

If you would rather drive the model from a ComfyUI workflow on your ROCm-built ComfyUI install:

cd ComfyUI/custom_nodes
git clone https://github.com/billwuhao/ComfyUI_ACE-Step.git

Then download the weights into ComfyUI/models/TTS/ACE-Step-v1-3.5B/. Per the custom node README the folder must contain four subdirectories: ace_step_transformer, music_dcae_f8c8, music_vocoder, and umt5-base. (ComfyUI itself must already be installed against the ROCm PyTorch wheel — see our SDXL-on-7800-XT recipe for that setup.)

Running

Gradio app (official)

The repository's setup.py registers acestep as a console script that maps to acestep.gui:main — a Click CLI that launches a Gradio web app. There is no .text2music() one-liner; the supported entrypoint is the Gradio app (or python -m acestep.gui):

acestep --port 7865 --bf16 true

Then open http://localhost:7865. In the Text2Music tab, enter descriptive tags (style, mood, instruments), optional lyrics with structure markers like [verse] / [chorus], set the audio duration, and click Generate. The app returns a downloadable audio file. On the 16 GB 7800 XT the default bf16 path fits at ~12 GB resident — runnable, but tight enough that you should close other GPU consumers (a display compositor, a second model) before generating. --bf16 defaults to true, which matches the model's native precision — there is no smaller hardware format to drop to on RDNA3.

The Click options are defined in acestep/gui.py: --checkpoint_path, --server_name, --port, --device_id, --share, --bf16, --torch_compile, --cpu_offload, --overlapped_decode. The --device_id flag selects the GPU by index (under ROCm, PyTorch enumerates the AMD card through the cuda device namespace, so --device_id 0 targets your 7800 XT).

Memory-optimized launch (drop to the 8 GB floor)

acestep --torch_compile true --cpu_offload true --overlapped_decode true --port 7865

On a 16 GB 7800 XT these flags are genuinely useful, not just optional: the default bf16 path sits at ~12 GB, so if the card is also driving a display or you want to colocate another model, these three flags together drop resident VRAM to the official 8 GB floor and free real headroom. --cpu_offload loads only the current stage's model to the GPU; --overlapped_decode runs the DCAE and vocoder using sliding windows to speed decoding; both are documented in acestep/gui.py.

⚠️ --torch_compile on ROCm: verify on first run. torch.compile works on RDNA3 (it lowers to Triton-ROCm), but exotic fused ops occasionally hit kernel-compile failures on gfx1101 and fall back to eager. If a compiled run errors, drop --torch_compile — the model is fully functional in eager mode. Do not install triton-windows (that note in the upstream README is Windows/CUDA-specific; you are on Linux/ROCm).

Library / programmatic use

To integrate ACE-Step into your own Python project, follow the inference code in the source tree — the ACEStepPipeline class in acestep/pipeline_ace_step.py is the real call surface (invoked via __call__, not a .text2music() method). The repository — not the auto-generated HF Hub text-to-audio snippet — is the source of truth for the call signature; see Troubleshooting below.

Results

Speed: No RX 7800 XT benchmark is published yet (the live /check/acestep-1-5-xl/rx-7800-xt verdict is unknown). The official ACE-Step model card publishes a per-device throughput table covering NVIDIA (A100, RTX 4090, RTX 3090) and Apple (M2 Max) — but no AMD / RDNA3 figure, and these are different architectures, so we do not extrapolate a 7800 XT number from them. We also do not carry the 7900 XTX figure: the 7800 XT has roughly two-thirds of the flagship's memory bandwidth (624 vs 960 GB/s) and fewer WMMA units (120 vs 192), so a sibling-card number would mislead. The Speed figure is therefore omitted. If you measure generation time on a 7800 XT, please contribute it via the submission form and it will appear at /check/acestep-1-5-xl/rx-7800-xt.
VRAM usage: At default bf16 precision the resident footprint is ~12 GB (the four components total ~8.28 GB on disk, plus activations) — it fits the 16 GB 7800 XT, but with limited headroom: this is a tight-but-fits situation, not a comfortable one. Close other GPU consumers before generating, or use cpu_offload + torch_compile + overlapped_decode to drop to the official 8 GB floor if you need to share the card. See /check/acestep-1-5-xl/rx-7800-xt for any community-submitted measurement.
Quality notes: Performs best in the top 10 well-performing languages (19 supported in total per the GitHub README); rare instruments may render imperfectly; outputs beyond ~5 minutes can lose structural coherence; the model is highly seed-sensitive ("gacha-style" results, per the HF card's Limitations section).

For the full benchmark data, see /check/acestep-1-5-xl/rx-7800-xt.

Troubleshooting

"Torch not compiled with CUDA enabled"

This means a CUDA build of PyTorch got installed instead of the ROCm build (the default pip install torch pulls CUDA). Uninstall and reinstall against the ROCm wheel index:

pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and python -c "import torch; print(torch.cuda.is_available())" returns True (ROCm masquerades as the cuda device namespace under HIP, so ACE-Step's torch.device("cuda:0") calls resolve to your 7800 XT with no code changes).

A library ships only gfx1100 kernels and won't load on the 7800 XT

The 7800 XT is gfx1101 (Navi 32), while the flagship 7900 XTX is gfx1100 (Navi 31). Most of the ROCm stack — and the stable ROCm PyTorch wheel that ACE-Step runs on — ships kernels for both, so ACE-Step itself does not need any override. But occasionally a third-party prebuilt extension only carries gfx1100 kernels and refuses to load on gfx1101 with a "no kernel image is available" / missing-gfx1101-kernel error. The standard Linux-only fallback is to mask the card as gfx1100 at runtime:

HSA_OVERRIDE_GFX_VERSION=11.0.0 acestep --port 7865

This is a legacy fallback, not a default — gfx1101 is officially ROCm-supported, so reach for it only if a specific library throws a missing-kernel error.

HF Quick Start snippet doesn't match the real API

ACE-Step is a music-generation model, but the Hugging Face Hub auto-generates a generic snippet from the text-to-audio pipeline tag — it does not reflect a runnable music-generation call, and there is no .text2music() method. The authoritative inference entry point is the acestep console script (which the repository's setup.py registers as acestep.gui:main, a Gradio app), launchable as acestep --port 7865 or python -m acestep.gui. For programmatic use, drive ACEStepPipeline from acestep/pipeline_ace_step.py (via its __call__) rather than copy-pasting the Hub snippet.

`acestep` command not found after `pip install -e .`

The -e (editable) install registers the acestep entry point in your conda env. If the shell can't find it, you're probably in a different env — re-activate with conda activate ace_step and verify with which acestep. The GitHub repo README documents the entry point and command-line arguments.

`torch.compile` errors on ROCm

If you passed --torch_compile true and a generation crashes with a Triton or Inductor kernel-compilation error, this is a known RDNA3 rough edge — torch.compile lowers to Triton-ROCm, which occasionally fails on exotic fused ops on gfx1101. Drop the flag and run in eager mode (acestep --port 7865); the model is fully functional without compilation. The compile flag is an optional speedup, not a requirement.

Generations sound unstructured past ~5 minutes

This is a documented limitation, not a bug. The model card calls it out under "Limitations" — the model loses long-range structural coherence beyond ~5 minutes. Either keep prompts inside that window or use the repaint/extend operations on shorter segments and stitch them.

If you hit something not covered here, please report via the submission form so we can add it to the catalogue.