What You'll Build
A working text-to-music pipeline that turns a text prompt + optional lyrics into a full song (vocals, instruments, up to ~4 minutes) on a single RTX 4070 Ti Super, either through the official Gradio app or as a ComfyUI custom node.
Hardware data: RTX 4070 Ti Super (16GB VRAM) · text-to-music, lyric-aligned vocals, top-10 supported languages per HF model card · See benchmark data
ℹ️ Not a TTS model. ACE-Step generates music — instruments and lyric-aligned vocals — from a text description. It is filed under our
ttsvertical because the catalogue groups all audio-output models together, but it is not a text-to-speech engine. If you want spoken speech synthesis on this GPU, see Kokoro or VoxCPM. If you want sung vocals over generated backing, you're in the right place.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 8 GB VRAM with optimization flags, 12 GB at default precision | RTX 4070 Ti Super (16 GB) |
| RAM | 16 GB | — |
| Storage | ~8 GB for the 3.5B transformer + DCAE + vocoder + UMT5-base text encoder | — |
| Software | Python 3.10, PyTorch with CUDA, ComfyUI (optional) | — |
The four weight components on the HF Files tab total ~8.3 GB on disk: the ace_step_transformer diffusion model (6.61 GB), the music_dcae_f8c8 autoencoder (0.31 GB), the music_vocoder (0.21 GB), and the umt5-base text encoder (1.13 GB).
Installation
1. Clone the repo and create the conda environment
git clone https://github.com/ace-step/ACE-Step.git
cd ACE-Step
conda create -n ace_step python=3.10 -y
conda activate ace_step
2. Install PyTorch (Windows GPU users only — Linux can skip)
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
The default pip install torch already ships with sm_89 (Ada Lovelace) kernels — no special wheel selection is required for the RTX 4070 Ti Super. The cu126 index is only needed when overriding the system CUDA on Windows.
3. Install the package
pip install -e .
This installs the acestep command-line entry point along with diffusers, transformers, accelerate, and the project's audio dependencies. Weights for ACE-Step/ACE-Step-v1-3.5B download automatically from HuggingFace on first launch (to ~/.cache/ace-step/checkpoints by default).
4. (Optional) ComfyUI custom node
If you would rather drive the model from a ComfyUI workflow:
cd ComfyUI/custom_nodes
git clone https://github.com/billwuhao/ComfyUI_ACE-Step.git
Then download the weights into ComfyUI/models/TTS/ACE-Step-v1-3.5B/. Per the custom node README the folder must contain four subdirectories: ace_step_transformer, music_dcae_f8c8, music_vocoder, and umt5-base.
Running
Gradio app (official)
acestep --port 7865 --bf16 true
Then open http://localhost:7865. In the Text2Music tab, enter descriptive tags (style, mood, instruments), optional lyrics with structure markers like [verse] / [chorus], set the audio duration, and click Generate. The app returns a downloadable audio file. On a 16 GB RTX 4070 Ti Super the default bf16 path runs comfortably without any memory flags.
Memory-optimized launch (free headroom for other workloads)
acestep --torch_compile true --cpu_offload true --overlapped_decode true --port 7865
These three flags together are what bring resident VRAM down to the official 8 GB floor — useful if the RTX 4070 Ti Super is also driving a display, running a browser, or sharing the card with other inference workloads. On a 16 GB card you do not need them for a single generation, but they cut peak resident VRAM substantially. Each flag is documented under Command Line Arguments in the GitHub README: --cpu_offload offloads model weights to CPU to save GPU memory, and --overlapped_decode uses overlapped decoding to speed up inference. The RTX 4070 Ti Super's PCIe Gen4 x16 host link keeps CPU↔GPU offload streaming cheap, so the optimized path is never a bottleneck on this card.
Library / programmatic use
To integrate ACE-Step into your own Python project, install it directly from the repo and follow the inference scripts in the source tree (the README documents pip install git+https://github.com/ace-step/ACE-Step.git). The repository — not the auto-generated HF Hub snippet — is the source of truth for the call signature; see Troubleshooting below.
Results
- Speed: No RTX 4070 Ti Super benchmark is published yet (the live /check/acestep-1-5-xl/rtx-4070-ti-super verdict is
unknown). The official ACE-Step GitHub README publishes a per-device throughput table that names the RTX 4090 (RTF 34.48×, 1.74 s to render 1 minute of audio at 27 steps) and the RTX 3090 (RTF 12.76×, 4.70 s) — but not the RTX 4070 Ti Super, which has fewer CUDA cores and lower memory bandwidth than either. Because the 4070 Ti Super is not a close compute-sibling of the 4090 (24 GB, substantially higher FP16 throughput) nor of the 3090 (different Ampere architecture), we do not extrapolate a 4070 Ti Super figure from them. If you measure generation time on a 4070 Ti Super, please contribute it via the submission form and it will appear at /check/acestep-1-5-xl/rtx-4070-ti-super. - VRAM usage: At default half precision (no flags beyond
--port), community userakandereported a peak on an RTX 3060 in HF discussion #4 (May 2025): "For me it runs on my 3060 and consumes 11.7GB / 12GB vram. Maybe it runs in half precision out of the box? Because i don't use any arguments other then --port to start." The RTX 4070 Ti Super's 16 GB envelope clears that ~11.7 GB peak with roughly 4 GB of headroom. The official minimum drops to 8 GB withcpu_offload + torch_compile + overlapped_decodeenabled, confirmed by ACE-Step org memberxushengyuan(Shengyuan Xu, an ACE-Step team member and paper author) in the same thread: "The minimum VRAM requirement for full-length generation is now just 8 GB. We tested it on an RTX 4060, and it delivers decent performance beyond our expectations (1.16it/s)." - Quality notes: Performs best in the top 10 supported languages (19 are supported in total per the GitHub README); rare instruments may render imperfectly; outputs beyond ~5 minutes can lose structural coherence; the model is highly seed-sensitive ("gacha-style" results, per the HF card's Limitations section).
For the full benchmark data, see /check/acestep-1-5-xl/rtx-4070-ti-super.
Troubleshooting
Out of memory while sharing the card with other workloads
The 16 GB envelope of the RTX 4070 Ti Super clears the cited ~11.7 GB default-precision peak comfortably for a standalone run, but if you are also running a desktop session or other CUDA workloads, enable the launch-flag combination:
acestep --torch_compile true --cpu_offload true --overlapped_decode true
cpu_offload is the heaviest hitter — it streams model weights from RAM into VRAM on demand. Combined with overlapped_decode you reach the official 8 GB floor that the ACE-Step team measured on the RTX 4060 per HF discussion #4. On Windows, torch_compile additionally requires pip install triton-windows, per the GitHub README.
HF Quick Start snippet doesn't match the real API
ACE-Step is a music-generation model, but the HuggingFace Hub auto-generates a generic diffusers snippet from the pipeline tag — it does not reflect a runnable music-generation call. The authoritative inference entry point is the acestep command-line / Gradio app (the repository's setup.py registers acestep = acestep.gui:main as the console script), or the ComfyUI custom node. There is no .text2music() Python method to call; for programmatic use, follow the inference code in the official GitHub repository rather than copy-pasting the Hub snippet.
acestep command not found after pip install -e .
The -e (editable) install registers the acestep entry point in your conda env. If the shell can't find it, you're probably in a different env — re-activate with conda activate ace_step and verify with which acestep. The GitHub repo README documents the entry point and command-line arguments.
Generations sound unstructured past ~5 minutes
This is a documented limitation, not a bug. The model card calls it out under "Limitations" — the model loses long-range structural coherence beyond ~5 minutes. Either keep prompts inside that window or use the repaint/extend operations on shorter segments and stitch them.
If you hit something not covered here, please report via the submission form so we can add it to the catalogue.